Help, Server is rebooting every few minutes!


Recommended Posts

I don't know what is going on. All of the sudden my server is rebooting what seems like every few minutes. I have been running 6.9.2 since it first came out, so it isn't like I am running a beta release. And I haven't made any configuration changes to the server. In fact, I was simply using the Windows 10 VM when it started this behavior. 

I am attaching the diagnostic data from Safe Mode with the array started.  Yes, it seems to work in safe mode. I know that some plugins got updated today but I honestly don't know which ones - unassigned devices? But that shouldn't cause this, right? Please help!

eeyore-diagnostics-20210504-1719.zip

Link to comment

Yes, I read that a while back. And I do not overclock my RAM (or my CPU), I am using approved RAM and all the same RAM in all the slots. Also, my power settings are correct so the c-state should not be an issue. And again, I haven't made any changes to any of that recently and I have been running Unraid for quite some time now. 

 

Also, here is a diagnostics from a regular boot - it crashed about 20 seconds after I got this!

eeyore-diagnostics-20210504-1730.zip

Link to comment

Why are you loading the Intel integrated graphics driver?

# Load the i915 driver
modprobe i915

 

This should really be posted in the General Support section. It isn't an Unraid bug and it isn't Urgent.

 

Have you done a memory test? Have you tried enabling syslog mirroring to flash to see if it catches the problem?

Link to comment
Posted (edited)

Thank you, @JorgeB for moving the thread to the correct forum. Apologies to @Squid for posting in the wrong place. I was in somewhat of a panic when I created the original thread.

 

So, to address the comments of @John_M - I let MemTest run overnight and there were no errors in the morning - also the system did not reboot at all. I am now mirroring syslog to flash. I will attach a new diags bundle. Also, I have removed the modprobe i915 now - this used to be on an intel system, and that is a remnant. 

 

@ChatNoir PSU seems to be okay, but that is one of the more difficult pieces of hardware to know for sure. Cooling also seems okay - the cpu and mb temps hover around 45, one occasionally hits 60 but only ever for a few seconds and it has always been so. These were my first thoughts too. 

 

I have seen some errors relating to L2 or L3 cache in the syslog. Could that be the issue? Is there a way to test the CPU for faults? I am at a loss here. The system stays up if I boot into safe mode and don't mount the array. Once I mount the array it just takes minutes until an unexcepted reboot.

@JorgeB - as to the memory overcocking - you are right! I had the XMP turned on for my ram. I turned it off this AM and the current speed should be 2133.

 

I deeply appreciate all the help you are all providing.  Any ideas what my next steps should be?

IMG_0373.png

eeyore-diagnostics-20210505-1013.zip

Edited by Mantene
Link to comment

Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
8 minutes ago, JorgeB said:

Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Yep, it still does it with the Ram at 2133. I did safemode with docker and vms disabled. it stays up for longer, but it still seems to reboot randomly. So yes, I am also of the opinion that it is hardware. I just wish I knew which component. MB, CPU, or PSU are the main suspects.

Link to comment
May  5 08:58:30 Eeyore kernel: RSP: 0018:ffffc900007b78a0 EFLAGS: 00010202
May  5 08:58:30 Eeyore kernel: RAX: ffffea0005e41d80 RBX: ffffc900007b7940 RCX: 0000000000000006
May  5 08:58:30 Eeyore kernel: RDX: 0000000000000101 RSI: 17fec0817ed02b28 RDI: ffffc900007b78e8
May  5 08:58:30 Eeyore kernel: RBP: ffffc900007b7930 R08: 000000000000007f R09: ffffea0005e41d80
May  5 08:58:30 Eeyore kernel: R10: 0000000000000000 R11: ffff888103891500 R12: 000000000000000d
May  5 08:58:30 Eeyore kernel: R13: ffff888103891500 R14: ffff8881026826c0 R15: 17fec0817ed02b08
May  5 08:58:30 Eeyore kernel: FS:  0000000000000000(0000) GS:ffff888ffea40000(0000) knlGS:0000000000000000
May  5 08:58:30 Eeyore kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  5 08:58:30 Eeyore kernel: CR2: 00001510c418d4e8 CR3: 000000000200a000 CR4: 0000000000350ee0
May  5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:9 (17:71:0) MC2_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c20400000020136
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0
May  5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x000171f21a4418f5
May  5 08:59:20 Eeyore kernel: [Hardware Error]: L2 Cache Ext. Error Code: 2, L2M Data Array ECC Error.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
May  5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:1 (17:71:0) MC14_STATUS[Over|CE|MiscV|AddrV|-|SyndV|CECC|-|-|-]: 0xdc2040000004010b
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0
May  5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000700b020f50300, Syndrome: 0x000171f21a47010a
May  5 08:59:20 Eeyore kernel: [Hardware Error]: L3 Cache Ext. Error Code: 4, L3M Data ECC Error.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

 

Link to comment
26 minutes ago, Mantene said:

I just wish I knew which component. MB, CPU, or PSU are the main suspects.

It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM.

Link to comment
7 minutes ago, JorgeB said:

It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM.

I can probably do all of those things. I am fairly sure I have a spare, though lower wattage, PSU. And just using two DIMMs is easy enough to try. However, I just started Prime95 cpu test so I will let that run for a few hours (if the PC stays up that long)! Thank you for the suggestions, I will report back when I have added information.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.