Help, Server is rebooting every few minutes!

Mantene · May 4, 2021

I don't know what is going on. All of the sudden my server is rebooting what seems like every few minutes. I have been running 6.9.2 since it first came out, so it isn't like I am running a beta release. And I haven't made any configuration changes to the server. In fact, I was simply using the Windows 10 VM when it started this behavior.

I am attaching the diagnostic data from Safe Mode with the array started. Yes, it seems to work in safe mode. I know that some plugins got updated today but I honestly don't know which ones - unassigned devices? But that shouldn't cause this, right? Please help!

eeyore-diagnostics-20210504-1719.zip

Squid · May 4, 2021

Have you read this post yet?

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-819173

Mantene · May 4, 2021

Yes, I read that a while back. And I do not overclock my RAM (or my CPU), I am using approved RAM and all the same RAM in all the slots. Also, my power settings are correct so the c-state should not be an issue. And again, I haven't made any changes to any of that recently and I have been running Unraid for quite some time now.

Also, here is a diagnostics from a regular boot - it crashed about 20 seconds after I got this!

eeyore-diagnostics-20210504-1730.zip

Mantene · May 4, 2021

Oh, it even does it in safe mode.

I am ready to throw the box out the window

John_M · May 5, 2021

Why are you loading the Intel integrated graphics driver?

# Load the i915 driver
modprobe i915

This should really be posted in the General Support section. It isn't an Unraid bug and it isn't Urgent.

Have you done a memory test? Have you tried enabling syslog mirroring to flash to see if it catches the problem?

ChatNoir · May 5, 2021

"Every few" minutes seems excessive, even for the regular Ryzen issues.

Are you sure you don't have a cooling or PSU issue ?

Mantene · May 5, 2021

Thank you, @JorgeB for moving the thread to the correct forum. Apologies to @Squid for posting in the wrong place. I was in somewhat of a panic when I created the original thread.

So, to address the comments of @John_M - I let MemTest run overnight and there were no errors in the morning - also the system did not reboot at all. I am now mirroring syslog to flash. I will attach a new diags bundle. Also, I have removed the modprobe i915 now - this used to be on an intel system, and that is a remnant.

@ChatNoir PSU seems to be okay, but that is one of the more difficult pieces of hardware to know for sure. Cooling also seems okay - the cpu and mb temps hover around 45, one occasionally hits 60 but only ever for a few seconds and it has always been so. These were my first thoughts too.

I have seen some errors relating to L2 or L3 cache in the syslog. Could that be the issue? Is there a way to test the CPU for faults? I am at a loss here. The system stays up if I boot into safe mode and don't mount the array. Once I mount the array it just takes minutes until an unexcepted reboot.

@JorgeB - as to the memory overcocking - you are right! I had the XMP turned on for my ram. I turned it off this AM and the current speed should be 2133.

I deeply appreciate all the help you are all providing. Any ideas what my next steps should be?

eeyore-diagnostics-20210505-1013.zip

Edited May 5, 2021 by Mantene

JorgeB · May 5, 2021

Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Mantene · May 5, 2021

8 minutes ago, JorgeB said:

Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Yep, it still does it with the Ram at 2133. I did safemode with docker and vms disabled. it stays up for longer, but it still seems to reboot randomly. So yes, I am also of the opinion that it is hardware. I just wish I knew which component. MB, CPU, or PSU are the main suspects.

Mantene · May 5, 2021

May  5 08:58:30 Eeyore kernel: RSP: 0018:ffffc900007b78a0 EFLAGS: 00010202
May  5 08:58:30 Eeyore kernel: RAX: ffffea0005e41d80 RBX: ffffc900007b7940 RCX: 0000000000000006
May  5 08:58:30 Eeyore kernel: RDX: 0000000000000101 RSI: 17fec0817ed02b28 RDI: ffffc900007b78e8
May  5 08:58:30 Eeyore kernel: RBP: ffffc900007b7930 R08: 000000000000007f R09: ffffea0005e41d80
May  5 08:58:30 Eeyore kernel: R10: 0000000000000000 R11: ffff888103891500 R12: 000000000000000d
May  5 08:58:30 Eeyore kernel: R13: ffff888103891500 R14: ffff8881026826c0 R15: 17fec0817ed02b08
May  5 08:58:30 Eeyore kernel: FS:  0000000000000000(0000) GS:ffff888ffea40000(0000) knlGS:0000000000000000
May  5 08:58:30 Eeyore kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  5 08:58:30 Eeyore kernel: CR2: 00001510c418d4e8 CR3: 000000000200a000 CR4: 0000000000350ee0
May  5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:9 (17:71:0) MC2_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c20400000020136
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0
May  5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x000171f21a4418f5
May  5 08:59:20 Eeyore kernel: [Hardware Error]: L2 Cache Ext. Error Code: 2, L2M Data Array ECC Error.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
May  5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:1 (17:71:0) MC14_STATUS[Over|CE|MiscV|AddrV|-|SyndV|CECC|-|-|-]: 0xdc2040000004010b
May  5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0
May  5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000700b020f50300, Syndrome: 0x000171f21a47010a
May  5 08:59:20 Eeyore kernel: [Hardware Error]: L3 Cache Ext. Error Code: 4, L3M Data ECC Error.
May  5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

JorgeB · May 5, 2021

26 minutes ago, Mantene said:

I just wish I knew which component. MB, CPU, or PSU are the main suspects.

It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM.

Mantene · May 5, 2021

7 minutes ago, JorgeB said:

It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM.

I can probably do all of those things. I am fairly sure I have a spare, though lower wattage, PSU. And just using two DIMMs is easy enough to try. However, I just started Prime95 cpu test so I will let that run for a few hours (if the PC stays up that long)! Thank you for the suggestions, I will report back when I have added information.

John_M · May 5, 2021

It looks like a faulty CPU to me, if the error reports are accurate. You'll be covered by the warranty, assuming you bought it through official channels.

rasmus · May 23, 2022

@Mantene Did you ever solve this?
Wondering as I am having the exact same error in my logs now and server starts crashing now and then... My CPU has been overclocked, so I am expecting that it could be degredation...

Mantene · May 31, 2022

On 5/23/2022 at 12:09 PM, rasmus said:

@Mantene Did you ever solve this?
Wondering as I am having the exact same error in my logs now and server starts crashing now and then... My CPU has been overclocked, so I am expecting that it could be degredation...

Sorry to tell you, but my problem turned out to be a bad CPU. I replaced it and all went back to working well.

Help, Server is rebooting every few minutes!

Recommended Posts

Mantene

Link to comment

Squid

Link to comment

Mantene

Link to comment

Mantene

Link to comment

John_M

Link to comment

ChatNoir

Link to comment

Mantene

Link to comment

JorgeB

Link to comment

Mantene

Link to comment

Mantene

Link to comment

JorgeB

Link to comment

Mantene

Link to comment

John_M

Link to comment

rasmus

Link to comment

Mantene

Link to comment

Join the conversation