Mantene Posted May 4, 2021 Posted May 4, 2021 I don't know what is going on. All of the sudden my server is rebooting what seems like every few minutes. I have been running 6.9.2 since it first came out, so it isn't like I am running a beta release. And I haven't made any configuration changes to the server. In fact, I was simply using the Windows 10 VM when it started this behavior. I am attaching the diagnostic data from Safe Mode with the array started. Yes, it seems to work in safe mode. I know that some plugins got updated today but I honestly don't know which ones - unassigned devices? But that shouldn't cause this, right? Please help! eeyore-diagnostics-20210504-1719.zip Quote
Squid Posted May 4, 2021 Posted May 4, 2021 Have you read this post yet? https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-819173 Quote
Mantene Posted May 4, 2021 Author Posted May 4, 2021 Yes, I read that a while back. And I do not overclock my RAM (or my CPU), I am using approved RAM and all the same RAM in all the slots. Also, my power settings are correct so the c-state should not be an issue. And again, I haven't made any changes to any of that recently and I have been running Unraid for quite some time now. Also, here is a diagnostics from a regular boot - it crashed about 20 seconds after I got this! eeyore-diagnostics-20210504-1730.zip Quote
Mantene Posted May 4, 2021 Author Posted May 4, 2021 Oh, it even does it in safe mode. I am ready to throw the box out the window Quote
John_M Posted May 5, 2021 Posted May 5, 2021 Why are you loading the Intel integrated graphics driver? # Load the i915 driver modprobe i915 This should really be posted in the General Support section. It isn't an Unraid bug and it isn't Urgent. Have you done a memory test? Have you tried enabling syslog mirroring to flash to see if it catches the problem? Quote
ChatNoir Posted May 5, 2021 Posted May 5, 2021 "Every few" minutes seems excessive, even for the regular Ryzen issues. Are you sure you don't have a cooling or PSU issue ? Quote
Mantene Posted May 5, 2021 Author Posted May 5, 2021 (edited) Thank you, @JorgeB for moving the thread to the correct forum. Apologies to @Squid for posting in the wrong place. I was in somewhat of a panic when I created the original thread. So, to address the comments of @John_M - I let MemTest run overnight and there were no errors in the morning - also the system did not reboot at all. I am now mirroring syslog to flash. I will attach a new diags bundle. Also, I have removed the modprobe i915 now - this used to be on an intel system, and that is a remnant. @ChatNoir PSU seems to be okay, but that is one of the more difficult pieces of hardware to know for sure. Cooling also seems okay - the cpu and mb temps hover around 45, one occasionally hits 60 but only ever for a few seconds and it has always been so. These were my first thoughts too. I have seen some errors relating to L2 or L3 cache in the syslog. Could that be the issue? Is there a way to test the CPU for faults? I am at a loss here. The system stays up if I boot into safe mode and don't mount the array. Once I mount the array it just takes minutes until an unexcepted reboot. @JorgeB - as to the memory overcocking - you are right! I had the XMP turned on for my ram. I turned it off this AM and the current speed should be 2133. I deeply appreciate all the help you are all providing. Any ideas what my next steps should be? eeyore-diagnostics-20210505-1013.zip Edited May 5, 2021 by Mantene Quote
JorgeB Posted May 5, 2021 Posted May 5, 2021 Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote
Mantene Posted May 5, 2021 Author Posted May 5, 2021 8 minutes ago, JorgeB said: Does is still reboot frequently with the RAM running @ 2133 MT/s? If yes one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a while, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Yep, it still does it with the Ram at 2133. I did safemode with docker and vms disabled. it stays up for longer, but it still seems to reboot randomly. So yes, I am also of the opinion that it is hardware. I just wish I knew which component. MB, CPU, or PSU are the main suspects. Quote
Mantene Posted May 5, 2021 Author Posted May 5, 2021 May 5 08:58:30 Eeyore kernel: RSP: 0018:ffffc900007b78a0 EFLAGS: 00010202 May 5 08:58:30 Eeyore kernel: RAX: ffffea0005e41d80 RBX: ffffc900007b7940 RCX: 0000000000000006 May 5 08:58:30 Eeyore kernel: RDX: 0000000000000101 RSI: 17fec0817ed02b28 RDI: ffffc900007b78e8 May 5 08:58:30 Eeyore kernel: RBP: ffffc900007b7930 R08: 000000000000007f R09: ffffea0005e41d80 May 5 08:58:30 Eeyore kernel: R10: 0000000000000000 R11: ffff888103891500 R12: 000000000000000d May 5 08:58:30 Eeyore kernel: R13: ffff888103891500 R14: ffff8881026826c0 R15: 17fec0817ed02b08 May 5 08:58:30 Eeyore kernel: FS: 0000000000000000(0000) GS:ffff888ffea40000(0000) knlGS:0000000000000000 May 5 08:58:30 Eeyore kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 5 08:58:30 Eeyore kernel: CR2: 00001510c418d4e8 CR3: 000000000200a000 CR4: 0000000000350ee0 May 5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged May 5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required. May 5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:9 (17:71:0) MC2_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c20400000020136 May 5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0 May 5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x000171f21a4418f5 May 5 08:59:20 Eeyore kernel: [Hardware Error]: L2 Cache Ext. Error Code: 2, L2M Data Array ECC Error. May 5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD May 5 08:59:20 Eeyore kernel: mce: [Hardware Error]: Machine check events logged May 5 08:59:20 Eeyore kernel: [Hardware Error]: Corrected error, no action required. May 5 08:59:20 Eeyore kernel: [Hardware Error]: CPU:1 (17:71:0) MC14_STATUS[Over|CE|MiscV|AddrV|-|SyndV|CECC|-|-|-]: 0xdc2040000004010b May 5 08:59:20 Eeyore kernel: [Hardware Error]: Error Addr: 0x00000001790531e0 May 5 08:59:20 Eeyore kernel: [Hardware Error]: IPID: 0x000700b020f50300, Syndrome: 0x000171f21a47010a May 5 08:59:20 Eeyore kernel: [Hardware Error]: L3 Cache Ext. Error Code: 4, L3M Data ECC Error. May 5 08:59:20 Eeyore kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN Quote
JorgeB Posted May 5, 2021 Posted May 5, 2021 26 minutes ago, Mantene said: I just wish I knew which component. MB, CPU, or PSU are the main suspects. It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM. Quote
Mantene Posted May 5, 2021 Author Posted May 5, 2021 7 minutes ago, JorgeB said: It's difficult to say, if you have any spares of that, like a different PSU start with what you can test and rule out, you can also try with just two DIMMs at one time, to completely rule out the RAM. I can probably do all of those things. I am fairly sure I have a spare, though lower wattage, PSU. And just using two DIMMs is easy enough to try. However, I just started Prime95 cpu test so I will let that run for a few hours (if the PC stays up that long)! Thank you for the suggestions, I will report back when I have added information. Quote
John_M Posted May 5, 2021 Posted May 5, 2021 It looks like a faulty CPU to me, if the error reports are accurate. You'll be covered by the warranty, assuming you bought it through official channels. Quote
rasmus Posted May 23, 2022 Posted May 23, 2022 @Mantene Did you ever solve this? Wondering as I am having the exact same error in my logs now and server starts crashing now and then... My CPU has been overclocked, so I am expecting that it could be degredation... Quote
Mantene Posted May 31, 2022 Author Posted May 31, 2022 On 5/23/2022 at 12:09 PM, rasmus said: @Mantene Did you ever solve this? Wondering as I am having the exact same error in my logs now and server starts crashing now and then... My CPU has been overclocked, so I am expecting that it could be degredation... Sorry to tell you, but my problem turned out to be a bad CPU. I replaced it and all went back to working well. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.