Stable on 6.8.3 but crashes on 6.9.2 and 6.10.0-rc2

blkjack410 · January 8, 2022

I've had amazing stability running my server on version 6.8.3 but when I upgrade to 6.9.2 or even 6.10.0-rc2 I get random lock-ups that require a reboot to fix. I've not seen anything that seems obvious in the logs as the the cause so I was hoping I could get some outside help.

Hardware Config:

Supermicro X9DRL-iF

2x Xeon E5-2650v2

32gb (8x4gb) Hynix DDR3-1333

3x Seagate Ironwolf 6TB + 1x Ironwolf 6TB for parity

1TB Crucial MX500 cache disk

I'm using only the onboard SATA controller and networking so there are no separate PCIe cards installed.

Some things that I noticed after upgrading:

1. Started seeing MCE messages about memory read errors that I had never seen before when using 6.8.3. Hoping this is a non-issue like I saw in this thread. Ran memtest for an hour but never encountered any errors.

2. The lock-up or crash causes the server to become unresponsive but does not initiate a restart on its own. I have to log into the remote admin interface of the motherboard to trigger a reboot or actually hit the power button to bring the server back.

3. I can't clearly identify any one trigger to the crash, it has happened when updating a minecraft docker, clicking on the "WebUI" button for my jellyfin docker, just watching videos using jellyfin, or even when the server is sitting idle.

I'm currently booted into safe mode with 35 minutes of uptime as a test to see if perhaps there is a problem with a plugin or if it makes any difference at all.

Thanks in advance for any help you guys can provide,

blkjack410

beowulf-diagnostics-20220108-1829.zip syslog-192.168.101.88.log

blkjack410 · January 9, 2022

OK so safemode test is over because the server once again crashed while idle. Jellyfin, Komanga, and pihole were the only dockers running and only pihole should have had any activity. I'm rolling back to 6.8.3 for now but i would like to be able to run the latest unraid if possible.

Thanks again in advance to all the big-brain gurus out there that have any advice.

syslog-192.168.101.88.log

JorgeB · January 9, 2022

Jan  8 13:04:10 Beowulf kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x17fa6b offset:0x600 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:1)

Server is reporting memory issues, System/IPMI event log in the BIOS might show more info for the affected DIMM.

blkjack410 · January 9, 2022

Got into the event log and sure enough there is something strange going on. For one thing, the first series of events listed is a string of single bit eec memory errors that occurred multiple times a second over the span of about 30 seconds back in last July. I have no way of really telling if that was really the extent of this event or if the log is maybe overflowing and old events are being removed as new ones are added. Anyone familiar with how the supermicro log behaves? Those events all applied to one DIM (P2_DIMMF1).

From what I recall, July was near the first time I tried to upgrade from 6.8.3 to 6.9.2 and that I only set up syslog because I was getting random crashes. Unfortunately the syslog doesn't seem to show what version of unraid is running or I don't know where to look for it.

Then there are no more events until yesterday 1/8 at 16:14 local time. I get a flurry of single bit errors from 16:14 to 16:17 all related to P1_DIMMB1. Then there is a break until 19:56 local time when P1_DIMMB1 starts having single bit errors until 20:03 local. There have been no other events since then. According to syslog it looks like I did the upgrade to 6.9.2 at 12:42 and after the reboot I see memory error messages in the syslog that don't seem to line up with any of the events in the bios log. The last string of errors seems occur about an hour before I rolled back to 6.8.3.

Because of the timing of these errors, it feels like this is somehow caused by upgrading to 6.9.2.

Can a single bit error like this even be caused by software or is it entirely oh the hardware level?
Does some solar flare hit earth every time I want to try upgrading unraid?
Why do some errors shown in the syslog don't match up with what I see in the bios?
Is there any way to improve my logging config to add more info useful for debugging this?
Does syslog have a blind spot in the time between boot and the syslog service actually starting or does it capture what happened retroactively?
should I change any of the settings in my bios event log to reduce event spamming or is it ok as is?
Any way to actual export the bios log?
Should I leave BIOS time as GMT or change it to local time so that timestamps match between bios and syslog?

Thanks everyone

Sreenshots in spoiler:

Spoiler

syslog-192.168.101.88.log

Edited January 9, 2022 by blkjack410

itimpi · January 9, 2022

Single bit errors like you describe will always be hardware related. It is possible a software upgrade can change the frequency as it may start using regularly a memory address that was not much used before the upgrade.

blkjack410 · January 10, 2022

That makes sense, so my next step will be to remove the dimms that had errors and see if I get a stable upgrade afterwards. I'll check back in when I get a chance to do that. Thanks for the help guys,

blkjack410

Stable on 6.8.3 but crashes on 6.9.2 and 6.10.0-rc2

Recommended Posts

blkjack410

Link to comment

blkjack410

Link to comment

JorgeB

Link to comment

blkjack410

Link to comment

itimpi

Link to comment

blkjack410

Link to comment

Join the conversation