Jump to content

Almost Nightly Restarts - Unraid 6.9.2


Recommended Posts

I installed Unraid about a month or two ago.  I initially installed 6.9.1 with two 4TB drives for storage and one 4TB drive as parity.  Things were running pretty great (uptime up to 7 days before I brought it down to tinker (swap around fans, temps were fine before but I wanted to tweak).

 

Then two changes happened near each other, I installed a second parity drive (also 4TB), and I upgraded to 6.9.2.  For the last couple of weeks, I'm getting almost daily (maybe every day days sometimes) restarts.  These seem to happen during periods of little to no load on the server (wake up in the morning and the server restarted overnight).  It's never happened with a parity check or anything that puts a decent load on the system.

 

I'm watching my temps and they aren't getting high or anything.

 

Looking at syslog, I seem to always see this error:

 

May 18 02:22:34 Tower kernel: smp: Bringing up secondary CPUs ...
May 18 02:22:34 Tower kernel: x86: Booting SMP configuration:
May 18 02:22:34 Tower kernel: .... node  #0, CPUs:        #1  #2
May 18 02:22:34 Tower kernel: mce: [Hardware Error]: Machine check events logged
May 18 02:22:34 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: bc002800000c0135
May 18 02:22:34 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 1a1410100 MISC d012000000000000 IPID b000000000
May 18 02:22:34 Tower kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1621329732 SOCKET 0 APIC 4 microcode 8701021

 

 

The CPU number changes (e.g. "CPU 2") but I see that in syslog everytime.  Sometimes I also have ECC errors logged that look like:

 

May 17 01:49:06 Tower kernel: [Hardware Error]: Deferred error, no action required.
May 17 01:49:06 Tower kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
May 17 01:49:06 Tower kernel: [Hardware Error]: Error Addr: 0x00000000b0208000
May 17 01:49:06 Tower kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xadd8234b0b800002
May 17 01:49:06 Tower kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.

 

but those aren't always present. 

 

I've tried a couple of things to see if I could figure out the issue:

  • Randomly when it would come back up it'd fail to load bzroot from the flash drive.  Copying that file on from a fresh download would fix it, but next restart it'd be corrupt again.  So, I replaced the flash drive with a higher quality drive.  Haven't had the bzroot issue since
  • Removed the new parity drive to see if possibly the new drive was the issue.  SMART had no problem with the drive, but figured it was worth a shot, still restarted without the parity drive
  • Ran memtest as the ECC errors made me wonder if the ram was bad, memtest passed perfectly

 

Any thoughts on what else to try?  I have diagnostic logs downloaded if those would help too.

 

Thanks!!

Link to comment
10 minutes ago, Sixtey7 said:

I'm getting almost daily (maybe every day days sometimes) restarts.

You should mirror the syslog to the flash and then post the diagnostics and the syslog saved to your next post

 

12 minutes ago, Sixtey7 said:

Looking at syslog, I seem to always see this error:

MCE's during processor initialization are semi-common on certain CPU combinations and nothing to worry about

12 minutes ago, Sixtey7 said:

Sometimes I also have ECC errors logged that look like:

Definitely memory issues

12 minutes ago, Sixtey7 said:

Ran memtest as the ECC errors made me wonder if the ram was bad, memtest passed perfectly

The memtest that ships with Unraid won't find corrected ECC errors (as they're corrected).  Your system event log will probably have more info on which stick is going bad (or go to https://www.memtest86.com/ and create a stick with the updated version which should find it for you also.

Link to comment

Sorry for the long delay, after posting this the system stablized and then I was offline for a few days.

 

Got things stood back up this morning, and one hour after the server was running, it restarted on me.

 

The day I got the post to run the memtest86 version of memtest I did.  The test completed successfully (all four iterations of the test) with zero errors found.  I verified that it was configured to report ECC errors.

 

Attached are the syslog (from the flash drive - I removed all logs prior to today), and the diagnostics download.

 

Thanks again for the help!

syslog (1) tower-diagnostics-20210530-0810.zip

Edited by Sixtey7
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...