Sixtey7 Posted May 23, 2021 Share Posted May 23, 2021 I installed Unraid about a month or two ago. I initially installed 6.9.1 with two 4TB drives for storage and one 4TB drive as parity. Things were running pretty great (uptime up to 7 days before I brought it down to tinker (swap around fans, temps were fine before but I wanted to tweak). Then two changes happened near each other, I installed a second parity drive (also 4TB), and I upgraded to 6.9.2. For the last couple of weeks, I'm getting almost daily (maybe every day days sometimes) restarts. These seem to happen during periods of little to no load on the server (wake up in the morning and the server restarted overnight). It's never happened with a parity check or anything that puts a decent load on the system. I'm watching my temps and they aren't getting high or anything. Looking at syslog, I seem to always see this error: May 18 02:22:34 Tower kernel: smp: Bringing up secondary CPUs ... May 18 02:22:34 Tower kernel: x86: Booting SMP configuration: May 18 02:22:34 Tower kernel: .... node #0, CPUs: #1 #2 May 18 02:22:34 Tower kernel: mce: [Hardware Error]: Machine check events logged May 18 02:22:34 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: bc002800000c0135 May 18 02:22:34 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 1a1410100 MISC d012000000000000 IPID b000000000 May 18 02:22:34 Tower kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1621329732 SOCKET 0 APIC 4 microcode 8701021 The CPU number changes (e.g. "CPU 2") but I see that in syslog everytime. Sometimes I also have ECC errors logged that look like: May 17 01:49:06 Tower kernel: [Hardware Error]: Deferred error, no action required. May 17 01:49:06 Tower kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b May 17 01:49:06 Tower kernel: [Hardware Error]: Error Addr: 0x00000000b0208000 May 17 01:49:06 Tower kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xadd8234b0b800002 May 17 01:49:06 Tower kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. but those aren't always present. I've tried a couple of things to see if I could figure out the issue: Randomly when it would come back up it'd fail to load bzroot from the flash drive. Copying that file on from a fresh download would fix it, but next restart it'd be corrupt again. So, I replaced the flash drive with a higher quality drive. Haven't had the bzroot issue since Removed the new parity drive to see if possibly the new drive was the issue. SMART had no problem with the drive, but figured it was worth a shot, still restarted without the parity drive Ran memtest as the ECC errors made me wonder if the ram was bad, memtest passed perfectly Any thoughts on what else to try? I have diagnostic logs downloaded if those would help too. Thanks!! Quote Link to comment
Squid Posted May 23, 2021 Share Posted May 23, 2021 10 minutes ago, Sixtey7 said: I'm getting almost daily (maybe every day days sometimes) restarts. You should mirror the syslog to the flash and then post the diagnostics and the syslog saved to your next post 12 minutes ago, Sixtey7 said: Looking at syslog, I seem to always see this error: MCE's during processor initialization are semi-common on certain CPU combinations and nothing to worry about 12 minutes ago, Sixtey7 said: Sometimes I also have ECC errors logged that look like: Definitely memory issues 12 minutes ago, Sixtey7 said: Ran memtest as the ECC errors made me wonder if the ram was bad, memtest passed perfectly The memtest that ships with Unraid won't find corrected ECC errors (as they're corrected). Your system event log will probably have more info on which stick is going bad (or go to https://www.memtest86.com/ and create a stick with the updated version which should find it for you also. Quote Link to comment
Sixtey7 Posted May 30, 2021 Author Share Posted May 30, 2021 (edited) Sorry for the long delay, after posting this the system stablized and then I was offline for a few days. Got things stood back up this morning, and one hour after the server was running, it restarted on me. The day I got the post to run the memtest86 version of memtest I did. The test completed successfully (all four iterations of the test) with zero errors found. I verified that it was configured to report ECC errors. Attached are the syslog (from the flash drive - I removed all logs prior to today), and the diagnostics download. Thanks again for the help! syslog (1) tower-diagnostics-20210530-0810.zip Edited May 30, 2021 by Sixtey7 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.