sedoro Posted July 27, 2019 Share Posted July 27, 2019 (edited) Hello everyone I've had my server since last Christmas so I'm quite new to this, and this morning I've woke up with a hardware error in my system: Quote Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc1d53c000010091 Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: TSC 0 Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: ADDR 106714a940 Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: MISC 2040444486 Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d6 TIME 1564198254 SOCKET 1 APIC 20 Jul 27 05:30:54 Tower kernel: EDAC MC1: 30031 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x106714a offset:0x940 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:1) As far as I understand, there's a defect memory module. So should I just remove/replace this module? Syslog says it's the Channel 1, DIMM 0 module. I've attached the diagram of my MB. Channel 1 DIMM 0 would correspond to CPU1_DIMM_C0? Thanks in advace for your help! syslog tower-diagnostics-20190727-0756.zip Edited September 17, 2019 by sedoro Title updated Quote Link to comment
trurl Posted July 27, 2019 Share Posted July 27, 2019 There is memtest on the boot menu. Quote Link to comment
Squid Posted July 27, 2019 Share Posted July 27, 2019 Memtest wont find the errors since they're being corrected Sent from my phone as I'm probably having a beer and enjoying a fire Quote Link to comment
sedoro Posted July 29, 2019 Author Share Posted July 29, 2019 Thanks for the answers, I tried memtest in the boot menu, the system rebooted but nothing happened. After reboot, everything was fine until this morning when I received another Hardware error. This one is different. What should I do next? syslog290719 Quote Link to comment
sedoro Posted August 30, 2019 Author Share Posted August 30, 2019 It's been a month since I had the first Hardware Error, and it just got worst. The system is randomly rebooting since end of July (Kernel Panic reboots - see attached capture). I haven't been able to perform a parity check as the system always reboots before it finishes (10 TB, 25hours usually) and I know there are parity errors so living in the edge now. When not performing parity check, the maximum period of no reboots have been of 4 days, but is is so random, that sometimes it just reboots before I can start array again. This is what I've discarded and why: RAM: I removed all sticks but 1 and ran system. Same reboots. I did it with 3 different sticks and different slots. PSU: I have dual PSU, have tried with only 1 at a time with same result. APU: Ran the system directly to AC. Same results. Latest Unraid upgrade. The problems started, more or less, when I upgraded to 6.7.2. I downgraded to 6.7.1 but reboots happens like always. I also removed both CPUs, looked for dust or twisted pins, and applied new thermal grease after that. I contacted the retailer and after some hardware tests they said this: Quote It seems to be a small known issue with unraid, something to do with Broadwell Era CPU's. Some people are suggesting a boot option to set C States on the CPU to C1 but not a definitive fix. Have a look at this forum and see if any applies to your situation - https://forums.unraid.net/topic/55140-632-kernel-panic-not-syncing-timeout-not-all-cpus-entered-broadcast-exception-handler/ Could it be related to a buggy microcode or to a software problem? They say I could try downgrade to 6.3.2 as seemed to be the point of conversation in that thread. What do you think? Is it worh trying? Also, two days ago I got a new Hardware Error: Quote Aug 28 01:42:03 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 28 01:42:03 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Aug 28 01:42:03 Tower kernel: EDAC MC1: 14 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x105fa95 offset:0xf00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0) Aug 28 05:30:38 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server Aug 28 05:30:38 Tower root: Hardware event. This is not a software error. Aug 28 05:30:38 Tower root: Uncorrected error Aug 28 05:30:38 Tower root: Data CACHE Level-2 Snoop Error Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged Thanks all for you help. PD: Title changed according to new symtoms. syslog Quote Link to comment
sedoro Posted September 5, 2019 Author Share Posted September 5, 2019 So I've been able to complete a parity check (25 hours, 2.943 errors) by dowgrading Unraid to 6.6.7. The system has been up for 1 day 2 hours now, maybe I've just been lucky, but I have good feelings as I tried parity check like 20 times before with version 6.7.x with no luck. No more MCE errors neither. I'll run another parity check in some days and if the system doesn't reboot will add a [Solved] to the title. 1 Quote Link to comment
sedoro Posted September 17, 2019 Author Share Posted September 17, 2019 Hi it's been 14 days uptime with 0 problems nor errors. Two parity checks completed with 0 errors. It seems the problem was related with Unraid 6.7.x somehow. Hope it gets fixed in future updates. I find the "Hardware event. This is not a software error." message quite misleading. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.