August 1, 20205 yr Not sure what's occurring, trying out unraid right now and running a parity check seems to cause the system to reboot (? not sure since I'm never able to observe it when it does) after like 4 or something hours. Fix Common Errors found that I had Machine Check Events. tower-diagnostics-20200801-0200.zip Edited August 1, 20205 yr by sand_
August 1, 20205 yr I would run Memtst (A boot option) for 24 hours... Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124 Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86
August 2, 20205 yr Author 8 hours in with 4 passes, still no errors. Good, I'm guessing? Edited August 2, 20205 yr by sand_
August 2, 20205 yr Author On 8/1/2020 at 7:53 AM, Frank1940 said: I would run Memtst (A boot option) for 24 hours... Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124 Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86 Zero errors during the memtest. 10 passes
August 2, 20205 yr Next step, Let's see there is a clue in the syslog at the time when the failure occurs. Setup the Syslog Server per the following set of instructions: I would be using the Mirror Syslog to Flash method since it occurs within a few hours.
August 3, 20205 yr Author It took about 6 hours for it to reboot this time, I was able to catch it and I think it might've said there was a kernal panic. syslog
August 3, 20205 yr The reboot occurs at line 22 in the syslog--- Time of 22:06:21 I am not an expert at reading syslogs but I don't see anything in the first 21 lines that is not typical of normal operation. I must ask, is it possible that you have a pet or child that might be pushing the reset button. Often times during a parity check, there is a nice flashing led that tends to attract and demand attention from the curious. Is this a new hardware build or is it a recycled computer? You might provide a few details as to the background of this server.
August 3, 20205 yr Author While I do have pets, none were in my room during the time while it rebooted and there are no children in my house. This is a recycled computer, it was my old gaming computer from about 5 years ago. It has an i5 4690k, 2x8gb of ram and a corsair CX450 psu. Before I started using unraid, it was functioning as a windows/ubuntu computer for a couple of weeks, which had no problems. I'm going to try and snap a picture of when it crashes because the most recent reboot showed that text does appear on screen when it does crash.
August 3, 20205 yr Next thing to try. Boot it in the Safe Mode and see if it still reboots. And go back to the BIOS stock settings on any overclocking. (Overclocking is a no, no for servers!) Also look at the inside of the case. Make sure it is clean. Get the dust out of heat sinks and fans. Make sure that the air flow is over the drives. Basically, the fans at the back of the case should blow out. Double check that the PS/MB power plugs are all securely plugged in. (By the way, PS have caused this problem in the past...) Most of the rebooting problems are hardware related.
August 4, 20205 yr Author Tried safe mode, this time the reboot happened around 10 hours in. Will try other stuff soon
August 4, 20205 yr A parity check is when the system is likely to be under maximum load. This suggests it might either be power supply or temperature related. Do you have access to another power supply to see if that might be the culprit?
August 5, 20205 yr Author I managed to capture the moment when it reboots and it spits out this. mce: [Hardware Error]: CPU 2: Machine Check Exception 5 Bank 1: bf80000000000124 mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81334b4e> {percpu_counter_add_batch+0x4e/0x52} mce: [Hardware Error]: TSC 3963ac8a7429 ADDR 40b9a9340 MISC 86 mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1596582445 SOCKET 0 APIC 4 microcode 27 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: Machine check: Processor context corrupt Kernel panic - not syncing: Fatal machine check Kernel Offset: disabled Rebooting in 30 seconds.. I don't currently have a power supply on hand, I would use my main pc's psu when I manage to find a good sale to replace however. It would suck if it is the psu as this one is only 4ish months old.
August 10, 20205 yr Author Solved? Didn't crash and was able to complete a parity rebuilt after changing C-States in bios from Auto to Disabled.
Archived
This topic is now archived and is closed to further replies.