sand_ Posted August 1, 2020 Share Posted August 1, 2020 (edited) Not sure what's occurring, trying out unraid right now and running a parity check seems to cause the system to reboot (? not sure since I'm never able to observe it when it does) after like 4 or something hours. Fix Common Errors found that I had Machine Check Events. tower-diagnostics-20200801-0200.zip Edited August 1, 2020 by sand_ Quote Link to comment
Frank1940 Posted August 1, 2020 Share Posted August 1, 2020 I would run Memtst (A boot option) for 24 hours... Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124 Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86 Quote Link to comment
sand_ Posted August 2, 2020 Author Share Posted August 2, 2020 (edited) 8 hours in with 4 passes, still no errors. Good, I'm guessing? Edited August 2, 2020 by sand_ Quote Link to comment
sand_ Posted August 2, 2020 Author Share Posted August 2, 2020 On 8/1/2020 at 7:53 AM, Frank1940 said: I would run Memtst (A boot option) for 24 hours... Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124 Aug 1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86 Zero errors during the memtest. 10 passes Quote Link to comment
Frank1940 Posted August 2, 2020 Share Posted August 2, 2020 Next step, Let's see there is a clue in the syslog at the time when the failure occurs. Setup the Syslog Server per the following set of instructions: I would be using the Mirror Syslog to Flash method since it occurs within a few hours. Quote Link to comment
sand_ Posted August 3, 2020 Author Share Posted August 3, 2020 It took about 6 hours for it to reboot this time, I was able to catch it and I think it might've said there was a kernal panic. syslog Quote Link to comment
Frank1940 Posted August 3, 2020 Share Posted August 3, 2020 The reboot occurs at line 22 in the syslog--- Time of 22:06:21 I am not an expert at reading syslogs but I don't see anything in the first 21 lines that is not typical of normal operation. I must ask, is it possible that you have a pet or child that might be pushing the reset button. Often times during a parity check, there is a nice flashing led that tends to attract and demand attention from the curious. Is this a new hardware build or is it a recycled computer? You might provide a few details as to the background of this server. Quote Link to comment
sand_ Posted August 3, 2020 Author Share Posted August 3, 2020 While I do have pets, none were in my room during the time while it rebooted and there are no children in my house. This is a recycled computer, it was my old gaming computer from about 5 years ago. It has an i5 4690k, 2x8gb of ram and a corsair CX450 psu. Before I started using unraid, it was functioning as a windows/ubuntu computer for a couple of weeks, which had no problems. I'm going to try and snap a picture of when it crashes because the most recent reboot showed that text does appear on screen when it does crash. Quote Link to comment
Frank1940 Posted August 3, 2020 Share Posted August 3, 2020 Next thing to try. Boot it in the Safe Mode and see if it still reboots. And go back to the BIOS stock settings on any overclocking. (Overclocking is a no, no for servers!) Also look at the inside of the case. Make sure it is clean. Get the dust out of heat sinks and fans. Make sure that the air flow is over the drives. Basically, the fans at the back of the case should blow out. Double check that the PS/MB power plugs are all securely plugged in. (By the way, PS have caused this problem in the past...) Most of the rebooting problems are hardware related. Quote Link to comment
sand_ Posted August 4, 2020 Author Share Posted August 4, 2020 Tried safe mode, this time the reboot happened around 10 hours in. Will try other stuff soon Quote Link to comment
itimpi Posted August 4, 2020 Share Posted August 4, 2020 A parity check is when the system is likely to be under maximum load. This suggests it might either be power supply or temperature related. Do you have access to another power supply to see if that might be the culprit? Quote Link to comment
sand_ Posted August 5, 2020 Author Share Posted August 5, 2020 I managed to capture the moment when it reboots and it spits out this. mce: [Hardware Error]: CPU 2: Machine Check Exception 5 Bank 1: bf80000000000124 mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81334b4e> {percpu_counter_add_batch+0x4e/0x52} mce: [Hardware Error]: TSC 3963ac8a7429 ADDR 40b9a9340 MISC 86 mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1596582445 SOCKET 0 APIC 4 microcode 27 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: Machine check: Processor context corrupt Kernel panic - not syncing: Fatal machine check Kernel Offset: disabled Rebooting in 30 seconds.. I don't currently have a power supply on hand, I would use my main pc's psu when I manage to find a good sale to replace however. It would suck if it is the psu as this one is only 4ish months old. Quote Link to comment
sand_ Posted August 10, 2020 Author Share Posted August 10, 2020 Solved? Didn't crash and was able to complete a parity rebuilt after changing C-States in bios from Auto to Disabled. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.