JohnSnyder Posted July 3, 2018 Share Posted July 3, 2018 Fix Common Problems found hardware errors on my machine. I've attached the syslog file. I'm using the latest version of unRAID (v6.5.3). I have dual 8 core Xeon processors on a Z9PE-D8-WS motherboard. I'd really appreciate an interpretation of this syslog by someone who understand it!! Thanks!! unraid-nas-syslog-20180703-1403.zip Quote Link to comment
JorgeB Posted July 3, 2018 Share Posted July 3, 2018 Check the board's SEL (system event log), there might be more info there. Quote Link to comment
JohnSnyder Posted July 5, 2018 Author Share Posted July 5, 2018 Crazy! I did a parity check/restore, and not only did the corrupt disk get restored, but the hardware problems disappeared from the Fix Common Problems report! It's interesting that I've had disk1 get corrupted 3 times. And I've swapped the disks around so that today's disk1 is different from a previous disk1 that also got corrupted. And the corruption occurred EACH time during a manually initiated Mover operation. Never during a scheduled Mover - only when I clicked on the button which initiates a Mover operation now. Quote Link to comment
JohnSnyder Posted July 7, 2018 Author Share Posted July 7, 2018 Well ... The hardware errors are now showing up again. Message from syslogd@unRAID-NAS at Jul 7 15:59:34 ... kernel:mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 7: 8800004000310e0f Message from syslogd@unRAID-NAS at Jul 7 15:59:34 ... kernel:mce: [Hardware Error]: TSC 65c0cc4db27 MISC 1c6c46004c00bd Message from syslogd@unRAID-NAS at Jul 7 15:59:34 ... kernel:mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1530993574 SOCKET 1 APIC 20 microcode 713 Any idea what these mean?? Quote Link to comment
Squid Posted July 7, 2018 Share Posted July 7, 2018 On 7/5/2018 at 12:26 PM, JohnSnyder said: Crazy! I did a parity check/restore, and not only did the corrupt disk get restored On 7/5/2018 at 12:26 PM, JohnSnyder said: the hardware problems disappeared from the Fix Common Problems report! On 7/5/2018 at 12:26 PM, JohnSnyder said: It's interesting that I've had disk1 get corrupted 3 times All tends to imply memory... Did you On 7/3/2018 at 2:30 PM, johnnie.black said: Check the board's SEL (system event log), there might be more info there. Quote Link to comment
_0m0t3ur Posted July 7, 2018 Share Posted July 7, 2018 Reboot into Memtest and run it for at least 24 hours.Sent from my iPhone using Tapatalk Quote Link to comment
pwm Posted July 9, 2018 Share Posted July 9, 2018 On 7/5/2018 at 6:26 PM, JohnSnyder said: Never during a scheduled Mover - only when I clicked on the button which initiates a Mover operation now. The scheduled mover operations are most probably during low-traffic time when the machine is otherwise idling. So system temperature, memory or PSU are prime suspects. Quote Link to comment
JohnSnyder Posted July 9, 2018 Author Share Posted July 9, 2018 I ran a 36 hour memory test and the result was 0 errors. After I restarted unRAID I continued getting the same hardware errors. I reseated my video card and my memory modules; and then ran an extended Fix Common Problems. So far (only 30 minutes or so), no errors of any kind are showing up. I'm encouraged ... but not yet convinced that reseating those components has fixed the problem. I've had periods of time in the past where no errors showed up for hours and even days -- only to reappear without warning. So, we'll see. Quote Link to comment
JohnSnyder Posted July 11, 2018 Author Share Posted July 11, 2018 Well, the same hardware errors showed up. I removed the original memory from CPU 0 in socket 1 (I've filled up the second set of slots in the interim) and the hardware error remains. So ... whatever ... Quote Link to comment
pwm Posted July 11, 2018 Share Posted July 11, 2018 Was the machine doing something special when the error happened? Have you done anything to eliminate the other two prime suspects? On 7/9/2018 at 1:51 PM, pwm said: So system temperature, memory or PSU are prime suspects. Quote Link to comment
Squid Posted July 11, 2018 Share Posted July 11, 2018 On 7/7/2018 at 5:27 PM, Squid said: Did you On 7/3/2018 at 2:30 PM, johnnie.black said: Check the board's SEL (system event log), there might be more info there. Quote Link to comment
JohnSnyder Posted July 12, 2018 Author Share Posted July 12, 2018 The temperature always reads in the 30s and 40s. Currently I have the side panel off. I haven't done anything specific to check the power supply - I'm not sure what to do. It's quite new, 850 watt Corsair. I have repeatedly checked the log file (the link to which is shown in the upper right hand corner). It's the only log file I know for unRAID, and it's the one I posted in my original post (saved in a zip file as unRAID-NAS SysLog). Is there another one I should look at? Quote Link to comment
JorgeB Posted July 12, 2018 Share Posted July 12, 2018 You should check the board's system event log, check your manual: Quote Link to comment
JohnSnyder Posted July 12, 2018 Author Share Posted July 12, 2018 Thanks, johnnie.black! I wasn't aware that the log you were referring to was in the BIOS. I did check it, and it is completely empty. I verified that the logging is enabled. However, I'll read the manual and try to figure out if there are any other settings I need to change in order to have this event log actually log something! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.