Hardware errors found by Fix Common Problems


Recommended Posts

Crazy!  I did a parity check/restore, and not only did the corrupt disk get restored, but the hardware problems disappeared from the Fix Common Problems report!  It's interesting that I've had disk1 get corrupted 3 times.  And I've swapped the disks around so that today's disk1 is different from a previous disk1 that also got corrupted.  And the corruption occurred EACH time during a manually initiated Mover operation.  Never during a scheduled Mover - only when I clicked on the button which initiates a Mover operation now.

Link to comment

Well ... The hardware errors are now showing up again.

 

Message from syslogd@unRAID-NAS at Jul  7 15:59:34 ...
 kernel:mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 7: 8800004000310e0f

Message from syslogd@unRAID-NAS at Jul  7 15:59:34 ...
 kernel:mce: [Hardware Error]: TSC 65c0cc4db27 MISC 1c6c46004c00bd

Message from syslogd@unRAID-NAS at Jul  7 15:59:34 ...
 kernel:mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1530993574 SOCKET 1 APIC 20 microcode 713

 

Any idea what these mean??

Link to comment
On 7/5/2018 at 12:26 PM, JohnSnyder said:

Crazy!  I did a parity check/restore, and not only did the corrupt disk get restored

 

On 7/5/2018 at 12:26 PM, JohnSnyder said:

the hardware problems disappeared from the Fix Common Problems report! 

 

On 7/5/2018 at 12:26 PM, JohnSnyder said:

 It's interesting that I've had disk1 get corrupted 3 times

All tends to imply memory...

 

Did you

On 7/3/2018 at 2:30 PM, johnnie.black said:

Check the board's SEL (system event log), there might be more info there.

 

Link to comment
On 7/5/2018 at 6:26 PM, JohnSnyder said:

Never during a scheduled Mover - only when I clicked on the button which initiates a Mover operation now.

 

The scheduled mover operations are most probably during low-traffic time when the machine is otherwise idling.

 

So system temperature, memory or PSU are prime suspects.

Link to comment

I ran a 36 hour memory test and the result was 0 errors.  After I restarted unRAID I continued getting the same hardware errors.

 

I reseated my video card and my memory modules; and then ran an extended Fix Common Problems.  So far (only 30 minutes or so), no errors of any kind are showing up.  I'm encouraged ... but not yet convinced that reseating those components has fixed the problem.  I've had periods of time in the past where no errors showed up for hours and even days -- only to reappear without warning.  So, we'll see.

 

Link to comment

The temperature always reads in the 30s and 40s.  Currently I have the side panel off.

 

I haven't done anything specific to check the power supply - I'm not sure what to do.  It's quite new, 850 watt Corsair.

 

I have repeatedly checked the log file (the link to which is shown in the upper right hand corner).  It's the only log file I know for unRAID, and it's the one I posted in my original post (saved in a zip file as unRAID-NAS SysLog).  Is there another one I should look at?

Link to comment

Thanks, johnnie.black!  I wasn't aware that the log you were referring to was in the BIOS.

 

I did check it, and it is completely empty.  I verified that the logging is enabled.  However, I'll read the manual and try to figure out if there are any other settings I need to change in order to have this event log actually log something!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.