Multiple Disk Read Errors - Random


Recommended Posts

So I've been scratching my head about this one.  It seems that I will randomly get read errors on random disks (sometimes 2).  Initially I thought that it was an issue with the drives, so I removed them from the array, ran SMART and tested them and all came back good.  Re-add them to the array, rebuild the data and they seem fine, until another random drive has read errors.  I'm trying to work my way through what's common among the drives and testing those shared components.  The weird thing is that they're all on different backplanes and therefore different cables.  So currently here is what I've done -->

 

- Replaced cables going from HP SAS Expander to 8087/8088 converter

- Replaced cables going from 8087/8088 converter to LSI 9205-8e HBA

 

The next thing I was thinking was to update the firmware on the 9205, as it appears to be from 2011... but it seems older firmware is preferred on these cards?  I was also considering running memtest to see if that's the problem, though for that is finding downtime.

 

Does anyone else have any thoughts or suggestions?  I've included the diagnostics here as well :)


~Spritz

dyson-diagnostics-20200705-1250.zip

Link to comment
42 minutes ago, Spritzup said:

So I've been scratching my head about this one.  It seems that I will randomly get read errors on random disks (sometimes 2).  Initially I thought that it was an issue with the drives, so I removed them from the array, ran SMART and tested them and all came back good.  Re-add them to the array, rebuild the data and they seem fine, until another random drive has read errors.  I'm trying to work my way through what's common among the drives and testing those shared components.  The weird thing is that they're all on different backplanes and therefore different cables.  So currently here is what I've done -->

 

- Replaced cables going from HP SAS Expander to 8087/8088 converter

- Replaced cables going from 8087/8088 converter to LSI 9205-8e HBA

 

The next thing I was thinking was to update the firmware on the 9205, as it appears to be from 2011... but it seems older firmware is preferred on these cards?  I was also considering running memtest to see if that's the problem, though for that is finding downtime.

 

Does anyone else have any thoughts or suggestions?  I've included the diagnostics here as well :)


~Spritz

dyson-diagnostics-20200705-1250.zip 249.81 kB · 0 downloads

are you running a ups? could be dirty power

Link to comment
8 minutes ago, Spritzup said:

Thanks for the reply.  Yeah, the system(s) are on a UPS.

well good luck you have done all the other steps i would of done there are people smarter than me on here hopefully one of them has an idea. my only other thing i can think of is power cable to the back plane or the plane its self. but if its random over different ones i don't think its that. if you think its ram sometimes if you just re-seat and move them around it fixes it.

Edited by nicksphone
Link to comment
3 hours ago, nicksphone said:

well good luck you have done all the other steps i would of done there are people smarter than me on here hopefully one of them has an idea. my only other thing i can think of is power cable to the back plane or the plane its self. but if its random over different ones i don't think its that. if you think its ram sometimes if you just re-seat and move them around it fixes it.

Thanks for the follow up.  I mean it could be the backplanes, it just seems like it would be really odd to have all of them start acting up at the same time.  As for your suggestion re - the power cable.  The system has the backplanes load balanced across separate lines on the PSU without the use of any extensions/splitters/etc.  So while possible, I think power is not likely.

 

Fun fact though, older Norco-4224's don't support the 3.3v sata spec, and you need to cover the pin on newer SAS/SATA drives.

 

~Spritz

Link to comment

The errors on disk10 don't appear random, they are logged as media errors and the disk appears to be failing:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
   5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    104
 87 Reported_Uncorrect      -O--CK   092   092   000    -    8
197 Current_Pending_Sector  -O--C-   100   100   000    -    16
198 Offline_Uncorrectable   ----C-   100   100   000    -    16

 

Run an extended SMART test.

Link to comment
2 minutes ago, Spritzup said:

Thanks @johnnie.black, you're an asset to this forum :)

 

I ran the extended SMART test, and assuming I'm reading it right, it does appear that the drive is failing.  I've posted it here to have another set of eyes have a look, in case I missed something.

 

~Spritz

dyson-smart-20200706-0824.zip 4.23 kB · 0 downloads

According to that report only the ‘short’ SMART test has been run - but that failed anyway.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.