Spritzup Posted July 5, 2020 Posted July 5, 2020 So I've been scratching my head about this one. It seems that I will randomly get read errors on random disks (sometimes 2). Initially I thought that it was an issue with the drives, so I removed them from the array, ran SMART and tested them and all came back good. Re-add them to the array, rebuild the data and they seem fine, until another random drive has read errors. I'm trying to work my way through what's common among the drives and testing those shared components. The weird thing is that they're all on different backplanes and therefore different cables. So currently here is what I've done --> - Replaced cables going from HP SAS Expander to 8087/8088 converter - Replaced cables going from 8087/8088 converter to LSI 9205-8e HBA The next thing I was thinking was to update the firmware on the 9205, as it appears to be from 2011... but it seems older firmware is preferred on these cards? I was also considering running memtest to see if that's the problem, though for that is finding downtime. Does anyone else have any thoughts or suggestions? I've included the diagnostics here as well ~Spritz dyson-diagnostics-20200705-1250.zip Quote
nicksphone Posted July 5, 2020 Posted July 5, 2020 42 minutes ago, Spritzup said: So I've been scratching my head about this one. It seems that I will randomly get read errors on random disks (sometimes 2). Initially I thought that it was an issue with the drives, so I removed them from the array, ran SMART and tested them and all came back good. Re-add them to the array, rebuild the data and they seem fine, until another random drive has read errors. I'm trying to work my way through what's common among the drives and testing those shared components. The weird thing is that they're all on different backplanes and therefore different cables. So currently here is what I've done --> - Replaced cables going from HP SAS Expander to 8087/8088 converter - Replaced cables going from 8087/8088 converter to LSI 9205-8e HBA The next thing I was thinking was to update the firmware on the 9205, as it appears to be from 2011... but it seems older firmware is preferred on these cards? I was also considering running memtest to see if that's the problem, though for that is finding downtime. Does anyone else have any thoughts or suggestions? I've included the diagnostics here as well ~Spritz dyson-diagnostics-20200705-1250.zip 249.81 kB · 0 downloads are you running a ups? could be dirty power Quote
Spritzup Posted July 5, 2020 Author Posted July 5, 2020 6 minutes ago, nicksphone said: are you running a ups? could be dirty power Thanks for the reply. Yeah, the system(s) are on a UPS. Quote
nicksphone Posted July 5, 2020 Posted July 5, 2020 (edited) 8 minutes ago, Spritzup said: Thanks for the reply. Yeah, the system(s) are on a UPS. well good luck you have done all the other steps i would of done there are people smarter than me on here hopefully one of them has an idea. my only other thing i can think of is power cable to the back plane or the plane its self. but if its random over different ones i don't think its that. if you think its ram sometimes if you just re-seat and move them around it fixes it. Edited July 5, 2020 by nicksphone Quote
Spritzup Posted July 5, 2020 Author Posted July 5, 2020 3 hours ago, nicksphone said: well good luck you have done all the other steps i would of done there are people smarter than me on here hopefully one of them has an idea. my only other thing i can think of is power cable to the back plane or the plane its self. but if its random over different ones i don't think its that. if you think its ram sometimes if you just re-seat and move them around it fixes it. Thanks for the follow up. I mean it could be the backplanes, it just seems like it would be really odd to have all of them start acting up at the same time. As for your suggestion re - the power cable. The system has the backplanes load balanced across separate lines on the PSU without the use of any extensions/splitters/etc. So while possible, I think power is not likely. Fun fact though, older Norco-4224's don't support the 3.3v sata spec, and you need to cover the pin on newer SAS/SATA drives. ~Spritz Quote
JorgeB Posted July 6, 2020 Posted July 6, 2020 The errors on disk10 don't appear random, they are logged as media errors and the disk appears to be failing: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 104 87 Reported_Uncorrect -O--CK 092 092 000 - 8 197 Current_Pending_Sector -O--C- 100 100 000 - 16 198 Offline_Uncorrectable ----C- 100 100 000 - 16 Run an extended SMART test. Quote
Spritzup Posted July 6, 2020 Author Posted July 6, 2020 Thanks @johnnie.black, you're an asset to this forum I ran the extended SMART test, and assuming I'm reading it right, it does appear that the drive is failing. I've posted it here to have another set of eyes have a look, in case I missed something. ~Spritz dyson-smart-20200706-0824.zip Quote
itimpi Posted July 6, 2020 Posted July 6, 2020 2 minutes ago, Spritzup said: Thanks @johnnie.black, you're an asset to this forum I ran the extended SMART test, and assuming I'm reading it right, it does appear that the drive is failing. I've posted it here to have another set of eyes have a look, in case I missed something. ~Spritz dyson-smart-20200706-0824.zip 4.23 kB · 0 downloads According to that report only the ‘short’ SMART test has been run - but that failed anyway. Quote
JorgeB Posted July 6, 2020 Posted July 6, 2020 2 minutes ago, itimpi said: but that failed anyway. Yep, that disk needs to be replaced. Quote
Spritzup Posted July 7, 2020 Author Posted July 7, 2020 *facepalm* Ok, thanks both @johnnie.black and @itimpi. New drive it is. ~Spritz Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.