Jump to content

Multiple Disks with various smart errors in logs, bad Disks or something else?


Go to solution Solved by JorgeB,

Recommended Posts

Hey all

 

I am hoping for smarter people to be able to look at my diags and hopefully help point at the likely culprit of an issue. 

 

I built a server a few months ago

HPE dl380 lff gen9 (12 3.5 drive slots) 

2 @ e5-2690 xeons

128gb ecc ram

Matrox on board display

4 port Broadcom nic

Emulex 2 port 10g nic

LSI SAS2308 sas controller for Jbod enclosure (not currently in use) 

HPE smart array sas controller and expander to control all 12 drives, sorry forget exact models

Nvidia gtx 1650 super (for plex transcoding) 

7 @ 8tb hgst H7280A520SUN8.0T drives (data) 

2 @ Seagate ST12000NM004G (dual parity) 

1 @ samsung 870 4tb ssd (data write cache) 

1 @ inland sata ssd 1tb (appdata) 

1 @ samsung sata ssd 500gb (vm os disk) 

1 @ 8tb hgst H7280A520SUN8.0T Precleared and in static bag as cold spare. 

 

All of the spindle drives were purchased used, ex data center drives... So I know there is a good possibility of drive failures with these, the 12tb drives had less than 500 hours on them so they were basicly new. 

 

 

About a week ago I saw errors on one data drive. I looked at the smart logs and saw delayed read errors, I talked to the company I bought the drives from and even though the drive was out of their warranty period they replaced the drive! 

I put the cold spare in and it started rebuilding. During the rebuild process I went to check if plex and other apps needed updates and received an error saying the server couldn't connect to github and an error message saying it couldn't write to a file in the usr/local share. That had happened a few weeks ago and the "fix" or work around was to reboot the server 🙄 unfortunately due to the error the server sees it as an unclean shutdown and stopped the data restore and kicked off a parity check instead (I hope that didn't delete 5 to 6btb of data 🤬)... During the parity check the "spare" drive listed over 1 million errors (very close in number to the number of writes on the disk). I paused the parity check, stopped the array put the replacement drive in the system (removed an unassigned drive) and restarted the array, at this time unraid disabled the "spare drive", I looked at the logs on the replacement drive and it has a bunch of smart errors as well 🤬

 

I then looked at all the drives, and out of the 9 spinning drives 6 are showing errors in the smart logs, most are delayed read errors, but there are some delayed write and verification errors as well. NONE of the ssds have any errors and 1 12tb and 2 8tb drives show no errors! 

 

I am currently running a full smart scan on all drives. 

 

In my mind it doesn't seem likely that so many of the drives would have errors, BUT it doesn't seem likely that there is another hardware issue or I would think All drives would have errors. 

 

I have attached diags in hopes some other smart people, who know Linux and unraid much better than me, can look at them and help me figure out if this is a random fluke and most of the drives I bought went bad or if there is something else going on. 

 

bob-diagnostics-20230621-1625.zip

Link to comment

Update

 

I have attached an updated diagnostic report

 

I was misreading tge logs and reading correction algorithm invocations as errors, but it seems they are not actually "errors". But I still have 4 drives with delayed read and write smart errors... 

 

It still seems kind of unlikely that 5 drives would all start throwing errors (even though unraid is only reporting errors on the main screen for one)... Is it possible the sas card or something else is happening? Or did I just have a run of really bad luck with drives?

bob-diagnostics-20230621-2156.zip

Link to comment
10 hours ago, JorgeB said:

Where are you seeing these?

In the smart scan logs, they are all corrected I attached a pic of the logs for the drive unraid disabled, it had somewhere in the 1.5 million errors shown on the main screen while rebuilding the drive. Even though smart onlh shows 2 errors. 

 

Bill

 

 

Screenshot_20230622_134920_Chrome~2.jpg

Link to comment
8 hours ago, JorgeB said:

Those are nothing to worry about, a non zero "Total uncorrected errors" will usually indicate a problem.

If these are "nothing to worry about" why did disk 5, with only 2  corrected delayed read errors, throw over 1.5 million errors in unraid while rebuilding data (drive replaced) and unraid disabled the drive? 

 

Sorry not trying to be confrontational just trying to understand why a drive with so few errors that are usually "nothing to worry about" would be toasted by unraid.... 

 

While I'm not happy about the drives having errors, I'm more worried about something else happening on the server that I was hoping would have shown in the logs. 

Link to comment

the diags are after the disk was disabled :( so no diags from when the disk was throwing the errors in unraid.

I had to reboot the server due to another issue I have another thread opened for (error showing unraid is unable to write to /usr/local/ and basically stops unraid from accessing github, updating plugins/dockers etc, and causes an unclean shutdown and parity check when using the reboot or shutdown option in the gui)

 

i can try doing a preclear, on the disabled drive and see what it does. I just rebuilt the data on a different drive and would really rather not put the disabled drive back in and rebuild again. I dont know if I will see the errors if the disk is unassigned.

Link to comment
  • 2 weeks later...

Well I'm back, 

 

The initial disk has been solid, full smart test (2 corrected delayed read errors taht have been there for a while) and copied 2 or 3tb of data to the drive with no issues. 

 

BUT the replacement drive has no smart errors, but unraid is reporting almost 22,000 errors 😞

 

My guess is the servers backplane has a flakey port etc... But I have attached the diags for anyone who wants to look and make any suggestions! 

 

 

Screenshot_20230702_163032_Chrome.jpg

bob-diagnostics-20230702-1619.zip

Link to comment

That is my fear too.... All internal slots are full 😞 BUT I am condensing a couple drives down onto a bigger drive. All Disks that have logged errors in unraid have been in that disk slot so I'm just going to mark it as bad so I don't use it. Sucks to have a nice hot swap chassis that has a bad slot. As long as it doesn't spread to other slots I can live with it. 

 

Thanks for your help JorgeB! 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...