Jump to content

Unraid disk SMART passed with read errors?


Recommended Posts

Hi All,

I'm new to Unraid and SMART (although I have setup my Unraid array and generally know about SMART).

I'm testing now and learning about both of these.

I have one disk in the array that's reported 9000+ errors (particularly when setting up the array).

I run an extended self-test and it stops at 10% with "Completed: read failure"

When I look at the SMART report though it suggested the disk "Passed."

How can that be?  Is the disk usable?  Unraid is not reporting failure or need to replace etc.
The SMART report is attached.
Thanks for any suggestions.

Cheers,
Ashley.

 

ST3000DM001-1ER166_Z501RTHG-20230216-0922.txt

Link to comment
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     18441         886976
# 2  Short offline       Completed: read failure       90%     18441         886976

 

29 minutes ago, AshleyAitken said:

Is the disk usable?

no

Link to comment

Also, you should be seeing SMART warnings ( 👎 ) on the Dashboard page for that disk. And if you click on the disk to get to its Attributes, these would all be highlighted.

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   097   097   010    -    3512
187 Reported_Uncorrect      -O--CK   001   001   000    -    147
197 Current_Pending_Sector  -O--C-   001   001   000    -    57720
198 Offline_Uncorrectable   ----C-   001   001   000    -    57720

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

Link to comment

 

Thank you @trurl for your replies.

 

TBH, I was overwhelmed by the SMART data and focussed on the "Passed" status.

 

Is this disk usable? No.

 

It's working as a part of the Unraid array (AFAIK). I can copy files to and read them from the array, but of course that could be the other drive doing the work?

 

I would have thought Unraid would have told me if it was having trouble writing or reading to a disk in the array, and I should replace the disk?  I would hope I would get more than a "warning."

 

And now it does seem to be, although I don't believe it showed this "error" status before I had run the SMART Self-Tests. Even when I was setting up the array and there were 9000+ *warnings* the UI dashboard didn't show anything.

 

image.png.1fb89562bea12a072f2f98b40aac4d25.png

 

Notifications do come up in Unraid and I do get emails but again they were listed as "warnings" in the notifications and emails (AFAIK), whereas in the Main page > Array Devices they are listed as errors.  

 

image.thumb.png.7372ffac4a69369041350ef5a6773f16.png

 

The disk does report these attributes:

 

image.png.f304aa92d3f149fff4630c59bb3aa406.png

 

and in more detail:

 

image.thumb.png.55b4fc4e111b94e1ab08145da89df6c3.png

 

Those "Pre-fail" look ominous 😞 and probably mean I should replace the disk?

 

But still I am confused why SMART reports "PASSED."  Surely it should be saying something like "going to fail" or "replace"?

 

Again, I am only learning and there's no real data on the array as yet. 

 

Thanks for all comments and suggestions.

 

 

Edited by AshleyAitken
Link to comment
2 hours ago, AshleyAitken said:

that could be the other drive doing the work?

If a disk isn't disabled, it is the only one doing the work for reads. If a disk is disabled, it isn't used at all and the rest of the array is doing the work for it.

 

2 hours ago, AshleyAitken said:

before I had run the SMART Self-Tests

Until you ran the self-tests, some of the problems might have been on parts of the disk that hadn't been accessed yet.

 

Link to comment

UDMA CRC errors are recorded by the drive firmware when it receives inconsistent data based on checksum. These are almost always connection problems. Usually the data is resent so no harm, but you do need to fix something if they increase frequently. And often, a connection problem will not result in CRC error because the drive never receives any data to checksum.

 

More in my next post on things you need to be concerned with and some things you can do with the webUI.

 

Link to comment
10 hours ago, AshleyAitken said:

was overwhelmed by the SMART data and focussed on the "Passed" status.

In many cases the overall Passed status is meaningless.    It only changes if one of the attributes has a “Failing Now” status.    The one that is the best indication of drive health is the Extended SMART test.   If this cannot complete without error then you should be replacing the drive.

  • Like 1
Link to comment

In general, you can always Acknowledge a SMART warning on the Dashboard page by clicking on it ( 👎) and it will warn again if it increases. Some are more important than others.

 

An occasional CRC I usually just acknowledge, and maybe just check and reseat the connection next time I need to open the case. More frequent CRC needs to be taken care of.

 

The screenshot you posted earlier with the checkboxes shows which Attributes get monitored. You can set these for all disks in Disk Settings, and override those settings for an individual disk in its settings.

 

A disk has some sectors reserved to replace bad sectors. That is what reallocation is about. A few reallocated is usually OK as long as it isn't increasing. Pending sectors are sectors that will be reallocated when they are written again. These are a little more worrisome because it means the data at that sector can't be reliably read. You can insure these get written by rebuilding the disk.

 

You can add to the list of SMART attributes for monitoring in Disk Settings or in the settings for an individual disk. Some disk manufacturers use the attributes a little differently. It is recommended that you add attributes 1 and 200 for WD disks.

 

 

 

  • Like 1
Link to comment

 

I must say I am impressed with the look and feel of the Unraid web UI but somewhat disappointed with the vagueness (at least it seems so to me) and confusion with regards to disk / array status.  

 

Here's an other example:

image.png.c86c17b2dede335d38a8bf30016d0e1e.png

 

Disk 2, which has been the better disk so far, has now had 22 errors/warnings and shows this in the dashboard.

 

Why is it "disabled" when SMART is showing healthy (after errors and warning)?

 

As an end user I would really like to know Unraid best understanding is to the overall status of each disk, e.g. healthy, failing, replace asap, and failed, and similarly for the array, e.g. healthy, needs repairing, repairing.  Nothing more... (unless I go to an advanced page etc). 

 

IMHO, having to understand and research different disk errors and try to work out what the UI is telling me (without popups explaining different values and information etc) is not good. 

 

 

Link to comment

It might be useful to think of warnings as more important than errors.

 

Error typically means a specific thing has failed in a specific way. How important that is depends on the details.

 

Warnings means there is something that deserves your attention. How to deal with that depends on the details.

 

1 hour ago, AshleyAitken said:

Why is it "disabled" when SMART is showing healthy

Unraid disables a disk when a write to it fails for any reason. Often this isn't a problem with the disk, but a problem communicating with the disk. That failed write updates parity, so it can be recovered by rebuilding, but the physical disk isn't used again until rebuilt since it is now out-of-sync with the array. After disabling, the disk is emulated from parity,

 

When reading an emulated disk, the data comes from the parity calculation by reading all other disks. When writing an emulated disk, parity is updated as if the disk had been written. The initial failed write, and any subsequent writes to the disabled/emulated disk, can be recovered by rebuilding.

 

https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array

 

1 hour ago, AshleyAitken said:

having to understand and research

Unfortunately, there are a lot of details to consider, and trying to put them all in the webUI would make for an incomprehensible user interface. And, even with all the information, it might be difficult for an inexperienced user to choose the best approach for resolving a problem. So, please ask on the forum if you are unsure how to proceed.

 

There is a lot more information in the wiki, which you can access from the Documentation link at the bottom of the forum, or from your Unraid webUI by clicking "manual" at lower right.

 

1 hour ago, AshleyAitken said:

Why is it "disabled"

The best way to answer that question is by examining the diagnostics, taken before reboot. The diagnostics contains the current syslog, which is in RAM like the rest of the OS. Diagnostics after reboot can tell us a lot about how things are, but they may not be able to tell us much about how they got that way.

 

Diagnostics also includes SMART reports for all connected disks and other useful information about your hardware and configuration.

 

Whether or not you have rebooted since disk2 became disabled:

 

Attach new diagnostics to your NEXT post in this thread.

  • Like 1
Link to comment
On 2/16/2023 at 1:36 AM, AshleyAitken said:

I have one disk in the array that's reported 9000+ errors (particularly when setting up the array).

 

 

 

Hi, on a side note that the specific model of 3TB drive is somewhat notorious for a very high failure rate, around 32% for Backblaze.

Given the age of the drives you may want to consider replacing or ensuring you have an independant backup of anything irreplaceable.

 

Wiki ST3000DM001

ExtremeTech

 

  • Thanks 1
Link to comment
On 2/21/2023 at 1:27 AM, trurl said:

You have to rebuild disk2, but it would be useful to see your diagnostics first.

 

Thank you, so disabled means the disk has "failed" (or got out of whack somehow) and the array is running without it. Interesting, I would have thought that would have been a more significant event and made headlines news on the Dashboard and Main (disks) page.

Here is my diagnosis, but please don't waste too much time on it, because of your advice (and thanks for heads-up on those disks), I am going to replace the disks.  They are old but haven't been used for a number of years.  


media-diagnostics-20230224-1930.zip
 

Also, somewhat strange (IMHO) is that there is no clear direction from the web UI on what to do in this case (the whole reason for having Unraid?). There is, however, some nice documentation giving the procedure, which is relatively simple. 

Let's see how I go...

Link to comment

I shutdown the array.

 

Set the disk to "No device."

 

Then tried to set the disk back to the disk that "failed."  

 

It seem to set temporarily but then there were some notifications (sorry lost those after reboot) and that disk disappeared from the drop-down and is no longer in the list of "unassigned devices."

 

I rebooted and still it doesn't appeared, so perhaps that disk has completely died 😞

 

 

Link to comment

SMART for parity looks OK with not many power-on hours. Both disks 1, 2 need to be replaced, but you can only rebuild one disk at a time. To reliably rebuild a disk, it must be able to reliably read all other disks. Disk1 may not work well enough to rebuild disk2, and disk2 isn't working at all.

 

Neither disk has much data yet. If you don't need any of the data it will be simpler to just start over with new disks.

Link to comment

FYI, I replaced the disks one after the other, letting the system rebuild one disk before doing the other, and now it's all green lights.

 

Now going to explore Docker apps... see if this very old hardware (without address mapping) can run it and handle any load.

 

Overally, I'm generally impressed with UnRaid and thankful for support I received here. 

 

[Apologies for any dupes... I wasn't logged in and didn't realise it would post but be moderated.]

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...