Jump to content

[6.11.5] Help determining drive health (Reported uncorrect)


Beta
Go to solution Solved by JonathanM,

Recommended Posts

Hi!

 

I'm in the midst of a read/write heavy procedure unpacking about 25 TB of rared content with unpackerr, throwing a few hundred gigs at it at a time. During last night I got a push from unraid saying Reported uncorrect on one drive increased from 0 to 1 in the middle of the current batch of unpacking. When done unraid indicated 32 errors on the drive (I assume these are corrected?)

 

I then proceeded to run a short smart test followed by an extended smart test. During the extended smart test, reported uncorrect increased to 3. However, the test seems to say that the drive passed. See test results below:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     37799         -
# 2  Short offline       Completed without error       00%     37786         -

 

After the extended smart test unraid now indicates 96 errors on the drive:

image.thumb.png.1c3986f2d5f9628c39615a5983c99f78.png

 

Reported uncorrect now sits at 3 after the extended test:

187	Reported uncorrect	0x0032	097	097	000	Old age	Always	Never	3

 

I've attached my diagnostics. (Disk 6 is the disk with the errors)

 

Is it safe to acknowledge the error and keep using the disk but keep an eye out for increasing values and other smart errors? Or should I be looking at replacing the drive ASAP?

 

Thanks!

 

unraid-diagnostics-20230329-2137.zip

Link to comment
4 hours ago, Beta said:

safe to acknowledge the error and keep using the disk but keep an eye out for increasing values and other smart errors

This. If all is quiet for a significant period of time, you can relax a little. If the errors keep happening regularly, I'd replace it. Regardless, the drive is now officially on your watch list.

4 hours ago, Beta said:

When done unraid indicated 32 errors on the drive (I assume these are corrected?)

Yes, when the drive returned a read error, Unraid read the rest of the drives and calculated from parity the bits that were supposed to be there, and wrote the calculated values back to the drive, which the drive acknowledged a successful write, so the drive is deemed still fit for use and not disabled. Unraid will continue to use a drive until a write fails, but that doesn't mean the drive is healthy, that's up to you to monitor and make a judgment call.

 

Just as an aside, if you don't need your ftp available 24/7/365, you might consider shutting it down when not in use. All the hack attempts make reading the logs irritating.

Link to comment
6 hours ago, JonathanM said:

This. If all is quiet for a significant period of time, you can relax a little. If the errors keep happening regularly, I'd replace it. Regardless, the drive is now officially on your watch list.

Yes, when the drive returned a read error, Unraid read the rest of the drives and calculated from parity the bits that were supposed to be there, and wrote the calculated values back to the drive, which the drive acknowledged a successful write, so the drive is deemed still fit for use and not disabled. Unraid will continue to use a drive until a write fails, but that doesn't mean the drive is healthy, that's up to you to monitor and make a judgment call.

 

Just as an aside, if you don't need your ftp available 24/7/365, you might consider shutting it down when not in use. All the hack attempts make reading the logs irritating.

 

Awesome, I thought as much, but nice having confirmation from someone more knowledgeable! :) I'll acknowledge the errors and keep my eyes on it for further errors for now.

 

Thanks for the hint on the ftp-server. I was going to disable it anyway soon, barely used anymore.

Link to comment

Ooops more SMART errors popped up this evening. Time to order a replacement?

 

Running another extended test right now.

 

 

197	Current pending sector	0x0012	100	100	000	Old age	Always	Never	8
198	Offline uncorrectable	0x0010	100	100	000	Old age	Offline	Never	8

 

Link to comment
3 hours ago, JonathanM said:

It would be prudent to have the replacement ready.

 

Do you trust the health of the rest of your drives?

 

Values increased during the extended test

187	Reported uncorrect	0x0032	097	097	000	Old age	Always	Never	3
197	Current pending sector	0x0012	100	100	000	Old age	Always	Never	16
198	Offline uncorrectable	0x0010	100	100	000	Old age	Offline	Never	16

 

And extended smart test failed

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     37833         -
# 2  Short offline       Completed without error       00%     37825         -
# 3  Extended offline    Completed without error       00%     37799         -
# 4  Short offline       Completed without error       00%     37786         -

 

Ordering a replacement disk now, I have not seen any issues with any other drives yet. Most of them are rather new except for two WD Black 2 TB with 7 years power on which I have been waiting to fail..

 

Halting all read/write heavy tasks until it's replaced.

Edited by Beta
Link to comment
  • Solution
6 hours ago, Beta said:

extended smart test failed

Expedite replacement, don't bother with extensive testing of the new incoming drive, the rebuild process followed by a non-correcting parity check and long smart test will be a trial by fire for the new drive.

6 hours ago, Beta said:

two WD Black 2 TB with 7 years power on which I have been waiting to fail..

Might be a good idea to order another replacement to have on hand, I personally keep a tested drive same size as parity in a box as an on deck option.

6 hours ago, Beta said:

Halting all read/write heavy tasks until it's replaced.

Good reason to consider keeping a tested cold spare to limit time at risk.

Link to comment
9 hours ago, JonathanM said:

Expedite replacement, don't bother with extensive testing of the new incoming drive, the rebuild process followed by a non-correcting parity check and long smart test will be a trial by fire for the new drive.

Might be a good idea to order another replacement to have on hand, I personally keep a tested drive same size as parity in a box as an on deck option.

Good reason to consider keeping a tested cold spare to limit time at risk.

 

Thanks for the help Jonathan! Didn't want to risk running the array with the disk over the weekend, so went to my local electronics chain store and purchased a replacement. Was going to order a Ironwolf 8TB online with next day delivery (monday cus wekends), was €20 cheaper but ¯\_(ツ)_/¯

 

Running rebuild now!

image.thumb.png.1c65d579ee54541e8cc4ea6ecaf84938.png

Link to comment
31 minutes ago, Beta said:

Running rebuild now!

Good deal. I know you aren't expecting the 2 old WD's to die immediately, but running single parity means you have no margin for error when it comes to drive replacement. Running with any single drive that you can't trust through a rebuild means you are just asking for a sudden unannounced failure by a drive you had no clue was on the way out.

 

All drives fail eventually, the tricky part is predicting when.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...