Jump to content

Schedule regular Smart Tests?


JimPhreak

Recommended Posts

an extended once a month

This is probably only what you need. The issue is, the spin down has to be temporarily disabled while this test is running.

Until there's an API for that, it's kind of hard.

I had started on some sort of dd of single random blocks, but stopped as the real way to fix this is turn off the spin down timer temporarily.

 

The short test is easy and finishes in minutes, but it's not really comprehensive enough.

 

Perhaps the spin down logic can inspect the smart data and if a test is being executed, skip the spin down until the test is no longer active.

Link to comment

an extended once a month

This is probably only what you need. The issue is, the spin down has to be temporarily disabled while this test is running.

Until there's an API for that, it's kind of hard.

I had started on some sort of dd of single random blocks, but stopped as the real way to fix this is turn off the spin down timer temporarily.

 

The short test is easy and finishes in minutes, but it's not really comprehensive enough.

 

Perhaps the spin down logic can inspect the smart data and if a test is being executed, skip the spin down until the test is no longer active.

 

Oh so you're saying that currently if you are running an extended test that the disk will still spin down if it's not being accessed otherwise?

Link to comment

an extended once a month

This is probably only what you need. The issue is, the spin down has to be temporarily disabled while this test is running.

Until there's an API for that, it's kind of hard.

I had started on some sort of dd of single random blocks, but stopped as the real way to fix this is turn off the spin down timer temporarily.

 

The short test is easy and finishes in minutes, but it's not really comprehensive enough.

 

Perhaps the spin down logic can inspect the smart data and if a test is being executed, skip the spin down until the test is no longer active.

 

Oh so you're saying that currently if you are running an extended test that the disk will still spin down if it's not being accessed otherwise?

 

Last I remember, yes.  A spin down is issued and it aborts the SMART test.

This may have been changed, but I may have missed it.

 

As far as short vs extended. If we do a monthly parity check on the 27th, and do an extended test 1 drive a day, starting with disk1 - diskn where each day of the month is the disk number, this could be done nicely.

At least that's how I planned to do it.

 

That's two full sweeps of each disk a month.

Another idea is to schedule a smart extended test for all drives on the 27th day and parity check on the 28th day.

Link to comment

an extended once a month

This is probably only what you need. The issue is, the spin down has to be temporarily disabled while this test is running.

Until there's an API for that, it's kind of hard.

I had started on some sort of dd of single random blocks, but stopped as the real way to fix this is turn off the spin down timer temporarily.

 

The short test is easy and finishes in minutes, but it's not really comprehensive enough.

 

Perhaps the spin down logic can inspect the smart data and if a test is being executed, skip the spin down until the test is no longer active.

 

Oh so you're saying that currently if you are running an extended test that the disk will still spin down if it's not being accessed otherwise?

 

Last I remember, yes.  A spin down is issued and it aborts the SMART test.

This may have been changed, but I may have missed it.

 

As far as short vs extended. If we do a monthly parity check on the 27th, and do an extended test 1 drive a day, starting with disk1 - diskn where each day of the month is the disk number, this could be done nicely.

At least that's how I planned to do it.

 

That's two full sweeps of each disk a month.

Another idea is to schedule a smart extended test for all drives on the 27th day and parity check on the 28th day.

 

Sounds good but currently there is no automated way to do this correct?  You just have to remember and then manually do it.

Link to comment

an extended once a month

This is probably only what you need. The issue is, the spin down has to be temporarily disabled while this test is running.

Until there's an API for that, it's kind of hard.

I had started on some sort of dd of single random blocks, but stopped as the real way to fix this is turn off the spin down timer temporarily.

 

The short test is easy and finishes in minutes, but it's not really comprehensive enough.

 

Perhaps the spin down logic can inspect the smart data and if a test is being executed, skip the spin down until the test is no longer active.

 

Oh so you're saying that currently if you are running an extended test that the disk will still spin down if it's not being accessed otherwise?

 

Last I remember, yes.  A spin down is issued and it aborts the SMART test.

This may have been changed, but I may have missed it.

 

As far as short vs extended. If we do a monthly parity check on the 27th, and do an extended test 1 drive a day, starting with disk1 - diskn where each day of the month is the disk number, this could be done nicely.

At least that's how I planned to do it.

 

That's two full sweeps of each disk a month.

Another idea is to schedule a smart extended test for all drives on the 27th day and parity check on the 28th day.

 

Sounds good but currently there is no automated way to do this correct?  You just have to remember and then manually do it.

 

That is correct. It's fairly easy to write a script to take today and turn it into a drive assignment.

The issue is telling emhttp not to spin down the drive.

It may be easier with unRAID 6. I have not explored it further after coming across this issue in unRAID 5.

 

Last I remember, even the webGui SMART Long test would still abend due to emhttp's spin down functionality.

Link to comment

It is not clear to me that running the extended SMART test is worthwhile if you have just run a parity check as that already involves reading every sector on every disk. If key SMART attributes change during the parity check process the current notification system will already tell you.

Link to comment

It is not clear to me that running the extended SMART test is worthwhile if you have just run a parity check as that already involves reading every sector on every disk. If key SMART attributes change during the parity check process the current notification system will already tell you.

I've run into issues whereby a badblocks read of the entire disk (like a parity check) succeeds without a hint of issue and a smart long test catches an LBA that is causing trouble.

 

In fact, I ran into this about 3 days ago while testing a new drive. 

Executed smart long test, READ ERROR flagged at LBA nnnnnn.

4 pass badblocks executed.

First pass reported bad blocks, next 3 passes succeeded without issue.

checked smart, sectors were reallocated.

Reran smart long test a new READ ERROR flagged at LBA oooooo (this one went much further into drive).

Re-ran badblocks 4 pass test, no errors reported.

Smart did not reveal any new pending sectors or additional sector reallocations.

Reran smart long test succeeded without error.

 

In addition, I've run into issues before where a parity check was executed with no hint of issue and having a double drive failure only hours later. The smart long test revealing pending sectors.

 

I would be on the "it's not needed camp as well" However my experience has shown that a smart long test can reveal problems that other basic read tests do not.

 

Every new drive gets a conveyance test, smart long test, 4 pass bad blocks and a final smart long test.

Periodically I run smart long tests just for good measure.

They have revealed LBA's that were not flagged in the normal course of a full drive read.

 

Perhaps the smart long test is less forgiving and flags an error earlier then the firmware read ECC/recovery does.

Link to comment

Anecdotal Evidence

 

Notice how the smart short tests did not reveal the problem.

With this case it's as the prior post.

 

smart short test

smart long test, Read failure.

4 pass badblocks.  pass 1 revealing problem, pass 2-4 did not reveal issue.

16 sectors reallocated at start, 30 sectors reallocated at end, no pending sectors.

Smart long test again, read failure in a different place.  No pending sectors. (this is what really surprised me)

4 pass badblocks, no errors reported, no sectors reallocated, no pending sectors.

smart long test passes.

 

What this reveals is that badlocks or preclear alone cannot detect or force the error to occur. Nor can a full badblocks read or parity check read.  It's probably more likely that a problem will be revealed with a full read, but in this case that did not happen until the smart long test occured.

 

It's crucial to do the smart long test before inserting the drive into your array or someone may be unpleasantly surprised.

2015-10-21_17_55.17-edit.jpg.ffada46495641f29bbabc71f7b708c55.jpg

2015-10-18_08_18_23.jpg.2dc30c5b709d2d1a600daaab186506c0.jpg

Link to comment
  • 2 months later...

Bump. I think it sure would be awesome to see this incorporated into unRAID as a plugin or built into the WubUI. Even a script that could perform this would be great. Is there a way to temporarily disable spindown and re-enable it from the CLIU making a script for this relatively easy? Due to the difference in SMART report information from manufacturer to manufacturer, I imagine parsing these results in a meaningful way would possibly prove difficult. But even just having the results emailed to you monthly or in a log file would indeed be a great and welcome addition to most I'm sure. Who has a NAS that doesn't want as much possible information about their drive health as possible?  :P

 

 

Link to comment
  • 2 months later...

Regular SMART tests are unlikely to give you anything that cannot already be done by doing periodic parity checks and also making sure you have notifications enabled so that you are immediately informed about any significant changes to SMART attributes.

 

With the above in place the only time you are likely to want to do a SMART short or long test is if you already suspect a drive is failing, and in such a case doing it manually is not an issue.  Note also that even if you did have the ability to do an automatic SMART long test it would be silently aborted if there was any read/write access to the drive which would limit the usefulness of such a feature.  A parity check on the other hand would not be disrupted and continue to read the remainder of any disks so in many ways is better than a SMART long test.

Link to comment

Regular SMART tests are unlikely to give you anything that cannot already be done by doing periodic parity checks and also making sure you have notifications enabled so that you are immediately informed about any significant changes to SMART attributes.

 

With the above in place the only time you are likely to want to do a SMART short or long test is if you already suspect a drive is failing, and in such a case doing it manually is not an issue.  Note also that even if you did have the ability to do an automatic SMART long test it would be silently aborted if there was any read/write access to the drive which would limit the usefulness of such a feature.  A parity check on the other hand would not be disrupted and continue to read the remainder of any disks so in many ways is better than a SMART long test.

 

Thanks for the advice.  This is exactly what I currently have setup.  A monthly parity check (on the 1st of the month).. and I just had my first drive throw a bunch of errors & had 51 currently pending sectors.  I ran a SMART long test & it claims it is still healthy.. so I was just curious if I could run more frequent long tests to see if the drive gets progressively worse.

 

Sorry - getting off topic here.  If I suspect a drive is failing, do I acknowledge the current state & wait for notifications of it to get worse?  I setup an advanced RMA with WD, so it shouldn't be more than a week.

Link to comment

Regular SMART tests are unlikely to give you anything that cannot already be done by doing periodic parity checks and also making sure you have notifications enabled so that you are immediately informed about any significant changes to SMART attributes.

 

With the above in place the only time you are likely to want to do a SMART short or long test is if you already suspect a drive is failing, and in such a case doing it manually is not an issue.  Note also that even if you did have the ability to do an automatic SMART long test it would be silently aborted if there was any read/write access to the drive which would limit the usefulness of such a feature.  A parity check on the other hand would not be disrupted and continue to read the remainder of any disks so in many ways is better than a SMART long test.

 

Thanks for the advice.  This is exactly what I currently have setup.  A monthly parity check (on the 1st of the month).. and I just had my first drive throw a bunch of errors & had 51 currently pending sectors.  I ran a SMART long test & it claims it is still healthy.. so I was just curious if I could run more frequent long tests to see if the drive gets progressively worse.

 

Sorry - getting off topic here.  If I suspect a drive is failing, do I acknowledge the current state & wait for notifications of it to get worse?  I setup an advanced RMA with WD, so it shouldn't be more than a week.

A SMART long test succeeding is no guarantee that the disk is OK (although a failure tends to definitely indicate the disk is bad)!  One reason why regular SMART tests are not a panacea.

 

Any drive with Pending sectors is never a good idea in unRAID as it would mean if another drive failed the rebuild would potentially have corruption at those sectors since a perfect rebuild requires all other drives plus parity to be perfect.    Therefore the moment you get any pending sectors on any drive you want to work on resolving the problem.   

 

Pending Sectors indicate sectors that could not be read reliably last time a read was attempted.  Pending sectors can sometimes be a 'false positive' and get cleared the next time the sectors in question get written, but if not then the drive needs replacement.  The ideal way forward is to rebuild the suspect drive onto a replacement as that way you still have the original available for data recovery purposes if anything goes wrong during the rebuild, and if the rebuild succeeds you can now test the drive with no risk of data loss    Another option if the SMART attributes for the drive otherwise look OK is to try and rebuild the drive onto itself to see if it clears the Pending sectors back to 0.  Since this should just be writing back the data that is expected to be in each sector it is a low risk strategy as long as all other drives look OK.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...