Possible failing disks - some guidance please

BetaQuasi · January 19, 2018

So I have two problematic disks in my array currently. I am looking to replace them - the data that is present on the shares that are on those disks is not critical and not backed up, however at the same time if I can avoid losing it, that'd be great.

I've attached smart reports for both. Extended smart reports on both disks passed with no errors after running for several hours. I'm currently preclearing a 3tb disk, and would like to address whichever disk is more concerning by replacing that first. I then have 3x 8tb disks. With those, I intend to replace my current 3tb parity with 2x 8Tb dual parity, and then use the 3rd as a data disk.

I don't currently have a handle on whether these values have been changing as at some point, I screwed up my notification configuration and haven't really had time to keep an eye on the array. I've been watching them for the past two days and there seems to be no change.

With the WD disk - it has a bunch of pending sectors. My understanding based on reading a bunch of posts here which hurt my brain (!), is if those have data in them and I replace the disk, that data could be rewritten incorrectly, and corrupt the files. Is that correct? If so, is there a better method of handling that data? I noticed quite a few posts suggesting running a non-correcting parity check, so I've kicked that off. I'll check the smart attributes again when that completes.

Also, is this the disk to be the most concerned about, as it has both pending sectors and offline uncorrectable?

With the Seagate disk, I can't seem to find clear info on whether I should be concerned with the uncorrected error count? I get the impression it's not nearly as important as the pending sectors on the WD though.

So, based on all of the above, and the fact that I'll soon have a 3tb disk precleared and ready to drop in, what would be my best way forward? Will I get corruption if I rebuild that disk while the sectors are still pending? Is there any way out of that conundrum?

I am guessing it will be best to postpone the parity replacement until such time as the issues with this disk are resolved (?), but there are 3 8tb disks kicking around should I need those. The array itself is already quite full, so moving data around would be challenging - one of the reasons why I'm (finally) moving it to 8tb disks.

Any feedback greatly appreciated!

Cheers

BQ

ST32000542AS_5XW1PQ16-20180119-1821.txt

WDC_WD30EZRX-00D8PB0_WD-WMC4N2092430-20180119-1820.txt

pwm · January 19, 2018

Not all disks has the same meaning for 'pending'. But normally, you can read out the data from them, but the disk prefers a write before it decides if the sector is good enough to continue use or if the data should be moved to a spare sector.

If the disk knows it can't read out the data, then it would flag the sector as uncorrectable.

Mat1926 · January 19, 2018

Guys, the OP mentioned pending sectors, what are they?!

JorgeB · January 19, 2018

WD completed the extended test without a read failure, this means the current pending sectors are "false positives" and the disk is fine for know and you should be able to copy all data from it without issues, though IMO that disk should be replaced when possible as double digits raw read errors are never a good sign on WD disks.

For the Seagate you only did a shot test, long test would be much better since there are some reported uncorrected errors, but they are quite old.

Edited January 19, 2018 by johnnie.black

pwm · January 19, 2018

40 minutes ago, Mat1926 said:

Guys, the OP mentioned pending sectors, what are they?!

It's exactly what I did say it is.

The HDD has seen some bit errors when reading the sector. Not critically so - the data can be read out - but enough that the drive has flagged the sector as potentially bad.

The HDD can at a later time either unflag the sector or it can retire the sector and move the data to a spare sector, in which case it would be counted as a remapped sector.

It's just that different drives have different logic for what to do with these pending sectors - and when to do it. Some drives are aggressive and quickly makes a decision. Some waits until the user performs a write to the sector.

While a pending sector means the drive has seen bit errors, it doesn't mean there is something wrong with the physical sector. It can have been a glitch when writing - such as too much vibrations - that resulted in a slightly bad write. That's why the drive isn't sure if it should be replaced or if the sector is safe for continued use.

Easiest for the drive is if the user issues a write to that sector - then the drive can perform the write and then return back and check if the new write had better luck - if not, then the sector is scrapped.

A more aggressive disk may either accept false negatives and remap the sector even if there isn't anything physically wrong with it. Or it can decide to do a rewrite without waiting for the user, and check the result. But this means the drive, on seeing the bit errors, has to wait a full revolution of the disk platters until it can perform the write. And then wait one more full revolution until it can double-check if the new write succeeded. And if not, then it needs to wait for a spare sector to rotate in under the heads before it can write the moved data. Most disks don't do this on their own because the additional work adds a slowdown that may fail some realtime requirements. And the rewrite test can downgrade a sector from having correctable bit errors to becoming a totally unreadable sector creating a race condition to copy the buffered data to a spare sector before power is lost. So it's common that drives prefer to solve the remapping at a later time - preferably when the user has already issued a write which means it saves one rotation delay and doesn't add any additional race condition to avoid data loss.

BetaQuasi · January 19, 2018

1 hour ago, johnnie.black said:

WD completed the extended test without a read failure, this means the current pending sectors are "false positives" and the disk is fine for know and you should be able to copy all data from it without issues, though IMO that disk should be replaced when possible as double digits raw read errors are never a good sign on WD disks.

For the Seagate you only did a shot test, long test would be much better since there are some reported uncorrected errors, but they are quite old.

Thanks Johnnie - could have sworn I hit long on both. I'll do a long test on both post parity check and update the thread with the results. Will definitely replace the WD regardless - great news that the data there should be fine right now, as I'll start moving that off.

Also, thanks pwm for the extended detail, interesting reading!

Edited January 19, 2018 by BetaQuasi

Mat1926 · January 19, 2018

13 hours ago, pwm said:

Thnx ...

BetaQuasi · January 20, 2018

Ok so I've finished a non-correcting parity check (with no errors) and ran full smart tests on both drives again. Results are attached. The WD reported errors in the web GUI this time around (but didn't last time?).

I've got a 3tb precleared and ready to replace it, but just want to check if the pending sectors are likely to cause any issues for me while I do that, or whether any problems with the Seagate disk might possibly contribute to more potential failure while rebuilding the WD onto the new disk. What would be the safest way forward? (I only have one parity disk at this stage.)

Thanks!

BQ

ST32000542AS_5XW1PQ16-20180120-2033.txt

WDC_WD30EZRX-00D8PB0_WD-WMC4N2092430-20180120-2032.txt

Edited January 20, 2018 by BetaQuasi

JorgeB · January 20, 2018

As expected the WD is getting worse, having failed the last extended text, good news is that the Seagate looks fine for now so you should have no issues replacing the WD.

BetaQuasi · January 20, 2018

Cheers Johnnie, will kick that off.

BetaQuasi · January 21, 2018

All good now, disk 4 rebuilt onto a new disk. Now to do the parity upgrade! Thanks again.

Possible failing disks - some guidance please

Recommended Posts

BetaQuasi

Link to comment

pwm

Link to comment

Mat1926

Link to comment

JorgeB

Link to comment

pwm

Link to comment

BetaQuasi

Link to comment

Mat1926

Link to comment

BetaQuasi

Link to comment

JorgeB

Link to comment

BetaQuasi

Link to comment

BetaQuasi

Link to comment

Join the conversation