Disk with errors (but green) during parity rebuild

steve1977 · November 11, 2017

I am in the process my parity disk to a larger one. Rebuild is ongoing.

All disks show "green", but disk 3 shows 128 errors. Smart test looks ok.

Does it make sense to let the parity rebuild complete? Any advice on what to do?

testdasi · November 13, 2017

What kind of errors?

steve1977 · November 13, 2017

Disk read errors. Disk does not disable and parity rebuild is still ongoing. Error log below:

https://pastebin.com/hKRjCEGN

JorgeB · November 13, 2017

Without the diagnostics it's hard to say more but you'll need at least to do a correcting parity check since current parity is not 100% valid.

steve1977 · November 13, 2017

Thanks. How to trigger a "correcting parity"? Do I need to rebuild or just run another parity check after this is completed?

It will still take another 12 hours for the parity build to complete. Will wait your guidance before replacing another drive.

Diagnostic attached. Thanks for your help!

tower-diagnostics-20171113-1711.zip

JorgeB · November 13, 2017

I looks like a disk problem:

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1

You can confirm by running an extended SMART test.

steve1977 · November 13, 2017

Got it. So, need to switch to a new disk?

Is my array still safe as long as the currently ongoing parity built will complete? As mentioned, the drive still shows "green" and the parity rebuild is underway.

JorgeB · November 13, 2017

23 minutes ago, steve1977 said:

So, need to switch to a new disk?

Yes if it fails the extended SMART test.

24 minutes ago, steve1977 said:

Is my array still safe as long as the currently ongoing parity built will complete? As mentioned, the drive still shows "green" and the parity rebuild is underway.

No because like I mentioned earlier parity is not valid.

steve1977 · November 13, 2017

What does this mean for me? Right now (during parity rebuild), I have access to all my data. What part do you expect that I'll lose? Anything I can do to prevent data loss?

I'll run extended smart once the parity is fully built.

JorgeB · November 13, 2017

And don't run a correcting parity check now with a suspected bad disk, post the SMART report after the test is complete.

pwm · November 13, 2017

Note that a disk that says it has one sector offline uncorrectable doesn't mean the disk need to be toast.

But it means that based on statistics, there is an increased probability that the drive will fail more - or totally - within a limited time span. Some disks just may get a bad sector because of a defect on the surface that wasn't noticed during the original factory scan, but there is a danger that the problem isn't just a tiny spot but a larger surface area that isn't good or that there is some issue with the head or other parts of the drive, in which case the drive is dangerous to continue to use.

It also means there is one sector that can't be read out correctly because the error correction code (ECC) for that sector isn't enough to correct the bit errors. If you already know the contents of that sector and tries to overwrite the sector then the disk can make use of a spare sector to store the correct data, making your RAID have a full set of disks with all correct data again.

As johnnie.black notes, you most definitely do not want to rebuild your parity at this stage, since the current parity is one way to recompute what contents that should have been stored in the offline uncorrectable sector (unless you happen to have a backup of the specific file data for the file that happens to make use of this specific disk sector).

Anyway - after a extended SMART scan, the disk will be able to tell which sector it finds the first error on. And it might potentially also increase the number of bad sectors.

steve1977 · November 14, 2017

Thanks. As you'd anticipated, the SMART failed. Please find attached. Please advice what to do next? Replace the drive and rebuild?

tower-smart-20171114-0210.zip

pwm · November 14, 2017

The result was as expected - the offline uncorrectable sector will stay uncorrectable - only a direct write to that address has a chance to clear the error counter.

You did get to know that the drive didn't find any more errors over the first 60% of the surface - and you got the address of that uncorrectable sector - LBA 91525368.

I would recommend to do a selective test where you start testing from the next sector and scan the rest of the drive to see if more errors shows up.

If you connect using ssh you can run smartctl and specify

smartctl -t select,91525369-max /dev/<drive>

the drive will continue from the first sector after the error and to the end of the drive.

steve1977 · November 14, 2017

Thanks. Let me do this later and get back to you. I don't mind if I lose a few files from this disk, but would be a pain if I lose the full disk.

JorgeB · November 14, 2017

If you have the space you can just copy everything from disk3 to other disk(s), the file(s) on the damaged sector(s) will give an I/O error, restore those from backup, with some luck it will be only 1 or 2.

steve1977 · November 14, 2017

Don't have the space per se, but could find a way. Would this give me better result than just replacing the disk?

JorgeB · November 14, 2017

3 minutes ago, steve1977 said:

Would this give me better result than just replacing the disk?

If parity finished syncing replacing the disk is also an option, but if you don't have checksums (or the disks is btrfs) the rebuilt disk will have some corrupt file(s) and you'll have no way of knowing which ones.

By copying/moving the data manually you'll know which files need to be restored.

steve1977 · November 14, 2017

Got it. So, copying indeed may be preferred. Could chkdsk or a variance thereof also be an option?

JorgeB · November 14, 2017

Could chkdsk or a variance thereof also be an option?

Not sure what you mean, if you mean marking the bad sectors there's no equivalent, also that would get the same end result as a rebuild, corrupt files.

steve1977 · November 14, 2017

I was thinking of fixing or deleting the files within the bad sectors. Copying will take a lot longer and typically copying from corrupt disks turns difficult. So, I a trying to avoid it, but don't like the idea that I have some corrupt files and don't know what they are.

JorgeB · November 14, 2017

That's why I think having checksums or using an auto checksum filesystem is highly recommended, for situations like this.

pwm · November 14, 2017

Copying from the corrupt disk shouldn't be problematic. If you use rsync for example, you can have it continue with other files after a read error. And unless the last 40% of the drive have more errors, you will only have one single file that will fail to copy. Another advantage with rsync is that it is well suited to restart the copy if you for some reason get it interrupted.

It is normally also possible to look up what file is using the specific LBA that the SMART test indicated. Exactly how to do that will depend on used file system. If this is a file you have a backup of or do not care about, then you can overwrite the file and have a large probability of zeroing the unrecoverable sector count.

If you have read out all the data you can recover from the problematic disk, then you could also have unRAID restore the damaged file by replacing the problematic disk with a new disk and have unRAID recompute the content from the other parity and data disks.

The main thing is that you want to keep as much redundancy as possible for as long as possible. Rebuilding the parity now would make the parity computed based on failed sector(s) of the problematic disk. And replacing the disk will make you vulnerable to other issues for the fill time until unRAID have recovered the full content by use of the parity data.

steve1977 · November 14, 2017

Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild.

If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back. Also, how is rsync different from "mv -r"?

I am not fully clear on the exact suggested next steps. Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity.

JorgeB · November 14, 2017

1 minute ago, steve1977 said:

Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild.

And like I already mentioned this is why current parity is not 100% valid.

3 minutes ago, steve1977 said:

If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back

Not quite following here.

4 minutes ago, steve1977 said:

Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity.

AFAIK xfs_repair has no option to scan the complete filesystem, but even if it could identify the files that are on the bad sectors you couldn't rebuild from parity as your current parity is not valid.

steve1977 · November 14, 2017

I am still not 100% sure whether I fully understand, but let me follow the next steps:

* I can add an additional 6TB disk as unassigned disk (UD)

* I can the move all files from the "corrupt" disk to the UD ("mv -r")

* I can then pull the corrupt disk and put the UD into the old "corrupt" slot

* Reconfigure and rebuild parity

Makes sense?

Disk with errors (but green) during parity rebuild

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation