[SOLVED] Disk errors during parity rebuild


Recommended Posts

TL;DR I need to swap out a drive after, well, swapping out all my drives and need help with the order in which to proceed.

 

I finally replaced all my old and slower drives and my read errors completely stopped about halfway through, I'm assuming due to the drive that was pulled at that point (my array has been plagued with read errors for a while due to old drives and a bad cable, all of which have now been replaced). However, I had to shrink the array once I had replaced everything I was going to, so I used Unbalance to transfer the data off the drive, removed it, and rebuilt parity.

During the parity rebuild, one of the new drives threw a bunch of errors; the totals at completion:

 

Read errors (Unraid): 64864

5 Reallocated Sector Count (SMART): 253

196 Reallocated Event Count (SMART): 249

197 Current Pending Sectors (SMART): 643

 

I am assuming these are signs that that drive is faulty and am planning to return it and swap in my spare/swap drive, but I am wondering if I should try to run a parity check first? Or should I trust Unraid saying parity is valid and go ahead and swap in/rebuild the faulty drive? Or try rebuilding parity first (concerned as if the errors repeat, wouldn't they write to parity? Or have they already?)? None of the data on the drive is crucial, I'm more concerned about maintaining parity for the more important drives, but if the parity drives are going to trash a bunch of content and there's a way to prevent that I'd like to...

(Unfortunately I rebooted before I remembered to do a diagnostic dump but I can post full SMART results once the current deep test on the bad drive is done running)

Edited by greyday
Link to comment

That drive is very sick -as each Pending sector indicates a sector that can not be read reliably (and can thus result in the corresponding sector on the parity drive potentially having the wrong contents).   Reallocated sectors while not necessarily a problem if they are stable are a big warning sign if the number is not small,.

 

with that drive in the system I would not assume that the contents of the parity drive are valid enough so that parity plus remaining drives can rebuild any failed drive without serious file system corruption on the rebuilt drive.
 

 Since you say the content of that drive is unimportant I would suggest

  • doing Tools -> New Config; and select the options to retain all current settings 
  • return to the Main tab and change the problem drive slot to its replacement
  • rebuild parity with the new drive set.   Hopefully this time it will build without drive level errors so it can be assumed valid.
  • You can then format the replacement drive to create an empty file system on it so it is ready to receive data.
  • Like 1
Link to comment
1 hour ago, itimpi said:

That drive is very sick -as each Pending sector indicates a sector that can not be read reliably (and can thus result in the corresponding sector on the parity drive potentially having the wrong contents).   Reallocated sectors while not necessarily a problem if they are stable are a big warning sign if the number is not small,.

 

with that drive in the system I would not assume that the contents of the parity drive are valid enough so that parity plus remaining drives can rebuild any failed drive without serious file system corruption on the rebuilt drive.
 

 Since you say the content of that drive is unimportant I would suggest

  • doing Tools -> New Config; and select the options to retain all current settings 
  • return to the Main tab and change the problem drive slot to its replacement
  • rebuild parity with the new drive set.   Hopefully this time it will build without drive level errors so it can be assumed valid.
  • You can then format the replacement drive to create an empty file system on it so it is ready to receive data.

Considering the number has doubled during the SMART test, I'd say that's a safe bet. ;)

My initial thought was to do pretty much exactly what you suggested, but since parity rebuild takes WAY longer and I care less about the data on this drive (It's all media, but resourcing/replacing/reripping ALL of it would be cumbersome), I'd prefer to just rebuild it with the current parity (which will keep it valid, correct?) and then just delete/replace anything corrupt I come across. It'd be a lot easier to replace files here and there than the drive entirely, especially since (thanks to Unbalance writing like 4x more to that drive than any others) it has like 60% of my total non-work media files. Any reason not to take this approach? I do have the original pulled drives so I could likely recreate the directory doing it your way as well, it will just take half a week (and put more wear and tear on the parity drives, no?).

Edited by greyday
Link to comment
27 minutes ago, greyday said:

I think I'll just rebuild it with the current parity (which will keep it valid, correct?)

 

I do not think that it is safe to assume that the current parity is valid after that level of error so you are going to almost certainly need to do a correcting parity check anyway after the rebuild.  I would go directly for rebuilding parity from scratch as at least that way (as long as it completes without error) you know it matches the current drive set.

  • Like 1
Link to comment
1 minute ago, itimpi said:

 

I do not think that it is safe to assume that the current parity is valid after that level of error so you are going to almost certainly need to do a correcting parity check anyway after the rebuild.  I would go directly for rebuilding parity from scratch as at least that way (as long as it completes without error) you know it matches the current drive set.

This is what I am trying to figure out; wouldn't rebuilding the drive validate parity, since all the errors would be on the newly rebuilt drive? I'm really game to do it in whatever way works, I'm just trying to figure out which will be less wear and tear on all the drives (but specifically the parity drives) and, less importantly, which would take less time. But based on my understanding of how parity works, any errors would be written to the rebuilt drive, as they would be checked against current data on the rest of the array, correct?

Or is there a potential for file system corruption or some such that I'm not factoring in?

Link to comment
10 minutes ago, greyday said:

This is what I am trying to figure out; wouldn't rebuilding the drive validate parity, since all the errors would be on the newly rebuilt drive? I'm really game to do it in whatever way works, I'm just trying to figure out which will be less wear and tear on all the drives (but specifically the parity drives) and, less importantly, which would take less time. But based on my understanding of how parity works, any errors would be written to the rebuilt drive, as they would be checked against current data on the rest of the array, correct?

Or is there a potential for file system corruption or some such that I'm not factoring in?

 

No.   You do not at this point know how (if at all) parity might have been corrupted by the bad disk).

 

Rebuilding the data drive just assumes that the parity is valid and that all the other disks are fine.  It is highly likely that the rebuild is going to result in a badly corrupt file system.   Your scenario would work if the drive being rebuilt now failed, but not if a different one failed.

 

As I said the only way you can be confident that the parity is valid for the complete disk set is to run a parity check anyway so why not do it from the outset by rebuilding parity.

 

Link to comment
11 minutes ago, itimpi said:

 

No.   You do not at this point know how (if at all) parity might have been corrupted by the bad disk).

 

Rebuilding the data drive just assumes that the parity is valid and that all the other disks are fine.  It is highly likely that the rebuild is going to result in a badly corrupt file system.

 

Your scenario would work if the drive being rebuilt now failed, but not if a different one failed.

 

As I said the only way you can be confident that the parity is valid for the complete disk set is to run a parity check anyway so why not do it from the outset by rebuilding parity.

That makes sense, and thank you for your responses; I will most likely remove it from the array and rebuild parity, try to salvage the files from the originally removed drives.

 

But to continue the thought (just to understand better), I 100% get what you are saying insofar as the time during the rebuild goes, that dual parity is most certainly hosed, but wouldn't a successful rebuild mean parity is restored? My understanding is that parity is based on adding up each 1 or 0 from the same sector of each drive and determining whether that sum is odd (1) or even (0), so that if a drive falls out adding the same numbers back up with the parity values will result in the missing value (I'm a little fuzzier on dual parity, to be honest, I'm guessing it is a line by line opposite to the first parity drive?). So then wouldn't reading all the healthy drives result in parity being vaild and any damage to it being written to the rebuilt drive? What I'm wondering here is if that drive can be rebuilt, then any corrupt files removed/replaced and the file system repaired without forcing a complete parity rebuild (again, academic at this point, I'm not planning to take this approach anymore, but I'm trying to understand how parity, and dual parity, work overall)...

Edited by greyday
Link to comment

I dont hold a candle to itimpi, but reading through this I would follow exactly what he suggests.  Since you built parity with a disk giving read errors, you very likely have invalid parity.  Luckily the data on that disk isnt important to you, so if it were me i would shut down my array now, do a new config, remove that drive from the array, and let fresh parity build.

 

It makes no difference how many parity disks you have or how many drives fail right now - you cant trust your existing parity with that failing disk in place because parity is based on "bad bits" from that disk, or at least, you can't be sure that it isnt.  

 

remove the drive and do new config asap IMO

Link to comment
7 hours ago, greyday said:

That makes sense, and thank you for your responses; I will most likely remove it from the array and rebuild parity, try to salvage the files from the originally removed drives.

 

But to continue the thought (just to understand better), I 100% get what you are saying insofar as the time during the rebuild goes, that dual parity is most certainly hosed, but wouldn't a successful rebuild mean parity is restored? My understanding is that parity is based on adding up each 1 or 0 from the same sector of each drive and determining whether that sum is odd (1) or even (0), so that if a drive falls out adding the same numbers back up with the parity values will result in the missing value (I'm a little fuzzier on dual parity, to be honest, I'm guessing it is a line by line opposite to the first parity drive?). So then wouldn't reading all the healthy drives result in parity being vaild and any damage to it being written to the rebuilt drive? What I'm wondering here is if that drive can be rebuilt, then any corrupt files removed/replaced and the file system repaired without forcing a complete parity rebuild (again, academic at this point, I'm not planning to take this approach anymore, but I'm trying to understand how parity, and dual parity, work overall)...


i was thinking through what you said and you might get away with just the rebuild!

 

after the rebuild completes the contents of the rebuilt data drive will agree with what parity plus all the other data drives contain, but the contents are likely to be badly corrupt at best.    However since you said you were not worried about preserving the drives contents, if you now follow the procedure for Reformatting a drive the fact that the contents may currently be invalid is not relevant as the format operation will create a new empty file system and update parity accordingly.   Note that this is a rather a special case where you are definitely going to discard the contents of the rebuilt drive.

  • Like 1
Link to comment
11 hours ago, itimpi said:


i was thinking through what you said and you might get away with just the rebuild!

 

after the rebuild completes the contents of the rebuilt data drive will agree with what parity plus all the other data drives contain, but the contents are likely to be badly corrupt at best.    However since you said you were not worried about preserving the drives contents, if you now follow the procedure for Reformatting a drive the fact that the contents may currently be invalid is not relevant as the format operation will create a new empty file system and update parity accordingly.   Note that this is a rather a special case where you are definitely going to discard the contents of the rebuilt drive.

 

This is what I was thinking too--and it involves 4TB of writes to the parity drives instead of 14, though I'm sure the difference on the lives of them will be nominal. Since it's just one drive, and there's only one other data drive larger than it, I think this plus a parity check when done may be all I need. Plus since I don't want to mount this drive in the array again and it's likely pretty buggy/crashy on its own, a rebuild may be the only way I can get a list of files that need replacing...

 

 

And, the more I think about it, the worst case scenario is an extra few hours, as I'd basically be right back at the "blank it and rebuild" stage if it doesn't work. I'm gonna give it a try, will report back when it's "done"!

EDIT: just for shits and giggles, here's the same SMART stats after running the deeper SMART test:
 

5 Reallocated Sector Count (SMART): 445

196 Reallocated Event Count (SMART): 416

197 Current Pending Sectors (SMART): 3451

Edited by greyday
Link to comment
On 3/10/2021 at 8:48 AM, itimpi said:


i was thinking through what you said and you might get away with just the rebuild!

 

Rebuild went smoothly! Now just cataloging what files I need to pull from the old drives and then on to blanking and copying over. Parity seems to be just fine, zero read errors across the board leads me to believe that this worked. I'll update if there are any other problems, but otherwise marking as solved, thanks for all your help!

Link to comment
  • greyday changed the title to [SOLVED] Disk errors during parity rebuild

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.