Help needed with disk replacement after parity errors

Benbou · April 13, 2023

Hello,

I'm a long time Unraid user, but it's my 1st time posting on the forum.

I recently experienced an unclean shutdown on my Unraid server due to a power outage. During the subsequent parity check, I received a notification about "Current pending sector" errors on one of my drives (disk 3). This drive has also regularly reported "UDMA CRC errors" during previous parity checks.

The parity check found 7 errors (this is the 1st time encounter parity sync errors), and I suspect that disk 3 is bad.

I have ordered a replacement drive, but I need help with the process of replacing it to prevent data loss and get my server up and running.

Here's my current plan:

Keep VMs and Docker disabled for now and prevent write to the server as much as possible.
Shutdown server, add new disk and preclear it when I receive it (I have a few available bays).
Shutdown server, remove disk 3, start the array (to flag disk3 as missing), stop the array, and assign the new precleared drive as disk 3.
Wait for Unraid to reconstruct the data.
Run a parity check and hope for 0 error found.

However, I'm not sure if this plan is the best course of action to prevent data loss and ensure a smooth transition to the new drive. I would appreciate any recommendations or ideas to help me complete this process with confidence.

Thank you in advance for your help.

PS: I have attached my diagnostic file to this post

PS2. I also ran a SMART extended self-test on all my drives. And no more error were discovered.

tower-diagnostics-20230412-2143_anon.zip

JorgeB · April 13, 2023

There are no disk errors for now, suggesting the pending sectors are "false positives", parity sync errors are likely from the unclean shutdown, and they are only on parity2, since the disk is not reporting errors for now and it passed a recent extended SMART test I would do a correcting check first, then you can replace the disk later if there really are issues.

itimpi · April 13, 2023

A few comments:

Your plan should work, but it is possible it may not be necessary if the disk is not really going bad. One thing you could do now is run an Extended SMART test on the drive and see if that passes and if it does the drive is quite likely to be OK.
A few sync errors after an unclean shutdown is expected.
The pending sectors on disk3 is not a good sign, but sometimes these can go back to 0 if they are rewritten. Ideally
CRC errors rarely in themselves indicate a problem with the actual drive as the relate to the connection between the drive and host.
the Preclear on the replacement disk is not strictly necessary unless you want it as a stress test. However a rebuil onto that drive would also act as a good stress test and would be quicker (assuming it works without error).

Since you already have a drive on order maybe the best thing to do is to proceed with your plan, and keep disk3 with its contents intact until the rebuild onto the replacement disk3 completes successfully. After that you can then run a pre-clear on the old drive to see if it passes that OK.

Benbou · April 13, 2023

5 hours ago, JorgeB said:

[...] and they are only on parity2, [...].

Is that mean that the parity1 is still valid? I didn't know that both parity were somehow not tied together.

Also, how did you find this information, I would like to be able to diagnose it in the future.

5 hours ago, JorgeB said:

There are no disk errors for now, suggesting the pending sectors are "false positives", parity sync errors are likely from the unclean shutdown, [...] since the disk is not reporting errors for now and it passed a recent extended SMART test I would do a correcting check first, then you can replace the disk later if there really are issues.

Please tell me if I'm wrong, but my understanding was that pending sectors were marked as so when the drive was unable to read data from those sector (and would attempt to recover the data later).

With that understanding, I was under the impression that the 8 pending sectors would likely cause the 7 parity sync errors.

However, now you are suggesting a correcting check. This check will use drives data to rebuild the parity, right?

So I'm worried that if some data is not recoverable on disk3, it will prevent me to getting it back from the parity.
On the other hand, if the parity is really invalid and the data on disk3 is fine, attempting to rebuild the drive will cause some data corruption/loss.

Is there's a way to certain of the proper course of actions?

Thanks for you time and your help!

PS: Writing this give me a new idea, that might allow me to be certain that the data will be saved.

Excluding disk3 from all my shares, to prevent any data change on the drive.
Find a way to run a checksum on all the files of the drive.
Replace the disk3, with the new drive and reconstruct it (keep old disk3 in a safe spot for now).
Re-run the checksum on all the reconstructed files of the drive.
Diff the 2 list of files and ensure that everything is identical

However, writing those bullet points make me wonder what would be the interpretation of those result if the data is not identical on those 2 lists... disk3 or parity is bad?

Benbou · April 13, 2023

5 hours ago, itimpi said:

A few comments:

[...] One thing you could do now is run an Extended SMART test on the drive and see if that passes and if it does the drive is quite likely to be OK.

[...]

I did run an Extended SMART test, and no new error were reported. Only the same 8 pending sectors.

5 hours ago, itimpi said:

A few comments:

[...]

The pending sectors on disk3 is not a good sign, but sometimes these can go back to 0 if they are rewritten. Ideally

[...]

This would not prevent any read on those sector? (which would invalidate the parity)

5 hours ago, itimpi said:

[...]

Since you already have a drive on order maybe the best thing to do is to proceed with your plan, and keep disk3 with its contents intact until the rebuild onto the replacement disk3 completes successfully. After that you can then run a pre-clear on the old drive to see if it passes that OK.

But at this point I'm not sure if the data on disk3 is corrupted or if the parity is.

So if the parity check (last step of my plan) do report 0 error, what that would mean?
If the parity was corrupted, it would just corrupt the data on disk3 and report 0 parity sync error?

But if the disk3 was corrupted, it wound indeed fix the corrupted data and report 0 parity error, right?

How can I be sure that I don't loss any data?

The data from the old disk3 will still be intact, but it will most likely be different that the data on the new disk3, how to know which drive have the "good" data?

Thank you for your time and your help!

itimpi · April 13, 2023

13 minutes ago, Benbou said:

The data from the old disk3 will still be intact, but it will most likely be different that the data on the new disk3, how to know which drive have the "good" data?

You cannot be certain with XFS file systems unless you have checksums of from when the data is known to be good. This is why some people use the File Integrity plugin in conjunction with XFS file systems.

You also have the option of using BTRFS (and shortly ZFS) file systems on array drives which have built in checksum checking of files as they are read or written which although it does not correct such errors lets you know immediately they are detected and in which files.

JorgeB · April 13, 2023

42 minutes ago, Benbou said:

Is that mean that the parity1 is still valid?

Correct, no errors found on P (parity1), only Q (parity2).

Benbou · April 13, 2023

3 minutes ago, itimpi said:

[...] This is why some people use the File Integrity plugin in conjunction with XFS file systems.

[...]

You can be assured that I will setup this plugin as soon as this power outage aftermath will be behind me.

So maybe I should find a way to checksum all the files from both drives (the old and new disk3) and diff the result.

If any files don't have the same hash, I can try to open them and figure out which one is corrupted.

Benbou · April 13, 2023

6 minutes ago, JorgeB said:

Correct, no errors found on P (parity1), only Q (parity2).

This is a great news!

So that mean that it's certain that the files on disk3 are fine?
I mean, if corrupted both parity would have reported errors, right?

JorgeB · April 13, 2023

11 minutes ago, Benbou said:

So that mean that it's certain that the files on disk3 are fine?
I mean, if corrupted both parity would have reported errors, right?

Yep.

Benbou · April 13, 2023

4 hours ago, JorgeB said:

Correct, no errors found on P (parity1), only Q (parity2).

But where, in the diagnostic files, did you found that the errors were only on Q?

Despite my attempts, I was unable to locate it.

JorgeB · April 14, 2023

Apr 10 23:49:25 Tower kernel: mdcmd (40): check nocorrect
Apr 10 23:49:25 Tower kernel: md: recovery thread: check P Q ...
Apr 11 01:40:56 Tower kernel: md: recovery thread: Q incorrect, sector=2220657896
Apr 11 01:41:00 Tower kernel: md: recovery thread: Q incorrect, sector=2221874816
Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901840
Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901848
Apr 11 13:17:52 Tower kernel: md: recovery thread: Q incorrect, sector=12889075728
Apr 11 13:23:34 Tower kernel: md: recovery thread: Q incorrect, sector=12986064728
Apr 12 18:01:08 Tower kernel: md: sync done. time=31194sec
Apr 12 18:01:08 Tower kernel: md: recovery thread: exit status: 0

Benbou · April 14, 2023

10 hours ago, JorgeB said:

Apr 10 23:49:25 Tower kernel: mdcmd (40): check nocorrect
Apr 10 23:49:25 Tower kernel: md: recovery thread: check P Q ...
Apr 11 01:40:56 Tower kernel: md: recovery thread: Q incorrect, sector=2220657896
Apr 11 01:41:00 Tower kernel: md: recovery thread: Q incorrect, sector=2221874816
Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901840
Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901848
Apr 11 13:17:52 Tower kernel: md: recovery thread: Q incorrect, sector=12889075728
Apr 11 13:23:34 Tower kernel: md: recovery thread: Q incorrect, sector=12986064728
Apr 12 18:01:08 Tower kernel: md: sync done. time=31194sec
Apr 12 18:01:08 Tower kernel: md: recovery thread: exit status: 0

Thank you!

Note to myself: next time I can use "incorrect" as a keyword to find in files.

Help needed with disk replacement after parity errors

Recommended Posts

Benbou

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

Benbou

Link to comment

Benbou

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Benbou

Link to comment

Benbou

Link to comment

JorgeB

Link to comment

Benbou

Link to comment

JorgeB

Link to comment

Benbou

Link to comment

Join the conversation