(SOLVED) Read errors during parity rebuild

Mahmutti · February 24, 2021

I got some read errors during a parity check a couple days ago and decided to replace the bad data disk. However, all the disks in the array were 8 TB at that point and my only replacements are 12 TB, so a parity swap was needed. Being an idiot, I misread the instructions on the parity swap procedure, and invalidated my old parity disk.

I need to do a parity rebuild on one of the 12 TB disks now. Yesterday I did that, everything went seemingly fine, but afterwards as I stopped the array to replace the bad 8 TB drive with another 12 TB drive, it said that the new 12 TB parity drive that I had just rebuilt is the wrong disk and suggested that the old 8 TB one is the right one. I powered off the machine and removed the old parity drive and started the rebuild again.

During this rebuild (that's still going on at this time) I got some read errors on the bad data disk, 868 of them to be precise. They seem to be concentrated in 3 different "clusters" of sectors, since I got a bunch of read errors during the same second at 3 different times. Now I don't mind it if I get a few corrupt files as a result (this is a media server, nothing I can't replace), but I'd rather not lose the contents of the whole drive.

So I have some questions:

- Will the array work, just with some corrupt files?

- Should I run a parity check afterwards? The disk seems to work fine sometimes, sometimes it gives a bunch of read errors. The alternative is to take out the bad disk and replace it with a fresh 12 TB one and rebuild that with some bad parity data.

- Is there any way to find out which files are affected from the sector numbers?

kratos-diagnostics-20210224-1046.zip

Edited February 24, 2021 by Mahmutti

JorgeB · February 24, 2021

1 hour ago, Mahmutti said:

- Will the array work, just with some corrupt files?

Yes.

1 hour ago, Mahmutti said:

Should I run a parity check afterwards? The disk seems to work fine sometimes, sometimes it gives a bunch of read errors. The alternative is to take out the bad disk and replace it with a fresh 12 TB one and rebuild that with some bad parity data.

It's difficult to say, it could get better, it could get worse, one thing you can do is to clone the that disk with ddrescue after the parity is built, if no errors great, if there are errors you can find out which files are affected.

Mahmutti · February 24, 2021

2 minutes ago, JorgeB said:

Yes.

It's difficult to say, it could get better, it could get worse, one thing you can do is to clone the that disk with ddrescue after the parity is built, if no errors great, if there are errors you can find out which files are affected.

Alright, I don't want to take any unnecessary risks so I won't be running a parity check. I can probably deal with a few broken files, although it's good to know that ddrescue is an option if I end up needing that. Thanks for the reassurance, I'll mark this as solved.

Mahmutti · February 25, 2021

On 2/24/2021 at 12:41 PM, JorgeB said:

Yes.

It's difficult to say, it could get better, it could get worse, one thing you can do is to clone the that disk with ddrescue after the parity is built, if no errors great, if there are errors you can find out which files are affected.

Sorry about the late bump, just thought I would add the solution to this question (beyond ddrescue):

On 2/24/2021 at 11:18 AM, Mahmutti said:

Is there any way to find out which files are affected from the sector numbers?

Ensure that you have plenty of RAM on your system. 16GB wasn't enough for a 8TB drive, about 20GB would have been cutting it close but sufficient.
If mounted, unmount the affected disk (in my case, /dev/sdd).
Run xfs_db -r /dev/sdd1 (replace sdd1 with your disk). This will open an interactive command prompt for debugging XFS.
1. blockget -n
  - This is the part that will eat your memory, it took about 20 or so seconds to complete for me.
2. daddr 13364668368
  - Replace the number with your affected sector number. In my case, I had 4 different sequences of sectors that threw read errors, I just took the first and last sector of each sequence.
3. blockuse -n
  - This will output the affected block, inode and filename.
4. Repeat steps 2-3 for any affected sectors that you want to check.
5. quit to exit xfs_db

There isn't a lot of easy to find information on this, so putting this here to help anyone else with the same specific issue.

Edited February 26, 2021 by Mahmutti

(SOLVED) Read errors during parity rebuild

Recommended Posts

Mahmutti

Link to comment

JorgeB

Link to comment

Mahmutti

Link to comment

Mahmutti

Link to comment

Join the conversation