Need help, parity disk shows error during rebuild

hawihoney · January 11, 2019

Diagnostics attached (see below).

- Two disks (disk19, disk20) failed at once in a Dual Parity system. Both disks show lots of reallocated sectors.

- Started data rebuild of disk19 on a new disk.

- First parity disk shows 64 read errors during data rebuild. Rebuild went to completion.

- First parity disk reported one pending sector during the rebuild (it's still that value after a reboot).

197	Current pending sector --> 1

- First Parity disk shows the following read errors during the rebuild (several times):

Jan 10 11:21:27 Tower2 kernel: md: disk0 read error, sector=37066384
Jan 10 11:21:27 Tower2 kernel: md: recovery thread: multiple disk errors, sector=37066384

- Parity check history shows that entry after completion.

2019-01-11, 02:52:33	15 hr, 33 min, 33 sec	107,1 MB/s	OK	64 (this stands for the 64 read errors)

My question: Was that rebuild successfully or is garbage data, from that 64 read errors, copied to this rebuilded disk19? I ask because two disk were off (disk19 and disk20) and the first parity disk shows read errors during the rebuild. I mean, there was no additional disk to get "good" data from. Was everything successfully?

I'm currently rebuilding disk20 and will replace the first parity disk afterwards.

Any help is highly appreciated.

Many thanks in advance.

tower2-diagnostics-20190111-0650.zip

Edited January 11, 2019 by hawihoney

JorgeB · January 11, 2019

Ahh, this makes more sense than the scenario you described yesterday, the rebuilt disk19 will be mostly OK but there will be some corruption due to the read errors on parity (unless there's no data on those sectors), md: recovery thread: multiple disk errors is Unraid speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption.", note also that the rebuild of disk 20 will also have some corruption because of the previous errors, again unless there's no data on those sectors.

hawihoney · January 11, 2019

Thanks for your answer.

To the background:

This week I bought two new barebones and did move everything old in there. Seems that I did handle 1-3 disks a little bit rough.

To the rebuild result:

Ok" means everything's ok. "Ok" does not mean there might be some corruption. The wording "ok" for the rebuild simply puzzles me. I think there might be some better explanation for the average user like me.

For disk20, I do have a full backup. I will copy it over once the rebuild is done. This will "repair" the state of the first parity as well. After this rebuild i will replace the first parity disk.

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors. This disk was full, so there might be some problems then.

Arrggghhhh.

JorgeB · January 11, 2019

1 hour ago, hawihoney said:

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors. This disk was full, so there might be some problems then.

It's till a very small number of errors considering how many sectors there are, and if the disk contains for example mostly video files there will likely be only one corrupt file, and that you likely mean a small (or large) glitch during playback, still these are the situations when it's good to have checksums of all your files (for those not using btrfs) so you could easy find out which file(s) are corrupt.

hawihoney · January 11, 2019

Thanks for the checksum reminder. It's a project I'm thinking about since 2-3 years. It's time to start it now.

I did not like the solutions I saw when I looked at them in the past. Will search again.

trurl · January 11, 2019

4 hours ago, hawihoney said:

For disk20, I do have a full backup...

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors...

35 minutes ago, hawihoney said:

Thanks for the checksum reminder.

Of course backups are more important than checksums.

hawihoney · January 12, 2019

Sorry, but I do have a hell of time interpreting the wordings around parity checks and disk rebuilding on a dual parity system. Please have a look at these three images. I took them during and after a correctional parity check:

- Img1: During correctional parity check parity-disk1 and data-disk3 throw read errors. This kind of error is mentioned in this thread already.

- Img2: Status during parity check says "0 sync errors corrected".

- Img3: Result after parity check says "0 errors found".

I don't get that. On the same "Main" page I see "248 Errors" that are not corrected and lead to no errors found. Really?

If I look closely I do see 17 writes to disk17. That was an image I did copy during parity sync. If I do ignore these I see one write access to disk3 (the data-disk with the read errors). This write access is not written to parity-disk2 but written to parity-disk1.

So what is the result?

- Trust the end result: Everything is ok.

- Trust the status: 248 read errors were detected and these are left uncorrected.

- 248 read errors were detected but corrected without Unraids interaction.

- And what might be the scenario to write to parity-disk2 but not to parity-disk1?

Am I looking to close and interpreting to much?

I'm pretty sure the WD-EFRX disks are the problem. I started with few of them and not that many in a case. It is mentioned they tolerate only 1-8 disk systems. I bet it's a vibration problem that leads to the read-errors. What do you think?

Edited January 12, 2019 by hawihoney

trurl · January 12, 2019

It will retry and sometimes succeed so the errors might have eventually resulted in correct data.

41 minutes ago, hawihoney said:

I see one write access to disk3 (the data-disk with the read errors)

After retries if it still can't read a data disk it will get the data from the parity calculation and then write it back to that data disk.

Maybe diagnostics would shed more light on the errors.

hawihoney · January 12, 2019

Yes, diagnostics attached.

Thanks for looking into it. I do see the read errors but nowhere I find Infos about corrections, etc.

tower2-diagnostics-20190112-1844.zip

trurl · January 12, 2019

I see the errors in syslog but SMART attributes for both disks look OK. I would do an extended SMART test on both.

JorgeB · January 12, 2019

In this case read errors were correct by parity, so all is well, except parity and disk3, parity you already knew it was failing, disk3 appears to be failing as well.

hawihoney · January 13, 2019

Thank you both. Extended SMART test can run when array is started?

JorgeB · January 13, 2019

1 hour ago, hawihoney said:

Extended SMART test can run when array is started?

Yes, but avoid accessing that disk during the test.

hawihoney · January 13, 2019

Ah, thanks.

Need help, parity disk shows error during rebuild

Recommended Posts

hawihoney

Link to comment

JorgeB

Link to comment

hawihoney

Link to comment

JorgeB

Link to comment

hawihoney

Link to comment

trurl

Link to comment

hawihoney

Link to comment

trurl

Link to comment

hawihoney

Link to comment

trurl

Link to comment

JorgeB

Link to comment

hawihoney

Link to comment

JorgeB

Link to comment

hawihoney

Link to comment

Join the conversation