Jump to content
Sign in to follow this  
hawihoney

Need help, parity disk shows error during rebuild

14 posts in this topic Last Reply

Recommended Posts

Diagnostics attached (see below).

 

- Two disks (disk19, disk20) failed at once in a Dual Parity system. Both disks show lots of reallocated sectors.

- Started data rebuild of disk19 on a new disk.

- First parity disk shows 64 read errors during data rebuild. Rebuild went to completion.

- First parity disk reported one pending sector during the rebuild (it's still that value after a reboot).

197	Current pending sector --> 1

 

- First Parity disk shows the following read errors during the rebuild (several times):

Jan 10 11:21:27 Tower2 kernel: md: disk0 read error, sector=37066384
Jan 10 11:21:27 Tower2 kernel: md: recovery thread: multiple disk errors, sector=37066384

 

- Parity check history shows that entry after completion.

2019-01-11, 02:52:33	15 hr, 33 min, 33 sec	107,1 MB/s	OK	64 (this stands for the 64 read errors)

 

My questionWas that rebuild successfully or is garbage data, from that 64 read errors, copied to this rebuilded disk19? I ask because two disk were off (disk19 and disk20) and the first parity disk shows read errors during the rebuild. I mean, there was no additional disk to get "good" data from. Was everything successfully?

 

I'm currently rebuilding disk20 and will replace the first parity disk afterwards.

 

Any help is highly appreciated.

 

Many thanks in advance.

 

 

tower2-diagnostics-20190111-0650.zip

Edited by hawihoney

Share this post


Link to post

Ahh, this makes more sense than the scenario you described yesterday, the rebuilt disk19 will be mostly OK but there will be some corruption due to the read errors on parity (unless there's no data on those sectors), md: recovery thread: multiple disk errors is Unraid speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption.", note also that the rebuild of disk 20 will also have some corruption because of the previous errors, again unless there's no data on those sectors.

Share this post


Link to post

Thanks for your answer.

 

 

To the background:

This week I bought two new barebones and did move everything old in there. Seems that I did handle 1-3 disks a little bit rough.

 

 

To the rebuild result:

Ok" means everything's ok. "Ok" does not mean there might be some corruption. The wording "ok" for the rebuild simply puzzles me. I think there might be some better explanation for the average user like me.

 

For disk20, I do have a full backup. I will copy it over once the rebuild is done. This will "repair" the state of the first parity as well. After this rebuild i will replace the first parity disk.

 

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors. This disk was full, so there might be some problems then.

 

Arrggghhhh.

 

Share this post


Link to post
1 hour ago, hawihoney said:

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors. This disk was full, so there might be some problems then.

It's till a very small number of errors considering how many sectors there are, and if the disk contains for example mostly video files there will likely be only one corrupt file, and that you likely mean a small (or large) glitch during playback, still these are the situations when it's good to have checksums of all your files (for those not using btrfs) so you could easy find out which file(s) are corrupt.

Share this post


Link to post

Thanks for the checksum reminder. It's a project I'm thinking about since 2-3 years. It's time to start it now.

 

I did not like the solutions I saw when I looked at them in the past. Will search again.

 

Share this post


Link to post
4 hours ago, hawihoney said:

For disk20, I do have a full backup...

 

For disk19, according to your answer, I have to live with possibly up to 64 wrong sectors...

 

35 minutes ago, hawihoney said:

Thanks for the checksum reminder. 

Of course backups are more important than checksums.

Share this post


Link to post

Sorry, but I do have a hell of time interpreting the wordings around parity checks and disk rebuilding on a dual parity system. Please have a look at these three images. I took them during and after a correctional parity check:

 

- Img1: During correctional parity check parity-disk1 and data-disk3 throw read errors. This kind of error is mentioned in this thread already.

- Img2: Status during parity check says "0 sync errors corrected".

- Img3: Result after parity check says "0 errors found".

 

I don't get that. On the same "Main" page I see "248 Errors" that are not corrected and lead to no errors found. Really?

 

If I look closely I do see 17 writes to disk17. That was an image I did copy during parity sync. If I do ignore these I see one write access to disk3 (the data-disk with the read errors). This write access is not written to parity-disk2 but written to parity-disk1.

 

So what is the result?

 

- Trust the end result: Everything is ok.

- Trust the status: 248 read errors were detected and these are left uncorrected.

- 248 read errors were detected but corrected without Unraids interaction.

- And what might be the scenario to write to parity-disk2 but not to parity-disk1?

 

Am I looking to close and interpreting to much?

 

I'm pretty sure the WD-EFRX disks are the problem. I started with few of them and not that many in a case. It is mentioned they tolerate only 1-8 disk systems. I bet it's a vibration problem that leads to the read-errors. What do you think?

 

 

img1.thumb.jpg.10eb467f6a371ff14b5af7e13a62ecd2.jpg

 

img2.jpg

img3.jpg

Edited by hawihoney

Share this post


Link to post

It will retry and sometimes succeed so the errors might have eventually resulted in correct data.

41 minutes ago, hawihoney said:

I see one write access to disk3 (the data-disk with the read errors)

After retries if it still can't read a data disk it will get the data from the parity calculation and then write it back to that data disk.

 

Maybe diagnostics would shed more light on the errors.

Share this post


Link to post

I see the errors in syslog but SMART attributes for both disks look OK. I would do an extended SMART test on both.

Share this post


Link to post

In this case read errors were correct by parity, so all is well, except parity and disk3, parity you already knew it was failing, disk3 appears to be failing as well.

Share this post


Link to post

Thank you both. Extended SMART test can run when array is started?

 

Share this post


Link to post
1 hour ago, hawihoney said:

Extended SMART test can run when array is started?

Yes, but avoid accessing that disk during the test.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this