Parity Swap Procedure with Dual Parity?


tstor

Recommended Posts

Hello,

 

I need to upgrade my two parity drives with larger disks in oder to be able to use larger data drives in the future. The current parity drives shall then replace the smallest data drives. Is the parity swap procedure described here (https://wiki.unraid.net/The_parity_swap_procedure) still supported with Unraid 6.8.2 and dual parity array and if yes, can I do both swaps at the same time?

It would obviously be faster to just copy the current parity drives to the new ones and filling the remaining area on the parity disks with the correct bits using the swap procedure than to recreate parity by reading the whole array, but I have some doubts that it works for dual parity arrays.

 

Thx for feedback

Link to comment
  • 1 month later...

While I originally wanted to avoid increasing the array size and therefore was interested in the array swap procedure, I ultimately decided to add another controller and increase the number of drives. Therefore I just put in larger parity drives and recalculated parity. For that I mounted the array in maintenance mode, so that If a drive  would fail during the recalculation, I still would have the old parity drives for reconstruction.

 

Now two things are worrying me.

 

1. Parity does not match between old a new drives.

In order to learn more about Unraid as well as a to be sure everything went well, I did a binary compare of some parts between the previous and the replacement parity drive, assuming that parity calculation should result in the same bytes. This was not the case. Therefore I have the following questions:

  • Is there somewhere a description of the disk layout of the parity drives?
  • I had assumed that the parity drives are just a binary blob, but they seem to be at least partitioned. Correct?
  • Is the assumption correct, that except for the first sectors the parity bytes should be the same as for the previous drives?
  • If the above is correct, where does the parity data start?

 

2. Drive Errors

Parity reconstruction took about three days and finished without any warnings. Then I started a parity check requiring another three days, it reported zero errors. Only 35 hours later however I got notifications for "offline uncorrectable" and "current pending sector" (see screenshot) from a data drive not physically touched during the parity upgrade. Seven days later there is yet another warning. During all this time, I left the array in maintenance mode because I wanted to observe how the situation evolves before writing to it again. Then I downloaded the S.M.A.R.T info for the drive (tower-smart-20200412-2324-1.zip) ,  started an extended test and did the same again (tower-smart-20200412-2324-2.zip). The extended test came back immediately, while I expected it to take some time. After that, I immediately shut the server down, alas without doing diagnostics first.

 

Status.thumb.png.a490fe08f8fe84cfd2188cc54636df86.png

 

Questions:

  • I did not find any definitive answer, what "Current pending sector count" and "Offline scan uncorrectable count" precisely mean. Google hits resulted in conflicting answers and the documentation from the T13 Technical Committee just says "Number of unstable sectors (waiting for remapping)" for the former, "Number of uncorrected errors" for the latter. There are 48 pending and 6 uncorrectable sectors, but no reallocated ones, which I don't fully understand. So it's not clear to me, whether the pending sectors already signify data loss, but the uncorrected errors definitively do (assuming they contained data, of course). Agree?
  • I assume, that I can no longer trust this drive and have to restore data from parity. Agree?
  • Given the timing, can I trust the new parity drives or could it be that the sector read errors already happened during parity recalculations and were just reported a few days later on? In other words, is it better to re-install the previous parity drives or can I continue with the new ones for restoring the drive? I did a parity check before taking out the old parity drives. Result was zero corrections, just as with every previous check.
  • Does it make sense to zero every sector of the old drive once it has been replaced and see whether the pending sectors disappear without being reallocated (outside of the array, of course)? The hypothesis here is that the read errors could have been the result of something having gone wrong during writing and that there is nothing physically wrong with the drive. If sectors get reallocated, the drive is damaged, otherwise it was a glitch.
  • The drive has a very high number of head parking / loading: 12121 for only 12 start/stop counts. Being an enterprise drive I wouldn't expect it to aggressively park its heads in order to conserve energy. Is there a feature In Unraid responsible for that or is it firmware related?

A lot of questions, I know and I will be grateful for anyone able to provide answers to some of those.

 

tower-smart-20200412-2324-1.zip tower-smart-20200412-2324-2.zip

Link to comment

Pending sectors usually mean bad sectors, i.e. sectors that can't be read, extended SMART test confirms it's a disk problem, so yes you need to replace it.

 

Regarding parity valid or not, we'd need the diagnostics, but if there were no errors during the parity sync it can be assumed valid (assuming no other hardware issues that would cause sync errors like bad or failing RAM), and it also depends if the parity check that ran after was correct or non correct, in some rare cases a correcting check can corrupt parity if there are read errors an a data disk, but again we'd need the diags to confirm.

Link to comment
17 minutes ago, johnnie.black said:

if there were no errors during the parity sync it can be assumed valid (assuming no other hardware issues that would cause sync errors like bad or failing RAM), and it also depends if the parity check that ran after was correct or non correct

There were no errors during the parity sync. I ran the parity check in non-correcting mode and it terminated with zero errors.

Link to comment
On 4/17/2020 at 1:43 PM, johnnie.black said:

In that case go ahead and replace the failing disk.

Thanks, I will. I'd like to do that it maintenance mode so that I can be sure that there are no writes to the array during rebuild. But I would like to have read access. Is there a way to do that?

 

If not, can I mount individual drives with 

  mount -o ro /dev/sdX1 /x

or does that interfere with the rebuild process?

Link to comment
16 hours ago, johnnie.black said:

If it's mounted read-only it won't affect parity, though it will affect rebuild speed if being accessed regularly.

Thanks, the array is now rebuilding the missing drive (disk12). I also observe the head load count because in my opinion it is excessive. For the busy array drives it currently remains stable, but for the idling UA drives (not mounted) it continues to increase (S.M.A.R.T 192 & 193 increase about 3 per hour). It is known that WD Green drives aggressively park their heads, but these are HGST data center drives. Looking at the high values in the counters, all drives seem to do this, when inactive. Is this normal?

tower-diagnostics-20200421-0920.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.