Dual Parity Question: Parity Sync Error


Recommended Posts

So I remember reading a while back that when using dual parity in unRAID, that when a parity check takes place and a sync error is found, that unRAID would tell me which disc had the error.  Is this true? 

 

Also, if doing a parity check and a sync error is detected, does unRAID use the dual parity to check if the sync error is due to an error on the parity drive or an error on a data drive?  In other words, when a sync error is found, does unRAID check to see if the sync error is from bad data on the parity drive or a data drive; does unRAID correct the data drive if it is the cause of the sync error?

 

Thanks,

craigr

Edited by craigr
Link to comment
1 hour ago, craigr said:

So I remember reading a while back that when using dual parity in unRAID, that when a parity check takes place and a sync error is found, that unRAID would tell me which disc had the error.  Is this true?

No. The theory was discussed, but it didn't happen.

 

1 hour ago, craigr said:

Also, if doing a parity check and a sync error is detected, does unRAID use the dual parity to check if the sync error is due to an error on the parity drive or an error on a data drive?  In other words, when a sync error is found, does unRAID check to see if the sync error is from bad data on the parity drive or a data drive; does unRAID correct the data drive if it is the cause of the sync error?

No, when a sync error occurs during a check, parity is always assumed to be wrong in the absence of a read error on a data drive.

 

In practice, parity protection only steps in when a drive refuses to return a success on a read request. As long as the drive returns data, unraid assumes it's correct.

Link to comment
2 hours ago, jonathanm said:

No. The theory was discussed, but it didn't happen.

 

No, when a sync error occurs during a check, parity is always assumed to be wrong in the absence of a read error on a data drive.

 

In practice, parity protection only steps in when a drive refuses to return a success on a read request. As long as the drive returns data, unraid assumes it's correct.

Thanks for the thorough answer.

 

I really wish unRAID would use the dual parity to verify if the parity or if the data on a disk is wrong.  I recently had parity “corrected” for a drive that had pending sectors resulting in two sync errors... oh well.

 

I still don’t know if the parity was wrong or not, but I suspect parity was correct and the data drive had bad reads.  I was not expecting any sync errors on that check and it was the first time I have ever gotten sync errors when I hadn’t expected them for one reason or another.

 

I pulled the drive out of the array, ran three preclears on it, and then an long smart test.  After the first preclear the pending sectors were not reallocated so I still hope their data was good.

 

Best,

craigr

Link to comment
8 hours ago, craigr said:

 I recently had parity “corrected” for a drive that had pending sectors resulting in two sync errors... oh well.

I understand and share your concern. Problem is, unless the drive gives up and tells unraid it can't retrieve the data, unraid assumes the drive is giving back good data. Theoretically a drive either fails the read, or returns the data that was stored there successfully. The concept of a successful read returning corrupt data just isn't considered.

 

If that were to happen, how does unraid know for certain WHICH successful read was corrupt? What criteria does it use to initiate the "I no longer trust any drive in the array" procedure?

Link to comment
9 hours ago, craigr said:

really wish unRAID would use the dual parity to verify if the parity or if the data on a disk is wrong.  I recently had parity “corrected” for a drive that had pending sectors resulting in two sync errors... oh well.

 

You are on the horns of the bull in cases like this.  What I would do at this point is to get that drive out of my array and rebuild it on a new drive.  (I know that there will be people who will disagree with me. But I consider my data to be more important than the cost of a new drive!  Philosophical questions are best left to those without a stake in the outcome.)  It could be that there a file (or, perhaps, two) on that disk that might be bad. However, that is now in the past.  There is nothing that can be done about it.  Now, let's look forward.  Let's not put more files at risk with a drive that is flaky!  Remember if you have another drive go completely bad, you might not be able to rebuild that one if this one decides to act up during the rebuild! 

 

Remember that if you really want to see if this drive truly has issues, you can then do it at your leisure using the manufacturer's utility to test the drive with multiple passes of the 'long test'.  If the drive is still in warranty, you could simply return for replacement.  (I have never had a manufacturer question me about a drive return and the replacement drive was always shipped on the same (or next)day as the old one was returned.  So it was not being tested to verify my claim.  They are not going to hassle you as most people don't go through the hassle of replacing a drive on a whim just to get a new drive.  It simply is not worth it from a public relations standpoint.) 

Link to comment
1 hour ago, jonathanm said:

I understand and share your concern. Problem is, unless the drive gives up and tells unraid it can't retrieve the data, unraid assumes the drive is giving back good data. Theoretically a drive either fails the read, or returns the data that was stored there successfully. The concept of a successful read returning corrupt data just isn't considered.

 

If that were to happen, how does unraid know for certain WHICH successful read was corrupt? What criteria does it use to initiate the "I no longer trust any drive in the array" procedure?

Well, my thought with dual parity is that unRAID could check the data against both parity drives.  If both parity drives agree, but the data drive does not, than fix the data drive.  However, in many situations I suppose that would corrupt a good piece of data.  Sigh.

 

Thanks again,

craigr

Link to comment
33 minutes ago, Frank1940 said:

You are on the horns of the bull in cases like this.  What I would do at this point is to get that drive out of my array and rebuild it on a new drive.  (I know that there will be people who will disagree with me. But I consider my data to be more important than the cost of a new drive!  Philosophical questions are best left to those without a stake in the outcome.)  It could be that there a file (or, perhaps, two) on that disk that might be bad. However, that is now in the past.  There is nothing that can be done about it.  Now, let's look forward.  Let's not put more files at risk with a drive that is flaky!  Remember if you have another drive go completely bad, you might not be able to rebuild that one if this one decides to act up during the rebuild! 

 

Remember that if you really want to see if this drive truly has issues, you can then do it at your leisure using the manufacturer's utility to test the drive with multiple passes of the 'long test'.  If the drive is still in warranty, you could simply return for replacement.  (I have never had a manufacturer question me about a drive return and the replacement drive was always shipped on the same (or next)day as the old one was returned.  So it was not being tested to verify my claim.  They are not going to hassle you as most people don't go through the hassle of replacing a drive on a whim just to get a new drive.  It simply is not worth it from a public relations standpoint.) 

I did pull the drive and replace it.

 

With the drive pulled, I ran three preclear cycles, the first of which allowed the pending sectors to be written and returned to service.  After that I ran an extended SMART test which it passed.

 

The extended SMART test is the same thing as in the WD Lifegaurd diagnostics utility.

 

The drive is out of warranty, but I doubt WD would have done anything anyway because there were no reallocated sectors.  All SMART data looks 100% perfect.

 

After the testing I returned the drive to the array.  It’s been several months and the drive has performed flawlessly since...

 

I know now it’s more prone to failure, but the risk is pretty low at this point and the data on the drive is not worth more to me than the cost of a replacement drive ;-)

 

Thanks for your help,

craigr

Link to comment

I agree completely with the action that you took.  You basically did what I suggested.  I have not had reason to use preclear for quite some time but from the 'noise' which seems to exist on the support thread, it seems to have some problems with the recent releases of Unraid.  It is good to hear that preclear working for you with your setup.  

 

EDIT:  I think the problem with Pending Sectors is that different HD manufacturers handle the error differently.  (Apparently, it is not necessary that the data could not be read but that it was read with difficulty and required extended error correcting algorithms.)  Some seem to relocate the sectors automatically on the next write and others seem to write first to the sectors in question and then 'test' to see that it can be read.  If it can be read, it not reallocated.

Edited by Frank1940
Link to comment
36 minutes ago, Frank1940 said:

I agree completely with the action that you took.  You basically did what I suggested.  I have not had reason to use preclear for quite some time but from the 'noise' which seems to exist on the support thread, it seems to have some problems with the recent releases of Unraid.  It is good to hear that preclear working for you with your setup.  

 

EDIT:  I think the problem with Pending Sectors is that different HD manufacturers handle the error differently.  (Apparently, it is not necessary that the data could not be read but that it was read with difficulty and required extended error correcting algorithms.)  Some seem to relocate the sectors automatically on the next write and others seem to write first to the sectors in question and then 'test' to see that it can be read.  If it can be read, it not reallocated.

Great :)

 

Yea, there are lots of folks it seems that have issues with preclear.  I have used it a lot over the past year as I expanded my server and have never had any issue using it.  I seem to recall at one point having it installed was causing some error messages in the log so I uninstalled it and only reinstalled when I wanted to use it.  Preclear has been installed on my server since this incident, and has not caused any more error messages in my log.

 

Yes, I believe WD attempts to read back the data and as long as the dive thinks it got it right, it does not trigger a read error, but adds the degraded sectors to pending reallocation.  Then, if the sectors are written to correctly, the pending sectors go back into circulation being considered healthy and not reallocated while also no longer pending.

 

Its actually totally possible that the sectors were read correctly and that I truly had two parity sync errors.  I like to believe this because all of my data is still perfect in this scenario ;)

 

Each preclear pass reads the sectors twice, so my preclears would have written zeros to those sectors three times and read them six.  I figured six good reads and a good SMART test were proof enough the drive is ok.  That said, this drive will always remain suspect and closely monitored.  However, it easily could have been a fluke though.  I think my 18 AGW wires may have had some voltage drop with the length and load on them.  The sectors could have gotten hit with a particle, there may have be a large vibration while the were written (sometimes simi trucks go up our street and the speed bump is right in front on your house), or who knows what else.

 

Anyway, thanks again,

craigr

Link to comment
23 minutes ago, craigr said:

Its actually totally possible that the sectors were read correctly and that I truly had two parity sync errors.  I like to believe this because all of my data is still perfect in this scenario ;)

What you surmise I suspect is true.  I have been following these boards for several years now and there have been a lot of folks who have rebuilt parity after finding a parity sync error.  I can not recall a single instance in which anyone has found an error in a file after doing so.  Of course, most of the time there was nothing like a smoking gun such as a disk with pending sectors.   Normally, everything else looks fine.  If one were to assume that the Parity disk(s) are right, which disk has the problem????  Plus, even if the parity was correct and one of the data disks does have errors, it is not necessary true that the data on that disk has been compromised!  It could be in an area of the disk where no data is stored! 

Edited by Frank1940
Link to comment
14 minutes ago, Frank1940 said:

...If one were to assume that the Parity disk(s) are right, which disk has the problem????  Plus, even if the parity was correct and one of the data disks does have errors, it is not necessary true that the data on that disk has been compromised!  It could be in an area of the disk where no data is stored! 

Great points and I hadn’t thought about that!  The disks are only 50% full so that gives me a 50/50 shot even if the drive data was bad.

 

I feel even better 🤗

 

craigr

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.