Non-correcting parity check finished with - sync errors and disk read errors


shEiD

Recommended Posts

First and foremost - I have been running unRAID for years - but never used parity, so I apologize for having no experience when it comes to parity stuff.

 

I have added parity on my secondary ("testing") unRAID machine some time ago. I wanted to try it out, before using it on my main unRAID machine.

Yesterday I noticed, that one of the disks reported 32 read errors. So today I decided to run a parity check.

  • As per instructions in the wiki, I ran non-correcting parity check.
  • It finished with 157 sync errors.
  • At the same time, the disk with 32 errors, now had 416 read errors.

 

So, basically - what do I do now?

  1. Because I ran non-correcting sync, does this mean that I may be able to avoid data loss? I mean, logically - if there's clearly one data drive, that produces errors - the parity may/should be OK/correct? Am I right?
  2. Is there some way, at this point, to "find out" exactly if parity is fine, and the errors are just because of that failing data disk?
  3. What are my options? I mean, in my extremely limited experience with having parity - is there any difference in "data safety" or "speed" between these two options:
    a) Use rsync to try and copy all the data off the failing drive to a "new" drive mounted via Unassigned Devices plugin. Then, replace the failing drive and rebuild the parity anew.
    b) Replace the failing disk and rebuild from parity.
  4. Or maybe I should do both, just for fun? 😉 Try and copy the files of the failing drive, then rebuild using another drive. Afterwards, I could compare the files on those two drives and see if there are any differences? That would be interesting.
  5. Would having all the files hashed with something like [Dynamix File Integrity plugin](https://forums.unraid.net/topic/43290-dynamix-file-integrity-plugin/) be useful in a situation like this? Alas, I have not...

 

Here are the diagnostics:

ungiga-diagnostics-20210115-2002.zip

 

If you need any more info, please let me know.
Thanks in advance for any help.

 

Starting the parity syncAfter the parity sync has finished.

Link to comment

Added the 1 and 200 SMART attributes to all drives.
Ran the short SMART test - completed without error

Now running the long SMART test... it'll take some time, I gather.

Can I navigate away from the SMART page? I mean, will I loose the progress info, if I close the browser tab, and come back to this SMART info page later on, while the test is still running?

Link to comment

Sorry, I forgot to write up my system. There are no cables to replace, tbh. I have all my array drives in a 16 bay SAS disk shelves. Inside the shelves backplanes are connected to the expanders via 4x SFF-8087 cables. And the shelves themselves are connected to the server's HBA via SFF-8088 cables.

Another detail I forgot to mention, is that I actually swapped one of the shelves between my primary and secondary servers just last Sunday (6 days ago). So I am pretty sure that the cables and/or drive slot is not the culprit, as I had no problems with this shelf and a drive slot, when it was connected to my primary server.

Finally, and most importantly - I am confident that the problem is with the drive, because this is the 3rd time it has read errors.
1st time was on Mar 21, 2020.
2nd time was last week.

And now is the 3rd time.

 

I mean, I'm pretty confident. I'm no expert by no means. If I'm mistaken or forgetting something - please, let me know.

I keep extensive notes/logs on what happens with my server. So it helps to remember, in situations like this 😉

 

I am retiring this drive - that's for sure. The question is what's the best way to do this and why (the questions in my 1st post)?

Link to comment

Switched to another slot and SMART tests again - no errors.

 

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     38078         -
# 2  Short offline       Completed without error       00%     38070         -
# 3  Extended offline    Completed without error       00%     38027         -
# 4  Short offline       Completed without error       00%     38020         -

 

Now I'm really confused. This drive had read errors in 2 separate enclosures and slots. But now SMART tests are good? What does that mean?
So is it safe to keep using it, or should I just retire it?
Do I run non-correcting parity check again?

Link to comment

So I ran a non-correcting parity check again - the same number of 157 errors. 🤔

2021-01-19_02-25-01__chrome.png.bf6f0ac8e879c39935447f3272dc1175.png

 

No disk read errors on this 2n run. 👍

2021-01-19_03-17-32__chrome.thumb.png.c18076221d551a0c4d19404bf2aaae10.png

 

But this time - TWO drives had their SMART Raw read error rate increased by 1 (15 to 16 and 2 to 3). 😠

2021-01-19_02-23-29__chrome.png.ad197505a36b450291ad2e2528a68b79.png

 

Questions:

  1. So what do these 157 parity errors mean, actually?
  2. Actually, I am really confused - is the data wrong on the disk(s) that had those read/SMART errors, or is the parity incorrect?
  3. What do I do now? What can I do to make sure the files are recovered and correct? Or am I screwed, because I do not have file hashed?
  4. How do I find out, exactly which files are affected by these errors? Can I?

P.S.: I do not care about the drives - BOTH of them were known to me as "showing signs for retirement". This is my secondary/testing unRAID server.
This post and all of this ordeal is to learn how unRAID works before I add parity to my main unRAID server. 

Edited by shEiD
Link to comment
50 minutes ago, shEiD said:

So what do these 157 parity errors mean, actually?

It means parity is out of sync, you need to run correcting check.

 

51 minutes ago, shEiD said:

because I do not have file hashed?

 

51 minutes ago, shEiD said:

How do I find out, exactly which files are affected by these errors?

Without checksums (or btrfs) you can only correct parity.

Link to comment
5 hours ago, jonathanm said:

The issues you are having is a good example of why you shouldn't use questionable drives in Unraid.

I do not use questionable drives anymore - all of the drives in both of my servers are only WD Reds. But that's beside the point. Every single drive is gonna become questionable with time... hopefully... if it's not gonna straight up die with no warning 😄 That is exactly the situation here - these are older WD Reds, and it seems a couple of them started to show their age and need to be retired... I am actually happy, this happened on this testing server, and not on my main one.

 

I am actually trying to learn how unRAID and it's parity works, and how to deal with this exact situation, when it inevitably happens again and again - with all the drives, when they'll get old.

 

8 hours ago, JorgeB said:

Without checksums (or btrfs) you can only correct parity.

So I have no way of knowing if any files on the array are corrupt, let alone which ones?

And all I can do now is just run parity check with "write corrections to parity" checkbox selected? And that is gonna update the parity, as in unRAID assumes that array is OK, and that parity needs updating, not that any of the files are borked, and needs fixing?

So basically - in this situation, I can be 100% sure that some of the files on my array gonna stay corrupt, and parity does not help me here..?

 

Is this a normal unraid behavior, or did I do something wrong and missed a step somewhere, which led me here?

What should I do to avoid this problem in the future, as best I can? I mean, I feel like drives go bad and/or die all the time...

What is the optimal/safest way to use unraid?

Link to comment
39 minutes ago, shEiD said:

So I have no way of knowing if any files on the array are corrupt, let alone which ones?

Yep, I recommend always having checksums, I use btrfs and still create manual checksums with corz, for some situations they can still be useful.

 

40 minutes ago, shEiD said:

And that is gonna update the parity, as in unRAID assumes that array is OK, and that parity needs updating, not that any of the files are borked, and needs fixing?

Yep.

 

41 minutes ago, shEiD said:

So basically - in this situation, I can be 100% sure that some of the files on my array gonna stay corrupt, and parity does not help me here..?

No, the problem can really be on parity, again no way of knowing without checksums.

 

41 minutes ago, shEiD said:

What should I do to avoid this problem in the future, as best I can?

See first reply.

 

Link to comment
1 hour ago, shEiD said:

Every single drive is gonna become questionable with time... hopefully... if it's not gonna straight up die with no warning

Yep. That's why you must never let a drive with signs of failure stay, it must be replaced ASAP. As soon as you get more drives failing than you have parity drives, you stand a chance of losing data.

 

Like you said, better to learn now than when something really valuable is at stake.

Link to comment
4 hours ago, shEiD said:

trying to learn how unRAID and it's parity works

Parity is a common concept in computers and communications. It is basically the same idea wherever it is used. Parity is just an extra bit that allows a missing bit to be calculated from all the other bits.

 

And, if that extra bit doesn't match the calculated value from all the other bits, there is no way to tell which of these bits may have changed.

 

Understanding parity can help make sense of how Unraid works to manage and protect your storage, and how you work with Unraid.

 

Here is the Unraid wiki on parity:

 

https://wiki.unraid.net/UnRAID_6/Overview#Parity-Protected_Array

Link to comment
8 hours ago, JorgeB said:

No, the problem can really be on parity, again no way of knowing without checksums.

I assume, that parity is OK, and the problem is borked files on the array, because the read errors where on the array drives...

Or rather maybe I should ask - so there are situations, where parity goes out of sync for some reason, even without disk errors on the parity drives themselves? Is there some place with detailed documentation about this? Because the wiki has pretty much nothing useful on this topic.

 

Quote

If you wish to run purely a check without writing correction, uncheck the checkbox that says Write corrections to parity before starting the check. In this mode, parity errors will be notated but not actually fixed during the check operation.

If I run the write-corrections-to-parity check after the read-only check (when parity errors where notated) - can unRAID use those "error notes" and be able to go and fix only those errors? Or does it do a full check of the whole array again?

I have a 200TB+ array, and shudder to think, how long will the parity creation/check is gonna take. 🤔 I'm hopping that [PLUGIN] PARITY CHECK TUNING is gonna help with checks, at least.

 

9 hours ago, JorgeB said:

Yep, I recommend always having checksums, I use btrfs and still create manual checksums with corz, for some situations they can still be useful.

Can I ask, why did you choose BTRFS on array disks with additional checksums using corz, instead of XFS and DYNAMIX FILE INTEGRITY PLUGIN?
Also, do you use this linux version of corz?

 

Link to comment
3 hours ago, shEiD said:

I assume, that parity is OK, and the problem is borked files on the array, because the read errors where on the array drives...

Yes but the sync errors started before the read errors, so not necessarily a direct connection, though can't never say conclusively one way or the other.

 

3 hours ago, shEiD said:

Or rather maybe I should ask - so there are situations, where parity goes out of sync for some reason,

#1 is after an unclean shutdown, #2 is due to bad hardware, most commonly bad RAM.

 

3 hours ago, shEiD said:

Can I ask, why did you choose BTRFS on array disks with additional checksums using corz, instead of XFS and DYNAMIX FILE INTEGRITY PLUGIN?

Not practical for me, because every couple of weeks or so I move data from my main servers to cold storage servers, which are always off except while receiving the data, so if I moved the data then waited for the checksums to be generated it would take twice as long, also I need the snapshot feature of btrfs since it's what I use to send data to my backup servers.

 

3 hours ago, shEiD said:

Also, do you use this linux version of corz?

I use the WIndows version, I create the cheskums by folder before sending them to the cold storage servers (but not for everything, only for movies and TV shows), this makes it possible first to confirm the data was correctly sent, like if I ever suspect a network problem, btrfs couldn't help with that since the data would arrive already corrupt, also good for if I just need to check a movie or a TV show, with btrfs I can only do a full disk check, and a TV season can be split over various disks.

 

 

 

 

 

 

 

 

 

 

Link to comment

I just finished running Parity Check with `Write corrections to parity` and updated the parity. The log shows a list of corrections, like these:

 

Jan 21 13:58:50 unGiga kernel: md: recovery thread: P corrected, sector=5455701888
Jan 21 13:58:50 unGiga kernel: md: recovery thread: P corrected, sector=5455701896
Jan 21 13:58:50 unGiga kernel: md: recovery thread: P corrected, sector=5455701904
Jan 21 13:58:50 unGiga kernel: md: recovery thread: P corrected, sector=5455701912
Jan 21 13:58:50 unGiga kernel: md: recovery thread: P corrected, sector=5455701920

 

I guess, that means that parity has been updated.

What is kind of confusing, is that the results in the UI and the log look exactly the same, as if I had run a read-only check...

 

2021-01-21_15-11-58__chrome.png.bb01815266757e4809ab9fd6a4bc0fb2.png

 

2021-01-21_15-12-19__chrome.png.16741404c0a74985aa20a97cc064a44d.png

 

Should I run a read-only check again? To make sure that I get the desired 0 errors result?

Edited by shEiD
Link to comment
1 hour ago, shEiD said:

Should I run a read-only check again? To make sure that I get the desired 0 errors result?

Yes. Until you have confirmed you have exactly zero sync errors you aren't finished. If you don't get zero after you already corrected parity then that would indicate some other problem.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.