Parity Check Corruption: Parity Disk or Data Disk?


Recommended Posts

True, maybe the acknowledged old smart value of the field "reallocated sector count" got lost, and the known good value shows up as new warning.

 

Thanks for your feedback. Knowing that HDDs have checksums as well make it a lot easier to handle.

 

For which SMART fields do we get notification emails on value change? I can not test SMART data changes, so I would like to know where to look up which fields are configured to raise notification emails.

Link to comment

Well, after knowing that SMART does already cover bit errors on each disk, you don't need integrity checks anymore. And parity checks on top of SMART will be already the 2nd data validation tool.

 

If 2 validations are not enough, you could use dual parity to have 3 validations of your data.

 

Once the data got transmitted valid to your unraid system, Parity + SMART notifications will be enough.

 

I did enable all SMART Notifications and added 10, 184, 196, 200. I don't use 1 because I would need to exclude Seagate disks from parameter 1.

 

20 minutes ago, Jaybau said:

Here is someone that just posted on Reddit (parity error how to identify affected disk).  Apparently others have my concerns, and I hope my feedback is not ignored.

 

 

SMART data should be used to detect which disk has no errors. But only after checking the system health, that's always the first point on a system without drive failures

Edited by Falcosc
Link to comment

"Unraid team did discuss executing a correction run automatically after each unsafe shutdown. Because it is so common to have these issues after unclean shutdown. But in the end it was decided against it."

 

How often do these data/parity problems happen?

If SMART errors start showing up, does parity stop too?

If there's an unclean shutdown, does Unraid know it's unclean?

If there's errors, does unraid stop parity until resolved?

How many of the problems could have been prevented by using ECC memory and UPS?

 

Link to comment
2 minutes ago, Jaybau said:

If there's an unclean shutdown, does Unraid know it's unclean?

 

When the array is successfully shutdown/stopped, a flag is set in a file on the flash drive.  (When the array is stopped the Linux Write cache is 'flushed' as a part of the shutdown procedure.)   When the server is restarted, the flag is checked to see if the array was in the stopped condition.  IF not, Unraid considers that the shutdown was unclean.  However, an unclean shutdown does not mean that a parity check will find an error.  That usually depends on whether that Write cache had been flushed first.

 

10 minutes ago, Jaybau said:

If SMART errors start showing up, does parity stop too?

No, unless there is a failure that parity cannot correct!   That is why it is important to setup Notifications .

 

12 minutes ago, Jaybau said:

If there's errors, does unraid stop parity until resolved?

 

Any parity operation will continue to completion unless there is a complete failure of a physical device.

======================

 

The reminder of your questions have no definite answers.  Some people can fix any parity issues without assistance.  Others will have to have help and those are the ones that you see here on the board.  (You could get a count of them over a period of time if you wanted to.)  However, none of us volunteers have any indications of how many registered Unraid licenses there are.  Without that knowledge, how does one say what percentage is over a given time period.

 

I have had several array issues over the ten plus years that I have been using it.  I have never lost a single file in that time!  But I do keep a careful watch on my servers and address any issue as soon as it crops up...

 

I have had unclean shutdown and I would say that in more than 90% of them, a parity check on startup did not find a problem.  (But that could well be because my write pattern to that array is in batches rather than a continuous stream of writes.)

 

Hope this helps...

  • Like 1
Link to comment
  • 1 year later...
On 5/29/2022 at 1:02 PM, Jaybau said:

Not sure if there's a better way of doing this, or something built-in and automated for a new user experience (lots of room for ignorance).  The last thing I want to deal with is corruption, not knowing what is corrupted, why/how, and trying to recover.  

 

I like btrfs scrub/validation feature, but I've read the array (data-parity process) doesn't recover well from an unexpected ungraceful shutdown.  Even with a UPS, it still makes me nervous.

 

Perhaps XFS + Dynamix File Integrity (DFI) is the more robust/safest solution.  XFS for the unexpected shutdown recovery + DFI that basically does the same thing as btrfs scrub.  I'm using BLAKE3 and hope it is as good or better than btrfs metadata/scrubs.

 

Perhaps something in the future:

  1. Built-in file integrity check (for XFS too), and perhaps the default for new users.  If a parity error happens, knowing if your parity versus your data is corrupt seems absolutely essential.  Without knowing this information, it could leave somebody with a lot to try and figure out.
  2. Scheduled hardware checks (e.g. memory test), SMART drive tests, file integrity checks, parity checks.  I would like to catch problems as early (and conveniently) as possible. 
  3. Best practices built-in and streamlined.  There's already some of these, but it's not intuitive, and I still don't have it nicely figured out with best practices.

Thank you.

I am running into the same issue right now and have no clue how to figure this out. i changed out my ram, hba, and sas cables , 2 of them.  

I am using your settings blake 3 and building the checksum on 24 disks will take about 25 hours. Once you run the parity check and gather which sectors are bad , how do you check with DFI?  The only thing I can do now is located the files and erase them or completely format the drive or replace. Everything checks out fine on smart and even scrutiny smart data, nothing is wrong with the hard drives. However crap unraid parity cant tell me anything and have run 2 parity checks and still getting roughly 5-50 errors with auto correction not working even though the box is checked on the main page.  Very frustrating and  received no help on discord

Link to comment
On 6/1/2022 at 6:59 AM, Falcosc said:

I have the same question.

 

How to detect what data is correct and how to correct?

Maybe we can start easy:

Is it able to extract both version of the mismatching data block for all data disks? And if you find tools to validate the result, how to decide in which way the array should be reconstructed?

 

2nd Question: why is there a "Correct" Flag implemented in the parity check? Who would want to blindly overwrite the parity drive data?

https://wiki.unraid.net/Parity#Checking_parity

 

I would guess that the process is to resolve the issue on the data disk and then rebuild the parity with the correct flag. But this would only be possible if the parity check does tell at which data disk the issue did Accor, which is not possible without checking the data block on all connected data disks.

 

My conclusion: if we don't know how to correctly solve a parity check result (blindly changing parity drive data is not a solution) I don't understand what the point of the parity check is.

yeah i agree. 

Link to comment

ive decided just to reformat both parity disks and rewrite the parity information to the disks , obviously this is too complicated and poor implementation of unraid  and should be addressed.  Regardless I should not be too affected due to only having about 60 sync errors which were not able to fix themselves.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.