Parity Check Corruption: Parity Disk or Data Disk?


Recommended Posts

16 minutes ago, Jaybau said:

"Ways that there should be parity check errors (https://wiki.unraid.net/Parity#Checking_parity😞

Undetected hardware fault (such as silent memory corruption)"

 

When there's a hardware fault/memory corruption, how do you know if the parity error is on the parity disk or the data disk?  Will the parity check tell me which is corrupted?

 

The only way to know where the corruption occurred is if you are using btrfs file system on your array disks (in which case a scrub will tell you about corrupt files) or if you have checksums for your files when running xfs/reiserfs.

Link to comment

Not sure if there's a better way of doing this, or something built-in and automated for a new user experience (lots of room for ignorance).  The last thing I want to deal with is corruption, not knowing what is corrupted, why/how, and trying to recover.  

 

I like btrfs scrub/validation feature, but I've read the array (data-parity process) doesn't recover well from an unexpected ungraceful shutdown.  Even with a UPS, it still makes me nervous.

 

Perhaps XFS + Dynamix File Integrity (DFI) is the more robust/safest solution.  XFS for the unexpected shutdown recovery + DFI that basically does the same thing as btrfs scrub.  I'm using BLAKE3 and hope it is as good or better than btrfs metadata/scrubs.

 

Perhaps something in the future:

  1. Built-in file integrity check (for XFS too), and perhaps the default for new users.  If a parity error happens, knowing if your parity versus your data is corrupt seems absolutely essential.  Without knowing this information, it could leave somebody with a lot to try and figure out.
  2. Scheduled hardware checks (e.g. memory test), SMART drive tests, file integrity checks, parity checks.  I would like to catch problems as early (and conveniently) as possible. 
  3. Best practices built-in and streamlined.  There's already some of these, but it's not intuitive, and I still don't have it nicely figured out with best practices.

Thank you.

 

Link to comment

I have the same question.

 

How to detect what data is correct and how to correct?

Maybe we can start easy:

Is it able to extract both version of the mismatching data block for all data disks? And if you find tools to validate the result, how to decide in which way the array should be reconstructed?

 

2nd Question: why is there a "Correct" Flag implemented in the parity check? Who would want to blindly overwrite the parity drive data?

https://wiki.unraid.net/Parity#Checking_parity

 

I would guess that the process is to resolve the issue on the data disk and then rebuild the parity with the correct flag. But this would only be possible if the parity check does tell at which data disk the issue did Accor, which is not possible without checking the data block on all connected data disks.

 

My conclusion: if we don't know how to correctly solve a parity check result (blindly changing parity drive data is not a solution) I don't understand what the point of the parity check is.

  • Like 1
Link to comment
2 hours ago, trurl said:

Unraid already monitors important SMART attributes and will warn about those. Do you have Notifications setup to alert you by email or other agent as soon as a problem is detected? 

 

Yes, but I'm suggesting a more consolidated approach to different testing, rather than separate scattered checks, and some checks not known/included to novice users.

 

First, is there a difference between SMART attribute monitoring/flags and running a SMART short/extended test?  I assume SMART attributes get triggered when the problem occurs, while a SMART test is proactive and can potentially find problem earlier.  Additionally, the SMART tests for my SAS drives don't seem to work at the GUI level, so have to manually run at the console.

 

I have separate checks for file integrity checksums, btrfs, parity, and run SMART tests.  Occasionally I bring down the system for memory check.  

Link to comment
2 hours ago, trurl said:

There is no way to schedule memory test since memtest must be booted and the only thing running.

 

How do people know when memory goes bad?  When they start seeing data errors? 

Is there a way to be proactive?

Can memory checks be scheduled to reboot the machine, run memory checks, start the machine, and report the analysis?

Link to comment

When you get non-correcting parity errors, you conduct an investigation to see if there is a cause for the errors.  IF you don't find one, REBUILD parity.

 

Why, the reason is simple, If something has happened that has caused a data error on a disk, you will have the problem from that error.  There is nothing that you can do about since you could not find it.  If you rebuild parity, that error will become a part of parity. 

 

Now, look  at what happens if a second drive should fail completely and you can identify it.  You replace the drive and you rebuilt using parity.  If you rebuilt parity, the rebuilt disk will be without error.  (That original disk will still have the error on it.) 

 

If you don't rebuild parity and use that parity with errors in it, the drive that you are rebuilding will have errors on it because parity is wrong for the current data that is on all of the other disks!  So you will now have two disks with errors on them!

 

Now, for an observation.  I can not recall a instance where parity was rebuilt after finding a error during a parity check that a user subsequently found a problem in a data disk!  I suspect that the error is almost always in the parity data that is on the parity disk.  I am not privy to the code that LimeTech uses but I suspect that the parity data update is always the last data committed to be written during data write operation.  If anything goes wrong during that operation, the parity data is the most likely one to be affected. 

Link to comment

Do we have a guide which is telling something about how to fix partiy check errors? I think about some ways of identifying which data disk is causing the mismatch in the parity.

 

Depending on which file system you use, the guide would look differently.

 

I did already think about it and have one major issue: What if the error is caused by a bit flip in an area of a data disk which has no data in this sector? File based check will not find this issue, so all file checksums of all data disks would be OK and you would not know at which disk did cause the error.

Link to comment
52 minutes ago, Falcosc said:

I did already think about it and have one major issue: What if the error is caused by a bit flip in an area of a data disk which has no data in this sector? File based check will not find this issue, so all file checksums of all data disks would be OK and you would not know at which disk did cause the error.

 

In such a scenario, where the parity does not match and produces an error, and you have checksums of your data that match, then there is no error in your data and a "bit flip" in an "empty" portion of the drive would not be an issue. You'd simply rebuild parity as you already know that your data is intact (via the checksum verification).

Link to comment
51 minutes ago, Falcosc said:

What if the error is caused by a bit flip in an area of a data disk which has no data in this sector?

What would happen (in this case) is that the during the non-correcting parity check is that the drive would have a read error.  The error detecting and correction routines would automatically begin.  They would either correct the error (the most likely outcome, BTW) or the disk would throw a read error.  (Modern hard disks are designed for this situation and recover gracefully from most data read problems.  IF they don't, the error will be flagged in the SMART attributes...)

 

As I said previously, the error is mostly likely in the Parity Data that is on the parity disk.  The most likely cause is usually an unclean shutdown of the array (power failure, lockup, etc.) where the parity drive does not get updated with the latest parity data.   Unraid uses a write delay cache (as do ALL modern OS's since about 1995) and does the actual writes when the the system is idle.  This means that this is always a period of time with the actual data on the physical drive is out of sync with the data that should be there.  You can read about that here:

       https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

 

Unless you find a cause for the parity error, you want to  have the parity reflect what is actually on the data disks--  Whether that data is right or wrong!  You don't want to be using parity data that is wrong to reconstruct data on a second drive as that can result in the data being reconstructed being wrong!

Link to comment

What is the best way to find out if either parity drive has to be overwritten or if the unlikely use case happens and the parity data is the correct one?

I mean, yes, you can start a validation of all Data from all data disks, but this isn't very easy. And it does take a long time.

 

Is there a way to find out which files are stored on all disk at a specific sector (we have the sector definition from the parity check result)

 

It is cool that we have the parity check, I guess we are just missing documentation about how to handle parity check errors correctly.

The coolest solution would be a selective data validation based on the invalid parity bit. Is that possible? That would be great and very fast. It maybe even useable if you don't have checksums because you may could extract the file and compare it with the backup.
I don't know about the structure of the file systems so I don't know how to find data references pointing to the effected sector.

Link to comment

Yes, I know. But could we find the file which is using the sector on any data disk to validate the file on any data disk?

3 of 5 drives have a file we only need to validate the 3 files. If the 3 files are ok the error did happen on the drives which don't have data or on the parity drive it self. That means you can rebuild the parity.
If the file is invalid, you have to fix it by fixing the cause of the issue (replacing hardware) and then you can fix the data and after the data is valid again rebuild the parity drive.

But I don't know how to check all data drives for data corruption.

 

We have this page: https://wiki.unraid.net/Check_Disk_Filesystem

But it does only cover file system checks, is this enough? Or do some mentioned file systems don't support checksums for files?

 

For me there is a gap between this file system documentation and how to treat parity check errors. I would like to know how to deal with monitoring tool results before I encounter them. For me it is pointless to monitor something if I don't know how to handle the result of parity check correctly :)

Edited by Falcosc
Link to comment
26 minutes ago, Falcosc said:

Yes, I know. But could we find the file which is using the sector on any data disk to validate the file on any data disk?

The biggest problem is not with an error in the data of a single file.  (You should have a separate backup of any file is not irreplaceable.  Parity on any server is not a substitute for a backup!) The biggest problems come if the error is the file allocation area of the disk.  These can cause the disk to become completely unreadable. (These are  usually detected and are often repairable with minimal data loss.)  

 

IF you are really concerned about this type of issue, you should be getting ECC memory, the Dynamix File Integrity plugin with xfs formatted disks or use btrfs as your disk format.  Then when you have a parity error, you can check for data integrity and determine where the problem is. 

Link to comment

Yes, but how?

We have Backup files to compare with the effected files.
But how do I find which files are effected by the parity check result?

It is not the solution to format the whole array and restore multiple terabytes of backup if you encounter your first parity check error.

I want to know how to find out what needs to be done.

How to check what disk is broken? If you know what disk is broken, then it is easy:

  • Parity Disk is wrong (memory fault, unclean shutdown, other reasons): use parity check with correct
  • Data disk is wrong: restore backup

 

My question is: how to decide between parity check with correct bit or restoring backup. And if restore backup is the answer, do I need to restore the whole backup? Or just the broken file.

 

I would like to read or create a documentation about how to deal with parity check results.
- first of all check your system stability stuff (we don't need to talk about the first point, we have good pages for that)

Then find the broken sector (this documentation is missing)
- On File System A do this
- On File System B do that
- On File System C you can not do this and need the file integrity plugin

Or is there a trick which I am missing which does bypass the whole file system analyses? Because it is strange that this important information how to determine what to be done after finding a parity check error is missing.

 

  

41 minutes ago, Frank1940 said:

IF you are really concerned about this type of issue, you should be getting ECC memory, the Dynamix File Integrity plugin with xfs formatted disks or use btrfs as your disk format.  Then when you have a parity error, you can check for data integrity and determine where the problem is. 

I don't understand why there is an uppercase if.

Could it be that modern hard disks are nearly flawless (flawless in terms of every issue is covered by smart) and you can skip all the file system checks and only do something if SMART did detect the error? If they were flawless, would the process look like this: Parity check found an error:

- no SMART counter raises: parity data is broken, fix system and rebuild parity

- SMART counters did raise: data disk is broken, replace disk and use the parity data, because parity is correct based on 0 SMART warnings on the parity drive

I don't know how powerful SMART is, or if there are other things which tells us what to do on a parity error.

Would be cool if HDD would have checksums on hardware level and if these are reliably reported in SMART.

 

Or If I rephrase the "why if" question, why is the file integrity check optional? Is there a functionality somewhere which makes us comfortable to skip these things?

Edited by Falcosc
Link to comment

You have raised an interesting issue.  There are two worlds of Disk Data people.  (1) Those believe that 'bit-rot' is possible and (2) those who believe that it not possible.  (I am a member of the second category.)  This topic can be debated ad nauseam (and has been)!   I am not going to be drugged in that argument.   (Basically, I am convinced that the HD manufacturers have reduce that probability of 'bit rot' to about the same as of being able-- as I learned in my introductory course to Modern Physics-- to propel a BB through the glass of the window in your living room leaving the glass completely undamaged.)

 

Why is there not a manual?  Probably because no one has figured out have to write a useful one.  This is not a topic where you can write a simple script and have it apply to every possible case. 

 

I will say it again.  IF the data on a sector of the  disk is not identical to what was originally written or if the sector can not be read, the disk will throw an error and that error will be reported in the SMART attributes.

 

The typical Unraid user, who has an error, can always turn to the forum by posting up his problem and his server's diagnostics file for advice.  Depending on what is found, he will probably be told to rebuild parity.  If he is not a typical user, who has installed one of the data integrity features, can run one of those to see if there is a problem.   

 

You have not even touched on the biggest issue.  What happens if a second problem occurs during parity rebuilt when the array has only single parity?   If the issue you are so concerned about is that important, the first thing you should do is implement dual parity!

 

EDIT:  One more thing.  As I recall, it was proposed (perhaps, even in a beta release) many years ago, that the parity check after an unclean shutdown be an a correcting one.  After much, much discussion, it was decided that the check should be non-correcting to allow the user time to check things out first!

Edited by Frank1940
Link to comment

@Frank1940 Thank you so much, now it starts to make sense why there is a gap in the documentation. I did feel stupid because I did not find anything. But I couldn't find an obvious reason for that. And it didn't help that any parity check issue was related to hardware issues, and nobody did ask what happen if parity check errors occur on healthy hardware.

 

OK, that means, we can use SMART if we don't believe bit rot

Or maybe go even further, maybe SMART is able to detect bit-rot as well during read?

 

My issue is that I don't know how powerful SMART actually is. Is there a checksum on HDD which makes it possible that the HDD can detect single bit errors during read?

I don't care about multi bit errors, my data is not so important to talk about edge cases, for me, it is good enough to have a guideline how to handle something which is part of my monitoring.

 

But I want to know if it is enough to trust SMART for usual use cases (people with single parity drives)

 

If we can trust SMART, I would formulate the guide as following:

Parity check found an error:

  • no SMART counter did raise: unclean shutdown or Hardware is broken (ram, pci, controller, cables etc)
  • SMART Counter did only increase on data disk: replace data disk
  • SMART Counter did only increase on Parity Disk: replace parity disk
  • SMART Counter did increase on multiple disks: we don't know which data is valid anymore
    • new: this is very unlikely and should be investigated by the community
    • either validate the data by using file system features or 3rd party apps
    • replace all disks with warnings and restore backup

Would you agree to this SMART based guide of how to deal with parity check results?
Or did I make some wrong assumptions?
 

 

Edited by Falcosc
Link to comment
33 minutes ago, Falcosc said:

My issue is that I don't know how powerful SMART actually is. Is there a checksum on HDD which makes it possible that the HDD can detect single bit errors during read?

This is built into modern disk drives at the sector level and in the case of a mismatch a read error is returned.   Note at this point there is no correcting of the error - just detection so that you can then take appropriate recovery action.

  • Like 1
Link to comment
1 hour ago, Falcosc said:

My issue is that I don't know how powerful SMART actually is. Is there a checksum on HDD which makes it possible that the HDD can detect single bit errors during read?

A single bit error is not a problem on a modern hard drive.  They are all correctable!  (and not really that uncommon!!)  The problem comes up when there are (say) a couple of hundred in a single sector to deal with. The issue is: "Is there a combination of errors that will make the error detection algorithm compute that the data is correct?"   And there is no way to answer that question since the firmware used is a trade secret of the manufacturer.  All we can say is that it must not happen very often or you would be able to read about the failures on the Internet. 

Link to comment
3 hours ago, Falcosc said:

SMART Counter did increase on multiple disks: we don't know which data is valid anymore

  • either validate the data by using file system features or 3rd party apps
  • replace all disks with warnings and restore backup

 

 

This one is the only one I would take issue with.  If this were to occur, I would be posting up and see if there might be something else was going on.  If you are running parity checks at least once a month, having more than one drive 'fail' in a single month with SMART type errors would be an anomaly.  (Virtually, every time there has been what initially appear as two disk related failures investigated on the forum, the disks themselves have not been the source of the problem and are usually good.)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.