Article: Why RAID 5 stops working in 2009


bubbaQ

Recommended Posts

On 8/11/2020 at 6:03 PM, testdasi said:

I don't see how dual parity on its own can identify which disk is wrong so please provide support behind your claim.

I will give you a very easy practical proof. (Proof is proof, right?)

Setup a test server with dual parity. Run a parity chack, and verify that everything is good.

 

For the purposes of the following exercise, to eliminate any interference, only use "maintenance mode", and only run "read-only" parity checks. Note, there's a bug in the current UI ignoring the read-only checkbox, so only do read-only parity checks from the command line. (mdcmd check nocorrect).

 

For extra proof, collect the md5 checksums of the disks, and save them somewhere, like...  md5sum /dev/mdX >/boot/mdX.before.md5

 

Now, let's "illegaly" modify one byte on one of the disks that are behind the "md" devices. Something like:  dd if=/dev/random of=/dev/sdX1 bs=1 count=1 seek=10000000

 

Run a "read-only" parity check, and you will see the error reported in the syslog, like...  kernel: md: recovery thread: PQ incorrect, sector=19528

 

Shutdown the server, and pull out one of the disks.  Start the server.  The missing disk will of course be simulated.  But here is the interesting part: With dual parity, you can do a parity check at this point.  And if you have been "lucky" enough to have pulled out the offending disk, the parity check will show no errors at this moment.  Any other disk, and you will be seeing exactly the same error as before.  We have now correctly identified exactly which disk is the one that's carrying the "wrong" byte, and we don't need to assume that it's the parity that's wrong.

 

Q.E.D.

 

Now shutdown the server, replace the missing disk, and start the server.  Unraid will offer to rebuild that disk from parity.  Proceed.  When done, you'll have all data on your disk exactly as it was before our "illegal" modification.  If you want an extra verification for that, just run the new md5 summs, and compare them with the ones you saved before. (Given that you were extra carefull to only use "maintenance mode", as advised earlier)

 

Note, the "illegal" modification could have been done on any of the disks, data or parity, regardless, with exactly the same result.

 

Also note, this whole exercise was designed only as a proof that the offending disk can indeed be identified with the help of dual parity, and that the good information can be restored back to the disk, instead of just assuming that the parity is wrong and propagating the wrong byte onto the parity disks. This execrise is not meant as a practical advise about how you should deal with such situations on real servers, although you could, if it comes to that.

 

Now that the proof is out of the way, limetech can think of some efficient programatic ways of doing it. My point is that it would be extremely usefil if the report in the syslog shows something like:

kernel: md: recovery thread: PQ incorrect, rdevName.2=sdc, sector=19528

 

Edited by Pourko
some extra info
Link to comment

No. Not Q.E.D. What you showed is entirely something else.

 

You went through the situation where you control all the other variables in the algorithm and only change one of them. In reality, you can not control all of the variables and be certain that only 1 change was made.

Link to comment
3 hours ago, Pourko said:

I will give you a very easy practical proof. (Proof is proof, right?)

Setup a test server with dual parity. Run a parity chack, and verify that everything is good.

 

For the purposes of the following exercise, to eliminate any interference, only use "maintenance mode", and only run "read-only" parity checks. Note, there's a bug in the current UI ignoring the read-only checkbox, so only do read-only parity checks from the command line. (mdcmd check nocorrect).

 

For extra proof, collect the md5 checksums of the disks, and save them somewhere, like...  md5sum /dev/mdX >/boot/mdX.before.md5

 

Now, let's "illegaly" modify one byte on one of the disks that are behind the "md" devices. Something like:  dd if=/dev/random of=/dev/sdX1 bs=1 count=1 seek=10000000

 

Run a "read-only" parity check, and you will see the error reported in the syslog, like...  kernel: md: recovery thread: PQ incorrect, sector=19528

 

Shutdown the server, and pull out one of the disks.  Start the server.  The missing disk will of course be simulated.  But here is the interesting part: With dual parity, you can do a parity check at this point.  And if you have been "lucky" enough to have pulled out the offending disk, the parity check will show no errors at this moment.  Any other disk, and you will be seeing exactly the same error as before.  We have now correctly identified exactly which disk is the one that's carrying the "wrong" byte, and we don't need to assume that it's the parity that's wrong.

 

Q.E.D.

 

Now shutdown the server, replace the missing disk, and start the server.  Unraid will offer to rebuild that disk from parity.  Proceed.  When done, you'll have all data on your disk exactly as it was before our "illegal" modification.  If you want an extra verification for that, just run the new md5 summs, and compare them with the ones you saved before. (Given that you were extra carefull to only use "maintenance mode", as advised earlier)

 

Note, the "illegal" modification could have been done on any of the disks, data or parity, regardless, with exactly the same result.

 

Also note, this whole exercise was designed only as a proof that the offending disk can indeed be identified with the help of dual parity, and that the good information can be restored back to the disk, instead of just assuming that the parity is wrong and propagating the wrong byte onto the parity disks. This execrise is not meant as a practical advise about how you should deal with such situations on real servers, although you could, if it comes to that.

 

Now that the proof is out of the way, limetech can think of some efficient programatic ways of doing it. My point is that it would be extremely usefil if the report in the syslog shows something like:

kernel: md: recovery thread: PQ incorrect, rdevName.2=sdc, sector=19528

 

No, all proofs are not equal. The reason I asked for mathematical proof is because it must state all assumptions upfront then use logic to arrive at a set of guaranteed conclusions. To be brutally honest, you statement "proof is proof" and casual use of "Q.E.D." give me the impression that you don't seem to appreciate the rigid standard that the science in computer science has to adhere to. "Some efficient programatic ways" don't automagically materialise from hope-and-pray. Good algorithms are always backed by sound maths and adding "Q.E.D." doesn't make your math sound.

 

 

Firstly, an assumption you didn't state is that there is only a single error (because your apparatus precisely control for that single variable, which is further apparent when you mentioned we don't need to assume parity is wrong). The implication when trying to recover data with the assumption violated, which you ignored, is further corruption to the array.

In other words, the conclusion is NOT guaranteed.

Perhaps someone who knows exactly what he/she is doing can use that as a last resort but that should be reserved for specialist data recovery products or even a plugin, not as a main Unraid feature. Unraid has to cater for layman users who for example, despite the big red warning that format is not part of data recovery process and data will be lost, still click format, lose data and ask why.

 

 

There is also an elephant in the room.

Why should LT small team put any effort in implementing this niche feature (among way more highly-requested features) when there is a better, guaranteed solution already implemented?

Use btrfs file system, run a scrub and it will tell you exactly which file is corrupted. Not just which disk, which file.

 

 

I can understand a feature request to do a partial recovery (which with BTRFS would allow file-level data recovery).

I don't understand forcing in a feature that MAY fix a rare issue (random data corruption on a single drive) but would confuse and even risk the ability to recover from a relatively less rare issue (a drive failure).

 

Edited by testdasi
  • Thanks 1
Link to comment
4 hours ago, testdasi said:

No, all proofs are not equal. The reason I asked for mathematical proof is because it must state all assumptions upfront then use logic to arrive at a set of guaranteed conclusions. To be brutally honest, you statement "proof is proof" and casual use of "Q.E.D." give me the impression that you don't seem to appreciate the rigid standard that the science in computer science has to adhere to.

I can't believe what I'm reading here.  A proof is either a proof or it isn't.  There is no gray are in that, and that applies to any hypothesis that ever needed a proof. You can choose numerous different ways of proving something, and they would be equally valid at the end. Choosing a complicated and obscure way to prove something doesn't make it any more of a proof. So I went out of my way to give you the simplest possible setup that you can replicate yourself. Unless you find a flaw to the steps I presented, you should be completely satisfied with the validity of my conclusion. 

Link to comment
6 hours ago, BRiT said:

You went through the situation where you control all the other variables in the algorithm and only change one of them. In reality, you can not control all of the variables and be certain that only 1 change was made.

Right. For the sake of simplicity, and proof of concept.

 

We check parity one byte/sector at a time. We have just arrived at a byte/sector which does not match what we expected. We report the sector number in the syslog. We have a few possible situations here:  A) We are with single parity: We reasonably assume that we should sync the parity disk, as in 99% of the cases that would indeed be the correct thing to do, and also because we have no way of knowing otherwise. B) We are with dual parity, when more that one disk are mismatched at that particular byte/sector: Again, we have to assume that the parity disks need syncing, and that's the only rasonable thing we can do.  C) -- and this is the interesting one, we are with dual parity, and only one disk is mismatched at that particular byte/sector. Now here we CAN know which disk exactly that is, and at the very least we can report that in the syslog.  And now that we've dealt with that particular byte/sector, we can proceed with our parity check, on to the next mismatched sectors we may find, and deal with them one at a time.

Link to comment
58 minutes ago, trott said:

not directly related to this topic, but assume I use btrfs for the disks in the array, when there is an error during the parity check (no automatic fix), can I run btrfs scrub on each disk to confirm if it is disk issue or parity issue?

Yes you can. If none of the disk has scrub error then it's likely parity that fails.

Now to be precise, scrub only checks Used space and not free space. But from my perspective I don't really care too much that there's a parity error on the free space (as long as none of the disk has a failure (e.g. reallocated sector) and I run a correcting parity check afterwards to make sure everything is in line).

 

5 hours ago, Pourko said:

I can't believe what I'm reading here.  A proof is either a proof or it isn't.  There is no gray are in that, and that applies to any hypothesis that ever needed a proof. You can choose numerous different ways of proving something, and they would be equally valid at the end. Choosing a complicated and obscure way to prove something doesn't make it any more of a proof. So I went out of my way to give you the simplest possible setup that you can replicate yourself. Unless you find a flaw to the steps I presented, you should be completely satisfied with the validity of my conclusion. 

You proposed an experiment, which is not a proof. An experiment can disapprove a theory (by providing a counter example) but it cannot prove a theory.

For example, I can crash a car to a wall at exactly 10mph and have 100% survival rate. But it doesn't prove the theory that crashing a car always have a 100% survival rate.

 

Let me be clear though. You are not completely wrong. You have simply generalised a specific scenario (i.e. your experiment) to be always true in all cases.

I wonder if you ignored my later paragraph:

"Perhaps someone who knows exactly what he/she is doing can use that as a last resort but that should be reserved for specialist data recovery products or even a plugin, not as a main Unraid feature. Unraid has to cater for layman users who for example, despite the big red warning that format is not part of data recovery process and data will be lost, still click format, lose data and ask why."

i.e. Unraid has to work in all cases and have to take care not to confuse the Unraid general users with the potential to cause catastrophic effect i.e. data loss.

 

And of course the elephant in the room: BTRFS.

Link to comment
18 minutes ago, testdasi said:

You proposed an experiment, which is not a proof. An experiment can disapprove a theory (by providing a counter example) but it cannot prove a theory.

For example, I can crash a car to a wall at exactly 10mph and have 100% survival rate. But it doesn't prove the theory that crashing a car always have a 100% survival rate.

Dude, with your example you can prove the theory that a car can be crashed.  After I see that once, then I know for a fact that a car can be crashed. In my example, I prove the theory that with dual parity you can know exactly which disk is carrying the mismatched byte, in case it is mismatched on only one disk at that particular byte/sector.  Nothing else.

Link to comment
2 hours ago, Pourko said:

Chapter 4 of that paper discusses exactly what I've been talking about -- identifying a single disk corruption at a certain byte position

Did you read this sentence?

Quote

If two disks arecorrupt in the same byte positions, the above algorithm will (again, in the general case)introduceadditionaldata corruption by corrupting a third drive.

 

  • Thanks 1
Link to comment
2 hours ago, Pourko said:

OK then, here is your proof: 

https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

Chapter 4 of that paper discusses exactly what I've been talking about -- identifying a single disk corruption at a certain byte position.

I have already read that paper quite a while ago.

Quote from the paper (underline added for emphasis): "If two disks are corrupt in the same byte positions, the above algorithm will (again, in the general case) introduce additional data corruption by corrupting a third drive. "

 

I think I shall stop responding to you now. You are beginning to take my sentences out of the context of the entire long posts (plural) in order to forward your reckless idea.

I have already stated my argument that Unraid should stay on the safe side and focus on the general case, especially when there is a safer alternative (btrfs scrub) already implemented.

  • Thanks 1
Link to comment
2 hours ago, testdasi said:

I have already read that paper quite a while ago.

Quote from the paper (underline added for emphasis): "If two disks are corrupt in the same byte positions, the above algorithm will (again, in the general case) introduce additional data corruption by corrupting a third drive. "

Keep the two things separately -- detecting something, and doing something about it.  I am only talking about the first part.  Yes, with two disks corrupt in the same byte positions, you can't detect which disks they are. But with one corrupt disk you can.  Report that in syslog.

Link to comment
12 minutes ago, Pourko said:

Keep the two things separately -- detecting something, and doing something about it.  I am only talking about the first part.  Yes, with two disks corrupt in the same byte positions, you can't detect which disks they are. But with one corrupt disk you can.  Report that in syslog.

problem here is when there is an error detected, how can we know if there is only one corrupt disk, or there are two corrupt disks

Link to comment
3 hours ago, testdasi said:

I have already stated my argument that Unraid should stay on the safe side and focus on the general case, especially when there is a safer alternative (btrfs scrub) already implemented.

Speaking of the safe side, imagine for a moment that what I'm asking for has already been implemented, i.e., Unraid repors the disk number in syslog in case it finds a single-disk mismatch. So now, you run your read only parity check, and on this occasion you finds numerous parity errors, but to your surprize you notice that all those errors seem to be coming from disk#5.  What safe road would you be wanting on that occasion?  The one where Unraid propagates the wrong bytes onto the parity, or the one that gives you the golden chance to yank disk#5 out of your server and rebuild your good data from parity? 

 

And about all the other features that you keep saying are higher priority for Unraid... They are not the core of Unraid. The "md" driver is.  The way Unraid does parity is the only reason I bought Unraid.  Everything else is bells and whistles, and you can do them on any other distro.

Link to comment
15 minutes ago, trott said:

problem here is when there is an error detected, how can we know if there is only one corrupt disk

I have demonstrated in a clumsy example that if you disregard a disk in your parity calculation you can come to a case where that calculation shows a correct parity.  Now that we know that we CAN know that, we can find better ways of getting to know that.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.