Recommended Posts

So if something goes screwy, for example a disk starts spewing garbage in the middle of a parity check (RobJ is going to challenge me on this, I know it is coming)...

Chuckle chuckle!  Really, I don't *try* to be a trouble maker!  I know you were just trying to make a point, but for anyone that might be confused by Brian's hyperbole(!), drives don't spew garbage, they spew errors.  However, it's true that we have heard here of a few cases where it seems provable that a drive returned corrupted data, on the order of 1 bit per 10 gigabytes?  With apologies to Jonathan (I don't know your situation), I still remain rather skeptical about some of the individual cases, but I agree it is possible, although extremely unlikely.  Brian brings up a good point, that we as UnRAIDers are 'uber diligent' about disk issues, and are far more likely to both detect and be concerned about even one bad bit per terabyte, where as the great unwashed masses of Windows and Mac users will probably never know about it if it happened to them.

 

I think we have a problem of scale here.  Yes, we are uber vigilant, and yes we have far more disk storage than they do, which multiplies the probability of the issue, but there are so many many more of them.  It's hard for me to believe that (compared to how few there are of us) the worldwide hundreds of millions of Windows and Mac users haven't run into this issue before.  They may mostly be ignorant, but not all of them are.  And what about the relatively small but very significant enterprise segment, with mostly RAID protected hard drives, have you ever heard of this issue being widely discussed among them?  I've heard of discussions of bit rot in the past, but that and similar are not really a problem any more, because of the error correction info embedded in the data sectors.

 

When we had this discussion before, I in essence challenged others to come up with a valid scenario that could explain these corrupted bits, and Joe L and Bubba came up with valid ideas.  I only remember one of them, the case where the internal memory was bad.  The scenario would be:

* we have a given sector being requested, with good uncorrupted data on the disk surface - (data is good)

* we read the data into a memory buffer (within the drive) - (data is still good)

* using the error correction info and CRC, the data is checked, and found to be good - (data is still good)

* the data is corrupted, either bad memory flips a bit, or bug in the firmware or a flaky hardware register causes a byte to be corrupted - (data is bad (but assumed good))

* packet is created with fresh packet overhead and CRC - (packet is good, data is bad)

* packet is transmitted to the machine containing this drive, error checked at every step of the way and always found to be good - (packet is good, data is bad)

* packet arrives at destination buffer, data is moved to its destination memory location, no errors at all have been reported - (yet data is bad)

* if executable, it's run with possibly defective behavior; if streamed, may be played with possibly detectable playback flaw; if stored, will be stored unknowingly corrupted; if it's part of UnRAID parity check, will produce 'incorrect parity' error

 

The one conclusion we can make is that if we are convinced that a drive is doing this, then it MUST be replaced, even if nothing else appears to be wrong with it (eg. a perfect SMART report).

 

I do like the ideas of Jonathan and Brian here, a diagnostic mode with enforced read-only access, and delayed approval of reviewable corrections.  I'll probably always do correctable parity checks myself, but the ideas are good and will prove useful in certain diagnostic circumstances.  I think Brian also proposed being able to do parity checks across specified ranges, which would clearly be useful here.  You could test and retest specific blocks.

 

I'll just end by saying I completely agree with Gary.  The chances of one of these bit anomalies occurring seems so small that if I'm going to worry about it, I might as well drive a tank, armor proof the house roof so no airplane tires or satellite debris crash through, get someone else to taste all my food first ...

Link to comment
  • Replies 117
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

OMG, I laughed so hard tears are still rolling down my cheeks! Thanks for that - needed after a long day. Completely agree. No rebuttals save one.

 

People have lost data due to correcting parity checks.

 

Peace!

 

People have also died because a plane crashed through their roof -- but neither Rob nor I have any plans to armor-proof our roofs  :)    How's the integrity of your roof?

 

Link to comment

Thanks for the information on Backups.

 

I'm planning on building a new unRAID box as a backup, and a second Pro license. I'm wondering if it's a good idea to use the product below to duplicate all my disks first, around 10 3TB HDD's:

 

http://www.newegg.com/Product/Product.aspx?Item=0Y5-0001-00014

 

I wouldn't.    I'd just build your 2nd server, then copy all the data across your network.  Using a duplicator would require using the same size drives ... and you're better off using newer 4TB drives for the new server.    If you want the copies to be quick, just set up the 2nd server without a parity drive assigned;  then do the copies; and then assign a parity drive.    The writes will be about 2 1/2 to 3 times as fast like that (although they won't, of course, be fault-tolerant).

 

Link to comment

Similar to UnRaid requirements most duplicators will clone to larger capacity drives than the source. I have an Inatech (?) dock, its invaluable.

 

So in your experience if I take my 10 x 3TB HDD's and duplicate to 3TB HDD's and larger it should work just fine?

 

Thanks.

Link to comment

As I noted earlier, this would work just fine -- I just don't see any reason to do it unless you simply want the duplicator (not a bad idea, as it is a neat piece of equipment).

 

My comment was really that you wouldn't want to buy a bunch of 3TB drives when you can use newer, higher-capacity 4TB drives.    And it certainly seems simpler to just build the 2nd server, then start ONE copy across the network to get all the data copied, than to remove all of your current server's drives one-at-a-time and duplicate them.

 

Link to comment

Yes of course it's well understood, thank you. I can actually put a duplicator to good use for reasons beyond my unRAID server. I am considering building a second server as I mentioned, but I have thought of just cloning all the current drives and keep those drives off-site as well for safe keeping.

 

On another note many years ago I used Arcoide DupliDisk hardware to mirror/duplicate HDD's on the fly in quite a few PC's and servers. If I remember correctly the servers ran Raid 5 with 5 DupliDisk controllers and a total of 10 HDD's.  I wonder if this method would be overkill for an unRAID server?

Link to comment

Nice chart -- although the final entry ["Only if off-site"] should probably read ["Only if off-site or backup still okay"].    Not all acts of God (fires, floods, etc.) will destroy local backup drives -- they may be at the opposite end of a home; stored in a fireproof/waterproof safe; or perhaps on a higher floor in a flood that destroys lower floors.    [in my case, I have BOTH a backup server on the far side of the house AND a 2nd set of backup drives stored in a waterproof/fireproof safe.]

 

Link to comment

Nice chart -- although the final entry ["Only if off-site"] should probably read ["Only if off-site or backup still okay"].    Not all acts of God (fires, floods, etc.) will destroy local backup drives -- they may be at the opposite end of a home; stored in a fireproof/waterproof safe; or perhaps on a higher floor in a flood that destroys lower floors.    [in my case, I have BOTH a backup server on the far side of the house AND a 2nd set of backup drives stored in a waterproof/fireproof safe.]

 

Getting picky on me! :D  Hopefully people will understand.

Link to comment

Nice chart -- although the final entry ["Only if off-site"] should probably read ["Only if off-site or backup still okay"].    Not all acts of God (fires, floods, etc.) will destroy local backup drives -- they may be at the opposite end of a home; stored in a fireproof/waterproof safe; or perhaps on a higher floor in a flood that destroys lower floors.    [in my case, I have BOTH a backup server on the far side of the house AND a 2nd set of backup drives stored in a waterproof/fireproof safe.]

 

Getting picky on me! :D  Hopefully people will understand.

 

Not so "picky" as "precise".  While I agree I should really have an off-site backup, I DO feel fairly well protected.    It's unlikely I'd lose all of my backups in a fire, or to theft - or even to an earthquake.  A flood would be more troublesome (not sure just how long my "waterproof safe" remains that way), although we live well above any local flood plains, so that's a reasonably unlikely event.

 

But I certainly agree off-site backups would be better.

 

 

Link to comment

Nice chart -- although the final entry ["Only if off-site"] should probably read ["Only if off-site or backup still okay"].    Not all acts of God (fires, floods, etc.) will destroy local backup drives -- they may be at the opposite end of a home; stored in a fireproof/waterproof safe; or perhaps on a higher floor in a flood that destroys lower floors.    [in my case, I have BOTH a backup server on the far side of the house AND a 2nd set of backup drives stored in a waterproof/fireproof safe.]

Getting picky on me! :D  Hopefully people will understand.

 

Image updated with 2 changes. Go back to the post and see if this is better. :)

Link to comment

It is VERY rare that a sync error is due to an error on one of the data disks.  If a read error occurs on a data disk, UnRAID will re-write the data on that disk to correct the data ... and if the write fails, it will disable the disk (so no inappropriate change to the parity disk would occur).    So unless you have multiple drive failures at the same time, you can always recover a failed disk -- that's the whole concept of UnRAID's fault tolerance.    ... it's also why you want to periodically do a parity check to ensure your parity information is good.

 

Bottom line:  I can really think of NO reason to do a "non-correcting" parity check.    The whole idea of doing the check is to ensure your parity is good !!

 

Yeah, the theory is nice and all, but just a couple of releases back (4.6 and 4.7) there was a substantial bug in unraid that would cause a drive being rebuild to have errors in the very first part of it (the superblock I believe). This occurred for me several times, it is provoked by having addons running and accessing disks (changing the superblock) while the rebuild process starts. See my old topic on this:

http://lime-technology.com/forum/index.php?topic=12884.msg122870#msg122870

 

Now, if you had that happen, then next time you run a correcting parity check, those errors will become permanent corruptions to the drive you had rebuild.

 

I am very greteful to Joe for advising all of us to run NON-CORRECTING monthly parity checks, thanks to this my unraid server maintains a perfect record for never losing or corrupting any data (I was able successfully re-rebuild the disk in question by doing it without my addons running).

 

Sure, the bug was eventually (after far, far , FAR too freaking long) corrected in unraid 5, but I say better safe than sorry. Non-correcting monthly parity checks are safest, and I would STILL like to see an option to automatically perform a non-correcting parity check after upgrading / rebuilding a disk.

Link to comment

I am very greteful to Joe for advising all of us to run NON-CORRECTING monthly parity checks, thanks to this my unraid server maintains a perfect record for never losing or corrupting any data (I was able successfully re-rebuild the disk in question by doing it without my addons running).

 

I don't have any problem with the idea of running non-correcting checks.  It's simply that there's very little you can do if you find errors, other than just run a correcting check afterwards.  If there were better tools [i.e. a "potentially corrupted files" tool that would show which files on every disk were potentially corrupted by a specific bit location's parity being wrong", then it would make more sense to do the non-correcting check -- PROVIDED, of course, that you have a way to USE the information.  For example, if you knew that files a, b, c, d, e, and f were potentially corrupted due to a parity error, do you have the ability to verify which ones are good or not?    If you don't have checksums or backups (or both), this information isn't of much utility.    But I agree that if you CAN answer that question reliably, then a non-correcting check would let you identify those times that the error was actually on a data disk, although then you have the issue that there's no "rebuild this file from existing parity" tool (you could, of course, rebuild the entire disk).

 

Link to comment

I don't have any problem with the idea of running non-correcting checks.  It's simply that there's very little you can do if you find errors, other than just run a correcting check afterwards. 

 

In the case I brought up, where unraid botched up while rebuilding a disk, there was a far better action to take. Reboot with a clean go script (no add-ons), and rebuild the disk again. Had I run a correcting parity check, I would not have had that option.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.