Unraid Encryption and UREs

NasOnABudget · September 11, 2020

Hi, I'm about to build my first server and had a question about UREs (Unrecoverable Read Errors) and encryption with unraid I was hoping to figure out before configuration.

So if I have this right, with regular unraid, no encryption, when you have a URE during rebuild, and I supposed outside of rebuild as well, it simply means the file with the URE will be corrupted at the byte where the error is, but if you are using AES for instance, will the same be true?

As AES is a block level cipher, I imagine that would mean entire blocks of 128 bits would be lost in a file, but otherwise it would be the same result. Does anyone know if this is true?

I also wonder if there are any measures in place, with 2 levels of parity to remedy this type of situation without data loss or a method for easily figuring out what file was effected.

Frank1940 · September 11, 2020

11 hours ago, NasOnABudget said:

So if I have this right, with regular unraid, no encryption, when you have a URE during rebuild, and I supposed outside of rebuild as well, it simply means the file with the URE will be corrupted at the byte where the error is, but if you are using AES for instance, will the same be true?

I am not really a true expert in this whole field but I will try to impart what I believe to the true.

Not quite right. You have to realize that a hard disk is not a simple device. It may be best to consider that there is a line at the connectors. Up to that line on the outside of the drive, you pretty much know what is going on. Once you cross that line and get inside of the drive, everything becomes smoke and mirrors. The manufacturers basically don't want to reveal what is really going on as they use a lot of very highly proprietary mathematical algorithms to increase the reliability and longevity of their drives.

The first thing which is known is that the data which they write to the drive is not the actual string of ones-and-zeros that was sent to it. They employ a scheme called Run-length limited coding to reduce the number of magnetic transitions. You can read about that here:

https://en.wikipedia.org/wiki/Run-length_limited

The next thing is that the reading error rate is actually rather high. (I can recall finding data from one HD manufacturer many years back on the raw read error. When I looked at it, I realized that all hard disks would basically fail with a few days unless something was being done.) What is done is that each manufacturer has developed a elaborate multi-layer error correction scheme. Basically, each block/sector of data is encoded with an error detection code which will first detect a certain number of errors and fix a lower number of errors. If everything can be handled at the block level all is good. If it can not, then the second , third, fourth... methods are successively employed. If all of these fail then you are going to get the dreaded Unrecoverable Read Error message.

You could have a single block/sector bad or it could be a third of the disk. At the actual data density rate on a large capacity HD, it is actually quite unlikely that only a single bit or byte is bad. There is a lot of technology in a modern HD-- mechanical, electrical, magnetic are all involved. Any one of those systems can fail and you have problems.

Now, let's look at the other side of the connector line that I talked about. This is the computer side. If you decide to employ encryption, this is where encryption is done. What crosses the line to the drive is encrypted data. (Aside from the HD commands that tell the drive where to write the data.)

When Unraid builds parity, it looks at each sector of the disk and uses the data from that sector to calculate the parity. It does not know what that data is, what is is used for, or what it represents. It only sees a string of ones and zeros. Here is a description of how parity is built and recovered:

https://wiki.unraid.net/index.php/UnRAID_Manual_6#Parity-Protected_Array

So if you have a encrypted drive, Unraid will rebuilt an identical encrypted drive using parity.

One final observation. Make sure that you have a truly valid reason that you absolutely have to encrypt you data. (Fear of espionage by a government agency would be a good example) Don't do it on a whim!!!! Look at all other options to secure the data first. If you lose or misplace that pass phrase, you will never recover your data. On the other side of the coin, you make it easy to locate that pass phrase by making it easily accessible by making too many copies of it or putting it someplace where it can be easily found, you have defeated the whole purpose of encrypting in the first place.

Your other questions of data recovery from an URE during a parity rebuilt of a disk. With dual parity, you should be able to get by single ones and most dual ones. The exception would be if two of disks in the parity operation had problems in the identical location on their surface. But I must also point that Unraid parity is not a data backup. You need a another backup of all critical data besides your Unraid server. Data read errors are not the only reasonthat you can lose data. Fire, Lightening, theft, Flooding are examples.

NasOnABudget · September 12, 2020

UREs

13 hours ago, Frank1940 said:

You could have a single block/sector bad or it could be a third of the disk. At the actual data density rate on a large capacity HD, it is actually quite unlikely that only a single bit or byte is bad.

As far as I can tell this is really the only thing that does not line up with what I posted? The hardware information about the background of the disks is cool, though I'm not sure the obfuscation of the inner workings of drives is all that detrimental to me because the actual methods employed don't really matter to the end user, only the resulting conditions. We have an interface, where we know what we can get out for what we put in. That doesnt mean this isnt interesting, but incidental to my queries.

A partial problem though with this is that the resulting conditions are unknown. Looking into UREs previously and now, the only way you can get meaningful statistics on how many UREs occur is through working at a large company under nda or having access to a statistically relevant number of drives. What you said about a bad sector/block sounds reasonable, but a third of the disk? That sounds like catastrophic disk failure, like a head crash or sand paper intrusion. A statistical anomaly covered by parity drives or backups (which I will get to later).

As a side note, a reddit comment from a claimed to be Oracle employee said something quite interesting. There was a part about reading all of the data more frequently leading to higher chances of fixing errors before they are even registered to the user, sort of related to the inner working of the drive and the basically magic proprietary error correcting/read technologies that go into them. From the same chain, they also talk about how many errors are hidden until spare disk space hidden to the user is used up. All sort of incidental to what I'm really looking for but interesting nonetheless.

Encryption

Ok, so encryption. How the parity works is all well and good and I've looked through all the articles on the wiki I could find on the topic before making this post trying to see if this was answered in the process. What I'm curious about is the interaction with encryption and data. So basically what the effects of what I am garnering the most likely type of data integrity issue to be. So for a video file, you might see just a blip usually but my question about practicality is if encryption, particularly the common place block level encryption in AES 256 will exacerbate that issue in any meaningful way.

As for losing the encryption key, It shouldn't be too hard to avoid losing it by keeping multiple separate backups. The point about access to it are valid absolutely, however, my reason for using encryption isn't because I'm worried about any high intel or trade secrets, but because of the use of my server as a backup for other devices. The thought is that if I become the victim of theft, I don't also need to worry about becoming the victim of identity theft, at least due to digital data. Now, for my portable devices this is way more important, and is the reason I have bitlocker enabled on laptops, as I imagine that theft in a public area is far more likely than theft at home. At home its less important given that if someone has access to your home, they probably have access to a lot of the information they might want anyways, but keyly not things like credit cards, or email passwords.

I imagine that so long as a potential bad actor does not have access to my online presence, thats job done for data security. If a government entity, large corporation or anyone else wants at my backups, linux distros and general media, well, firstly, they are going to get it no matter what unless I devote my life to data security, and secondly I'll have bigger problems than my identity being stolen

Backups

Firstly, I want to say you are 100% right, parity is not a backup, raid is not a backup, backups in the same locations arent the most useful backups. 3-2-1 is absolutely a great standard to keep. That being said, I am not an enterprise. Only some of my data is important enough for me to be able to financially justify backing it up using the 3-2-1 technique, and so I only do that with that data. In a perfect world, absolutely, and indeed I plan on building a more or less identical server when reasonable to function solely as a backup to this one (which would add to my protections, malware protection, and true lightening strike protection), but that's not currently possible.

The risk of losing this data (relatively small), vs the financial penalty for dedicating funds towards this insurance dont balance out currently. I realize that for some people with less data or more money, that may sound ridiculous, but unfortunately not everyone is in that position, so compromise must be made here as with every area in life.

To give some perspective, I have about ~50TB of data that I would need to find 3 homes for (you can see my in progress build in this post). Doing some rough estimates, and leaving some room for expansion, that would add up to owning a (low end) new car to follow the 3-2-1 method with data storage costs that would be unfathomable. That's simply not feasible. When I have proper insulation from the life altering events listed myself, Ill do so for my unimportant (non show stopper) data as well.

The amount of data that is more important though, (pc backups (I dont really hold sentimental value with any of my other data)), Is much smaller and I plan to adhere to 3-2-1 with this data, likely with an online backup solution at much more manageable prices via more hands off services (ex. backblaze + this server).

To put it another way, second priority data (~48TB out of 50), simply will have to be second class, while priority data can follow 3-2-1/

Reading through various forums, I see many users are in this situation and often are made to feel poorly, perhaps due to a lack of elaboration and explanation, so I thought id include that if not just to say, yes, I understand best practices, and they are important, particularly for data related to income, clients etc, but sometimes best practices aren't the best (or most feasible) options.

Ok so that was long, so let me give a short summary of the above.

3-2-1 on ~49TB of non essential, non income related data is a small new car (completely out of budget/reason)
3-2-1 on ~1TB of essential, still non income related data is doable and I'm doing it.

That out of the way lets talk parity.

Parity

Ok, so one thing I'm very unclear on is what unraid parity can and cannot do with regards to UREs. I understand that they can indeed recreate a drives contents using parity with up to 2 drive failures. What isn't clear is whether or not they can deal with URE's in a way that recovers data. I think some systems are able to make a best guess out of 3 sources (ex. zfs raidz2/3), but as far as I've read, unraid is not able to do this and while it wont fail a rebuild due to bad data, it also wont be able to fill in the gaps.

That isnt the end of world for me, as like i've said, none of my data (except that which is backed up multiple times) needs to be absolutely 100%, but I would still like to know authoritatively what current unraid does, and if encryption worsens it.

Since I haven't already said it, thanks for the lengthy reply. I found the wiki link interesting.

Edited September 12, 2020 by NasOnABudget
Spacing

Frank1940 · September 12, 2020

58 minutes ago, NasOnABudget said:

As a side note, a reddit comment from a claimed to be Oracle employee said something quite interesting. There was a part about reading all of the data more frequently leading to higher chances of fixing errors before they are even registered to the user, sort of related to the inner working of the drive and the basically magic proprietary error correcting/read technologies that go into them. From the same chain, they also talk about how many errors are hidden until spare disk space hidden to the user is used up. All sort of incidental to what I'm really looking for but interesting nonetheless.

This is true in typical case usage on a 'normal' PC. All of the unused/empty spaace on a HD is unchecked until something saves a file to it. With a drive that is in the Unraid array, that is not the case. Every time any parity operation is performed every single byte of every disk in that array will be read. Most knowledgeable Unraid users run periodical parity checks. Many will do it monthly.

58 minutes ago, NasOnABudget said:

Ok, so one thing I'm very unclear on is what unraid parity can and cannot do with regards to UREs. I understand that they can indeed recreate a drives contents using parity with up to 2 drive failures. What isn't clear is whether or not they can deal with URE's in a way that recovers data. I think some systems are able to make a best guess out of 3 sources (ex. zfs raidz2/3), but as far as I've read, unraid is not able to do this and while it wont fail a rebuild due to bad data, it also wont be able to fill in the gaps.

From what I understand, Unraid will correctly rebuilt any data using its parity calculation. That makes the assumption that the number of disks with error conditions meets the parity correction requirement. I.e., one disk failure (the one being rebuilt with single parity) or two disks failed (the rebuilding one and one still in the array with dual parity). I have been following this forum for years and I have not heard of a single case where every single byte on a failed disk was not recovered.

Yes, you can have data loss if a second disk has a problem during a rebuild with single parity. Yes, you can lose data if you have a problem with two additional disks with dual parity.

In fact, that was the principal reason that dual parity was wanted by many Unraid users. It was to be able to rebuilt a data single disk when there was an undetected problem on a second disk that would be discovered during a rebuild operation. It was not added to permit a sloppy administrator to run a server until he had two failed disks before he was going to take action!

58 minutes ago, NasOnABudget said:

...but I would still like to know authoritatively what current unraid does, and if encryption worsens it.

One important concept is that Unraid is not rebuilding files when does a disk rebuild. It is actually rebuilding the disk sector-by-sector. The finished disk will be a mirror image of the original disk. Since you seem to be from Missouri, I will ping @limetech and @jonp and see if they will add anything.

Edited September 12, 2020 by Frank1940

Frank1940 · September 12, 2020

OH, one thing I must say for completeness. If a file was corrupted on an array disk and parity on the array is correct, if the disk is rebuilt, the file will still contain the original corruption after the disk is rebuilt.

Having said that I firmly believe that most file corruption occurs before the file is written to the disk. If a file should get corrupted after being written to the disk, it should be unreadable. (Basically, the error detection/correction routines should find it, and either fix it correctly or throw an error!)

Edited September 12, 2020 by Frank1940

NasOnABudget · September 13, 2020

22 hours ago, Frank1940 said:

Since you seem to be from Missouri

lol I have no idea where you got that from ha. Im in Canada.

22 hours ago, Frank1940 said:

Most knowledgeable Unraid users run periodical parity checks.

Hmm, sort of unraid scrubbing I suppose. Though, unlike btrfs scrubbing I imagine by default with XFS there is no data repair that goes on when problems are discovered.

So if Im not missing something a corrupted disk, if not outright failing could pass off bad data as the true data which would then migrate to the replacement disk as well.

Unless throwing an error implies something more that Im not thinking of.

Edited September 13, 2020 by NasOnABudget

JonathanM · September 13, 2020

6 hours ago, NasOnABudget said:

On 9/12/2020 at 7:32 AM, Frank1940 said:

Since you seem to be from Missouri

lol I have no idea where you got that from ha. Im in Canada.

https://www.sos.mo.gov/archives/history/slogan.asp

6 hours ago, NasOnABudget said:

So if Im not missing something a corrupted disk, if not outright failing could pass off bad data as the true data

If a process stores a 1, and the disk returns a confident 0, then yes, you will have problems.

Thing is, that's not the way disks work, as Frank tried to tell you. If the disk is having issues, it returns a read error, which triggers unraid to calculate from all the remaining disks what SHOULD be there, and write the correct data back to the disk. If the write succeeds, a read error is shown, and life moves on. If the disk returns a write error, then unraid fails the disk and all further operations to that slot happen purely from emulated data.

A parity check is just reading the data from all the devices and making sure the maths are correct. A read error during a parity check is handled the exact same way as normal operations, as I said in the above paragraph.

Data corruption occurs elsewhere in the chain of custody. Bad RAM, untidy shutdown, buggy application code, and the like are what can cause what should have been a 1 to be a 0 or vice-versa. Once the data has been sent to the disks, it's either returned as it was sent, or a disk error is generated that causes Unraid to reconstruct it. Writes are sent to the data slots first then the parity slots, so if the power is pulled during a write, it's necessary to do a parity check to make sure everything that went to the data drives also updated parity.

NasOnABudget · September 15, 2020

On 9/13/2020 at 12:07 PM, jonathanm said:

Thing is, that's not the way disks work, as Frank tried to tell you. If the disk is having issues, it returns a read error, which triggers unraid to calculate from all the remaining disks what SHOULD be there, and write the correct data back to the disk. If the write succeeds, a read error is shown, and life moves on. If the disk returns a write error, then unraid fails the disk and all further operations to that slot happen purely from emulated data.

Is there any verification of this by unraid?

I've seen multiple people describe what unraid does in these sorts of situations differently really was hoping for a more definitive answer.

Maybe I should send them an email.

Edited September 15, 2020 by NasOnABudget

Squid · September 15, 2020

3 hours ago, NasOnABudget said:

Is there any verification of this by unraid?

I've seen multiple people describe what unraid does in these sorts of situations differently really was hoping for a more definitive answer.

Maybe I should send them an email.

Where? That is exactly what happens...

trurl · September 15, 2020

On 9/13/2020 at 12:07 PM, jonathanm said:

If the disk is having issues, it returns a read error, which triggers unraid to calculate from all the remaining disks what SHOULD be there, and write the correct data back to the disk. If the write succeeds, a read error is shown, and life moves on. If the disk returns a write error, then unraid fails the disk and all further operations to that slot happen purely from emulated data.

This is the way Tom described it in some post somewhere years ago, good luck on coming up with the best search terms to find it😉 I don't know what they say on reddit.

trurl · September 15, 2020

3 minutes ago, trurl said:

This is the way Tom described it in some post somewhere years ago

Here is one example:

NasOnABudget · September 15, 2020

5 hours ago, trurl said:

Here is one example:

Thanks that is pretty definitive, now I really only have the question of : Does encryption affect the recoverability of data at all? I suppose though, that that could be a question all on its own.

Edited September 15, 2020 by NasOnABudget

JonathanM · September 15, 2020

45 minutes ago, NasOnABudget said:

Does encryption affect the recoverability of data at all? I suppose though, that that could be a question all on its own.

Yes on both counts.

It's a totally separate question, and encrypted disks are basically unrecoverable if corrupted. None of the normal recovery tools that I'm aware of can even mount a damaged encrypted filesystem. It's imperative that any system using encryption be extremely robust and error free.

Backup, backup, backup.

Unraid Encryption and UREs

Recommended Posts

NasOnABudget

Link to comment

Frank1940

Link to comment

NasOnABudget

Link to comment

Frank1940

Link to comment

Frank1940

Link to comment

NasOnABudget

Link to comment

JonathanM

Link to comment

NasOnABudget

Link to comment

Squid

Link to comment

trurl

Link to comment

trurl

Link to comment

NasOnABudget

Link to comment

JonathanM

Link to comment

Join the conversation