How do you manage data on a failing array drive?

matt15k · July 4

Hello, I understand similar questions have been asked, but I am still uncertain about this after hours and hours of research.

I recently added another refurbished server grade 18TB HDD to my Unraid array, and it passed the initial preclear successfully (1 round of pre-read, clear, and post-read), followed by a successful extended smart test. Then I moved ~8TB of files from another full disk to the new disk using Krusader. After this move, the new disk is showing "Reallocated sector count: 2", "Current pending sector: 42" and "Offline uncorrectable: 20", and a new extended smart test shows "Completed: read failure".

This looks to be a bad HDD, so I am going to return it for another one under the seller warranty, but I am mainly concerned about file integrity of the data I copied to the new disk. Some of the files were backed up outside the array so I am restoring them back to the original disk on the array from the backup. The other files were new downloads (online videos), and I am not finding any way to locate which files (if any) may have been damaged by this disk failure. For the meantime I plan to copy those video files back to another disk, and scan them with CorruptVideoFileInspector from a Windows VM. My hope is that if everything passes, I can assume no files were damaged. Any other suggestions are welcome.

Overall, what is the best practice when these type of drive errors happen? The array has dual parity which I know is only for total disk failure, but it seems strange there are no protections available for the situation I have encountered here. I back up my important files, but the media files are too large unless I built a second Unraid array. If this suddenly happens to another disk in the future within the array, or to a parity disk, it does not seem practical to restore everything from backup over a few failed disk sectors. That is why I am wondering if there is a way to know if any files got affected by this (to put my mind at ease instead of manually redownloading TBs of files), and how to treat situations like this in the future.

Thanks in advance.

JonathanM · July 5

If the error checking and correcting algorithm built into the failing hard drive was working properly (and it should be, the probability of a media error being reported correctly but handled incorrectly is exceedingly small) then when data was sent to the drive, it either wrote properly and returned a success, or it failed the write and returned a fail. When Unraid gets a message from the drive saying a write failed, the drive is immediately disabled, and all further writes will be only committed to the parity emulated drive. Drives are smart though, and when a write to a bad sector fails, it's not immediately tossed, it marks the sector as pending and writes to another sector instead. It's only if the drive is incapable of writing the data that it fails a write.

If in the future a sector fails a read, first the drive tries again a BUNCH of times, and if it fails, it sends that to the OS. Unraid records that failure, calculates what should be there from the parity equations, and writes the data back to the drive. If the drive is successful in writing the data, Unraid continues as normal, and the only indication you get is an extreme slowdown and the error counter incremented in the Main GUI. If the write fails, see above.

So, the tldr is, your data is fine.

22 hours ago, matt15k said:

The array has dual parity which I know is only for total disk failure, but it seems strange there are no protections available for the situation I have encountered here.

The drives themselves provide that protection by reporting back when data is unable to be read or written correctly. Parity provides the safety net when the drive gives up.

matt15k · July 22

Thanks for the reply.

On 7/5/2024 at 5:11 PM, JonathanM said:

If in the future a sector fails a read, first the drive tries again a BUNCH of times, and if it fails, it sends that to the OS. Unraid records that failure, calculates what should be there from the parity equations, and writes the data back to the drive. If the drive is successful in writing the data, Unraid continues as normal, and the only indication you get is an extreme slowdown and the error counter incremented in the Main GUI.

Where can I read more about this? Everything I have read about data integrity in Unraid has said the parity protection is only for emulating a failed disk, not restoring incorrect or corrupted data, so I'd love to learn more details about how parity drives in the Unraid array protect against disk read errors. I have been considering moving my XFS array over to a ZFS pool and wondering if its stronger data integrity protections would help avoid similar issues in the future.

JonathanM · July 22

4 hours ago, matt15k said:

restoring incorrect or corrupted data, so I'd love to learn more details about how parity drives in the Unraid array protect against disk read errors.

There is a disconnect in that statement, disk read errors (which parity handles) are not the only cause of incorrect or corrupt data.

Yes, if you have a disk that acts up in a non protected environment, no RAID or parity, then a disk read error can lose data. Unraid takes that out of the equation if you use the parity array or a pool with a redundant RAID level.

Most corrupt data occurs upstream while it's being manipulated in memory, or due to an incomplete write from a power failure or crash. The drive system can only store and retrieve what is sent to it, garbage in, garbage out.

It's the file system that's responsible for finding and correcting those types of issues.

How do you manage data on a failing array drive?

Recommended Posts

matt15k

Link to comment

JonathanM

Link to comment

matt15k

Link to comment

JonathanM

Link to comment

Join the conversation