Bad Copy on Failing Disk

October 27, 201411 yr

I recently had a drive fail and be kicked from the array in the middle of having files moved to it.

I was using Teracopy, and had it doing checksum comparison.

One of the files failed the CRC check (I presume the file that was being copied at the time of the drive was "red balled.") The CRC was calculated on the source file as well as on the simulated disk's copy of the file.

Looks like unRAID was not able to seamlessly switch to drive simulation mode without losing data.

Quote

October 27, 201411 yr

Looking at WeeboTech's comment in your other thread, it seems that what happened is that a sector or two was written (and failed) before the drive was red-balled ... so the drive emulation didn't kick-in until there was already bad data.

Note sure if the reassigned drive letter also impacted that ... but as a minimum that would explain why the CRC failed when reading the file.

As a matter of interest, did the REST of the copies continue and pass the validation? If so, then that's almost certainly what happened ... and UnRAID did its magic just fine, since the rest of the copies would be to the emulated disk rather than the physical unit.

Quote

October 27, 201411 yr

Was the bad file in your md5sum catalog? Can you verify it that way?

This really strengthens my resolve to have the hash sum catalog both as a database and attached to the extended attributes.

In addition having a folder.hash and possibly a folder.par2 file in each folder.

Quote

October 27, 201411 yr

In my experience, if TeraCopy says it was a bad copy, it's a bad copy.

But I agree it's nice to have the ability to confirm that. All of my folders have a <FolderName>.hash file generated by Corz Checksum utility. I can check any folder with a simple right-click and "Verify checksums"

Quote

October 27, 201411 yr

Author

I did not go back to my md5 catalog. I trust that if Teracopy says it didn't verify that it didn't. I have been doing mass amounts of moving data around using Teracopy, and this is the first time there has been a mismatch, right as a disk was being kicked. (All other files verified). I believe this is a bug that unRAID does not update parity on the write that kicks the disk from the array - so that parity is off by that I/O.

After the rebuild I moved the source file overwriting the bad file. In retrospect I wish I had saved the bad file and compared to the original to determine the size of the block that mismatched.

Here is a post containing the syslog of the failure if this is helpful in researching the issue.

SYSLOG

Quote

October 27, 201411 yr

This isn't to say I do not trust teracopy. I do. It was more of a end point diagnostic for proof of concept.

After the rebuild I moved the source file overwriting the bad file. In retrospect I wish I had saved the bad file and compared to the original to determine the size of the block that mismatched.

*sigh* I was hoping you still had both, checked the md5 on both of them and compared the data. cmp shows you what byte(s) start to differ.

Quote

October 27, 201411 yr

... I believe this is a bug that unRAID does not update parity on the write that kicks the disk from the array - so that parity is off by that I/O.

Not sure I agree with that. I think UnRAID likely calculated that parity just fine, as the sector that failed the write was likely then written to the emulated disk after the actual disk was red-balled. I think the parity mismatch is due to the sectors that apparently had bad writes that were unreported just prior to the red-balling of the disk.

Quote

October 28, 201411 yr

Author

... I believe this is a bug that unRAID does not update parity on the write that kicks the disk from the array - so that parity is off by that I/O.

Not sure I agree with that. I think UnRAID likely calculated that parity just fine, as the sector that failed the write was likely then written to the emulated disk after the actual disk was red-balled. I think the parity mismatch is due to the sectors that apparently had bad writes that were unreported just prior to the red-balling of the disk.

Whether the write hit the physical disk or not, parity should have been updated correctly.

Quote

October 28, 201411 yr

Agree ... which is why it would have been very interesting to see if the emulated file had a good MD5 (it should have). It does, however, beg the question as to why TeraCopy validation failed.

Sounds like UnRAID may have read from the physical disk instead of the emulated on ... something I wouldn't expect after the drive has been flagged with a red ball.

Quote

October 28, 201411 yr

I've had tera copy failures occur with CRC albeit oh so rare, but there were never disk errors.

I've always re-copied, never thought of validating things. Soon I'll have the means to.

Quote

October 28, 201411 yr

Author

The way Teracopy works is it computes the CRC on the source as it does the copy and then, when the copy is complete, it reads the destination to compute it's CRC. So there is no way that the destination CRC was computed reading the physical disk. It was reading the simulated disk.

Gary - not sure why you want to doubt this bug. Thinking up highly unlikely/imposible scenarios why it isn't a bug is not very useful. Clearly the copy I was doing was inaccurate because a disk failure occurred in the middle. Let LimeTech do it's research. I'm sure they have a technique to simulate a failure. I have long suspected this issue due to a similar problem I experienced long ago. But there was no Teracopy verification failure to prove it. The reason I use Teracopy is not because I doubt my disks or network, it is to ensure each file is accurately moved if something weird like this happens.

Users should be aware of this issue and take proper precautions.

Quote

October 28, 201411 yr

Not doubting the bug ... just trying to understand what might have occurred. What "long ago" issue did you have that would account for this?

Quote

October 28, 201411 yr

Author

I don't remember all the details but remember I discovered my media file had a giant glitch in the middle, and was able to re-rip the movie. The new file was exactly the same size and did not glitch at all. I thought that the file might have been corrupted in rebuilding a failed disk that had occurred relatively recently but there were a lot of possibilities and I didn't know for sure. Since then I've been pretty religious about copying with Teracopy or doing md5 comparisons.

This was the first red ball since then.

Quote

August 3, 201510 yr

The only way corruption like this should be possible is if the failing device returns bad read data, meaning a read command returned with 'success' status but the data it returned was actually bad. This of course would be a serious firmware issue with the device.

There is a way to simulate a write error. At any time you can type this command:

mdcmd set wrerror <N>

where N is the disk number: 0 => parity, 1 => disk1, etc.

After typing this command, the effect is that the very next write command (and only that write command) on that disk will be tagged as "failed" and the code that disables the device will kick in.

Quote

August 3, 201510 yr

Agree this seems like an edge case that requires something to fail in the hardware -- either the read data issue you noted or a random memory error. In any event, it doesn't seem like a major issue as long as folks do verifications on their array writes (e.g. with TeraCopy or via MD5 checksums), as it will be detected then and the involved file can be re-written.

Quote

August 3, 201510 yr

Author

Thanks Tom! Having had it happen a couple times, forgive a little skepticism, even for the bug free programmer. I will do some testing and let you know if I can coerce this situation.

Gary - I agree that doing the post read verifies is a good practice to avoid this issue, but this does not justify leaving a bug like this in place (if indeed it is a bug). I would not call a corruption that affects data integrity as a disk fails with data being written to it an edge case. It is squarely at the heart of what unRaid does. More testing is needed to confirm this is a true bug or not.

Quote

August 3, 201510 yr

I would not call a corruption that affects data integrity as a disk fails with data being written to it an edge case.

It is squarely at the heart of what unRaid does.

Wholeheartedly agree.

More testing is needed to confirm this is a true bug or not.

I think a couple of tests need to occur.

One with the logic of setting the write error in software.

Another with actually turning off the drive being written, and finally with turning off a different drive so the array goes into the fault tolerant state while writes are occurring.

The situation being, how does the system and/or kernel deal with drives that have fallen off the bus and can't get up.

There could be buffer/cache (drive/controller) situations that crop up at the driver/firmware level.

Quote

August 3, 201510 yr

Agree this seems like an edge case that requires something to fail in the hardware -- either the read data issue you noted or a random memory error. In any event, it doesn't seem like a major issue as long as folks do verifications on their array writes (e.g. with TeraCopy or via MD5 checksums), as it will be detected then and the involved file can be re-written.

Most of the writes on my server are initiated at the server by applications, or by backup processes running on other computers.

Very little of my server content was put there by me with a copy operation over the network, so hard to see how verification is going to help.

Quote

August 4, 201510 yr

Thanks Tom! Having had it happen a couple times, forgive a little skepticism, even for the bug free programmer. I will do some testing and let you know if I can coerce this situation.

Gary - I agree that doing the post read verifies is a good practice to avoid this issue, but this does not justify leaving a bug like this in place (if indeed it is a bug). I would not call a corruption that affects data integrity as a disk fails with data being written to it an edge case. It is squarely at the heart of what unRaid does. More testing is needed to confirm this is a true bug or not.

Agree this needs to be resolved. I simply meant that it requires some very specific set of circumstances (i.e. an edge case) that need to be identified to resolve. But other than the example you noted some time ago in the other thread, I've not seen any reports of this happening, and Tom did a fairly thorough analysis of the driver and can't see any condition that would cause it ... so it's almost certainly an "edge case" that hasn't yet been identified.

Not saying it's not a problem -- just that it's rare (perhaps VERY rare) and that a prudent way to ensure it doesn't "bite" you is to always do verification of the data you store on your array.

Fortunately, a drive failing in the midst of a file being written to it seems to be a very rare event [On the other hand, that's likely why it's so difficult to isolate exactly what happened in the case you noted.].

Quote

August 4, 201510 yr

Thanks Tom! Having had it happen a couple times, forgive a little skepticism, even for the bug free programmer. I will do some testing and let you know if I can coerce this situation.

Gary - I agree that doing the post read verifies is a good practice to avoid this issue, but this does not justify leaving a bug like this in place (if indeed it is a bug). I would not call a corruption that affects data integrity as a disk fails with data being written to it an edge case. It is squarely at the heart of what unRaid does. More testing is needed to confirm this is a true bug or not.

Agree this needs to be resolved. I simply meant that it requires some very specific set of circumstances (i.e. an edge case) that need to be identified to resolve. But other than the example you noted some time ago in the other thread, I've not seen any reports of this happening, and Tom did a fairly thorough analysis of the driver and can't see any condition that would cause it ... so it's almost certainly an "edge case" that hasn't yet been identified.

Not saying it's not a problem -- just that it's rare (perhaps VERY rare) and that a prudent way to ensure it doesn't "bite" you is to always do verification of the data you store on your array.

Fortunately, a drive failing in the midst of a file being written to it seems to be a very rare event [On the other hand, that's likely why it's so difficult to isolate exactly what happened in the case you noted.].

I don't think it's so edge case. It's just that no one has used the tools that reveal it.

I've had drives fail on writes.

With recent tools like bunker and bitrot, the problem may be revealed more.

Might be worthwhile to look at the whole cache/mover subsystem and create the hash extended attributes so they are stored with files ongoing.

Quote

August 4, 201510 yr

...Might be worthwhile to look at the whole cache/mover subsystem and create the hash extended attributes so they are stored with files ongoing.

This would only help for cached writes. I don't use that feature.

Quote

August 4, 201510 yr

Author

I simply meant that it requires some very specific set of circumstances (i.e. an edge case) that need to be identified to resolve.

I guess I object to the term "edge case" related to this issue.

UnRaid is in the business of handling failed disks. Whereas for Excel, smoothly handling a failed disk might be considered an edge case, for unRAID it is the main function! You might say unRAID's whole existence is to handle this edge case.

And having a disk fail while data is being copied to it - this is a very normal use case for a failing disk.

I just don't like the labeling which acts to trivialize the problem.

I don't think it's so edge case. It's just that no one has used the tools that reveal it.

I've had drives fail on writes.

I've also had disks fail on writes. I'm sure others have too. Not an edge case.

I discovered the user share copy bug and a number of other issues with unRAID over the years. I have seen this happen 3 times. I am 97% confident this is a real issue. Treat this with skepticism at your own risk.

It is difficult to provide a reproducible scenario. If ANYONE has a red ball in the middle of a copy operation, PLEASE contact me immediately for instructions to gather information that might help identify this issue.

Quote

August 4, 201510 yr

...Might be worthwhile to look at the whole cache/mover subsystem and create the hash extended attributes so they are stored with files ongoing.

This would only help for cached writes. I don't use that feature.

I do agree, only for cached writes.

Yet if it could serve to automatically tag files with a checksum. It might be impetus to use a cache drive.

Teracopy does the checksum before and after. If the cache subsystem did the same, we might capture issues that have never been caught before or set the stage for bitrot verification as part of the unRAID feature set. Frankly, this was a reason I had considered btrfs, but this part of the BTRFS feature set is not stable enough for my purposes.

Quote

August 4, 201510 yr

I simply meant that it requires some very specific set of circumstances (i.e. an edge case) that need to be identified to resolve.

I guess I object to the term "edge case" related to this issue.

UnRaid is in the business of handling failed disks. Whereas for Excel, smoothly handling a failed disk might be considered an edge case, for unRAID it is the main function! You might say unRAID's whole existence is to handle this edge case.

And having a disk fail while data is being copied to it - this is a very normal use case for a failing disk.

I just don't like the labeling which acts to trivialize the problem.

This is part of my beef with the commentary.

I was sitting here thinking about it and realized. Drives don't usually fail sitting idle. They fail under heavy usage conditions.

I've had drives fail in trayless sata units under heavy usage because the sata cable disconnected from the vibrations.

Used to happen to me all the time on one slot. I would redball anytime I did heavy writes to that drive.

As far as testing conditions, I've proposed them below.

A hardware drop or sata disconnect of a hard drive is a great test for this condition outside of the normal mdcmd wrerror software layer. That may be where the issue lies.

Remember JoeL revealing that badblocks alone cannot be trusted for the integrity of a drive.

i.e the kernel hides some things and who knows if the kernel, or drive losses, something when a drive fails out from under it.

It could also be controller dependent when a ATA Reset is sent.

Quote

August 4, 201510 yr

...Might be worthwhile to look at the whole cache/mover subsystem and create the hash extended attributes so they are stored with files ongoing.

This would only help for cached writes. I don't use that feature.

I do agree, only for cached writes.

Yet if it could serve to automatically tag files with a checksum. It might be impetus to use a cache drive.

Teracopy does the checksum before and after. If the cache subsystem did the same, we might capture issues that have never been caught before or set the stage for bitrot verification as part of the unRAID feature set. Frankly, this was a reason I had considered btrfs, but this part of the BTRFS feature set is not stable enough for my purposes.

Automatic checksums would be nice, but putting it in mover is not going to get the job done for most of my data, and my data doesn't typically get touched by Teracopy either.

Different strokes for different folks I guess.

Quote

Bad Copy on Failing Disk

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)