Jump to content

Hard Drive Post-Read Verification FAIL


Twisted

Recommended Posts

Posted

I was running drive preclear a WD Red Drive that is new. I bought it to move files over to start using unRAID. After I moved the files over, I ran the test and the health looks good, but the Post-Read verification failed. Should I re-run the test again?

 

############################################################################################################################
#                                                                                                                          #
#                                        unRAID Server Preclear of disk VJH3TWDX                                           #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 5 - Pre-read verification:                                                  [15:36:43 @ 142 MB/s] SUCCESS    #
#   Step 2 of 5 - Zeroing the disk:                                                       [15:29:55 @ 143 MB/s] SUCCESS    #
#   Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 4 of 5 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#   Step 5 of 5 - Post-Read verification:                                                                          FAIL    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                              Cycle elapsed time: 42:12:29 | Total elapsed time: 42:12:30                                 #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  STATUS                                                                           #
#   5-Reallocated_Sector_Ct      0        -                                                                                #
#   9-Power_On_Hours             136      -                                                                                #
#   194-Temperature_Celsius      40       -                                                                                #
#   196-Reallocated_Event_Count  0        -                                                                                #
#   197-Current_Pending_Sector   0        -                                                                                #
#   198-Offline_Uncorrectable    0        -                                                                                #
#   199-UDMA_CRC_Error_Count     0        -                                                                                #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################

--> FAIL: Post-Read verification failed. Your drive is not zeroed.
Posted

@Twisted -

 

Just seeing this thread.

 

The write verification failure is not something to ignore. Drives should never ever allow data to be returned that is different than what was written to it. And here we have a case that zeroes were written to the disk and the very next read it is returning something else. In theory it is remotely remotely possible, but we are so close to the 0% probability that you should not assume it was normal in some way. In the real world, this never happens.

 

Memory error comes to mind first. Looks like you ran a 6 hour test, I'd run at least 24 hours.

 

The other thing that can cause (in theory anyway) is cables that are run tightly bundled. This can cause the signal to get changed and data can be different than what drive returned. Again - very rare but in theory could happen.

 

Other thoughts include bad cache memory in the drive or controller. I've never heard of such a problem, but something freaky like this could cause data errors leaving the drive.

 

I understand you changed the cables and port, and got a successful preclear. I'm not sure one good preclear would do it for me. I'd run at least several. But I'd go back to the port and try to induce it to fail again there. YOU REALLY WANT A REPEATABLE FAILURE SCENARIO, only then do you have a way to know the problem is fixed. Otherwise you can will ignore that port forever but never really know if it had anything to do with the problem.

 

I would suggest installing some type of checksum collection and testing protocol. There is a plugin that you might look into to collect and check these. If you get a seemingly random checksum failure, that would alert you to an unresolved issue.

Posted

@SSD Thank you for the reply. I think it may be a cable issue. I did rerun the memory test with no issues and I precleared another drive on the same port with the new cable and had no issues. I just updated to 4.6.1, so I cannot use preclear yet. Once it is available again, I will run it a few more times on the drive to see if I get no errors repeatedly. Thank you for the advice, I will give it a second look.

Posted
6 hours ago, Twisted said:

@SSD Thank you for the reply. I think it may be a cable issue. I did rerun the memory test with no issues and I precleared another drive on the same port with the new cable and had no issues. I just updated to 4.6.1, so I cannot use preclear yet. Once it is available again, I will run it a few more times on the drive to see if I get no errors repeatedly. Thank you for the advice, I will give it a second look.

 

Sounds good. I'd just say that a bad cable does not normally cause altered data. The scenario would have to be that the drive returned zeroes, but somehow the signal was altered without the connection being compromised. Bad or loose cables are pretty common. They can cause drives to be slow and drop offline. They might cause read errors. But a successful read with wrong data is not something I've ever seen. There have only been 2 or 3 cases of verification error I can remember, and bad memory was the cause of them as I recall.

 

I did some of the work on the original preclear script, but am not involved with the plugin. You should be able to run the script independently. I'll see if I can figure out some instructions for you and others that need to run preclear.

Posted

Lose/bad cable can result in bad data. But based on statistics, such a bad cable would have produced a huge number of retransmissions - transfer errors that was caught by the ECC - before one single transfer managed to contain a specific set of erroneous bits that just happened to compute to the expected ECC of a non-corrupt transfer.

 

So it's basically only users who continue to use a system even if they get large numbers of transfer errors that are likely to have the drive or controller card sometimes accept corrupt data as valid.

 

So it's way more likely that invalid data was caused by the drive failing to correctly write the data to the sector - the drive performs the writes without reading back the result. Possibly the write failure because of external events - power glitch, vibration, shock, ... - and possibly because one sector was marginal. If this happens once in a blue moon it's just as expected - statistically, drives are not promising 100% data reliability. If it happens for multiple sectors during the same run, or new sectors come cropping up in near time, then it's time to be scared because it's then more likely to be something wrong with the drive.

Posted
56 minutes ago, pwm said:

Lose/bad cable can result in bad data. But based on statistics, such a bad cable would have produced a huge number of retransmissions - transfer errors that was caught by the ECC - before one single transfer managed to contain a specific set of erroneous bits that just happened to compute to the expected ECC of a non-corrupt transfer.

 

So it's basically only users who continue to use a system even if they get large numbers of transfer errors that are likely to have the drive or controller card sometimes accept corrupt data as valid.

 

So it's way more likely that invalid data was caused by the drive failing to correctly write the data to the sector - the drive performs the writes without reading back the result. Possibly the write failure because of external events - power glitch, vibration, shock, ... - and possibly because one sector was marginal. If this happens once in a blue moon it's just as expected - statistically, drives are not promising 100% data reliability. If it happens for multiple sectors during the same run, or new sectors come cropping up in near time, then it's time to be scared because it's then more likely to be something wrong with the drive.

 

Incredibly more likely that something hardware wise going on than the statistical lottery hitting during this preclear.

 

When you hear hoofbeats, think of horses not zebras.

 

But most common hardware issues would not cause this because the checks and balances are very strong in the computer to hard disk chain to avoid bad data getting through. Bad memory is one plausible reason, and still high on my list of culprits. I had a lightly used Windows HTPC that randomly crashed every couple months. I ran memory tests a couple times - all good. Very annoying. Then read good to run 24 hour test and gave it a try. It finally found an error near the 18 hour mark. Replacing the chip solved my infrequent crashing problem.

 

If a post read verify occurs, my first instinct would be to do a second verify (at least at the area of the disk that failed the first time). Then I'd know if it was the write or the read that was the problem. If it was the write, I'd be looking suspiciously at the drive. If it was the read, I'd be looking suspiciously at the ram. But either drive or ram could be the cause of either.

Posted
24 minutes ago, SSD said:

When you hear hoofbeats, think of horses not zebras.

Which is why my post suggests horses and not zebras. I very clearly indicate that it's very unlikely to have a transfer error not caught by ECC unless the device constantly produces a huge number of failed transfers.

 

1 hour ago, pwm said:

So it's way more likely that invalid data was caused by the drive failing to correctly write the data to the sector

 

Posted
4 hours ago, pwm said:

If this happens once in a blue moon it's just as expected - statistically, drives are not promising 100% data reliability. If it happens for multiple sectors during the same run, or new sectors come cropping up in near time, then it's time to be scared because it's then more likely to be something wrong with the drive.

 

I hear you. No offense intended. It was the quote above that generated my response. I don't agree it is expected to see this "once in a blue moon". (A blue moon happens once a year by the way). This should never (or just short of never) happen. Maybe on the north pole or other challenging environment right on the edge of acceptable environmental conditions, with an already shaky drive that is reallocating sectors and generating resets already. But not in one's unRAID server at home with a new drive that is otherwise acting normally.


The forum experiences bare that out - this has never happened before and not been attributed to a memory error. How many drive years does this represent? It should not take multiple sectors having this problem to peak interest and be scared, although my fear would have nothing to do with a bad drive. It should raise the alarm and demand we drive it to resolution. That or test so extensively as to convince us that we can rely on the server.

Posted

Thank you both for the insight into this issue. It is frustrating to have an error come and go without being able to 100% identify the cause. Once preclear is up and running again, I am going to run it for a third time on the drive and then run it on another drive that passed using that old cable again to see if I can reproduce the error. I will let you know the outcome, so it can possible help others troubleshoot in the future.

Posted
3 hours ago, SSD said:

This should never (or just short of never) happen.

This should never happen for people who do react when they see transfer errors and fix the cable or broken controller card port.

But if you continue to run with a broken port and start to collect millions of retransmissions then you have to be prepared that the ECC will sometimes fail.

 

The ECC used is nowhere near as good as a good cryptographic hash to catch larger multi-bit errors in a data block. And this is also why large storage centers are constantly seeing silent errors in their storage pools.

 

This is also one of the big reasons why disk drives are moving to 4kB sectors. Besides fitting more data on each track, they will also get a more efficient ECC when both sector size and ECC is increased in size. The current ECC size for 512-byte sectors is too small in relation to the huge number of sector transfers we perform every year. If I take a quick peek at one of my disks, it has 19,361,616,984 sector writes and 399,306,495,141 sector reads (about 200 TB read). This should then be compared to the datasheet specifying 1 nonrecoverable bit errors per 1*10^15 bits reads. 4*10^12 sectors read is 1.6*10^15 bits read. So just a single disk in a single server at home has performed a similar amount of bit reads as Seagates datasheet specifies in their guestimation of bit errors.


That's also why lots of RAID-6 systems uses vertical parity - even if they don't have to, they always read from all disks so they can compute the P+Q parity to augment the per-sector ECC on the drive and the per-sector ECC in the transfer from disk to controller.

 

And it's the reason why newer file systems are adding support for additional checksums on top of the drive subsystem - trow a dice a billion times and even a 20-sided dice will quite often hit the highest value...

 

That we don't hurt more from silent errors is that the majority of our disks are filled with media files - and most media files are very resilient to a few bit errors.

 

https://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://www.enterprisestorageforum.com/storage-management/silent-data-corruption-the-backup-killer.html

https://storagegaga.wordpress.com/2011/07/27/silent-data-corruption-sdc-its-more-prevalent-that-you-think/

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...