WD Datacenter Gold 12TB issues

dchamb · November 27, 2017

Hello,

I am trying to ready a Western Digital Datacenter Gold 12TB drive to replace my current parity drive. I ran the preclear for over 58 hours making it through 4 of 5 steps before failing on step 5. Here are my questions:

1. Am I dealing with a bad drive?

2. Is there a way to start the preclear without going through steps 1 - 4?

3. Could there be a BIOS issue here?

Here is the preclear report:

############################################################################################################################
#                                                                                                                          #
#                                         unRAID Server Preclear of disk 8DG3KEVD                                          #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 5 - Pre-read verification:                                                  [17:14:50 @ 193 MB/s] SUCCESS    #
#   Step 2 of 5 - Zeroing the disk:                                                        [41:09:23 @ 80 MB/s] SUCCESS    #
#   Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 4 of 5 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#   Step 5 of 5 - Post-Read verification:                                                                          FAIL    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                              Cycle elapsed time: 58:33:47 | Total elapsed time: 58:33:47                                 #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  STATUS                                                                           #
#   5-Reallocated_Sector_Ct      0        -                                                                                #
#   9-Power_On_Hours             0        -                                                                                #
#   194-Temperature_Celsius      34       -                                                                                #
#   196-Reallocated_Event_Count  0        -                                                                                #
#   197-Current_Pending_Sector   0        -                                                                                #
#   198-Offline_Uncorrectable    0        -                                                                                #
#   199-UDMA_CRC_Error_Count     131      -                                                                                #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################

--> FAIL: Post-Read verification failed. Your drive is not zeroed.


root@Tower:/usr/local/emhttp#

Thanks!

Dale

tdallen · November 27, 2017

The UDMA_CRC-Error_Count usually indicates a bad cable, or bad connection on the existing cable.

Edited November 27, 2017 by tdallen

HellDiverUK · November 27, 2017

I concur, looks like a bad cable to me.

dchamb · December 1, 2017

I replaced the cable, started preclear again. This time it was going much faster and the CRC error count did not increase from where it was before the cable was replaced. I got through steps 1 through 4, but it has been hung up on step 5 for several hours at 18%. I really need to determine if the drive is faulty even though SMART shows it is fine, or if there is a problem with preclear, or something else.

Help!

Thanks

Dale

SSD · December 1, 2017

4 hours ago, dchamb said:

I replaced the cable, started preclear again. This time it was going much faster and the CRC error count did not increase from where it was before the cable was replaced. I got through steps 1 through 4, but it has been hung up on step 5 for several hours at 18%. I really need to determine if the drive is faulty even though SMART shows it is fine, or if there is a problem with preclear, or something else.

Help!

Thanks

Dale

Not a good sign. If the cabling is good and the drive locks up, seems like a bad drive.

Using fancy names like "gold" and "datacenter" may give you a warm fuzzy feeling in your psyche that this drive is going to have a long and problem free life, but truth is drives are drives, and commercial and enterprise drives have similar failure rates. The "bathtub curve" phenomenon is real, meaning that early drive fatality is more common than fatality after a break in period.

12TB drives are relatively new and no where near in the sweet spot on price. We old timers tend to be very price conscious, because the premium drives we bought in the 2T and 3T days are long gone or in backup servers, and we realize that it is costly to do the refresh cycles. 8T have been the way to go for drives for past 6-8 months or so at ~$20/T. 12T are about $35/T.

So you are one of few I've seen with 12's. They keep saying they are pushing the laws of physics to make higher capacity drives, but somehow they keep doing it anyway. I guess HAMR is coming and maybe we'll see a jump in sizes. But these 12s may be eeking out the last bit of capacity from the current tech, and it could be that you are a bit out there on the bleeding edge, and more failures are going to be normal. Or it could just be that this is a bad drive, and a replacement will work just fine.

BTW, a failure in the 3rd pass is a bad thing. That literally means that the drive read something other than a zero somewhere on the disk. There is a file that gets generated that tells you where. Probably just a couple of bytes. It could be that cable crosstalk could induce a non-zero signal AFTER the read, or that the marginal cable connection did something similar. But I will say that this is extremely rare. I've only seen it a small handful of times. The drive's ECC will usually not let a bad read escape the drive. I call it spewing garbage when a drive returns data that is different than what was written to the disk. There are those that would argue that it is impossible - but it does happen as you've proven. Bit rot is sometimes blamed when it happens in the real world, but you've got some very fast rotting happening if this problem develops between the 2nd and 3rd stage of a preclear!

You might rule out cabling problems, but I'd be pretty quick to pull the trigger on a replacement. If you're within the return windows from whence you bought it, you'd be assured to get a brand new drive, which is better than a possible refurb from WD.

dchamb · December 2, 2017

Not a good sign. If the cabling is good and the drive locks up, seems like a bad drive.

Using fancy names like "gold" and "datacenter" may give you a warm fuzzy feeling in your psyche that this drive is going to have a long and problem free life, but truth is drives are drives, and commercial and enterprise drives have similar failure rates. The "bathtub curve" phenomenon is real, meaning that early drive fatality is more common than fatality after a break in period.

12TB drives are relatively new and no where near in the sweet spot on price. We old timers tend to be very price conscious, because the premium drives we bought in the 2T and 3T days are long gone or in backup servers, and we realize that it is costly to do the refresh cycles. 8T have been the way to go for drives for past 6-8 months or so at ~$20/T. 12T are about $35/T.

So you are one of few I've seen with 12's. They keep saying they are pushing the laws of physics to make higher capacity drives, but somehow they keep doing it anyway. I guess HAMR is coming and maybe we'll see a jump in sizes. But these 12s may be eeking out the last bit of capacity from the current tech, and it could be that you are a bit out there on the bleeding edge, and more failures are going to be normal. Or it could just be that this is a bad drive, and a replacement will work just fine.

BTW, a failure in the 3rd pass is a bad thing. That literally means that the drive read something other than a zero somewhere on the disk. There is a file that gets generated that tells you where. Probably just a couple of bytes. It could be that cable crosstalk could induce a non-zero signal AFTER the read, or that the marginal cable connection did something similar. But I will say that this is extremely rare. I've only seen it a small handful of times. The drive's ECC will usually not let a bad read escape the drive. I call it spewing garbage when a drive returns data that is different than what was written to the disk. There are those that would argue that it is impossible - but it does happen as you've proven. Bit rot is sometimes blamed when it happens in the real world, but you've got some very fast rotting happening if this problem develops between the 2nd and 3rd stage of a preclear!

You might rule out cabling problems, but I'd be pretty quick to pull the trigger on a replacement. If you're within the return windows from whence you bought it, you'd be assured to get a brand new drive, which is better than a possible refurb from WD.

Seems the fault lies with preclear. It crashed on a segmentation fault. I'm going to reboot and try it from a command line.

Btw, I'm not hung up on the fancy names either. That's just what they call it. I have a 10TB WD Gold that runs like a top so when they came out with a 12TB for the same price as the 10TB I grabbed it up. Being 62 myself I think I'm an old timer myself lol!

Sent from my SM-G955U using Tapatalk

AndroidCat · December 2, 2017

Seems the fault lies with preclear. It crashed on a segmentation fault. I'm going to reboot and try it from a command line.

Btw, I'm not hung up on the fancy names either. That's just what they call it. I have a 10TB WD Gold that runs like a top so when they came out with a 12TB for the same price as the 10TB I grabbed it up. Being 62 myself I think I'm an old timer myself lol!

Sent from my SM-G955U using Tapatalk

Did you use preclear plugin?
It tends to die i.e. stops progressing and cpu and memory utilization for preclear script skyrocket to 100%.

Sent from my SM-G955U1 using Tapatalk

dchamb · December 3, 2017

23 hours ago, AndroidCat said:

Did you use preclear plugin?
It tends to die i.e. stops progressing and cpu and memory utilization for preclear script skyrocket to 100%.

Sent from my SM-G955U1 using Tapatalk

I used the preclear plugin. But when I try to run the script, it keeps telling me the drive is busy! Why can't I get this thing to preclear? I'm thinking of just putting the drive in the array and forgetting preclear.

AndroidCat · December 3, 2017

I used the preclear plugin. But when I try to run the script, it keeps telling me the drive is busy! Why can't I get this thing to preclear? I'm thinking of just putting the drive in the array and forgetting preclear.

I had to kill it from cli and start over. Luckily it saves progress periodically and resumes where it left off.

Sent from my SM-G955U1 using Tapatalk

dchamb · December 3, 2017

2 hours ago, AndroidCat said:

I had to kill it from cli and start over. Luckily it saves progress periodically and resumes where it left off.

Sent from my SM-G955U1 using Tapatalk

Not sure what is there to kill. I rebooted the unRAID machine and it still says the device is busy. It looks like an error in the script to me.

AndroidCat · December 3, 2017

Not sure what is there to kill. I rebooted the unRAID machine and it still says the device is busy. It looks like an error in the script to me.

Yep, looks like different issue.

Sent from my SM-G955U1 using Tapatalk

JorgeB · December 3, 2017

6 hours ago, dchamb said:

Not sure what is there to kill. I rebooted the unRAID machine and it still says the device is busy. It looks like an error in the script to me.

Script need to be patched to work with v6.2+

https://forums.lime-technology.com/topic/12391-re-preclear_disksh-a-new-utility-to-burn-in-and-pre-clear-disks-for-quick-add/?do=findComment&comment=460592

dchamb · December 3, 2017

johnnie.black,

Thanks! That did the trick! Preclear works again and reports my drive was successfully precleared. I am rebuilding my parity drive now.

pras1011 · February 18, 2018

How long does it take to preclear a 12tb drive?

witalit · February 25, 2018

Must take forever at least a few days.

dchamb · February 25, 2018

It was a couple of days but that was going through the first 4 phases of preclear. Preclear crashed because of the script problem in phase 5 so I never completed it. I assigned to unRAID and everything has been working fine.

My array consists of the 12TB WD Gold for the parity drive, a 10TB Gold data and 3 6TB Red drives and it takes 1 day 40 minutes to do a parity check at 135.1 MB/s.

WD Datacenter Gold 12TB issues

Recommended Posts

dchamb

Link to comment

tdallen

Link to comment

HellDiverUK

Link to comment

dchamb

Link to comment

SSD

Link to comment

dchamb

Link to comment

AndroidCat

Link to comment

dchamb

Link to comment

AndroidCat

Link to comment

dchamb

Link to comment

AndroidCat

Link to comment

JorgeB

Link to comment

dchamb

Link to comment

pras1011

Link to comment

witalit

Link to comment

dchamb

Link to comment

Join the conversation