[SOLVED] preclear hanging on pass#2, bad drive?

mklv · January 24, 2011

Hi,

I've been migrating my drives one (or two) at a time from my WHS to my new unraid server. I moved two WD EADS drives, and started a preclear. They both passed the first preclear (see attachment #1), so I tried a second pass just in case.

With one of my drives, the preclear hung 40% into step 10 (I think it was a read step). So I canceled it, and restarted my unraid server.

I restarted the preclear just on this drive. A few hours later, I see that it is stuck on step 1 "Disk Pre-Read in progress 40% complete".

I canceled it again, and tried dumping the syslog, which reached ~1.75GB before I ran out of room on the flash drive. I extracted the beginning of the log, up until the errors started popping up (see attachment #2).

Do any of you see why the drive passed the first preclear, but couldn't get through another?

Thanks

attempt_1_preclear_success.txt

attempt_2_syslog.txt

furymaster · January 24, 2011

Hi,

I'm currently running on the same problem with 2 WD20EARS. Both of them stop on preread, both of them at the same time - either at 51%, at 61% or at 45%. Funny thing is that I've always tried to preclear them parallel. Now I've took a look in the smart data, an both drives have very high values at Load_Cycle_Count (500 on 24 Power_On_Hours) , different to the working ones.

I've read something about hdparm -B, but funny enough hdparm -I doesn't show the capability "Advanced Power Management" - huh?

If I don't get the two disks to work properly, I have a RMA rate of 4 out of 10 ... grrr

furymaster · January 24, 2011

Just did 2 short test on the drive,

# 1 Short offline Completed: read failure 90% 24 2415436963

# 2 Short offline Completed: read failure 90% 24 2415436963

RMA???

Joe L. · January 24, 2011

Just did 2 short test on the drive,

# 1 Short offline Completed: read failure 90% 24 2415436963

# 2 Short offline Completed: read failure 90% 24 2415436963

RMA???

A drive that fails its own read tests, and cannot be pre-cleared.... RMA it.

(Unless you suspect a power supply is unable to handle the combined load of all the disks in the server and it might not be the fault off the disks, but the fact then are not given clean power.)

SSD · January 24, 2011

Sometimes these internal drive tests fail if unRaid spins the drive down. So you need to turn off spindown before starting them.

Joe L. · January 24, 2011

Sometimes these internal drive tests fail if unRaid spins the drive down. So you need to turn off spindown before starting them.

Yes, but that was "short" test... should take a few minutes at most, and if interrupted by a spin-down, the error is not "read failure", but instead will look like this stating "aborted by host":

# 1  Extended offline    Aborted by host               90%      9009         -
# 2  Short offline       Completed without error       00%      3992         -
# 3  Extended offline    Completed without error       00%      3792         -

Joe L.

SSD · January 24, 2011

Sometimes these internal drive tests fail if unRaid spins the drive down. So you need to turn off spindown before starting them.

Yes, but that was "short" test... should take a few minutes at most, and if interrupted by a spin-down, the error is not "read failure", but instead will look like this stating "aborted by host":
# 1  Extended offline    Aborted by host               90%      9009         -
# 2  Short offline       Completed without error       00%      3992         -
# 3  Extended offline    Completed without error       00%      3792         -
Joe L.

Don't know how long the drive might have been sitting unused before test was started. The test's accesses to the drives are hidden to unRAID, so it does not see them as "activity".

You are right about the "Aborted by host" message, but firmware across different drives not always 100% consistant. Still a small chance drive might have seen the drive spinning down as a read error. And it doesn't just spin it down - it seems to monitor it and keep slamming it down if anything like this (e.g., including taking a smartctl report) tries to spin up the disk without one of the specific things unRAID monitors to spin the drive up.

But overall I agree with your diagnosis, Dr. Joe. The drive definitely looks like its ready to be RMAed.

furymaster · January 24, 2011

Just did 2 short test on the drive,

# 1 Short offline Completed: read failure 90% 24 2415436963

# 2 Short offline Completed: read failure 90% 24 2415436963

RMA???

A drive that fails its own read tests, and cannot be pre-cleared.... RMA it.
(Unless you suspect a power supply is unable to handle the combined load of all the disks in the server and it might not be the fault off the disks, but the fact then are not given clean power.)

The power supply is OK, all other drives are spun down, only 2 are currently preclearing ... that's definitely not the problem. I suppose WDs quality tests are the problem

mklv · January 25, 2011

Had to investigate the long and short test, as this was new to me. It seems that my drive has no problem with the short test, but can't pass the long test.

The thing that really disturbs me is that there is no indication of a failure in any of the smart attributes (at least from what I can make out).

Is this a common occurrence?

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 229 229 021 Pre-fail Always - 10516

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 20

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 162

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 95

194 Temperature_Celsius 0x0022 130 110 000 Old_age Always - 22

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 15

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 70% 156 1567202372

# 2 Short offline Completed without error 00% 154 -

# 3 Extended offline Completed: read failure 70% 154 1567202372

# 4 Short offline Completed without error 00% 151 -

# 5 Short offline Completed without error 00% 151 -

Joe L. · January 25, 2011

The thing that really disturbs me is that there is no indication of a failure in any of the smart attributes (at least from what I can make out).

Actually, there are 15 sectors pending re-allocation, apparently detected by the "offline" long test before it stopped.

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 15

mklv · January 25, 2011

Ooh, I missed that. Thank you!

[SOLVED] preclear hanging on pass#2, bad drive?

Recommended Posts

mklv

Link to comment

furymaster

Link to comment

furymaster

Link to comment

Joe L.

Link to comment

SSD

Link to comment

Joe L.

Link to comment

SSD

Link to comment

furymaster

Link to comment

mklv

Link to comment

Joe L.

Link to comment

mklv

Link to comment

Archived