Jump to content

[SOLVED] preclear hanging on pass#2, bad drive?


mklv

Recommended Posts

Hi,

 

I've been migrating my drives one (or two) at a time from my WHS to my new unraid server. I moved two WD EADS drives, and started a preclear. They both passed the first preclear (see attachment #1), so I tried a second pass just in case.

 

With one of my drives, the preclear hung 40% into step 10 (I think it was a read step). So I canceled it, and restarted my unraid server.

 

I restarted the preclear just on this drive. A few hours later, I see that it is stuck on step 1 "Disk Pre-Read in progress 40% complete".

 

I canceled it again, and tried dumping the syslog, which reached ~1.75GB before I ran out of room on the flash drive. I extracted the beginning of the log, up until the errors started popping up (see attachment #2).

 

Do any of you see why the drive passed the first preclear, but couldn't get through another?

 

Thanks

attempt_1_preclear_success.txt

attempt_2_syslog.txt

Link to comment

Hi,

 

I'm currently running on the same problem with 2 WD20EARS. Both of them stop on preread, both of them at the same time - either at 51%, at 61% or at 45%. Funny thing is that I've always tried to preclear them parallel. Now I've took a look in the smart data, an both drives have very high values at Load_Cycle_Count (500 on 24 Power_On_Hours) , different to the working ones.

 

I've read something about hdparm -B, but funny enough hdparm -I doesn't show the capability "Advanced Power Management" - huh? ???

 

If I don't get the two disks to work properly, I have a RMA rate of 4 out of 10 ... grrr

 

 

Link to comment

Just did 2 short test on the drive,

 

# 1  Short offline       Completed: read failure       90%        24         2415436963

# 2  Short offline       Completed: read failure       90%        24         2415436963

 

RMA???

A drive that fails its own read tests, and cannot be pre-cleared....  RMA it. 

(Unless you suspect a power supply is unable to handle the combined load of all the disks in the server and it might not be the fault off the disks, but the fact then are not given clean power.)

Link to comment

Sometimes these internal drive tests fail if unRaid spins the drive down.  So you need to turn off spindown before starting them.

Yes, but that was "short" test... should take a few minutes at most, and if interrupted by a spin-down, the error is not "read failure", but instead will look like this stating "aborted by host":

 

# 1  Extended offline    Aborted by host               90%      9009         -
# 2  Short offline       Completed without error       00%      3992         -
# 3  Extended offline    Completed without error       00%      3792         -

 

Joe L.

Link to comment

Sometimes these internal drive tests fail if unRaid spins the drive down.  So you need to turn off spindown before starting them.

Yes, but that was "short" test... should take a few minutes at most, and if interrupted by a spin-down, the error is not "read failure", but instead will look like this stating "aborted by host":

 

# 1  Extended offline    Aborted by host               90%      9009         -
# 2  Short offline       Completed without error       00%      3992         -
# 3  Extended offline    Completed without error       00%      3792         -

 

Joe L.

 

Don't know how long the drive might have been sitting unused before test was started.  The test's accesses to the drives are hidden to unRAID, so it does not see them as "activity".

 

You are right about the "Aborted by host" message, but firmware across different drives not always 100% consistant.  Still a small chance drive might have seen the drive spinning down as a read error.  And it doesn't just spin it down - it seems to monitor it and keep slamming it down if anything like this (e.g., including taking a smartctl report) tries to spin up the disk without one of the specific things unRAID monitors to spin the drive up.

 

But overall I agree with your diagnosis, Dr. Joe. ;)  The drive definitely looks like its ready to be RMAed.

Link to comment

Just did 2 short test on the drive,

 

# 1  Short offline       Completed: read failure       90%        24         2415436963

# 2  Short offline       Completed: read failure       90%        24         2415436963

 

RMA???

A drive that fails its own read tests, and cannot be pre-cleared....   RMA it. 

(Unless you suspect a power supply is unable to handle the combined load of all the disks in the server and it might not be the fault off the disks, but the fact then are not given clean power.)

 

The power supply is OK, all other drives are spun down, only 2 are currently preclearing ... that's definitely not the problem. I suppose WDs quality tests are the problem  >:(

Link to comment

Had to investigate the long and short test, as this was new to me.  It seems that my drive has no problem with the short test, but can't pass the long test.

 

The thing that really disturbs me is that there is no indication of a failure in any of the smart attributes (at least from what I can make out). 

 

Is this a common occurrence?

 


Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  229  229  021    Pre-fail  Always      -      10516

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      20

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      162

10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      11

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      10

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      95

194 Temperature_Celsius    0x0022  130  110  000    Old_age  Always      -      22

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      15

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      15

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      15

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      70%      156        1567202372

# 2  Short offline      Completed without error      00%      154        -

# 3  Extended offline    Completed: read failure      70%      154        1567202372

# 4  Short offline      Completed without error      00%      151        -

# 5  Short offline      Completed without error      00%      151        -

Link to comment

The thing that really disturbs me is that there is no indication of a failure in any of the smart attributes (at least from what I can make out). 

Actually, there are 15 sectors pending re-allocation, apparently detected by the "offline" long test before it stopped.

 

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      15

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      15

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...