New (used) drive shown as defective...now what?

tillkrueger · July 28, 2018

Hi everyone...

bc I'm so low on funds right now, I took a risk when ordering a 6TB WD Red drive to replace a 3TB WD Red drive in my array that was taken out of the array for being defective.

the new (used) 6TB drive rebuilt from parity successfully, but today, a few days later, when remotely logging into my unRAID WebUI, it showed with a Red Cross and the array in a compromised state.

my questions now:

> is a pre-clear the next logical step (I probably should have done that first), or is it too late for that.

> what would be the procedure to do a pre-clear now, after the fact?

> if a pre-clear is *not* what's to be done next, what is?

JorgeB · July 28, 2018

Please post your diagnostics: Tools -> Diagnostics

tillkrueger · July 28, 2018

thanks jb...here ya go.

unraid-diagnostics-20180728-0856.zip

JorgeB · July 28, 2018

It looks like a disk problem, run an extended SMART test.

binhex · July 28, 2018

> is a pre-clear the next logical step (I probably should have done that first), or is it too late for that.

Hate to say this but for reference for next time you buy a disk the first thing you need to do is preclear it, this checks for errors (and zeros the drive, less important)

Sent from my SM-G935F using Tapatalk

tillkrueger · July 28, 2018

7 hours ago, johnnie.black said:

It looks like a disk problem, run an extended SMART test.

how exactly do I do that, and is this the next thing I should do, or should I aim for a pre-clear first?

6 hours ago, binhex said:

Hate to say this but for reference for next time you buy a disk the first thing you need to do is preclear it, this checks for errors (and zeros the drive, less important)

yeah, in retrospect I do realise that I should have done that first...my bad...too late now?

trurl · July 28, 2018

5 hours ago, tillkrueger said:

how exactly do I do that

Click on the disk to get to its page then go to Self Test

tillkrueger · July 30, 2018

thx trurl

I started it yesterday around the time you pointed out to me where to do so, and some 20hrs later it's still churning away at 30%...it's a 6TB drive, so we're talking 3-4 days then?

since the drive this 6tb replaced was a 3tb which was also marked as defective, would another 6tb I might have to get to replace this one run the risk of not being able to rebuild whatever data was affected by the read error this one (and maybe the 3tb it replaced) shows, or is that data "safe" as part of the parity information?

in other words, the errors (bad data) itself will not be rebuilt as "bad", but as the original data before it went bad, I hope?

pwm · July 30, 2018

unRAID rebuilds the content as it should have been.

With a single parity, you could see the parity logic as a+b+c+d = e where e is parity (for first parity it isn't + but X-or which is often written as ⊕ )

And unRAID computes block-for-block what content the broken disk must have contained (before failure) to satisfy the parity equation.

The parity is computed based on the expected content of the different data disks, i.e. the value the sectors had when last written or when parity was last rebuilt.

tillkrueger · July 30, 2018

great explanation, pwm...and what a relief!

pwm · July 30, 2018

2 minutes ago, tillkrueger said:

great explanation, pwm...and what a relief!

This is also why it is dangerous to throw away existing parity and rebuild the parity if you don't trust all data disks.

tillkrueger · July 30, 2018

yeah, I get that...I don't think thatI have *ever* thrown away or overwritten a faulty drive, as one look into my shelves would show

tillkrueger · July 30, 2018

whoops...I just navigated away from the page that showed the progress "report" of the extended SMART self-test (which had been showing 30% for the past 24hrs, and when I went back to that page it said "Completed without error" (same thing it showed before I started the extended test).
I hit the "Download" button and attached what it saved...did I interrupt the actual extended test, or is this really the outcome of the extended test?

WDC_WD60EFRX-68MYMN1_WD-WXL1H642H7PL-20180730-1200.txt

pwm · July 30, 2018

Last test did end without error:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6723         -

Notice that the lifetime value matches the current statistics - the SMART test ended about 9 hours ago, give or take the roundoff to whole hours.

  9 Power_On_Hours          -O--CK   091   091   000    -    6732

Your original estimate that the SMART test should take days didn't sound correct - most drives manages the test in 6-15 hours depending on capacity and age. Your specific disk reports that it needs about 12 hours (for healthy drive that doesn't receive read/write access requests):

Extended self-test routine
recommended polling time: 	 ( 719) minutes.

I wonder if you have problems with either power or vibrations. The drive could read all data correctly during the SMART test, but reported 256 uncorrectable sectors 200 hours back in time. And had a command not finish 100 hours before the SMART test.

tillkrueger · July 30, 2018

So what does that mean in terms of best course of actions?

Vibrations/power within the server?

pwm · July 30, 2018

13 minutes ago, tillkrueger said:

So what does that mean in terms of best course of actions?

Vibrations/power within the server?

It means you are in uncharted territory. I don't see any obvious reasons for the failures and then the perfect SMART test, so no easy way to figure out exactly what to fix. You could try to clear the drive again and see if it works better. Or you could try to switch PSU.

tillkrueger · July 30, 2018

Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now.

Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already?

pwm · July 30, 2018

6 minutes ago, tillkrueger said:

Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now.

Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already?

No, you can't start a clear on the drive if it has already been added to the array and completely or partially rebuilt - since it's part of the array, any writes to it will update the parity state based on the writes. So a clear would teach unRAID that the disk is expected to be empty.

trurl · July 30, 2018

5 minutes ago, tillkrueger said:

Since I am 500 miles away from this system, clearing the drive sounds like the only option I have right now.

Can this be done remotely, and if so, are there instructions online for clearing a drive that has been rebuilt as part of the array already?

You can't clear a drive that's in the array. If you remove it from the array you will be unprotected unless you have dual parity. If you already have a spare disk installed you could replace/rebuild to that and then clear the disk.

Usually clearing a disk is recommended for getting pending sectors reallocated, but I just looked at your posted diagnostics and that doesn't seem to be the problem. I don't know what others saw that told them it was a disk problem.

tillkrueger · July 30, 2018

Makes sense.

what if I copied all data on it to the other drives in the array first? Could I then do a pre-clear like I should have done in the first place and see what happens?

there must be some way to recover from this situation, or have I really navigated myself into a check-mate?

tillkrueger · July 30, 2018

so trurl, would you agree with pwm that the culpit would likely be a faulty PSU and further attempts at trying to deal with the disk are likely to be futile?

pwm · July 30, 2018

8 minutes ago, trurl said:

I don't know what others saw that told them it was a disk problem.

What didn't look too good was:

Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 Sense Key : 0x5 [current] 
Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 ASC=0x21 ASCQ=0x0 
Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 CDB: opcode=0x8a 8a 08 00 00 00 00 74 75 19 b0 00 00 00 08 00 00
Jul 26 04:41:33 unRAID kernel: print_req_error: critical target error, dev sdk, sector 1953831344
Jul 26 04:41:33 unRAID kernel: md: disk6 write error, sector=1953831280

and in the SMART data:

Error 2 [1] occurred at disk power-on lifetime: 6629 hours (276 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 74 75 19 b0 c0 00  Error: IDNF at LBA = 0x747519b0 = 1953831344

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 74 75 19 b0 41 00  2d+17:37:21.022  WRITE FPDMA QUEUED
  e5 00 00 00 00 00 00 00 00 00 00 40 00  2d+17:37:21.022  CHECK POWER MODE
  ea 00 00 00 00 00 00 00 00 00 00 40 00  2d+17:37:20.942  FLUSH CACHE EXT
  e5 00 00 00 00 00 00 00 00 00 00 00 00  2d+17:37:20.939  CHECK POWER MODE
  40 00 00 00 01 00 00 00 00 00 00 40 00  2d+17:37:11.720  READ VERIFY SECTOR(S)

Error 1 [0] occurred at disk power-on lifetime: 6533 hours (272 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 01 00 00 02 86 6d 4e 00 e0 00  Error: UNC 256 sectors at LBA = 0x2866d4e00 = 10845244928

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 01 00 00 02 86 6d 4e 00 e0 08  2d+03:05:41.795  READ DMA EXT
  25 00 00 04 00 00 02 86 6d 4a 00 e0 08  2d+03:05:41.755  READ DMA EXT
  25 00 00 02 00 00 02 86 6d 47 80 e0 08  2d+03:05:41.733  READ DMA EXT
  25 00 00 00 40 00 00 9a 84 31 80 e0 08  2d+03:05:41.707  READ DMA EXT
  25 00 00 01 00 00 00 9a 84 30 00 e0 08  2d+03:05:41.705  READ DMA EXT

So the drive has earlier reported 256 uncorrectable sectors at 6533 hours. Not sure if that was before or after the drive was connected to the unRAID. All the SMART-data tells is the number of power-on hours - so @tillkrueger must count backwards and figure out if the UNC errors happened before/after the drive was bought. If before the drive was bought, then it might be possible to ignore this error - maybe power issue in the original system or maybe someone tried to move the machine while the drive was busy.

The more recent error (IDNF) at 6629 hours could have been caused by the drive disconnecting - so the "IDNF" was because the drive did no longer have a connected controller to interact with. In which case it could be a cable problem. Or a controller card issue.

Not knowing the exact conditions when the two errors happened makes it harder to guess the reason.

tillkrueger · July 30, 2018

My gut feeling is that the Amazon Marketplace seller who sold me this used drive in “like-new” condition dumped a faulty drive on me.

if there is nothing I can do with this drive remotely at this point, I’ll have to take it up with Amazin and try to pressure him into returning my money, and get a new drive.

this drive had only been in my array for 48-72hrs before being ng marrked as faulty, and I had replaced cables drive-cages, and controller all within the past 3 years, so I would think that my server hardware is good...notbto say that even new components can’t be faulty, but my gut feeling is that something is fishy with this used drive.

trurl · July 30, 2018

1 hour ago, tillkrueger said:

Makes sense.

what if I copied all data on it to the other drives in the array first? Could I then do a pre-clear like I should have done in the first place and see what happens?

there must be some way to recover from this situation, or have I really navigated myself into a check-mate?

If you copy all the data off, you could New Config without it and rebuild parity. Then it wouldn't be in the array.

pwm · July 30, 2018

51 minutes ago, tillkrueger said:

My gut feeling is that the Amazon Marketplace seller who sold me this used drive in “like-new” condition dumped a faulty drive on me.

The SMART data claims that the drive isn't faulty.

The older uncorrectable error need not represent any error with the disk - when the disk writes, it always writes blind. It first aligns to the track in read mode. Then it counts sectors waiting until it's about to spot the correct track. Then makes a blind realign of the write head over the track and performs the write. The drive does not know if the write goes well or not - it isn't until you later try to read the sectors that the drive will find out if they could be read.

Drives with a vibration sensor tries to abort writes if vibrations are seen. Drives without vibration sensors will just produce garbage writes if there is too much vibrations. When the drive aligns, each track is two-digit nanometers wide. So 1000 tracks are about the same width as a human hair. And the resolution used when aligning the head is in one-digit nanometers. 10 nm is about the width of 20 silicon atoms. So lots can go wrong when the drive tries to properly align the head and write the data. There are videos showing how the drives in a server rack stops producing data if a person shouts at the machine - the voice vibrations are enough to make the enterprise disks stop their tasks and wait for the vibrations to end.

The open question is why the drive did disconnect.

But Sense 0x5 means an invalid command.

And ASC=0x21, ASCQ=0x0 means block out of range.

Sense Key : 0x5 [current] 
Jul 26 04:41:33 unRAID kernel: sd 5:0:11:0: [sdk] tag#1 ASC=0x21 ASCQ=0x0

If we assume the drive has been up continuously, then the time stamp in the SMART data was Mon Jul 30 12:00:49 2018 and the power on counter then was 6732.

The error in the SMART log happened at 6629, so 103 power-on hours earlier - 4 days and 7 hours.

So somewhere around Jul 26, 05:00.

That agrees with the with the unRAID log printout of Jul 26 04:41:33.

So possibly the command was dropped because of a transfer error. Or a software bug. But the error doesn't represent a broken disk - just that the disk couldn't perform the task because the requested task wasn't valid.

New (used) drive shown as defective...now what?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived