Jump to content

Dying drives conundrum


Recommended Posts

Hi guys, I have a bit of trouble. It has come to light when upgrading to 5.04, but that's probably not that relevant. I had a few read errors and current pending sectors on my array disk 4, and then some on disk 8 as well, and 1 current pending sector on the parity drive.

 

My unraid is made of 9x2TB drives including parity plus a 750gb cache drive.

 

I had a spare 2TB drive in case of failure but it turned out to be faulty when I swapped it in for the questionable disk 4. I just bought a new drive because of some current pending sectors on other drives, and have swapped it into the system.

 

Here is the current situation after some juggling around of drives and data rebuilding:

 

Spare drive #1: dead, in warranty, to be RMA'd.

Spare drive #2: appears functional but is reporting 65535 current pending sectors. Haven't run WD tools on it yet but it should RMA.

 

So far so good.

 

On the array:

 

 

D1: good

D2: good

D3: good

D4: reporting 65535 current pending sectors. Two days ago, as an array drive, it was reporting only 2 current pending sectors, so I thought it would be okay to use as a replacement for the failed previous D4. Rebuilding D4 onto it seems to have caused/revealed this pending sector avalanche.

D5: good

D6: good

D7: good

D8: good - this is the brand new drive.

Parity: 1 current pending sector. Appears to work fine though.

 

They are all WD "green" drives.

 

Clearly I have three drives that need RMA. But I have no spares and will have an array that is one disk down. The 1 pending sector on the parity drive makes me nervous based on the last few days' failures.

 

RMA is going to take a few weeks no matter what as it is via post.

 

2TB of data loss would be very annoying but not completely catastrophic.

 

I would prefer not to have to buy ANOTHER drive and then end up with three spares sitting there ageing.

 

So do I:

 

(a) RMA the two spare drives, leave the dubious D4 in, and then RMA it later (probably slightly more expensive due to postage).

(b) RMA all three dodgy drives and run unprotected for a few weeks

© preclear and rebuild the parity drive now, while I still have a working (but untrustworthy) D4, to clear that current pending sector?

(d) something else?

 

Also, is it better in general to leave all drives permanently spun up, or spun down, to minimise failure risk in the interim?

 

I considered condensing my data and shrinking the array by one drive, which I could proably just about manage the space for, but the process involves working the drives fairly hard and that seems to invite failure of the current D4 and/or P.

 

Any suggestions for me?

Link to comment

There are no easy answers to you issues. I would be moving all the files off D4 onto the other disks, possibly by dropping D4 out of the array so it is simulated which allows you to get all the files except for any affected by that bad sector on the parity drive.

 

If you leave D4 installed and move the files and the parity dies you won't lose anything else. The bigger problem occurs if another drive goes before you get the array protected again so trying to get the array protected ASAP might be the best choice. Also, moving the files around would be exercising the parity which would give you more confidence in it being OK, or not. Moving might also write to the pending sector on the parity and re-allocate it.

 

Trying to rebuild parity with the bad sectors on D4 will just end up with parity issues for each bad sector so I don't see that as an option. If you return D4 and the parity pending sectors increase then you won't be able to rebuild D4 well at all - each bad sector on the parity would corrupt the corresponding rebuilt data on D4.

 

I think in parallel to moving files, I would RMA the 2 spare drives. By the time they came back you'd know if the parity was good or not and you'd have D4 gone from the array. You could then use one of these spares to either add as D4 to get back the storage or to replace the parity, whichever is necessary first. If the parity is bad, once D4 and the parity come back from the second RMA you can finish fixing the array and possibly expand to have another disk in the array.

 

 

Link to comment

Thanks Lionel. I have started emptying disk 4 while leaving it in the array. I will RMA the two "spare" pieces of junk and then start rebuilding the array properly when their replacements arrive. With luck nothing else fails in the meantime.

 

Just thinking more though... is there an argument for not moving the 1.7TB of files from D4 for now, and just avoiding putting stress on any disks in the array until the replacement drives arrive? My thinking is that failure is more likely to result from working the drives than just from time passing. 99% of the time this system is just reading to stream video and even then it's not that much. I could keep write operations (and hence use of the parity disk?) to a minimum.

Link to comment

The safest thing to do is simply turn off the system until you have replacement drives, so you can rebuild D4 again with a good drive.    If you don't have backups (which it sounds like you don't), then that's the best way to protect against data loss in the event of a 2nd drive failure.

 

Link to comment

Thanks Gary. I think you are probably right.

 

On backups - the "my life is screwed if I lose these files" stuff is all backed up separately. But the 15TB or so of other stuff is not. That is why it would be annoying but not catastrophic to lose it; I would lose time more than I would lose data.

 

I suppose I could shut the NAS down for a few weeks but I am leaning towards the middle ground of leaving it running but not putting it under any strain. If I thought it were really crucial to shut it down, I would sooner blow another $100 on yet another a 2TB drive.

 

Fingers crossed!

 

Supposing I did have a good spare disk here, what would be the correct course? Would the first disk to replace and rebuild be D4, or parity?

Link to comment

Supposing I did have a good spare disk here, what would be the correct course? Would the first disk to replace and rebuild be D4, or parity?

You will not be able to rebuild parity to be valid if any other disk (e.g. disk4) is having problems.

 

If you have managed to move all the files off disk4, then you could do a 'new config' to create the array with disk4 omitted.  You could then create a valid parity from the remaining disks and re-add disk4 when you have a suitable replacement.

Link to comment

Thanks for the replies guys.

 

I am thinking that the "shut it down till you have a new drive" position is correct.

 

I see that there is indeed no easy answer here - every path involves some risk. That's okay!

 

I think I've decided to move the files I'd most hate to lose and leave those that are easily replaced, then sit tight and not stress the server (or myself).

Link to comment

I am thinking that the "shut it down till you have a new drive" position is correct.

 

Definitely what I'd do ... but I also wouldn't wait for an RMA to process -- I'd order the new drive immediately (or pick one up locally).

 

Gary, you are really annoying me... because I know you're right, but I don't want to spend the $$$$. I'll just get one tomorrow.

 

And then after I replace/rebuild disk4, how do I clear that pending sector on the parity drive?

Link to comment

What does the Web GUI show as your current status for the drives?  i.e. are they all "green"?

 

If so, then despite the pending sectors, there haven't been any write failures nor any uncorrectable reads ... so the array should be good.    Especially after you replace disk4.

 

Here's what I'd do:

 

=> Replace disk4 and let it rebuild.  After the rebuild completes, run a parity check to confirm all is well.

 

=>  Now you want the ending sectors cleared on parity, but to do this requires writing to those sectors ... and that won't happen with a parity check.  So ... I'd do a new parity sync [This requires un-assigning the parity drive; starting the array without it;  then stopping the array; reassigning the parity drive; and then Starting the array again.  But do NOT do this if you have any red-balled disks, as you will lose fault-tolerance for the duration of the new parity sync.

 

Link to comment

Currently rebuilding disk4 on a brand new drive.

 

Watch this space, I may have some questions. The removed drive is not failing WD tools diagnostics as expected, and is giving different smart readings. I may need some help interpreting it... I will just see what happens with the extended test.

 

Thanks again to all who replied.

Link to comment

You may have to try that. I recall someone having an issue with a pending sector that wouldn't clear even with a preclear.

 

Are there any errors in the log indicating issues writing to the parity?

 

The syslog appears clean as far as I can tell. I am somewhat inclined just to leave it for now, replace that drive when I get my RMAs back, and then do the preclear etc then. I can leave it in the box as a warm spare if it cleans up okay.

Link to comment

 

I just replaced a WD 3.0tb green in my main server that stated 65534 pending sectors. I placed it in my test bed to run a pre-clear after the data was rebuilt and running on the new drive, I ran one cycle it came back with a result like this, but the initial pending sector number does not look like it changed at all.

 

  Current_Pending_Sector =    66    200            0        ok          65534

No SMART attributes are FAILING_NOW

 

3 sectors were pending re-allocation before the start of the preclear.

14 sectors were pending re-allocation after pre-read in cycle 1 of 1.

65534 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

65534 sectors are pending re-allocation at the end of the preclear,

    a change of 65531 in the number of sectors pending re-allocation.

 

This drive is going for RMA but the results above do not make any sense to me.

 

rgds,

 

Dave

 

Link to comment

I had a WD drive that 3 pending sectors would appear in 1 spot and then dissappear again before 3 more in another spot would pop up when I kept running preclears on it. Funny things happen sometimes. Best to just send it back if it's under warranty.

 

I will leave the Parity Disk for now and replace it with one of the RMAs that's coming back.

 

The one I had that had 65535 sectors - which was Disk 4 in the OP - kept passing in WD Diagnostics. I had to stress it (fill with zeros etc) a few times before it finally gave clear cut errors. Now it is in my RMA box.

 

Jason

 

Link to comment
  • 1 month later...

I thought I'd just pop in and say that all is well now. Nothing failed while I was waiting for the RMA drives, and one of the 2TB drives was replaced with a 3TB... no comment from WD so I don't know if it's a policy or an error. Obviously that is now my parity drive so I am "3TB drive ready" from here on.

 

I ended up with more drives than I really needed, so now my array has 2TB more capacity than it had before.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...