Hard Drive Help


Recommended Posts

Hello,

 

I am hoping some of the experts out there can give me some guidance on next best steps.  here is the situation:

 

UnRAID Version: 5.0-rc6-r8168-test

Server: forget model but purchased from LimeTech about 4 years ago, never had a problem with it, rock solid.

 

I purchased 4 WD Green 3TB drives.

- Precleared all 4, no issues

- Replaced my parity drive, rebuilt and then did a parity check, all ok

- Added a new drive to expand space - all ok

- Replaced drive 3 which had a lot of bad sectors, so wanted to do a preemptive strike

- Rebuild was going VERY VERY slow, like 3 days worth.  read up on the issue, shut down and pulled the drives and pushed them in good to re-seat them.  Rebuild seem to go ok but noticed lots of errors coming up

- During the rebuild SABnzbd pause came off and a lot of downloads came through, not sure if that has contributed to the issues but figure I would mention it.

- After drive 3 rebuild I did a parity check and all came out ok.

- Added 1 more 3TB drive in slot 12 which was empty due to way the juggling of drives, 11 and 13 were active (not sure if that matters at all but again figure I would mention all just in case)

 

As you can see below I am getting all kinds of errors on 3 of the 4 drives, to be fair drive 12 has not had any real load on it yet.  At first the errors were just "current pending sector" but then the load cycle count came up with other stuff. 

 

I get lots of handle_stripe read error messages and from what I can see in the forums that is typically a bad cable, but as I purchased the server from LimeTech I am assuming quality cables were used, but guess the question is do they go bad in 4 years?

 

Logs attached, I did run a full SMART test via unMenu and think it should be in the logs.  Lots of stuff in there and I am a noob when it comes to Linux and HD issues.

 

TIA for any thoughts on this.

 

Greg

 

unraidhdissues.jpg

syslog-20120821-092454.txt

Link to comment

How many passes did each new 3TB get in preclear? It looks to me like disk3 and disk13 are in trouble, given their age and your symptoms. Also, perhaps you should investigate whether you need to change the idle power down setting on the new 3TB's, it appears your LCC values are rising faster than I'd be comfortable with.

 

Bottom line, I hope you have data backups of your irreplaceable stuff, if there is any. You may not be able to get out of this without data loss if the current trend continues.

 

As far as what I'd do next, I think I'd start by backing up all the data on disk3 and disk13 to disk12 if there is enough free space. Those 2 drives need some extra testing before I'd trust them for hold data for any length of time. If there is irreplaceable data involved, I'd tread very carefully.

Link to comment

As long as load cycle count is a small multiple of power-on hours it is fine. Several times per hour is ok.  Hundreds of times per hour may eventually cause a problem but I've seen no supporting evidence.

 

The main issue is the current_pending_sector count on the 2 mentioned disks. If a disk shows any errors on unRAID main they must be resolved. The disks must have shown an error count on unRAID main at some point. Allowing 2 disks to reach this state is indeed a serious problem. See here: http://lime-technology.com/forum/index.php?topic=22111.msg196385#msg196385

Link to comment

Thanks guys,

 

I only ran one pre-clear cycle on each as I was anxious to get started.  The whole process of  pre-clear, then data rebuild takes so long!  My goal is to replace a number of the drives that have been in the system for 4 plus years, mostly the seagate drives as all of them have some sort of errors on the smart report.

 

I will move the data from disk 3 and 13 to 12 to empty 3 and 13.

 

If the drive is empty is there a way to remove it from the array and run the array without a disk in that spot?  I was going to purchase 2 more drives, so I guess I can pre-clear them and then replace the empty 3 and 13, but that whole process will take over a week.  Oh well guess I better get started.

 

Appreciate any other guidance / suggestions.

 

Thanks.

Link to comment

As long as load cycle count is a small multiple of power-on hours it is fine. Several times per hour is ok.  Hundreds of times per hour may eventually cause a problem but I've seen no supporting evidence.

 

The main issue is the current_pending_sector count on the 2 mentioned disks. If a disk shows any errors on unRAID main they must be resolved. The disks must have shown an error count on unRAID main at some point. Allowing 2 disks to reach this state is indeed a serious problem. See here: http://lime-technology.com/forum/index.php?topic=22111.msg196385#msg196385

 

 

The ratios are:

Parity: load cycle count = 1,694; power on  hours = 693, power cycle count 17  (the load cycle count does not seem to have increased recently)

Disk 3: load cycle count = 1,523; power on  hours = 694, power cycle count 17

 

One more question, I have the old disk 3 which was a 1.5 TB drive, can I take out the 3tb drive and put the old drive back in?  I think in this case I would only lose any new data written to the disk since it was install, correct?  The 1.5TB drive was full that is why I upgraded it to 3TB

 

TIA

 

 

 

Link to comment

If your old drives are all still available, meaning the data on them is intact, then that gives you much more breathing room. When you move data around on the array I'd be sure to keep in mind which data is more easily replaced, and move that data last.

 

Data drives in unraid are easily swapped in and out without losing data, AS LONG AS YOU DON'T HAVE A DRIVE FAILURE. As soon as you remove a drive and unassign the slot, unraid expects that you will be replacing that drive with an identical or larger drive. If you wish to stop using that slot, or replace it with a smaller drive, you must give up drive failure protection temporarily by setting a new configuration, and recalculating parity based on the new smaller array.

 

If you still have all your old drives with data intact, you could install them and pull all the new drives, and set a new configuration to recalculate parity based on the old data drives. Problem being, if one of those data drives fails, you will be unable to rebuild it.

 

Safe bet at this point would probably be to empty out the drives showing obvious signs of failure onto the drives showing healthy, and pull the 2 suspect drives out of the array after you have accounted for all the data, then set a new configuration and recalc parity. I'd probably not physically pull the drives yet, just unassign them. You could then run a couple more preclear cycles and see if they are going to stabilize or completely die.

 

If things go pear shaped while you are in the process of moving data around on the current array, at least you still have the old drives as data back up.

Link to comment

Thanks Jonathan,

 

That is helpful.  Disk 13 only had 50 gig but took 1.5 hours to move the data to drive 12.  Drive 3 has 1.5 TB so that will take a while to move I think.  I will move the data and then report back as I do the next steps.  I guess the lesson is to do do more pre-clear cycles and then make 1 change at a time and live with it a bit to ensure that drive settles in before adding or changing another drive.

 

Cheers,

 

Greg

Link to comment

I guess the lesson is to do do more pre-clear cycles and then make 1 change at a time and live with it a bit to ensure that drive settles in before adding or changing another drive.

Yep. It's also a good idea NOT to order a bunch of drives at the same time. Space the purchases out either from different vendors, or wait as long as you can before purchasing more from the same vendor.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.