One drive has major issues What do I do next?


Recommended Posts

http://lime-technology.com/wiki/index.php?title=The_Analysis_of_Drive_Issues#Drive_interface_issue_.231

 

Your syslog shows both sdr and sdq having issues....and at first glance appear to be physical. Try the link above for details, but

in summary, SATA cable may be bad/loose.  Also could be a power supply problem, so check that all connections are solid.

 

Perhaps those two drives share a common controller, are in the same drive cage, or share a SAS cable?

 

 

 

 

 

 

Link to comment

This is what I have today.

 

unRAID Version: 5.0

What should be my next step?

 

 

It also looks like you have an issue with disk4 as well as disk5.  You should probably also list your hardware--- particularly what you are using for your SATA expansion cards.  You also had two disks that failed to return smart data.  (I suspect that these disks were disk4 and disk5 since they were not indicating that they were powered up--- no disk temperatures in the table.)

 

It also appears the you have a lot of disks with very high 'power on' hours.  While the actual hours are only an indication that you could be reaching the end-of-life area of the bathtub failure curve, it does raise the risk for multiple disk failures substantially!  I hope you have another backup of your critical data.

Link to comment

http://lime-technology.com/wiki/index.php?title=The_Analysis_of_Drive_Issues#Drive_interface_issue_.231

 

Your syslog shows both sdr and sdq having issues....and at first glance appear to be physical. Try the link above for details, but

in summary, SATA cable may be bad/loose.  Also could be a power supply problem, so check that all connections are solid.

 

Perhaps those two drives share a common controller, are in the same drive cage, or share a SAS cable?

 

 

I have a 24 bay case. I'll need to open it up to get info. I no longer have it.

As far as I remember there are 3 cards for all 24 bays.

 

 

Is there anything I can do to test the hardware?

 

 

This is what I have today.

 

unRAID Version: 5.0

What should be my next step?

 

 

It also looks like you have an issue with disk4 as well as disk5.  You should probably also list your hardware--- particularly what you are using for your SATA expansion cards.  You also had two disks that failed to return smart data.  (I suspect that these disks were disk4 and disk5 since they were not indicating that they were powered up--- no disk temperatures in the table.)

 

It also appears the you have a lot of disks with very high 'power on' hours.  While the actual hours are only an indication that you could be reaching the end-of-life area of the bathtub failure curve, it does raise the risk for multiple disk failures substantially!  I hope you have another backup of your critical data.

 

Why do they have high power on hours? How do I fix that? Is this bad?

 

Link to comment

Why do they have high power on hours? How do I fix that? Is this bad?

 

It's not something you can fix - it's just a running count of how many hours that drive has been running. It is used to give an indication of how old the drive is and how long it's been running (think like odometer in a car). Like anything else, the longer you use it the more prone to issues it becomes. If you are running a number of disks it may be time to at least start planning on replacing with newer, high capacity disks.

 

Mind you, you need to get through your current issue first, so I would sort of back-bench this for the moment.

Link to comment

 

 

 

 

This is what I have today.

 

unRAID Version: 5.0

What should be my next step?

 

 

It also looks like you have an issue with disk4 as well as disk5.  You should probably also list your hardware--- particularly what you are using for your SATA expansion cards.  You also had two disks that failed to return smart data.  (I suspect that these disks were disk4 and disk5 since they were not indicating that they were powered up--- no disk temperatures in the table.)

 

It also appears the you have a lot of disks with very high 'power on' hours.  While the actual hours are only an indication that you could be reaching the end-of-life area of the bathtub failure curve, it does raise the risk for multiple disk failures substantially!  I hope you have another backup of your critical data.

 

Why do they have high power on hours? How do I fix that? Is this bad?

 

The power-on hours is the length of time that these drives have been in service with power applied to them.  (As I recall, some of your disks have over 40,000 hours on them. This is about 5 years of continuous use.)  Hard disks have a failure curve that looks like a bathtub (Google bathtub curve).  Basically, a relativity high percentage of hard disks will fail quite early in their life.  (This is often referred as 'infant morality'.)  Once they get past this point, there are very few failures of a long period of time.  The the failure rate starts to rise again.  This last period is the 'end-of-life' when the disk simply wears out from use. 

Link to comment

OK, good info regarding the bathtub curve.

Some drives have been on that long, huh? Wow

 

 

 

What is DSBL on disk5?

I have 2 3TB drive still in the package. I would preclear them, then what? What are the next steps to get the data from disk5?

Also, how do I properly swap a drive and have the data put back on?

Lastly, how do I determine the drives to replace from looking at the SMART Report?

 

 

I pulled the case off the rack. A few of SATA cables may have been not connected securely. Snugged them all up and powered on.

 

 

I now have this:

SMART Report

Syslog

 

 

This is array off.

2014-07-06_18-27-42.png

 

 

This is array on. This does not look good.

2014-07-06_18-52-03.png

 

 

 

 

 

 

Link to comment

I have been waiting to see if anyone else will jump in with advice and no one has so far.  I have been in a similar situation and this is what I did.  You now have a array with only one defective disk, so unRAID should be able to completely rebuild the contents of that disk. 

 

First, I would shut the array down until I was ready to replace that disk.  (You don't want a second disk to fail before the replacement is completed.) 

 

Second. I would precleared a replacement disk using another machine.  My personal preference to do three complete cycles as this should get you past the 'infant mortality' portion of the bathtub curve.  (You can find lots of discussion on this point if you look for it...  ;)  )

 

Third, I would replace that disabled disk using these instructions from the Version 5.0-beta unRAID Manual 5 (not complete).

 

      http://lime-technology.com/wiki/index.php/UnRAID_Manual_5#Replace_a_failed_disk

 

Whatever, you do, DON'T do any writes to that disabled disk you removed until the array is up and running, and you have run a non-correcting parity check without any errors.  (If anything goes wrong, there is a possibility that you can recover some of the files from it on another computer. 

 

I would then setup automatic non-corecting parity checks on a very regular basis--- probably weekly--- and checking the outcome.  (Look at that unMENU Smart View screen in your last post.  All of those yellow, orange and red entries under the 'Add'l Issues/Failures' column are flagging issues on drives that are starting to fail!)

 

You should keep a eagle eye on disks 10, 12 and 15.  (They already have serious issues!  If the 'reallocated sector ct' increases, that is a real sign that failure is eminent! )  (In fact, I would be replacing them as soon as possible.  Mostly because I like to be proactive in fixing things before they become major problems.)

Link to comment

I agree with Frank1940. Shut down the array, but I would pre-clear both 3TB disks you have ready to go. Once they are pre-cleared and confirmed good I would use the first one to replace your failed disk and let parity rebuild it. Do a non-correcting parity check and hopefully the results are good.

 

Once you have that back up and running I would add the other 3TB disk as a new drive to the array and copy the contents from disk 10, 12 and 15 to it.

 

In order to remove disk 10, 12 and 15 you are going to have to run the New Config utility, which means you will need to re-assign all the disks in your array. I would take a screen shot of your current GUI page, power down and take out the 3 questionable disks, power up again and re-assign the drives to the array (with 5.0 and later each disk can be in whatever slot you like, but you want to make sure you get Parity correct).

 

This will get you back to a point of a clean array with healthy disks.

 

However... you have a large number of disks with 20,000+ power on hours (some double that), which leads back to the conversation that you are moving out of the disks "safety" period and getting into the period where issues are more likely. As you can afford it I would suggest you start a more long term project to introduce 3TB or 4TB disks to replace those old 1TB/2TB disks.

Link to comment

Frank1940 and bkastner you have both contributed a lot of very good information and to the point. I really appreciate your help.

 

 

I spent the rest of my Sunday researching the issue. I wonder if the semi seated SATA cable to the controller caused this. There is a train that rolls by about 100 feet away. Maybe the vibrations caused this? I have no issues doing what was suggested, I am just curious as to how disk5 had issues besides the high power on hours.

 

 

After looking at the syslog during bootup I took a more in depth look at disk5. I mounted the drive with mount -t reiserfs /dev/sdp1 /tmp/x. I cd'd to the dir and could view the folders. I was going to attempt to run Joe L's unraid_partition_disk but have now decided against it for the time being as I will just replace the drive with a precleared drive.

 

 

Do either of you folks have any recommendations / opinions on what I can do to this machine to extend the life of the hard drives current and new ones? Settings, configurations, scripts, spin down, etc

Link to comment

Frank1940 and bkastner you have both contributed a lot of very good information and to the point. I really appreciate your help.

 

 

I spent the rest of my Sunday researching the issue. I wonder if the semi seated SATA cable to the controller caused this. There is a train that rolls by about 100 feet away. Maybe the vibrations caused this? I have no issues doing what was suggested, I am just curious as to how disk5 had issues besides the high power on hours.

 

 

After looking at the syslog during bootup I took a more in depth look at disk5. I mounted the drive with mount -t reiserfs /dev/sdp1 /tmp/x. I cd'd to the dir and could view the folders. I was going to attempt to run Joe L's unraid_partition_disk but have now decided against it for the time being as I will just replace the drive with a precleared drive.

 

 

Do either of you folks have any recommendations / opinions on what I can do to this machine to extend the life of the hard drives current and new ones? Settings, configurations, scripts, spin down, etc

 

Unfortunately drive life is dynamic and unique to that exact hard drive. You can buy 2 identical drives on the same day, from the same vendor, and from the same lot number, and get vastly different life expectancy from each. It's the nature of the beast. :)

 

As for the train causing issues... vibration could have impact, but I'd expect it to be minimal. If you want to eliminate cabling as an issue in the future look for locking SATA cables. They actually click into place and you need to press on either side of the cable to unplug it. This will help ensure this isn't an issue in the future.

 

As for extending the life of your old drives - nothing is going to help here. It looks like you've had a good usage out of them if the power on hours is any indication, and they are getting due for retirement. For future drives, again, there is not much I think you can do, and I wouldn't worry too much about it.

 

You should get 5-7 years out of a standard drive (maybe more, maybe less) using it with UnRAID. I wouldn't stress myself out about trying to extend lifespans. If you are buying 3TB/4TB disks now, by the time they are getting long in the tooth you will likely be able to buy 8TB or 10TB disks and can look at consolidating data further at that point.

Link to comment

I followed the simple steps from http://lime-technology.com/wiki/index.php/UnRAID_Manual_5#Replace_a_failed_disk.

After powering on the machine I allowed it to sync and now running the parity check with correct any parity errors turned off. I still see disk5 as unformated as the parity sync is running.

 

 

Should I scrap the sync and then format the drive and run rebuild, then parity check without correction again? I did preclear the drive.

Link to comment

I followed the simple steps from http://lime-technology.com/wiki/index.php/UnRAID_Manual_5#Replace_a_failed_disk.

After powering on the machine I allowed it to sync and now running the parity check with correct any parity errors turned off. I still see disk5 as unformated as the parity sync is running.

 

 

Should I scrap the sync and then format the drive and run rebuild, then parity check without correction again? I did preclear the drive.

 

Was the new disk assigned to the disk5 position before you started the array? --- You can tell by the serial number.  (It has been a long time since I have done a disk replacement and I don't remember all of the nitty-gitty details.) 

Link to comment

I followed the simple steps from http://lime-technology.com/wiki/index.php/UnRAID_Manual_5#Replace_a_failed_disk.

After powering on the machine I allowed it to sync and now running the parity check with correct any parity errors turned off. I still see disk5 as unformated as the parity sync is running.

 

 

Should I scrap the sync and then format the drive and run rebuild, then parity check without correction again? I did preclear the drive.

With unRAID, I normally think of sync as meaning correcting parity or possibly rebuilding parity. Also, formatting a drive is not part of the process of rebuilding an array drive either as the file system is part of the rebuild. Many people seem to think formatting a drive means erasing it. What formatting really means is creating an empty file system.

 

Maybe another screenshot would help clarify where you are.

Link to comment

dgaschk,

I think he intended to rebuild drive5 from parity. The drive5 in the new screenshot is not the same as the drive5 in the first screenshot. I don't know how he got to this point.

 

berizzle,

As I said before, formatting a drive is not part of the process of rebuilding a drive. When you replaced drive5 it should have offered to rebuild it. It should not be offering to format a drive if you have done the procedure correctly.

 

Not sure how to proceed now. Exactly how did you get to this point? It looks like unRAID doesn't know that there is supposed to be any drive5 data to rebuild.

 

Maybe someone can help you recover the data from the original drive5.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.