Repeated Drive Failures


Recommended Posts

Unraid 5.0 RC 10

 

I've had 4.7 running for about a year with little to no problems. But still a relative noob. Decided to jump the gun give 5.0 a try started with rc9 then quickly moved to rc10. Things were running fairly smoothly installed sabnzbd, sickbeard, couchpotato. Got my first redball next to one on my drives on Jan 10th. Little surprised but since it was a about 4 years old 1.5TB samsung I wasn't worried. Had another 1.5TB sitting around that was working fine so pre-cleared that installed and everythign was going smoothly.  While that was pre-clearing and all I ordered a 3TB Red drive to update my parity drive. Got that, started a 3 cycle pre-clear (6 days if anyone was wondering) no errors. Swapped out the older 2TB parity drive with the new 3TB let everything rebuild. Things seemed to go fine. Went to do a parity check it was chugging away I went to bed. Woke up refreshed my screen and behold another red ball on a different drive. This time a 1.5TB seagate. Thought that was odd but just a coincidence. Since I still had my old parity drive hanging out I pre-cleared that just fine, let the system rebuild last night and this morning right after I clicked to do a parity check I refreshed to see how long it was going to take.. 3rd red ball on another drive. At this stage I no longer think my old drives a just dying and something is going on.

 

 

TL;DR Multiple drive failures thinking it's no longer a coincidence.

 

Other issue I was going to address but may/may not be related. Sickbeard keeps loosing system settings I have to keep running the install wizard. Then after having it shut down for a number of days with this fiasco it's no longer installed, simply gone I never un-installed it. Sab, sick, and couch are all stored on my cache drive. PLEASE NOTE. If this is not related to my drive failures I don't need to address this now.

syslog-2013-02-05.txt

Link to comment

A little update. Installed rfstools and yareg to take a look at the latest red ball drive within windows. Randomly checked files and it all seems ok. Decided to switch drives around to different physical slots on my unraid system to see if that changes anything. All the red balls have been located on the slots associated with the SUPERMICRO AOC-SASLP-MV8. Perhaps this is the issue....?

 

Put the seemingly ok drive back in the system and I still get the red ball of death. I decided to pre-clear this one on a one cycle. Currently in post-read at 28%. Adding drives is not a quick process in unraid is it? I'm currently wondering what the pre-clear results will be and if it's all good if I will have a red ball when it's done or not. Got another 3tb Red drive in today think I'm going to back up my critical data on that before anything else goes down the tubes.

 

Any advise would be appreciated.

Link to comment

A drive that has a "red" indicator has had a "write" to it fail.  It will NEVER change back to green without it being reconstructed (or forced into service by use of the new-config and invalidslot commands)

It does not matter if the issue was a bad sector, or a loose cable, when corrected you must re-construct the drive for it to be correct.  (remember, a write to it failed... it is nearly guaranteed to be out-of-date with its correct contents)

 

Running the preclear script on a disk with sectors pending re-allocation should re-allocate them...  That will not make it "green" until you put it back into the array and reconstruct the data back onto it.

 

Glad your data looks good.  hope it is as simple as a flaky disk controller card.

 

As far as critical data, you have the right idea. See here:

http://lime-technology.com/forum/index.php?topic=2601.msg21033#msg21033

Link to comment

A drive that has a "red" indicator has had a "write" to it fail.  It will NEVER change back to green without it being reconstructed (or forced into service by use of the new-config and invalidslot commands)

It does not matter if the issue was a bad sector, or a loose cable, when corrected you must re-construct the drive for it to be correct.  (remember, a write to it failed... it is nearly guaranteed to be out-of-date with its correct contents)

 

Running the preclear script on a disk with sectors pending re-allocation should re-allocate them...  That will not make it "green" until you put it back into the array and reconstruct the data back onto it.

 

Glad your data looks good.  hope it is as simple as a flaky disk controller card.

 

As far as critical data, you have the right idea. See here:

http://lime-technology.com/forum/index.php?topic=2601.msg21033#msg21033

 

I should have thanked you when I first got everyone's quick reponse but I wanted to wait till I got my 1 cycle pre-clear results first. I didn't know that unraid would keep it red until some intervention was done first. The drive passed under pre-clear.I've attached the report just in case I missed something.

 

I have re-arranged my drives in my unraid system so that none of them are going through the AOC-SASLP-MV8. As of this moment I'm thinking this might be issue. Wished it wasn't but better than a bad motherboard and going the headache of changing that out. Going to re-assign the drive and let it rebuild. I'll update with the outcome.

preclear_results_Seagate_1.5TB_9XW00BCZ.txt

Link to comment

The smart report has this line:

        Seek_Error_Rate =    43      42          30        near_thresh 1876908072301

It seems your disk starts with all parameters set to 100 once the disk is in use (with a factory new value of 253)

The "worst" value has been a 42.  The current normalized value a 43, and the failure threshold a 30 when experiencing errors in seeking to a track.  Remember,the normalized value started as "100"

 

Basically, it appears as if physical wear is showing itself.  The disk has not failed, but seems to be showing signs of wear in that "seeks" are experiencing errors.

 

I'd think of replacing the drive.  it may clear just fine, but if it cannot move the disk heads to the correct track without errors, that really does not help as much as you might want.

 

Joe L.

Link to comment

The smart report has this line:

        Seek_Error_Rate =    43      42          30        near_thresh 1876908072301

It seems your disk starts with all parameters set to 100 once the disk is in use (with a factory new value of 253)

The "worst" value has been a 42.  The current normalized value a 43, and the failure threshold a 30 when experiencing errors in seeking to a track.  Remember,the normalized value started as "100"

 

Basically, it appears as if physical wear is showing itself.  The disk has not failed, but seems to be showing signs of wear in that "seeks" are experiencing errors.

 

I'd think of replacing the drive.  it may clear just fine, but if it cannot move the disk heads to the correct track without errors, that really does not help as much as you might want.

 

Joe L.

 

So I guess I did miss something. The array rebuilt and I checked parity without any errors so for now it seems to be in a good place. Replacing the drive is next on my list. Is there a way to do a test to get similar results like a pre-clear but without completely clearing the drive in the process? If so I'd like to run it on the other drives in the array. In the mean time I'm going to pre-clear (ie test) the two drives I removed/replaced in the past two weeks.

Also since I think I've narrowed down the issue to the AOC-SASLP-MV8 card is there any type of test/diagnostic I can do that to really see if that is the issue?

Link to comment
The array rebuilt and I checked parity without any errors so for now it seems to be in a good place. Replacing the drive is next on my list. Is there a way to do a test to get similar results like a pre-clear but without completely clearing the drive in the process?
A clean parity check is an indicator that all drives read out the expected values. It's not as thorough as a preclear, because it doesn't do a write and compare, but it does read every single spot on every drive once. Checking smart attributes for all drives, running a non-correcting parity check, and comparing the after check smart results to the ones you got pre check is a pretty good way to keep up with drive health.
Link to comment

I little update. Things have been stable since I moved the drives off of the AOC-SASLP-MV8. No red balls. I know that a few of my older drives are approaching the end of their lives. I did do a pre-clear on the two drives that did give me the red balls last month they passes the pre-clear test results with "No SMART attributes are FAILING_NOW" but a couple of "near_thresh" line items. Going to add my second 3TB Red drive this week.

 

I've attached the pre-clear results from the drive that gave me my first red ball. There is only one line with a "near_thresh" indication, it's the

 

End-to-End_Error =    1      1            0        near_thresh 150

 

Is this something that I should be concerned about? Would this have caused the first red ball? Or am I right in suspecting AOC-SASLP-MV8?

 

 

preclear_HD154UI.txt

Link to comment

I little update. Things have been stable since I moved the drives off of the AOC-SASLP-MV8. No red balls. I know that a few of my older drives are approaching the end of their lives. I did do a pre-clear on the two drives that did give me the red balls last month they passes the pre-clear test results with "No SMART attributes are FAILING_NOW" but a couple of "near_thresh" line items. Going to add my second 3TB Red drive this week.

 

I've attached the pre-clear results from the drive that gave me my first red ball. There is only one line with a "near_thresh" indication, it's the

 

End-to-End_Error =    1      1            0        near_thresh 150

 

Is this something that I should be concerned about? Would this have caused the first red ball? Or am I right in suspecting AOC-SASLP-MV8?

Only way to answer is to know the starting value for that individual parameter for that specific make/model/firmware version drive.  (It might start at "1" and any failure of that type might fail the drive), or it might start at 100 and the 150 raw count value might have gotten it to where is is just about to hit the threshold.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.