One red ball drive. Rebuild knocks another drive off line.


Recommended Posts

Hi guys,

 

So on the weekend I noticed that one of my drives (Drive 6) had gone redball. I powered off the system and restarted for good measure. I ran a smartctl against it and everything looked ok (as far as I could tell...it "passed" anyway). So I was heading out for golf and I decided to run the long test just for good measure. When I checked after it finished, again I could find no problems. So I initiated a rebuild of the drive on the array.

 

After the rebuild started, everything proceeded fine until about 4% or 5% completed and then another drive (Drive 5) started reporting a bunch (1000s) of errors. I stopped the array and checked and the system could no longer find Drive5 (missing).

 

I rebooted again, and the drive was back. I ran smartctl short test on that drive and it all showed OK, so I started the rebuild again.

 

As before it bombed out at about 5% and Disk5 was "missing" again. So, I powered off, rebooted and started the array but stopped the build until I hear from you guys.

 

I've attached a syslog and the smartctl logs from the two drives in question. I'm running unRaid 5.0.6

 

What do you guys suggest for a course of action?

 

thanks,

 

Sam

Syslog_Smartctl.zip

Link to comment

I wonder if you might have some loose connections to your 5in3 cage. Once the connections are made and burned in, and if you never open the case, they usually are good for years. But your symptoms point to something wrong with cabling to the cages.

 

Some cages require a little extra shove to ensure that the cage (with drive attached) is secured to its backplane. My Supermicros snap the drive securely in place, but my Rosewills latch is not quite as sure and I always give an extra push. Not sure of your cages, but that might also be the issue.

Link to comment

Thanks for that. I haven't touched this computer in months... if not years. It's just been sitting there happily and quietly!

 

I bought a replacement drive for that one that was questionable. I'll dump it in an preclear and then try rebuilding with it.

 

Am I right that I didn't see anything in the smartctl reports? I noticed the drive in question had this:

 

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       32

 

I only noticed this because a drive in my HTPC failed last week. Maybe there is something in the water!

 

Sam

Link to comment

OK, well that didn't work.

 

I tried rebuilding with the brand new drive to replace the one that was redball (Drive 6) but the rebuild process still knocked Disk 5 off line (missing).

 

Can I move Disk 5 to another slot/controller and still rebuild from there, or will the array think that it's a different disk because it will change from /dev/sdh to /dev/sde or something similar.

 

That's all I can think of trying for now. Any other suggestions?

 

Sam

Link to comment
...Can I move Disk 5 to another slot/controller and still rebuild from there, or will the array think that it's a different disk because it will change from /dev/sdh to /dev/sde or something similar...
Since you are on v5 unRAID will remember the disks by their serial number so you can try a different port.
Link to comment

So it looks like I got something else going on. Either a controller/cable/cage problem.

 

I moved Disk5 off to another slot in a seperate drive tray that I use when I preclear drives. I know that tray is connected directly to the motherboard. I started up the system and checked the file system of Disk5 just because I could (no problems). I then started another rebuild.

 

Now another drive is getting knocked off line. I'm not sure if that drive is in the same 5in3 cage that Disk5 was, but I don't think it is.

 

The system has not been moved/touched/disturbed in months so I don't think I have spontaneous loose cables. Plus I think I'm using locking cables (at lease coming from the add-in controller card) I'm starting to suspect the controller card (Supermicro AOC-SASLP-MV8).

 

So, based on that information, what do you guys think? Any other thoughts on how to isolate? Looks like I will have to open the system up!

 

Sam

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.