Large Disk Failure Help


tential

Recommended Posts

Wow, that's surprising that after all this time, now I lose 2 drives.  The thing is, Drive 9 still shows up in unraid as available for use.

 

What should I do then since if I try to rebuild my Parity, it goes super slow?

 

The only thing that makes me slightly worried is the 2 connectors connecting Drive 9/10 are at the very end of my line of power connectors for HDDs.  Previously, they were connected to the out of use drives.  When I recabled, I used those 2 connectors as there were 2 dedicated connectors for drives 9/10 with no splitters there before, and those were easier to move away for my SSDs.  

 

Oh well, if I lose drives I lose drives, what's my next step here?

 

Edit: Maybe I should recable and remove the Marvel Controller out of the equation?

Edited by tential
Link to comment
  • Replies 208
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

You should try to rebuild disk10, that's the worst one, there will be be some corruptions because of disk9 but it's the best option you have.

When that is done replace disk9, and only after that resync parity1.

 

Bad/failing power can damage disks, so best to make sure all is good before proceeding, also like I mentioned 500W is IMO on the low side for 15 disks, I don't use more than 12 disks on 500W, for 14/15 disks I use 550W.

Link to comment
31 minutes ago, johnnie.black said:

You should try to rebuild disk10, that's the worst one, there will be be some corruptions because of disk9 but it's the best option you have.

When that is done replace disk9, and only after that resync parity1.

 

Bad/failing power can damage disks, so best to make sure all is good before proceeding, also like I mentioned 500W is IMO on the low side for 15 disks, I don't use more than 12 disks on 500W, for 14/15 disks I use 550W.

Maybe it's a power issue, it only started when I tried to use all 15 of my drives at once.

I came across this thread here where the error I'm having went away when the person upgraded their PSU.

https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous

 

If I can't rebuild due to the process being very very slow, should I try upgrading my PSU next then?

 

tower-diagnostics-20180207-0935.zip

Latest Diagnostic.

Edited by tential
Link to comment
1 minute ago, johnnie.black said:

I would, or disconnect the unassigned disks you currently have connected.

Just was having the SAME thought, and was going to ask.  Alright next step!  I'm going to enjoy watching an episode or two first of TV before I try.

 

I'm going to keep the two power cables that were connected to ATA9/10 disconnected (The way they were before/connected to unassigned drives) and use different power cables in their place as well as disconnect the unassigned disks. 

 

Thanks, wish me luck!

Link to comment

tower-diagnostics-20180207-1404.zip

 

That's my latest diagnostic now.  What's the damage here?

 

Should I next swap the ATA9/ATA10 data cables with my SSD to rule that out?  That way those 2 drives are now connected to my MOBO, and the 2 SSDs I can move to the Marvel controller to isolate it and see if I am having trouble with the controller or those 2 drives specifically.  

 

Edit: I'm guessing that diagnostic won't be very god, I started the array and am now getting a ton of read errors on Disk 7/8.  I think I'm making this worse.

 

I'm not sure why, Transmission is starting off ON, too, even though autostart is off.  I'm sure that's not helping either when that turns on at the beginning each time.

Edited by tential
Link to comment

I would mention that most drives will lock the drive heads after a relatively short period of time being idle. There is a SMART attribute called load cycle count (LCC) that tells you how many times the heads have locked, and I think you'll find it is a pretty big number. This head locking helps prevent the heads from slamming into the disk surface and causing damage.

Link to comment

Ok, well I'm hoping it isn't bad.  My last array start had unraid hard lockup.  I've never had that happen before.  Tons of errors spit out to me, which I couldn't get out of the diagnostic :(tower-diagnostics-20180208-0604.zip

 

I'm just going to wait for ya'll to give me a clear set of what I need to do next, so I don't mess this up anymore.  I feel like at this point, I don't even know how to get myself running stably again.  Maybe I need that new PSU I don't know I'm sad.

Link to comment
4 minutes ago, johnnie.black said:

Nothing on the diags are they from after rebooting?

Yes, I couldn't get anything from unraid before.  It was just locked up.  I've never had that happen to me.  I couldn't cancel parity or anything.  I shutdown and the PC stayed on all night and never was able to shut off although the server was inaccessible.  I couldn't pull a diagnostic sadly, and I eventually had to hard power off.

 

This is after the reboot now, I didn't start the array since I got a ton of read errors like a million, then the parity check was still going, but was going to take 30 days.

 

Not sure what to do here, this is now after the reboot.  

 

Sorry to have made this so massively confusing.

 

 

Link to comment

There are no syslog errors but you have the known failing disks 9 and 10 still assigned, were you trying to sync parity1?

 

If so and like I said that won't work, you should try to rebuild disk10 to a new disk, that's the worst one, there will be be some corruptions because of disk9 but it's the best option you have, when that is done replace disk9, and only after that resync parity1.

 

 

Link to comment

Ok, will try that now.

 

Any thoughts as to how I lost disk 9/10?  

 

I'm getting errors everywhere now though.

 

Feb 8 06:43:00 Tower kernel: md: disk7 read error, sector=14552776

 

That can't be good?  Over a million of them.

On Disk 6/7/8.  Rebuild is still going, says it will take 11 hours now.

 

I'm just mind boggled as to what's going on here it feels like things are getting worse not better.

 

MwC12oe.png

Edited by tential
Link to comment

Good news is that SMART for disks 6, 7 and 8 looks OK, probably a cable issue, they are likely sharing the same miniSAS cable on the LSI, it can also be a controller/controller port/power issue.

 

You can try canceling the rebuild and replacing the miniSAS cable, or swap ports with the other one, to try and rule things out, then restart the rebuild.

 

Link to comment

Ok I'm trying that now.  With the amount of times power is being mentioned and considering I'm at the power limit of 15 drives/CX500 W PSU, should I just order a new PSU now?  Looks like I'm screwed on that side either way.  Especially since I was going to upgrade to a Xeon/more GPUs this year too.

Link to comment

Seriously, I'd go for a new power supply. I know from experience that inadequate or failing power supplies can be responsible for some very obscure and frustrating fault conditions. My personal favourites are the Corsair AX and RMx series but there are other decent brands. You need a single +12 volt rail and I'd go for a power rating in the region of 650 to 750 watts, depending on your future expansion plans.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.