How to proceed after my SAS controller card crapped out. Possible parity rebuild required


Recommended Posts

Will try to accurately and completely explain what happened and the state that I am currently in. I have a server with 12 drives and dual parity. The parity disks are 8tb drives. Yesterday the power company stopped by to let me know they would be replacing a pole in front of my house. I shut down my server and went to lunch. When I got back the power was back on so I started my server. Upon booting I had alerts that drives were missing, 5 to be exact. The array did start somehow, not sure how. Docker was also started and a few containers were running... Anyway I stopped the array and the drives were not detected by the os. Thought perhaps it was a random issue so restarted. The drives all came back and the array restarted but 2 were marked disabled and emulated. Then I started to get read errors on 6 drives (including the 2 that were "disabled" so I once again shut down. Checked the layout of cables and turns out all of the drives were all on the same SAS controller. Removed the SAS controller and everything boots fine, all drives present but still the 2 drives are disabled. That would be fine but now I am getting a SMART error of "Current pending sector" with a value of "1" on another drive (it was connected to the SAS controller and one that was tossing read errors). I ran an extended test on the 6 drives that were having read errors and all appear fine except for the one still having the pending sector error.

 

As far as Unraid is concerned I have two failed drives. With the SMART error I am worried about rebuilding parity and having an issue there.

 

I want to throw out that that the two drives that are disabled are a 4tb and an 8tb. I cannot be sure that no data was written to the array when I booted as docker had started and a few containers were running...

 

How would you proceed here? Would it be better to do the new config route and hope that the drives are matching parity? Would it be better to remove the drives then add them back and allow it to rebuild? If so should I add them both at the same time or would it be better to do individual? Would the 4tb rebuild faster than the 8tb? If so that would give me the benefit of only having 1 failed drive if something else happens in this time period. Worried I may have to start a manual recovery of files here and that is going to be a mess...

Link to comment

Well I read more docs and it seemed like new config was the way to go. I did it and everything seemed fine till it wasn't. Started getting new read errors on completely different drives. Swapped the PSU for a brand new one and it did not make a difference. Here are the diagnostics files. They are not from when read errors were occuring, at least I do not think that they are...

 

essex-diagnostics-20191115-0106.zip

Edited by Syco54645
Link to comment

So this is unusual case ( too many fault ) and I don't think Unraid protect mechanism would work well or valid.

You have one disk have pending sector error, SMART selftest fail . So I don't think you could trouble free for rebuild.

 

7 hours ago, Syco54645 said:

I ran an extended test on the 6 drives that were having read errors and all appear fine except for the one still having the pending sector error.

I will check all disks filesystem health or not first.

Then stop the array and mount the emulate disks by UD. If mountable and filesystem check also no error, then I will assume physical disk / data in health.

 

If filesystem check show positive, then I will perform "new config" ( not add parity disk ) and copy out the file from Samsung 2TB disk to a health disk. And finally add back parity and rebuild.

 

If I can't confirm parity valid then I won't rebuild by it, this just make thing worst ( file corrupt ).

 

Edited by Benson
Link to comment
6 hours ago, Syco54645 said:

Well I read more docs and it seemed like new config was the way to go. I did it and everything seemed fine till it wasn't. Started getting new read errors on completely different drives. Swapped the PSU for a brand new one and it did not make a difference. Here are the diagnostics files. They are not from when read errors were occuring, at least I do not think that they are...

 

essex-diagnostics-20191115-0106.zip 134.43 kB · 0 downloads

You have two Marvell RAID controller, best you have some spare disk for test to ensure all stable/functional before any disk recovery.

Link to comment
6 hours ago, Benson said:

You have two Marvell RAID controller, best you have some spare disk for test to ensure all stable/functional before any disk recovery.

I have an LSI card on order, just waiting for it to arrive. I am seeing that Marvell is not recommended for Unraid. This card was recommended to me by someone in #unraid on freenode, not that it matters just giving my reason for purchasing. I thought one of the Marvell cards was bad as when I moved the drives to the onboard sata and the other card the errors went away, till I tried to rebuild then I got a bunch of read errors. I have done multiple extended smart tests on the drives that are having said problems and they check out fine every time (except for disk 9). Physically my server is a 4u rackmount rosewill case and the drives are all in Norco ss500 cages. Card0 had 8 drives on it and card1 had 4. Cage0 and 1 were on card0 and cage2 was on card1 I do not think the issue lies there as I was first having issues with all drives on card0 which were in cage0. The 5th drive was working fine, it was plugged into the onboard sata ports. I have removed drives from card0 and moved the drives that were having issues to onboard sata. The other 4 to card1 and I am now having issues with the drives originally on card1. Because of this I do not think the issue is in the cages as I am having issue across all three.

 

How would you suggest I test for stability? Create a new server on a different stick and toss drives in it? That actually does not sound like a bad idea to me now that i say it. I have a pile of drives that were pulled for age but were still performing well. i can create a new server with those on the current hardware and see if I have the same issues.

essex-diagnostics-20191115-0106.zip

Link to comment

If drive which connect onboard never got read error, then I suggest just simple wait LSI arrival.

 

But if you like troubleshoot, use Unraid USB backup function and at the end perform restore also fine.

 

I agree problem not in cage ( you have 3 ), but no clue why problem start after a power off on.

 

Edited by Benson
Link to comment
1 hour ago, Benson said:

If drive which connect onboard never got read error, then I suggest just simple wait LSI arrival.

 

But if you like troubleshoot, use Unraid USB backup function and at the end perform restore also fine.

 

I agree problem not in cage ( you have 3 ), but no clue why problem start after a power off on.

 

Yes at this point just going to wait for the lsi. Nothing else is going to get me up and running I think. Probably will swap to the lsi card and start the array, making sure anything that may write to the array is stopped and allow the parity sync to happen. The errors were only cropping up last night during a sync. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.