Jump to content
stor44

(SOLVED) Parity upgrade = lots of problems

52 posts in this topic Last Reply

Recommended Posts

Hi there, having some unexpected problems here. Diagnostics attached.

 

I'm trying to upgrade my two parity drives (1 at a time) from 4TB Reds, to 10TB Seagates I scored on Black Friday (got 3 of them for $200 CDN each, BarraCuda Pro's inside).

- I precleared the 10TB drives before shucking, no errors. I'm pretty sure you don't have to preclear parity drives, but I wanted to test them anyway.

- I set unRAID to not start the array automatically

- I shut down unRAID last night, no problems prior to that.

- removed the 4TB 1st Parity drive

- installed the 10TB Seagate after shucking it

- booted up unRAID, it said the Parity 1 drive is missing (correct) so I set that slot to the 10TB drive and started the array and waited for the parity rebuild to begin.

 Now I'm seeing many errors: "Array has 6 disks with errors", "Disk 2 in error state", and now it says every drive has read errors. Also I can't access the shares from my other computers over the network.

- The parity check has paused itself at 4.64GB.

 

I'm using a Dell PERC H310 raid card flashed to IT firmware.

I will double check my cabling, maybe bumped something during install?

Thanks for any help!

tower-diagnostics-20191206-1228.zip

Edited by stor44
Solved by swapping motherboard

Share this post


Link to post

It was a controller problem:

Dec  6 06:22:09 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!

Mmake sure it's well seated and sufficiently cooled, you can also try a different PCIe slot if available.

 

As for the disabled disk either rebuild on top if the emulated disk is mounting correctly and data looks fine, or, and assuming nothing was written to it after it got disabled, do a new config and resync parity.

Share this post


Link to post

Thanks Johnnie.

- I shut down unRAID. Checked my cables, removed and reseated the Dell card. That seems to work.

- unRAID boots, now the rest of disks seem fine again, except Disk2 is "unmountable: no file system".

- I clicked Start the array, but it wants to format Disk2 first.

 

Should I format Disk 2, or try a new config like you said? Seems like both roads lead to a parity check.

unraid DEC6 1.png

tower-diagnostics-20191206-1324.zip

Share this post


Link to post
Should I format Disk 2

Format is never a an option, unless there's no data on disk2, after a format data can't be recovered from parity, seems to be a common misconception.

 

You could try fixing the filesystem on the emulated disk2 first and if successful rebuild, but since there can be some corruption, and assuming nothing was written to disk2 once it got disabled, doing a new config with IMHO be the way to go.

 

 

Share this post


Link to post

Thanks, that makes sense re: format. Nothing new should have been written to Disk2.

- I clicked New Config, use current assignments.

- started the array, and now a parity check is underway. Will report back once it's done in 18 to 20 hours.

 

Thanks again johnnie.black, you've helped me numerous times and I always appreciate it! Couldn't run my server without this helpful community.

Share this post


Link to post
31 minutes ago, stor44 said:

re: format. Nothing new should have been written to Disk2.

So that we're clear, formatting would delete any new and old data, i.e. all existing data on disk2.

Share this post


Link to post

Five hours into the parity check, but now I got an alert saying "array has six failed disks.' Diagnostics attached.

I'm at work for another few hours, but I have remote access to the GUI. 
 

The parity check is still running, but instead of taking 18 hours, it’s changed to about 3 hours left, 65% complete. Also it says 4TB total size, shouldn’t that be 10TB?

 

Thanks for any help.

tower-diagnostics-20191206-1936.zip

Edited by stor44
More detail

Share this post


Link to post
38 minutes ago, stor44 said:

The parity check is still running, but instead of taking 18 hours, it’s changed to about 3 hours left, 65% complete. Also it says 4TB total size, shouldn’t that be 10TB?

Parity1 dropped so I think it is only checking parity2 now.

Share this post


Link to post

Thanks trurl, I think you’re right. I doubt it’s actually the 10TB drive, might be something up with my controller card. Will try another slot when I get a chance.

Share this post


Link to post

Controller problem again:

Dec  6 13:19:53 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!

Errors in multiple disks, no point in continuing the sync.

Share this post


Link to post

Thanks, and here's an update. I got home and the parity check said it was 165% complete.

- stopped parity check and shut down the server

- swapped the controller card to a different slot, re-checked cabling

- booted up, did New Config again, keeping assignments

- started parity check. It's doing the full 10TB this time. About 7 hours left now. No dropped drives so far.

Share this post


Link to post

Parity check is finished, no errors, all drives accounted for. Thanks for the help! Will upgrade the 2nd Parity drive to 10TB later next week.

Share this post


Link to post

Spoke too soon. Suddenly six bad disks again. I can see the SAS card is "non-operational" again in the logs. Will try another slot.

Share this post


Link to post

Removed GTX 970, moved H310 card to 8x slot that was covered by the video card. Now all drives are working again.

Share this post


Link to post

Any chance the H310 is on its way out?  You keep moving it, it works for a while, then goes dark.

Share this post


Link to post
14 minutes ago, sota said:

Any chance the H310 is on its way out?  You keep moving it, it works for a while, then goes dark.

Or, inadequately cooled. Server grade HBA's assume you have hurricane force winds inside your case for ventilation. (or at least most data centers sound like that)

Share this post


Link to post

Indeed. The card has just failed again today, same error as before:

Dec  9 16:24:23 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!

I do have a second H310, but I use it in my Windows PC to host the drives that keep copies of my unRAID shares (running DrivePool on there). Might need to research another host card.

I have one more slot I can try, but yes, starting to seem heat-related or end of the line. I'll check the heatsink on the card too.

 

Thanks for the replies.

Share this post


Link to post

Latest update. I was able to swap slots and the host card ran for about 24 hours, but then drives would suddenly disappear from the array.

 

I don't want to mess with my other PC (where my other H310 is working fine for 3 years), so I've bought two more H310's off eBay to test/replace this one. Hopefully they show up next week.

 

In the meantime, what does "stale configuration" mean? I did search for it, but I'm still not sure. Latest logs attached. Thanks for your time.

tower-diagnostics-20191212-1205.zip

Share this post


Link to post

Thanks. I'm only use Chrome on my Mac to access unRAID, so that's weird, but sounds like nothing to worry about.

Share this post


Link to post

Sorry to hijack the thread but i worry about the temps on my HBA, is there anyway to measure the temp?

 

How much better would it be to remove the heatsink and apply some high quality thermal paste and reapply the heatsink?

 

My HBA is LSI SAS9340-8i ServeRAID M1215 , i have not had any issues yet but i worry because i put my finger on the heat sink and it was warm, i do have a fan blowing on it but id say the heatsink it self was at least 50 or even 60c.

Share this post


Link to post
1 minute ago, je82 said:

it but id say the heatsink it self was at least 50 or even 60c.

That sounds about right, they do run pretty hot, as long as there's some airflow around them it's fine, they just shouldn't be without any as they are designed to run in servers with good cooling.

Share this post


Link to post
1 minute ago, johnnie.black said:

That sounds about right, they do run pretty hot, as long as there's some airflow around them it's fine, they just shouldn't be without any as they are designed to run in servers with good cooling.

Ok thanks, perhaps i read the specifications wrong about the card but according to lenovo it is only operational at temp 5c to 40c which sounds crazy to me, most things will operate long above 40c? Perhaps they are talking about surrounding environment recommended temperatures and not actual card temp?

image.png.2d571f8d61e838381b1b3fe47a7a3176.png

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.