stor44 Posted December 6, 2019 Share Posted December 6, 2019 (edited) Hi there, having some unexpected problems here. Diagnostics attached. I'm trying to upgrade my two parity drives (1 at a time) from 4TB Reds, to 10TB Seagates I scored on Black Friday (got 3 of them for $200 CDN each, BarraCuda Pro's inside). - I precleared the 10TB drives before shucking, no errors. I'm pretty sure you don't have to preclear parity drives, but I wanted to test them anyway. - I set unRAID to not start the array automatically - I shut down unRAID last night, no problems prior to that. - removed the 4TB 1st Parity drive - installed the 10TB Seagate after shucking it - booted up unRAID, it said the Parity 1 drive is missing (correct) so I set that slot to the 10TB drive and started the array and waited for the parity rebuild to begin. Now I'm seeing many errors: "Array has 6 disks with errors", "Disk 2 in error state", and now it says every drive has read errors. Also I can't access the shares from my other computers over the network. - The parity check has paused itself at 4.64GB. I'm using a Dell PERC H310 raid card flashed to IT firmware. I will double check my cabling, maybe bumped something during install? Thanks for any help! tower-diagnostics-20191206-1228.zip Edited January 13, 2020 by stor44 Solved by swapping motherboard Quote Link to comment
JorgeB Posted December 6, 2019 Share Posted December 6, 2019 It was a controller problem: Dec 6 06:22:09 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Mmake sure it's well seated and sufficiently cooled, you can also try a different PCIe slot if available. As for the disabled disk either rebuild on top if the emulated disk is mounting correctly and data looks fine, or, and assuming nothing was written to it after it got disabled, do a new config and resync parity. 1 Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 Thanks Johnnie. - I shut down unRAID. Checked my cables, removed and reseated the Dell card. That seems to work. - unRAID boots, now the rest of disks seem fine again, except Disk2 is "unmountable: no file system". - I clicked Start the array, but it wants to format Disk2 first. Should I format Disk 2, or try a new config like you said? Seems like both roads lead to a parity check. tower-diagnostics-20191206-1324.zip Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 Also I'm seeing this on the video output from unRAID. Quote Link to comment
JorgeB Posted December 6, 2019 Share Posted December 6, 2019 Should I format Disk 2 Format is never a an option, unless there's no data on disk2, after a format data can't be recovered from parity, seems to be a common misconception. You could try fixing the filesystem on the emulated disk2 first and if successful rebuild, but since there can be some corruption, and assuming nothing was written to disk2 once it got disabled, doing a new config with IMHO be the way to go. Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 Thanks, that makes sense re: format. Nothing new should have been written to Disk2. - I clicked New Config, use current assignments. - started the array, and now a parity check is underway. Will report back once it's done in 18 to 20 hours. Thanks again johnnie.black, you've helped me numerous times and I always appreciate it! Couldn't run my server without this helpful community. Quote Link to comment
JorgeB Posted December 6, 2019 Share Posted December 6, 2019 31 minutes ago, stor44 said: re: format. Nothing new should have been written to Disk2. So that we're clear, formatting would delete any new and old data, i.e. all existing data on disk2. Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 Understood, thanks again. Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 (edited) Five hours into the parity check, but now I got an alert saying "array has six failed disks.' Diagnostics attached. I'm at work for another few hours, but I have remote access to the GUI. The parity check is still running, but instead of taking 18 hours, it’s changed to about 3 hours left, 65% complete. Also it says 4TB total size, shouldn’t that be 10TB? Thanks for any help. tower-diagnostics-20191206-1936.zip Edited December 6, 2019 by stor44 More detail Quote Link to comment
trurl Posted December 6, 2019 Share Posted December 6, 2019 38 minutes ago, stor44 said: The parity check is still running, but instead of taking 18 hours, it’s changed to about 3 hours left, 65% complete. Also it says 4TB total size, shouldn’t that be 10TB? Parity1 dropped so I think it is only checking parity2 now. Quote Link to comment
stor44 Posted December 6, 2019 Author Share Posted December 6, 2019 Thanks trurl, I think you’re right. I doubt it’s actually the 10TB drive, might be something up with my controller card. Will try another slot when I get a chance. Quote Link to comment
JorgeB Posted December 7, 2019 Share Posted December 7, 2019 Controller problem again: Dec 6 13:19:53 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Errors in multiple disks, no point in continuing the sync. Quote Link to comment
stor44 Posted December 7, 2019 Author Share Posted December 7, 2019 Thanks, and here's an update. I got home and the parity check said it was 165% complete. - stopped parity check and shut down the server - swapped the controller card to a different slot, re-checked cabling - booted up, did New Config again, keeping assignments - started parity check. It's doing the full 10TB this time. About 7 hours left now. No dropped drives so far. Quote Link to comment
stor44 Posted December 7, 2019 Author Share Posted December 7, 2019 Parity check is finished, no errors, all drives accounted for. Thanks for the help! Will upgrade the 2nd Parity drive to 10TB later next week. Quote Link to comment
stor44 Posted December 7, 2019 Author Share Posted December 7, 2019 Spoke too soon. Suddenly six bad disks again. I can see the SAS card is "non-operational" again in the logs. Will try another slot. Quote Link to comment
stor44 Posted December 8, 2019 Author Share Posted December 8, 2019 Removed GTX 970, moved H310 card to 8x slot that was covered by the video card. Now all drives are working again. Quote Link to comment
sota Posted December 8, 2019 Share Posted December 8, 2019 Any chance the H310 is on its way out? You keep moving it, it works for a while, then goes dark. Quote Link to comment
JonathanM Posted December 8, 2019 Share Posted December 8, 2019 14 minutes ago, sota said: Any chance the H310 is on its way out? You keep moving it, it works for a while, then goes dark. Or, inadequately cooled. Server grade HBA's assume you have hurricane force winds inside your case for ventilation. (or at least most data centers sound like that) Quote Link to comment
stor44 Posted December 9, 2019 Author Share Posted December 9, 2019 Indeed. The card has just failed again today, same error as before: Dec 9 16:24:23 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! I do have a second H310, but I use it in my Windows PC to host the drives that keep copies of my unRAID shares (running DrivePool on there). Might need to research another host card. I have one more slot I can try, but yes, starting to seem heat-related or end of the line. I'll check the heatsink on the card too. Thanks for the replies. Quote Link to comment
stor44 Posted December 12, 2019 Author Share Posted December 12, 2019 Latest update. I was able to swap slots and the host card ran for about 24 hours, but then drives would suddenly disappear from the array. I don't want to mess with my other PC (where my other H310 is working fine for 3 years), so I've bought two more H310's off eBay to test/replace this one. Hopefully they show up next week. In the meantime, what does "stale configuration" mean? I did search for it, but I'm still not sure. Latest logs attached. Thanks for your time. tower-diagnostics-20191212-1205.zip Quote Link to comment
JorgeB Posted December 12, 2019 Share Posted December 12, 2019 https://forums.unraid.net/topic/59845-unraid-os-version-640-rc8q-available/?do=findComment&comment=588602 1 Quote Link to comment
stor44 Posted December 12, 2019 Author Share Posted December 12, 2019 Thanks. I'm only use Chrome on my Mac to access unRAID, so that's weird, but sounds like nothing to worry about. Quote Link to comment
je82 Posted December 12, 2019 Share Posted December 12, 2019 Sorry to hijack the thread but i worry about the temps on my HBA, is there anyway to measure the temp? How much better would it be to remove the heatsink and apply some high quality thermal paste and reapply the heatsink? My HBA is LSI SAS9340-8i ServeRAID M1215 , i have not had any issues yet but i worry because i put my finger on the heat sink and it was warm, i do have a fan blowing on it but id say the heatsink it self was at least 50 or even 60c. Quote Link to comment
JorgeB Posted December 12, 2019 Share Posted December 12, 2019 1 minute ago, je82 said: it but id say the heatsink it self was at least 50 or even 60c. That sounds about right, they do run pretty hot, as long as there's some airflow around them it's fine, they just shouldn't be without any as they are designed to run in servers with good cooling. Quote Link to comment
je82 Posted December 12, 2019 Share Posted December 12, 2019 1 minute ago, johnnie.black said: That sounds about right, they do run pretty hot, as long as there's some airflow around them it's fine, they just shouldn't be without any as they are designed to run in servers with good cooling. Ok thanks, perhaps i read the specifications wrong about the card but according to lenovo it is only operational at temp 5c to 40c which sounds crazy to me, most things will operate long above 40c? Perhaps they are talking about surrounding environment recommended temperatures and not actual card temp? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.