hsingh314 Posted August 5 Share Posted August 5 My journey with 6.12 continues. I figured it was fine to upgrade finally - but it seems like I might have been too hasty as it seems like 6.12 still doesn't like my system. I updated - things seemed fine, I followed all the 'update' steps online. After a couple of minutes my first parity drive and drive 6 gets 'disabled'. I don't know why, there was no initial error or anything I could see - so I figured its just something wonky. Stopped the array, but it got hung on "sync filesystem" where I had to just shut it down. Started up and drives were still disabled, so I unassigned them, started in maintenance mode and then assigned them back again and fired it up. They both had passed smart tests and then 'rebuild' started where then both drives at the same time dropped again. But the sync continued to run with the other 'good' parity drive accumulating millions of errors, where I decided it was time to stop the sync. System is hung again, array won't stop and can't generate a diagnostic file.... Ugh, I feel like my system is cursed - any suggestions on how to proceed forward without losing all my data? Even pressing the shut down button the GUI or doing the command through SSH causes nothing to really happen. If it shuts down and restarts, I will try to generate a diag file and post it here. Ugh..... Quote Link to comment
hsingh314 Posted August 5 Author Share Posted August 5 (edited) Just had to do a hard power button shutdown (ugh). Started it up in maintenance mode. Doing another Smart extended test on each drive (they seem to pass the short ones). Hasn't finished yet, but here is the current Diag. bishop-diagnostics-20240805-1405.zip Edited August 5 by hsingh314 Quote Link to comment
JorgeB Posted August 5 Share Posted August 5 Diags are after rebooting, so we can't see what happened, but you have an LSI HBA, which is probably the most used controller with Unraid, and AFAIK there haven't been any other similar reports. Quote Link to comment
hsingh314 Posted August 5 Author Share Posted August 5 (edited) Welp, is it possible that this could be from the HBA failing/overheating? I have had a past issue where I figured an HBA issue as all the drives dropped that were connected to HBA but none in the MB after firing up a vm connected to a video card next to the HBA. What is the protocol in this situation of trying to renable disabled HDDs when it seems like the HDDs is fine (just did a short smart test and its fine)? I do have a new HBA (Lenovo 430-16i) which is supposed to run cooler and has a larger heat sink - but realized today don't have the right cables so can't use it right now, but did order the mini-sas cables. Edited August 5 by hsingh314 Quote Link to comment
hsingh314 Posted August 5 Author Share Posted August 5 Do I need to do the xfs_repair status thing before I can enable the drives? Quote Link to comment
hsingh314 Posted August 5 Author Share Posted August 5 Tried to re-insert the drives again - and again they threw out a bunch of errors and it canceled the parity. I was able to do a diag after. What is causing this? bishop-diagnostics-20240805-1942.zip Quote Link to comment
hsingh314 Posted August 6 Author Share Posted August 6 Could this be due to a failing HBA? I noticed that the only disks that were getting the errors were the ones attached to the HBA. When in maintenance mode everything was fine but as soon as the parity sync started, it seems fine for a couple of minutes and then everything goes down. I can't even see the smart results with the drives attached to the HBA, but I could for the ones attached to the MB after the parity sync stops and fails. Could it be fine until a lot of data is going through it and then it overheats or something happens and the HBA fails? Quote Link to comment
hsingh314 Posted August 6 Author Share Posted August 6 I tried downgrading to 6.11.5 to see if that would make any difference, but same thing happened. I did a diag. bishop-diagnostics-20240805-2050.zip But also I started getting a syslog error : Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 134045728 bytes) in /usr/local/emhttp/plugins/dynamix/include/Syslog.php on line 18 And I think this is why its not allowing me to shutdown the server without pushing the power button after the HBA disappears. Quote Link to comment
JorgeB Posted August 6 Share Posted August 6 Aug 5 20:46:27 Bishop kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure the HBA is well seated and sufficiently cooled, you can also try a different PCIe slot. Quote Link to comment
hsingh314 Posted August 6 Author Share Posted August 6 Yikes! - I appreciate the confirmation - I dunno why I couldn't find that error. I have tried two separate slots and I have a small fan on the heatsink. I have been having issues with my HBA in the past and even repasted the heat sink - but maybe it's just been on the last legs in general. Hopefully, the Lenovo 430-16i will work properly (I think its a rebranded LSI 9400-16i) out of the box as I am not looking forward to it if I have to flash the bios & firmware of the HBA. Quote Link to comment
hsingh314 Posted August 7 Author Share Posted August 7 @JorgeB So slapped in the new HBA - it seemed to pick it up fine, assigned all the disks, and started the parity sync, and no errors, so far... After the parity sync, i'll try to upgrade to 6.12 again. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.