urbancore Posted February 8 Share Posted February 8 Hello all, Currently I have a parity check (which I run every 90 days) on an array of 32 drives which paused about halfway or so through and two of my newly added drives which have been added in the last six months have disabled. Both of those drives are empty currently and the SMART check on them looks okay. They are both connected via a SAS backplane and are both in my main chassis. I have six more drives in DAS which are also empty and not having issues. I have replacement drives available if that is what must be done, I just wish to make sure I do this correctly. As I stated there is no data on either of the disabled drives at this time. I am posting diagnostics below, the server has not been rebooted so they should be helpful. The parity check was happening while I was away at work for the last two days. I verified with my wife there are no power fluctuations or hits during that time and both the NAS and DAS have their own UPS in good health. Any guidance would be welcome? Thanks in advance you all. I am sure you get this a lot but this is my first real tango with a failed disk. tower-diagnostics-20240208-0138.zip Quote Link to comment
JorgeB Posted February 8 Share Posted February 8 Feb 7 03:38:25 Tower kernel: mpt2sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready Feb 7 03:38:25 Tower kernel: mpt2sas_cm0: fault_state(0x7e23)! Feb 7 03:38:25 Tower kernel: mpt2sas_cm0: sending diag reset !! Feb 7 03:38:26 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS HBA timed out and had to be reset, causing the disk errors, check it's well seated and sufficiently cooled. P.S: were the parity sync errors before this issue expected? 1 Quote Link to comment
urbancore Posted February 8 Author Share Posted February 8 Thank you! I might very well have missed that. I will get a small 40mm fan at once to put on that LSI Card. I was low-key worried about this before but we got a few parity checks in and it didn't seem to be an issue so I suppose I forgot about it. Thanks! Now once I shut down the server to attach the fan and start back up is there anything I need to do? How do I re-enable those drives? And finish up the parity check I presume? Quote Link to comment
Solution JorgeB Posted February 8 Solution Share Posted February 8 Post new diags after array start to see the current emulated disk status. Quote Link to comment
urbancore Posted February 20 Author Share Posted February 20 On 2/8/2024 at 8:44 AM, JorgeB said: Post new diags after array start to see the current emulated disk status. I wanted to follow up on this. The disks were absolutely fine and the HBA being too hot was the issue. My solution in this case was to remove the heat sink on the HBA, re-paste the sink with new thermal compound and then re-attach the heat sink using 25mm Nylon bolts and nuts with a 40mm Noctua fan going on it. This has the card nice and cool and I was able to re-add the same disks back to the array with no issues aside from a 3 day rebuild time. So yes, heat is an issue on these HBAs designed for high-speed air movement in a rack mounted system when not getting the air it's expecting. Cool those cards people. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.