Jump to content

Parity check paused, two disks disabled. Devices emulated.


Go to solution Solved by JorgeB,

Recommended Posts

Hello all, 

 

Currently I have a parity check (which I run every 90 days) on an array of 32 drives which paused about halfway or so through and two of my newly added drives which have been added in the last six months have disabled. Both of those drives are empty currently and the SMART check on them looks okay. They are both connected via a SAS backplane and are both in my main chassis. I have six more drives in DAS which are also empty and not having issues. 

 

I have replacement drives available if that is what must be done, I just wish to make sure I do this correctly. As I stated there is no data on either of the disabled drives at this time. I am posting diagnostics below, the server has not been rebooted so they should be helpful. 

 

The parity check was happening while I was away at work for the last two days. I verified with my wife there are no power fluctuations or hits during that time and both the NAS and DAS have their own UPS in good health. Any guidance would be welcome? 

 

Thanks in advance you all. I am sure you get this a lot but this is my first real tango with a failed disk. 

 

tower-diagnostics-20240208-0138.zip

Link to comment
Feb  7 03:38:25 Tower kernel: mpt2sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
Feb  7 03:38:25 Tower kernel: mpt2sas_cm0: fault_state(0x7e23)!
Feb  7 03:38:25 Tower kernel: mpt2sas_cm0: sending diag reset !!
Feb  7 03:38:26 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS

 

HBA timed out and had to be reset, causing the disk errors, check it's well seated and sufficiently cooled.

 

P.S: were the parity sync errors before this issue expected?

  • Like 1
Link to comment

Thank you! I might very well have missed that. I will get a small 40mm fan at once to put on that LSI Card. 

 

I was low-key worried about this before but we got a few parity checks in and it didn't seem to be an issue so I suppose I forgot about it. Thanks! 

 

Now once I shut down the server to attach the fan and start back up is there anything I need to do? How do I re-enable those drives? And finish up the parity check I presume? 

Link to comment
  • 2 weeks later...
On 2/8/2024 at 8:44 AM, JorgeB said:

Post new diags after array start to see the current emulated disk status.

 

I wanted to follow up on this. The disks were absolutely fine and the HBA being too hot was the issue. My solution in this case was to remove the heat sink on the HBA, re-paste the sink with new thermal compound and then re-attach the heat sink using 25mm Nylon bolts and nuts with a 40mm Noctua fan going on it. This has the card nice and cool and I was able to re-add the same disks back to the array with no issues aside from a 3 day rebuild time. 

 

So yes, heat is an issue on these HBAs designed for high-speed air movement in a rack mounted system when not getting the air it's expecting. Cool those cards people. 

 

 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...