Jump to content

Is my HBA failing?


Go to solution Solved by spdelope,

Recommended Posts

All 8 of the drives attached to the HBA all of a sudden have read errors. I can resolve and be stable for a week or two but if I reboot or enough time passes, eventually get the read errors and end up rebooting two three times and the issues go away. Tried different BIOS settings and such. Almost thought I found the issue when I noticed a 4 pin power plug not plugged in on the motherboard for the CPU (thought maybe the CPU was stealing power from MB or something like that)

 

I have a 850W psu, not sure if that could be an issue (using a calc it says i only need 500W)

 

I have the 8i card and a 16e card and a 10GB NIC card attached.

 

Would love your feedback, thanks!

 

I am working on expanding my array and using a 16e card with an external enclosure. Would it be smart rather than getting another 8i to just use the two ports on the 16e I wont be using? So going from the breakout cable to a 8077-8088 adapter and a short 8088 cable to connect at the back of the case.

 

Thanks again (and nevermind some of the file names you'll see in the logs, thats just how I name my taxes)

 

This is the relevant line that starts the read errors

Jun 18 11:40:46 UnRaid-Server kernel: mpt2sas_cm0: SAS host is non-operational !!!!

unraid-server-diagnostics-20240618-1148.zip

Edited by spdelope
more info
Link to comment
Posted (edited)
6 hours ago, JorgeB said:

Make sure the HBA is well seated and sufficiently cooled, these run hot, you can also try a different PCIe slot, if available.

 

One other thing worth mentioning is when I reboot the server, the drives are gone from the system and I have to reboot a couple times for them to show back up and a couple times, two random drives were disabled and I had to resync.

 

Shutting down and waiting doesn't seem to help so I'm not sure if a cooling issue. Right now I have a fan on it and running without the top on to help

Edited by spdelope
Link to comment
  • 2 weeks later...
Posted (edited)
On 6/19/2024 at 7:47 AM, JorgeB said:

Try a different PCIe slot, if the same I would try a different HBA

I've tried a different PCI slot (swapped the 16e and 8i) and experienced the same issue. I then got another HBA and same thing.

 

I tried it without the 16e and the problem seems to go away. It seems, for now at least, it only occurs when the second PCI slot is filled (keep in mind I have a 10g nic in the third slot and hasn't been an issue)

 

I also tried it with just the 16e (using an internal to external adapter and SAS cable) and the issue was gone.

 

So now I'm thinking maybe motherboard? PSU? Thoughts? I have an LSI card showing up to remove the possibility the Dell is causing compatibility issues

 

Thanks!

Edited by spdelope
more info
Link to comment
  • Solution
On 6/29/2024 at 12:53 AM, JorgeB said:

Possibly, of the two, the board would be my first suspect.

 

 

LSI wasnt an issue but I swapped the CMOS battery at the same time (as suggested by ArtOfServer). Existing one was reading 2.89v and brand new was 3.3v

 

I then put Dell back and have been running strong for a couple days. Wild if that little button battery could have been the issue all along. (also a CMOS reset by default so maybe that too)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...