January 22, 20233 yr I have been plagued by repeated disk errors for quite a while but lately it has gotten quite bad. Unraid has disabled disks because of this forcing me to do a rebuild to reenable them. I have replaced my case to connect directly to the drives I bought a LSI 9207-8i HBA with the latest IT firmware and I have replaced the SAS cables but nothing has seemed to work. This last time Unraid reported over 1000 errors on each on my array disks with the exception of the one that it disabled that one it only reported 3. Can anyone look at the diagnostic data and give me some insight on where to go from here because i am at a loss. Thanks, server-diagnostics-20230122-1102.zip
January 22, 20233 yr Author Its almost like the HBA card drops out or something because no disk that is running on that card shows up in the Diagnostics and when i stop the array they are no where to be found.
January 23, 20233 yr Community Expert HBA problem: Jan 22 10:44:00 Server kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure it's well seated and not overheating, you can also try a different PCIe slot if available.
January 23, 20233 yr Author 7 hours ago, JorgeB said: HBA problem: Jan 22 10:44:00 Server kernel: mpt2sas_cm0: SAS host is non-operational !!!! Make sure it's well seated and not overheating, you can also try a different PCIe slot if available. Funny you should mention that. And I am not quite ready to say definitively that this was the issue but the screw hole in my case and on the card dont line up really well and when forcing bracket over to line up with the hole in the case I believe that it pulled part of the card out of the PCIe slot and over a short amount time stopped making contact with the slot. So right now i just have it sitting in there unsecured but I should probably look at modifying the bracket so I can secure it incase the tension on the slot isnt enough to keep it from coming out.
January 27, 20233 yr Author Well it happened again. This time I removed the heat sink and tried to remove as much of that cement like thermal paste as possible and added some artic silver 5 to the chip before reinstalling the heatsink. Hopefully that is the issue. So we will see if that does the trick if not might have to add a fan to it or its just eol and I need to return it. Is there any way to monitor the temperature of the card from unraid?
January 27, 20233 yr Author Actually I am usually not stressing the board when this happens. Is it possible its just a bad board?
January 27, 20233 yr Community Expert 6 hours ago, RysXr200 said: Is there any way to monitor the temperature of the card from unraid? AFAIK they don't have a temp sensor. 6 hours ago, RysXr200 said: Is it possible its just a bad board? Could be, did you try a different PCIe slot?
February 4, 20233 yr Author Solution It's been over a week and I haven't had a recurrence of the issue. Hopefully that remains the case. What I did in addition to adding a better thermal paste to the heat sink was bend the bracket on the card to hold it in the slot better.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.