May 1, 20251 yr Hello everyone. Hope you're doing well. I've been experiencing some issues with my server, and I can't really seem to figure out what's going on My Server Consists of: Supermicro X9DRi-LN4F+ 2 x 2690 v2 128GB RAM 8 Drives on a Dell H200 3 Drives on the onboard SAS 3 NVMes on an Asus Hyper M.2 card 1660 Ti GPU A Dual Coral TPU on a PCIe adapter. A month ago, my server showed me read errors on all the drives that were connected to the H200. I restarted and it worked for a couple of days. Then it happened again. And so on and so forth. Then I saw some errors on the onboard SAS as well. So I thought that was causing the issue. So I moved the drives from the onboard SAS to the onboard SATA. And then everything worked for about 20 days. Today, the server, again, experienced read errors on all drives of the H200 Controller. So I bought a new one, flashed it to IT mode, and installed it. Same errors. Right when I boot the server, it just shows read errors, and I have to shut it down. I tried a second H200. I tried different PCIe slots. I don't know what else to try, my mind is about to explode. I've been heavily relying on this server for my work, and it just cut my legs off. I can't pinpoint the problem, so I know what to change/fix/buy. This is what I see 15 seconds after I start the array: and then I have to shut it down. Any ideas ?
May 18, 20251 yr Author Sorry I was on a business trip and had no internet to post diagnostics. I was also so busy I couldn't mess with the server itself. So I THINK I've figured out the issue. But I keep an eye on it, just in case. I'm only replying for a possible future problem of similar type. Turns out, the Dell H200 controller was overheating and causing all sorts of problems. I just strapped on a Fan with a couple of zip ties, and everything works flawlessly. Which though, is pretty weird, since this is exactly the way the server has been running for quite a while now, and the room is 24/7 air conditioned to 22 degrees. I don't know what's gotten into it now. That being said, I will remove the heatsink, replace the paste if there is any, which I guess there must be, (or a pad) and will place the fan properly using a 3D printed bracket. But for now, just cooling the card, did the trick. Anyway, just for reference for future users !
May 18, 20251 yr 5 hours ago, bonamin said: replace the paste if there is any, which I guess there must be The cards normal have thermal epoxy which is not easy to remove, see here.
May 27, 20251 yr Author On 5/19/2025 at 12:08 AM, Wody said:The cards normal have thermal epoxy which is not easy to remove, see here.Mine just have thermal paste, like a CPU. So, new update. The server had been running fine up till like 2 days ago.When the fire nation attacked. :DSame problem. All disks on the HBA card get read errors. I really don't know what to do. This frustrates me really, REALLY much. :( I will wait for the next error, and I will post the diagnostics. Maybe someone can help.
June 5, 20251 yr Author Well, I managed to get the diag. If anyone can take a look at this, I'm more than happy to listen to your thoughts.I know removed the card all-together, and I've connected all 9 drives on the motherboard. 6 via SATA, and 3 vis the SAS port. Since the problems begun, I've completely removed the Parity Drive, as I had to constantly rebuild it which would probably fry my drives all-together. Anyway, other that 1 specific drive with all my family photos, the rest of the drives are media only so I don't really care even if one fails right now. I just wanna fix this. This is getting seriously annoying, and I don't know where to move from here. theark-diagnostics-20250604-1403.zip
June 5, 20251 yr Jun 4 11:20:19 TheArk kernel: mpt2sas_cm0: SAS host is non-operational !!!!Problems with the LSI controller, is this onboard?
June 5, 20251 yr Author 3 hours ago, JorgeB said:Jun 4 11:20:19 TheArk kernel: mpt2sas_cm0: SAS host is non-operational !!!!Problems with the LSI controller, is this onboard?No, it's not onboard. I'm using a Dell PERC H200 card. (IT Mode)Problem is, I've bought a second identical card, and both the cards gave me errors. I've then cooled the cards using a fan, and also changed the thermal compound. After this, I didn't have errors for a month. Until I had errors again. :D This is what confuses me. I don't know if the cards are fried, if the cooling is insufficient, if a drive is misbehaving and causing the card to crash or something, or if the motherboard or one of the CPUs is problematic. By this point, i"m even considering the PSU. I don't know what else to try. Would you suggest I buy a 3rd card ?
June 6, 20251 yr Author I did try different slots, and on different CPUs as half of the slots are via CPU1 and the others via CPU2.didn't change anything.
June 7, 20251 yr In that case I would try a different controller, can still be an LSI, e.g., 9300-8i.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.