August 31, 20241 yr I've been struggling with my system for a couple of months now. The issue is the array reports drive errors that bogs down the system until a reboot. The errors disappear and come back after a non-consistent number of hours or tasks. I've looked at the disks reports and don't see anything wrong (permanent). I have re-seated all my components to include RAM, PSU, HBA, cables, drives. I've changed HBAs, RAM, and cables. I have 2 diagnostics I will attach, they were done within second of the errors occurring and with the system up for a short time. The system has gone weeks without issues in the past but I tough diags for those times would be harder to parse. Some things that come to mind to mention: The behavior has occurred even when running in safe mode, GUI, no GUI and any other combination. I methodically deleted and reinstalled each docker container to avoid runaway issues. The system ran flawlessly for around 4 months and I had 5 drives total then, I THINK the issue appeared when the array grew to 6+ drives. I managed to come back down to 6. Finally managed to update BIOS a couple of days ago (seemed complicated on this MB). I'm starting to feel a little insane about this and have been glued at my computer for weeks but evidently this is beyond my knowledge and ability to google. TIA to anybody that can help. *HBA current 9600-24i *drive cables current 2x (SFF-8654 8i to 2x (4x SFF-8643)) hl15-diagnostics-20240828-2034.zip hl15-diagnostics-20240830-2050.zip Edited August 31, 20241 yr by ZVeguillaCotto
August 31, 20241 yr Community Expert Solution All disks dropped offline, most probable reasons would be a power/connection issue or the controller.
August 31, 20241 yr Author 3 hours ago, JorgeB said: All disks dropped offline, most probable reasons would be a power/connection issue or the controller. The disks that produce the errors are different each time. I've changed disk bays also. I have a couple of additional theories: Files got corrupted once upon a time, each time that file is read the errors starts (no idea if possible). Backplane is damaged and thermal expansion causes errors. It hadn't occurred to me to connect drives directly to HBA to bypass backplane. I will attach disks directly to HBA today and report back.
August 31, 20241 yr Community Expert 9 hours ago, ZVeguillaCotto said: Files got corrupted once upon a time The errors are because the disks are dropping, it has nothing to do with files or data, it's a hardware issue.
September 2, 20241 yr Author On 8/31/2024 at 6:30 PM, JorgeB said: The errors are because the disks are dropping, it has nothing to do with files or data, it's a hardware issue. I have been running the drives directly to the HBA and PSU, bypassing the backplane for about a day and a half. I haven't seen any errors yet. Will keep updating as the days go on. Thanks.
September 7, 20241 yr Author 1 week update. The issue has not returned... yet. Will mark as solved. Will keep updating if relevant. I read about connector issues being the leading cause for this issue and thought I tried everything to rule it out. I had missed removing the backplane from the equation. Thanks to @JorgeB for the help.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.