Jump to content

Array errors that disapear on reboot.


Go to solution Solved by JorgeB,

Recommended Posts

I've been struggling with my system for a couple of months now.

 

The issue is the array reports drive errors that bogs down the system until a reboot. The errors disappear and come back after a non-consistent number of hours or tasks.

 

I've looked at the disks reports and don't see anything wrong (permanent).

 

I have re-seated all my components to include RAM, PSU, HBA, cables, drives. I've changed HBAs, RAM, and cables.

 

I have 2 diagnostics I will attach, they were done within second of the errors occurring and with the system up for a short time. The system has gone weeks without issues in the past but I tough diags for those times would be harder to parse.

 

Some things that come to mind to mention:

The behavior has occurred even when running in safe mode, GUI, no GUI and any other combination.

I methodically deleted and reinstalled each docker container to avoid runaway issues.

The system ran flawlessly for around 4 months and I had 5 drives total then, I THINK the issue appeared when the array grew to 6+ drives. I managed to come back down to 6.

Finally managed to update BIOS a couple of days ago (seemed complicated on this MB).

 

I'm starting to feel a little insane about this and have been glued at my computer for weeks but evidently this is beyond my knowledge and ability to google.

 

TIA to anybody that can help.

 

imagen.png.649c1762f98a0d39c7946a9cc66038b6.png

*HBA current 9600-24i

*drive cables current 2x (SFF-8654 8i to 2x (4x SFF-8643))

hl15-diagnostics-20240828-2034.zip hl15-diagnostics-20240830-2050.zip

Edited by ZVeguillaCotto
Link to comment
3 hours ago, JorgeB said:

All disks dropped offline, most probable reasons would be a power/connection issue or the controller.

The disks that produce the errors are different each time.

I've changed disk bays also.

 

I have a couple of additional theories:

Files got corrupted once upon a time, each time that file is read the errors starts (no idea if possible).

 

Backplane is damaged and thermal expansion causes errors.

It hadn't occurred to me to connect drives directly to HBA to bypass backplane.

 

I will attach disks directly to HBA today and report back.

Link to comment
On 8/31/2024 at 6:30 PM, JorgeB said:

The errors are because the disks are dropping, it has nothing to do with files or data, it's a hardware issue.

I have been running the drives directly to the HBA and PSU, bypassing the backplane for about a day and a half. I haven't seen any errors yet.

 

Will keep updating as the days go on.

 

Thanks.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...