Help! Multiple HDDs getting lots of read errors all at the same time

Followers

September 5, 20241 yr

In the last week or so I started getting unraid notifications that multiple disks (around 4-6) were having read errors. When I logged into the dashboard I saw on the Main tab that those disks had error counts in the table. I have 13 disks (1 parity) and the errors ranged from 1 to ~1200. When I restart the server all errors goto zero and I get an unraid notif saying all read errors have returned to normal. Since that first time it has happened about 2-3 more times with the most recent time the system locked up and I had to hold the power button to restart it.

I didn't take a screenshot or pic of the dashboard when it had all the errors. I will do that next time

Some info about my setup:

- All HDDs are SAS and were bought used from eBay in batches (so HDDs brands/models cluster together)

- All HDDs go through my Supermicro backplane (I believe it is bpn-sas3-826el1)

- Backplane is connected to mobo via HBA card (Adaptec asr-7805)

I find it highly unlikely that HDDs across multiple vendors and generations would start failing at the exact same moment so I wonder if it is the backplane or HBA card. Does that seem reasonable? How can I verify this?

Attached my diagnostics zip, the monitor log before I had to force shutdown, and my hardware setup described above.

Thanks in advance!

maroon-diagnostics-20240905-0943.zip

Edited September 5, 20241 yr by tone
add info

Quote

September 5, 20241 yr

Community Expert

DIags are after a reboot, enable the syslog server in case the server crashes again, and post that when it next happens.

Quote

September 5, 20241 yr

Author

Thanks for responding. I have enabled the syslog server and will post again when it happens.

Appreciate the help!

Quote

September 9, 20241 yr

Author

Ok it happened again, here are some screenshots and attached are my diagnostics.

Thank you for the help. Hoping this isn't really bad

image.png.dcd315c0db500c73c304ed4910299d9a.png

maroon-diagnostics-20240908-2145.zip

Quote

September 9, 20241 yr

Community Expert

Sep  6 03:18:17 maroon kernel: aacraid 0000:01:00.0: IOP reset failed
Sep  6 03:18:17 maroon kernel: aacraid 0000:01:00.0: ARC Reset attempt failed

Controller issues, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.

Quote

3 weeks later...

September 29, 20241 yr

Author

Ok, I have done a few things since the last post:

I got a dedicated fan on the heatsink of the HBA (blowing toward it)
I still got errors so I replaced the HBA with another one (from ebay)
still getting errors so now I think its either the SAS cables or my backplane or my HDDs?

another symptom I am experiencing is that the server has locked up at 100% cpu waiting on iowait process.

also shutdown doesn’t seem to work, it gets stuck at “Forcing shutdown…”

I attached my latest diagnostics incase the errors are different but otherwise I will replace the cables then if needed backplane

maroon-diagnostics-20240929-0959.zip

Edited September 29, 20241 yr by tone
Added pics

Quote

September 30, 20241 yr

Community Expert

Log is completely spammed with controller related crashes, but cannot see the start of the problem, reboot to clear the logs and post new diags as soon as you see errors in the log.

Quote

1 month later...

November 11, 20241 yr

Author

Ok update here.

I actually turned off the ZFS backups (uninstalled Sanoid and the ZFS Plugin and disabled the user scripts) and had not had an error for 1+ month.

I also increased the fans so there was better cooling in the case and on the HBA.

Anyway, I had an error last night but now on only one disk (Disk 6, which was the ZFS backup target):

Log has new errors too:

I am not able to download a diagnostics as it freezes/hangs:

Here is the disk log for Disk 6 (sde):

LMK if this should be a new post/topic altogether. Any idea what I should do? TIA!

Quote

November 12, 20241 yr

Community Expert

Looks like the disk dropped offline, I would disable spin down and see if it still happens.

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Help! Multiple HDDs getting lots of read errors all at the same time

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)