September 5, 20241 yr In the last week or so I started getting unraid notifications that multiple disks (around 4-6) were having read errors. When I logged into the dashboard I saw on the Main tab that those disks had error counts in the table. I have 13 disks (1 parity) and the errors ranged from 1 to ~1200. When I restart the server all errors goto zero and I get an unraid notif saying all read errors have returned to normal. Since that first time it has happened about 2-3 more times with the most recent time the system locked up and I had to hold the power button to restart it. I didn't take a screenshot or pic of the dashboard when it had all the errors. I will do that next time Some info about my setup: - All HDDs are SAS and were bought used from eBay in batches (so HDDs brands/models cluster together) - All HDDs go through my Supermicro backplane (I believe it is bpn-sas3-826el1) - Backplane is connected to mobo via HBA card (Adaptec asr-7805) I find it highly unlikely that HDDs across multiple vendors and generations would start failing at the exact same moment so I wonder if it is the backplane or HBA card. Does that seem reasonable? How can I verify this? Attached my diagnostics zip, the monitor log before I had to force shutdown, and my hardware setup described above. Thanks in advance! maroon-diagnostics-20240905-0943.zip Edited September 5, 20241 yr by tone add info
September 5, 20241 yr Community Expert DIags are after a reboot, enable the syslog server in case the server crashes again, and post that when it next happens.
September 5, 20241 yr Author Thanks for responding. I have enabled the syslog server and will post again when it happens. Appreciate the help!
September 9, 20241 yr Author Ok it happened again, here are some screenshots and attached are my diagnostics. Thank you for the help. Hoping this isn't really bad maroon-diagnostics-20240908-2145.zip
September 9, 20241 yr Community Expert Sep 6 03:18:17 maroon kernel: aacraid 0000:01:00.0: IOP reset failed Sep 6 03:18:17 maroon kernel: aacraid 0000:01:00.0: ARC Reset attempt failed Controller issues, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.
September 29, 20241 yr Author Ok, I have done a few things since the last post: I got a dedicated fan on the heatsink of the HBA (blowing toward it) I still got errors so I replaced the HBA with another one (from ebay) still getting errors so now I think its either the SAS cables or my backplane or my HDDs? another symptom I am experiencing is that the server has locked up at 100% cpu waiting on iowait process. also shutdown doesn’t seem to work, it gets stuck at “Forcing shutdown…” I attached my latest diagnostics incase the errors are different but otherwise I will replace the cables then if needed backplane maroon-diagnostics-20240929-0959.zip Edited September 29, 20241 yr by tone Added pics
September 30, 20241 yr Community Expert Log is completely spammed with controller related crashes, but cannot see the start of the problem, reboot to clear the logs and post new diags as soon as you see errors in the log.
November 11, 20241 yr Author Ok update here. I actually turned off the ZFS backups (uninstalled Sanoid and the ZFS Plugin and disabled the user scripts) and had not had an error for 1+ month. I also increased the fans so there was better cooling in the case and on the HBA. Anyway, I had an error last night but now on only one disk (Disk 6, which was the ZFS backup target): Log has new errors too: I am not able to download a diagnostics as it freezes/hangs: Here is the disk log for Disk 6 (sde): LMK if this should be a new post/topic altogether. Any idea what I should do? TIA!
November 12, 20241 yr Community Expert Looks like the disk dropped offline, I would disable spin down and see if it still happens.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.