Multiple Disk Read Errors across majority of my Array disk (After upgrading to 7.0.0...?

January 12, 20251 yr

I upgraded to 7.0.0 on Friday evening. This might be relevant or not to this discussion as I have never seen this error before. (Over a year using Unraid on this specific device.)

The error stated:
- Warning - array has errors Array has 7 disks with read errors

I noticed that 7 of the 8 data disks in the array were spun down (which is odd because I had turned off spindown as part of the upgrade process for what I thought was safety reasons).

Upon trying to spin up the disks, first nothing happened. Then disk 8 went disabled. Notification: "Alert - Disk 8 in error state (disk dsbl)
WDC_XXXXXXXXXXXXX (sde)"

Here is a diagnostics run from approximately that time: (REMOVED)

Then I turned off the Docker and VM services (to limit any writing to the disks) and rebooted the device. I then get the following notifications:

Notice - array turned good Array has 0 disks with read errors
Notice - Disk X returned to normal operation WDC_XXXXXXXXXXXXX (sdh) [times 6]

Here is another "diagnostics" run from after the reboot: (REMOVED)

Did I screw something up? It appears that at a minimum I will have to rebuild disk 8 due to something causing the filesystem to get corrupted (or something along those lines). I don't want to take any additional steps before I know if this is something safe to do or if all the other disks might go down again while doing that. I checked the "Attributes" for all the other disks that went into the weird read state (1, 2, 3, 4, 5, 6) and nothing looks out of the ordinary for me. (UDMA CRC error count is low on those that have it and that was usually from something do to with static electricity discharge against the case itself. I've learned to ground myself against something else before touching the case to prevent that.)

Doing a quick web and forum search it looks like this could be related to my HBA SAS controller (I thought I did a firmware update before installing it) but I have no way to confirm.

Other notes:

As part of the 7.0.0 upgrade, I switched all my shares to pool -> array and then ran the mover in case my btrfs RAID1 pool went down. (Also set Docker and VM services to disabled for speed and safety.) Afterward, I deleted and recreated my primary RAID1 pool with two NVME disks (primary server data) and then created a new single disk pool (temporary disk for data between mover runs to the disk. I then reconfigured all the shares to point to specific a specific pool and/or array depending on the share's use case. I hope this all looks good as well.

Edited May 1, 20251 yr by HeliusSol

Quote

January 12, 20251 yr

Author

I've been looking at the logs posted in my OP. Looks like something might have happened with either the HBA SAS card or maybe the motherboard itself causing a section of the PCI--E bus to go down. I think the 7 disks in question are connected to the HBA SAS card via SAS -> SATA cables. Does anyone else see any evidence of this in the logs? I'm unsure what exactly I might be looking for. (Does this look like a motherboard issue with the bus? Does it look like a problem with the HBA SAS card itself?)

If this is the case what might be recommended? I can move a grand total of 8 disks to the motherboard SATA ports but then I'm left with 2 that can't be connected at all if I try to remove the HBA SAS card from the machine for now...

Edited January 12, 20251 yr by HeliusSol

Quote

January 12, 20251 yr

Community Expert
Solution

1 hour ago, HeliusSol said:

rebuild disk 8 due to something causing the filesystem to get corrupted (or something along those lines

Rebuild is required because the disk is out-of-sync. The emulated filesystem is mounted so shouldn't be any corruption on rebuild.

Unraid disables a disk when a write to it fails. After a disk is disabled, it will not be used again until rebuilt. Instead, the disk is emulated. When the emulated disk is read, all other disks are read and the contents of the emulated disk are the result of the parity calculation. When the emulated disk is written, parity is updated so the emulated write can be recovered.

The initial write failure which caused the disk to be disabled, and any subsequent writes, are all emulated by updating parity. The actual disk is not written, so it is out-of-sync with the rest of the array, and must be rebuilt to recover the emulated writes and get back in sync.

Quote

January 12, 20251 yr

Author

@trurlThank you for your reply. I understand that a rebuild is necessary.

My primary concern is whether there is anything indicating drive failure or other device hardware failure that would preclude me from attempting this right now. I don't know anything specific about the SMART data other than that the UDMA CRC errors can be completely isolated to the SAS breakout cables in at least some cases (which I think it what is going on with mine). I don't see any bad sectors or such.

Do you (or anyone else) see anything in the logs before or after a reboot that would explain what happened? My best guess is something happened with either the PCI-E bus or the HBA SAS card itself. Power Supply failure to one of those devices? Appears the card "disappeared" but nothing I see in the logs indicates what actually happened.

I'd like to start the rebuild and hope that everything recovers like it should. I just would rather try to understand what caused the problem before starting a rebuild that will definitely take over 24 hours and keep me from writing to the array for that long.

(My anxiety about not knowing what happened to cause this large of an issue after over 12 months of no significant problems with the device has my stomach in knots.)

Quote

January 13, 20251 yr

Author

I started the rebuild of disk8 after moving as many of the HDDs as I could back to SATA cables directly connected to the motherboard. It says around 24 hours for a complete rebuild of a 20TB disk at current speeds. I expect that as long as nothing happens with the motherboard or the HBA SAS card that this should resolve my issue for the moment. If anyone has any idea what happened (or may have happened) when the HBA SAS card and everything connected to it went offline, please let me know. I don't understand what happened or how to prevent it going forward. Hoping that removing most of the drives from the device will keep it "happy" for the time being. Might be worth getting a different one as a backup or something.

In the future, I think I will need to make sure to just grab the diagnostics and reboot the machine before attempting to spin up the disks directly...

Quote

January 13, 20251 yr

Community Expert

15 hours ago, HeliusSol said:

around 24 hours for a complete rebuild of a 20TB disk

Almost certainly going to take longer than that. Typically 2-3 hours per TB.

Quote

January 14, 20251 yr

Author

On 1/13/2025 at 12:18 PM, trurl said:

Almost certainly going to take longer than that. Typically 2-3 hours per TB.

Notification from server:

Elapsed Time 1 day, 9 hr, 30 min, 39 sec, Runtime 1 day, 9 hr, 6 min, 51 sec, Increments 3, Average Speed 184.6 MB/s

I left Docker and the VM stuff off for at least the first half of it and set mover not to until a week from the day it was started. So not bad all things considered.

That all said, my best guess is that HBA SAS card went offline (how? why? how to prevent? anything in logs that someone sees that might explain it would be helpful) and my attempt to force disk 8 back online manually caused the problem. If this ever happens in the future, I will attempt to reboot the system to reset the HBA SAS card. If that doesn't work, I'd be down until I get a replacement (or replace the motherboard if that is the problem at that point).

Quote

January 27, 20251 yr

Author

For anyone late to the party. The final result is this. Rebuild was required and completed successfully. However, the reason all the HDDs connected to the HBA SAS card fell out of the array is still not clear. It appears the card failed. I wish there was a way to tell the system to shut down the array immediately when any or more than X drives "disappear". That might have prevented my need to rebuild.

I think I will need to take the HBA SAS card's heatsink off, replace the thermal paste, and somehow add a fan to it as that is one of two possible reasons that this happened. The other is that there is some issue with this specific card or its firmware (I did update them a little over a year ago)/drivers and Unraid 7.0.0

Quote

January 27, 20251 yr

Community Expert

Just now, HeliusSol said:

That might have prevented my need to rebuild.

When a drive can't be written, it becomes disable. After that, it must be rebuilt. The disabled drive is emulated by parity and so the array can continue to be used in that degraded state.

On 1/12/2025 at 4:18 PM, trurl said:

When the emulated disk is read, all other disks are read and the contents of the emulated disk are the result of the parity calculation. When the emulated disk is written, parity is updated so the emulated write can be recovered.

Dual parity would allow 2 disks to become disabled and the array could continue to be used in that degraded state.

Quote

January 27, 20251 yr

Author

1 minute ago, trurl said:

Dual parity would allow 2 disks to become disabled and the array could continue to be used in that degraded state.

I understand that. I had what I believe at the time to be 7 of my 8 data drives "go offline" of the 10 drive array (dual parity).

It is quite unnerving to have 7 drives get "disconnected" (not disabled) all at the same time. Thus my conclusion about either a temporary hardware failure or a driver/firmware issue.

I don't know of any setting that would allow for automatically disabling the entire array when more than 2 drives "disappear" from the array (while it is active) that might have prevented the 1 drive that got disabled from getting disabled due to the write error in the first place.

Quote

January 27, 20251 yr

Community Expert

It doesn't know there is a problem until it tries to access a disk.

Quote

Multiple Disk Read Errors across majority of my Array disk (After upgrading to 7.0.0...?

Featured Replies

Solved by trurl

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)