Disks getting disabled, this time the parity disk

lfas · July 27, 2022

I was having issues booting my Windows VM, so I did a clean reboot of unraid. When it came back online, my parity disk was disabled. I stopped the array, removed the disk, started then stopped the array, and re-added the disk and now it is doing a parity rebuild/check.

But I am concerned, and don't understand how to interpret the logs. In the past, disk 4 would randomly disable itself (and as far as I could tell all the SMART tests checked out). I eventually determined disk 4 was disabling due to a bad cable, which I removed from the system. Disk 4 has been fine since then, but now that the parity disk disabled, I am worried since I didn't touch it.

What should I do to make sure my system is running correctly?

The diagnostics I attached are from right after I rebooted and noticed the drive was disabled.

icaria-diagnostics-20220727-1046.zip

trurl · July 27, 2022

5 minutes ago, lfas said:

from right after I rebooted

And so can't tell us why it became disabled. If this is a recurring issue you might want to setup syslog server

Looks like problems with reading cache2. Maybe related? Controller or power?

JorgeB · July 27, 2022

Disk was already disabled at boot time, if it happens again grab the diags before rebooting.

This indicates that the cache SATA device dropped offline some time ago:

Jul 27 11:41:21 Icaria kernel: BTRFS info (device nvme0n1p1): bdev /dev/sdd1 errs: wr 231821, rd 56307, flush 14833, corrupt 820, gen 0

Run a scrub and then see here to reset the stats and for better pool monitoring.

There's also data corruption detected on the other pool, Ryzen with overclock RAM like you have is known to in some cases corrupt data, see here.

lfas · July 27, 2022

3 minutes ago, trurl said:

And so can't tell us why it became disabled. If this is a recurring issue you might want to setup syslog server

Looks like problems with reading cache2. Maybe related? Controller or power?

Thank you for the reply. I will setup the syslog server as you recommended.

The only way I know how to test if it is the controller or power is to use a different motherboard and different PSU. Is there an easier way to test?

lfas · July 27, 2022

6 minutes ago, JorgeB said:
Disk was already disabled at boot time, if it happens again grab the diags before rebooting.

This indicates that the cache SATA device dropped offline some time ago:
Jul 27 11:41:21 Icaria kernel: BTRFS info (device nvme0n1p1): bdev /dev/sdd1 errs: wr 231821, rd 56307, flush 14833, corrupt 820, gen 0
Run a scrub and then see here to reset the stats and for better pool monitoring.

There's also data corruption detected on the other pool, Ryzen with overclock RAM like you have is known to in some cases corrupt data, see here.

Thank you, I will do both those things. For the scrub, should I enable "repair corrupted blocks"?

JorgeB · July 27, 2022

2 minutes ago, lfas said:

For the scrub, should I enable "repair corrupted blocks"?

Yes, and check that in the end there are no uncorrectable errors.

Disks getting disabled, this time the parity disk

Recommended Posts

lfas

Link to comment

trurl

Link to comment

JorgeB

Link to comment

lfas

Link to comment

lfas

Link to comment

JorgeB

Link to comment

Join the conversation