Parity disk in error state followed by 5 other disks showing Input/output errors, cache drive now unreadable

enmesh-parisian-latest · October 21, 2021

As mentioned, I first received an email about the parity drive being in an error state followed by 5 other disks. Several docker containers are reporting they have no read access to the appdata directory which is on the cache drive.

The array is currently stalled "unmounting disks" as I try to stop the array

Any ideas?

.tobor-server-diagnostics-20211021-1329.zip

 ls -la /mnt/
/bin/ls: cannot access '/mnt/disk18': Input/output error
/bin/ls: cannot access '/mnt/disk16': Input/output error
/bin/ls: cannot access '/mnt/disk10': Input/output error
/bin/ls: cannot access '/mnt/disk7': Input/output error
/bin/ls: cannot access '/mnt/disk6': Input/output error
total 16
drwxr-xr-x 26 root   root  520 Sep 15 18:57 ./
drwxr-xr-x 21 root   root  440 Oct 21 12:12 ../
drwxrwxrwx  1 nobody users  66 Oct 17 04:30 cache/
drwxrwxrwx  7 nobody users 108 Oct 17 04:30 disk1/
d?????????  ? ?      ?       ?            ? disk10/
drwxrwxrwx  6 nobody users  88 Oct 17 04:30 disk11/
drwxrwxrwx  6 nobody users  88 Oct 17 04:30 disk12/
drwxrwxrwx  5 nobody users  69 Oct 10 04:30 disk13/
drwxrwxrwx  5 nobody users  69 Oct 17 04:30 disk14/
drwxrwxrwx  5 nobody users  53 Oct 17 04:30 disk15/
d?????????  ? ?      ?       ?            ? disk16/
drwxrwxrwx  4 nobody users  36 Oct 17 04:30 disk17/
d?????????  ? ?      ?       ?            ? disk18/
drwxrwxrwx  5 nobody users  67 Oct 17 04:30 disk19/
drwxrwxrwx  6 nobody users  88 Oct 17 04:30 disk2/
drwxrwxrwx  5 nobody users  51 Oct 17 04:30 disk3/
drwxrwxrwx  6 nobody users  88 Oct 17 04:30 disk4/
drwxrwxrwx  5 nobody users  51 Oct 17 04:30 disk5/
d?????????  ? ?      ?       ?            ? disk6/
d?????????  ? ?      ?       ?            ? disk7/
drwxrwxrwx  5 nobody users  51 Oct 17 04:30 disk8/
drwxrwxrwx  7 nobody users 109 Oct 17 04:30 disk9/
drwxrwxrwt  2 nobody users  40 Sep 15 18:57 disks/
drwxrwxrwt  2 nobody users  40 Sep 15 18:57 remotes/
drwxrwxrwx  1 nobody users 108 Oct 17 04:30 user/
drwxrwxrwx  1 nobody users 108 Oct 17 04:30 user0/

JorgeB · October 21, 2021

Looks like a controller problem:

Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: midlevel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: lowlevel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: error handler-48
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: firmware-33
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: kernel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: Controller reset type is 3
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: Issuing IOP reset
Oct 21 12:49:49 tobor-server kernel: aacraid 0000:01:00.0: IOP reset failed
Oct 21 12:49:49 tobor-server kernel: aacraid 0000:01:00.0: ARC Reset attempt failed

If possible use one the recommended controllers, like an LSI HBA, you also have filesystem corruption on multiple disks.

enmesh-parisian-latest · October 21, 2021

39 minutes ago, JorgeB said:

Looks like a controller problem:

Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: midlevel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: lowlevel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: error handler-48
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: firmware-33
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: outstanding cmd: kernel-0
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: Controller reset type is 3
Oct 21 12:48:28 tobor-server kernel: aacraid 0000:01:00.0: Issuing IOP reset
Oct 21 12:49:49 tobor-server kernel: aacraid 0000:01:00.0: IOP reset failed
Oct 21 12:49:49 tobor-server kernel: aacraid 0000:01:00.0: ARC Reset attempt failed

If possible use one the recommended controllers, like an LSI HBA, you also have filesystem corruption on multiple disks.

I'm using an Adaptec RAID 71605 which has served me well for years, although I did a force reboot and the controller was giving me a high pitched alert beep indicating it was overheated. I'll shut down and let it cool off a bit then try again. The missing drives were back but the parity drive is still listed as being in error state.

Can you suggest how best to deal with the filesystem corruption? I assume this is somehow related to the raid controller messing up.

EDIT: regarding fs corruption I'm following the instructions here: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

Edited October 21, 2021 by enmesh-parisian-latest
asdf

Parity disk in error state followed by 5 other disks showing Input/output errors, cache drive now unreadable

Recommended Posts

enmesh-parisian-latest

Link to comment

JorgeB

Link to comment

enmesh-parisian-latest

Link to comment

Join the conversation