Unresolved drive errors after a parity check.

Phredwerd · June 28, 2023

Hi everyone, so my server has an 8TB parity drive with date on 3TB, 2TB and 2TB drives. All running XFS. Last month, I ran a parity check and the second drive showed 336 errors.

SMART short self-test resulted in no errors. The drive attributes seem pretty solid as well.

This is the drive log (server name redacted):

May 11 01:27:42 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x1fc000 SErr 0x0 action 0x0
May 11 01:27:42 <ServerName> kernel: ata6.00: irq_stat 0x40000008
May 11 01:27:42 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED
May 11 01:27:42 <ServerName> kernel: ata6.00: cmd 60/40:70:28:09:80/05:00:4b:00:00/40 tag 14 ncq dma 688128 in
May 11 01:27:42 <ServerName> kernel: ata6.00: status: { DRDY ERR }
May 11 01:27:42 <ServerName> kernel: ata6.00: error: { UNC }
May 11 01:27:42 <ServerName> kernel: ata6.00: configured for UDMA/133
May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=11s
May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 Sense Key : 0x3 [current] 
May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 ASC=0x11 ASCQ=0x4 
May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 CDB: opcode=0x28 28 00 4b 80 09 28 00 05 40 00
May 11 01:27:42 <ServerName> kernel: I/O error, dev sde, sector 1266682824 op 0x0:(READ) flags 0x0 phys_seg 84 prio class 0
May 11 01:27:42 <ServerName> kernel: ata6: EH complete
May 11 01:27:51 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x40c0001f SErr 0x0 action 0x0
May 11 01:27:51 <ServerName> kernel: ata6.00: irq_stat 0x40000008
May 11 01:27:51 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED
May 11 01:27:51 <ServerName> kernel: ata6.00: cmd 60/40:20:e8:0e:80/05:00:4b:00:00/40 tag 4 ncq dma 688128 in
May 11 01:27:51 <ServerName> kernel: ata6.00: status: { DRDY ERR }
May 11 01:27:51 <ServerName> kernel: ata6.00: error: { UNC }
May 11 01:27:51 <ServerName> kernel: ata6.00: configured for UDMA/133
May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=20s
May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 Sense Key : 0x3 [current] 
May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 ASC=0x11 ASCQ=0x4 
May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 CDB: opcode=0x28 28 00 4b 80 0e e8 00 05 40 00
May 11 01:27:51 <ServerName> kernel: I/O error, dev sde, sector 1266684232 op 0x0:(READ) flags 0x0 phys_seg 92 prio class 0
May 11 01:27:51 <ServerName> kernel: ata6: EH complete
May 11 01:27:58 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x600fe20 SErr 0x0 action 0x0
May 11 01:27:58 <ServerName> kernel: ata6.00: irq_stat 0x40000008
May 11 01:27:58 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED
May 11 01:27:58 <ServerName> kernel: ata6.00: cmd 60/40:60:28:14:80/05:00:4b:00:00/40 tag 12 ncq dma 688128 in
May 11 01:27:58 <ServerName> kernel: ata6.00: status: { DRDY ERR }
May 11 01:27:58 <ServerName> kernel: ata6.00: error: { UNC }
May 11 01:27:58 <ServerName> kernel: ata6.00: configured for UDMA/133
May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=27s
May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 Sense Key : 0x3 [current] 
May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 ASC=0x11 ASCQ=0x4 
May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 CDB: opcode=0x28 28 00 4b 80 14 28 00 05 40 00
May 11 01:27:58 <ServerName> kernel: I/O error, dev sde, sector 1266685032 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 0
May 11 01:27:58 <ServerName> kernel: ata6: EH complete
May 18 00:24:32 <ServerName>  emhttpd: read SMART /dev/sde
May 18 00:44:58 <ServerName>  emhttpd: read SMART /dev/sde

I then followed the steps outlined here in that I stopped the array and restarted in maintenance mode. I ran the xfs_repair GUI command with -v as the parameters which, according to the docs "tests and reports, making changes when necessary". I can't seem to find anything bad in the log file nor recommendations. Which is listed here:

Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
        - block cache size set to 359968 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 108140 tail block 108140
        - 19:55:38: zeroing log - 29809 of 29809 blocks done
        - scan filesystem freespace and inode maps...
        - 19:55:39: scanning filesystem freespace - 32 of 32 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 19:55:39: scanning agi unlinked lists - 32 of 32 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 30
        - agno = 15
        - agno = 0
        - agno = 1
        - agno = 31
        - agno = 16
        - agno = 2
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 19:55:51: process known inodes and inode discovery - 24256 of 24256 inodes done
        - process newly discovered inodes...
        - 19:55:51: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 19:55:51: setting up duplicate extent list - 32 of 32 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - 19:55:51: check for inodes claiming duplicate blocks - 24256 of 24256 inodes done
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - 19:55:53: rebuild AG headers and trees - 32 of 32 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 15
        - agno = 30
        - agno = 1
        - agno = 16
        - agno = 17
        - agno = 2
        - agno = 31
        - agno = 18
        - agno = 19
        - agno = 3
        - agno = 20
        - agno = 4
        - agno = 5
        - agno = 21
        - agno = 6
        - agno = 22
        - agno = 7
        - agno = 8
        - agno = 23
        - agno = 9
        - agno = 24
        - agno = 25
        - agno = 10
        - agno = 11
        - agno = 26
        - agno = 12
        - agno = 27
        - agno = 13
        - agno = 14
        - agno = 28
        - agno = 29
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 19:55:57: verify and correct link counts - 32 of 32 allocation groups done

        XFS_REPAIR Summary    Fri May 26 19:55:59 2023

Phase		Start		End		Duration
Phase 1:	05/26 19:55:37	05/26 19:55:38	1 second
Phase 2:	05/26 19:55:38	05/26 19:55:39	1 second
Phase 3:	05/26 19:55:39	05/26 19:55:51	12 seconds
Phase 4:	05/26 19:55:51	05/26 19:55:51
Phase 5:	05/26 19:55:51	05/26 19:55:53	2 seconds
Phase 6:	05/26 19:55:53	05/26 19:55:57	4 seconds
Phase 7:	05/26 19:55:57	05/26 19:55:57

Total run time: 20 seconds
done

I then restarted the array in normal mode, and still saw the drive showing 336 errors just as before. I then ran parity check again with the 'Write corrections to parity' option checked. It resulted in no errors BUT the 336 errors still persist so not sure where to go next with this.

The only thing I can think of is to run xfs_repair with the -L option which clears the log but not sure what side effects that command brings.

Any insight?!

Thanks!

itimpi · June 28, 2023

Once you get parity errors then the only way to clear them is to run a correcting check, and the correcting check will report the number of errors corrected. It is only the next check that will show 0 errors.

the syslog snippet you posted suggests you may have a cabling issue (either SATA or power) to whatever drive is ata6 as you are getting resets happening.

Phredwerd · June 28, 2023

What strikes me odd as that setup persisted for a good 18 months through multiple parity checks. Then the one I ran in May one showed those errors. I ran another check with corrections enabled and it resulted in 0 errors but this still persists.

Also, I have my unraid setup such that it sends me a telegram every morning at 12:20 AM and it constantly shows up as FAIL because of this. Kind of annoying more than anything at this point.

Unresolved drive errors after a parity check.

Recommended Posts

Phredwerd

Link to comment

itimpi

Link to comment

Phredwerd

Link to comment

Join the conversation