Phredwerd Posted June 28, 2023 Share Posted June 28, 2023 Hi everyone, so my server has an 8TB parity drive with date on 3TB, 2TB and 2TB drives. All running XFS. Last month, I ran a parity check and the second drive showed 336 errors. SMART short self-test resulted in no errors. The drive attributes seem pretty solid as well. This is the drive log (server name redacted): May 11 01:27:42 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x1fc000 SErr 0x0 action 0x0 May 11 01:27:42 <ServerName> kernel: ata6.00: irq_stat 0x40000008 May 11 01:27:42 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED May 11 01:27:42 <ServerName> kernel: ata6.00: cmd 60/40:70:28:09:80/05:00:4b:00:00/40 tag 14 ncq dma 688128 in May 11 01:27:42 <ServerName> kernel: ata6.00: status: { DRDY ERR } May 11 01:27:42 <ServerName> kernel: ata6.00: error: { UNC } May 11 01:27:42 <ServerName> kernel: ata6.00: configured for UDMA/133 May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=11s May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 Sense Key : 0x3 [current] May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 ASC=0x11 ASCQ=0x4 May 11 01:27:42 <ServerName> kernel: sd 6:0:0:0: [sde] tag#14 CDB: opcode=0x28 28 00 4b 80 09 28 00 05 40 00 May 11 01:27:42 <ServerName> kernel: I/O error, dev sde, sector 1266682824 op 0x0:(READ) flags 0x0 phys_seg 84 prio class 0 May 11 01:27:42 <ServerName> kernel: ata6: EH complete May 11 01:27:51 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x40c0001f SErr 0x0 action 0x0 May 11 01:27:51 <ServerName> kernel: ata6.00: irq_stat 0x40000008 May 11 01:27:51 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED May 11 01:27:51 <ServerName> kernel: ata6.00: cmd 60/40:20:e8:0e:80/05:00:4b:00:00/40 tag 4 ncq dma 688128 in May 11 01:27:51 <ServerName> kernel: ata6.00: status: { DRDY ERR } May 11 01:27:51 <ServerName> kernel: ata6.00: error: { UNC } May 11 01:27:51 <ServerName> kernel: ata6.00: configured for UDMA/133 May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=20s May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 Sense Key : 0x3 [current] May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 ASC=0x11 ASCQ=0x4 May 11 01:27:51 <ServerName> kernel: sd 6:0:0:0: [sde] tag#4 CDB: opcode=0x28 28 00 4b 80 0e e8 00 05 40 00 May 11 01:27:51 <ServerName> kernel: I/O error, dev sde, sector 1266684232 op 0x0:(READ) flags 0x0 phys_seg 92 prio class 0 May 11 01:27:51 <ServerName> kernel: ata6: EH complete May 11 01:27:58 <ServerName> kernel: ata6.00: exception Emask 0x0 SAct 0x600fe20 SErr 0x0 action 0x0 May 11 01:27:58 <ServerName> kernel: ata6.00: irq_stat 0x40000008 May 11 01:27:58 <ServerName> kernel: ata6.00: failed command: READ FPDMA QUEUED May 11 01:27:58 <ServerName> kernel: ata6.00: cmd 60/40:60:28:14:80/05:00:4b:00:00/40 tag 12 ncq dma 688128 in May 11 01:27:58 <ServerName> kernel: ata6.00: status: { DRDY ERR } May 11 01:27:58 <ServerName> kernel: ata6.00: error: { UNC } May 11 01:27:58 <ServerName> kernel: ata6.00: configured for UDMA/133 May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=27s May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 Sense Key : 0x3 [current] May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 ASC=0x11 ASCQ=0x4 May 11 01:27:58 <ServerName> kernel: sd 6:0:0:0: [sde] tag#12 CDB: opcode=0x28 28 00 4b 80 14 28 00 05 40 00 May 11 01:27:58 <ServerName> kernel: I/O error, dev sde, sector 1266685032 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 0 May 11 01:27:58 <ServerName> kernel: ata6: EH complete May 18 00:24:32 <ServerName> emhttpd: read SMART /dev/sde May 18 00:44:58 <ServerName> emhttpd: read SMART /dev/sde I then followed the steps outlined here in that I stopped the array and restarted in maintenance mode. I ran the xfs_repair GUI command with -v as the parameters which, according to the docs "tests and reports, making changes when necessary". I can't seem to find anything bad in the log file nor recommendations. Which is listed here: Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes - block cache size set to 359968 entries Phase 2 - using internal log - zero log... zero_log: head block 108140 tail block 108140 - 19:55:38: zeroing log - 29809 of 29809 blocks done - scan filesystem freespace and inode maps... - 19:55:39: scanning filesystem freespace - 32 of 32 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - 19:55:39: scanning agi unlinked lists - 32 of 32 allocation groups done - process known inodes and perform inode discovery... - agno = 30 - agno = 15 - agno = 0 - agno = 1 - agno = 31 - agno = 16 - agno = 2 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 3 - agno = 4 - agno = 5 - agno = 27 - agno = 28 - agno = 29 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - 19:55:51: process known inodes and inode discovery - 24256 of 24256 inodes done - process newly discovered inodes... - 19:55:51: process newly discovered inodes - 32 of 32 allocation groups done Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - 19:55:51: setting up duplicate extent list - 32 of 32 allocation groups done - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - 19:55:51: check for inodes claiming duplicate blocks - 24256 of 24256 inodes done Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - 19:55:53: rebuild AG headers and trees - 32 of 32 allocation groups done - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 15 - agno = 30 - agno = 1 - agno = 16 - agno = 17 - agno = 2 - agno = 31 - agno = 18 - agno = 19 - agno = 3 - agno = 20 - agno = 4 - agno = 5 - agno = 21 - agno = 6 - agno = 22 - agno = 7 - agno = 8 - agno = 23 - agno = 9 - agno = 24 - agno = 25 - agno = 10 - agno = 11 - agno = 26 - agno = 12 - agno = 27 - agno = 13 - agno = 14 - agno = 28 - agno = 29 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... - 19:55:57: verify and correct link counts - 32 of 32 allocation groups done XFS_REPAIR Summary Fri May 26 19:55:59 2023 Phase Start End Duration Phase 1: 05/26 19:55:37 05/26 19:55:38 1 second Phase 2: 05/26 19:55:38 05/26 19:55:39 1 second Phase 3: 05/26 19:55:39 05/26 19:55:51 12 seconds Phase 4: 05/26 19:55:51 05/26 19:55:51 Phase 5: 05/26 19:55:51 05/26 19:55:53 2 seconds Phase 6: 05/26 19:55:53 05/26 19:55:57 4 seconds Phase 7: 05/26 19:55:57 05/26 19:55:57 Total run time: 20 seconds done I then restarted the array in normal mode, and still saw the drive showing 336 errors just as before. I then ran parity check again with the 'Write corrections to parity' option checked. It resulted in no errors BUT the 336 errors still persist so not sure where to go next with this. The only thing I can think of is to run xfs_repair with the -L option which clears the log but not sure what side effects that command brings. Any insight?! Thanks! Quote Link to comment
itimpi Posted June 28, 2023 Share Posted June 28, 2023 Once you get parity errors then the only way to clear them is to run a correcting check, and the correcting check will report the number of errors corrected. It is only the next check that will show 0 errors. the syslog snippet you posted suggests you may have a cabling issue (either SATA or power) to whatever drive is ata6 as you are getting resets happening. Quote Link to comment
Phredwerd Posted June 28, 2023 Author Share Posted June 28, 2023 What strikes me odd as that setup persisted for a good 18 months through multiple parity checks. Then the one I ran in May one showed those errors. I ran another check with corrections enabled and it resulted in 0 errors but this still persists. Also, I have my unraid setup such that it sends me a telegram every morning at 12:20 AM and it constantly shows up as FAIL because of this. Kind of annoying more than anything at this point. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.