Data drive corruption reported by syslog with repair not going through

campfred · July 11, 2021

Hello everyone!
I would like having some assistance with a drive suffering data corruption.

I noticed about it not because of a notification but rather because I wasn't able to write to Array powered shares (Cache-only ones were working fine). So, I went to check on the syslog and noticed this :

Jul 11 15:45:02 Alfred kernel: XFS (md3): Corruption detected! Free inode 0x1800d6ba7 not marked free! (mode 0x41ed)
Jul 11 15:45:02 Alfred kernel: XFS (md3): Internal error xfs_trans_cancel at line 954 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x280/0x2ea [xfs]
Jul 11 15:45:02 Alfred kernel: CPU: 0 PID: 32201 Comm: shfs Tainted: P     U     O      5.10.28-Unraid #1
Jul 11 15:45:02 Alfred kernel: Hardware name: ASUS All Series/Z87-C, BIOS 2103 08/15/2014
Jul 11 15:45:02 Alfred kernel: Call Trace:
Jul 11 15:45:02 Alfred kernel: dump_stack+0x6b/0x83
Jul 11 15:45:02 Alfred kernel: xfs_trans_cancel+0x52/0xc9 [xfs]
Jul 11 15:45:02 Alfred kernel: xfs_create+0x280/0x2ea [xfs]
Jul 11 15:45:02 Alfred kernel: xfs_generic_create+0xc9/0x1ed [xfs]
Jul 11 15:45:02 Alfred kernel: vfs_mkdir+0x55/0x77
Jul 11 15:45:02 Alfred kernel: do_mkdirat+0x7a/0xc7
Jul 11 15:45:02 Alfred kernel: do_syscall_64+0x5d/0x6a
Jul 11 15:45:02 Alfred kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 11 15:45:02 Alfred kernel: RIP: 0033:0x14d104ab8467
Jul 11 15:45:02 Alfred kernel: Code: 1f 40 00 48 8b 05 29 8a 0d 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 89 0d 00 f7 d8 64 89 01 48
Jul 11 15:45:02 Alfred kernel: RSP: 002b:000014d0fdc18bb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
Jul 11 15:45:02 Alfred kernel: RAX: ffffffffffffffda RBX: 000014d0e8083cc0 RCX: 000014d104ab8467
Jul 11 15:45:02 Alfred kernel: RDX: 00000000000001c0 RSI: 00000000000001c0 RDI: 000014d0e807aae0
Jul 11 15:45:02 Alfred kernel: RBP: 000014d0fdc18bf0 R08: 000014d0e8561820 R09: 0065766973756c63
Jul 11 15:45:02 Alfred kernel: R10: 000014d0e807fe80 R11: 0000000000000206 R12: 0000000000000000
Jul 11 15:45:02 Alfred kernel: R13: 000000000000a67d R14: 000014d0e8087040 R15: 00000000000001c0
Jul 11 15:45:02 Alfred kernel: XFS (md3): xfs_do_force_shutdown(0x8) called from line 955 of file fs/xfs/xfs_trans.c. Return address = 00000000a737bb2b
Jul 11 15:45:02 Alfred kernel: XFS (md3): Corruption of in-memory data detected.  Shutting down filesystem
Jul 11 15:45:02 Alfred kernel: XFS (md3): Please unmount the filesystem and rectify the problem(s)

What I understood from this message : Data corruption has been found on Drive 3 (md3) and unRAID is stopping all I/O transfers to the Array and requesting that I unmount and check the drive.

Fine, I'm gonna follow the « Check Disk Filesystems » guide in the wiki and I should be good.

Except, I ran the check in verbose with no modify (so, with « xfs_repair -nv /dev/md3 ») and I don't understand what's the error?

Here's the output from it:

Phase 1 - find and verify superblock...
        - block cache size set to 1040488 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 5123 tail block 5119
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agi_freecount 128, counted 105 in ag 9
agi_freecount 128, counted 105 in ag 9 finobt
agi_freecount 63, counted 61 in ag 10
agi_freecount 63, counted 61 in ag 10 finobt
sb_fdblocks 331469875, counted 345095662
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
imap claims in-use inode 6443330471 is free, correcting imap
imap claims in-use inode 6443330472 is free, correcting imap
imap claims in-use inode 6443330473 is free, correcting imap
imap claims in-use inode 6443330474 is free, correcting imap
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 5
        - agno = 3
        - agno = 7
        - agno = 6
        - agno = 4
        - agno = 8
        - agno = 9
        - agno = 10
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected dir inode 10856320197, would move to lost+found
disconnected dir inode 21596214240, would move to lost+found
Phase 7 - verify link counts...
Maximum metadata LSN (11:7968) is ahead of log (11:5123).
Would format log to cycle 14.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Sun Jul 11 17:33:36 2021

Phase           Start           End             Duration
Phase 1:        07/11 17:33:18  07/11 17:33:18
Phase 2:        07/11 17:33:18  07/11 17:33:19  1 second
Phase 3:        07/11 17:33:19  07/11 17:33:31  12 seconds
Phase 4:        07/11 17:33:31  07/11 17:33:31
Phase 5:        Skipped
Phase 6:        07/11 17:33:31  07/11 17:33:36  5 seconds
Phase 7:        07/11 17:33:36  07/11 17:33:36

Total run time: 18 seconds

Okay, there's an alert for the FS' log telling me to mount the disk to resolve the log inconsistencies.

...Except after I did mount the Array back, I went back to square one with my array being I/O blocked because of corruption.

So, I went back in Maintenance mode and tried to do the repair anyway to see if it's gonna attempt to do something with the log but nope, it asks me to mount the drive first.

Phase 1 - find and verify superblock...
        - block cache size set to 1040488 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 5123 tail block 5119
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

I did power down the server and go replace the power and data cables for all my drives following that just in case it would be a cable fail causing this and there is no bend or kinks on them. 'Still getting that state, though.

Now, I don't know what else I can do for resolving this issue.

Does someone have an idea or a pointer that could potentially help me solve this?

Of course, diagnostics data is attached to this post.

Thank you very much for taking your time to read me!

alfred-diagnostics-20210711-1735.zip

Squid · July 12, 2021

Just do the -L flag. Usually there's no corruption.

campfred · July 13, 2021

On 7/11/2021 at 9:49 PM, Squid said:

Just do the -L flag. Usually there's no corruption.

Thank you for the pointer! It looks like it redone the log on the f.s. and it's mounting properly, now!

I'll wait 'till the end of the week to see if something comes up and the array locks up the drive again.

If everything's fine by the weekend, I'll mark the thread as solved.

Command output for anyone who'd be interested or are in the same situation :

root@Alfred:~# xfs_repair /dev/md3 -Lv
Phase 1 - find and verify superblock...
        - block cache size set to 1040488 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 5123 tail block 5119
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agi_freecount 128, counted 105 in ag 9
agi_freecount 128, counted 105 in ag 9 finobt
agi_freecount 63, counted 61 in ag 10
agi_freecount 63, counted 61 in ag 10 finobt
sb_fdblocks 331469875, counted 345095662
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
imap claims in-use inode 6443330471 is free, correcting imap
imap claims in-use inode 6443330472 is free, correcting imap
imap claims in-use inode 6443330473 is free, correcting imap
imap claims in-use inode 6443330474 is free, correcting imap
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 6
        - agno = 1
        - agno = 7
        - agno = 5
        - agno = 4
        - agno = 8
        - agno = 9
        - agno = 10
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected dir inode 10856320197, moving to lost+found
disconnected dir inode 21596214240, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 1181372 nlinks from 3 to 5
Maximum metadata LSN (11:7988) is ahead of log (1:2).
Format log to cycle 14.

        XFS_REPAIR Summary    Tue Jul 13 09:33:48 2021

Phase           Start           End             Duration
Phase 1:        07/13 09:31:14  07/13 09:31:14
Phase 2:        07/13 09:31:14  07/13 09:31:46  32 seconds
Phase 3:        07/13 09:31:46  07/13 09:31:58  12 seconds
Phase 4:        07/13 09:31:58  07/13 09:31:58
Phase 5:        07/13 09:31:58  07/13 09:31:59  1 second
Phase 6:        07/13 09:31:59  07/13 09:32:05  6 seconds
Phase 7:        07/13 09:32:05  07/13 09:32:05

Total run time: 51 seconds
done
root@Alfred:~#

trurl · July 13, 2021

Be sure to check your lost+found share for anything the repair might have put there.

Data drive corruption reported by syslog with repair not going through

Recommended Posts

campfred

Link to comment

Squid

Link to comment

campfred

Link to comment

trurl

Link to comment

Join the conversation