[SOLVED, bug?] XFS drive issue and xfs_repair fails, and unRaid crash on mount


Recommended Posts

I have an XFS drive which have filesystem issues and mount cause unRaid to crash. The initial problem lines in the syslog at the time the problem first occurred are here (full syslog attached where there are far more lines after what I included here, PreBootSyslog.txt).

 

Jan  6 00:42:36 Tower kernel: XFS (md2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3156 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_free_ag_extent+0x419/0x558

Jan  6 00:42:36 Tower kernel: CPU: 3 PID: 32691 Comm: shfs Not tainted 4.4.30-unRAID #2

Jan  6 00:42:36 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012

Jan  6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303b78 ffffffff8136f79f ffff88004b38e0d0

Jan  6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303b90 ffffffff81275fd0 ffffffff812465ad

Jan  6 00:42:36 Tower kernel: ffff8803af303c00 ffffffff8125a774 000001526b9f3000 000000005036eb68

Jan  6 00:42:36 Tower kernel: Call Trace:

Jan  6 00:42:36 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan  6 00:42:36 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35

Jan  6 00:42:36 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558

Jan  6 00:42:36 Tower kernel: [<ffffffff8125a774>] xfs_btree_insert+0xba/0x152

Jan  6 00:42:36 Tower kernel: [<ffffffff812465ad>] xfs_free_ag_extent+0x419/0x558

Jan  6 00:42:36 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558

Jan  6 00:42:36 Tower kernel: [<ffffffff812471e9>] xfs_free_extent+0xbd/0xed

Jan  6 00:42:36 Tower kernel: [<ffffffff812960fd>] xfs_trans_free_extent+0x21/0x58

Jan  6 00:42:36 Tower kernel: [<ffffffff81271712>] xfs_bmap_finish+0xdf/0x102

Jan  6 00:42:36 Tower kernel: [<ffffffff81281ce1>] xfs_itruncate_extents+0xe3/0x152

Jan  6 00:42:36 Tower kernel: [<ffffffff81281dde>] xfs_inactive_truncate+0x8e/0xce

Jan  6 00:42:36 Tower kernel: [<ffffffff812827ef>] xfs_inactive+0xa2/0xc1

Jan  6 00:42:36 Tower kernel: [<ffffffff81286a64>] xfs_fs_evict_inode+0x90/0x93

Jan  6 00:42:36 Tower kernel: [<ffffffff8111e687>] evict+0xaf/0x164

Jan  6 00:42:36 Tower kernel: [<ffffffff8111f1d3>] iput+0x160/0x16d

Jan  6 00:42:36 Tower kernel: [<ffffffff8111655c>] do_unlinkat+0x125/0x201

Jan  6 00:42:36 Tower kernel: [<ffffffff81116ba7>] SyS_unlink+0x11/0x13

Jan  6 00:42:36 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d

Jan  6 00:42:36 Tower kernel: XFS (md2): Internal error xfs_trans_cancel at line 990 of file fs/xfs/xfs_trans.c.  Caller xfs_inactive_truncate+0xb9/0xce

Jan  6 00:42:36 Tower kernel: CPU: 3 PID: 32691 Comm: shfs Not tainted 4.4.30-unRAID #2

Jan  6 00:42:36 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012

Jan  6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303dc8 ffffffff8136f79f ffff8802b5bfae80

Jan  6 00:42:36 Tower kernel: ffffffff81664fc0 ffff8803af303de0 ffffffff81275fd0 ffffffff81281e09

Jan  6 00:42:36 Tower kernel: ffff8803af303e08 ffffffff8128a1e6 00000000ffffff8b ffff880119467800

Jan  6 00:42:36 Tower kernel: Call Trace:

Jan  6 00:42:36 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan  6 00:42:36 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35

Jan  6 00:42:36 Tower kernel: [<ffffffff81281e09>] ? xfs_inactive_truncate+0xb9/0xce

Jan  6 00:42:36 Tower kernel: [<ffffffff8128a1e6>] xfs_trans_cancel+0x49/0xbf

Jan  6 00:42:36 Tower kernel: [<ffffffff81281e09>] xfs_inactive_truncate+0xb9/0xce

Jan  6 00:42:36 Tower kernel: [<ffffffff812827ef>] xfs_inactive+0xa2/0xc1

Jan  6 00:42:36 Tower kernel: [<ffffffff81286a64>] xfs_fs_evict_inode+0x90/0x93

Jan  6 00:42:36 Tower kernel: [<ffffffff8111e687>] evict+0xaf/0x164

Jan  6 00:42:36 Tower kernel: [<ffffffff8111f1d3>] iput+0x160/0x16d

Jan  6 00:42:36 Tower kernel: [<ffffffff8111655c>] do_unlinkat+0x125/0x201

Jan  6 00:42:36 Tower kernel: [<ffffffff81116ba7>] SyS_unlink+0x11/0x13

Jan  6 00:42:36 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d

Jan  6 00:42:36 Tower kernel: XFS (md2): xfs_do_force_shutdown(0x8) called from line 991 of file fs/xfs/xfs_trans.c.  Return address = 0xffffffff8128a1ff

Jan  6 00:42:41 Tower kernel: XFS (md2): Corruption of in-memory data detected.  Shutting down filesystem

Jan  6 00:42:41 Tower kernel: XFS (md2): Please umount the filesystem and rectify the problem(s)

 

I have checked memory with MemTest86+ overnigt without issues in failsafe single-CPU mode.

 

After reboot I mounted in maintenance mode, and tried repair.

 

root@Tower:~# xfs_repair -v /dev/md2

Phase 1 - find and verify superblock...

        - block cache size set to 1470848 entries

Phase 2 - using internal log

        - zero log...

zero_log: head block 639392 tail block 625937

ERROR: The filesystem has valuable metadata changes in a log which needs to

be replayed.  Mount the filesystem to replay the log, and unmount it before

re-running xfs_repair.  If you are unable to mount the filesystem, then use

the -L option to destroy the log and attempt a repair.

Note that destroying the log may cause corruption -- please attempt a mount

of the filesystem before doing this.

 

then I mount the disk as it suggest

 

root@Tower:~# mkdir /mnt/disk2 ; mount /dev/md2 /mnt/disk2

 

The system crashes shortly after. When I tail a syslog in a telnet this is what I get before telnet lose connection (see attached for full tail: syslogTailAfterMountingBadDisk.txt and see attached image for screen output):

 

Jan  7 11:53:47 Tower login[2317]: ROOT LOGIN  on '/dev/pts/1' from '10.0.0.5'

Jan  7 12:00:40 Tower kernel: XFS (md2): Mounting V5 Filesystem

Jan  7 12:00:40 Tower kernel: XFS (md2): Starting recovery (logdev: internal)

Jan  7 12:01:06 Tower kernel: XFS (md2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3156 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_free_ag_extent+0x419/0x558

Jan  7 12:01:06 Tower kernel: CPU: 2 PID: 5742 Comm: mount Not tainted 4.4.30-unRAID #2

Jan  7 12:01:06 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012

Jan  7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfa38 ffffffff8136f79f ffff88040bfa31a0

Jan  7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfa50 ffffffff81275fd0 ffffffff812465ad

Jan  7 12:01:06 Tower kernel: ffff8803f55cfac0 ffffffff8125a774 ffffffff812598f9 00000000cedf01d0

Jan  7 12:01:06 Tower kernel: Call Trace:

Jan  7 12:01:06 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan  7 12:01:06 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35

Jan  7 12:01:06 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558

Jan  7 12:01:06 Tower kernel: [<ffffffff8125a774>] xfs_btree_insert+0xba/0x152

Jan  7 12:01:06 Tower kernel: [<ffffffff812598f9>] ? xfs_btree_lookup+0x307/0x4a1

Jan  7 12:01:06 Tower kernel: [<ffffffff812465ad>] xfs_free_ag_extent+0x419/0x558

Jan  7 12:01:06 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558

Jan  7 12:01:06 Tower kernel: [<ffffffff812471e9>] xfs_free_extent+0xbd/0xed

Jan  7 12:01:06 Tower kernel: [<ffffffff812960fd>] xfs_trans_free_extent+0x21/0x58

Jan  7 12:01:06 Tower kernel: [<ffffffff81291ad4>] xlog_recover_process_efi+0x125/0x155

Jan  7 12:01:06 Tower kernel: [<ffffffff81291b75>] xlog_recover_process_efis+0x71/0xb5

Jan  7 12:01:06 Tower kernel: [<ffffffff81076179>] ? wake_up_bit+0x1d/0x1f

Jan  7 12:01:06 Tower kernel: [<ffffffff8127acb2>] ? xfs_iget+0x50f/0x54e

Jan  7 12:01:06 Tower kernel: [<ffffffff81294f17>] xlog_recover_finish+0x18/0x8b

Jan  7 12:01:06 Tower kernel: [<ffffffff81294f17>] ? xlog_recover_finish+0x18/0x8b

Jan  7 12:01:06 Tower kernel: [<ffffffff8128c20a>] xfs_log_mount_finish+0x20/0x36

Jan  7 12:01:06 Tower kernel: [<ffffffff8128547f>] xfs_mountfs+0x601/0x6a8

Jan  7 12:01:06 Tower kernel: [<ffffffff81287d7f>] xfs_fs_fill_super+0x3fd/0x489

Jan  7 12:01:06 Tower kernel: [<ffffffff8110c871>] mount_bdev+0x141/0x195

Jan  7 12:01:06 Tower kernel: [<ffffffff81287982>] ? xfs_parseargs+0x8c1/0x8c1

Jan  7 12:01:06 Tower kernel: [<ffffffff8128633d>] xfs_fs_mount+0x10/0x12

Jan  7 12:01:06 Tower kernel: [<ffffffff8110d4e2>] mount_fs+0xf/0x84

Jan  7 12:01:06 Tower kernel: [<ffffffff81122036>] vfs_kern_mount+0x65/0xf7

Jan  7 12:01:06 Tower kernel: [<ffffffff811249ac>] do_mount+0x91c/0xa72

Jan  7 12:01:06 Tower kernel: [<ffffffff810ce84b>] ? strndup_user+0x3a/0x82

Jan  7 12:01:06 Tower kernel: [<ffffffff81124cf1>] SyS_mount+0x70/0x9c

Jan  7 12:01:06 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d

Jan  7 12:01:06 Tower kernel: XFS (md2): Internal error xfs_trans_cancel at line 990 of file fs/xfs/xfs_trans.c.  Caller xlog_recover_process_efi+0x148/0x155

Jan  7 12:01:06 Tower kernel: CPU: 2 PID: 5742 Comm: mount Not tainted 4.4.30-unRAID #2

Jan  7 12:01:06 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012

Jan  7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfbd8 ffffffff8136f79f ffff8800cedf0000

Jan  7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfbf0 ffffffff81275fd0 ffffffff81291af7

Jan  7 12:01:06 Tower kernel: ffff8803f55cfc18 ffffffff8128a1e6 ffff8800ce278000 ffff8803e0bde000

Jan  7 12:01:06 Tower kernel: Call Trace:

Jan  7 12:01:06 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan  7 12:01:06 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35

Jan  7 12:01:06 Tower kernel: [<ffffffff81291af7>] ? xlog_recover_process_efi+0x148/0x155

Jan  7 12:01:06 Tower kernel: [<ffffffff8128a1e6>] xfs_trans_cancel+0x49/0xbf

Jan  7 12:01:06 Tower kernel: [<ffffffff81291af7>] xlog_recover_process_efi+0x148/0x155

Jan  7 12:01:06 Tower kernel: [<ffffffff81291b75>] xlog_recover_process_efis+0x71/0xb5

Jan  7 12:01:06 Tower kernel: [<ffffffff81076179>] ? wake_up_bit+0x1d/0x1f

Jan  7 12:01:06 Tower kernel: [<ffffffff8127acb2>] ? xfs_iget+0x50f/0x54e

Jan  7 12:01:06 Tower kernel: [<ffffffff81294f17>] xlog_recover_finish+0x18/0x8b

Jan  7 12:01:06 Tower kernel: [<ffffffff81294f17>] ? xlog_recover_finish+0x18/0x8b

Jan  7 12:01:06 Tower kernel: [<ffffffff8128c20a>] xfs_log_mount_finish+0x20/0x36

Jan  7 12:01:06 Tower kernel: [<ffffffff8128547f>] xfs_mountfs+0x601/0x6a8

Jan  7 12:01:06 Tower kernel: [<ffffffff81287d7f>] xfs_fs_fill_super+0x3fd/0x489

Jan  7 12:01:06 Tower kernel: [<ffffffff8110c871>] mount_bdev+0x141/0x195

Jan  7 12:01:06 Tower kernel: [<ffffffff81287982>] ? xfs_parseargs+0x8c1/0x8c1

Jan  7 12:01:06 Tower kernel: [<ffffffff8128633d>] xfs_fs_mount+0x10/0x12

Jan  7 12:01:06 Tower kernel: [<ffffffff8110d4e2>] mount_fs+0xf/0x84

Jan  7 12:01:06 Tower kernel: [<ffffffff81122036>] vfs_kern_mount+0x65/0xf7

Jan  7 12:01:06 Tower kernel: [<ffffffff811249ac>] do_mount+0x91c/0xa72

Jan  7 12:01:06 Tower kernel: [<ffffffff810ce84b>] ? strndup_user+0x3a/0x82

Jan  7 12:01:06 Tower kernel: [<ffffffff81124cf1>] SyS_mount+0x70/0x9c

Jan  7 12:01:06 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d

Jan  7 12:01:06 Tower kernel: XFS (md2): xfs_do_force_shutdown(0x8) called from line 991 of file fs/xfs/xfs_trans.c.  Return address = 0xffffffff8128a1ff

Jan  7 12:01:06 Tower kernel: XFS (md2): Corruption of in-memory data detected.  Shutting down filesystem

Jan  7 12:01:06 Tower kernel: XFS (md2): Please umount the filesystem and rectify the problem(s)

Jan  7 12:01:06 Tower kernel: XFS (md2): Failed to recover EFIs

Jan  7 12:01:06 Tower kernel: XFS (md2): log mount finish failed

Jan  7 12:01:06 Tower kernel: XFS (md2): xfs_log_force: error -

 

I have tried changed power and sata cables with another drive, it made no change.

 

I upgraded from unRaid 6.1.9 to 6.2.4 the day before the issue happened.

 

I just tested the parity without writing changes to disk, and the parity was still valid.

 

MemTest86+ hung on parallel testing, but worked fine in four complete tests in failsafe mode (CPU-core 0 only), and also completed a full test in round robin between four CPU cores.

 

Do you have any suggestions for how to proceed? I can buy new disk and rebuild from parity if its easiest / safest. When I have disk filesystem issues, I'm always worried if fixing them will cause ruining parity, but since parity is valid, it seems a moot issue.

 

Is there a bug in XFS that should be reported?

 

Here is a similar old support request, that does not seem helpful to me http://lime-technology.com/forum/index.php?topic=40603.msg383298#msg383298.

 

I found someone mentioning he had to delete XFS journal and then repair in order to proceed, I havn't tried or investigated that yet.

 

Best Alex

 

PS:

I have my data on crashplan though it wasn't quite up2date due to other issues, but its good enough if I really need it, and I have 6 months old data on an offsite-backup disk so data wise I'm not panicking.

all.zip

Link to comment

Thank you for the help. Running

xfs_repair -L -v /dev/md2

fixed the issues, and the disk is up and running again. I'm currently checking md5's of files and will report back as it might be slighly interesting if all files are still valid.

 

Before I close the issue, I like your input as to whether a bug-report should be filed to XFS or elsewhere. Preferably the kernel should never crash, but probably its a know issue.

 

Best Alex

Link to comment

One thing you mentioned is the memtst hang when parallel testing, along with this log entry;

 

Jan  7 12:01:06 Tower kernel: XFS (md2): Corruption of in-memory data detected.  Shutting down filesystem

 

together makes me wonder about the details of your system, not that you have a bad component, but if there is some architecture which is causing this. I can guess it is ASUS P8B using c206 and E3 processor. That should have no trouble running memtst in parallel.

 

There are bugs in xfs, and this might be one of them. They are hard to make progress on, as in your case, the log was dumped and you moved forward.

Link to comment

Yeah its a Intel® Xeon® CPU E31225 @ 3.10GHz on Asus P8B WS with up2date BIOS.

 

I have seen reports on earlier versions of the MemTest86+ that the parallel test hangs. For my system it fails on the same test quite reliably, and so did the unRaid xfs mount command (when replaying log I think) before I deleted the journal log. I'm always quite cautious to blame 'other components' esp. presumably well-tested stuff, because its just to easy when I cannot find the bug in my code. It seems a lot more likely to me that there is a bug in a presumably not that frequently used xfs replay log than there is in the CPU, but I do also recognize that there is a definite probability that it is a hardware issue. I noticed that the other threads I found with the content 'XFS_WANT_CORRUPTED_GOTO' also contained warnings of memory corruptions and on their system MemTest86+ also ran well. It's seems likely its just an unfortunate error-message from XFS that cause us users to think its hardware related.

 

However I have no clue as what to do with this error-report, and probably its useless to the developers if the don't have the drive or detailed dump data.

 

Best Alex

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.