Alex R. Berg Posted January 7, 2017 Share Posted January 7, 2017 I have an XFS drive which have filesystem issues and mount cause unRaid to crash. The initial problem lines in the syslog at the time the problem first occurred are here (full syslog attached where there are far more lines after what I included here, PreBootSyslog.txt). Jan 6 00:42:36 Tower kernel: XFS (md2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3156 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_free_ag_extent+0x419/0x558 Jan 6 00:42:36 Tower kernel: CPU: 3 PID: 32691 Comm: shfs Not tainted 4.4.30-unRAID #2 Jan 6 00:42:36 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012 Jan 6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303b78 ffffffff8136f79f ffff88004b38e0d0 Jan 6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303b90 ffffffff81275fd0 ffffffff812465ad Jan 6 00:42:36 Tower kernel: ffff8803af303c00 ffffffff8125a774 000001526b9f3000 000000005036eb68 Jan 6 00:42:36 Tower kernel: Call Trace: Jan 6 00:42:36 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 6 00:42:36 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35 Jan 6 00:42:36 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558 Jan 6 00:42:36 Tower kernel: [<ffffffff8125a774>] xfs_btree_insert+0xba/0x152 Jan 6 00:42:36 Tower kernel: [<ffffffff812465ad>] xfs_free_ag_extent+0x419/0x558 Jan 6 00:42:36 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558 Jan 6 00:42:36 Tower kernel: [<ffffffff812471e9>] xfs_free_extent+0xbd/0xed Jan 6 00:42:36 Tower kernel: [<ffffffff812960fd>] xfs_trans_free_extent+0x21/0x58 Jan 6 00:42:36 Tower kernel: [<ffffffff81271712>] xfs_bmap_finish+0xdf/0x102 Jan 6 00:42:36 Tower kernel: [<ffffffff81281ce1>] xfs_itruncate_extents+0xe3/0x152 Jan 6 00:42:36 Tower kernel: [<ffffffff81281dde>] xfs_inactive_truncate+0x8e/0xce Jan 6 00:42:36 Tower kernel: [<ffffffff812827ef>] xfs_inactive+0xa2/0xc1 Jan 6 00:42:36 Tower kernel: [<ffffffff81286a64>] xfs_fs_evict_inode+0x90/0x93 Jan 6 00:42:36 Tower kernel: [<ffffffff8111e687>] evict+0xaf/0x164 Jan 6 00:42:36 Tower kernel: [<ffffffff8111f1d3>] iput+0x160/0x16d Jan 6 00:42:36 Tower kernel: [<ffffffff8111655c>] do_unlinkat+0x125/0x201 Jan 6 00:42:36 Tower kernel: [<ffffffff81116ba7>] SyS_unlink+0x11/0x13 Jan 6 00:42:36 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d Jan 6 00:42:36 Tower kernel: XFS (md2): Internal error xfs_trans_cancel at line 990 of file fs/xfs/xfs_trans.c. Caller xfs_inactive_truncate+0xb9/0xce Jan 6 00:42:36 Tower kernel: CPU: 3 PID: 32691 Comm: shfs Not tainted 4.4.30-unRAID #2 Jan 6 00:42:36 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012 Jan 6 00:42:36 Tower kernel: 0000000000000000 ffff8803af303dc8 ffffffff8136f79f ffff8802b5bfae80 Jan 6 00:42:36 Tower kernel: ffffffff81664fc0 ffff8803af303de0 ffffffff81275fd0 ffffffff81281e09 Jan 6 00:42:36 Tower kernel: ffff8803af303e08 ffffffff8128a1e6 00000000ffffff8b ffff880119467800 Jan 6 00:42:36 Tower kernel: Call Trace: Jan 6 00:42:36 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 6 00:42:36 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35 Jan 6 00:42:36 Tower kernel: [<ffffffff81281e09>] ? xfs_inactive_truncate+0xb9/0xce Jan 6 00:42:36 Tower kernel: [<ffffffff8128a1e6>] xfs_trans_cancel+0x49/0xbf Jan 6 00:42:36 Tower kernel: [<ffffffff81281e09>] xfs_inactive_truncate+0xb9/0xce Jan 6 00:42:36 Tower kernel: [<ffffffff812827ef>] xfs_inactive+0xa2/0xc1 Jan 6 00:42:36 Tower kernel: [<ffffffff81286a64>] xfs_fs_evict_inode+0x90/0x93 Jan 6 00:42:36 Tower kernel: [<ffffffff8111e687>] evict+0xaf/0x164 Jan 6 00:42:36 Tower kernel: [<ffffffff8111f1d3>] iput+0x160/0x16d Jan 6 00:42:36 Tower kernel: [<ffffffff8111655c>] do_unlinkat+0x125/0x201 Jan 6 00:42:36 Tower kernel: [<ffffffff81116ba7>] SyS_unlink+0x11/0x13 Jan 6 00:42:36 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d Jan 6 00:42:36 Tower kernel: XFS (md2): xfs_do_force_shutdown(0x8) called from line 991 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff8128a1ff Jan 6 00:42:41 Tower kernel: XFS (md2): Corruption of in-memory data detected. Shutting down filesystem Jan 6 00:42:41 Tower kernel: XFS (md2): Please umount the filesystem and rectify the problem(s) I have checked memory with MemTest86+ overnigt without issues in failsafe single-CPU mode. After reboot I mounted in maintenance mode, and tried repair. root@Tower:~# xfs_repair -v /dev/md2 Phase 1 - find and verify superblock... - block cache size set to 1470848 entries Phase 2 - using internal log - zero log... zero_log: head block 639392 tail block 625937 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. then I mount the disk as it suggest root@Tower:~# mkdir /mnt/disk2 ; mount /dev/md2 /mnt/disk2 The system crashes shortly after. When I tail a syslog in a telnet this is what I get before telnet lose connection (see attached for full tail: syslogTailAfterMountingBadDisk.txt and see attached image for screen output): Jan 7 11:53:47 Tower login[2317]: ROOT LOGIN on '/dev/pts/1' from '10.0.0.5' Jan 7 12:00:40 Tower kernel: XFS (md2): Mounting V5 Filesystem Jan 7 12:00:40 Tower kernel: XFS (md2): Starting recovery (logdev: internal) Jan 7 12:01:06 Tower kernel: XFS (md2): Internal error XFS_WANT_CORRUPTED_GOTO at line 3156 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_free_ag_extent+0x419/0x558 Jan 7 12:01:06 Tower kernel: CPU: 2 PID: 5742 Comm: mount Not tainted 4.4.30-unRAID #2 Jan 7 12:01:06 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012 Jan 7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfa38 ffffffff8136f79f ffff88040bfa31a0 Jan 7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfa50 ffffffff81275fd0 ffffffff812465ad Jan 7 12:01:06 Tower kernel: ffff8803f55cfac0 ffffffff8125a774 ffffffff812598f9 00000000cedf01d0 Jan 7 12:01:06 Tower kernel: Call Trace: Jan 7 12:01:06 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 7 12:01:06 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35 Jan 7 12:01:06 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558 Jan 7 12:01:06 Tower kernel: [<ffffffff8125a774>] xfs_btree_insert+0xba/0x152 Jan 7 12:01:06 Tower kernel: [<ffffffff812598f9>] ? xfs_btree_lookup+0x307/0x4a1 Jan 7 12:01:06 Tower kernel: [<ffffffff812465ad>] xfs_free_ag_extent+0x419/0x558 Jan 7 12:01:06 Tower kernel: [<ffffffff812465ad>] ? xfs_free_ag_extent+0x419/0x558 Jan 7 12:01:06 Tower kernel: [<ffffffff812471e9>] xfs_free_extent+0xbd/0xed Jan 7 12:01:06 Tower kernel: [<ffffffff812960fd>] xfs_trans_free_extent+0x21/0x58 Jan 7 12:01:06 Tower kernel: [<ffffffff81291ad4>] xlog_recover_process_efi+0x125/0x155 Jan 7 12:01:06 Tower kernel: [<ffffffff81291b75>] xlog_recover_process_efis+0x71/0xb5 Jan 7 12:01:06 Tower kernel: [<ffffffff81076179>] ? wake_up_bit+0x1d/0x1f Jan 7 12:01:06 Tower kernel: [<ffffffff8127acb2>] ? xfs_iget+0x50f/0x54e Jan 7 12:01:06 Tower kernel: [<ffffffff81294f17>] xlog_recover_finish+0x18/0x8b Jan 7 12:01:06 Tower kernel: [<ffffffff81294f17>] ? xlog_recover_finish+0x18/0x8b Jan 7 12:01:06 Tower kernel: [<ffffffff8128c20a>] xfs_log_mount_finish+0x20/0x36 Jan 7 12:01:06 Tower kernel: [<ffffffff8128547f>] xfs_mountfs+0x601/0x6a8 Jan 7 12:01:06 Tower kernel: [<ffffffff81287d7f>] xfs_fs_fill_super+0x3fd/0x489 Jan 7 12:01:06 Tower kernel: [<ffffffff8110c871>] mount_bdev+0x141/0x195 Jan 7 12:01:06 Tower kernel: [<ffffffff81287982>] ? xfs_parseargs+0x8c1/0x8c1 Jan 7 12:01:06 Tower kernel: [<ffffffff8128633d>] xfs_fs_mount+0x10/0x12 Jan 7 12:01:06 Tower kernel: [<ffffffff8110d4e2>] mount_fs+0xf/0x84 Jan 7 12:01:06 Tower kernel: [<ffffffff81122036>] vfs_kern_mount+0x65/0xf7 Jan 7 12:01:06 Tower kernel: [<ffffffff811249ac>] do_mount+0x91c/0xa72 Jan 7 12:01:06 Tower kernel: [<ffffffff810ce84b>] ? strndup_user+0x3a/0x82 Jan 7 12:01:06 Tower kernel: [<ffffffff81124cf1>] SyS_mount+0x70/0x9c Jan 7 12:01:06 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d Jan 7 12:01:06 Tower kernel: XFS (md2): Internal error xfs_trans_cancel at line 990 of file fs/xfs/xfs_trans.c. Caller xlog_recover_process_efi+0x148/0x155 Jan 7 12:01:06 Tower kernel: CPU: 2 PID: 5742 Comm: mount Not tainted 4.4.30-unRAID #2 Jan 7 12:01:06 Tower kernel: Hardware name: System manufacturer System Product Name/P8B WS, BIOS 2106 07/16/2012 Jan 7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfbd8 ffffffff8136f79f ffff8800cedf0000 Jan 7 12:01:06 Tower kernel: 0000000000000000 ffff8803f55cfbf0 ffffffff81275fd0 ffffffff81291af7 Jan 7 12:01:06 Tower kernel: ffff8803f55cfc18 ffffffff8128a1e6 ffff8800ce278000 ffff8803e0bde000 Jan 7 12:01:06 Tower kernel: Call Trace: Jan 7 12:01:06 Tower kernel: [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 7 12:01:06 Tower kernel: [<ffffffff81275fd0>] xfs_error_report+0x32/0x35 Jan 7 12:01:06 Tower kernel: [<ffffffff81291af7>] ? xlog_recover_process_efi+0x148/0x155 Jan 7 12:01:06 Tower kernel: [<ffffffff8128a1e6>] xfs_trans_cancel+0x49/0xbf Jan 7 12:01:06 Tower kernel: [<ffffffff81291af7>] xlog_recover_process_efi+0x148/0x155 Jan 7 12:01:06 Tower kernel: [<ffffffff81291b75>] xlog_recover_process_efis+0x71/0xb5 Jan 7 12:01:06 Tower kernel: [<ffffffff81076179>] ? wake_up_bit+0x1d/0x1f Jan 7 12:01:06 Tower kernel: [<ffffffff8127acb2>] ? xfs_iget+0x50f/0x54e Jan 7 12:01:06 Tower kernel: [<ffffffff81294f17>] xlog_recover_finish+0x18/0x8b Jan 7 12:01:06 Tower kernel: [<ffffffff81294f17>] ? xlog_recover_finish+0x18/0x8b Jan 7 12:01:06 Tower kernel: [<ffffffff8128c20a>] xfs_log_mount_finish+0x20/0x36 Jan 7 12:01:06 Tower kernel: [<ffffffff8128547f>] xfs_mountfs+0x601/0x6a8 Jan 7 12:01:06 Tower kernel: [<ffffffff81287d7f>] xfs_fs_fill_super+0x3fd/0x489 Jan 7 12:01:06 Tower kernel: [<ffffffff8110c871>] mount_bdev+0x141/0x195 Jan 7 12:01:06 Tower kernel: [<ffffffff81287982>] ? xfs_parseargs+0x8c1/0x8c1 Jan 7 12:01:06 Tower kernel: [<ffffffff8128633d>] xfs_fs_mount+0x10/0x12 Jan 7 12:01:06 Tower kernel: [<ffffffff8110d4e2>] mount_fs+0xf/0x84 Jan 7 12:01:06 Tower kernel: [<ffffffff81122036>] vfs_kern_mount+0x65/0xf7 Jan 7 12:01:06 Tower kernel: [<ffffffff811249ac>] do_mount+0x91c/0xa72 Jan 7 12:01:06 Tower kernel: [<ffffffff810ce84b>] ? strndup_user+0x3a/0x82 Jan 7 12:01:06 Tower kernel: [<ffffffff81124cf1>] SyS_mount+0x70/0x9c Jan 7 12:01:06 Tower kernel: [<ffffffff81629c2e>] entry_SYSCALL_64_fastpath+0x12/0x6d Jan 7 12:01:06 Tower kernel: XFS (md2): xfs_do_force_shutdown(0x8) called from line 991 of file fs/xfs/xfs_trans.c. Return address = 0xffffffff8128a1ff Jan 7 12:01:06 Tower kernel: XFS (md2): Corruption of in-memory data detected. Shutting down filesystem Jan 7 12:01:06 Tower kernel: XFS (md2): Please umount the filesystem and rectify the problem(s) Jan 7 12:01:06 Tower kernel: XFS (md2): Failed to recover EFIs Jan 7 12:01:06 Tower kernel: XFS (md2): log mount finish failed Jan 7 12:01:06 Tower kernel: XFS (md2): xfs_log_force: error - I have tried changed power and sata cables with another drive, it made no change. I upgraded from unRaid 6.1.9 to 6.2.4 the day before the issue happened. I just tested the parity without writing changes to disk, and the parity was still valid. MemTest86+ hung on parallel testing, but worked fine in four complete tests in failsafe mode (CPU-core 0 only), and also completed a full test in round robin between four CPU cores. Do you have any suggestions for how to proceed? I can buy new disk and rebuild from parity if its easiest / safest. When I have disk filesystem issues, I'm always worried if fixing them will cause ruining parity, but since parity is valid, it seems a moot issue. Is there a bug in XFS that should be reported? Here is a similar old support request, that does not seem helpful to me http://lime-technology.com/forum/index.php?topic=40603.msg383298#msg383298. I found someone mentioning he had to delete XFS journal and then repair in order to proceed, I havn't tried or investigated that yet. Best Alex PS: I have my data on crashplan though it wasn't quite up2date due to other issues, but its good enough if I really need it, and I have 6 months old data on an offsite-backup disk so data wise I'm not panicking. all.zip Link to comment
trurl Posted January 7, 2017 Share Posted January 7, 2017 Since you can't mount safely you will have to use the -L option. Data loss if any would just be the most recent writes that hadn't been committed. Link to comment
Alex R. Berg Posted January 8, 2017 Author Share Posted January 8, 2017 Thank you for the help. Running xfs_repair -L -v /dev/md2 fixed the issues, and the disk is up and running again. I'm currently checking md5's of files and will report back as it might be slighly interesting if all files are still valid. Before I close the issue, I like your input as to whether a bug-report should be filed to XFS or elsewhere. Preferably the kernel should never crash, but probably its a know issue. Best Alex Link to comment
trurl Posted January 8, 2017 Share Posted January 8, 2017 I don't know. Most people here who have had needed to repair their filesystem don't get crashes, but I think there may have been a few. Link to comment
c3 Posted January 8, 2017 Share Posted January 8, 2017 One thing you mentioned is the memtst hang when parallel testing, along with this log entry; Jan 7 12:01:06 Tower kernel: XFS (md2): Corruption of in-memory data detected. Shutting down filesystem together makes me wonder about the details of your system, not that you have a bad component, but if there is some architecture which is causing this. I can guess it is ASUS P8B using c206 and E3 processor. That should have no trouble running memtst in parallel. There are bugs in xfs, and this might be one of them. They are hard to make progress on, as in your case, the log was dumped and you moved forward. Link to comment
Alex R. Berg Posted January 8, 2017 Author Share Posted January 8, 2017 Yeah its a Intel® Xeon® CPU E31225 @ 3.10GHz on Asus P8B WS with up2date BIOS. I have seen reports on earlier versions of the MemTest86+ that the parallel test hangs. For my system it fails on the same test quite reliably, and so did the unRaid xfs mount command (when replaying log I think) before I deleted the journal log. I'm always quite cautious to blame 'other components' esp. presumably well-tested stuff, because its just to easy when I cannot find the bug in my code. It seems a lot more likely to me that there is a bug in a presumably not that frequently used xfs replay log than there is in the CPU, but I do also recognize that there is a definite probability that it is a hardware issue. I noticed that the other threads I found with the content 'XFS_WANT_CORRUPTED_GOTO' also contained warnings of memory corruptions and on their system MemTest86+ also ran well. It's seems likely its just an unfortunate error-message from XFS that cause us users to think its hardware related. However I have no clue as what to do with this error-report, and probably its useless to the developers if the don't have the drive or detailed dump data. Best Alex Link to comment
Alex R. Berg Posted January 9, 2017 Author Share Posted January 9, 2017 In case it has any interest: My personal MD5 scan of all disk files went fine, no files missing, and no files with wrong MD5's. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.