[SOLVED] kernel BUG at fs/inode.c:518 in log

sreknob · April 27, 2020

Hi,

Looking for some guidance on the following error. Noted this morning that some of my docker containers were misbehaving and then found that I was unable to stop any containers. Looking in my syslog, I found the following error overnight:

Apr 27 04:37:28 unRAID kernel: ------------[ cut here ]------------
Apr 27 04:37:28 unRAID kernel: kernel BUG at fs/inode.c:518!
Apr 27 04:37:28 unRAID kernel: invalid opcode: 0000 [#1] SMP PTI
Apr 27 04:37:28 unRAID kernel: CPU: 5 PID: 686 Comm: kswapd0 Tainted: P           O      4.19.107-Unraid #1
Apr 27 04:37:28 unRAID kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Extreme3 Gen3, BIOS P2.30 06/29/2012
Apr 27 04:37:28 unRAID kernel: RIP: 0010:clear_inode+0x78/0x88
Apr 27 04:37:28 unRAID kernel: Code: 74 02 0f 0b 48 8b 83 90 00 00 00 a8 20 75 02 0f 0b a8 40 74 02 0f 0b 48 8b 83 20 01 00 00 48 8d 93 20 01 00 00 48 39 c2 74 02 <0f> 0b 48 c7 83 90 00 00 00 60 00 00 00 5b 5d c3 53 83 7f 40 00 48
Apr 27 04:37:28 unRAID kernel: RSP: 0018:ffffc90001af7be8 EFLAGS: 00010287
Apr 27 04:37:28 unRAID kernel: RAX: ffff8881433d2220 RBX: ffff888143352100 RCX: 0000000000000000
Apr 27 04:37:28 unRAID kernel: RDX: ffff888143352220 RSI: 0000000000000000 RDI: ffff888143352270
Apr 27 04:37:28 unRAID kernel: RBP: ffff888143352270 R08: 0000000000000001 R09: 0000000000000000
Apr 27 04:37:28 unRAID kernel: R10: 0000000000000001 R11: ffff88841f55fb40 R12: ffff888143352210
Apr 27 04:37:28 unRAID kernel: R13: ffff88812b2a76c0 R14: ffffc90001af7c98 R15: ffff888151041f00
Apr 27 04:37:28 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88841f540000(0000) knlGS:0000000000000000
Apr 27 04:37:28 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 27 04:37:28 unRAID kernel: CR2: 000014c9af20d020 CR3: 0000000004e0a005 CR4: 00000000000606e0
Apr 27 04:37:28 unRAID kernel: Call Trace:
Apr 27 04:37:28 unRAID kernel: fuse_evict_inode+0x18/0x50
Apr 27 04:37:28 unRAID kernel: evict+0xb8/0x16e
Apr 27 04:37:28 unRAID kernel: __dentry_kill+0xcb/0x135
Apr 27 04:37:28 unRAID kernel: shrink_dentry_list+0x149/0x185
Apr 27 04:37:28 unRAID kernel: prune_dcache_sb+0x56/0x74
Apr 27 04:37:28 unRAID kernel: super_cache_scan+0xee/0x16d
Apr 27 04:37:28 unRAID kernel: do_shrink_slab+0x128/0x194
Apr 27 04:37:28 unRAID kernel: shrink_slab+0x11b/0x276
Apr 27 04:37:28 unRAID kernel: shrink_node+0x108/0x3cb
Apr 27 04:37:28 unRAID kernel: kswapd+0x451/0x58a
Apr 27 04:37:28 unRAID kernel: ? __switch_to_asm+0x41/0x70
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Apr 27 04:37:28 unRAID kernel: ? mem_cgroup_shrink_node+0xa4/0xa4
Apr 27 04:37:28 unRAID kernel: kthread+0x10c/0x114
Apr 27 04:37:28 unRAID kernel: ? kthread_park+0x89/0x89
Apr 27 04:37:28 unRAID kernel: ret_from_fork+0x35/0x40
Apr 27 04:37:28 unRAID kernel: Modules linked in: macvlan nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod nct6775 hwmon_vid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd kvm_intel drm_kms_helper kvm drm intel_cstate coretemp mxm_wmi cp210x wmi usbserial syscopyarea sysfillrect sysimgblt fb_sys_fops agpgart i2c_i801 crct10dif_pclmul intel_powerclamp i2c_core crc32c_intel r8169 ahci mpt3sas video x86_pkg_temp_thermal backlight realtek libahci pata_jmicron button raid_class scsi_transport_sas pcc_cpufreq
Apr 27 04:37:28 unRAID kernel: ---[ end trace 6347766c3e151675 ]---
Apr 27 04:37:28 unRAID kernel: RIP: 0010:clear_inode+0x78/0x88
Apr 27 04:37:28 unRAID kernel: Code: 74 02 0f 0b 48 8b 83 90 00 00 00 a8 20 75 02 0f 0b a8 40 74 02 0f 0b 48 8b 83 20 01 00 00 48 8d 93 20 01 00 00 48 39 c2 74 02 <0f> 0b 48 c7 83 90 00 00 00 60 00 00 00 5b 5d c3 53 83 7f 40 00 48
Apr 27 04:37:28 unRAID kernel: RSP: 0018:ffffc90001af7be8 EFLAGS: 00010287
Apr 27 04:37:28 unRAID kernel: RAX: ffff8881433d2220 RBX: ffff888143352100 RCX: 0000000000000000
Apr 27 04:37:28 unRAID kernel: RDX: ffff888143352220 RSI: 0000000000000000 RDI: ffff888143352270
Apr 27 04:37:28 unRAID kernel: RBP: ffff888143352270 R08: 0000000000000001 R09: 0000000000000000
Apr 27 04:37:28 unRAID kernel: R10: 0000000000000001 R11: ffff88841f55fb40 R12: ffff888143352210
Apr 27 04:37:28 unRAID kernel: R13: ffff88812b2a76c0 R14: ffffc90001af7c98 R15: ffff888151041f00
Apr 27 04:37:28 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88841f540000(0000) knlGS:0000000000000000
Apr 27 04:37:28 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 27 04:37:28 unRAID kernel: CR2: 000014c9af20d020 CR3: 0000000004e0a002 CR4: 00000000000606e0

Given that I'm unable to stop or kill my docker containers, I haven't tried a reboot as I presume my system is going to hang on shutdown due to this.

My plan is to turn off auto-array start, shutdown (gracefully or not -- perhaps trying umount -f /var/lib/docker prior to this), nuke the docker image an check my file systems.

Any advice appreciated. I've attached my diagnostics.

Thanks!

unraid-diagnostics-20200427-1448.zip

Edited June 4, 2020 by sreknob
solved

sreknob · April 27, 2020

So, it ended up being a dirty shutdown, despite lazy unmounting the docker image and killing off a boat load of processes.

Also, note to self, don't trust the GUI for config edits when something is hanging. I disabled autostart from the GUI but it didn't stick, so parity check is happening now. Still nuked the docker image but won't be able to check the filesystems until parity check is done.... ¯\_(ツ)_/¯.

JorgeB · April 28, 2020

Server is constantly running out of memory, you should try and fix that first.

sreknob · April 28, 2020

Thanks for having a look. It’s actually not the server that’s running out of memory. My Unifi controller mongodb gets out of control in its docker container so I have constrained the memory for that container. When it hits the limit in Unifi container, it throws the error and restarts the docker container. AFAIK it’s never been the actual unraid OS.

Edited April 28, 2020 by sreknob
Typo

JorgeB · April 28, 2020

Then you can ignore that, there's also fs corruption on disk4 that I missed when I first looked since log is heavily spammed with the oom errors.

Mar 17 12:18:05 unRAID kernel: XFS (md4): Metadata corruption detected at xfs_dinode_verify+0xa5/0x52e [xfs], inode 0x90d66a8e dinode
Mar 17 12:18:05 unRAID kernel: XFS (md4): Unmount and run xfs_repair

sreknob · April 28, 2020

Thanks again. I figured it was a FS error and wanted to make sure this was enough to cause the fs/inode.c error and the lockup for my dockers. Kind of strange to me though the filesystem error caused this big a problem for me. Seeing that it was over a month between the FS error and the lockup, do you think this explains it? Especially as my docker appdata and docker image all run off cache, though many use the fuse file system for appdata and storage.

Also, I hadn’t realized my container was running out of memory so often though as well. Suppose I’ll give Unifi a little more room. Never found an adequate reason my controller gets out of control...

Edited April 28, 2020 by sreknob
Clarification

sreknob · April 29, 2020

While in maintenance mode, I ran a file system check on all my array drives (which are xfs) and I did not find any major errors. Will check the lost+found to see if anything in there.

Interestingly, on my cache pool (btrfs) did show some errors.

[1/7] checking root items
[2/7] checking extents
data backref 333842939904 root 5 owner 18675039 offset 18446673705361137664 num_refs 0 not found in extent tree
incorrect local backref count on 333842939904 root 5 owner 18675039 offset 18446673705361137664 found 1 wanted 0 back 0x189df1d0
incorrect local backref count on 333842939904 root 5 owner 18675039 offset 395763712 found 7 wanted 8 back 0x187170b0
backpointer mismatch on [333842939904 102400]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 18675039 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/sdb1
UUID: 6512c394-cc26-4544-97d3-4a484923301f
found 213364232192 bytes used, error(s) found
total csum bytes: 92116280
total tree bytes: 1020280832
total fs tree bytes: 773505024
total extent tree bytes: 114278400
btree space waste bytes: 224328220
file data blocks allocated: 2574916517888
 referenced 20818464358

There doesn't seem to be a lot of info about these errors and I am uncertain if it's safe to run a scrub or repair on it....

Thanks in advance for any guidance!

JorgeB · April 29, 2020

Scrub is safe but won't fix those, --repair is not safe, you can try it but make a backup first.

JorgeB · April 29, 2020

Also, xfs_repair is not always clear when it finds problems or not, only way to know for sure is to check the exit status, or just always run it without -n or nothing will be done if there are issues found.

sreknob · May 23, 2020

Thanks for your help so far johnnie!

As an update, I wiped my cache drives and reformatted to a new cache pool, recreated my docker image. I also ran XFS repair on all my array drives. No lost and found folders were created and no other obvious error was flagged.

Things seemed ok for a bit, but then had another crash. Different error this time. Seems to only happen overnight, not sure if mover is to blame. I set up remote logging over the network after my last issues, so I have a system log of the crash but I was unable to retrieve diagnostics at the time due to being completely unresponsive with not even a console active. I've included an updated diagnostics file from just now since I rebooted this morning to see the current configs and status.

Here are some log entries from just before it became unresponsive and stopped logging:

May 22 03:22:25 unRAID crond[1904]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
>>>
May 22 04:40:14 unRAID kernel: BUG: unable to handle kernel paging request at ffffffff820e0740
May 22 04:40:14 unRAID kernel: PGD 4e0e067 P4D 4e0e067 PUD 4e0f063 PMD 41d500063 PTE 800ffffffaf1f062
May 22 04:40:14 unRAID kernel: Oops: 0002 [#1] SMP PTI
>>>
May 22 04:40:15 unRAID kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
May 22 04:40:15 unRAID kernel: PGD 0 P4D 0
May 22 04:40:15 unRAID kernel: Oops: 0000 [#2] SMP PTI
>>>

I've truncated for brevity of this post, the more complete syslog and full traces are attached to the post.

Any thoughts or advice highly appreciated.

unraid-2020-05-22-crash.log

unraid-diagnostics-20200522-1935.zip

JorgeB · May 23, 2020

See if memtest finds any RAM issues.

sreknob · May 23, 2020

Thanks, that was going to be my next step. I actually have a new set of RAM in the mail as of last week as I was planning on adding more, so I'll save my memtest downtime for the new sticks and test the current modules in another machine afterwards. Will update here with anything new! Thanks again.

sreknob · June 4, 2020

Just to follow-up....

RAM was all good.

Seems when I cleared my docker image and reset my docker networking I appear to have uncovered a different problem.

The more recent crashes seem to be actually due to some sort of conflict between my custom docker networking and maybe the nvidia driver.

I've turned off folding@home which was using it and no issue since then.

What clued me in was the modules from the call trace

Modules linked in: nvidia_uvm(O) tun macvlan xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod nct6775 hwmon_vid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm_kms_helper kvm_intel kvm drm intel_cstate coretemp mpt3sas r8169 syscopyarea sysfillrect crct10dif_pclmul sysimgblt mxm_wmi fb_sys_fops intel_powerclamp crc32c_intel agpgart i2c_i801 i2c_core x86_pkg_temp_thermal wmi ahci realtek video libahci raid_class pata_jmicron cp210x backlight usbserial button pcc_cpufreq scsi_transport_sas

Looking at this thread --> [6.5.0]+ Call Traces when assigning IP to Dockers and the fact the nvidia module was linked I just turned off F@H and it went away. When I move this back over to Plex, I may have to create a different docker network as in the linked thread if it returns.

[SOLVED] kernel BUG at fs/inode.c:518 in log

Recommended Posts

sreknob

Link to comment

sreknob

Link to comment

JorgeB

Link to comment

sreknob

Link to comment

JorgeB

Link to comment

sreknob

Link to comment

sreknob

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

sreknob

Link to comment

JorgeB

Link to comment

sreknob

Link to comment

sreknob

Link to comment

Join the conversation