sreknob Posted April 27, 2020 Share Posted April 27, 2020 (edited) Hi, Looking for some guidance on the following error. Noted this morning that some of my docker containers were misbehaving and then found that I was unable to stop any containers. Looking in my syslog, I found the following error overnight: Apr 27 04:37:28 unRAID kernel: ------------[ cut here ]------------ Apr 27 04:37:28 unRAID kernel: kernel BUG at fs/inode.c:518! Apr 27 04:37:28 unRAID kernel: invalid opcode: 0000 [#1] SMP PTI Apr 27 04:37:28 unRAID kernel: CPU: 5 PID: 686 Comm: kswapd0 Tainted: P O 4.19.107-Unraid #1 Apr 27 04:37:28 unRAID kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Extreme3 Gen3, BIOS P2.30 06/29/2012 Apr 27 04:37:28 unRAID kernel: RIP: 0010:clear_inode+0x78/0x88 Apr 27 04:37:28 unRAID kernel: Code: 74 02 0f 0b 48 8b 83 90 00 00 00 a8 20 75 02 0f 0b a8 40 74 02 0f 0b 48 8b 83 20 01 00 00 48 8d 93 20 01 00 00 48 39 c2 74 02 <0f> 0b 48 c7 83 90 00 00 00 60 00 00 00 5b 5d c3 53 83 7f 40 00 48 Apr 27 04:37:28 unRAID kernel: RSP: 0018:ffffc90001af7be8 EFLAGS: 00010287 Apr 27 04:37:28 unRAID kernel: RAX: ffff8881433d2220 RBX: ffff888143352100 RCX: 0000000000000000 Apr 27 04:37:28 unRAID kernel: RDX: ffff888143352220 RSI: 0000000000000000 RDI: ffff888143352270 Apr 27 04:37:28 unRAID kernel: RBP: ffff888143352270 R08: 0000000000000001 R09: 0000000000000000 Apr 27 04:37:28 unRAID kernel: R10: 0000000000000001 R11: ffff88841f55fb40 R12: ffff888143352210 Apr 27 04:37:28 unRAID kernel: R13: ffff88812b2a76c0 R14: ffffc90001af7c98 R15: ffff888151041f00 Apr 27 04:37:28 unRAID kernel: FS: 0000000000000000(0000) GS:ffff88841f540000(0000) knlGS:0000000000000000 Apr 27 04:37:28 unRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 27 04:37:28 unRAID kernel: CR2: 000014c9af20d020 CR3: 0000000004e0a005 CR4: 00000000000606e0 Apr 27 04:37:28 unRAID kernel: Call Trace: Apr 27 04:37:28 unRAID kernel: fuse_evict_inode+0x18/0x50 Apr 27 04:37:28 unRAID kernel: evict+0xb8/0x16e Apr 27 04:37:28 unRAID kernel: __dentry_kill+0xcb/0x135 Apr 27 04:37:28 unRAID kernel: shrink_dentry_list+0x149/0x185 Apr 27 04:37:28 unRAID kernel: prune_dcache_sb+0x56/0x74 Apr 27 04:37:28 unRAID kernel: super_cache_scan+0xee/0x16d Apr 27 04:37:28 unRAID kernel: do_shrink_slab+0x128/0x194 Apr 27 04:37:28 unRAID kernel: shrink_slab+0x11b/0x276 Apr 27 04:37:28 unRAID kernel: shrink_node+0x108/0x3cb Apr 27 04:37:28 unRAID kernel: kswapd+0x451/0x58a Apr 27 04:37:28 unRAID kernel: ? __switch_to_asm+0x41/0x70 ### [PREVIOUS LINE REPEATED 1 TIMES] ### Apr 27 04:37:28 unRAID kernel: ? mem_cgroup_shrink_node+0xa4/0xa4 Apr 27 04:37:28 unRAID kernel: kthread+0x10c/0x114 Apr 27 04:37:28 unRAID kernel: ? kthread_park+0x89/0x89 Apr 27 04:37:28 unRAID kernel: ret_from_fork+0x35/0x40 Apr 27 04:37:28 unRAID kernel: Modules linked in: macvlan nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod nct6775 hwmon_vid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd kvm_intel drm_kms_helper kvm drm intel_cstate coretemp mxm_wmi cp210x wmi usbserial syscopyarea sysfillrect sysimgblt fb_sys_fops agpgart i2c_i801 crct10dif_pclmul intel_powerclamp i2c_core crc32c_intel r8169 ahci mpt3sas video x86_pkg_temp_thermal backlight realtek libahci pata_jmicron button raid_class scsi_transport_sas pcc_cpufreq Apr 27 04:37:28 unRAID kernel: ---[ end trace 6347766c3e151675 ]--- Apr 27 04:37:28 unRAID kernel: RIP: 0010:clear_inode+0x78/0x88 Apr 27 04:37:28 unRAID kernel: Code: 74 02 0f 0b 48 8b 83 90 00 00 00 a8 20 75 02 0f 0b a8 40 74 02 0f 0b 48 8b 83 20 01 00 00 48 8d 93 20 01 00 00 48 39 c2 74 02 <0f> 0b 48 c7 83 90 00 00 00 60 00 00 00 5b 5d c3 53 83 7f 40 00 48 Apr 27 04:37:28 unRAID kernel: RSP: 0018:ffffc90001af7be8 EFLAGS: 00010287 Apr 27 04:37:28 unRAID kernel: RAX: ffff8881433d2220 RBX: ffff888143352100 RCX: 0000000000000000 Apr 27 04:37:28 unRAID kernel: RDX: ffff888143352220 RSI: 0000000000000000 RDI: ffff888143352270 Apr 27 04:37:28 unRAID kernel: RBP: ffff888143352270 R08: 0000000000000001 R09: 0000000000000000 Apr 27 04:37:28 unRAID kernel: R10: 0000000000000001 R11: ffff88841f55fb40 R12: ffff888143352210 Apr 27 04:37:28 unRAID kernel: R13: ffff88812b2a76c0 R14: ffffc90001af7c98 R15: ffff888151041f00 Apr 27 04:37:28 unRAID kernel: FS: 0000000000000000(0000) GS:ffff88841f540000(0000) knlGS:0000000000000000 Apr 27 04:37:28 unRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 27 04:37:28 unRAID kernel: CR2: 000014c9af20d020 CR3: 0000000004e0a002 CR4: 00000000000606e0 Given that I'm unable to stop or kill my docker containers, I haven't tried a reboot as I presume my system is going to hang on shutdown due to this. My plan is to turn off auto-array start, shutdown (gracefully or not -- perhaps trying umount -f /var/lib/docker prior to this), nuke the docker image an check my file systems. Any advice appreciated. I've attached my diagnostics. Thanks! unraid-diagnostics-20200427-1448.zip Edited June 4, 2020 by sreknob solved Quote Link to comment
sreknob Posted April 27, 2020 Author Share Posted April 27, 2020 So, it ended up being a dirty shutdown, despite lazy unmounting the docker image and killing off a boat load of processes. Also, note to self, don't trust the GUI for config edits when something is hanging. I disabled autostart from the GUI but it didn't stick, so parity check is happening now. Still nuked the docker image but won't be able to check the filesystems until parity check is done.... ¯\_(ツ)_/¯. Quote Link to comment
JorgeB Posted April 28, 2020 Share Posted April 28, 2020 Server is constantly running out of memory, you should try and fix that first. Quote Link to comment
sreknob Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) Thanks for having a look. It’s actually not the server that’s running out of memory. My Unifi controller mongodb gets out of control in its docker container so I have constrained the memory for that container. When it hits the limit in Unifi container, it throws the error and restarts the docker container. AFAIK it’s never been the actual unraid OS. Edited April 28, 2020 by sreknob Typo Quote Link to comment
JorgeB Posted April 28, 2020 Share Posted April 28, 2020 Then you can ignore that, there's also fs corruption on disk4 that I missed when I first looked since log is heavily spammed with the oom errors. Mar 17 12:18:05 unRAID kernel: XFS (md4): Metadata corruption detected at xfs_dinode_verify+0xa5/0x52e [xfs], inode 0x90d66a8e dinode Mar 17 12:18:05 unRAID kernel: XFS (md4): Unmount and run xfs_repair Quote Link to comment
sreknob Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) Thanks again. I figured it was a FS error and wanted to make sure this was enough to cause the fs/inode.c error and the lockup for my dockers. Kind of strange to me though the filesystem error caused this big a problem for me. Seeing that it was over a month between the FS error and the lockup, do you think this explains it? Especially as my docker appdata and docker image all run off cache, though many use the fuse file system for appdata and storage. Also, I hadn’t realized my container was running out of memory so often though as well. Suppose I’ll give Unifi a little more room. Never found an adequate reason my controller gets out of control... Edited April 28, 2020 by sreknob Clarification Quote Link to comment
sreknob Posted April 29, 2020 Author Share Posted April 29, 2020 While in maintenance mode, I ran a file system check on all my array drives (which are xfs) and I did not find any major errors. Will check the lost+found to see if anything in there. Interestingly, on my cache pool (btrfs) did show some errors. [1/7] checking root items [2/7] checking extents data backref 333842939904 root 5 owner 18675039 offset 18446673705361137664 num_refs 0 not found in extent tree incorrect local backref count on 333842939904 root 5 owner 18675039 offset 18446673705361137664 found 1 wanted 0 back 0x189df1d0 incorrect local backref count on 333842939904 root 5 owner 18675039 offset 395763712 found 7 wanted 8 back 0x187170b0 backpointer mismatch on [333842939904 102400] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots root 5 inode 18675039 errors 1040, bad file extent, some csum missing ERROR: errors found in fs roots Opening filesystem to check... Checking filesystem on /dev/sdb1 UUID: 6512c394-cc26-4544-97d3-4a484923301f found 213364232192 bytes used, error(s) found total csum bytes: 92116280 total tree bytes: 1020280832 total fs tree bytes: 773505024 total extent tree bytes: 114278400 btree space waste bytes: 224328220 file data blocks allocated: 2574916517888 referenced 20818464358 There doesn't seem to be a lot of info about these errors and I am uncertain if it's safe to run a scrub or repair on it.... Thanks in advance for any guidance! Quote Link to comment
JorgeB Posted April 29, 2020 Share Posted April 29, 2020 Scrub is safe but won't fix those, --repair is not safe, you can try it but make a backup first. Quote Link to comment
JorgeB Posted April 29, 2020 Share Posted April 29, 2020 Also, xfs_repair is not always clear when it finds problems or not, only way to know for sure is to check the exit status, or just always run it without -n or nothing will be done if there are issues found. Quote Link to comment
sreknob Posted May 23, 2020 Author Share Posted May 23, 2020 Thanks for your help so far johnnie! As an update, I wiped my cache drives and reformatted to a new cache pool, recreated my docker image. I also ran XFS repair on all my array drives. No lost and found folders were created and no other obvious error was flagged. Things seemed ok for a bit, but then had another crash. Different error this time. Seems to only happen overnight, not sure if mover is to blame. I set up remote logging over the network after my last issues, so I have a system log of the crash but I was unable to retrieve diagnostics at the time due to being completely unresponsive with not even a console active. I've included an updated diagnostics file from just now since I rebooted this morning to see the current configs and status. Here are some log entries from just before it became unresponsive and stopped logging: May 22 03:22:25 unRAID crond[1904]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null >>> May 22 04:40:14 unRAID kernel: BUG: unable to handle kernel paging request at ffffffff820e0740 May 22 04:40:14 unRAID kernel: PGD 4e0e067 P4D 4e0e067 PUD 4e0f063 PMD 41d500063 PTE 800ffffffaf1f062 May 22 04:40:14 unRAID kernel: Oops: 0002 [#1] SMP PTI >>> May 22 04:40:15 unRAID kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 May 22 04:40:15 unRAID kernel: PGD 0 P4D 0 May 22 04:40:15 unRAID kernel: Oops: 0000 [#2] SMP PTI >>> I've truncated for brevity of this post, the more complete syslog and full traces are attached to the post. Any thoughts or advice highly appreciated. unraid-2020-05-22-crash.log unraid-diagnostics-20200522-1935.zip Quote Link to comment
JorgeB Posted May 23, 2020 Share Posted May 23, 2020 See if memtest finds any RAM issues. Quote Link to comment
sreknob Posted May 23, 2020 Author Share Posted May 23, 2020 Thanks, that was going to be my next step. I actually have a new set of RAM in the mail as of last week as I was planning on adding more, so I'll save my memtest downtime for the new sticks and test the current modules in another machine afterwards. Will update here with anything new! Thanks again. Quote Link to comment
sreknob Posted June 4, 2020 Author Share Posted June 4, 2020 Just to follow-up.... RAM was all good. Seems when I cleared my docker image and reset my docker networking I appear to have uncovered a different problem. The more recent crashes seem to be actually due to some sort of conflict between my custom docker networking and maybe the nvidia driver. I've turned off folding@home which was using it and no issue since then. What clued me in was the modules from the call trace Modules linked in: nvidia_uvm(O) tun macvlan xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod nct6775 hwmon_vid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm_kms_helper kvm_intel kvm drm intel_cstate coretemp mpt3sas r8169 syscopyarea sysfillrect crct10dif_pclmul sysimgblt mxm_wmi fb_sys_fops intel_powerclamp crc32c_intel agpgart i2c_i801 i2c_core x86_pkg_temp_thermal wmi ahci realtek video libahci raid_class pata_jmicron cp210x backlight usbserial button pcc_cpufreq scsi_transport_sas Looking at this thread --> [6.5.0]+ Call Traces when assigning IP to Dockers and the fact the nvidia module was linked I just turned off F@H and it went away. When I move this back over to Plex, I may have to create a different docker network as in the linked thread if it returns. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.