Tons of Call Traces, then out of memory.

SelfSD · September 30, 2020

My server has now crashed 3 days in a row due to out of memory problems. It happens overnight when I'm sleeping so I wake up to a hard locked OS, but today I managed to log in to it and copy the syslog. Sadly running diagnostics just froze it completely and even after an hour of waiting before force rebooting there were no ZIP file on the USB drive.

I've attached the diagnostics from after the reboot. I know it's not very helpful for troubleshooting but it shows my system and configs at least.

It completed 1 pass of memtest this morning with no issues. I will have to wait until the weekend if I'm gonna run a day long check.

I thought the OOM issues was the problem so I tried to add a 128 GB swap file but it didn't help.

I'm by no means an expert at reading into these call traces but last time it was something about C States so I disabled C states in the BIOS. I know the C state issue is mostly affecting 1'st generation Ryzen CPUs but I thought it could help. At least I'm not getting the C State error in the call traces but it's still not working well...

Maybe it's time to just retire this whole system and get some new hardware. ☹️

BTW, didn't Fix Common Problem have a diagnostic mode that would pull diagnostics every 15 minutes? I can't seem to find this option again.

syslog.log unraid-diagnostics-20200930-0906.zip

JorgeB · September 30, 2020

Macvlan call traces are usually the result of dockers with a custom IP address, more info here.

For better stability also a good idea to respect max AMD officially supported RAM speed for your config, which is 1333Mhz, not 1600.

SelfSD · September 30, 2020

Thanks for the link! I'll read through the topic and I'll turn down the memory clock.

SelfSD · September 30, 2020

The RAM has been manually set to 1333 MHz, auto bumped it up to 1600 for some reason and I've put my dockers on a separate VLAN which seems to be the best solution from the thread you linked.

Now I just hope that the out of memory errors at the end were related to the call traces. 😕

SelfSD · October 1, 2020

I got another call trace after about 17 hours uptime. This time it's something else. Everything still seems to be running fine

Oct  1 05:15:49 UnRaid kernel: general protection fault: 0000 [#1] SMP NOPTI
Oct  1 05:15:49 UnRaid kernel: CPU: 6 PID: 674 Comm: kswapd0 Tainted: P           O      4.19.107-Unraid #1
Oct  1 05:15:49 UnRaid kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015
Oct  1 05:15:49 UnRaid kernel: RIP: 0010:iput+0x87/0x154
Oct  1 05:15:49 UnRaid kernel: Code: 89 e6 e8 e5 80 50 00 85 c0 75 bc e9 df 00 00 00 48 8b 5d 28 a8 08 4c 8b 6b 30 74 0e 48 c7 c7 79 fc d2 81 e8 83 7b f2 ff 0f 0b <49> 8b 45 20 48 85 c0 74 29 48 89 ef e8 62 c7 89 00 85 c0 75 75 f6
Oct  1 05:15:49 UnRaid kernel: RSP: 0018:ffffc9000344bc08 EFLAGS: 00010246
Oct  1 05:15:49 UnRaid kernel: RAX: 0000000000000000 RBX: ffff88868e356000 RCX: 0000000000000000
Oct  1 05:15:49 UnRaid kernel: RDX: 0000000100000000 RSI: ffff888226874680 RDI: ffff888226874680
Oct  1 05:15:49 UnRaid kernel: RBP: ffff888226874600 R08: 0000000000000001 R09: ffffc9000344bb78
Oct  1 05:15:49 UnRaid kernel: R10: 0000000000000000 R11: ffff88881fb9fb40 R12: ffff888226874680
Oct  1 05:15:49 UnRaid kernel: R13: 7d808fd500000000 R14: ffffc9000344bc98 R15: ffff888222ce4cc0
Oct  1 05:15:49 UnRaid kernel: FS:  0000000000000000(0000) GS:ffff88881fb80000(0000) knlGS:0000000000000000
Oct  1 05:15:49 UnRaid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  1 05:15:49 UnRaid kernel: CR2: 00007f091b0898c8 CR3: 00000006464e6000 CR4: 00000000000406e0
Oct  1 05:15:49 UnRaid kernel: Call Trace:
Oct  1 05:15:49 UnRaid kernel: __dentry_kill+0xcb/0x135
Oct  1 05:15:49 UnRaid kernel: shrink_dentry_list+0x149/0x185
Oct  1 05:15:49 UnRaid kernel: prune_dcache_sb+0x56/0x74
Oct  1 05:15:49 UnRaid kernel: super_cache_scan+0xee/0x16d
Oct  1 05:15:49 UnRaid kernel: do_shrink_slab+0x128/0x194
Oct  1 05:15:49 UnRaid kernel: shrink_slab+0x11b/0x276
Oct  1 05:15:49 UnRaid kernel: shrink_node+0x108/0x3cb
Oct  1 05:15:49 UnRaid kernel: kswapd+0x451/0x58a
Oct  1 05:15:49 UnRaid kernel: ? __switch_to_asm+0x41/0x70
Oct  1 05:15:49 UnRaid kernel: ? mem_cgroup_shrink_node+0xa4/0xa4
Oct  1 05:15:49 UnRaid kernel: kthread+0x10c/0x114
Oct  1 05:15:49 UnRaid kernel: ? kthread_park+0x89/0x89
Oct  1 05:15:49 UnRaid kernel: ret_from_fork+0x22/0x40
Oct  1 05:15:49 UnRaid kernel: Modules linked in: vhost_net tun vhost tap kvm_amd kvm ccp veth nvidia_uvm(O) xt_nat macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables iptable_filter xfs md_mod it87 hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat ip_tables wireguard ip6_udp_tunnel udp_tunnel bonding mlx4_en mlx4_core e1000e nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) edac_mce_amd crc32_pclmul pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm_kms_helper drm syscopyarea sysfillrect sysimgblt fb_sys_fops mpt3sas fam15h_power mxm_wmi wmi_bmof agpgart k10temp crct10dif_pclmul wmi crc32c_intel i2c_piix4 ahci raid_class scsi_transport_sas i2c_core libahci button [last unloaded: kvm]
Oct  1 05:15:49 UnRaid kernel: ---[ end trace eda3ee69822f802e ]---
Oct  1 05:15:49 UnRaid kernel: RIP: 0010:iput+0x87/0x154
Oct  1 05:15:49 UnRaid kernel: Code: 89 e6 e8 e5 80 50 00 85 c0 75 bc e9 df 00 00 00 48 8b 5d 28 a8 08 4c 8b 6b 30 74 0e 48 c7 c7 79 fc d2 81 e8 83 7b f2 ff 0f 0b <49> 8b 45 20 48 85 c0 74 29 48 89 ef e8 62 c7 89 00 85 c0 75 75 f6
Oct  1 05:15:49 UnRaid kernel: RSP: 0018:ffffc9000344bc08 EFLAGS: 00010246
Oct  1 05:15:49 UnRaid kernel: RAX: 0000000000000000 RBX: ffff88868e356000 RCX: 0000000000000000
Oct  1 05:15:49 UnRaid kernel: RDX: 0000000100000000 RSI: ffff888226874680 RDI: ffff888226874680
Oct  1 05:15:49 UnRaid kernel: RBP: ffff888226874600 R08: 0000000000000001 R09: ffffc9000344bb78
Oct  1 05:15:49 UnRaid kernel: R10: 0000000000000000 R11: ffff88881fb9fb40 R12: ffff888226874680
Oct  1 05:15:49 UnRaid kernel: R13: 7d808fd500000000 R14: ffffc9000344bc98 R15: ffff888222ce4cc0
Oct  1 05:15:49 UnRaid kernel: FS:  0000000000000000(0000) GS:ffff88881fb80000(0000) knlGS:0000000000000000
Oct  1 05:15:49 UnRaid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  1 05:15:49 UnRaid kernel: CR2: 00007f091b0898c8 CR3: 00000006464e6000 CR4: 00000000000406e0

Googling "kswapd0 Tainted" led me to this page and following the steps there gives me the following:

cat /proc/sys/kernel/tainted
4225

linux-tools/kernel-tools is not installed and doesn't exist in the NerdPack plugin so I can't run that.

for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
0 1
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
12 1
13 0
14 0
15 0
16 0
17 0

If I'm reading the decoding table correctly, a proprietary module was loaded, the kernel died recently and an externally built module was loaded.

This doesn't tell me much but maybe it can help finding out why it happened.

unraid-diagnostics-20201001-0839.zip

civic95man · October 1, 2020

9 hours ago, SelfSD said:

If I'm reading the decoding table correctly, a proprietary module was loaded

The 'tainted' keyword is because of the out-of-tree nvidia driver which was loaded. Its nothing bad or to be worried about - just lets people know that you are using an "unsupported" configuration due to that oot driver so kernel-level tech support would be limited.

The call trace appears to be a kernel bug relating to kswapd and several other people are seeing this issue - according to google. It *may* not be a critical issue and can be ignored, but I don't know for sure. You could always try the beta version to see if the newer kernel fixes things.

SelfSD · October 1, 2020

Thank you, that's good to hear.

I've been holding off the betas but maybe I should try the latest one. If the server crashes once more I'll jump on the new 29 beta.

civic95man · October 1, 2020

This current beta seems to be pretty solid for people and for some its the only way to support their new hardware. Just be sure to read up on the release notes because there were some significant changes

SelfSD · October 2, 2020

I've been following the changes and I'm excited for it all!

But I do wonder if that call trace messes with my dockers and VM. I'm not able to stop any of the dockers and they're all stuck on:

[s6-finish] sending all processes the KILL signal and exiting.

I managed to manually kill the dockers and stop the docker service, but now when I'm attempting to shut down the array, it's stuck on:

Oct 2 17:24:46 UnRaid root: Waiting on VMs to shutdown
Oct 2 17:24:46 UnRaid root: Stopping libvirtd...

I can't manually stop the process either as it's not running.

root@UnRaid:~# /etc/rc.d/rc.libvirt stop
libvirt is not running...

unraid-diagnostics-20201002-1731.zip

SelfSD · October 2, 2020

It got stuck on "turning off swap" even though I did not have a swapfile enabled. It generated a diagnostics file which I have attached.

I'm gonna give 6.9.0 beta 29 a try. Hopefully it will solve these weird issues.

unraid-diagnostics-20201002-1740.zip

Tons of Call Traces, then out of memory.

Recommended Posts

SelfSD

Link to comment

JorgeB

Link to comment

SelfSD

Link to comment

SelfSD

Link to comment

SelfSD

Link to comment

civic95man

Link to comment

SelfSD

Link to comment

civic95man

Link to comment

SelfSD

Link to comment

SelfSD

Link to comment

Join the conversation