Server crashing with "kernel BUG at mm/vmscan.c:1703

lp0101 · July 8, 2020

I'm running unraid on an intel x99 platform.

Occasionally, during somewhat heavy loads from VMs/containers (recently - scanning a whole library for metadata in Jellyfin docker), I'll get hit with a crash. The only way to recover from this is to hard reboot the machine. Below is the log: (Note - Log is reversed, read the log from bottom to top. It's just how my syslog server gave it to me.)

2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CR2: 000014d1aebde718 CR3: 0000000001e0a002 CR4: 00000000001606e0
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,FS:  0000000000000000(0000) GS:ffff88881fa00000(0000) knlGS:0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R13: ffff88881f41a000 R14: 0000000000000002 R15: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R10: ffffc90003ba3a48 R11: 0000000000000001 R12: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RBP: ffff88881f41a020 R08: ffffc90003ba3cb3 R09: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RDX: ffffc90003ba3a50 RSI: 0000000000000000 RDI: ffffea0002e89e80
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RAX: 00000000ffffffea RBX: ffffea0002e89e88 RCX: ffffc90003ba3990
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RSP: 0018:ffffc90003ba3960 EFLAGS: 00010082
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,---[ end trace 3842a02541499cc3 ]---
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Modules linked in: vhost_net tun vhost tap kvm_intel kvm md_mod nvidia_uvm(O) nfsv3 nfs lockd grace xt_CHECKSUM sunrpc ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs bonding nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) mxm_wmi wmi_bmof intel_wmi_thunderbolt crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd intel_cstate drm_kms_helper coretemp crct10dif_pclmul intel_powerclamp crc32c_intel drm x86_pkg_temp_thermal syscopyarea sysfillrect sysimgblt fb_sys_fops e1000e i2c_i801 agpgart i2c_core ahci libahci wmi pcc_cpufreq button [last unloaded: md_mod]
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,ret_from_fork+0x35/0x40
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? kthread_park+0x89/0x89
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,kthread+0x10c/0x114
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? collapse_shmem+0xacd/0xacd
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? wait_woken+0x6a/0x6a
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,khugepaged+0xa67/0x1829
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? pagevec_lru_move_fn+0xaa/0xb9
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? __lru_cache_add+0x51/0x51
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,__alloc_pages_nodemask+0x423/0xae1
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,try_to_free_pages+0xb2/0xcd
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,do_try_to_free_pages+0x1a1/0x300
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_node+0xf1/0x3cb
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? compaction_suitable+0x25/0x61
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? compaction_suitable+0x25/0x61
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? __compaction_suitable+0x77/0x96
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_node_memcg+0x4c4/0x64a
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? move_to_new_page+0x169/0x21b
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_inactive_list+0xd8/0x47e
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Call Trace:
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CR2: 000014d1aebde718 CR3: 0000000001e0a002 CR4: 00000000001606e0
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,FS:  0000000000000000(0000) GS:ffff88881fa00000(0000) knlGS:0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R13: ffff88881f41a000 R14: 0000000000000002 R15: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R10: ffffc90003ba3a48 R11: 0000000000000001 R12: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RBP: ffff88881f41a020 R08: ffffc90003ba3cb3 R09: 0000000000000000
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RDX: ffffc90003ba3a50 RSI: 0000000000000000 RDI: ffffea0002e89e80
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RAX: 00000000ffffffea RBX: ffffea0002e89e88 RCX: ffffc90003ba3990
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RSP: 0018:ffffc90003ba3960 EFLAGS: 00010082
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,"Hardware name: ASUS All Series/X99-A, BIOS 4101 07/10/2019"
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CPU: 8 PID: 349 Comm: khugepaged Tainted: P           O      4.19.107-Unraid #1
2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,invalid opcode: 0000 [#1] SMP PTI
2020-07-07,21:25:56,crit,barad-dur,kern,kernel,kernel BUG at mm/vmscan.c:1703!

Edited July 8, 2020 by lp0101

lp0101 · July 8, 2020

Bump. Would love to move more essential services to my unraid box, but I'm not comfortable doing that until I know what causes this

lp0101 · July 9, 2020

Happened again tonight. Server was actually not doing anything when it crashed, completely idle. Woke up to the same error.

ColdKeyboard · February 17, 2021

I had the same issue for the past couple of weeks.

Server would crash "randomly". I disabled docker and VM (I'm not using any VMs but still). Same thing, UnRaid crashes.

If I leave array stopped, server runs for days. As soon as I start the array, in the next hours or couple of days, it will crash.

Last time I setup remote logging and captured that last message received mentioned "kernel BUG at mm/vmscan.c:1703!"

Below is what I captured with my remote logging. Any help/suggestion on how to solve this would be greatly appreciated.

2021-02-17 05:53,Warning,192.168.0.25,CR2: 000000000044f300 CR3: 00000001d6ae2000 CR4: 00000000003406f0
2021-02-17 05:53,Warning,192.168.0.25,CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-02-17 05:53,Warning,192.168.0.25,FS:  0000000000000000(0000) GS:ffff888430600000(0000) knlGS:0000000000000000
2021-02-17 05:53,Warning,192.168.0.25,R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000074c
2021-02-17 05:53,Warning,192.168.0.25,R10: 0000000000044268 R11: ffff8884306dfb40 R12: 0000000000000046
2021-02-17 05:53,Warning,192.168.0.25,RBP: 00000000000000e0 R08: 0000000000000000 R09: ffff8884306da5c0
2021-02-17 05:53,Warning,192.168.0.25,RDX: 0000000000000001 RSI: 0000000000000003 RDI: 000000000000074c
2021-02-17 05:53,Warning,192.168.0.25,RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81e3ca80
2021-02-17 05:53,Warning,192.168.0.25,RSP: 0018:ffffc90001933e20 EFLAGS: 00010046
2021-02-17 05:53,Warning,192.168.0.25,Code: 89 f8 48 89 f7 c6 00 00 57 9d 0f 1f 44 00 00 c3 41 54 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 74 09 89 c6 e8 16 d1 9f ff 66 90 4c 89 e0 41 5c
2021-02-17 05:53,Warning,192.168.0.25,RIP: 0010:_raw_spin_lock_irqsave+0x1a/0x31
2021-02-17 05:53,Warning,192.168.0.25,---[ end trace 575c6c2f1f88a641 ]---
2021-02-17 05:53,Warning,192.168.0.25,Modules linked in: xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding edac_mce_amd ccp kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd i2c_piix4 i2c_core r8169 video ahci k10temp backlight glue_helper libahci nvme realtek wmi_bmof nvme_core wmi thermal button pcc_cpufreq acpi_cpufreq
2021-02-17 05:53,Warning,192.168.0.25,ret_from_fork+0x22/0x40
2021-02-17 05:53,Warning,192.168.0.25,? kthread_park+0x89/0x89
2021-02-17 05:53,Warning,192.168.0.25,kthread+0x10c/0x114
2021-02-17 05:53,Warning,192.168.0.25,? mem_cgroup_shrink_node+0xa4/0xa4
2021-02-17 05:53,Warning,192.168.0.25,? __switch_to_asm+0x41/0x70
2021-02-17 05:53,Warning,192.168.0.25,kswapd+0x451/0x58a
2021-02-17 05:53,Warning,192.168.0.25,shrink_node+0xf1/0x3cb
2021-02-17 05:53,Warning,192.168.0.25,? super_cache_count+0x70/0xb4
2021-02-17 05:53,Warning,192.168.0.25,? xfs_fs_nr_cached_objects+0x16/0x19 [xfs]
2021-02-17 05:53,Warning,192.168.0.25,shrink_node_memcg+0x4c4/0x64a
2021-02-17 05:53,Warning,192.168.0.25,shrink_inactive_list+0xd8/0x47e
2021-02-17 05:53,Warning,192.168.0.25,Call Trace:
2021-02-17 05:53,Warning,192.168.0.25,CR2: 000000000044f300 CR3: 00000001d6ae2000 CR4: 00000000003406f0
2021-02-17 05:53,Warning,192.168.0.25,CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-02-17 05:53,Warning,192.168.0.25,FS:  0000000000000000(0000) GS:ffff888430600000(0000) knlGS:0000000000000000
2021-02-17 05:53,Warning,192.168.0.25,R13: ffff88818f48bc00 R14: 0000000000000002 R15: 0000000000000002
2021-02-17 05:53,Warning,192.168.0.25,R10: ffffc90001d0bc58 R11: 0000000000000003 R12: 0000000000000002
2021-02-17 05:53,Warning,192.168.0.25,RBP: ffff88818f48bc20 R08: ffffc90001d0bea3 R09: 0000000000000000
2021-02-17 05:53,Warning,192.168.0.25,RDX: ffffc90001d0bc60 RSI: 0000000000000000 RDI: ffffea000d70c6c0
2021-02-17 05:53,Warning,192.168.0.25,RAX: 00000000ffffffea RBX: ffffea000d70c6c8 RCX: ffffc90001d0bba0
2021-02-17 05:53,Warning,192.168.0.25,RSP: 0018:ffffc90001d0bb70 EFLAGS: 00010082
2021-02-17 05:53,Warning,192.168.0.25,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb
2021-02-17 05:53,Warning,192.168.0.25,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9
2021-02-17 05:53,Warning,192.168.0.25,Hardware name: Gigabyte Technology Co., Ltd. B450M DS3H/B450M DS3H-CF, BIOS F60c 10/29/2020
2021-02-17 05:53,Warning,192.168.0.25,CPU: 0 PID: 786 Comm: kswapd0 Tainted: G      D           4.19.107-Unraid #1
2021-02-17 05:53,Warning,192.168.0.25,invalid opcode: 0000 [#2] SMP NOPTI
2021-02-17 05:53,Critical,192.168.0.25,kernel BUG at mm/vmscan.c:1703!
2021-02-17 05:53,Warning,192.168.0.25,------------[ cut here ]------------
2021-02-17 05:53,Info,192.168.0.25,oom_reaper: reaped process 28849 (shfs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2021-02-17 05:53,Error,192.168.0.25,Killed process 28849 (shfs) total-vm:483680kB, anon-rss:21368kB, file-rss:4kB, shmem-rss:1064kB
2021-02-17 05:53,Error,192.168.0.25,Out of memory: Kill process 28849 (shfs) score 1 or sacrifice child

Aldarel · February 20, 2021

Hi,

Unfortunately I have the same issue and it has been there for quite a while. Luckily for me it does not occur that quickly and seems to be pretty random. I have had it once occur close to when I was upgrading a Docker container, but today it occurred without any container update operation. I am running like 15 containers, out of which a few have constant workload, others mostly idle. I do not run VMs and the VM service is turned off.

What also occurs is that this kernel bug causes some sort of ethernet packet storm because my whole network drops connectivity the moment this happens and recovers whenever I plug out Unraid ethernet cable OR hard power down the server (it does not respond to any other shutdown methods at this point). So this issue may also link to many other posts complaining about complete network loss because of Unraid!

This is super unreliable behavior and whats worse it takes my network down from other devices, and also requires full array check which takes a lot of time with 16x10TB array.

I don't have nice text stack trace of the problem, but I have screenshots of the console from today (Feb 20) and Feb 12 when it last crashed.
Crash stack trace screenshots (https://imgur.com/a/nj1VWDa)

My crash stack trace looks super similar to yours. I figured out it could be an issue with memory but these sticks have been running for years now and I don't really have had much problem with those, count of crashes have been less than 5 during the lifetime but I don't have the error reports of the other time. ~~I would also try (again) to run memtest but for some reason it just freezes after attempting to load memtest program. Possibly somehow related to uefi/secure boot/the way usb stick is prepared?~~ Edit: This was just an issue of trying to run BIOS memtest in UEFI. Got the newest version of Memtest86 which runs UEFI and it now starts properly. Edit2: Running now memtest86 newest version and it found at least one error so far on 1st pass, in moving inversions (test 7). I will let it run a few more passes to see if it reproduces. Also because I am running XMP, I think that is another cause for potential instability. I will eventually try to run RAM on stock speeds and see if that helps. Although I guess if it is reproducable error I should RMA the sticks no matter XMP or not. But for future I think I should run Unraid without XMP, I would assume main hit would be in using huge write cache (over 50% of ram is allowed for file write caching) and maybe iGPU Plex transcode performance?

Given the crash stack trace hints at memory operations faulty RAM (or some other component in memory pipeline) seems a valid cause right now. Will update further details later.

Please help, this unreliability is really worrying and complete silence is not very reassuring to me or future customers I suppose.

Edited February 20, 2021 by Aldarel
Memtest details#2

trurl · February 20, 2021

Posting diagnostics when the problem isn't occurring would give us a better idea of your configuration and hardware.

Aldarel · February 20, 2021

So I got 7 errors from 4 passes of Memtest86 newest version in 3200 XMP configuration. Reseated RAM and configured them to run on base speed 2133. Will rerun the same test and if they still fail I will start looking into which one/ones are faulty and RMA them. I think that will already solve my issue and if not then I will post further system specs and so on.

MemTest86-Report-20210220-100537.html

Edited February 20, 2021 by Aldarel

ColdKeyboard · February 21, 2021

Ok, I ran memtest for 24h, completed 10 passes. I know I didn't give it a lot of time, but still, it reported 0 errors. (Months ago I let it run for couple of days and still had 0 errors. I didn't touch the config since then). So I think it would be reasonable to assume memory is ok.

I have also attached diagnostics report from UnRaid.

nas-diagnostics-20210221-0127.zip

Edited February 21, 2021 by ColdKeyboard

Aldarel · February 23, 2021

So I have now located my problem down to a pair of memory modules with Memtest86 reports and those are now in progress to be RMA changed for working ones. Rest of the memory modules did not error in runs worth over a day. Will run at least a day of testing for all 4 modules once I receive the new ones from RMA.

I will most likely keep memory speed at DDR4 JEDEC standard however, instead of putting them back to XMP settings. Now having endured these crashes, stability would be the primary concern here and it is not that much slower anyway.

This should fortunately solve my crash issues, as well as complete network dead issue. Although it is something you guys may want to take a look why would it cause a network packet storm when OS crashes.

Aldarel · March 1, 2021

With new replaced RAM stick pair and 36 hours memtest ok behind me, I have much more confidence everything will be now stable and the issue is closed at least for me.

Aldarel · March 26, 2021

So far 24 days uptime without that issue any more. Hopefully it stays as such and was fixed by replacing a faulty memory module (and downclocking back to standard memory speed).

Server crashing with "kernel BUG at mm/vmscan.c:1703

Recommended Posts

lp0101

Link to comment

lp0101

Link to comment

lp0101

Link to comment

ColdKeyboard

Link to comment

Aldarel

Link to comment

trurl

Link to comment

Aldarel

Link to comment

ColdKeyboard

Link to comment

Aldarel

Link to comment

Aldarel

Link to comment

Aldarel

Link to comment

Join the conversation