lp0101 Posted July 8, 2020 Posted July 8, 2020 (edited) I'm running unraid on an intel x99 platform. Occasionally, during somewhat heavy loads from VMs/containers (recently - scanning a whole library for metadata in Jellyfin docker), I'll get hit with a crash. The only way to recover from this is to hard reboot the machine. Below is the log: (Note - Log is reversed, read the log from bottom to top. It's just how my syslog server gave it to me.) 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CR2: 000014d1aebde718 CR3: 0000000001e0a002 CR4: 00000000001606e0 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,FS: 0000000000000000(0000) GS:ffff88881fa00000(0000) knlGS:0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R13: ffff88881f41a000 R14: 0000000000000002 R15: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R10: ffffc90003ba3a48 R11: 0000000000000001 R12: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RBP: ffff88881f41a020 R08: ffffc90003ba3cb3 R09: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RDX: ffffc90003ba3a50 RSI: 0000000000000000 RDI: ffffea0002e89e80 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RAX: 00000000ffffffea RBX: ffffea0002e89e88 RCX: ffffc90003ba3990 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RSP: 0018:ffffc90003ba3960 EFLAGS: 00010082 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,---[ end trace 3842a02541499cc3 ]--- 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Modules linked in: vhost_net tun vhost tap kvm_intel kvm md_mod nvidia_uvm(O) nfsv3 nfs lockd grace xt_CHECKSUM sunrpc ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs bonding nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) mxm_wmi wmi_bmof intel_wmi_thunderbolt crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd intel_cstate drm_kms_helper coretemp crct10dif_pclmul intel_powerclamp crc32c_intel drm x86_pkg_temp_thermal syscopyarea sysfillrect sysimgblt fb_sys_fops e1000e i2c_i801 agpgart i2c_core ahci libahci wmi pcc_cpufreq button [last unloaded: md_mod] 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,ret_from_fork+0x35/0x40 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? kthread_park+0x89/0x89 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,kthread+0x10c/0x114 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? collapse_shmem+0xacd/0xacd 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? wait_woken+0x6a/0x6a 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,khugepaged+0xa67/0x1829 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? pagevec_lru_move_fn+0xaa/0xb9 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? __lru_cache_add+0x51/0x51 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,__alloc_pages_nodemask+0x423/0xae1 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,try_to_free_pages+0xb2/0xcd 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,do_try_to_free_pages+0x1a1/0x300 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_node+0xf1/0x3cb 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? compaction_suitable+0x25/0x61 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? compaction_suitable+0x25/0x61 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? __compaction_suitable+0x77/0x96 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_node_memcg+0x4c4/0x64a 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,? move_to_new_page+0x169/0x21b 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,shrink_inactive_list+0xd8/0x47e 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Call Trace: 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CR2: 000014d1aebde718 CR3: 0000000001e0a002 CR4: 00000000001606e0 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,FS: 0000000000000000(0000) GS:ffff88881fa00000(0000) knlGS:0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R13: ffff88881f41a000 R14: 0000000000000002 R15: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,R10: ffffc90003ba3a48 R11: 0000000000000001 R12: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RBP: ffff88881f41a020 R08: ffffc90003ba3cb3 R09: 0000000000000000 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RDX: ffffc90003ba3a50 RSI: 0000000000000000 RDI: ffffea0002e89e80 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RAX: 00000000ffffffea RBX: ffffea0002e89e88 RCX: ffffc90003ba3990 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RSP: 0018:ffffc90003ba3960 EFLAGS: 00010082 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,"Hardware name: ASUS All Series/X99-A, BIOS 4101 07/10/2019" 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,CPU: 8 PID: 349 Comm: khugepaged Tainted: P O 4.19.107-Unraid #1 2020-07-07,21:25:56,Warning,barad-dur,kern,kernel,invalid opcode: 0000 [#1] SMP PTI 2020-07-07,21:25:56,crit,barad-dur,kern,kernel,kernel BUG at mm/vmscan.c:1703! Edited July 8, 2020 by lp0101 Quote
lp0101 Posted July 8, 2020 Author Posted July 8, 2020 Bump. Would love to move more essential services to my unraid box, but I'm not comfortable doing that until I know what causes this Quote
lp0101 Posted July 9, 2020 Author Posted July 9, 2020 Happened again tonight. Server was actually not doing anything when it crashed, completely idle. Woke up to the same error. Quote
ColdKeyboard Posted February 17, 2021 Posted February 17, 2021 I had the same issue for the past couple of weeks. Server would crash "randomly". I disabled docker and VM (I'm not using any VMs but still). Same thing, UnRaid crashes. If I leave array stopped, server runs for days. As soon as I start the array, in the next hours or couple of days, it will crash. Last time I setup remote logging and captured that last message received mentioned "kernel BUG at mm/vmscan.c:1703!" Below is what I captured with my remote logging. Any help/suggestion on how to solve this would be greatly appreciated. 2021-02-17 05:53,Warning,192.168.0.25,CR2: 000000000044f300 CR3: 00000001d6ae2000 CR4: 00000000003406f0 2021-02-17 05:53,Warning,192.168.0.25,CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2021-02-17 05:53,Warning,192.168.0.25,FS: 0000000000000000(0000) GS:ffff888430600000(0000) knlGS:0000000000000000 2021-02-17 05:53,Warning,192.168.0.25,R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000074c 2021-02-17 05:53,Warning,192.168.0.25,R10: 0000000000044268 R11: ffff8884306dfb40 R12: 0000000000000046 2021-02-17 05:53,Warning,192.168.0.25,RBP: 00000000000000e0 R08: 0000000000000000 R09: ffff8884306da5c0 2021-02-17 05:53,Warning,192.168.0.25,RDX: 0000000000000001 RSI: 0000000000000003 RDI: 000000000000074c 2021-02-17 05:53,Warning,192.168.0.25,RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81e3ca80 2021-02-17 05:53,Warning,192.168.0.25,RSP: 0018:ffffc90001933e20 EFLAGS: 00010046 2021-02-17 05:53,Warning,192.168.0.25,Code: 89 f8 48 89 f7 c6 00 00 57 9d 0f 1f 44 00 00 c3 41 54 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 74 09 89 c6 e8 16 d1 9f ff 66 90 4c 89 e0 41 5c 2021-02-17 05:53,Warning,192.168.0.25,RIP: 0010:_raw_spin_lock_irqsave+0x1a/0x31 2021-02-17 05:53,Warning,192.168.0.25,---[ end trace 575c6c2f1f88a641 ]--- 2021-02-17 05:53,Warning,192.168.0.25,Modules linked in: xt_nat veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding edac_mce_amd ccp kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd i2c_piix4 i2c_core r8169 video ahci k10temp backlight glue_helper libahci nvme realtek wmi_bmof nvme_core wmi thermal button pcc_cpufreq acpi_cpufreq 2021-02-17 05:53,Warning,192.168.0.25,ret_from_fork+0x22/0x40 2021-02-17 05:53,Warning,192.168.0.25,? kthread_park+0x89/0x89 2021-02-17 05:53,Warning,192.168.0.25,kthread+0x10c/0x114 2021-02-17 05:53,Warning,192.168.0.25,? mem_cgroup_shrink_node+0xa4/0xa4 2021-02-17 05:53,Warning,192.168.0.25,? __switch_to_asm+0x41/0x70 2021-02-17 05:53,Warning,192.168.0.25,kswapd+0x451/0x58a 2021-02-17 05:53,Warning,192.168.0.25,shrink_node+0xf1/0x3cb 2021-02-17 05:53,Warning,192.168.0.25,? super_cache_count+0x70/0xb4 2021-02-17 05:53,Warning,192.168.0.25,? xfs_fs_nr_cached_objects+0x16/0x19 [xfs] 2021-02-17 05:53,Warning,192.168.0.25,shrink_node_memcg+0x4c4/0x64a 2021-02-17 05:53,Warning,192.168.0.25,shrink_inactive_list+0xd8/0x47e 2021-02-17 05:53,Warning,192.168.0.25,Call Trace: 2021-02-17 05:53,Warning,192.168.0.25,CR2: 000000000044f300 CR3: 00000001d6ae2000 CR4: 00000000003406f0 2021-02-17 05:53,Warning,192.168.0.25,CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2021-02-17 05:53,Warning,192.168.0.25,FS: 0000000000000000(0000) GS:ffff888430600000(0000) knlGS:0000000000000000 2021-02-17 05:53,Warning,192.168.0.25,R13: ffff88818f48bc00 R14: 0000000000000002 R15: 0000000000000002 2021-02-17 05:53,Warning,192.168.0.25,R10: ffffc90001d0bc58 R11: 0000000000000003 R12: 0000000000000002 2021-02-17 05:53,Warning,192.168.0.25,RBP: ffff88818f48bc20 R08: ffffc90001d0bea3 R09: 0000000000000000 2021-02-17 05:53,Warning,192.168.0.25,RDX: ffffc90001d0bc60 RSI: 0000000000000000 RDI: ffffea000d70c6c0 2021-02-17 05:53,Warning,192.168.0.25,RAX: 00000000ffffffea RBX: ffffea000d70c6c8 RCX: ffffc90001d0bba0 2021-02-17 05:53,Warning,192.168.0.25,RSP: 0018:ffffc90001d0bb70 EFLAGS: 00010082 2021-02-17 05:53,Warning,192.168.0.25,Code: 53 08 48 89 1a eb 25 48 8b 43 08 48 8b 3b 48 89 47 08 48 89 38 48 8b 45 00 48 89 58 08 48 89 03 48 89 6b 08 48 89 5d 00 eb 02 <0f> 0b 49 ff c7 4c 89 d8 4d 39 dc 49 0f 43 c4 48 3b 04 24 0f 82 cb 2021-02-17 05:53,Warning,192.168.0.25,RIP: 0010:isolate_lru_pages.isra.0+0x18b/0x2b9 2021-02-17 05:53,Warning,192.168.0.25,Hardware name: Gigabyte Technology Co., Ltd. B450M DS3H/B450M DS3H-CF, BIOS F60c 10/29/2020 2021-02-17 05:53,Warning,192.168.0.25,CPU: 0 PID: 786 Comm: kswapd0 Tainted: G D 4.19.107-Unraid #1 2021-02-17 05:53,Warning,192.168.0.25,invalid opcode: 0000 [#2] SMP NOPTI 2021-02-17 05:53,Critical,192.168.0.25,kernel BUG at mm/vmscan.c:1703! 2021-02-17 05:53,Warning,192.168.0.25,------------[ cut here ]------------ 2021-02-17 05:53,Info,192.168.0.25,oom_reaper: reaped process 28849 (shfs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 2021-02-17 05:53,Error,192.168.0.25,Killed process 28849 (shfs) total-vm:483680kB, anon-rss:21368kB, file-rss:4kB, shmem-rss:1064kB 2021-02-17 05:53,Error,192.168.0.25,Out of memory: Kill process 28849 (shfs) score 1 or sacrifice child Quote
Aldarel Posted February 20, 2021 Posted February 20, 2021 (edited) Hi, Unfortunately I have the same issue and it has been there for quite a while. Luckily for me it does not occur that quickly and seems to be pretty random. I have had it once occur close to when I was upgrading a Docker container, but today it occurred without any container update operation. I am running like 15 containers, out of which a few have constant workload, others mostly idle. I do not run VMs and the VM service is turned off. What also occurs is that this kernel bug causes some sort of ethernet packet storm because my whole network drops connectivity the moment this happens and recovers whenever I plug out Unraid ethernet cable OR hard power down the server (it does not respond to any other shutdown methods at this point). So this issue may also link to many other posts complaining about complete network loss because of Unraid! This is super unreliable behavior and whats worse it takes my network down from other devices, and also requires full array check which takes a lot of time with 16x10TB array. I don't have nice text stack trace of the problem, but I have screenshots of the console from today (Feb 20) and Feb 12 when it last crashed. Crash stack trace screenshots (https://imgur.com/a/nj1VWDa) My crash stack trace looks super similar to yours. I figured out it could be an issue with memory but these sticks have been running for years now and I don't really have had much problem with those, count of crashes have been less than 5 during the lifetime but I don't have the error reports of the other time. I would also try (again) to run memtest but for some reason it just freezes after attempting to load memtest program. Possibly somehow related to uefi/secure boot/the way usb stick is prepared? Edit: This was just an issue of trying to run BIOS memtest in UEFI. Got the newest version of Memtest86 which runs UEFI and it now starts properly. Edit2: Running now memtest86 newest version and it found at least one error so far on 1st pass, in moving inversions (test 7). I will let it run a few more passes to see if it reproduces. Also because I am running XMP, I think that is another cause for potential instability. I will eventually try to run RAM on stock speeds and see if that helps. Although I guess if it is reproducable error I should RMA the sticks no matter XMP or not. But for future I think I should run Unraid without XMP, I would assume main hit would be in using huge write cache (over 50% of ram is allowed for file write caching) and maybe iGPU Plex transcode performance? Given the crash stack trace hints at memory operations faulty RAM (or some other component in memory pipeline) seems a valid cause right now. Will update further details later. Please help, this unreliability is really worrying and complete silence is not very reassuring to me or future customers I suppose. Edited February 20, 2021 by Aldarel Memtest details#2 Quote
trurl Posted February 20, 2021 Posted February 20, 2021 Posting diagnostics when the problem isn't occurring would give us a better idea of your configuration and hardware. Quote
Aldarel Posted February 20, 2021 Posted February 20, 2021 (edited) So I got 7 errors from 4 passes of Memtest86 newest version in 3200 XMP configuration. Reseated RAM and configured them to run on base speed 2133. Will rerun the same test and if they still fail I will start looking into which one/ones are faulty and RMA them. I think that will already solve my issue and if not then I will post further system specs and so on. MemTest86-Report-20210220-100537.html Edited February 20, 2021 by Aldarel 1 Quote
ColdKeyboard Posted February 21, 2021 Posted February 21, 2021 (edited) Ok, I ran memtest for 24h, completed 10 passes. I know I didn't give it a lot of time, but still, it reported 0 errors. (Months ago I let it run for couple of days and still had 0 errors. I didn't touch the config since then). So I think it would be reasonable to assume memory is ok. I have also attached diagnostics report from UnRaid. nas-diagnostics-20210221-0127.zip Edited February 21, 2021 by ColdKeyboard Quote
Aldarel Posted February 23, 2021 Posted February 23, 2021 So I have now located my problem down to a pair of memory modules with Memtest86 reports and those are now in progress to be RMA changed for working ones. Rest of the memory modules did not error in runs worth over a day. Will run at least a day of testing for all 4 modules once I receive the new ones from RMA. I will most likely keep memory speed at DDR4 JEDEC standard however, instead of putting them back to XMP settings. Now having endured these crashes, stability would be the primary concern here and it is not that much slower anyway. This should fortunately solve my crash issues, as well as complete network dead issue. Although it is something you guys may want to take a look why would it cause a network packet storm when OS crashes. Quote
Aldarel Posted March 1, 2021 Posted March 1, 2021 With new replaced RAM stick pair and 36 hours memtest ok behind me, I have much more confidence everything will be now stable and the issue is closed at least for me. Quote
Aldarel Posted March 26, 2021 Posted March 26, 2021 So far 24 days uptime without that issue any more. Hopefully it stays as such and was fixed by replacing a faulty memory module (and downclocking back to standard memory speed). Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.