fpoa Posted August 22, 2020 Share Posted August 22, 2020 Hi community, I was having some crashing issues so had the server powered off for a few days while I was doing some research. Since powering it back on I've been keeping log viewer open to keep an eye on things. The last several days I have noticed weird messages in the logs. First, my system: CPU: Amd Ryzen 7 2700 8 cores Mobo: Asus ROG Strix B450-F Gaming 16 GB Asus Radeon HD6450 1gb (passthrough to VM) GTX 1080 TI (used for plex transcoding) Running on Unraid 6.8.3 and linuxserver.io's Unraid Nvidia plugin version 2019-06-23. At first, log was getting spammed with the same error message every 10 seconds or so (flooded past what my syslog viewer could show at a time so no idea how long it went on for). I unfortunately did not save diagnostics or take a screenshot, but it was: "NVRM: GPU RmInitAdapter failed! NVRM: rm_init_adapter failed for device bearing minor number 0." Rebooting the server seemed to fix things at least temporarily. I could watch things on plex and it would use hardware transcoding just fine and no errors in log. However, the next day syslog would be flooded with the above messages again. I saw a post on reddit recommending going back to stock 6.8.3 on the Unraid Nvidia plugin and then redo the Nvidia 6.8.3 build. This seemed to work and there were no errors when I woke up this morning. However, tonight when I checked logs before bed I saw this: Aug 21 20:34:12 SPAMFAM kernel: NVRM: Xid (PCI:0000:09:00): 79, pid=17083, GPU has fallen off the bus. Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU has fallen off the bus. Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU is on Board . Aug 21 20:34:12 SPAMFAM kernel: NVRM: A GPU crash dump has been created. If possible, please run Aug 21 20:34:12 SPAMFAM kernel: NVRM: nvidia-bug-report.sh as root to collect this data before Aug 21 20:34:12 SPAMFAM kernel: NVRM: the NVIDIA kernel module is unloaded. This time I have the diagnostics file saved if its needed. If any other information is needed, please let me know. I am heading to bed now but hopefully someone can help and I'll check this thread when I wake up. Quote Link to comment
JorgeB Posted August 23, 2020 Share Posted August 23, 2020 You should post this on the Nvidia plugin support thread: Quote Link to comment
fpoa Posted August 23, 2020 Author Share Posted August 23, 2020 10 minutes ago, johnnie.black said: You should post this on the Nvidia plugin support thread: Will do, thanks! Quote Link to comment
fpoa Posted August 30, 2020 Author Share Posted August 30, 2020 Haven't gotten a reply in the Nvidia plugin support thread, but recently saw a new error message which I do not think is related to the plugin, but am starting to get worried. Quote Aug 29 19:42:15 SPAMFAM kernel: Modules linked in: nvidia_uvm(O) macvlan xt_CHECKSUM ipt_REJECT xt_nat ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding rsnvme(PO) sr_mod cdrom nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) btusb btrtl btbcm btintel bluetooth ecdh_generic drm_kms_helper edac_mce_amd wmi_bmof mxm_wmi crc32_pclmul pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm kvm_amd kvm syscopyarea sysfillrect sysimgblt fb_sys_fops igb(O) k10temp agpgart i2c_piix4 ahci ccp i2c_core nvme libahci usblp crct10dif_pclmul nvme_core crc32c_intel wmi button pcc_cpufreq acpi_cpufreq Aug 29 19:42:15 SPAMFAM kernel: CPU: 2 PID: 31159 Comm: kworker/2:0 Tainted: P O 4.19.107-Unraid #1 Aug 29 19:42:15 SPAMFAM kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 2008 03/04/2019 Aug 29 19:42:15 SPAMFAM kernel: Workqueue: events macvlan_process_broadcast [macvlan] Aug 29 19:42:15 SPAMFAM kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e Aug 29 19:42:15 SPAMFAM kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48 Aug 29 19:42:15 SPAMFAM kernel: RSP: 0018:ffff88842e683d90 EFLAGS: 00010202 Aug 29 19:42:15 SPAMFAM kernel: RAX: 0000000000000188 RBX: ffff88842b6d0100 RCX: ffff888286597618 Aug 29 19:42:15 SPAMFAM kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffffffff81e08b90 Aug 29 19:42:15 SPAMFAM kernel: RBP: ffff8882865975c0 R08: 00000000896aacaa R09: ffff8883531b31c0 Aug 29 19:42:15 SPAMFAM kernel: R10: 0000000000000000 R11: ffff8883532c8000 R12: 0000000000008481 Aug 29 19:42:15 SPAMFAM kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000f964 Aug 29 19:42:15 SPAMFAM kernel: FS: 0000000000000000(0000) GS:ffff88842e680000(0000) knlGS:0000000000000000 Aug 29 19:42:15 SPAMFAM kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 29 19:42:15 SPAMFAM kernel: CR2: 00005621bb0b1018 CR3: 0000000001e0a000 CR4: 00000000003406e0 Aug 29 19:42:15 SPAMFAM kernel: Call Trace: Aug 29 19:42:15 SPAMFAM kernel: <IRQ> Aug 29 19:42:15 SPAMFAM kernel: ipv4_confirm+0xaf/0xb9 Aug 29 19:42:15 SPAMFAM kernel: nf_hook_slow+0x3a/0x90 Aug 29 19:42:15 SPAMFAM kernel: ip_local_deliver+0xad/0xdc Aug 29 19:42:15 SPAMFAM kernel: ? ip_sublist_rcv_finish+0x54/0x54 Aug 29 19:42:15 SPAMFAM kernel: ip_rcv+0xa0/0xbe Aug 29 19:42:15 SPAMFAM kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1 Aug 29 19:42:15 SPAMFAM kernel: __netif_receive_skb_one_core+0x53/0x6f Aug 29 19:42:15 SPAMFAM kernel: process_backlog+0x77/0x10e Aug 29 19:42:15 SPAMFAM kernel: net_rx_action+0x107/0x26c Aug 29 19:42:15 SPAMFAM kernel: __do_softirq+0xc9/0x1d7 Aug 29 19:42:15 SPAMFAM kernel: do_softirq_own_stack+0x2a/0x40 Aug 29 19:42:15 SPAMFAM kernel: </IRQ> Aug 29 19:42:15 SPAMFAM kernel: do_softirq+0x4d/0x5a Aug 29 19:42:15 SPAMFAM kernel: netif_rx_ni+0x1c/0x22 Aug 29 19:42:15 SPAMFAM kernel: macvlan_broadcast+0x111/0x156 [macvlan] Aug 29 19:42:15 SPAMFAM kernel: ? __switch_to_asm+0x41/0x70 Aug 29 19:42:15 SPAMFAM kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Aug 29 19:42:15 SPAMFAM kernel: process_one_work+0x16e/0x24f Aug 29 19:42:15 SPAMFAM kernel: worker_thread+0x1e2/0x2b8 Aug 29 19:42:15 SPAMFAM kernel: ? rescuer_thread+0x2a7/0x2a7 Aug 29 19:42:15 SPAMFAM kernel: kthread+0x10c/0x114 Aug 29 19:42:15 SPAMFAM kernel: ? kthread_park+0x89/0x89 Aug 29 19:42:15 SPAMFAM kernel: ret_from_fork+0x22/0x40 Aug 29 19:42:15 SPAMFAM kernel: ---[ end trace 4067e0319717aeb0 ]--- Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x23:0x56:515) Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0 Quote Link to comment
JorgeB Posted August 30, 2020 Share Posted August 30, 2020 Macvlan call traces are usually the result of having dockers with a custom IP address: Quote Link to comment
fpoa Posted August 30, 2020 Author Share Posted August 30, 2020 1 hour ago, johnnie.black said: Macvlan call traces are usually the result of having dockers with a custom IP address: Like others in that thread, I had followed spaceinvader one's video on setting up pihole. It is the only docker I have with a custom IP. I've had it setup for almost 4 months now and don't think I've ever seen that macvlan trace error before, but perhaps I missed it. Of note, according to that thread multiple macvlan trace errors can result in Unraid crashing - perhaps I have missed a bunch and that is causing my hard reboots. I'm not very network savvy, but it looks like I'll need to learn how to setup vlan's. Thank you johnnie.black! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.