GPU issue, "GPU has fallen off the bus"


fpoa

Recommended Posts

Hi community,

 

I was having some crashing issues so had the server powered off for a few days while I was doing some research.  Since powering it back on I've been keeping log viewer open to keep an eye on things.  The last several days I have noticed weird messages in the logs.

 

First, my system:

CPU: Amd Ryzen 7 2700 8 cores

Mobo: Asus ROG Strix B450-F Gaming

16 GB

Asus Radeon HD6450 1gb (passthrough to VM)

GTX 1080 TI (used for plex transcoding)

 

Running on Unraid 6.8.3 and linuxserver.io's Unraid Nvidia plugin version 2019-06-23.

 

At first, log was getting spammed with the same error message every 10 seconds or so (flooded past what my syslog viewer could show at a time so no idea how long it went on for).  I unfortunately did not save diagnostics or take a screenshot, but it was:

 

"NVRM: GPU RmInitAdapter failed!

NVRM: rm_init_adapter failed for device bearing minor number 0."

 

Rebooting the server seemed to fix things at least temporarily.  I could watch things on plex and it would use hardware transcoding just fine and no errors in log.  However, the next day syslog would be flooded with the above messages again.  I saw a post on reddit recommending going back to stock 6.8.3 on the Unraid Nvidia plugin and then redo the Nvidia 6.8.3 build. This seemed to work and there were no errors when I woke up this morning.  However, tonight when I checked logs before bed I saw this:

 

Aug 21 20:34:12 SPAMFAM kernel: NVRM: Xid (PCI:0000:09:00): 79, pid=17083, GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU is on Board .
Aug 21 20:34:12 SPAMFAM kernel: NVRM: A GPU crash dump has been created. If possible, please run
Aug 21 20:34:12 SPAMFAM kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 21 20:34:12 SPAMFAM kernel: NVRM: the NVIDIA kernel module is unloaded.

 

This time I have the diagnostics file saved if its needed.  If any other information is needed, please let me know.  I am heading to bed now but hopefully someone can help and I'll check this thread when I wake up.

Link to comment

Haven't gotten a reply in the Nvidia plugin support thread, but recently saw a new error message which I do not think is related to the plugin, but am starting to get worried.

Quote

Aug 29 19:42:15 SPAMFAM kernel: Modules linked in: nvidia_uvm(O) macvlan xt_CHECKSUM ipt_REJECT xt_nat ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding rsnvme(PO) sr_mod cdrom nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) btusb btrtl btbcm btintel bluetooth ecdh_generic drm_kms_helper edac_mce_amd wmi_bmof mxm_wmi crc32_pclmul pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm kvm_amd kvm syscopyarea sysfillrect sysimgblt fb_sys_fops igb(O) k10temp agpgart i2c_piix4 ahci ccp i2c_core nvme libahci usblp crct10dif_pclmul nvme_core crc32c_intel wmi button pcc_cpufreq acpi_cpufreq
Aug 29 19:42:15 SPAMFAM kernel: CPU: 2 PID: 31159 Comm: kworker/2:0 Tainted: P O 4.19.107-Unraid #1
Aug 29 19:42:15 SPAMFAM kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 2008 03/04/2019
Aug 29 19:42:15 SPAMFAM kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e
Aug 29 19:42:15 SPAMFAM kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48
Aug 29 19:42:15 SPAMFAM kernel: RSP: 0018:ffff88842e683d90 EFLAGS: 00010202
Aug 29 19:42:15 SPAMFAM kernel: RAX: 0000000000000188 RBX: ffff88842b6d0100 RCX: ffff888286597618
Aug 29 19:42:15 SPAMFAM kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffffffff81e08b90
Aug 29 19:42:15 SPAMFAM kernel: RBP: ffff8882865975c0 R08: 00000000896aacaa R09: ffff8883531b31c0
Aug 29 19:42:15 SPAMFAM kernel: R10: 0000000000000000 R11: ffff8883532c8000 R12: 0000000000008481
Aug 29 19:42:15 SPAMFAM kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000f964
Aug 29 19:42:15 SPAMFAM kernel: FS: 0000000000000000(0000) GS:ffff88842e680000(0000) knlGS:0000000000000000
Aug 29 19:42:15 SPAMFAM kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 19:42:15 SPAMFAM kernel: CR2: 00005621bb0b1018 CR3: 0000000001e0a000 CR4: 00000000003406e0
Aug 29 19:42:15 SPAMFAM kernel: Call Trace:
Aug 29 19:42:15 SPAMFAM kernel: <IRQ>
Aug 29 19:42:15 SPAMFAM kernel: ipv4_confirm+0xaf/0xb9
Aug 29 19:42:15 SPAMFAM kernel: nf_hook_slow+0x3a/0x90
Aug 29 19:42:15 SPAMFAM kernel: ip_local_deliver+0xad/0xdc
Aug 29 19:42:15 SPAMFAM kernel: ? ip_sublist_rcv_finish+0x54/0x54
Aug 29 19:42:15 SPAMFAM kernel: ip_rcv+0xa0/0xbe
Aug 29 19:42:15 SPAMFAM kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
Aug 29 19:42:15 SPAMFAM kernel: __netif_receive_skb_one_core+0x53/0x6f
Aug 29 19:42:15 SPAMFAM kernel: process_backlog+0x77/0x10e
Aug 29 19:42:15 SPAMFAM kernel: net_rx_action+0x107/0x26c
Aug 29 19:42:15 SPAMFAM kernel: __do_softirq+0xc9/0x1d7
Aug 29 19:42:15 SPAMFAM kernel: do_softirq_own_stack+0x2a/0x40
Aug 29 19:42:15 SPAMFAM kernel: </IRQ>
Aug 29 19:42:15 SPAMFAM kernel: do_softirq+0x4d/0x5a
Aug 29 19:42:15 SPAMFAM kernel: netif_rx_ni+0x1c/0x22
Aug 29 19:42:15 SPAMFAM kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: ? __switch_to_asm+0x41/0x70
Aug 29 19:42:15 SPAMFAM kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: process_one_work+0x16e/0x24f
Aug 29 19:42:15 SPAMFAM kernel: worker_thread+0x1e2/0x2b8
Aug 29 19:42:15 SPAMFAM kernel: ? rescuer_thread+0x2a7/0x2a7
Aug 29 19:42:15 SPAMFAM kernel: kthread+0x10c/0x114
Aug 29 19:42:15 SPAMFAM kernel: ? kthread_park+0x89/0x89
Aug 29 19:42:15 SPAMFAM kernel: ret_from_fork+0x22/0x40
Aug 29 19:42:15 SPAMFAM kernel: ---[ end trace 4067e0319717aeb0 ]---
Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x23:0x56:515)
Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0

 

Link to comment
1 hour ago, johnnie.black said:

Macvlan call traces are usually the result of having dockers with a custom IP address:

 

Like others in that thread, I had followed spaceinvader one's video on setting up pihole.  It is the only docker I have with a custom IP.  I've had it setup for almost 4 months now and don't think I've ever seen that macvlan trace error before, but perhaps I missed it.  Of note, according to that thread multiple macvlan trace errors can result in Unraid crashing - perhaps I have missed a bunch and that is causing my hard reboots.

 

I'm not very network savvy, but it looks like I'll need to learn how to setup vlan's.  Thank you johnnie.black!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.