Nvidia cards throwing D3 power states and header 127 issues, what happened?


Recommended Posts

Hi everyone,

Fair warning, this is my crosspost attempt from Reddit.  Why I started there, I'm not sure. 

 

I'm an owner of a 2-streamers-1-CPU build that holds 2 GTX 1080s and a Threaderipper 1950x, and I've been stable for a long period of time, 6 months to a year since my last issue. About a couple of months ago I upgraded to 6.9.0rc2, and after about a month of uptime my GF and I had a double-blackscreen while playing the same game, Monster Hunter World. Logs from the VMs are as follows, see the bottom 4 lines for the major issue:

-boot strict=on \
-device nec-usb-xhci,p2=15,p3=15,id=usb,bus=pci.0,addr=0x7 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x3 \
-blockdev '{"driver":"host_device","filename":"/dev/disk/by-id/ata-Samsung_SSD_850_EVO_1TB_S3PJNB0J806112N","node-name":"libvirt-4-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-4-format","read-only":false,"cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-4-storage"}' \
-device virtio-blk-pci,bus=pci.0,addr=0x4,drive=libvirt-4-format,id=virtio-disk2,bootindex=1,write-cache=on \
-blockdev '{"driver":"file","filename":"/mnt/disks/GameCache/Blizzard/FearTurkey/vdisk2.img","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-3-format","read-only":false,"cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-3-storage"}' \
-device virtio-blk-pci,bus=pci.0,addr=0x5,drive=libvirt-3-format,id=virtio-disk3,write-cache=on \
-blockdev '{"driver":"file","filename":"/mnt/user/isos/Win10_1803_English_x64.iso","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":true,"driver":"raw","file":"libvirt-2-storage"}' \
-device ide-cd,bus=ide.0,unit=0,drive=libvirt-2-format,id=ide0-0-0,bootindex=2 \
-blockdev '{"driver":"file","filename":"/mnt/user/isos/virtio-win-0.1.173-2.iso","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":true,"driver":"raw","file":"libvirt-1-storage"}' \
-device ide-cd,bus=ide.0,unit=1,drive=libvirt-1-format,id=ide0-0-1 \
-netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:45:b5:84,bus=pci.0,addr=0x2 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev socket,id=charchannel0,fd=35,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-device usb-tablet,id=input0,bus=usb.0,port=2 \
-device 'vfio-pci,host=0000:44:00.0,id=hostdev0,bus=pci.0,addr=0x6,romfile=/mnt/user/nas/Build Info (DO NOT TOUCH)/gtx1080.dump' \
-device vfio-pci,host=0000:44:00.1,id=hostdev1,bus=pci.0,addr=0x8 \
-device vfio-pci,host=0000:09:00.3,id=hostdev2,bus=pci.0,addr=0x9 \
-device vfio-pci,host=0000:42:00.0,id=hostdev3,bus=pci.0,addr=0xa \
-device usb-host,hostbus=1,hostaddr=2,id=hostdev4,bus=usb.0,port=1 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2021-03-15 20:58:17.231+0000: Domain id=1 is tainted: high-privileges
2021-03-15 20:58:17.231+0000: Domain id=1 is tainted: host-cpu
char device redirected to /dev/pts/0 (label charserial0)
2021-03-16T05:19:05.856739Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2021-03-16T05:19:05.861722Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2021-03-16T05:19:07.054770Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
2021-03-16T05:19:07.054895Z qemu-system-x86_64: vfio: Unable to power on device, stuck in D3

 

"Ok", I think to myself, "Maybe sitting for multiple months on an RC branch wasn't a good idea, upgrade to 6.9.1". I make the upgrade and decide to try again. My GF and I play some more games for a couple hours, then swap back to Monster Hunter World. After roughly an hour of MHW, crash again, both VMs. Same message. Now I'm aware of the dreaded AMD reset bug and I've seen Code 43s but this is neither AMD nor a Code 43, not to mention the system has been as stable as our relationship through Covid, so I'm at a loss. Google fu told me about trying to disable the hypervisor, and I'm also thinking of vfio binding one or both GPUs at boot, but I'd be concerned about boot, since one of the 1080s is the primary card. Does anyone have a better idea at what could be wrong?

 

Logfiles can be found here. Forgive the VM names, I'm a TFS fan at heart.

Link to comment

After a slew of changes (without validating them individually of course) it APPEARS to be stable, for now.  Notable events:

  • Point our VM GPUs at different BIOSes (I point secondary GPUs at BIOS files too, but somehow they were pointing at the same one, same GPU so no compatibility issue, but maybe a file handle issue)
  • Disable Hypervisor (enabling Hypervisor afterwards has not re-induced the issue from what I can tell)
  • Fast boot was enabled on a VM.  Disabled that crap.  Not sure if that did it but the words "fast startup" and "power state" go hand in hand in my brain.  Also  you should have that disabled by default in these VMs.
  • Did not try nvidia-persistenced tag yet, and apparently didn't need it

If anyone else has any thoughts, please share them.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.