AMD GPU Reset Bug?


Recommended Posts

I have a few older AMD graphics cards running virtual instances of Kodi for tvs around the house.

 

When the occasional crash occurs leaving my only option as "force stop", I'm unable to boot back into the VM without rebooting unRAID.

 

Besides a newer graphics card, or the current power cycle method, are there any other ways to reset the GPU without disrupting the entire server?

Edited by Living Legend
Link to comment

One solution I have found is that before a "force stop", if I SSH into my server and type:

virsh detach-device testLibreelec gpudev.xml

and I have created gpudev.xml to say:

<hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
      </source>
</hostdev>

then I can get the result:

root@unraid:~# virsh detach-device testLibreelec gpudev.xml
Device detached successfully

At this point, a "force stop" from the GUI or a:

virsh shutdown testLibreelec

will stop the VM.  And now, when the VM starts back up, we don't get any Kernel panicking errors.  So far so good!

 

So rather than a manual entry, I figured either a basic user script, or better yet, a plugin would be quite useful.  I stumbled across this USB hot swap plugin:

That got me thinking:

https://github.com/cmgraz/unraid-libvirt-pcidetatch

 

My coding is rudimentary at best, but it seems that this would be fairly straight forward to make some USB to PCI changes in the code to allow a user to detach a specific PCI device.

 

It looks like the primary file to modify would be:

https://github.com/cmgraz/unraid-libvirt-pcidetatch/blob/master/source/libvirt.hotplug.usb/usr/local/emhttp/plugins/libvirt.hotplug.usb/include/virshcmd.php

 

And changes on line 17-22 and again on 35-40 would be the biggest changes needing to be made.

 

Does this sound practical, or is this a bigger project than I'm projecting?

Edited by Living Legend
Link to comment
  • 8 months later...

Hi, I have an NVIDIA card which does the same thing. On hard reset of the VM I need to reboot unraid. Have LibreElec VM's over the house and the same issue. Good to know the SSH solution I will try next time but did an easier solution ever get made?

 

Cheers

Link to comment
  • 1 month later...

Bumping Topic. 

 

My Vega 64 will not run again after first boot and requires entire system to reboot for it to work again. 

 

Through trial and error, discovered that if you only GPU passthrough the Video Card only and no sound card, running that command will pause the machine when shutting down the computer. It was reliable and Forced Closed without issue. 

 

I have an RX 480 that behaves the same way but only when switching between Linux and Windows KVM. I read that Vega might play nicer with newer version of the linux kernel 4.16+ but we are using 4.14. It's going to be a while and I may just switch to an Nvidia. 

 

 

Link to comment

I have the same problem with two GPUs and found this on the net.

 

When the VM shuts down, all devices used by the guest are deinitialized by its OS in preparation for shutdown. In this state, those devices are no longer functionnal and must then be power-cycled before they can resume normal operation. Linux can handle this power-cycling on its own, but when a device has no known reset methods, it remains in this disabled state and becomes unavailable. Since Libvirt and Qemu both expect all host PCI devices to be ready to reattach to the host before completely stopping the VM, when encountering a device that won’t reset, they will hang in a “Shutting down” state where they will not be able to be restarted until the host system has been rebooted. It is therefore reccomanded to only pass through PCI devices which the kernel is able to reset, as evidenced by the presence of a reset file in the PCI device sysfs node, such as /sys/bus/pci/devices/0000:00:1a.0/reset.

 

This is something that has been a bug for a long time but isnt fixed. Maybe 6.6 will fix is when they upgrade Libvirt and Qemu.

 

Link to comment
  • 5 weeks later...

Hi

 

I have done the above but if I need to 'force shutdown' of the VM it still has the error. Every now and again my LibreELEC VM freezes and this is the only way to fix it but obviously needs the whole unraid box to reboot to bring up my VM again.

I have a couple others with new cards and there is no issue, its just with one of my old cards

Link to comment
  • 3 weeks later...

Hi everyone, I'm new to the forums and found this thread. I also have reset issues with both my rx 580 and vega 64. I did find this on the level one forum:

 

https://forum.level1techs.com/t/solved-testers-needed-pci-passthrough-with-4-19rcx-pci-reset-regression/132372

 

Looks like there is a fix with kernel 4.19. I haven't tested myself as my only system is my Unraid box, which is sporting the 6.6.0-rc4 which is awesome btw. It's currently on kernel 4.18 but I'm hoping they add this patch or move to kernel 4.19 when it's stable. Hope it's soon :).

 

Link to comment
  • 4 months later...

Any progress on this issue? I  wasted my weekend trying to get my new RX Vega 64 passthrough without much luck. Im getting the same problem of not being able to start the VM a second time, without restarting the server. Problem exists with both my windows 10 and OSX Mojave VMs. 

 

Sleeping the server and waking it seems to reset the GPU also.

 

I even tried upgrading to unraid 6.6.7RC2 which has a newer kernel and still same problem. I had to revert back to 6.6.6 because AFS didnt work for me with the RC version of unraid. So back to the drawing board...

 

I just ordered a Sapphire RX 580 8GB i hope this card works. I got my nvidia gtx970 to work fine, but i need an AMD GPU mainly for OSX mojave and final cut pro. I do minimal gaming on windows 10VM but my daily is the OSX mojave i want that to work stable. 

Edited by alfredo_2020
Link to comment
  • 4 months later...
  • 3 weeks later...
  • 2 months later...

Facing this issue in both AMD R9 290X & AMD Vega Frontier Edition (Does anybody else face it on these cards ???)

 

AFAIK, there is only one person who is running behind to fix this gnif

Problem is that he does not have the all the different types of AMD hardware to test 🙁

 

Donating to him could help!

https://forum.level1techs.com/t/navi-reset-kernel-patch/147547

 

Link to comment
Just an update on my Gigabyte RX Vega 64 OC, on Unraid 6.6, as long as I don't passthrough the GPU's sound card, the video card won't encounter the bug. 
 
Meaning, just have to use a different device for sound card. At least it's something. Thought I'd let everyone know in case you have an RX Vega Card. 


Carefull !!
tried it today with my vega 56 , but if i dont pass the sound after the first reboot of the (osx) vm , instead of the vm hanging, my complete unraid server kernel panics after an initial garbled caleidoscopic screen output on the vega. Thats a new one for me as never ever in several years had unraid crashing on me. Luckily my array and all my zfs ssd pools did not give an inch and worked fine after the crashes.
So back to passing both and just waiting for a proper fix and until then just rebooting the box if i have to restart the vm. crap.
Link to comment
  • 2 weeks later...

Well not for me. Made it worse as now instead of locking up the vm on next or restart it crashes my whole server. Could be specific but if i go back i does not crash the box so......

(msi radeon 56, supermicro dual socket motherboard, osx vm)

Edited by glennv
Link to comment

These patches are supposed to be workarounds for older AMD cards that don't properly support functionality that makes them nice to use in a VM.  However, if these patches break newer AMD cards, then we must get rid of these patches, and sorry to say, for those with older cards you have to live with the limitations.

 

Do I have this right?

  • Like 1
Link to comment

Could be , and in my case agreed, although i am just one single measuring point and dont presume my experiences are common place. Would be nice to hear some feedback from other amd card passthru (idealy osx) users and how their experience changes (either positive or negative) after this patch implementation.

I think this thing should idealy be fixed by amd with a bios upgrade or something. The fixes seem a bit shady to me by messing with the card power states at pcie level. If i think about that it starts to make sense why my system does not like that and says goodbye.

And if the fix stays for what it is , it would idealy have an on off switch so it can be disabled for users with cards that don play nice.

As i can live with a manual reboot once in a while, but not with a crashing production server

 

Edited by glennv
Link to comment
21 minutes ago, glennv said:

Would be nice to hear some feedback from other amd card passthru (idealy osx) users and how their experience changes (either positive or negative) after this patch implementation.

Yes, that is the reason for my post here.

 

22 minutes ago, glennv said:

I think this thing should idealy be fixed by amd with a bios upgrade or something.

Or at least a patch that makes it into kernel mainline, barring that, a patch from actual kernel dev. - not knocking patch author but it was created to address one particular issue on their particular card.

 

23 minutes ago, glennv said:

And if the fix stays for what it is , it would idealy have an on off switch so it can be disabled for users with cards that don play nice.

That won't be possible.

 

23 minutes ago, glennv said:

As i can live with a manual reboot once in a while, but not with a crashing production server

Agreed - this is most important and why I'm inclined to remove these patches...

 

The best fix of course is to use Nvidia.

Link to comment
3 minutes ago, limetech said:

 

 

The best fix of course is to use Nvidia.

Tell that to Tim C(r)ook O.o 

I only wish, as i have a few Nvidia's laying in the dust here. Forced as many to go the AMD route if we want to keep our OSX workflows/tools.

Edited by glennv
Link to comment

The patch doesn't seems to work.

 

I have a Win 10 VM with an AMD Sapphire 5700 pass though. If I need to reboot Windows, the VM will not boot up again. I don't have any error notification, it's just doesn't work.

 

I need to restart the whole server to get the VM back.

 

I'm on 6.8 rc4.

 

edit: I have this message in the logs.

 

2019-10-28T19:23:34.570409Z qemu-system-x86_64: vfio: Cannot reset device 0000:0b:00.1, no available reset mechanism.
2019-10-28T19:23:34.574452Z qemu-system-x86_64: vfio: Cannot reset device 0000:0b:00.1, no available reset mechanism.
2019-10-28T19:23:36.007454Z qemu-system-x86_64: vfio: Cannot reset device 0000:0b:00.1, no available reset mechanism.
2019-10-28T19:23:36.011432Z qemu-system-x86_64: vfio: Cannot reset device 0000:0b:00.1, no available reset mechanism.


Ben

Edited by Benjamin Picard
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.