Jump to content
Living Legend

AMD GPU Reset Bug?

16 posts in this topic Last Reply

Recommended Posts

I have a few older AMD graphics cards running virtual instances of Kodi for tvs around the house.

 

When the occasional crash occurs leaving my only option as "force stop", I'm unable to boot back into the VM without rebooting unRAID.

 

Besides a newer graphics card, or the current power cycle method, are there any other ways to reset the GPU without disrupting the entire server?

Edited by Living Legend

Share this post


Link to post

One solution I have found is that before a "force stop", if I SSH into my server and type:

virsh detach-device testLibreelec gpudev.xml

and I have created gpudev.xml to say:

<hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
      </source>
</hostdev>

then I can get the result:

root@unraid:~# virsh detach-device testLibreelec gpudev.xml
Device detached successfully

At this point, a "force stop" from the GUI or a:

virsh shutdown testLibreelec

will stop the VM.  And now, when the VM starts back up, we don't get any Kernel panicking errors.  So far so good!

 

So rather than a manual entry, I figured either a basic user script, or better yet, a plugin would be quite useful.  I stumbled across this USB hot swap plugin:

That got me thinking:

https://github.com/cmgraz/unraid-libvirt-pcidetatch

 

My coding is rudimentary at best, but it seems that this would be fairly straight forward to make some USB to PCI changes in the code to allow a user to detach a specific PCI device.

 

It looks like the primary file to modify would be:

https://github.com/cmgraz/unraid-libvirt-pcidetatch/blob/master/source/libvirt.hotplug.usb/usr/local/emhttp/plugins/libvirt.hotplug.usb/include/virshcmd.php

 

And changes on line 17-22 and again on 35-40 would be the biggest changes needing to be made.

 

Does this sound practical, or is this a bigger project than I'm projecting?

Edited by Living Legend

Share this post


Link to post

Hi, I have an NVIDIA card which does the same thing. On hard reset of the VM I need to reboot unraid. Have LibreElec VM's over the house and the same issue. Good to know the SSH solution I will try next time but did an easier solution ever get made?

 

Cheers

Share this post


Link to post

Bumping Topic. 

 

My Vega 64 will not run again after first boot and requires entire system to reboot for it to work again. 

 

Through trial and error, discovered that if you only GPU passthrough the Video Card only and no sound card, running that command will pause the machine when shutting down the computer. It was reliable and Forced Closed without issue. 

 

I have an RX 480 that behaves the same way but only when switching between Linux and Windows KVM. I read that Vega might play nicer with newer version of the linux kernel 4.16+ but we are using 4.14. It's going to be a while and I may just switch to an Nvidia. 

 

 

Share this post


Link to post

I have the same problem with two GPUs and found this on the net.

 

When the VM shuts down, all devices used by the guest are deinitialized by its OS in preparation for shutdown. In this state, those devices are no longer functionnal and must then be power-cycled before they can resume normal operation. Linux can handle this power-cycling on its own, but when a device has no known reset methods, it remains in this disabled state and becomes unavailable. Since Libvirt and Qemu both expect all host PCI devices to be ready to reattach to the host before completely stopping the VM, when encountering a device that won’t reset, they will hang in a “Shutting down” state where they will not be able to be restarted until the host system has been rebooted. It is therefore reccomanded to only pass through PCI devices which the kernel is able to reset, as evidenced by the presence of a reset file in the PCI device sysfs node, such as /sys/bus/pci/devices/0000:00:1a.0/reset.

 

This is something that has been a bug for a long time but isnt fixed. Maybe 6.6 will fix is when they upgrade Libvirt and Qemu.

 

Share this post


Link to post

Have try rx 480 and rx 580 same fail,

Its possible to write a script for shutdown and reboot like autoreset gpu while shutdown and reboot.

Share this post


Link to post

Hi

 

I have done the above but if I need to 'force shutdown' of the VM it still has the error. Every now and again my LibreELEC VM freezes and this is the only way to fix it but obviously needs the whole unraid box to reboot to bring up my VM again.

I have a couple others with new cards and there is no issue, its just with one of my old cards

Share this post


Link to post

I've try a RX 460 without any fail, i can reboot and shutdown linux and windows vm's, without reboot unraid completely, but RX 480/580 i give up and wait for unraid update.

Share this post


Link to post

Hi everyone, I'm new to the forums and found this thread. I also have reset issues with both my rx 580 and vega 64. I did find this on the level one forum:

 

https://forum.level1techs.com/t/solved-testers-needed-pci-passthrough-with-4-19rcx-pci-reset-regression/132372

 

Looks like there is a fix with kernel 4.19. I haven't tested myself as my only system is my Unraid box, which is sporting the 6.6.0-rc4 which is awesome btw. It's currently on kernel 4.18 but I'm hoping they add this patch or move to kernel 4.19 when it's stable. Hope it's soon :).

 

Share this post


Link to post

Just an update on my Gigabyte RX Vega 64 OC, on Unraid 6.6, as long as I don't passthrough the GPU's sound card, the video card won't encounter the bug. 

 

Meaning, just have to use a different device for sound card. At least it's something. Thought I'd let everyone know in case you have an RX Vega Card. 

Share this post


Link to post

Any progress on this issue? I  wasted my weekend trying to get my new RX Vega 64 passthrough without much luck. Im getting the same problem of not being able to start the VM a second time, without restarting the server. Problem exists with both my windows 10 and OSX Mojave VMs. 

 

Sleeping the server and waking it seems to reset the GPU also.

 

I even tried upgrading to unraid 6.6.7RC2 which has a newer kernel and still same problem. I had to revert back to 6.6.6 because AFS didnt work for me with the RC version of unraid. So back to the drawing board...

 

I just ordered a Sapphire RX 580 8GB i hope this card works. I got my nvidia gtx970 to work fine, but i need an AMD GPU mainly for OSX mojave and final cut pro. I do minimal gaming on windows 10VM but my daily is the OSX mojave i want that to work stable. 

Edited by alfredo_2020

Share this post


Link to post

One day i will be able to move to mojave but till i find a amd card without this bug the good old GT 1030 will drive high sierra. 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.