AMD GPU Reset Bug?


Recommended Posts

  • 1 month later...
  • 8 months later...

Promising solution

Quote

The project https://github.com/gnif/vendor-reset 18 is a collaboration between @belfrypossum and myself. It aims to provide an avenue for easily adding complex reset sequences to the kernel without needing to upstream them into the kernel itself.

 

Today both @belfrypossum and I have agreed that the project is ready for use by the general public and would like to announce that it completely succeeds the prior released patches for AMD GPU resets. Currently the project targets (note this is not an exhaustive list and only a few example GPUs for each ASIC are listed here):

Polaris 10, 11 & 12

Vega 10 (Vega56/64/FE)

Vega 20 (Radeon 7)

Navi 10 (5600XT, 5700, 5700XT)

Navi 12 (Pro 5600M)

Navi 14 (Pro 5300, RX 5300, 5500XT)

 

Usage is very simple, just build the module and modprobe it, or use dmks to manage it directly (configuration is included). Nothing more is needed.

There are still conditions under which the GPUs will not reset however we are working to improve them as time permits.

This entirely removes the need to patch your kernel, and it is required that any patches you have applied for GPU resets be removed when using this module.

https://forum.level1techs.com/t/amd-polaris-vega-navi-reset-project-vendor-reset/163801

 

https://github.com/gnif/vendor-reset

 

 

Link to comment

This guy is pointing to this github, dont know if there is something to be done or added to the unraid kernel to add it? 

 

Im sick of having to reboot the whole server every time I need to reboot macos with the 5700xt and everything is so smooth and nice with every other vm I have with the nvidia card. 

Edited by mSedek
  • Like 1
Link to comment
11 hours ago, methanoid said:

Is there any guide as to how to use this with unRAID? Is @jonp or @limetech going to integrate it?

I guess I'll need to do some research on this.  If these are kernel modifications that fix bugs with those devices, why not just submit them upstream for merger into the Linux kernel itself?  Not saying we wouldn't consider adding this or finding a way to make it supportable, but need to understand all the ramifications before we make any commitments.

  • Haha 1
Link to comment
On 11/19/2020 at 8:23 PM, jonp said:

I guess I'll need to do some research on this.  If these are kernel modifications that fix bugs with those devices, why not just submit them upstream for merger into the Linux kernel itself?  Not saying we wouldn't consider adding this or finding a way to make it supportable, but need to understand all the ramifications before we make any commitments.

The reasoning is that a BIG patch thats very specific like this WONT make it into the kernel but by applying as a runtime patch it achieves the same end result (working VFIO passthrough for AMD GPUs).  Would be killer if LT could make this easy to use in unRAID as the virtualisation is for some a key part of unRAID's appeal ;-)

Link to comment
On 11/22/2020 at 3:27 PM, methanoid said:

The reasoning is that a BIG patch thats very specific like this WONT make it into the kernel but by applying as a runtime patch it achieves the same end result (working VFIO passthrough for AMD GPUs).  Would be killer if LT could make this easy to use in unRAID as the virtualisation is for some a key part of unRAID's appeal ;-)

 

I completely agree with this sentiment.  So we might have a solution for this, but I need to chat with the team about it.  More info soon!

  • Like 2
Link to comment
10 minutes ago, ghulican said:

Does anyone else get drive failures after trying to start a VM and having an AMD card reset issue? 

I get issues with an unassigned device SSD passed through to a VM after one of these issues causing a reboot.  I have to reboot twice to get the SSD to pass through again to my VM.  

Edited by Andiroo2
Link to comment

If someone is interested I got a Unraid 6.9.0 beta35 build with a working reset patch integrated for Navi10 only (Radeon Pro 5700XT, Radeon RX5700XT, Radeon Pro W5700X, Radeon Pro W5700, RX5700, Radeon Pro RX5700, RX5600XT, RX5600).

 

The build includes this patch from here: Click (slightly modyfied to work with Kernel 5.8.18)

  • Like 2
Link to comment
6 minutes ago, ich777 said:

If someone is interested I got a Unraid 6.9.0 beta35 build with a working reset patch integrated for Navi10 only (Radeon Pro 5700XT, Radeon RX5700XT, Radeon Pro W5700X, Radeon Pro W5700, RX5700, Radeon Pro RX5700, RX5600XT, RX5600).

 

The build includes this patch from here: Click (slightly modyfied to work with Kernel 5.8.18)

Works good so far. Thank you very much!

  • Like 1
Link to comment
27 minutes ago, ich777 said:

If someone is interested I got a Unraid 6.9.0 beta35 build with a working reset patch integrated for Navi10 only (Radeon Pro 5700XT, Radeon RX5700XT, Radeon Pro W5700X, Radeon Pro W5700, RX5700, Radeon Pro RX5700, RX5600XT, RX5600).

 

The build includes this patch from here: Click (slightly modyfied to work with Kernel 5.8.18)

Does any one know if this will be included in the betas any time soon? I know this patch is new and these things take time, but I'm tired of this bug. If it will be a while I'm willing to patch my build.

 

I have a RX 5600 XT and I am on Unraid 6.8.3, but I have never patched Unraid before and I was wondering if someone could provide instruction on how to implement this patch. I'm more than willing to update to the beta if that makes it easier. I also have a secondary NVIDIA gpu I use for transcoding so I don't know if that complicates things or not. Any advice would be appreciated, even if it is just to wait.

Link to comment
5 minutes ago, ndetar said:

Does any one know if this will be included in the betas any time soon? I know this patch is new and these things take time, but I'm tired of this bug. If it will be a while I'm willing to patch my build.

 

I have a RX 5600 XT and I am on Unraid 6.8.3, but I have never patched Unraid before and I was wondering if someone could provide instruction on how to implement this patch. I'm more than willing to update to the beta if that makes it easier. I also have a secondary NVIDIA gpu I use for transcoding so I don't know if that complicates things or not. Any advice would be appreciated, even if it is just to wait.

I would wait for the "official" OTB solution as the guy from the article explains that his patch got outdated by the new stuff which UNRAID is looking to implement

Edited by mSedek
Link to comment
10 minutes ago, ndetar said:

Does any one know if this will be included in the betas any time soon? I know this patch is new and these things take time, but I'm tired of this bug.

This is not the actually new vendor_reset patch (since it's not working properly for all cards).

 

10 minutes ago, ndetar said:

I have a RX 5600 XT and I am on Unraid 6.8.3, but I have never patched Unraid before and I was wondering if someone could provide instruction on how to implement this patch. I'm more than willing to update to the beta if that makes it easier. I also have a secondary NVIDIA gpu I use for transcoding so I don't know if that complicates things or not. Any advice would be appreciated, even if it is just to wait.

This is pretty simple I send you a download link to the zip archive and you replace the two files on your USB Boot Device, I only made this if someone really needs it...

EDIT: Of course you have to upgrade first to beta35 and then replace the files.

 

6 minutes ago, mSedek said:

I would wait for the "official" OTB solution as the guy from the article explains that his patch got outdated by the new stuff which UNRAID is looking to implement

Yes that's true but the new vendor-reset patch doesn't work on Kernel v5.8+ and Navi10 cards Github Issue Click

 

I've already made a build with this new patch and it works for the video reset on a 5700XT but not for the audio of the card, so after the first reset you don't get the audio to work again.

 

EDIT: I also had to edit the "old" patch a little bit to work with Kernel 5.8.18 but it works.

Link to comment
1 minute ago, ich777 said:

This is not the actually new vendor_reset patch (since it's not working properly for all cards).

 

This is pretty simple I send you a download link to the zip archive and you replace the two files on your USB Boot Device, I only made this if someone really needs it...

EDIT: Of course you have to upgrade first to beta35 and then replace the files.

 

Yes that's true but the new vendor-reset patch doesn't work on Kernel v5.8+ and Navi10 cards Github Issue Click

 

I've already made a build with this new patch and it works for the video reset on a 5700XT but not for the audio of the card, so after the first reset you don't get the audio to work again.

That would be awesome, I don't use my gpu's audio anyways so that's not a problem for me. Correct me if I'm wrong, I would just upgrade to the beta, install the NVIDIA drivers, then replace the two files.

Link to comment
5 minutes ago, ndetar said:

That would be awesome, I don't use my gpu's audio anyways so that's not a problem for me. Correct me if I'm wrong, I would just upgrade to the beta, install the NVIDIA drivers, then replace the two files.

Yes exactly.

I would do it that way.

Upgrade to beta35, download the Nvidia drivers from the CA App, replace the two files on the USB Boot Device and reboot again.

 

Am I correct that you passthrough the AMD graphics card to a VM?

  • Thanks 1
Link to comment

The vendor reset does not work with Navi cards right now. The sound card gets a Code 10 error after 2nd Boot of the vm.

The quoted issue also shows that Vega cards don't work atm, too.

 

edit: we could only test with an 5700xt in reference design.

 

If you update to 6.9.0-beta35 you do need the patched build @ich777 provides. Otherwise you cannot use the gpu in vms.

 

Keep in mind, this is a workaround until gnif releases a new version of the vendor-reset. As soon as it works, we should move to the vendor-reset, because the vendor-reset, As I tested hours and hours together with @ich777 last week, does indeed handle the gpu in a better way imo.

 

Edited by giganode
  • Thanks 2
Link to comment
2 minutes ago, giganode said:

The vendor reset does not work with Navi cards right now. The sound card gets a Code 10 error after 2nd Boot of the vm.

The quoted issue also shows that Vega cards don't work atm, too.

 

edit: we could only test with an 5700xt in reference design.

 

If you update to 6.9.0-beta35 you do need the patched build @ich777 provides. Otherwise you cannot use the gpu in vms.

 

Keep in mind, this is a workaround until gnif releases a new version of the vendor-reset. As soon as it works, we should move to the vendor-reset, because the vendor-reset, As I tested hours and hours together with @ich777 last week, does indeed handle the gpu better in a better way imo.

 

Hope it will get like nvidias.. My 3080 havent failed a single time switching from lots of linux /windows vms several times per day over the past months, not a single unraid reset as I wont reset/power off my macOS/5700XT(saphire) VM for anything in the World 

Link to comment
6 minutes ago, mSedek said:

Hope it will get like nvidias.. My 3080 havent failed a single time switching from lots of linux /windows vms several times per day over the past months, not a single unraid reset as I wont reset/power off my macOS/5700XT(saphire) VM for anything in the World 

Up until now nVidia never had a reset bug, afaik. But the RTX 3000 series does not have the Code 43 as f.e. the 2000 series or the gtx series before.

As reported by gnif and Level1Techs the new RX 6000 series does not have a reset bug anymore. This is very nice!!

 

Let's all sell our old stuff to bare metal user as soon as we have enough availability of the new generation and let's never have gpu passthrough problems anymore...........

Edited by giganode
Link to comment
5 minutes ago, giganode said:

The vendor reset does not work with Navi cards right now. The sound card gets a Code 10 error after 2nd Boot of the vm.

The quoted issue also shows that Vega cards don't work atm, too.

 

edit: we could only test with an 5700xt in reference design.

 

If you update to 6.9.0-beta35 you do need the patched build @ich777 provides. Otherwise you cannot use the gpu in vms.

 

Keep in mind, this is a workaround until gnif releases a new version of the vendor-reset. As soon as it works, we should move to the vendor-reset, because the vendor-reset, As I tested hours and hours together with @ich777 last week, does indeed handle the gpu better in a better way imo.

 

Makes sense, thanks. I got the patched build and will try it out in a few days. I have a 5600XT so I'll let everyone know if it works.

 

Once the new vender-reset is released what happens then? Is it something we would implement on our own builds or is it something that would be included in a future beta/release? Just curious since I'm not to familiar with the whole process.

  • Like 1
Link to comment
3 minutes ago, ndetar said:

Makes sense, thanks. I got the patched build and will try it out in a few days. I have a 5600XT so I'll let everyone know if it works.

That would be nice!

 

3 minutes ago, ndetar said:

Once the new vender-reset is released what happens then? Is it something we would implement on our own builds or is it something that would be included in a future beta/release? Just curious since I'm not to familiar with the whole process.

What happens? We will try it ✌️😏

If everything works there could be a possibility to implement it, I think. But only if it does not create new problems for other users.

  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.