Jump to content

AMD 7900 XTX reset bug - Please help me I tried everything


matz7

Recommended Posts

Hey guys,

I'm starting to loose my mind, I have a reset bug with an AMD 7900 XTX.

After restart it works once perfectly, but as soon as I reboot the Windows VM, I can't boot again.

I tried ACS, blacklist AMDGPU driver, Q35 vs i440fx, dumping ROM, disable D0 state, the amd reset bug plugin from the store, remove and PCI rescan.

I don't have any more idea.

WM log shows this (but it's the same at the first time after boot when it works):
 

2023-11-19T01:47:15.893928Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
2023-11-19T01:47:15.893968Z qemu-system-x86_64: vfio_dma_map(0x14a7edc56c00, 0x380000000000, 0x10000000, 0x14a7dce00000) = -2 (No such file or directory)
2023-11-19T01:47:15.894085Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
2023-11-19T01:47:15.894090Z qemu-system-x86_64: vfio_dma_map(0x14a7edc56c00, 0x380010000000, 0x200000, 0x14a7dcc00000) = -22 (Invalid argument)
2023-11-19T01:47:15.901186Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
2023-11-19T01:47:15.901197Z qemu-system-x86_64: vfio_dma_map(0x14a7edc56c00, 0x380000000000, 0x10000000, 0x14a7dce00000) = -22 (Invalid argument)
2023-11-19T01:47:15.901336Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
2023-11-19T01:47:15.901341Z qemu-system-x86_64: vfio_dma_map(0x14a7edc56c00, 0x380010000000, 0x200000, 0x14a7dcc00000) = -22 (Invalid argument)
2023-11-19T01:47:15.907046Z qemu-system-x86_64: VFIO_MAP_DMA failed: Invalid argument
2023-11-19T01:47:15.907054Z qemu-system-x86_64: vfio_dma_map(0x14a7edc56c00, 0x380000000000, 0x10000000, 0x14a7dce00000) = -22 (Invalid argument)

 

no error in dmesg, after I remove and rescan the GPU:
 

[185956.423209] pci 0000:03:00.0: Removing from iommu group 22
[185956.423323] pci 0000:03:00.1: Removing from iommu group 23
[185956.447322] pci 0000:03:00.0: [1002:744c] type 00 class 0x030000
[185956.447337] pci 0000:03:00.0: reg 0x10: [mem 0x6130000000-0x613fffffff 64bit pref]
[185956.447346] pci 0000:03:00.0: reg 0x18: [mem 0x6140000000-0x61401fffff 64bit pref]
[185956.447352] pci 0000:03:00.0: reg 0x20: [io  0x4000-0x40ff]
[185956.447358] pci 0000:03:00.0: reg 0x24: [mem 0x86f00000-0x86ffffff]
[185956.447364] pci 0000:03:00.0: reg 0x30: [mem 0x87000000-0x8701ffff pref]
[185956.447438] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[185956.447564] pci 0000:03:00.0: Adding to iommu group 22
[185956.447570] pci 0000:03:00.0: vgaarb: bridge control possible
[185956.447571] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[185956.447595] pci 0000:03:00.1: [1002:ab30] type 00 class 0x040300
[185956.447606] pci 0000:03:00.1: reg 0x10: [mem 0x87020000-0x87023fff]
[185956.447682] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[185956.447751] pci 0000:03:00.1: Adding to iommu group 23
[185956.471547] pci 0000:03:00.0: BAR 0: assigned [mem 0x6130000000-0x613fffffff 64bit pref]
[185956.471561] pci 0000:03:00.0: BAR 2: assigned [mem 0x6140000000-0x61401fffff 64bit pref]
[185956.471567] pci 0000:03:00.0: BAR 5: assigned [mem 0x86f00000-0x86ffffff]
[185956.471569] pci 0000:03:00.0: BAR 6: assigned [mem 0x87000000-0x8701ffff pref]
[185956.471570] pci 0000:03:00.1: BAR 0: assigned [mem 0x87020000-0x87023fff]
[185956.471572] pci 0000:03:00.0: BAR 4: assigned [io  0x4000-0x40ff]
[185956.471630] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0
[185982.931308] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[185982.958294] br0: port 6(vnet9) entered blocking state
[185982.958297] br0: port 6(vnet9) entered disabled state
[185982.958322] device vnet9 entered promiscuous mode
[185982.958393] br0: port 6(vnet9) entered blocking state
[185982.958394] br0: port 6(vnet9) entered forwarding state
[185999.148982] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[185999.148989] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[185999.148992] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[185999.148993] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x450
[186030.378009] br0: port 6(vnet9) entered disabled state
[186030.378189] device vnet9 left promiscuous mode
[186030.378191] br0: port 6(vnet9) entered disabled state
[186031.039678] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

 

No libvirt or vfio errors.

 

What else to check? what to do? do you have any idea, I'm getting desperate, I can't reboot all the time.

If you have any idea please help me out.

 

Thanks

Link to comment
  • 3 weeks later...

No, I did not. And yes, the vendor reset modul is only for older cards.

I tried customer kernel, because there is some sriov fix in the 6.2, but didn't help.

If I remove the gpu after vm shout down, sleep the server and rescan it, it works, but the sleep broke other things (for example my intel IGPU passtrough to other VM) and overall it's just a bad solution.

I tried to download ROMs from other vendors (I have an MSI card) and use that, but didn't help.

I don't know what to do. I don't want to buy an other card just for this, but it's really frustrating.

After reboot it works so well...

Edited by matz7
Link to comment
12 hours ago, matz7 said:

I don't know what to do.

12 hours ago, Johns6977 said:

Did you get any help with this?

I can't help here since nobody is interested in fixing this (not even AMD itself), I hate to say that but you are on your own, for a more detailed explanation head over to this conversation on L1 Forums (this post from gnif itself, the creator for the first vendor reset fix, is really eye opening and it really should open the eyes from everyone that AMD is not better than Nvidia).

 

A few years ago I would have replied with something like that: "...AMD is not that far in terms of virtualization like Nvidia and you should better stick to Nvidia cards when utilizing virtualization..." nowadays I would more reply with something like: "...AMD learned nothing and you should if you plan to use the card for virtualization stick to Nvidia, so to speak sell your AMD card and buy Nvidia..."

 

I'm usually not that harsh and I'm far from a Nvidia fanboy but it's the cold, (sad,) hard truth...

Link to comment
  • 5 months later...
On 12/14/2023 at 9:12 AM, ich777 said:

I can't help here since nobody is interested in fixing this (not even AMD itself), I hate to say that but you are on your own, for a more detailed explanation head over to this conversation on L1 Forums (this post from gnif itself, the creator for the first vendor reset fix, is really eye opening and it really should open the eyes from everyone that AMD is not better than Nvidia).

 

A few years ago I would have replied with something like that: "...AMD is not that far in terms of virtualization like Nvidia and you should better stick to Nvidia cards when utilizing virtualization..." nowadays I would more reply with something like: "...AMD learned nothing and you should if you plan to use the card for virtualization stick to Nvidia, so to speak sell your AMD card and buy Nvidia..."

 

I'm usually not that harsh and I'm far from a Nvidia fanboy but it's the cold, (sad,) hard truth...

Spent hours searching high and low for a solution just to end up at this comment. Literally opened ebay right after reading this to see how much I could possibly get for my 7900 XTX. How sad.

 

I need to ask though, has there been any developments with this at all since 2023? Is the 7900 XTX still in an "abandon ship" card for VMs?

Link to comment
2 minutes ago, PlexAJ said:

Spent hours searching high and low for a solution just to end up at this comment. Literally opened ebay right after reading this to see how much I could possibly get for my 7900 XTX. How sad.

 

I need to ask though, has there been any developments with this at all since 2023? Is the 7900 XTX still in an "abandon ship" card for VMs?

I've got it working, no idea how, I spent 50+ hours trying every settings. After like 100 days I had to restart the host because of a RAM upgrade and now I can't even start the vm first time. There is no way I start it again, so I just bought an Intel A770 (that also has reset bug but only if you passtrough audio and I don't need that)

 

So it is possible, but I have no idea how, I flashed random bioses, blacklisted driver, added grub parameters, edited xml, but I don't know what fixed it. for 100 days while my host was running I was able to restart windows guest with the card.

 

If anyone has info, I would appriciate it, because  A770 is way slower, but I don't want to spend 1k+ on an Nvidia card now...

Link to comment
58 minutes ago, PlexAJ said:

I need to ask though, has there been any developments with this at all since 2023? Is the 7900 XTX still in an "abandon ship" card for VMs?

The text that I wrote still applies sadly today.

 

51 minutes ago, matz7 said:

I was able to restart windows guest with the card.

Keep in mind that always heavily applies on the manufacturer, revision and so on and is still not guaranteed to always work fine.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...