Jump to content

[SOLVED] AMD Chipset Passthrough sudden failure with Radeon Graphics cards


Recommended Posts

Posted (edited)

UPDATE: PLEASE SEE MY THIRD POST FOR MY FINDINGS

Good evening,

My VM Engine appears to have severely broken itself. I have a primary unraid server using a Ryzen 1700X and a Gigabyte B450M DS3H motherboard. In addition I'm using an additional ASMedia Technology Inc. ASM1062 Serial ATA Controller. I had used ACS override and configured a Windows VM that allowed a lot of passthrough, so that outside of my NAS I'd have a close to bare metal lab/work from home machine in the garage to work with (I can get into why I configured things the way I did, at some point, but I needed USB plug and play functionality, as well as wanting access to use a SATA dock on the PC as needed for various things).

Passthrough are an Asus Strix RX570 4GB, the ASMedia sata controller, and one of the USB controllers.
This unit has worked flawlessly for months (outside of a hiccup I did to windows, prompting a re-installation).

Recently, I did a test of a second PCI-E video card (GTX 6800) to verify an issue on another test unraid server I'm tinkering with. The card was never attached to the the primary VM and only ever worked with another single use VM (to verify it's behavior, and rule out a board problem on the other server -- it doesn't reset correctly).

I don't know if this happened immediately after or before as I'd been working inside the main house most of the week and not in the garage, but shortly after removing the PCI card, cleaning up some bad cable management, and resuming the normal Unraid behvaior, my USB mouse and keyboard plugged into the passthrough USB had become unresponsive.

 

Troubleshooting:
Fully power cycling the system as well as countless reboots.
Unconfiguring the USB, reconfiguring.
Removing it and readding it to the pci-stub.ids modifier on the flash drive (in both OS and GUI mode)
Removing it from the XML config, readding it.
Recreating the VM using the existing installation of Windows
Upgraded from 6.8.0 to 6.8.3

Upon a deeper dive, I found the following log lines:
2020-07-12T09:09:24.619673Z qemu-system-x86_64: vfio: Cannot reset device 0000:02:00.0, depends on group 15 which is not owned.

Where 02:00.0 (group 14) us my USB controller and group 15 is the onboard (NOT the ASMedia) Sata controller.

 

2020-07-12T11:09:15.094365Z qemu-system-x86_64: vfio: Cannot reset device 0000:02:00.0, depends on group 15 which is not owned.

IOMMU group 14:	[1022:43d5] 02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller (rev 01)
IOMMU group 15:	[1022:43c8] 02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)



In addition, as of tonight, while troubleshooting, the video for the VM stopped coming on (though it outputs normal video until the VM launches), and I've noticed that leaving the VM running eventually locks up the entire Unraid server (or so I speculate, I haven't had it lock up without the VM running as of yet) -- Recreating the VM results in same thing, even though the settings are changed. I'm stuck with Q35 Machine as the i440fx machine seems to crap itself with current Radeon video drivers, but I've tried other versions prior to finding the log lines (and after, for giggles) and tried Sea BIOS, among other things. I've recreated the VM on VNC only to have it blue screen (removing the video from the original gives a VNC error saying the guest hasn't initialized, which is consistent with installing with a GPU then removing it).

Also, I can verify with lspci -v that the USB controller isn't bound. I'm not super familiar with the inner workings of this, but I'm at a loss at this point as I cannot seem to get the system back up and running, and I'm at the end of the rope for the day. 
 

02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller (rev 01) (prog-if 30 [XHCI])
        Subsystem: ASMedia Technology Inc. Device 1142
        Flags: fast devsel, IRQ 30
        Memory at fcca0000 (64-bit, non-prefetchable) [size=32K]
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [78] Power Management version 3
        Capabilities: [80] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [200] Secondary PCI Express <?>
        Capabilities: [300] Latency Tolerance Reporting
        Capabilities: [400] L1 PM Substates
        Kernel driver in use: pci-stub


I've also found the following in the server logs:
 

Tower kernel: vfio-pci 0000:02:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000000007f1f1c80 flags=0x0000]

Best I can tell, the USB's suddenly because linked to the state of the sata controller use for the array, but I don't understand why this didn't matter before and does now. This was working without issue for over six months.

I'm going to verify the unit isn't overheating next, then perhaps try a new USB drive, though I'd rather not burn my transfer if I can avoid it. I'm also going to try a completely new Windows 10 installation with the video card as it's still not outputting video. 

Edited by mirrion
Posted

I think it's a combination of things. Moving to a new USB brought the VM back online. The new system isn't appending the USB's downstream devices like it used to, allowing me to passthrough the keyboard and mouse manually, even though it gives me the option to passthrough the USB they're hooked into. The error for the USB persists:
 

qemu-system-x86_64: vfio: Cannot reset device 0000:02:00.0, depends on group 15 which is not owned.

I am unclear what changed and why the reset is now bound to the sata controller. Again, it worked for six months without doing this. While I could seek out a USB controller, I'd rather just use one of the two that come with machine as there hot swap ability is very valuable to me in this scenario.

Does anyone have any ideas?

Also, can anyone explain explicitly what 'owned' means here? I'm going to dig around in the BIOS later and see if something accidentally got changed in the relationship between the two controllers.

Posted

So, after a lot of troubleshooting, here what I determined, and I'm changing the topic to reflect my findings:

There is a bug somewhere in the software when using RX 500 series cards (I presume all Polaris cards would be affected), wherein you can run headless for a while, but modifying the VM Engine/KVM system with other graphics card can cause this and other functionality to break. Let me break down a rough timeline of configurations:

Testing phase:

1st Config:
Gigabyte b450m ds3h motherboard (with 2X PCIe)
Ryzen 1700X
EVGA Geforce GTX 970 SC
Radeon 4350
32 Gigabytle mixed DDR4 2400 memory


Booting from the Radeon (configured to boot from the bottom PCI slot).
VM passthrough on the GTX 970, a  ASMedia Technology Inc. ASM1062 Serial ATA Controller passed through a converter on the M.2 slot, and the 400 series chipset USB controller.
Attatched a 1TB Inland Professional SSD  to the sata controller to run bare metal

Everything worked without issue - used for 2 months during trial run of Unraid

Purchased the Unraid software:

Use phase:

 

Used the 1st config for about two months.

2nd Config:
Needed the  GTX 970 for a project using Windows XP, swapped hardware around, removed the GTX 970, config turned to:

Gigabyte b450m ds3h motherboard (with 2X PCIe)
Ryzen 1700X
Asus ROG Strix RX570 4GB
Radeon 4350
32 Gigabytle mixed DDR4 2400 memory

Booting from the Radeon 4350 (configured to boot from the bottom PCI slot).
VM passthrough on the RX 570, a ASMedia Technology Inc. ASM1062 Serial ATA Controller passed through a converter on the M.2 slot, and the 400 series chipset USB controller.
Attatched a 1TB Inland Professional SSD to the sata controller to run bare metal

 

Encounter an issue where installing drivers crashed the VM. Switching to the newest Q35 machine resolved. Worked for ~3 months without issue

Wanted to attempt a project unraid build, pulled the Radeon 4350, but didn't get around to using it, config worked another ~3 months.

Transitioned to working inside, and stopped using the machine for a few weeks. Decided to really test an unraid build with a XFX GTX5800 XTreme, found odd reset behavior in the test build, moved it to my main build to test, observed the same behavior. Removed and rebooted the server, and let it sit for several days without use.

Attempted to use the server and found the mouse and keyboard were unresponsive, rebooted with intent to come back later. Checked again, several days later, and found them still unresponsive. Rebooted, and determined USB was not working at all, pulling and unplugging while checking via RDP.

Attempted troubleshooting, such as toggling device associations with the VM, and video stopped working.

After extensive troubleshooting, including recreating the VM, booting to the 1TB inland drive directly (which worked), and so on, I eventually reseated all the components in the machine and created a new USB drive. VM worked (albeit with issue in windows likely due to the driver changes when booting it natively and not in a VM). Finally attempting to fix this issue, and juggle parts as the machine had more power than it needed, I replaced the 1700X with a Ryzen 1600 and the RX570 with a XFX RX560 I had, and downgraded the 1TB drive to a 120GB one. The original issues of no video and USB reset errors returned. Putting some thought into it, I started to suspect some compatibility issues with running headless after boot on the RX500 series, and that modifications to the video setup was to blame. I re-inserted the Radeon 4350 and the machine instantly booted, everything installed and it's now running flawlessly.

If running a Polaris card and having issues with resetting hardware, I highly recommend finding a second graphics adapter to boot to, as it seems there's issues running without one. I expect if I were to go in and remove the Radeon 4350 now, it might work without issue until I futz with it again. If I get around to testing this, I'll update here with the results. Hope this helps someone in the future. It's still possible another component could be at fault, but I expect it's just that the hardware in use is too new -- particularly the video card series that has other known issues -- to have as stable support as older, more tried and true architectures. If you encounter a similar issue/bug and get the same results on similar hardware, please post here for people like me in the future who might needs the assist.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...