Second GPU Passthrough (SOLVED)

bat2o · July 23, 2020

I am having trouble passing a second GPU. My first GPU passthrough works great. Throughout my trials it disconnects the ability to write to the “domains” disk (for my system is it my cache drive), where I have to do a shutdown of the system and usually reformat the drive in the array.

Here is my system:

Motherboard: Micro-Star International Co., Ltd - B450 TOMAHAWK MAX (MS-7C02)

Processor: AMD Ryzen 7 3700X 8-Core @ 3.6 GHz

GPU1: XFX Radeon RX 580 8 GB (Graphics: [1002:67df] 29:00.0 / Sound: [1002:aaf0] 29:00.1)

GPU2: SAPPHIRE Radeon RX 550 DirectX 12 100414P4GL 4GB (Graphics: [1002:699f] 25:00.0 / Sound: [1002:aae0] 25:00.1)

In order to get GPU1 to work I had to include the GPU’s vfio-pci.ids into the syslinux configuration.

vfio-pci.ids=1002:67df,1002:aaf0,1002:699f,1002:aae0

When I added the GPU2 vfio-pci.ids (1002:699f,1002:aae0) to the syslinux configuration, unRAID won’t boot in the PCIe ACS override of downstream: pcie_acs_override=downstream. So I am running ‘both’: pcie_acs_override=downstream,multifunction.

Below are a list of settings I tried. Each resulted in disabling my ability to write to the ‘domains’ share.

- Changed domains to reside on disk1

- Changed VMS machine to Q35 per spaceinvader one (https://www.youtube.com/watch?v=QlTVANDndpM&t=509s). Initial trials were with machine i440fx.

- Added the ‘Graphics ROM BIOS’ for GPU2

- In VM XML file added multifunction=’on’ per spaceinvader one (https://www.youtube.com/watch?v=QlTVANDndpM&t=509s).

- Updated to unRAID 6.9.0-beta25

I have not tried the following

- Changed the ‘Server boot mode’ to Legacy. I cannot get the system to boot in legacy.

- Swap GPU location on the motherboard

- Tried another GPU

- Update bios

Each trail had varying levels of success, but eventually resulted in similar outcomes. The setup that resulted in the system working for the longest period had the following settings and the diagnostic file name:

- Machine: Q35-4.2

- vbios: NA

- Multifunction: on

- unRAID OS: 6.8.3

- Diagnostic file: batcave-diagnostics-20200721-2302.zip

Here is my latest trial, which didn’t get far.

- Machine: Q35-5.0

- vbios: NA

- Multifunction: on

- unRAID OS: 6.9.0-beta25

- Diagnostic file: batcave-diagnostics-20200723-0918.zip

Here are some forums that indicate similar issues.

Seems like this person has a similar issue, but no resolution shared.

https://forums.unraid.net/topic/86519-unraid-680-1660-gtx-gpu-passthrough/

This person was able to fix it with a bios update (not unRAID).

https://www.reddit.com/r/VFIO/comments/g5hi4k/going_mad_help_needed_with_gpu_passthrough/

This person got it to work by changing their GPU.

https://forums.unraid.net/topic/79134-ryzen-internal-graphics-passthrough/

batcave-diagnostics-20200721-2302.zip batcave-diagnostics-20200723-0918.zip

Edited July 24, 2020 by [email protected]

bat2o · July 25, 2020

On 7/23/2020 at 10:20 AM, [email protected] said:

I have not tried the following

- Changed the ‘Server boot mode’ to Legacy. I cannot get the system to boot in legacy.

- Swap GPU location on the motherboard

- Tried another GPU

- Update bios

Updates to additional attempts. I removed GPU1 (Address 29:00) from its PCI-E x16 slot and placed GPU2 into it. It ran a VM with GPU passthrough very well and ran it for about an hour. I then removed GPU2 from that slot (Address 29:00) and placed it in its original slot (Address 25:00), and I did not reconnect GPU1. The VM with GPU passthrough worked well (ran it for an hour), but crashed like previous attempts when I tried shutting down the VM (diagnostics included).

Because GPU2 worked well in address 29:00 and not in 25:00, I believe it has to do with how unRAID and the motherboard are handling that address. By the way, in the IOMMU grouping 25:00 is in it own group. So I updated the bios to the most current version (Version: 7C02v37; Release Date: 2020-06-15).

At first it booted unRAID just fine and I tried running the VM and the system crashed again (didn't get a diagnostic file). After system reboot unRAID wouldn't load. Investigating why it wouldn't load I discovered many files disappeared from the flash drive. Restored flash drive to an backup version, but still unRAID wouldn't load. I updated the syslinux configuration (syslinux.cfg) to the default to see the boot-up display on my monitor (attached is a photo of that boot screen). It was indicating similar issues as the log files after the VMs crash (i.e., iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=25:00.0 ...). So I removed the GPU from that slot (address: 25:00), and unRAID is able to boot.

Syslinux configuration:

kernel /bzimage
append initrd=/bzroot

I have also been trying to boot my system in legacy mode, but haven't been successful. From what I understand, my bios settings allows a legacy boot (Bios Mode: CVM ; Boot Mode: Legacy + UEFI). Though looking at my boot options my system only recognizes the flash drive as a UEFI USB drive.

batcave-diagnostics-20200723-1912.zip

bat2o · July 28, 2020

On 7/23/2020 at 10:20 AM, bat2o said:

I have not tried the following

- Changed the ‘Server boot mode’ to Legacy. I cannot get the system to boot in legacy.

- Swap GPU location on the motherboard

- Tried another GPU

- Update bios

I have conducted all the trials. I was able to run unRAID in legacy mode, tried another GPU (Saphire RX580), and ran the primary and primary and secondary GPU with vbios files. All with similar results.

On 7/24/2020 at 6:16 PM, bat2o said:

I believe it has to do with how unRAID and the motherboard are handling that address.

I still believe it has something to do with how the unRAID is handling the address. For instance, my latest attempt resulted in disabling the parity drive (diagnostics below). For this latest attempt I did try running both GPUs and compare them in the log file when unRAID OS boots. They are similar but address 25:00 has this line in it:

Quote

pci 0000:25:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x2 link at 0000:20:04.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)

Don't know what that means, but it is different than 29:00 where the primary GPU is located.

tower-diagnostics-20200727-1901.zip

Edited July 28, 2020 by bat2o

meep · July 28, 2020

Hi. Came here from a post on reddit.

I had a look at your diagnostics there, and there are a few oddities - mainly the 'Pod' and 'Tumbler' VM XML files appear to be truncated, so I cannot see how the GPUs are assigned.

Also, from analysing this, I think you might be looking in the wrong place, or at the wrong thing. Your logs are filled with BTRFS and disk errors. Rather than thinking the second GPU/VM is causing the crash / disk errors, have you considered that the disk errors might be causing the VM crash?

It looks like your 'domains' share (where the VM images are stored) is set to cache prefer and your cache is BTRFS . If there's a disk error, your VM files could corrupted and cause the VM to crash. In my experience, BTRFS has been a very poor option for cache file system (two catastrophic crashes resulting in data loss).

My next step here would be to reconfigure my cache to be a single drive with XFS (preferably a different physical drive on a different controller to eliminate possible issues with a bad disk, controller or cable)). hen I'd re-create my VMs ensuring the data is on this safer single drive and see where it goes from there.

Also, check out the blog link in my footer for some tips on GPU config. It looks like you've done a lot of them already, but you might find a nugget!.

bat2o · July 29, 2020

Attached are the xml files. For your reference Tumbler is meant for GPU1 (29:00) and Pod is meant for GPU2 (25:00). Throughout my trials I have also done new VM xml files too.

You are correct that it corrupts my vdisk files. After crashes I usually have to replace it with my back-up versions.

13 hours ago, meep said:

Rather than thinking the second GPU/VM is causing the crash / disk errors, have you considered that the disk errors might be causing the VM crash?

This could be a possibility I'll look into. Though I don't believe it is because I was running the Tumbler VM on GPU1 for over a month with no issues. And created the Pod VM through VNC during that time. I only started seeing these issues when I was trying to setup the Pod VM to GPU2.

VM_XMLs.zip

Edited July 29, 2020 by bat2o
Including xml files

bat2o · November 11, 2020

Replaced my motherboard with:

ASRock - X570 Phantom Gaming 4

Now it works great.

scorcho99 · February 11, 2021

This is a resolved post but I appreciate the data and resolution information.

My hardware behaved similarly:

On my MSI x470 I noticed that when I used the 3rd slot for VGA passthrough that some of my SATA disks in the array dropped out which sounds like the issue you were having as well. My theory was that the iommu was having some sort of resource conflict or misdirection with devices running off the chipset. Given I had to use the ACS override to force this configuration to work sort of leans on that too, I've always thought people were a bit eager to solve the problem with the ACS override since I've heard it has potential stability concerns.

I only recall testing the an older AMD GPU. I believe all tests were with unraid 6.8. Since 3rd slot passthrough was just a test and not required for my use case, I kept using the board and it has worked fine with dual GPUs passed through off the cpu slots. I'm still running the ACS override option to pass through USB cards, it only seemed like the GPU mucked things up.

Second GPU Passthrough (SOLVED)

Recommended Posts

bat2o

Link to comment

bat2o

Link to comment

bat2o

Link to comment

meep

Link to comment

bat2o

Link to comment

bat2o

Link to comment

scorcho99

Link to comment

Join the conversation