unRAID crashes after GPU Passthrough to Windows KVM


cap089

Recommended Posts

Hi! 

I followed up the topic around unRAID since a while. Now I wanted to build my own rig with a Windows KVM for gaming, freeNAS etc. 

Before my new processor arrived I tested it with another system, just a few months ago... With the help of SpaceInvaderOnes videos I was successful  by passthrough my EVGA GTX 1080 FTW to the Windows KVM, including dumping the BIOS so I can have it in the first PCI slot. So far so good.

 

Yesterday I started to build my own PC and everything seems to work fine - like USB controller passthrough - but when I try to passthrough my GTX 1080 unRAID completely crashes and after that even not boot anymore. First I thought the actual version 6.7.2 is buggy and I also dumped the BIOS again of the card... but also with the version 6.7.0, which I used in my first tests, I have the same problem..

So far I have no clue whats going wrong here. Of course the graphic card and the audio has its own IOMMU group. 

 

Here some hardware info of the two systems:

- test system was on a MSI x470 Gaming Plus with an Ryzen 1700

- actual system is an Asus Strix x570 E-Gaming with an Ryzen 3900X

 

Can anyone give me some advices, please?

 

Fun fact: The Windows KVM has its own SATA SSD so after unRAID is crashed and not bootable anymore, the original Windows KVM boots like a normal system (?)

Link to comment

Update: After giving up yesterday, I used the original Windows KVM which is bootable like a baremetal system. I installed the newest Nvidia graphics driver and played some games without any issues. 

 

Today I created a new Windows VM in unRAID with the same hardware specifications like the first one and there I just got an ugly image on the screen and nothing else (see below). But at least unRAID seems not to crash... Yesterday at first the "VM tab" in the Web GUI was not reachable and then the whole system crashes.

 

Has someone any ideas? Couldn't find something yet in the web.

Note: In the second PCI slot the graphics card works so if another card sits in the first slot.

 

Photo-2019-09-23-10-23-03_3315.JPG

 

Update 2: 

Ok I tested two things more: 

1. Tried to passthrough an older AMD 270X, I came to the "TianoCore" screen but after that the monitor stays black - the card is UEFI ready.

2. Tried it out with unRAID version 6.7.3 r4 but same result...

 

Update 3: Thats the error message which I get:

2019-09-23T15:29:53.780080Z qemu-system-x86_64: vfio_region_write(0000:09:00.0:region1+0x2206c8, 0x0,8) failed: Device or resource busy

I also tried to "exclude" the graphics card via vfio-pci.ids but then unRAID hangs in the boot screen. 

So finally after my tests I think unRAID is not ready for x570 boards.

Edited by cap089
Updating...
Link to comment

Hi I scratched my head on this for a while, there was a solution, essentially unraid doesn't release the Primary GPU for use by a VM:

 

First test with the following commands via ssh with your VM off:

 

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

 

Then try to boot your vm with the gpu passed through. if it boots successfully you had the same issue as me. Then install user scripts from community applications and set this to run as a little script when the array starts. even works with auto starting VMs.

  • Like 1
  • Thanks 1
Link to comment
15 minutes ago, david279 said:

AMD bios bug.....try the kernel here. This a bug that was introduced with the agesa version that started support for the 3000 series.

 

 https://forums.unraid.net/topic/82625-kernelcustom-build-kernel-527-for-latest-ryzen-3000-and-x570-sensors-suppport-and-r8125-driver/

Ok thanks for the reply. Actually I use the current BIOS version for my board (from 09/12/2019) - maybe I dont need the custom kernel?

 

But anyway I will try first what Hankanman sad. I also read your thread because I also get the same error....

So I will try it first with these commands, tomorrow. Today I have no more nerves for testing...

And yes I also dumped the BIOS - the old way with the GTX 1080 in the second slot - so this should be fine. Especially because it worked on the test system.

14 minutes ago, Hankanman said:

Hi I scratched my head on this for a while, there was a solution, essentially unraid doesn't release the Primary GPU for use by a VM:

 

First test with the following commands via ssh with your VM off:

 

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

 

Then try to boot your vm with the gpu passed through. if it boots successfully you had the same issue as me. Then install user scripts from community applications and set this to run as a little script when the array starts. even works with auto starting VMs.

 

Link to comment

From all that I've read so far about the x570 boards is, that in most cases the IOMMU groups are horrible. Tons of stuff are grouped together and can't be split and a couple things under linux aren't stable yet. Even Wendel reported it in one of his videos that this platform isn't suggested yet for virtualisation. Not sure if newer BIOS/Agesa versions released the last weeks already fixed passthrough issues. I remember the first gen Threadrippers had nearly the same problems. 3 months or so after release the plattform became stable. Until that I had the same issues, random crashes, freezes, GPU passthrough not fully functional, terrible IOMMU groupings. Fingers crossed that your issues get solved soon.

Link to comment

Ok! The unbind command of Hankanman fixed it! And works with the user.scripts plugin very well. Thanks a lot! 

 

Now I just need to figure out how to passthrough my optical drive to the VM. Passing it through via "form view" as a normal SATA device does not work because unRAID claims that there is no media in it.. The manual way which is explained in other threads on this forum also does not work. Like: 

 

 

I also just tried to passthrough the whole SATA controller but for now I do not know how to identify the actual SATA port where the optical drive is plugged in..

Link to comment

UPDATE: 

So, the unbinding of the GPU for passthrough just worked in a test VM, for my "real" VM it doesn't worked. I also made a complete new install and here it also doesn't work. 

 

I also tested the mentioned custom kernel, but here the array could not start because of "incompatible md version"? So I think it means the signature... The custom kernel provides just MD5 signature but on my unRAID boot-stick I have SHA256 signatures for the "bzimage" and "bzmodules".

 

Finally I did not come any step further and need a break.

Edited by cap089
Link to comment

Final Update: On my first test with the custom kernel, which is available in the other thread, I used the wrong unRAID version so that causes the error... So today I tested again with the stable version 6.7.2 and the custom kernel modules. 

 

First run: In a test environment with a smaller SSD. I set up the unbind script via the user.scripts plugin. Installed Windows and after updating, I changed from VNC to the GPU and it worked fine. So time for the next step...

 

Second run: Now using all my drives. Set up the disk, parity and cache devices and deleted the SSD with the "baremetal" Windows installation. Done the exact same steps like in the test environment. After passthrough the GPU, unRAID again completely crashes - WTF ?!!!

 

So finally I am done with unRAID, thats too buggy. And I am lucky that I not already bought the software.

Edited by cap089
Link to comment

Sorry to hear that. Unraid is certainly not for the faint of heart. Particularly the KVM aspects. And to be honest that why the trial license is a reasonable way to allow people to attempt to make it work for them without being on the hook immediately for the cost of the OS/Hypervisor. On the subject of your issues. While i couldn't begin to solve your issues within a singular post here. Im fairly optimistic that it would be possible to make unraid work for you. Should you decide to give it another try. Lemme me know. Would be happy to attempt to help you remedy your issues.

Link to comment
6 hours ago, metathias said:

Sorry to hear that. Unraid is certainly not for the faint of heart. Particularly the KVM aspects. And to be honest that why the trial license is a reasonable way to allow people to attempt to make it work for them without being on the hook immediately for the cost of the OS/Hypervisor. On the subject of your issues. While i couldn't begin to solve your issues within a singular post here. Im fairly optimistic that it would be possible to make unraid work for you. Should you decide to give it another try. Lemme me know. Would be happy to attempt to help you remedy your issues.

Hi! Thanks for your reply. I really want to give it another try but without a „stable GPU Passthrough“ its not the right one for me. And I think for that the Linux Kernel has to be updated first, to run properly with x570 chipset hardware (?) 

Link to comment

I do wonder exactly what you mean by Unraid crash's. Do you get a full system reset back to post? I've yet to have unraid actually crash on me. Its possible that by passing through a SATA controller the system might become destabilized. As far as my system goes (X399 Asus Prime board, With Threadripper 1950x) There are no SATA ports that get IOMMUs that dont also have several system critical parts attached to them. Thus disabling me from attempting to passthrough a SATA controller directly. Instead i use VMdisks Which have dedicated SSDs meant to host only those singular images (1 SSD per VM).

Edited by metathias
Link to comment

With crash I mean that the "VM" tab gets unresponsive (endless loading) as soon as I starting the VM with GPU Passthrough switched on. In this case I can switch to "Dashboard" or any other tab but after about 20 seconds everything gets unresponsive and I can perform a hard reset by pressing and holding the power button.

 

And I am NOT passing through the whole SATA controller. I just select in the mask for the primary drive "Manual" and then type: "/dev/disk/by-id/the_desired_SSD". So thats the same way you are using?

 

At first I planned to passthrough the whole SATA controller but I cannot figure out on which port the desired SSD sits. This is needed because the SATA controller on my board have different IOMMU groups but the same PCI-IDs. The trick to "hide" the desired devices as in the case of my USB controller doesn't work because I found no command to see on which ports my drives are plugged in. For USB controllers I got the right command from one of SpaceInvadersOne videos...:

for usb_ctrl in $(find /sys/bus/usb/devices/usb* -maxdepth 0 -type l); do pci_path="$(dirname "$(realpath "${usb_ctrl}")")"; echo "Bus $(cat "${usb_ctrl}/busnum") --> $(basename $pci_path) (IOMMU group $(basename $(realpath $pci_path/iommu_group)))"; lsusb -s "$(cat "${usb_ctrl}/busnum"):"; echo; done

 

Link to comment

Ahh ok. Yes that the way i pass through my SSDs as well. The hanging your experiencing in Unraid seems to me. Very likely a lack of resources on unraids part somewhere. Or the 6.7 bug that choking everyones bandwidth to death when it start doing multiple parallel file operations. To be honest im still running 6.5.x .

 

Link to comment

I actually have one VM sitting directly on the cache drive as an image.  Another gets passed through an NVMe drive. The last and latest one currently manul method described above. All of the above work great at the same time. So long as the resource distribution is not overly optimistic. I save about 80% 20% of my total ram (currently only 32GB) to unraid and for a few dockers.

Edited by metathias
Link to comment
16 minutes ago, metathias said:

Ahh ok. Yes that the way i pass through my SSDs as well. The hanging your experiencing in Unraid seems to me. Very likely a lack of resources on unraids part somewhere. Or the 6.7 bug that choking everyones bandwidth to death when it start doing multiple parallel file operations. To be honest im still running 6.5.x .

 

Ok, so in normal conditions this way should not be the reason for my issues and is just the incompatibility between unRAID/ Linux Kernel and the X570 mainboards. Thats the only explanation I have for me at this moment.

 

You mean a lack of resources? Okay I dont know... During my tests there were no other Docker apps or any big file transfer running in the background. Ok just the parity build/sync process but this shouldn't be that intensive load. In my configuration unRAID has access to 4 + 4 cores and 16 GB RAM, I think thats more enough.

Edited by cap089
Link to comment
14 hours ago, metathias said:

I dont think your gonna find anything hardware specific for your chipset in any of the current builds. So the previous  build should be pretty much the same (compatibility wise). For how difficult it is to try the lower versions. 6.5 or 6.6 (Not very). Its worth a shot IMO.

Hi! I tested it with version 6.6.7 and there its the same problem. Endless loading and crashing....

Link to comment

Sorry to hear that. Based on where your system seems to start showing issues AFTER the VM attempt to boot. The most likely culprit is a malformed VM config XML. Through this thread im seeing quite a few suggestions to use all manner of hacks. Which should not be necessary AFAIK at this point in time. Im pretty sure Unraid gladly gives up the primary GPU for instance so long as you dont use UEFI, And dont run the graphical version of unraid on boot. To enjoy VM rebootability though you have to deploy the VM with a link to its Vbios.

Link to comment

Ok what do you mean with malformend XML config? In my test environment without the parity build etc. is running, it worked like a charm. And there I applied the same steps.

And yes, I use UEFI because otherwise unRAID is not bootable. Actually "legacy support" is switched on in the BIOS of my mainboard but I think here the Asus board behaves buggy (?)

Edited by cap089
Link to comment

Unraid not being bootable without UEFI sounds a bit odd to me. Its bootable either way for me. Having it on presented a host of issues for me in the past (Cant speak for recent versions of unraid). The most notable of which was the inability to actually grab the video card from unraid and give it back when the VM was strarting/ending. When i speak of a malformed XML i am speaking to a possible misconfiguration or incompatibility with a setting you have set in your VM.

 

Perhaps you could post a copy of your VMs XML. And a copy of your IOMMU groups for further examination.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.