September 26, 20169 yr Greetings, I am not sure if this is an issue for the Devs but I have 6x NVidia Quadro M4000 GPUs on my Asus Z10PE-D16 WS Motherboard. I am running unRAID 6.2 and am only able to pass trough 3 of my 6 GPUs to the VMs. I have only tried 1 VM, switching the GPU each time because my motherboard assigns GPUs 2, 3 & 4 the following in the device lists 02, 01, 03 so I was just trying to figure out the physical output location. However I dont quite understand why my other 3 GPUs are assigned 81, 82 and 83 and cannot be passed through to my VM. When my motherboard POSTs I can see that all 6 Nvidia GPUs are detected, labeled 1,2,3,81,82,83 successfully however I seem unable to get the last 3 to work. Am I doing something wrong in unRAID? Yes the GPUs are powered on. yes I they enough power (1500W PSU). yes I have 4G decoding enabled in the BIOS. I am using the onboard GPU as the 'head' for my current setup so output video in the BIOS is set to 'onboard device' as a 7th GPU as well. Im stumped. help?
September 26, 20169 yr I'd say it sounds like the vfio-pci error... but that was supposed to be fixed in 6.2. Do you have a CPU installed in each socket with RAM for each socket?
September 26, 20169 yr Author I'd say it sounds like the vfio-pci error... but that was supposed to be fixed in 6.2. Do you have a CPU installed in each socket with RAM for each socket? Yeah, I have 2x 2696/2699 v4 CPUs with 14 of the 16 DIMM slots populated. I didn't populate the last 2 because they overlap with one of the graphics cards What can I do about the vfio-PCI error? I'm supposed to deliver a working product today
September 26, 20169 yr I'd say it sounds like the vfio-pci error... but that was supposed to be fixed in 6.2. Do you have a CPU installed in each socket with RAM for each socket? Yeah, I have 2x 2696/2699 v4 CPUs with 14 of the 16 DIMM slots populated. I didn't populate the last 2 because they overlap with one of the graphics cards What can I do about the vfio-PCI error? I'm supposed to deliver a working product today You could try to edit the /etc/libvirt/qemu.conf file and add the groups for the 3 cards not working to see if this is the problem. Look for the cgroup_device_acl section and add the vfio groups of the three cards. You find the groups in Tools --> system devices in the webgui. Disable/enable libvirt in settings --> VM manager and try to pass one card through.
September 26, 20169 yr Author This is the hotfix I mentioned, it supposed to be rolled into 6.2. Thank you for answering. In the hotfix you linked to the OP quotes that I should see an error message, where would I be seeing this error message? I'd say it sounds like the vfio-pci error... but that was supposed to be fixed in 6.2. Do you have a CPU installed in each socket with RAM for each socket? Yeah, I have 2x 2696/2699 v4 CPUs with 14 of the 16 DIMM slots populated. I didn't populate the last 2 because they overlap with one of the graphics cards What can I do about the vfio-PCI error? I'm supposed to deliver a working product today You could try to edit the /etc/libvirt/qemu.conf file and add the groups for the 3 cards not working to see if this is the problem. Look for the cgroup_device_acl section and add the vfio groups of the three cards. You find the groups in Tools --> system devices in the webgui. Disable/enable libvirt in settings --> VM manager and try to pass one card through. As for this, I checked that each NVidia card was assigned to its own group, i.e. card 81 belongs to group /77/. Do I still need to edit the qemu.conf file and if so how do I know im doing it right? I don't want to ruin the cards that already work. My linux skill are already good but I am not experienced with unRAID
September 26, 20169 yr This is the hotfix I mentioned, it supposed to be rolled into 6.2. Thank you for answering. In the hotfix you linked to the OP quotes that I should see an error message, where would I be seeing this error message? I'd say it sounds like the vfio-pci error... but that was supposed to be fixed in 6.2. Do you have a CPU installed in each socket with RAM for each socket? Yeah, I have 2x 2696/2699 v4 CPUs with 14 of the 16 DIMM slots populated. I didn't populate the last 2 because they overlap with one of the graphics cards What can I do about the vfio-PCI error? I'm supposed to deliver a working product today You could try to edit the /etc/libvirt/qemu.conf file and add the groups for the 3 cards not working to see if this is the problem. Look for the cgroup_device_acl section and add the vfio groups of the three cards. You find the groups in Tools --> system devices in the webgui. Disable/enable libvirt in settings --> VM manager and try to pass one card through. As for this, I checked that each NVidia card was assigned to its own group, i.e. card 81 belongs to group /77/. Do I still need to edit the qemu.conf file and if so how do I know im doing it right? I don't want to ruin the cards that already work. My linux skill are already good but I am not experienced with unRAID Yes you need to add the /dev/vfio/77 to the qemu.conf file. You probably need to uncommented some parts, but take a look in the vfio hotfix thread and download the qemu.conf there and compare them. Then you'll see what to uncommented. Do not add all the devices from that file. Only the vfio for the 3 non working cards.
September 26, 20169 yr Author Update! I was checking the system logs and found that the only difference between my NVidia card that works and doesnt work was the following lines. Kernel: vfio-pci 0000 02:00.0 enabling device (0100 -> 0103) This line does not populate when dealing with card assigned to 81, 82 or 83, but does occur in cards 01, 02 (shown above), and 03 I replaced the qemu.conf file with the one in the hotfix you provided and now the following line appears Kernel: vfio-pci 0000 81:00.0 enabling device (0100 -> 0103) however the display does not appear for any of the 81, 82, or 83 cards. I cannot replace the 'domain' file provided in the hotfix because neither the directory nor the file exists in 6.2. This really is strange
September 27, 20169 yr I note you said you're supposed to deliver a working product - might it be worth looking into a paid support session with one of the Limetech folks? https://lime-technology.com/services/
September 27, 20169 yr Update! I was checking the system logs and found that the only difference between my NVidia card that works and doesnt work was the following lines. Kernel: vfio-pci 0000 02:00.0 enabling device (0100 -> 0103) This line does not populate when dealing with card assigned to 81, 82 or 83, but does occur in cards 01, 02 (shown above), and 03 I replaced the qemu.conf file with the one in the hotfix you provided and now the following line appears Kernel: vfio-pci 0000 81:00.0 enabling device (0100 -> 0103) however the display does not appear for any of the 81, 82, or 83 cards. I cannot replace the 'domain' file provided in the hotfix because neither the directory nor the file exists in 6.2. This really is strange If you read the hotfix thread OP you would have seen that it's either the domain.img or copying the qemu.conf file. Its also stated that this is not the fix that is implemented in 6.2. That was the reason for me telling you to add the vfio devices to the correct section, instead of using the whole config file. There might be stuff they changed in the config for 6.2 that might break now that you use an old file. On a side note, I would suggest that you test the configuration next time before promising a product. Its not exactly a normal setup you have there. I would contact limetech support to try and solve this as the previous poster also suggested.
Archived
This topic is now archived and is closed to further replies.