'VIDEO_INTERNAL_SCHEDULER_ERROR' windows 10 VM


Recommended Posts

Hi all,

 

I need some help in how to start troubleshooting my problem.

 

Currently running Unraid 6.8.3 on the following hardware:

Motherboard: AsRock x570 Taichi, latest bios version P3.40

CPU: AMD Ryzen 9 3950x

Memory: 128gb DDR4

GPU: 3x GTX 1080 Ti gaming X 11gb

Harddrives: 3x 1TB SSD drives, 1x 1TB nvm-e drive as unassigned mounted devices (and 2 harddisks: 1x parity and 1 array disk with just the default shares and a 1TB nvm-e cache drive).

PSU: 1600 watt (GPUs all have been connected properly with 2 GPU power cables each) also the second power connector on the motherboard.

 

Settings:

HVM enabled, IOMMU enabled.

No PCIe ACS override, no VFIO allow unsafe interrupts.

BIOS boot mode Legacy, switched C states off as in one of the guides advised for Ryzen builds.

No dockers running, no other VMs then the 3 w10 vms mentioned below.

 

So as stated above I have 3 W10 gaming vms, all of them with 4 pinned cores and corresponding HT ones, 16gb of memory, a ssd mounted by device id, a dedicated GPU with the same custom vbios (also tried a W10 vm with nvme drive mounted by device id).

All settings are equal. 2 W10 gaming vms work fine and stable simultaneous as well as individual, the problem starts when I fire up the 3rd one (either on the last remaining ssd or from the nvm-e, it does not matter). 

The vm has fresh W10 installed, and as soon as it runs it crashes with a (grapics card) error 'VIDEO_SCHEDULER_INTERNAL_ERROR' within a couple of minutes showing within the famous windows blue stop code error screen. Sometimes it freezes my other vms too and once it throwed the same error on one of the other VMs at the same time. GPU is not broken, output is working.

 

What I have tried so far:

- Tested with different nvidia drivers -> yes, no solution/fix (also installed the nvidia control pannel, driver tested via windows update and by downloading it manually).

- Checked if vbios was mountend -> yes, they al use the same vbios file.

- Checked pinned cpu cores/ht ones -> all different and within the isolated range mapped correctly,.

- Checked cables and pcie extenders -> all fine.

- Checked if pci lane was enabled -> yes (no 3rd nvme drive inserted on the motherboard that would disable the last slot).

- Checked for errors -> no errors in the logs of the machines.

- Launching only VM 3 so the other two are not running.

- Checked the VM on the other GPUs -> no problems encountered so VM Image should be fine.

- Checked the IOMMU groups, all GPUs are on different groups:

Quote

IOMMU group 24:

[10de:1b06] 04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

[10de:10ef] 04:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

IOMMU group 31: 

[10de:1b06] 0f:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

[10de:10ef] 0f:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

IOMMU group 32: 

[10de:1b06] 10:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

[10de:10ef] 10:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

 

When I check the following detailed device information with lspci -v I notice that 3 things are disabled on the 'problem causing gpu, its IOMMU device 10:00.0 and that the memory at xxxxxxxx are different.

Quote

04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] GP102 [GeForce GTX 1080 Ti]
        Flags: bus master, fast devsel, latency 0, IRQ 38
        Memory at ea000000 (32-bit, non-prefetchable)
        Memory at 90000000 (64-bit, prefetchable)
        Memory at a0000000 (64-bit, prefetchable)
        I/O ports at d000
        Expansion ROM at eb000000 [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express <?>
        Kernel driver in use: vfio-pci

0f:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] GP102 [GeForce GTX 1080 Ti]
        Flags: bus master, fast devsel, latency 0, IRQ 127
        Memory at ee000000 (32-bit, non-prefetchable)
        Memory at d0000000 (64-bit, prefetchable)
        Memory at e0000000 (64-bit, prefetchable)
        I/O ports at f000
        Expansion ROM at 000c0000 [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express <?>
        Kernel driver in use: vfio-pci

10:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] GP102 [GeForce GTX 1080 Ti]
        Flags: fast devsel, IRQ 5
        Memory at ec000000 (32-bit, non-prefetchable) [disabled]
        Memory at b0000000 (64-bit, prefetchable) [disabled]
        Memory at c0000000 (64-bit, prefetchable) [disabled]
        I/O ports at e000 [disabled]
        Expansion ROM at ed000000 [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express <?>

Anyone having the same issue?

Edited by Timothyy
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.