AMD GPU Locks Up during Transcoding when Passed Through To Ubuntu/Emby VM


flexage

Recommended Posts

Hey fellow Unraiders,

 

On my Unraid server, I've got an UEFI Ubuntu Server 20.04 VM running Emby.

 

I've followed SpaceInvaderOne's Advanced GPU Passthrough Techniques guide, and have the AMD RX570 8GB passed through. I also installed the latest Radeon Software for Linux / AMDGPU-PRO drivers.

 

The RX570 is the primary and only GPU installed in the system at present time.

 

The VM boots successfully (although no Tiano Core screen, just the Ubuntu boot up logs whirling by), and I can begin to GPU Transcode in Emby.

 

However, after a while the GPU locks up, especially if i start a second transcode or seek to a new video position - Ubuntu and Emby continue to run, but the GPU appears to be hung and I can no longer run `radeontop` or start and GPU related activity such as transcoding.

 

I made sure to pass through the RX570 HDMI sound card too, and made sure to put them on the same hardware slot in KVM:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x29' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x01' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x29' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x01' function='0x01' multifunction='on'/>
    </hostdev>

 

Also, both the vga and audio were already on an isolated IOMMU group together, no need for override right?

IOMMU group 15:

[1002:67df] 29:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)

[1002:aaf0] 29:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

 

When looking at the Emby Server transcode logs, I'm seeing some messages that I'm not familiar with, although they seem to do with no longer being able to access the GPU:

 

>> ThrottleBySegmentRequest: RequestPosition: 00:05:30 - TranscodingPosition: 00:06:21 - ThrottleBuffer: 51s (Treshold: 120s)
>> ThrottleBySegmentRequest: RequestPosition: 00:05:30 - TranscodingPosition: 00:06:21 - ThrottleBuffer: 51s (Treshold: 120s)
>> ThrottleBySegmentRequest: RequestPosition: 00:05:30 - TranscodingPosition: 00:06:21 - ThrottleBuffer: 51s (Treshold: 120s)
>> ThrottleBySegmentRequest: RequestPosition: 00:05:30 - TranscodingPosition: 00:06:21 - ThrottleBuffer: 51s (Treshold: 120s)
>> ThrottleBySegmentRequest: RequestPosition: 00:05:30 - TranscodingPosition: 00:06:21 - ThrottleBuffer: 51s (Treshold: 120s)
amdgpu: amdgpu_cs_query_fence_status failed.
amdgpu: amdgpu_cs_query_fence_status failed.
amdgpu: amdgpu_cs_query_fence_status failed.
16:23:55.588 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: amdgpu_cs_query_fence_status failed.
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.595 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
16:23:55.595 frame= 1228 fps= 18 q=-0.0 size= 30086kB time=00:06:21.86 bitrate=4789.8kbits/s throttle=off speed=0.752x
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.600 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.605 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.622 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.629 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.
16:23:55.633 [mpegts @ 0x1a58a40] H.264 bitstream error, startcode missing, size 0
amdgpu: The CS has been cancelled because the context is lost.
amdgpu: The CS has been cancelled because the context is lost.

 

I've also tried the same setup as above, but with a Radeon 550 2GB in the system, and get exactly the same behaviour and error logs.

 

For shits and giggles I set up a Win 10 VM with a GPU passed through, and went to install Emby, however after I'd installed the official Radeon Drivers and opened chrome the VM locked up.

 

Any ideas fellow raiders? Would having 2 GPUs in the system, i.e. the lesser 550 in the first PCIE slot (not-passed though to any vm), and the RX570 in another PCIE slot (passed through to emby/ubuntu VM) help at all?

 

TIA

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.