UNRAID freezes, hard reset


Recommended Posts

Yesterday, I remotely accessed one of my VMs with Nvidia GPU passthrough, and suddenly the whole VM froze and disconnected. Subsequently, the entire UNRAID server became unresponsive, and unfortunately, the only solution was a hard reset. This has occurred twice, and I'm now concerned that if I tempt fate and start the VM a third time, it could cause irreparable damage to the server, potentially leading to data loss.

 

memtest is OK!

 

Below is a snippet of what the logs show. Note that the Docker log is also 100% full:

May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
          

 

The log entries show recurring PCIe interface issues. "BadDLLP" is possibly because of corrupted data??

The device [10b5:8714] see IOMMU GROUP below is probably the because it's connected to the PCIe port.

The repeating "severity=Corrected" error suggests an underlying issue that, while currently correctable, persists. The 'AER: Multiple Corrected error received' messages from the PCIe port indicate multiple corrected errors from the device at 0000:46:02.0

 

 

Quote

 

IOMMU group 53:[10b5:8714] 45:00.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 54:[10b5:8714] 46:01.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 55:[10b5:8714] 46:02.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

 

 

PCIe ACS override is set to = both

 

Attached, you will find the diagnostics.zip folder.

 

OS: unRAID 6.12.0-rc5

 

Hardware:
Threadripper 1950x

ASUS Zenith Extreme

PCIE 1: PNY NVIDIA Quadro P2000 5GB

PCIE 2: M.2 X16 Adapter GEN 3

PCIE 3: Nvidia RTX 3060 12GB

PCIE 4: HBA Card LSI SAS 9207-8i

 

 

tower-diagnostics-20230512-2120.zip

Edited by SimpleDino
Link to comment

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

Link to comment
1 hour ago, ghost82 said:

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

 

Thanks for the quick response.

 

I'm curious about whether the pcie_aspm=off command will affect all PCIe slots on a global level. Specifically, I'm wondering if it will have an impact on the 4xM.2 PCIe adapter.

Given the circumstances, pci=noaer might be the preferable solution for the time being. The peculiar thing is, I hadn't experienced any issues until recently.

 

The problem likely originates from PCIe slot 3, which houses the RTX 3060. As mentioned in my post, Unraid and the VM only freeze when the GPU is passed through to the Windows VM.

The GPU in PCIe slot 3 operates fine outside of the VM when used for stable diffusion or other tasks. However, the AER error persists, and this error first appeared after the initial system freeze.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.