UNRAID freezes, hard reset

May 12, 20233 yr

Yesterday, I remotely accessed one of my VMs with Nvidia GPU passthrough, and suddenly the whole VM froze and disconnected. Subsequently, the entire UNRAID server became unresponsive, and unfortunately, the only solution was a hard reset. This has occurred twice, and I'm now concerned that if I tempt fate and start the VM a third time, it could cause irreparable damage to the server, potentially leading to data loss.

memtest is OK!

Below is a snippet of what the logs show. Note that the Docker log is also 100% full:

May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP

The log entries show recurring PCIe interface issues. "BadDLLP" is possibly because of corrupted data??

The device [10b5:8714] see IOMMU GROUP below is probably the because it's connected to the PCIe port.

The repeating "severity=Corrected" error suggests an underlying issue that, while currently correctable, persists. The 'AER: Multiple Corrected error received' messages from the PCIe port indicate multiple corrected errors from the device at 0000:46:02.0

Quote

IOMMU group 53:[10b5:8714] 45:00.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 54:[10b5:8714] 46:01.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 55:[10b5:8714] 46:02.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

PCIe ACS override is set to = both

Attached, you will find the diagnostics.zip folder.

OS: unRAID 6.12.0-rc5

Hardware:
Threadripper 1950x

ASUS Zenith Extreme

PCIE 1: PNY NVIDIA Quadro P2000 5GB

PCIE 2: M.2 X16 Adapter GEN 3

PCIE 3: Nvidia RTX 3060 12GB

PCIE 4: HBA Card LSI SAS 9207-8i

tower-diagnostics-20230512-2120.zip

Edited May 12, 20233 yr by SimpleDino

Quote

May 14, 20233 yr

Author

@Squid @ghost82 Do you guys have any clues?syslog.txt

Quote

May 14, 20233 yr

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

Quote

May 14, 20233 yr

Author

1 hour ago, ghost82 said:

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

Thanks for the quick response.

I'm curious about whether the pcie_aspm=off command will affect all PCIe slots on a global level. Specifically, I'm wondering if it will have an impact on the 4xM.2 PCIe adapter.

Given the circumstances, pci=noaer might be the preferable solution for the time being. The peculiar thing is, I hadn't experienced any issues until recently.

The problem likely originates from PCIe slot 3, which houses the RTX 3060. As mentioned in my post, Unraid and the VM only freeze when the GPU is passed through to the Windows VM.

The GPU in PCIe slot 3 operates fine outside of the VM when used for stable diffusion or other tasks. However, the AER error persists, and this error first appeared after the initial system freeze.

Quote

May 15, 20233 yr

pcie_aspm=off acts globally and it will disable the power management, pci=noaer will only stop advanced error reporting in the logs.

Maybe hardware related, one of your PLX Technology switches.

Quote

UNRAID freezes, hard reset

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)