SimpleDino Posted May 12, 2023 Share Posted May 12, 2023 (edited) Yesterday, I remotely accessed one of my VMs with Nvidia GPU passthrough, and suddenly the whole VM froze and disconnected. Subsequently, the entire UNRAID server became unresponsive, and unfortunately, the only solution was a hard reset. This has occurred twice, and I'm now concerned that if I tempt fate and start the VM a third time, it could cause irreparable damage to the server, potentially leading to data loss. memtest is OK! Below is a snippet of what the logs show. Note that the Docker log is also 100% full: May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: device [10b5:8714] error status/mask=00000080/0000a000 May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: [ 7] BadDLLP The log entries show recurring PCIe interface issues. "BadDLLP" is possibly because of corrupted data?? The device [10b5:8714] see IOMMU GROUP below is probably the because it's connected to the PCIe port. The repeating "severity=Corrected" error suggests an underlying issue that, while currently correctable, persists. The 'AER: Multiple Corrected error received' messages from the PCIe port indicate multiple corrected errors from the device at 0000:46:02.0 Quote IOMMU group 53:[10b5:8714] 45:00.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab) IOMMU group 54:[10b5:8714] 46:01.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab) IOMMU group 55:[10b5:8714] 46:02.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab) PCIe ACS override is set to = both Attached, you will find the diagnostics.zip folder. OS: unRAID 6.12.0-rc5 Hardware: Threadripper 1950x ASUS Zenith Extreme PCIE 1: PNY NVIDIA Quadro P2000 5GB PCIE 2: M.2 X16 Adapter GEN 3 PCIE 3: Nvidia RTX 3060 12GB PCIE 4: HBA Card LSI SAS 9207-8i tower-diagnostics-20230512-2120.zip Edited May 12, 2023 by SimpleDino Quote Link to comment
SimpleDino Posted May 14, 2023 Author Share Posted May 14, 2023 @Squid @ghost82 Do you guys have any clues?syslog.txt Quote Link to comment
ghost82 Posted May 14, 2023 Share Posted May 14, 2023 There are a lots of AER errors, I think the crash happens because of the log filling. The source of the issue could be hardware or software. If I were you, first of all, I would clean the slots and check the cables. If it's software related it could be fixed with a new kernel update. In the meantime to not show these, you can try pcie_aspm=off in your syslinux config. Otherwise you can try pci=noaer Quote Link to comment
SimpleDino Posted May 14, 2023 Author Share Posted May 14, 2023 1 hour ago, ghost82 said: There are a lots of AER errors, I think the crash happens because of the log filling. The source of the issue could be hardware or software. If I were you, first of all, I would clean the slots and check the cables. If it's software related it could be fixed with a new kernel update. In the meantime to not show these, you can try pcie_aspm=off in your syslinux config. Otherwise you can try pci=noaer Thanks for the quick response. I'm curious about whether the pcie_aspm=off command will affect all PCIe slots on a global level. Specifically, I'm wondering if it will have an impact on the 4xM.2 PCIe adapter. Given the circumstances, pci=noaer might be the preferable solution for the time being. The peculiar thing is, I hadn't experienced any issues until recently. The problem likely originates from PCIe slot 3, which houses the RTX 3060. As mentioned in my post, Unraid and the VM only freeze when the GPU is passed through to the Windows VM. The GPU in PCIe slot 3 operates fine outside of the VM when used for stable diffusion or other tasks. However, the AER error persists, and this error first appeared after the initial system freeze. Quote Link to comment
ghost82 Posted May 15, 2023 Share Posted May 15, 2023 pcie_aspm=off acts globally and it will disable the power management, pci=noaer will only stop advanced error reporting in the logs. Maybe hardware related, one of your PLX Technology switches. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.