Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

UNRAID freezes, hard reset

Featured Replies

Yesterday, I remotely accessed one of my VMs with Nvidia GPU passthrough, and suddenly the whole VM froze and disconnected. Subsequently, the entire UNRAID server became unresponsive, and unfortunately, the only solution was a hard reset. This has occurred twice, and I'm now concerned that if I tempt fate and start the VM a third time, it could cause irreparable damage to the server, potentially leading to data loss.

 

memtest is OK!

 

Below is a snippet of what the logs show. Note that the Docker log is also 100% full:

May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Multiple Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
May 12 14:54:03 Tower kernel: pcieport 0000:40:01.3: AER: Corrected error received: 0000:46:02.0
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:   device [10b5:8714] error status/mask=00000080/0000a000
May 12 14:54:03 Tower kernel: pcieport 0000:46:02.0:    [ 7] BadDLLP               
          

 

The log entries show recurring PCIe interface issues. "BadDLLP" is possibly because of corrupted data??

The device [10b5:8714] see IOMMU GROUP below is probably the because it's connected to the PCIe port.

The repeating "severity=Corrected" error suggests an underlying issue that, while currently correctable, persists. The 'AER: Multiple Corrected error received' messages from the PCIe port indicate multiple corrected errors from the device at 0000:46:02.0

 

 

Quote

 

IOMMU group 53:[10b5:8714] 45:00.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 54:[10b5:8714] 46:01.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

IOMMU group 55:[10b5:8714] 46:02.0 PCI bridge: PLX Technology, Inc. Device 8714 (rev ab)

 

 

PCIe ACS override is set to = both

 

Attached, you will find the diagnostics.zip folder.

 

OS: unRAID 6.12.0-rc5

 

Hardware:
Threadripper 1950x

ASUS Zenith Extreme

PCIE 1: PNY NVIDIA Quadro P2000 5GB

PCIE 2: M.2 X16 Adapter GEN 3

PCIE 3: Nvidia RTX 3060 12GB

PCIE 4: HBA Card LSI SAS 9207-8i

 

 

tower-diagnostics-20230512-2120.zip

Edited by SimpleDino

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

  • Author
1 hour ago, ghost82 said:

There are a lots of AER errors, I think the crash happens because of the log filling.

The source of the issue could be hardware or software.

If I were you, first of all, I would clean the slots and check the cables.

If it's software related it could be fixed with a new kernel update.

In the meantime to not show these, you can try pcie_aspm=off in your syslinux config.

Otherwise you can try pci=noaer

 

Thanks for the quick response.

 

I'm curious about whether the pcie_aspm=off command will affect all PCIe slots on a global level. Specifically, I'm wondering if it will have an impact on the 4xM.2 PCIe adapter.

Given the circumstances, pci=noaer might be the preferable solution for the time being. The peculiar thing is, I hadn't experienced any issues until recently.

 

The problem likely originates from PCIe slot 3, which houses the RTX 3060. As mentioned in my post, Unraid and the VM only freeze when the GPU is passed through to the Windows VM.

The GPU in PCIe slot 3 operates fine outside of the VM when used for stable diffusion or other tasks. However, the AER error persists, and this error first appeared after the initial system freeze.

pcie_aspm=off acts globally and it will disable the power management, pci=noaer will only stop advanced error reporting in the logs.

Maybe hardware related, one of your PLX Technology switches.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.