[SOLVED] AER PCIe Bus Errors


Recommended Posts

I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend.

 

The first device that is throwing errors is one of two GPUs in the system. The errors look like:

Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Tower kernel: vfio-pci 0000:01:00.0:   device [10de:1e84] error status/mask=00100000/00000000
Tower kernel: vfio-pci 0000:01:00.0:    [20] UnsupReq               (First)
Tower kernel: vfio-pci 0000:01:00.0: AER:   TLP Header: 40000001 00000003 000be7c0 f7f7f7f7
Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

 

The second device that is throwing errors is my LSI card. This is new. It is an LSI 9207-8i purchased from The Art of the Server on ebay. It is in a PCIe slot that was previously occupied by an NVME SSD in a PCIe adapter. Those errors look like:

Tower kernel: mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: mpt3sas 0000:04:00.0:   device [1000:0087] error status/mask=00000001/00002000
Tower kernel: mpt3sas 0000:04:00.0:    [ 0] RxErr 

 

Despite these errors, both devices are acting normally. The GPU is passed through to a VM and behaves as expected even under full load. The LSI card also appears fully functional. I went through an entire parity check which passed with zero errors. I am currently running through a drive rebuild (not because of drive failure, just swapping it out) and would rather not have to abort, but I also do not know how severe these errors are and if I need to take immediate action. 

 

I am attaching my full diagnostics dump.

 

Any advice would be much appreciated.

 

Thank you.

tower-diagnostics-20210310-1757.zip

Edited by Team_Dango
Link to comment

After doing some digging I believe I have solved my issue. It seems like it is somewhat a known bug on Asus X99 motherboards. Mine is an Asus X99-WS/IPMI. I am on the latest BIOS so updating was not an option.

 

The solution was to add "pcie_aspm=off" to my syslinux configuration. After a reboot I appear to no longer be getting errors. Fingers crossed it stays fixed. 

 

If anyone has anything to add feel free to chime in. If I don't have any errors tomorrow morning I'll mark this solved.

  • Like 3
Link to comment
  • Team_Dango changed the title to [SOLVED] AER PCIe Bus Errors
  • 1 year later...
On 3/11/2021 at 7:03 AM, Team_Dango said:

Thank you for the suggestion. I'll check for that next time I reboot the server.

Where exactly did you add the pcie_aspm=off

 

Currently trying to get my pcie google coral to work

 

in Google's GitHub they said this 

 

I had tried disabling it my bios and that didn't help  

 

This was google's Response:

 

can you please share how did you turn off pcie_aspm. have you added pcie_aspm=off to the /boot/extlinux/extlinux.conf?

 

 

$ cat /boot/extlinux/extlinux.conf
TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
INITRD /boot/initrd
APPEND ${cbootargs} quiet pcie_aspm=off

 

 

Thank you for your help

Link to comment
  • 6 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.