Team_Dango Posted March 10, 2021 Share Posted March 10, 2021 (edited) I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend. The first device that is throwing errors is one of two GPUs in the system. The errors look like: Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0 Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) Tower kernel: vfio-pci 0000:01:00.0: device [10de:1e84] error status/mask=00100000/00000000 Tower kernel: vfio-pci 0000:01:00.0: [20] UnsupReq (First) Tower kernel: vfio-pci 0000:01:00.0: AER: TLP Header: 40000001 00000003 000be7c0 f7f7f7f7 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful The second device that is throwing errors is my LSI card. This is new. It is an LSI 9207-8i purchased from The Art of the Server on ebay. It is in a PCIe slot that was previously occupied by an NVME SSD in a PCIe adapter. Those errors look like: Tower kernel: mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0 Tower kernel: mpt3sas 0000:04:00.0: device [1000:0087] error status/mask=00000001/00002000 Tower kernel: mpt3sas 0000:04:00.0: [ 0] RxErr Despite these errors, both devices are acting normally. The GPU is passed through to a VM and behaves as expected even under full load. The LSI card also appears fully functional. I went through an entire parity check which passed with zero errors. I am currently running through a drive rebuild (not because of drive failure, just swapping it out) and would rather not have to abort, but I also do not know how severe these errors are and if I need to take immediate action. I am attaching my full diagnostics dump. Any advice would be much appreciated. Thank you. tower-diagnostics-20210310-1757.zip Edited March 11, 2021 by Team_Dango Quote Link to comment
Team_Dango Posted March 11, 2021 Author Share Posted March 11, 2021 After doing some digging I believe I have solved my issue. It seems like it is somewhat a known bug on Asus X99 motherboards. Mine is an Asus X99-WS/IPMI. I am on the latest BIOS so updating was not an option. The solution was to add "pcie_aspm=off" to my syslinux configuration. After a reboot I appear to no longer be getting errors. Fingers crossed it stays fixed. If anyone has anything to add feel free to chime in. If I don't have any errors tomorrow morning I'll mark this solved. 3 Quote Link to comment
Vr2Io Posted March 11, 2021 Share Posted March 11, 2021 BIOS usually have ASPM control, anyway its OK to disable by software. Quote Link to comment
Team_Dango Posted March 11, 2021 Author Share Posted March 11, 2021 Thank you for the suggestion. I'll check for that next time I reboot the server. Quote Link to comment
ssjucrono Posted August 23, 2022 Share Posted August 23, 2022 On 3/11/2021 at 7:03 AM, Team_Dango said: Thank you for the suggestion. I'll check for that next time I reboot the server. Where exactly did you add the pcie_aspm=off Currently trying to get my pcie google coral to work in Google's GitHub they said this I had tried disabling it my bios and that didn't help This was google's Response: can you please share how did you turn off pcie_aspm. have you added pcie_aspm=off to the /boot/extlinux/extlinux.conf? $ cat /boot/extlinux/extlinux.conf TIMEOUT 30 DEFAULT primary MENU TITLE L4T boot options LABEL primary MENU LABEL primary kernel LINUX /boot/Image INITRD /boot/initrd APPEND ${cbootargs} quiet pcie_aspm=off Thank you for your help Quote Link to comment
JorgeB Posted August 24, 2022 Share Posted August 24, 2022 13 hours ago, ssjucrono said: Where exactly did you add the pcie_aspm=off Same as here: https://forums.unraid.net/topic/111161-pcie-errors/?do=findComment&comment=1013378 Quote Link to comment
ssjucrono Posted August 24, 2022 Share Posted August 24, 2022 @JorgeB Thank you! I added it like so: kernel /bzimage append initrd=/bzroot pci=noaer pcie_aspm=off I will reboot and test today Thank you 1 Quote Link to comment
ssjucrono Posted August 24, 2022 Share Posted August 24, 2022 So now it works for a bit but then my whole server stops responding. I cannot SSH webgui nothing. I have to hard reboot it by holding the power button. I also don't think I can see logs as I have to reboot so I don't get the syslog. I thought I was on the right path but I guess not. Quote Link to comment
JorgeB Posted August 24, 2022 Share Posted August 24, 2022 Enable the syslog server and post that after a crash. Quote Link to comment
ssjucrono Posted August 24, 2022 Share Posted August 24, 2022 oh ok I did enable before this most recent crash. Let me reboot and check it out Quote Link to comment
ssjucrono Posted August 24, 2022 Share Posted August 24, 2022 syslog Ok I have attached the syslog. I really do not see anything really telling in it. Thank you @JorgeB Quote Link to comment
JorgeB Posted August 25, 2022 Share Posted August 25, 2022 Nothing relevant logged, PCIe errors were of course suppressed, but that might not be the problem, does the server crash without that device? Quote Link to comment
ssjucrono Posted August 25, 2022 Share Posted August 25, 2022 so the device is currently in the server and operating per the drivers it is when I start the Frigate docker that it will work for about an hour or less then crash the server. I previously didn't have a pcie bracket on it. I put one it and it started this behavior. Quote Link to comment
JorgeB Posted August 25, 2022 Share Posted August 25, 2022 Try a different PCI slot if available, if it's elated to the PCIe errors some slots might not show the issue, especially CPU vs PCH slots. Quote Link to comment
ssjucrono Posted August 25, 2022 Share Posted August 25, 2022 Yeah I originally had it in slot 4 and that was worse. I moved it to slot 6 currently. I had this working fine with a more generic adapter however only 1 TPU was showing due to the weird layout of m2. The Google Coral is a dual TPU m2 with an adaptor from https://github.com/magic-blue-smoke/Dual-Edge-TPU-Adapter to make it pcie Quote Link to comment
JorgeB Posted August 25, 2022 Share Posted August 25, 2022 Both are CPU slots, but looks like that board only has CPU slots. Quote Link to comment
ssjucrono Posted August 25, 2022 Share Posted August 25, 2022 15 minutes ago, JorgeB said: Both are CPU slots, but looks like that board only has CPU slots. yes, I guess I am not sure what you are getting at? Quote Link to comment
JorgeB Posted August 25, 2022 Share Posted August 25, 2022 Most boards have CPU slots (slots that are connected directly to the CPU) and PCH slots, slots are connected to the chipset, since that board has dual CPU it has enough lanes to use only CPU slots, the device *might* have worked better if there was a PCH slot. Quote Link to comment
ssjucrono Posted August 25, 2022 Share Posted August 25, 2022 oooh, interesting. I had no idea. I am also talking to the creator of the PCIe adaptor Quote Link to comment
ssjucrono Posted August 26, 2022 Share Posted August 26, 2022 @JorgeB I did find this jumper on my motherboard. Not sure if that is of any help? Quote Link to comment
JorgeB Posted August 26, 2022 Share Posted August 26, 2022 11 minutes ago, ssjucrono said: Not sure if that is of any help? It won't hurt to try. Quote Link to comment
SohailS Posted March 13, 2023 Share Posted March 13, 2023 Sorry to dig this up but wanted to say that im using the same motherboard the Asus X99-WS/IPMI and adding pcie_aspm=off worked for me. but im curious to know what the error is? i checked in my BIOS and ASPM is already off Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.