KombatJam Posted January 8, 2022 Share Posted January 8, 2022 I am running my unraid server on an X99 Deluxe II with 3x NVME drives just installed. Please let me know the best course of action. I did find to update the syslinux.cfs with append initrd=/bzroot pci=nommconfto but this has not resolved the issue. Since installation the logs gets filled immediately on reboot with PICE errors. Here are logs entries: Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: device [2646:2263] error status/mask=00000001/0000e000 Jan 7 21:39:23 Tower kernel: nvme 0000:03:00.0: [ 0] RxErr Jan 7 21:39:23 Tower kernel: pcieport 0000:00:02.0: AER: Corrected error received: 0000:03:00.0 Thanks! Quote Link to comment
Solution JorgeB Posted January 8, 2022 Solution Share Posted January 8, 2022 You try the below, if it doesn't help look for a BIOS update or try using different PCIe slots if possible. Quote Link to comment
KombatJam Posted January 8, 2022 Author Share Posted January 8, 2022 14 hours ago, JorgeB said: You try the below, if it doesn't help look for a BIOS update or try using different PCIe slots if possible. That solved it. Thanks for digging that up really appreciated. Quote Link to comment
David Bott Posted August 11, 2022 Share Posted August 11, 2022 Hi.... I am sorry...But the above solution of adding "pci=noaer" to boot I do not think really "solves" anything other than hiding the error. The error is still happening, just not reporting it. So the real question is why it is happening so it can be fixed? Here is my system config...with most recent BIOS running. Gigabyte Technology Co., Ltd. Z690I A ULTRA LITE D4 , Version Default string American Megatrends International, LLC., Version F20a BIOS dated: Fri 22 Jul 2022 12:00:00 AM EDT 12th Gen Intel® Core™ i5-12600K @ 3700 MHz Samsung 1TB NVMe (2 in a RAID for Cache) LSI PCIe 8 drive controller I get the error whenever MOVER runs. I use NVMe Cache Drives and it reports the issue each time. Aug 11 13:42:59 Server emhttpd: shcmd (94): /usr/local/sbin/mover &> /dev/null & Aug 11 13:43:29 Server kernel: pcieport 0000:00:06.0: AER: Corrected error received: 0000:02:00.0 Aug 11 13:43:29 Server kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 11 13:43:29 Server kernel: nvme 0000:02:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 11 13:43:29 Server kernel: nvme 0000:02:00.0: [ 0] RxErr Aug 11 13:43:51 Server kernel: pcieport 0000:00:06.0: AER: Corrected error received: 0000:02:00.0 Aug 11 13:43:51 Server kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 11 13:43:51 Server kernel: nvme 0000:02:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 11 13:43:51 Server kernel: nvme 0000:02:00.0: [ 0] RxErr Thanks for any additional info. David Quote Link to comment
JorgeB Posted August 11, 2022 Share Posted August 11, 2022 This is usually hardware related, a BIOS update might help, as well as using other PCIe/M.2 slots if available, different kernel might also help. Quote Link to comment
David Bott Posted August 28, 2022 Share Posted August 28, 2022 (edited) On 8/11/2022 at 2:25 PM, JorgeB said: This is usually hardware related, a BIOS update might help, as well as using other PCIe/M.2 slots if available, different kernel might also help. Hi... (1st, Sorry for the delay...I have been trying other things like changing timing and NVMe drive.) Thank you kindly for the reply. I have the latest BIOS as mentioned. This is happening in the NVMe channel. My motherboard has two NVMe slots and one on one board. One channel goes though PCIe and the other goes though the CPU it seems. I have tried swapping the two NVMe's using used as a RAID 1 cache and the error "seemed" to follow the drive. So I replaced that drive just for fun and yet the problem still shows. It surely could be a kernel issue, but not sure how to deal with that in unRAID as they supply that with there OS. Here is the error repeated on the new NVMe (Note they are Samsun 980 Pro drives). (I had the GUI only show me errors in the log.) Aug 19 23:01:47 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 19 23:01:47 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 19 23:01:47 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 20 00:26:42 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 20 00:26:42 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 20 00:26:42 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 20 18:29:22 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 20 18:29:22 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 20 18:29:22 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 22 17:26:33 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 22 17:26:33 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 22 17:26:33 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 23 04:06:23 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 23 04:06:23 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 23 04:06:23 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 23 11:17:45 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 23 11:17:45 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 23 11:17:45 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 24 03:53:45 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 24 03:53:45 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 24 03:53:45 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Aug 24 13:18:54 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 24 13:18:54 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 24 13:18:54 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Edited August 28, 2022 by David Bott Quote Link to comment
JorgeB Posted August 29, 2022 Share Posted August 29, 2022 You can try installing v6.11.0-rc4, newer kernel might help, if not not much more you can do other than suppressing the error, unless you are wiling to use a different board (or devices). Quote Link to comment
David Bott Posted August 30, 2022 Share Posted August 30, 2022 On 8/29/2022 at 4:17 AM, JorgeB said: You can try installing v6.11.0-rc4, newer kernel might help, if not not much more you can do other than suppressing the error, unless you are wiling to use a different board (or devices). Well I guess that might be worth a shot. The error is just bugging me and I can not tell if it is hardware or software related. And just using one setting that suppresses all the PCI errors just makes no sense for you want to know if there is an error afterall. I have, as mentioned, even replaced one of the Samsung drives as it seemed the follow the drive. Thank you again. David Quote Link to comment
Nicktdot Posted August 31, 2022 Share Posted August 31, 2022 Let me know how it works out for you. I have 1 of 4 SK Hynix NVMEs on an Asus HYPER M.2 X16 GEN 4 CARD throwing this constantly. Quote Link to comment
trurl Posted August 31, 2022 Share Posted August 31, 2022 On 8/28/2022 at 10:59 AM, David Bott said: Samsun 980 Pro Have you checked for firmware update? I had to update firmware on mine before they would play well on my new desktop build. Quote Link to comment
David Bott Posted August 31, 2022 Share Posted August 31, 2022 13 hours ago, Nicktdot said: Let me know how it works out for you. I have 1 of 4 SK Hynix NVMEs on an Asus HYPER M.2 X16 GEN 4 CARD throwing this constantly. I just now installed Version: 6.11.0-rc4 and will let it run. As you may have seen by my log times, it is not all the time, so I need to give a number of hours. I will report back. 13 hours ago, trurl said: Have you checked for firmware update? I had to update firmware on mine before they would play well on my new desktop build. Hi...Do you mean Firmware for the Samsung 980 drive itself? If that is what you mean, no, I have not. I will need to look into a current version and how yo even upgrade it. Removing them from my setup is easy for one and hard for the other. So I hope I could do it in the setup. (Hoping it is a bootable flash or something.) Thanks for the thought as never considered updating the firmware on the drive. Quote Link to comment
David Bott Posted August 31, 2022 Share Posted August 31, 2022 (edited) 14 hours ago, Nicktdot said: Let me know how it works out for you. I have 1 of 4 SK Hynix NVMEs on an Asus HYPER M.2 X16 GEN 4 CARD throwing this constantly. Well that did not take long. Sorry to say even with the new Kernel (in the 6.11.0-RC4) still have the error. So I went back to 6.10.3 Aug 31 10:53:52 Server kernel: pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 Aug 31 10:53:52 Server kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Aug 31 10:53:52 Server kernel: nvme 0000:03:00.0: device [144d:a80a] error status/mask=00000001/0000e000 Looks like I may have to try a firmware update on the NVMe's. I did find this FWUPDMGR Linux thing that it seems I might be able to run to upgrade the drives. Otherwise, it seems Samsung only support Windows with their software and that would be a pain as I would need to remove the drives and find a Windows machine. UPDATE: It seems that fwupdmgr will not work as it is only for supported products... https://fwupd.org/lvfs/devices/ Edited August 31, 2022 by David Bott Quote Link to comment
David Bott Posted August 31, 2022 Share Posted August 31, 2022 15 hours ago, trurl said: Have you checked for firmware update? I had to update firmware on mine before they would play well on my new desktop build. It seems I am on the current FW release... 5B2QGXA7 ... So bummer on that. The good news is that the error shows it is "Corrected", the bad news it has an error at that it needs to correct at all. Quote Link to comment
Nicktdot Posted September 1, 2022 Share Posted September 1, 2022 Could you try pcie_aspm=off . This seems to disable power management mode which is throwing the error.. I've put it in my config for next time I reboot 1 Quote Link to comment
David Bott Posted September 1, 2022 Share Posted September 1, 2022 41 minutes ago, Nicktdot said: Could you try pcie_aspm=off . This seems to disable power management mode which is throwing the error.. I've put it in my config for next time I reboot Hi...Thanks for the idea. I just added it to my Sys Config so it now looks like this... kernel /bzimage append initrd=/bzroot pcie_aspm=off ...saved and rebooted. I will check the log tomorrow morning and see if I have the errors still and report back. Quote Link to comment
David Bott Posted September 1, 2022 Share Posted September 1, 2022 11 hours ago, Nicktdot said: Could you try pcie_aspm=off . This seems to disable power management mode which is throwing the error.. I've put it in my config for next time I reboot Morning...And a good one it is! No errors in the log using pcie_aspm=off Might I ask where you came up with that being likely the issue? Thanks Quote Link to comment
Nicktdot Posted September 1, 2022 Share Posted September 1, 2022 (edited) I'm very glad to know it works. I was researching the error / mask 00000001/0000e000 in the message, and found out it had to do with the PCI end device not responding to an ASPM command. So while turning off AER masks the problem by not logging the errors, it doesn't solve the actual PCI errors. So then started going down the rabbit hole of what ASPM is all about, ( https://en.wikipedia.org/wiki/Active_State_Power_Management ) and saw there is a kernel boot flag to turn off the feature.. I dont think we need it anyways seeing as my server is running 24h/day and never goes to sleep mode. I figured it might help avoid the error altogether if the unused feature is disabled. I'll check my own server next time it reboots! Edited September 1, 2022 by Nicktdot 2 1 Quote Link to comment
David Bott Posted September 2, 2022 Share Posted September 2, 2022 (edited) 19 hours ago, Nicktdot said: I'm very glad to know it works. I was researching the error / mask 00000001/0000e000 in the message, and found out it had to do with the PCI end device not responding to an ASPM command. So while turning off AER masks the problem by not logging the errors, it doesn't solve the actual PCI errors. So then started going down the rabbit hole of what ASPM is all about, ( https://en.wikipedia.org/wiki/Active_State_Power_Management ) and saw there is a kernel boot flag to turn off the feature.. I dont think we need it anyways seeing as my server is running 24h/day and never goes to sleep mode. I figured it might help avoid the error altogether if the unused feature is disabled. I'll check my own server next time it reboots! Totally agree on AER setting ,as I mentioned that before to others who mentioned using it. I sure do not want to suppress all PCI errors just to hide this one that was an actual error, even though it was being corrected, it should not need to be. But hiding all the errors is not not a good idea. I have been looking for weeks and never came across that setting as I had no clue it was a power issue. (Power as in a power saving feature...which like you, surely us not needed in this case. Especially for Cache which is always doing something with Dockers. One setting I did try was the sleep timer setting, set to 5500 or something. It seemed to help, but not fully. I have had no other errors as of yet and my system also just did a fully parity check that runs at the 1st of each month. Though it took 1 day, 5 hours, 21 minute to complete. (I have an 18TB Parity Drive for future upgrades to storage. The four data drives are all 8TB currently....But man, it takes a long time to verify 8TB and then it still continues to scan the Parity drive which is another 10TB above that. Not sure why it does that...but oh well. May need to look to using another method or drive format.) This is all do to my upgrading my Motherboard, processor, and adding two NVMe's for cache (RAID) then the error showed up. I so want to thank you again for your research. So...THANKS!!!! Kudos!!!! Yippee!!! Etc. Edited September 2, 2022 by David Bott 1 Quote Link to comment
Nuke Posted December 17, 2022 Share Posted December 17, 2022 On 9/1/2022 at 4:39 AM, David Bott said: Hi...Thanks for the idea. I just added it to my Sys Config so it now looks like this... kernel /bzimage append initrd=/bzroot pcie_aspm=off ...saved and rebooted. I will check the log tomorrow morning and see if I have the errors still and report back. Am i change things correct? Quote Link to comment
JorgeB Posted December 18, 2022 Share Posted December 18, 2022 leave a space between pcie_aspm=off and pcei_acs_override 1 Quote Link to comment
ncceylan Posted February 6 Share Posted February 6 I also encountered this situation。how to solve this? Feb 7 00:01:10 xxlan kernel: pcieport 0000:80:01.1: AER: Corrected error received: 0000:82:00.0 Feb 7 00:01:10 xxlan kernel: nvme 0000:82:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Feb 7 00:01:10 xxlan kernel: nvme 0000:82:00.0: device [1e4b:1001] error status/mask=00000001/00002000 Feb 7 00:01:10 xxlan kernel: nvme 0000:82:00.0: [ 0] RxErr Quote Link to comment
JorgeB Posted February 6 Share Posted February 6 2 hours ago, ncceylan said: I also encountered this situation。how to solve this? Try this. Quote Link to comment
David Bott Posted February 6 Share Posted February 6 33 minutes ago, JorgeB said: Try this. Hi... I added... append initrd=/bzroot pcie_aspm=off ...to the UnRAID OS config area. (Click on Flash Drive to get to the right area.) Add it in as a new line, save and reboot. Quote Link to comment
David Bott Posted February 6 Share Posted February 6 1 hour ago, JorgeB said: Yes, that's it, did it work? Well it did for me. Not 100% on your issue or hardware. I had the issue with Samsung 980 Pro 1TB NVME drives. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.