juanamingo Posted June 6, 2022 Share Posted June 6, 2022 Good afternoon all! Fix Common Problems just notified me that my log folder was filling up - currently about 67% full. I took a look and saw that there are two 3 syslog entries - totaling about 256Mb, so i took a look in the newest syslog and I'm seeing this repeated: Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000 Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: [ 0] RxErr (First) Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: It has been corrected by h/w and requires no further action Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: event severity: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 0, type: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 1, type: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000 device_id: 0000:02:00.0 is my Samsung 980Pro NVMe drive, which is the 2nd drive in my cache pool. I haven't noticed this error before and have been running this setup for about 3-4 months. The only thing that has changed is upgrading from 6.9 -> 6.10.1 -> 6.10.2 One thing that's weird is I looked at the attributes for the drive and see: "Power on hours 315 (13d, 3h)" - which is NOT right.... Its partner drive has "Power on hours 2,605", which is ~108 days or ~3 1/2 months - and sounds about right as they were installed at almost the same time (about a week apart) Any suggestions? guardian-diagnostics-20220606-1527.zip Quote Link to comment
JorgeB Posted June 7, 2022 Share Posted June 7, 2022 Unlikely that it would be a device problem, you can swap with the other one to confirm, if there's a different m.2/PCIe slot you can use it might also help. Quote Link to comment
juanamingo Posted June 7, 2022 Author Share Posted June 7, 2022 Thanks - I'll try that as soon as I can pull the server and lyk. I shouldn't be worried about a drive reporting only 315 hours on when it should have > 1500? Quote Link to comment
JorgeB Posted June 7, 2022 Share Posted June 7, 2022 I wouldn't, I've seen similar issues before. Quote Link to comment
juanamingo Posted July 28, 2022 Author Share Posted July 28, 2022 Well I finally got a chance to look into this. I have 4 nvme slots on my board and 2 nvme drives in a cache pool. I pulled both drives and installed heatsinks on them because every so often when the mover was running one of the drives would hit 50-60C and I didn't like that much. (I think it was the drive that was erroring, but i'm not sure) Originally the drive with the errors was in slot 0, and the 2nd drive in slot 1. When reinstalling the drives i put the drive with errors in slot 2, and the 2nd drive in slot 0, leaving slots 1 and 3 unoccupied. I'm still seeing the errors and they've followed the drive.... Some google research seems to indicate this is a harmless error - BUT in a week or so, log folder will fill up and I'll need to delete the syslog.1 or reboot. Any suggestions besides seeing if Samsung will replace the drive? Quote Link to comment
JorgeB Posted July 29, 2022 Share Posted July 29, 2022 11 hours ago, juanamingo said: Any suggestions besides seeing if Samsung will replace the drive? Not really, don't know of a way to suppress that error, so you'd need to reboot regularly to keep the log from filling up. Quote Link to comment
juanamingo Posted July 29, 2022 Author Share Posted July 29, 2022 Gotcha - thank you sir! It looks like I'd have to compile a custom kernel to suppress that. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.