Hardware error from APEI Generic Hardware Error Source: 514

juanamingo · June 6, 2022

Good afternoon all!

Fix Common Problems just notified me that my log folder was filling up - currently about 67% full. I took a look and saw that there are two 3 syslog entries - totaling about 256Mb, so i took a look in the newest syslog and I'm seeing this repeated:

Jun  6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Jun  6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
Jun  6 15:31:25 Guardian kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Jun  6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]: event severity: corrected
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:  Error 0, type: corrected
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   section_type: PCIe error
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   port_type: 0, PCIe end point
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   version: 0.2
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   command: 0x0406, status: 0x0010
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   device_id: 0000:02:00.0
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   slot: 0
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   secondary_bus: 0x00
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa80a
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   class_code: 010802
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:  Error 1, type: corrected
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   section_type: PCIe error
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   port_type: 0, PCIe end point
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   version: 0.2
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   command: 0x0406, status: 0x0010
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   device_id: 0000:02:00.0
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   slot: 0
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   secondary_bus: 0x00
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa80a
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   class_code: 010802
Jun  6 15:31:31 Guardian kernel: {127142}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000

device_id: 0000:02:00.0 is my Samsung 980Pro NVMe drive, which is the 2nd drive in my cache pool.

I haven't noticed this error before and have been running this setup for about 3-4 months. The only thing that has changed is upgrading from 6.9 -> 6.10.1 -> 6.10.2

One thing that's weird is I looked at the attributes for the drive and see: "Power on hours 315 (13d, 3h)" - which is NOT right....

Its partner drive has "Power on hours 2,605", which is ~108 days or ~3 1/2 months - and sounds about right as they were installed at almost the same time (about a week apart)

Any suggestions?

guardian-diagnostics-20220606-1527.zip

JorgeB · June 7, 2022

Unlikely that it would be a device problem, you can swap with the other one to confirm, if there's a different m.2/PCIe slot you can use it might also help.

juanamingo · June 7, 2022

Thanks - I'll try that as soon as I can pull the server and lyk.

I shouldn't be worried about a drive reporting only 315 hours on when it should have > 1500?

JorgeB · June 7, 2022

I wouldn't, I've seen similar issues before.

juanamingo · July 28, 2022

Well I finally got a chance to look into this. I have 4 nvme slots on my board and 2 nvme drives in a cache pool.

I pulled both drives and installed heatsinks on them because every so often when the mover was running one of the drives would hit 50-60C and I didn't like that much. (I think it was the drive that was erroring, but i'm not sure)

Originally the drive with the errors was in slot 0, and the 2nd drive in slot 1.

When reinstalling the drives i put the drive with errors in slot 2, and the 2nd drive in slot 0, leaving slots 1 and 3 unoccupied.

I'm still seeing the errors and they've followed the drive....

Some google research seems to indicate this is a harmless error - BUT in a week or so, log folder will fill up and I'll need to delete the syslog.1 or reboot.

Any suggestions besides seeing if Samsung will replace the drive?

JorgeB · July 29, 2022

11 hours ago, juanamingo said:

Any suggestions besides seeing if Samsung will replace the drive?

Not really, don't know of a way to suppress that error, so you'd need to reboot regularly to keep the log from filling up.

juanamingo · July 29, 2022

Gotcha - thank you sir!

It looks like I'd have to compile a custom kernel to suppress that.

Hardware error from APEI Generic Hardware Error Source: 514

Recommended Posts

juanamingo

Link to comment

JorgeB

Link to comment

juanamingo

Link to comment

JorgeB

Link to comment

juanamingo

Link to comment

JorgeB

Link to comment

juanamingo

Link to comment

Join the conversation