November 29, 20232 yr I've just migrated my Unraid setup to new hardware and haven't checked the logs in a few days. I noticed there's L3 cache and ECC errors. Spoiler Nov 25 10:35:57 Apollo kernel: traps: lsof[30540] general protection fault ip:149830c9cc6e sp:47ac5b58ccaad0d1 error:0 in libc-2.37.so[149830c84000+169000] Nov 26 11:28:22 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 26 11:28:22 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 26 11:28:22 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 26 11:28:22 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc61c0 Nov 26 11:28:22 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x196800020a800503 Nov 26 11:28:22 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 26 11:28:22 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Nov 26 14:26:57 Apollo usbhid-ups[22983]: [D5:22983] send_to_all: ADDCMD driver.reload-or-error Nov 26 14:26:58 Apollo usbhid-ups[22983]: [D2:22983] send_to_one: sending ADDCMD driver.reload-or-error Nov 26 14:26:58 Apollo usbhid-ups[22983]: [D5:22983] send_to_one: ADDCMD driver.reload-or-error Nov 26 14:26:58 Apollo usbhid-ups[22983]: [D6:22983] send_to_one: write 30 bytes to socket 16 succeeded (ret=30): ADDCMD driver.reload-or-error Nov 26 15:26:32 Apollo kernel: apcupsd[7843]: segfault at 0 ip 000000000041acf3 sp 00007ffd830b9ac0 error 4 in apcupsd[404000+2f000] likely on CPU 31 (core 29, socket 0) Nov 26 15:53:59 Apollo root: NUTSTATS: Error Updating ups.realpower.json - UPS returned BAD or NULL value. Nov 26 15:53:59 Apollo root: NUTSTATS: Error Updating input.voltage.json - UPS returned BAD or NULL value. Nov 26 15:53:59 Apollo root: NUTSTATS: Error Updating input.frequency.json - UPS returned BAD or NULL value. Nov 26 15:53:59 Apollo root: NUTSTATS: Error Updating output.voltage.json - UPS returned BAD or NULL value. Nov 26 15:53:59 Apollo root: NUTSTATS: Error Updating output.frequency.json - UPS returned BAD or NULL value. Nov 26 21:45:30 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 26 21:45:30 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 26 21:45:30 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 26 21:45:30 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc4240 Nov 26 21:45:30 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x196800020a800503 Nov 26 21:45:30 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 26 21:45:30 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Nov 27 09:13:37 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 27 09:13:37 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 27 09:13:37 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 27 09:13:37 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc61c0 Nov 27 09:13:37 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x196800020a800503 Nov 27 09:13:37 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 27 09:13:38 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Nov 27 20:30:50 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 27 20:30:50 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 27 20:30:50 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 27 20:30:50 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc6b00 Nov 27 20:30:50 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x32d000040a800503 Nov 27 20:30:50 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 27 20:30:50 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Nov 28 14:21:15 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 28 14:21:15 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 28 14:21:15 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 28 14:21:15 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc4240 Nov 28 14:21:15 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x196800020a800503 Nov 28 14:21:15 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 28 14:21:15 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Nov 28 21:10:51 Apollo kernel: mce: [Hardware Error]: Machine check events logged Nov 28 21:10:51 Apollo kernel: [Hardware Error]: Corrected error, no action required. Nov 28 21:10:51 Apollo kernel: [Hardware Error]: CPU:1 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Nov 28 21:10:51 Apollo kernel: [Hardware Error]: Error Addr: 0x00000001d0bc4440 Nov 28 21:10:51 Apollo kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x32d000040a800503 Nov 28 21:10:51 Apollo kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Nov 28 21:10:51 Apollo kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD The Unraid logs also seem to sync up with some logs from the Gigabyte management panel. It's logged this with in a minute or two of each of the errors: Quote Unknown sensor of type memory logged a smi_handler : Correctable ECC / other correctable memory error(DIMM_D1) was asserted Is all of this suggesting the RAM in DIMM_D1 is bad or is there something else going on? I realise EPYC is a bit funny about mounting pressure, so I was going to try reseating the CPU and the RAM. The CPU was preinstalled on the mobo when I got it, but I have an appropriate torque screwdriver now. I've attached diagnostics just in-case. Any help would be greatly appreciated. apollo-diagnostics-20231129-0847.zip Edited November 29, 20232 yr by Benji
November 29, 20232 yr I would try swapping DIMM_D1 with another one and see if the errors follow the stick.
November 30, 20232 yr Author On 11/29/2023 at 12:00 PM, JorgeB said: I would try swapping DIMM_D1 with another one and see if the errors follow the stick. I swapped it to C1 and I'm getting the same errors in Unraid but the motherboard panel has changed the error to C1. So is it safe to say it's a bad stick then?
November 30, 20232 yr 8 minutes ago, Benji said: So is it safe to say it's a bad stick then? I would think so.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.