March 9, 20251 yr Hello folks! The past few months, I've been getting MCE errors on my server. Nothing serious seems to accompany them, no crashes or issues, just these warnings. I'm concerned that this could be indicative of some fault that could be becoming a bigger issue. The past few months they were pretty infrequent, but lately I've been getting them more often. Currently have the machine down, running MemTest86 to check for potential issues with the RAM sticks, so I don't have access to logs at the moment, but here is one of the recent events: Mar 1 20:06:14 Wintermute kernel: mce: [Hardware Error]: Machine check events logged Mar 1 20:06:14 Wintermute kernel: [Hardware Error]: Deferred error, no action required. Mar 1 20:06:14 Wintermute kernel: [Hardware Error]: CPU:1 (19:21:0) MC16_STATUS[-|-|-|-|-|-|Deferred|-|-]: 0x9090909090909090 Mar 1 20:06:14 Wintermute kernel: [Hardware Error]: IPID: 0x0000000000000000 Mar 1 20:06:14 Wintermute kernel: [Hardware Error]: Bank 16 is reserved. Mar 1 20:06:14 Wintermute kernel: [Hardware Error]: cache level: RESV, tx: INSN Each MCE event is similar, with minor difference such as referencing a different Bank being reserved. Can post more logs once the system is up again. As for hardware specs: CPU: AMD Ryzen 7 5800X. RAM: Corsair Vengeance RGB Pro 32GB (2x16GB) DDR4 3600 [PC4-28800] - 2 kits, 4 sticks total for 64GB. Not as optimal as 2 sticks, I know. Mobo: ASUS B650-PLUS TUF Gaming WIFI ATX AM5 Motherboard TUF GAMING B650-PLUS WIFI PSU: EVGA SuperNOVA 850 G2 HDDs: 4 Seagate IronWolf Pro 22TB drives ^ All of this is roughly 4-5 years old. Does anyone have any advice as to what the cause might be, or how I could go about troubleshooting further? Thanks! Edit: MemTest86 did not come up with any errors. I've attached diagnostics, however they don't seem to cover the period when the issue occurred since I restarted the machine (sorry, I'm new to this). I'll post it regardless in case it is of use, but I'll post an updated one if/when the issue reoccurs. wintermute-diagnostics-20250309-2107.zip Edited March 10, 20251 yr by ReddlyB92 Adding diagnostic file.
March 11, 20251 yr One of the biggest reasons for ECC is stray electrons. It entirely possible that a bit was flipped in memory by an electron. This wasn't caused by a hardware failure but was detected by the ECC and corrected. There is an awesome video on this cause but sadly I cannot find it, if I do I will post it here. Found it! Edited March 11, 20251 yr by Beercules added video
March 11, 20251 yr Community Expert Solution On 3/9/2025 at 3:58 PM, ReddlyB92 said: I've been getting MCE errors on my server Does this only happen soon after booting and not again until next boot?
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.