Kyle W Posted February 7, 2023 Share Posted February 7, 2023 I noticed last night that my Mac VM was not running after having not touched my NAS all weekend (my BlueBubbles client was not connected on my Android). After logging into the web GUI, there was a notification of an unclean shutdown and parity check in-progress. The power did not go out (I have a UPS which triggers a shutdown within 120 seconds) and asked my family if anyone had touched the NAS, but they hadn't. Looking at the logs, there seems to be a significant number of machine check events with the same IPID and syndromes 0x7e3a00100a800a02 and 0xc3f501000a800a02. I'm assuming there was a crash that resulted in a reboot of the system. Specs: AsRock B450M Pro 4 P5.40 BIOS Ryzen 5700X Nemix 2x16GB DDR4-3200 unbuffered ECC RAM 3x 8tb WD Red Plus 1x 1tb Samsung 980 Pro Cache Drive Nvidia Quadro P4000 (only using for video out so I can modify BIOS settings, etc right now) Here's an example of the MCE, diagnostics also attached: Feb 6 15:06:14 Kyle-Server kernel: mce: [Hardware Error]: Machine check events logged Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: Corrected error, no action required. Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: CPU:0 (19:21:2) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: Error Addr: 0x00000000b5a68b40 Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x7e3a00100a800a02 Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Feb 6 15:06:14 Kyle-Server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x3169a2 offset:0xc40 grain:64 syndrome:0x10) Feb 6 15:06:14 Kyle-Server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Any advice for troubleshooting this one? The RAM is running at its specified 3200MHz speed and I've not touched any other settings in the BIOS except for custom fan curves on my chassis fans. I did see a note about setting typical idle current in the BIOS, so I will look into that later today when the parity check is finished. All the listed components are new except for the Quadro. I've been running the RAM since December 31 with no other issues, and the current CPU since January 19 (previously a Ryzen 5700G which I returned due to lack of ECC support, then a Ryzen 1600X for about a week while I was waiting for my 5700X to arrive) also with no noticeable issues but I have not been keeping an eye on the logs. I did actually have a fan controller failure about a week ago and the machine shutdown from what I assumed to be a CPU over-temperature condition. I'm hoping the CPU didn't become damaged from this though it is still within the return window. The RAM would require a manufacturer RMA at this point. Two of the WD Reds hit 47C and 50C which has me a bit freaked out but that's within their operating temperature range. server-diagnostics-20230207-1026.zip Quote Link to comment
apandey Posted February 8, 2023 Share Posted February 8, 2023 Download and run memtest86 to start with Quote Link to comment
Kyle W Posted February 8, 2023 Author Share Posted February 8, 2023 Note: tested the typical idle current setting and that did not fix the issue. Running Memtest today. Quote Link to comment
Kyle W Posted February 9, 2023 Author Share Posted February 9, 2023 Memtest passed 3 times after running for roughly 24 hours. I am going to swap in my Ryzen 1600X to see if the CPU is causing problems. Quote Link to comment
Kyle W Posted February 10, 2023 Author Share Posted February 10, 2023 Well unfortunately the CPU swap did not result in fully resolving the issue. The errors are different now, but not gone. I'm going to reach out to Nemix regarding an RMA on the RAM. Feb 9 22:53:27 Kyle-Server kernel: mce: [Hardware Error]: Machine check events logged Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: Corrected error, no action required. Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: Error Addr: 0x00000000b3052b60 Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000004c60a400a02 Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Feb 9 22:53:27 Kyle-Server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x1a60a5 offset:0x760 grain:64 syndrome:0x4c6) Feb 9 22:53:27 Kyle-Server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD kyle-server-diagnostics-20230209-2331.zip Quote Link to comment
apandey Posted February 10, 2023 Share Posted February 10, 2023 Have you tried running with a single RAM stick at a time to try and eliminate any bad ones. Normally, if it's stable in memtest, it should not cause problems. I hope you ran the latest downloaded memtest, not the one bundled with unraid Quote Link to comment
Kyle W Posted February 10, 2023 Author Share Posted February 10, 2023 7 hours ago, apandey said: Have you tried running with a single RAM stick at a time to try and eliminate any bad ones. Normally, if it's stable in memtest, it should not cause problems. I hope you ran the latest downloaded memtest, not the one bundled with unraid I'll try this next. I did download the latest memtest and created a bootable USB. Quote Link to comment
Kyle W Posted February 14, 2023 Author Share Posted February 14, 2023 Well it's been running for over 24 hours with no errors after pulling the second stick. Hopefully I've isolated the bad stick! Quote Link to comment
Lolight Posted February 14, 2023 Share Posted February 14, 2023 (edited) On 2/14/2023 at 12:20 AM, Kyle W said: Well it's been running for over 24 hours with no errors after pulling the second stick. Hopefully I've isolated the bad stick! A 24 hours long test is not really required to isolate bad RAM. 10 passes without errors would be sufficient at first. Run each stick separately for 10 passes. If errors are encountered try a different slot. If no errors then run the sticks together for another 10 passes. If errors are encountered then the memory controller or memory slots could be at fault. In that case try switching RAM slots and do it again. Edited February 15, 2023 by Lolight Quote Link to comment
Kyle W Posted March 3, 2023 Author Share Posted March 3, 2023 I ran the machine for about 20 days without errors after removing the single stick I previously isolated. Nemix sent me an RMA and I mailed it in, a little over a week later I had a replacement stick in hand. It has been over 24 hours without any errors so far, so I'm hoping things are resolved. 1 Quote Link to comment
duelistjp Posted March 12, 2023 Share Posted March 12, 2023 On 2/13/2023 at 10:17 PM, Lolight said: A 24 hours long test is not really required to isolate bad RAM. 10 passes without errors would be sufficient at first. Run each stick separately for 10 passes. If errors are encountered try a different slot. If no errors then run the sticks together for another 10 passes. If errors are encountered then the memory controller or memory slots could be at fault. In that case try switching RAM slots and do it again. depends how many sticks. if you have more than 2 you are better off with a binary search doing half at a time. makes a pretty big difference if you are running server hardware with 16 sticks of ram. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.