singha Posted March 20, 2023 Share Posted March 20, 2023 (edited) Apologies if I am posting this in the wrong area. This is my first time on the forum. My machine has crashed twice with the error "Machine Check Events detected on your server" being displayed by the fix-common-problems plugin. I've been running this server for almost 1.5 years now. The first crash was 3 days ago. The second crash was today. NerdPack and mcelog have already been installed. In bios, "Power Supply Idle Control" is set to "typical current idle", c-states are disabled globally, and DOCP is disabled (has been like this for ~1 year). I am looking for your technical expertise and I thank you for your time. EDIT: No bios update has occurred for >1 year. The server was heavily loaded (on 22/32 threads) when the MCE error occurred both times. tower-diagnostics-20230320-0829.zip Edited March 20, 2023 by singha More details Quote Link to comment
singha Posted March 21, 2023 Author Share Posted March 21, 2023 Ran memtest86+ (one with unraid) for 48 hours last weekend and am now running it again. No errors. Disabled PBO, enabled SR-IOV, and changed the Tips and Tweaks plugin power mode to performance. Let's see if it makes a difference. This error doesn't occur too often though, so it might take some time to resurface Quote Link to comment
singha Posted March 27, 2023 Author Share Posted March 27, 2023 (edited) That didn't work. It happened again with a heavy sustained load. tower-diagnostics-20230327-0114.zip Edited March 27, 2023 by singha Quote Link to comment
singha Posted April 13, 2023 Author Share Posted April 13, 2023 Thinking bad ram or overheating ram, which would explain the sudden crash/shutdown Quote Link to comment
Solution singha Posted May 8, 2023 Author Solution Share Posted May 8, 2023 It was overheating ram Quote Link to comment
B_Sinn3d Posted May 8, 2023 Share Posted May 8, 2023 7 minutes ago, singha said: It was overheating ram Just curious, how did you determine RAM was overheating? I get them on my backup server (r720xd) every once and a while. Quote Link to comment
singha Posted May 8, 2023 Author Share Posted May 8, 2023 (edited) It was only happening during a very specific (and thermally hot) workload. If I didn't run that specific workload, the server could go without crashing. Memtest didn't show any errors (ran it for a week) and given that system stability was fine except for that one workload, the chance of memory problems was low. Only thing was that the system was running very hot. Normally, CPUs and VRMs will throttle if they get hot. The only thing that won't throttle is ram. Increased the fan speeds everywhere (previously set a custom quieter fan curve) and haven't crashed for ~1 month now, even when running that specific workload. Hypothetically, it still could be bad ram. A bit flip could be occurring in the kernel space if it overheats, which causes a crash. Nevertheless, in my case, the solution was to increase cooling. EDIT: MCE can refer to a lot of things, so not sure if it's applicable to your problem Edited May 8, 2023 by singha Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.