Sporadic Crashing on Epyc 7551P

AlaskanBeard · September 12, 2022

I recently moved unraid over to a server with an AMD CPU (Epyc 7551P) and my server has been crashing ever since. So far uptime has ranged from 2 hours to ~50, with it typically crashing around the 5 hour mark. Aside from CPU, memory, and motherboard, I haven't made any other changes.

So far I've ran two memtest passes without error, and I've disabled global c-states in the BIOS.

I have server health logging enabled in my IPMI as well and the only thing logged there is "Correctable ECC / other correctable memory error @DIMMC1 - Assertion", however there's one of these messages for each DIMM each time a crash happens.

I've also tried upgrading from 6.10.3 to 6.11.0-rc4 with the same behavior and I've since reverted to 6.10.3.

tower-diagnostics-20220912-1227.zip

JorgeB · September 13, 2022

Fix the RAM problem, an uncorrectable error will halt the server.

Memtest doesn't detect ECC corrected errors, removed one or more DIMMs and see if the SEL errors stop.

AlaskanBeard · September 14, 2022

It's hard to say for sure, since one of my crashes happened after 2 days of uptime, but it does seem to be doing fine with what I think is the problem DIMM removed. Only issue now is I'm running out of memory haha.

I'm going to stop some of my containers to reduce memory usage and I'll report back in a few days if I don't have any crashes.

Thanks for the suggestion!

AlaskanBeard · September 17, 2022

I'm unfortunately still having issues.

I haven't had an ECC error logged since I took out the one DIMM, so I do think that was an issue.

unRAID has locked up a couple times due to CPU and Memory consumption. The memory consumption I've solved by just powering off a couple of my containers. The CPU consumption, I think I've fixed as well. I'd read a couple threads where cache drive corruption was an issue, and I decided now was as good of time as any to replace my cache array with a single nvme, and I haven't had any CPU consumption issues since.

After all that, I managed ~45 hours of uptime before unRAID crashed. I've been using grafana to check CPU and memory usage when crashes happen, and this most recent crash has CPU usage right at 50% and memory at 2.5GB free, with 26GB used and 19GB for cache+buffer, so I'm thinking it's not a memory consumption issue either? I'm not sure how much memory unRAID needs free at any given time.

I've attached new diagnostics generated after this most recent crash. I've also updated to 6.11 rc5 (after the crash), and the crash happened on rc4.

tower-diagnostics-20220917-1327.zip

JorgeB · September 18, 2022

Enable the syslog server and post that after a crash, also make sure power supply idle control is correctly set.

AlaskanBeard · September 18, 2022

Unfortunately I don't have any power supply idle control settings in my bios, but I do have global C-States disabled.

The last crash happened ~16 hours ago. I believe that boot finished at ~September 17th 12:17 in the logs, and then I I installed rc5 and rebooted at 12:35, if I'm reading the logs right. And it looks like the boot started at 12:15:39 after the most recent crash.

syslog

JorgeB · September 19, 2022

Nothing relevant logged that I can see, that usually points to a hardware problem, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

AlaskanBeard · October 3, 2022

Thanks again for your help!

The issue was twofold. The first, is that my bios was getting reset; despite no warnings in ipmi, the CMOS battery was bad, and replacing that seems to have solved the hard crashes I was seeing (now that it remembers c-states are supposed to be disabled). I was able to get a little over 6 days of uptime before I re-enabled TDarr (more on that below).

The other issue I was seeing is that sometimes it wouldn't crash and reset, it would just hit 100% CPU usage and become unresponsive. I was able to kill docker one time this happened, and while the system didn't recover, I did see CPU usage drop to a more normal level. From there, I started experimenting with the containers I'm running, and long story short, it's 100% TDarr.

For whatever reason, on this hardware while I'm using my GPU to re-encode videos it just saturates the CPU. There's a setting in the application to set ffmpeg priority to low and that seems to have fixed the issue.

If someone somehow stumbles on this from Google, the TDarr setting is under GPU > Options > Low FFmpeg/HandBrake process priority.

Sporadic Crashing on Epyc 7551P

Recommended Posts

AlaskanBeard

Link to comment

JorgeB

Link to comment

AlaskanBeard

Link to comment

AlaskanBeard

Link to comment

JorgeB

Link to comment

AlaskanBeard

Link to comment

JorgeB

Link to comment

AlaskanBeard

Link to comment

Join the conversation