AlaskanBeard Posted September 12, 2022 Share Posted September 12, 2022 I recently moved unraid over to a server with an AMD CPU (Epyc 7551P) and my server has been crashing ever since. So far uptime has ranged from 2 hours to ~50, with it typically crashing around the 5 hour mark. Aside from CPU, memory, and motherboard, I haven't made any other changes. So far I've ran two memtest passes without error, and I've disabled global c-states in the BIOS. I have server health logging enabled in my IPMI as well and the only thing logged there is "Correctable ECC / other correctable memory error @DIMMC1 - Assertion", however there's one of these messages for each DIMM each time a crash happens. I've also tried upgrading from 6.10.3 to 6.11.0-rc4 with the same behavior and I've since reverted to 6.10.3. tower-diagnostics-20220912-1227.zip Quote Link to comment
JorgeB Posted September 13, 2022 Share Posted September 13, 2022 Fix the RAM problem, an uncorrectable error will halt the server. Memtest doesn't detect ECC corrected errors, removed one or more DIMMs and see if the SEL errors stop. Quote Link to comment
AlaskanBeard Posted September 14, 2022 Author Share Posted September 14, 2022 It's hard to say for sure, since one of my crashes happened after 2 days of uptime, but it does seem to be doing fine with what I think is the problem DIMM removed. Only issue now is I'm running out of memory haha. I'm going to stop some of my containers to reduce memory usage and I'll report back in a few days if I don't have any crashes. Thanks for the suggestion! 1 Quote Link to comment
AlaskanBeard Posted September 17, 2022 Author Share Posted September 17, 2022 I'm unfortunately still having issues. I haven't had an ECC error logged since I took out the one DIMM, so I do think that was an issue. unRAID has locked up a couple times due to CPU and Memory consumption. The memory consumption I've solved by just powering off a couple of my containers. The CPU consumption, I think I've fixed as well. I'd read a couple threads where cache drive corruption was an issue, and I decided now was as good of time as any to replace my cache array with a single nvme, and I haven't had any CPU consumption issues since. After all that, I managed ~45 hours of uptime before unRAID crashed. I've been using grafana to check CPU and memory usage when crashes happen, and this most recent crash has CPU usage right at 50% and memory at 2.5GB free, with 26GB used and 19GB for cache+buffer, so I'm thinking it's not a memory consumption issue either? I'm not sure how much memory unRAID needs free at any given time. I've attached new diagnostics generated after this most recent crash. I've also updated to 6.11 rc5 (after the crash), and the crash happened on rc4. tower-diagnostics-20220917-1327.zip Quote Link to comment
JorgeB Posted September 18, 2022 Share Posted September 18, 2022 Enable the syslog server and post that after a crash, also make sure power supply idle control is correctly set. Quote Link to comment
AlaskanBeard Posted September 18, 2022 Author Share Posted September 18, 2022 Unfortunately I don't have any power supply idle control settings in my bios, but I do have global C-States disabled. The last crash happened ~16 hours ago. I believe that boot finished at ~September 17th 12:17 in the logs, and then I I installed rc5 and rebooted at 12:35, if I'm reading the logs right. And it looks like the boot started at 12:15:39 after the most recent crash. syslog Quote Link to comment
JorgeB Posted September 19, 2022 Share Posted September 19, 2022 Nothing relevant logged that I can see, that usually points to a hardware problem, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
AlaskanBeard Posted October 3, 2022 Author Share Posted October 3, 2022 Thanks again for your help! The issue was twofold. The first, is that my bios was getting reset; despite no warnings in ipmi, the CMOS battery was bad, and replacing that seems to have solved the hard crashes I was seeing (now that it remembers c-states are supposed to be disabled). I was able to get a little over 6 days of uptime before I re-enabled TDarr (more on that below). The other issue I was seeing is that sometimes it wouldn't crash and reset, it would just hit 100% CPU usage and become unresponsive. I was able to kill docker one time this happened, and while the system didn't recover, I did see CPU usage drop to a more normal level. From there, I started experimenting with the containers I'm running, and long story short, it's 100% TDarr. For whatever reason, on this hardware while I'm using my GPU to re-encode videos it just saturates the CPU. There's a setting in the application to set ffmpeg priority to low and that seems to have fixed the issue. If someone somehow stumbles on this from Google, the TDarr setting is under GPU > Options > Low FFmpeg/HandBrake process priority. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.