Adeon Posted October 19, 2022 Share Posted October 19, 2022 (edited) Hi, what i mean with "Hard crash" is, the server just turns completely off. IPMI is still available, but i can not power the server back on via IPMI. I can only turn it back on if i unplug the power and wait a couple of seconds (discharging capacitors). I have set up the syslog server, but there is nothing useful in the logs at all. The crash always happens while the server is idling @45W. The server is connected to a PDU that is connected to a UPS. No other device connected to the UPS turns off (Modem, Router etc.) Hardware: Motherboard - Supermicro X11SCH-F CPU - Xeon E-2278G RAM - 64 GB DDR4 ECC @2666Mhz PSU - Seasonic Focus PX 750W (80+ Plat) 3 x Toshiba 16TB HDDs 2 x Silicon Power 1TB M.2 NVMe drives Server was running smooth for 1,5 years The crashes started ~ 3 months ago (no hardware changes at that time). What i have done so far: MemTest86+ 1 pass Changed custom network to "ipvlan" 1 hour Prime95 small FFTs 150W max power draw 68°C max CPU temp Checked and reseated all power connection on the motherboard Reseated the 2 RAM modules After the last crash, i directly connected the server to the UPS (without the PDU) The relevant part of the syslog is attached. I have an appdata backup script running at 4:30 that stops and starts my Docker containers. I don't know the exact time, but i think the server crashed between 11:00AM - 13:00PM. Unraid Version is 6.11.1 crashes also happened with 6.10.3 Syslog.txt Edited October 20, 2022 by Adeon Quote Link to comment
JorgeB Posted October 20, 2022 Share Posted October 20, 2022 13 hours ago, Adeon said: what i mean with "Hard crash" is, the server just turns completely off. That suggests a hardware problem, PSU or board would be the main suspects. Quote Link to comment
Adeon Posted October 20, 2022 Author Share Posted October 20, 2022 (edited) What i remembered now is that i disabled "Restart after AC loss" in BIOS, so that is the reason i have to manually restart the server. There is also nothing useful in the IPMI logs (the only entries are "ACPowerOn(OEM)".) After the parity check is done, i'll update the BIOS + BMC and do another run with MemTest86+ but this time with a downloaded version, because i learned that the Unraid version is not capable of reporting ECC errors. If the server crashes again, i'll try another PSU and reseat the CPU. What is weird to me is the server never crashed under load, only when idling, and the server is idling 90% of the time. Because of that, i changed the "Normal CPU Scaling Governor:" back to "Performance" (was Power Save) in the Tips & Tricks plugin, just in case there is an issue with c-states or p-states. I'll keep updating this thread and hopefully find a solution. Edit: Ran Memtest for ~14 hours (4 passes) with 0 errors. Edited October 21, 2022 by Adeon Quote Link to comment
Adeon Posted November 16, 2022 Author Share Posted November 16, 2022 (edited) A little Update: The server crashed again today after running smooth for ~4 weeks. Checked logs every day and there was nothing suspicious at all. The server was pretty much idling all the time. Again, there is absolutely nothing in the syslogs and nothing in the IPMI logs. I just can't imagine that this is caused by a broken PSU. I'm gonna disabled the last couple of Dockers that i installed 3-4 months ago. If the server crashes again, i'll change the PSU. Edited November 16, 2022 by Adeon Quote Link to comment
Hoopster Posted November 16, 2022 Share Posted November 16, 2022 (edited) 1 hour ago, Adeon said: A little Update: The server crashed again today after running smooth for ~4 weeks. Checked logs every day and there was nothing suspicious at all. The server was pretty much idling all the time. Again, there is absolutely nothing in the syslogs and nothing in the IPMI logs. I just can't imagine that this is caused by a broken PSU. I'm gonna disabled the last couple of Dockers that i installed 3-4 months ago. If the server crashes again, i'll change the PSU. Are you using the iGPU in the 2278G for transcoding via the i915 drivers? If so, how are you loading the drivers? A thread about issues others have had with the i915 drivers can be found here. I am not saying that is your problem. This is my reply in that thread about what I did to solve the serving crashing problems that seemed to be related to i915. I have a 2288G in my server and it started crashing with 6.10.x releases of unRAID. Nothing at all useful in the syslog. I had the GPU Statistics and Intel-GPU-Top plugins installed. As a test, I removed them and the crashes (about once a week) stopped. As a test, I re-installed these plugins with the 6.11.1 release and my unRAID server locked five days later. Uninstalled the plugins and it has been smooth sailing ever since. I let the i915 drivers load via the touch method. I also uninstalled the CoreFreq plugin. Edited November 16, 2022 by Hoopster Quote Link to comment
Adeon Posted November 16, 2022 Author Share Posted November 16, 2022 Thank you for the tip. Yes the iGPU is for transcoding but i pretty much never use it. I have both GPU TOP & GPU Statistics installed. The i915.conf already exists in /boot/config/modprobe.d/i915.conf. I removed both plugins for now. Let's see if that helps. Quote Link to comment
Hoopster Posted November 16, 2022 Share Posted November 16, 2022 (edited) 3 hours ago, Adeon said: Yes the iGPU is for transcoding but i pretty much never use it. It does not have to be in use to cause server lockups. In fact, for most people, including me, it always happened when the server was fairly idle. Let's see if removing those plugins works for you as it did for me. Edited November 17, 2022 by Hoopster Quote Link to comment
PRG Posted November 17, 2022 Share Posted November 17, 2022 Mysterious crashes like this for me have always been the PSU. I had one that would very occasionally trip the breaker before failing more often. Took ages to figure that out because the system was on a UPS that kept it on. Quote Link to comment
Adeon Posted December 6, 2022 Author Share Posted December 6, 2022 Well, the server crashed again today. Again, nothing in the logs at all. Tomorrow i'm going to change the PSU, i hope that this finally fixes that really annoying issue. Quote Link to comment
UnKwicks Posted January 1, 2023 Share Posted January 1, 2023 (edited) @Adeon, I have the same issue since 3 month. My setup is completely different beside the PSU. I have a Seasonic FOCUS PX as well (with 550 Watt). Did you already change the PSU and did that help? As MoBo I run a ASRock J5005-ITX with embedded Celeron. Edited January 1, 2023 by UnKwicks Quote Link to comment
Adeon Posted January 1, 2023 Author Share Posted January 1, 2023 Yes, i did change the PSU to a Seasonic Foxus GX 550W and Unraid is running fine for 25 Days now. But that was also the case with the old PSU, so it is too early to tell if the issue is gone. Quote Link to comment
UnKwicks Posted January 4, 2023 Share Posted January 4, 2023 Ok, lets see. I ordered a new PSU as well. It should arrive today. Hopefully this helps. Quote Link to comment
Solution Adeon Posted January 21, 2023 Author Solution Share Posted January 21, 2023 (edited) Update: Since the PSU replacement, the server has been running without a crash for 44 days. Update-2: Server is still running smooth for 60+ days now, i consider this fixed. Edited February 13, 2023 by Adeon 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.