czaj Posted June 27, 2021 Share Posted June 27, 2021 Dear Unraid Community, I am looking for help with identifying the root cause of Unraid server unplanned restarts, which are driving me nuts. I am hoping for suggestions what can be checked, how can I try to catch more logs from a time right before unplanned restart and how can I try to trigger unplanned restart to find the problem and solve it. Unraid system is installed with a few docker containers and VMs, everything is configured and running perfectly - unless the random reboot happen. Server can run without problems days or weeks but sometimes this is happening even day after day. The time of reboot is totally random (although I observed that it is usually happening in the morning hours - between 5AM and 8AM). This is not an power issue as I have UPS. So far I did following: Checked RAM with memtest - no issues. Replaced PSU with new one (I had this change to modular PSU planned so I used this as opportunity to troubleshoot). Enabled q to flash to be able review logs after unplanned reboot, but there is nothing helpful in logs. Disabled few containers/vms to check are they root cause of the problem - without luck. I was suspecting that Windows VM (vdesktop01) might somehow cause that, but recently restart happened even with Windows VM turned off. Identified IRQ #16 issue and fixed it by disabling this feature (in modprobe - options i2c-i801 disable_features=0x10) - didn't help. I am attaching following files for review: Syslog log, which contains logs from few hours before unplanned restart and a full boot log after crash happened. Diagnotics data. System specification: Mobo: MSI B250M PRO-VDH CPU: Intel Pentium G4600 RAM GOODRAM 4GB (1x4GB) 3000MHz CL16 IRDM X Black G.SKILL 32GB (2x16GB) 3000MHz CL16 Aegis PSU: SeaSonic FOCUS GX-650 ATX 80+ GOLD Drives: 2x WD RED 4TB for array 2x GOODRAM IRDM Pro gen. 2 512GB for cache 1x GOODRAM 120GB 2,5" SATA SSD IRDM GEN. 2 - directly attached to VM UPS: APC SmartUPS 750 Please let me know what additional data should be provided to review this case. elysium-diagnostics-20210627-1226.zip syslog-unraid.log Quote Link to comment
JorgeB Posted June 27, 2021 Share Posted June 27, 2021 This sounds like a hardware issue, nothing being logged at the time points to the same, very difficult top diagnose remotely, basically you need to start swapping hardware, easiest thing to test is using only one DIMM at a time, if that doesn't help and since you already replaced the board PSU next thing would be using a different board. Quote Link to comment
czaj Posted June 27, 2021 Author Share Posted June 27, 2021 Thank you for the quick reply. I was thinking about hardware issue as well, but I intentionally skipped that since this setup was running for 2 years without an issue on old OS and problems started not long after installing Unraid (of course I can be totally wrong as hardware issue might happen in the similar time as Unraid installation). Leaving one memory DIMM is a very good suggestion. I was stress testing memory via running software dedicated for stress tests on VMs and nothing happened - but I didn't test it by physically removing it. I will do that and update this thread if find something. In case of any other ideas - especially how to try to force that behavior again - I am open for suggestions. Quote Link to comment
Tristankin Posted June 27, 2021 Share Posted June 27, 2021 Any chance you are using igpu transcoding in plex? You may have used old config steps. I was having the same issue on my system. Quote Link to comment
JorgeB Posted June 28, 2021 Share Posted June 28, 2021 Another thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
czaj Posted June 28, 2021 Author Share Posted June 28, 2021 Tristankin, I don't use plex so this is not it - thank you for suggestion! JorgeB, I was trying how server will behave with all docker containers and VMs powered down - worked for 2 weeks without any problem. Actually I reminded myself that not so long ago I enabled XMP profile to increase memory speed from 2133mhz to 2400mhz. After checking on the Internet that it is possible that XMP is causing unplanned restarts - I disabled it and currently waiting for results (will all memory dimms and all containers/vms enabled). If that won't help - my next step will be to leave one dimm and test again. Quote Link to comment
czaj Posted June 29, 2021 Author Share Posted June 29, 2021 Server restarted today again during morning. So this is not XMP. Going to leave one memory stick and test what will happen. Quote Link to comment
czaj Posted August 7, 2021 Author Share Posted August 7, 2021 (edited) To provide update on my situation - server is still randomly rebooting. Only pattern which I noticed is that when there is not much stuff running (only core containers/vms) - reboots are not so often (once per week), but when I enable all my stuff - reboot usually happens within 24h. I've tried: Moving USB Stick with OS from USB3 to USB2 port (to confirm does the USB stick is a problem), Running it without UPS (to confirm does UPS is a problem), Reset BIOS to default settings, Run system on one memory stick (tried for each of 3, in different dimm slots). I run out of the ideas what else from the current setup can be tested. If someone has any other ideas what can I check or what can be problem here - please let me know! I am planning to buy new motherboard (Gigabyte B560M DS3H) and CPU (Intel i5-11400) to upgrade current setup and hopefully confirm that reboots are no longer happening. I went though forum to check does B560 chipset and 11th gen CPU are working with unRAID 6.9.2 and basing on the forum posts they do. So hopefully I won't be surprised here. Edited August 7, 2021 by czaj Quote Link to comment
Tristankin Posted August 8, 2021 Share Posted August 8, 2021 Have you tried 6.8.3? I had to roll back to stop the random hangs. Quote Link to comment
czaj Posted August 8, 2021 Author Share Posted August 8, 2021 Are the symptoms in your situation were same as with mine? I mean literally no logs which may indicate what happened and reboot (not just hang which requires manual restart)? Did you have these hangs in 6.9.1 as well? Rolling back to 6.8.3 sounds like a time consuming process (considering for example that cache array won't be detecting in 6.8.x). Maybe trying to roll back to 6.9.1 is something which I need to test. PS. I am really hoping that this won't be a software issue because keeping old version of OS running is not a best thing from the security perspective. Quote Link to comment
Tristankin Posted August 9, 2021 Share Posted August 9, 2021 (edited) Tell me about it, The new 1MB aligned cache works on 6.8.3 fine though. Mine hung, not a reboot, nothing in syslog. So maybe it is different? Just a cheaper option than having to replace hardware. Edited August 9, 2021 by Tristankin Quote Link to comment
czaj Posted September 1, 2021 Author Share Posted September 1, 2021 Update: Today is 21st day of NAS running without a crash under a full workload. It seems that changing motherboard (Gigabyte B560M DS3H) and CPU (Intel i5-11400) resolved this problem. There are two potential root causes of that: Issue with old motherboard/CPU - most likely, but didn't test it yet to confirm it, Incompatibility between OS and hardware components - least likely but still possible (I guess ;)). No matter what is a root cause, this case seems to be resolved. Thank you for help and suggestions how to resolve this. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.