DrJake Posted June 5, 2022 Share Posted June 5, 2022 Hi all, For reasons unknown, my Unraid sever reboots itself every 3-4 weeks, during this period, there doesn't seem to be any issues with the server. It has happened for 6 months+ now, and has recently caught my attention. I kinda discarded it as something I did to crash it in the early days. Anyhow, I have followed the instructions and installed "mcelog via the NerdPack plugin", the system diagnostics is attached, hope there is something in the log before the latest self-reboot (~6 hours ago). tower-diagnostics-20220605-1022.zip Quote Link to comment
ChatNoir Posted June 5, 2022 Share Posted June 5, 2022 Your first step would be to look at that : For both memory speed and power state. As for FCP reports, the MCE seems harmless, but you should probably act on those : Jun 5 04:41:06 Tower root: Fix Common Problems: Other Warning: Mover logging is enabled Jun 5 04:41:07 Tower root: Fix Common Problems: Other Warning: Background notifications not enabled Mover logging is only useful for debugging why mover does not do what you want. It should be OFF otherwise as it tends to spam the log for no reason. You should really set up notifications so that you are alerted immediately when Unraid detects an issue. Acting soon will keep your data safer than doing nothing and realizing too late that the system had issues for weeks and finally failed. If the issues continue after the changes from the FAQ, you should set up a syslog server and post that file after the next crash/reboot. Quote Link to comment
DrJake Posted June 6, 2022 Author Share Posted June 6, 2022 Thank you ChatNoir, there's a lot of information to digest. Already acted on the fix common problems issues as suggested. Re: memory speed, I'm running AMD Ryzen 7 3700X 8-Core @ 3600 MHz, and 4 pieces of Corsair Vengeance LPX (4x16GB) DDR4 3200MHz C16 Desktop Gaming Memory Black. I'm pretty sure I set the memory speed at 3200MHz. According to your linked thread, for 3rd gen processor running dual 4/4 config the speed should be dialed back to 2667MHz? is that correct? I will check on the c-state once I get back. Quote Link to comment
ChatNoir Posted June 6, 2022 Share Posted June 6, 2022 6 hours ago, DrJake said: the speed should be dialed back to 2667MHz? is that correct? Yes. Quote Link to comment
DrJake Posted July 15, 2022 Author Share Posted July 15, 2022 Hi again ChatNoir, since adjusting the RAM speed, my server uptime is ~29 days 14 hours now. ~2 days ago, the GPU passed through to a VM stopped working. I kinda suspected that my system crashes were related to the GPU (RX580) somehow, but cannot confirm. I've attached the syslog here, don't think there's anything about the GPU. So the current situation is that, the VM (with the GPU passed through) is still running, I can RDC into it and everything work, just that the screen attached to the Unraid server is not showing anything. But Win10 is reporting issues (see screenshot). In the past, I have used SpaceInvaderOne's script to reset the GPU, and more recently I tend to just shut down the server and restart it. I have left the server running for 2 days now since the GPU "crash", and it appears to be stable. Regarding the GPU, I followed SpaceInvaderOne's video on dumping vbios, and passing it through. I don't think there's much of an issue there, plus if I were to restart the server, the GPU would be working again. So my question is, what should I do to further diagnose the problem here? ~3-4 weeks seems to be the magical timeframe when the GPU experiences issues, and that's also the timeframe past unraid reboots have happened. Should I keep the server running for another month and see if it crashes? unraid syslog 20220715.log Quote Link to comment
ChatNoir Posted July 15, 2022 Share Posted July 15, 2022 Can't help you for that as I do not use VMs. But hopefully someone has something to say. Quote Link to comment
DrJake Posted July 16, 2022 Author Share Posted July 16, 2022 Hi again, Thought I'd give an update and close this thread. Thx again ChatNoir, I think it helped a lot with the RAM speed dialed back, resolving that instability issue. Since then, the Unraid server crash problem feels more like 2 separate issues compounded together. It was very telling when the GPU crashes, and the VM and the Unraid server remained operational. I found this thread, https://www.reddit.com/r/Amd/comments/pf4ebr/figured_out_a_fix_for_the_rx_580_intermittent/, it sounds like the RX580OC crashing is a common problem, and coincidentally also relates to the OC clock speed (similar to the RAM speed issue). So since my last post couple days ago, I went ahead and rebooted the server. I followed the instructions in that thread above, installed Adrenalin (the install required many reboots, and RX580 kept crashing, and install wouldn't progress unless it detects a valid GPU... so had to reboot the server quite a few times...). Anyhow, it's now installed, and I used it to manually tune the GPU clock to the base speed of 1257MHz, did a stress test, and nothing crashed, phew. So that's where I am now. Let's see how long the server can run this time without issues (I'm quite hopeful actually). 1 Quote Link to comment
DrJake Posted September 9, 2022 Author Share Posted September 9, 2022 Hello, thought I'd check-in again. My Unraid server is still rebooting ever 3-4 weeks. Last reboot on 8/September, and the time before that was on 12/August, and 29/July. Been running a syslog server and recorded the last 3 resets (logs attached). The log entries below are there each time there was a reboot. Not sure if it happened before or after the reboot. The GPU for VM seems quite stable these days, even did stress testing without crashing. Could someone take a look and point me to the right direction? I'm running out of things to try here. Jul 29 21:35:20 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:20 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:24 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:35:24 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Jul 29 21:37:55 Tower webGUI: Successful login user root from 192.168.11.7 Jul 29 21:38:01 Tower kernel: clocksource: timekeeping watchdog on CPU7: Marking clocksource 'tsc' as unstable because the skew is too large: Jul 29 21:38:01 Tower kernel: clocksource: 'hpet' wd_now: 71b477c4 wd_last: 7100a93b mask: ffffffff Jul 29 21:38:01 Tower kernel: clocksource: 'tsc' cs_now: ecc745390ca5c cs_last: ecc73ce4daf64 mask: ffffffffffffffff Jul 29 21:38:01 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Jul 29 21:38:01 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Jul 29 21:38:01 Tower kernel: sched_clock: Marking unstable (1159212382806303, -10482851288)<-(1159202045772327, -145824138) Jul 29 21:38:02 Tower kernel: clocksource: Switched to clocksource hpet Aug 12 00:05:06 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:06 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:09 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:05:10 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Aug 12 00:07:42 Tower kernel: clocksource: timekeeping watchdog on CPU15: Marking clocksource 'tsc' as unstable because the skew is too large: Aug 12 00:07:42 Tower kernel: clocksource: 'hpet' wd_now: 31b6948a wd_last: 313252fd mask: ffffffff Aug 12 00:07:42 Tower kernel: clocksource: 'tsc' cs_now: e710a7ac3d094 cs_last: e710a2445a3b4 mask: ffffffffffffffff Aug 12 00:07:42 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Aug 12 00:07:42 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Aug 12 00:07:42 Tower kernel: sched_clock: Marking unstable (1131240282539958, -10214497034)<-(1131230211029576, -142991887) Aug 12 00:07:43 Tower kernel: clocksource: Switched to clocksource hpet Sep 8 17:35:55 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:56 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:59 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:35:59 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd Sep 8 17:38:34 Tower kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: Sep 8 17:38:34 Tower kernel: clocksource: 'hpet' wd_now: f5ae5efc wd_last: f4fdec69 mask: ffffffff Sep 8 17:38:34 Tower kernel: clocksource: 'tsc' cs_now: 1e2215fac635f8 cs_last: 1e221579001198 mask: ffffffffffffffff Sep 8 17:38:34 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog Sep 8 17:38:34 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Sep 8 17:38:34 Tower kernel: sched_clock: Marking unstable (2360442394677355, -18500752721)<-(2360424039483928, -145566028) Sep 8 17:38:34 Tower kernel: clocksource: Switched to clocksource hpet 20220908 unraid crash 1738.log 20220729 unraid crash 2130.log 20220812 double crash 0.00am and 9.40am.log Quote Link to comment
JorgeB Posted September 9, 2022 Share Posted September 9, 2022 Nothing obvious logged, did you set the correct power supply idle control as linked above? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.