Unraid server reboots every ~3-4 weeks, reason unknown


Recommended Posts

Hi all,

 

For reasons unknown, my Unraid sever reboots itself every 3-4 weeks, during this period, there doesn't seem to be any issues with the server. It has happened for 6 months+ now, and has recently caught my attention. I kinda discarded it as something I did to crash it in the early days. Anyhow, I have followed the instructions and installed "mcelog via the NerdPack plugin", the system diagnostics is attached, hope there is something in the log before the latest self-reboot (~6 hours ago).

tower-diagnostics-20220605-1022.zip

Link to comment

Your first step would be to look at that :

 

For both memory speed and power state.

 

As for FCP reports, the MCE seems harmless, but you should probably act on those :

Jun  5 04:41:06 Tower root: Fix Common Problems: Other Warning: Mover logging is enabled
Jun  5 04:41:07 Tower root: Fix Common Problems: Other Warning: Background notifications not enabled

Mover logging is only useful for debugging why mover does not do what you want. It should be OFF otherwise as it tends to spam the log for no reason.

You should really set up notifications so that you are alerted immediately when Unraid detects an issue. Acting soon will keep your data safer than doing nothing and realizing too late that the system had issues for weeks and finally failed.

 

If the issues continue after the changes from the FAQ, you should set up a syslog server and post that file after the next crash/reboot.

Link to comment

Thank you ChatNoir, there's a lot of information to digest.

 

Already acted on the fix common problems issues as suggested.

 

Re: memory speed, I'm running AMD Ryzen 7 3700X 8-Core @ 3600 MHz, and 4 pieces of Corsair Vengeance LPX (4x16GB) DDR4 3200MHz C16 Desktop Gaming Memory Black. I'm pretty sure I set the memory speed at 3200MHz. According to your linked thread, for 3rd gen processor running dual 4/4 config the speed should be dialed back to 2667MHz? is that correct?

 

I will check on the c-state once I get back.

Link to comment
  • 1 month later...

Hi again ChatNoir, since adjusting the RAM speed, my server uptime is ~29 days 14 hours now. 

 

~2 days ago, the GPU passed through to a VM stopped working. I kinda suspected that my system crashes were related to the GPU (RX580) somehow, but cannot confirm. I've attached the syslog here, don't think there's anything about the GPU. 

 

So the current situation is that, the VM (with the GPU passed through) is still running, I can RDC into it and everything work, just that the screen attached to the Unraid server is not showing anything. But Win10 is reporting issues (see screenshot). In the past, I have used SpaceInvaderOne's script to reset the GPU, and more recently I tend to just shut down the server and restart it.

1031379810_UnraidGPU.PNG.3521bbf67854b44b6b94fafc5797db74.PNG

 

I have left the server running for 2 days now since the GPU "crash", and it appears to be stable. Regarding the GPU, I followed SpaceInvaderOne's video on dumping vbios, and passing it through. I don't think there's much of an issue there, plus if I were to restart the server, the GPU would be working again.

 

So my question is, what should I do to further diagnose the problem here? ~3-4 weeks seems to be the magical timeframe when the GPU experiences issues, and that's also the timeframe past unraid reboots have happened. Should I keep the server running for another month and see if it crashes?

 

unraid syslog 20220715.log

Link to comment

Hi again,

 

Thought I'd give an update and close this thread.

 

Thx again ChatNoir, I think it helped a lot with the RAM speed dialed back, resolving that instability issue. Since then, the Unraid server crash problem feels more like 2 separate issues compounded together. It was very telling when the GPU crashes, and the VM and the Unraid server remained operational.

 

I found this thread, https://www.reddit.com/r/Amd/comments/pf4ebr/figured_out_a_fix_for_the_rx_580_intermittent/, it sounds like the RX580OC crashing is a common problem, and coincidentally also relates to the OC clock speed (similar to the RAM speed issue).

 

So since my last post couple days ago, I went ahead and rebooted the server. I followed the instructions in that thread above, installed Adrenalin (the install required many reboots, and RX580 kept crashing, and install wouldn't progress unless it detects a valid GPU... so had to reboot the server quite a few times...). Anyhow, it's now installed, and I used it to manually tune the GPU clock to the base speed of 1257MHz, did a stress test, and nothing crashed, phew.

 

So that's where I am now. Let's see how long the server can run this time without issues (I'm quite hopeful actually).  

 

  • Like 1
Link to comment
  • 1 month later...

Hello, thought I'd check-in again.

 

My Unraid server is still rebooting ever 3-4 weeks. Last reboot on 8/September, and the time before that was on 12/August, and 29/July. Been running a syslog server and recorded the last 3 resets (logs attached). The log entries below are there each time there was a reboot. Not sure if it happened before or after the reboot. The GPU for VM seems quite stable these days, even did stress testing without crashing. 

 

Could someone take a look and point me to the right direction? I'm running out of things to try here.

Jul 29 21:35:20 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Jul 29 21:35:20 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Jul 29 21:35:24 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Jul 29 21:35:24 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Jul 29 21:37:55 Tower webGUI: Successful login user root from 192.168.11.7
Jul 29 21:38:01 Tower kernel: clocksource: timekeeping watchdog on CPU7: Marking clocksource 'tsc' as unstable because the skew is too large:
Jul 29 21:38:01 Tower kernel: clocksource:                       'hpet' wd_now: 71b477c4 wd_last: 7100a93b mask: ffffffff
Jul 29 21:38:01 Tower kernel: clocksource:                       'tsc' cs_now: ecc745390ca5c cs_last: ecc73ce4daf64 mask: ffffffffffffffff
Jul 29 21:38:01 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog
Jul 29 21:38:01 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jul 29 21:38:01 Tower kernel: sched_clock: Marking unstable (1159212382806303, -10482851288)<-(1159202045772327, -145824138)
Jul 29 21:38:02 Tower kernel: clocksource: Switched to clocksource hpet
Aug 12 00:05:06 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Aug 12 00:05:06 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Aug 12 00:05:09 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Aug 12 00:05:10 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Aug 12 00:07:42 Tower kernel: clocksource: timekeeping watchdog on CPU15: Marking clocksource 'tsc' as unstable because the skew is too large:
Aug 12 00:07:42 Tower kernel: clocksource:                       'hpet' wd_now: 31b6948a wd_last: 313252fd mask: ffffffff
Aug 12 00:07:42 Tower kernel: clocksource:                       'tsc' cs_now: e710a7ac3d094 cs_last: e710a2445a3b4 mask: ffffffffffffffff
Aug 12 00:07:42 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog
Aug 12 00:07:42 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Aug 12 00:07:42 Tower kernel: sched_clock: Marking unstable (1131240282539958, -10214497034)<-(1131230211029576, -142991887)
Aug 12 00:07:43 Tower kernel: clocksource: Switched to clocksource hpet

 

Sep  8 17:35:55 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Sep  8 17:35:56 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Sep  8 17:35:59 Tower kernel: usb 3-2: reset full-speed USB device number 2 using xhci_hcd
Sep  8 17:35:59 Tower kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Sep  8 17:38:34 Tower kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
Sep  8 17:38:34 Tower kernel: clocksource:                       'hpet' wd_now: f5ae5efc wd_last: f4fdec69 mask: ffffffff
Sep  8 17:38:34 Tower kernel: clocksource:                       'tsc' cs_now: 1e2215fac635f8 cs_last: 1e221579001198 mask: ffffffffffffffff
Sep  8 17:38:34 Tower kernel: tsc: Marking TSC unstable due to clocksource watchdog
Sep  8 17:38:34 Tower kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Sep  8 17:38:34 Tower kernel: sched_clock: Marking unstable (2360442394677355, -18500752721)<-(2360424039483928, -145566028)
Sep  8 17:38:34 Tower kernel: clocksource: Switched to clocksource hpet

 

20220908 unraid crash 1738.log 20220729 unraid crash 2130.log 20220812 double crash 0.00am and 9.40am.log

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.