Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Random full system crash with no apparent cause

Featured Replies

Hello,

 

In the last few days, my server started randomly crashing and rebooting. I couldnt find anything useful when reading the logs and I've already configured the logs to be mirrored to the flash drive.

 

If anyone with more experience with this kind of troubleshooting could give me a hand, it would be greatly appreciated!

 

I've attached the diagnostic file to this thread. The last crash occured on "28 February 2024 03:18", according to the uptime indicator.

 

Thank you!

birdskeep-diagnostics-20240228-0832.zip

Solved by TheBird956

  • Community Expert

Unfortunately there's nothing relevant logged, and that's not very surprising if the server really is rebooting on its own, vs crashing or hanging, since that usually indicates a hardware issue.

  • Author

Interesting

 

Any idea what I could do to get more visibility?

 

At first I thought it could be related to the cache disk getting full, but I don't see why that would cause a reboot, and there is still enough space 

  • Community Expert
14 minutes ago, TheBird956 said:

Any idea what I could do to get more visibility?

 

Not sure what you mean?

  • Author
6 minutes ago, JorgeB said:

Not sure what you mean?

Is there a tool that I could setup that would give me more information on what could be the cause, like logging the shutdown reason, or if something is overheating

  • Community Expert

It may be easier to just boot using a Windows live flash drive and run some tests, I don't known of many stress test tool that could run with Unraid, and some hardware issues usually don't leave anything logged.

I am diving into this conversation. I noticed that TheBird956 is running Unraid 6.12.8. I have 7 Unraid servers that I manage. I recently upgraded 2 of them from 6.12.6 to 6.12.8 and now both of the upgraded servers are randomly shutting down. I have Virtual Syslog Server running on a Windows machine in the office of one of the upgraded Unraid servers to hopefully catch some log files of what is going on. I am posting something here so that others can add if they are experiencing similar issues.

  • Author

I am noticing that the crash is occuring daily, around 3am. I'm trying to see what is running at that time. Here's a diagnostic file that I just generated

birdskeep-diagnostics-20240301-0838.zip

  • Author

I just noticed I have a bunch of errors at the end of the previous syslog file (included in the provious diagnostic file). I'm not too sure if they are usefull right now since they are from the parity check tuning plugin

  • Community Expert
20 minutes ago, TheBird956 said:

I just noticed I have a bunch of errors at the end of the previous syslog file (included in the provious diagnostic file). I'm not too sure if they are usefull right now since they are from the parity check tuning plugin

They can be ignored as they are an internal check simply reporting a state I did not think would occur.   You could delete the parity.check.tuning.automatic and parity.check.tuning.manual file from the plugins folder on the flash drive to stop them being reported.     

 

They should also have been removed during a boot sequence as you are on the latest release of the plugin.  Have you rebooted since it was installed?   If so although this is a different issue I would appreciate it if you could turn on the 'testing' mode of logging in the plugin's settings; reboot the server and then post new diagnostics taken after the reboot so I can see if I can determine why this was not tidied up during the reboot.

  • Author
1 hour ago, itimpi said:

They can be ignored as they are an internal check simply reporting a state I did not think would occur.   You could delete the parity.check.tuning.automatic and parity.check.tuning.manual file from the plugins folder on the flash drive to stop them being reported.     

 

They should also have been removed during a boot sequence as you are on the latest release of the plugin.  Have you rebooted since it was installed?   If so although this is a different issue I would appreciate it if you could turn on the 'testing' mode of logging in the plugin's settings; reboot the server and then post new diagnostics taken after the reboot so I can see if I can determine why this was not tidied up during the reboot.

 

Here you go

birdskeep-diagnostics-20240301-1031.zip

  • Community Expert
33 minutes ago, TheBird956 said:

 

Thanks.  According to those diagnostics the state information WAS cleared up during the boot sequence and you are no  longer getting those error messages from the plugin.

 

If by any chance you manage to get this symptom again then I would appreciate if you can provide any information about what you might have done to create it and if possible any new diagnostics taken at the time.

  • Author
18 minutes ago, itimpi said:

 

Thanks.  According to those diagnostics the state information WAS cleared up during the boot sequence and you are no  longer getting those error messages from the plugin.

 

If by any chance you manage to get this symptom again then I would appreciate if you can provide any information about what you might have done to create it and if possible any new diagnostics taken at the time.

 

Maybe the crashes Iǜe been having were preventing the state from being properly updated? Lately, the only reason the server is rebooting was because of the crash and this is the first time since I opened this thread that I manually triggered a reboot.

  • Community Expert
4 minutes ago, TheBird956 said:

 

Maybe the crashes Iǜe been having were preventing the state from being properly updated? Lately, the only reason the server is rebooting was because of the crash and this is the first time since I opened this thread that I manually triggered a reboot.

 

It should not have mattered that you were having crashes.   The trigger for the error message you were getting (which was meant to be fixed in the last plugin update) was exactly you having a crash during a parity check that was manually initiated. 

 

It is quite possible that you shutting down tidily meant that these conditions were not met.   That is why I am interested in seeing if you can get it again in the future and what leads up to it.

  • Author

Ok, I'll keep an eye out.

 

Just for the record, I noticed that a task in Jellyfin (Generate BIF Files) was scheduled around the time of the crash. I disabled it just in case, I will see tomorrow if the crash still occurs. (I have no reason to believe this is the cause, other than the scheduled time)

  • 2 months later...
  • Author

The issue never completely disappeared, but wasn't as present after changing settings in Jellyfin. It stopped being always at the same time of the day though...

 

It just crashed again, so I am attaching to this message the latest diagnostic files. Also, I've configured more metrics on the system and an external Prometheus instance is scraping those metrics. Hopefully there is something useful in there.

https://snapshots.raintank.io/dashboard/snapshot/ZeOO2xqaPsgjQGgYRlkQibwhJZz3R3vX

You can see no metrics were collected between 11:43 and 11:45, this is the time when the server was rebooting. Metrics are scraped every 10 seconds.

 

The only thing I notice is the high cpu package temperatures, but I don't know if it got too high right before the crash.

birdskeep-diagnostics-20240517-1146.zip

  • Community Expert

I assume the diags are after rebooting? It does also have the syslog-previous, but don't see anything there other than read errors on disk1, which are logged as a disk problem btw, and based on SMART I would replace that disk ASAP, though that will be unrelated to the crashing.

  • Author

Yes, the diagnostics file was right after the reboot. I would love to know how to have one right before the crash, but its just unpredictable.

 

I noticed the errors in SMART started occurring after a crash/unclean shutdown, tough I am expecting it to be stable right now. The read errors reported by Unraid also stopped. I am running an extended test to see if it is getting worse. What have you noticed exactly? Maybe I am not reading it correctly.

  • Community Expert
Just now, TheBird956 said:

Yes, the diagnostics file was right after the reboot. I would love to know how to have one right before the crash, but its just unpredictable.

 

You can't ge the diagnostics but you can get the syslog.   Just setup the Syslog Server.   If you submit a syslog, try to give an approximate time when the reboot occurred. 

  • Author
10 minutes ago, Frank1940 said:

You can't ge the diagnostics but you can get the syslog.   Just setup the Syslog Server.   If you submit a syslog, try to give an approximate time when the reboot occurred. 

Here's everything for today. The crash recovered around 11:45 (line 94)

 

For reference, here's the syslog configuration: image.thumb.png.ccbbc017897c98f11fabd700781542e7.png

syslog-10.10.10.40.log

Edited by TheBird956
Replaced occured with recovered

  • Community Expert

Unfortunately there's nothing relevant logged, this can be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

  • Author

I received a replacement drive, and the server crashed when rebuilding the array and will not post anymore.

 

I unplugged the new drive and the server is able to post again.

 

I am starting to believe all this is caused by the 600W psu, including the random crashes. The new drive has a higer power load (WD40EFRX vs WD80EFPX)

 

Hopefully the crash during the rebuild wont cause more problems.

  • 5 months later...
  • Author

You can't make this up...

 

I was writing an update to let the world know that the issue seems to be fixed after replacing the CPU (Intel 13th gen RMA), but it crashed again after almost 8 days of uptime.

 

I've attached my diagnostics.zip file. The only thing I notice in the logs is that one SSD is having trouble to trim and that there is around 5 minutes of missing log before the restart (server takes at most 1 minute to boot)

birdskeep-diagnostics-20241021-0831.zip

birdskeep-syslog-10.10.10.40-20241021-1236.zip

Edited by TheBird956
Add syslog file

  • Community Expert

Some device and USB issues logged, though probably unrelated, should still be checked.

 

P.S. you should start a new log, that one is getting very large and takes a long time to open.

  • Author
Quote

Some device and USB issues logged, though probably unrelated, should still be checked.

 

Ill take a look, the only usb devices I have is the USB drive and a UPS

 

Quote

P.S. you should start a new log, that one is getting very large and takes a long time to open.

 

Weird, I thought it did that automatically. I reduced the size from 100mb to 5mb

 

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.