Server Hard Crashes


Recommended Posts

I have been experiencing server hard crashes for quite a while now.  I have gone through with various suggestions, increase ram, ramtests, writing log files to share etc.  Nothing had really helped pinpoint the cause of the issue.  I was originally thinking it was Plex transcoding related, not entirely sure if that is the problem now or not.  I had another hard crash over the weekend.  Attached is the log.   Any help is appreciated.

 

The server was unreachable at around the 2PM time frame on the 24th.  Last entry was:  Apr 24 14:00:13 HomeServer kernel: mdcmd (97): spindown 5

 

Thanks in advance,

 

Edit: added diagnostic dump.

 

Jay

syslog-192.168.1.2.log

homeserver-diagnostics-20200427-1109.zip

Edited by jcreynoldsii
Link to comment
  • 4 weeks later...

I experienced a couple more and each time the log doesn't show anything.  Attached is the one from today.  Came home to a non responsive server.  I got to find a solution or at least a way to shutdown gracefully, the quick press of the power button does not work.  I thought I'd be slick and pull the power on my UPS to initiate a controlled shutdown (didn't work).  My system is hooked up to a TV via hdmi after a period of time the screen is black and will not wake up, which is weird why wouldn't the console stay up?  This is annoying as hell.  Any help is greatly appreciated.

 

I found these 4 lines interesting on the reboot of the server:

/var/tmp/go: line 4: /boot/unmenu/uu: Permission denied
sh: ./apcupsd-3.14.10-x86_64-1_rlw.txz.auto_install: Permission denied
sh: ./powerdown-2.06-noarch-unRAID.tgz.auto_install: Permission denied
sh: ./screen-4.0.3-x86_64-4.txz.auto_install: Permission denied

Each of these items directly relates to a few problems I spoke of above.

 

Jay

syslog-192.168.1.2.log

Link to comment
2 hours ago, trurl said:

Why have you allocated 50G to docker image? Have you had problems filling it? 20G should be more than enough and when I see someone with a docker image larger I suspect they have something misconfigured with their dockers.

I use quite a few dockers and in the past I hit the default max docker image file size, so when I needed to increase it I set it so that I would have to keep doing it over and over again.

Edited by jcreynoldsii
Link to comment
2 hours ago, jcreynoldsii said:

I experienced a couple more and each time the log doesn't show anything.  Attached is the one from today.  Came home to a non responsive server.  I got to find a solution or at least a way to shutdown gracefully, the quick press of the power button does not work.  I thought I'd be slick and pull the power on my UPS to initiate a controlled shutdown (didn't work).  My system is hooked up to a TV via hdmi after a period of time the screen is black and will not wake up, which is weird why wouldn't the console stay up?  This is annoying as hell.  Any help is greatly appreciated.

 

I found these 4 lines interesting on the reboot of the server:


/var/tmp/go: line 4: /boot/unmenu/uu: Permission denied
sh: ./apcupsd-3.14.10-x86_64-1_rlw.txz.auto_install: Permission denied
sh: ./powerdown-2.06-noarch-unRAID.tgz.auto_install: Permission denied
sh: ./screen-4.0.3-x86_64-4.txz.auto_install: Permission denied

Each of these items directly relates to a few problems I spoke of above.

 

Jay

syslog-192.168.1.2.log 535.38 kB · 1 download

Why do you have any reference to unmenu in your config/go file on the flash drive?     Unmenu was a v5 feature that is not compatible with the current v6.   Any reference to it should be removed as trying to use it will just cause problems.

Link to comment
4 minutes ago, jcreynoldsii said:

Good question, nothing i ever did.  I used v5 before v6, perhaps stuff that didnt cleaned up?

You might want to check your ‘go’ file in case there are any other references to incompatible/obsolete features?

 

also do you have an ‘extras’ folder on the flash drive.     If so that should probably be removed as it may be installing incompatible packages. In v6 any extra packages are normally installed via the Nerdpack or DevPack plugins.

Link to comment

As far as my docker file, i am currently using ~23gig.

23 minutes ago, itimpi said:

You might want to check your ‘go’ file in case there are any other references to incompatible/obsolete features?

 

also do you have an ‘extras’ folder on the flash drive.     If so that should probably be removed as it may be installing incompatible packages. In v6 any extra packages are normally installed via the Nerdpack or DevPack plugins.

I dont have an extras folder on the flash drive.  I just got the nerdpack tonight and I have yet to install anything from it.  

 

Content of Go File:

#!/bin/bash
# Start the Management Utility
/usr/local/sbin/emhttp &
/boot/unmenu/uu

# resize log partition
mount -o remount,size=384m /var/log

cd /boot/packages && find . -name '*.auto_install' -type f -print | sort | xargs -n1 sh -c

Those packages are inside of the /boot/packages dir.

Edited by jcreynoldsii
Link to comment

That /boot/packages line is not part of a standard go file.    It is probably what causes the other error messages you mentioned as tightening of Unraid security a few releases ago means that files on the flash dtive can no longer have ‘execute’ permission set on them.

Link to comment

Checked my log and there are several peculiar entries from dockers:

May 21 19:36:21 HomeServer kernel: docker0: port 9(vethfafdac9) entered blocking state
May 21 19:36:21 HomeServer kernel: docker0: port 9(vethfafdac9) entered forwarding state
May 21 19:36:21 HomeServer kernel: docker0: port 9(vethfafdac9) entered disabled state
May 21 19:36:21 HomeServer kernel: eth0: renamed from veth7d7cd71
May 21 19:36:21 HomeServer kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethfafdac9: link becomes ready
May 21 19:36:21 HomeServer kernel: docker0: port 9(vethfafdac9) entered blocking state
May 21 19:36:21 HomeServer kernel: docker0: port 9(vethfafdac9) entered forwarding state
May 21 19:38:07 HomeServer kernel: vethafb7e9d: renamed from eth0
May 21 19:38:07 HomeServer kernel: docker0: port 10(veth037a102) entered disabled state
May 21 19:38:07 HomeServer kernel: docker0: port 10(veth037a102) entered disabled state
May 21 19:38:07 HomeServer kernel: device veth037a102 left promiscuous mode
May 21 19:38:07 HomeServer kernel: docker0: port 10(veth037a102) entered disabled state
May 21 19:38:09 HomeServer kernel: docker0: port 10(vethbe1888a) entered blocking state
May 21 19:38:09 HomeServer kernel: docker0: port 10(vethbe1888a) entered disabled state
May 21 19:38:09 HomeServer kernel: device vethbe1888a entered promiscuous mode
May 21 19:38:09 HomeServer kernel: IPv6: ADDRCONF(NETDEV_UP): vethbe1888a: link is not ready
May 21 19:38:09 HomeServer kernel: docker0: port 10(vethbe1888a) entered blocking state
May 21 19:38:09 HomeServer kernel: docker0: port 10(vethbe1888a) entered forwarding state
May 21 19:38:09 HomeServer kernel: docker0: port 10(vethbe1888a) entered disabled state
May 21 19:38:10 HomeServer kernel: eth0: renamed from veth0d0dee2
May 21 19:38:10 HomeServer kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethbe1888a: link becomes ready
May 21 19:38:10 HomeServer kernel: docker0: port 10(vethbe1888a) entered blocking state
May 21 19:38:10 HomeServer kernel: docker0: port 10(vethbe1888a) entered forwarding state
May 21 19:38:20 HomeServer kernel: vethadbaa22: renamed from eth0
May 21 19:38:20 HomeServer kernel: docker0: port 11(vethd09af7a) entered disabled state
May 21 19:38:20 HomeServer kernel: docker0: port 11(vethd09af7a) entered disabled state
May 21 19:38:20 HomeServer kernel: device vethd09af7a left promiscuous mode
May 21 19:38:20 HomeServer kernel: docker0: port 11(vethd09af7a) entered disabled state
May 21 19:38:22 HomeServer kernel: docker0: port 11(veth0dcbe21) entered blocking state
May 21 19:38:22 HomeServer kernel: docker0: port 11(veth0dcbe21) entered disabled state
May 21 19:38:22 HomeServer kernel: device veth0dcbe21 entered promiscuous mode
May 21 19:38:22 HomeServer kernel: IPv6: ADDRCONF(NETDEV_UP): veth0dcbe21: link is not ready
May 21 19:38:22 HomeServer kernel: eth0: renamed from veth9b5b4d4
May 21 19:38:22 HomeServer kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth0dcbe21: link becomes ready
May 21 19:38:22 HomeServer kernel: docker0: port 11(veth0dcbe21) entered blocking state
May 21 19:38:22 HomeServer kernel: docker0: port 11(veth0dcbe21) entered forwarding state

How do I correlate the port to a specific docker?

Link to comment
3 hours ago, trurl said:

Also that /boot/unmenu/uu line should be removed.

This look better?

#!/bin/bash
# Start the Management Utility
/usr/local/sbin/emhttp &

# resize log partition
mount -o remount,size=384m /var/log

 

How can I troubleshoot the crashes?  The log has nothing tangible in there.  I would like to be able to gracefully shutdown unraid, however the console keyboard doesnt work, web gui and web interfaces gone, the quick press of the power button doesn't work and neither does pulling UPS from mains power.  It seems like unraid gets hung 100%.

Link to comment

Yet another one this morning.  Last message before rebooting it this morning was:

May 24 02:30:07 HomeServer crond[1722]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

Edit:  Mover is setup to go off at 2:30 am, however Mover logging was not enabled, I have since enabled it.  I shutdown all non-essential dockers this morning so this system can finish a parity check, it has crashed two days in a row while performing the parity check due to the unclean hard reboots for the day prior.

syslog-192.168.1.2.log homeserver-diagnostics-20200524-1048.zip

Edited by jcreynoldsii
Link to comment

Not related to crash, but something to be aware of so you can clean it up. Your syslog is being spammed with these:

May 24 10:47:22 HomeServer root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token
May 24 10:47:24 HomeServer root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

Here is the FAQ on that:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=545988

 

Also probably unrelated to your crashes, but your docker/VM related shares are not configured ideally, and your docker image is much larger than should be needed. 20G is my usual recommendation and should be more than enough unless you have something misconfigured. Have you had problems filling docker image?

 

On 4/27/2020 at 12:08 PM, jcreynoldsii said:

I was originally thinking it was Plex transcoding related

How is that configured? Misconfigured dockers can fill docker image, or they can fill RAM.

 

Any mapped host path that is not an actual disk or user share is in RAM just like the rest of the OS, and anything that writes to the container path of that mapping is writing to RAM.

 

Any application path that does not correspond exactly to a container path, including upper/lower case and including the leading '/', isn't a mapped path, so if the application writes to that path it is writing into the docker image.

Link to comment

My docker is image size is set to 50GB, I am using about 23GB of that.  At one point in the past I ran out of docker image size so I increased it.

 

All of my dockers reside on the cache/appdata directory and I believe that all of my mapped directories reside on shares.

 

Plex docker configuration:

 

image.thumb.png.f325a10680c14ebf51ff33b991a3f1c2.png

 

All of my dockers reside on the cache/appdata directory.

 

I am wondering if the mover app isn't what got me, considering the last two days I woke up to a hard crashed system.

Edited by jcreynoldsii
Link to comment
2 hours ago, jcreynoldsii said:

My docker is image size is set to 50GB, I am using about 23GB of that.  At one point in the past I ran out of docker image size so I increased it.

To me this suggests a problem with one or more path settings within one or more applications you have running as a docker container. I have 15 dockers installed and they are only using 44% of 20G, and they are stable at that amount of usage.

 

That plex screenshot isn't nearly as useful as the docker run command that results from it. See this very first link in the Docker FAQ for instructions on how to get that docker run command and post it:

 

What other dockers do you run?

Link to comment

I only use a few of those, and I'm not likely to do the research required to figure out how to use the others and what if anything might be misconfigured in them. Maybe someone else will contribute.

 

Plex is one I do use, and it doesn't write as much as for example some downloading application. Other than its library (appdata) the only things it is likely to write are DVR and transcodes. Do you use plex DVR? You don't have a mapping for transcodes. What does the application itself use for the transcode directory?

 

If you go to Settings - Docker and disable docker altogether, do you still get crashes?

Link to comment

No I don't use Plex DVR and I don't have anything mapped for trans-coding.  Should I make a directory on a share drive for it?  Or should I transcode to RAM?

 

 

I'd have to give it a shot on shutting down docker, I planned on doing that tonight before turning in for the night.  I don't want to run with it off during day as I use PiHole for DNS and without a secondary PiHole running it pretty much will shutdown the internet in the house.

 

The reason I initially suspected Plex to be my culprit is because I would experience a lot of hard crashes while casting Plex to the chromecasts in the house.  As of lately that hasn't been the case.  The last two nights have had hard crashes for no apparent reason.

 

 

Edited by jcreynoldsii
Link to comment
5 hours ago, jcreynoldsii said:

transcode to RAM?

That's what I do. Here is a link in that thread to a post that is a good starting point to getting this setup like I have it. Just read from there to the end of the page and all the things you need to do is explained:

 

 

Link to comment

No crashes over night, I did however changed the mover schedule so that it would not run.  I did this simply because I wanted the parity check to finish as it had already found and corrected errors.  Presumably because it hard crashed yesterday during the parity check from the previous hard crash.

Date			Duration		Speed 		Status Errors
2020-05-25, 08:25:18	21 hr, 39 min, 34 sec	102.6 MB/s	OK	129

I ran only the bare minimum of dockers last night, PiHole, Unifi-Controller and Plex.  I will slowly introduce dockers over the next few days to see if these crashes are docker related.  In the event that doesn't work I think I may change frequency of mover back to every night.

 

Also I have a drive reporting the following,  it seems to be going in and out of good/bad, any insight on this?

199	UDMA CRC error count	0x000a	200	200	000	Old age	Always	Never	1

It does however pass the overall smart health check.  Attached is the smart report.

Hitachi_HDS723030ALA640_MK0311YHK1SMBA-20200525-1019.txt

Link to comment

CRC errors are typically connection issues not drive problems. You can click on that SMART warning on the Dashboard and acknowledge it and it won't warn you again unless it increases.

 

The small number of parity errors is typical of unclean shutdowns. But the only acceptable number of parity errors is exactly zero, so they must be corrected. If there are still parity errors after a correcting parity check then you still have problems to diagnose, so after parity errors get corrected we usually recommend following with a non-correcting parity check to verify that parity has zero errors now.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.