Jump to content

System hung, all shares unresponsive, webgui unresponsive, can't ssh to unraid


Recommended Posts

I am running unRAID v6.3.5 (permanent license).  The PC running unRAID was built about 45 days ago and made it through a 30 day trial license all the way until today with no trouble.

 

A few minutes ago I was using VLC to watch a video (.avi) on my Win2k10 desktop PC that was playing from a mounted unRAID CIFS share when the video started pixelating then hung.  VLC crashed trying to read the file.  The desktop computer lost access to all unRAID shares.  There are about a dozen unRAID shares mounted to the desktop some CIFS, some both CIFS and NFS.  The NFS shares are mounted to several different VMware instances of Ubuntu 16.x on a HP DL380 G6 server.

 

At the same time this happened I was gunzipping a large file (300GB+) on an Ubuntu 16.0.4 VM running on the G6.  The .gz file was on an unRAID share mounted to the VM (R/W) via NFS.  It hung.

 

At the same time, on a different Ubuntu VM on the G6 server, SABNZBD hung while downloading to an unRAID NFS share.  

 

Basically all unRAID shares hung to all my machines.  There were no dockers running.  There are no VM's defined on the unRAID.

 

I was able to ping the unRAID server but I was unable to http or ssh to the server.   I unplugged the network cable but there was no effect so I had to power cycle the server.

 

After power cycling I was able to mount shares and everything came back online.  While trolling through /var/log/syslog I noticed that it only started at the time of the crash/reboot.  There didn't appear to be any previous versions of syslog.  I did run "diagnostics" and attached it to this post.  Can anyone please advise where I can find older syslogs or some indication that might show what happened here?  Thanks!

unraid1-diagnostics-20170816-1951.zip

Link to comment

Since unRAID runs purely from RAM in normal operation you do not normally have any historical syslogs.

 

if this persists then install the Fix Common Problems plugin and put it into troubleshooting mode.    That will keep writing syslogs to the flash drive to help with trying to pin down the cause.

Link to comment

Partial crash.  System stayed up but shares went offline.  I have attached the latest diagnostics file (I ran before  doing a clean reboot from the gui).  Seems to be a lot of chatter about running out of memory/swap.  The only traffic the box should be seeing is CIFS/NFS.  Any advice or observations would be appreciated.  Please advise if any other files are needed.  There are quite a few diagnostics files in /boot/logs. Thanks.

unraid1-diagnostics-20170819-1538.zip

Link to comment

Kernel used in v6.3.5 is very prone to OOM errors, even when the server is doing transfers only, try lowering the amount of RAM used for cache, works for me and many other with similar issues:

 

sysctl vm.dirty_ratio=2
sysctl vm.dirty_background_ratio=1

These won't survive a reboot so if it helps you need to make them permanent using your go file or the tips and tweaks plugin.

 

Link to comment
Just now, Gordon Shumway said:

Not to go too far off topic, but how soon before 6.4 is GA?

 

Soon

 

1 minute ago, Gordon Shumway said:

Also, how much of a pain is it to switch the a release candidate, then switch back to GA when it is available, or back to 6.3.5?

 

Very easy, most of that can now be done using the webGUI but you can also just overwrite 3 or 5 files on the flash drive, depending on the release installed.

Link to comment

Your recommendation seems to have helped with the RAM issue but this morning it crashed as all CPU's were at 100% and the system started killing processes.  I was able to ssh in and run a quick top and saw that load averages were 17,10, and 4 but nothing was running (that I know of).  

 

I also noticed some of the top processes running were "docker" and "php", which was odd as I mentioned before I don't have any VM's defined and have 2 dockers defined but neither are running.  I have to delete the plex docker (running it on another machine) and I leave Krusader off until I need it.  When I logged in I tried to launch diagnostics but the system killed it.  I was able to run "poweroff" which took about 5 minutes and after I powered the machine on I ran a new "diagnostics" (attached).  Any assistance would be appreciated.  Thanks.

 

GS

unraid1-diagnostics-20170823-0723.zip

Link to comment
50 minutes ago, Gordon Shumway said:

Just for fun I deleted the two dockers (Krusader and Plex) and ran the CA to clean up the docker remains.  Hopefully the new version will be coming out soon or someone can give us a clue what's killing our machines.  Sorry to say folks, but I didn't (and still don't) have these problems with my old Synology...

 

Synology is totally in control of the hardware platform. Yours is not a common problem which probably means either broken hardware or slight incompatibility of some kind.

 

Not fair to compare a very closed platform like Synology with an open platform like unRAID. You get better stability perhaps (although closed systems can have problems too), but pay a lot more and get only a small amount of the function.

 

Wonderful thing - you get to choose :) 

 

I'd take the system down to the basics. No dockers. Just NAS. See if it runs stable for several days.

 

Then introduce one Docker. Same thing.

 

Continue to add function until the problem occurs.

 

An approach like this may provide some useful data.

Link to comment

Just teasing about the Synology vs. unRAID.  If Synology was that great I wouldn't be here now... The hardware is brand-new out-of-the-box so I tend to doubt it is failing this early in it's life cycle, but anything is possible.  I did remove the two dockers this morning and cleaned them up with the Community App (don't recall them name and not connected to my home at the moment).  I'll bounce the box tonight to make sure everything is clean, then keep it under observation.  Thanks.

Link to comment

I ran the sysctl commands that Johnnie.black recommended and also removed the Krusader and Plex dockers that were not in use and have yet to have a problem.  A week ago I had to shutdown EVERYTHING as the local power company was running new lines and had to shutdown my entire block.  I have not seen a CPU or memory spike since then and also have not re-entered the "sysctl" commands so I will have to conclude that removing the unused dockers seems to have made the trouble go away.  For now.  Thanks everyone.

Link to comment
  • 1 month later...

I'm baaaaack...  About 15 minutes ago my Windows desktop lost connection to an unraid share.  I logged into my unraid server and found that I was "out of memory".  I stopped the array and tried to restart it but got errors telling me to run "fix common problems".  When I launched that app it complained that my server was out of memory (duh), and that call traces had been found.  I have attached the lines of the syslog file from just before the server ran out of memory.  I have not yet actually rebooted the machine and will probably hold off until tomorrow morning in case anyone wants to see anything that might disappear after a reboot.

 

GS

2017-10-18_unraid1_syslog.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...