Unraid webconsole unreachable ? Transport endpoint not connected ? Read this.


Recommended Posts

This thread describes errors with the unraid webconsole dying and "transport endpoint not connected" errors. In this fist post you will find the combination of the tips to solve these issues, how we got to it is found in the posts hereunder.

 

Add the following lines of command to your GO script:

 

syntax for 4.7:

pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -17 > /proc/$PID/oom_adj; done
pgrep -f "/usr/sbin/smbd" | while read PID; do echo -17 > /proc/$PID/oom_adj; done

 

(I will only maintain a set of commands for the 5.x branch from now on.

 

The following set of commands will make sure your needed base functionality does not get killed off in case of an out of memory error:

 

on 5.*: (thanks to int13h)

pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done^M
pgrep -f "/usr/sbin/smbd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done^M
pgrep -f "in.telnetd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

 

Make SABnzbd likely to get killed

pgrep -f "/usr/local/sabnzbd/SABnzbd.py" | while read PID; do echo 1000 > /proc/$PID/oom_score_adj; done^M[/i]

 

The SABnzbd setting will actually help better then the other ones, it will make sure that whenever a process needs to be sacrificed then SABnzbd will be the one that is killed off. Good thing is that you just disable/enable from the webinterface and it will run again,  no need to go into console or reboot.

 

Please take note of the thread below with respect to the how and why of these additions

 

-----

Original first post follows:

-----

 

Just noticed I could no longer access my user shares. Thru console I get the following error if I try to CD to it:

 

-bash: cd: user: Transport endpoint is not connected

 

This looks like a bug.. And it caused the loss of some files for me, nothing unsolvable but still..

 

I cannot add syslog now since I cannot reach the flash drive also from my system, looks like SMB has crashed ..

 

I will now reboot, this will undoubtedly solve the issue but I hope it does not happen again though..

 

The array refuses to shut down, there are no processes running that normally make that happen, I have saved the results of ps -elf.

 

I have now rebooted and can access the syslog I saved on the flashdrive, it is in the attached zip together with the ps -elf output.

 

System is backup but (ofcourse) parity check is now running.

syslog.zip

Link to comment
  • Replies 106
  • Created
  • Last Reply

Top Posters In This Topic

We need Joe on this :-)

 

I did some digging and I think I have found what is going on. Most probably the server ran out of memory, that will then trigger linux to stop processes that are used not very often. Downside is that this means that emhttp (that gives you the webinterface for administering unraid) and SMB (that actually gives you your drive shares) are the first ones to go.

 

I recently had the web interface die on me and now SMB, both happened after I started using SABNzbd/Couchpotato/Sickbeard, all are used a lot recently and might even have memory leaks.

 

Issue is that I can hardly believe that this is linux behaviour that cannot be altered... Core system processes (and in case of unraid SMB and web interface are that) should be protected against this behaviour.. On the other hand, if that would be possible it most probably would allready have been done..

 

Fortunately memory is quite cheap these days so yesterday I ordered another 8 gigs, my box will then run on 16gigs which I believe is in basis total nonsense but if it helps me survive a couple of days longer it is worth the 50 euro for me ..

Link to comment

Did some digging...

 

Now we really need some linux expertise...:

 

http://lwn.net/Articles/317814/

 

Apparently linux uses something called an "OOM-killer" (out of memory killer).

 

OOM is used to make sure a server will not crash, if memory becomes desperately low OOM will kill of processes according to a predefined rule set in an event to keep the server afloat.

 

The article behind the above link describes ways to (quote)  "SAVE IMPORTANT PROCESSES FROM BEIING KILLED".

 

Now that is exactly what we need right ?

 

I have no problem in the system killing Couchpotato, Sickbeard, SABnzb, Airvideo..  But the server should keep itself running, and ITSELF for an unraid box includes emhttp and smb (and possibly also the daemons for ntfs etc.).

 

The way to do this is give processes a score that defines the order in which they are available for "killing", give an application a score of -17 and it will not be considered a candidate for killing.

 

Sounds good eh ?

 

Problem is there is no trace of these options within the unraid distro (at least not to my untrained eye), it could be we need to compile something in, or possibly OOM is a specific implementation for one distro and unraid might use something else..

 

So... linux savvy guys out there... get your hack on ?  This would be a very valuable thing to have imho..

Link to comment

Ehm....

 

Someone needs to help me and tell me I am talking crap because I think I actually found the solution...

 

Someone PLEASE verify...:

 

STEPS:

 

1) Login your server.

2) do PS -elf | grep emhttp

 

This now gives you the process-ID of the emhttp process

 

3) go to /proc/<the process id you just found>

 

Here you will find a whole lot of stuff that I do not understand, but you will also find a file called oom_adj, if you look into it you find one number: 0 That is the basic number every process gets. If we can set this to -17 the EMHTTP process will no longer get killed.

 

For me the process ID for EMHTTP is 14736.

 

So I gave the following command:

 

echo -17 > /proc/14736/oom_adj

 

The file now contains the figure of -17 and should no longer be attacked by the OOM killer..

 

Now this is done on a per process-id basis and that changes when you reboot, we need something that can go into the GO file and think we have that also:

 

pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -17 > /proc/$PID/oom_adj; done

Before anyone tries this: Someone with linux knowledge needs to confirm.. I am not afraid to experiment but I found this after 15 minutes of googling, so it might be that this totally does not work or is dangereous somehow ...

 

 

 

 

Link to comment

I've never used it, but I've read that before....Should be safe as long as your setting emhttp and smb to not be killed, could get dangerous if you started setting other apps to not be killed. emhttp and smb are relatively low-resource so "not" killing them shouldn't make a lot of difference.

 

If your scores are like mine, there are a whole slew of applications that would be killed before smb and emhttp though. Checking right now, these would be killed before smb for me:

Sabnzbd

sickbeard

mylar

couchpotato

sickbeard(second instance)

headphones

telnet

+2 others I'm forgetting(9 total)

Link to comment

 

If only the oom_adj score would be used to define to process to be killed that would be the case, but as I understand it there is a whole formule system behind that decides on the process to be sacrificed by looking at the last use of the process as well as the oom_adj.

 

Therefor it is perfectly possible that emhttp gets killed off while couchpotato has an oom_adj score that offers it up sooner...

 

Personally I would never make sab/couch/sick unkillable, if only because they can be easily restarted... Emhttp can no longer be restarted in the v5 release making it necessary to perform a full reboot, possibly get a parity check at bootup..

 

Sent from my iPad using Tapatalk HD

Link to comment

The list I was referring to was the list of oom_score's. From my understanding, and I just went back and read on it again to make sure I wasn't completely off base, the oom_score is the result of the formula your speaking of, and is "shifted" according to the value set in oom_adj. So since I haven't set any oom_adj values, they are all currently "0". In this case, the list I made would be the order oom_killer would go in order to free memory.

 

That being said it would definitely be possible that emhttp is killed before other apps depending on that formula. My list is just an example.

 

If it does happen again, print the output of

egrep -i 'killed process' | /var/log/
[code]

That will let you know what was killed atleast.

Link to comment

That helps, thanks!

 

Imho the whole thing is tightly related to the fact that processes get killed that cannot be restarted...

 

Also:

 

With 8 gigs in my box there is no -real- reason there should be out of memory situations, that (imho) are applications behaving badly, and they should be punished (eg killed). Should it occur more often (I will monitor my system closely, then I could also create a small cron job (ddaily for example) that checks if the sab/sic/couch are still running and if not; restart them. That probably is not to difficult to make and it will make sure that bad behaving applications get killed, and after some time restarted with fresh opportunities again... That would keep the ecosystem running up until the point latest updates of these apps have made them more nice behaving...

 

 

Link to comment

It's true, with 8 gb you shouldnt see those situations often, but they may under the right circumstances happen. If you were to start seeing them often, I'd run memtest to rule out ram going bad.

 

Sab has been known to be a resource hog but its never been bad enough for me to have issues out of it.

 

The problem isn't so much running out of memory, its letting applications over commit. If Sab tries to get 1 gb of memory but only 500mb is available, linux will let Sab have it. Sab most likely will not use it all so the 500mb is fine until Sab does need it. Then linux tries to kill off other processes so Sab can use it, since linux has already told Sab it can.

 

Problem here is: if linux has to kill off processes what will it kill? If it doesn't kill enough in time Sab will try to write to memory it can't, causing a segfault.

 

Thanks for the thread, made me reread on the topic. Been a while since I've thought about how it works other than acknowledging its there.

Link to comment

And I do feel that we have stumbled upon a solution for emhttp getting killed.

 

A lot of memory could even make it more likely that issues occur if the cause is lack of low memory.

 

Since low memory is used to address high memory adding more gigs will cause less low memory to be available. That will then cause the OOM killer to run more frequent..

 

Should unraid become 64 bit this issue would not occur..

Link to comment

I really hate to be one of those guys but here goes:

 

This issue only happened to me with RC8. I went back to RC5 and it's been flawless (as it was before i upgraded).

 

My setup has 3 gigs of memory + a 2 gig swapfile.

 

so the bottom line is if it's a memory issue, what is different in RC8 causing the memory to fill up in a matter of a day?

 

Link to comment

Thanks Helmonder for bringing this up. I too have been having this issue since installing SAB/SB/CP onto unraid and I too have 8gigs of memory.

See this post http://lime-technology.com/forum/index.php?topic=20013.msg200189#msg200189

 

If you watch top closely while SAB is downloading you will see that it eats memory at the same rate that it downloads (what it looks like anyway -no calculations to back up that statement). It seems to hold onto that memory in cache but then when the post processor starts OOM starts killing off processes. This, to me, looks like an issue with SAB.

 

I have resolved my issue by setting SAB to pause the queue whilst post processing. I haven't ran out of memory once since doing that, however I do prefer your solution. Being able to manage this via the gui or via a conf file would be excellent.

Link to comment

Keep in mind.  Processes that access large amounts of memory or filecounts tend to cause these OOM issues.

I had issues in the past where a large rsync job caused OOM issues. I had to drop the cache before and after I ran the job to prevent the OOM issues.

 

I wonder if the unRAID md drive was re-tuned in RC8.

Can anyone validate differences between the two?

 

Also anything that uses /tmp or /var/tmp can cause the system to run out of memory since that uses root fs which cannot be swapped out.

If you have a swap partition you can mount a tmpfs on /var/tmp and /tmp to allow swapping to assist. (I've verified that tmpfs filesystems can be swapped out where as rootfs will not be swapped out).

 

Another point to consider, if you are running the cache dirs script, this could cause OOM conditions too. It did on mine, but I have drives with enourmous amounts of files, so I cannot use it without causing crashes.

 

The issue boils down to how much low memory is available for the kernel. 2g, 4g, 8g doesn't matter, it's low memory that's the issue.

If we use tmpfs for the root filesystem one day, I think the low memory pressure issue would be better since tmpfs can be swapped out in an emergency.

 

I did find a way of remounting root onto a tmpfs filesystem, but it requires some new tools and startup in unRAID that has to occur at the boot level.

Link to comment

After upgrading to RC8 from 4.7 I have the same issue. I can't run SAB, sickbeard, and couchpotato server at the same time. They will all crash within a minute or so of starting up and running for a bit. I'm going to try RC5 so we'll see. Just thought I'd throw my hat in so you're not alone.

Link to comment

I think you guys are running into similar issues that I have, see my post "Lost connectivity". I feel that my problems are related to SAB/SickBeard/CouchPotato on unraid 4.7. My warning to everyone is that prior to performing a disk upgrade or parity check, they should reboot to a vanilla unraid installation I.e. no unmenu or other extraneous apps running. I have been down for 4 days now and I'm still unsure I will be whole again:-(

 

Sent from my ASUS Transformer Pad TF700T using Tapatalk 2

 

 

Link to comment

My syslog says :

 

/proc/1245/oom_adj is deprecated, please use /proc/1245/oom_score_adj instead

 

Ok now, if I write a value in oom_score_adj file, do I have to use the same scoring system (from -17 to 17) ?

Or is it expecting different kind of values ?

 

 

autoreply : Actually, the "new" scoring system uses range [-1000;+1000]

So I'll add to my GO file these three lines :

 

# OOM daemon trick : do not kill unRaid WebUI and SMB shares too easily.
pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
pgrep -f "/usr/local/sbin/smbd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

 

I'll see how it goes.  :)

Link to comment

It is described in the following knowledgebase article:

 

http://www.dbasquare.com/kb/how-to-adjust-oom-score-for-a-process/

 

/proc/[pid]/oom_score_adj for kernels 2.6.29 and newer  need a value between -1000 and 1000

/proc/[pid]/oom_adj for kernels older than 2.6.29 need a value between -17 and 15

 

Unraid is now on 3.4.11 so indeed, the -1000 and 1000 range should be used !

 

 

 

Link to comment

i think i have the same problem everything works sabnzbd, sickbeard and smb but tower web interface will stop working didn't have this problem before. I get in chrome

Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data.

Link to comment

Using oom_adj :

 

Oct 24 18:49:26 Tower emhttp: unRAID System Management Utility version 5.0-rc8a
Oct 24 18:49:26 Tower emhttp: Copyright (C) 2005-2012, Lime Technology, LLC
Oct 24 18:49:26 Tower emhttp: Plus key detected, GUID: 090C-6300-AA00-000000168261
Oct 24 18:49:26 Tower kernel: go (11365): /proc/11363/oom_adj is deprecated, please use /proc/11363/oom_score_adj instead.

 

Using oom_score_adj :

 

Oct 25 22:50:49 Tower emhttp: unRAID System Management Utility version 5.0-rc8a
Oct 25 22:50:49 Tower emhttp: Copyright (C) 2005-2012, Lime Technology, LLC
Oct 25 22:50:49 Tower emhttp: Plus key detected, GUID: 090C-6300-AA00-000000168261

 

Better  ;)

 

So we should use oom_score_adj, at least for unraid 5.0 branch.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.