TRANSPORT ENDPOINT IS NOT CONNECTED WITH /MNT/USER (due to docker?) - shfs segfault - 2023 edition

Alex R. Berg · August 20, 2023

I have this problem and have had it for probably a couple of years now.

$ ls /mnt/user
/bin/ls: cannot access '/mnt/user': Transport endpoint is not connected

There have been many others telling of the same problem:

https://forums.unraid.net/topic/102568-transport-endpoint-is-not-connected-with-mntuser/ image.png.9926565c7da0712f79b522267b55d18c.png

image.png.fa48894809529b8d40cd2e093344c3d1.png https://forums.unraid.net/topic/115300-binls-cannot-access-user-transport-endpoint-is-not-connected/ -

image.png.c10e7a8b408f19ad101da301ccfcd4c8.png image.png.de2bc60e612278eebfd96a2bde4c3d2c.png

The first mentions it could be docker accessing /mnt/user. This could very well fit my usecase. Because I use Duplicati (running via docker) to backup my /mnt/disk* and /mnt/user, and let Duplicati deduplicati the double backup. So if a disk+parity fails I can restore files from that disk, and if I lose a /mnt/user/ folder I can restore that across all disks.

Has any found a solution to this problem? Is there a different way of using docker, so that it can access /mnt/user without the issue?

Has anyone determined that docker IS the cause of this issue (at least in their case) ?

Edited August 26, 2023 by Alex R. Berg

itimpi · August 20, 2023

As far as I know this problem tends to occur when the fuse process managing User Shares crashes and is not directly linked to docker. Your diagnostics might enable us to determine if this is the case for you.

Alex R. Berg · August 25, 2023

Thank you itimpi.

Here's my diagnostics. I have further anonymized the syslog. It echoed all my sudo-lines, and lots of scripts I was running, I have removed those lines that seemed private, but kept those at the end around where the problem arises, in the hope of not removing anything important for the diagnostic-process.

Best Alex

tower-diagnostics-20230729-0846-anonymized-extra-anonymized.zip

JorgeB · August 25, 2023

Jul 29 03:40:01 Tower kernel: shfs[15981]: segfault at 0 ip 00005587d0b1359c sp 0000149390b7f810 error 4 in shfs[5587d0b11000+c000]

shfs segfaulted, after this happens shares will be gone until the server is rebooted, but not clear to me what caused the segfault, if it keeps happening try to drill down if it's after doing something specific.

Alex R. Berg · August 26, 2023

Thank you, I havn't noticed that line as being the first line, or I haven't previously followup on it.

It does (did) keep happening. Now I have changed my dockers which access lots of files (backup+Resilio sync) and only allow those to access my files via disk-shares and so far it looks good. Its a bit to early to tell though. I'll report back in a few weeks, with whether it has been stable by then. It was happening maybe once-twice a week when my backup was processing all my files checking checksums through /mnt/user.

I've been exporting diagnostics when it happened many times back last year, I just checked previous logs.

2022: 

Aug 22 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync/Archive
Aug 22 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync
Aug 22 03:40:02 Tower kernel: shfs[23214]: segfault at 0 ip 00000000004043a4 sp 0000152d0c670780 error 4 in shfs[402000+c000]
Aug 22 03:40:02 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 43 8b 05 8f df 00 00 85 c0 78 2f e8 16 e0 ff ff
Aug 22 03:40:02 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup
Aug 22 03:40:02 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync

Aug 29 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync/Archive
Aug 29 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync
Aug 29 03:40:01 Tower kernel: shfs[1963]: segfault at 0 ip 00000000004043a4 sp 0000147af70f3780 error 4 in shfs[402000+c000]
Aug 29 03:40:01 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 43 8b 05 8f df 00 00 85 c0 78 2f e8 16 e0 ff ff
Aug 29 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup
Aug 29 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync

Sep  2 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync/Archive
Sep  2 03:40:01 Tower move: error: move, 392: No such file or directory (2): lstat: /mnt/cache/sync/AppBackup/.sync
Sep  2 03:40:01 Tower kernel: shfs[16416]: segfault at 0 ip 00000000004043a4 sp 0000152977cfb780 error 4 in shfs[402000+c000]
Sep  2 03:40:01 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 43 8b 05 8f df 00 00 85 c0 78 2f e8 16 e0 ff ff

Sep 16 03:40:02 Tower kernel: shfs[27859]: segfault at 0 ip 00000000004043a4 sp 000014caa2cb2780 error 4 in shfs[402000+c000]
Sep 16 03:40:02 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 43 8b 05 8f df 00 00 85 c0 78 2f e8 16 e0 ff ff

2023: 
Jun 18 03:40:02 Tower kernel: shfs[4635]: segfault at 0 ip 000055c0d617359c sp 0000150db53fa810 error 4 in shfs[55c0d6171000+c000]
Jun 18 03:40:02 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 4d 8b 05 67 dd 00 00 85 c0 78 39 e8 9e da ff ff

Aug 10 03:40:02 Tower kernel: shfs[8688]: segfault at 0 ip 000055a8e226f59c sp 000014bf269d9810 error 4 in shfs[55a8e226d000+c000]
Aug 10 03:40:02 Tower kernel: Code: 48 8b 45 f0 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 89 75 e0 c7 45 fc 00 00 00 00 8b 45 fc 48 63 d0 48 8b 45 e0 48 01 d0 <0f> b6 00 3c 2f 74 4d 8b 05 67 dd 00 00 85 c0 78 39 e8 9e da ff ff

I picture presents it self. I have lots of things running during the day, but I also log whenever run-parts execute something, and it seems in the above examples that the previous onces finished long before. Like lines like these:

Aug 29 03:24:49 Tower /etc/cron.daily/userScripts/run_vps_up_test: Executing
Aug 29 03:24:49 Tower /etc/cron.daily/userScripts: run-parts completed

I also checked all my other crontab-jobs and userScripts, and I'm fairly certain I have nothing else running at that specific time.

Hmm, so I can just disable the mover and then my problems will probably be fixed? That does sound kind of very strange given that I'm using my own custom mover that is based on mover-codebase and runs the mover codebase regularly throughout the day.

I run this 0:40 every day, so if it was just the mover script then this should also have triggered the issue some times (unless other script of mine have turned off the docker to make backups, but I don't think I have that any longer):

#!/usr/bin/bash

$ALEXBIN/custom_mover
/usr/local/sbin/mover > /dev/null
# Completion check was previously moved to appBackup which turns off docker and services: \\towerAlex\appbackup\backupRsync
# but now back here as there's no longer an appbackup
mover_completion_check

I don't know if the scheduled mover does anything different that running '/usr/local/sbin/mover'.

---

I'll leave my docker changes so stuff run on /mnt/disk* for the next month or two. I'll also change the mover schedule to run at another time, so if it crashes again I would expect a new timestamp. If it runs smoothly for a couple of months, I'll try changing by dockers to use /mnt/user again, and see it the problem the resurfaces.

Will it be helpful to track the issue down to anyone else if I report back and try to recreate the issue?

PS: I updated the title to include 'shfs segfault', to make it easier for others to find this issue.

Edited August 26, 2023 by Alex R. Berg

JorgeB · August 26, 2023

Some containers that directly manipulate files, like move, rename, remove, have been known to cause issues with shfs, but it's been happening for some time, so likely not an easy fix, but see what you can find, at least it may help other users having the same issue, another similar issue is this one:

https://forums.unraid.net/bug-reports/stable-releases/683-shfs-error-results-in-lost-mntuser-r939/

Alex R. Berg · August 27, 2023

Right that fits with my experience. So it seems that when a unRaid process moves a file away from a disk via disk share /mnt/disk1 (such as the mover process) while a docker container accesses (reads?) the file via the '/mnt/user'-share, then it'll crash the shfs. It's probably the remove operation 'rm /mnt/disk1/someFile' that conflics with shfs reading that file, possible it could also be related to that same file appearing in shfs on another disk. It's vague understanding that it is shfs that creates the virtual image /mnt/user as the sum of /mnt/disk*.

So I would say if someone wants to reproduce this issue, like just create a docker that reads a large GB sized file from /mnt/user and then have a mover process running directly on unraid that moves that file back and forth between two disks.

The reason why I was only seeing it in the night and not from my custom mover is almost certainly that my custom mover runs every 5 minutes checking disks are idle and only then moving files.

*/5 * * * * /home/alex/bin/move_if_idle > /dev/null

See attached files, for content of my move_if_idle

move_if_idle are_disks_idle

Edited August 27, 2023 by Alex R. Berg

JorgeB · August 27, 2023

Thanks, I'm going to try and duplicate this as soon as I can.

JorgeB · September 4, 2023

On 8/27/2023 at 9:14 AM, Alex R. Berg said:

So I would say if someone wants to reproduce this issue, like just create a docker that reads a large GB sized file from /mnt/user and then have a mover process running directly on unraid that moves that file back and forth between two disks.

Just so I'm clear, the mover should not try to move any file in use, can you confirm that's really happening?

Alex R. Berg · November 5, 2023

Ah, I had forgotten that part of the mover semantics, that the mover will leave any file currently in use. That kind of makes my hypothesis less likely. Though it might have started moving a file that then becomes used.

I haven't seen the problem at all since changing my duplicati + Resilio Sync dockers to use /mnt/disk* and /mnt/cache instead of /mnt/user. Before it was there nearly once a month or so.

But now I just reliably reproduced the issue. Though the reliable way is unfortunately just by calling the move script incorrectly. But maybe you can find something. This makes it way more likely that the actual bug is in the move script or how the move-script uses shfs (or something...), since it still manages to crash the shfs process

Note that this thing I reproduced is not something I have done in the previous error-reports, previously it seemed quite reasonable that it was the actual mover causing the error, not my custom move-script, due to the time-of-day was exactly when the actual mover ran.

---

I configured on of my shares to be cache-only. Currently the shares files are on disk1. This caused the issue immediately each time I run my custom mover which use Limetechs move-command, (but use it incorrectly, as I discovered)

I now know that I can reproduce the issue when neither of these were running:

Docker stopped
Parity check not running
Mover disabled using touch /var/run/mover.pid, and check its not running

To crash the server I simply run:

echo /mnt/user0/home/svnbackup/dead.letter | /usr/local/bin/move -e 0xfffffffe

Other files on same share give the same result, when using /mnt/user0

`0xfffffffe`
should have been the magic word for move to cache (many years ago). However I can see that the mover script has changed and it no longer uses the new 'move' command in this way, so maybe this problem is something that has nothing to do with the other problem. Though it seems to crash in a similar way (different code, but also different unRaid version).
If I run this (which seems to be what mover does) then I get:

> echo /mnt/user0/home/svnbackup/dead.letter | /usr/local/bin/move ""
move_object: /mnt/user0/home/svnbackup/dead.letter File exists

> find /mnt/*/home/svnbackup -maxdepth 0
/mnt/disk1/home/svnbackup
/mnt/user/home/svnbackup
/mnt/user0/home/svnbackup

If I run it CORRECTLY then I get - using disk1 instead of user0:
> echo /mnt/disk1/home/svnbackup/dead.letter | /usr/local/bin/move ""
then it just moves the file to cache as it should.

I'm running unRaid 6.12.4 now.

PS: I'm now following the topic, so maybe that should help me discover late comments.

tower-diagnostics-20231105-1706-shfs-crash-5-anon.zip

Sanduleak · November 14, 2023

This issue has been driving me nuts for the lats few weeks, I am needing some kind of solution.

I use this server only as a backup for other servers in my home network, the server works with a few cronjobs that start the other servers and backup the data in those servers with rsync, nothing fancy here, no dockers, no cache disks, only a lot of storage.

I am attaching the last log to see if that can be of any help.

The server has 16 gb of ram and an i9 9900k, I think that the problem is not the ram nor the processor since those are more than enought for the task.

The problem usually starts when rsyncing folders with a lot (millions) of small files, not sure if the number of files has something to with it but probably does.

Let's see if we can figure it out and solve it.

dayna-diagnostics-20231114-0449.zip

JorgeB · November 14, 2023

5 hours ago, Sanduleak said:

This issue has been driving me nuts for the lats few weeks, I am needing some kind of solution.

This is a different issue, SHFS is not crashing, it's getting killed because the server is running out of RAM:

Nov 13 23:08:07 Dayna kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=shfs,pid=9741,uid=0

TRANSPORT ENDPOINT IS NOT CONNECTED WITH /MNT/USER (due to docker?) - shfs segfault - 2023 edition

Recommended Posts

Alex R. Berg

Link to comment

itimpi

Link to comment

Alex R. Berg

Link to comment

JorgeB

Link to comment

Alex R. Berg

Link to comment

JorgeB

Link to comment

Alex R. Berg

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

Alex R. Berg

Link to comment

Sanduleak

Link to comment

JorgeB

Link to comment

Join the conversation