• [6.8.3] shfs error results in lost /mnt/user


    JorgeB
    • Minor

    There are several reports in the forums of this shfs error causing /mnt/user to go away:

     

    May 14 14:06:42 Tower shfs: shfs: ../lib/fuse.c:1451: unlink_node: Assertion `node->nlookup > 1' failed.

     

    Rebooting will fix it, until it happens again, I remember seeing at least 5 or 6 different users with the same issue in the last couple of months, it was reported here that it's possibly this issue:

     

    https://github.com/libfuse/libfuse/issues/128

     

    Attached diags from latest occurrence.

     

     

     

    tower-diagnostics-20200514-1444.zip

    • Upvote 3



    User Feedback

    Recommended Comments



    On 10/27/2021 at 6:00 PM, VBilbo said:

    I have the same issue with disappearing /mnt/user every couple of days after installing Tdarr.

    I had this too, just disable NFS shares if you can get away with it.

    I went a step further and disabled NFS entirely.

    Link to comment

    Not sure if this is already known, but today I suddenly had this issue happening and started looking for solutions. I stumbled on this post:

    I changed the path for my media share from /mnt/user/media to /mnt/user0/media in the Tdarr docker containers and just to be sure I also made sure the /temp is now on a share that has no mover running on it. 

     

    I have no tdarr related errors anymore in the logs, so fingers crossed!
    I have NFS disabled and still have Hard Links enabled under Global Share Settings.

    Edited by VlarpNL
    Link to comment

    I have just seen this for the first time and i found my way here by accident, as I ran "Fix common problems" and there was a comment about Tdarr, which I also have installed.

     

    Server is on 6.9.2 and is always very stable as i generally dont touch it. Today I added a new physical nvme drive and moved some files/folders with bash. On completion of this task i noticed that /mnt/user was missing.  I should also note that some of the folders moved where shares that Tdarr interacts

     

    A reboot resolved the issue.

    Edited by bandiboo
    Link to comment

    Hello everyone,

     

    Just wanted to chime in here and potentially give some people a temp solution. I am also having the Tdarr warning pop up in fix common problems

     

    "tdarr (haveagitgat/tdarr:latest) has the following comments: Anecdotal evidence is implicating tdarr with Unraid issues."

     

    which refers me to this thread. in doing some research and reading through here, it looks like im not truly having the issues everyone is experiencing.

     

    i believe it was due to dumb luck. the way i have Tdarr configured my transcode folder is on an NVME drive that is my second cache pool and i have the share it uses set to use cache only and specifically that drive. i guess because the mover is never invoked on this share i don't have a true issue.

     

    i haven't had the mnt/user folder disappear on me or any other issues i have read about with this warning.

     

    maybe this can help some other people experiencing this.

     

    although due to my OCD i would like to figure out how to get fix common problems to stop sending me this warning and ultimately fix the over all issue. (i know i can ignore warning but that's not the way i like to deal with these things)

    Link to comment

    I can confirm I am having the same experience you are having.  I have my tdarr_temp folder set to prefer cache (NVME) and have not experienced the mnt/user share issue.  I went with this approach because I didn't want unnecessary writes to my spinning drives. I've been running tdarr for months now. 

    Link to comment

    I just want to report my experience with Tdarr. I've been running Tdarr in multiple configurations with multiple nodes across two Unraid servers. Currently, I'm running two tdarr servers with four nodes operating on two independent data sets totaling around 90k files on 20 disks with parity. Temporary files are written directly to disk, skipping cache. I have never had any issues like this before, and I've saved almost 40TB over the last 5 months.

    Edited by paloooz
    Link to comment

    FYI - I have been seeing this issue (the disappearing user shares) since upgrading to 6.10 RC4. I am using NFS for my shares as the majority of my systems are Linux or Mac. Mac's in particular have speed issues with SMB shares, but NFS works great.

     

    The gotcha is that I don't use tdarr... in fact I don't use any of the *arr apps. I've grabbed diagnostics just now as it just happened again. I will send them via PM if anyone else wants to look at them, but I prefer not posting them here. Although I use anonymize, going through the diagnostics reveal they still contain information that I consider private.

     

    I'll be taking my own stab at the diagnostics shortly, but I've disabled hard links as suggested and will see if that helps.

     

    Link to comment

    Much like AgentXXL I also don't use tdarr and am having this issue.  The real kicker for me is that I can't do a clean shutdown after this happens either, and since I'm running 2 parity drives, I have to rebuild it every time (total of about 30 hours). 

     

    That being said, the last time this happened, the parity drives went offline before the reboot, and it seems to be related, but in either event, shares randomly going away, which causes docker to crash and then put the machine in a bad state that requires a physical reboot to bring it back up is a problem.

     

    Edit: and /mnt/user is offline again, in the middle of a parity rebuild.

    Edited by Jeremyb
    Link to comment

    i too have had the shares disappear overnight, with this shfs error, on a production server.  So i will be watching this forum with great interest!  

    changes i have recently made...
    - consolidated x2 cache drives (production & server) into 1 larger SSD
    - added new raid6 cache pool

    - turned on a docker called prism and let it run all night.  it writes its meta-data to the new cache pool (not a public shared folder).  
    - NFS was on, but i have turned it off, since the shares disappeared, to see if that was the issue (rarely the CFO brings in her macbook, and i make fun of her for its overpriced limitations).

    A reboot did bring everything back online for now.  i have stopped the prism app, in order to monitor the server for 24 hours, without any other changes.   This unraid server is used heavily during the day, in a production environment, with an average of 5 clients grabbing/saving files all day.  Normally this server does not have any issues (besides the ones W10 bestows upon it with its random forced updates), and has been left on issue free, without a reboot for months.

    I have Logs of the crash, but they contain sensitive information i do not feel safe posting publicly.

     

    Link to comment
    2 hours ago, miicar said:

    i too have had the shares disappear overnight, with this shfs error, on a production server.  So i will be watching this forum with great interest!  

    changes i have recently made...
    - consolidated x2 cache drives (production & server) into 1 larger SSD
    - added new raid6 cache pool

    - turned on a docker called prism and let it run all night.  it writes its meta-data to the new cache pool (not a public shared folder).  
    - NFS was on, but i have turned it off, since the shares disappeared, to see if that was the issue (rarely the CFO brings in her macbook, and i make fun of her for its overpriced limitations).

    A reboot did bring everything back online for now.  i have stopped the prism app, in order to monitor the server for 24 hours, without any other changes.   This unraid server is used heavily during the day, in a production environment, with an average of 5 clients grabbing/saving files all day.  Normally this server does not have any issues (besides the ones W10 bestows upon it with its random forced updates), and has been left on issue free, without a reboot for months.

    I have Logs of the crash, but they contain sensitive information i do not feel safe posting publicly.

     

    I am having the same problem, but don't have any of those same "Recent changes"  I got the issue when NFS was off, I don't have or use Prism.  I have 2 drives in my cache pool in a RAID1 BtrFS, and a 2nd one that is just a NVME.  I don't use RAID 6 for anything. 

     

    As with most other people this issue is relatively new, as my machine was humming along pretty good (it's pretty new) before it started crashing.  I've replaced most of the suspect hardware by now. 

     

    The other challenging part is that it requires a hard reboot, which means if I am remote to the machine for a while, I won't be able to access or reboot it.

    Link to comment
    2 hours ago, Jeremyb said:

    The other challenging part is that it requires a hard reboot, which means if I am remote to the machine for a while, I won't be able to access or reboot it.

    Not the answer to the problem, but the MyServers plugin makes that painless to accomplish.

    Link to comment
    1 hour ago, Squid said:

    Not the answer to the problem, but the MyServers plugin makes that painless to accomplish.

     

    Ok, I have the plugin, but I'm not sure how it helps.  I can VPN into my network and clicking the shutdown/restart button on the unraid dashboard does nothing.  SSHing in and typing "shutdown -r now" does nothing.  Typing the same command from a connected keyboard and monitor also does nothing.  The only way to access it again is a hard reboot. 

    Link to comment

    I believe it is hung doing a soft reboot because it cannot unmount /mnt/user.  Even trying to access /mnt/user from the terminal hangs.  

    Link to comment
    On 4/19/2022 at 5:35 PM, Jeremyb said:

     

    Ok, I have the plugin, but I'm not sure how it helps.  I can VPN into my network and clicking the shutdown/restart button on the unraid dashboard does nothing.  SSHing in and typing "shutdown -r now" does nothing.  Typing the same command from a connected keyboard and monitor also does nothing.  The only way to access it again is a hard reboot. 

    This has happened to me as well.  I am lucky i have a managed motherboard i can access outside of unraid and force a reset...i don't know if i will ever build another server that doesn't have some sort of backdoor access.

    Link to comment

    I also had this issue (transport endpoint not connected) every few days.

     

    In the end it turned out for me to be caused by bad memory. I was running Memtest for several days but got only rare errors (one issue every few hours) - so I wondered if that could cause the frequent shfs errors.

     

    However, replacing the memory module totally solved this issue for me (no errors since 2 months now). So maybe you should also check your RAM.

    Link to comment

    I'm another referred here by the fix common problems plugin.  I use tdarr extensively having been in discussion with the developer since the beginning of it's creation.  I have never and still do not have any of these issues.  However, I do not use the unraid array except as a dummy USB device to start the docker services (I use ZFS) and do not use NFS (I use SMB).  I strongly suspect this issue is more about tdarr triggering an unraid bug of some kind than tdarr itself being the issue.

    Link to comment

    I've experienced this twice in two days. Never seen it before now. 

     

    I streamed from Plex earlier and now my user shares are gone. 

     

    I do not used tdarr but do use many of the arrs. 

    Link to comment

    Happened to me recently with 6.11.2.

     

    In my case it's always triggered by SMB file operations - macOS's poor implementation acting on a stale version of the directory tree, causing invalid operations and the fuse exception. I have to remember to navigate down or up then back to force a refresh.

    Link to comment
    On 9/9/2021 at 1:33 PM, niavasha said:

    Well - the only thing I can do is commiserate - this is obviously painful for all involved. However, I setup a User Script, as per an above post, that runs every minute with a Custom schedule (* * * * *) and looks for the error in Syslog and reboots the machine safely if it finds it.

     

    My machine has rebooted over 200 times since May last year - but - now I don't really notice it, rather than finding my machine inaccessible.

     

    Yes it's a pain in the proverbial, but, at least it's now automatic. Here's the script. Note I at one stage was dumping core on mount.shfs (which involved removing the "ulimit -c none" set on boot) with a hope that @limetech might value a core file for debugging purposes, but, at this stage despite proffering this several times, I've given up:

     

    #!/bin/bash
    # Looks for the dreaded unlink_node error and logs occurrences to /boot/config/unlink_reboots.logs 
    # Also reboots the machine using the prescribed powerdown -r
    # Also backups a core file from mount.shfs if it finds it in root (although this requires adjust the ulimit for core on which is not covered here)
    
    
    grep unlink_node /var/log/syslog > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        echo "----------------------------------------------------------------------------"  >> /boot/config/unlink_reboots.log
        grep unlink_node /var/log/syslog >> /boot/config/unlink_reboots.log
        date >> /boot/config/unlink_reboots.log
        uname -a >> /boot/config/unlink_reboots.log 
        echo "----------------------------------------------------------------------------"  >> /boot/config/unlink_reboots.log
        echo "" >> /boot/config/unlink_reboots.log
        if [ -f "/core" ]; then
            echo "Found core file, moving it" >> /boot/config/unlink_reboots.log
            mv /core /boot/logs/core.$(date +%Y%m%d-%H%M)
        fi
        powerdown -r
    fi

     


    Does this reboot take enough time to bring down all the containers ?

    Link to comment

    @limetech Happy to say this looks completely fixed as of whatever release came out after September 14th as that was the last time I had to auto-reboot to handle the issue.

     

    Currently I am running 6.11.5 which has 5.19.17-Unraid kernel - and I've been up and running for 82 days... yay.

     

    Thanks!

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.