• [6.8.3] shfs error results in lost /mnt/user


    JorgeB
    • Minor

    There are several reports in the forums of this shfs error causing /mnt/user to go away:

     

    May 14 14:06:42 Tower shfs: shfs: ../lib/fuse.c:1451: unlink_node: Assertion `node->nlookup > 1' failed.

     

    Rebooting will fix it, until it happens again, I remember seeing at least 5 or 6 different users with the same issue in the last couple of months, it was reported here that it's possibly this issue:

     

    https://github.com/libfuse/libfuse/issues/128

     

    Attached diags from latest occurrence.

     

     

     

    tower-diagnostics-20200514-1444.zip




    User Feedback

    Recommended Comments

    Ditto I have it too.

     

    I was deleting files and directories via rm and rmdir, hence the unlink_node I presume, the last succesful delete, I was able to ls afterwards, but then when trying to ls again a few minutes later when returning to the shell - I received

    ls: cannot open directory '.': Transport endpoint is not connected

    /mnt/user file handle was stale and not showing via mount

    ls -l /mnt
    ls: cannot access '/mnt/user': Transport endpoint is not connected
    total 192
    drwxrwxrwx 1 nobody users 242 May 20 04:40 cache
    drwxrwxrwx 1 nobody users 178 May 20 04:40 disk1
    drwxrwxrwx 1 nobody users  82 May 20 04:40 disk10
    drwxrwxrwx 1 nobody users 148 May 20 04:40 disk2
    drwxrwxrwx 1 nobody users 160 May 20 04:40 disk3
    drwxrwxrwx 1 nobody users 160 May 20 04:40 disk4
    drwxrwxrwx 1 nobody users 134 May 20 04:40 disk5
    drwxrwxrwx 1 nobody users 150 May 20 04:40 disk6
    drwxrwxrwx 1 nobody users 144 May 20 04:40 disk7
    drwxrwxrwx 1 nobody users 140 May 20 04:40 disk8
    drwxrwxrwx 1 nobody users 106 May 20 04:40 disk9
    drwxrwxrwt 2 nobody users  40 May 18 18:20 disks
    d????????? ? ?      ?       ?            ? user
    drwxrwxrwx 1 nobody users 178 May 20 04:40 user0

    Nothing in dmesg, but /var/log/syslog has:

    May 20 08:36:00 qwwweeq shfs: shfs: ../lib/fuse.c:1451: unlink_node: Assertion `node->nlookup > 1' failed.

    I also see this I've not seen before:

     

    May 20 08:41:03 pkpaoskd root:  HDIO_DRIVE_CMD(setidle) failed: Inappropriate ioctl for device

    Thanks

    Link to comment

    attaching diagnostics here.

    @johnnie.black - apologies for the double post, I'd not seen any reply to your or the other threads (which weren't closed and to which you had duplicate posted, hence I assumed the more posts the better) and I also did not agree with the priority of Minor, as I've had data corruption several times along with daily reboots, and this was the more inline with the higher severity.

    Can you please increase your priority from minor to a higher one please?

    Also this log set does not include unlink as it's post reboot and the logs get wiped. I'll reattach when it inevitably reoccurs.

    thanks

    diagnostics-20200528-0846.zip

    Link to comment
    34 minutes ago, niavasha said:

    I'd not seen any reply to your or the other threads (which weren't closed and to which you had duplicate posted

    The other threads are in the general support forum, this is the bug report section and there's no duplicate bug reports about this, at least AFAIK, and no point in having duplicates here, any other users with the same issue are encourage to add to this report, the more people affected the more likely it will be prioritized.

     

    37 minutes ago, niavasha said:

    I also did not agree with the priority of Minor

    That's not always easy to choose, I'll be leaving as minor for now since most users affected don't have major issues, it can always be changed later, but regardless it will have LT's attention.

     

    As for your issues I would try upgrading to v6.9-beta1 to see if it helps, you can also try running the mover during off hours, when there's no docker activity, since that might be one of the triggers, i.e., trying to move a file after it was moved/deleted by a docker.

     

     

    Link to comment

    i have the same problem,

    i can tricker the error by stopping/starting certan docker containers, but after that /mnt/user is red and unusable...

    with the error Tower shfs: shfs: ../lib/fuse.c:1451: unlink_node: Assertion `node->nlookup > 1' failed. in the log...

    tower-diagnostics-20200602-0925.zip

    Link to comment

    Well, despite trying several versions of the beta, this is still occurring daily, sometimes several times a day, so, the best thing I did to make this as painless as possible for me was adding a userscript that runs at a custom schedule (every minute, i.e. * * * * *) that checks syslog for the error, then reboots the server using the supplied powerdown tool which is the "recommended" way of rebooting. 

    I experimented with stopping docker, unmounting shares, remounting, starting docker, same for vms etc, but, at the end of the day it was quicker and easier to just reboot (urgh indeed) - and to date this has been safe for me. Fingers crossed.

    Obviously use this at your own risk, caveat emptor, buyer beware, not officially endorsed, etc etc, but it's as simple as stated and :

     

     

    #!/bin/bash

    tail -1000 /var/log/syslog |grep unlink_node > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        powerdown -r
    fi

    #!/bin/bash
    
    # only looking at the last 1000 lines (as didn't want to see too old an error - 1 minute on user script works well)
    tail -1000 /var/log/syslog |grep unlink_node > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        powerdown -r
    fi
    Link to comment

    I face the same issue recently. Deleting a folder while restarting a Docker container that "uses" them makes /mnt/user go down.

    Link to comment

    Hi, just wondering if there was any news/update on this? I know what docker is causing the problem for me (tdarr) but I am the only user affected with this issue so the tdarr devs cannot help.  

    Link to comment

    I cannot automatically backup my dockers now because of this bug. It's risky to stop/start dockers unless I am there to restart the server in case this bug pops up.

    Link to comment

    I have been studying this issue.  Wondering if anyone has a simple way to make this happen?

     

    Something to try: if you can tolerate NFS stale file handles for a bit, set "Settings/NFS/Tunable (fuse remember)" to 0, or disable NFS entirely and then Stop/Start array and see if you can make it happen, or what different issue occurs.  I have posted a query in the FUSE github issue list asking if anyone there can shed some light on why this assert() gets reached.  When it does hit this assert() it kills the shfs process which is why your user shares become inaccessible.

     

    I am continuing to study this code.  I may end up adding some addition debugging code to FUSE in a test release.

    Link to comment

    Hi, I have no simple way, the only way I know is to let tdarr loose on my library.  It runs 24/7 and can lose my shares once every 2-3 days or like yesterday 3 times in 1 day.  Would it be useful to you for my diags before each reboot when it happens?  

     

    I do not understand stale NFS file handles to answer your question I am afraid. 

    Link to comment

    @limetech I have a couple of coredump files if you are interested in debugging with gdb :

     

    core.shfs.2047.202101242340: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/local/sbin/shfs /mnt/user -disks 2047 -o noatime,allow_other -o remember=-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/local/sbin/shfs', platform: 'x86_64'
    core.shfs.2047.202101252228: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/local/sbin/shfs /mnt/user -disks 2047 -o noatime,allow_other -o remember=-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/local/sbin/shfs', platform: 'x86_64'

     

    Please PM me. 

     

    I've not had time to look into this myself due to work commitments, but did at least manage to setup the system to dump core etc.

    Link to comment

    Unfortunately, the one of three servers this bug occurs on is a "production" server so it's hard to do debugging because it can't have much downtime. But it happened again and there are a lot of "kernel: tun: unexpected GSO type" errors 

    Here are my diagnostics.

    cs-diagnostics-20210315-2045.zip

    Link to comment
    11 hours ago, Jagadguru said:

    But it happened again and there are a lot of "kernel: tun: unexpected GSO type" errors 

    That should be unrelated, more info and workarounds here, as for the shfs issue you could try this posted just above:

     

    On 3/14/2021 at 5:39 PM, limetech said:

    Something to try: if you can tolerate NFS stale file handles for a bit, set "Settings/NFS/Tunable (fuse remember)" to 0, or disable NFS entirely and then Stop/Start array and see if you can make it happen, or what different issue occurs.

     

     

    Link to comment

    I turned NFS off entirely and migrated the affected shares to SMB.  I also re-enabled CA backup to shut down containers and restart them after backup, which I had been hesitant to do because of this bug.

     

    After doing that a month ago, the bug has not popped up since.

    Link to comment

    So I had this issue a while ago and now it has started again with no difference in my docker containers, in fact nothing changed to make it stop happening and now start again.

     

    What is curious is that it happens over night, I reboot the server in the morning and then within 10 / 20 mins it happens again. I reboot again and it is fine all day and then over night it happens again the the cycle restarts.

     

    I didn't grab the logs before the first reboot but find them attached before the second

    35highfield-diagnostics-20210518-0943.zip

     

    EDIT: I am running 6.9.2

    Edited by doma_2345
    Link to comment

    Well - the only thing I can do is commiserate - this is obviously painful for all involved. However, I setup a User Script, as per an above post, that runs every minute with a Custom schedule (* * * * *) and looks for the error in Syslog and reboots the machine safely if it finds it.

     

    My machine has rebooted over 200 times since May last year - but - now I don't really notice it, rather than finding my machine inaccessible.

     

    Yes it's a pain in the proverbial, but, at least it's now automatic. Here's the script. Note I at one stage was dumping core on mount.shfs (which involved removing the "ulimit -c none" set on boot) with a hope that @limetech might value a core file for debugging purposes, but, at this stage despite proffering this several times, I've given up:

     

    #!/bin/bash
    # Looks for the dreaded unlink_node error and logs occurrences to /boot/config/unlink_reboots.logs 
    # Also reboots the machine using the prescribed powerdown -r
    # Also backups a core file from mount.shfs if it finds it in root (although this requires adjust the ulimit for core on which is not covered here)
    
    
    grep unlink_node /var/log/syslog > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        echo "----------------------------------------------------------------------------"  >> /boot/config/unlink_reboots.log
        grep unlink_node /var/log/syslog >> /boot/config/unlink_reboots.log
        date >> /boot/config/unlink_reboots.log
        uname -a >> /boot/config/unlink_reboots.log 
        echo "----------------------------------------------------------------------------"  >> /boot/config/unlink_reboots.log
        echo "" >> /boot/config/unlink_reboots.log
        if [ -f "/core" ]; then
            echo "Found core file, moving it" >> /boot/config/unlink_reboots.log
            mv /core /boot/logs/core.$(date +%Y%m%d-%H%M)
        fi
        powerdown -r
    fi

     

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.