Jump to content
  • Unraid 7.0.0-beta2 - Array Stop & NFS issues


    jpatriarca
    • Urgent

    Hi team,

    I have a newly installed server with 7.0.0-beta2 and I'm detecting two issues with my install that seems to be related to the same root cause, NFS service

     

    Issue #1

     

    Anytime I need to stop my array, I get logs stating that one or multiple mount locations are busy and the system are unable to unmount them. No matter how much time I wait, those will not unmount, even if my mounts in the destination servers (plenty of physical and Proxmox VM and CT) mounts and closed before stopping the array.

    If I then press the shutdown/reboot, every time the system restart, a new parity-check will start, that will take something like 2 days to complete

     

    Issue #2

     

    Regarding the NFS connected servers to unraid, sometimes the NFS service hangs on unraid, which avoids my servers to connect to the server. Even if I restart NFS or RPC services in unraid servers, I'm unable to make unraid NFS shares to be connected. But still, if I use showmounts commands towards unraid, all visible shares are listed in the command execution for any server. I've also tested to restart my servers and try to re-connect to NFS shares, but I get the same timeout error when connecting.

    Anytime I get these errors, without exception, I'm forced to restart my server and I need to go through issue #1 and the consequences described above.

     

    Sharing the diagnostic saved in my USB for the last restart I needed to do this afternoon, related with issue #2 and now my parity-check task is being executed.

     

    Both issues has been detected also on 7.0.0-beta1 version. This behavior was not seen on my previous NAS, that was a Synology.

     

    Looking forward getting your feedback.

    unraid-diagnostics-20240813-1621.zip




    User Feedback

    Recommended Comments

    Re: issue #1, several zfs datasets are failing to unmount:

     

    Aug 13 16:21:24 unraid root: cannot unmount '/mnt/disk5': pool or dataset is busy
    Aug 13 16:21:24 unraid root: cannot unmount '/mnt/disk3/media': pool or dataset is busy
    Aug 13 16:21:24 unraid root: cannot unmount '/mnt/disk3': pool or dataset is busy
    Aug 13 16:21:25 unraid root: cannot unmount '/mnt/disk2/proxmox_bck': pool or dataset is busy
    Aug 13 16:21:25 unraid root: cannot unmount '/mnt/disk2/proxmox': pool or dataset is busy
    Aug 13 16:21:25 unraid root: cannot unmount '/mnt/disk2/media': pool or dataset is busy
    Aug 13 16:21:25 unraid root: cannot unmount '/mnt/disk2': pool or dataset is busy
    

    Possibly because of the NFS issue, can you please confirm if you can stop the array normally if you do it before issue #2 happens?

    Link to comment

    I cannot stop the array normally. To be honest, since I started using unraid in beta, I couldn't stop the array normally. Anytime I tried to do it, I got the "cannot unmount" error on logs and then needed to force restart to get back to it.

     

    I would expect any client to get a timeout and that unraid would stop the array even if files were opened. So other NFS issue must be involved in here, based on the description sent on my initial post and related with issue #1.

     

    Even with those "cannot unmount", I can confirm all clients connected to unraid NFS service were hanged and unable to connect to the NFS share. After the restart, everything went back to normal on connected clients.

    Edited by jpatriarca
    Adding last paragraph with additional information and context
    Link to comment

    So to confirm, if you start the array, and stop it immediately, before any NFS issues, you still cannot stop it?

     

    That would mean something else is preventing the datasets from unmounitng, if that is the case, try to rule out plugins by rebooting in safe mode and Docker and VM services by leaving them disabled initially.

    Link to comment

    Still unable to stop the array, getting this at this moment as I tried to stop the array

     

    Quote

    Aug 13 18:48:32 unraid root: cannot unmount '/mnt/disk3/media': pool or dataset is busy
    Aug 13 18:48:32 unraid root: cannot unmount '/mnt/disk3': pool or dataset is busy
    Aug 13 18:48:32 unraid root: cannot unmount '/mnt/disk2/proxmox_bck': pool or dataset is busy
    Aug 13 18:48:32 unraid root: cannot unmount '/mnt/disk2/media': pool or dataset is busy
    Aug 13 18:48:32 unraid root: cannot unmount '/mnt/disk2': pool or dataset is busy

     

    Now I need to force reboot unraid and restart again.

    I will try with to reboot in safe mode without any plugins on and only with NFS shares active and try to unmount it.

     

    I will share the result later today or tomorrow in this thread. Thanks

    Link to comment

    Hi @JorgeB, I was able to proceed to test my unraid mount/unmount array behavior in safe mode and I could mount and unmount correctly, confirming that all servers are connected with NFS and with and without Docker active. Attached a new diagnostic file of this session and tests done for your analysis.

     

    At this moment, I have a strong bet that moving out from Safe Mode and mounting the array, one of the community apps might be the root cause of this behavior outside of safe mode.

     

    Question: Does the diagnostics file includes logs or information about the community apps installed and what might be causing this array lockdown and avoiding the array to unmount safely?

     

    Thanks

    unraid-diagnostics-20240813-2226.zip

    Link to comment

    If it works in safe mode it suggests a plugin issue, recommend uninstalling or disabling all plugins and then add one by one and retest.

    Link to comment

    I was able to test unraid outside of Safe Mode and removed some of the plugins installed, and I could unmount the array correctly without needing to restart the server. So I'm limiting this issue to be related with NFS and the shares that are active between my servers and unraid. Somehow, unraid cannot close those in a controlled way, for some reason that logs seems to not describe. If I disable NFS at all, mount/unmount the array will work as expected, which removes docker and plugins to be the cause of the issues.

     

    Also, I've tested using Unraid without servers connecting to it through NFS, or disabling NFS service at all, and I could unmount and mount back the array without issues.

     

    There's more and more indications that both issues described on my initial post are related and something in NFS service in unraid are avoiding the array to be unmounted and is bringing severe hanging in NFS service that makes me to restart.

     

    The NFS connections between my servers and unraid are a bunch of FSTAB NFS shares (2 physical and 2 virtual servers) and Proxmox mounting points to a unique NFS shares on the two physical servers that are Proxmox hosts.

     

    During the unmount process, and after checking logs that the shares are busy, I can go to each server and unmount any NFS share activated before. Even after all those shares are deactivated, unraid keeps providing the messages that shares are still busy. 

     

    I wasn't able to see any command or tool in unraid that could tell me which are the shares and files that are avoiding unraid to unmount the array and do some debugging. IF there's any tool or log, please let me know to proceed with those tests.

    Link to comment
    2 hours ago, jpatriarca said:

    I wasn't able to see any command or tool in unraid that could tell me which are the shares and files that are avoiding unraid to unmount the array and do some debugging. IF there's any tool or log, please let me know to proceed with those tests.

    Install the Open FIles plugin.  It will show you what files are open and you can see which ones might prevent a shutdown.  You are also given the opportunity to kill the tasks holding open thise files.

     

    I will review your diagnostics and see if I find anything.

    Link to comment

    Hi @JorgeB & @dlandon, more details that might be useful:

     

    - the root cause is on NFS service

    - I isolated a case scenario to one vm mounting a NFS (Proxmox Backup Server) as the remaining one were disabled

    - When unmounting the array, I get the "pool or dataset is busy" for the mounts where the NFS will read

    - If I go to server and unmount the NFS share, and get unmount confirmation, but unraid will continue to stay that pool and share are busy

    - I need to force restart unraid to get back to enable the array

     

    I've worked to change NFS mount options in multiple ways to try to avoid this behavior, like hard/soft, timeo and retrans options. On the other side, I have Export settings in NFS share configurations in unraid, related with uid/gid that was migrated from the old storage anda void permission issues. Don't know if this influences and if there is another options in exports that can be used to avoid this.

     

    What surprises me, is that the NFS service in unraid will not take timeout in consideration and will not force the closure of the mounts that are being accessed by server mounts. This is the expected behavior (with risks related with loosing data, of course) that I've seen in other platforms, like Synology.

     

    Hope this additional information helps.

    Link to comment
    3 hours ago, JorgeB said:

    @dlandonany ideas for how to confirm if NFS is preventing unmounting the datasets?

    Those shares with the 'busy' log messages are all exported with NFS.  In looking at the log, it appears Unraid is doing all the right things to get the remote NFS client files released.  I'm suspecting that the pool or dataset will have to be force exported to get the pool or dataset not busy.  I did a little test and a zfs pool mounted remotely using NFS shows that the pool is busy even if no files are open.

     

    You know better than I about how Unraid is unmounting zfs pools.

    • Like 1
    Link to comment
    1 hour ago, dlandon said:

    I did a little test and a zfs pool mounted remotely using NFS shows that the pool is busy even if no files are open.

    I cannot reproduce that, I can stop the array with a share on zfs pool export over NFS mounted on a different server with UD.

    Link to comment

    Thanks both for your feeedback. @JorgeB, I could manage to unmount the array with open NFS connections if files are not open. With files open, I cannot unmount the array and I see the behavior I'm describing. So unraid is trying to close the array and the NFS exports if a file is opened and cannot do it no timeout signal is sent. What don't fit in this description is my previous mentions that after unmouting all NFS mounts on my servers, doesn't do anything to unraid that keeps on a loop waiting. My expectations was that, after the NFS share is unmounted on client servers, the open files are discarded, a timeout would occur and unraid would unmount the zfs pool safely. This is not what's happening.

     

    Based on that and to prove my point, I've done some tests this afternoon, replicating the behavior and having busy messages for multiple shares in unraid, and I've disabled NFS service manually by running command /etc/rc.d/rc.nfsd stop. After a couple of seconds (like 10 to 20) the array was unmounted safely.

     

    I'm sharing a new diagnostics and you should see that happening in the system logs attached to it, somewhere between 14:00 and 15:00.

     

    Please take in consideration this is only about issue #1 and issue #2 are still to be reviewed and discussed, if you don't mind.

    unraid-diagnostics-20240814-1705.zip

    Link to comment

    OK, so the failing to unmount with open files is expected, still failing after the NFS shares are unmounted on the remote servers I don't know, not really familiar with NFS, and if the current behavior can be improved.

    Link to comment

    I'm seeing some log emtries with regard to NFS:
     

    Aug 14 11:03:03 unraid rc.nfsd: /usr/sbin/rpc.mountd
    Aug 14 11:03:03 unraid rpc.mountd[5605]: Version 2.6.4 starting
    Aug 14 11:03:03 unraid kernel: rpc-srv/tcp: nfsd: got error -32 when sending 20 bytes - shutting down socket
    ### [PREVIOUS LINE REPEATED 1 TIMES] ###
    Aug 14 11:03:03 unraid rc.nfsd: NFS server daemon...  Started.

    This indicates a problem occured with the NFS client temporarily causing NFS to shut down the socket.

     

    I assume you are mounting the NFS shares with fstab entries.  Can you post the fstab file on your client?

    Link to comment

    As for issue #2, we are seeing this kind of issue on some Unraid 6.12 servers and are looking into it.  We are not sure if this is also an Unraid 7.0 problem.  The issue seems to be related to a kernel release where nfsd does not reliably terminate so a nfsd restart does not work.

     

    Does the #2 issue occur on a clean boot with the array auto starting?  Or after an array stop and restart?

    Link to comment

    Sure, these are the FSTAB entries on my 4 servers:

     

    Server 1 - Proxmox Server using CT mounting points

     

    unraid.internal.patriarca.pt:/mnt/user/media /mnt/unraid/media nfs4 defaults,nolock,timeo=50,retrans=4 0 0

     

    This is the mount details:

     

    unraid.internal.patriarca.pt:/mnt/user/media on /mnt/unraid/media type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=50,retrans=4,sec=sys,clientaddr=192.168.20.21,local_lock=none,addr=192.168.20.199)

     

    Server 2 - Proxmox Server using CT mounting points 

     

    unraid.internal.patriarca.pt:/mnt/user/media /mnt/unraid/media nfs4 defaults,nolock,timeo=50,retrans=4 0 0

     

    This is the mount details:

     

    unraid.internal.patriarca.pt:/mnt/user/media on /mnt/unraid/media type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=50,retrans=4,sec=sys,clientaddr=192.168.20.22,local_lock=none,addr=192.168.20.199)

     

    Server 3 - Virtual Server with Debian

     

    unraid.internal.patriarca.pt:/mnt/user/media    /mnt/unraid/media       nfs4 defaults,nolock,timeo=50,retrans=4 0 0
    unraid.internal.patriarca.pt:/mnt/user/personal /mnt/unraid/personal    nfs4 defaults,nolock,timeo=50,retrans=4 0 0

     

    These are the mount details:

     

    unraid.internal.patriarca.pt:/mnt/user/proxmox_bck on /mnt/proxmox_bck type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=50,retrans=4,sec=sys,clientaddr=192.168.10.102,local_lock=none,addr=192.168.20.199)

     

    Server 4 - Virtual server with Proxmox Backup Server

     

    unraid.internal.patriarca.pt:/mnt/user/proxmox_bck /mnt/proxmox_bck nfs4 defaults,nolock,timeo=50,retrans=4 0 0

     

    In addition to the FSTAB configuration on server #1 and #2, there's a PVE storage configuration that creates another NFS share for a lvm volume, where I have some hard drive images for a unique virtual server, this is the configuration:

     

    nfs: unraid-lvm
            export /mnt/user/proxmox
            path /mnt/pve/unraid-lvm
            server unraid.internal.patriarca.pt
            content images,iso,vztmpl
            options timeo=50,retrans=4
            preallocation off

     

    This is the mount details (the same details for server #1 and server #2 that share the storage configuration through proxmox:

     

    unraid.internal.patriarca.pt:/mnt/user/proxmox on /mnt/pve/unraid-lvm type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=50,retrans=4,sec=sys,clientaddr=192.168.20.21,local_lock=none,addr=192.168.20.199)

     

    Hope it helps

    Link to comment
    4 minutes ago, dlandon said:

    As for issue #2, we are seeing this kind of issue on some Unraid 6.12 servers and are looking into it.  We are not sure if this is also an Unraid 7.0 problem.  The issue seems to be related to a kernel release where nfsd does not reliably terminate so a nfsd restart does not work.

     

    Does the #2 issue occur on a clean boot with the array auto starting?  Or after an array stop and restart?

    I've seen the behavior one a clean boot. Since I have also the described unmount, I needed to restart the server and could unmount/mount the array. So all NFS hangs I've seen are after a clean boot and after several hours/days. The first diagnostics shared in this thread might have some info about this nfsd servie hang.

    Link to comment

    Hi @dlandon and @JorgeB,

    Is there any update about both issues? Where you able to test my scenario on issues #1? And is there any updates or expected correction (in next 7.0.0 beta version) of the kernel issue with NFS reported in issue #2?

     

    I'm still facing the same issues as reported

     

    Looking forward for your updates. Thank you

    Link to comment

    Unraid 7.0 beta 3 will be released soon.  We recommend you try it and see if the issues are resolved.  There is a good chance the NFS issue (#2) will be fixed and I suspect it may also fix your first issue.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...