• [6.6.0] NFS Kernel crash


    ajeffco
    • Solved Urgent

    Hello,

     

    Running unraid 6.6.0 stable, with a mostly NFS shares server.  NFS appears to be crashing.  Below is the first indication in the log file of a problem.  All clients lock up with "nfs: server tower not responding, timed out" from that point forward.  I have a coworker running unraid who has had the same issue, and while we initially thought it was just NFS, all CIFS AND RSYNC shares become unavailable also when this happens.  When this happens unraid becomes 100% unusable for file operations for any client!

     

    This appears to  have been reported already at [ 6.6.0-RC4 ] NFS CRASHES.  I submitted another since this is 6.6.0 stable.

     

    HOW TO REPRODUCE:  Reboot and just wait.  My coworker has had this happen a few times, this is my first issue.

     

    Sep 26 03:48:41 tower kernel: ------------[ cut here ]------------
    Sep 26 03:48:41 tower kernel: nfsd: non-standard errno: -103
    Sep 26 03:48:41 tower kernel: WARNING: CPU: 2 PID: 12478 at fs/nfsd/nfsproc.c:817 nfserrno+0x44/0x4a [nfsd]
    Sep 26 03:48:41 tower kernel: Modules linked in: md_mod nfsd lockd grace sunrpc bonding mlx4_en mlx4_core igb sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ast ttm kvm_intel drm_kms_helper kvm drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd agpgart glue_helper intel_cstate intel_uncore ipmi_ssif intel_rapl_perf syscopyarea mpt3sas i2c_i801 i2c_algo_bit i2c_core ahci sysfillrect pcc_cpufreq libahci sysimgblt fb_sys_fops raid_class scsi_transport_sas wmi acpi_power_meter ipmi_si acpi_pad button [last unloaded: md_mod]
    Sep 26 03:48:41 tower kernel: CPU: 2 PID: 12478 Comm: nfsd Not tainted 4.18.8-unRAID #1
    Sep 26 03:48:41 tower kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 3.0a 02/08/2018
    Sep 26 03:48:41 tower kernel: RIP: 0010:nfserrno+0x44/0x4a [nfsd]
    Sep 26 03:48:41 tower kernel: Code: c0 48 83 f8 22 75 e2 80 3d b3 06 01 00 00 bb 00 00 00 05 75 17 89 fe 48 c7 c7 3b 9a 18 a0 c6 05 9c 06 01 00 01 e8 8a ec ec e0 <0f> 0b 89 d8 5b c3 48 83 ec 18 31 c9 ba ff 07 00 00 65 48 8b 04 25 
    Sep 26 03:48:41 tower kernel: RSP: 0018:ffffc9000c743db8 EFLAGS: 00010286
    Sep 26 03:48:41 tower kernel: RAX: 0000000000000000 RBX: 0000000005000000 RCX: 0000000000000007
    Sep 26 03:48:41 tower kernel: RDX: 0000000000000000 RSI: ffff88087fc96470 RDI: ffff88087fc96470
    Sep 26 03:48:41 tower kernel: RBP: ffffc9000c743e08 R08: 0000000000000003 R09: ffffffff82202400
    Sep 26 03:48:41 tower kernel: R10: 000000000000087f R11: 000000000000a9e4 R12: ffff8802b01ea808
    Sep 26 03:48:41 tower kernel: R13: ffff8807febb2a58 R14: 0000000000000002 R15: ffffffffa01892a0
    Sep 26 03:48:41 tower kernel: FS:  0000000000000000(0000) GS:ffff88087fc80000(0000) knlGS:0000000000000000
    Sep 26 03:48:41 tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Sep 26 03:48:41 tower kernel: CR2: 00001501e0097000 CR3: 0000000001e0a005 CR4: 00000000001606e0
    Sep 26 03:48:41 tower kernel: Call Trace:
    Sep 26 03:48:41 tower kernel: nfsd_open+0x15e/0x17c [nfsd]
    Sep 26 03:48:41 tower kernel: nfsd_write+0x4c/0xaa [nfsd]
    Sep 26 03:48:41 tower kernel: nfsd3_proc_write+0xad/0xdb [nfsd]
    Sep 26 03:48:41 tower kernel: nfsd_dispatch+0xb4/0x169 [nfsd]
    Sep 26 03:48:41 tower kernel: svc_process+0x4b5/0x666 [sunrpc]
    Sep 26 03:48:41 tower kernel: ? nfsd_destroy+0x48/0x48 [nfsd]
    Sep 26 03:48:41 tower kernel: nfsd+0xeb/0x142 [nfsd]
    Sep 26 03:48:41 tower kernel: kthread+0x10b/0x113
    Sep 26 03:48:41 tower kernel: ? kthread_flush_work_fn+0x9/0x9
    Sep 26 03:48:41 tower kernel: ret_from_fork+0x35/0x40
    Sep 26 03:48:41 tower kernel: ---[ end trace 0df913a547279c0d ]---

    tower-diagnostics-20180926-0904.zip




    User Feedback

    Recommended Comments



    Forgot to mention.  When this happens the web GUI works, ssh works. 

     

    Just noticed also that /mnt/user is missing!

     

    root@tower:~# df
    df: /mnt/user: Transport endpoint is not connected
    Filesystem       1K-blocks        Used   Available Use% Mounted on
    rootfs            16367896      708240    15659656   5% /
    tmpfs                32768         428       32340   2% /run
    devtmpfs          16367912           0    16367912   0% /dev
    tmpfs             16448628           0    16448628   0% /dev/shm
    cgroup_root           8192           0        8192   0% /sys/fs/cgroup
    tmpfs               131072         524      130548   1% /var/log
    /dev/sda1          1000336      451424      548912  46% /boot
    /dev/loop0            8320        8320           0 100% /lib/modules
    /dev/loop1            4992        4992           0 100% /lib/firmware
    /dev/md1        2930266532  1651302708  1278054652  57% /mnt/disk1
    /dev/md3        3907018532  3466271080   439927960  89% /mnt/disk3
    /dev/md4        3907018532  3467094748   439014820  89% /mnt/disk4
    /dev/md5        3907018532  3485190236   421011284  90% /mnt/disk5
    /dev/md6        3907018532  3483896356   421228428  90% /mnt/disk6
    /dev/md7        2930266532  1100410644  1829428332  38% /mnt/disk7
    /dev/md9        3907018532  3474464252   431616164  89% /mnt/disk9
    /dev/md10       3907018532  3437929976   468108360  89% /mnt/disk10
    /dev/md11       3907018532  3192044520   713767432  82% /mnt/disk11
    /dev/md12       3907018532  3485902824   420362712  90% /mnt/disk12
    /dev/md13       3907018532  2182583140  1722297612  56% /mnt/disk13
    /dev/md14       3907018532  3485206232   420995080  90% /mnt/disk14
    /dev/md15       3907018532  3355538164   550484364  86% /mnt/disk15
    /dev/md16       3907018532  3368869408   537176896  87% /mnt/disk16
    /dev/md17       2930266532  1085994852  1843976364  38% /mnt/disk17
    /dev/sds1        878906148    43783224   834069176   5% /mnt/cache
    shfs           59582040512 43729910024 15836218248  74% /mnt/user0
    /dev/md2        3907018532     7210884  3898767788   1% /mnt/disk2
     

    Edited by ajeffco
    Link to comment

    I am the co-worker mentioned. I experienced the nfs issue as well this past weekend. As with ajeffco, the gui worked, ssh was available....but /mnt/user was no longer mounted. I downgraded back to 6.5.3 and had no further nfs issues. upgraded back to 6.6.0 to retest and nfs had crashed by the next morning. switched from nfs to cifs and have not, as yet, had any issues accessing shares.

     

    ****

     

    [Sat Sep 22 22:02:02 2018] ------------[ cut here ]------------
    [Sat Sep 22 22:02:02 2018] nfsd: non-standard errno: -103
    [Sat Sep 22 22:02:02 2018] WARNING: CPU: 1 PID: 8892 at fs/nfsd/nfsproc.c:817 nfserrno+0x44/0x4a [nfsd]
    [Sat Sep 22 22:02:02 2018] Modules linked in: xfs nfsd lockd grace sunrpc md_mod bonding ax88179_178a usbnet mii e1000e mpt3sas i2c_i801 i2c_core ahci raid_class coretemp ipmi_si libahci scsi_transport_sas intel_agp intel_gtt button agpgart [last unloaded: mii]
    [Sat Sep 22 22:02:02 2018] CPU: 1 PID: 8892 Comm: nfsd Not tainted 4.18.8-unRAID #1
    [Sat Sep 22 22:02:02 2018] Hardware name: Supermicro X7SPA-HF/X7SPA-HF, BIOS 1.2b       07/19/13
    [Sat Sep 22 22:02:02 2018] RIP: 0010:nfserrno+0x44/0x4a [nfsd]
    [Sat Sep 22 22:02:02 2018] Code: c0 48 83 f8 22 75 e2 80 3d b3 06 01 00 00 bb 00 00 00 05 75 17 89 fe 48 c7 c7 3b 5a 11 a0 c6 05 9c 06 01 00 01 e8 8a 2c f4 e0 <0f> 0b 89 d8 5b c3 48 83 ec 18 31 c9 ba ff 07 00 00 65 48 8b 04 25
    [Sat Sep 22 22:02:02 2018] RSP: 0018:ffffc900009bbdc0 EFLAGS: 00010282
    [Sat Sep 22 22:02:02 2018] RAX: 0000000000000000 RBX: 0000000005000000 RCX: 0000000000000007
    [Sat Sep 22 22:02:02 2018] RDX: 0000000000000000 RSI: ffff88013fc96470 RDI: ffff88013fc96470
    [Sat Sep 22 22:02:02 2018] RBP: ffffc900009bbe10 R08: 0000000000000003 R09: ffffffff8220c400
    [Sat Sep 22 22:02:02 2018] R10: 00000000000003ea R11: 00000000000149c0 R12: ffff880139161808
    [Sat Sep 22 22:02:02 2018] R13: 00000000007a0000 R14: ffff880139161968 R15: 0000000000000080
    [Sat Sep 22 22:02:02 2018] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000
    [Sat Sep 22 22:02:02 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [Sat Sep 22 22:02:02 2018] CR2: 0000155140b36020 CR3: 0000000001e0a000 CR4: 00000000000006e0
    [Sat Sep 22 22:02:02 2018] Call Trace:
    [Sat Sep 22 22:02:02 2018]  nfsd_open+0x15e/0x17c [nfsd]
    [Sat Sep 22 22:02:02 2018]  nfsd_read+0x45/0xec [nfsd]
    [Sat Sep 22 22:02:02 2018]  nfsd3_proc_read+0x95/0xda [nfsd]
    [Sat Sep 22 22:02:02 2018]  nfsd_dispatch+0xb4/0x169 [nfsd]
    [Sat Sep 22 22:02:02 2018]  svc_process+0x4b5/0x666 [sunrpc]
    [Sat Sep 22 22:02:02 2018]  ? nfsd_destroy+0x48/0x48 [nfsd]
    [Sat Sep 22 22:02:02 2018]  nfsd+0xeb/0x142 [nfsd]
    [Sat Sep 22 22:02:02 2018]  kthread+0x10b/0x113
    [Sat Sep 22 22:02:02 2018]  ? kthread_flush_work_fn+0x9/0x9
    [Sat Sep 22 22:02:02 2018]  ret_from_fork+0x1f/0x40
    [Sat Sep 22 22:02:02 2018] ---[ end trace 23bd716bac671985 ]---

    Link to comment

    What is the nature of the transfer taking place when the crash happens?

    That is, reading from server or writing to server?

    Large files or small files?

    What program is running in the client?

     

    Can you post your /etc/exports file?

     

    I'm having a hard time reproducing this bug.

    Link to comment

    Thanks for responding...

     

    --- /etc/exports ---

    # See exports(5) for a description.
    # This file contains a list of all directories exported to other computers.
    # It is used by rpc.nfsd and rpc.mountd.

    "/mnt/user/downloads" -async,no_subtree_check,fsid=116 10.10.10.0/24(sec=sys,rw,no_root_squash,insecure)
    "/mnt/user/home" -async,no_subtree_check,fsid=118 10.10.10.0/24(sec=sys,rw,no_root_squash,insecure)
    "/mnt/user/movies" -async,no_subtree_check,fsid=119 10.10.10.0/24(sec=sys,rw,no_root_squash,insecure)
    "/mnt/user/tv" -async,no_subtree_check,fsid=117 10.10.10.0/24(sec=sys,rw,no_root_squash,insecure)
    --- end of exports ---

     

    There are likely reads and writes occurring to/from the server.  I can't say the transfer sizes.

     

    The clients are as follows:

    /mnt/user/downloads     : radarr, sonarr, sabnzbd and plex

    /mnt/user/home             : plex

    /mnt/user/movies          : radarr, plex

    /mnt/user/tv                 : sonarr, plex

     

    Not shown in exports and not shared in any form (NFS, SMB, AFP) is /mnt/user/synology, which is an rsync target for Synology Hyper Backup.  According to my synology Hyper Backup logs the night backup had finished successfully about 2 minutes before the first timestamp on the kernel message.

     

    Also not shown in exports but shared via SMB is /mnt/user/sort, which would have no traffic at all at that time of day.

     

    On the clients the first indication of a problem is "03:48:07,970::ERROR::[misc:1634] Cannot change permissions of /downloads".  My servers sync time via NTP, so it's in the same timeframe although not exactly the same.

     

    Here's the fstab entry for the /mnt/user/downloads export, the same settings are used on all NFS clients.

    tower:/mnt/user/downloads /downloads  nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800,soft,_netdev 0 0

     

    Let me know if I can get you anything else, and thanks again.


    Al

     

     

     

    Link to comment

    I disabled the direct_io tunable on the global shares page earlier this morning after getting everything back in place, will report back any events.

    Link to comment

    my configuration is much simpler that ajeffco's in that I only have 2 nfs shares which are named, ironically:

     

    /mnt/user/movies          : plex

    /mnt/user/TV                : plex

     

    but, as you can see above, are only accessed by the plex server running on another host and it does not have permissions to delete any data, so it should almost mostly reads, but some minor bit of writing may be occurring as well.

     

    unraid:/mnt/user/movies /movies  nfs auto,nofail,noatime,nolock,intr,soft,_netdev,tcp,actimeo=1800 0 0
    unraid:/mnt/user/TV /tv  nfs auto,nofail,noatime,nolock,intr,tcp,soft,_netdev,actimeo=1800 0 0

     

    I thought I'd uploaded the diagnostic zip file as well using the GUI's bug report feature...but I may have gone about it incorrectly. I can re-up it if needed.

    Link to comment

    Please try this test.  Go to Settings/NFS and set the "Tunable (fuse_remember)" setting to 0.  This might cause your clients to crash and burn due to "stale file handle", but maybe not.  Would be interesting test.

    Link to comment
    2 minutes ago, macor64 said:

    my configuration is much simpler that ajeffco's in that I only have 2 nfs shares which are named, ironically: 

    yes diagnostics would be nice as well as seeing your /etc/exports file, which we omit from diagnostics.zip for some reason.  (That file only defined NFS exports.)

     

    Also try above test to set 'fuse_remember' to 0.

    Link to comment

    I will try that test after sending this:

     

    /etc/exports

     

    # See exports(5) for a description.
    # This file contains a list of all directories exported to other computers.
    # It is used by rpc.nfsd and rpc.mountd.

    "/mnt/user/TV" -async,no_subtree_check,fsid=103 *(sec=sys,rw,no_root_squash,insecure,no_subtree_check)
    "/mnt/user/movies" -async,no_subtree_check,fsid=102 *(sec=sys,rw,no_root_squash,insecure,no_subtree_check)

    unraid-2-diagnostics-20180925-0534.zip

    Link to comment

    One more thing to set up: boot the server in non-GUI mode, ie, Normal mode and have monitor hooked up (or use IPMI).  Then when crash happens see if any messages spit out there.

    Link to comment

    this has been covered by the other ticket regarding shares disappearing. 

    i was not able to post diagnostics as I had to revert back to 6.5.3

    over weekend I will upgrade again and set the tunable to 0 and post diagnostics if it happens again.

    My syslog looks exactly like the others posted.

    Link to comment
    On 9/26/2018 at 1:51 PM, limetech said:

    Please try this test.  Go to Settings/NFS and set the "Tunable (fuse_remember)" setting to 0.  This might cause your clients to crash and burn due to "stale file handle", but maybe not.  Would be interesting test.

    I came here because I've been having problems getting my backup completed all day. This is the first time I've had to reboot my server aside from upgrades (and the extended power outage like last weekend). I have set the fuse_remember to 0, and I will report back if I'm actually able to complete my backup.

     

    Here's the trace from my last crash:

    [ 2714.619041] ------------[ cut here ]------------
    [ 2714.619043] nfsd: non-standard errno: -103
    [ 2714.619077] WARNING: CPU: 1 PID: 10676 at fs/nfsd/nfsproc.c:817
    nfserrno+0x44/0x4a [nfsd]
    [ 2714.619078] Modules linked in: xt_nat xt_CHECKSUM iptable_mangle
    ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables vhost_net
    tun vhost tap veth ipt_MASQUERADE iptable_nat nf_conntrack_ipv4
    nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs nfsd
    lockd grace sunrpc md_mod i915 i2c_algo_bit iosf_mbi drm_kms_helper drm
    intel_gtt agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops it87
    hwmon_vid bonding e1000e r8169 mii x86_pkg_temp_thermal
    intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul
    crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64
    crypto_simd cryptd ahci libahci glue_helper intel_cstate intel_uncore
    i2c_i801 video intel_rapl_perf i2c_core backlight thermal button
    acpi_pad fan pcc_cpufreq [last unloaded: e1000e]
    [ 2714.619113] CPU: 1 PID: 10676 Comm: nfsd Not tainted 4.18.8-unRAID
    #1
    [ 2714.619114] Hardware name: Gigabyte Technology Co., Ltd. Z97-
    HD3P/Z97-HD3P, BIOS F2 09/17/2014
    [ 2714.619118] RIP: 0010:nfserrno+0x44/0x4a [nfsd]
    [ 2714.619118] Code: c0 48 83 f8 22 75 e2 80 3d b3 06 01 00 00 bb 00 00
    00 05 75 17 89 fe 48 c7 c7 3b 6a 42 a0 c6 05 9c 06 01 00 01 e8 8a 1c c3
    e0 <0f> 0b 89 d8 5b c3 48 83 ec 18 31 c9 ba ff 07 00 00 65 48 8b 04 25
    [ 2714.619140] RSP: 0018:ffffc90001d53dc0 EFLAGS: 00010282
    [ 2714.619142] RAX: 0000000000000000 RBX: 0000000005000000 RCX:
    0000000000000007
    [ 2714.619143] RDX: 0000000000000000 RSI: ffff88041fa56470 RDI:
    ffff88041fa56470
    [ 2714.619144] RBP: ffffc90001d53e10 R08: 0000000000000003 R09:
    ffffffff8220a800
    [ 2714.619144] R10: 00000000000003d4 R11: 0000000000012ddc R12:
    ffff88040908a008
    [ 2714.619145] R13: 000000008de30000 R14: ffff88040908a168 R15:
    0000000000000002
    [ 2714.619146] FS:  0000000000000000(0000) GS:ffff88041fa40000(0000)
    knlGS:0000000000000000
    [ 2714.619147] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2714.619148] CR2: 000000000094e690 CR3: 0000000001e0a006 CR4:
    00000000001626e0
    [ 2714.619148] Call Trace:
    [ 2714.619153]  nfsd_open+0x15e/0x17c [nfsd]
    [ 2714.619157]  nfsd_read+0x45/0xec [nfsd]
    [ 2714.619161]  nfsd3_proc_read+0x95/0xda [nfsd]
    [ 2714.619164]  nfsd_dispatch+0xb4/0x169 [nfsd]
    [ 2714.619170]  svc_process+0x4b5/0x666 [sunrpc]
    [ 2714.619173]  ? nfsd_destroy+0x48/0x48 [nfsd]
    [ 2714.619175]  nfsd+0xeb/0x142 [nfsd]
    [ 2714.619179]  kthread+0x10b/0x113
    [ 2714.619181]  ? kthread_flush_work_fn+0x9/0x9
    [ 2714.619183]  ret_from_fork+0x35/0x40
    [ 2714.619185] ---[ end trace 94c2c1298e7ff70a ]---

     

    Update: Nope, the fuse_remember set to 0 option didn't help at all

     

     

    Edited by Frank76
    Link to comment

    Good Morning,

     

    My unraid rig has survived the night without crashing!  I'll be watching it closely and will report back if anything happens.

     

    @Frank76 I also turned off the Tunable (enable Direct IO): hich had been enabled.

    Edited by ajeffco
    fixed name of parameter
    Link to comment
    On 9/28/2018 at 8:47 AM, ajeffco said:

    Good Morning,

     

    My unraid rig has survived the night without crashing!  I'll be watching it closely and will report back if anything happens.

     

    @Frank76 I also turned off the Tunable (enable Direct IO): hich had been enabled.

    Last night I disabled the Direct IO, and rebooted, and woke up this morning to the same issue. It seems like my backups are too IO intensive for NFS to handle. I'm getting very close to downgrading, but I have lost the ability to go back to 6.5.x (at least easily). In the mean time I will disable my backups. Hopefully that will prevent my server from crashing again.

    Link to comment
    57 minutes ago, Frank76 said:

    Last night I disabled the Direct IO, and rebooted, and woke up this morning to the same issue. It seems like my backups are too IO intensive for NFS to handle. I'm getting very close to downgrading, but I have lost the ability to go back to 6.5.x (at least easily). In the mean time I will disable my backups. Hopefully that will prevent my server from crashing again.

    If you need to downgrade the manual method will still work.    It is likely on;y the automatic downgrade via the GUI that is no longer possible.

    Link to comment

    Hello,

     

    Another night of no crash.

    root@tower:~# uptime
     11:28:15 up 1 day, 12:58,  1 user,  load average: 0.01, 0.00, 0.00

     

    @Frank76 Sorry to hear your still having problems.  I've run Synology backups manually and let them run by schedule and haven't had trouble since the two changes.  I've also run 2 macbook timemachine backups at the same time as a manual synology backup scan each day since the changes, and it hasn't crashed.  I want to say that for certain one of my crashes occurred when there was no I/O going to the unraid rig.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.