Jump to content
  • shsf hangs (6.11 / 6.12) - possibly NFS-related?


    robertklep
    • Closed Minor

    (apologies for the long post)

     

    I'm trying to track down an issue where shsf randomly hangs every few days, causing my whole server to become unresponsive. This happened on 6.11.5 and now also on 6.12.6.

     

    Hardware:

    • Supermicro X11SCL-IF motherboard, latest BIOS
    • 2x Kingston KSM26ES8/8HD DDR4 Single-bit ECC RAM modules (tested)
    • 2x Seagate Firecuda 520 1TB SSD (pool)
    • 1x Seagate EXOS X16 16TB (parity)
    • 1x Seagate EXOS X16 1TGB (data)

     

    The symptoms are that shfs stops working for some reason, causing processes to get stuck in a D state. Processes that don't access disk continue to work. I had an `htop` running this time, notice the `postscreen` process (running inside a Docker-Mailserver container) that is stuck:

    image.thumb.png.5ddcd1ec6f2b6cb586178edb99fb1773.png

     

    At this point, I can try and initiate a shutdown but it will fail because the filesystem is stuck and I have to hard-power-down the server.

     

    This time though, I killed the main shfs process (PID 7003), which unstuck the server and the shutdown succeeded.

     

    Things I've tried/done:

    • disable C states in BIOS
    • configure Docker to use ipvlan instead of macvlan
    • changed mover schedule from daily to weekly (this temporarily seemed to fix the problem, I was able to get an update of 3 months, which I'd never gotten before)
    • disable I/O-intensive Docker containers (Duplicacy backups)
    • disabled I/O-intensive external tasks (Time Machine backups)

     

    Nothing has worked so far.

     

    Because I managed to unstuck the disks, Unraid was able (for the first time) to dump a bunch of stuff to my syslog server. It's mostly a long list of "Transport endpoint is not connected" errors, but also this stack trace:

    nfsd: non-standard errno: -103
    WARNING: CPU: 2 PID: 5015 at fs/nfsd/nfsproc.c:909 nfserrno+0x45/0x51 [nfsd]
    Modules linked in: tcp_diag inet_diag bluetooth ecdh_generic ecc tls xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net vhost vhost_iotlb xt_comment xt_connmark xt_mark nft_compat nf_tables wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap af_packet 8021q garp mrp bridge stp llc igb intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
    ghash_clmulni_intel sha512_ssse3 sha256_ssse3 ipmi_ssif sha1_ssse3 wmi_bmof ast drm_vram_helper drm_ttm_helper ttm aesni_intel crypto_simd cryptd drm_kms_helper rapl intel_cstate drm intel_uncore i2c_i801 i2c_algo_bit mei_me agpgart ahci syscopyarea sysfillrect sysimgblt i2c_smbus fb_sys_fops i2c_core libahci mei nvme cp210x pl2303 input_leds joydev led_class usbserial acpi_ipmi intel_pch_thermal nvme_core thermal fan video wmi ipmi_si backlight intel_pmc_core acpi_tad button unix [last unloaded: igb]
    CPU: 2 PID: 5015 Comm: nfsd Tainted: P           O       6.1.64-Unraid #1
    "Hardware name: Supermicro Super Server/X11SCL-IF, BIOS 2.2 10/27/2023"
    RIP: 0010:nfserrno+0x45/0x51 [nfsd]
    Code: c3 cc cc cc cc 48 ff c0 48 83 f8 26 75 e0 80 3d dd c9 05 00 00 75 15 48 c7 c7 b5 c2 d9 a0 c6 05 cd c9 05 00 01 e8 01 39 30 e0 <0f> 0b b8 00 00 00 05 c3 cc cc cc cc 48 83 ec 18 31 c9 ba ff 07 00
    RSP: 0000:ffffc9000155fde8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
    RDX: 0000000000000002 RSI: ffffffff820d7e01 RDI: 00000000ffffffff
    RBP: ffff88814e140180 R08: 0000000000000000 R09: ffffffff82245f10
    R10: 00007fffffffffff R11: ffffffff82969256 R12: 0000000000000001
    R13: 0000000000000000 R14: ffff88814f6dc0c0 R15: ffffffffa0dbf6c0
    FS:  0000000000000000(0000) GS:ffff88845ed00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000014a2aa5dbbd3 CR3: 0000000241028002 CR4: 00000000003706e0
    Call Trace:
    <TASK>
    ? __warn+0xab/0x122
    ? report_bug+0x109/0x17e
    ? nfserrno+0x45/0x51 [nfsd]
    ? handle_bug+0x41/0x6f
    ? exc_invalid_op+0x13/0x60
    ? asm_exc_invalid_op+0x16/0x20
    ? nfserrno+0x45/0x51 [nfsd]
    ? nfserrno+0x45/0x51 [nfsd]
    nfsd_access+0xac/0xf1 [nfsd]
    nfsd3_proc_access+0x78/0x88 [nfsd]
    nfsd_dispatch+0x1a6/0x262 [nfsd]
    svc_process_common+0x32f/0x4df [sunrpc]
    ? ktime_get+0x35/0x49
    ? nfsd_svc+0x2b6/0x2b6 [nfsd]
    ? nfsd_shutdown_threads+0x5b/0x5b [nfsd]
    svc_process+0xc7/0xe4 [sunrpc]
    nfsd+0xd5/0x155 [nfsd]
    kthread+0xe4/0xef
    ? kthread_complete_and_exit+0x1b/0x1b
    ret_from_fork+0x1f/0x30
    </TASK>

     

    I don't know if this is a cause or a symptom.

     

    The server was now also able to create a diagnostics log.

     

    unraid-diagnostics-20231215-0947.zip




    User Feedback

    Recommended Comments

    Okay, I should have searched first, there are plenty of threads that relate to this issue.

     

    I've disabled NFS for now (although I've also read that SMB causes the same issue for some users) and hope this gets solved at some point (if I'm still running Unraid at that time, because it doesn't inspire confidence).

    Link to comment

    Changed Status to Closed

     

    Solution: lost confidence and moved away from Unraid entirely. Ubuntu on the same hardware now has an uptime of 7 days (which is when I installed it) which is already more than the average I got with Unraid the last couple of months.

    Edited by robertklep
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...