Issue with shares dropping/vanishing increasingly frequent (going stale)


Recommended Posts

They are still there on disk and when I reboot UNRAID things appear to be working again. Been seeing this issue more and more lately and it's causing concern but I'm in the middle of doing some transfers or something with shares and then it drops them all (NFS/SMB all gone, no systems can access).

Seems like it might have to do with a cache drive that keeps filling up too quick (it's small) but I really am not sure. Or maybe related to NFS issues but SMB also drops out.

I have the NFS shares more or less like these:
IP(ro,async,root_squash,all_squash,no_subtree_check) IP2(rw,async,root_squash,all_squash,no_subtree_check)

Nothing crazy. Maybe I am doing too much with the file systems over the network but it's maybe 4-5 parallel actions sometimes... doesn't seem crazy.

Other weird situation is that if I do ls -l on /mnt/user/downloads or /mnt/user/media the permissions/owners are crazy, all over the place with different user names, or sometimes just numbers, or nobody.

I've had to reboot UNRAID many times in the past week where normally it would easily remain up for months so the issue is becoming worse or it's related to using the system more... which shouldn't break the system... right?

I can't even stay up a whole day lately. If it's not a complete share meltdown (all smb/nfs dropped/gone stale), it is sometimes just NFS, and sometimes just specific systems using NFS. Never seen SMB only drops so far.

Edited by Econaut
added notes about NFS shares going stale
Link to comment
  • Econaut changed the title to Issue with shares dropping/vanishing increasingly frequent (going stale)
2 hours ago, JorgeB said:

Not seeing anything relevant in the log, do you know the time it stopped working? Also post output of:

 

ls -la /mnt/user

 

 

Thanks for checking. Well it was within an hour or so before the original post. It's working now after reboots, and maybe a relevant adjustment to NFS but more importantly (I think) a stricter control over the cache and monitoring it to make sure it doesn't fill up.

So hoping to get some more definitive answers on how a full cache could negatively affect shares in this manner, or how NFS parameters may as well.

I changed the NFS params to:
IP(ro,sync,root_squash,all_squash,no_subtree_check) IP2(rw,sync,root_squash,all_squash,no_subtree_check)
async > sync (looks like async can sometimes cause issues based on the description for it though I never seem to have issues unless the cache drive fills up)

 

:~# ls -la /mnt/user
total 0
drwxrwxrwx  1 nobody users   119 May 19 17:48 ./
drwxr-xr-x 19 root   root    380 May 19 16:50 ../
drwxrwxrwx  1 nobody users   752 May 18 16:53 appdata/
drwxrwxrwx  1 nobody users   102 May 17 16:09 backups/
drwxrwx---  1 nobody users   137 Dec 22 16:10 data/
drwxrwxrwx  1 nobody users    82 May 17 15:57 domains/
drwxrwxrwx  1 nobody users    24 May 18 06:44 share1/ (used shared that caused dropoff)
drwxrwxrwx  1 nobody users    48 May 17 15:35 isos/
drwxrwxrwx  1 nobody users    28 May 19 17:07 share2/ (used shared that caused dropoff)
drwxrwxrwx  1 nobody users    26 Sep 13  2021 system/

 

Link to comment
1 hour ago, Econaut said:

So hoping to get some more definitive answers on how a full cache could negatively affect shares in this manner, or how NFS parameters may as well.

File system corruption can cause User Shares to misbehave, and btrfs file systems (as used on the cache pool) seem to be prone to corruption if the drive gets too full.

Link to comment
1 hour ago, itimpi said:

File system corruption can cause User Shares to misbehave, and btrfs file systems (as used on the cache pool) seem to be prone to corruption if the drive gets too full.

That would explain it, thanks. I am indeed using btrfs. Does running the mover while the cache is still being used lead to such corruptions as well?

Link to comment
1 hour ago, itimpi said:

No.   Just make sure you have a sensible Minimum Free Space setting for the cache pool to protect it against overfilling.

So maybe 0 isn't very sensible then... I believe that was the default. Do I need to stop the array to change that value?

Edited by Econaut
Link to comment

Got about a week and it occurred again when trying to delete a file through NFS. Delete failed, then all shares vanished when trying to delete again.

Restarted UNRAID and they are back and then I deleted through SMB which worked ok.

Seems like NFS is working better with IP(rw,all_squash,anonuid=99,anongid=100) but this issue still happened anyway... what can I do?

 

@JorgeB@itimpi

 

Edited by Econaut
Link to comment
  • 3 weeks later...

With some changes, things have been stable comparatively however for the 2nd time, I have deleted a file on an NFS share and this immediately causes UNRAID to drop all shares (seem to all get corrupted in some form)

This does not look good and appears to be a major defect of some sort:
 

/mnt# ls -lh
/bin/ls: cannot access 'user': Transport endpoint is not connected
...
d????????? ? ?      ?       ?            ? user/
...


I have to reboot to get them restored.

Link to comment
  • 2 weeks later...

Created a file on an NFS share from a linux mount, then deleted the file and all unraid shares crashed and shut down basically. Must restart unraid after that. Issue maybe seems to stem from the permissions of an SMB user being forced into the file system as seen by the NFS mount point... even though I believe I've done all I can to force and normalize users/groups for the files... they just keep changing from nobody.

Sounds like a major defect that needs some attention... Is general support not the right place for this?

Edited by Econaut
Link to comment

I'm seeing this in the log that lead to the problem:

May 30 21:04:31 REDACTED  shfs: share cache full
### [PREVIOUS LINE REPEATED 3 TIMES] ###
...
May 31 09:44:38 REDACTED kernel: shfs[10308]: segfault at 10 ip 000014648b6515c2 sp 000014648af45c20 error 4 in libfuse3.so.3.12.0[14648b64d000+19000]
May 31 09:44:38 REDACTED kernel: Code: f4 c8 ff ff 8b b3 08 01 00 00 85 f6 0f 85 46 01 00 00 4c 89 ee 48 89 df 45 31 ff e8 18 dc ff ff 4c 89 e7 45 31 e4 48 8b 40 20 <4c> 8b 68 10 e8 15 c2 ff ff 48 8d 4c 24 18 45 31 c0 31 d2 4c 89 ee
...
May 31 09:44:38 REDACTED kernel: ------------[ cut here ]------------
May 31 09:44:38 REDACTED kernel: nfsd: non-standard errno: -103
May 31 09:44:38 REDACTED kernel: WARNING: CPU: 15 PID: 9463 at fs/nfsd/nfsproc.c:889 nfserrno+0x45/0x51 [nfsd]
May 31 09:44:38 REDACTED kernel: Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha rpcsec_gss_krb5 cmac cifs asn1_decoder cifs_arc4 cifs_md4 dns_resolver xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod tcp_diag inet_diag it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc bonding tls amdgpu radeon gpu_sched drm_ttm_helper ttm btusb edac_mce_amd gigabyte_wmi wmi_bmof mxm_wmi edac_core kvm_amd btrtl kvm btbcm crct10dif_pclmul drm_display_helper crc32_pclmul crc32c_intel btintel ghash_clmulni_intel aesni_intel drm_kms_helper crypto_simd cryptd drm bluetooth
May 31 09:44:38 REDACTED kernel: igb rapl nvme agpgart i2c_piix4 i2c_algo_bit k10temp joydev ahci ecdh_generic nvme_core i2c_core ecc syscopyarea libahci ccp sysfillrect sysimgblt fb_sys_fops tpm_crb thermal tpm_tis tpm_tis_core video tpm backlight wmi button acpi_cpufreq unix
May 31 09:44:38 REDACTED kernel: CPU: 15 PID: 9463 Comm: nfsd Not tainted 5.19.17-Unraid #2
May 31 09:44:38 REDACTED kernel: Hardware name: Gigabyte Technology Co., Ltd. REDACTED_MODEL/REDACTED_MODEL, BIOS F37b 03/23/2023
May 31 09:44:38 REDACTED kernel: RIP: 0010:nfserrno+0x45/0x51 [nfsd]
May 31 09:44:38 REDACTED kernel: Code: c3 cc cc cc cc 48 ff c0 48 83 f8 26 75 e0 80 3d bb 47 05 00 00 75 15 48 c7 c7 17 64 f0 a0 c6 05 ab 47 05 00 01 e8 42 47 94 e0 <0f> 0b b8 00 00 00 05 c3 cc cc cc cc 48 83 ec 18 31 c9 ba ff 07 00
May 31 09:44:38 REDACTED kernel: RSP: 0018:ffffc90000b47b58 EFLAGS: 00010282
May 31 09:44:38 REDACTED kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
May 31 09:44:38 REDACTED kernel: RDX: 0000000000000001 RSI: ffffffff820d7be1 RDI: 00000000ffffffff
May 31 09:44:38 REDACTED kernel: RBP: ffffc90000b47db0 R08: 0000000000000000 R09: ffffffff828653f0
May 31 09:44:38 REDACTED kernel: R10: 00003fffffffffff R11: ffff88a05e2c4bae R12: 000000000000000c
May 31 09:44:38 REDACTED kernel: R13: 000000000010011a R14: ffff8881623011a0 R15: ffffffff82909480
May 31 09:44:38 REDACTED kernel: FS:  0000000000000000(0000) GS:ffff889fde3c0000(0000) knlGS:0000000000000000
May 31 09:44:38 REDACTED kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 31 09:44:38 REDACTED kernel: CR2: 00007faf350d6590 CR3: 0000000101500000 CR4: 0000000000750ee0
May 31 09:44:38 REDACTED kernel: PKRU: 55555554
May 31 09:44:38 REDACTED kernel: Call Trace:
May 31 09:44:38 REDACTED kernel: <TASK>
May 31 09:44:38 REDACTED kernel: nfsd4_encode_fattr+0x1372/0x13d9 [nfsd]
May 31 09:44:38 REDACTED kernel: ? getboottime64+0x20/0x2e
May 31 09:44:38 REDACTED kernel: ? kvmalloc_node+0x44/0xbc
May 31 09:44:38 REDACTED kernel: ? __kmalloc_node+0x1b4/0x1df
May 31 09:44:38 REDACTED kernel: ? kvmalloc_node+0x44/0xbc
May 31 09:44:38 REDACTED kernel: ? override_creds+0x21/0x34
May 31 09:44:38 REDACTED kernel: ? nfsd_setuser+0x185/0x1a5 [nfsd]
May 31 09:44:38 REDACTED kernel: ? nfsd_setuser_and_check_port+0x76/0xb4 [nfsd]
### [PREVIOUS LINE REPEATED 1 TIMES] ###
May 31 09:44:38 REDACTED kernel: nfsd4_encode_getattr+0x28/0x2e [nfsd]
May 31 09:44:38 REDACTED kernel: nfsd4_encode_operation+0xb0/0x201 [nfsd]
May 31 09:44:38 REDACTED kernel: nfsd4_proc_compound+0x2a7/0x56c [nfsd]
May 31 09:44:38 REDACTED kernel: nfsd_dispatch+0x1a9/0x262 [nfsd]
May 31 09:44:38 REDACTED kernel: svc_process+0x3f1/0x5d6 [sunrpc]
May 31 09:44:38 REDACTED kernel: ? nfsd_svc+0x2b6/0x2b6 [nfsd]
May 31 09:44:38 REDACTED kernel: ? nfsd_shutdown_threads+0x5b/0x5b [nfsd]
May 31 09:44:38 REDACTED kernel: nfsd+0xd5/0x155 [nfsd]
May 31 09:44:38 REDACTED kernel: kthread+0xe7/0xef
May 31 09:44:38 REDACTED kernel: ? kthread_complete_and_exit+0x1b/0x1b
May 31 09:44:38 REDACTED kernel: ret_from_fork+0x22/0x30
May 31 09:44:38 REDACTED kernel: </TASK>
May 31 09:44:38 REDACTED kernel: ---[ end trace 0000000000000000 ]---
...
May 31 09:44:40 REDACTED  smbd[14715]: [2023/05/31 09:44:40.111782,  0] ../../source3/smbd/files.c:1199(synthetic_pathref)
May 31 09:44:40 REDACTED  smbd[14715]:   synthetic_pathref: opening [.] failed
May 31 09:44:40 REDACTED  smbd[14714]: [2023/05/31 09:44:40.260325,  0] ../../source3/smbd/files.c:1199(synthetic_pathref)
May 31 09:44:40 REDACTED  smbd[14714]:   synthetic_pathref: opening [.] failed
...
May 31 09:44:40 REDACTED  smbd[14714]: [2023/05/31 09:44:40.355105,  0] ../../source3/smbd/smb2_service.c:168(chdir_current_service)
May 31 09:44:40 REDACTED  smbd[14714]:   chdir_current_service: vfs_ChDir(/mntREDACTED) failed: Transport endpoint is not connected. Current token: uid=1001, gid=100, 4 groups: 100 3003 3004 3005
May 31 09:44:40 REDACTED  smbd[14714]: [2023/05/31 09:44:40.355148,  0] ../../source3/smbd/smb1_process.c:1159(switch_message)
May 31 09:44:40 REDACTED  smbd[14714]:   Error: Could not change to user. Removing deferred open, mid=30024.
May 31 09:44:40 REDACTED  smbd[14714]: [2023/05/31 09:44:40.355859,  0] ../../source3/smbd/smb2_service.c:168(chdir_current_service)
May 31 09:44:40 REDACTED  smbd[14714]:   chdir_current_service: vfs_ChDir(/mntREDACTED) failed: Transport endpoint is not connected. Current token: uid=1001, gid=100, 4 groups: 100 3003 3004 3005

I can see that after all this, NFS and SMB both are going south.

 

Here are some suggestions:

  • Don't use a cache on your shares for the time being.
  • Your disks are all very full.  I'd suggest you verify the 'Allocation Method' and the min free space makes sense for what you are doing.
  • Can you remove some stuff from your array, or add more storage?  Your disks are all pretty full.  Many are over 90%.
  • I am not an expert on NFS, but try these rules 'IP(rw,sec=sys,insecure,anongid=100,anonuid=99,no_root_squash)'.  These are what I recommend for UD devices.
  • Thanks 1
Link to comment

@dlandon Thanks for looking at it. I did recently upgrade the cache to be much larger and it's no longer getting anywhere close to full.

The disks remain very full indeed, and I plan to get more but it's what it is for now... Allocation method is High-Water for all. Even with mostly full disks, a delete action shouldn't cause a complete system share meltdown. Min free space settings are all about 100GB or more, should all make sense.

Good suggestions for NFS exports. I am no expert either but as UNRAID has limited characters for the rules, I've had to keep it shortened. Thankfully sec=sys should be default and unnecessary to add. I had stale file handle issues and am not sure what helped but removing insecure was one of the things I did. Is it a problem for UNRAID to have NFS mounts/shares only use the 'secure' port(s)?

The clients are mounting the nfs shares with fstab options: rw,nosuid,noexec,hard,intr

all_squash
vs no_root_squash and all that confuses the hell out of me but my choice of `all_squash` to help normalize UID's shouldn't prevent delete from working or cause that to corrupt the UNRAID shares.

Got another hard system share meltdown... I sat there wondering if things were stable enough to attempt a delete, started delete, got the popup asking if I was sure, and let it sit there because I was not sure... And there you have it, complete system share meltdown even without a delete being issued. I didn't proceed with the delete but somehow whatever NFS did prior to that prompt is what caused the issue, probably each of the three times this has occurred so far.

Edited by Econaut
Link to comment
15 minutes ago, Econaut said:

The clients are mounting the nfs shares with fstab options: rw,nosuid,noexec,hard,intr

Be sure the clients are using NFSv4.  NFSv3 has issues when a share is using the cache and files are moved by the mover from the cache to the share disk(s).

  • Like 1
Link to comment

If possible I'd like to know where in the logs to look to see the crash event from this delete action (or pre-delete action).

This was another:
 

Jun 24 06:11:33 REDACTED  smbd[12682]: [2023/06/24 06:11:33.404989,  0] ../../source3/smbd/close.c:1485(close_directory)
Jun 24 06:11:33 REDACTED  smbd[12682]:   Could not close dir! fname=REDACTED_FOLDER_PATH, fd=37, err=116=Stale file handle
Jun 24 06:11:33 REDACTED  smbd[12682]: [2023/06/24 06:11:33.407654,  0] ../../source3/smbd/close.c:1485(close_directory)
Jun 24 06:11:33 REDACTED  smbd[12682]:   Could not close dir! fname=REDACTED_FOLDER_PATH, fd=38, err=116=Stale file handle
Jun 24 06:15:52 REDACTED kernel: ------------[ cut here ]------------
Jun 24 06:15:52 REDACTED kernel: nfsd: non-standard errno: -38
Jun 24 06:15:52 REDACTED kernel: WARNING: CPU: 10 PID: 9462 at fs/nfsd/nfsproc.c:889 nfserrno+0x45/0x51 [nfsd]
Jun 24 06:15:52 REDACTED kernel: Modules linked in: rpcsec_gss_krb5 cmac cifs asn1_decoder cifs_arc4 cifs_md4 dns_resolver xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net vhost vhost_iotlb tap wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha tun xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod tcp_diag inet_diag it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc bonding tls amdgpu radeon gpu_sched drm_ttm_helper ttm btusb edac_mce_amd edac_core gigabyte_wmi wmi_bmof mxm_wmi kvm_amd kvm btrtl crct10dif_pclmul drm_display_helper crc32_pclmul crc32c_intel btbcm ghash_clmulni_intel aesni_intel drm_kms_helper crypto_simd btintel cryptd rapl bluetooth
Jun 24 06:15:52 REDACTED kernel: drm igb i2c_piix4 nvme agpgart k10temp i2c_algo_bit ecdh_generic joydev ecc ahci nvme_core i2c_core libahci ccp syscopyarea sysfillrect sysimgblt fb_sys_fops tpm_crb tpm_tis thermal tpm_tis_core video tpm wmi backlight button acpi_cpufreq unix
Jun 24 06:15:52 REDACTED kernel: CPU: 10 PID: 9462 Comm: nfsd Not tainted 5.19.17-Unraid #2
Jun 24 06:15:52 REDACTED kernel: Hardware name: MOTHERBOARD INFO
Jun 24 06:15:52 REDACTED kernel: RIP: 0010:nfserrno+0x45/0x51 [nfsd]
Jun 24 06:15:52 REDACTED kernel: Code: c3 cc cc cc cc 48 ff c0 48 83 f8 26 75 e0 80 3d bb 47 05 00 00 75 15 48 c7 c7 17 24 f0 a0 c6 05 ab 47 05 00 01 e8 42 87 94 e0 <0f> 0b b8 00 00 00 05 c3 cc cc cc cc 48 83 ec 18 31 c9 ba ff 07 00
Jun 24 06:15:52 REDACTED kernel: RSP: 0018:ffffc90002017dc0 EFLAGS: 00010282
Jun 24 06:15:52 REDACTED kernel: RAX: 0000000000000000 RBX: ffff888162b681a0 RCX: 0000000000000027
Jun 24 06:15:52 REDACTED kernel: RDX: 0000000000000001 RSI: ffffffff820d7be1 RDI: 00000000ffffffff
Jun 24 06:15:52 REDACTED kernel: RBP: ffff888162b68030 R08: 0000000000000000 R09: ffffffff828653f0
Jun 24 06:15:52 REDACTED kernel: R10: 00003fffffffffff R11: ffff88a05e2bdcdd R12: ffff8882f77de6c0
Jun 24 06:15:52 REDACTED kernel: R13: ffff8881043f4000 R14: 00000000ffffffda R15: ffff8892bd198340
Jun 24 06:15:52 REDACTED kernel: FS:  0000000000000000(0000) GS:ffff889fde280000(0000) knlGS:0000000000000000
Jun 24 06:15:52 REDACTED kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 24 06:15:52 REDACTED kernel: CR2: 000000a98e47efb4 CR3: 0000000160f4c000 CR4: 0000000000750ee0
Jun 24 06:15:52 REDACTED kernel: PKRU: 55555554
Jun 24 06:15:52 REDACTED kernel: Call Trace:
Jun 24 06:15:52 REDACTED kernel: <TASK>
Jun 24 06:15:52 REDACTED kernel: nfsd_link+0x15f/0x1a5 [nfsd]
Jun 24 06:15:52 REDACTED kernel: nfsd4_link+0x20/0x3e [nfsd]
Jun 24 06:15:52 REDACTED kernel: nfsd4_proc_compound+0x437/0x56c [nfsd]
Jun 24 06:15:52 REDACTED kernel: nfsd_dispatch+0x1a9/0x262 [nfsd]
Jun 24 06:15:52 REDACTED kernel: svc_process+0x3f1/0x5d6 [sunrpc]
Jun 24 06:15:52 REDACTED kernel: ? nfsd_svc+0x2b6/0x2b6 [nfsd]
Jun 24 06:15:52 REDACTED kernel: ? nfsd_shutdown_threads+0x5b/0x5b [nfsd]
Jun 24 06:15:52 REDACTED kernel: nfsd+0xd5/0x155 [nfsd]
Jun 24 06:15:52 REDACTED kernel: kthread+0xe7/0xef
Jun 24 06:15:52 REDACTED kernel: ? kthread_complete_and_exit+0x1b/0x1b
Jun 24 06:15:52 REDACTED kernel: ret_from_fork+0x22/0x30
Jun 24 06:15:52 REDACTED kernel: </TASK>
Jun 24 06:15:52 REDACTED kernel: ---[ end trace 0000000000000000 ]---

 

Edited by Econaut
Link to comment
13 minutes ago, dlandon said:

Be sure the clients are using NFSv4.  NFSv3 has issues when a share is using the cache and files are moved by the mover from the cache to the share disk(s).

They are updated Ubuntu systems running v4
Checked with nfsstat -s & nfsstat -c

Some extra info:
UNRAID is NFS server

Ubuntu bare metal is one client (haven't actually tested deleting from here, no cause of crash to my knowledge)
Ubuntu UNRAID VM is another client (this seems to be where I see the issues the most with deletes and what not)

I use UD SMB mounts (no NFS for UD)


The bare metal system was seeing stale file handle issues (all NFS clients were) but that issue is passed now.

Edited by Econaut
Extra Info
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.