Jump to content

robertklep

Members
  • Posts

    24
  • Joined

  • Last visited

Everything posted by robertklep

  1. Switching to Samba and disabling NFS didn't fix the issue for me anyway, so it's not exclusive to using NFS. I moved away from Unraid to plain Ubuntu months ago, on the same hardware, and never had any issues since (uptimes of many months 🥳).
  2. The issue for me was triggered by actively using NFS. I disabled NFS and moved to SMB but that didn't fix it.
  3. So to clarify what actually happened: `shfs` deadlocked for some reason. I was running htop (see screenshot in my post) which showed some random postfix process hanging in a D state. At Dec 15, 09:46:09, I then initiated a shutdown, which got stuck because `/mnt/user` was borked. I kill-9'd ("voluntary" signals didn't work, another clue that the process was stuck in some unforeseen state) the shfs processes that I could see in the htop output causing a ton of "Transport endpoint is not connected" errors and very likely also the mariadb error that you mention (I only just noticed that the nfsd error that I initially thought could be cause of the hang actually came after I killed shfs, so that's just caused by `/mnt/user` disappearing). Killing shfs caused disk I/O to get unstuck enough for Unraid to actually write diagnostics and finish the shutdown. Note that up until the point where I started the shutdown, nothing was being logged. Between 08:00AM (when Kodi NFS-mounted a share to do a library scan) and 09:46:09, something happened that caused shfs to deadlock. To debug this issue over the last few months, I've tried everything already. No additional packages, disabling NFS entirely, etc. Nothing fixed it. I've also had this problem from the moment I started using Unraid, so with a very fresh and basic system. Also, there aren't a lot of NFS mounts (probably about 5 at most). You see a lot of mount requests, but that's because Kodi manages NFS sources on demand, closing them when its done with them (which you don't see being logged). Since I got fed up with this instability I moved from Unraid to Ubuntu which, with the exact same workload in terms of NFS, Docker, etc, now has an uptime of 3 weeks, which with Unraid would have been "an exceptional run". I'm perfectly willing to accept that something in the way I use my server doesn't fit with Unraid, but on the other hand, I'm not the only person that's running into these issues, judging by this thread. I'm not willing to dumb down my server to fit the OS, so instead I changed OS.
  4. They're attached to the post: https://forums.unraid.net/applications/core/interface/file/attachment.php?id=284784&key=276c1e0db7ed3ce86c8547d0c185730b Fair enough, but in all my time using Docker on various systems I've never come across a filesystem driver hanging itself up over it.
  5. Posted those about a month ago, then found this thread about what I believe to be the same issue (or at least related) that started back in 2018. As for the Docker documentation, that's just general information, at least I don't see anything specific to what you shouldn't do to prevent `shfs` from getting into trouble.
  6. That's basically what I did, but never was able to find a specific plugin or container that caused the problem. For me, the best way to trigger the issue was using NFS, but even with NFS completely turned off and using SMB it happened. Which for me invalidated the use of Unraid as a NAS. Up until now, nobody has been able to pinpoint the exact cause of this problem though. If a misconfiguration of a Docker container can cause `shfs` to deadlock I would at least expect some documentation on how to prevent this.
  7. For me, it was clearly an Unraid problem.
  8. Which for me was another reason to move away from Unraid. This is the magic sauce that Unraid is built on, and it's more or less abandonware. I started using Unraid in October 2022, when 6.11 was just released. The issues I had were present already, although they didn't occur as often as later on (about every two weeks or so). Since I suspected it might have been a hardware issue at first, it took me quite some time to realise it wasn't. I held off upgrading to 6.12 (from 6.11.5 which I was running) for quite a while until I was fed up with the constant hangs, so I upgraded to 6.12.6 hoping the issue would be fixed, but it actually made matters worse. I then downgraded back to 6.11.5 (which indeed was very easy!) only to find that by doing so, the community apps plugin was broken because I had mistakenly upgraded it on 6.12.6 and the minimum Unraid version that the installed plugin supported was 6.12.0. Which is when I decided to migrate away from Unraid. It's great when it works, but during this entire ordeal I found so many implementation details that puzzle me (`in_use` a shell script? really!?) that I'm happy to be running something else now.
  9. Given that I was the one moving to Ubuntu: the root cause for me was an issue in Unraid, specifically in `shfs`. I can't fix it because it's closed source, but since the issue has existed for quite some time now (this thread started back in 2018), apparently nobody else can fix it either. I'm not inclined to keep trying to find the specific error, or the specific conditions that trigger it, because that's simply not my job and the days where I liked to spend my free time on something like this are long gone. This is commercial server software and if it can't handle keeping my server up for more than a few days at a time, it's not suitable for me. So the next best thing for me was moving to something else, which became Ubuntu because I've been using it since it was first released, and Debian before that. I've never had issues with updates, but I'm also not someone that runs automatic updates or .0 releases. With the next big update I guess I'll see whether my boot/root ZFS snapshotting will come in useful if the update gets botched 😅
  10. Do you use NFS a lot? I've had a lot of issues with Unraid freezing up with the symptoms you describe (server is still up, but unresponsive and login attempts freeze), which seemed to be related to NFS in my case. The ultimate cause was shfs deadlocking itself for whatever reason, causing most disk I/O to get stuck until it was killed (or the server was power cycled, since it couldn't perform a proper shutdown). I had the issue on 6.11.5 but it became much worse when I upgraded to 6.12.6 to see if that would fix my problems. It looks like my issue is related to a long standing issue.
  11. Moved from Unraid to plain Ubuntu, uptime is now already more than the average that Unraid managed on the same hardware the last couple of months.
  12. Changed Status to Closed Solution: lost confidence and moved away from Unraid entirely. Ubuntu on the same hardware now has an uptime of 7 days (which is when I installed it) which is already more than the average I got with Unraid the last couple of months.
  13. After yet another hang I am planning my move away from Unraid to either TrueNAS SCALE or Proxmox. I've enjoyed using Unraid due to its simplicity, but in the year that I've used it it was never stable for me and now I'm done with it. My hardware setup doesn't really benefit a whole lot from Unraid's mixing-and-matching-drives feature, while that same feature is causing a lot of issues for me.
  14. That didn't fix anything, I just got another hang. That's three times in three days now. I'll revert back to 6.11.5 where I at least managed to get an uptime of a few weeks (once).
  15. I posted a large report earlier today which seems to touch this issue. In my case it's most likely related to NFS, and it occurs both on 6.11 and 6.12. I can't disable hard link support easily as this causes issues with Docker Mailserver, but I have disabled NFS for the time being and hope that my server will be able to stay up for more than a few days. However, given that this issue might be related to Unraid's "magic" (shfs) and libfuse, and because it has existed for such a long time already, I'm going to have to assume it's unfixable. At the moment I'm still suffering from sunk-cost-fallacy-syndrome, but once I'm past that I guess I'll start looking at alternatives 😢
  16. Okay, I should have searched first, there are plenty of threads that relate to this issue. I've disabled NFS for now (although I've also read that SMB causes the same issue for some users) and hope this gets solved at some point (if I'm still running Unraid at that time, because it doesn't inspire confidence).
  17. (apologies for the long post) I'm trying to track down an issue where shsf randomly hangs every few days, causing my whole server to become unresponsive. This happened on 6.11.5 and now also on 6.12.6. Hardware: Supermicro X11SCL-IF motherboard, latest BIOS 2x Kingston KSM26ES8/8HD DDR4 Single-bit ECC RAM modules (tested) 2x Seagate Firecuda 520 1TB SSD (pool) 1x Seagate EXOS X16 16TB (parity) 1x Seagate EXOS X16 1TGB (data) The symptoms are that shfs stops working for some reason, causing processes to get stuck in a D state. Processes that don't access disk continue to work. I had an `htop` running this time, notice the `postscreen` process (running inside a Docker-Mailserver container) that is stuck: At this point, I can try and initiate a shutdown but it will fail because the filesystem is stuck and I have to hard-power-down the server. This time though, I killed the main shfs process (PID 7003), which unstuck the server and the shutdown succeeded. Things I've tried/done: disable C states in BIOS configure Docker to use ipvlan instead of macvlan changed mover schedule from daily to weekly (this temporarily seemed to fix the problem, I was able to get an update of 3 months, which I'd never gotten before) disable I/O-intensive Docker containers (Duplicacy backups) disabled I/O-intensive external tasks (Time Machine backups) Nothing has worked so far. Because I managed to unstuck the disks, Unraid was able (for the first time) to dump a bunch of stuff to my syslog server. It's mostly a long list of "Transport endpoint is not connected" errors, but also this stack trace: nfsd: non-standard errno: -103 WARNING: CPU: 2 PID: 5015 at fs/nfsd/nfsproc.c:909 nfserrno+0x45/0x51 [nfsd] Modules linked in: tcp_diag inet_diag bluetooth ecdh_generic ecc tls xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net vhost vhost_iotlb xt_comment xt_connmark xt_mark nft_compat nf_tables wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap af_packet 8021q garp mrp bridge stp llc igb intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 ipmi_ssif sha1_ssse3 wmi_bmof ast drm_vram_helper drm_ttm_helper ttm aesni_intel crypto_simd cryptd drm_kms_helper rapl intel_cstate drm intel_uncore i2c_i801 i2c_algo_bit mei_me agpgart ahci syscopyarea sysfillrect sysimgblt i2c_smbus fb_sys_fops i2c_core libahci mei nvme cp210x pl2303 input_leds joydev led_class usbserial acpi_ipmi intel_pch_thermal nvme_core thermal fan video wmi ipmi_si backlight intel_pmc_core acpi_tad button unix [last unloaded: igb] CPU: 2 PID: 5015 Comm: nfsd Tainted: P O 6.1.64-Unraid #1 "Hardware name: Supermicro Super Server/X11SCL-IF, BIOS 2.2 10/27/2023" RIP: 0010:nfserrno+0x45/0x51 [nfsd] Code: c3 cc cc cc cc 48 ff c0 48 83 f8 26 75 e0 80 3d dd c9 05 00 00 75 15 48 c7 c7 b5 c2 d9 a0 c6 05 cd c9 05 00 01 e8 01 39 30 e0 <0f> 0b b8 00 00 00 05 c3 cc cc cc cc 48 83 ec 18 31 c9 ba ff 07 00 RSP: 0000:ffffc9000155fde8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027 RDX: 0000000000000002 RSI: ffffffff820d7e01 RDI: 00000000ffffffff RBP: ffff88814e140180 R08: 0000000000000000 R09: ffffffff82245f10 R10: 00007fffffffffff R11: ffffffff82969256 R12: 0000000000000001 R13: 0000000000000000 R14: ffff88814f6dc0c0 R15: ffffffffa0dbf6c0 FS: 0000000000000000(0000) GS:ffff88845ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000014a2aa5dbbd3 CR3: 0000000241028002 CR4: 00000000003706e0 Call Trace: <TASK> ? __warn+0xab/0x122 ? report_bug+0x109/0x17e ? nfserrno+0x45/0x51 [nfsd] ? handle_bug+0x41/0x6f ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? nfserrno+0x45/0x51 [nfsd] ? nfserrno+0x45/0x51 [nfsd] nfsd_access+0xac/0xf1 [nfsd] nfsd3_proc_access+0x78/0x88 [nfsd] nfsd_dispatch+0x1a6/0x262 [nfsd] svc_process_common+0x32f/0x4df [sunrpc] ? ktime_get+0x35/0x49 ? nfsd_svc+0x2b6/0x2b6 [nfsd] ? nfsd_shutdown_threads+0x5b/0x5b [nfsd] svc_process+0xc7/0xe4 [sunrpc] nfsd+0xd5/0x155 [nfsd] kthread+0xe4/0xef ? kthread_complete_and_exit+0x1b/0x1b ret_from_fork+0x1f/0x30 </TASK> I don't know if this is a cause or a symptom. The server was now also able to create a diagnostics log. unraid-diagnostics-20231215-0947.zip
  18. FWIW, I was having issues with freezes on 6.11.5 with ssh and console logins working but getting stuck immediately (never getting a shell prompt). The only way to "fix" was to do a hard reboot since any file I/O got stuck. I'm now fairly sure this was caused by a combination of a Docker container performing relatively intensive I/O (duplicacy) and mover running at the same time. My guess is that this triggers a bug in shfs which causes it to hang, basically blocking all filesystem I/O (which explains why I could still log in, since credentials are likely cached in RAM, but then getting stuck). I rescheduled mover to run once a week (it was running daily, which isn't really necessary in my setup anyway) and made sure that it doesn't overlap with any I/O heavy Docker tasks, like running backups. So far, my server has been up 71 days, whereas before I was happy to have it last more than 2 weeks.
  19. You have a leading space in the path to `/mnt/cache/snaps` 😅
  20. No problem, I can run a user script for that 😊
  21. Good to know (although in my case the correct cache settings were kept) 👍🏻 One other question: there's no option to run a script after a scheduled snapshot has been created, is there? Ideally I'd like to create a symlink that points to the latest snapshot.
  22. Ah thanks for the explanation! (you too, @itimpi) Which makes my whole question moot, so I won't bother you with any more details
  23. Hmm perhaps I misunderstood how "Cache: prefer" works (only installed Unraid a few days ago), but I thought that if the cache pool fills up, eventually files gets moved to the array? No, I meant subvolumes. I would think that those look like regular directories to Mover and hence won't be problematic. To give a bit more background: I have an `appdata` share which is set to "Cache: prefer". On my cache pool mountpoint (`/mnt/nvme-cache/`) I converted it from a regular directory to a subvolume so I can snapshot it.
  24. Thanks for this great plugin In order to create snapshots I'm converting some of my "Cache: prefer" shares to subvolumes in the cache pool (which is btrfs, my array is XFS), but now I wonder if this may interfere with Mover. I don't _think_ it'll be an issue, but am interested in hearing if my reasoning is correct. When the cache fills up, Mover will eventually move data from the cached shares to the array. These files will (obviously) not be snapshotted anymore, but this won't be an issue because my whole goal with snapshotting is for backing up those shares to the array anyway. The only issue I can foresee is Mover somehow having an issue with subvolumes in general that I'm not aware of?
×
×
  • Create New...