Unraid FUSE (?) dies during docker container operation, whole OS gets stuck


Recommended Posts

Hi,

 

I have an issue with Unraid where I think the FUSE filesystem dies when a specific docker container receives a specific interaction. I know it sounds vague at start, but please read on, it's always reproducible and rather easy to do..

 

I'm using the latest Unraid (6.11.1 - basic license) and just upgraded my setup to have a parity drive, as well as to have a cache drive (did a clean install, not migrated).

My setup is: 2x 4TB HDD (1 parity, 1 xfs) + 1x 500GB SSD as cache (tried with btrfs as well as xfs here)

 

What I did:

  • Installed all pre-requisites to have Apps as well as docker-compose; enabled docker, etc. - everything which you need to add the MariaDB-Official app:as well as to run docker-compose:
  • Within Apps, added MariaDB Official, made its database persistent inside /mnt/user/appdata/... + created own network for it ("mariadb")
    • I don't think the external MariaDB as well as the custom network is relevant here, but I didn't want to modify the example. I just wanted to use one single MariaDB for other use-cases too, hence the initial setup looked like this.
  • Brought up Owncloud from their official docker-compose file (modified file attached to allow recreating the issue - modification includes the use the MariaDB Official docker as well as the persistent storage for both the Owncloud docker container as well as Redis - for both, I've added the volume mounts to /mnt/user/appdata/owncloud/...)
  • Enabled remote syslog to gather some logs, as after the issue happens, there's no way to get any diagnostics...

 

What is observed:

  • After doing the Owncloud initial setup via web, you can check that everything is working fine in the web browser, as well as using the PC client, HOWEVER:
  • If by any chance you'd try to access the server via the iOS Owncloud application, everything dies instantly - and I mean everything, instantly (you don't really even have to do anything, just have the iOS client request a login, or similar):
    • You cannot access the owncloud web interface, it's dead.
    • You cannot get into the owncloud container in any way, it's dead
      • (there's no logs written by docker for the container about any kind of error)
    • You cannot stop/remove the container:
      Error response from daemon: cannot stop container: owncloud_server: tried to kill container, but did not receive an exit event
    • You cannot stop the docker service:
      stopping dockerd... waiting for docker to die... repeat x15 times... umount: /var/lib/docker: target is busy.

       

    • You cannot stop the storage array anymore, you get into an infinite loop of the following log entries (it goes until infinity, does not take into account any timeout set at the array settings)
      • Unmounting disks...
        shcmd (72301): umount /mnt/disk1
        umount: /mnt/disk1: target is busy.
        shcmd (72301): exit status: 32
        shcmd (72302): umount /mnt/cache
        umount: /mnt/cache: target is busy.
        shcmd (72302): exit status: 32
        Retry unmounting disk share(s)...
        Unmounting disks...

         

    • You cannot create diagnostics, it also hangs indefinitely at "Starting diagnostics collection..." - does not generate any kind of log
    • powerdown -r also does not work the first time you issue it out.
    • the second invocation of the powerdown -r reboots the server with the following logs (these are the last logs I receive via syslog, nothing after this until the new logs from the rebooted system)
      • md: md_notify_reboot
        md: stopping all md devices
        md: 1 devices still in use.
        sd 4:0:0:0: [sdd] Synchronizing SCSI cache
        sd 2:0:0:0: [sdc] Synchronizing SCSI cache
        sd 1:0:0:0: [sdb] Synchronizing SCSI cache

         

    • A bonus event here: before trying to do anything with the system (after owncloud dies) if you try to access the FTP settings (Settings / FTP Server) in Unraid, the Unraid web interface also dies completely, and nothing seems to work getting it back until you reboot the system. I didn't find any logs related to this either.

 

Of course, as the array did not stop properly, the whole parity check needs to be re-done, which is 8+ hours with the 4TB disk, so it's not nice.

 

After reading about a lot of similar issues (albeit very outdated ones, from 2015-2016) it seemed to me that the issue could be caused by the internal FUSE filesystem which tries its best to use the cache drive transparently. To put this theory to test, I've recreated the owncloud container with the volume shares mounted at /mnt/cache/appdata/owncloud as well as /mnt/disk1/appdata/owncloud - so bypassing the /mnt/user/appdata/... construct. In both cases, the application worked perfectly, there weren't any crashes; and most importantly, no Unraid OS level crashes (or rather, no storage array hanging issues)

The problem with this solution is that I cannot use the cache feature which would be very beneficial for the HDDs to not spin up every time something needs to be written to them; so I'd really like to use the /mnt/user/appdata/... path.. (as far as I've understood, if you directly point a volume share to /mnt/cache/appdata, then it's true you're using the cache drive, but it also means that:

  • the data is not going to be flushed at any time to the HDDs, where the parity would protect the data
  • the amount of space you can use is only as much as the cache drive, as everything is stored there completely.

 

What I've attached to this post:

  • The remote syslog capture - it's rather long, but you can skip until Oct  8 11:14:37 in the log to see the important stuff onwards.
  • The owncloud docker-compose.yml file, which can be used to reproduce the issue (warning - you WILL face parity check at the end...) - I've modified the user/passwords inside the file to be generic, as well as removed the external .env file need for easier testing. If you like, you can get the unmodified versions from their site: https://doc.owncloud.com/server/10.11/admin_manual/installation/docker/#docker-compose - scroll downwards, you'll find the .env as well as the docker-compose.yml - just make sure you modify it so it wouldn't use a docker volume, but an external share.
  • A diagnostics zip file which I've created AFTER the system rebooted (I've noticed at some posts a diagnostics file was also requested after reboot, so here it is in advance..)

 

Any help would be appreciated about solving this while still maintaining the ability to use the cache feature.

tower-diagnostics-20221008-2003.zip syslog docker-compose.yml

  • Thanks 1
Link to comment

Cannot help with the actual issue, but:

11 hours ago, zeroxx1986 said:

the data is not going to be flushed at any time to the HDDs, where the parity would protect the data

This is incorrect, any data on cache will still be moved by the mover as long as the share is correctly configured.

 

11 hours ago, zeroxx1986 said:

the amount of space you can use is only as much as the cache drive, as everything is stored there completely.

This is correct.

Link to comment

Thanks for the correction Jorge, in the meantime I've also got to this conclusion; in my head the cache drive was behaving differently (I thought it's working similar to AutoTier) :)

 

Continuing my initial post, I've decided to try out yet another approach, but sadly this also failed:

  • created a new share, and name it appdata-persistent and set it t use cache pool: no.
  • created the owncloud container in /mnt/user/appdata-persistent/owncloud
  • Initially it seemed this method worked and I could use it this way, however after some time fiddling with owncloud via the iPhone application (was still minutes, not hours), the app hanged again, and the exact same scenario happened as previously. Again no other logs were visible, certainly no error logs.
  • Thanks 1
Link to comment
  • 1 month later...

Not yet as I know, they are investigating the issue (I've opened a real bug ticket for this).

As a workaround, you can target one of the disks directly instead (/mnt/disk1/... or /mnt/cache/.. in case you have and want to use the cache drive(s)) for the mount instead of the regular path (/mnt/user/..). That works flawlessly.

Link to comment
  • 5 months later...
3 hours ago, emiel said:

Suffering from the same issue.  Do you have a link for the bug ticket you created?

 

It says UNRAID USER BUG REPORT (ID #4044)  #11151 - the link is https://forums.unraid.net/support/11151/ although it's private for the staff and the one who raised the ticket (me).

Long story short, they've tried to replicate the issue, but not the way I was writing it down, and of course it wasn't reproducible that way. Copied the last answer I've got:

Quote

 

November 20, 2022

Hey there,

 

Unfortunately as best as I can tell on my systems there is no problem with using /mnt/user/... as a reference.  Admittedly because I don't particularly use these containers, I could be missing something.

 

My suggestion would be to continue to use the disk reference.  

 

This type of issue isn't widespread, and this is the first report we've had of it in the last couple of years (including via the forum etc)

 

-redacted-

 

I've since completely abandoned unraid as with this issue present I could never sleep peacefully, always worried about what would cause this issue again, which in turn would just break things or worse.. Instead I've just spun up a regular ubuntu server with ZFS and installed everything I wanted as well as set up monitoring for my drives. To me, unraid didn't provide anything more than a somewhat convenient UI to spin up my docker apps and occasional VMs, as well as act as a sort of NFS - all of which could be done in ubuntu if you know how, and with ZFS, my data is much more secure. Just recently unfortunately it also went through the trial of fire, as one mirrored drive got corrupted. Due to ZFS the swap was very easy to do and the resilvering also took about 40 mins instead of 7+ hours (the drives weren't full). Time will tell how this setup will fare with major OS updates, etc.

Link to comment
On 5/12/2023 at 6:41 PM, zeroxx1986 said:

I've since completely abandoned unraid as with this issue present I could never sleep peacefully, always worried about what would cause this issue again, which in turn would just break things or worse.. Instead I've just spun up a regular ubuntu server with ZFS and installed everything I wanted as well as set up monitoring for my drives. To me, unraid didn't provide anything more than a somewhat convenient UI to spin up my docker apps and occasional VMs, as well as act as a sort of NFS - all of which could be done in ubuntu if you know how, and with ZFS, my data is much more secure. Just recently unfortunately it also went through the trial of fire, as one mirrored drive got corrupted. Due to ZFS the swap was very easy to do and the resilvering also took about 40 mins instead of 7+ hours (the drives weren't full). Time will tell how this setup will fare with major OS updates, etc.

I can understand your frustration. I would really like to use owncloud, but can't either which is very frustrating.

Owncloud is the only container suffering from this (as I'm currently aware of), but doesn't give me much confidence in the system anymore.

Maybe I will do a similar transition as you did another OS

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.