Unraid FUSE (?) dies during docker container operation, whole OS gets stuck

zeroxx1986 · October 8, 2022

Hi,

I have an issue with Unraid where I think the FUSE filesystem dies when a specific docker container receives a specific interaction. I know it sounds vague at start, but please read on, it's always reproducible and rather easy to do..

I'm using the latest Unraid (6.11.1 - basic license) and just upgraded my setup to have a parity drive, as well as to have a cache drive (did a clean install, not migrated).

My setup is: 2x 4TB HDD (1 parity, 1 xfs) + 1x 500GB SSD as cache (tried with btrfs as well as xfs here)

What I did:

Installed all pre-requisites to have Apps as well as docker-compose; enabled docker, etc. - everything which you need to add the MariaDB-Official app:as well as to run docker-compose:
Within Apps, added MariaDB Official, made its database persistent inside /mnt/user/appdata/... + created own network for it ("mariadb")
- I don't think the external MariaDB as well as the custom network is relevant here, but I didn't want to modify the example. I just wanted to use one single MariaDB for other use-cases too, hence the initial setup looked like this.
Brought up Owncloud from their official docker-compose file (modified file attached to allow recreating the issue - modification includes the use the MariaDB Official docker as well as the persistent storage for both the Owncloud docker container as well as Redis - for both, I've added the volume mounts to /mnt/user/appdata/owncloud/...)
Enabled remote syslog to gather some logs, as after the issue happens, there's no way to get any diagnostics...

What is observed:

After doing the Owncloud initial setup via web, you can check that everything is working fine in the web browser, as well as using the PC client, HOWEVER:
If by any chance you'd try to access the server via the iOS Owncloud application, everything dies instantly - and I mean everything, instantly (you don't really even have to do anything, just have the iOS client request a login, or similar):
- You cannot access the owncloud web interface, it's dead.
- You cannot get into the owncloud container in any way, it's dead
  - (there's no logs written by docker for the container about any kind of error)
- You cannot stop/remove the container:
```
Error response from daemon: cannot stop container: owncloud_server: tried to kill container, but did not receive an exit event
```
- You cannot stop the docker service:
```
stopping dockerd... waiting for docker to die... repeat x15 times... umount: /var/lib/docker: target is busy.
```
- You cannot stop the storage array anymore, you get into an infinite loop of the following log entries (it goes until infinity, does not take into account any timeout set at the array settings)
  - ```
  Unmounting disks...
  shcmd (72301): umount /mnt/disk1
  umount: /mnt/disk1: target is busy.
  shcmd (72301): exit status: 32
  shcmd (72302): umount /mnt/cache
  umount: /mnt/cache: target is busy.
  shcmd (72302): exit status: 32
  Retry unmounting disk share(s)...
  Unmounting disks...
```
- You cannot create diagnostics, it also hangs indefinitely at "Starting diagnostics collection..." - does not generate any kind of log
- powerdown -r also does not work the first time you issue it out.
- the second invocation of the powerdown -r reboots the server with the following logs (these are the last logs I receive via syslog, nothing after this until the new logs from the rebooted system)
  - ```
  md: md_notify_reboot
  md: stopping all md devices
  md: 1 devices still in use.
  sd 4:0:0:0: [sdd] Synchronizing SCSI cache
  sd 2:0:0:0: [sdc] Synchronizing SCSI cache
  sd 1:0:0:0: [sdb] Synchronizing SCSI cache
```
- A bonus event here: before trying to do anything with the system (after owncloud dies) if you try to access the FTP settings (Settings / FTP Server) in Unraid, the Unraid web interface also dies completely, and nothing seems to work getting it back until you reboot the system. I didn't find any logs related to this either.

Of course, as the array did not stop properly, the whole parity check needs to be re-done, which is 8+ hours with the 4TB disk, so it's not nice.

After reading about a lot of similar issues (albeit very outdated ones, from 2015-2016) it seemed to me that the issue could be caused by the internal FUSE filesystem which tries its best to use the cache drive transparently. To put this theory to test, I've recreated the owncloud container with the volume shares mounted at /mnt/cache/appdata/owncloud as well as /mnt/disk1/appdata/owncloud - so bypassing the /mnt/user/appdata/... construct. In both cases, the application worked perfectly, there weren't any crashes; and most importantly, no Unraid OS level crashes (or rather, no storage array hanging issues)

The problem with this solution is that I cannot use the cache feature which would be very beneficial for the HDDs to not spin up every time something needs to be written to them; so I'd really like to use the /mnt/user/appdata/... path.. (as far as I've understood, if you directly point a volume share to /mnt/cache/appdata, then it's true you're using the cache drive, but it also means that:

the data is not going to be flushed at any time to the HDDs, where the parity would protect the data
the amount of space you can use is only as much as the cache drive, as everything is stored there completely.

What I've attached to this post:

The remote syslog capture - it's rather long, but you can skip until Oct 8 11:14:37 in the log to see the important stuff onwards.
The owncloud docker-compose.yml file, which can be used to reproduce the issue (warning - you WILL face parity check at the end...) - I've modified the user/passwords inside the file to be generic, as well as removed the external .env file need for easier testing. If you like, you can get the unmodified versions from their site: https://doc.owncloud.com/server/10.11/admin_manual/installation/docker/#docker-compose - scroll downwards, you'll find the .env as well as the docker-compose.yml - just make sure you modify it so it wouldn't use a docker volume, but an external share.
A diagnostics zip file which I've created AFTER the system rebooted (I've noticed at some posts a diagnostics file was also requested after reboot, so here it is in advance..)

Any help would be appreciated about solving this while still maintaining the ability to use the cache feature.

tower-diagnostics-20221008-2003.zip syslog docker-compose.yml

JorgeB · October 9, 2022

Cannot help with the actual issue, but:

11 hours ago, zeroxx1986 said:

the data is not going to be flushed at any time to the HDDs, where the parity would protect the data

This is incorrect, any data on cache will still be moved by the mover as long as the share is correctly configured.

11 hours ago, zeroxx1986 said:

the amount of space you can use is only as much as the cache drive, as everything is stored there completely.

This is correct.

zeroxx1986 · October 9, 2022

Thanks for the correction Jorge, in the meantime I've also got to this conclusion; in my head the cache drive was behaving differently (I thought it's working similar to AutoTier)

Continuing my initial post, I've decided to try out yet another approach, but sadly this also failed:

created a new share, and name it appdata-persistent and set it t use cache pool: no.
created the owncloud container in /mnt/user/appdata-persistent/owncloud
Initially it seemed this method worked and I could use it this way, however after some time fiddling with owncloud via the iPhone application (was still minutes, not hours), the app hanged again, and the exact same scenario happened as previously. Again no other logs were visible, certainly no error logs.

musicking · November 17, 2022

So glad it's not just me with this problem.

Any fix yet?

zeroxx1986 · November 17, 2022

Not yet as I know, they are investigating the issue (I've opened a real bug ticket for this).

As a workaround, you can target one of the disks directly instead (/mnt/disk1/... or /mnt/cache/.. in case you have and want to use the cache drive(s)) for the mount instead of the regular path (/mnt/user/..). That works flawlessly.

emiel · May 12, 2023

On 11/17/2022 at 8:35 AM, zeroxx1986 said:

Not yet as I know, they are investigating the issue (I've opened a real bug ticket for this).

Suffering from the same issue. Do you have a link for the bug ticket you created?

zeroxx1986 · May 12, 2023

3 hours ago, emiel said:

Suffering from the same issue. Do you have a link for the bug ticket you created?

It says UNRAID USER BUG REPORT (ID #4044) #11151 - the link is https://forums.unraid.net/support/11151/ although it's private for the staff and the one who raised the ticket (me).

Long story short, they've tried to replicate the issue, but not the way I was writing it down, and of course it wasn't reproducible that way. Copied the last answer I've got:

Quote

November 20, 2022

Hey there,

Unfortunately as best as I can tell on my systems there is no problem with using /mnt/user/... as a reference. Admittedly because I don't particularly use these containers, I could be missing something.

My suggestion would be to continue to use the disk reference.

This type of issue isn't widespread, and this is the first report we've had of it in the last couple of years (including via the forum etc)

-redacted-

I've since completely abandoned unraid as with this issue present I could never sleep peacefully, always worried about what would cause this issue again, which in turn would just break things or worse.. Instead I've just spun up a regular ubuntu server with ZFS and installed everything I wanted as well as set up monitoring for my drives. To me, unraid didn't provide anything more than a somewhat convenient UI to spin up my docker apps and occasional VMs, as well as act as a sort of NFS - all of which could be done in ubuntu if you know how, and with ZFS, my data is much more secure. Just recently unfortunately it also went through the trial of fire, as one mirrored drive got corrupted. Due to ZFS the swap was very easy to do and the resilvering also took about 40 mins instead of 7+ hours (the drives weren't full). Time will tell how this setup will fare with major OS updates, etc.

emiel · May 16, 2023

On 5/12/2023 at 6:41 PM, zeroxx1986 said:

I've since completely abandoned unraid as with this issue present I could never sleep peacefully, always worried about what would cause this issue again, which in turn would just break things or worse.. Instead I've just spun up a regular ubuntu server with ZFS and installed everything I wanted as well as set up monitoring for my drives. To me, unraid didn't provide anything more than a somewhat convenient UI to spin up my docker apps and occasional VMs, as well as act as a sort of NFS - all of which could be done in ubuntu if you know how, and with ZFS, my data is much more secure. Just recently unfortunately it also went through the trial of fire, as one mirrored drive got corrupted. Due to ZFS the swap was very easy to do and the resilvering also took about 40 mins instead of 7+ hours (the drives weren't full). Time will tell how this setup will fare with major OS updates, etc.

I can understand your frustration. I would really like to use owncloud, but can't either which is very frustrating.

Owncloud is the only container suffering from this (as I'm currently aware of), but doesn't give me much confidence in the system anymore.

Maybe I will do a similar transition as you did another OS

Unraid FUSE (?) dies during docker container operation, whole OS gets stuck

Recommended Posts

zeroxx1986

Link to comment

JorgeB

Link to comment

zeroxx1986

Link to comment

musicking

Link to comment

zeroxx1986

Link to comment

emiel

Link to comment

zeroxx1986

Link to comment

emiel

Link to comment

Join the conversation