docker causing btrfs errors making cache drive read-only

treos33 · April 28, 2018

Having some serious issues right now. Here's the steps I've been going through as best as I can remember them:

First I noticed some of my docker webUIs were unreachable. When I went to investigate the Unraid GUI was still responsive but I couldn't really do anything. Couldn't stop docker, couldn't stop the array, couldn't download diagnostic files.
I was still able to connect via ssh and sftp but I couldn't stop docker and unmount the array.
Tried to back up some of my docker appdata but I found the cache drive had become read-only and even reading was causing issues.
Ended up forcing an unclean shutdown.
On reboot, docker restarted with one of my containers as an orphaned image. When I tried to restore the orphan I got an error and the cache file system became read-only again.
This time I was able to stop docker from the GUI. I stopped the array then restarted it and the cache became read/write again.
Deleted the docker image and created a new one.
Setup about half my containers when I started to get errors again.
Downloaded system log.
Had to restart at this point because the cache was read-only and I couldn't unmount the array.
Recreated the docker image again, and installed only a select few containers. I'm basically just waiting now to see if it happens again or if its dependent on the container.

I should also mention that I've performed 2 btrfs scrubs on my cache drive with 0 errors reported. I've tried a scub on the docker image after errors occurred but it always stops immediately.

Any help would be appreciated!

Thanks.

tower-diagnostics-20180427-1922.zip

John_M · April 28, 2018

You have corruption to both your cache pool file system and you docker image. Recreating the docker image didn't fix the problem because of the corrupt cache file sytem.

treos33 · April 28, 2018

Thanks, John. How did you determine the cache file is corrupt? Why wouldn't the btrfs scrub turn up errors if the file system was corrupt?

My cache drives are only a few months old, I'm surprised that they would be corrupt already.

Thanks for you help!

Squid · April 28, 2018

Apr 27 19:19:00 Tower kernel: ---[ end trace 623b3df827508758 ]---
Apr 27 19:19:00 Tower kernel: BTRFS: error (device sdh1) in __btrfs_free_extent:7073: errno=-2 No such entry
Apr 27 19:19:00 Tower kernel: BTRFS info (device sdh1): forced readonly
Apr 27 19:19:00 Tower kernel: BTRFS: error (device sdh1) in btrfs_run_delayed_refs:3089: errno=-2 No such entry
Apr 27 19:19:00 Tower kernel: BTRFS error (device sdh1): pending csums is 4096

8 hours ago, treos33 said:

I'm surprised that they would be corrupt already.

Apr 27 19:19:00 Tower kernel: BTRFS: Transaction aborted (error -2)

Apr 27 17:49:46 Tower emhttpd: unclean shutdown detected

Computers and filesystems have never particularly liked unclean shutdowns and if they are in the middle of writes when that happens corruption is inevitable.

treos33 · April 28, 2018

I still have a few questions about why this happens and how to fix it.

Isn't this the point of using a btrfs pool for the cache? It looks like all the errors are coming from one of my cache drives, sdh. Shouldn't a btrfs scrub identify this?

To correct this, should I just remove sdh from the pool and reformat it? Should I do it for both cache drives?

Thanks again!

JorgeB · April 28, 2018

38 minutes ago, treos33 said:

Isn't this the point of using a btrfs pool for the cache?

No, a point of a pool is to provide redundancy against a device failure, not filesystem corruption.

If both devices are affected a scrub can't fix it, you can fsck the filesystem but I would recommend backup, format and restore instead, since btrfs fsck is not sufficiently mature yet.

treos33 · April 28, 2018

Ok, thanks. I reformatted the cache drives and I'm currently restoring data.

Ran into a bit of an issue when recreating the cache pool. Not sure if its a bug or I did something wrong.

I changed the cache pool to 1 slot. Changed the assigned drive to xfs, started the array, formatted, and then repeated with the second drive. I changed the pool to 2 slots and assigned both drives. When I started the array I was given the option to format the first drive in the array but the format never worked. It kept saying the drive was unmountable. I had to change the cache pool back to one slot and manually format each drive to btrfs before it would allow me to create the cache pool.

Thanks for all your help, everyone!

docker causing btrfs errors making cache drive read-only

Recommended Posts

treos33

Link to comment

John_M

Link to comment

treos33

Link to comment

Squid

Link to comment

treos33

Link to comment

JorgeB

Link to comment

treos33

Link to comment

Archived