Docker keeps stopping every few days, can't be restarted - 6.12.0-rc12

thompn4 · April 5, 2023

Hey everyone – I’m been trying to troubleshoot an error for over a month now and I can’t figure it out. I have a pretty big array (10 disks total, double parity, 68TB usable) with a btrfs cache with parity (originally I had two 256 GB and two 1 TB SSD’s for 1.25 TB of cache).

A little over a month ago all my docker containers stopped working. I started getting errors saying my cache drive was read only. Restarting the server would fix things temporarily, but then at some random point (sometimes the same day, sometimes a few days later), docker would die again.

Here’s what I tried:

1) Rebuilding the docker image.

2) Replacing my SSDs (took the existing 4 out and put in two brand new 1 TB SSDs

3) Changing the cables to my SSDs

4) Changing the OS version to 6.12.0-rc2

5) Changing where frigate writes to to an nvme (and not my main cache)

At this point I’m kinda out of ideas. Maybe it’s still bad cables to my SSDs resulting in them being unplugged when it gets too cold in my server room? I don’t feel like I use my server significantly more/differently than anyone else, the only thing I do that’s pretty taxing is run frigate for CCTV – I have that set up for 24/7 recording so that is a lot of writing to cache, but I figured moving that to the nvme would solve that problem. Diagnostics attached, Fix Common Problems is not showing any errors right now. When I try to reenable docker, it fails to start currently. I can rebuild the docker image again, and it'll work for a while (this last time was 4 days), but I fear I’m just in a loop of rebuilding my docker image every few days. Before this it worked solidly for well over a year.

unraid-diagnostics-20230405-0801.zip

JorgeB · April 5, 2023

Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 113, gen 0
Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdl1 errs: wr 201748643, rd 247526, flush 1001565, corrupt 86544, gen 0

This shows one of the cache devices dropped offline in the past, also some possible data corrutpion, run a correcting scrub and post the results, also see here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

and here, Ryzen with overclocked RAM like you have is known to corrupt data:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

thompn4 · April 7, 2023

Thanks for your help @JorgeB! When I did the 'btrfs dev stats' command there were a lot of errors. I removed the ram overclock, checked the cables to the SSDs again, ran a scrub, and set up the 'btrfs dev stats' script on a schedule. Hopefully everything will continue to work this time around.

Scrub stats here:

UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85

Scrub started: Fri Apr 7 00:10:39 2023

Status: finished

Duration: 0:11:18

Total to scrub: 240.50GiB

Rate: 389.42MiB/s

Error summary: verify=3123 csum=6073889

Corrected: 6076994

Uncorrectable: 18

Unverified: 0

JorgeB · April 7, 2023

Run another scrub and post diags once it finishes.

thompn4 · April 7, 2023

UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85

Scrub started: Fri Apr 7 06:39:42 2023

Status: finished

Duration: 0:04:11

Total to scrub: 200.20GiB

Rate: 918.02MiB/s

Error summary: csum=18

Corrected: 0

Uncorrectable: 18

Unverified: 0

JorgeB · April 7, 2023

And the diags please.

thompn4 · April 7, 2023

Sorry - attached now - thanks for your help!!

unraid-diagnostics-20230407-0902.zip

JorgeB · April 7, 2023

Some errors were uncorrectable, those files are listed in the syslog during the scrub, e.g.:

Apr  7 06:39:56 Unraid kernel: BTRFS warning (device sdk1): checksum error at logical 7364440064 on dev /dev/sdj1, physical 6261338112, root 5, inode 181866, offset 86016, length 4096, links 1 (path: appdata/jellyfin/metadata/library/8a/8a0efc38e8f26aa46396532f81472c17/chapters/637569150190000000_3000000000.jpg)

These should be deleted/restore from a backup, then run another scrub to confirm 0 errors.

thompn4 · April 7, 2023

No more errors - thanks again!

UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85

Scrub started: Fri Apr 7 12:30:54 2023

Status: finished

Duration: 0:01:36

Total to scrub: 77.38GiB

Rate: 826.39MiB/s

Error summary: no errors found

unraid-diagnostics-20230407-1232.zip

thompn4 · April 10, 2023

@JorgeB - not sure if you have any other advice, but I am getting a few btrfs corrupt errors (14 yesterday, 1 today) still when the 'btrfs dev stats /mnt/cache' command runs. When I run a scrub, it comes up as 0 errors, but the errors do show up in the syslog as "csum failed". Diagnostics attached. Thanks again.

unraid-diagnostics-20230409-2112.zip

JorgeB · April 10, 2023

At this point suggest backing up the pool then reformatting.

thompn4 · May 2, 2023

So I found at least one of the issues - one of my sticks of ram was bad. I was able to get the server up and running, mover was working fine, could read and write to the cache drives fine. However, as soon as I started docker, the cache went to read only again. I removed the docker image and recreated it (adding apps from the previous apps section) and before it could even finish adding all my apps, the cache drive went read only again. No corruption errors this time. I'm beginning to think one of the files for docker itself (not the image) is corrupted, or something in my app data? But the server runs fine until I try to start docker. Attached are diagnostics from last night when this happened.

unraid-diagnostics-20230502-0002.zip

JorgeB · May 2, 2023

There's corruption on the cache filesystem, so you need to fix that first, if you had a bad RAM before best to backup what you can and reformat cache, then recreate the docker image.

trurl · May 2, 2023

3 minutes ago, thompn4 said:

one of my sticks of ram was bad

If you've been running on bad RAM then anything has some chance of being corrupt.

Docker keeps stopping every few days, can't be restarted - 6.12.0-rc12

Recommended Posts

thompn4

Link to comment

JorgeB

Link to comment

thompn4

Link to comment

JorgeB

Link to comment

thompn4

Link to comment

JorgeB

Link to comment

thompn4

Link to comment

JorgeB

Link to comment

thompn4

Link to comment

thompn4

Link to comment

JorgeB

Link to comment

thompn4

Link to comment

JorgeB

Link to comment

trurl

Link to comment

Join the conversation