Jump to content

Docker keeps stopping every few days, can't be restarted - 6.12.0-rc12


Go to solution Solved by JorgeB,

Recommended Posts

Hey everyone – I’m been trying to troubleshoot an error for over a month now and I can’t figure it out.  I have a pretty big array (10 disks total, double parity, 68TB usable) with a btrfs cache with parity (originally I had two 256 GB and two 1 TB SSD’s for 1.25 TB of cache).

A little over a month ago all my docker containers stopped working. I started getting errors saying my cache drive was read only. Restarting the server would fix things temporarily, but then at some random point (sometimes the same day, sometimes a few days later), docker would die again.

Here’s what I tried:

1)      Rebuilding the docker image.

2)      Replacing my SSDs (took the existing 4 out and put in two brand new 1 TB SSDs

3)      Changing the cables to my SSDs

4)      Changing the OS version to 6.12.0-rc2

5)      Changing where frigate writes to to an nvme (and not my main cache)

At this point I’m kinda out of ideas. Maybe it’s still bad cables to my SSDs resulting in them being unplugged when it gets too cold in my server room? I don’t feel like I use my server significantly more/differently than anyone else, the only thing I do that’s pretty taxing is run frigate for CCTV – I have that set up for 24/7 recording so that is a lot of writing to cache, but I figured moving that to the nvme would solve that problem. Diagnostics attached, Fix Common Problems is not showing any errors right now. When I try to reenable docker, it fails to start currently. I can rebuild the docker image again, and it'll work for a while (this last time was 4 days), but I fear I’m just in a loop of rebuilding my docker image every few days. Before this it worked solidly for well over a year. 

unraid-diagnostics-20230405-0801.zip

Link to comment
  • Solution
Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdm1 errs: wr 0, rd 0, flush 0, corrupt 113, gen 0
Mar 29 20:50:51 Unraid kernel: BTRFS info (device sdm1): bdev /dev/sdl1 errs: wr 201748643, rd 247526, flush 1001565, corrupt 86544, gen 0

This shows one of the cache devices dropped offline in the past, also some possible data corrutpion, run a correcting scrub and post the results, also see here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

and here, Ryzen with overclocked RAM like you have is known to corrupt data:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

 

Link to comment

Thanks for your help @JorgeB! When I did the 'btrfs dev stats' command there were a lot of errors. I removed the ram overclock, checked the cables to the SSDs again, ran a scrub, and set up the 'btrfs dev stats' script on a schedule. Hopefully everything will continue to work this time around. 

 

Scrub stats here: 

UUID: 39eb07fa-6f31-43bb-b9e0-e5bc3349ce85

Scrub started: Fri Apr 7 00:10:39 2023

Status: finished

Duration: 0:11:18

Total to scrub: 240.50GiB

Rate: 389.42MiB/s

Error summary: verify=3123 csum=6073889

Corrected: 6076994

Uncorrectable: 18

Unverified: 0

Link to comment

Some errors were uncorrectable, those files are listed in the syslog during the scrub, e.g.:

 

Apr  7 06:39:56 Unraid kernel: BTRFS warning (device sdk1): checksum error at logical 7364440064 on dev /dev/sdj1, physical 6261338112, root 5, inode 181866, offset 86016, length 4096, links 1 (path: appdata/jellyfin/metadata/library/8a/8a0efc38e8f26aa46396532f81472c17/chapters/637569150190000000_3000000000.jpg)

 

These should be deleted/restore from a backup, then run another scrub to confirm 0 errors.

 

Link to comment
  • 4 weeks later...

So I found at least one of the issues - one of my sticks of ram was bad. I was able to get the server up and running, mover was working fine, could read and write to the cache drives fine. However, as soon as I started docker, the cache went to read only again. I removed the docker image and recreated it (adding apps from the previous apps section) and before it could even finish adding all my apps, the cache drive went read only again. No corruption errors this time. I'm beginning to think one of the files for docker itself (not the image) is corrupted, or something in my app data? But the server runs fine until I try to start docker. Attached are diagnostics from last night when this happened. 

unraid-diagnostics-20230502-0002.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...