Jump to content

Some Docker containers becoming corrupt repeatedly in 6.10 RC2


Sarge

Recommended Posts

I'm not sure if this should go in the 6.10 RC2 post, in general or in here.  Someone let me know if I should move it.

I'm new to Unraid but have been watching a bunch of videos and reading tutorials and these forums.  I decided to go with 6.10 RC2 instead of the stable branch and am having some issues with Docker containers corrupting themselves. 

 

Diagnostics attached

 

What's happening:

  1. I'll get all the docker apps installed, everything is fine and starts as one would expect and all of them work.
  2. Sometime later, typically the next day, I will discover that one or more have stopped or are in some kind of failed state.
    This is almost always the Docker OCI complaining that a file inside the docker image is missing and can't be ran.
    From Pihole:  OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: read init-p: connection reset by peer: unknown
  3. Deleting the docker container and image and re-installing does not fix it.  Only fix I've found is to stop the Docker service and delete the /mnt/user/system/docker/ folder then restart the Docker service.  I can then reinstall all of the Docker containers and they all work, (Call #1 above, rinse and repeat)

 

Changes I've made:

  1. I changed Docker from running in a vdisk to running in a directory, this seems to be when things started breaking.
  2. I changed Docker Custom Network Type from MacVlan to IpVlan at the same time.  I have not reverted either of these changes in an attempt to fix yet.
  3. I do have the the appdata backup plugin set to back up my appdata folder weekly, but it did not do so last night and things still failed

 

System Specs:

  • Dell R720xd
  • Perc Card flashed to IT mode
  • 128 Gigs of RAM
  • Cache drive is two new 2TB Samsung 970 Evo Plus drives in Raid 1 with BTRFS. 
  • Main Array is five 10TB Seagate Enterprise drives with about 3 years of spin on them.  There are two of the same 10TB drives being used for parity, but they are both new.

 

Things I've Tried:

  • I've ran a scrub a couple times on the parity NVME drives, no issues found.  
  • I ran Memtest 86 overnight, it made it all the way through one pass plus about half way through the next with no errors.
  • I ran Prime 95 from the Ultimate Boot CD (USB) for several hours with no issues.
  • I checked the firmware on the Samsung NVME drives I'm using for cache and it is on its latest.

 

Note: you may see some errors in the logs for some Samsung 860 Evo drives, haven't had a chance to look into those, they are generating CRC errors.  Nothing is on them, they are set up as a cache pool called Vmdks that I'm not using yet.  

 

Any help or tests anyone can think of would be very appreciated. 

 

 

thor-diagnostics-20220130-2050.zip

Link to comment
10 hours ago, Squid said:

Is this because one or more of the containers are updating themselves?

ca.update.applications.plg - 2021.09.24  (Up to date)

and other containers are "running through them" eg: connected to a VPN being managed by another container?

 

I do have Auto Update Applications installed and running but I'm nearly 100% positive that's not it as one of the containers is one I compiled and host on Dockerhub and i haven't updated it, so there aren't any changes to the image to download and no reason to restart the container.

I'm not running a VPN (yet).  I do have a reverse proxy installed but am currently just using it for SSL certificate renewal.  Other than that, they are all independant.

 

Note:  I have the array down at the moment as I'm in the middle of testing the drives to see if there are any errors.  I should have done that before copying over all the data but didn't think about it.    Unrelated question on the subject of drive testing.  I know I can pull one empty drive from the array, change it from BTRFS to XFS and restart the array and reformat it without any issues, but I don't know if I can do that to more than one drive at a time.  i.e. if I use unbalance to move all  data to one drive that has been tested, then shut down the array and run `badblocks -wvs -b 4096 /dev/sdX` on all the empty drives at once will the array be broken when I spin it back up?

 

Any other thoughts on the Docker corruption?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...