Resolved: Cache Pool Not Showing Correct Size & Docker Not Starting

-=Striker=- · November 11, 2020

Hello,

I am still relatively new to Unraid, and once I got it up and running, its just worked so I haven't needed to do a lot of troubleshooting.

Yesterday, I started encountering issues with a docker container and when I tried to restart it, I got an error 403. I did some digging around on the forums and found a suggestion to delete my docker.img and rebuild it, so I gave that a try. While all my other docker containers are working properly, one is still giving me issues.

While I was reviewing my system, I noticed that my cache pool looks very odd. I have two 1TB hard drives that should be in the cache pool operating as RAID1 resulting in a 1TB usable cache, however it's reporting as 500GB (see below).

I've tried running the balance and scrub commands. I've rebooted the server and double checked the power and sata connections to these drives.

My one docker container that's having issues (sagetv-server-java-8) runs the PVR and TV playback functions in the house, so everyone's getting a little cranky around here!

Any suggestions would be greatly appreciated. As I mentioned, I'm still a little new to this, so feel free to use sock puppets and crayons when explaining things to me....

Edited November 13, 2020 by -=Striker=-

trurl · November 11, 2020

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

-=Striker=- · November 11, 2020

Diagnostics attached.

Sorry, I should have thought to do that!

hal9000-diagnostics-20201111-1347.zip

-=Striker=- · November 11, 2020

I'm currently trying to backup my appdata folder by copying it across the network to some spare space on my workstation and it is going painfully slow. Is there a faster way to be doing this?

I'm assuming that whatever will need to be done to resolve this will involve messing around with the cache drives and a backup would be prudent....

image.png.e6c509c10ff8f8072cd04c3e2a048548.png

trurl · November 12, 2020

If you already had CA Backup plugin installed and scheduled, you would already have a recent compressed archive of appdata you could copy instead. I don't know if trying to do that now would be any faster than the copy you are already doing.

I am guessing from the large number of files that you have plex.

-=Striker=- · November 12, 2020

29 minutes ago, trurl said:

If you already had CA Backup plugin installed and scheduled, you would already have a recent compressed archive of appdata you could copy instead. I don't know if trying to do that now would be any faster than the copy you are already doing.

I am guessing from the large number of files that you have plex.

yes, you are correct, I have Plex.

I will look at setting up CA Backup once this is resolved!

Any ideas what my next steps are once i'm done backing up?

JorgeB · November 12, 2020

Cache pool is only one device, despite what the GUI shows, there are hardware issues with the other:

Nov 11 10:18:57 Hal9000 kernel: BTRFS info (device sdb1): bdev (null) errs: wr 545786, rd 0, flush 87, corrupt 0, gen 0

See here for more info.

-=Striker=- · November 12, 2020

OK, so I have migrated all Cache content to the array. Stopped the array, disabled the cache drives, pull the cache drives, started the array.

I'm now running some tests on the cache drives on a separate system. Once done, I'll build a new cache array and migrate the appropriate data back to the cache.

In my previous searches I also found the post you linked so I had checked the SMART reports for the drives and they had indicated everything was good. I guess that was bad info or I misunderstood something.

Would all of this explain why my one Docker container is having issues? I thought the purpose of having 2 drives in the cache pool was so it could run in RAID1 and prevent from issues occurring from a single drive failure?

JorgeB · November 12, 2020

I thought the purpose of having 2 drives in the cache pool was so it could run in RAID1 and prevent from issues occurring from a single drive failure?

It is, that's why cache was still working with a single device, but you still need to fix it.

trurl · November 12, 2020

7 minutes ago, -=Striker=- said:

Stopped the array, disabled the cache drives, pull the cache drives, started the array.

If you started the array with dockers / VMs enabled, but with no cache installed, then you probably have had your docker / VM related shares (appdata, domains, system) recreated on the array.

-=Striker=- · November 12, 2020

3 minutes ago, JorgeB said:

It is, that's why cache was still working with a single device, but you still need to fix it.

I agree, the cache needs to be fixed.

The reason why I started investigating my server though was because a docker container was failing. What was the cause of that failure if the RAID1 kept things going?

JorgeB · November 12, 2020

One possibility is the one mentioned in the linked FAQ, system share is by default NOCOW, if the other device comes back online, even for a little while, it can corrupt the any NOCOW data with the out of sync device.

-=Striker=- · November 12, 2020

9 minutes ago, trurl said:

If you started the array with dockers / VMs enabled, but with no cache installed, then you probably have had your docker / VM related shares (appdata, domains, system) recreated on the array.

I changed the settings on those shares so Mover would be allowed to migrate the data to the array. I stopped docker and vm services and ran Mover to migrate all data to the array before shutting down and pulling cache drives. So all data should be intact on the array without needing to be recreated.

Once the cache is fixed, i'll change the share settings and migrate them back to the cache where they belong.

-=Striker=- · November 12, 2020

2 minutes ago, JorgeB said:

One possibility is the one mentioned in the linked FAQ, system share is by default NOCOW, if the other device comes back online, even for a little while, it can corrupt the any NOCOW data with the out of sync device.

OK, so if that's the case, then my app data for that docker may be corrupt and need to be rebuilt since fixing the cache won't repair that issue?

JorgeB · November 12, 2020

Depends where the appdata folder is, if it's in the system share or any other share set to NOCOW, yes it might be corrupt, no way to know.

-=Striker=- · November 13, 2020

Thank you for all the assistance, I appreciate your insight and recommendations!

After removing the bad drive and rebuilding my cache array, things are working. I had some data loss in one of my dockers, but nothing that can't be rebuilt.

Resolved: Cache Pool Not Showing Correct Size & Docker Not Starting

Recommended Posts

-=Striker=-

Link to comment

trurl

Link to comment

-=Striker=-

Link to comment

-=Striker=-

Link to comment

trurl

Link to comment

-=Striker=-

Link to comment

JorgeB

Link to comment

-=Striker=-

Link to comment

JorgeB

Link to comment

trurl

Link to comment

-=Striker=-

Link to comment

JorgeB

Link to comment

-=Striker=-

Link to comment

-=Striker=-

Link to comment

JorgeB

Link to comment

-=Striker=-

Link to comment

Join the conversation