All my dockers are missing!?! Please help!

glenner · August 16, 2018

18 hours ago, johnnie.black said:

Like I already posted more than once the write errors can't be caused by software, it was a hardware problem, most likely the NVMe devices dropped offline one ore more times, and due to high number of errors and the fact that it happened to both it will most likely happen again.

Status update: It looks like deleting the runaway Logitech log file and thereby freeing up huge amount of space on my cache has fixed my server.

Here is what it looks like on my end:

As I said in my last post, I stopped all the dockers last night and started making /mnt/cache backups.
While using mc, I noticed it was taking a long time to copy over this file: /mnt/cache/appdata/LogitechMediaServer/logs/server.log
I checked the file and found it to be a year's worth of obscenely verbose logging by LMS and the file was 85GB.
I trashed the file and redid the cache backup.
My cache usage has since dropped to 62GB out of 250GB used.
I have not run the btrfs balance at this point.
I rebooted and started the array.
All of my dockers were immediately back up and running. My Windows 10 VM is also back and up and running.
I have zero errors when running: btrfs dev stats /mnt/cache
I did not rebuild my cache, docker.img, or VM. I only deleted the one 85GB log file and rebooted... I'm not clear that my docker image or VM are corrupted... They don't appear to be as best as I can tell.
I checked the syslog, and app logs and don't see anything amiss. No errors that I can see...
I've since updated a few plugins and dockers... stopped and started dockers from the UI. It all works.
I've posted my latest diags...

I'm pretty sure all of the issues I have had, including the write errors, are a result of the one out of control log file, and the way btrfs cache seems to operate in this particular situation where it thinks there is no space left for some reason (even though I should have still had 90GB free even with the massive log file present). I'll continue monitoring it over the next while to see if anything changes, but it looks pretty clear to me that this is what has happened in this case.

unraid-diagnostics-20180815-2124.zip

JorgeB · August 16, 2018

4 hours ago, glenner said:

I'm pretty sure all of the issues I have had, including the write errors, are a result of the one out of control log file

Not the writes errors on the stats as I'm going to say for the last time.

4 hours ago, glenner said:

I have not run the btrfs balance at this point.

You should still run the balance, or upgrade to v6.5 or you're going to hit the not enough space errors again due to the cache filesystem being fully allocated.

glenner · August 20, 2018

On 8/16/2018 at 2:15 AM, johnnie.black said:

Not the writes errors on the stats as I'm going to say for the last time.

You should still run the balance, or upgrade to v6.5 or you're going to hit the not enough space errors again due to the cache filesystem being fully allocated.

So last week after my system seemed to become stable again once I deleted my runaway LMS log file, I did run the balance, and then I upgraded everything over the weekend.

I updated my bios, unraid 6.5.3, dockers, plugins, and recreated my appdata backups. I also found some new dockers I needed and set those up too... :-)

btrfs stats are stil clean. I did a quick sanity check after each upgrade step to ensure the system was still stable... My system is up to date now.

I did have to restore my plex db... as that did get corrupted in the initial outage. Fortunately, plex keeps dated backups under appdata and so that's an easy fix.

I don't see my instability issues returning (missing dockers, missing VMs, errors).... at least anytime soon.

I've had unraid for a year now, which I setup on a new custom pro build I bought last year. It's been solid and much better than the Windows box I used to run all my HTPC stuff on... The only issues I've seen over the last year that resulted in any kind of "outage" happened if some cache file gets huge and out of control. I saw it a while ago with a 100GB+ SageTV recording that brought down my whole server (https://forums.sagetv.com/forums/showthread.php?t=64895). And now I've seen it more recently with a huge Logitech Media Server log file that also effectively brought down my server.

As best as I can tell interactions between the btrfs cache, mover settings, environment settings, and huge files can lead to issues. Once a cache file gets huge and there is "no space" left on the btrfs cache, and all the while a docker is actively attempting to write 20GB/hr to the cache, then all bets are off...

I'd like to find a way to get some kind of alert if I have a huge file brewing, or excess disk usage in my cache. That might have averted all the problems I've had so far. I should never have say a 20GB+ file on the system (some SageTV recordings like a 3 hour sports program could hit 20GB before being moved off to the array, but that's the biggest file I ever want to see on the cache). Not sure if there is a plugin for that (maybe "Fix Common Problems" could scan for that), but will see if I can find something, or setup some kind of automated file size scan in the cron.

In any event, I'm on 6.5.3 and I think super stable again... Thanks for your help. I really appreciate it.

Edited August 21, 2018 by glenner

All my dockers are missing!?! Please help!

Recommended Posts

glenner

Link to comment

JorgeB

Link to comment

glenner

Link to comment

Join the conversation