Docker Image corrupt?


Recommended Posts

Hi all, my UniFi docker has just failed for some reason. I noticed this yesterday along with a few other containers not responding so I rebooted my server. Everything came back ok apart from the UniFi Controller container which although had a green light against it, it only showed as 'server starting' when I connected to the Webui.

 

When I look in the sys log I see the following error repeating all the time the controller container is running;

 

Mar 23 21:35:03 Media kernel: BTRFS warning (device loop2): csum failed root 1796 ino 10101 off 41115648 csum 0xb3cf73fa expected csum 0x293aeffb mirror 1

 

This morning I got a message from Fix Common Problems that said the Docker image was read only. I stopped Docker and increased the size to see if this was the issue but no avail. Stopped Docker again and ran a scrub;

 

scrub status for 266b6e65-1004-4619-8039-7a234e50f0f1
scrub started at Sun Mar 24 18:14:34 2019 and finished after 00:00:03
total bytes scrubbed: 3.65GiB with 2 errors
error details: csum=2
corrected errors: 0, uncorrectable errors: 2, unverified errors: 0

 

All other dockers run fine, only when I start this one I see the csum errors appear in the sys log. The UniFi Controller never starts and all I get is a message saying server starting when I try and connect to the WebUI.

 

I assume I need to remove the Docker Image file and restore from a CA backup? Obviously this will take all my containers back to the last backup, will this be a problem? I run Plex, Sonarr, Radarr, nzbget UniFi-video and Tautulli.

Link to comment
2 minutes ago, andyjayh said:

I assume I need to remove the Docker Image file and restore from a CA backup? Obviously this will take all my containers back to the last backup, will this be a problem? I run Plex, Sonarr, Radarr, nzbget UniFi-video and Tautulli.

Nope.  All the appdata is stored outside of the docker.img.  Remove the docker image then go to Apps, Previous Apps, check off what you want and you're back in business

Link to comment
43 minutes ago, Squid said:

Nope.  All the appdata is stored outside of the docker.img.  Remove the docker image then go to Apps, Previous Apps, check off what you want and you're back in business

Ah I see. So leave the appdata well alone in this case. Will give this a go, thanks. 

Link to comment

Hmm, not great. Did the above and what I am seeing in the Sys Log now is;

 

Mar 24 21:54:59 Media kernel: BTRFS warning (device sdl1): csum failed root 5 ino 17093788 off 4571136 csum 0x3235bd95 expected csum 0x00000000 mirror 2
Mar 24 21:54:59 Media kernel: BTRFS critical (device sdl1): corrupt node: root=7 block=93667393536 slot=86, bad key order, current (18446744073709551606 128 986823262208) next (18446744073707454454 128 986826604544)

 

When I go to the WebUI it is cycling between a message about repairing the database and then server startup.

Link to comment
1 hour ago, andyjayh said:
Hmm, not great. Did the above and what I am seeing in the Sys Log now is;
 
Mar 24 21:54:59 Media kernel: BTRFS warning (device sdl1): csum failed root 5 ino 17093788 off 4571136 csum 0x3235bd95 expected csum 0x00000000 mirror 2
Mar 24 21:54:59 Media kernel: BTRFS critical (device sdl1): corrupt node: root=7 block=93667393536 slot=86, bad key order, current (18446744073709551606 128 986823262208) next (18446744073707454454 128 986826604544)
 
When I go to the WebUI it is cycling between a message about repairing the database and then server startup.

Which means that the docker corruption was caused by an underlying file system problem on the cache drive. Not in a position to assist right now, but if @johnnie.blackis around he can help no problems

Sent via telekinesis
 

Edited by Squid
Link to comment
10 hours ago, andyjayh said:

Mar 24 21:54:59 Media kernel: BTRFS warning (device sdl1): csum failed root 5 ino 17093788 off 4571136 csum 0x3235bd95 expected csum 0x00000000 mirror 2

These are checksum errors, possible the result of one device dropping offline, please post the diagnostics: Tools -> Diagnostics

Link to comment
23 hours ago, johnnie.black said:

These are checksum errors, possible the result of one device dropping offline, please post the diagnostics: Tools -> Diagnostics

Apologies for not doing this yesterday but here is my diagnostics report.

 

I've also just noticed in my Dashboard that it is showing the 'Flash Log Docker' as 100% on the second of the three graphs! Not sure where this is located but none of my drives are full or near full including the flash drive.

media-diagnostics-20190326-0843.zip

Link to comment
18 minutes ago, johnnie.black said:

Syslog rotated so can't see the beginning of the problem but it does look like a cache device dropped offline, reboot and after a few minutes of array usage grab and post new diags.

 

Ok, thanks for looking. I've restarted the server and uploaded a new diagnostics file.

 

Learning curve but I now understand that what I was seeing in the Dashboard was that my Docker Log was filling up and I suspect this is causing part of the issue, don't know if this was able to take one of the cache drives offline and therefore caused the corruption I am now experiencing? I need to understand why my log file is now filling up so quickly when my server has in the past been up for months at a time without issue. UniFi Controller and Video are the new Dockers so I suspect one of these is writing large amounts of log file entries?

media-diagnostics-20190326-0944.zip

Link to comment

At some time in the past there were errors writing to both cache devices, these a re hardware errors:

Mar 26 09:36:26 Media kernel: BTRFS info (device sdl1): bdev /dev/sdl1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Mar 26 09:36:26 Media kernel: BTRFS info (device sdl1): bdev /dev/sdk1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0

See here for more info on what to do:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

 

Link to comment

Ok thanks for this. I've created the following script as suggested in the linked post;

#!/bin/bash
btrfs dev stats -c /mnt/cache
if [[ $? -ne 0 ]]; then /usr/local/emhttp/webGui/scripts/notify -i warning -s "ERRORS on cache pool"; fi

Ran with the following output;

 

[/dev/sdl1].write_io_errs 7
[/dev/sdl1].read_io_errs 0
[/dev/sdl1].flush_io_errs 0
[/dev/sdl1].corruption_errs 0
[/dev/sdl1].generation_errs 0
[/dev/sdk1].write_io_errs 9
[/dev/sdk1].read_io_errs 0
[/dev/sdk1].flush_io_errs 0
[/dev/sdk1].corruption_errs 0
[/dev/sdk1].generation_errs 0

/tmp/user.scripts/tmpScripts/btrfs check/script: line 4: syntax error: unexpected end of file

 

Not quite sure about the syntax error?

 

As expected, some write errors. As suggested in your linked post this is most likely a cable fault so until I can purchase another pair of cable I should try and reseat them? Once I do this the only thing to do is scrub the cache drive? I can do this via selecting the primary cache drive and running the scrub tool? I've done this and it's reporting no errors.

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.