Docker randomly failing and then failing to start

tidusjar · June 14, 2022

So this issue has been happening on and off for the past few months.

What seems to happen is that all of a sudden my docker containers would just become unresponsive and not work. Stopping or Restarting the containers would show an error message in the UI saying something along the lines of 'Service failed to start' with no other information.

Sometimes a reboot of the server would fix this, other times i'd need to disable docker, delete the `docker.img` and reinstall the containers.

Today this has happened again and seems to be becoming more frequent, so i'm hoping someone is able to point me in the right direction of what I can do to resolve this.

Server diagnostics are attached, currently the `Docker Service failed to start.` at this point i'm probably going to have to delete the docker image and start again.

If there is any other information I can provide please let me know.

server-diagnostics-20220614-1334.zip

tidusjar · June 14, 2022

After restarting I now have the following:

Server logs are attached for this instance

server-diagnostics-20220614-1401.zip

ChatNoir · June 14, 2022

Seems like there are errors on your cache drive and consequently with your docker image.

Jun 14 14:01:06 Server kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 189, rd 9526, flush 1, corrupt 0, gen 0
Jun 14 14:01:06 Server kernel: BTRFS warning (device sdg1): direct IO failed ino 16006543 rw 0,0 sector 0xb9db660 len 0 err no 10
Jun 14 14:01:06 Server kernel: blk_update_request: I/O error, dev loop2, sector 1614912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 14 14:01:06 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 16, rd 4697, flush 0, corrupt 0, gen 0
Jun 14 14:01:06 Server kernel: sd 7:0:0:0: [sdg] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Jun 14 14:01:06 Server kernel: sd 7:0:0:0: [sdg] tag#12 CDB: opcode=0x28 28 00 0a 97 da e0 00 00 08 00
Jun 14 14:01:06 Server kernel: blk_update_request: I/O error, dev sdg, sector 177724128 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

tidusjar · June 14, 2022

Any sort of tests I can do to see what's wrong?

JorgeB · June 14, 2022

Jun 14 08:00:08 Server kernel: ata7.00: disabled

Cache device dropped offline, check/replace cables and if it comes back post new diags after array start.

tidusjar · June 14, 2022

9 minutes ago, JorgeB said:
Jun 14 08:00:08 Server kernel: ata7.00: disabled
Cache device dropped offline, check/replace cables and if it comes back post new diags after array start.

Just replaced the cable, started back up and things are running for now. But like i mentioned it keeps happening.

Update: Actually things are behaving quite strangely like nothing can write to the cache now

server-diagnostics-20220614-1452.zip

Edited June 14, 2022 by tidusjar

JorgeB · June 14, 2022

Jun 14 14:41:57 Server kernel: ata7.00: disabled

It dropped again, did you replace both cables? Power and SATA. If that doesn't help try a different SATA port or replace the device.

tidusjar · June 14, 2022

11 minutes ago, JorgeB said:
Jun 14 14:41:57 Server kernel: ata7.00: disabled
It dropped again, did you replace both cables? Power and SATA. If that doesn't help try a different SATA port or replace the device.

I only did the SATA cable. I've now switched SATA ports and different power cable. It now seems to be working, i'll check over the next few days

tidusjar · June 15, 2022

And it's happened again this morning, docker containers are failing to start (and have stopped). New drive I guess?

server-diagnostics-20220615-0849.zip

itimpi · June 15, 2022

The diagnostics are showing that the cache dtive appears to be playing up and that the docker.img file is corrupt.

looking at the SMART information for the cache drive I see:

199 UDMA_CRC_Error_Count    -O--CK   100   100   050    -    1
202 Percent_Lifetime_Remain ----CK   000   000   001    NOW  0

but not sure how significant the Remaining Lifetime attribute really is in practice. You could try running an extended SMART test on the drive to see if that can complete without error.

JorgeB · June 15, 2022

Docker image went read-only for an apparent lack of space, would should balancing the cache filesystem.

Cache device didn't drop again but it's still showing several ATA errors, still looks like a cable/connection problem, I would suggest trying it in a different controller, swap with another disk if you don't have other free ports.

Docker randomly failing and then failing to start

Recommended Posts

tidusjar

Link to comment

tidusjar

Link to comment

ChatNoir

Link to comment

tidusjar

Link to comment

JorgeB

Link to comment

tidusjar

Link to comment

JorgeB

Link to comment

tidusjar

Link to comment

tidusjar

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Join the conversation