tidusjar Posted June 14, 2022 Share Posted June 14, 2022 So this issue has been happening on and off for the past few months. What seems to happen is that all of a sudden my docker containers would just become unresponsive and not work. Stopping or Restarting the containers would show an error message in the UI saying something along the lines of 'Service failed to start' with no other information. Sometimes a reboot of the server would fix this, other times i'd need to disable docker, delete the `docker.img` and reinstall the containers. Today this has happened again and seems to be becoming more frequent, so i'm hoping someone is able to point me in the right direction of what I can do to resolve this. Server diagnostics are attached, currently the `Docker Service failed to start.` at this point i'm probably going to have to delete the docker image and start again. If there is any other information I can provide please let me know. server-diagnostics-20220614-1334.zip Quote Link to comment
tidusjar Posted June 14, 2022 Author Share Posted June 14, 2022 After restarting I now have the following: Server logs are attached for this instance server-diagnostics-20220614-1401.zip Quote Link to comment
ChatNoir Posted June 14, 2022 Share Posted June 14, 2022 Seems like there are errors on your cache drive and consequently with your docker image. Jun 14 14:01:06 Server kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 189, rd 9526, flush 1, corrupt 0, gen 0 Jun 14 14:01:06 Server kernel: BTRFS warning (device sdg1): direct IO failed ino 16006543 rw 0,0 sector 0xb9db660 len 0 err no 10 Jun 14 14:01:06 Server kernel: blk_update_request: I/O error, dev loop2, sector 1614912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jun 14 14:01:06 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 16, rd 4697, flush 0, corrupt 0, gen 0 Jun 14 14:01:06 Server kernel: sd 7:0:0:0: [sdg] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s Jun 14 14:01:06 Server kernel: sd 7:0:0:0: [sdg] tag#12 CDB: opcode=0x28 28 00 0a 97 da e0 00 00 08 00 Jun 14 14:01:06 Server kernel: blk_update_request: I/O error, dev sdg, sector 177724128 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Quote Link to comment
tidusjar Posted June 14, 2022 Author Share Posted June 14, 2022 Any sort of tests I can do to see what's wrong? Quote Link to comment
JorgeB Posted June 14, 2022 Share Posted June 14, 2022 Jun 14 08:00:08 Server kernel: ata7.00: disabled Cache device dropped offline, check/replace cables and if it comes back post new diags after array start. Quote Link to comment
tidusjar Posted June 14, 2022 Author Share Posted June 14, 2022 (edited) 9 minutes ago, JorgeB said: Jun 14 08:00:08 Server kernel: ata7.00: disabled Cache device dropped offline, check/replace cables and if it comes back post new diags after array start. Just replaced the cable, started back up and things are running for now. But like i mentioned it keeps happening. Update: Actually things are behaving quite strangely like nothing can write to the cache now server-diagnostics-20220614-1452.zip Edited June 14, 2022 by tidusjar Quote Link to comment
JorgeB Posted June 14, 2022 Share Posted June 14, 2022 Jun 14 14:41:57 Server kernel: ata7.00: disabled It dropped again, did you replace both cables? Power and SATA. If that doesn't help try a different SATA port or replace the device. Quote Link to comment
tidusjar Posted June 14, 2022 Author Share Posted June 14, 2022 11 minutes ago, JorgeB said: Jun 14 14:41:57 Server kernel: ata7.00: disabled It dropped again, did you replace both cables? Power and SATA. If that doesn't help try a different SATA port or replace the device. I only did the SATA cable. I've now switched SATA ports and different power cable. It now seems to be working, i'll check over the next few days 1 Quote Link to comment
tidusjar Posted June 15, 2022 Author Share Posted June 15, 2022 And it's happened again this morning, docker containers are failing to start (and have stopped). New drive I guess? server-diagnostics-20220615-0849.zip Quote Link to comment
itimpi Posted June 15, 2022 Share Posted June 15, 2022 The diagnostics are showing that the cache dtive appears to be playing up and that the docker.img file is corrupt. looking at the SMART information for the cache drive I see: 199 UDMA_CRC_Error_Count -O--CK 100 100 050 - 1 202 Percent_Lifetime_Remain ----CK 000 000 001 NOW 0 but not sure how significant the Remaining Lifetime attribute really is in practice. You could try running an extended SMART test on the drive to see if that can complete without error. Quote Link to comment
JorgeB Posted June 15, 2022 Share Posted June 15, 2022 Docker image went read-only for an apparent lack of space, would should balancing the cache filesystem. Cache device didn't drop again but it's still showing several ATA errors, still looks like a cable/connection problem, I would suggest trying it in a different controller, swap with another disk if you don't have other free ports. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.