Dockers Stopping/Crashing Issue


Recommended Posts

Every once in awhile most of my dockers will just stop.  I typicially run the following dockers below.  For ex I was just watching a movie via my Nvidia Shield Emby App and the movie just stopped with a connection error.  Went to my server and I can see that fall my dockers are stopped except for Jackett and Syncthing.  HOwever If I try to stop those two I get an error.  The only wat to get my server back is to do a cold shut down as I can't get the dockers to stop.  I've attached the diagnostics file.  I was hoping someone might be able to determine what the cause is.

EmbyServer

Jackett

Radarr

Sonarr

Syncthingtower-diagnostics-20200606-2140.zip

 

 

Link to comment

I see a lot of errors in the docker log file but I have no idea what they mean.  The issue seems to have started at about 9:380pm...

I can see the following around that time and it goes on much longer..

time="2020-06-06T21:38:08.741553472-04:00" level=info msg="shim reaped" id=45e6cbe90cec8a18a140012ad5087f9d781f4d5da740de6bcaad7b12c530f2e9 
time="2020-06-06T21:38:08.754121658-04:00" level=error msg="failed to delete bundle" error="readdirnames /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/45e6cbe90cec8a18a140012ad5087f9d781f4d5da740de6bcaad7b12c530f2e9: readdirent: input/output error" 
time="2020-06-06T21:38:08.754603368-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2020-06-06T21:38:08.769828095-04:00" level=info msg="shim reaped" id=88180c4bf81855d3e7a92bd501ab4bb47eb60560b28180b03cb2f2439ba663b2 
time="2020-06-06T21:38:08.780322960-04:00" level=error msg="failed to delete bundle" error="readdirnames /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/88180c4bf81855d3e7a92bd501ab4bb47eb60560b28180b03cb2f2439ba663b2: readdirent: input/output error" 
time="2020-06-06T21:38:08.780896891-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2020-06-06T21:38:08.787041004-04:00" level=info msg="shim reaped" id=635d3b7531ca34e8a5b65319e8ef448a6e77ce6d764bf35644d9f68a3edcadf4 
time="2020-06-06T21:38:08.797493369-04:00" level=error msg="failed to delete bundle" error="unlinkat /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/635d3b7531ca34e8a5b65319e8ef448a6e77ce6d764bf35644d9f68a3edcadf4/shim.stderr.log: read-only file system" 
time="2020-06-06T21:38:08.798231519-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2020-06-06T21:38:08.798331899-04:00" level=warning msg="failed to get endpoint_count map for scope local: open /var/lib/docker/network/files/local-kv.db: read-only file system"
time="2020-06-06T21:38:08.798676389-04:00" level=warning msg="Failed detaching sandbox e528b447c4dadb4d80f94c677c913eb0729cb652ea85d987c5262ad89ade1a4a from endpoint 121546bc8f2a267bc8f7daa58c88eb044d7b43221830b8b5b1e25db95e4a365b: failed to update store for object type *libnetwork.endpoint: open /var/lib/docker/network/files/local-kv.db: read-only file system\n"
time="2020-06-06T21:38:08.798782689-04:00" level=warning msg="Failed deleting endpoint 121546bc8f2a267bc8f7daa58c88eb044d7b43221830b8b5b1e25db95e4a365b: endpoint with name EmbyServer id 

Link to comment

And from the syslog around the same time...here's a snipit...

Jun  6 21:37:08 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun  6 21:37:08 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Jun  6 21:37:08 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 9
Jun  6 21:37:08 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  6 21:37:08 Tower kernel: ata4.00: status: { DRDY }
Jun  6 21:37:08 Tower kernel: ata4: hard resetting link
Jun  6 21:37:13 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Jun  6 21:37:18 Tower kernel: ata4: COMRESET failed (errno=-16)
Jun  6 21:37:18 Tower kernel: ata4: hard resetting link
Jun  6 21:37:23 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Jun  6 21:37:28 Tower kernel: ata4: COMRESET failed (errno=-16)
Jun  6 21:37:28 Tower kernel: ata4: hard resetting link
Jun  6 21:37:33 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Jun  6 21:38:03 Tower kernel: ata4: COMRESET failed (errno=-16)
Jun  6 21:38:03 Tower kernel: ata4: limiting SATA link speed to 1.5 Gbps
Jun  6 21:38:03 Tower kernel: ata4: hard resetting link
Jun  6 21:38:08 Tower kernel: ata4: COMRESET failed (errno=-16)
Jun  6 21:38:08 Tower kernel: ata4: reset failed, giving up
Jun  6 21:38:08 Tower kernel: ata4.00: disabled
Jun  6 21:38:08 Tower kernel: ata4: EH complete
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#8 CDB: opcode=0x2a 2a 00 02 c1 40 58 00 00 08 00
Jun  6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 46219352
Jun  6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#9 CDB: opcode=0x2a 2a 00 06 85 45 50 00 00 08 00
Jun  6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 109397328
Jun  6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#11 CDB: opcode=0x2a 2a 00 06 c1 4a 18 00 00 78 00
Jun  6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 113330712
Jun  6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jun  6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00

Edited by DigitalDivide
forgot something
Link to comment

I replaced the sata but now replaced the power cable...well not replace but moved it to a diff drive to see what happens.  The power cable for that drive has an extension on it as it was too short.  Now I'll see if the problem follows to the other drive.

 

What's odd is everything has been fine for years.  No cables or anything was moved.  The only thing I did was upgraded the server software to the latest version from 5.6 I think it was....oh well....we'll see

Link to comment

i had partly the same issue before, but it was related to some cheapo marvel based sata controller cards, I've replaced everything and moved over to lsi raid cards (flashed out the raid-function). Could there be a issue with your sata controller and the new kernel perhaps ? 

Link to comment

So dockers are still crashing.  Attached is the config.  Not sure what part to actually look at to show cache drive is dropping.  

I did replace the sata cable, plugged the sata cable into a diff port on the server as well.  I also used a different sata power cable...well not entirely different.  The power cable braches off into 3 diff connections and I plugged the drive into a diff one and the one that was plugged into the cache drive I plugged into on of my unassigned devices.  

 

I can't imagine it's the drive as I replaced it.  None of my other drives are dropping.  Just the cache.  Very frustrating indeed!

Any other ideas?

 

Edit: just realized my diagnostics file is after the reboot so won't show anything....for the future, what log should I be looking at to determine if it's dropping and or why?

 

My MB is the asrock c2550d41 

tower-diagnostics-20200613-1320.zip

Edited by DigitalDivide
forgot something
Link to comment
13 hours ago, DigitalDivide said:

Edit: just realized my diagnostics file is after the reboot so won't show anything....for the future, what log should I be looking at to determine if it's dropping and or why?

Look in the syslog for error like these:

Jun  7 14:24:30 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
Jun  7 14:24:30 Tower kernel: ata3.00: irq_stat 0x00400040, connection status changed
Jun  7 14:24:30 Tower kernel: ata3: SError: { PHYRdyChg 10B8B DevExch }
Jun  7 14:24:30 Tower kernel: ata3.00: failed command: FLUSH CACHE EXT
Jun  7 14:24:30 Tower kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 7
Jun  7 14:24:30 Tower kernel:         res 40/00:34:20:86:36/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Jun  7 14:24:30 Tower kernel: ata3.00: status: { DRDY }
Jun  7 14:24:30 Tower kernel: ata3: hard resetting link
Jun  7 14:24:35 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Jun  7 14:24:37 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun  7 14:24:37 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Jun  7 14:24:37 Tower kernel: ata3.00: revalidation failed (errno=-5)
Jun  7 14:24:42 Tower kernel: ata3: hard resetting link
Jun  7 14:24:44 Tower kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun  7 14:24:44 Tower kernel: ata3.00: link offline, clearing class 1 to NONE
Jun  7 14:24:44 Tower kernel: ata3: hard resetting link
Jun  7 14:24:44 Tower kernel: ata3: SATA link down (SStatus 0 SControl 300)
Jun  7 14:24:44 Tower kernel: ata3.00: link offline, clearing class 1 to NONE
Jun  7 14:24:44 Tower kernel: ata3.00: disabled

 

Previous lines show the connection issues, last line is the device dropping offline "ata3.00: disabled"

Link to comment
  • 2 weeks later...

My issue is continuing with my dockers constantly crashing.  In looking at the syslog I'm seeing tons of errors but not exactly sure what they mean.  Would someone kindly be able to look at the attached file and provide some guidance?  The errors start on the 27th of June near the bottom

syslog.txt

Edited by DigitalDivide
forgot something
Link to comment

Cache device is still dropping offline:

 

Jun 27 05:14:26 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Jun 27 05:14:31 Tower kernel: ata3: COMRESET failed (errno=-16)
Jun 27 05:14:31 Tower kernel: ata3: hard resetting link
Jun 27 05:14:37 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Jun 27 05:14:41 Tower kernel: ata3: COMRESET failed (errno=-16)
Jun 27 05:14:41 Tower kernel: ata3: hard resetting link
Jun 27 05:14:47 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Jun 27 05:15:16 Tower kernel: ata3: COMRESET failed (errno=-16)
Jun 27 05:15:16 Tower kernel: ata3: limiting SATA link speed to 1.5 Gbps
Jun 27 05:15:16 Tower kernel: ata3: hard resetting link
Jun 27 05:15:21 Tower kernel: ata3: COMRESET failed (errno=-16)
Jun 27 05:15:21 Tower kernel: ata3: reset failed, giving up
Jun 27 05:15:21 Tower kernel: ata3.00: disabled
Jun 27 05:15:21 Tower kernel: ata3: EH complete

 

If you have already replaced the cables like recommended it could be a failing disk.

Link to comment
  • 3 weeks later...
  • 3 weeks later...
On 7/19/2020 at 6:48 AM, DigitalDivide said:

Hi, problem went away after replacing my drive with a new ssd. Haven't had any problems since.

What make and model did you get? I'm currently having a bear of a time with my Samsung SSDs and even got a newer Samsung that is also giving these errors.

Link to comment

So, it seems like i finally found the answer to my problem. SSD's seem functional, so i checked midnight my server log and spotted a try of the auto-update plugin to update all dockers, even those i added voa docker compose. It seems like this midnightly maneuver bricked my dockers. Everything fine for a few days, praying to god it stays this way as i am on vacation. Hope this helps someone with a similar problem.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.