DigitalDivide Posted June 7, 2020 Posted June 7, 2020 Every once in awhile most of my dockers will just stop. I typicially run the following dockers below. For ex I was just watching a movie via my Nvidia Shield Emby App and the movie just stopped with a connection error. Went to my server and I can see that fall my dockers are stopped except for Jackett and Syncthing. HOwever If I try to stop those two I get an error. The only wat to get my server back is to do a cold shut down as I can't get the dockers to stop. I've attached the diagnostics file. I was hoping someone might be able to determine what the cause is. EmbyServer Jackett Radarr Sonarr Syncthingtower-diagnostics-20200606-2140.zip Quote
DigitalDivide Posted June 7, 2020 Author Posted June 7, 2020 I see a lot of errors in the docker log file but I have no idea what they mean. The issue seems to have started at about 9:380pm... I can see the following around that time and it goes on much longer.. time="2020-06-06T21:38:08.741553472-04:00" level=info msg="shim reaped" id=45e6cbe90cec8a18a140012ad5087f9d781f4d5da740de6bcaad7b12c530f2e9 time="2020-06-06T21:38:08.754121658-04:00" level=error msg="failed to delete bundle" error="readdirnames /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/45e6cbe90cec8a18a140012ad5087f9d781f4d5da740de6bcaad7b12c530f2e9: readdirent: input/output error" time="2020-06-06T21:38:08.754603368-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" time="2020-06-06T21:38:08.769828095-04:00" level=info msg="shim reaped" id=88180c4bf81855d3e7a92bd501ab4bb47eb60560b28180b03cb2f2439ba663b2 time="2020-06-06T21:38:08.780322960-04:00" level=error msg="failed to delete bundle" error="readdirnames /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/88180c4bf81855d3e7a92bd501ab4bb47eb60560b28180b03cb2f2439ba663b2: readdirent: input/output error" time="2020-06-06T21:38:08.780896891-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" time="2020-06-06T21:38:08.787041004-04:00" level=info msg="shim reaped" id=635d3b7531ca34e8a5b65319e8ef448a6e77ce6d764bf35644d9f68a3edcadf4 time="2020-06-06T21:38:08.797493369-04:00" level=error msg="failed to delete bundle" error="unlinkat /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/635d3b7531ca34e8a5b65319e8ef448a6e77ce6d764bf35644d9f68a3edcadf4/shim.stderr.log: read-only file system" time="2020-06-06T21:38:08.798231519-04:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" time="2020-06-06T21:38:08.798331899-04:00" level=warning msg="failed to get endpoint_count map for scope local: open /var/lib/docker/network/files/local-kv.db: read-only file system" time="2020-06-06T21:38:08.798676389-04:00" level=warning msg="Failed detaching sandbox e528b447c4dadb4d80f94c677c913eb0729cb652ea85d987c5262ad89ade1a4a from endpoint 121546bc8f2a267bc8f7daa58c88eb044d7b43221830b8b5b1e25db95e4a365b: failed to update store for object type *libnetwork.endpoint: open /var/lib/docker/network/files/local-kv.db: read-only file system\n" time="2020-06-06T21:38:08.798782689-04:00" level=warning msg="Failed deleting endpoint 121546bc8f2a267bc8f7daa58c88eb044d7b43221830b8b5b1e25db95e4a365b: endpoint with name EmbyServer id Quote
DigitalDivide Posted June 7, 2020 Author Posted June 7, 2020 (edited) And from the syslog around the same time...here's a snipit... Jun 6 21:37:08 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jun 6 21:37:08 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT Jun 6 21:37:08 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 9 Jun 6 21:37:08 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 6 21:37:08 Tower kernel: ata4.00: status: { DRDY } Jun 6 21:37:08 Tower kernel: ata4: hard resetting link Jun 6 21:37:13 Tower kernel: ata4: link is slow to respond, please be patient (ready=0) Jun 6 21:37:18 Tower kernel: ata4: COMRESET failed (errno=-16) Jun 6 21:37:18 Tower kernel: ata4: hard resetting link Jun 6 21:37:23 Tower kernel: ata4: link is slow to respond, please be patient (ready=0) Jun 6 21:37:28 Tower kernel: ata4: COMRESET failed (errno=-16) Jun 6 21:37:28 Tower kernel: ata4: hard resetting link Jun 6 21:37:33 Tower kernel: ata4: link is slow to respond, please be patient (ready=0) Jun 6 21:38:03 Tower kernel: ata4: COMRESET failed (errno=-16) Jun 6 21:38:03 Tower kernel: ata4: limiting SATA link speed to 1.5 Gbps Jun 6 21:38:03 Tower kernel: ata4: hard resetting link Jun 6 21:38:08 Tower kernel: ata4: COMRESET failed (errno=-16) Jun 6 21:38:08 Tower kernel: ata4: reset failed, giving up Jun 6 21:38:08 Tower kernel: ata4.00: disabled Jun 6 21:38:08 Tower kernel: ata4: EH complete Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#8 CDB: opcode=0x2a 2a 00 02 c1 40 58 00 00 08 00 Jun 6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 46219352 Jun 6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#9 CDB: opcode=0x2a 2a 00 06 85 45 50 00 00 08 00 Jun 6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 109397328 Jun 6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#11 CDB: opcode=0x2a 2a 00 06 c1 4a 18 00 00 78 00 Jun 6 21:38:08 Tower kernel: print_req_error: I/O error, dev sdc, sector 113330712 Jun 6 21:38:08 Tower kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jun 6 21:38:08 Tower kernel: sd 4:0:0:0: [sdc] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Edited June 7, 2020 by DigitalDivide forgot something Quote
JorgeB Posted June 7, 2020 Posted June 7, 2020 Cache device dropped offline, check cables then post a SMART report (or new diags) Quote
DigitalDivide Posted June 7, 2020 Author Posted June 7, 2020 rechecked cables and replaced the one attached to the cache drive. Rebooted. Here's my diagnostics after reboot. tower-diagnostics-20200607-1111.zip Quote
DigitalDivide Posted June 7, 2020 Author Posted June 7, 2020 Now my Emby Server stopped again. Not sure what's going on. Here's my diagnostics. I've replaced the drive, rebuilt the docker image etc. What's odd is it always seems to be the emby docker that goes...or goes first. tower-diagnostics-20200607-1702.zip Quote
JorgeB Posted June 8, 2020 Posted June 8, 2020 Cache disk is still dropping, did you replace both cables, power and SATA? Quote
DigitalDivide Posted June 8, 2020 Author Posted June 8, 2020 I replaced the sata but now replaced the power cable...well not replace but moved it to a diff drive to see what happens. The power cable for that drive has an extension on it as it was too short. Now I'll see if the problem follows to the other drive. What's odd is everything has been fine for years. No cables or anything was moved. The only thing I did was upgraded the server software to the latest version from 5.6 I think it was....oh well....we'll see Quote
Abnorm Posted June 8, 2020 Posted June 8, 2020 i had partly the same issue before, but it was related to some cheapo marvel based sata controller cards, I've replaced everything and moved over to lsi raid cards (flashed out the raid-function). Could there be a issue with your sata controller and the new kernel perhaps ? Quote
DigitalDivide Posted June 13, 2020 Author Posted June 13, 2020 (edited) So dockers are still crashing. Attached is the config. Not sure what part to actually look at to show cache drive is dropping. I did replace the sata cable, plugged the sata cable into a diff port on the server as well. I also used a different sata power cable...well not entirely different. The power cable braches off into 3 diff connections and I plugged the drive into a diff one and the one that was plugged into the cache drive I plugged into on of my unassigned devices. I can't imagine it's the drive as I replaced it. None of my other drives are dropping. Just the cache. Very frustrating indeed! Any other ideas? Edit: just realized my diagnostics file is after the reboot so won't show anything....for the future, what log should I be looking at to determine if it's dropping and or why? My MB is the asrock c2550d41 tower-diagnostics-20200613-1320.zip Edited June 13, 2020 by DigitalDivide forgot something Quote
JorgeB Posted June 14, 2020 Posted June 14, 2020 13 hours ago, DigitalDivide said: Edit: just realized my diagnostics file is after the reboot so won't show anything....for the future, what log should I be looking at to determine if it's dropping and or why? Look in the syslog for error like these: Jun 7 14:24:30 Tower kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen Jun 7 14:24:30 Tower kernel: ata3.00: irq_stat 0x00400040, connection status changed Jun 7 14:24:30 Tower kernel: ata3: SError: { PHYRdyChg 10B8B DevExch } Jun 7 14:24:30 Tower kernel: ata3.00: failed command: FLUSH CACHE EXT Jun 7 14:24:30 Tower kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 7 Jun 7 14:24:30 Tower kernel: res 40/00:34:20:86:36/00:00:00:00:00/40 Emask 0x10 (ATA bus error) Jun 7 14:24:30 Tower kernel: ata3.00: status: { DRDY } Jun 7 14:24:30 Tower kernel: ata3: hard resetting link Jun 7 14:24:35 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Jun 7 14:24:37 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jun 7 14:24:37 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100) Jun 7 14:24:37 Tower kernel: ata3.00: revalidation failed (errno=-5) Jun 7 14:24:42 Tower kernel: ata3: hard resetting link Jun 7 14:24:44 Tower kernel: ata3: SATA link down (SStatus 0 SControl 300) Jun 7 14:24:44 Tower kernel: ata3.00: link offline, clearing class 1 to NONE Jun 7 14:24:44 Tower kernel: ata3: hard resetting link Jun 7 14:24:44 Tower kernel: ata3: SATA link down (SStatus 0 SControl 300) Jun 7 14:24:44 Tower kernel: ata3.00: link offline, clearing class 1 to NONE Jun 7 14:24:44 Tower kernel: ata3.00: disabled Previous lines show the connection issues, last line is the device dropping offline "ata3.00: disabled" Quote
DigitalDivide Posted June 27, 2020 Author Posted June 27, 2020 (edited) My issue is continuing with my dockers constantly crashing. In looking at the syslog I'm seeing tons of errors but not exactly sure what they mean. Would someone kindly be able to look at the attached file and provide some guidance? The errors start on the 27th of June near the bottom syslog.txt Edited June 27, 2020 by DigitalDivide forgot something Quote
JorgeB Posted June 27, 2020 Posted June 27, 2020 Cache device is still dropping offline: Jun 27 05:14:26 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Jun 27 05:14:31 Tower kernel: ata3: COMRESET failed (errno=-16) Jun 27 05:14:31 Tower kernel: ata3: hard resetting link Jun 27 05:14:37 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Jun 27 05:14:41 Tower kernel: ata3: COMRESET failed (errno=-16) Jun 27 05:14:41 Tower kernel: ata3: hard resetting link Jun 27 05:14:47 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Jun 27 05:15:16 Tower kernel: ata3: COMRESET failed (errno=-16) Jun 27 05:15:16 Tower kernel: ata3: limiting SATA link speed to 1.5 Gbps Jun 27 05:15:16 Tower kernel: ata3: hard resetting link Jun 27 05:15:21 Tower kernel: ata3: COMRESET failed (errno=-16) Jun 27 05:15:21 Tower kernel: ata3: reset failed, giving up Jun 27 05:15:21 Tower kernel: ata3.00: disabled Jun 27 05:15:21 Tower kernel: ata3: EH complete If you have already replaced the cables like recommended it could be a failing disk. Quote
DigitalDivide Posted June 27, 2020 Author Posted June 27, 2020 Went out and bought an SSD. Will replace and see what happens Quote
JustusAurelius Posted July 19, 2020 Posted July 19, 2020 On 6/28/2020 at 12:40 AM, DigitalDivide said: Went out and bought an SSD. Will replace and see what happens Hey, i currently have probably a similiar problem where my docker crashs without an obvious reason. Did replacing your disk solve the issue? Good Luck and a great Sunday, JP Quote
DigitalDivide Posted July 19, 2020 Author Posted July 19, 2020 Hi, problem went away after replacing my drive with a new ssd. Haven't had any problems since. Quote
GermGerm Posted August 3, 2020 Posted August 3, 2020 On 7/19/2020 at 6:48 AM, DigitalDivide said: Hi, problem went away after replacing my drive with a new ssd. Haven't had any problems since. What make and model did you get? I'm currently having a bear of a time with my Samsung SSDs and even got a newer Samsung that is also giving these errors. Quote
DigitalDivide Posted August 3, 2020 Author Posted August 3, 2020 I don't recall what model, but it was a WD 1TB I see this in my server GUI WDC WDS100T2B0A-00SM50, 201891804116, 401020WD, max UDMA/133 Still no problems and running smoothly Quote
JustusAurelius Posted August 6, 2020 Posted August 6, 2020 So, it seems like i finally found the answer to my problem. SSD's seem functional, so i checked midnight my server log and spotted a try of the auto-update plugin to update all dockers, even those i added voa docker compose. It seems like this midnightly maneuver bricked my dockers. Everything fine for a few days, praying to god it stays this way as i am on vacation. Hope this helps someone with a similar problem. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.