January 3, 20188 yr I've recently seen a few posts that are similar to the issue I'm having but none of the suggestions seem to help me. I have a few dockers running but after about an hour or so some of them stop. I am unable to restart them, update them, stop the array, or stop other dockers (execution error). While I can ssh in to the box I can not successfully power down (powerdown - r, reboot, etc) without holding the power button I have: Tried cycling through the dockers to see if it is a specific one causing my issue, but it happens no matter which docker is running. \ I have recently rebuilt my docker image and added dockers back from my templates. Tried to decipher diagnostics (attached) htop to kill dockers I have not: disabled docker to see if the system becomes unresponsive Done other tests, 11ish hours between troubleshooting steps to let the parity check run Something I read on a similar thread, I do have unassigned devices with smb shares. They do not unmount themselves. And I believe I have them set as rw/slave in each docker. But the idea wasn't continued upon. Maybe my problem lies within UD? A few weeks ago I had a pretty new 4tb Red go bad on me. I replaced it and rebuilt from parity. That is the only change that I can think of since this has become an issue. Not sure if that could have any bearing on the issue. Thanks for reading... mikey-diagnostics-20180102-2001.zip
January 5, 20188 yr Author Grasping for a solution, I have tried a few more things. fixed permissions (docker safe) nuked docker image increased image from 50gb to 80gb (probably unnecessary, reports 7gb used) rebalanced cache scrubbed cache reloaded all dockers but stopped them all. turning dockers on one by one with a few hours between I'm hoping that I'll blindly find the solution but afraid I won't learn from it. Looking at the docker log this was everywhere for all containers: Quote \"***aborting after fassert() failure\" for logger json-file: write /var/lib/docker/containers/68469640315c9914c7e4dc2247437c1df4ad68ff8e22ea27e3ddf4954a7be928/68469640315c9914c7e4dc2247437c1df4ad68ff8e22ea27e3ddf4954a7be928-json.log: read-only file system" Everything is fine so far, just hit 14 hours. I don't expect it to be fine for long.
January 5, 20188 yr Community Expert Lots of ATA errors on your cache device, ending up disable, most likely a bad cable, this is just the end: Quote Jan 2 18:59:57 Mikey kernel: ata7.00: status: { DRDY ERR } Jan 2 18:59:57 Mikey kernel: ata7.00: error: { ICRC ABRT } Jan 2 18:59:57 Mikey kernel: ata7: hard resetting link Jan 2 18:59:59 Mikey kernel: ata7: softreset failed (SRST command error) Jan 2 18:59:59 Mikey kernel: ata7: reset failed (errno=-5), retrying in 8 secs Jan 2 19:00:00 Mikey shfs/user: err: shfs_write: write: (5) Input/output error Jan 2 19:00:07 Mikey kernel: ata7: hard resetting link Jan 2 19:00:09 Mikey kernel: ata7: softreset failed (SRST command error) Jan 2 19:00:09 Mikey kernel: ata7: reset failed (errno=-5), retrying in 8 secs Jan 2 19:00:17 Mikey kernel: ata7: hard resetting link Jan 2 19:00:19 Mikey kernel: ata7: softreset failed (SRST command error) Jan 2 19:00:19 Mikey kernel: ata7: reset failed (errno=-5), retrying in 33 secs Jan 2 19:00:53 Mikey kernel: ata7: hard resetting link Jan 2 19:00:55 Mikey kernel: ata7: softreset failed (SRST command error) Jan 2 19:00:55 Mikey kernel: ata7: reset failed, giving up Jan 2 19:00:55 Mikey kernel: ata7.00: disabled Jan 2 19:00:55 Mikey kernel: ata7: EH complete
January 5, 20188 yr Author I replaced the two cables on the cache drives just because I had a stack of fresh ones. Everything is stable for just over an hour now.... And I haven't been exactly going easy on it. So far so good, thanks!
Archived
This topic is now archived and is closed to further replies.