Recurring Issue When using Docker


Recommended Posts

syslog.txt

 

Hi all,

 

I've been battling this problem for a couple of years now, I've never had a true resolution. Problem statement:

  • If running Docker (with containers running, or not) eventually my system will hard lock on a page fault or (rarely) a kernel panic
  • It doesn't seem to matter which containers I run, but ones which access the array more, or seem to use more computational power will speed up the failure
  • I can only recover by physically powering off the server, rebooting, and letting the parity check run again - then I get another couple of days of use

 

In the past, I've tried the following:

  • Remake the flash drive (I've used a total of 4 so far)
  • Blown away the docker image
  • Put the docker image to a single disk
  • Put the docker image to a disk that's not part of the array
  • Enable or revoke privileged mode for every container
  • Limit the memory of each container such that the total sum is less than half my physical memory (64GB)
  • Swapped the mainboard and processor
  • Putting all array disks on an LSI SAS controller
  • Swapped memory
  • Upgraded the power supply

 

So, syslog is attached - hoping someone can help here. My next step is to pull one stick at a time of the 4 DIMMs in the system and assume one stick is bad. I don't have evidence of that, and I've already swapped it... but, evidence seems kind of RAMy.

 

Thanks in advance.

 

EDIT: I didn't immediately put this in the Docker support area because the last time I asked for help, someone pointed me towards a corrupted file system. It didn't resolve the issue, but that person was correct -- I don't know if the Docker thing is a symptom or the root cause.

Edited by FlexGunship
Link to comment

Looks like you used to have a 4 disk cache pool, but now are using UD for the SSDs. Why?

 

Your appdata, domains, system shares are on the array. I know your docker.img isn't in system share but libvirt.img is. 

 

None of this is likely a cause of your problem of course.

 

Are you still using macvlan with your dockers?

Link to comment

First question first - it seems to be an artifact of early PCIe SSDs. It's actually a single device on a single PCIe slot, but internally it has 4 devices mounted in raid. In my experience, unRAID recognizes a "head" device and 3 "others". If I mount one of the "others" I get 240GB and a single device - if I mount the "head" device, the internal firmware of the SSD kicks in and mount all devices as a single 1TB device. It kind of took a while to figure out how to make it work; I don't pretend to understand the internal machinations.

 

EDIT: Anyway, it made cache pool thing tricky to manage when the array went down. So, I just nixed. No deeper meaning.

 

If, for any reason, you believe this is contributing, I can pull it. Previously my docker.img was on the array.

 

Likewise, I can also collocate the appdata folder if you think it could be related. But the problem, again, existed long before this recent move to the non-array device.

 

I've recently switch to ipvlan - I didn't notice a change for better or for worse after the change.

 

image.thumb.png.4df5a6b6f647e53298ea7061fbae4935.png

DOUBLE EDIT: Was it clear what the mode of failure was from the first syslog? It's not clear to me that there was evidence of failure in the diagnostics ZIP.

Edited by FlexGunship
Link to comment
  • 2 weeks later...

No overclocking. It's a stock i7-6700 (non-k) with a stock cooler on a stock Dell mobo.

 

I haven't run memtest recently. But I HAVE run memtest since I had the issue. 

 

I have also removed each stick (running with 48GB at a time) and have noted a crash in every case. 

 

Also... I bought this processor, mobo, and RAM to address the crashing issue. I had the same issue on an i3-6100 previously with 32GB of different RAM on an HP motherboard. I also bought an LSI SAS card to try to solve this issue as the Dell mobo has a Marvel data controller which is notorious for not working in unRAID (I'm told).

 

To be fair, I don't know for a *fact* that the failure mode is identical as the i3-6100 HP days, but the rate of failure and manifestation is the same. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.