Recurring Issue When using Docker

FlexGunship · February 4, 2023

Hi all,

I've been battling this problem for a couple of years now, I've never had a true resolution. Problem statement:

If running Docker (with containers running, or not) eventually my system will hard lock on a page fault or (rarely) a kernel panic
It doesn't seem to matter which containers I run, but ones which access the array more, or seem to use more computational power will speed up the failure
I can only recover by physically powering off the server, rebooting, and letting the parity check run again - then I get another couple of days of use

In the past, I've tried the following:

Remake the flash drive (I've used a total of 4 so far)
Blown away the docker image
Put the docker image to a single disk
Put the docker image to a disk that's not part of the array
Enable or revoke privileged mode for every container
Limit the memory of each container such that the total sum is less than half my physical memory (64GB)
Swapped the mainboard and processor
Putting all array disks on an LSI SAS controller
Swapped memory
Upgraded the power supply

So, syslog is attached - hoping someone can help here. My next step is to pull one stick at a time of the 4 DIMMs in the system and assume one stick is bad. I don't have evidence of that, and I've already swapped it... but, evidence seems kind of RAMy.

Thanks in advance.

EDIT: I didn't immediately put this in the Docker support area because the last time I asked for help, someone pointed me towards a corrupted file system. It didn't resolve the issue, but that person was correct -- I don't know if the Docker thing is a symptom or the root cause.

Edited February 4, 2023 by FlexGunship

trurl · February 4, 2023

Attach diagnostics to your NEXT post in this thread

FlexGunship · February 4, 2023

Lol... do you know how many times I'm mentally admonished someone for forgetting their diag.zip?

My bad - apologies.

athena-diagnostics-20230203-2333.zip

trurl · February 4, 2023

Looks like you used to have a 4 disk cache pool, but now are using UD for the SSDs. Why?

Your appdata, domains, system shares are on the array. I know your docker.img isn't in system share but libvirt.img is.

None of this is likely a cause of your problem of course.

Are you still using macvlan with your dockers?

FlexGunship · February 4, 2023

First question first - it seems to be an artifact of early PCIe SSDs. It's actually a single device on a single PCIe slot, but internally it has 4 devices mounted in raid. In my experience, unRAID recognizes a "head" device and 3 "others". If I mount one of the "others" I get 240GB and a single device - if I mount the "head" device, the internal firmware of the SSD kicks in and mount all devices as a single 1TB device. It kind of took a while to figure out how to make it work; I don't pretend to understand the internal machinations.

EDIT: Anyway, it made cache pool thing tricky to manage when the array went down. So, I just nixed. No deeper meaning.

If, for any reason, you believe this is contributing, I can pull it. Previously my docker.img was on the array.

Likewise, I can also collocate the appdata folder if you think it could be related. But the problem, again, existed long before this recent move to the non-array device.

I've recently switch to ipvlan - I didn't notice a change for better or for worse after the change.

DOUBLE EDIT: Was it clear what the mode of failure was from the first syslog? It's not clear to me that there was evidence of failure in the diagnostics ZIP.

Edited February 4, 2023 by FlexGunship

FlexGunship · February 13, 2023

Polite Bump

I'm still having this issue about every 24 to 36 hours. No lost data, but I would really appreciate any other insights anyone has.

JorgeB · February 14, 2023

Enable the syslog server and post that after a crash.

FlexGunship · February 17, 2023

Just updating that the syslog server is enabled, and I'm waiting for the next crash at this point.

FlexGunship · February 17, 2023

syslog

I guess I had it running since last year... so it was huge. I trimmed anything before 2/12 for the purpose of this upload. If you need more of the log... or need to me start it clean, please let me know.

trurl · February 17, 2023

Continuous dumps for many hours starting here

Feb 16 23:08:23 Athena kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-... 7-... } 21250 jiffies s: 1981 root: 0x90/.

Not clear what they are related to.

Are you overclocking?

Have you done memtest lately?

FlexGunship · February 17, 2023

No overclocking. It's a stock i7-6700 (non-k) with a stock cooler on a stock Dell mobo.

I haven't run memtest recently. But I HAVE run memtest since I had the issue.

I have also removed each stick (running with 48GB at a time) and have noted a crash in every case.

Also... I bought this processor, mobo, and RAM to address the crashing issue. I had the same issue on an i3-6100 previously with 32GB of different RAM on an HP motherboard. I also bought an LSI SAS card to try to solve this issue as the Dell mobo has a Marvel data controller which is notorious for not working in unRAID (I'm told).

To be fair, I don't know for a *fact* that the failure mode is identical as the i3-6100 HP days, but the rate of failure and manifestation is the same.

Recurring Issue When using Docker

Recommended Posts

FlexGunship

Link to comment

trurl

Link to comment

FlexGunship

Link to comment

trurl

Link to comment

FlexGunship

Link to comment

FlexGunship

Link to comment

JorgeB

Link to comment

FlexGunship

Link to comment

FlexGunship

Link to comment

trurl

Link to comment

FlexGunship

Link to comment

Join the conversation