FlexGunship Posted February 4, 2023 Share Posted February 4, 2023 (edited) syslog.txt Hi all, I've been battling this problem for a couple of years now, I've never had a true resolution. Problem statement: If running Docker (with containers running, or not) eventually my system will hard lock on a page fault or (rarely) a kernel panic It doesn't seem to matter which containers I run, but ones which access the array more, or seem to use more computational power will speed up the failure I can only recover by physically powering off the server, rebooting, and letting the parity check run again - then I get another couple of days of use In the past, I've tried the following: Remake the flash drive (I've used a total of 4 so far) Blown away the docker image Put the docker image to a single disk Put the docker image to a disk that's not part of the array Enable or revoke privileged mode for every container Limit the memory of each container such that the total sum is less than half my physical memory (64GB) Swapped the mainboard and processor Putting all array disks on an LSI SAS controller Swapped memory Upgraded the power supply So, syslog is attached - hoping someone can help here. My next step is to pull one stick at a time of the 4 DIMMs in the system and assume one stick is bad. I don't have evidence of that, and I've already swapped it... but, evidence seems kind of RAMy. Thanks in advance. EDIT: I didn't immediately put this in the Docker support area because the last time I asked for help, someone pointed me towards a corrupted file system. It didn't resolve the issue, but that person was correct -- I don't know if the Docker thing is a symptom or the root cause. Edited February 4, 2023 by FlexGunship Quote Link to comment
trurl Posted February 4, 2023 Share Posted February 4, 2023 Attach diagnostics to your NEXT post in this thread Quote Link to comment
FlexGunship Posted February 4, 2023 Author Share Posted February 4, 2023 Lol... do you know how many times I'm mentally admonished someone for forgetting their diag.zip? My bad - apologies. athena-diagnostics-20230203-2333.zip Quote Link to comment
trurl Posted February 4, 2023 Share Posted February 4, 2023 Looks like you used to have a 4 disk cache pool, but now are using UD for the SSDs. Why? Your appdata, domains, system shares are on the array. I know your docker.img isn't in system share but libvirt.img is. None of this is likely a cause of your problem of course. Are you still using macvlan with your dockers? Quote Link to comment
FlexGunship Posted February 4, 2023 Author Share Posted February 4, 2023 (edited) First question first - it seems to be an artifact of early PCIe SSDs. It's actually a single device on a single PCIe slot, but internally it has 4 devices mounted in raid. In my experience, unRAID recognizes a "head" device and 3 "others". If I mount one of the "others" I get 240GB and a single device - if I mount the "head" device, the internal firmware of the SSD kicks in and mount all devices as a single 1TB device. It kind of took a while to figure out how to make it work; I don't pretend to understand the internal machinations. EDIT: Anyway, it made cache pool thing tricky to manage when the array went down. So, I just nixed. No deeper meaning. If, for any reason, you believe this is contributing, I can pull it. Previously my docker.img was on the array. Likewise, I can also collocate the appdata folder if you think it could be related. But the problem, again, existed long before this recent move to the non-array device. I've recently switch to ipvlan - I didn't notice a change for better or for worse after the change. DOUBLE EDIT: Was it clear what the mode of failure was from the first syslog? It's not clear to me that there was evidence of failure in the diagnostics ZIP. Edited February 4, 2023 by FlexGunship Quote Link to comment
FlexGunship Posted February 13, 2023 Author Share Posted February 13, 2023 Polite Bump I'm still having this issue about every 24 to 36 hours. No lost data, but I would really appreciate any other insights anyone has. Quote Link to comment
JorgeB Posted February 14, 2023 Share Posted February 14, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
FlexGunship Posted February 17, 2023 Author Share Posted February 17, 2023 Just updating that the syslog server is enabled, and I'm waiting for the next crash at this point. Quote Link to comment
FlexGunship Posted February 17, 2023 Author Share Posted February 17, 2023 syslog I guess I had it running since last year... so it was huge. I trimmed anything before 2/12 for the purpose of this upload. If you need more of the log... or need to me start it clean, please let me know. Quote Link to comment
trurl Posted February 17, 2023 Share Posted February 17, 2023 Continuous dumps for many hours starting here Feb 16 23:08:23 Athena kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-... 7-... } 21250 jiffies s: 1981 root: 0x90/. Not clear what they are related to. Are you overclocking? Have you done memtest lately? Quote Link to comment
FlexGunship Posted February 17, 2023 Author Share Posted February 17, 2023 No overclocking. It's a stock i7-6700 (non-k) with a stock cooler on a stock Dell mobo. I haven't run memtest recently. But I HAVE run memtest since I had the issue. I have also removed each stick (running with 48GB at a time) and have noted a crash in every case. Also... I bought this processor, mobo, and RAM to address the crashing issue. I had the same issue on an i3-6100 previously with 32GB of different RAM on an HP motherboard. I also bought an LSI SAS card to try to solve this issue as the Dell mobo has a Marvel data controller which is notorious for not working in unRAID (I'm told). To be fair, I don't know for a *fact* that the failure mode is identical as the i3-6100 HP days, but the rate of failure and manifestation is the same. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.