Jump to content

Docker service failed to start


Go to solution Solved by JorgeB,

Recommended Posts

Hello everyone,
 

I am recently having a lot of problems with Docker on my unRAID server. I am not sure what would lead to this as I have done the usual web search and found a few potential solutions which I have tried (i.e. deleting docker.img and recreating). I have not done a memtest yet, which I will do on this coming Saturday when I have physical access to the server.

 

Following problems have occurred:
1) Docker Containers crashing because of segmentation fault

2) Docker service failing to start

 

I suspect it could have something to do with either my RAM or my Cache drive, as I do not remember having problems before installing the cache drives a few months ago. Maybe someone can spot my error in the provided logs?

I hope one of you guys from the community can help me out here! Let me know if you have any further questions or ideas what I could do here. Thank you!

ryan-diagnostics-20230620-1753.zip

Link to comment
16 minutes ago, JorgeB said:

Current docker image is corrupt, you need to recreate, but if it's a reoccurring issue it might indicate other problems, though nothing else I can see so far, possible because up-time is only a few minutes.

This is most likely a reoccurring issue as I have already recreated it and it worked fine for a wile before problems came back.
Is there anything specific that could cause this? Probably one of the containers I am running is writing data to the docker.img, correct? (I have tried to not cause this, but I am not that experienced in the configuration and have missed something maybe) Or are there other probable causes that could be responsible for a corrupt docker.img?

I will try to post logs again when uptime is higher. A bit tricky because I only have remote access via Wireguard VPN running on the server itself. ( I don´t want to send someone to switch on the server so often) Sometimes the VPN seems to be unresponsive / down as well. But I do not know if that is because the server crashes or because of other issues. Maybe this could be related.

Is it possible that both of this is caused by an error with the cache drive? I had no problems with unRAID at all before installing that drive a few months ago. I have found this line in the system log:

Jun 20 17:34:52 ryan kernel: BTRFS: error (device loop2) in btrfs_replay_log:2500: errno=-5 IO failure (Failed to recover log tree)

Is this maybe related?

I have attached a screenshot of my share menu. Is there something obviously wrong there maybe?
 

 

shares.png

Edited by J0hn_Lawrence
Link to comment
28 minutes ago, JorgeB said:

Unlikely to be a cache device problem, could be RAM, but not seeing any other indications pointing to that for now.

Okay. I will check RAM this weekend.
So what you propose is to repeat the process of deleting and recreating the docker.img, correct?
Thank you for your help so far!!

Link to comment

I deleted the docker image and recreated according to information I found on the forum. Process worked fine.

I observed very strange behavior a day later. I could not reach the server anymore via my VPN. I first thought maybe the VPN died. But when I came to the physical location of the server today, the server did not respond either. Pinging worked, but that only shows the network interface still works. I suspect that the whole OS must have crashed. What could have caused something like this? After a restart by physically using the power button on the machine it rebooted and seems to work again for now.

I tried to run memtest but the system always just reboots to the mode selection screen when I select memtest86 and press enter. Strange. Am I maybe missing something here?
16GiB of RAM are detected by the OS and displayed in the WebUI, which is the correct amount.

It seems to me there is more wrong with the system than I previously suspected. Upon boot I saw on the connected monitor of the server that it said "There are differences between boot sector and its backup". What does this mean exactly?
Another forum post described this event as well:

But I am not sure if I should try to use this solution or if this is not well suited in my case.

Also I just saw this error in the log. I can explain why I would get a segmentation fault.

Jun 24 10:24:09 ryan kernel: php-fpm[22222]: segfault at 1568a92870a0 ip 00000000008dcd1b sp 00007fff6de4cf10 error 6 in php-fpm[600000+3c0000] likely on CPU 6 (core 2, socket 0)



I would much appreciate any further guidance on this topic. Thank you for your support thus far.

Edited by J0hn_Lawrence
Link to comment
13 minutes ago, J0hn_Lawrence said:

I tried to run memtest but the system always just reboots to the mode selection screen when I select memtest86 and press enter. Strange. Am I maybe missing something here?

it only works with legacy/CSM boot, not UEFI, if you can only boot UEFI use the free Passmark memtest, you should also post the diagnostics.

Link to comment

I was unfortunately not able to switch out of UEFI to run memtest. I switched DRAM frequency from auto to 2133 (which is the speed of my RAM) to ensure XMP is not beeing used.
 

15 minutes ago, JorgeB said:

Diags look mostly fine, though there are some out of memory errors.

How could the system be out of memory? I have 16 GiB DRAM installed and no Docker Container or VM is installed nor are any running.
What could this out of memory mean in this context?


Thank you for your ongoing support! I have attached newest logs, because I saw more segmentation fault error logs.

ryan-diagnostics-20230624-1338.zip

Link to comment
  • Solution
Jun 24 10:24:09 ryan kernel: __vm_enough_memory: pid: 22222, comm: php-fpm, no enough memory for the allocation
Jun 24 10:24:09 ryan kernel: php-fpm[22222]: segfault at 1568a92870a0 ip 00000000008dcd1b sp 00007fff6de4cf10 error 6 in php-fpm[600000+3c0000] likely on CPU 6 (core 2, socket 0)
Jun 24 10:24:09 ryan kernel: Code: 85 ff 74 10 f6 47 04 40 75 0a 83 2f 01 75 05 e8 2b 75 fc ff 48 83 c3 20 49 39 dc 0f 84 8e fe ff ff 80 7b 09 00 74 d4 48 8b 3b <83> 2f 01 75 b0 e8 6b d7 fe ff eb c5 66 0f 1f 84 00 00 00 00 00 48
Jun 24 10:24:09 ryan nginx: 2023/06/24 10:24:09 [error] 2296#2296: *9916 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.178.90, server: , request: "POST /plugins/community.applications/include/exec.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.178.103", referrer: "http://192.168.178.103/Apps"
Jun 24 10:24:09 ryan php-fpm[2158]: [WARNING] [pool www] child 22222 exited on signal 11 (SIGSEGV) after 0.519772 seconds from start
Jun 24 10:26:27 ryan kernel: __vm_enough_memory: pid: 24942, comm: php-fpm, no enough memory for the allocation

 

Were starting something at this time?

 

Also see some other segfaults in the newer diags, so should really run memtest.

Link to comment

I ran memtest86+ and I guess I found the error. I had 2 sticks of 8GiB DDR4 DRAM 2133 in my machine. When I did the memtest with both of them, I got >100k failures after 3% of the check. I then tested the sticks individually in a known working slot and sure enough one of them is completely broken. The other one passed memtest with 0 errors 2 times. I will monitor the situation of my server on the coming week. I suspect no segmentation faults should occur anymore as I have removed the faulty 8GiB DRAM stick. At least in the last 30 minutes none have occurred which makes me hopeful that the other stick is in fact fine.

Thank you so much @JorgeB for your help and assistance! Very much appreciated!

In case anyone reads this and is wondering how memtest86+ testing works:
- Download the .iso from this page https://www.memtest.org/

- Write it to a USB stick -> i.e. with Fedora Media Writer or a similar program
- Change the boot priority in your BIOS/UEFI and set the newly created USB memtest86+ drive as 1st prio.
- Boot up the machine and start the memtest
- It should complete with no errors and a big green PASS if all is good.
- It will show you an error count and red error messages if something is wrong.
- First try all RAM sticks, then remove them and try them out individually.
-> You should be able to figure out which sticks are probably ok and which are defective.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...