Jump to content

6.12.3/4 - Unstable WebUI that crashes (nginx?), Docker unable to start?


Go to solution Solved by Yivey_unraid,

Recommended Posts

Hi! Started getting problems with my server the other day. By then I was on 6.12.3 and it all started with some Docker containers acting up. Went into the Unraid WebUI and the Docker page said that Docker was unable to start. Restarted Docker and got it working again.

 

Then today it acted up again, and I also started getting problems with Unraid WebUI crashing on me. Restarted nginx over SSH using

/etc/rc.d/rc.nginx restart

 

That got the WebUI up again, but despite restarting the whole server Docker won't restart at all. SSH worked all the time though.

 

In a futile attempt to perhaps fix it, I thought that I would try to upgrade to 6.12.4. So I pulled the USB stick and did a manual update. Everything started up ok, but now I couldn't start the array because the WebUI crashed on me again. Restarted nginx multiple times until it stopped working all together. I SSH:ed into the server and pulled the Diagnostics before shutting everything down.

Any ideas on what's causing this? I can't figure it out in the logs.

 

Unfortunately I don't have a remote syslog server.

define7-diagnostics-20230926-2316.zip

Link to comment
3 hours ago, JorgeB said:

That is mostly to see if it's a config problem, not the flash drive.

OK, I tried now and started up a completely fresh install on a new flash drive. It work's but I haven't started the array or anything as I don't want to screw anything up. So now you suggest to copy some flies from the old install, make a fresh install on the old flash, then copy back those previous files to the new install? What files should I keep and what will I end up having to reconfigure?

Edited by Yivey_unraid
Link to comment

Does this approach mean I'll have to reconfigure all the plugins, modprobe.d, go-file, ssh keys, shares, network configs etc etc?
Not possible to start with something less "destructive"? Sort of the other way around, and start by removing some parts of the install on the old flash? If so, what parts should I start removing? I appreciate the help!

Link to comment

I've tried a bunch of stuff now trying to narrowing it down.

I did a fresh install off 6.12.4 on the old flash and copied over the bare minimum that you suggested. Tried to see if I could provoke an nginx failure. Everything seemed to work fine and I couldn't find any faults.

Started moving over more stuff and trying. But eventually grew impatient since every try involved starting/stopping the server and handling the USB back and forth. Tried to only remove the plugins from the config folder. That wasn't the problem either since nginx kept crashing. Managed to grab a log from one of the times it was happening. Look around 20:14 in the attached syslog. Don't know if that's something that can point somewhere? After that I restarted nginx but eventually it stopped working all together.


Also tried restoring to a week old flash backup, but that was also presenting the exact same issues with nginx crashing the unraid API. Even though I had none of those issues back then? What does that mean? I will keep trying to restore just parts of the config since the first stuff I tried didn't work.

 

Feels like I'm stumbling blind here and looking for a needle in a haystack...

define7-diagnostics-20230928-2024.zip

Edited by Yivey_unraid
Link to comment

@JorgeB What is your take on the fact that the problems keep occurring even though I restored the flash from a 1 week old backup? Back then I had no issues at all. Off course something could’ve been problematic in the setup back then already, just not presenting symptoms, but shouldn’t this work like a snapshot?

I’ve found this bug report and my symptoms are very similar, but the problem is that I haven’t had IPv6 activated at all, only IPv4. 

 

2 hours ago, JorgeB said:

Those look related to the Connect plugin, you can try without it.

But I tried without any plugins at all earlier, still same issues? I will try this again though. I guess the connect plugin is still named dynamix.my.servers in the plugins folder, because I can’t find any plugin named Connect?

Link to comment

Ok, I'm soon about to give up.... 😫

 

I've tried so many things now without getting it to work properly again. Every time I try something new it creates a new problem.

 

I've done this (I'm sorry if it's not in chronological order, I should've kept a log on all my tests):

  • Started with the bare minimum like @JorgeB suggested on the old flash but with a new fresh install. Started installing stuff back manually. Kept having nginx problems.
  • Started with the bare minimum but on a new flash drive that I bought. Same deal, kept having nginx problems. And often the Docker engine couldn't start. If it started I wasn't able to install any containers due to to various errors but the most common was "docker: error pulling image configuration: image config verification failed for digest sha256".
  • Started a completely new trial version on another new flash drive but kept having the same issues as above, minus the nginx one. Can install most plugins without an error, but no containers at all. Most installs look like this (just picked a random container in CA):image.thumb.png.38450320734e9f08fb2df62b312728e1.png
  • During all these trial and error sessions I've deleted the docker directory multiple times, tried switching back to docker image. Moving the image or dir to a different pool with different FS. No luck. Same result.
  • Tried reverting back to not use SSL. nginx still produced these type of errors in the log:
  • On 9/28/2023 at 11:48 PM, Yivey_unraid said:
    nginx: 2023/09/28 20:07:12 [error] 19782#19782: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload

     

  • Multiple times CA has had problems loading and suggested me to change the DNS settings. I've done that and tried both the suggested DNS settings as well as others. Both setting them static on the server as well as on the router. No change on either the "rebuild" system or the trial.
  • I've tried using both my NIC's on my MB (MSI Z590 Torpedo). Both are Intel, one is 1 GbE and the other is 2.5 GbE. Both in bonding mode and separate. No change.
  • I even changed out my switch to another since I had one laying around. No change.
  • I bought two new 16 GB RAM to test out. No change. I guess I'll have 64 GB now if I ever get this going again...
  • I tried using an USB to ethernet adapter to see if it was the onboard NIC's but couldn't get unraid to see the adapter and trying to boot with it connected to the system failed as I guess it tries to boot from the adapter as it's USB.
  • Changed out the ethernet cable between the switch and the server. No change.
  • I got desperate and updated the FW on both the MB and the HBA (Adaptec ASR-71605). No change.



Really starting to suspect it's a hardware issue. Especially since a completely new trial version doesn't work either.

I've attached two diagnostics. One is from the last time I ran my "rebuild" flash and one is the trial version. Unfortunately I've done so many restarts but these at least should show the problems I have with the Docker engine and how I'm not able to install any containers.

 

What should I do here?!? I've put 25+ hours into this hunt for a needle in a haystack now. PLEASE HELP!

 

define-7-diagnostics-20231005-2328_pro_version.zip tower-test-diagnostics-20231006-0119_trial_version.zip

Link to comment
8 hours ago, Yivey_unraid said:

Really starting to suspect it's a hardware issue. Especially since a completely new trial version doesn't work either.

I would suspect the same, unfortunately that's not always easy to diagnose, especially remotely, see if you can try with a different PC, or a different board/CPU combo.

Link to comment

I will start to try and narrow down what HW is causing the issue. Also maybe move the server and physically try it on another network.

 

Don't really have any other system to try it on, except for my old HP N40L Microserver. That doesn't have room for the HBA etc. But I can give the trial flash a go in it. Would you say I need to recreate the flash every time I do a new HW test? Like in the sense that faulty HW would corrupt the flash in some way so it'll present faults even when run on good HW? 

Link to comment

Tried some more.

 

I installed a TP-Link PCIe Ethernet card and tried with a new fresh install of 6.12.4 on a new flash. Still same problem. So either it's a problem in my network, or it's a problem in the MB/CPU.

This is the BIOS screen for the security settings on the MB. It does mention SHA256, which is what is erroring most frequently during my tries. Should I have this setup in any other way? This is settings I've never touched before.

Should I start a new thread with these problems since my initial questions wasn't really about this SHA256 problem? The initial problem perhaps was caused by this problem though..

IMG_9282.jpg

Link to comment

Update again...
 

I found an old NUC that I ran HA on earlier. Connected it to the same switch as my main server and booted the same flash from the previous post that gave errors in my main system.

 

Everything works on this NUC system as far as I can stresstest it just by installing Docker containers, plugins, removing all of them, clearing the Docker image and installing again.

 

As it looks now, I'm pretty certain that it's the main servers hardware that are faulty in some way. As far as I understand SHA256 is a hash crypto function carried out by the CPU, but it could also be . Something is corrupting the data somewhere, or the algorithm isn't working (highly unlikely).

 

I was still not sure that the problem wasn't related to the network parts on the motherboard. Going through the MB manual I found that all networking functionality is going through the chipset in all the ways I've tested. So I tested the PCIe Ethernet card in PCI_E1 socket to rule out that it's the chipset that's faulty. At first I got my hopes up, as it seemed to actually be working, but no bueno. 🙈 I don't know if I can rule the chipset out completely since the USB is still connected via it, and I don't know a way around that short of buying a PCIe card with both ethernet and USB.

 

161286633_Skrmavbild2023-10-07kl_00_10_45.thumb.png.6dce04e7ffa89a95a523130347d622c2.png

 

Now, the only thing I can think of to do further is to disassemble the PC and reseat the CPU in the socket. Don't really see what difference that'll make but I'll give it a try. I'm out of ideas, and buying a new MB and CPU on a hunch is really the last resort...

 

Any other suggestions of what to look for or try out?

 

 

Link to comment

My Unraid GUI has started being unresponsive this evening.. (6.12.4)

I've been working with nginx & cloudflare config and it ground to a halt -  It seemed to be fixed after a restart but the system log was flooded with entries like:
Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 
 - before I started the container!

I'm happy to try to look for any commonalities that might give us a reproducible-on-demand scenario 👍

Link to comment
15 hours ago, Kev600 said:

My Unraid GUI has started being unresponsive this evening.. (6.12.4)

I've been working with nginx & cloudflare config and it ground to a halt -  It seemed to be fixed after a restart but the system log was flooded with entries like:
Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 
 - before I started the container!

I'm happy to try to look for any commonalities that might give us a reproducible-on-demand scenario 👍

Unfortunately this isn’t related to the container Nginx Proxy Manager at all. Nginx is related to the webUI.

Link to comment

try disabling TPM in the BIOS as Unraid does not use it AFAIK it can only be configure to pass it through to a VM for use.

 

That last error

Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 

 

This seems to be where you leave a web connection open on the Dashboard page which causes an issue that then floods the /var/log/syslog  and /errorlog files to file the filesystem.  there is no fix that I can see, you can only clear the logs and restart nginx process to clear it.

 

 

  • Thanks 1
Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...