Yivey_unraid Posted September 26, 2023 Share Posted September 26, 2023 Hi! Started getting problems with my server the other day. By then I was on 6.12.3 and it all started with some Docker containers acting up. Went into the Unraid WebUI and the Docker page said that Docker was unable to start. Restarted Docker and got it working again. Then today it acted up again, and I also started getting problems with Unraid WebUI crashing on me. Restarted nginx over SSH using /etc/rc.d/rc.nginx restart That got the WebUI up again, but despite restarting the whole server Docker won't restart at all. SSH worked all the time though. In a futile attempt to perhaps fix it, I thought that I would try to upgrade to 6.12.4. So I pulled the USB stick and did a manual update. Everything started up ok, but now I couldn't start the array because the WebUI crashed on me again. Restarted nginx multiple times until it stopped working all together. I SSH:ed into the server and pulled the Diagnostics before shutting everything down. Any ideas on what's causing this? I can't figure it out in the logs. Unfortunately I don't have a remote syslog server. define7-diagnostics-20230926-2316.zip Quote Link to comment
Yivey_unraid Posted September 27, 2023 Author Share Posted September 27, 2023 Any takers? I’ve also tried booting into safe mode but the webUI still crashes. Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 You can try a new flash drive with a stock install, no key needed, it that works correctly it would suggest a config issue, you can then copy just the bare minimum and reconfigure the rest. Quote Link to comment
Yivey_unraid Posted September 27, 2023 Author Share Posted September 27, 2023 3 hours ago, JorgeB said: You can try a new flash drive with a stock install, no key needed, it that works correctly it would suggest a config issue, you can then copy just the bare minimum and reconfigure the rest. I'll give that a go! What is it that points to the USB drive being faulty? Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 That is mostly to see if it's a config problem, not the flash drive. Quote Link to comment
Yivey_unraid Posted September 27, 2023 Author Share Posted September 27, 2023 (edited) 3 hours ago, JorgeB said: That is mostly to see if it's a config problem, not the flash drive. OK, I tried now and started up a completely fresh install on a new flash drive. It work's but I haven't started the array or anything as I don't want to screw anything up. So now you suggest to copy some flies from the old install, make a fresh install on the old flash, then copy back those previous files to the new install? What files should I keep and what will I end up having to reconfigure? Edited September 27, 2023 by Yivey_unraid Quote Link to comment
JorgeB Posted September 28, 2023 Share Posted September 28, 2023 10 hours ago, Yivey_unraid said: So now you suggest to copy some flies from the old install, Yes, start with the bare minimum, like the key, super.dat and pools folder for the assignments, and docker user templates. Quote Link to comment
Yivey_unraid Posted September 28, 2023 Author Share Posted September 28, 2023 Does this approach mean I'll have to reconfigure all the plugins, modprobe.d, go-file, ssh keys, shares, network configs etc etc? Not possible to start with something less "destructive"? Sort of the other way around, and start by removing some parts of the install on the old flash? If so, what parts should I start removing? I appreciate the help! Quote Link to comment
JorgeB Posted September 28, 2023 Share Posted September 28, 2023 After copying the bare minimum you can start restoring other config files, but do a few at a time in case the issue returns, so you can find the culprit, assuming it's that. Quote Link to comment
Yivey_unraid Posted September 28, 2023 Author Share Posted September 28, 2023 (edited) I've tried a bunch of stuff now trying to narrowing it down. I did a fresh install off 6.12.4 on the old flash and copied over the bare minimum that you suggested. Tried to see if I could provoke an nginx failure. Everything seemed to work fine and I couldn't find any faults. Started moving over more stuff and trying. But eventually grew impatient since every try involved starting/stopping the server and handling the USB back and forth. Tried to only remove the plugins from the config folder. That wasn't the problem either since nginx kept crashing. Managed to grab a log from one of the times it was happening. Look around 20:14 in the attached syslog. Don't know if that's something that can point somewhere? After that I restarted nginx but eventually it stopped working all together. Also tried restoring to a week old flash backup, but that was also presenting the exact same issues with nginx crashing the unraid API. Even though I had none of those issues back then? What does that mean? I will keep trying to restore just parts of the config since the first stuff I tried didn't work. Feels like I'm stumbling blind here and looking for a needle in a haystack... define7-diagnostics-20230928-2024.zip Edited September 28, 2023 by Yivey_unraid Quote Link to comment
Yivey_unraid Posted September 28, 2023 Author Share Posted September 28, 2023 nginx: 2023/09/28 20:07:12 [error] 19782#19782: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload This error message is coming up a lot in the logs. Quote Link to comment
JorgeB Posted September 29, 2023 Share Posted September 29, 2023 9 hours ago, Yivey_unraid said: Look around 20:14 in the attached syslog. Those look related to the Connect plugin, you can try without it. 9 hours ago, Yivey_unraid said: This error message is coming up a lot in the logs. No idea what these are about. Quote Link to comment
Yivey_unraid Posted September 29, 2023 Author Share Posted September 29, 2023 @JorgeB What is your take on the fact that the problems keep occurring even though I restored the flash from a 1 week old backup? Back then I had no issues at all. Off course something could’ve been problematic in the setup back then already, just not presenting symptoms, but shouldn’t this work like a snapshot? I’ve found this bug report and my symptoms are very similar, but the problem is that I haven’t had IPv6 activated at all, only IPv4. 2 hours ago, JorgeB said: Those look related to the Connect plugin, you can try without it. But I tried without any plugins at all earlier, still same issues? I will try this again though. I guess the connect plugin is still named dynamix.my.servers in the plugins folder, because I can’t find any plugin named Connect? Quote Link to comment
Yivey_unraid Posted October 5, 2023 Author Share Posted October 5, 2023 Ok, I'm soon about to give up.... 😫 I've tried so many things now without getting it to work properly again. Every time I try something new it creates a new problem. I've done this (I'm sorry if it's not in chronological order, I should've kept a log on all my tests): Started with the bare minimum like @JorgeB suggested on the old flash but with a new fresh install. Started installing stuff back manually. Kept having nginx problems. Started with the bare minimum but on a new flash drive that I bought. Same deal, kept having nginx problems. And often the Docker engine couldn't start. If it started I wasn't able to install any containers due to to various errors but the most common was "docker: error pulling image configuration: image config verification failed for digest sha256". Started a completely new trial version on another new flash drive but kept having the same issues as above, minus the nginx one. Can install most plugins without an error, but no containers at all. Most installs look like this (just picked a random container in CA): During all these trial and error sessions I've deleted the docker directory multiple times, tried switching back to docker image. Moving the image or dir to a different pool with different FS. No luck. Same result. Tried reverting back to not use SSL. nginx still produced these type of errors in the log: On 9/28/2023 at 11:48 PM, Yivey_unraid said: nginx: 2023/09/28 20:07:12 [error] 19782#19782: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload Multiple times CA has had problems loading and suggested me to change the DNS settings. I've done that and tried both the suggested DNS settings as well as others. Both setting them static on the server as well as on the router. No change on either the "rebuild" system or the trial. I've tried using both my NIC's on my MB (MSI Z590 Torpedo). Both are Intel, one is 1 GbE and the other is 2.5 GbE. Both in bonding mode and separate. No change. I even changed out my switch to another since I had one laying around. No change. I bought two new 16 GB RAM to test out. No change. I guess I'll have 64 GB now if I ever get this going again... I tried using an USB to ethernet adapter to see if it was the onboard NIC's but couldn't get unraid to see the adapter and trying to boot with it connected to the system failed as I guess it tries to boot from the adapter as it's USB. Changed out the ethernet cable between the switch and the server. No change. I got desperate and updated the FW on both the MB and the HBA (Adaptec ASR-71605). No change. Really starting to suspect it's a hardware issue. Especially since a completely new trial version doesn't work either. I've attached two diagnostics. One is from the last time I ran my "rebuild" flash and one is the trial version. Unfortunately I've done so many restarts but these at least should show the problems I have with the Docker engine and how I'm not able to install any containers. What should I do here?!? I've put 25+ hours into this hunt for a needle in a haystack now. PLEASE HELP! define-7-diagnostics-20231005-2328_pro_version.zip tower-test-diagnostics-20231006-0119_trial_version.zip Quote Link to comment
JorgeB Posted October 6, 2023 Share Posted October 6, 2023 8 hours ago, Yivey_unraid said: Really starting to suspect it's a hardware issue. Especially since a completely new trial version doesn't work either. I would suspect the same, unfortunately that's not always easy to diagnose, especially remotely, see if you can try with a different PC, or a different board/CPU combo. Quote Link to comment
Yivey_unraid Posted October 6, 2023 Author Share Posted October 6, 2023 I will start to try and narrow down what HW is causing the issue. Also maybe move the server and physically try it on another network. Don't really have any other system to try it on, except for my old HP N40L Microserver. That doesn't have room for the HBA etc. But I can give the trial flash a go in it. Would you say I need to recreate the flash every time I do a new HW test? Like in the sense that faulty HW would corrupt the flash in some way so it'll present faults even when run on good HW? Quote Link to comment
Yivey_unraid Posted October 6, 2023 Author Share Posted October 6, 2023 Tried some more. I installed a TP-Link PCIe Ethernet card and tried with a new fresh install of 6.12.4 on a new flash. Still same problem. So either it's a problem in my network, or it's a problem in the MB/CPU. This is the BIOS screen for the security settings on the MB. It does mention SHA256, which is what is erroring most frequently during my tries. Should I have this setup in any other way? This is settings I've never touched before. Should I start a new thread with these problems since my initial questions wasn't really about this SHA256 problem? The initial problem perhaps was caused by this problem though.. Quote Link to comment
Yivey_unraid Posted October 6, 2023 Author Share Posted October 6, 2023 Update again... I found an old NUC that I ran HA on earlier. Connected it to the same switch as my main server and booted the same flash from the previous post that gave errors in my main system. Everything works on this NUC system as far as I can stresstest it just by installing Docker containers, plugins, removing all of them, clearing the Docker image and installing again. As it looks now, I'm pretty certain that it's the main servers hardware that are faulty in some way. As far as I understand SHA256 is a hash crypto function carried out by the CPU, but it could also be . Something is corrupting the data somewhere, or the algorithm isn't working (highly unlikely). I was still not sure that the problem wasn't related to the network parts on the motherboard. Going through the MB manual I found that all networking functionality is going through the chipset in all the ways I've tested. So I tested the PCIe Ethernet card in PCI_E1 socket to rule out that it's the chipset that's faulty. At first I got my hopes up, as it seemed to actually be working, but no bueno. 🙈 I don't know if I can rule the chipset out completely since the USB is still connected via it, and I don't know a way around that short of buying a PCIe card with both ethernet and USB. Now, the only thing I can think of to do further is to disassemble the PC and reseat the CPU in the socket. Don't really see what difference that'll make but I'll give it a try. I'm out of ideas, and buying a new MB and CPU on a hunch is really the last resort... Any other suggestions of what to look for or try out? Quote Link to comment
Yivey_unraid Posted October 7, 2023 Author Share Posted October 7, 2023 Created a new thread since the problem has "evolved" so much from the original questioning. Perhaps the original problem will be solved if I can find the potential root problem. If you have any comments or suggestions please follow up in that thread instead. Thank you! Quote Link to comment
Kev600 Posted October 8, 2023 Share Posted October 8, 2023 My Unraid GUI has started being unresponsive this evening.. (6.12.4) I've been working with nginx & cloudflare config and it ground to a halt - It seemed to be fixed after a restart but the system log was flooded with entries like: Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 - before I started the container! I'm happy to try to look for any commonalities that might give us a reproducible-on-demand scenario 👍 Quote Link to comment
Yivey_unraid Posted October 8, 2023 Author Share Posted October 8, 2023 15 hours ago, Kev600 said: My Unraid GUI has started being unresponsive this evening.. (6.12.4) I've been working with nginx & cloudflare config and it ground to a halt - It seemed to be fixed after a restart but the system log was flooded with entries like: Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 - before I started the container! I'm happy to try to look for any commonalities that might give us a reproducible-on-demand scenario 👍 Unfortunately this isn’t related to the container Nginx Proxy Manager at all. Nginx is related to the webUI. Quote Link to comment
hwextreme Posted October 9, 2023 Share Posted October 9, 2023 try disabling TPM in the BIOS as Unraid does not use it AFAIK it can only be configure to pass it through to a VM for use. That last error Oct 8 01:09:01 KBNAS nginx: 2023/10/08 01:09:01 [alert] 11885#11885: worker process 18538 exited on signal 6 This seems to be where you leave a web connection open on the Dashboard page which causes an issue that then floods the /var/log/syslog and /errorlog files to file the filesystem. there is no fix that I can see, you can only clear the logs and restart nginx process to clear it. 1 Quote Link to comment
Solution Yivey_unraid Posted December 9, 2023 Author Solution Share Posted December 9, 2023 For me, the CPU ended up being faulty and Intel replaced it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.