toughiv Posted December 18, 2023 Share Posted December 18, 2023 Disclaimer! I've never had to do the whole troubleshooting and diagnostics before. I likely have made a pigs ear of the logs needed to properly troubleshoot (I am sorry) I'm getting desperate, these crashes are causing me lots of headaches! Current Issue Unraid is crashing roughly once per day System reboots on its own just fine, all that is required is that I login and click 'Start Array' and start my Docker containers again. Services that are disabled when the Unraid server boots back up Adding this here, likely nothing, but every little helps! SSHD doesn't start when Unraid boots back up. I need to open a console from the login dashboard and run: /etc/rc.d/rc.sshd start This has only happened once, but when I logged back in I had to go to Settings > Docker > Enable Docker > Yes Checks Already Performed Run Memtestx86 until 3 passes complete - 0 errors found Changed Docker network settings from MacVLAN to IPVlan (forum posts suggests stability issue with MacVLAN) Created a persistent syslog output share & installed CA "User Scripts" The script is: #!/bin/bash FILENAME="/mnt/user/Persistent-Syslog/syslog-$(date +%s)" tail -f /var/log/syslog > $FILENAME Persistent Syslog script output example Please ignore the date shown, I wasn't making use of NTP servers (this highlighted it to me). I have changed that now. Syslog doesnt seem to give me much as to the cause of the crash (which suggests to me this may all be Docker related...) The below outputs are just the last couple lines before crash appears to have occurred. Dec 19 17:57:11 Tower kernel: pcieport 0000:00:01.2: device [1022:1453] error status/mask=00001000/00006000 Dec 19 17:57:11 Tower kernel: pcieport 0000:00:01.2: [12] Timeout Dec 19 19:55:50 Tower nginx: 2023/12/19 19:55:50 [alert] 2939#2939: worker process 7301 exited on signal 6 Dec 19 19:55:53 Tower nginx: 2023/12/19 19:55:53 [alert] 2939#2939: worker process 10111 exited on signal 6 Dec 19 19:55:54 Tower nginx: 2023/12/19 19:55:54 [alert] 2939#2939: worker process 10162 exited on signal 6 Dec 19 19:55:56 Tower nginx: 2023/12/19 19:55:56 [alert] 2939#2939: worker process 10178 exited on signal 6 Dec 19 19:55:59 Tower nginx: 2023/12/19 19:55:59 [alert] 2939#2939: worker process 10199 exited on signal 6 Dec 19 19:56:01 Tower nginx: 2023/12/19 19:56:01 [alert] 2939#2939: worker process 10246 exited on signal 6 Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: AER: Corrected error received: 0000:00:00.0 Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: device [1022:1453] error status/mask=00001000/00006000 Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: [12] Timeout Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: [ 6] BadTLP Dec 20 22:57:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.161.51#123 flags 0x19 failed: Address already in use Dec 20 22:57:40 Tower ntpd[6760]: unable to create socket on br0.99 (843) for 172.16.161.51#123 Dec 20 22:57:40 Tower ntpd[6760]: failed to init interface for address 172.16.161.51 Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 127.0.0.1#123 flags 0x5 failed: Address already in use Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on lo (844) for 127.0.0.1#123 Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 127.0.0.1 Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.160.9#123 flags 0x19 failed: Address already in use Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0 (845) for 172.16.160.9#123 Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.160.9 Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.162.51#123 flags 0x19 failed: Address already in use Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0.98 (846) for 172.16.162.51#123 Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.162.51 Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.161.51#123 flags 0x19 failed: Address already in use Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0.99 (847) for 172.16.161.51#123 Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.161.51 Quote Link to comment
trurl Posted December 18, 2023 Share Posted December 18, 2023 Attach Diagnostics to your NEXT post in this thread. Quote Link to comment
itimpi Posted December 18, 2023 Share Posted December 18, 2023 If the server reboots itself this suggest a hardware issue. The most likely culprits would be a cooling issue (so CPU overheats) or inadequate power. Quote Link to comment
toughiv Posted December 18, 2023 Author Share Posted December 18, 2023 Attached diagnostics tower-diagnostics-20231218-0857.zip Quote Link to comment
toughiv Posted December 18, 2023 Author Share Posted December 18, 2023 4 minutes ago, itimpi said: If the server reboots itself this suggest a hardware issue. The most likely culprits would be a cooling issue (so CPU overheats) or inadequate power. I has assumed a reboot because all my services become unresponsive and then after some time, I can get access to the Unraid login page and the array is now offline. Actually, it must be a reboot now I think about it because the system uptime resets. If there is a cooling or power issue, is there a way to diagnose that? It crashed at midnight recently when there were no major operations or usage going on, basically idling. It is not plugged into a UPS though... Quote Link to comment
JorgeB Posted December 18, 2023 Share Posted December 18, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
toughiv Posted December 18, 2023 Author Share Posted December 18, 2023 @JorgeB - thanks, done this now. Will report back on next crash. Makes my UserScript redundant but suspect this will be more robust. Quote Link to comment
toughiv Posted December 21, 2023 Author Share Posted December 21, 2023 Took a little longer to crash this time, but here are the latest syslog files. Thank you for your help syslog-1703089476 syslog-1703203715 Quote Link to comment
JorgeB Posted December 21, 2023 Share Posted December 21, 2023 Try booting in safe mode and/or closing any browser windows open to the GUI, only open when you need to use it then close again. Quote Link to comment
toughiv Posted December 21, 2023 Author Share Posted December 21, 2023 Ah so the crashing is being caused by long open sessions? Such as ssh connections, tabs and the like? Quote Link to comment
JorgeB Posted December 21, 2023 Share Posted December 21, 2023 It can be, mostly browser widnows open to the GUI in any device, it's worth a try. Quote Link to comment
toughiv Posted December 23, 2023 Author Share Posted December 23, 2023 @JorgeB - ensuring browser windows were closed didnt help unfortunately. Is this a known issue with 6.12.4? I didn't have this problem until i upgraded Quote Link to comment
JorgeB Posted December 23, 2023 Share Posted December 23, 2023 29 minutes ago, toughiv said: Is this a known issue with 6.12.4? Not really, updated to v6.12.6 and post a new syslog after a crash. 1 Quote Link to comment
toughiv Posted December 25, 2023 Author Share Posted December 25, 2023 Merry Xmas everyone - i hope you all have a great festive time. I have attached the latest syslog - I think it shows that syslog isn't capturing the reason for the crash / it must be docker related? syslog-1703330985 Quote Link to comment
itimpi Posted December 25, 2023 Share Posted December 25, 2023 @toughiv I see this in the syslog: Dec 23 11:39:03 Tower root: Fix Common Problems: Warning: Deprecated plugin Unraid-Nvidia.plg Not sure if it can cause problems or not. Quote Link to comment
toughiv Posted December 25, 2023 Author Share Posted December 25, 2023 3 minutes ago, itimpi said: @toughiv I see this in the syslog: Dec 23 11:39:03 Tower root: Fix Common Problems: Warning: Deprecated plugin Unraid-Nvidia.plg Not sure if it can cause problems or not. @itimpi if I get rid of this though, then I won't be able to pass through my GPU to Plex? Quote Link to comment
itimpi Posted December 25, 2023 Share Posted December 25, 2023 1 minute ago, toughiv said: @itimpi if I get rid of this though, then I won't be able to pass through my GPU to Plex? I think that the Nvidia-Driver plugin replaces this one giving similar functionality and is compatible with the 6.12.x releases. Quote Link to comment
toughiv Posted December 25, 2023 Author Share Posted December 25, 2023 Ah okay! I'll give that a go, thank you Quote Link to comment
toughiv Posted December 25, 2023 Author Share Posted December 25, 2023 2 hours ago, itimpi said: I think that the Nvidia-Driver plugin replaces this one giving similar functionality and is compatible with the 6.12.x releases. turns out i had both installed. Hopefully that removes the crashing. Will report back, thanks again Quote Link to comment
toughiv Posted December 26, 2023 Author Share Posted December 26, 2023 Crashed again this morning syslog-1703491413 Quote Link to comment
JorgeB Posted December 26, 2023 Share Posted December 26, 2023 Not seeing anything relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
toughiv Posted December 26, 2023 Author Share Posted December 26, 2023 Okay @JorgeB- thank you Quote Link to comment
toughiv Posted December 31, 2023 Author Share Posted December 31, 2023 @JorgeB - I have turned off the Tdarr container for last 3 days and no issues. Have you ever heard of that causing an issue? Quote Link to comment
JorgeB Posted January 1 Share Posted January 1 Not sure, you can try posting in the container support thread to see if other container users can help. Quote Link to comment
toughiv Posted January 4 Author Share Posted January 4 @JorgeB - i have found that i spoke too soon! If this is hardware and no errors shown in RAM after 3 passes, and no errors in parity / no warnings about failing disks... what do you think it could be? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.