Jump to content

Unstable since update to 6.12.4


toughiv

Recommended Posts

Disclaimer!
I've never had to do the whole troubleshooting and diagnostics before. I likely have made a pigs ear of the logs needed to properly troubleshoot (I am sorry)

 

I'm getting desperate, these crashes are causing me lots of headaches!



Current Issue

  • Unraid is crashing roughly once per day
  • System reboots on its own just fine, all that is required is that I login and click 'Start Array' and start my Docker containers again.

 

Services that are disabled when the Unraid server boots back up

Adding this here, likely nothing, but every little helps!

 

  • SSHD doesn't start when Unraid boots back up.
  • I need to open a console from the login dashboard and run:

 

/etc/rc.d/rc.sshd start

 

  • This has only happened once, but when I logged back in I had to go to Settings > Docker > Enable Docker > Yes

 

Checks Already Performed

  • Run Memtestx86 until 3 passes complete - 0 errors found
  • Changed Docker network settings from MacVLAN to IPVlan (forum posts suggests stability issue with MacVLAN)
  • Created a persistent syslog output share & installed CA "User Scripts"

 

The script is:

#!/bin/bash
FILENAME="/mnt/user/Persistent-Syslog/syslog-$(date +%s)"
tail -f /var/log/syslog > $FILENAME

 

 

Persistent Syslog script output example

Please ignore the date shown, I wasn't making use of NTP servers (this highlighted it to me). I have changed that now.

 

Syslog doesnt seem to give me much as to the cause of the crash (which suggests to me this may all be Docker related...)

The below outputs are just the last couple lines before crash appears to have occurred.

 

Dec 19 17:57:11 Tower kernel: pcieport 0000:00:01.2:   device [1022:1453] error status/mask=00001000/00006000
Dec 19 17:57:11 Tower kernel: pcieport 0000:00:01.2:    [12] Timeout               
Dec 19 19:55:50 Tower nginx: 2023/12/19 19:55:50 [alert] 2939#2939: worker process 7301 exited on signal 6
Dec 19 19:55:53 Tower nginx: 2023/12/19 19:55:53 [alert] 2939#2939: worker process 10111 exited on signal 6
Dec 19 19:55:54 Tower nginx: 2023/12/19 19:55:54 [alert] 2939#2939: worker process 10162 exited on signal 6
Dec 19 19:55:56 Tower nginx: 2023/12/19 19:55:56 [alert] 2939#2939: worker process 10178 exited on signal 6
Dec 19 19:55:59 Tower nginx: 2023/12/19 19:55:59 [alert] 2939#2939: worker process 10199 exited on signal 6
Dec 19 19:56:01 Tower nginx: 2023/12/19 19:56:01 [alert] 2939#2939: worker process 10246 exited on signal 6
Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: AER: Corrected error received: 0000:00:00.0
Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2:   device [1022:1453] error status/mask=00001000/00006000
Dec 19 19:57:49 Tower kernel: pcieport 0000:00:01.2:    [12] Timeout               
Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 03:07:28 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 03:24:22 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 03:30:21 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 03:30:49 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 03:40:49 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 04:43:12 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP                
Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Dec 20 05:30:12 Tower kernel: pcieport 0000:00:03.1:    [ 6] BadTLP     

 

Dec 20 22:57:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.161.51#123 flags 0x19 failed: Address already in use
Dec 20 22:57:40 Tower ntpd[6760]: unable to create socket on br0.99 (843) for 172.16.161.51#123
Dec 20 22:57:40 Tower ntpd[6760]: failed to init interface for address 172.16.161.51
Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 127.0.0.1#123 flags 0x5 failed: Address already in use
Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on lo (844) for 127.0.0.1#123
Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 127.0.0.1
Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.160.9#123 flags 0x19 failed: Address already in use
Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0 (845) for 172.16.160.9#123
Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.160.9
Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.162.51#123 flags 0x19 failed: Address already in use
Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0.98 (846) for 172.16.162.51#123
Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.162.51
Dec 20 23:02:40 Tower ntpd[6760]: bind(19) AF_INET 172.16.161.51#123 flags 0x19 failed: Address already in use
Dec 20 23:02:40 Tower ntpd[6760]: unable to create socket on br0.99 (847) for 172.16.161.51#123
Dec 20 23:02:40 Tower ntpd[6760]: failed to init interface for address 172.16.161.51

 

Link to comment
4 minutes ago, itimpi said:

If the server reboots itself this suggest a hardware issue.    The most likely culprits would be a cooling issue (so CPU overheats) or inadequate power.

I has assumed a reboot because all my services become unresponsive and then after some time, I can get access to the Unraid login page and the array is now offline. Actually, it must be a reboot now I think about it because the system uptime resets.

 

If there is a cooling or power issue, is there a way to diagnose that? It crashed at midnight recently when there were no major operations or usage going on, basically idling. It is not plugged into a UPS though...

 

Link to comment

Not seeing anything relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...