Jump to content

SSH and WebUI not accessible after couple days


Go to solution Solved by JorgeB,

Recommended Posts

Hello! I have been running a server for over 5 years now without any major problems. Now however I face an issue which is on one hand really hard to debug since it's sporadic but also hard to evaluate what the issue is, since I have no errors or so.

 

Let's jump right into it!

 

Issue: Always after like 2-3 days after boot-up the system is not accessible anymore. Neither via WebUI nor SSH. However, the docker-containers still seem to work just fine. For example I can access Plex, SABnzbd or Radarr. On the other hand my Traefik is not responding, so my website is down, I can only access the services locally. Also the services seem to have trouble communicating with themselves, e.g. Radarr shows that SABnzbd is not accessible. After a hard-reboot the server is fully accessible again. However, the same problem usually occurs again after a couple days.

Logs: Firstly, you can find my syslog (two actually, March 11 and 14) captured via the syslog server's feature "Mirror syslog to flash" and a diagnostics file (after reboot, since I cannot access anything) attached. 
Sadly I don't seem to find anything interesting inside. The logs just seem to stop at March 14 02:15:23, even though I hard-shutdown the server today on March 16 at around 15:30 (in this example I first realised the server stopped responding on March 14 at around 12:00, but didn't have time to do anything). Just to check, I also enabled the "Local syslog folder", which shows the last message and me shutting it down:

Mar 14 00:48:49 RaidByte emhttpd: read SMART /dev/sdc
Mar 14 01:19:01 RaidByte emhttpd: spinning down /dev/sdc
Mar 14 01:19:03 RaidByte emhttpd: spinning down /dev/sde
Mar 14 02:03:04 RaidByte emhttpd: read SMART /dev/sde
Mar 14 02:15:23 RaidByte emhttpd: read SMART /dev/sdc
Mar 16 15:35:54 RaidByte root: Delaying execution of fix common problems scan for 10 minutes
Mar 16 15:35:54 RaidByte unassigned.devices: Mounting 'Auto Mount' Devices...
Mar 16 15:35:54 RaidByte emhttpd: Starting services...
Mar 16 15:35:54 RaidByte emhttpd: shcmd (59): /etc/rc.d/rc.samba restart
Mar 16 15:35:54 RaidByte wsdd2[10426]: 'Terminated' signal received.
Mar 16 15:35:54 RaidByte nmbd[10416]: [2024/03/16 15:35:54.270850,  0] ../../source3/nmbd/nmbd.c:59(terminate)
Mar 16 15:35:54 RaidByte nmbd[10416]:   Got SIGTERM: going down...


The only suspicious thing I found was this BadTPL error one day before the logs stop. I don't know it this is related, or what it means but it seems to have been "corrected"?

Mar 13 22:03:06 RaidByte emhttpd: read SMART /dev/sdb
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3:    [ 6] BadTLP                
Mar 13 22:12:44 RaidByte emhttpd: read SMART /dev/sdc

 

In the other file on March 11, there's also such an error. Additionally we can see this:
 

Mar 11 10:28:05 RaidByte kernel: Buffer I/O error on dev sdb1, logical block 2179, lost async page write
Mar 11 10:28:05 RaidByte kernel: device offline error, dev sdb, sector 4405507 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): unable to read inode block for updating (i_pos 70455344)
Mar 11 10:28:05 RaidByte kernel: device offline error, dev sdb, sector 18884 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2

 

followed by many more of this type and just before I reboot again this:
 

Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): FAT read failed (blocknr 2179)
Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): unable to read inode block for updating (i_pos 70648355)
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Failed to read block 0x3c2c820: -5
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read page, block 3c2c820, size a0ac

 

Steps taken to try and mitigate: 

  • Hard Reboot (Problem occurs again after few days)
  • Change USB-Port of Unraid Drive
  • Change WebUI port from 1001 to an unused number below 1000, since I read in another forum with similar issues that one should use a port below 1000.
  • Replug Ethernet Cable
  • Fresh Install
  • Switched out USB drive with freshly bought SanDisk Cruzer Blade
  • Setting pcie_aspm=off

  • Unplugging Bluetooth Dongle (caused some logs once)

  • Setting pci=noaer in syslinux to prevent AER error bug

  • Removing Tailscale Plugin

 

Solution: Assuming it was,

  1. Disabling c-states.
  2. Setting the right RAM frequency.

Both of which is explained in this guide

 

raidbyte-diagnostics-20240316-1610.zip syslog-mar14.log syslog-mar11.log

Edited by xBotRaid
Update measures
Link to comment
  • xBotRaid changed the title to SSH and WebUI not accessible after couple days
18 hours ago, xBotRaid said:
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]

This indicates a flash drive problem, you can try re-formatting it first, if issues persist, replace it.

Link to comment
On 3/17/2024 at 11:41 AM, JorgeB said:

This indicates a flash drive problem, you can try re-formatting it first, if issues persist, replace it.

Thanks a lot for the suggestion! 

I tried to reformat the drive and put the backup on there. This caused the webUI to not show at all (which confirms your point). 
Afterwards I decided to do a fresh install and bought a new USB drive.

Copied some of the backup's config files on a fresh install, no problems yet. 

 

I'll mark this as solution if the issue doesn't reappear in the following days.

 

Have a nice evening!

  • Like 1
Link to comment
On 3/17/2024 at 11:41 AM, JorgeB said:

This indicates a flash drive problem, you can try re-formatting it first, if issues persist, replace it.

Update: As said, I did a fresh install on a newly bought USB device (the popular SanDisk Cruzer Blade to be precise).

Sadly, the issue occurred twice again in the last day. SSH and webUI not responding. Luckily I have some more logs (attached), however I don't know if they help.
Let's break it down.


First Crash (syslog1)

Starts with the following message I encountered in the other syslogs as well:

Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3:    [ 6] BadTLP                

I read in a wiki post (german), that this is due to a bug in the AER driver, that a log message does not get deleted (even though corrected) and thus overloading the system.

According to other unraid forum posts (like this one), as a workaround one can disable the advanced log messages using pci=noaer in the syslinux config. So to address this, I went ahead and set it. Let's see how it goes.

Then interestingly an error only seen in this log file:

Mar 19 17:59:22 RaidByte kernel: usb 7-4: USB disconnect, device number 8
Mar 19 17:59:23 RaidByte kernel: usb 7-4: new full-speed USB device number 9 using xhci_hcd
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI Read Default Erroneous Data Reporting command is advertised, but not supported.
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI Read Transmit Power Level command is advertised, but not supported.
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI LE Set Random Private Address Timeout command is advertised, but not supported.
Mar 19 18:02:45 RaidByte kernel: usb 7-4: USB disconnect, device number 9
...
Mar 19 22:37:45 RaidByte kernel: usb 7-4: new full-speed USB device number 29 using xhci_hcd
Mar 19 22:37:45 RaidByte kernel: usb 7-4: device descriptor read/64, error -71
Mar 19 22:37:46 RaidByte kernel: usb 7-4: device descriptor read/64, error -71
...
Mar 20 04:06:44 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered blocking state
Mar 20 04:06:44 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered disabled state
Mar 20 04:06:44 RaidByte kernel: device veth9ef81a9 entered promiscuous mode
Mar 20 04:06:52 RaidByte kernel: eth0: renamed from veth228d9bb
Mar 20 04:06:52 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth9ef81a9: link becomes ready
Mar 20 04:06:52 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered blocking state
Mar 20 04:06:52 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered forwarding state
Mar 20 04:06:54 RaidByte Docker Auto Update: Community Applications Docker Autoupdate finished (Note from me: Last line of log)

A lot of errors about usb 7-4 (my UGREEN Bluetooth USB dongle for the VM), I didn't encounter these before, except the `device descriptor read/64, error -71` error.

I believe this was not the problem before, since I didn't see that message before. Anyhow I removed the Bluetooth Dongle to eliminate it as a probable cause of the issue.

 

Second Crash (syslog2)

Now here it gets even more confusing. It seems as if there were two SIGTERMs (power-off command), both of which I think not to have initiated. See for yourself:

Mar 20 11:27:17 RaidByte emhttpd: read SMART /dev/nvme0n1
Mar 20 11:27:17 RaidByte emhttpd: read SMART /dev/sda
Mar 20 11:27:18 RaidByte emhttpd: Starting services...
Mar 20 11:27:18 RaidByte emhttpd: shcmd (17): /etc/rc.d/rc.samba restart
Mar 20 11:27:18 RaidByte wsdd2[1771]: 'Terminated' signal received.
Mar 20 11:27:18 RaidByte nmbd[1761]: [2024/03/20 11:27:18.072045,  0] ../../source3/nmbd/nmbd.c:59(terminate)
Mar 20 11:27:18 RaidByte nmbd[1761]:   Got SIGTERM: going down...
...
Mar 20 21:00:00 RaidByte winbindd[21720]:   initialize_winbindd_cache: clearing cache and re-creating with version number 2
Mar 20 21:00:00 RaidByte emhttpd: shcmd (33579): /etc/rc.d/rc.avahidaemon restart
Mar 20 21:00:00 RaidByte root: Stopping Avahi mDNS/DNS-SD Daemon: stopped
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: Got SIGTERM, quitting.
Mar 20 21:00:00 RaidByte avahi-dnsconfd[7442]: read(): EOF
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: Leaving mDNS multicast group on interface br0.IPv4 with address 192.168.0.10.
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: avahi-daemon 0.8 exiting.
Mar 20 21:00:00 RaidByte root: Starting Avahi mDNS/DNS-SD Daemon: /usr/sbin/avahi-daemon -D
...
Mar 20 21:01:54 RaidByte kernel: br0: port 1(bond0) entered disabled state
Mar 20 21:01:54 RaidByte kernel: eth0: 0xffffc9000006d000, 2c:f0:5d:5e:84:9d, IRQ 51
Mar 20 21:01:55 RaidByte tips.and.tweaks: Tweaks Applied
Mar 20 21:01:55 RaidByte unassigned.devices: Mounting 'Auto Mount' Remote Shares... (Note: Last line of log)

 

Other than that I didn't find anything (not even an AER error). Maybe someone else can find something interesting.

 

Changes made, to prevent future crashes:

  • Set pci=noaer in syslinux to prevent AER error bug
  • Unplugged Bluetooth Dongle

Let's see if that helps. If you have any other ideas what the issue could be here I'd love to hear it!

Thanks!

syslog2 syslog1

Link to comment
10 hours ago, xBotRaid said:

According to other unraid forum posts (like this one), as a workaround one can disable the advanced log messages using pci=noaer in the syslinux config. So to address this, I went ahead and set it. Let's see how it goes.

That just ignores the errors, try this first:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

 

 

If it doesn't work then use the other option.

 

 

 

 

Link to comment
On 3/21/2024 at 10:13 AM, JorgeB said:

That just ignores the errors, try this first:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

 

 

If it doesn't work then use the other option.

 

Ok, sadly using both pcie_aspm=off and pci=noaer, didn't resolve the issues. However, I got new syslogs and noticed a new issue also present in the previous syslogs.

 

In syslog3 (using pcie_aspm=off) I got the last few lines saying:

Mar 22 15:20:19 RaidByte kernel: eth0: renamed from vetha3633d2
Mar 22 15:20:19 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth8d0eb2b: link becomes ready
Mar 22 15:20:19 RaidByte kernel: br-cfc1bfc30214: port 1(veth8d0eb2b) entered blocking state
Mar 22 15:20:19 RaidByte kernel: br-cfc1bfc30214: port 1(veth8d0eb2b) entered forwarding state
Mar 22 15:20:20 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered blocking state
Mar 22 15:20:20 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered disabled state
Mar 22 15:20:20 RaidByte kernel: device veth5e20028 entered promiscuous mode
Mar 22 15:20:23 RaidByte kernel: eth0: renamed from vethd6d954e
Mar 22 15:20:23 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth5e20028: link becomes ready
Mar 22 15:20:23 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered blocking state
Mar 22 15:20:23 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered forwarding state

 

Similarly in syslog4 (using pci=noaer):

Mar 23 03:53:15 RaidByte kernel: eth0: renamed from veth7fb01b6
Mar 23 03:53:15 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc29b4d6: link becomes ready
...
Mar 23 04:13:54 RaidByte kernel: eth0: renamed from vethd1c1ad9
Mar 23 04:13:54 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2079265: link becomes ready
...
Mar 23 04:14:03 RaidByte kernel: eth0: renamed from vethf8969d6
Mar 23 04:14:03 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethe35ff87: link becomes ready
...
Mar 23 04:14:05 RaidByte kernel: device veth1200252 entered promiscuous mode
Mar 23 04:14:45 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth1200252: link becomes ready
Mar 23 04:14:45 RaidByte kernel: br-cfc1bfc30214: port 6(veth1200252) entered blocking state
Mar 23 04:14:45 RaidByte kernel: br-cfc1bfc30214: port 6(veth1200252) entered forwarding state
Mar 23 04:15:01 RaidByte Docker Auto Update: Community Applications Docker Autoupdate finished
Mar 23 12:28:20 RaidByte webGUI: Successful login user root from 192.168.0.54 (Note: last line)

 

Seemingly some IPv6 address conflict, when scanning the old log files, this message is always present. Also to note: Just before the server went down at syslog3, one minute before I connected using Tailscale (setting the server as Exit Node), maybe that has something to do with it as well?

Searching online didn't seem to fully understand what the problem is here.
As when it comes to my network setup:

  • I recently directly connected my server to the router by ethernet (before had to do some ugly wifi repeater stuff due to our infrastructure), this also matches up approximately when the issues started appearing.
  • "IPv4 address assignment" is set to Automatic (so DHCP should set IP, however there is a value in "IPv4 address" I guess this gets ignored when in Automatic?)

  • "Network protocol" is set to "IPv4 only"

  • On my router I set the server to staticly use 192.168.0.10

  • Bonding is enabled with mode "active-backup (1)" with eth0 as bonding member.

  • VLANs are disabled.

So for now I'm going to set it back to pcie_aspm=off since the AER error was gone at least and hope to find a solution for these network issues.

syslog3 syslog4

Link to comment
On 3/24/2024 at 6:52 PM, JorgeB said:

IPv6 messages may come from containers, but looking at them, note sure if they are an actual problem, other than the log spam.

Restarted again, nothing seems to work. Got any idea what else this could be? (syslog attached)

 

Thanks for your thorough help, it is greatly appreciated!

syslog

Link to comment
11 minutes ago, JorgeB said:

Please be more specif, what is not working?

Same as mentioned before, after a couple days I get complete failure on all ends:

  • WebUI not accessible
  • Connecting via SSH times out
  • Connecting via SMB times out
  • When connecting via HDMI only getting a black screen
  • Now, as opposed to before, the Docker containers are also not responding

Then, after hard-reboot everything works again for about 1-2 days and then I get the same issue.

Link to comment

The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted.  You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash/freeze to see if it shows anything.  The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.  

Link to comment
Posted (edited)
33 minutes ago, itimpi said:

The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted.  You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash/freeze to see if it shows anything.  The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.  

Yes, I already enabled the mirror to flash option. All the syslogs I posted here are taken directly from the flash drive with the "Mirror syslog to flash" feature after it crashed.

Correction: The diagnostics zip file in the main post is not taken with "Mirror syslog to flash". But all the other syslogs are.

Edited by xBotRaid
Link to comment
32 minutes ago, xBotRaid said:

Correction: The diagnostics zip file in the main post is not taken with "Mirror syslog to flash". But all the other syslogs are.

OK - since only the original post contained diagnostics, are you saying that the subsequent syslogs posted are ones created by the syslog server and are not RAM copies?  Just asking as it is clearer when posting diagnostics (using the latest Unraid version) as the syslog server version included there gets labelled as syslog-previous.txt when using mirror to flash.

Link to comment
  • Solution

Unfortunately there's nothing relevant logged in the latest syslog, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
Posted (edited)
1 hour ago, itimpi said:

OK - since only the original post contained diagnostics, are you saying that the subsequent syslogs posted are ones created by the syslog server and are not RAM copies?  Just asking as it is clearer when posting diagnostics (using the latest Unraid version) as the syslog server version included there gets labelled as syslog-previous.txt when using mirror to flash.

 

1 hour ago, JorgeB said:

Unfortunately there's nothing relevant logged in the latest syslog, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

Yes, I did not know you can also post the diagnostics after reboot (with the syslog server enabled) and the syslog-previous.txt is included. What I always did so far is, before restarting, plugging in the flash drive into my laptop and retrieve the syslog (no -previous.txt or no .txt extension). I think it only gets renamed syslog-previous.txt after rebooting again. So yes, all syslogs are captured using the syslog server, directly moving the file on my flash drive to my laptop before rebooting the server.

UPDATE: After it crashing again, I just tried to boot it up in safe-mode. Now it doesn't even start up anymore. It still shows the boot process via HDMI all the way to the point where I can log in using my credentials. But it's not accessible via webUI nor SSH. I tried Safe-Mode-GUI but this didn't work as it got stuck at a flashing cursor after boot-up. I attached some logs of the two previous boot attempts (syslog is the newest boot and syslog-previous the one before, both taken using syslog server).

I also loaded up a fresh version of unraid on another SanDisk Cruzer, this one booted up just fine.

 

Also noticed a couple of FSCK0000.REC files on my flash drive.

1022304662_Screenshot2024-03-26at13_12_22.thumb.png.f21e38cde74aa9dda71a12292f619a03.png

 

syslog syslog-previous

Edited by xBotRaid
Link to comment
3 minutes ago, xBotRaid said:

I also loaded up a fresh version of unraid on another SanDisk Cruzer, this one booted up just fine.

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

Link to comment
8 minutes ago, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

Ok, thanks! I'll try that. I'll keep you updated if something interesting happens.

Link to comment
On 3/26/2024 at 1:18 PM, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

 

I had some success (I think), since my server worked just fine for 3 days (before lasted only <2 days). Then I decided to install the Tailscale Plugin. And shortly after the webGUI wasn't accessible anymore (different this time anyhow, since the login was quickly accessible, but upon logging in it times out, with a reload landing back at the login window.

 

This can be attributed to a common problem I found listed in this plugin.

 

However, this does not indicate that this was the problem before, since before I used the docker version. 

But the server worked for 3 days, so that's a good sign. I'll try seeing how it behaves now over longer time without Tailscale, don't need it necessarily anyways.

 

Thanks for your help, will mark yours as solution. If anything unusual occurs, I'll revive this thread.

  • Like 1
Link to comment
On 3/26/2024 at 1:18 PM, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

My hope was short-lived...

The server just crashed again within a few hours. But this time, completely differently and weirdly, which confuses me even more.

I definitely need your advice on that one, since that's making no sense to me - at all.

 

What happened?

As I was setting up some stuff in HomeAssistant, one second later it didn't respond anymore. So I checked my Unraid dashboard (within 30 seconds), there it was up and running, but now displaying "Uptime 1 minutes", I went to the docker containers and it had the warning that they were still booting up. So within a minute my system went from "all docker containers up and running just fine, more than 1h uptime" to "1 min uptime, docker containers are still starting up". How's that even possible? If the system really crashed, it would need to boot up first again, and that usually takes like 5 minutes until I can first access the dashboard, now it was more like 30 seconds.

 

First I thought I was crazy, however then it happened again. I was deleting some old docker containers, suddenly the UI didn't respond for like 10 seconds. Then I refreshed and again it said "Uptime 1 minutes" (maybe also 0, not sure) and the docker containers were starting again. How can that happen, that it crashes and then recovers in just about 15 seconds?

 

Two more interesting observations:

  1. For the second crash, or whatever that was, I took a diagnostic before (...-2023) and after (...-2148). When comparing the syslog.txt in the before with the syslog-previous.txt in the after, they are obviously identical to the first few lines. But the only difference in fact, is just one line, and that's me logging into the webGUI. Thus nothing during the crash or shortly before has been logged by the syslog-server.
  2. I've noticed, between the different crashes the timezone (which should be CET, e.g. UTC+1) was always different, so the hour was wrong, but the minutes are correct. In the log of the first crash (...-1938) there's a "ntpd[1464]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" error. According to the only result on this I found from another  Unraid forum this can be ignored and happens on every startup, this however was once during boot up and once after all the docker containers booted up.

 

That's all I have. Attached is:

 

Thanks again for your help!

I bought you a coffee/beer for your great help so far!

Link to comment

(Stopping by because I saw the Tailscale mention and decided to check the diagnostics to make sure it wasn't a plugin issue)

 

The good news: your server restarts a lot faster than you think.

The bad news: your server is definitely restarting.

 

Given that there's nothing in the syslog indicating why the reboot is happening, I would go back to what JorgeB indicated previously -- this seems like a hardware problem.

Link to comment
Posted (edited)
3 hours ago, EDACerton said:

(Stopping by because I saw the Tailscale mention and decided to check the diagnostics to make sure it wasn't a plugin issue)

 

The good news: your server restarts a lot faster than you think.

The bad news: your server is definitely restarting.

 

Given that there's nothing in the syslog indicating why the reboot is happening, I would go back to what JorgeB indicated previously -- this seems like a hardware problem.

 

Hi, thanks for your input! Still find it kinda crazy it reboots so quickly, but ok.

If the problem is hardware, are there any tips (or best guesses) on how to figure out where in hardware the problem could lie? (don't have the funds or spare parts to replace the whole server ;D)

 

Bonus: Attached a tailscale diagnostics I had back when I was still using your plugin. Figured I might put it in here as well, if you need it to debug. Although for me doesn't matter since I don't use it anymore right now. But: Really cool plugin btw!

RaidByte-tailscale-diag-20240329-085341.zip

Edited by xBotRaid
Link to comment
3 hours ago, xBotRaid said:

are there any tips (or best guesses) on how to figure out where in hardware the problem could lie?

If you have a different PC you can try swapping some parts, like the PSU for example, if the server has multiple RAM sticks try with just one, if the same try a different one, that will basically rule out the RAM.

  • Like 1
Link to comment
On 3/30/2024 at 11:28 AM, JorgeB said:

If you have a different PC you can try swapping some parts, like the PSU for example, if the server has multiple RAM sticks try with just one, if the same try a different one, that will basically rule out the RAM.

 

Update: Hopefully last update... 😅

 

After completely removing everything, except the essential parts, I still got the issue. Now however I found the same error in the syslog as detailed here

I disabled c-states and now I'm checking, with all the hardware, how long it will be up.

 

Wish me luck! I hope this was the issue, could very well be a different issue I produced after resetting the motherboard however.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...