xBotRaid

April 29

For all future readers:
So! After a while I can confidently say the issue is gone. Current uptime 11 days.

Now as to what was the issue was exactly I cannot say for sure, since I got rid of some plugins after all, which I didn't need.

But I have a strong suspicion, that the ultimate solution was as simple as disabling the c-states (and checking that the RAM is running at the correct speed, as suggested by JorgeB's guide.

This since my current config, which lasted 11 days pretty much identical to the previous setup.

The only differences being some uninstalled plugins, a re-flashing of Unraid and the removal of some plugins.

Additionally I came across the RTL8168 Drivers from community applications, which I also installed, however I has worked reliably for 3 days without I too.

Otherwise, same Hardware, Docker Containers and with neither pcie_aspm=off nor pci=noaer.

So if anyone else comes across this issue, I would recommend you to:

Follow JorgeB's guide on finding the correct RAM speed and disabling the c-states.
Additionally, it makes sense to install the RTL Drivers (pick the right one for your system).
If all that doesn't resolve the issue, try setting pcie_aspm=off or pci=noaer.

Thanks for all the help along the way! Hope this helps if someone else has the same issue.

April 1

On 3/30/2024 at 11:28 AM, JorgeB said:

If you have a different PC you can try swapping some parts, like the PSU for example, if the server has multiple RAM sticks try with just one, if the same try a different one, that will basically rule out the RAM.

Update: Hopefully last update... 😅

After completely removing everything, except the essential parts, I still got the issue. Now however I found the same error in the syslog as detailed here.

I disabled c-states and now I'm checking, with all the hardware, how long it will be up.

Wish me luck! I hope this was the issue, could very well be a different issue I produced after resetting the motherboard however.

March 30

3 hours ago, EDACerton said:

(Stopping by because I saw the Tailscale mention and decided to check the diagnostics to make sure it wasn't a plugin issue)

The good news: your server restarts a lot faster than you think.

The bad news: your server is definitely restarting.

Given that there's nothing in the syslog indicating why the reboot is happening, I would go back to what JorgeB indicated previously -- this seems like a hardware problem.

Hi, thanks for your input! Still find it kinda crazy it reboots so quickly, but ok.

If the problem is hardware, are there any tips (or best guesses) on how to figure out where in hardware the problem could lie? (don't have the funds or spare parts to replace the whole server ;D)

Bonus: Attached a tailscale diagnostics I had back when I was still using your plugin. Figured I might put it in here as well, if you need it to debug. Although for me doesn't matter since I don't use it anymore right now. But: Really cool plugin btw!

RaidByte-tailscale-diag-20240329-085341.zip

March 29

On 3/26/2024 at 1:18 PM, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

My hope was short-lived...

The server just crashed again within a few hours. But this time, completely differently and weirdly, which confuses me even more.

I definitely need your advice on that one, since that's making no sense to me - at all.

What happened?

As I was setting up some stuff in HomeAssistant, one second later it didn't respond anymore. So I checked my Unraid dashboard (within 30 seconds), there it was up and running, but now displaying "Uptime 1 minutes", I went to the docker containers and it had the warning that they were still booting up. So within a minute my system went from "all docker containers up and running just fine, more than 1h uptime" to "1 min uptime, docker containers are still starting up". How's that even possible? If the system really crashed, it would need to boot up first again, and that usually takes like 5 minutes until I can first access the dashboard, now it was more like 30 seconds.

First I thought I was crazy, however then it happened again. I was deleting some old docker containers, suddenly the UI didn't respond for like 10 seconds. Then I refreshed and again it said "Uptime 1 minutes" (maybe also 0, not sure) and the docker containers were starting again. How can that happen, that it crashes and then recovers in just about 15 seconds?

Two more interesting observations:

For the second crash, or whatever that was, I took a diagnostic before (...-2023) and after (...-2148). When comparing the syslog.txt in the before with the syslog-previous.txt in the after, they are obviously identical to the first few lines. But the only difference in fact, is just one line, and that's me logging into the webGUI. Thus nothing during the crash or shortly before has been logged by the syslog-server.
I've noticed, between the different crashes the timezone (which should be CET, e.g. UTC+1) was always different, so the hour was wrong, but the minutes are correct. In the log of the first crash (...-1938) there's a "ntpd[1464]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" error. According to the only result on this I found from another Unraid forum this can be ignored and happens on every startup, this however was once during boot up and once after all the docker containers booted up.

That's all I have. Attached is:

First Crash Diagnostic (...-1938.zip, raidbyte-diagnostics-20240329-1938.zip)
BEFORE: Second Crash Diagnostic (...-2023, raidbyte-diagnostics-20240329-2023.zip)
AFTER: Second Crash Diagnostic (...-2148, raidbyte-diagnostics-20240329-2148.zip)

Thanks again for your help!

I bought you a coffee/beer for your great help so far!

March 29

On 3/26/2024 at 1:18 PM, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

I had some success (I think), since my server worked just fine for 3 days (before lasted only <2 days). Then I decided to install the Tailscale Plugin. And shortly after the webGUI wasn't accessible anymore (different this time anyhow, since the login was quickly accessible, but upon logging in it times out, with a reload landing back at the login window.

This can be attributed to a common problem I found listed in this plugin.

However, this does not indicate that this was the problem before, since before I used the docker version.

But the server worked for 3 days, so that's a good sign. I'll try seeing how it behaves now over longer time without Tailscale, don't need it necessarily anyways.

Thanks for your help, will mark yours as solution. If anything unusual occurs, I'll revive this thread.

March 26

8 minutes ago, JorgeB said:

That suggest a /config problem, you can try redoing the flash drive, backup the current one first and then redo it and just restore the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit.you can recreate the flash drive and restore only

Ok, thanks! I'll try that. I'll keep you updated if something interesting happens.

March 26

1 hour ago, itimpi said:

OK - since only the original post contained diagnostics, are you saying that the subsequent syslogs posted are ones created by the syslog server and are not RAM copies? Just asking as it is clearer when posting diagnostics (using the latest Unraid version) as the syslog server version included there gets labelled as syslog-previous.txt when using mirror to flash.

1 hour ago, JorgeB said:

Unfortunately there's nothing relevant logged in the latest syslog, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Yes, I did not know you can also post the diagnostics after reboot (with the syslog server enabled) and the syslog-previous.txt is included. What I always did so far is, before restarting, plugging in the flash drive into my laptop and retrieve the syslog (no -previous.txt or no .txt extension). I think it only gets renamed syslog-previous.txt after rebooting again. So yes, all syslogs are captured using the syslog server, directly moving the file on my flash drive to my laptop before rebooting the server.

UPDATE: After it crashing again, I just tried to boot it up in safe-mode. Now it doesn't even start up anymore. It still shows the boot process via HDMI all the way to the point where I can log in using my credentials. But it's not accessible via webUI nor SSH. I tried Safe-Mode-GUI but this didn't work as it got stuck at a flashing cursor after boot-up. I attached some logs of the two previous boot attempts (syslog is the newest boot and syslog-previous the one before, both taken using syslog server).

I also loaded up a fresh version of unraid on another SanDisk Cruzer, this one booted up just fine.

Also noticed a couple of FSCK0000.REC files on my flash drive.

syslog syslog-previous

March 26

33 minutes ago, itimpi said:

The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted. You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash/freeze to see if it shows anything. The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.

Yes, I already enabled the mirror to flash option. All the syslogs I posted here are taken directly from the flash drive with the "Mirror syslog to flash" feature after it crashed.

Correction: The diagnostics zip file in the main post is not taken with "Mirror syslog to flash". But all the other syslogs are.

March 26

11 minutes ago, JorgeB said:

Please be more specif, what is not working?

Same as mentioned before, after a couple days I get complete failure on all ends:

WebUI not accessible
Connecting via SSH times out
Connecting via SMB times out
When connecting via HDMI only getting a black screen
Now, as opposed to before, the Docker containers are also not responding

Then, after hard-reboot everything works again for about 1-2 days and then I get the same issue.

March 25

On 3/24/2024 at 6:52 PM, JorgeB said:

IPv6 messages may come from containers, but looking at them, note sure if they are an actual problem, other than the log spam.

Restarted again, nothing seems to work. Got any idea what else this could be? (syslog attached)

Thanks for your thorough help, it is greatly appreciated!

syslog

March 24

On 3/21/2024 at 10:13 AM, JorgeB said:

That just ignores the errors, try this first:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

If it doesn't work then use the other option.

Ok, sadly using both pcie_aspm=off and pci=noaer, didn't resolve the issues. However, I got new syslogs and noticed a new issue also present in the previous syslogs.

In syslog3 (using pcie_aspm=off) I got the last few lines saying:

Mar 22 15:20:19 RaidByte kernel: eth0: renamed from vetha3633d2
Mar 22 15:20:19 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth8d0eb2b: link becomes ready
Mar 22 15:20:19 RaidByte kernel: br-cfc1bfc30214: port 1(veth8d0eb2b) entered blocking state
Mar 22 15:20:19 RaidByte kernel: br-cfc1bfc30214: port 1(veth8d0eb2b) entered forwarding state
Mar 22 15:20:20 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered blocking state
Mar 22 15:20:20 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered disabled state
Mar 22 15:20:20 RaidByte kernel: device veth5e20028 entered promiscuous mode
Mar 22 15:20:23 RaidByte kernel: eth0: renamed from vethd6d954e
Mar 22 15:20:23 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth5e20028: link becomes ready
Mar 22 15:20:23 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered blocking state
Mar 22 15:20:23 RaidByte kernel: br-cfc1bfc30214: port 7(veth5e20028) entered forwarding state

Similarly in syslog4 (using pci=noaer):

Mar 23 03:53:15 RaidByte kernel: eth0: renamed from veth7fb01b6
Mar 23 03:53:15 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc29b4d6: link becomes ready
...
Mar 23 04:13:54 RaidByte kernel: eth0: renamed from vethd1c1ad9
Mar 23 04:13:54 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2079265: link becomes ready
...
Mar 23 04:14:03 RaidByte kernel: eth0: renamed from vethf8969d6
Mar 23 04:14:03 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethe35ff87: link becomes ready
...
Mar 23 04:14:05 RaidByte kernel: device veth1200252 entered promiscuous mode
Mar 23 04:14:45 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth1200252: link becomes ready
Mar 23 04:14:45 RaidByte kernel: br-cfc1bfc30214: port 6(veth1200252) entered blocking state
Mar 23 04:14:45 RaidByte kernel: br-cfc1bfc30214: port 6(veth1200252) entered forwarding state
Mar 23 04:15:01 RaidByte Docker Auto Update: Community Applications Docker Autoupdate finished
Mar 23 12:28:20 RaidByte webGUI: Successful login user root from 192.168.0.54 (Note: last line)

Seemingly some IPv6 address conflict, when scanning the old log files, this message is always present. Also to note: Just before the server went down at syslog3, one minute before I connected using Tailscale (setting the server as Exit Node), maybe that has something to do with it as well?

Searching online didn't seem to fully understand what the problem is here.
As when it comes to my network setup:

I recently directly connected my server to the router by ethernet (before had to do some ugly wifi repeater stuff due to our infrastructure), this also matches up approximately when the issues started appearing.
"IPv4 address assignment" is set to Automatic (so DHCP should set IP, however there is a value in "IPv4 address" I guess this gets ignored when in Automatic?)
"Network protocol" is set to "IPv4 only"
On my router I set the server to staticly use 192.168.0.10
Bonding is enabled with mode "active-backup (1)" with eth0 as bonding member.
VLANs are disabled.

So for now I'm going to set it back to pcie_aspm=off since the AER error was gone at least and hope to find a solution for these network issues.

syslog3 syslog4

March 21

4 hours ago, JorgeB said:

That just ignores the errors, try this first:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

If it doesn't work then use the other option.

Sounds like a good idea.

Thus, changes made:

Set pcie_aspm=off, instead of pci=noaer and rebooted

March 20

On 3/17/2024 at 11:41 AM, JorgeB said:

This indicates a flash drive problem, you can try re-formatting it first, if issues persist, replace it.

Update: As said, I did a fresh install on a newly bought USB device (the popular SanDisk Cruzer Blade to be precise).

Sadly, the issue occurred twice again in the last day. SSH and webUI not responding. Luckily I have some more logs (attached), however I don't know if they help.
Let's break it down.

First Crash (syslog1)

Starts with the following message I encountered in the other syslogs as well:

Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Mar 19 11:31:20 RaidByte kernel: pcieport 0000:00:01.3:    [ 6] BadTLP

I read in a wiki post (german), that this is due to a bug in the AER driver, that a log message does not get deleted (even though corrected) and thus overloading the system.

According to other unraid forum posts (like this one), as a workaround one can disable the advanced log messages using pci=noaer in the syslinux config. So to address this, I went ahead and set it. Let's see how it goes.

Then interestingly an error only seen in this log file:

Mar 19 17:59:22 RaidByte kernel: usb 7-4: USB disconnect, device number 8
Mar 19 17:59:23 RaidByte kernel: usb 7-4: new full-speed USB device number 9 using xhci_hcd
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI Read Default Erroneous Data Reporting command is advertised, but not supported.
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI Read Transmit Power Level command is advertised, but not supported.
Mar 19 17:59:23 RaidByte kernel: Bluetooth: hci0: HCI LE Set Random Private Address Timeout command is advertised, but not supported.
Mar 19 18:02:45 RaidByte kernel: usb 7-4: USB disconnect, device number 9
...
Mar 19 22:37:45 RaidByte kernel: usb 7-4: new full-speed USB device number 29 using xhci_hcd
Mar 19 22:37:45 RaidByte kernel: usb 7-4: device descriptor read/64, error -71
Mar 19 22:37:46 RaidByte kernel: usb 7-4: device descriptor read/64, error -71
...
Mar 20 04:06:44 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered blocking state
Mar 20 04:06:44 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered disabled state
Mar 20 04:06:44 RaidByte kernel: device veth9ef81a9 entered promiscuous mode
Mar 20 04:06:52 RaidByte kernel: eth0: renamed from veth228d9bb
Mar 20 04:06:52 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth9ef81a9: link becomes ready
Mar 20 04:06:52 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered blocking state
Mar 20 04:06:52 RaidByte kernel: br-cfc1bfc30214: port 13(veth9ef81a9) entered forwarding state
Mar 20 04:06:54 RaidByte Docker Auto Update: Community Applications Docker Autoupdate finished (Note from me: Last line of log)

A lot of errors about usb 7-4 (my UGREEN Bluetooth USB dongle for the VM), I didn't encounter these before, except the `device descriptor read/64, error -71` error.

I believe this was not the problem before, since I didn't see that message before. Anyhow I removed the Bluetooth Dongle to eliminate it as a probable cause of the issue.

Second Crash (syslog2)

Now here it gets even more confusing. It seems as if there were two SIGTERMs (power-off command), both of which I think not to have initiated. See for yourself:

Mar 20 11:27:17 RaidByte emhttpd: read SMART /dev/nvme0n1
Mar 20 11:27:17 RaidByte emhttpd: read SMART /dev/sda
Mar 20 11:27:18 RaidByte emhttpd: Starting services...
Mar 20 11:27:18 RaidByte emhttpd: shcmd (17): /etc/rc.d/rc.samba restart
Mar 20 11:27:18 RaidByte wsdd2[1771]: 'Terminated' signal received.
Mar 20 11:27:18 RaidByte nmbd[1761]: [2024/03/20 11:27:18.072045,  0] ../../source3/nmbd/nmbd.c:59(terminate)
Mar 20 11:27:18 RaidByte nmbd[1761]:   Got SIGTERM: going down...
...
Mar 20 21:00:00 RaidByte winbindd[21720]:   initialize_winbindd_cache: clearing cache and re-creating with version number 2
Mar 20 21:00:00 RaidByte emhttpd: shcmd (33579): /etc/rc.d/rc.avahidaemon restart
Mar 20 21:00:00 RaidByte root: Stopping Avahi mDNS/DNS-SD Daemon: stopped
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: Got SIGTERM, quitting.
Mar 20 21:00:00 RaidByte avahi-dnsconfd[7442]: read(): EOF
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: Leaving mDNS multicast group on interface br0.IPv4 with address 192.168.0.10.
Mar 20 21:00:00 RaidByte avahi-daemon[7433]: avahi-daemon 0.8 exiting.
Mar 20 21:00:00 RaidByte root: Starting Avahi mDNS/DNS-SD Daemon: /usr/sbin/avahi-daemon -D
...
Mar 20 21:01:54 RaidByte kernel: br0: port 1(bond0) entered disabled state
Mar 20 21:01:54 RaidByte kernel: eth0: 0xffffc9000006d000, 2c:f0:5d:5e:84:9d, IRQ 51
Mar 20 21:01:55 RaidByte tips.and.tweaks: Tweaks Applied
Mar 20 21:01:55 RaidByte unassigned.devices: Mounting 'Auto Mount' Remote Shares... (Note: Last line of log)

Other than that I didn't find anything (not even an AER error). Maybe someone else can find something interesting.

Changes made, to prevent future crashes:

Set pci=noaer in syslinux to prevent AER error bug
Unplugged Bluetooth Dongle

Let's see if that helps. If you have any other ideas what the issue could be here I'd love to hear it!

Thanks!

syslog2 syslog1

March 18

On 3/17/2024 at 11:41 AM, JorgeB said:

This indicates a flash drive problem, you can try re-formatting it first, if issues persist, replace it.

Thanks a lot for the suggestion!

I tried to reformat the drive and put the backup on there. This caused the webUI to not show at all (which confirms your point).
Afterwards I decided to do a fresh install and bought a new USB drive.

Copied some of the backup's config files on a fresh install, no problems yet.

I'll mark this as solution if the issue doesn't reappear in the following days.

Have a nice evening!

March 16

Hello! I have been running a server for over 5 years now without any major problems. Now however I face an issue which is on one hand really hard to debug since it's sporadic but also hard to evaluate what the issue is, since I have no errors or so.

Let's jump right into it!

Issue: Always after like 2-3 days after boot-up the system is not accessible anymore. Neither via WebUI nor SSH. However, the docker-containers still seem to work just fine. For example I can access Plex, SABnzbd or Radarr. On the other hand my Traefik is not responding, so my website is down, I can only access the services locally. Also the services seem to have trouble communicating with themselves, e.g. Radarr shows that SABnzbd is not accessible. After a hard-reboot the server is fully accessible again. However, the same problem usually occurs again after a couple days.

Logs: Firstly, you can find my syslog (two actually, March 11 and 14) captured via the syslog server's feature "Mirror syslog to flash" and a diagnostics file (after reboot, since I cannot access anything) attached.
Sadly I don't seem to find anything interesting inside. The logs just seem to stop at March 14 02:15:23, even though I hard-shutdown the server today on March 16 at around 15:30 (in this example I first realised the server stopped responding on March 14 at around 12:00, but didn't have time to do anything). Just to check, I also enabled the "Local syslog folder", which shows the last message and me shutting it down:

Mar 14 00:48:49 RaidByte emhttpd: read SMART /dev/sdc
Mar 14 01:19:01 RaidByte emhttpd: spinning down /dev/sdc
Mar 14 01:19:03 RaidByte emhttpd: spinning down /dev/sde
Mar 14 02:03:04 RaidByte emhttpd: read SMART /dev/sde
Mar 14 02:15:23 RaidByte emhttpd: read SMART /dev/sdc
Mar 16 15:35:54 RaidByte root: Delaying execution of fix common problems scan for 10 minutes
Mar 16 15:35:54 RaidByte unassigned.devices: Mounting 'Auto Mount' Devices...
Mar 16 15:35:54 RaidByte emhttpd: Starting services...
Mar 16 15:35:54 RaidByte emhttpd: shcmd (59): /etc/rc.d/rc.samba restart
Mar 16 15:35:54 RaidByte wsdd2[10426]: 'Terminated' signal received.
Mar 16 15:35:54 RaidByte nmbd[10416]: [2024/03/16 15:35:54.270850,  0] ../../source3/nmbd/nmbd.c:59(terminate)
Mar 16 15:35:54 RaidByte nmbd[10416]:   Got SIGTERM: going down...

The only suspicious thing I found was this BadTPL error one day before the logs stop. I don't know it this is related, or what it means but it seems to have been "corrected"?

Mar 13 22:03:06 RaidByte emhttpd: read SMART /dev/sdb
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Mar 13 22:10:14 RaidByte kernel: pcieport 0000:00:01.3:    [ 6] BadTLP                
Mar 13 22:12:44 RaidByte emhttpd: read SMART /dev/sdc

In the other file on March 11, there's also such an error. Additionally we can see this:

Mar 11 10:28:05 RaidByte kernel: Buffer I/O error on dev sdb1, logical block 2179, lost async page write
Mar 11 10:28:05 RaidByte kernel: device offline error, dev sdb, sector 4405507 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): unable to read inode block for updating (i_pos 70455344)
Mar 11 10:28:05 RaidByte kernel: device offline error, dev sdb, sector 18884 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2

followed by many more of this type and just before I reboot again this:

Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): FAT read failed (blocknr 2179)
Mar 11 10:28:05 RaidByte kernel: FAT-fs (sdb1): unable to read inode block for updating (i_pos 70648355)
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Failed to read block 0x3c2c820: -5
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read fragment cache entry [3c2c820]
Mar 11 10:28:05 RaidByte kernel: SQUASHFS error: Unable to read page, block 3c2c820, size a0ac

Steps taken to try and mitigate:

Hard Reboot (Problem occurs again after few days)
Change USB-Port of Unraid Drive
Change WebUI port from 1001 to an unused number below 1000, since I read in another forum with similar issues that one should use a port below 1000.
Replug Ethernet Cable
Fresh Install
Switched out USB drive with freshly bought SanDisk Cruzer Blade
Setting pcie_aspm=off
Unplugging Bluetooth Dongle (caused some logs once)
Setting pci=noaer in syslinux to prevent AER error bug
Removing Tailscale Plugin

Solution: Assuming it was,

Disabling c-states.
Setting the right RAM frequency.

Both of which is explained in this guide.

raidbyte-diagnostics-20240316-1610.zip syslog-mar14.log syslog-mar11.log

March 7, 2023

Hey, have had an annoying issue for the last couple months and can't wrap my head around it. Authentik just randomly crashes (which blocks access to my services) and upon further inspection in the Docker logs I find the following (repeated like multiple times, rest of logs normal):

Spoiler

redis.exceptions.ConnectionError: Error -5 connecting to redis:6379. -5.
{"error":"websocket: close 1006 (abnormal closure): unexpected EOF","event":"ws read error","level":"warning","logger":"authentik.outpost.ak-api-controller","loop":"ws-handler","timestamp":"2023-03-08T04:01:45+01:00"}
{"event": "Redis ConnectionError: Error -5 connecting to redis:6379. No address associated with hostname.", "level": "error", "logger": "django_redis.cache", "timestamp": 1678244505.058624}
Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/websockets/wsproto_impl.py", line 208, in run_asgi
result = await self.app(self.scope, self.receive, self.send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/message_logger.py", line 86, in __call__
raise exc from None
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
await self.app(scope, inner_receive, inner_send)
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 139, in _run_asgi3
return await self._run_app(scope, lambda: self.app(scope, receive, send))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 188, in _run_app
raise exc from None
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 183, in _run_app
return await callback()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/routing.py", line 62, in __call__
return await application(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/authentik/root/asgi.py", line 54, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/routing.py", line 116, in __call__
return await application(
^^^^^^^^^^^^^^^^^^
File "/authentik/lib/sentry.py", line 48, in __call__
return await self.inner(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/consumer.py", line 94, in app
return await consumer(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/consumer.py", line 58, in __call__
await await_many_dispatch(
File "/usr/local/lib/python3.11/site-packages/channels/utils.py", line 57, in await_many_dispatch
await task
redis.exceptions.ConnectionError: Error -5 connecting to redis:6379. -5.
{"error":"websocket: close 1006 (abnormal closure): unexpected EOF","event":"ws read error","level":"warning","logger":"authentik.outpost.ak-api-controller","loop":"ws-handler","timestamp":"2023-03-08T04:01:50+01:00"}
{"event": "Redis ConnectionError: Error -5 connecting to redis:6379. No address associated with hostname.", "level": "error", "logger": "django_redis.cache", "timestamp": 1678244510.065683}
Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/websockets/wsproto_impl.py", line 208, in run_asgi
result = await self.app(self.scope, self.receive, self.send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/message_logger.py", line 86, in __call__
raise exc from None
File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
await self.app(scope, inner_receive, inner_send)
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 139, in _run_asgi3
return await self._run_app(scope, lambda: self.app(scope, receive, send))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 188, in _run_app
raise exc from None
File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/asgi.py", line 183, in _run_app
return await callback()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/routing.py", line 62, in __call__
return await application(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/authentik/root/asgi.py", line 54, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/routing.py", line 116, in __call__
return await application(
^^^^^^^^^^^^^^^^^^
File "/authentik/lib/sentry.py", line 48, in __call__
return await self.inner(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/consumer.py", line 94, in app
return await consumer(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/channels/consumer.py", line 58, in __call__
await await_many_dispatch(
File "/usr/local/lib/python3.11/site-packages/channels/utils.py", line 57, in await_many_dispatch
await task

I'm running beryju/authentik:latest Docker container on Unraid in a bridge network and it works fine, but now just randomly crashes sometimes. Tried replacing the hostname 'redis' with my local ip, didn't change the issue. When checking redis, it ran without any issues during the crash. Did anyone have a similar issue or has an idea what's going on here?

Redis and Authentik are running in the same docker network called proxynet (just as adviced by spaceinvader one).

Also Redis is running without interruption, so it should be able to connect.

Does anyone have an idea what could be the issue? Thanks for any help!

(Attached two screenshots of Redis and Authentik configuration)

November 6, 2022

Hello!

Have been running Unraid for many years now. Haven't had an issue since.

Changed some things recently, switched from SWAG w/ Authelia to Traefik w/ Authentik.

Now just a few days ago (about a month after the switch) I wasn't able to access any of my services (except a few like, HOOBS and SABnzbd).

Also the unraid console was not accessible anymore via the web. I tried connecting via SSH as well but no luck. So I restarted, and the issue was gone. After a few days it happened again. Was able to get some logs (I think around the time it happened). The logs are repeating (see snippet below).

Now I am a bit scared since it could be a malicious attack.

Does anyone know what this could be and what the best action is in this scenario?

Is this normal? (syslog attached)

Snippet of logs:

Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth12f536f) entered disabled state
Oct 29 22:42:42 RaidByte kernel: device veth12f536f left promiscuous mode
Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth12f536f) entered disabled state
Oct 29 22:42:42 RaidByte  avahi-daemon[6274]: Withdrawing address record for fe80::ccb2:fff:fee0:8c6c on veth12f536f.
Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth4476eb3) entered blocking state
Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth4476eb3) entered disabled state
Oct 29 22:42:42 RaidByte kernel: device veth4476eb3 entered promiscuous mode
Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth4476eb3) entered blocking state
Oct 29 22:42:42 RaidByte kernel: br-cfc1bfc30214: port 6(veth4476eb3) entered forwarding state
Oct 29 22:42:42 RaidByte kernel: eth0: renamed from veth14953d6
Oct 29 22:42:42 RaidByte kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth4476eb3: link becomes ready
Oct 29 22:42:44 RaidByte  avahi-daemon[6274]: Joining mDNS multicast group on interface veth4476eb3.IPv6 with address fe80::c08d:9dff:fe83:4f91.
Oct 29 22:42:44 RaidByte  avahi-daemon[6274]: New relevant interface veth4476eb3.IPv6 for mDNS.
Oct 29 22:42:44 RaidByte  avahi-daemon[6274]: Registering new address record for fe80::c08d:9dff:fe83:4f91 on veth4476eb3.*.
Oct 29 22:43:13 RaidByte kernel: veth13dc14e: renamed from eth0
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethda2cfd0) entered disabled state
Oct 29 22:43:13 RaidByte  avahi-daemon[6274]: Interface vethda2cfd0.IPv6 no longer relevant for mDNS.
Oct 29 22:43:13 RaidByte  avahi-daemon[6274]: Leaving mDNS multicast group on interface vethda2cfd0.IPv6 with address fe80::f0f8:51ff:fefe:d0c5.
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethda2cfd0) entered disabled state
Oct 29 22:43:13 RaidByte kernel: device vethda2cfd0 left promiscuous mode
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethda2cfd0) entered disabled state
Oct 29 22:43:13 RaidByte  avahi-daemon[6274]: Withdrawing address record for fe80::f0f8:51ff:fefe:d0c5 on vethda2cfd0.
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethc8db68e) entered blocking state
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethc8db68e) entered disabled state
Oct 29 22:43:13 RaidByte kernel: device vethc8db68e entered promiscuous mode
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethc8db68e) entered blocking state
Oct 29 22:43:13 RaidByte kernel: br-cfc1bfc30214: port 7(vethc8db68e) entered forwarding state
Oct 29 22:43:14 RaidByte kernel: veth14953d6: renamed from eth0
Oct 29 22:43:14 RaidByte kernel: br-cfc1bfc30214: port 7(vethc8db68e) entered disabled state
Oct 29 22:43:14 RaidByte kernel: br-cfc1bfc30214: port 6(veth4476eb3) entered disabled state

syslog.1.txt

October 22, 2020

On 10/17/2020 at 8:48 AM, JorgeB said:

Don't see any crash there, just failure to unmount the disks because something was still using them, you can try this.

Just as an update: I set up your recommended logging method, but since then the issue has not occured again. If it occurs again, I‘ll post the log here.

Thanks for the suggestion

October 16, 2020

12 hours ago, JorgeB said:

See here, make sure you're using the correct "power supply idle control" setting.

Ok, sadly it just crashed again. Didn't have syslog turned on, so no new logs, but probably wouldn't have helped more than the logs recorded previously anyways... (see in first post) Other ideas?

October 16, 2020

9 hours ago, JorgeB said:

See here, make sure you're using the correct "power supply idle control" setting.

Thanks for the suggestion! Seems related, since I also have a Gen 1 CPU which are prone to this problem on linux machines.

If found the exact "Power Supply Idle Control" setting on the MSI Motherboard and set it to "typical current idle" as suggested.

I will let you know if the issue persists, I can't yet give a definitive answer since it happens quite randomly.

October 15, 2020

Hello, my media-server recently got a problem:

It started crashing very randomly at random times. So it usually runs 3-24h and after that it just won't be accessible in any way, the web GUI is down, the SSH interface is down and all the dockers are down. When looking at the HDMI connected monitor, it displays the usual information it already did after start-up (cmdline mode).

So I enabled syslog server to record all information necessary. This is what I've extracted out of the syslog.txt (complete version down below):

Aug 11 05:33:23 RaidByte root: umount: /mnt/disk1: target is busy.
Aug 11 05:33:23 RaidByte emhttpd: shcmd (105): exit status: 32
Aug 11 05:33:23 RaidByte emhttpd: Retry unmounting disk share(s)...
Aug 11 05:33:28 RaidByte emhttpd: Unmounting disks...
Aug 11 05:33:28 RaidByte emhttpd: shcmd (106): umount /mnt/disk1
Aug 11 05:33:28 RaidByte root: umount: /mnt/disk1: target is busy.
Aug 11 05:33:28 RaidByte emhttpd: shcmd (106): exit status: 32
Aug 11 05:33:28 RaidByte emhttpd: Retry unmounting disk share(s)...
Aug 11 05:33:31 RaidByte root: Status of all loop devices
Aug 11 05:33:31 RaidByte root: /dev/loop1: [2049]:4 (/boot/bzfirmware)
Aug 11 05:33:31 RaidByte root: /dev/loop0: [2049]:3 (/boot/bzmodules)
Aug 11 05:33:31 RaidByte root: /dev/loop3: [2305]:6442451073 (/mnt/disk1/system/libvirt/libvirt.img)
Aug 11 05:33:31 RaidByte root: Active pids left on /mnt/*
Aug 11 05:33:31 RaidByte root:                      USER        PID ACCESS COMMAND
Aug 11 05:33:31 RaidByte root: /mnt/disk1:          root     kernel mount /mnt/disk1
Aug 11 05:33:31 RaidByte root: /mnt/disks:          root     kernel mount /mnt/disks
Aug 11 05:33:31 RaidByte root: Active pids left on /dev/md*
Aug 11 05:33:31 RaidByte root:                      USER        PID ACCESS COMMAND
Aug 11 05:33:31 RaidByte root: /dev/md1:            root     kernel mount /mnt/disk1
Aug 11 05:33:31 RaidByte root: Generating diagnostics...

So it seems to unmount disks for whatever reason and fails doing so with a busy disk. Anyone has an idea why that could be?

Also I recently had problems stopping the array because always one random disk did not unmount (every time a different one), so i had to force it via cmline. (Maybe this is connected)

Also I have to mention that I'm currently on the beta version, that could be the problem too, but doesn't have to be.

What do you think?

Thanks for any help!

System Info:

Version: 6.9.0-beta22

MOBO: Das X470 GAMING PLUS MAX

CPU: Ryzen 5 1600

GPU: GT 710 (for Win10-VM passthrough)

HDDs: 3x4TB (one Barracuda, one WD Blue, one WD Red)

SSDs: Newly installed Kington A2000 (1TB) as cache and a SanDisk Ultra (256 GB)

Diagnostics:

raidbyte-diagnostics-20200811-0533.zip

July 18, 2020

Hello,

I recently tried getting my GT710 to work on a Win10 VM with passthrough. I went through a hellride with errors. I came to the conclusion that simply the combination of following pieces don‘t seem to work quite well:

- Ryzen 5 1600

- Gigabyte B450 Aorus M

- (any GPU)

(More about this here)

So I thought about trying it with a different motherboard, since this seems to be the main factor.

What would you recommend? (Budget)

I have this MSI B450 Pro Max in mind, since it seems to not have any issues (or not documented issues online).

What do you think? Also it should not consume too much power...

June 22, 2020

34 minutes ago, rachid596 said:

Try seabios in bios type

I did try that... No success

June 22, 2020

3 minutes ago, rachid596 said:

You can boot on baremetal windows and dump with gpu z ?

Envoyé de mon HD1913 en utilisant Tapatalk

Can I run Windows baremetal somehow without having an extra empty HDD or SSD leftover?

June 22, 2020

11 minutes ago, rachid596 said:

You pass the Vbios ? If it's on second pci you dont need Vbios. Try to create à New vm template

Envoyé de mon HD1913 en utilisant Tapatalk

I did not pass a vbios and I insalled both like that, wilst passing the GT 710 through.

1 minute ago, rachid596 said:

You can boot on baremetal windows and dump with gpu z ?

Envoyé de mon HD1913 en utilisant Tapatalk

I don't have it set up like that now, but I can try.

xBotRaid

Posts

Joined

Last visited

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by xBotRaid

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

SSH and WebUI not accessible after couple days

Authentik crashes randomly due to

Unraid Console Not Accessible (works again after restart) - Attack?

Help: Server randomly crashes "disks busy"

Help: Server randomly crashes "disks busy"

Help: Server randomly crashes "disks busy"

Help: Server randomly crashes "disks busy"

Reccomendation: Budget Motherboard for GPU Passthrough (AM4)

GPU-Passthrough:

GPU-Passthrough:

GPU-Passthrough: