August 19, 20241 yr So it started with the server going offline a few months ago randomly. When booting it showed errors and sticking the boot stick into the Windows PC also showed it as corrupted or so. Meanwhile I enabled syslog and went about my day. Today the server wasnt responsive again. It showed up as online on the Tailscale dashboard, but it would not respond via the Web Interface and Apps didnt work either. So I downloaded the stick backup and transferred the license to a new stick. I also added 2x 8TB the day before as the server was almost full (they were from another NAS so I know they were working fine). Now the server was with its new USB stick, currently rechecking parity and preclearing the drives. And it happened again! -Server isn't reachable from the Web but shows up as online -The Display attached to the server just shows the login page(?) -Pressing the power off button initiates the shutdown, however graceful shutdown doesnt work. It forces shutdown but after a few minutes it is still stuck on "Starting diagnostics collection..." -Today also qDirStat would keep freezing up. Perhaps related? I don't know whats wrong with the server. Im currently testing the old USB stick with H2testw, but it actually seems OK?? Also, I just saw "dirty bit is set, might be corrupt" in the boot screen again, even with the new drive.... Might it be due to the SATA cables? I googled around and saw this mentionned sometimes - it is kind of crammed in the case. I have to have to reboot-button the server to get it to "work" and download the syslog/diagnostics. Lets see how long it will last this time. Edited January 4, 20251 yr by 012315
August 19, 20241 yr Author Some other wierd behavior I noticed: The preclear disks plugin seems to go back in time after i hard reset. But for the 2 disks it goes back at different rates. So When it crashed yesterday Disk A had been prereading for 5h and the other one 8h, when it worked today we were at roughly A - 20h and disk B 25. Now after the 2nd reset we are back at ~20h and 18h??
August 19, 20241 yr Author And it gets better! So a few hours ago the server wasnt responding via the Web UI again. I left it. Now apparently it responds again and works. The preclear is done on both new disks. Perhaps this was it? But some docker containers arent reachable, AND THE SHARES ARE GONE???? After rebooting the shares are back. Other services such as docker work again. I cant connect to the Windows Filesystem but this is another issue probably. This has been resolved. I was connecting to server and not server.local making it go over tailscale. I've attached logs for both before and after reboot. Edited January 4, 20251 yr by 012315
August 19, 20241 yr Author Another update: Apprarently as I upgraded to 6.12 or something removed the "tailscale" listening interface. I've now added tailscale back as a listening interface and connection to the GUI seems to work again. This does not explain though connecting via TS would work upon reboot anyways and why the shares would dissapear or the server & dockers suddenly stop working
August 20, 20241 yr Author And the server is now completely offline. Im currently not there physically so no rebooting or getting the logs If someone could help me that would be greatly appreciated!
August 20, 20241 yr Community Expert Server was running out of RAM, and the culprit appears to be qbittorrent-nox, so check it's config/limit its RAM usage or leave it disabled and retest.
August 27, 20241 yr Author After rebooting i am getting "bzfirmware checksum error - Press enter to reboot". Mind you that this is a new USB stick, I transferred the old backup onto it. Upon plugging into Windows it detected the drive as faulty, after clicking it "repaired" it and is readable. I will now replace the bz files as instructed in an older forum thread I found I've replaced the files, the server is now booting again. Lets see how long this lasts. Ill update to the latest unraid OS aswell. Edited August 27, 20241 yr by 012315
September 1, 20241 yr Author Alright, and the server is unreachable again. Ill try server.local once home, for now all dockers and control panel via tailscale are down. Before rebooting, is there any way to get the logs, or should i do that after the reboot?
September 1, 20241 yr Author Oh sure, I have that enabled. I was able to reach the panel locally, server is still up. Reading the logs, it seems I got OOM'd again. However my log comprehension isnt enough to tell me what caused the OOM. Ive added logs and diags. Also, as you told me last time it was qbittorent, I had edited the docker compose to add a memory limit Edit: Consulting on how to read the longs with Chatgpt, it seems shfs and awk are using vast amounts of RAM. Is this normal? I have 32GB ECC Ram. For now I've upped the Folder Caching Plugins Pressure from 10 to 25 if this might be the culprit? Edited January 4, 20251 yr by 012315
September 2, 20241 yr Community Expert shfs was the process killed, but you have a lot of other container related processes using a lot of RAM, there's one especially spawning a lot of docker-proxy entries, see if you can find out which one it is, but there are others high RAM consuming.
September 2, 20241 yr Author I have made a discovery: In the appdata folder/share there are a LOT of files, as the 2 photoprism instances among others save all their thumbnails and there. qdirstat crashes when trying to visualize the folder, and so does the Duplicacy process. The last OOM occurred about 2.5h after 6AM, when the daily backup takes place. After rebooting the server today I see that duplicacy is stuck trying to index the appdata folder and RAM usage is creeping up again. So, what might there be in the appdata that causes all processes to not handle it? Its about 300K files I believe. Can this really be this bad? Can this be the actual culprit?
September 2, 20241 yr Author Alright disabling SMB did not fix the issue. This time the server went down even quicker! So now I guess its time to find out why there are so many docker processes. Help or guidance would be appreciated! Edited September 3, 20241 yr by 012315
September 3, 20241 yr Community Expert You can try enabling one container at a time and see if you can find the culprit.
September 16, 20241 yr Community Expert 5 minutes ago, 012315 said: Also, what would be the proper way to fix this bzfirmware error? If you recall, this has happened multiple times already and Ive also switched USB Sticks You can try rewriting the bz* type files from a zip download for the release. Of this does not fix it then you need to replace the flash drive.
January 3, 20251 yr Author Solution For others: What seemed to fix the issue was adding more storage to the system, as drives and cache was at 95%+. Havent had a problem for months now.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.