TreksterDK Posted December 22, 2023 Share Posted December 22, 2023 Hey everyone I have been dealing with persistent challenges in my Unraid setup, which includes multiple dockers and a couple of VMs. Currently, the VMs are intentionally shut down as I'm not actively using them, so I don't believe they are related to my issues. Typically, the server runs smoothly for about 1-2 weeks, but then it can suddenly become unresponsive. Issues range from the web interface being inaccessible to various other problems, such as certain dockers intermittently stopping with failed restart attempts. In some cases, I've also noticed "read-only file system" errors in some docker logs. In many instances, I had to perform a manual restart of the server by physically powering it off and then back on to restore it to a running state. Initially, I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot? Just today, a docker became unresponsive (Scripted), and attempts to stop it resulted in a "Server Error" popup (First time I have seen this). Following a community suggestion, I successfully stopped and force-updated the docker. However, attempting to restart it afterwards caused the web interface to become unresponsive. Fortunately, I had a console open and tried to shut down the server with the "powerdown" command, which did not work correctly (waited for 20 minutes-ish), leading to a forced server shutdown - once again. I have attached three diagnostics files, two from previous incidents (28th of November and 12th of December) and another from today. All files were created just after rebooting the server. As someone more comfortable with Windows than Linux, I have managed to maintain everything up until now. I'm reaching out to the community for assistance because my Linux skills are simply not advanced enough to interpret these diagnostic files. I need help identifying the root cause and finding a more permanent solution to these recurring problems. Any insights or guidance on what to look for would be greatly appreciated. Thank you nas-disk-diagnostics-20231222-1114.zip nas-disk-diagnostics-20231213-0619.zip nas-disk-diagnostics-20231128-1807.zip Quote Link to comment
Michael_P Posted December 22, 2023 Share Posted December 22, 2023 6 minutes ago, TreksterDK said: I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot? FWIW I had an NVME cache drive that would do that too, then after a few months any writes to it would cause it to go read only and the drive was less than a year old Quote Link to comment
JorgeB Posted December 22, 2023 Share Posted December 22, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 (edited) 22 minutes ago, JorgeB said: Enable the syslog server and post that after a crash. Ok. I will do that: Is that enough, and just hit Apply ("Backup" is a share)? There is no information in the logs tha hints at possible errors and solutions? Edited December 22, 2023 by TreksterDK Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 13 minutes ago, Michael_P said: FWIW I had an NVME cache drive that would do that too, then after a few months any writes to it would cause it to go read only and the drive was less than a year old Did you also have occasionally unresponsive interfaces and problems similar to what I described? Quote Link to comment
Michael_P Posted December 22, 2023 Share Posted December 22, 2023 26 minutes ago, TreksterDK said: Did you also have occasionally unresponsive interfaces and problems similar to what I described? Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made. It was also a 970 Evo Plus. Quote Link to comment
itimpi Posted December 22, 2023 Share Posted December 22, 2023 43 minutes ago, TreksterDK said: Ok. I will do that: Is that enough, and just hit Apply ("Backup" is a share)? There is no information in the logs tha hints at possible errors and solutions? not quite enough. As set then the server is listening for other systems to send it messages to log. To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help. Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 48 minutes ago, itimpi said: not quite enough. As set then the server is listening for other systems to send it messages to log. To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help. Ok. I did this: "syslogs" share is on the cache drive (Even though that might be one of the problems 🤔, but it was recommended), but will be moved to array (secondary location) Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 1 hour ago, Michael_P said: Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made. It was also a 970 Evo Plus. My NVME (970 EVO Plus) is actually just an unassigned drive I use for various things - like Plex transcoding etc. My actual Cache drive is a Western Digital Blue drive (WDS500G2B0A) sitting on the onboard SATA controller. It's been running for 4 Years and 197 Days (According to Diskspeed). But yeah, it might be the problem. Quote Link to comment
trurl Posted December 22, 2023 Share Posted December 22, 2023 18 minutes ago, TreksterDK said: Ok. I did this: You still didn't tell it which server IP to send the syslogs to, but you did mirror to flash so you will get those. Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 25 minutes ago, trurl said: You still didn't tell it which server IP to send the syslogs to, but you did mirror to flash so you will get those. Ok. I don't know much about syslog servers to be honest. Any recommendations for a free syslog server for Windows that will be good enough for this situation? I get different result googling it, but am unsure what to choose. Setting this up, will allow me to disable the mirror feature to the Flash drive, right? (I would like to avoid unnecessary writes to that if I can avoid it) Quote Link to comment
trurl Posted December 22, 2023 Share Posted December 22, 2023 You can specify your Unraid server IP itself, as explained at the syslog server link Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 27 minutes ago, trurl said: You can specify your Unraid server IP itself, as explained at the syslog server link Ok. I added the unraid server ip. I hope this is enough for this purpose. Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know Quote Link to comment
itimpi Posted December 22, 2023 Share Posted December 22, 2023 51 minutes ago, TreksterDK said: Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know The problem is that the diagnostics only show what has happened since the last time the server was booted so often is insufficient (although it can contain clues). Quote Link to comment
Michael_P Posted December 22, 2023 Share Posted December 22, 2023 57 minutes ago, TreksterDK said: Is there no way to look at the current diagnostics files In your syslog there's an error, looks the same as when my cache drive showed signs of failing Dec 10 03:18:25 NAS-Disk kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 1 Quote Link to comment
TreksterDK Posted December 22, 2023 Author Share Posted December 22, 2023 23 minutes ago, Michael_P said: In your syslog there's an error, looks the same as when my cache drive showed signs of failing Dec 10 03:18:25 NAS-Disk kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.? Quote Link to comment
Michael_P Posted December 22, 2023 Share Posted December 22, 2023 1 minute ago, TreksterDK said: That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.? Still not 100% that's the cause, but if you do need to replace it, just set all the shares to move to the array then run mover. When you have the new drive installed, set the shares back to the cache 1 Quote Link to comment
JorgeB Posted December 22, 2023 Share Posted December 22, 2023 Btrfs is detecting data corruption, this is usually RAM related, start by running memtest, then scrub the pool. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.