Unraid Stability Issues: Seeking Community Assistance

TreksterDK · December 22, 2023

Hey everyone

I have been dealing with persistent challenges in my Unraid setup, which includes multiple dockers and a couple of VMs. Currently, the VMs are intentionally shut down as I'm not actively using them, so I don't believe they are related to my issues.

Typically, the server runs smoothly for about 1-2 weeks, but then it can suddenly become unresponsive. Issues range from the web interface being inaccessible to various other problems, such as certain dockers intermittently stopping with failed restart attempts. In some cases, I've also noticed "read-only file system" errors in some docker logs. In many instances, I had to perform a manual restart of the server by physically powering it off and then back on to restore it to a running state.

Initially, I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot? Just today, a docker became unresponsive (Scripted), and attempts to stop it resulted in a "Server Error" popup (First time I have seen this). Following a community suggestion, I successfully stopped and force-updated the docker. However, attempting to restart it afterwards caused the web interface to become unresponsive. Fortunately, I had a console open and tried to shut down the server with the "powerdown" command, which did not work correctly (waited for 20 minutes-ish), leading to a forced server shutdown - once again.

I have attached three diagnostics files, two from previous incidents (28th of November and 12th of December) and another from today. All files were created just after rebooting the server.

As someone more comfortable with Windows than Linux, I have managed to maintain everything up until now. I'm reaching out to the community for assistance because my Linux skills are simply not advanced enough to interpret these diagnostic files. I need help identifying the root cause and finding a more permanent solution to these recurring problems. Any insights or guidance on what to look for would be greatly appreciated.

Thank you

nas-disk-diagnostics-20231222-1114.zip nas-disk-diagnostics-20231213-0619.zip nas-disk-diagnostics-20231128-1807.zip

Michael_P · December 22, 2023

6 minutes ago, TreksterDK said:

I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot?

FWIW I had an NVME cache drive that would do that too, then after a few months any writes to it would cause it to go read only and the drive was less than a year old

JorgeB · December 22, 2023

Enable the syslog server and post that after a crash.

TreksterDK · December 22, 2023

22 minutes ago, JorgeB said:

Enable the syslog server and post that after a crash.

Ok. I will do that:
image.png.013e7a48d90e03cb5ac7289dbd74c742.png

Is that enough, and just hit Apply ("Backup" is a share)?

There is no information in the logs tha hints at possible errors and solutions?

Edited December 22, 2023 by TreksterDK

TreksterDK · December 22, 2023

13 minutes ago, Michael_P said:

FWIW I had an NVME cache drive that would do that too, then after a few months any writes to it would cause it to go read only and the drive was less than a year old

Did you also have occasionally unresponsive interfaces and problems similar to what I described?

Michael_P · December 22, 2023

26 minutes ago, TreksterDK said:

Did you also have occasionally unresponsive interfaces and problems similar to what I described?

Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made. It was also a 970 Evo Plus.

itimpi · December 22, 2023

43 minutes ago, TreksterDK said:

Ok. I will do that:

Is that enough, and just hit Apply ("Backup" is a share)?

There is no information in the logs tha hints at possible errors and solutions?

not quite enough. As set then the server is listening for other systems to send it messages to log. To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help.

TreksterDK · December 22, 2023

48 minutes ago, itimpi said:

not quite enough. As set then the server is listening for other systems to send it messages to log. To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help.

Ok. I did this:

"syslogs" share is on the cache drive (Even though that might be one of the problems 🤔, but it was recommended), but will be moved to array (secondary location)

TreksterDK · December 22, 2023

1 hour ago, Michael_P said:

Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made. It was also a 970 Evo Plus.

My NVME (970 EVO Plus) is actually just an unassigned drive I use for various things - like Plex transcoding etc. My actual Cache drive is a Western Digital Blue drive (WDS500G2B0A) sitting on the onboard SATA controller. It's been running for 4 Years and 197 Days (According to Diskspeed). But yeah, it might be the problem.

trurl · December 22, 2023

18 minutes ago, TreksterDK said:

Ok. I did this:

You still didn't tell it which server IP to send the syslogs to, but you did mirror to flash so you will get those.

TreksterDK · December 22, 2023

25 minutes ago, trurl said:

You still didn't tell it which server IP to send the syslogs to, but you did mirror to flash so you will get those.

Ok. I don't know much about syslog servers to be honest. Any recommendations for a free syslog server for Windows that will be good enough for this situation? I get different result googling it, but am unsure what to choose. Setting this up, will allow me to disable the mirror feature to the Flash drive, right? (I would like to avoid unnecessary writes to that if I can avoid it)

trurl · December 22, 2023

You can specify your Unraid server IP itself, as explained at the syslog server link

TreksterDK · December 22, 2023

27 minutes ago, trurl said:

You can specify your Unraid server IP itself, as explained at the syslog server link

Ok. I added the unraid server ip. I hope this is enough for this purpose.

Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know

itimpi · December 22, 2023

51 minutes ago, TreksterDK said:

Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know

The problem is that the diagnostics only show what has happened since the last time the server was booted so often is insufficient (although it can contain clues).

Michael_P · December 22, 2023

57 minutes ago, TreksterDK said:

Is there no way to look at the current diagnostics files

In your syslog there's an error, looks the same as when my cache drive showed signs of failing

Dec 10 03:18:25 NAS-Disk kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0

TreksterDK · December 22, 2023

23 minutes ago, Michael_P said:
In your syslog there's an error, looks the same as when my cache drive showed signs of failing
Dec 10 03:18:25 NAS-Disk kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0

That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.?

Michael_P · December 22, 2023

1 minute ago, TreksterDK said:

That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.?

Still not 100% that's the cause, but if you do need to replace it, just set all the shares to move to the array then run mover. When you have the new drive installed, set the shares back to the cache

JorgeB · December 22, 2023

Btrfs is detecting data corruption, this is usually RAM related, start by running memtest, then scrub the pool.

Unraid Stability Issues: Seeking Community Assistance

Recommended Posts

TreksterDK

Link to comment

Michael_P

Link to comment

JorgeB

Link to comment

TreksterDK

Link to comment

TreksterDK

Link to comment

Michael_P

Link to comment

itimpi

Link to comment

TreksterDK

Link to comment

TreksterDK

Link to comment

trurl

Link to comment

TreksterDK

Link to comment

trurl

Link to comment

TreksterDK

Link to comment

itimpi

Link to comment

Michael_P

Link to comment

TreksterDK

Link to comment

Michael_P

Link to comment

JorgeB

Link to comment

Join the conversation