Jump to content

Unraid Stability Issues: Seeking Community Assistance


Recommended Posts

Hey everyone

 

I have been dealing with persistent challenges in my Unraid setup, which includes multiple dockers and a couple of VMs. Currently, the VMs are intentionally shut down as I'm not actively using them, so I don't believe they are related to my issues.

 

Typically, the server runs smoothly for about 1-2 weeks, but then it can suddenly become unresponsive. Issues range from the web interface being inaccessible to various other problems, such as certain dockers intermittently stopping with failed restart attempts. In some cases, I've also noticed "read-only file system" errors in some docker logs. In many instances, I had to perform a manual restart of the server by physically powering it off and then back on to restore it to a running state.

 

Initially, I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot? Just today, a docker became unresponsive (Scripted), and attempts to stop it resulted in a "Server Error" popup (First time I have seen this). Following a community suggestion, I successfully stopped and force-updated the docker. However, attempting to restart it afterwards caused the web interface to become unresponsive. Fortunately, I had a console open and tried to shut down the server with the "powerdown" command, which did not work correctly (waited for 20 minutes-ish), leading to a forced server shutdown - once again.

 

I have attached three diagnostics files, two from previous incidents (28th of November and 12th of December) and another from today. All files were created just after rebooting the server.

 

As someone more comfortable with Windows than Linux, I have managed to maintain everything up until now. I'm reaching out to the community for assistance because my Linux skills are simply not advanced enough to interpret these diagnostic files. I need help identifying the root cause and finding a more permanent solution to these recurring problems. Any insights or guidance on what to look for would be greatly appreciated.

 

Thank you

nas-disk-diagnostics-20231222-1114.zip nas-disk-diagnostics-20231213-0619.zip nas-disk-diagnostics-20231128-1807.zip

Link to comment
6 minutes ago, TreksterDK said:

I suspected a faulty cache disk (because of the read-only file system errors), but what confuses me here is that these issues seem to temporarily resolve after a reboot?

 

FWIW I had an NVME cache drive that would do that too, then after a few months any writes to it would cause it to go read only and the drive was less than a year old

Link to comment
26 minutes ago, TreksterDK said:

Did you also have occasionally unresponsive interfaces and problems similar to what I described?

 

Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made.  It was also a 970 Evo Plus.

Link to comment
43 minutes ago, TreksterDK said:


Ok. I will do that:
image.png.013e7a48d90e03cb5ac7289dbd74c742.png

Is that enough, and just hit Apply ("Backup" is a share)?

 

There is no information in the logs tha hints at possible errors and solutions?


not quite enough.   As set then the server is listening for other systems to send it messages to log.   To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help.

Link to comment
48 minutes ago, itimpi said:


not quite enough.   As set then the server is listening for other systems to send it messages to log.   To get the server to log its own syslog messages you need to set one of the last two fields as described in the syslog server link or the built-in help.

 

Ok. I did this: 

 

image.thumb.png.50490714828d6f50b8fb1de1fb1c7b6a.png

 

"syslogs" share is on the cache drive (Even though that might be one of the problems 🤔, but it was recommended), but will be moved to array (secondary location)

Link to comment
1 hour ago, Michael_P said:

 

Yep, docker would fail to unmount and generally hose everything until i rebooted and did a file system repair on the drive - then it would be fine again until the next time a lot of writes were made.  It was also a 970 Evo Plus.

 

My NVME (970 EVO Plus) is actually just an unassigned drive I use for various things - like Plex transcoding etc. My actual Cache drive is a Western Digital Blue drive (WDS500G2B0A) sitting on the onboard SATA controller. It's been running for 4 Years and 197 Days (According to Diskspeed). But yeah, it might be the problem.

Link to comment
25 minutes ago, trurl said:

You still didn't tell it which server IP to send the syslogs to, but you did mirror to flash so you will get those.

 

Ok. I don't know much about syslog servers to be honest. Any recommendations for a free syslog server for Windows that will be good enough for this situation? I get different result googling it, but am unsure what to choose. Setting this up, will allow me to disable the mirror feature to the Flash drive, right? (I would like to avoid unnecessary writes to that if I can avoid it)

Link to comment
27 minutes ago, trurl said:

You can specify your Unraid server IP itself, as explained at the syslog server link

 

Ok. I added the unraid server ip. I hope this is enough for this purpose.

 

Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know :)

Link to comment
51 minutes ago, TreksterDK said:

Is there no way to look at the current diagnostics files (I uploaded) to figure out where the problem might be?, or is the syslogs a requirement for now to get anywhere? I ask because I don’t know

The problem is that the diagnostics only show what has happened since the last time the server was booted so often is insufficient (although it can contain clues).

Link to comment
23 minutes ago, Michael_P said:

 

In your syslog there's an error, looks the same as when my cache drive showed signs of failing

 

Dec 10 03:18:25 NAS-Disk kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0

 

 

That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.?

Link to comment
1 minute ago, TreksterDK said:

 

That is actually my SSD Cache drive. Maybe I should just buy a new one. Recommandations for doing a switch of that? Backing up files etc.?

 

Still not 100% that's the cause, but if you do need to replace it, just set all the shares to move to the array then run mover. When you have the new drive installed, set the shares back to the cache

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...