After 3 years Unraid is being wonky, dockers are not working, drives keep failing


Recommended Posts

Hi everyone! I've been running an Unraid server for about 3 years now, and overall it has been going great.

 

A few months ago I started having issues with the stability of my server. I have drive failures frequently, my docker containers were acting up being unable to start or write.

 

This past week the server has become unusable and unreliable. Most of my dockers won't start, and I have 2 drives marked as unmountable.

 

I did the basic trouble shooting I have in the past such as deleting the docker.img and rebuilt my containers.

In the past when I did have a drive fail (it has happened maybe 4-6 times over 3 years, 2 were actual bad drives I had to RMA) I would stop the array, unmount the bad drive, restart the array, stop the array, add the drive back to the array, and let it rebuild from parity. I decided not to do that this time with the 2 drives that died, since it takes almost a week at this point to rebuild a drive from parity. I did some digging and decided to do a "new config" and I was hopeful at first. Parity rebuilt from data in just a few days, and overall the system felt snappier. But now nothing is working as expected.

 

Note: Anything important on the server is backed up on another NAS in my house and offsite to BackBlaze. The unimportant stuff that is not backed up is TBs of movies for a plex server. I would be sad if I had to repopulate this, but it't just not worth the cost to back up.

 

The important non-data stuff on the server are my docker containers. I run Home Assistant, Plex, NextCloud, Foundry, and around 20 or so dockers that get daily use. Most of these were set up 3 years ago when I started the sever and haven't really been touched since, they just work.

 

I don't want to just mess with things. I feel like something is off somewhere, and I am hoping to get some help from the community to solve my issues. I love tech stuff and have experience with Linux, and I am a software developer. I just don't know the particular issues with Unraid and I am hoping to learn.

 

I attached my diagnostics zip, but let me know if I need to provide any other information.

 

Thanks in advance.

orthanc-diagnostics-20220717-1043.zip

Link to comment
Just now, trurl said:

Not a good idea to completely fill all of your disks. Can make it difficult or impossible to recover from filesystem corruption.

Any tips on this? Do I just need more disks? I have used unBalance in the past to empty out older disks when adding new disks.

Is there some setting I should change? Can I have disks be somehow marked as full once they only have a certain number of GBs left?

Link to comment

You should set Minimum Free for each user share to larger than the largest file you expect to write to the share. That will make Unraid choose a different disk when a drive has less than Minimum. Won't help in the case where no disks have any space, of course.

 

Filesystems will typically perform worse the fuller a disk is.

 

You need to have larger disks or more disks. Looks like a few of your disks could be replaced with larger disks.

9 minutes ago, thebedivere said:

I have used unBalance in the past to empty out older disks when adding new disks.

Not sure I understand. Usually if you have a disk you want to get rid of, you would rebuild it to a larger disk instead of adding another disk.

Link to comment
8 minutes ago, trurl said:

All of the syslogs included in diagnostics are just the same problem over and over, probably due to your completely full disks. Are there any older syslogs in /var/log?

 

 

Syslogs go back about a week, all the same issue of just not being able to write to disks.

 

I am assuming the "new config" wiped any older logs?

Link to comment

Also note I corrected my first reply. Not only should you not fill all of your disks, you shouldn't even fill any of your disks.

13 minutes ago, trurl said:

Can make it difficult or impossible to recover from filesystem corruption.

1 minute ago, trurl said:

Filesystems will typically perform worse the fuller a disk is.

 

 

Link to comment
2 minutes ago, trurl said:

You need to have larger disks or more disks. Looks like a few of your disks could be replaced with larger disks.

Not sure I understand. Usually if you have a disk you want to get rid of, you would rebuild it to a larger disk instead of adding another disk.

 

Let's say all my disks have about 20 gb free. I add a new disk to expand the array by 6-14 TB. Then I use unBalance to move files from the older disks that are more full onto the newer disk.

 

It sounds like this isn't necessary and if I follow your advice and just have shares set up with minimum free space.

 

I can delete some of the movies on the array and free up some space, then set up the minimum free space.

Link to comment
3 minutes ago, trurl said:

Also note I corrected my first reply. Not only should you not fill all of your disks, you shouldn't even fill any of your disks.

 

 

 

Gotcha, thanks for the advice. I will see what I can do to empty some space and set up minimum free space.

 

I'll restart the server after I have some free space and see what happens with docker and such.

 

How much free space should I have? 10 GB? 1 GB? A percentage of the drive size?

Edited by thebedivere
Link to comment
Just now, thebedivere said:

"new config" wiped any older logs?

No, reboot wipes older logs since syslogs are in RAM just like the rest of the OS.

 

New Config isn't really the way to fix things if you get disabled or unmountable disks. For one thing, it won't fix unmountable anyway.

 

But if you New Config your way out of a disabled disk, you will lose any writes that happened to the disk when it became disabled and any writes to the emulated disk after it became disabled. All of those writes can be recovered from parity if you rebuild the data disk, but are lost if you rebuild parity instead.

 

Link to comment
7 minutes ago, trurl said:

You should set Minimum Free for each user share to larger than the largest file you expect to write to the share.

That just helps keep Unraid from choosing a disk that won't have enough space for the new file. But in general I would leave even more for the other reasons I mentioned.

 

4 minutes ago, thebedivere said:

I add a new disk to expand the array by 6-14 TB. Then I use unBalance to move files from the older disks that are more full onto the newer disk.

Why not replace/rebuild that older disk to the larger disk? Disk5 could be replaced with 14TB, for example since that is the size of your smallest parity.

 

You can just keep adding more disks, of course. But you shouldn't get even close to running out of capacity before you do something about it.

Link to comment
2 hours ago, trurl said:

Why not replace/rebuild that older disk to the larger disk? Disk5 could be replaced with 14TB, for example since that is the size of your smallest parity.

 

You can just keep adding more disks, of course. But you shouldn't get even close to running out of capacity before you do something about it.

 

I've been trying to expand the space over time. Instead of removing a disk I'd rather add more disks. It's one of the reasons I wanted unraid for the server, since I can just drop in disks of different sizes to grow when I need the space.

Link to comment

Cleaned up some space by deleting a bunch of movies, and unmounted disk 6 and 7 (they were flagged as unmountable, but no smart errors). Disk 7 is rebuilding from parity, and disk 6 is still marked as unmountable. I suspect that disk 6 might actually need to be replaced.

 

But, now disk 3, which didn't have any issues before, is marked as unmountable.

 

I've attached new diagnostics.

 

image.thumb.png.c7316306b79c6bb829a5a41f0ae62be6.png

orthanc-diagnostics-20220717-1415.zip

Link to comment
8 minutes ago, thebedivere said:

disk 6 and 7 (they were flagged as unmountable

disk7 not unmountable according to that screenshot. Perhaps you just meant they were both disabled.

 

Both disks are rebuilding, disk6 is rebuilding an unmountable filesystem. We usually recommend repairing the filesystem before rebuilding on top of the same disk to make sure it can be repaired before you overwrite the disk. Too late now, you will have to hope for the best when you repair the filesystem.

 

disk3 filesystem will also have to be repaired, after rebuild completes.

 

So, you are exactly in this position I was concerned about with your full disks.

3 hours ago, trurl said:

Can make it difficult or impossible to recover from filesystem corruption.

 

And... While rebuilding 6,7, it looks like you are having connection problems with disk5. You need to stop and fix that before attempting rebuild or repair.

 

 

Link to comment
4 minutes ago, thebedivere said:

Could that be sata cables, or the sata controller?

yes, maybe just the sata/power connectors. Any splitters?

 

I see you have a Marvell controller.

06:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller [1b4b:9215] (rev 11)
	Subsystem: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller [1b4b:9215]
	Kernel driver in use: ahci
	Kernel modules: ahci

Can you remove that or just not connect to it?

Link to comment

There are some power splitters. Edit: 4 pin to sata, not splitters

 

The 4 port sata is on the motherboard,  and I have a 6 port expansion card for the rest.

 

I'm assuming I should power down, disconnect, and test out each drive connection. What's the best tool/approach for that? I'm more than happy to use CLI tools.

Edited by thebedivere
clarifying
Link to comment

OK, moved some cables around to see what happened, and some drives are not showing up at all.

 

It's in a small server rack in the basement, so it'll take me some effort to pull the case fully open to get to the inner connectors.

 

The case also has some quick replace slots in the front, I'll also try plugging directly to the drives.

 

Thanks for the help so far, it might be a day or two before I get to the next step.

Screenshot_20220717-165133_Mull.jpg

Link to comment
9 minutes ago, thebedivere said:

moved some cables around

All cables should have some slack so connectors can sit squarely on the connection with no tension that might cause them to move. Don't bundle data cables.

 

15 minutes ago, thebedivere said:

some drives are not showing up

Multiple disks suggest power or controller. Already mentioned splitters.

 

Already mentioned Marvell. If it has been working for you then maybe OK, more likely to be an issue if you enable IOMMU for hardware passthru to VMs.

Link to comment

Nevermind, syslog shows it is rebuilding disks 6,7

 

Is it showing anything in the Errors column on Main - Array Devices since you took that screenshot?

 

Probably you are just going to have to repair the filesystems on all the unmountable disks and hope for the best.

Link to comment

No errors are showing up. I'm freeing some more storage space from disk 4 before I stop the array and try and do some file system repair.

 

Current view on the UI:

 

image.thumb.png.32716cb2e326e15585f651142f908c1b.png

 

Disk 7 did eventually show up after I swapped out the sata cable for one that connected better.

 

I tried moving disks 3, 5, and 6 to other sata ports on the motherboard instead of the expansion card, and I also tried different power connectors, and nothing changed.

 

I will attempt to do some file system repair and see how it goes.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.