SSD cache pool - large writes causing connectivity issues

tiwing · April 15, 2021

kscs-fvm2-diagnostics-20210415-1243.zip

Hi all, I'm having an issue where any large writes to the cache pool cause what seems to be connectivity issues to the entire unraid server including dropping connection to dockers and interruption of the GUI from a browser. I've been able to isolate two scenarios where this happens, and how I can prevent it:

1) copying large files to a cache:yes share using krusader

2) processing files using tdarr from a cache:yes share. NOTE the tdarr cache space is on an SSD in undefined devices, not on the cache pool itself.

setting shares to cache:no prevents the issue.

During sustained writes to the cache pool there is ever increasing processor usage where it starts fine, then over a period of minutes the number of pinned CPUs progressively increases to a pretty high number at which point I lose connectivity via the web and dockers such as plex go offline.

appdata and system (docker, libvert) are all on the same cache pool.

I had an "unbalanced" pool previously and thought that might be the issue (1x480GB and 2x250GB). I since pulled the 2x250s and replaced with an additional 480GB for a "proper" btrfs raid1 *cough* cache pool.

I have not yet tested removing one of the drives in the pool to see if single disk on the pool is still an issue, but that is one of my next steps to problem solve.

I could also create a second cache pool of either 1 or 2 disks (prefer 2 for real world redundancy) since I still have my 250GB twins available, and isolate share writes form appdata and system folders, but I'd prefer not to use more disks for cache than I need to, and I just spent a bunch of money on 480GB SSDs. I AM willing to test though, if it seems like a good next step test.

I'm not challenged for CPU or memory in this machine as it's a dual 8 core Xeon with 128GB ram. All drives are all on twin LSI HBA cards through SAS2 backplanes.

For now I've left the shares set to cache:no where I normally do my large writes, but that defeats the point of having a somewhat large cache, and is not how I'd prefer to use the server. Additionally, there are a number of smaller writes to these same shares all day long and I'd rather make use of the cache pool.

Any help and comments is appreciated.

JorgeB · April 15, 2021

In my experience Kingston A400 is probably one of the worst SSDs you can buy, they are slow when new and tend to get progressively slower with use, avoid those and any other DRAMless TLC SSD, I would recommend trying with decent SSDs, like the 860/870EVO, MX500, WD Blue or similar.

tiwing · April 15, 2021

damn.. well that's why I'm seeing it now then - my 250s are EVOs and I never noticed a problem with them. Do you think that's largely the cause of high processor usage?? Maybe I'll dump the EVOs back in there - return window is still open for one of the 480s.

edit: follow up Q - would you recommend using separate cache pools for data writes versus appdata/system ? Makes a difference to the sizes I'm buying

Edited April 15, 2021 by tiwing

tiwing · April 15, 2021

also, thinking about the best way to switch back to the 250 evos , would it be to create a new cache pool (pool2) of 2x250SSD, move/copy over appdata and system folders and repoint the shares to "pool2", then remove the 480s? Far better, I think, than moving everything to the array, then back again (large plex library)

tiwing · April 16, 2021

update:

I created a new pool, put the 250GB evo's back into the box and moved docker and appdata back to them. Once I started writing large files to cache the same issue reoccurred where processors maxed out and it looks like disk IO caused connection issues, including dropping out plex, I assume because plex couldn't find IO to access the docker or appdata. Or maybe the networking within dockers... I have no idea.

So I've left the twin 480s in a separate pool and switched file writes to that pool while leaving system and appdata on the smaller pool. It works perfectly, and although large writes seem to really max many of the threads it did not take down plex and I didn't lose access to the unraid gui.

I don't remember this happening in any previous unraid versions, and unfortunately the upgrade o 6.9.0 happened with a day of moving to the new server.

For now I guess I'll have to leave things as separate pools or to turn off cache usage for all file shares except system and appdata. I'm not impressed that I have to do this. Is this a new bug introduced in a recent version of unraid? Anyone else experiencing this, or is anyone able to replicate it? I'll mess a bit with my backup unraid server but I don't have twin cache drives on it, and I don't have parity... it's just a jbod box for daily backup.

JorgeB · April 16, 2021

5 hours ago, tiwing said:

Is this a new bug introduced in a recent version of unraid?

It's been happening for some time but just to a few users, full reason why is still unknown to me, low performing SSDs can exacerbate the issue, and sometimes using faster ones helps, but not always.

tiwing · April 16, 2021

that's interesting, looks like I fall into the few users camp!

I just tested my old machine (diagnostics attached) which actually does show the same behaviour in terms of processor usage on writing to the BTRFS pool - in this case the "pool" is one SSD as I don't care for redundancy in this box. I never noticed it when it was my primary machine, but I think I know why. This box alternates between full speed and almost zero write speeds. Always has. Still does. Seems like when the processors get maxed the write speed drops to "slow", then things calm down, processors return to normal levels and write speed increases to full, then processors ramp up and write speed drops again. I think this behaviour meant that I never lost connectivity to my dockers or gui which is why I never noticed it there. It does not happen on a regular time schedule, sometimes it will be fine when copying for a minute, then will go into a cycle as described every 10-15 seconds, then it will be fine for a bit. It's also impossible for me to tell if the processors ramp up before the write speeds slow down... in other word I don't know if high processor usage is causing slow write speeds, or if slow write is causing high processor.

I'm happy to help test different scenarios since I have two different unraid boxes at home - just shoot me a PM if you want me to dig into anything.

cheers

kscs-bu-diagnostics-20210416-0807.zip

JorgeB · April 16, 2021

5 minutes ago, tiwing said:

I'm happy to help test different scenarios since I have two different unraid boxes at home

If I understood correctly having the system and appdata shares on a different pool helps? At least the dockers don't go offline like that?

P.S. it's normal for CPU usage to go up a lot in the GUI dashboard during any transfer since it accounts for i/o wait, same shouldn't happen for example on top.

tiwing · April 16, 2021

2 minutes ago, JorgeB said:

having the system and appdata shares on a different pool helps?

P.S. it's normal for CPU usage to go up a lot in the GUI dashboard during any transfer since it accounts for i/o wait, same shouldn't happen for example on top.

correct. After leaving system and appdata on the "cache" pool, and moving all other shares to a different pool, I had no connection loss to my dockers and no gui interruption. I probably copied several TB in various tests last night to test, including using krusader disk to pool directly, krusader share to share, windows share to share, and having Tdarr do a bunch of work on a test folder. I'm at work now, but I'll watch "top" tonight and see what happens there. I also have the netdata docker installed, will watch that tonight and see if anything looks bizarre, and will report back here.

JorgeB · April 16, 2021

6 minutes ago, tiwing said:

After leaving system and appdata on the "cache" pool, and moving all other shares to a different pool, I had no connection loss to my dockers and no gui interruption.

Thanks, that could be a good workaround for other users.

SSD cache pool - large writes causing connectivity issues

Recommended Posts

tiwing

Link to comment

JorgeB

Link to comment

tiwing

Link to comment

tiwing

Link to comment

tiwing

Link to comment

JorgeB

Link to comment

tiwing

Link to comment

JorgeB

Link to comment

tiwing

Link to comment

JorgeB

Link to comment

Join the conversation