[6.7.x] Very slow array concurrent performance

JonathanM · August 5, 2019

I wonder if the sqlite thing is related, since we know there have been issues with sqlite on the fuse system for some users for a long time, but recently it's become a major issue. Perhaps when the file system performance falls below some threshold, sqlite reacts poorly and corrupts instead of waiting for completion.

Maybe there is a latent bug in sqlite that is being triggered by i/o speed?

JorgeB · August 5, 2019

I would say it's very possible, since the array becomes almost completely unresponsive, even listing a folder's contents can take many seconds, so it might cause timeouts for other things.

Edited August 5, 2019 by johnnie.black

JonathanM · August 5, 2019

Could you please toggle this plugin and check status with all mitigations enabled and disabled?

JorgeB · August 5, 2019

28 minutes ago, jonathanm said:

Could you please toggle this plugin and check status with all mitigations enabled and disabled?

Can't right now since I'am at work, but I have a small server here, based on a Core2Duo 8400, which isn't affected and the behavior is exactly the same.

JonathanM · August 5, 2019

12 minutes ago, johnnie.black said:

I have a small server here, based on a Core2Duo 8400, which isn't affected and the behavior is exactly the same.

So toggling the mitigations doesn't change anything?

JorgeB · August 5, 2019

5 minutes ago, jonathanm said:

So toggling the mitigations doesn't change anything?

I had the idea that only Lynnifield and newer CPUs were affected by those bugs, but I was wrong, the E8400 is also affected, still just tried the plugin and disabling the mitigations didn't make any difference.

JonathanM · August 5, 2019

It was an idea anyway. My thought process was that even though the CPU may not be vulnerable, the mitigations would still be applied in the code regardless. Honestly I don't know enough low level coding to be able to figure it out for myself, so I just wanted to advance the theory.

All these issues seemed to start popping up at roughly the same time frame, so it's tough to distinguish what may or may not be truly causal, or just coincidental.

JorgeB · August 5, 2019

It was a good idea and good to rule out, still would be surprised if the mitigations caused such a big performance loss, 10 or 15% OK, but in this issue performance goes from 70MB/s to <0.5MB/s.

Hoopster · August 5, 2019

I have seen similar behavior lately.

Beginning with any 6.7.* release, if I am doing heavy file transfers/disk writes, etc. any activity on another disk brings things to a crawl or, in the case of the operation in which I first noticed the issue, kills writes to the disk.

I say this began with 6.7.x because I kept my server up to date with each new release and all of the video capture activity I have done recently is in the 6.7.x time frame.

For example, my wife has a lot of videos shot on MiniDV tape which she wanted captured to the unRAID server. Anytime I started capturing life was good if that was the ONLY activity on the server. If, while capturing, I attempted to stream a movie, browse files on the server through Windows Explorer, etc. the capture process would stop with the error "destination disk too slow." This is an error from the video editing/capture system. Each tape is about 90 minutes and capturing writes a 19-20GB file to the array.

I have write caching disabled for shares so the video capture is going straight to the parity-protected array.

Basically, if any heavy writing is going on on the array such as recording a show through Plex DVR, copying large files to the array, etc., browsing files in the array is VERY slow. It takes several seconds (in addition to any time required to spin up disk(s) if necessary) to populate the folder/file list and open files. Without other "heavy" activity, browsing the array is fairly snappy.

I do not recall seeing this behavior with prior versions. In fact, I am relatively certain this was not an issue previously although I have not run any tests to prove it by rolling back to 6.6.7. This is purely based on recent observations although I will run some tests later by rolling my backup server back to 6.6.7 to see if the same behavior occurs.

Edited August 7, 2019 by Hoopster

Videodr0me · August 6, 2019

You can try turning off direct i/o in the share settings. Maybe that will affect your observed behaviour.

JorgeB · August 6, 2019

35 minutes ago, Videodr0me said:

You can try turning off direct i/o in the share settings.

Direct i/o is more for user shares, for this I used disk shares, but I did try having direct i/o on or off and it didn't make any difference.

Marshalleq · August 6, 2019

Talking about differing CPU's - I have been having issues on my thread ripper system, I'm now going to perform the same test as above as I hadn't noticed it exactly like that, but then I do have two disk controllers which may change things a little. I hadn't realised it until just now, but on top of the normal Plex issues with mover, I had the Apple TV plex client just dumping out of a movie yesterday while copying a large amount of data from my Unraid server to an iMac.

Other things I'm trying to understand the cause of, is since this version I've had two SSD's die (one enterprise) and the new enterprise that is only a month old, written only 8TB (rated at 1TB per day for 5 years) already has re-allocated sectors on it). I'm pretty sure I've changed that cable which is about the only thing left I can think of doing - I don't suppose it has anything to do with this, but thought I'd throw it out there. It does say it's had 11 unsafe shutdowns (which it definitely hasn't) - however a bad cable is a possibility or maybe with all these I/o problems it starving the SSD into thinking it's had a disconnection? Just throwing it out there as 2 dead SSD's and a third new one with issues is not normal.

JorgeB · August 7, 2019

Just to add that this problem is also easy to notice for an array disk to disk copy (though with a smaller performance impact) , e.g, copying 3 files totaling about 12GB, time spent:

		Read/Write/Modify	Reconstruct Write
v6.7.2 			8m43s			6m17s
v6.6.7			5m37s			4m26s

Edited August 7, 2019 by johnnie.black

s.Oliver · August 7, 2019

i can add to this and it's a major drop-down for unRAID going from 6.7 onward.

before i was reluctant to post about it, cause of too less tests done to be 100% sure of not having some settings somewhere changed…

but now, i'm sure. today i upgraded one more unRAID server from 6.6x to 6.7.2 and do see the exact same behavior! so i do have 2 machines here, which haven't had a single change, except they were uograded to 6.7.x (meanwhile all on 6.7.2).

in my book, it doesn't matter how you access the data: coming from network or locally on the server, using different machines to connect to the server… when one write into the array is ongoing, then any reads (even from cache SSDs/NVMe') – even the ones coming from data or cache devices which aren't written to – are super slow. also whenever now a rebuild is happening, you better not want to read any file...

also RAM amount doesn't change anything, nor the used controllers nor the cpu (with/without mitigation enabled/disabled). and while i can't back it by data, it seems that rebuilds are slower too.

this can have severe scenarios, where some services are writing continuously data into the array (like video surveillance for example).

hopefully we can find a fast fix for this, because going back to 6.6.x isn't a good option anymore.

@limetech what can we do to help debugging this?

Edited August 7, 2019 by s.Oliver

Kevek79 · August 7, 2019

I am running UnRaid 6.7.2

I have seen the same behavior on my box when streaming to one of my clients. Everything runs buttery smooth until I try to copy some new files to the array. If I write directly to the array while streaming the stream freezes.

I have mitigated the issue by caching my media share for the moment, so loading content does not interfere with streaming from the array (or an unassigned device for that matter).

But that can not be a viable solution. I normally do not stream when mover operations are running, so I can not say anything about the impact of mover. But streaming and writing from/to the array at the same time was definitely working in earlier Unraid versions and currently it does not.

I have just adjusted my scheduled times for mover and parity checks so I can make sure that they do never run at the same time - just to be save for the moment.

Edited August 7, 2019 by Kevek79

Marshalleq · August 7, 2019

I'm just going to downgrade until someone sorts something out I think. There is a beta out with a newer kernel which could be worth a go though. Happy to help out with testing, but doesn't seemlike lime tech are listening for some reason. They're usually pretty good right?

bytchslappa · August 8, 2019

Add me to the list as well - 6.6.7 and all is well - 6.7.x and it all turns to custard - there are a number of threads on this now.. copying a single file between disks or writing to the array via SMB should not slow the disk access down to the point where docker and VM's die and stop responding - this is not heavy IO - its a single file.

I personally have not purchased unraid yet - and maybe not looking at the current state and lack of interest from the devs around this - but i've given it a lot of time for something that really shouldn't require messing around this much, Freenas - I can hammer the array while running on a low end CPU in a first gen HP microserver - and dockers dont stop responding - unraid 6.7 with a way more powerful CPU, more RAM, tried different controllers (SAS and SATA) complete with SAS and SATA disks thus different cables etc - just doesn't perform.. so do i buy into unraid but run 6.6.x and hope that what ever is busted in 6.7.x is fixed.. this is a paid product - not a freebie which you can sorta give a little slack too.. I dont even have this sort of issue with the now dead FlexRAID (in which the array works is very similar to unraid with all parity to a single disk)

JonathanM · August 8, 2019

22 minutes ago, bytchslappa said:

lack of interest from the devs around this

Trust me, @limetech is very interested. It's a pretty small team, and lack of regular updates on every forum thread does not equal neglect of the product. They are very actively working on isolating the issue so they can fix it.

testdasi · August 8, 2019

I can confirm this bug - but with a different conclusion.

cp from cache to disk2 (using console) reaches about 200MB/s, read from disk3 (via SMB) drops to 5MB/s
Once the disk2 write is done, read from disk3 immediately goes back up to 197MB/s
cp from cache to unassigned device (using console) reaches 500MB/s, read from disk 3 (via SMB) is still high around 172MB/s
To remove SMB as a variable, I have repeated the test using console only (2 simultaneous connections) and they have similar results
To remove console as a variable, I have repeated the test using SMB only and I can see write speed about 2x-3x read speed but the frequent fluctuation makes it hard to judge. However, it's clear read speed is in the double-digit (i.e. faster than case (1) above).
To remove write as a variable, I tested read (via SMB) from 3 disks, 2 disks and 1 disk and get 96-95-97, 141-143 and 210.
To remove read as a variable, I tested write (via SMB) from 3 disks, 2 disks and 1 disk and get similarly even splits.

No parity. All mitigation disabled via Squid's plugin.

So it sounds to me like it's not necessarily an issue with concurrent performance but rather there's a speed limit to the array IO with incorrect prioritisation of write vs read.

For read/write to a single disk, it's limited by the maximum speed of the device, usually HDD which is usually lower than this overall speed limit.
When read / write to multiple disks, the total speed of multiple devices exceed the speed limit, causing the overall limit to be apparent.
- If only read or only write, the limit is divided across multiple disks evenly
- If read + write, there appears to significantly higher priority (and/or resources) given to write, crippling read speed.

Edited August 8, 2019 by testdasi
have an epiphany

BiGs · August 8, 2019

Hey. I've been having similar issues from day dot of cache addition, but I only recently purchased Unraid and have started with 6.7. My system becomes unusable while mover is running. any transfers slow to nearly nothing, in some ftp cases time out completely and fail. I run Shinobi as a docker with a CCTV system on 24h recordings and I'm getting recording black spots during mover scheduled time. I troubleshooted this a bit myself by moving the cache disks off the mobo sata ports and onto the raid card sas sata ports on my raid card with the rest of the array disks thinking it might be unnecessary io cpu power to route it via the motherboard and instead keep it to the pci-e slot/raid card io. It did improve by actually having some sort of recording happening instead of nothing or 0 byte files being written, but the normally 15minute blocks of recordings are still being interrupted and split into various chunks of that 15minute block with missing minutes still. It also kills any gui communication while mover is running. I can't see much in the terms of mover settings for troubleshooting this but a speed limit or something might be good? I'm a bit of a layman with this stuff so I don't understand what's suggested in above posts. Just thought I'd post my issues on this subject too.

Edit: Maybe an important note is I run two parity disk array with two ssd disks in a cache pool. (so maybe higher then standard cpu requirement by Unraid for mover)

Edited August 8, 2019 by BiGs

ZataH · August 9, 2019

I had the exact same issue on 6.7.x. Multiple file transfers locally

Downgraded to version 6.6.7 and it works fine

Marshalleq · August 10, 2019

I suddenly had the realisation that this bug is probably what's been causing me so many headaches with my Crashplan backup. I mean, I nearly cancelled the service because it was so slow and it kept crashing. So I had enough and downgraded. Yes, crashplan (docker based) is now suddenly faster and so far working much better, other things I noticed included that it booted a lot quicker and didn't sort of pause before the login screen, Plex is more responsive, the disks seem to be 'quieter', before there was sort of random reads and writes happening which I couldn't track down, but now seem to have dissappeared, the Unraid GUI is much faster, I'd even say my SSD is running cooler. (Call me paranoid but I've had two SSD's unexpectedly die and this brand new one already has unrecoverable sectors after only a month. Perhaps some of this is in my mind, but the primary function of a NAS is, well to serve files to multiple people concurrently in a performant way. Right now that doesn't happen on 6.7. I'd bet many people have this bug and haven't realised it yet.

Edited August 10, 2019 by Marshalleq
Clarity.

trott · August 10, 2019

I'm now processing to move data to unraid from old HDD to do some testing, then I noticee the same issue, when there is write activity going on, the read speed is extremely slow( 3-4M/s, sometimes in KB), no matter which disk the date is read from, the result is always the same;

when there is no write activity, the read speed is return to normal, I did 3 concurrent reading from 3 disk share, each reading can reach 150-200M/s

Edited August 10, 2019 by trott

Maticks · August 10, 2019

I'm back on 6.6.7 really hoping for a fix soon.

s.Oliver · August 10, 2019

well, couldn't stand it anymore – so back to 6.6.7 and all is back to normal, expected behavior.

though, missing stuff from 6.7, so i'll hope they can identify/fix the problem really soon.

[6.7.x] Very slow array concurrent performance

User Feedback

Recommended Comments

JonathanM 2302

Link to comment

JorgeB 7465

Link to comment

JonathanM 2302

Link to comment

JorgeB 7465

Link to comment

JonathanM 2302

Link to comment

JorgeB 7465

Link to comment

JonathanM 2302

Link to comment

JorgeB 7465

Link to comment

Hoopster 1183

Link to comment

Videodr0me 31

Link to comment

JorgeB 7465

Link to comment

Marshalleq 139

Link to comment

JorgeB 7465

Link to comment

s.Oliver 25

Link to comment

Kevek79 12

Link to comment

Marshalleq 139

Link to comment

bytchslappa 15

Link to comment

JonathanM 2302

Link to comment

testdasi 500

Link to comment

BiGs 1

Link to comment

ZataH 11

Link to comment

Marshalleq 139

Link to comment

trott 14

Link to comment

Maticks 22

Link to comment

s.Oliver 25

Link to comment

Join the conversation