[6.7.x] Very slow array concurrent performance

Lignumaqua · August 11, 2019

Add me to the list with the exact same issues. Tried to copy a 50 GB VM img file to the SSD cache disk today and it took nearly an hour! During that time processor usage spiked to over 90% (twin Xeon with 16 cores and 128GB of RAM so resources shouldn't be an isssue...). Copying seemed to run in cycles with 1GB transferring relatively quickly with less than 50% processor usage, then a minute or so of almost stationary transfer speed with the processor spiking up to 90% +. If it's relevant, the same as @BiGs above I run a two parity disk array with two ssd disks in a cache pool.

I'd noticed things were slow in the last few weeks, but hadn't connected the dots. This large file transfer was so painful I came to the forum and found this thread. Describes my issue perfectly. Going to revert to 6.6.7 and hope that fixes things. Unfortunately it seems there's a fundamental file handling issue with 6.7.

Limetech haven't responded to this thread at all. In the possibly related SQLite thread they said they couldn't duplicate that problem. I haven't experienced the SQLite issue either, so I'm hoping they can duplicate this manifestation instead. I understand they are a small company, but it would be really helpful if they were to do so and acknowledge the issue (or explain why they can't).

PS - The 'Minor' label on this thread is an understatement. This is way worse than a minor irritation.

Edited August 11, 2019 by Lignumaqua

Marshalleq · August 11, 2019

I think it's a weekend for @limetech, probably be another 24 hours before they notice this has blown up a bit.

Your post reminds me of another one too, where I had the same 'burst' file copying issue - I had forgotten about that, but exactly as you describe. I think I'll go hunt for my thread on it, pretty sure everyone thought I was mad!

Marshalleq · August 11, 2019

Found it copied relevant snippet below:

"....some observations that seem odd to me include at times the disk is reading and writing from the same disk at the same speed in both read and write columns of 75MB/s and simultaneously the drive it's copying from is only running at 10 or 20MB/s sometimes less. Other behaviour that seems odd to me, is it cycles between reading from the source drive (and not writing to the target), then not reading from the source drive and writing to the target. So it's like copying it to a buffer somewhere. Something I'm sure is not normal for a normal move or copy operation."

This was using unbalance.

However, more unusual stuff even on 6.6.7 shows I really don't know how unraids raid works.

I have disabled both the docker and the virtual machines services, so nothing is running. Doing a console copy from my VM drive (unassigned devices) to the Btrfs cache mirror, is running nice at 490MB/s, yet there is a raid array disk constantly at read of 245MB/s for the whole copy - it stopped when the copy stopped. And no, I am not writing to /mnt/usr/something I'm just writing directly to /mnt/cache.

Edited August 12, 2019 by Marshalleq

dustinr · August 12, 2019

is there a good guide to roll back to 6.6.x ? also what features will i lose (im on latest rc).

-DCR

jdmhammer · August 13, 2019

And here I thought I had a network issue, glad others are reporting this so it gets resolved.

Lignumaqua · August 13, 2019

Reporting back. Now reverted to V6.6.7 and everything works again. Much quicker transfers and the cache drive is behaving correctly.

StevenD · August 13, 2019

I, too, reverted to 6.6.7 last night. All is working as expected again.

sirkuz · August 13, 2019

Good to hear reverting back helps. I am trying to hold off for a fix but this sure is an annoying issue! Been following/researching it for awhile but now that more people are reporting it hopefully it will get some more attention/resolution

s.Oliver · August 14, 2019

funny thing, now another problem has disappeared (after going back to 6.6.7), which brought some serious brain smashing:

PLEX (docker) has some background tasks running (usually in the night), one is the media scanning job. this one regularly crashed and alot of people had this problem too and tried to find a solution. now after some days of up time with 6.6.7 i haven't seen one crash – YEAH!

in the nights i've some big backup jobs running, which are writing into the array. so i would guess, that PLEX has timed out on accessing data in the array (albeit, i just reads files).

Marshalleq · August 14, 2019

On 8/9/2019 at 8:28 AM, Warrentheo said:

I have an EVGA GTX1070, and have 6.7.3-rc2 installed... Has not given me an issue, I also would not expect the slightly updated linux kernel that came in rc1 to cause that sort of issue... Not much has changed in rc 1 & 2...

Yeah, I was having issues with backups crashing too. I'm not enjoying the downgraded KVM though.

sirkuz · August 15, 2019

I bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal". Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

s.Oliver · August 15, 2019

5 hours ago, sirkuz said:

I bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal". Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

maybe you don't have to, if limetech can identify the problem and fix it.

testdasi · August 15, 2019

8 hours ago, sirkuz said:

I bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal". Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

Upvote the bug report hopefully may help it gain a bit more traction.

Marshalleq · August 15, 2019

Forgive the dumb question, I don't actually see where I can upvote this, I can see others that have upvoted, but no option in there for me to do the same, other than to 'like' it.

Edit - found it. Hovering over the like button shows an upvote button. Not exactly intuitive but all good.

Edited August 15, 2019 by Marshalleq

s.Oliver · August 16, 2019

need to correct my last post: PLEX docker (media scan background task) did crash now once. so possible that this isn't related to the kernel, or whatever.

Marshalleq · August 16, 2019

I'm still not sure @limetech have actually seen this. While we're waiting, a good idea might be to all comment on what motherboards / sata cards we're using to see if there's any commonality. I'm surprised more people aren't reporting this to be honest, but maybe they've just not been that observant yet.

I've got an Asus Prime X399-A board with 6 SATA's on it and a Dell Perc 310 controller flashed to IT mode, detected as a Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-s [Falcon] (rev 03). This has an additional 8 SATA ports.

If we all had that exact card, all had AMD, or even all had threadripper for example, I'd be suspicious.

Also, has anyone uploaded diagnostics, looking back through the thread it seems even the original poster hasn't done this, though maybe I'm not looking properly... If not, we should attach, just make sure you're not on 6.6.7 when you do it.

Edited August 16, 2019 by Marshalleq

spgill · August 16, 2019

Pitching in my two cents...

Relatively new Unraid user here experiencing the exact same symptoms after upgrading from a 6.6.x release to 6.7.2. Considering rolling back until this is fixed because it is crippling my system.

My hardware is a Gigabyte GA-F2A88XM-D3HP motherboard, with an AMD A10-7860K CPU, and using the chipset SATA controller (AMD A88X) for my array. I have an attached LSI PCIe SAS card, but the problems didn't start until long after I installed it, and besides the only device plugged into it is a SAS tape drive which is turned off most of the time.

richsoft · August 16, 2019

I'll also add my experience of this into the mix, although I haven't done any measurements to quantify my findings.

I've been struggling to pin down a Windows 10 VM performance issue for a while, which I thought was network related, but appears that it may have been a symptom of this issue.

I downgraded to 6.6.7 this evening, and my problems seem to have been resolved. It would appear that the terrible, apparent, download speed may actually have been the VM struggling to save the file to the array (directly to an uncached share, not to the SSD where the VM is located). Although running speed checks seemed to show a good internet download speed (around 70Mb/s), when downloading a file in a browser I was getting speeds in the Kb/s range.

Downloading to the array was probably in itself exaggerating other general performance issues now realise I was seeing with general file share usage, which also seems to be much better now.

As a bonus, Plex can hardware transcode again on my J4105, which also stopped working in 6.7.x :)

Marshalleq · August 16, 2019

Doing a bit of googling - I did find the below, which seems to coincide with the kernel versions of the 6.7 series, similar symptoms and be related to ATA disks. I'll need to upgrade back to 6.7 to do testing to see if there are any IOWait related issues or if this is totally unrelated, but posting here so others can weigh in. Equally according to this thread, some things seem to wake up the bug and cause it to revert to normal performance again. With all the plugins and dockers within Unraid, I could see that happening quite consistently, making it hard to pin down. As far as I can tell, the kernel versions Unraid use, even in the latest RC3, do not include a fix for this issue.

https://bugzilla.kernel.org/show_bug.cgi?id=202353

Marshalleq · August 17, 2019

In an effort to begin some testing I have upgraded back to latest 'stable' - if we can call it that. Unfortunately, my mirrored cache has mounted one drive into unassigned devices and the other is being reported as having no file system. This is not the expectation I have of a set of mirrored SSDs. In addition, even though it is enabled, I have lost SSH access. Sure I can get in via the server, but this is really quite unexpected. I have no idea if I had data on my cache or not. Actually with the amount of issues with the BTRFS cache I've had with it not setting up a true mirror etc, I am now of the opinion it is safer to be unmirrored, because frankly it doesn't work for the purpose it was intended. This unraid has a few quirks doesn't it.

If it weren't for the excellent KVM GPU passthrough, I'd probably migrate to FreeNAS now. The silence from @limetech on this issue is just not OK.

Fixed SSH by deleting the keys in /boot/config/ssh/

Fixed BTRFS Cache, by stopping array, and changing cache to have two disks again (it had reset to 1). Then starting the array, confirming the data existed, then stopping array again, adding in second disk that had been ejected and starting array one more time.

An auto balance then automatically ran, which incorrectly turned the btrfs into a RAID 0. Running the below command balanced it back to raid 1. Done. Now to monitor to see what we can find in terms of iowait etc.

# btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/cache

Edited August 17, 2019 by Marshalleq

DanW · August 17, 2019

I'm brand new to using UnRaid as a NAS and I think I have this issue.

I have a 10G connection and it is a nightmare transferring files to / from the array in its current state.

Is this a regular occurrence with UnRaid releases? It seems very flaky.

Edited August 17, 2019 by DanW
A bit more detail.

trurl · August 17, 2019

46 minutes ago, DanW said:

I'm brand new to using UnRaid as a NAS and I think I have this issue.

Possibly you aren't having this issue but some other. This issue is related to 6.7 and later.

Have you tried an earlier version?

Marshalleq · August 17, 2019

So I'm now back on 6.7.2 and already I'm seeing issues again. Specifically, while playing something on Plex, the mover process created a repeating image freeze / resume scenario on the client. I had the opportunity therefore to look at top and saw the wa (I/O wait) reach approximately 0.20 vs the idle wait of 0.03. While this may be indicative of the issue below, I'm looking deeper into it as it's not entirely unusual for moving data from SSD to HDD to create a high I/o wait obviously. Perhaps someone can check it on 6.6.7 for me, my recollection was this only got to about 0.10 on that.

The patch mentioned above was put into mainline kernel from 4.19.1. So I've upgraded to the beta of unraid which has kernel 4.19.60. I assume that is later and therefore a good way to test if this resolves the issue. Will keep you posted.

Edited August 18, 2019 by Marshalleq

trott · August 18, 2019

3 hours ago, Marshalleq said:

So I'm now back on 6.7.2 and already I'm seeing issues again. Specifically, while playing something on Plex, the mover process created a repeating image freeze / resume scenario on the client. I had the opportunity therefore to look at top and saw the wa (I/O wait) reach approximately 0.20 vs the idle wait of 0.03. While this may be indicative of the issue below, I'm looking deeper into it as it's not entirely unusual for moving data from SSD to HDD to create a high I/o wait obviously. Perhaps someone can check it on 6.6.7 for me, my recollection was this only got to about 0.10 on that.

The patch mentioned above was put into mainline kernel from 4.19.1. So I've upgraded to the beta of unraid which has kernel 4.19.60. I assume that is later and therefore a good way to test if this resolves the issue. Will keep you posted.

6.7.2 user kernel version: 4.19.56, if the fix is in 4.19.1, then I don't think 4.19.60 will help

Marshalleq · August 18, 2019

That’s why I upgraded to the release candidate of the next version. Probably my post is a bit confusing cause at the beginning of it I wasn’t. But I try not to post too many messages in a row.

Edited August 18, 2019 by Marshalleq

[6.7.x] Very slow array concurrent performance

User Feedback

Recommended Comments

Lignumaqua 9

Link to comment

Marshalleq 139

Link to comment

Marshalleq 139

Link to comment

dustinr 0

Link to comment

jdmhammer 6

Link to comment

Lignumaqua 9

Link to comment

StevenD 88

Link to comment

sirkuz 6

Link to comment

s.Oliver 25

Link to comment

Marshalleq 139

Link to comment

sirkuz 6

Link to comment

s.Oliver 25

Link to comment

testdasi 501

Link to comment

Marshalleq 139

Link to comment

s.Oliver 25

Link to comment

Marshalleq 139

Link to comment

spgill 1

Link to comment

richsoft 0

Link to comment

Marshalleq 139

Link to comment

Marshalleq 139

Link to comment

DanW 2

Link to comment

trurl 2950

Link to comment

Marshalleq 139

Link to comment

trott 14

Link to comment

Marshalleq 139

Link to comment

Join the conversation