• [6.7.x] Very slow array concurrent performance


    JorgeB
    • Solved Urgent

    Since I can remember Unraid has never been great at simultaneous array disk performance, but it was pretty acceptable, since v6.7 there have been various users complaining for example of very poor performance when running the mover and trying to stream a movie.

     

    I noticed this myself yesterday when I couldn't even start watching an SD video using Kodi just because there were writes going on to a different array disk, and this server doesn't even have a parity drive, so did a quick test on my test server and the problem is easily reproducible and started with the first v6.7 release candidate, rc1.

     

    How to reproduce:

     

    -Server just needs 2 assigned array data devices (no parity needed, but same happens with parity) and one cache device, no encryption, all devices are btrfs formatted

    -Used cp to copy a few video files from cache to disk2

    -While cp is going on tried to stream a movie from disk1, took a long time to start and would keep stalling/buffering

     

    Tried to copy one file from disk1 (still while cp is going one on disk2), with V6.6.7:

     

    2083897607_Screenshot2019-08-0511_58_06.png.520373133cc121c80a361538a5fcc99b.png

     

    with v6.7rc1:

     

    856181720_Screenshot2019-08-0511_54_15.png.310bce8dbd6ed80d11d97727de55ac14.png

     

    A few times transfer will go higher for a couple of seconds but most times it's at a few KB/s or completely stalled.

     

    Also tried with all unencrypted xfs formatted devices and it was the same:

     

    1954593604_Screenshot2019-08-0512_21_37.png.6fb39b088e6cc77d99e45b37ea3184d8.png

     

    Server where problem was detected and test server have no hardware in common, one is based on X11 Supermicro board, test server is X9 series, server using HDDs, test server using SSDs so very unlikely to be hardware related.

    • Like 1
    • Upvote 22



    User Feedback

    Recommended Comments



    Add me to the list with the exact same issues. Tried to copy a 50 GB VM img file to the SSD cache disk today and it took nearly an hour! During that time processor usage spiked to over 90% (twin Xeon with 16 cores and 128GB of RAM so resources shouldn't be an isssue...). Copying seemed to run in cycles with 1GB transferring relatively quickly with less than 50% processor usage, then a minute or so of almost stationary transfer speed with the processor spiking up to 90% +.  If it's relevant, the same as @BiGs above I run a two parity disk array with two ssd disks in a cache pool. 

     

    I'd noticed things were slow in the last few weeks, but hadn't connected the dots. This large file transfer was so painful I came to the forum and found this thread. Describes my issue perfectly. Going to revert to 6.6.7 and hope that fixes things. Unfortunately it seems there's a fundamental file handling issue with 6.7. 

     

    Limetech haven't responded to this thread at all. In the possibly related SQLite thread they said they couldn't duplicate that problem. I haven't experienced the SQLite issue either, so I'm hoping they can duplicate this manifestation instead. I understand they are a small company, but it would be really helpful if they were to do so and acknowledge the issue (or explain why they can't).

     

    PS - The 'Minor' label on this thread is an understatement.  This is way worse than a minor irritation.

    Edited by Lignumaqua
    Link to comment

    I think it's a weekend for @limetech, probably be another 24 hours before they notice this has blown up a bit.

     

    Your post reminds me of another one too, where I had the same 'burst' file copying issue - I had forgotten about that, but exactly as you describe.  I think I'll go hunt for my thread on it, pretty sure everyone thought I was mad!

    • Upvote 1
    Link to comment

    Found it copied relevant snippet below:

     

    "....some observations that seem odd to me include at times the disk is reading and writing from the same disk at the same speed in both read and write columns of 75MB/s and simultaneously the drive it's copying from is only running at 10 or 20MB/s sometimes less.  Other behaviour that seems odd to me, is it cycles between reading from the source drive (and not writing to the target), then not reading from the source drive and writing to the target.  So it's like copying it to a buffer somewhere.  Something I'm sure is not normal for a normal move or copy operation."

     

    This was using unbalance.

     

    However, more unusual stuff even on 6.6.7 shows I really don't know how unraids raid works.

     

    I have disabled both the docker and the virtual machines services, so nothing is running.  Doing a console copy from my VM drive (unassigned devices) to the Btrfs cache mirror, is running nice at 490MB/s, yet there is a raid array disk constantly at read of 245MB/s for the whole copy - it stopped when the copy stopped.  And no, I am not writing to /mnt/usr/something I'm just writing directly to /mnt/cache.  

     

    Edited by Marshalleq
    Link to comment

    Reporting back. Now reverted to V6.6.7 and everything works again. Much quicker transfers and the cache drive is behaving correctly. 

    Link to comment

    Good to hear reverting back helps. I am trying to hold off for a fix but this sure is an annoying issue! Been following/researching it for awhile but now that more people are reporting it hopefully it will get some more attention/resolution :)

    Link to comment

    funny thing, now another problem has disappeared (after going back to 6.6.7), which brought some serious brain smashing:

     

    PLEX (docker) has some background tasks running (usually in the night), one is the media scanning job. this one regularly crashed and alot of people had this problem too and tried to find a solution. now after some days of up time with 6.6.7 i haven't seen one crash – YEAH!

     

    in the nights i've some big backup jobs running, which are writing into the array. so i would guess, that PLEX has timed out on accessing data in the array (albeit, i just reads files).

    • Upvote 1
    Link to comment
    On 8/9/2019 at 8:28 AM, Warrentheo said:

    I have an EVGA GTX1070, and have 6.7.3-rc2 installed...  Has not given me an issue, I also would not expect the slightly updated linux kernel that came in rc1 to cause that sort of issue...  Not much has changed in rc 1 & 2...

    Yeah, I was having issues with backups crashing too.  I'm not enjoying the downgraded KVM though.

    Link to comment

    I  bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal".  Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

    • Upvote 1
    Link to comment
    5 hours ago, sirkuz said:

    I  bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal".  Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

    maybe you don't have to, if limetech can identify the problem and fix it.

    • Upvote 1
    Link to comment
    8 hours ago, sirkuz said:

    I  bit the bullet and have decided to revert one system as well, as reported many times things seem to be back to "normal".  Not sure how long I will be able to hold of on the secondary reverting as well as it was quite simple.

    Upvote the bug report hopefully may help it gain a bit more traction.

    • Like 1
    • Upvote 2
    Link to comment

    Forgive the dumb question, I don't actually see where I can upvote this, I can see others that have upvoted, but no option in there for me to do the same, other than to 'like' it.

     

    Edit - found it.  Hovering over the like button shows an upvote button.  Not exactly intuitive but all good.

    Edited by Marshalleq
    • Upvote 4
    Link to comment

    need to correct my last post: PLEX docker (media scan background task) did crash now once. so possible that this isn't related to the kernel, or whatever.

    Link to comment

    I'm still not sure @limetech have actually seen this.  While we're waiting, a good idea might be to all comment on what motherboards / sata cards we're using to see if there's any commonality.  I'm surprised more people aren't reporting this to be honest, but maybe they've just not been that observant yet.

     

    I've got an Asus Prime X399-A board with 6 SATA's on it and a Dell Perc 310 controller flashed to IT mode, detected as a Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-s [Falcon] (rev 03). This has an additional 8 SATA ports.

     

    If we all had that exact card, all had AMD, or even all had threadripper for example, I'd be suspicious. :D

     

    Also, has anyone uploaded diagnostics, looking back through the thread it seems even the original poster hasn't done this, though maybe I'm not looking properly...  If not, we should attach, just make sure you're not on 6.6.7 when you do it.

    Edited by Marshalleq
    Link to comment

    Pitching in my two cents...

     

    Relatively new Unraid user here experiencing the exact same symptoms after upgrading from a 6.6.x release to 6.7.2. Considering rolling back until this is fixed because it is crippling my system.

     

    My hardware is a Gigabyte GA-F2A88XM-D3HP motherboard, with an AMD A10-7860K CPU, and using the chipset SATA controller (AMD A88X) for my array. I have an attached LSI PCIe SAS card, but the problems didn't start until long after I installed it, and besides the only device plugged into it is a SAS tape drive which is turned off most of the time.

    • Like 1
    Link to comment

    I'll also add my experience of this into the mix, although I haven't done any measurements to quantify my findings.

     

    I've been struggling to pin down a Windows 10 VM performance issue for a while, which I thought was network related, but appears that it may have been a symptom of this issue.

     

    I downgraded to 6.6.7 this evening, and my problems seem to have been resolved.   It would appear that the terrible, apparent, download speed may actually have been the VM struggling to save the file to the array (directly to an uncached share, not to the SSD where the VM is located).   Although running speed checks seemed to show a good internet download speed (around 70Mb/s), when downloading a file in a browser I was getting speeds in the Kb/s range.

     

    Downloading to the array was probably in itself exaggerating other general performance issues now realise I was seeing with general file share usage, which also seems to be much better now.

     

    As a bonus, Plex can hardware transcode again on my J4105, which also stopped working in 6.7.x :)

     

    Link to comment

    Doing a bit of googling - I did find the below, which seems to coincide with the kernel versions of the 6.7 series, similar symptoms and be related to ATA disks.  I'll need to upgrade back to 6.7 to do testing to see if there are any IOWait related issues or if this is totally unrelated, but posting here so others can weigh in.  Equally according to this thread, some things seem to wake up the bug and cause it to revert to normal performance again.  With all the plugins and dockers within Unraid, I could see that happening quite consistently, making it hard to pin down.  As far as I can tell, the kernel versions Unraid use, even in the latest RC3, do not include a fix for this issue.

     

    https://bugzilla.kernel.org/show_bug.cgi?id=202353

    • Like 1
    Link to comment

    In an effort to begin some testing I have upgraded back to latest 'stable' - if we can call it that.  Unfortunately, my mirrored cache has mounted one drive into unassigned devices and the other is being reported as having no file system.  This is not the expectation I have of a set of mirrored SSDs.  In addition, even though it is enabled, I have lost SSH access.  Sure I can get in via the server, but this is really quite unexpected.  I have no idea if I had data on my cache or not.  Actually with the amount of issues with the BTRFS cache I've had with it not setting up a true mirror etc, I am now of the opinion it is safer to be unmirrored, because frankly it doesn't work for the purpose it was intended.  This unraid has a few quirks doesn't it.

     

    If it weren't for the excellent KVM GPU passthrough, I'd probably migrate to FreeNAS now.  The silence from @limetech on this issue is just not OK.

     

    Fixed SSH by deleting the keys in /boot/config/ssh/

     

    Fixed BTRFS Cache, by stopping array, and changing cache to have two disks again (it had reset to 1).  Then starting the array, confirming the data existed, then stopping array again, adding in second disk that had been ejected and starting array one more time.

    An auto balance then automatically ran, which incorrectly turned the btrfs into a RAID 0.  Running the below command balanced it back to raid 1.  Done.  Now to monitor to see what we can find in terms of iowait etc.

     

    # btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/cache

    Edited by Marshalleq
    Link to comment

    I'm brand new to using UnRaid as a NAS and I think I have this issue.

    I have a 10G connection and it is a nightmare transferring files to / from the array in its current state.

    Is this a regular occurrence with UnRaid releases? It seems very flaky.

    Edited by DanW
    A bit more detail.
    Link to comment
    46 minutes ago, DanW said:

    I'm brand new to using UnRaid as a NAS and I think I have this issue.

    Possibly you aren't having this issue but some other. This issue is related to 6.7 and later.

     

    Have you tried an earlier version?

    Link to comment

    So I'm now back on 6.7.2 and already I'm seeing issues again.  Specifically, while playing something on Plex, the mover process created a repeating image freeze / resume scenario on the client.  I had the opportunity therefore to look at top and saw the wa (I/O wait) reach approximately 0.20 vs the idle wait of 0.03.  While this may be indicative of the issue below, I'm looking deeper into it as it's not entirely unusual for moving data from SSD to HDD to create a high I/o wait obviously.  Perhaps someone can check it on 6.6.7 for me, my recollection was this only got to about 0.10 on that.

     

    The patch mentioned above was put into mainline kernel from 4.19.1.  So I've upgraded to the beta of unraid which has kernel 4.19.60.  I assume that is later and therefore a good way to test if this resolves the issue.  Will keep you posted.

    Edited by Marshalleq
    Link to comment
    3 hours ago, Marshalleq said:

    So I'm now back on 6.7.2 and already I'm seeing issues again.  Specifically, while playing something on Plex, the mover process created a repeating image freeze / resume scenario on the client.  I had the opportunity therefore to look at top and saw the wa (I/O wait) reach approximately 0.20 vs the idle wait of 0.03.  While this may be indicative of the issue below, I'm looking deeper into it as it's not entirely unusual for moving data from SSD to HDD to create a high I/o wait obviously.  Perhaps someone can check it on 6.6.7 for me, my recollection was this only got to about 0.10 on that.

     

    The patch mentioned above was put into mainline kernel from 4.19.1.  So I've upgraded to the beta of unraid which has kernel 4.19.60.  I assume that is later and therefore a good way to test if this resolves the issue.  Will keep you posted.

    6.7.2 user kernel version: 4.19.56, if the fix is in 4.19.1, then I don't think 4.19.60 will help

    Link to comment

    That’s why I upgraded to the release candidate of the next version. Probably my post is a bit confusing cause at the beginning of it I wasn’t. But I try not to post too many messages in a row. :)

    Edited by Marshalleq
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.