• [6.7.x] Very slow array concurrent performance


    JorgeB
    • Solved Urgent

    Since I can remember Unraid has never been great at simultaneous array disk performance, but it was pretty acceptable, since v6.7 there have been various users complaining for example of very poor performance when running the mover and trying to stream a movie.

     

    I noticed this myself yesterday when I couldn't even start watching an SD video using Kodi just because there were writes going on to a different array disk, and this server doesn't even have a parity drive, so did a quick test on my test server and the problem is easily reproducible and started with the first v6.7 release candidate, rc1.

     

    How to reproduce:

     

    -Server just needs 2 assigned array data devices (no parity needed, but same happens with parity) and one cache device, no encryption, all devices are btrfs formatted

    -Used cp to copy a few video files from cache to disk2

    -While cp is going on tried to stream a movie from disk1, took a long time to start and would keep stalling/buffering

     

    Tried to copy one file from disk1 (still while cp is going one on disk2), with V6.6.7:

     

    2083897607_Screenshot2019-08-0511_58_06.png.520373133cc121c80a361538a5fcc99b.png

     

    with v6.7rc1:

     

    856181720_Screenshot2019-08-0511_54_15.png.310bce8dbd6ed80d11d97727de55ac14.png

     

    A few times transfer will go higher for a couple of seconds but most times it's at a few KB/s or completely stalled.

     

    Also tried with all unencrypted xfs formatted devices and it was the same:

     

    1954593604_Screenshot2019-08-0512_21_37.png.6fb39b088e6cc77d99e45b37ea3184d8.png

     

    Server where problem was detected and test server have no hardware in common, one is based on X11 Supermicro board, test server is X9 series, server using HDDs, test server using SSDs so very unlikely to be hardware related.

    • Like 1
    • Upvote 22



    User Feedback

    Recommended Comments



    I wonder if the sqlite thing is related, since we know there have been issues with sqlite on the fuse system for some users for a long time, but recently it's become a major issue. Perhaps when the file system performance falls below some threshold, sqlite reacts poorly and corrupts instead of waiting for completion.

     

    Maybe there is a latent bug in sqlite that is being triggered by i/o speed?

    Link to comment

    I would say it's very possible, since the array becomes almost completely unresponsive, even listing a folder's contents can take many seconds, so it might cause timeouts for other things.

    Edited by johnnie.black
    Link to comment
    28 minutes ago, jonathanm said:

    Could you please toggle this plugin and check status with all mitigations enabled and disabled?

    Can't right now since I'am at work, but I have a small server here, based on a Core2Duo 8400, which isn't affected and the behavior is exactly the same.

    Link to comment
    12 minutes ago, johnnie.black said:

    I have a small server here, based on a Core2Duo 8400, which isn't affected and the behavior is exactly the same.

    So toggling the mitigations doesn't change anything?

    Link to comment
    5 minutes ago, jonathanm said:

    So toggling the mitigations doesn't change anything?

    I had the idea that only Lynnifield and newer CPUs were affected by those bugs, but I was wrong, the E8400 is also affected, still just tried the plugin and disabling the mitigations didn't make any difference.

    Link to comment

    It was an idea anyway. My thought process was that even though the CPU may not be vulnerable, the mitigations would still be applied in the code regardless. Honestly I don't know enough low level coding to be able to figure it out for myself, so I just wanted to advance the theory.

     

    All these issues seemed to start popping up at roughly the same time frame, so it's tough to distinguish what may or may not be truly causal, or just coincidental.

    Link to comment

    It was a good idea and good to rule out, still would be surprised if the mitigations caused such a big performance loss, 10 or 15% OK, but in this issue performance goes from 70MB/s to <0.5MB/s.

    Link to comment

    I have seen similar behavior lately.

     

    Beginning with any 6.7.* release, if I am doing heavy file transfers/disk writes, etc. any activity on another disk brings things to a crawl or, in the case of the operation in which I first noticed the issue, kills writes to the disk.

     

    I say this began with 6.7.x because I kept my server up to date with each new release and all of the video capture activity I have done recently is in the 6.7.x time frame.

     

    For example, my wife has a lot of videos shot on MiniDV tape which she wanted captured to the unRAID server.  Anytime I started capturing life was good if that was the ONLY activity on the server.  If, while capturing, I attempted to stream a movie, browse files on the server through Windows Explorer, etc. the capture process would stop with the error "destination disk too slow."  This is an error from the video editing/capture system.  Each tape is about 90 minutes and capturing writes a 19-20GB file to the array.

     

    I have write caching disabled for shares so the video capture is going straight to the parity-protected array.

     

    Basically, if any heavy writing is going on on the array such as recording a show through Plex DVR, copying large files to the array, etc., browsing files in the array is VERY slow.  It takes several seconds (in addition to any time required to spin up disk(s) if necessary) to populate the folder/file list and open files.  Without other "heavy" activity, browsing the array is fairly snappy.

     

    I do not recall seeing this behavior with prior versions.  In fact, I am relatively certain this was not an issue previously although I have not run any tests to prove it by rolling back to 6.6.7. This is purely based on recent observations although I will run some tests later by rolling my backup server back to 6.6.7 to see if the same behavior occurs.

    Edited by Hoopster
    Link to comment
    35 minutes ago, Videodr0me said:

    You can try turning off direct i/o in the share settings.

    Direct i/o is more for user shares, for this I used disk shares, but I did try having direct i/o on or off and it didn't make any difference.

    • Like 1
    Link to comment

    Talking about differing CPU's - I have been having issues on my thread ripper system, I'm now going to perform the same test as above as I hadn't noticed it exactly like that, but then I do have two disk controllers which may change things a little.  I hadn't realised it until just now, but on top of the normal Plex issues with mover, I had the Apple TV plex client just dumping out of a movie yesterday while copying a large amount of data from my Unraid server to an iMac.  

     

    Other things I'm trying to understand the cause of, is since this version I've had two SSD's die (one enterprise) and the new enterprise that is only a month old, written only 8TB (rated at 1TB per day for 5 years) already has re-allocated sectors on it).  I'm pretty sure I've changed that cable which is about the only thing left I can think of doing - I don't suppose it has anything to do with this, but thought I'd throw it out there.  It does say it's had 11 unsafe shutdowns (which it definitely hasn't) - however a bad cable is a possibility or maybe with all these I/o problems it starving the SSD into thinking it's had a disconnection?  Just throwing it out there as 2 dead SSD's and a third new one with issues is not normal.

    Link to comment

    Just to add that this problem is also easy to notice for an array disk to disk copy (though with a smaller performance impact) , e.g, copying 3 files totaling about 12GB, time spent:

     

     

    		Read/Write/Modify	Reconstruct Write
    v6.7.2 			8m43s			6m17s
    v6.6.7			5m37s			4m26s

     

     

    Edited by johnnie.black
    Link to comment

    i can add to this and it's a major drop-down for unRAID going from 6.7 onward.

     

    before i was reluctant to post about it, cause of too less tests done to be 100% sure of not having some settings somewhere changed…

     

    but now, i'm sure. today i upgraded one more unRAID server from 6.6x to 6.7.2 and do see the exact same behavior! so i do have 2 machines here, which haven't had a single change, except they were uograded to 6.7.x (meanwhile all on 6.7.2).

     

    in my book, it doesn't matter how you access the data: coming from network or locally on the server, using different machines to connect to the server… when one write into the array is ongoing, then any reads (even from cache SSDs/NVMe') – even the ones coming from data or cache devices which aren't written to – are super slow. also whenever now a rebuild is happening, you better not want to read any file...

     

    also RAM amount doesn't change anything, nor the used controllers nor the cpu (with/without mitigation enabled/disabled). and while i can't back it by data, it seems that rebuilds are slower too.

     

    this can have severe scenarios, where some services are writing continuously data into the array (like video surveillance for example).

     

    hopefully we can find a fast fix for this, because going back to 6.6.x isn't a good option anymore.

     

    @limetech what can we do to help debugging this?

    Edited by s.Oliver
    • Upvote 2
    Link to comment

    I am running UnRaid 6.7.2

    I have seen the same behavior on my box when streaming to one of my clients. Everything runs buttery smooth until I try to copy some new files to the array. If I write directly to the array while streaming the stream freezes. 

    I have mitigated the issue by caching my media share for the moment, so loading content does not interfere with streaming from the array (or an unassigned device for that matter). 

    But that can not be a viable solution. I normally do not stream when mover operations are running, so I can not say anything about the impact of mover. But streaming and writing from/to the array at the same time was definitely working in earlier Unraid versions and currently it does not.

    I have just adjusted my scheduled times for mover and parity checks so I can make sure that they do never run at the same time - just to be save for the moment.

    Edited by Kevek79
    Link to comment

    I'm just going to downgrade until someone sorts something out I think.  There is a beta out with a newer kernel which could be worth a go though.  Happy to help out with testing, but doesn't seemlike lime tech are listening for some reason.  They're usually pretty good right?

    • Upvote 1
    Link to comment

    Add me to the list as well - 6.6.7 and all is well - 6.7.x and it all turns to custard - there are a number of threads on this now.. copying a single file between disks or writing to the array via SMB should not slow the disk access down to the point where docker and VM's die and stop responding - this is not heavy IO - its a single file.  

     

    I personally have not purchased unraid yet - and maybe not looking at the current state and lack of interest from the devs around this - but i've given it a lot of time for something that really shouldn't require messing around this much, Freenas - I can hammer the array while running on a low end CPU in a first gen HP microserver - and dockers dont stop responding - unraid 6.7 with a way more powerful CPU, more RAM, tried different controllers (SAS and SATA) complete with SAS and SATA disks thus different cables etc - just doesn't perform.. so do i buy into unraid but run 6.6.x and hope that what ever is busted in 6.7.x is fixed.. this is a paid product - not a freebie which you can sorta give a little slack too.. I dont even have this sort of issue with the now dead FlexRAID (in which the array works is very similar to unraid with all parity to a single disk)

    Link to comment
    22 minutes ago, bytchslappa said:

    lack of interest from the devs around this

    Trust me, @limetech is very interested. It's a pretty small team, and lack of regular updates on every forum thread does not equal neglect of the product. They are very actively working on isolating the issue so they can fix it.

    • Upvote 1
    Link to comment

    I can confirm this bug - but with a different conclusion.

     

    1. cp from cache to disk2 (using console) reaches about 200MB/s, read from disk3 (via SMB) drops to 5MB/s
    2. Once the disk2 write is done, read from disk3 immediately goes back up to 197MB/s
    3. cp from cache to unassigned device (using console) reaches 500MB/s, read from disk 3 (via SMB) is still high around 172MB/s
    4. To remove SMB as a variable, I have repeated the test using console only (2 simultaneous connections) and they have similar results
    5. To remove console as a variable, I have repeated the test using SMB only and I can see write speed about 2x-3x read speed but the frequent fluctuation makes it hard to judge. However, it's clear read speed is in the double-digit (i.e. faster than case (1) above).
    6. To remove write as a variable, I tested read (via SMB) from 3 disks, 2 disks and 1 disk and get 96-95-97, 141-143 and 210.
    7. To remove read as a variable, I tested write (via SMB) from 3 disks, 2 disks and 1 disk and get similarly even splits.

     

    No parity. All mitigation disabled via Squid's plugin.

     

    So it sounds to me like it's not necessarily an issue with concurrent performance but rather there's a speed limit to the array IO with incorrect prioritisation of write vs read.

    • For read/write to a single disk, it's limited by the maximum speed of the device, usually HDD which is usually lower than this overall speed limit.
    • When read / write to multiple disks, the total speed of multiple devices exceed the speed limit, causing the overall limit to be apparent.
      • If only read or only write, the limit is divided across multiple disks evenly
      • If read + write, there appears to significantly higher priority (and/or resources) given to write, crippling read speed.

     

     

    write disk 2 read disk 3.PNG

    nowrite read disk 3.PNG

    write UD read disk 3.PNG

    Edited by testdasi
    have an epiphany
    • Like 1
    • Upvote 1
    Link to comment

    Hey. I've been having similar issues from day dot of cache addition, but I only recently purchased Unraid and have started with 6.7. My system becomes unusable while mover is running. any transfers slow to nearly nothing, in some ftp cases time out completely and fail. I run Shinobi as a docker with a CCTV system on 24h recordings and I'm getting recording black spots during mover scheduled time. I troubleshooted this a bit myself by moving the cache disks off the mobo sata ports and onto the raid card sas sata ports on my raid card with the rest of the array disks thinking it might be unnecessary io cpu power to route it via the motherboard and instead keep it to the pci-e slot/raid card io. It did improve by actually having some sort of recording happening instead of nothing or 0 byte files being written, but the normally 15minute blocks of recordings are still being interrupted and split into various chunks of that 15minute block with missing minutes still. It also kills any gui communication while mover is running. I can't see much in the terms of mover settings for troubleshooting this but a speed limit or something might be good? I'm a bit of a layman with this stuff so I don't understand what's suggested in above posts. Just thought I'd post my issues on this subject too. 

     

    Edit: Maybe an important note is I run two parity disk array with two ssd disks in a cache pool. (so maybe higher then standard cpu requirement by Unraid for mover) 

     

    cpumover.jpg

     

    Edited by BiGs
    Link to comment

    I had the exact same issue on 6.7.x. Multiple file transfers locally 

    Downgraded to version 6.6.7 and it works fine

    Link to comment

    I suddenly had the realisation that this bug is probably what's been causing me so many headaches with my Crashplan backup.  I mean, I nearly cancelled the service because it was so slow and it kept crashing.  So I had enough and downgraded.  Yes, crashplan (docker based) is now suddenly faster and so far working much better, other things I noticed included that it booted a lot quicker and didn't sort of pause before the login screen, Plex is more responsive, the disks seem to be 'quieter', before there was sort of random reads and writes happening which I couldn't track down, but now seem to have dissappeared, the Unraid GUI is much faster, I'd even say my SSD is running cooler.  (Call me paranoid but I've had two SSD's unexpectedly die and this brand new one already has unrecoverable sectors after only a month.  Perhaps some of this is in my mind, but the primary function of a NAS is, well to serve files to multiple people concurrently in a performant way.  Right now that doesn't happen on 6.7. I'd bet many people have this bug and haven't realised it yet.

    Edited by Marshalleq
    Clarity.
    • Like 1
    Link to comment

    I'm now processing to move data to unraid from old HDD to do some testing, then I noticee the same issue,  when there is write activity going on,  the read speed is extremely slow( 3-4M/s, sometimes in KB),   no matter which disk the date is read from, the result is always the same;

     

    when there is no write activity, the read speed is return to normal,  I did 3 concurrent reading from 3 disk share, each reading can reach 150-200M/s

    Edited by trott
    Link to comment

    well, couldn't stand it anymore – so back to 6.6.7 and all is back to normal, expected behavior.

    though, missing stuff from 6.7, so i'll hope they can identify/fix the problem really soon.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.