• [6.7.x] Very slow array concurrent performance


    JorgeB
    • Solved Urgent

    Since I can remember Unraid has never been great at simultaneous array disk performance, but it was pretty acceptable, since v6.7 there have been various users complaining for example of very poor performance when running the mover and trying to stream a movie.

     

    I noticed this myself yesterday when I couldn't even start watching an SD video using Kodi just because there were writes going on to a different array disk, and this server doesn't even have a parity drive, so did a quick test on my test server and the problem is easily reproducible and started with the first v6.7 release candidate, rc1.

     

    How to reproduce:

     

    -Server just needs 2 assigned array data devices (no parity needed, but same happens with parity) and one cache device, no encryption, all devices are btrfs formatted

    -Used cp to copy a few video files from cache to disk2

    -While cp is going on tried to stream a movie from disk1, took a long time to start and would keep stalling/buffering

     

    Tried to copy one file from disk1 (still while cp is going one on disk2), with V6.6.7:

     

    2083897607_Screenshot2019-08-0511_58_06.png.520373133cc121c80a361538a5fcc99b.png

     

    with v6.7rc1:

     

    856181720_Screenshot2019-08-0511_54_15.png.310bce8dbd6ed80d11d97727de55ac14.png

     

    A few times transfer will go higher for a couple of seconds but most times it's at a few KB/s or completely stalled.

     

    Also tried with all unencrypted xfs formatted devices and it was the same:

     

    1954593604_Screenshot2019-08-0512_21_37.png.6fb39b088e6cc77d99e45b37ea3184d8.png

     

    Server where problem was detected and test server have no hardware in common, one is based on X11 Supermicro board, test server is X9 series, server using HDDs, test server using SSDs so very unlikely to be hardware related.

    • Like 1
    • Upvote 22



    User Feedback

    Recommended Comments



    As this is apparently fixed in 6.8 would @limetech give a breakdown of what caused the issue and what the fix was with a bit of technical detail. I'm sure many in this thread would be very interested 

    Edited by dgreig
    Link to comment
    4 hours ago, limetech said:

    This is fixed in upcoming 6.8 release.

    @sittingmongoose see this post by limetech. It's fixed in 6.8. I would however be dubious about a fix being backported at this stage. Hopefully 6.8 will be out soon™️

    Edited by dgreig
    Link to comment

    If it's fixed in 6.8 they should be able to push out a hot fix, for this extremely serious issue.  We are all bleeding from it and I believe it's causing read errors and other things which are really challenging.  Anyway, I can only assume that the reason they haven't is because it has proved difficult to back port.  What's missing though, is ANY information which is absolutely APPALLING.  Why they're not providing any though, we can only guess.  My guess is they know it's bad and they don't want to admit it because the fallout of having to tell everyone that downgrade is the only option, or upgrade which isn't yet available.  This is really a big black mark on what for me has otherwise been an exceptional experience on unraid.  I expect things to break from time to time or not run properly, what i don't expect is an almost total silence / lack of communication on critical issues.

    Edited by Marshalleq
    • Like 1
    Link to comment

    It's not a "show stopper" issue. Backporting would be silly...wasted dev time IMHO. The fact that they acknowledged and fixed in next major rev makes me happy. I don't have anything configured for 6.7.x that prevented a downgrade and I doubt most do.

    Link to comment
    4 hours ago, sirkuz said:

    It's not a "show stopper" issue. Backporting would be silly...wasted dev time IMHO. The fact that they acknowledged and fixed in next major rev makes me happy. I don't have anything configured for 6.7.x that prevented a downgrade and I doubt most do.

    It is a showstopper.  If this issue isn't fixed then anyone that uses their UnRaid server primarily for media streaming through a household will have to find another solution.  The Unraid show is effectively stopped for them.

     

    Using your argument that downgrading is a viable solution, then no issue would ever be considered a showstopper.  

     

    Limetech was happy to throw out rc4 when they thought that would resolve the problem.  It didn't.  So if a resolution has been found why hasn't rc5 with that fix been released?  Why has there been a change of approach in resolving this issue?

    Edited by dalben
    • Like 1
    • Thanks 1
    Link to comment

    Yeah and the one thing that stops me from downgrading is actually Unraids killer feature - GPU passthrough for gaming.  Tried that in 6.6x?  It's really quite bad compared to 6.7.  My intent here is not to be negative, because a lot of hard and great work has gone in that we all appreciate and get the benefit of. My hope is that someone in Unraid will read this and say, 'Yeah, we could have communicated better, and we'll try do that next time'.  We're not asking for rocket science here - just a heads up that the issue has been a bi**h to solve and they don't know when theyre' going to get there yet.  Or sorry guys, you're going to have to wait til 6.8 because xyz means it's not viable to patch it.  Or just we can't figure it out yet, wait.  It would take about as long as it took me to write this one paragraph.  Probably less. Done and then everyone's happy.

    Edited by Marshalleq
    Link to comment

    It's saturday, It's possible @limetech is a person.

     

    They haven't failed to update us when they can.

     

    NO one works 24/7.

     

    Give them time to verify and test.

     

    6.8 has ALOT of changes, what do you think will happen if they rush it and break something else.

     

    Be reasonable with your expectations.

    • Like 1
    Link to comment
    13 minutes ago, Dazog said:

    It's saturday, It's possible @limetech is a person.

     

    They haven't failed to update us when they can.

     

    NO one works 24/7.

     

    Give them time to verify and test.

     

    6.8 has ALOT of changes, what do you think will happen if they rush it and break something else.

     

    Be reasonable with your expectations.

    My only gripe here is that limetech released an RC (rc4) with an expected fix. Now they have stopped that and are telling us to wait for the next version. Why the change? 

     

    We all know limetech release time lines are as consistent as lottlo numbers so we're in a situation where this fix could be 1 day or 1 month away. 

     

    I'm at breaking point where I have to do something now as the family has had enough of the stalled streaming. If I have to downgrade, I will, but there's a good chance I'll have to find another solution going forward. 

     

    Unraid was great when I was the only real user of the server. Now that it anchors the family's entertainment I need something more reliable. 

    Link to comment
    25 minutes ago, Dazog said:

    It's saturday, It's possible @limetech is a person.

     

    They haven't failed to update us when they can.

     

    NO one works 24/7.

     

    Give them time to verify and test.

     

    6.8 has ALOT of changes, what do you think will happen if they rush it and break something else.

     

    Be reasonable with your expectations.

    LOL, just LOL.  Suggest you look at the number of complaints, before this thread elsewhere in this forum, then the date this was logged then count the number of responses from limetech in this thread and note those dates.  Then you can comment.  And I'll add an additional LOL (lots of love) for being so positive. :)

    Link to comment
    1 hour ago, dalben said:

    I'm at breaking point where I have to do something now

    Change your mover schedule to outside viewing hours. This prevents writes to the array while reading (streaming) is going on.

     

    • Like 1
    Link to comment

    The issue is that writes originating from a fast source (as opposed to a slow source such as 1Gbit network) completely consume all available internal "stripe buffers" used by md/unraid to implement array writes.  When reads come in they get starved for resources to efficiently execute.  The method used to limit the number of I/O's directed to a specific md device in the Linux kernel no longer works in newer kernels, hence I've had to implement a different throttling mechanism.

     

    Changes to md/unraid driver require exhaustive testing.  All functional tests pass however, driver bugs are notorious for only showing up under specific workloads due to different timing.  As I write this I'm 95% confident there are no new bugs.  Since this will be a stable 'patch' release, I need to finish my testing.

    • Like 5
    • Thanks 3
    Link to comment
    1 hour ago, bonienl said:

    Change your mover schedule to outside viewing hours. This prevents writes to the array while reading (streaming) is going on.

     

    I dont use Mover. The stalls come when one of the media download dockers pulls down a file, into cache, then moves into the library, which is on the array. 

     

    Or if I am just working on other stuff and accidentally start a file copy or move while media is streaming. 

    Link to comment
    39 minutes ago, limetech said:

    The issue is that writes originating from a fast source (as opposed to a slow source such as 1Gbit network) completely consume all available internal "stripe buffers" used by md/unraid to implement array writes.  When reads come in they get starved for resources to efficiently execute.  The method used to limit the number of I/O's directed to a specific md device in the Linux kernel no longer works in newer kernels, hence I've had to implement a different throttling mechanism.

     

    Changes to md/unraid driver require exhaustive testing.  All functional tests pass however, driver bugs are notorious for only showing up under specific workloads due to different timing.  As I write this I'm 95% confident there are no new bugs.  Since this will be a stable 'patch' release, I need to finish my testing.

    Fantastic update. Thankyou very much for that. I think we’d all agree, taking the time to do some good testing is preferable. Please let me know if you’d like any help with that. (29 years in IT software / hardware here) and I’m sure others on here will be similar. 

    Link to comment
    43 minutes ago, limetech said:

    The issue is that writes originating from a fast source (as opposed to a slow source such as 1Gbit network) completely consume all available internal "stripe buffers" used by md/unraid to implement array writes.  When reads come in they get starved for resources to efficiently execute.  The method used to limit the number of I/O's directed to a specific md device in the Linux kernel no longer works in newer kernels, hence I've had to implement a different throttling mechanism.

     

    Changes to md/unraid driver require exhaustive testing.  All functional tests pass however, driver bugs are notorious for only showing up under specific workloads due to different timing.  As I write this I'm 95% confident there are no new bugs.  Since this will be a stable 'patch' release, I need to finish my testing.

    Thanks for the update. 

     

    Based on what you've found, is there a possibility this issue could trigger disk read errors. Ive had a few recently and while they could be coincidence / batch timing related, I'm wondering if there is a connection. 

    • Thanks 1
    Link to comment
    4 minutes ago, dalben said:

    I dont use Mover. The stalls come when one of the media download dockers pulls down a file, into cache, then moves into the library, which is on the array.  

    I’m pretty sure that actually is the move that does that. I’m not aware of anything else doing it...

    Link to comment
    1 minute ago, dalben said:

    Thanks for the update. 

     

    Based on what you've found, is there a possibility this issue could trigger disk read errors. Ive had a few recently and while they could be coincidence / batch timing related, I'm wondering if there is a connection. 

    I’d like to understand that too. I think you’ve seen my comments on it earlier. 

    Link to comment
    18 minutes ago, dalben said:

    I dont use Mover. The stalls come when one of the media download dockers pulls down a file, into cache, then moves into the library, which is on the array. 

     

    Or if I am just working on other stuff and accidentally start a file copy or move while media is streaming. 

    Stop stealing movies then until 6.8 is out.

    • Haha 4
    Link to comment
    2 hours ago, saarg said:

    Stop stealing movies then until 6.8 is out.

    Well done. You've been nominated for the most useful reply award. You must be proud. 

    • Haha 1
    Link to comment
    5 hours ago, Marshalleq said:

    I’m pretty sure that actually is the move that does that. I’m not aware of anything else doing it...

    You can configure things either way. If the final destination is set to be only on the array bypassing the cache, then when the download to the cache is done, the media manager immediately moves the file from cache to array without the mover being activated. Or, you can have the final destination use the cache drive, in which case after the download is done the file is renamed to the final destination share and sits on the cache drive waiting for the mover to put it on the array.

     

    Normally both methods work well, but with a small cache it's better to write to the array immediately rather than risk filling up the download temp space, and having mover scheduled every couple hours is not a good solution.

    • Like 1
    Link to comment
    5 hours ago, jonathanm said:

    You can configure things either way. If the final destination is set to be only on the array bypassing the cache, then when the download to the cache is done, the media manager immediately moves the file from cache to array without the mover being activated. Or, you can have the final destination use the cache drive, in which case after the download is done the file is renamed to the final destination share and sits on the cache drive waiting for the mover to put it on the array.

     

    Normally both methods work well, but with a small cache it's better to write to the array immediately rather than risk filling up the download temp space, and having mover scheduled every couple hours is not a good solution.

    Thanks, that's interesting.  (Just reading the cache options now) - so the options 'yes' and 'prefer' explicitly state they use the mover, so we're not talking about those.  The option 'only' is obviously on the cache, so you must be talking about the option 'no', which specifically states it prohibits files being being put on the cache pool.  I can see why I'd be confused!

     

     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.