• [6.7.x] Very slow array concurrent performance


    JorgeB
    • Solved Urgent

    Since I can remember Unraid has never been great at simultaneous array disk performance, but it was pretty acceptable, since v6.7 there have been various users complaining for example of very poor performance when running the mover and trying to stream a movie.

     

    I noticed this myself yesterday when I couldn't even start watching an SD video using Kodi just because there were writes going on to a different array disk, and this server doesn't even have a parity drive, so did a quick test on my test server and the problem is easily reproducible and started with the first v6.7 release candidate, rc1.

     

    How to reproduce:

     

    -Server just needs 2 assigned array data devices (no parity needed, but same happens with parity) and one cache device, no encryption, all devices are btrfs formatted

    -Used cp to copy a few video files from cache to disk2

    -While cp is going on tried to stream a movie from disk1, took a long time to start and would keep stalling/buffering

     

    Tried to copy one file from disk1 (still while cp is going one on disk2), with V6.6.7:

     

    2083897607_Screenshot2019-08-0511_58_06.png.520373133cc121c80a361538a5fcc99b.png

     

    with v6.7rc1:

     

    856181720_Screenshot2019-08-0511_54_15.png.310bce8dbd6ed80d11d97727de55ac14.png

     

    A few times transfer will go higher for a couple of seconds but most times it's at a few KB/s or completely stalled.

     

    Also tried with all unencrypted xfs formatted devices and it was the same:

     

    1954593604_Screenshot2019-08-0512_21_37.png.6fb39b088e6cc77d99e45b37ea3184d8.png

     

    Server where problem was detected and test server have no hardware in common, one is based on X11 Supermicro board, test server is X9 series, server using HDDs, test server using SSDs so very unlikely to be hardware related.

    • Like 1
    • Upvote 22



    User Feedback

    Recommended Comments



    Right, so initial testing with the release candidate 6.7.3-rc2.  To test I played a plex video, preloaded many gigabytes of files to a cached share.  Invoked the mover manually, added an additional copy of another large set of files from a disk to the cache and with all this going simultaneously I don't seem to have any issues. The wa (io wait) in top only get's up to 8.0 instead of 20.0 under the previous kernel.  (Gotta edit that as I think I wrote 0.20 which was incorrect, it was 20.  Also my write speed (from HDD to SSD) is about normal at 53MB/s - yes it's slow, it's always been slow, even with Seagate enterprise capacity disks - seems to be an overhead of the unraid parity.

     

    This is my first and only test so far (I'll try tomorrow when someone is watching plex on the Apple TV in the lounge where the issue was visible today) I'd be interested if anyone else can test though,  by upgrading to 'next' - there are very very few changes in it so it should be quite safe.

     

    If the problem goes away for you I'd say very lucky we are.  Otherwise we shall need to investigate further.  Fingers crossed!

    Edited by Marshalleq
    Link to comment
    8 hours ago, Marshalleq said:

    Right, so initial testing with the release candidate 6.7.3-rc2.  To test I played a plex video, preloaded many gigabytes of files to a cached share.  Invoked the mover manually, added an additional copy of another large set of files from a disk to the cache and with all this going simultaneously I don't seem to have any issues. The wa (io wait) in top only get's up to 8.0 instead of 20.0 under the previous kernel.  (Gotta edit that as I think I wrote 0.20 which was incorrect, it was 20.  Also my write speed (from HDD to SSD) is about normal at 53MB/s - yes it's slow, it's always been slow, even with Seagate enterprise capacity disks - seems to be an overhead of the unraid parity.

     

    This is my first and only test so far (I'll try tomorrow when someone is watching plex on the Apple TV in the lounge where the issue was visible today) I'd be interested if anyone else can test though,  by upgrading to 'next' - there are very very few changes in it so it should be quite safe.

     

    If the problem goes away for you I'd say very lucky we are.  Otherwise we shall need to investigate further.  Fingers crossed!

    The problem should only happen if one writing session simultaneous with another read/write session in disk array, it shouldn't happen in cache pool or UD for my test.

     

    If problem haven't trigger, then disk array read write speed should be expected from 190MB/s to 90MB/s for spinnder disk in disk array, no matter have parity or not.

    Edited by Benson
    Link to comment
    8 hours ago, Marshalleq said:

    53MB/s - yes it's slow, it's always been slow, even with Seagate enterprise capacity disks - seems to be an overhead of the unraid parity.

    It should another issue cause this.

    Link to comment

    Yes, simultaneous write session while multiple read sessions to spinning disk is what I did.  There was multiple plex sessions ongoing while doing a large multi-terabyte copy from SSD cache to the array.  But I could try be a little harder on it and try again with even more writes and reads.

     

    Regarding the speed, my reading on this forum indicated that 53MB/s was fairly normal for writing with Parity. If it's not, the only thing I can think of it being is a faulty cable, but I have run speed tests on all my drives and they perform at their rated speed individually - so I don't think it's that.  I'm doing another speed test now to make sure nothing hasn't gone wrong.  I'd be interested in knowing what your configuration is. my drives are mainly on a Dell PERC310 in IT mode, which seems to have more than enough bandwidth for the job, but perhaps it's that.

     

    Edit: Quick calculation:

     

    The Dell Perc H310 supports 8 drives and runs on the PCI Express 2.0 bus.  PCI Express 2.0 supports 500MB/s.  So dividing by 8 means each drive would get a maximum of 62.5MB/s.  This could be the reason why I guess.  Individual drive speed tests wouldn't be restricted by the bus speed, so that would be why I hadn't seen the issue.  I also assume read would not be impacted as I don't think read needs to calculate across all drives.  Perhaps I should look into reconstruct write mode again.

    Edited by Marshalleq
    Update
    Link to comment
    51 minutes ago, Marshalleq said:

    I'd be interested in knowing what your configuration is. my drives are mainly on a Dell PERC310 in IT mode

    Not special, 16 disks, most are WD shuck disk, mix 5400 and 7200 rpm. All connect thr a SAS blackplane to LSI 9207-8i IT in pcie3.0 x8, but confirm no different if connect to 9211 IT (pcie 2.0). Direct pcie from CPU. Change with different platform aslo same speed, no much different on with or without parity.

     

    Yes, must be reconstruct write mode.

    Edited by Benson
    Link to comment
    35 minutes ago, Marshalleq said:

    PCI Express 2.0 supports 500MB/s.

    It is 1x lane speed, so 8x will be 4GB/s, so each disk have ~500MB/s used bandwidth for 8 disks.

    Edited by Benson
    Link to comment

    Of course!  So not that then.  The speed test came out OK.  Also @johnnie.black I'd suggest that an impact to performance that brings a systems to it's knees in the main area it is designed for should not be categorised as minor.  Perhaps we should increase the ticket rating which may also get more visibility?

     

    506137233_ScreenShot2019-08-19at10_47_33.thumb.png.ea0678ee18f15c66744ab3a825cbddb7.png

    • Upvote 2
    Link to comment

    I agree, it is not minor;  I just move from Ubuntu server to unraid and still in trial, I have finished my moving my data to array,  now when the qbittorent is download something,  I cannot watch movie, it is alway in buffer

    Link to comment

    Glad I found this as I thought I was going crazy.  I have the same symptoms where a video stream will hang/freeeze if there is a background write happening on the array.  At first I thought it was a lag spinning up a disk for the write so I have one spin-up group that spins up all disks when even one is up.

     

    That didn't solve the problem.  I was about to spend the next weekend with fingers under the hood moving disks from the onboard sata to the LSI2008 SAS-2 card to see if I could find a sweet spot where the errors disappeared.

     

    If there have been some changes to address this in the latest Beta/RC, I'll happily try it out and see if it works.  I see it's flagged as minor.  Technically it is, but if you had in WAF to the equation then it's a showstopper.

    Edited by dalben
    Link to comment

    Minor is the default when a bug report gets created, I can change it to urgent, but I'm sure LT has seen the bug report and are working on it., minor or urgent it's not going to make any difference on how long it takes fix it, I expect a new release as soon as there's a fix.

    Link to comment

    Sorry - our posts crossed.  @johnnie.black I love your confidence - have you had any confirmation that they've seen it?  They're usually pretty good at saying 'Hey we've seen it' I thought, but this one is stunningly quiet.  If I had the workload they did, I'd definitely be using the flags of minor / major to filter through everything.  That's just my 2c though - (born from 28 years in IT though!).. :D

    • Haha 1
    Link to comment
    24 minutes ago, Marshalleq said:

    WAF?  I can't see how anyone could ever see a bug that kills services on a server as minor though.

    WAF - The most important variable when building a home server used predominantly for media streaming

     

    Wife Acceptance Factor

    Link to comment
    27 minutes ago, Marshalleq said:

    They're usually pretty good at saying 'Hey we've seen it'

    Not in my experience, I would say the opposite is true, but since the bug reports board has so few posts it's hard to miss one, especially when there are multiple replys.

    Link to comment

    Well that's true enough.  I'm fairly well an Unraid Noob - but I've not seen anyone ignore posts while in the process of fixing them before.  Anyway, I am powerless to do anything.

    Link to comment
    19 hours ago, Marshalleq said:

    Let me know if the beta helps - it'd be great to disprove that theory....

    Installed the RC this morning.  Should be able to give a report tonight as to whether it helps..

    • Like 1
    Link to comment

    I had one freeze last night where all of plex went offline for about 1 minute, but it was over wifi in dubious circumstances, so not exactly sure.  Definitely keen to hear your experience.

    Link to comment
    4 hours ago, dalben said:

    Installed the RC this morning.  Should be able to give a report tonight as to whether it helps..

    I did not think the RC was even trying to address this problem?   Instead it is focused on getting to the bottom of why some users are experiencing SQLite DB corruption.   Having said that I guess the two issues could be related in some way

    Link to comment
    25 minutes ago, itimpi said:

    I did not think the RC was even trying to address this problem?   Instead it is focused on getting to the bottom of why some users are experiencing SQLite DB corruption.   Having said that I guess the two issues could be related in some way

    I am reading on both threads for a while now, even if I have not experienced the SQL Bug yet. 

    What seems to be common to both issues is that using a cache drive mitigates both issues in some way. 

    SQL Lite Bug seems to affect only users that do not have their app data on the cache drive. 

    On the other hand caching my media share helped a lot with transferring new content to the server while it is being streamed from. (And make sure that mover only runs at time with no server usage) 

     

    I think it is highly likely that those two issues are connected and a solution for one of the issues could may be solve both. 

     

    @Marshalleq Please keep us updated on what you find out. 

     

     

     

     

    Link to comment

    OK, ran some tests.

     

    Copied a 1Gb file from cache to array and no video stalling.

    Copied a 4.3Gb file and when it got to about 2.8Gb the video stalled.  Then everything was slow for about 35-40secs before it all came good again.

     

    So the latest RC doesn't help with this problem.

    Edited by dalben
    Link to comment

    After pulling my hair out for the last week looking for what I originally assumed was probably a network issue, I found this thread which describes the issue I'm having exactly. 

     

    My system is a dual xeon 2650 setup with 96GB of ram, dual LSI2008SAS2 cards, two cache drives connected to the onboard sata controller intel c600/x79 chipset in raid1.  Mover is currently configured to run hourly as my cache drives are relatively small @ 120GBs for the number of users within my household (8).  I was already planning to jump to a 1TB nvme drive but guess I may need to seriously consider downgrading as my wife's identical twin lives with us which means WAFx2 is a major issue! 😱

     

    Is there anything major to look out for when downgrading?

    • Like 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.