• [6.7.x] Very slow array concurrent performance


    JorgeB
    • Solved Urgent

    Since I can remember Unraid has never been great at simultaneous array disk performance, but it was pretty acceptable, since v6.7 there have been various users complaining for example of very poor performance when running the mover and trying to stream a movie.

     

    I noticed this myself yesterday when I couldn't even start watching an SD video using Kodi just because there were writes going on to a different array disk, and this server doesn't even have a parity drive, so did a quick test on my test server and the problem is easily reproducible and started with the first v6.7 release candidate, rc1.

     

    How to reproduce:

     

    -Server just needs 2 assigned array data devices (no parity needed, but same happens with parity) and one cache device, no encryption, all devices are btrfs formatted

    -Used cp to copy a few video files from cache to disk2

    -While cp is going on tried to stream a movie from disk1, took a long time to start and would keep stalling/buffering

     

    Tried to copy one file from disk1 (still while cp is going one on disk2), with V6.6.7:

     

    2083897607_Screenshot2019-08-0511_58_06.png.520373133cc121c80a361538a5fcc99b.png

     

    with v6.7rc1:

     

    856181720_Screenshot2019-08-0511_54_15.png.310bce8dbd6ed80d11d97727de55ac14.png

     

    A few times transfer will go higher for a couple of seconds but most times it's at a few KB/s or completely stalled.

     

    Also tried with all unencrypted xfs formatted devices and it was the same:

     

    1954593604_Screenshot2019-08-0512_21_37.png.6fb39b088e6cc77d99e45b37ea3184d8.png

     

    Server where problem was detected and test server have no hardware in common, one is based on X11 Supermicro board, test server is X9 series, server using HDDs, test server using SSDs so very unlikely to be hardware related.

    • Like 1
    • Upvote 22



    User Feedback

    Recommended Comments



    I'm not that lucky with 6.8rc. Transfer rates got really bad, and I got *lots* of these on the parity drive when I stress the array in any way.

     

    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: attempting task abort! scmd(0000000009d51915)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: [sdg] tag#2081 CDB: opcode=0x12 12 01 00 00 fe 00
    Oct 14 08:19:32 Nasse kernel: scsi target10:0:5: handle(0x000b), sas_address(0x4433221107000000), phy(7)
    Oct 14 08:19:32 Nasse kernel: scsi target10:0:5: enclosure logical id(0x500605b005524f40), slot(0)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: task abort: SUCCESS scmd(0000000009d51915)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: Power-on or device reset occurred

     

    Swapped place of the parity drive and the issue followed, so I was afraid the drive was broken, but after going back to 6.7.3rc4 all is back to normal and transfer speeds are good again.

    Link to comment
    11 minutes ago, Ancan said:

    I'm not that lucky with 6.8rc. Transfer rates got really bad, and I got *lots* of these on the parity drive when I stress the array in any way.

     

    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: attempting task abort! scmd(0000000009d51915)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: [sdg] tag#2081 CDB: opcode=0x12 12 01 00 00 fe 00
    Oct 14 08:19:32 Nasse kernel: scsi target10:0:5: handle(0x000b), sas_address(0x4433221107000000), phy(7)
    Oct 14 08:19:32 Nasse kernel: scsi target10:0:5: enclosure logical id(0x500605b005524f40), slot(0)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: task abort: SUCCESS scmd(0000000009d51915)
    Oct 14 08:19:32 Nasse kernel: sd 10:0:5:0: Power-on or device reset occurred

     

    Swapped place of the parity drive and the issue followed, so I was afraid the drive was broken, but after going back to 6.7.3rc4 all is back to normal and transfer speeds are good again.

    Need diagnostics

    Link to comment

    I haven't done specific testing yet - just letting everything settle.  The system does not seem to have exhibited any issues so far.  Thankyou @limetech, I know this was a challenging one.

     

    I do have the below recurring error in the log, which I assume is unrelated and may have existed prior, but hard to tell.  It does have an open kernel.org ticket.

     

    Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: device [1022:1453] error status/mask=00001180/00006000

     

    [1022:1453] 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

    Link to comment
    1 hour ago, Ancan said:

    Here you go. I upgraded to 6.8, and ran some jobs until the error started again.

    That looks like a hardware problem, possibly power related, replace/swap both cables on the parity disk.

     

    You should also update the LSI firmware since it's very old:

    LSISAS2008: FWVersion(10.00.08.00)

    current version is 20.00.07.00

    • Like 1
    • Thanks 1
    Link to comment
    4 hours ago, johnnie.black said:

    That looks like a hardware problem, possibly power related, replace/swap both cables on the parity disk.

     

    You should also update the LSI firmware since it's very old:

    
    LSISAS2008: FWVersion(10.00.08.00)

    current version is 20.00.07.00

    As I wrote I've already moved the disk to another slot, and the issue follows the parity disk so it shouldn't be cable/power related. On 6.7.3rc4 there's no problem. I've been running unbalance for hours now without a since hickup, and on 6.8 it's fine for a while, then device resets all the time. Might be related to the kernel and not unraid per se. Anyway, not directly related to this thread so I'll continue the discussion elsewhere if needed.

     

    Thank's for the heads-up on the f/w. Will try to upgrade now. My only controller so I hope nothing goes wrong.

    Link to comment
    3 minutes ago, Ancan said:

    Thank's for the heads-up on the f/w. Will try to upgrade now.

    You should do that, it can be because of running the new driver with an older firmware.

    Link to comment

    I don't think that issue is fixed with v6.8. Atleast not for me and some other users.

    I still have high iowait and a completely unresponsive array with rc5 when the mover is running.

    Streams / shares are not accessible or stop completely. Im running two SSDs in raid 1 with btrfs.

    Somebody said that it could be a btrfs issue but I don't know.

    Link to comment
    3 minutes ago, GHunter said:

    Post it as a new issue so it won't be ignored.

    It's already posted and it's not being ignored.  Sorry if your particular issue is not the current issue being looked at.

     

    • Like 1
    Link to comment

    I, and other users, seem to still be having this same performace issue in latest rc's.

    Why has this been closed off as fixed? or is it raised in another issue that i havent found?

    Link to comment
    37 minutes ago, patchrules2000 said:

    Why has this been closed off as fixed?

    Because this specific issue is easily reproducible (see original post) and has been been fixed since rc1, there might be another one but since I can't reproduce I can't report it, so anyone still having issues needs to make a detailed report especially detailing how to reproduce.

    • Thanks 1
    Link to comment

    Thanks for the quick reply johnnie,

    Was about to get some data to open this back up an noticed rc6 has been released and thought might as well try it first. 

    Turns out RC6 fixes something that previous rc's did not for me and now everything seems to be running smoothly. 

    Pushing 200+MB read + write, Cpu encoding at 90% usage while deliverying 3+ media streams and not a single slow down or high IO/Wait issue to report. 

    Suffice to say im very pleased as long as this dosent resurface as an issue between RC6 + Final release.

     

    Keep the magic unraid sauce flowing please! :D 

     

    image.thumb.png.0e984ba77bdb030fe220696690056cfa.png

    • Like 1
    Link to comment

    I recently upgraded to rc7 thinking this problem was behind me. It still persists. It's very easy to reproduce. I copy several GB of files from an unassigned disks to a /mnt/user path. After the memory buffer fills and writes are flushed to disk I start seeing the IO wait shoot up to 45, disrupting all running dockers. It takes at least 5 minutes for the load to subside and system return to normal.

     

    I have a cache pool setup with 2 SSDs (no Samsung drives at this point).

     

    Is BRTFS the culprit?

     

    I'll have to go back to rc5 as the lack of nvidia drivers is killing my performance as well.

     

    tower-diagnostics-20191127-1917.zip

    Link to comment
    5 minutes ago, Carlos Talbot said:

    I copy several GB of files from an unassigned disks to a /mnt/user path. After the memory buffer fills and writes are flushed to disk I start seeing the IO wait shoot up to 45, disrupting all running dockers.

    Is the copy going to the array or the cache pool?

    Link to comment
    2 minutes ago, Carlos Talbot said:

    Array - /mnt/user/subfolder

    That doesn't really answer my question, is that share set to use cache?

    Link to comment
    21 minutes ago, johnnie.black said:

    That doesn't really answer my question, is that share set to use cache?

    Sorry, yes, it's set to Yes for cache.

     

    This got me thinking. I tried the same copy command to a another share that is not using cache. Sure enough the load held steady at 5 and never got higher (this also includes a plex transcode in the background). Containers were accessible without issue.

     

    So it does appear to be the cache that is affecting this.

     

    image.thumb.png.57dcdd201f8026e2a2ab875e8f6b82f6.png

     

    Edited by Carlos Talbot
    • Thanks 1
    Link to comment
    16 minutes ago, Carlos Talbot said:

    So it does appear to be the cache that is affecting this.

    Yep, you might want to try with a single xfs or btrfs cache device just to compare, some users have bad performance with cache pool, possibly not just Samsung devices, and this is a very old issue.

    Link to comment
    59 minutes ago, Carlos Talbot said:

    Array - /mnt/user/subfolder

    In case you still need something to fill in your understanding

     

    Array = disks in the parity array

    /mnt/user/subfolder = a user share named subfolder

     

    User shares always include cache. Unless the share is set cache-no then all new writes go to cache.

    Link to comment
    18 minutes ago, trurl said:

    In case you still need something to fill in your understanding

     

    Array = disks in the parity array

    /mnt/user/subfolder = a user share named subfolder

     

    User shares always include cache. Unless the share is set cache-no then all new writes go to cache.

    Got it. I'm in the process of switching from 2 drives in the cache pool to 1 and keeping it at BTRFS. I'm just surprised this issue is still ongoing as it's very easy to reproduce.

    Edited by Carlos Talbot
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.