• [6.11.5] Slow parity


    Synd

    Hi,

     

    Since 6.11.5 upgrades, each time I run a parity check, write to the array or use the mover, the parities are running at 60-70MB/s, while before the upgrade, it was at 170-180MB/s.

     

    The system is all connected on SAS3 equipment, but we've seen the same issues on Discord with others running SAS2, or Direct connect.

     

    I included my diagnostics, and others will do as well, as we talked on Discord to give as much data points about this.

     

    I don't spindown or anything. I run write reconstruct all the time, but it's almost like the setting doesn't work at all anymore.

     

    This is a bug that impacts AMD only. There was a regression in kernel 5.19 about dummy waits that caused to impact AMD CPUs. (https://www.phoronix.com/news/Linux-6.0-AMD-Chipset-WA)

     

    Thanks.

    megumin-diagnostics-20221221-1116.zip




    User Feedback

    Recommended Comments

    As @Synd reported, I started seeing very slow writes with parity recently. I usually saw them in 80 - 110MB/s range, but with recent shuffling of data drives and upgrading parity to 20TB drives, it is now in the 30 - 50MB/s range. It was occasionally dropping to less than 10MB/s - I suspect this happened when data was being written to the array.

     

    As I was trying to figure it out, I tried moving the parity drives to the motherboard SATA. I also tried moving my ZFS pool to motherboard SATA instead of using the HBA. Alas with all the changes, my USB boot drive started throwing read errors, and I managed to lose the diagnostics I was grabbing throughout my changes. I had copied them to the USB and forgot to copy them off before I tried formatting the USB key to see if it was OK. It did re-format with no errors (full format, not quick).

     

    Yesterday (Dec 20th, 2022) I decided to pro-actively replace the USB key. I changed it to another of the Eluteng USB 3 to mSATA SSD adapters like I've been using on my 2nd unRAID system for the last 5-6 months. I decided to do a 'clean' rebuild of my main unRAID so I didn't restore my backup immediately. Instead I manually re-installed all needed plugins and when required, I copied the config/support files for each plugin/container from the backup.

     

    Doing this returned my parity build speed (on the new 20TB drives) to ~100MB/s when using single parity, and ~70MB/sec when doing dual parity. Also of note is that the 2nd unRAID system got upgraded with a new 16TB parity and data drives, but its parity build was in the normal 100MB/s range.

     

    The main unRAID is using a LSI 9305-24i to connect to a Supermicro CSE-847 36 bay enclosure that I've converted to use as a DAS shelf. It's been using this new HBA for about 5 months. The DAS conversion of the CSE-847 has been in use for over 2 years using two HBAs, one internal and one external, with those both being replaced by the single 9305-24i. The 2nd unRAID is using a LSI 9201-16i in a Supermicro CSE-846 24 bay enclosure. Specs of both systems are in my signature.

     

    One thing I did notice on the main unRAID is that the parity build seems to be single-threaded and is maxing out the single thread at 100%. Multi-threading would likely make little difference as only one thread at a time can be reading the sectors from alI drives. I did not notice this behavior on the 2nd unRAID system.

     

    I have currently stopped/cancelled the single disk parity build as I realized I had some more data to move between servers and the writes are much faster when no parity calculation is involved. Once this data move is complete I will re-add a single 20TB parity and let it build. If any additional info is required, let me know.

     

    EDIT: I also have my disk config set to Reconstruct Write.

    Edited by AgentXXL
    Added use of Reconstruct Write
    Link to comment

    To add onto this, after more discussions, we found it's all the people with Epyc systems that have a slower parity since 6.11 like this bug presents. The Xeons users are usually fine with the Discord mods all running tests with some of the most active Discord users.

    Link to comment

    I don't have logs of my last parity (it concluded November 26th). But I have the same issue and I'm also using EPYC (Gen 3 Milan) combined with a LSI 9500-8i HBA (SAS3) to a SAS3 Expander, I have 11 drives connected (various 10TB & 18TB drives) with dual-parity configured.

     

    In the past (6.9.x etc) parity would start at 220MB/s and at the slowest point it would be about 140MB/s which is normal for my disks.

     

    Under the current (v6.11.5) it starts off fast like normal but after a few hours I'm down to 40MB/s. But if I then pause the parity check and resume it, it jumps back to full speed (at an appropriate speed for the position in the parity checking like between the stated 220MB/s and 140MB/s etc).

     

    I have nothing writing to the disks when this occurs (not shares, not VM's or dockers, nothing).

    Link to comment
    Quote

    I included my diagnostics

    These diags show writes going on to 6 disks at the same time, this will be pretty slow since parity needs to be updated for all simutaniouly, post new diags with just a parity check going on or writing to just one disk.

    Link to comment
    3 hours ago, JorgeB said:

    Also I test the parity check speed for each new release and except for the known v6.8 slowdown it's been pretty stable, there was even a small improvement for v6.10+

     

    imagem.png

     

     

     

    Is your set up AMD Epyc or Intel? The issues appears to be isolated to Epyc setups.

    Link to comment
    1 hour ago, JorgeB said:

    Intel

    Which is what we found on Discord that Intel is not affected too. @Fuggin and @Kilrah did a parity, but only AMD builds are slowing down. @AgentXXL @Pri and me are running Epyc. I know @The_Mountain also has the issue on Epyc. This is a specific AMD bug after testing with multiple setups in the #offtopic channel of the Discord.

    Link to comment
    1 hour ago, Synd said:

    but only AMD builds are slowing down

    OK, in that case cannot comment as all my servers are Intel, you probably should add that info to the 1st post.

    Link to comment
    Quote

    I run write reconstruct all the time, but it's almost like the setting doesn't work at all anymore.

     

    Also note that turbo write does not work if there are writes going to multiple disks, like you had in the diags posted originally.

    Link to comment
    3 minutes ago, JorgeB said:

    OK, in that case cannot comment as all my servers are Intel, you probably should add that info to the 1st post.

    Done, and found the regression in the kernel itself going thru our gits at work + the ticketering system. Dummy waits are applied to AMD with kernel 5.19 and was patched for 6.0. I shared a lot more details to staff in a private discord channel due to bringing internal work data into the conversation. 

    • Like 2
    Link to comment

    Does this issue affect all Epyc CPUs? Just got a Rome Epyc and I'm not noticing any issues with parity check speed, and curiously v6.11.5 if faster than v6.9.2...

    Link to comment

    I've recently upgraded my main server from a Xeon E5-2650L v4 to an EPYC 7282 (Rome) and I've not seen any difference in parity check or write speeds.

    I've had to rebuild two drives (due to issues with a faulty cable) and also run a full parity check since upgrading, all without any slowdowns.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.