• [6.7.0-rc2] Reading all disks when writing to a single one


    hawihoney
    • Retest Minor

    After upgrade from 6.6.6 stable to 6.7.0-rc2 I see unusual reads whenever I write to a single disk.

     

    E.g. In this example I write/copy to \\tower2\disk21 from my Windows 10 machine (SMB). During the whole copy all other disks are spun up and are read at low speed. In the example shown in the picture a 40GB file is written. disk21 and parity/parity2 show the usual write activity. But the other disks are spun up and read as well.

     

    After the file is written reading the other disks stopps as well.

     

    Diagnostics and image attached.

     

    *** Edit: The Main page shows that same read activity for the flash drive as well. Forgot to mention that.

     

     

    tower2-diagnostics-20190127-1031.zip

    Clipboard01.jpg




    User Feedback

    Recommended Comments



    1 hour ago, johnnie.black said:

    I can now reproduce this, the missing piece was that for me at least, it only starts happening after a large transfer, i.e., after a fresh boot on my test server I start seeing the reads on all disks at around the 30% mark of a 30GB transfer, and it's easily reproducible after a reboot, with v6.6.6 I can complete the same transfer without the reads, so maybe this can help LT do the same.

    I can confirm I am seeing the same issue [rc4] when transferring a big file (14 GB). Initially only parity and the selected disk show read/write activity, but after a while other disks in the array start to show read activity. These are not constantly the same disks but at the end of the transfer all disks in the array are spun up due to this behavior.

    Edited by bonienl
    Link to comment

    Exactly. My first thought was "Give these machines more RAM" and I went from 8GB to 16GB. The only effect was that these read requests start later.

     

    And now think about my other report here. At some point, with really fat files, the systems hang completely.

     

    IMHO, we found a memory leak.

     

    Link to comment

    I am getting the same effect when the mover is running.

     

    image.png.98cab170043d867d650591dd771537a6.png

    I have also notice Writes continue after the mover has logged it is finished, could this be still writing cache to disk?

     

    I have 32GB in my machine. I have also notices disks spin up more often than on 6.6.6.image.thumb.png.779865f607b6db61c90d1dded43be163.png

    Link to comment
    35 minutes ago, SimonF said:

    I have also notice Writes continue after the mover has logged it is finished, could this be still writing cache to disk?

    That's normal, by default 20% free RAM is used for write cache.

    Link to comment

    I've read that several times now. I can't remember that an application writing to the array did report "Ok" and the array was still writing for minutes. When was that enormous caching introduced?

     

    What happens if that write fails? My application said "Ok" already and I do have no clue that the array created a mess?

     

    Where can I switch that off?

     

    Link to comment
    9 minutes ago, hawihoney said:

    When was that enormous caching introduced?

    It's been like that for years, as long as I can remember, and it's not enormous, it's 20% free RAM, though that can be a considerable size for servers with large amounts of RAM.

    Link to comment
    3 hours ago, johnnie.black said:

    I can now reproduce this

    One more data point, it doesn't happen to me with btrfs disks, only with xfs.

    Link to comment
    On 1/27/2019 at 5:05 AM, hawihoney said:

    No, I don't use something like Turbo Write. In fact I don't even know what it is.

     

    Are these Tunable Values?

     

    I changed md_num_stripes 4096, md_sync_window 2048 and md_sync_thresh 2000. All others are default.

     

    ***Edit*** These three Tunable Values mentioned above have been changed a year ago. I didn't change anything between 6.6.6 and 6.7.0-rc2.

     

     

     

     

    Please set these tunables back to their default values and let me know if that makes any difference.

    To do this, Stop array, go to Settings/Disk Settings and then set each field to blank and hit Apply - that should restore the defaults.

    Link to comment
    3 minutes ago, limetech said:

    Please set these tunables back to their default values and let me know if that makes any difference.

    To do this, Stop array, go to Settings/Disk Settings and then set each field to blank and hit Apply - that should restore the defaults.

    Mine are all default.

    Link to comment

    @Tom: I changed these values back and forth several times during the RCs. But not with RC4. Will do tomorrow.

     

    Sine some weeks I can't stop. That's why I try to avoid to stop the array. It's always hanging on Stopping Services. I need to power cycle with IPMI. The reason are the mount points to external machines. Every morning they are gone and ls on a mount point, or other commands, stall the machines. I changed from Unassigned Devices to own scripts - same result. I couldn't find the reason in these weeks ...

     

    Link to comment
    31 minutes ago, limetech said:

    Please set these tunables back to their default values and let me know if that makes any difference.

    My tests are also done with default tunables.

    Link to comment
    11 hours ago, johnnie.black said:

    One more data point, it doesn't happen to me with btrfs disks, only with xfs.

    Seems a bigthrough, xfs related.

    Link to comment

    Tunable back to default - same problem.

     

    Need to add something. See these small reads even when reading/streaming from a single disk. This means, whenever all disks are spun down and I do read from a single disk, all other disks, that are part of the same User Share, where this file belongs to, will spin up to.

     

    Didn't recognize that til today.

     

    Problem is XFS and/or MD related.

     

    Edited by hawihoney
    Link to comment

    Last post from me on that.

     

    I've set up a test. Two parity drives, a User Share spreading disk1 and disk2. Writing to disk1:

     

    Blue: User Share (disk1, disk2)

    Green: Writing to disk1 --> ok.

    Red: Small reads on disk2 --> wrong.

     

    If I switch off User Shares completely, all remaining disks show small reads as well.

     

    Unbenannt.png

    Link to comment

    An update: fyi, we are actively working on this issue and I can reproduce easily.  No conclusions yet but pretty sure this is a case of kernel nabbing pages which contain the disk top-level directory entries as RAM fills up.

    • Like 1
    Link to comment

    Another update: looks like we have solved this, or rather the solution is in the latest Linux kernel patch release (4.19.24).  This issue was due to a set of kernel changes that just got reverted.  You can read the gory details here:

    https://lkml.org/lkml/2019/1/29/1508

     

    tldr: the Linux virtual memory subsystem got majorly borked, especially with XFS, but now it's fixed.

     

    We are doing some more testing and I think this reversion will fix a few other issues reported here in prerelease bug reports.  We should have an -rc5 published in the next day or two.

    • Like 2
    • Upvote 1
    Link to comment
    1 hour ago, limetech said:

    Another update: looks like we have solved this, or rather the solution is in the latest Linux kernel patch release (4.19.24).  This issue was due to a set of kernel changes that just got reverted.  You can read the gory details here:

    https://lkml.org/lkml/2019/1/29/1508

     

    tldr: the Linux virtual memory subsystem got majorly borked, especially with XFS, but now it's fixed.

     

    We are doing some more testing and I think this reversion will fix a few other issues reported here in prerelease bug reports.  We should have an -rc5 published in the next day or two.

    Gotta love troubleshooting other people's sloppy work.

    Link to comment
    1 hour ago, limetech said:

    Another update: looks like we have solved this, or rather the solution is in the latest Linux kernel patch release (4.19.24)

     

    My disks are all XFS and every one of them was spun up when only parity and one data disk should have been involved with large files transfers, backups being written to the array or recording of files via Plex DVR. 

     

    I was just compiling a report after watching this happen for a few days.  I updated my active server from 6.6.6 to RC4 5 days ago and that's when the behavior started.  Fortunately, I came across this report first which confirms what I saw.  Glad to see it has already been identified and resolved.

    Link to comment

    I think it's not fixed. Did upgrade everything to 6.7.0-RC5 and did the same copies. It looked different, but in the end the same result happened. Other disks are read with small Intervalls, the system had a hard time. RAM usage beyond 87%, CPU usage nearly 100%.

     

    Took a lot of screenshots and diagnostics. Will post them here in an hour.

     

    This time unRAID sending server is bare metal with 128GB and 40 CPU threads, receiving server unRAID VM with 16GB and 8 CPU threads. If you need two bare metals, other users need to jump in.

     

     

    Link to comment

    It is fixed for me. Previous test with copying a large file to the server now shows the expected behavior and no more reads from 'other' disks.

     

    Link to comment
    18 minutes ago, hawihoney said:

    RAM usage beyond 87%, CPU usage nearly 100%.

    Definitely something else going on. This is not normal behavior.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.