• Degraded array performance from RC3/4/5


    sittingmongoose
    • Closed

    Posting this here because issues are supposed to be separate from the beta announcement thread.

     

    I had serious array performance issues with 6.7.  6.8 RC1 seemed to completely fix those issues.  I could run mover and parity check and still be able to stream multiple 4K videos from my server.  

     

    With RC3-RC5(both vanilla & Nvidia builds), If I have mover OR parity check running I can not stream and videos off my server.  This was the issue with 6.7 that was heavily documented.  

     

    Something was changed with disk performance or something after RC1 that reverted the fixes that were in place.  Again, RC1 works perfectly, RC3,RC4,RC5 have the same issue as 6.7.

     

    Here are some diags from today, running Nvidia 6.8 RC5.  I have the same issues with Vanilla(Which I throughly tested).  The issue shows up a bit before 11:30am EST until I paused parity check.

    9900k-diagnostics-20191103-1935.zip




    User Feedback

    Recommended Comments

    Quote

     This was the issue with 6.7 that was heavily documented.  

    It was, it was also easily reproducible, I can't reproduce the previous issue with rc5, same performance as rc1.

     

    Not saying there isn't a problem, but if there is it's likely a different one, you could see if you can reproduce in the same way, i.e., start writing to one array disk, by invoking the mover or manually and then try to copy a file from another array disk to your desktop, note the copy speed with rc1 and rc5, if there's a significant difference then try for example doing the same in safe mode with all dockers/VMs stopped.

    Link to comment

    I cant do any file transfers as the server is remotely located.  I actually never saw that file transfer issue.  I had/have a problem streaming off the array.  Its reproducible for me and I have seen a handful of other users running newer RCs have the same issue.  Users that have posted in the slow array performance thread that was set to "solved", and also in the RC3/4/5 release threads.

     

    For me to reproduce it, all I have to do is start parity check and I can no longer stream videos without buffering.  If you want to be sure to trigger it, you can run mover and parity check at once and try to stream.  I am able to run both mover and parity check while still being able to use plex on RC1.  

     

     

    Link to comment
    59 minutes ago, sittingmongoose said:

    If you want to be sure to trigger it, you can run mover and parity check at once and try to stream

    I don't issues streaming video, and can copy a file from the server at a steady 100MB/s while the mover is running.

    Edited by johnnie.black
    Link to comment

    In your diags syslog there is a stack dump and multiple Q parity errors, along with a lot of messages I don't recognize from plugins I don't use.  Please reboot in 'Safe Mode' and retest.  Also, post-rc1 we changed kernel I/O scheduler from 'none' to 'mq-deadline' - you can change back to 'none' as a test on Settings/Disk Settings page - Tunable (scheduler).

    Link to comment
    2 hours ago, sittingmongoose said:

    I cant do any file transfers as the server is remotely located.  I actually never saw that file transfer issue.  I had/have a problem streaming off the array.  Its reproducible for me and I have seen a handful of other users running newer RCs have the same issue.  Users that have posted in the slow array performance thread that was set to "solved", and also in the RC3/4/5 release threads.

     

    For me to reproduce it, all I have to do is start parity check and I can no longer stream videos without buffering.  If you want to be sure to trigger it, you can run mover and parity check at once and try to stream.  I am able to run both mover and parity check while still being able to use plex on RC1.  

     

     

     

    I have 4 streams going right now including a 4k direct play (over 100Mbps bandwidth).  I started the mover and none of the streams are being affected.  I'm running rc3 currently.  I will test parity check next but need mover to finish because I have mover set to be disabled during parity checks.  I have the mover tuner plugin set to enable turbo write during moves so all my disks are spun up and being read at over 160MB/s during the mover process and still no streams are being affected.

     

    EDIT:  Can confirm parity check does not affect my streams either.

    Edited by IamSpartacus
    Link to comment
    1 hour ago, limetech said:

    Also, post-rc1 we changed kernel I/O scheduler from 'none' to 'mq-deadline'

    Are you sure about this? I was convinced rc1 doesn't have a scheduler setting but default was still "mq-deadline", i.e. unless the user change the setting in rc5 both will use the same I/O scheduler.

    Link to comment
    7 minutes ago, johnnie.black said:

    Are you sure about this? I was convinced rc1 doesn't have a scheduler setting but default was still "mq-deadline", i.e. unless the user change the setting in rc5 both will use the same I/O scheduler.

    Yes there is some confusion about this because if you looked at /sys/block/<dev>/queue/scheduler it said 'none' but there was a bug integrating mq code into kernel and 'none' really selected 'mq-deadline' - but I did not got down that rabbit hole to see if it was really the case.

    Link to comment
    3 minutes ago, limetech said:

    because if you looked at /sys/block/<dev>/queue/scheduler it said 'none'

    As I remember it looking at that showed "mq_deadline" as the active scheduler.

     

     

    Link to comment
    5 minutes ago, johnnie.black said:

    As I remember it looking at that showed "mq_deadline" as the active scheduler.

     

     

    Very well could be - those days were filled with lots of reversions and updates to track down sqlite corruption issue. :/

    • Haha 2
    Link to comment
    2 hours ago, limetech said:

    In your diags syslog there is a stack dump and multiple Q parity errors, along with a lot of messages I don't recognize from plugins I don't use.  Please reboot in 'Safe Mode' and retest.  Also, post-rc1 we changed kernel I/O scheduler from 'none' to 'mq-deadline' - you can change back to 'none' as a test on Settings/Disk Settings page - Tunable (scheduler).

    I upgraded back to RC5 and booted into safe mode.  I set mover tuner to low IO disk priority.  I also set Tunable schedualer to "None".  So far it is having problems with 2 streams with parity and mover running.  But it looks like log file is maxing out?

     

    I included my diags here.

    9900k-diagnostics-20191104-1939.zip

    Link to comment
    14 minutes ago, IamSpartacus said:

    Is there any info on what each of the different options in `Tunable (scheduler)` are supposed to do?

    None, Auto, Kyber, BFQ, MQ-Deadline are the 5 options I see in RC5.

    Link to comment
    24 minutes ago, sittingmongoose said:

    I upgraded back to RC5 and booted into safe mode.  I set mover tuner to low IO disk priority.  I also set Tunable schedualer to "None".  So far it is having problems with 2 streams with parity and mover running.  But it looks like log file is maxing out?

     

    I included my diags here.

    9900k-diagnostics-20191104-1939.zip 410.87 kB · 1 download

    I've never seen those messages but seems to indicate something seriously messed up.

     

    Also what is "mover tuner"?

    Link to comment
    1 minute ago, limetech said:

    I've never seen those messages but seems to indicate something seriously messed up.

     

    Also what is "mover tuner"?

    The CA Mover Tuning plugin.  I just installed it and set mover disk IO priority to low.

     

    As a follow up to what I previously posted about the safe mode and log files and stuff.  I rebooted into regular mode, RC5 Nvidia.  Log file is ok now, Those settings I changed( Scheduler to none, and mover disk priority to low) might have done the trick.  I ran parity check, mover and streamed 2 4k streams along with 4 1080p streams and it seemed fine.  It also fixed my VM issue. 

     

    I am including my diags to see if there are still all the errors you previously saw.

    9900k-diagnostics-20191104-2003.zip

    Link to comment
    19 minutes ago, sittingmongoose said:

    I am including my diags to see if there are still all the errors you previously saw.

    FWIW, your latest syslog doesn't have any of this

    Nov  4 14:32:07 9900K kernel: x86/PAT: CPU 3/KVM:5062 conflicting memory types a2000000-a3000000 uncached-minus<->write-combining
    Nov  4 14:32:07 9900K kernel: x86/PAT: reserve_memtype failed [mem 0xa2000000-0xa2ffffff], track uncached-minus, req uncached-minus
    Nov  4 14:32:07 9900K kernel: ioremap reserve_memtype failed -16

    which flooded your previous one, but it still conatins segfaults:

    Nov  4 14:51:02 9900K kernel: ffdetect[25520]: segfault at 38 ip 00000000004042af sp 00007ffdde2eeda0 error 4 in ffdetect[403000+c000]
    Nov  4 14:51:02 9900K kernel: Code: 0f b6 6d 00 40 84 ed 75 b7 48 8b 14 24 48 8d 35 ce ad 00 00 bf 01 00 00 00 31 c0 ff 15 92 59 01 00 48 89 df ff 15 c1 5a 01 00 <41> 0f b6 2c 24 40 84 ed 0f 84 9a 00 00 00 4c 8d 35 2c b1 00 00 eb

    Did you ever boot into safe mode?

    Link to comment
    34 minutes ago, sittingmongoose said:

    None, Auto, Kyber, BFQ, MQ-Deadline are the 5 options I see in RC5.

     

    Yes but what do they each do is my question?

     

    32 minutes ago, limetech said:

    I've never seen those messages but seems to indicate something seriously messed up.

     

    Also what is "mover tuner"?

     

    It's a plugin that allows for further tuning of the mover.  These are the settings I'm using and I'm not having any issues with the mover affecting the rest of my system since upgrading to rc1.  Before 6.8rc1 I was having lots of issues.

     

    image.thumb.png.4ba4d6091ac6cb492c76200c5b4bb23e.png

    Link to comment
    35 minutes ago, John_M said:

    FWIW, your latest syslog doesn't have any of this

    
    Nov  4 14:32:07 9900K kernel: x86/PAT: CPU 3/KVM:5062 conflicting memory types a2000000-a3000000 uncached-minus<->write-combining
    Nov  4 14:32:07 9900K kernel: x86/PAT: reserve_memtype failed [mem 0xa2000000-0xa2ffffff], track uncached-minus, req uncached-minus
    Nov  4 14:32:07 9900K kernel: ioremap reserve_memtype failed -16

    which flooded your previous one, but it still conatins segfaults:

    
    Nov  4 14:51:02 9900K kernel: ffdetect[25520]: segfault at 38 ip 00000000004042af sp 00007ffdde2eeda0 error 4 in ffdetect[403000+c000]
    Nov  4 14:51:02 9900K kernel: Code: 0f b6 6d 00 40 84 ed 75 b7 48 8b 14 24 48 8d 35 ce ad 00 00 bf 01 00 00 00 31 c0 ff 15 92 59 01 00 48 89 df ff 15 c1 5a 01 00 <41> 0f b6 2c 24 40 84 ed 0f 84 9a 00 00 00 4c 8d 35 2c b1 00 00 eb

    Did you ever boot into safe mode?

    My 2nd(of three) diag posts was in safe mode running RC5.  Its the diag file ending in 1939.  That was the one that had my log file fill with errors.

     

    So you're saying my most recent diag looks better but still has segfaults?  Can you explain what a segfault could indicate?

    Link to comment
    21 minutes ago, sittingmongoose said:

    So you're saying my most recent diag looks better but still has segfaults?  Can you explain what a segfault could indicate?

    The most recent diagnostics don't have the "conflicting memory types" errors but it has segfaults. A Segmentation Fault is an access violation, i.e. an attempt to access an area of memory in an illegal way, such as trying to write to read-only memory or trying to read memory that belongs to another process. It could be a programming bug or I suppose some table of pointers could be getting corrupted, or a hardware problem. Since your other errors are memory related I'd ask about your RAM. Have you run a MemTest on it? Are you operating it within spec?

    Link to comment
    1 hour ago, IamSpartacus said:

    Yes but what do they each do is my question?

    google 'linux I/O schedulers'.  Your guess is as good as mine as to what they all do.  For longest time we always used "none" but actually performance was better after they moved to 'mq' (multiqueue) and default set to 'mq-deadline'.

     

    In the tunable setting, as stated in the Help for that setting, "auto" selects whatever is the kernel default for that device, which is going to be 'mq-deadline'.

    • Thanks 1
    Link to comment
    51 minutes ago, John_M said:

    The most recent diagnostics don't have the "conflicting memory types" errors but it has segfaults. A Segmentation Fault is an access violation, i.e. an attempt to access an area of memory in an illegal way, such as trying to write to read-only memory or trying to read memory that belongs to another process. It could be a programming bug or I suppose some table of pointers could be getting corrupted, or a hardware problem. Since your other errors are memory related I'd ask about your RAM. Have you run a MemTest on it? Are you operating it within spec?

    I just recently installed another 32gb of ram.  That could be the culprit.  
     

    I am also overclocked to 5.0ghz all core on my 9900k.  Didn’t really stress test it as thoroughly as I should have.  It’s getting upgraded to a 9900ks next Tuesday so I’ll run memtest then and stress test.

     

    when installing my ram sticks I did actually have problems posting until after I shuffled them around.....so maybe I got new bad sticks.

    Link to comment
    5 minutes ago, sittingmongoose said:

    I am also overclocked to 5.0ghz all core on my 9900k.

    It's never wise to overclock a server.

    • Thanks 1
    • Haha 1
    Link to comment
    1 hour ago, sittingmongoose said:

    I just recently installed another 32gb of ram.  That could be the culprit.  
     

    I am also overclocked to 5.0ghz all core on my 9900k.  Didn’t really stress test it as thoroughly as I should have.  It’s getting upgraded to a 9900ks next Tuesday so I’ll run memtest then and stress test.

     

    when installing my ram sticks I did actually have problems posting until after I shuffled them around.....so maybe I got new bad sticks.

    I really hope this isn't the case, else you've wasted a lot of people's time.  If it sounds like I'm irritated - yes I am - there are enough problems already without having to deal with overclocked h/w and questionable RAM.

    • Like 1
    Link to comment


    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.