• Unraid slow VM boot times


    thenonsense
    • 6.7.0-rc1 Closed Minor
    Message added by limetech

    Please be aware that these comments were copied here from another source and that the date and time shown for each comment may not be accurate.

    Retrofit from bug reports sent via unraid web UI:

     

    Bug Description:

    Unraid VMs are slow to boot, and while they are typically normal after boot, a few times they are lagging during use.  Core usage is limited to a single core during boot, and the core is maxed @ 100%.

    These slow boots can take upwards of minutes, as opposed to normal boots of <10s.

    This issue did not exist on 6.5.3 and exists on every build 6.6.0rc1 and onwards.

    This issue appears to be limited to Threadripper chips?

    The issue manifests itself about 75% of boot sequences.  25% of the time the system may boot normally.

    How to reproduce:

    Obtain a Threadripper build (possibly more than 8 cores are needed).

    Update unraid to 6.5.3.

    Boot into any VM.  Most recommended is a Windows VM.  Repeat this process multiple times.  Note boot appears to be similar in performance to bare metal.

    Update unraid to >=6.6.0rc1

    Boot into the same VM multiple times.  Note for the majority of boots, the spinning circle lags heavily.  CPU usage hangs @ 100% on 1 core.

     

    Please keep me posted if there is any testing I can do.  I'm looking into how to compile unraid, but I believe I'm not incorporating all needed packages, and I can't seem to find any newer documentation on the process.  




    User Feedback

    Recommended Comments



    I did retest.  This bug report exists for the purpose of the conversation going on here:

    The bug is now logged as persistent from 6.6.0rc1 to 6.6.2.  

    Link to comment
    5 hours ago, thenonsense said:

    Adding sample xml for one of  my tested VMs.

     

     

    sampleVM.txt

    @thenonsense
    CPU pairings look off, spread across both numa nodes and not using any thread pairings, this may be impacting performance.

     

    This is a Win10 VM I have (also 1950x) that has no boot issues... sub 20 seconds.
    (note: created under 6.5.x but tested on 6.6.3)
    image.png.73e7c8d2493cbdeea58fd1831006875b.png

     

    This is what your pairings look like in the gui, note that it doesn't follow the same layout.

    image.png.fbed0b5b0e218c14a98c135d7e2ccc15.png

     

    Try changing pairings to follow similar pairings and see if that impacts boot time, note I'm also still on an older machine (i440fx) setting still.

     

    I would also ensure your on the latest bios, if you haven't updated already.
     

    Edited by tjb_altf4
    Link to comment

    I can't really reproduce that behaviour on my 1950x. No matter which machine type i use or what cores i give to a VM. I never noticed extrem long boot times. The only thing i had a couple times is, the first core maxed at 100% during boot. VM will never finishing boot in this case and only happens if i passthrough a PCIE device. For example creating a fresh windows VM with a GPU passthrough sometimes the VM shows this behaviour on boot if i give more than 1 core to the VM. The bug is reported a couple times and the fix for that is, install Windows with only 1 core and add more at a later state.

     

    EDIT:

      <numatune>
        <memory mode='strict' nodeset='0'/>
      </numatune>

    This can also cause your issue. You telling the VM only using RAM connected to the first node/die but as tjb_altF4 already stated you mixing cores from both dies. 

    Edited by bastl
    Link to comment
    4 hours ago, bastl said:

    For example creating a fresh windows VM with a GPU passthrough sometimes the VM shows this behaviour on boot if i give more than 1 core to the VM.

    We have noticed this anomaly with multiple windows versions and multiple CPU families.  Very strange.

    Link to comment
    On 10/31/2018 at 3:14 AM, tjb_altf4 said:

    CPU pairings look off, spread across both numa nodes and not using any thread pairings, this may be impacting performance.

    I did the research, talked to Aorus, birthed the post discussing core and CCX assignments for Threadripper.  The pairings are correct.  The CCXs are correct.  The dies are correct.  Numactl confirms the CCXs and dies.  Good try though.

     

    I'm working off the theory that the 100% usage on one core is due to some race condition, as for typical (fast) boots, CPU usage on one core seems to always spike during boot for a short period, but all cores see usage by the time the VM posts.  That, combined with the random factor when the boots are slow/fast, screams race condition 101.

    Edited by thenonsense
    Need more discovery
    Link to comment

    Followup, rude to omit screenshots:

    Our motherboards parse the core assignments differently.

    image.png.d0ab22f93d02dc484584ca5ae30d8b5d.png

     

    This is shown in our VM configurators:

    image.png.02a76c469720695d888bfbe310a3af42.png

     

    Edited by thenonsense
    Link to comment

    This is still an issue.  incorrect core allocations and numa nodes are not the cause.

     

    On 10/31/2018 at 8:36 AM, limetech said:

    We have noticed this anomaly with multiple windows versions and multiple CPU families.  Very strange.

    It seems you've also noticed this on other platforms.  I noted the fix applied to 6.5.3 outside of this bug report, but didn't include it here:
     

    "In terms of code changes, this is a very minor release; however, we changed a significant linux kernel CONFIG setting that changes the kernel preemption model.  This change should not have any deleterious effect on your server, and in fact may improve performance in some areas, certainly in VM startup (see below).  This change has been thoroughly tested - thank you! to all who participated in the 6.5.3-rc series testing.

     

    Background: several users have reported, and we have verified, that as the number of cores assigned to a VM increases, the POST time required to start a VM increases seemingly exponentially with OVMF and at least one GPU/PCI device passed through.  Complicating matters, the issue only appears for certain Intel CPU families.  It took a lot of work by @eschultz in consultation with a couple linux kernel developers to figure out what was causing this issue.  It turns out that QEMU makes heavy use of a function associated with kernel CONFIG_PREEMPT_VOLUNTARY=yes to handle locking/unlocking of critical sections during VM startup.  Using our previous kernel setting CONFIG_PREEMPT=yes makes this function a NO-OP and thus introduces serious, unnecessary locking delays as CPU cores are initialized.  For core counts around 4-8 this delay is not that noticeable, but as the core count increases, VM start can take several minutes(!)."

     

    From the 6.5.3 release notes here: 

     

    I asked this in the 6.6.2 announcement but do you think this is related to the issue arising after the move from 6.5.3?

    Link to comment
    18 minutes ago, thenonsense said:

    This still exists in 6.6.6.  Has this issue been looked at since then?

    Issue does not exist for me and hasn’t since they incorporated the fix in a previous version.

    • Upvote 1
    Link to comment

    I used to have slow boot issues a while ago before I updated the BIOS on my MSI X399 gaming pro carbon ac motherboard.  Those AGESA versions make all the difference.  Which motherboard are you using?

    Link to comment
    Just now, 1812 said:

    Issue does not exist for me and hasn’t since they incorporated the fix in a previous version.

    I'm not seeing a build in your signature/profile, can you tell us what you're running?

    Two people have confirmed it on ThreadRippers.  

     

    This bears a strong resemblance to the issue patched in 6.5.3, as mentioned earlier, and re-hashed here:

    On 10/31/2018 at 8:36 AM, limetech said:
    On 10/31/2018 at 4:17 AM, bastl said:

    For example creating a fresh windows VM with a GPU passthrough sometimes the VM shows this behaviour on boot if i give more than 1 core to the VM.

    We have noticed this anomaly with multiple windows versions and multiple CPU families.  Very strange.

    Regarding this:

    2 minutes ago, eschultz said:

    I used to have slow boot issues a while ago before I updated the BIOS on my MSI X399 gaming pro carbon ac motherboard.  Those AGESA versions make all the difference.  Which motherboard are you using?

    I'm on an Aorus Gaming 7 X399, BOIS F11e Agesa 1.1.01.a.  Assuming you bumped on 11/15, you'd be running 1.1.0.2.  

    Link to comment
    3 minutes ago, thenonsense said:

    Assuming you bumped on 11/15, you'd be running 1.1.0.2.  

    I'm a little behind and still running on BIOS 7B09v1B (08/14/2018) - AGESA Code 1.1.0.1A.  Should be the same version as your motherboard now it seems.

     

    Link to comment
    1 minute ago, eschultz said:

    I'm a little behind and still running on BIOS 7B09v1B (08/14/2018) - AGESA Code 1.1.0.1A.  Should be the same version as your motherboard now it seems.

    Darn, so it's not that.

    Link to comment
    1 hour ago, thenonsense said:

    I'm not seeing a build in your signature/profile, can you tell us what you're running?

    Two people have confirmed it on ThreadRippers.  

     

    Probably a threadripper issue.

     

    the problem no longer occurs on proliant servers or hp z series workstations, of which I run.

    Link to comment
    2 hours ago, eschultz said:

    Oddly I only have one numa node:

    My observation has been those with one node i.e. running in uma/distributed mode aren't seeing the same issues.

    1950x seems to be often defaulted to uma/distributed mode, whereas 2xxx series seems to default to numa/local mode.

    Link to comment

    Took a bit, but I finally did some testing.

    Using uma, the time to post is faster, but time to login is still an eon.  Single core still pinged for 100% usage.

    Using numa, time to post is slower (probably because it has to work harder to find enough contiguous room for VMs) but otherwise unchanged.

     

    Conclusion:  numa is not the cause.  Uma's faster by about 5-10 seconds to post, but Windows still takes several minutes to load.

     

    Edited by thenonsense
    Edit: superfluous info
    Link to comment

    Very excited for the new RC.  Unfortunately my mobo died but I'll test this as soon as the new one comes in.  If anyone can test it in the meantime, it'll be appreciated.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.