Jump to content
  • [6.7.0-rc5] Dashboard CPU erroneously stuck at 100%


    jbartlett
    • Closed Annoyance

    I noticed this back on RC3 but since RC5 came out, I updated and waited to see if it happened again. Back on RC3, it happened to CPU 0 which was assigned to UNRAID. Now it's showing up with CPU 14 which is the last core assigned to a Win10 VM. Task manager in Win10 shows no activity. Remoting into the server, htop does not show 100% utilization. I see it updating at the same rate as the other CPUs.

     

    cpu1.png.2845e5cfe63257cdff6f80b45548650d.pngcpu3.png.0e755f2050efaff4db486bc4b4e60021.pngcpu2.png.7fdae589d75381960e314dfa6d358225.png

     

    nas-diagnostics-20190302-0815.zip



    User Feedback

    Recommended Comments



    I did an "Inspect Element" on the bar and it's doing something odd. The other CPU graphs are updating every second and only integers but this one is going from 99% to 100% in about half of that time, going through several decimal stops along the way.

     

    Animation of the change: https://gyazo.com/88ebee4954d9d3b7e9655c6f7e9f2a80

    Edited by jbartlett

    Share this comment


    Link to comment
    Share on other sites

    Rebooting makes the "stuck bar" go away. I've been really busy over the past few days and haven't been monitoring it closely but now CPU 27 is pegging in the Dashboard which is not represented in htop. Only CPU's 2-15 are in use, the rest are available for unraid's use. The average load includes the pegged VM. This issue has not appeared on my Intel backup system with a hex core CPU.

     

    Side by side of the Dashboard & htop video

    https://gyazo.com/ff4b45e4173b48ed88667d637be33ad4

    Edited by jbartlett

    Share this comment


    Link to comment
    Share on other sites

    Experienced the same issue, phantom CPU in dashboard. Running rc-6 image.thumb.png.7e41a74537f6a56dff54d3f3d2b0d21e.png

    Share this comment


    Link to comment
    Share on other sites

    This happened to me the other day.

     

    I just noticed this behavior as well.  Running RC7.  To make things even weirder you can see the one CPU thread that is showing as pinned at 100% is isolated to a VM that is powered off.  I also noticed something else which could be completely unrelated.  But while this is happening if I switch tabs in Chrome so that Unraid is in the background, when I go back to that tab I'm at a blank (black) web page and I have to reload the page to access the WebGUI again.

     

    1505126026_CPUIso.thumb.JPG.b25a390641e98407d103056e98c117c0.JPG.957f771cad84e812054d97f284e4b675.JPG

     

    1828988873_CPUError.thumb.JPG.46934965db54c52d99c86669e11c3523.JPG.3def08f0db6fe8b7f1ef06e11226f496.JPG

     

     

     

    spe-unraid01-diagnostics-20190413-0224.zip

    Share this comment


    Link to comment
    Share on other sites

    On your HTOP, the matching CPU from the Dashboard graph is also maxed which isn't the scenario I reported here.

    I'd recommend excluding the VM CPU's in the sys config to keep the OS away from them.

    IE: append isolcpus=12,13,14,15,28,29,30,31 initrd=/bzroot

    Share this comment


    Link to comment
    Share on other sites

    Are we just seeing a theme with Ryzen? Might be an unidentified bug.

    Edited by phbigred

    Share this comment


    Link to comment
    Share on other sites

    I had that same thought. I just built a 2950 system but it hasn't been on long enough at once to see if it shows up. It has not shown up on my Intel build.

    Share this comment


    Link to comment
    Share on other sites
    19 hours ago, jbartlett said:

    On your HTOP, the matching CPU from the Dashboard graph is also maxed which isn't the scenario I reported here.

    I'd recommend excluding the VM CPU's in the sys config to keep the OS away from them.

    IE: append isolcpus=12,13,14,15,28,29,30,31 initrd=/bzroot

     

    I already have the CPU's isolated as you can see from my screen shot but my syslinux file looks like shows isolcpus=12-15,28-31 instead of isolcpus=12,13,14,15,28,29,30,31 like you described.  Is there a difference there?

    Share this comment


    Link to comment
    Share on other sites

    Other than I didn't know about being able to hyphenate the range, no.

    Share this comment


    Link to comment
    Share on other sites
    5 minutes ago, jbartlett said:

    Other than I didn't know about being able to hyphenate the range, no.

     

    I didn't do it manually.  By isolating the CPUs in the settings it appended the syslinux file automatically.

    Share this comment


    Link to comment
    Share on other sites
    53 minutes ago, IamSpartacus said:

    I didn't do it manually.  By isolating the CPUs in the settings it appended the syslinux file automatically.

    Oh neat, I never noticed the CPU Isolation part on the CPU Pinning page (never scrolled down far enough)

    Share this comment


    Link to comment
    Share on other sites
    On 4/17/2019 at 1:49 AM, phbigred said:

    Are we just seeing a theme with Ryzen? Might be an unidentified bug.

    If it is, it's probably tied to NUMA mode as I don't see these problems on my 1950x in UMA mode.

    Given the difference between htop and the dash was the inclusion of iowait time, I wonder if processes are idling waiting for resources they can't physically access due to enforced cpu node separation and isolation.

     

    Just a theory anyway.

    Share this comment


    Link to comment
    Share on other sites

    Anyone in this thread having issues with RC8? My issue still hasn't tripped causing the phantom CPU spikes in GUI.

    Share this comment


    Link to comment
    Share on other sites

    If you're using Windows 10 and it's on 1903 version/Insider Edition, and you are remoting into it, once you disconnect it will peg 1 or 2 threads/cpu's.... I found this out and had to remove the Inside Edition from my VM and went back to the previous Version of Win10...  As of today it's still doing it even after the latest Insider update.

     

    If you remote in, DC and then remote back in, it will settle..but once you DC again it'll spike... something to do with the new background blur maybe in Win 10? not sure.

    Edited by presence06

    Share this comment


    Link to comment
    Share on other sites
    44 minutes ago, presence06 said:

    If you're using Windows 10 and it's on 1903 version/Insider Edition, and you are remoting into it, once you disconnect it will peg 1 or 2 threads/cpu's.... I found this out and had to remove the Inside Edition from my VM and went back to the previous Version of Win10...  As of today it's still doing it even after the latest Insider update.

     

    If you remote in, DC and then remote back in, it will settle..but once you DC again it'll spike... something to do with the new background blur maybe in Win 10? not sure.

    This isn't VM related this is a general UI bug. I pin my CPUs and it happens to ones assigned to unraid outside of VM isolation.

    Share this comment


    Link to comment
    Share on other sites
    Quote

    I noticed this back on RC3 but since RC5 came out, I updated and waited to see if it happened again.

    Next time you see this happen, please type this command:

    date +%s ; cat /proc/stat

    Save that output.  Then type the command again, and save that output.  Finally post both sets of output.

     

    The code that generates the graphs is based on a daemon that polls /proc/stat every second to monitor CPU load.

    Share this comment


    Link to comment
    Share on other sites
    bonienl

    Posted (edited)

    The calculation for CPU26 based on these two samples is correct and 100%.

     

    One big difference between CPU26 and all the other CPUs is the IOwait time

          user    nice  system  idle       iowait    irq softirq
    cpu25 784637  25697 1747440 295493912  189817    0   9588    0 0 0
    cpu26 1567648 96275 4562200 21299937   270014868 0   38855   0 0 0
    cpu27 782504  26063 1728628 295529225  183739    0   9300    0 0 0

    This implies CPU26 is waiting most of the time on disk I/O activity to finish

    Edited by bonienl

    Share this comment


    Link to comment
    Share on other sites

    I tried to identify what might be causing that - stopped my VM's and the array but the current pegged CPU stayed pegged. Found a WD NVMe drive hasn't unmounted and wouldn't unmount from "Unassigned Devices", nor could I pull up a LS of the share, it just hung. Didn't look like I was actually using it so I rebooted & unmounted it. If the pegging is not related to that, it'll show up again in a few days.

    Share this comment


    Link to comment
    Share on other sites

    Looks like the cause for the IOWAIT is a fstrim being executed on a WD Black 256GB nvme drive, executed by the Trim plugin. The process gets stuck in an uninterruptible sleep.

     

    Not a bug in the Dashboard.

    Share this comment


    Link to comment
    Share on other sites
    8 hours ago, jbartlett said:

    Looks like the cause for the IOWAIT is a fstrim being executed on a WD Black 256GB nvme drive, executed by the Trim plugin. The process gets stuck in an uninterruptible sleep.

     

    Not a bug in the Dashboard.

    Thank you for the update, guess now there's an issue with fstrim?

    Share this comment


    Link to comment
    Share on other sites

    I'm guessing bad device. I rebooted to free it up and now the NVMe drive doesn't register.

    Share this comment


    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.