• 6.9.0 & 6.9.1 are not so stable...


    dojesus
    • Urgent

    I'm running an x570 board. Was waiting on 6.9.0, so I could pass through my onboard audio to VM, as it would lock up the server on 6.8.3.
    After upgrade, audio IS being passed through, but random VM crashes followed. Every CPU core and thread (including the isolated ones), would max to 100% same for RAM usage, and everything would become unresponsive until the VM would crash, then the resources are restored to the server.
    I upgraded to 6.9.1 to see if this would resolve the problem...it didn't.
    There is something very wrong with the new kernel drivers, IMO.
    I've been forced to roll back to 6.8.3, due to failing WAF.

    If you need anything from me that would help track down the issue, I'm happy to oblige.

    Screenshot_20210319-031542.jpg




    User Feedback

    Recommended Comments

    Perhaps my older logs are still there, but I only had today to roll back and finish parity sync.

    *edit* I did back up my flash drive before rolling back....would the appropriate info be contained in that backup?
     

    tower-diagnostics-20210321-1821.zip

    Edited by dojesus
    Link to comment

    I'm seeing similar things. I was running on 6.9.0 for a long time fine, but then I upgraded to 6.9.1 a few days ago, after successfully configuring a swag docker container. After this, the server became completely unresponsive. Managed to roll back to 6.9.0. It's better, but still looks like this below. Cannot access the docker tab at all - it seems there is an issue with docker - I'm trying to reach the docker settings, but then the GUI stops. I can click on other things in the GUI and it's fast and responsive. I guess I could go to safe mode, but I haven't had the time. BTW - I can't download Diagnostics - it seems to work, but no file is actually downloaded.

     

    image.png.e1700990adc19968dadd0e044446087c.png

    Edited by mrtrilby
    • Like 1
    Link to comment
    On 3/21/2021 at 6:24 PM, dojesus said:

    Perhaps my older logs are still there, but I only had today to roll back and finish parity sync.

    *edit* I did back up my flash drive before rolling back....would the appropriate info be contained in that backup?

    Diagnostics contains a lot of information about the current state of your server, and it also contains syslog since the last reboot. Anything from before reboot is not there and unless you had Syslog Server setup to store older syslogs somewhere no way to get anything.

     

    About the only way that backup would help is if you restored it, booted up, and got diagnostics.

    Link to comment

    It was a royal pain rolling it back to 6.8.3 after having upgraded to 6.9.0, then 6.9.1.

    I may be adding a pcie card to the mix next weekend. I may just jump to 6.9.1, get the crash download the diagnostic, and roll back to 6.8.3...if y'all want me to do that to help discover what the problem is.

    Link to comment

    Hi. I have now booted the server into Safe Mode w/GUI after rolling back from 6.9.1 to 6.9.0. I have stopped the docker, and plugins are not running.

     

    I have 2 VMs with dedicated graphics cards, and this happens both with nVidia (Pop!_OS) and AMD (Win10) - the screenshot above shows that it is the same for both VMs. Pop!_OS boots fine and works for a while - then the same thing happens - first one of the cores go into 100%, then it'll likely continue for a while before all cores go 100% and it crashes. I think the Win10 didn't start because I don't have enough keyboards and mice to go around.

     

    The Pop!_OS VM had a process running at 100% - io.elementary.appcenter, which is the app store. I managed to kill it, and have now restarted it - now the VM is running ok.

     

    The GUI is working fine. I have an AMD Ryzen 2400G with built-in graphics, so I have a third graphics where I can see what's going on now.

     

    In summary, this is not due to docker, which I thought at first, it's not due to plugins (they are not running in safe mode), and both VMs were rock solid before the upgrade to 6.9.1. I think next step is to try to create a new VM and see if it works fine with a fresh start.

     

    I now started docker while in safe mode: everything seems to be working fine. 

     

    See diagnostics attached.

    t-tower-diagnostics-20210323-1016.zip

    Link to comment
    10 hours ago, mrtrilby said:

    I think the Win10 didn't start because I don't have enough keyboards and mice to go around.

     

    You really ought to have started your own thread in General Support instead of hijacking someone else's. I think the Windows 10 VM failed to start because its AMD GPU is bound to the amdgpu driver you have loaded for the Raven Ridge integrated GPU. You need to isolate it by stubbing it, as you have done with the Nvidia card.

     

    01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] [1002:67ef] (rev e5)
    	Subsystem: Sapphire Technology Limited Baffin [Radeon RX 460] [1da2:e348]
    	Kernel modules: amdgpu
    ...
    0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208B [GeForce GT 710] [10de:128b] (rev a1)
    	Subsystem: ZOTAC International (MCO) Ltd. GK208 [GeForce GT 710B] [19da:7326]
    	Kernel driver in use: vfio-pci
    0b:00.1 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
    	Subsystem: ZOTAC International (MCO) Ltd. GK208 HDMI/DP Audio Controller [19da:7326]
    	Kernel driver in use: vfio-pci
    ...
    0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd] (rev c6)
    	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd]
    	Kernel modules: amdgpu

     

     

    AMD GPU driver integration was introduced with Unraid 6.9 but is disabled by default. Hence why not an issue with Unraid 6.8.

     

    Edited by John_M
    Added explanation
    • Like 1
    Link to comment
    1 hour ago, John_M said:

     

    You really ought to have started your own thread in General Support instead of hijacking someone else's. I think the Windows 10 VM failed to start because its AMD GPU is bound to the amdgpu driver you have loaded for the Raven Ridge integrated GPU. You need to isolate it by stubbing it, as you have done with the Nvidia card.

     

    
    01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] [1002:67ef] (rev e5)
    	Subsystem: Sapphire Technology Limited Baffin [Radeon RX 460] [1da2:e348]
    	Kernel modules: amdgpu
    ...
    0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208B [GeForce GT 710] [10de:128b] (rev a1)
    	Subsystem: ZOTAC International (MCO) Ltd. GK208 [GeForce GT 710B] [19da:7326]
    	Kernel driver in use: vfio-pci
    0b:00.1 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
    	Subsystem: ZOTAC International (MCO) Ltd. GK208 HDMI/DP Audio Controller [19da:7326]
    	Kernel driver in use: vfio-pci
    ...
    0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd] (rev c6)
    	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd]
    	Kernel modules: amdgpu

     

     

    AMD GPU driver integration was introduced with Unraid 6.9 but is disabled by default. Hence why not an issue with Unraid 6.8.

     

     

    I was not aware that I was hijacking someone's thread, is it not helpful for someone if somebody else has the same issues and can shed some light on it? I got the Win 10 VM up again, and had the server running for a while on 6.9.0, but after a couple of hours I see the same as threadstarter dojesus again: all cores at 100%. Finally the server rebooted by itself. 

     

    Link to comment
    54 minutes ago, mrtrilby said:

    I got the Win 10 VM up again, and had the server running for a while on 6.9.0, but after a couple of hours I see the same as threadstarter dojesus again: all cores at 100%. Finally the server rebooted by itself.

     

    Your Windows VM would have worked under Unraid 6.9 until you chose to enable the amdgpu driver, which bound to both your integrated AMD GPU and your discrete one. From then on the Windows VM would fail because its GPU could no longer be passed through. Did my suggestion help? Tools -> System Devices; tick the Radeon GPU and click "Bind selected to VFIO at boot"; then re-boot.

     

    In fact, you don't need the amdgpu driver at all unless you plan to use the integrated GPU to accelerate video transcoding by Jellyfin or Plex.

     

    Edited by John_M
    Added uses for iGPU
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.