Jump to content
  • Random Crashes


    Edvard_Grieg
    • Urgent

    I'm not exactly sure what's going on, but I've been experiencing random crashes lately and I've been trying to isolate a common factor.

     

    HW:

    E5-2697v2 x2

    SuperMicro X9Dri LN4 Motherboard

    128GB DDR3 ECC

     

    Previous config:

    Unraid 6.12.10 - rock solid stable with infinite uptime other than version updates

     

    What changed:

    Added Sparkle A380 Elf Card

    Upgraded to Unraid7 Beta2 to support Arc card

     

    What I've been finding is that the server will crash- no terminal response, IP stops responding.  HW is still running, but from an OS layer appears to be fully non-responsive.  The added frustration is that logging does not appear to be working great.  I generate logs while the server is up, but I've yet to capture anything during the crash itself.  I have logs set to mirror to flash and syslog is pointed to the server IP. 

     

    While the server is up it's successfully transcoding via Plex as well as Unmaniac to bulk transcode some less important files from H264 to H265.  I've played with the number of worker processes, deliberately trying to get it to crash,  and it seems fine, but then randomly I'll refresh a page and find it non-responsive and the server will have crashed requiring a hard power-cycle.

     

    I've run a memtest which came back fine, and again prior to adding the Intel Card and upgrading to 7b2 I had zero issues.  I can obviously take out the Intel Card and downgrade back to 6.12.10, but I'd like to see if I can pinpoint what is going on.

     

    Attached are the diagnostics files and any insights towards getting this working would be great, or at least isolating so I know if I need to downgrade or replace the GPU etc.

    atlas-diagnostics-20240813-1655.zip




    User Feedback

    Recommended Comments

    The diagnostics only show the log since last boot, you should set up a syslog server and post this file also after the next crash.

     

    There are at least 3 changes between your stable system and the current status :

    - v7 beta

    - new hardware

    - new background transcoding tasks using this new HW

     

    If the syslog server does not provide interesting information, it would be helpful to check one thing at a time :

    - v7 without the ARC GPU → stable or not ?

    - v7 with ARC GPU but no transcoding → stable or not ?

    Link to comment
    7 hours ago, ChatNoir said:

    The diagnostics only show the log since last boot, you should set up a syslog server and post this file also after the next crash.

     

    There are at least 3 changes between your stable system and the current status :

    - v7 beta

    - new hardware

    - new background transcoding tasks using this new HW

     

    If the syslog server does not provide interesting information, it would be helpful to check one thing at a time :

    - v7 without the ARC GPU → stable or not ?

    - v7 with ARC GPU but no transcoding → stable or not ?

    Thanks. As I mentioned, logging in setup, but not capturing details. Attached is the syslog details if there is something I misconfigured.

     

    I understand what changed and how to eliminate items, but I'd really rather catch what is happening to potentially fix it. I can yank the GPU and downgrade and I'm sure things will be fine, but I'd like to actually monitor what happened. What other ways of monitoring/logging are there?

    Screenshot_20240814_071227_Firefox.jpg

    Link to comment

    So I've done some more testing and while I feel I've isolated the issue, I still can't get any more details.

     

    It specifically does appear to be tied to the GPU and potentially 'overtaxing' it.  

     

    This is a Sparkle Intel Arc 380 Genie card, main steps to reproduce have been using fileflows or Unmanic with multiple concurrent threads and QSV processing.  When enough threads are added, the server eventually hangs.  

     

    Steps of confirmation, the last few days I've had no concurrent flows running and the system has exhibited no stability issues.

     

    I initially thought it might be an issue with Unmanic, so I started rewriting processing into FileFlows and it was working, but when increasing the number of nodes, I saw the same behavior.

     

    A new point though, this time when I was able to reproduce I was at the terminal for my server, with a tail -f syslog going on the monitor.  Nothing was dumped to syslog, or outputted to the screen.  It appears the system is actually hanging and not necessarily crashing per se.  

     

    I am going to attempt replacing the card with an A580 to see if it is something hardware related to the card.  Regardless there is something that is neither getting handled properly from an exception standpoint or allowing for the proper capture of logs.

    Link to comment
    1 minute ago, ChatNoir said:

    Could it be a Power Supply issue ?

    I would think doubtful.  This is a Supermicro 3U with dual 920PSU, never any stability issues prior even with CPUs at prolonged full tilt and all drives spun up.  In these scenarios only the GPU is being taxed, CPUs are ~25% utilization, memory is low (128GB total and <25% utilization), only a couple drives are active at a time.  

     

    The a380 does not use separate PSU connector and just draws from PCIe slot...was going to pick up an a580 and see if extra cooling and ram makes a difference.  If this bombs I may be going Intel route.  

     

    Only other thought I could think of was if there's IO contention being created with the multiple flows, but realistically I can't imagine this is any more than what gets triggered with NZBs etc

    Link to comment

    So Asrock A580 (aside from not fitting in chassis) appears to produce a similar result, although possibly with a higher thread count.  Note that with the A580 it was being powered by a separate external PSU so power should not have been a consideration.  

     

    So still when upping the number of concurrent threads it causes the system to freeze, not produce any logs or any other indication.  I could see the container crashing or some other downstream event, but the result of overall system instability is what is concerning.

     

    Edit: The more I read on the Intel Arc side, the more it seems that is the culprit and some combination of drivers/Kernel/software.  There should be some big caveats on external Intel GPUs for this release.  Not sure if iGPU has same issues, but at least the Arc seems to be troublesome.  Different setup, but behavior appears to be the same as described in this Intel thread.  https://community.intel.com/t5/Intel-ARC-Graphics/Random-Freezes-with-Arc-Series-GPUs-Thread/m-p/1508559

     

    I don't know if there's some compounding issues as my setup is PCIe3.0, no reBar support etc

    Edited by Edvard_Grieg
    Link to comment

    I have experienced this a few times lately as well. The other day, it happened 3 times. I was on a VM once and it locked up. It does seem to be related to the GPU. I only had the issue when messing with VMs. I have three GPUs and have been trying to get things configured properly.

    Link to comment
    2 hours ago, bobbintb said:

    I have experienced this a few times lately as well. The other day, it happened 3 times. I was on a VM once and it locked up. It does seem to be related to the GPU. I only had the issue when messing with VMs. I have three GPUs and have been trying to get things configured properly.

    I gave up on arc and found a deal on a P2200, it's running at full tilt without any issues. In some ways it was trickier to setup some of the Nvidia parameters, but ultimately much more stable.

    Link to comment
    16 hours ago, Edvard_Grieg said:

    I gave up on arc and found a deal on a P2200, it's running at full tilt without any issues. In some ways it was trickier to setup some of the Nvidia parameters, but ultimately much more stable.

    Mine are actually all Nvidia.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...