Jump to content
  • [SOLVED][UNRAID 6.10 RC2] - Server "dies" after being left alone?


    Biomatrix
    • Solved Minor

    so,
    this has happened before, I thought I fixed it with removal of CPU FREQ plugin... but it seems I have not.
    in this particular instance - I left the server doing a  Parity check, and came back to the main showing 8 of the 10 disks in the array as 
    "missing" - hard reboot brought everything back. 

    attached is the last diagnostic that was on the Flash, as well as the syslog. (mirrored to flash until it gets stable)
    last time I touched the server was Jan 28 - around 17:30 hr
    the manual/forced reboot was around Jan 29; 11:30 Hr

     

    syslog fermentor-diagnostics-20220128-1817.zip




    User Feedback

    Recommended Comments

    Does this consistently happen during a parity check?  If so, I would restart in Safe Mode and disable all VMs and docker apps from running and try a parity check again.

    Link to comment
    10 minutes ago, Squid said:

    Does this consistently happen during a parity check?  If so, I would restart in Safe Mode and disable all VMs and docker apps from running and try a parity check again.

    you know what - I don't know.

    I've only recently built this server; and have been chasing little issues(had zero issues on my dell) I can't say it's consistantly with a Parity, because sometimes I come back when i've started a partity to a clean run... 

    but it's worth a shot for sure. I will attempt this and come back with more (or less?) information!

    Link to comment
    1 hour ago, Biomatrix said:

    you know what - I don't know.

    Trouble is that hard lockups / reboots are always a guessing game unless there's some error that precedes them.  Safe Mode (since you're using the nVidia plugin) and VMs with passthrough being disabled significantly narrows down issues

    Link to comment
    23 hours ago, Squid said:

    Trouble is that hard lockups / reboots are always a guessing game unless there's some error that precedes them.  Safe Mode (since you're using the nVidia plugin) and VMs with passthrough being disabled significantly narrows down issues

    yea...
    new syslog - this is where it crashes/reboots
    after 5 hours of idle...
     

    Jan 30 05:24:16 Fermentor emhttpd: read SMART /dev/sde
    Jan 30 05:25:16 Fermentor kernel: sd 11:0:3:0: attempting task abort!scmd(0x00000000de361d93), outstanding for 60341 ms & timeout 60000 ms
    Jan 30 05:25:16 Fermentor kernel: sd 11:0:3:0: [sde] tag#1003 CDB: opcode=0x4d 4d 00 40 00 00 00 00 00 04 00
    Jan 30 05:25:16 Fermentor kernel: scsi target11:0:3: handle(0x000d), sas_address(0x5000c50084f99f3d), phy(7)
    Jan 30 05:25:16 Fermentor kernel: scsi target11:0:3: enclosure logical id(0x50030480091bdf7f), slot(7) 
    Jan 30 05:25:16 Fermentor kernel: scsi target11:0:3: enclosure level(0x0000), connector name(     )
    Jan 30 05:25:16 Fermentor kernel: sd 11:0:3:0: task abort: SUCCESS scmd(0x00000000de361d93)
    Jan 30 05:30:14 Fermentor kernel: Linux version 5.14.15-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Thu Oct 28 09:56:33 PDT 2021
    Jan 30 05:30:14 Fermentor kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
    Jan 30 05:30:14 Fermentor kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
    Jan 30 05:30:14 Fermentor kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
    Jan 30 05:30:14 Fermentor kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
    Jan 30 05:30:14 Fermentor kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
    Jan 30 05:30:14 Fermentor kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
    Jan 30 05:30:14 Fermentor kernel: signal: max sigframe size: 1776
    Jan 30 05:30:14 Fermentor kernel: BIOS-provided physical RAM map:


    So... i'm going down the line; going to disable Nvidia.
     

     

    14 hours ago, WizADSL said:

    You may also want to run a memory test.

     

    Memory tested good - all 384GB of it... LOL
    can redo a test again.

    there was no change in the system from the previous 64GB ram, to this 384GB from the dell, the crashes happened then too.

    I do apprecate everyone!



     

    Link to comment

    UPDATE :
    so I have disabled 2 user scripts that I brought over from my older box.
    disabling them seems to have fixed it (so far)
    Nvidia was the last peice that wasn't really disabled - so I did these scripts first...

    I don't even know if they are nessary anymore?

    Nvidia-Persistance-First 

    #!/bin/bash
    nvidia-smi --persistence-mode=1

     

     

     Nvidia-Power-Reduction

    #!/bin/bash
    gpupstate=$(nvidia-smi --query-gpu="pstate" --format=csv,noheader);
    gpupid=$(nvidia-smi --query-compute-apps="pid" --format=csv,noheader);
    if [ "$gpupstate" == "P0" ] && [ -z "$gpupid" ]; then nvidia-smi -pm 1; fuser -kv /dev/nvidia*; fi;

     

    Either way; I will come back in another day or two to report if nothing else.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...