Jump to content

Restarting your uNRAID server when the NIC driver fails


Recommended Posts

So folks my unRAID server has an Intel server NIC in it with dual XFP ports and runs the ixgbe driver.  Over a few uNRAID revisions now, there are random instances where the NIC driver dies (and yes, I uploaded the diagnostics files when it happened).  There's really been no solution to this issue, and it of course usually happens when I'm overseas for work and it's 12+ hours before I can remote in to PikVM and bounce the server to get the network back up.

 

I got a little tired of this, so here's the result of me and ChatGPT4 having a bit of a discussion about how to deal with it automatically.  The solution documented here is a pair of User scripts, one a "Ping Watchdog" and the other a supervisor for the watchdog (in case the watchdog dies).

Here is the watchdog script, called "ping_watchdog" and is running a ping against a pair of IP addresses (my core switch and my gateway) so that one single IP being down doesn't trigger the reboot.  Sometimes my gateway is off the air for a while as I do some arcane Mikrotik things on it.

This script is set to run at the first array start only and stays running forever (unless it dies for some reason; see the supervisor script below).
 

#!/bin/bash

TARGET_IP_1="192.168.0.254" # Replace with your gateway router IP address
TARGET_IP_2="192.168.0.240" # Replace with your core switch IP address
PING_COUNT=4 # Number of pings to send
PING_TIMEOUT=5 # Timeout for each ping in seconds
FAIL_THRESHOLD=30 # Number of consecutive failed ping checks before restarting
CHECK_INTERVAL=60 # Time in seconds between ping checks

failed_pings=0

ping_check() {
  local target_ip=$1
  ping -c $PING_COUNT -W $PING_TIMEOUT $target_ip >/dev/null 2>&1
  return $?
}

while true; do
  ping_check $TARGET_IP_1
  result1=$?

  ping_check $TARGET_IP_2
  result2=$?

  if [ $result1 -ne 0 ] && [ $result2 -ne 0 ]; then
    failed_pings=$((failed_pings + 1))
    echo "$(date) - Pings to $TARGET_IP_1 and $TARGET_IP_2 failed. Consecutive failed ping checks: $failed_pings"
  else
    failed_pings=0
  fi

  if [ $failed_pings -ge $FAIL_THRESHOLD ]; then
    echo "$(date) - Restarting unRAID server due to $FAIL_THRESHOLD consecutive failed ping checks"
    /usr/local/sbin/powerdown -r
    exit 0
  fi

  sleep $CHECK_INTERVAL # Wait for the specified time before the next iteration
done


Next is the "ping_watchdog_supervisor" which is set to run every hour.  If the first script is seen as not running, it kicks it off again.

 

#!/bin/bash

PING_WATCHDOG_SCRIPT="ping_watchdog"

pid=$(pgrep -f "^/bin/bash.*/tmp/user.scripts/tmpScripts/$PING_WATCHDOG_SCRIPT")

if [ -z "$pid" ]; then
  echo "$(date) - Ping watchdog script not running. Restarting..."
  /usr/local/emhttp/plugins/user.scripts/start_script.sh "$PING_WATCHDOG_SCRIPT"
else
  echo "$(date) - Ping watchdog script running with PID $pid"
fi

 

Coupled together, these two scripts ensure that if my NIC ever dies, uNRAID performs a clean reboot without hurting the array.

 

Edited by Kaldek
Link to comment
17 minutes ago, MrGrey said:

I've heard a lot about ChatGPT. Is it true?

Depends what the question is but yes, it's amazing for turning ideas into code.  I don't trust it 100% of course, and I'm using it to give me ideas and examples.  I get to  bypass all the grief I'd get by asking a human. 

In my view, these generative AI models are necessary.  The amount of time we all burn on questions when the respondee of the question has their own emotions around the question and how they want to answer it is utterly insane.  ChatGPT in particular has a very simple, concise and objective response to everything asked of it.  The trick is knowing how to phrase your questions, hence the term "prompt engineering".  I'm much better at this than I ever was at "Google-Fu".

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...