Anyone have an Emergency unRAID checklist


Recommended Posts

I was thinking, it would probably be a good idea to print out a paper that has "what to do if x happens" and tape it to the inside of my unRAID case.  Obviously it would need to be fairly short/easy to read.  Something that had information on steps after a failed drive would be the primary information. Possibly some important console commands (how to start/stop things) or basic troubleshooting ideas for common issues.

 

 

Does anyone already have something like this, if so mind sharing it?  It would be nice to not have to scramble to look up the procedures of swapping out a bad drive when in that situation.

 

 

Just a thought...

Link to comment

That is a good idea because whenever I have a problem my first reaction is to restart and then I realize that I didn't get a syslog.  If I had something that reminded me to do so that would be great, maybe I'll create something to remind me.  Or better yet automatically save a log when shutdown or restart is selected.

 

Maybe one of the most used list would be one that covers what to do with a disabled, missing, unformatted disk, ect.

 

1. Check data and power cables

2. Run smart check or other non distinctive drive tests.

3. If disk is bad what to do from there.

Link to comment

The most important thing is don't panic, or rush. Too many times you see people post on here, then post an hour later about how they messed up even more because they weren't patient enough to wait.

 

A machine can live (turned off) indefinitely, until you know the right solution to fix it, don't panic and guess at solutions. That should be on there.

Link to comment

Of course... Panic is bad.  My main scenario is this, you have a red ball.  Go down, open the case to see 'whats up' and have a

 

 

Red Ball

  • Check Cables/Power Supply
  • Check syslog
  • Order to do stuff if it's determined the drive is bad

etc.. 

 

 

I've got a spare drive precleared and ready to go, but I'm not sure I know the exact order to do things, so I'd be on here searching for the steps.  Having that taped to the inside of my case seems like a good idea.

 

 

Link to comment

This is an excellent topic, one we have touched on in the past, but I don't remember where.  There are already some good ideas above, I'll throw out a few others.

 

It would be good to have 2 lists, one for the emergency (the red ball), and one for recommended practices, the preparation for the emergency that makes it easier to handle.  Some ideas for the prep list:

* Either have a diagram of your system with the location of every drive with its disk number and serial number, or apply labels or markings to every drive with their disk and serial numbers.  When you have a drive failure, it's often quite difficult to identify the physical drive at fault, especially when you feel a little panic.  You generally cannot see the serial numbers or makes and models without removing them, and you don't want to shut down or remove drives at first, not until you have gathered all relevant information.  It goes without saying that this needs to be updated with every drive or layout change.

* Have SMART reports prepared for ALL drives, preferably obtained within the last 6 months.  When there is a problem with a drive, you want to be able to compare a new report with your saved one, for important clues about its current status.

* Make sure that you are doing Parity Checks regularly, preferably scheduled monthly.  If a rebuild is necessary, it can only be completely reliable if parity is completely valid.

* Have a spare drive already Precleared and ready, if it's needed (already mentioned above)

* Have the Red ball list printed and located appropriately (obvious but ...)

 

Ideas for the red ball list:  (not complete)

1. There's no need to panic.  The issue could be as minor as a cable connection has slipped, or yes as 'major' as a drive that has completely failed, but then that is why you installed UnRAID right?  You will get to replace it with the same or bigger drive, with no data loss(!), and perhaps end up with more space!

2. Do NOT shut the server down!  We need to gather important information first.

3. Capture the syslog.  It's often vital for diagnosing the drive issue.

4. Obtain a SMART report for the drive, if possible.  (It may quickly tell you there's nothing wrong with the drive itself!)

** If the drive has totally failed (not spinning or not recognized by the BIOS), obviously you won't be able to obtain a SMART report, but then who needs it, the drive is dead!

** There are other reasons the kernel may have dropped the drive, so if you cannot access the drive for a SMART report, wait until after the next reboot and try again.

5. Examine the new SMART report, then if necessary compare it with the saved one for the problem drive.  Ask for help if necessary.

** If the new SMART report looks good, then you know the drive is good(!), and the problem is elsewhere.

** If you see something questionable, compare it with the previous SMART report.  If what seems questionable is exactly the same on the previous report, then ignore it.  You are only interested in what is worsening.  A number of reallocated sectors or ATA errors may look bad, but if the numbers are the same as the previous, then they aren't recent!

** If you see issues that appear to be mechanical in nature, that appear to indicate the drive is failing, then ...  [replace and rebuild procedure]

** If you see issues with bad sectors (increased Current Pending Sectors), then ...  [deal with bad sectors]

6. Examine syslog for related drive issues.  Ask for help if necessary.  (If you create a post, attach entire syslog, zipped if necessary.)

** Drive issues generally fall into 2 separate categories, physical issues with the drive itself, and issues with the interface to the drive (such as power, cabling, controller, driver, etc). 

Use the UnRAID wiki page The Analysis of Drive Issues to help determining which it is.  (The UnRAID wiki page Drive Symbols may be helpful in determining the drive's associated ata and scsi symbols.)

** Specific error flags may point to the actual problem.  For example, if you see either BadCRC or ICRC, then almost certainly there is a bad SATA cable.  Just replace it!

** If the drive is missing on the initial drive inventory, then check cable connections.  One may have fallen off, or is dislodged just enough to lose either the power or data connection.

** If drive was present, but syslog shows loss of communication errors followed by being disabled, then there are a number of possible causes.  The drive is usually fine.  Cabling or back frame connection may be loose.  Controller or port may be faulty.  Power may be insufficient or faulty.

** etc etc ...

7. etc etc ...

 

The above is much more wordy than some would like, but the extra detail is designed to be both helpful and reassuring.  I'm sure some would prefer just a simple bullet list.  Admittedly, the wiki pages above are much more technical than many would like.

 

The initial steps will be the same, but once info is gathered, then the steps can diverge widely based on the symptoms detected.  Perhaps another approach would be helpful then, a symptom list with recommended steps for each.

 

The one very important thing to remember here is that sometimes a red-balled drive is bad, but many times (perhaps even more) the drive is good and it's just a problem with cabling or power or controller or port.  There have been a lot of drives RMA'd with nothing wrong with them.  If you just create a simple emergency red ball list based on "if it's red, replace and rebuild", then you won't fix certain problems.  The new drive may then red ball too, if there wasn't anything wrong with the previous drive (and it wasn't a loose cable).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.