Two Failed Disks - Need Advice Regarding Recovery


Recommended Posts

Hi. I thought I'd quickly check on my Unraid server before heading out for the (Canadian) Thanksgiving weekend. It turns out that today my server decided to have a mini-meltdown as I was greeted by two failed disks (yay!, lucky me).

 

- I was a bonehead and rebooted the system before grabbing a syslog or diagnostic output. Fortunately I left the Unmenu syslog window open, so I copied the syslog there, which covered the entire failure, and attached it as "Syslog Copied From Unmenu". My apologies for it not being the complete syslog or diagnostic report.

 

- After rebooting, I grabbed the diagnostic output (see attached).

 

From my observations, I suspect the following:

 

- Disk11 is failing. I could hear a disk "clicking"/retrying. Simultaneously, I could see Disk11 appear and disappearing from the array. Many of the errors in "Syslog Copied From Unmenu" refer to disk11 (although not all of them).

 

- I suspect that the parity disk is OK, but I'd appreciate a second opinion on that from someone who would better understand the syslogs.

 

If my hunch is correct, and the parity disk isn't also failing, I'd like to use the parity drive to rebuild Disk11 with a new disk. I'm not sure how to do this, since Unraid has kicked it out of the array. I'd appreciate it someone could help me through this so I don't lose all of the data from Disk11.

 

Thanks for the advice and assistance, I really appreciate it!

tower-diagnostics-20151009-2147.zip

Syslog_Copied_From_Unmenu.zip

Syslog_Copied_From_Unmenu.zip

tower-diagnostics-20151009-2147.zip

Link to comment

Hi trurl, thanks for taking a look and sorry for my slow reply (thanksgiving took up the whole weekend).

 

So I checked the mobo and Supermicro AOC-SASLP-MV8 BIOS and what I found was odd. At first I couldn't see disk11 (3TB Seagate Z1F322KH) during boot. I had to boot a few times to get into the supermicro BIOS though because it posted so fast I couldn't read the list or the command to enter the setup menu (had to look it up... CTRL+m). After booting a couple of times, disk11 appeared and has consistently appeared during boot since (after a half dozen boots).

 

I then booted Unraid to see if this was just a fluke. I fond that now all of the drives are showing. Disk11 is showing green, but the parity drive is showing a red 'x' (but it is found). I've attached the diagnostic report.

 

I've started a reiserfsck status check from the gui in maintenance mode on Disk11 to see if there are any file system errors. I'm checking this due to the string of errors that were present in the syslog prior to failure. I'll post back when I get the results.

 

If things don't randomly crap out on me again, what else should I do? Smart tests? More file system checks? Should I preclear the 4TB parity as a stress test before adding it to the array?

 

Thanks for your thoughts.

tower-diagnostics-20151013-2149.zip

Link to comment

So the reiserfsck --check test of disk11 returned fine (see attached).

 

I ran a short SMART test of the parity and disk11. Disk11 ( is clean. For the parity drive, there was a "command timeout" row highlighted in yellow with a raw count of 1, but that doesn't strike me as worrisome (I've attached the SMART reports, although I guess they are included in the diagnostic I added below).

 

After this, I decided to run a long SMART test on both drives, just to stress them a bit and see if any other errors might be produced. A couple minutes in I could hear the same disk click/retry sound. I checked the syslog and saw a string of errors for sdc (Disk11). Connection reset errors and such. See the attached diagnostic file for the syslog that includes the errors.

 

I am going to wait until the long SMART test finishes on the parity drive, but afterward, what should I try with regards to Disk11? Re-seating the SATA cable? Any other test recommended?

 

Thanks

tower-smart-20151013-2248.zip

reiserfsck_--check_-_ST3000DM001-1CH166-Z1F3ZZKH.txt

tower-diagnostics-20151013-2255.zip

Link to comment

So after a long SMART test of my parity drive, no new errors were found.

 

Otherwise, little new to report, other than Disk11 throwing the same bunch of link reset errors when being accessed. I've attached the latest diagnostic report.

 

So how can I recover my array?

 

- I don't feel like I can successfully rebuild the parity drive. If I attempt this, Disk11 will probably continue to throw link reset errors until Unraid boots the drive out of the array and/or the parity drive becomes invalid again.

 

- I can RMA Disk11, but without a valid parity, I can't rebuild Disk11.

 

I seem to be stuck. Is there anyway to force Unraid to validate the parity drive and allow me to rebuild Disk11?

tower-diagnostics-20151014-2111.zip

Link to comment

Checking the cables is the first thing you should have done when experiencing intermittent problems.

But as the drives is clicking, it seems to have a proper connection.

 

If I were you, I would stop stressing the disk11 and try to extract the data on a local PC.

You can't say how long it will keep running at all.

Once your data is safe, you can go ahead stressing it to death.

 

In practice, you could replace disk11 with a new (and precleared) disk.

Assuming there is no other problem with the controller or the parity disk, let unRAID

build a new parity (new config).

I wouldn't trust the parity as the parity was also red at some point!

 

Then copy the former disk11 from a local PC to your unRAID.

Highly possible that the drive will drop out and you will have to start over several times until all your data

is copied.

If you happen to have a backup of the former disk11 you're better off.

Link to comment
  • 2 weeks later...

A quick update.

  • I later booted the server and found 4 (!) disks missing.
  • It turned out that all 4 disks were connected to my motherboard via SATA cables, while the drives connected via SAS to SATA cables from my supermicro card were not seemingly affected.
  • I considered it was either a SATA controller or SATA cable issue. I tried replacing all of the SATA cables with older ones, all of the drives appeared again, no issues!!!

 

It wasn't a single cable, it was ALL of my cables. I replaced all of my SATA cables about 6 months ago because I had one bad cable and thought I could probably get better quality cables than the ones you get for free with motherboards. I bought the cables from monoprice, having been impressed with their cable quality in the past. Well... that was a mistake!

 

HOWEVER, I'm not out of the woods yet.

  • I rebuilt parity successfully, with no errors.
  • Next, I swapped out an new RMA disk for the initial disk that had failed (because I want to send the old disk back). This was also seemingly successful with no errors...
  • BUT 4 minutes after the rebuild, the parity drive goes down with a billion read errors and turns up a failed disk in the array. Also, unbelievably, Unraid reported that the parity drive had 18 QUINTILLION writes [i feel like that deserves a Doctor Evil Muahahah].
  • Suspicious, I check a couple of videos I had on the replaced disk. Many of them fail to start playing... that's not normal. I mount the disk I replaced with SNAP and try the same files, they play fine. OK, so my parity drive was failing during the rebuild.

 

So now I want to do two things.

 

1) Copy the contents of my OLD disk11 to the RMA drive. To do this without parity I understand I need to preclear the RMA drive, partition it with gdisk, mkreiserfs, mount both drives with SNAP, and then copy with something like midnight commander. I need to send the old disk11 back shortly, because my shipping window is rapidly disappearing.

 

2) What do I do about my 4TB disk?

  • I've been trying to do a preclear run on it, to see if it comes out error free (probably not)
  • However, despite starting the array with no parity disk installed, preclear refuses to recognize the 4TB disk as an unassigned disk (despite it showing up this way on the unraid main page)
  • Edit - I've realized that I needed to upgrade the preclear script, I imagine that should solve the issues.

 

I've attached a diagnostic report of my latest config. I have the diagnostic report for when the parity drive blew up, but it is too big to attach here. You can find it here: https://www.dropbox.com/s/o5hmfpns0fb3x7g/tower-diagnostics-20151025-0935.zip?dl=0

 

Thanks for the help

tower-diagnostics-20151029-2058.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.