unRAID Stopped / Disk Missing = Bad Disk / Bad Backplane?


Recommended Posts

I'm running unRAID 5.0.5 and started up Plex today and got an error that the server wasn't available.  Server was working fine last night.  Pulled up the web interface and see the server is stopped and says Disk 1 is missing.  Here's my build thread with configuration.

 

 

Read through the Analysis of Drive Issues Wiki page and searched the Syslog (attached) for the provided keywords.  I found two from the list:

 

1) qc timeout

2) revalidation failed

 

Here's an excerpt from the Syslog with the keywords:

 

Jan  4 18:19:20 Tower kernel: sas: ata7: end_device-2:0: dev error handler

Jan  4 18:19:20 Tower kernel: ata7.00: ATA-8: ST31500541AS,            5XW0270P, CC32, max UDMA/133

Jan  4 18:19:20 Tower kernel: ata7.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32)

Jan  4 18:19:20 Tower kernel: ata7.00: qc timeout (cmd 0x27)

Jan  4 18:19:20 Tower kernel: ata7.00: failed to read native max address (err_mask=0x4)

Jan  4 18:19:20 Tower kernel: ata7.00: HPA support seems broken, skipping HPA handling

Jan  4 18:19:20 Tower kernel: ata7.00: revalidation failed (errno=-5)

Jan  4 18:19:20 Tower kernel: mvsas 0000:02:00.0: Phy0 : No sig fis

Jan  4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1527:mvs_I_T_nexus_reset for device[0]:rc= 0

Jan  4 18:19:20 Tower kernel: ata7.00: qc timeout (cmd 0xef)

Jan  4 18:19:20 Tower kernel: ata7.00: failed to set xfermode (err_mask=0x4)

Jan  4 18:19:20 Tower kernel: ata7.00: limiting speed to UDMA/133:PIO3

Jan  4 18:19:20 Tower kernel: mvsas 0000:02:00.0: Phy0 : No sig fis

Jan  4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1527:mvs_I_T_nexus_reset for device[0]:rc= 0

Jan  4 18:19:20 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1957:Release slot [0] tag[0], task [f75c3900]:

Jan  4 18:19:20 Tower kernel: sas: sas_ata_task_done: SAS error 8a

Jan  4 18:19:20 Tower kernel: ata7.00: failed to set xfermode (err_mask=0x11)

Jan  4 18:19:20 Tower kernel: ata7.00: disabled

 

It points to Disk #1 as noted on the web interface.  Per the Analysis of Drive Issues:

 

Drive interface issue #4

 

This is an example of what is probably a loose backplane or cable connection issue: (could be either the SATA connection or the power connection or both)

 

[sNIP]

 

Note: There are no CRC errors here, which normally implicate a bad cable or two.

 

These problems are often related to a backplane, perhaps loose, perhaps vibration-related, perhaps defective. If the SATA link remains up for awhile, but communications are clearly bad, then the emphasis should probably be on the power connection. The easiest way to test whether it is the fault of the backplane is to reinstall the drive outside of the backplane.

 

If there is no backplane involved, then the same considerations apply to the cable connections, each end of both the SATA and power cables, including any power cable splitters that may be involved. It is common after opening a computer case, to jostle the cables, and SATA cables are notorious for coming loose, if they aren't the locking type. It is a good habit to check all SATA connections just before closing a case up.

 

Good quality SATA and power cables and splitters are strongly recommended. Always make certain that they are firmly connected, and not subject to vibration. The same is even more important for backplanes, make sure that drives are firmly and well seated in their trays, and cannot be vibrated loose.

 

I haven't been in the case recently, but I'm assuming it's a bad connection or cable. 

 

1) Can I open the box and check connections while the server is still running or do I need to shut it off? 

2) If it is the backplane, wouldn't the other three disks on that row be showing missing?  Can I shut down the server and move the "missing" disk to an empty drive slot and see if it works?

 

Any other insight you folks can offer? 

 

Thanks

syslog-2017-01-04.txt.zip

Link to comment

Post the SMART report

 

I'm having difficulty figuring out the disk identifier for the "missing disk".  I think I screwed up.  I started the array in maintenance mode, which I believe disabled the missing disk.  I tried using ls -l /dev/disk/by-id to identify the missing disks identifier in telnet, but it doesn't show up.  Is there a way to obtain the disk identifier now?  Prior to starting in maintenance mode the Main > Array Devices page in the web interface didn't list the disks identifier.

Link to comment

It looks like it disappeared from your syslog after that snippet you posted.

 

Shutdown and check connections as you planned to do.

 

Shutdown, checked connections, powered on.  Main > Array Devices shows Disk 1 "not installed / unassigned".  I am able to see it as a selection in the pull down menu.  It also gives me the disk identifier as sdd.

 

Ran  smartctl  -a  -d  ata  /dev/sdd.  SMART Report attached.  I reviewed the important attributes to check outlined in Obtaining a SMART Report and didn't see anything that is a red flag for the disk, nor did I see any error codes that signal bad cabling.

 

Assuming the drive is ok, the next step is to reconstruct the drive, correct?

 

    You can re-enable the hard drive and reconstruct it as follows:

 

        Stop the array.

        Go to the Main page (Devices in version 4.7) and unassign the disk.

        Go to the Main page (Array Operations section) and start the array.

        Stop the array again.

        Go to the Main page (Devices in version 4.7) and re-assign the disk.

        Go to the Main page (Array Operations section) - the system should indicate there is a "new" drive to replace the disabled one. Check the confirmation box and click the Start button to start a reconstruct/rebuild of the disk.

 

Running smartctl -t long /dev/sdd for detailed report and will report back when complete.

 

Thanks for your help!

 

Link to comment

I'm not having success getting an extended SMART report.  When run, it says it will take 368 minutes but never gives me an update after that. 

 

root@Tower:~# smartctl -t long /dev/sdd

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 368 minutes for test to complete.

Test will complete after Thu Jan  5 16:01:27 2017

 

Use smartctl -X to abort test.

root@Tower:~#

 

Does it matter if the array is started, stopped, or started in maintenance mode when performing the SMART test?

Link to comment

This is not an attribute we often monitor, but it does say failing:

184 End-to-End_Error        0x0032   084   084   099    Old_age   Always   FAILING_NOW 16

 

Is the next step to re-enable the drive and reconstruct?

You have this exactly backwards. Rebuilding will re-enable the drive. I think I would use a new disk. And preclear it to test of course. Maybe someone else will have a different opinion on whether to re-use that disk or not.
Link to comment

I share trurl's concern about that drive.  You've got 5.73 years of 24/7 operation on it, and it's showing its age, with a number of troubling SMART attributes.  None are clearly critical right now, but they're worrisome.  184 is seriously lower than the maker ever thought it should go, and lower than I've ever seen.  I wouldn't trust that drive without some serious Preclear testing, 2 or 3 passes at least.

Link to comment

Thanks for the insight.  I guess this is a good time to shrink my array.  I've been planning on doing it anyway.  I have a total 13.5TB and only about 7TB used (down from 10TB).  I'd like to reduce the number of drives and then upgrade to unRAID OS 6 with dual parity.  Guess it's time to do some reading:

 

Shrink Array

 

Clear an Array Drive Script

 

From my limited reading, I *believe* this is possible versus installing a new drive for the failing one.

Link to comment

Thanks for the insight.  I guess this is a good time to shrink my array.  I've been planning on doing it anyway.  I have a total 13.5TB and only about 7TB used (down from 10TB).  I'd like to reduce the number of drives and then upgrade to unRAID OS 6 with dual parity.  Guess it's time to do some reading:

 

Shrink Array

 

Clear an Array Drive Script

 

From my limited reading, I *believe* this is possible versus installing a new drive for the failing one.

Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking.
Link to comment

Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking.

 

Correct.  After shutting down and checking connections, Main > Array Devices shows Disk 1 "not installed / unassigned".  The disk does show as a selection in the pull down menu.  If I add it to the array, it shows blue / "New Disk, Not In Array".  I'm uncertain of how to proceed from here.  If the drive is still accessible, it should have the original data on it.  If I can get the array to accept it as is (I haven't written anything to the disk in weeks), I can copy the files off the drive.  Is there a way to accomplish this or am I stuck having to install a new HD and rebuild the disk from the parity?

Link to comment

Your first post in the thread said the array was stopped with the disk missing. Can you start the array and read from the emulated drive? If so then you should be able to copy its contents to other drives then proceed with shrinking.

 

Correct.  After shutting down and checking connections, Main > Array Devices shows Disk 1 "not installed / unassigned".  The disk does show as a selection in the pull down menu.  If I add it to the array, it shows blue / "New Disk, Not In Array".  I'm uncertain of how to proceed from here.  If the drive is still accessible, it should have the original data on it.  If I can get the array to accept it as is (I haven't written anything to the disk in weeks), I can copy the files off the drive.  Is there a way to accomplish this or am I stuck having to install a new HD and rebuild the disk from the parity?

No, I'm not saying you should do anything with the disk at the moment. But it shouldn't say New Disk either if it is the same disk you had in that slot before.

 

What I was saying is that if you have parity, and you only have one missing disk, then unRAID should be able to emulate the disk's data, even if it is completely removed from the server. And you should be able to copy that data from the emulated disk without even worrying about, or even actually still possessing, the physical disk.

 

Post a screenshot.

Link to comment

No, I'm not saying you should do anything with the disk at the moment. But it shouldn't say New Disk either if it is the same disk you had in that slot before.

 

What I was saying is that if you have parity, and you only have one missing disk, then unRAID should be able to emulate the disk's data, even if it is completely removed from the server. And you should be able to copy that data from the emulated disk without even worrying about, or even actually still possessing, the physical disk.

 

Post a screenshot.

 

Interesting.  This is the first time I've run into a disk issue.  I didn't realize the parity would emulate a single missing drive.  I haven't started the server except for maintenance mode at this point.  Screen shots attached.

 

XCxfGGe.png

W75jMpU.png

 

To clarify, if I remove the troubled disk 1 and bring the array on-line, the parity should emulate the missing drive & allow me to copy the missing data to the other drives?  Does the parity emulate a folder for "Disk 1"?  I ask because my server is setup with high-water allocation.  Sorry, for the questions, just trying to wrap my mind around it.

Link to comment

When a disk is missing, or if it is just disabled (redball), unRAID can calculate the disk's data from parity plus all the other disks. This is how it is able to rebuild a disk. And if you try to read the disk unRAID calculates its data in just the same way. And if you try to write to the disk, unRAID will actually update parity as if the disk was written, so it can later read that written data by calculating it in just the same way.

 

I'm not entirely clear about the colored indicators next to the disk in your screenshots. I don't think V5 was always consistent in the way these were used regardless of what the descriptions for these colors are. A better idea of what unRAID would do if you started the array can be seen in the Array Operation tab, which we don't have a screenshot for.

Link to comment

When a disk is missing, or if it is just disabled (redball), unRAID can calculate the disk's data from parity plus all the other disks. This is how it is able to rebuild a disk. And if you try to read the disk unRAID calculates its data in just the same way. And if you try to write to the disk, unRAID will actually update parity as if the disk was written, so it can later read that written data by calculating it in just the same way.

 

I'm not entirely clear about the colored indicators next to the disk in your screenshots. I don't think V5 was always consistent in the way these were used regardless of what the descriptions for these colors are. A better idea of what unRAID would do if you started the array can be seen in the Array Operation tab, which we don't have a screenshot for.

 

Here's a screen shot of the Array Operation tab with no Disk #1 selected (the red square screen shot previously posted).

 

JuO0OXL.png

 

Here's a screen shot of the Array Operation tab with Disk #1 selected (the blue circle screen shot previously posted).

 

E7NmqS5.png

 

So if I:

 

1) Bring the array on-line with the questionable Disk #1 removed from the server (or left as unselected as seen in the red square screen shot previously posted)

2) Copy the files from the emulated Disk #1 to existing hard drives

3) Is there another step after this to shrink the array and remove bad Disk #1 for good (effectively reducing the size of the array by 1.5TB)?

Link to comment

1 and 2 correct.

 

For 3, since you are already unprotected there is no point in trying to maintain parity, so the simple thing would be to go to Utils - New Config and reassign the drives you want to keep in the slots you want them to be in, and then when you Start it will rebuild parity. It is very important that you don't accidentally assign a data disk to the parity slot or it will be overwritten.

 

Here it is in the wiki: Shrink Array

Link to comment

1 and 2 correct.

 

For 3, since you are already unprotected there is no point in trying to maintain parity, so the simple thing would be to go to Utils - New Config and reassign the drives you want to keep in the slots you want them to be in, and then when you Start it will rebuild parity. It is very important that you don't accidentally assign a data disk to the parity slot or it will be overwritten.

 

Here it is in the wiki: Shrink Array

 

Thanks for your patience and assistance, trurl.  Got everything transferred off the emulated drive, shrank the array, and rebuilt the parity.  All is working again.  Now to read up on upgrading to unRAID OS 6!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.