Jump to content

Eek! My first disk failure, what do I do?


JustinAiken

Recommended Posts

A Samsung 1.5 TB drive died when I was copying files to it... or at least, I think it did. The light next to it in the web server is red, it's temperature shows up as 0 degrees, and under errors it says 1,538.

 

It only has 150GB of stuff thankfully, it died early on in me copying files to it... Is this what I do?

 

* Request an RMA from Samsung

* Remove the device from the array

* Turn off the unRaid

* Swap out the drive in the same slot

* Turn on the unRaid

* Assign the drive in the array

* Let parity rebuild it?

 

Or is there a way to fix the drive? I'm surprised to see problems with this one, because the Samsung 1.5TB's have been running cooled than my WD Green's...

 

 

Link to comment

The red ball means that a write to that disk has failed and the server has disabled it.  The first thing you should do is power down, reseat the drive, then power back up.  It could be something as simple as a loose cable.

 

If that doesn't fix it then try running SMART on the drive.  Post the results here.

 

If SMART indicates that the drive has in fact failed, then the procedure you wrote out is correct.  Technically you can skip the second step (remove the device from the array), but it won't hurt anything to do that.

 

You will probably want to replace the drive as soon as possible, not wait weeks for the Samsung RMA to be processed.  As of now you haven't lost any data, but if another drive were to fail right now you likely would.  If you are extra paranoid you may want to keep the server powered off until you have your replacement drive ready to go.

 

If you can afford it, I would recommend buying a new 1.5 or 2 TB drive today and replacing this failed drive as soon as possible.  Then you can take your time with the 1.5 TB Samsung running SMART tests, etc, to see if it has actually failed.  If it has, RMA it.  If it hasn't, add it back to the array as a new disk or keep it as a spare.

Link to comment

Actually, looking, I only have 100 GB of data on there, 75 of which matters...

 

Would it be better to copy that 75GB off the disc, remove the disc from the array, and rebuild the parity without it? That way I wouldn't be at risk waiting even 2 days for a new drive to come from newegg...

Link to comment

Actually, looking, I only have 100 GB of data on there, 75 of which matters...

 

Would it be better to copy that 75GB off the disc, remove the disc from the array, and rebuild the parity without it? That way I wouldn't be at risk waiting even 2 days for a new drive to come from newegg...

That would work.

 

after removing the disk from the array you'll need to initialize a new disk configuration.

 

If on an older version of unRAID, when the array is stopped you'll see a button labeled as "restore"  You'll need to check the box under it and press it.

 

If on one of the last couple of releases the button was replaced with an exactly equivalent command you'll need to type after logging in as "root" via telnet, or logging in at the system console.

 

Either way, you'll first

Stop the array

Go to the devices page and un-assign the drive you'll be removing

 

Then either:

  Log in and type:

  initconfig

  Answer Yes if it prompts you to confirm your desire to initialize a new configuration.

or

  Check the box under the button labeled "restore" and then press the "restore" button to initialize a new disk configuration.

 

Then, Go back to the main web-management page (or refresh it), all the remaining assigned drives will appear as "blue"

Press "Start" to build a new set of parity data based on the new disk configuration.

 

 

Link to comment

Okay, turned off the machine, reseated the drive, turned it back on.. the disk5 light was still read, with 0 reads, writes, or errors...

 

Here's the SMART report: http://pastebin.com/3dBySjgY

It is responding, therefore it might be/have been a loose cable, power splitter, drive cage, etc.

 

It was disabled because a "write" to it failed.  You must rebuild it from parity and the remaining drives if you wish the data that was not written. (And also if you want the disk file system to be sane, as the block it could not write might be part of the file-system or part of a file)

 

So to get the disk to be recognized as its own replacement you must have the array forget the model/serial number of the existing drive.

 

To do that

Stop the array,

Un-assign the drive that is marked as "red"

Start the array by pressing "Start"  (This causes the array to forget the model/serial number of the un-assigned drive)

Stop the array once more

Re-assign the drive that was marked as "red"

Start the array once more.  It will reconstruct the contents of the failed drive, including what it could not write originally.

Once the re-construction is complete you'll be protected from a second concurrent drive failure once more.

 

Joe L

Link to comment

Ok, thanks Joe, did that... it's rebuilding now, got past the point where it errored before, no errors now...

 

Now I just have to wait for it to finish it's 7 hour rebuild (even though the drive was only 10 percent full :P)

 

I have the Norco case, so the SATA connection was over a SAS -> SAS cable, and the power was going to the whole backplane... any idea on what I should look out for?

Link to comment

Now I just have to wait for it to finish it's 7 hour rebuild (even though the drive was only 10 percent full :P)

 

Every bit must be rebuilt no matter if it is a 1 or a 0.  Therefore empty space takes just as long to rebuild as your actual data.

 

I have the Norco case, so the SATA connection was over a SAS -> SAS cable, and the power was going to the whole backplane... any idea on what I should look out for?

 

And I assume that there are other drives on the same backplane that are working, right?  Maybe try a different drive bay?  I know there's one bay in my Norco 4220 in which the just doesn't want to seat quite right.  For now I'm just not using it, but I'm sure I'll be annoyed once I get to 19 drives...

 

Also, is your PSU powerful enough to run all your drives?  You should have a PSU with at least 30A on a single +12V rail.

Link to comment

Okay, so I rebuilt the drive from the parity, it is now green, 0 errors...

 

But if I try to write to it, it gives me permissions errors... I think the whole drive is stuck as read-only now... how do I fix that?

First, post a syslog.

http://lime-technology.com/wiki/index.php?title=Troubleshooting

 

Odds are you have some file-system corruption and it was mounted as read-only to prevent further damage.

 

Instructions on how to check the file-systems are in the wiki here:

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...