Help!....my disk has gone down!


Recommended Posts

Hey guys, i have a really wierd but scary problem in my hands...

 

ok, i have 2 x 500gb hdd's (SATA)....these are my data disks....and i have one parity disk which is an ide 500gb hdd...

 

I transferred all my data onto my 2 x 500gb hdd today, and also built my parity today aswell.....all was fine....

 

I then started copying more files to my data disks...and somewhere along the process the whole thing froze!!!.....i couldnt get into the web interface for unraid at all....i restarted the box, and upon reboot i find a red light next to disk one!, parity and disk 2 are still green though.

 

When i go to start up the array, its unprotected and it gives me the option to format, and shows disk 1 as unformatted!!!!.....is this a bug or did my hdd just die on me???..........please help asap!...i will not proceed any further until giving instructions....my unraid box is powered down right now.

Link to comment

More news...i started the arreay.....and although disk1 shows as red, i can see the full contents, and can also read from the share with disk1 data on it.....so obviously its showing red but should be green?

 

unRAID is designed to take a drive out of service if unRAID gets a WRITE error on a drive.  Write errors indicate a drive is in really bad shape.  But don't panic, more often than not the cause is usually related to a cable that is lose or bad, and not a failed drive.  It could also be something more serious like something wrong with the motherboard, bad memory, bad PSU, etc.  The first step is to look at your syslog.  Please click on the troubleshooting link in my sig for instructions on how to save and post.

Link to comment

More news...i started the arreay.....and although disk1 shows as red, i can see the full contents, and can also read from the share with disk1 data on it.....so obviously its showing red but should be green?

Actually, no...

 

Because you have a parity disk you can still read and write to a failed disk.    This is exactly why you decided to put together an unRAID server.  A disk failure is less of an issue, as you can still get to your data, and you can replace the failed disk to be fully protected once more.  (Right now, f you had a second concurrent disk failure, you would lose the data on both failed disks... parity can only save you from 1 failed disk)

 

Your original issue occurred with "unformatted" disks because one or more processes still had a file open when you tried to stop your array. (or you had logged in via telnet and you had "cd'd" to that drive and it was your current directory)  The drives that appeared as "unformatted" were those that could be un-mounted, the other was the one that could not because it was busy.

 

When you rebooted, the process keeping the disk busy was gone.

 

One of your disks has failed, the  "red" indicator is not an error on the display.  But... as many have discovered, it could be as simple as a loose cable on that drive. (either power or data)

 

So, first, capture a copy of your syslog and post it here (instructions in the wiki here). 

Then stop the array, power down, check your connectors, and then power up once more.  Detailed instructions here in the wiki

 

If it was a loose cable, you will need to un-assign the failed disk, reboot, and then re-assign the disk for unRAID to think it is a new drive. (Details also in the Hard-Drive-Failures page in the wiki)

 

Whatever you do, DO NOT press the button labeled "Restore" at this time without further specific instructions.  It does not restore data, it instead returns to the array to the state it was before you added any disks or data.  See this entry in the wiki about the Evils of the restore button.

 

The exception on use of the "restore" button might apply if you have located a loose cable and you have not written anything to the array.  It is a special command used between pressing  "Restore" and "Start" here  If you use this procedure, let the parity check go to completion.

 

Joe L.

Link to comment

Hey guys,

 

Im leaving my machine off as i dont have a spare drive lying around to use.....

 

im going to try checking the power and data cables tonight....is this the procedure for doing so???

 

take the array offline, unassign red light disk, and then power down - check cables....power back up and reassign disk to check and see whether it works?

 

...one other thing id like to point out......i cant see my HDD temps from unraid web interface....it always shows a * in the temp feild...quite annoying as i really need to see how hot these things are getting!!!

 

Also, if i was to connect the failed disk to another sata channel on the mobo, and reassign the disk, will this cause any complications???

 

 

 

 

 

 

 

 

Link to comment

Do NOT un-assign the drive, as you could lose your drive's data!  As long as it is a part of the array, the parity protection is preserving its contents.  Once you remove it, then all of the parity info is invalid.

 

For an accurate diagnosis of what happened, we really need the syslog captured when it first showed a RED ball, disabled.  Once you rebooted, that was lost.  We still need to see a syslog to advise you for the next step.  It is hard to tell anything without seeing that syslog.  The one step we are very unlikely to give you is un-assigning that drive.  Quite often, a bad cable or other very simple problem has been the cause of a disabled drive, really not the fault of the drive at all.  If it is something that simple, then there are only a couple of minutes of steps to take, and you will be back up fine.

 

For the temps, please see the FAQ :  http://lime-technology.com/wiki/index.php?title=FAQ#Why_is_a_temp_not_showing_for_a_drive.3F

 

Also, if i was to connect the failed disk to another sata channel on the mobo, and reassign the disk, will this cause any complications???

 

There's no problem re-connecting the drive to a different port, just DO NOT re-assign the drive, at least for now.  And preferably, don't move the drive yet, until we see a syslog.

 

I do echo the others, in recommending you take a little time to read the Troubleshooting page, and the FAQ, the Best of the Forums page, and other Wiki pages.  There's a lot of useful info in there.

Link to comment

im getting a little scared now....i unassisnged my red drive, and also the parity drive, but assigned them back straight away.

 

I still have a red disk, parity and my other disk is still fine though.....i also can still see all the contents off of my unraid share (good sign hopfully?)....and the red disk is also showing the proper available space according to how much was transferred to it etc...

 

Im going to buy a sata cable today, and try another power cable to test....would this be the corrct procedure for checking then?

 

Stop Array

Power Down box

Check cables - power + sata

Power on box and see if it changes anything

 

If still no green light, how do i capture a log?

 

Thanks

Link to comment

im getting a little scared now....i unassisnged my red drive, and also the parity drive, but assigned them back straight away.

 

I still have a red disk, parity and my other disk is still fine though.....i also can still see all the contents off of my unraid share (good sign hopfully?)....and the red disk is also showing the proper available space according to how much was transferred to it etc...

 

Im going to buy a sata cable today, and try another power cable to test....would this be the corrct procedure for checking then?

 

Stop Array

Power Down box

Check cables - power + sata

Power on box and see if it changes anything

 

If still no green light, how do i capture a log?

 

Thanks

Un-assigning one drive is exactly the same as if a drive had failed.  In other words, to simulate a failed drive you would un-assign that drive.

 

So, you are OK to un-assign the failed drive, however, do NOT press "Restore" to re-initialize the array, that WILL cause data loss.

 

If you attempt to un-assign a second drive the array will not start.

 

I think Rob is mistaken though, if the failed drive is actually working, and it was just a loose cable. you will need to un-assign the failed drive, reboot (to make unRAID forget the serial number of the drive in that slot) and then re-assign the drive to the failed slot (the one with the "red" ball)

 

UnRAID will not return a drive that has failed on its own (when it takes it out of service when writing to the drive failed).  You must take the corrective actions.

Link to comment

Thanks mate, so to clear things up.

 

If it is a faulty cable...

 

I unassign the failed drive.

Power Down machine

reseat cable

power up machine

reassign failed drive

start array

 

Will this mean it will rebuild the failed drive?...or will it remember its serial no. of the drive and go as if nothing has happened????

 

 

It would help to know how Unraid works during an unassign and reassign of drives.

 

Thanks:)

Link to comment

Thanks mate, so to clear things up.

 

If it is a faulty cable...

 

I unassign the failed drive.

Power Down machine

reseat cable

power up machine

reassign failed drive

start array

 

Will this mean it will rebuild the failed drive?

It will rebuild the failed drive.

...or will it remember its serial no. of the drive and go as if nothing has happened????

By un-assigning and rebooting it will forget the serial number. When you then re-assign it, it will treat it exactly as if it was a new replacement drive.

It would help to know how Unraid works during an unassign and reassign of drives.

 

Thanks:)

You are welcome.

 

Joe L.

Link to comment

Thanks to everyone i have got my array back up and running.....i changed nothing at all....power and cables were all there and no loose cables etc....

 

Here's how i fixed the issue - lots of confusion in this area on this thread, i went with my gut instinct on what someone had said, and all was good:

 

Stopped array

Unassigned Red Disk

Powered Down machine

Powered Up Machine

Reassigned Red Disk (which is now showing as blue)

Started array, and its rebuilding the contents of the drive from Parity!

 

This is extremely annoying since ive only had my array up for less than 2 days.....

 

Some more questions:)

 

1) Will this mean i'll have to rebuild my disk everytime my disk goes red (even if nothing was wrong)????

 

2) I have been advised that next time an error occurs, capture a syslog before powering down the Unraid box.....is this possible using Putty, and saving the Syslog to file on remote PC?

 

3) What software is out there i could use to test my HDD's from within Unraid?....i would like to do this now just to be very safe.

 

Thanks again for all your help.....i didn't see any other way to get my array online without unassigning and reassigning the red disk......this scares me, because what if my parity becomes a red disk!!!

Link to comment

Hi NSTA,

 

I'm a newbie too...  Just trying to learn and help if I can (always try to help a few when I post a few)...  As I understand it, there's an addon called unMenu that can help with two of your issues (see the documentation and look at the addons):

 

1.  I'll leave this to someone more proficient, but I recall reading a forum post somewhere that allowed skipping of the parity sync - so I think it's possible.  Someone more experienced can probably point us to that thread.

 

2.  You can view (and save - if I remember right - the last 10,000 lines) of your Syslog with your browser.

 

3.  Seems that unMenu also offers some hard drive testing - not sure what the features actually mean, but hopefully someone else can elaborate.

 

Russell

Link to comment

Hi NSTA,

 

I'm a newbie too...  Just trying to learn and help if I can (always try to help a few when I post a few)...   As I understand it, there's an addon called unMenu that can help with two of your issues (see the documentation and look at the addons):

 

1.  I'll leave this to someone more proficient, but I recall reading a forum post somewhere that allowed skipping of the parity sync - so I think it's possible.  Someone more experienced can probably point us to that thread.

 

2.  You can view (and save - if I remember right - the last 10,000 lines) of your Syslog with your browser.

 

3.  Seems that unMenu also offers some hard drive testing - not sure what the features actually mean, but hopefully someone else can elaborate.

 

Russell

 

1. You do not need unmenu to stop a parity check, you can do that from the stock unRAID web interface, just simply click on the "cancel"(i believe) button during a partiy check.(only if you know for certain the data is safe and you don't need to check parity).

 

2. Yes unmenu can make it easier to view your syslog and has some nice colouring features which allow for some easier diagnostics but unMENU is not necessary.... there are commands available all over the forum and in the wiki about how to capture syslogs.. the easiest way is to use telnet and send the command cp /var/log/syslog /boot/syslog.txt, which will save your current syslog to the root of the flash drive as a file called "syslog.txt".. you can access it from the flash share on your server over your network.

 

3. unMENU does not provide hard drive "testing"... the best way to test a hard drive is to use the preclearing script by Joe L.  You can execise and pre clear the drive to ensure a good drive prior to adding it to the array.  The closest unMENU comes to providing testing for the drives is through SMART reports, which don't actually force the drive to work or properly "test" drive to ensure the drive won't fail.  SMART mostly just reports the changes that have occured internally on the drive and errors the drive has encountered.  With SMART, you need to keep track of the drives yourself and make sure important values are not changing.

 

Hope this helps,

Matt

 

 

Link to comment

Matt,

 

Thanks for jumping in and helping...  Maybe I'm confused now...

 

1.  I thought he wasn't trying to CANCEL it...  but rather SKIP it, since he knew his data on all drives (including parity) had not truly changed.  Something like this:  http://lime-technology.com/forum/index.php?topic=2591.msg20919#msg20919

 

2.  Some of us prefer to avoid telnet/console for something that should be as simple to access as a syslog.  :)

 

3.  Other than the SMART technologies (however you access them) - is there a better way to watch your ACTIVE UNRAID drives for issues?

 

Thanks,

 

Russell

Link to comment

The preclear script is a destructive procedure that will wipe the disk clean. LOok at the troubleshooting page for instructions to run the SMART tests. These can also be initiated in unmenu. The SMART tests are not destructive.

The preclear_disk.sh script cannot be used on any disk that is assigned as part of your array, or if currently mounted, even if outside of the array.  That is to protect you from yourself.  It is guaranteed to zero out the entire drive.  There is NO undoing this process.  If you write zeros to all of your data, it is gone. 

 

Its original purpose was to clear a drive before adding it to the unRAID array to eliminate the lengthy time the array is offline while a disk is cleared.  It still takes the same time (or longer) to clear a drive, but it occurs while the array is on-line and available.

 

A member of this forum asked for a tool to burn-in a drive.  It was a natural addition.  The change involved adding a pre-read, and a post-read of every block on the disk.

I even added a series of reads of random blocks, interspersed with a periodic read of the first and last block on the disk.  The whole idea was to move the disk heads a whole lot more than if they were just reading from the first block to the last in a linear fashion.  I wanted to put the drive through its paces.  I want it to fail early  if it has any tendency to have mechanical problems (before adding it to the array on the "Devices" page).

 

The "long" SMART test performs a non-destructive "read" of all the blocks on a disk.  It is much faster than the preclear_disk.sh script, as the data is read by the disk's firmware and never is sent outside of the hard-disk's case.  It probably does not seek randomly, and it does nothing to let you know of external issues.  Basically you request that a "long" test is run, and several hours later you can ask for a SMART "status-report" to see the results.   

 

You can "request" a "long" SMART test be performed using unmenu's Disk-Management page.  It is perhaps the easiest way for many, as long as they have installed the unmenu.awk package.  If you do not have unmenu.awk  installed, you can make the same SMART requests on the command line as described in the wiki here. (you will have to log in via telnet to get to the command line)

 

To watch your active drives for issues you must monitor the lime-technology management page (at the least) or, if you installed some of the add-on utilities developed by members of this forum, the equivalents.  I strongly suggest installing one of the e-mail scripts to be pro-actively  alerted if something unusual is occurring with the disks in the array.

 

If you want ease of access to the SMART data, then the unmenu.awk scripts in combination with the awesome SMART view of disks in the "myMain" plug-in will do it.

Every once in a while take a look at the unmenu web-page, you can even view and download the syslog from there.  . I don't know of an easier way to see (and monitor) the ongoing health of your disks.

 

Joe L.

PS. I think you were right, the other poster wanted to avoid a complete parity re-build.  He would have had to use the "Trust-My-Parity" procedure.  Hopefully, unRAId will soon have a true "Parity Check, but do not fix errors" that is different than the "Parity Check and Automatically fix errors" we have now.

 

 

Link to comment

Matt,

 

Thanks for jumping in and helping...  Maybe I'm confused now...

 

1.  I thought he wasn't trying to CANCEL it...  but rather SKIP it, since he knew his data on all drives (including parity) had not truly changed.  Something like this:  http://lime-technology.com/forum/index.php?topic=2591.msg20919#msg20919

It looks like Joe L. beat me to the punch with a great post... here is what I was going to say... I'm not trying to advocate not using unMENU, it is a great add-on and is very valuable.  My points about running unmenu refer to your question about using telnet/console and my resoning for why you would need to know how to and be familiar with it.

 

I just re-read the post and found I was incorrectly answering the question the first time...

From what I can tell , he was rebuilding a disk from parity (called a Parity calculation procedure and not a sync/check).  As far as I know, there was no way for him to avoid the calculation, but I could be wrong, I will defer to those who have much more experience recovering/helping people recover from similar problems.

 

The link you provided would have been no help in this situation.  At least not as I understand the problem. I guess I was wrong, defer to the previous post by Joe L. on this matter.

2.  Some of us prefer to avoid telnet/console for something that should be as simple to access as a syslog.   :)

well here's the catch 22... after you install unmenu, you need to start it by running 1 command through telnet/console, or you will need to add a line to your go script to have it load on boot(or both if you want to boot on load and don't want to have to reboot to use it the first time).  After changing your go script, it's always a good idea to make sure it is in a UNIX format and not a DOS format, so you would run another command through telnet/console to do that (fromdos, details can be found in wiki or through a forum search).

What happens if emhttp crashes and you can't use the web interface but you can still get the syslog from the console/telnet... You will need to know how to do it.

 

It is still a good idea to learn to do things through the console/telnet because chances are you will not be looking at your syslog when everything is fine(99% of the time) but when something has gone wrong and the system may not be operating normally.  The syslog is re-written everytime you boot.  If your system will not let you do anything besides runiing console commands, you could save the syslog to the flash(with the command outined above) and reboot and not lose the syslog you wish to view.  You could also just pull your flash after turning off the server and plug it in to your personal computer and access the file that way(again assuming your system is not working/booting properly)

 

The main problem with direct access to the syslog is that it is on the ramfs and not on the key, the command I outlined above moves the file to the flash where it is easily accessible via the network or it can be saved it for next time you boot up and want to and can look at it.

 

3.  Other than the SMART technologies (however you access them) - is there a better way to watch your ACTIVE UNRAID drives for issues?

 

Thanks,

 

Russell

 

Not to my knowledge, constant montoring of the smart reports is one of the best ways to monitor you active unraid drives.  Only through comparing old vs new smart reports will you see if anything has changed since the last time the report was collected, and then determine if the values are increasing/decreasing undesirably over time.  Any other monitoring software will likely be based off of collection of smart reports and then determining appropriate thresholds for certain values.  This removes the ability of the user to know exactly what is going on with the drive and to make informed decisions about the drive reliability.

For instance one of the thresholds in the program might be the number of reallocated sectors since the last check.  If that number is too large then the program thinks the drive is losing a large number of sectors in a certain period of time and it will alert the user to replace the disk.  However, you can find on this forum a post by Joe L, whereby on the first use of as disk the number of reallocated sectors was very high from one test to another but has not changed in a very long time since.  Clearly, Joe L. case COULD have caused a false positive in the program, and would have cost him the price of a replacement disk for no reason.

 

Cheers,

Matt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.