Failing drive

January 1, 200917 yr

Using unmenu with MyMain i discovered that one of my drives is failing....and want to know what steps to take....

My disk2 is failing

disk9 is empty

Both are same size

I want to replace disk2 with disk9, rma disk2, and then replace disk9 w/ the new RMAed drive. What would be the best steps to take?

Here's a SS:

January 1, 200917 yr

Sorry to see you are having problems with a drive, but this is actually the kind of thing that I hoped folks would be able to see BEFORE a drive failure when I created the myMain plugin.

There are a couple of different approaches you could take.

1 - You could simply copy the data from disk2 to disk9. unRAID will "correct" read errors from disk2, if any, as it goes. Once complete, you could remove disk2 from the array. (This would force a rebuild of parity afterwards, during which time you'd be "exposed" to drive failure. You clipped the very top of the display where it would have shown when the last parity check occurred. If you've done one recently then the chances of a drive giving problems is pretty small.).

2 - You could power down, disconnect disk2, power up, and start the array. unRAID will simulate disk2 using parity and the other drives. You could then copy the data from the simulated disk2 to disk9. Like option 1, you'd have to rebuild parity afterwards. If the drive is in really bad shape (constantly clicking, running very hot, running very slowly, smoking , etc.) I'd recommend this over option 1. Otherwise it is likely to be slower.

3 - You could replace disk2 with ANOTHER 750G disk. (unRAID does not allow you to rebuild onto another disk in the array, even if it is empty. Rebuilding must be to another disk.) Although you would not need to rebuild parity afterwards, you are rebuilding a disk during which time you'd be unprotected from another failure. So the risks here are similar to the risks above.

4 - You could replace disk2 with your parity disk and insert a larger parity disk using a process called "PARITY SWAP". (Very similar to 3)

Post back if you need more specific instructions on these techniques.

Good luck!

January 1, 200917 yr

Kudos to Brian! I think this is the first 'textbook' example of the value of the SMART overview screen, and with apologies to fatal for being the 'victim', this is really a great example. You caught it before it was completely lost.

I must caution you as to the steps to take. If all the drives were good, then there is a way to remove Disk 9, and there are other methods, but anything that involves modifying the parity drive, at the whole drive level, such as a drive rebuild, swap-disable, drive removal, etc, could mean reading from every sector of Disk 2, and you do not want to do that at all. And because you have a drive just itching to fail, you can't do anything that involves the temporary dropping of another drive.

I really think your only choice is to copy all files from Disk 2 to Disk 9. If Disk 2 fails before completing the migration, then you will have to wait for the arrival of the replacement drive, and do a rebuild onto it, in the Disk 2 position. Whatever you do, you cannot attempt any drive reorganization, such as moving Disk 9 elsewhere, because Disk 9 in its current state is vital to preserving parity. Your parity is so important at the moment, that you don't want to take any actions that could endanger it. In fact, I would temporarily disable any current operations to your unRAID server, such as scheduled backups. Once the system is completely fine, then you can make changes to Disk 9.

Once you have a replacement drive at hand, then Brian's choices 3 and 4 become available. Hmmm... I skimmed too quickly over Brian's post. His first 2 choices are good. I think I would recommend the first, as long as it is working, then go to the second. Both choices mention rebuilding parity, but I don't think that will be necessary, or desired.

January 1, 200917 yr

Once you have a replacement drive at hand, then Brian's choices 3 and 4 become available. Hmmm... I skimmed too quickly over Brian's post. His first 2 choices are good. I think I would recommend the first, as long as it is working, then go to the second. Both choices mention rebuilding parity, but I don't think that will be necessary, or desired.

Sometimes in these short posts it is hard to say things the way you are trying to say them. With options 1 and 2, all the data copying can occur without rebuilding parity. But after you are done copying and are finally ready to remove disk2 from the array, you'd have to rebuild parity. Right?

January 1, 200917 yr

I'm sorry, I misread option 1 even more than I thought, plus you were thinking farther ahead than I was. I was just trying to get the data safe, and the new drive into its place, and assumed that was what option 1 was trying to do.

How recent a parity check makes a difference here, I think, as option 2 carries even more risk if there is not a recent parity check. Getting the data moved from Disk 2 to either Disk 9 or an external drive is the first priority. Then if he's happy with the data on Disk 9 and can adjust his User Shares, removing Disk 2 and rebuilding parity is a valid choice, and then adding the new drive as Disk 2 when it arrives. It carries extremely low risk. I was thinking that he would swap the bad with the new drive when it arrives, and rebuild Disk 2 onto it, which is also low risk once the data is safely copied elsewhere. Then once he's confident Disk 2 is fully restored, then he can delete the files from Disk 9.

January 1, 200917 yr

Author

Thanks for the help

I am currently in the process of copying all files from disk2 onto disk9, so far so good

After the copy and checking the contents of disk9, i plan on deleting all files from disk2, remove disk2, and rebuild the parity.

B/c if i wait and let the parity simulate disk2 while my RMA arrieves, and another drive fails, won't i lose my data?

BTW, happy new year

January 1, 200917 yr

Agree 100%. Getting the data off of disk2 and onto disk9 (or somewhere else) is the highest priority.

But I was thinking that fatal just wanted to just use his disk9 AS the replacement, with no immediate plans to buy another disk.

In that case ...

1. Use option 1 (or 2) to copy the disk2 data to disk9

2. Power down and remove disk2

3. Power up

4. Stop the array (if it started)

5. On the devices page, unassign the disk from slot 9, and then assign the disk to slot 2

6. On the main page, press the dreaded restore button (restore will drop parity protection - it is safe here because we've already rescued the disk2 data, but in general pressing restore is a bad idea)

7. Start the array and let unRAID build parity

8. Run a full parity check (don't forget this, building parity is quite different from checking parity)

fatal could then add a new drive (when/if he buys one) to the array in the normal way. It would not be an emergency.

January 1, 200917 yr

..., i plan on deleting all files from disk2, remove disk2, ...

As you probably will realize, there's no need to delete anything before removing it.

And Brian has produced another of his very good step-by-step procedures.

January 2, 200917 yr

Author

it's still copying the files over ...which is taking forever...

after the copy is done and i power down to remove disk2, would it be ok if i removed disk9 from its physical position and position it to disk 2? I don't think it should be a problem.

Thanks for the steps Rob

January 2, 200917 yr

No need to physically move disk9. All you have to do it reassign it on the devices page.

Refer to the step by step.

January 2, 200917 yr

Author

No need to physically move disk9. All you have to do it reassign it on the devices page.

Refer to the step by step.

yeah but i'm just a lil anal about the drives being in the same spot in the trays as me assigning them

Thanks for all the help, i moved over all the files, removed disk2, put disk9 in disk2's spot, powered back up, assigned the correct disks, and built parity, which went fine w/ 0 errors.

Time to RMA the disk now

Edit: Also, before i installed unmenu/mymain, i was wondering why my parity check was only going at like 20MB/sec, after taking this drive out and rebuilding parity, i was back up to my normal 37-40MB/sec for the parity build.

January 2, 200917 yr

Excellent!

If you haven't already, run a parity check to make sure that the parity is solid and that you are protected.

January 2, 200917 yr

I had a 1tb drive start failing on me last night. I saw as I was writing a movie to it that I got 3 errors at the far right on that drive (unraid main page). I had seen some smart errors before on this drive so, since I had a spare , I decided it was time. For some screwball reason I decided to do a parity check first... $:-\$ well, 69 drive errors and 10 sync errors later, parity check finished... I shut down, pulled the drive, inserted spare, booted up and 3 hours later I had my stuff back... BUT, I guess there are at least 10 bad bits in those movies, right? :'(

January 2, 200917 yr

I had a 1tb drive start failing on me last night. I saw as I was writing a movie to it that I got 3 errors at the far right on that drive (unraid main page). I had seen some smart errors before on this drive so, since I had a spare , I decided it was time. For some screwball reason I decided to do a parity check first... $:-\$ well, 69 drive errors and 10 sync errors later, parity check finished... I shut down, pulled the drive, inserted spare, booted up and 3 hours later I had my stuff back... BUT, I guess there are at least 10 bad bits in those movies, right? :'(

I think the answer is yes. The syslog (if you saved it) would tell you where on the disk the problems were - and perhaps give you some clues. (If the errors were confined to a short range of sectors, you'd likely have just 1 corrupted file - if they are more randomly distributed, the problems may be spread amoung several files.)

I wish that Tom would implement a parity check differently. It should "remember" where the parity problems are but NOT correct them on the fly. After the parity check is complete, it could prompt the user as to whether to correct the errors or not. unRAID could then recheck just the few sync error sectors and fix them. With this model, you'd be able to safely run parity checks even when a drive is suspect, and not risk corrupting parity in the process.

January 2, 200917 yr

Because of this thread I downloaded Unmenu and it is a great tool, should be come standard in UnRaid.....

Any way here is a screen shot of mine, at what point do you have to worry about reallocated sectors, I see that I have 31 on one of my new 1.5tb drives. But my old WD's are doing great, go figure

January 2, 200917 yr

I see that I have 31 on one of my new 1.5tb drives.

I'd begin to worry...

January 2, 200917 yr

Because of this thread I downloaded Unmenu and it is a great tool, should be come standard in UnRaid.....

Any way here is a screen shot of mine, at what point do you have to worry about reallocated sectors, I see that I have 31 on one of my new 1.5tb drives. But my old WD's are doing great, go figure

The number of reallocated sectors is a little less important than how often the number increases. If you have 31 sector reallocations a year from now it was not anything to worry about. But if you see the number increasing then that's a bad sign. I would recommend a parity check and see if the number increases. If it does, I'd try to get it replaced. If it doesn't, I would monitor it over the coming days and weeks, running a few more parity checks. If it seems stable at 31, I would just keep an eye on it but it is likely fine. But if you see it increasing, even if only by a few sectors a week, the drive is going.

January 3, 200917 yr

I ripped a couple of more movies to it last night, now it is completely full, well only 1.2gig free. The count went to 32 now. I will keep an eye on it and see what happens, going to run a parity check tonight after I rip 2 more movies to completely fill the array.

I ordered another 1.5tb drive yesterday, so I hope that they are going to be OK.

January 6, 200917 yr

Just to update I did run a parity check the other night and as of right now the reallocated_sector_ct is still 32 so far so good

Is their any way to find out what the other itmes listed in yellow are:

» attribute_189=70

» head_flying_hours=2.42318e+14

» attribute_241=2.2524e+09

» attribute_242=3.40341e+09

January 6, 200917 yr

Are you using Smartctl 5.38?

Attribute 189 should be identified as "High Fly Writes". But it may mean something different for Seagate drives.

Refer to this link and read about them: LINK (scroll down a little past 1/2 way and you'll see a table where each of the attributes are explained).

I have no idea what 241 and 242 are, but given their large values have to believe it is some type of bitmapped value that would mean nothing to a human. I would not worry about them.

I am not sure if a high fly write value of 70 is problematic or not. The failing drive also has a non-zero value of 25.

You might want to look here at someone with similar question on the Seagate forum.

January 7, 200917 yr

I'd say it's time for smartmontools v5.39 or v5.40, with all of the new drives in its database, and more of the attributes detailed, including more of the vendor-specific ones. 241 through 249 are supposed to be vendor-specific, and it's up to the vendor as to whether they will be described or not. They have the right to keep them for internal use only.

As Brian has said earlier, I wouldn't be too concerned about a few flaws that are reported early, so long as they don't increase. I suspect the higher densities in these newer drives are pushing the tolerances too tightly, so it may be normal for them to find and record their flaws within the first weeks, and then operate fine for years. The key issue is watching the trends, and you can't detect a trend without sufficient time and SMART reports.

January 7, 200917 yr

I'd say it's time for smartmontools v5.39 or v5.40, with all of the new drives in its database, and more of the attributes detailed, including more of the vendor-specific ones. 241 through 249 are supposed to be vendor-specific, and it's up to the vendor as to whether they will be described or not. They have the right to keep them for internal use only.

As far as I know, 5.38 is the most current release.

Might be able to get a more current version if you compile it yourself.

January 7, 200917 yr

I am using what ever is in UnMenu

January 7, 200917 yr

UnMenu uses whatever version of smartctl is on your server. It does not test the disks itself, it invokes the test using smartctl and shows you the output. It just makes it easier, since you do not have to go to the Lunix command line to invoke the test and remember the syntax, etc.

Version 5.38 of smartctl is in the newer releases of unRAID, so that is probably what you are using.

The meaning of many of the "vendor specific" SMART attributes are not disclosed by the disk manufacturers. They use them in their own utilities.

Joe L.

January 7, 200917 yr

Version 5.38 of smartctl is in the newer releases of unRAID, so that is probably what you are using.

Has the missing library problem been fixed in the newer releases?

Failing drive

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)