[SOLVED] Failing drive in middle of rebuild. Need to re-install old smaller d...

August 20, 201213 yr

I recently returned from vacation to find that I had left my A/C off and one of my hard drives had overheated and was taken offline. I have had an issue with cooling on a couple of my drives, so this didn't surprise me. After looking into things and verifying it had overheated, I ran a SMART report and everything checked out fine. I removed the drive from the array and reassigned it again, and to be safe, had the drive get rebuilt. I then took care of my cooling problem once and for all. All of this went fine.

A couple days later, and while everything had been running just fine, I decided to go ahead and pre-clear a newish 2TB drive I've had in the machine for about 6 months now, but had never taken the time to pre-clear it and have it replace a smaller drive in my system. The drive pre-cleared fine, and I went to replace a smaller 500GB drive in my system last night. As soon as I went to have the new 2TB drive rebuilt with the 500GB drive's data, the drive that had overheated earlier started throwing out a ton of "handle_stripe read error" lines in the syslog. I reran a SMART report on the drive, and sure enough, it appears to be failing.

Now, the original 500GB drive is still in the machine and I want to just reinstate it where it was and use the 2TB drive that I was having the 500GBs contents rebuilt to as the replacement drive for the failing one. I canceled the rebuild process and went to reassign the old drive back in it's place, but this is where I should have done some research beforehand not thinking that this would be trying to replace a 2TB drive with a 500GB drive. Sure enough, it doesn't want to replace the drive with a smaller drive. I know the data on the 500GB drive is fine, but when I replaced the 500GB with the 2TB, did this immediately change my parity drive so that I won't be able to go back? And if I can, will parity still be good to rebuild the failing 2TB drive on my replacement drive?

Thanks for the help!

Quote

August 20, 201213 yr

As long as you have a copy of the config directory from before you swapped in the new/larger drive, no problem.

Copy back the old config folder, put the disks back as they were, and you will be back to where you were prior to the upgrade.

Quote

August 20, 201213 yr

Author

As long as you have a copy of the config directory from before you swapped in the new/larger drive, no problem.

Copy back the old config folder, put the disks back as they were, and you will be back to where you were prior to the upgrade.

Sadly, I didn't make a copy of the config directory. This was the first time I have upgraded a drive, and I just followed the instructions in the FAQ / Manual:

http://lime-technology.com/wiki/index.php/UnRAID_Manual#Replace_a_single_disk_with_a_bigger_one

It didn't mention making a copy of the config directory, and I didn't think to do it.

Are there any other solutions?

Quote

August 20, 201213 yr

Author

Maybe I should add that the drive that is failing still shows a green ball in the unRAID WebGUI, not a red ball. I should have made that clear. The drive is failing, but unRAID isn't yet reporting a failed drive in this case.

Quote

August 20, 201213 yr

Author

So, I've been researching all day as to what I experienced last night, and still don't have a solution just yet. When I get home, I'll power up the machine, run a SMART report on the 2TB "failing" drive to post here (along with a syslog showing the errors), but the report last night showed it was failing due to "Reallocated Sectors Count" being 1260....which is a lot! And I'm assuming this is the cause of the "handle_stripe read" errors in the syslog. I figured the overheating must have caused this??? Although the rebuild on this overheated drive was slow, it completed and the system was in usable condition for a couple days before I decided to replace the small 500GB drive with my spare 2TB drive. That is when I saw all of the "handle_stripe read" errors. From what I have read, unRAID will only pull a drive offline if it can't write to the drive. I have also read that most large drives these days have thousands of reserved sectors for remap, and although 1260 is huge, and I feel the drive definitely needs to be replaced, I'm curious if I should just let the rebuild do its thing on the replaced 500GB drive.

I would really like to get the 500GB back in its place, but since I don't have a saved config, I have yet to find how to do this without invalidating parity.

Quote

August 20, 201213 yr

Author

See attached for the SMART report, and the syslog as the upgraded drive is being rebuilt.

smartreport-disk5-2012-08-20.txt

syslog-2012-08-20.txt

Quote

August 20, 201213 yr

disk5 has failed.

5 Reallocated_Sector_Ct 0x0033 042 042 140 Pre-fail Always FAILING_NOW 1260

Quote

August 21, 201213 yr

Author

disk5 has failed.

5 Reallocated_Sector_Ct 0x0033 042 042 140 Pre-fail Always FAILING_NOW 1260

I figured as much. Nothing is writing to the array right now and disk5 is still showing green. The rebuild of the other drive is still going...sometimes slow, sometimes much faster. If I don't touch anything, is there any reason to believe the rebuild won't finish based off of the syslog and SMART report? I'm hoping it will, then I'll just drop another 2TB drive in to replace disk5. Otherwise, is there a way to reinstall the old 500GB drive without a previously saved config file and without invalidating parity?

Quote

August 21, 201213 yr

The latest versions have a check box to indicate that parity is valid when the disks are assigned, once the configuration has been reset.

Quote

August 21, 201213 yr

Author

The latest versions have a check box to indicate that parity is valid when the disks are assigned, once the configuration has been reset.

By later versions, I'm assuming you mean the beta and RCs for 5.0?

Taking a stab at what you are saying. If I were running the latest 5.0-rc, I could stop my array, assign my drives back the way they were (including a drive smaller than its intended replacement), run 'initconfig', possibly run a 'set invalidslot' command on a failing drive, and before starting my array, there would be a checkbox for "Trust the array/parity/etc" or something like that?

Is there not a way to do this with 4.7?

To simplify the discussion:

Before (all green, no errors, disk5 recently rebuilt successfully):

parity - 2TB

disk1 - 2TB

disk2 - 2TB

disk3 - 500GB

disk4 - 500GB

disk5 - 2TB

After (all green, disk4 replaced and currently rebuilding, disk5 read errors started with disk4 rebuild):

parity - 2TB

disk1 - 2TB

disk2 - 2TB

disk3 - 500GB

disk4 - 2TB (replaced 500GB)

disk5 - 2TB (still green, but thousands of read errors, SMART report shows failure)

I haven't been able to check on the system since I started the rebuild of disk4, but at the time, unRAID was not reporting disk5 as a failed disk. Rebuilding disk4 based off the "before" parity, is there a reason to believe it would not work as long as disk5 stays "green" throughout the rebuild process?

Quote

August 21, 201213 yr

The is a "New config" button in V 5. No command line instructions are required. I don't know how to do this in 4.7. You could try upgrading, but that is dicing with and unhealthy array. I suggest that you reset the array config with the 500GB drive and recover the data on the disk5 from backup.

Quote

August 21, 201213 yr

Author

The is a "New config" button in V 5. No command line instructions are required. I don't know how to do this in 4.7. You could try upgrading, but that is dicing with and unhealthy array. I suggest that you reset the array config with the 500GB drive and recover the data on the disk5 from backup.

This is where I'm confused. In 4.7, if I reset the array, won't that invalidate parity? Then I won't be able to recover disk5.

As it stands, I can't reassign the 500GB drive back to disk4 without resetting the array because it is smaller than the current drive. But if I reset the array, I lose parity to rebuild disk5

Quote

August 21, 201213 yr

You can do it. Here are the basic steps;

1. You have to re-assign the drives as you want them. ie assign the old drive back and assign the replacement for the failing drive.

2. You run initconfig.

3. Refresh the web page. I believe all drives will indicate blue.

4. You run the set invalidslot and use the number of the drive that needs rebuilding.

5. Start the array.

6. Make damn sure the proper disk is being written.

Just to note, the replacement drive needs a valid unRAID partition on it before doing the above. I'm pretty sure that the preclear script does this. If the drive had been used before then it's partitioned. You can also temporarily assign it as cache and it will get partitioned.

I would not expect the current rebuild to work. unRAID can "rebuild" a sector on a healthy array. It can't do so with a replacement disk because it's missing the data on that disk.

Quote

August 21, 201213 yr

The is a "New config" button in V 5. No command line instructions are required. I don't know how to do this in 4.7. You could try upgrading, but that is dicing with and unhealthy array. I suggest that you reset the array config with the 500GB drive and recover the data on the disk5 from backup.

This is where I'm confused. In 4.7, if I reset the array, won't that invalidate parity? Then I won't be able to recover disk5.

As it stands, I can't reassign the 500GB drive back to disk4 without resetting the array because it is smaller than the current drive. But if I reset the array, I lose parity to rebuild disk5

Yes. Recover disk 5 from a backup.

Quote

August 21, 201213 yr

Author

You can do it. Here are the basic steps;

1. You have to re-assign the drives as you want them. ie assign the old drive back and assign the replacement for the failing drive.

2. You run initconfig.

3. Refresh the web page. I believe all drives will indicate blue.

4. You run the set invalidslot and use the number of the drive that needs rebuilding.

5. Start the array.

6. Make damn sure the proper disk is being written.

Just to note, the replacement drive needs a valid unRAID partition on it before doing the above. I'm pretty sure that the preclear script does this. If the drive had been used before then it's partitioned. You can also temporarily assign it as cache and it will get partitioned.

I would not expect the current rebuild to work. unRAID can "rebuild" a sector on a healthy array. It can't do so with a replacement disk because it's missing the data on that disk.

So, again I may be a bit confused, but I'm working through a lot of reading. From everything that I have read, running an 'initconfig' command on 4.7 will reset the array and make the parity drive invalid, but I have also read that people do this immediately before the 'set invalidslot' command which would suggest that parity would stay valid for the purpose of the rebuild.

With that said, is it correct to say that running 'initconfig' by itself, then immediately starting the array would make parity invalid, but by running 'initconfig' followed by the 'mdcmd set invalidslot' command to immediately disable a drive would leave parity as valid for the purpose of rebuilding the disabled drive? This would be like running 'mdcmd set invalidslot 99' to leave parity valid without disabling a disk?

Yes. Recover disk 5 from a backup.

There is no backup. If there were, I wouldn't be going the route of trying to revert disk4 back and use the known good parity to rebuild a replacement drive. And before it is mentioned, I know, unRAID with a single parity drive itself is not a backup. Aside from needing a 8.5TB+ backup solution to accomplish this, all of what was on disk5 is recoverable by other means, but I would prefer to not to have to first, go through the process of identifying all of the Bluray and HD-DVDs I had taking up 2TB on this drive (the drive was easily 95%+ full), and second, re-ripping all of the ones that were.

Quote

August 21, 201213 yr

The internal coding in unRAID actually issues the 'mdcmd set invalidslot x' command as part of the rebuild process when you replace a drive. The difference is that the internal coding first partitions the drive before issuing the command to rebuild from parity. So, you need to get a proper partition on the drive first if you're doing it yourself.

Basically, the initconfig fixes the disk assignments but it uses 'mdcmd set invalidslot 0' to tell unRAID to build parity.

So, you issue 'mdcmd set invalidslot 5' to rebuild disk5 instead of the parity.

Hope that helps.

Quote

August 21, 201213 yr

Author

Basically, the initconfig fixes the disk assignments but it uses 'mdcmd set invalidslot 0' to tell unRAID to build parity.
So, you issue 'mdcmd set invalidslot 5' to rebuild disk5 instead of the parity.

Exactly what I was looking for! Makes perfect sense! Thanks!

Also, I completely follow on the need for the drive to be partitioned correctly.

Quote

August 21, 201213 yr

Author

Slightly concerned now. This is what I did.

1 - Stopped the array

2 - Reassigned old 500GB back into old disk4 slot

3 - Assigned replacement drive to disk5 (previously been pre-cleared and was being used for the rebuild of disk4)

4 - Ran 'initconfig'

5 - Refreshed page and everything is blue as expected

6 - Ran 'mdcmd set invalidslot 5' from /root, but I never got the response:

cmdOper=set

cmdResult=OK

Everything is still blue and I have not started my array. Since I didn't get the response I expected from the mdcmd command, is there any other way to check if disk5 was properly set as invalid? Could this indicate the drive is not properly partitioned? I assumed I wouldn't need to pre-clear the drive again since it had already been in the array and was undergoing a rebuild. Figured it would be flagged as a replacement drive and the rebuild of disk5 would simply overwrite any of the data that had been rebuilt from disk4. Here is the section of the syslog after initconfig. The last line shows the mdcmd command, but I'm concerned since I didn't get what I expected after hitting enter and everything is still blue:

Aug 21 19:06:26 Tower emhttp: shcmd (71): modprobe -rw md-mod 2>$stuff$1 | logger (Other emhttp)
Aug 21 19:06:26 Tower kernel: md: unRAID driver removed (System)
Aug 21 19:06:26 Tower emhttp: shcmd (72): modprobe md-mod super=/boot/config/super.dat slots=8,80,8,16,8,0,8,32,3,64,8,64 2>$stuff$1 | logger (unRAID engine)
Aug 21 19:06:26 Tower kernel: xor: automatically using best checksumming function: pIII_sse (System)
Aug 21 19:06:27 Tower kernel:    pIII_sse  : 11290.400 MB/sec (System)
Aug 21 19:06:27 Tower kernel: xor: using function: pIII_sse (11290.400 MB/sec) (System)
Aug 21 19:06:27 Tower kernel: md: unRAID driver 1.1.1 installed (System)
Aug 21 19:06:27 Tower kernel: read_file: error 2 opening /boot/config/super.dat (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: could not read superblock from /boot/config/super.dat (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: initializing superblock (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk0: [8,80] (sdf) ST32000542AS     6XW1A0JN size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk0 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk1: [8,16] (sdb) Hitachi HDS5C302 ML0220F3045MRD size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk1 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk2: [8,0] (sda) ST32000542AS     9XW0ADZH size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk2 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk3: [8,32] (sdc) ST3500630AS      6QG3NFD7 size: 488386552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk3 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk4: [3,64] (hdb) WDC WD5000AAJB-00UHA0 WD-WCAPW2413121 size: 488386552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk4 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk5: [8,64] (sde) WDC WD2001FASS-0 WD-WMAY00471750 size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk5 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (1): set md_num_stripes 1280 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (2): set md_write_limit 768 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (3): set md_sync_window 288 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (4): set spinup_group 0 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (5): set spinup_group 1 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (6): set spinup_group 2 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (7): set spinup_group 3 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (: set spinup_group 4 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (9): set spinup_group 5 0 (unRAID engine)
Aug 21 19:07:13 Tower kernel: mdcmd (10): set invalidslot 5 (unRAID engine)

Quote

August 22, 201213 yr

Slightly concerned now. This is what I did.

1 - Stopped the array

2 - Reassigned old 500GB back into old disk4 slot

3 - Assigned replacement drive to disk5 (previously been pre-cleared and was being used for the rebuild of disk4)

4 - Ran 'initconfig'

5 - Refreshed page and everything is blue as expected

6 - Ran 'mdcmd set invalidslot 5' from /root, but I never got the response:

cmdOper=set

cmdResult=OK

Everything is still blue and I have not started my array. Since I didn't get the response I expected from the mdcmd command, is there any other way to check if disk5 was properly set as invalid? Could this indicate the drive is not properly partitioned? I assumed I wouldn't need to pre-clear the drive again since it had already been in the array and was undergoing a rebuild. Figured it would be flagged as a replacement drive and the rebuild of disk5 would simply overwrite any of the data that had been rebuilt from disk4. Here is the section of the syslog after initconfig. The last line shows the mdcmd command, but I'm concerned since I didn't get what I expected after hitting enter and everything is still blue:

Aug 21 19:06:26 Tower emhttp: shcmd (71): modprobe -rw md-mod 2>$stuff$1 | logger (Other emhttp)
Aug 21 19:06:26 Tower kernel: md: unRAID driver removed (System)
Aug 21 19:06:26 Tower emhttp: shcmd (72): modprobe md-mod super=/boot/config/super.dat slots=8,80,8,16,8,0,8,32,3,64,8,64 2>$stuff$1 | logger (unRAID engine)
Aug 21 19:06:26 Tower kernel: xor: automatically using best checksumming function: pIII_sse (System)
Aug 21 19:06:27 Tower kernel:    pIII_sse  : 11290.400 MB/sec (System)
Aug 21 19:06:27 Tower kernel: xor: using function: pIII_sse (11290.400 MB/sec) (System)
Aug 21 19:06:27 Tower kernel: md: unRAID driver 1.1.1 installed (System)
Aug 21 19:06:27 Tower kernel: read_file: error 2 opening /boot/config/super.dat (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: could not read superblock from /boot/config/super.dat (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: initializing superblock (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk0: [8,80] (sdf) ST32000542AS     6XW1A0JN size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk0 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk1: [8,16] (sdb) Hitachi HDS5C302 ML0220F3045MRD size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk1 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk2: [8,0] (sda) ST32000542AS     9XW0ADZH size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk2 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk3: [8,32] (sdc) ST3500630AS      6QG3NFD7 size: 488386552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk3 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk4: [3,64] (hdb) WDC WD5000AAJB-00UHA0 WD-WCAPW2413121 size: 488386552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk4 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: md: import disk5: [8,64] (sde) WDC WD2001FASS-0 WD-WMAY00471750 size: 1953514552 (Drive related)
Aug 21 19:06:27 Tower kernel: md: disk5 new disk (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (1): set md_num_stripes 1280 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (2): set md_write_limit 768 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (3): set md_sync_window 288 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (4): set spinup_group 0 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (5): set spinup_group 1 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (6): set spinup_group 2 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (7): set spinup_group 3 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (: set spinup_group 4 0 (unRAID engine)
Aug 21 19:06:27 Tower kernel: mdcmd (9): set spinup_group 5 0 (unRAID engine)
Aug 21 19:07:13 Tower kernel: mdcmd (10): set invalidslot 5 (unRAID engine)

Looks good so far. Older versions of mdcmd gave back a response. Don't think the current one does.

Quote

August 22, 201213 yr

Author

Looks good so far. Older versions of mdcmd gave back a response. Don't think the current one does.

Should I be concerned that all drives are still showing as blue when I refresh the page and that unmenu is showing "NEW_ARRAY, unRAID ARRAY is STOPPED 6 disks in array. PARITY NOT VALID: DISK_NEW"? I came across an older post of yours suggesting that after running the set invalidslot command, you should see greens on all drives except the one set as invalid, and it should be red. This post: http://lime-technology.com/forum/index.php?topic=17617.msg158691#msg158691

Quote

August 22, 201213 yr

You will get no response on 4.7. Just do the initconfig, refresh, mdcmd command and then press start. Look for the command in the syslog, last line that Joe quoted above if you want to make sure.

I'm not sure but refreshing the web interface again might screw it up.

Quote

August 22, 201213 yr

You will get no response on 4.7. Just do the initconfig, refresh, mdcmd command and then press start. Look for the command in the syslog, last line that Joe quoted above if you want to make sure.

I'm not sure but refreshing the web interface again might screw it up.

In 4.7, you can refresh the browser, you cannot in the 5.0 series, and all drives should then appear BLUE. (you are establishing a new configuration)

Quote

August 22, 201213 yr

Author

Thanks everyone. Everything looks to be going along nicely. disk4 is back in place and disk5 is rebuilding. I'll report back once disk5 is rebuilt as to how successful I was in doing all of this.

Quote

August 23, 201213 yr

Author

Drive appears to have been rebuilt just fine. I'm running a NOCORRECT Parity Check right now, but there is one problem that I thought an array stop/start or reboot would correct. disk5 is a 2TB drive showing a size of "1,953,514,552" but with a free amount of "-,997,659,240". The amount showing as free stayed the same amount during the entire rebuild and has not changed. In reality, the drive is almost full and should have less than 100GB remaining. I have played a number of videos directly from the drive and everything appears to be fine.

Any thoughts?

EDIT: Also, when viewed as a network drive on another machine, it claims to have over 3TB in space available. So far, I have not written anything to the drive to test it.

Quote

August 23, 201213 yr

Author

also, df shows:

/dev/md5 1953454928 -1343853128 3297308056 - /mnt/disk5

Quote

[SOLVED] Failing drive in middle of rebuild. Need to re-install old smaller d...

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)