Help! Failure + start (rebuild) + removal + start + stop


Recommended Posts

I have a mixed Pro setup, current state shown in the pic below. I had a 250gig drive with a couple of errors, so I decided to run a parity check to see if that would reallocate all the bad sectors. Well, that got progressively slower until it was crawling and essentially hung - couldn't unmount, etc. So I removed the bad drive and started back up - it started to do a rebuild, but I saw how long it was going to take, and got a bit scared that the orange parity ball + the red missing ball meant I would lose data (dumb noob mistake, now I've read the Top Ten Ways to lose data), so I put the drive back in and tried to start up so I could probe it with Smart view on unmenu (installed right then). Everything became totally unresponsive after it started going for a while and had a bunch of errors (of course) on the bad drive, so I decided to shut everything down and see if I could restart the array as I had done before. No - now it shows as below. Doh! What should I do to minimize data loss???  ??? Thanks!!

 

Edit: here's the syslog I just grabbed, if it helps:

http://pastebin.com/f48235285

 

Picture2.png

Link to comment

I have a mixed Pro setup, current state shown in the pic below. I had a 250gig drive with a couple of errors, so I decided to run a parity check to see if that would reallocate all the bad sectors. Well, that got progressively slower until it was crawling and essentially hung - couldn't unmount, etc. So I removed the bad drive and started back up - it started to do a rebuild, but I saw how long it was going to take, and got a bit scared that the orange parity ball + the red missing ball meant I would lose data (dumb noob mistake, now I've read the Top Ten Ways to lose data), so I put the drive back in and tried to start up so I could probe it with Smart view on unmenu (installed right then). Everything became totally unresponsive after it started going for a while and had a bunch of errors (of course) on the bad drive, so I decided to shut everything down and see if I could restart the array as I had done before. No - now it shows as below. Doh! What should I do to minimize data loss???  ??? Thanks!!

 

Edit: here's the syslog I just grabbed, if it helps:

http://pastebin.com/f48235285

 

Picture2.png

According to your screen shot you have one drive that is not responding at all (disk3).  You are currently "simulating" its contents by use of the parity drive in combination with the other drives.

 

You really have only one action available to you at this time. That is to replace the defective disk.  (Unless it is a cabling problem, in which case replace the defective cable or re-seat it if it is loose)

 

When you replace the drive use the "Start" button to re-start the array.  Do NOT use the button labeled "restore" as it does not restore any data, but instead resets your disk configuration to the way it was before you added any disks. It throws away all parity data and starts the process to calculate it again. If you were to use it now, all the data on disk3 would be lost, as parity would be thrown away and a new parity calculation without disk3 started.

 

 

Link to comment

According to your screen shot you have one drive that is not responding at all (disk3).   You are currently "simulating" its contents by use of the parity drive in combination with the other drives.

 

You really have only one action available to you at this time. That is to replace the defective disk.  (Unless it is a cabling problem, in which case replace the defective cable or re-seat it if it is loose)

 

When you replace the drive use the "Start" button to re-start the array.  Do NOT use the button labeled "restore" as it does not restore any data, but instead resets your disk configuration to the way it was before you added any disks. It throws away all parity data and starts the process to calculate it again. If you were to use it now, all the data on disk3 would be lost, as parity would be thrown away and a new parity calculation without disk3 started.

 

The current array state is stopped, and the error message is about too many bad or missing disks.  The start button looks disabled.  If it was only disk3, wouldn't unRAID allow the array to be started and simulate it?

 

The parity disk is orange.  I am not quite sure I understand what that means.  The manual does not explain the orange status color, but I think unRAID's message about too many bad disks has to do with parity.

 

Not sure a simple drive rebuild it going to work at this point.

 

BTW - for future reference - if you see a disk turn red or notice any other unusual behavior, your first step should be to save a copy of the syslog.  The attached syslog is after a reboot - and although helpful, provides little help in determining the reason the drive was disabled.

 

I'd suggest running a smartctl report on the parity disk at this point.  If you've loaded unmenu, the myMain "smart view" will provide this information.

 

One more thing - users should realize that your name is embedded in the syslog.  If you obtain the syslog via unmenu, your name is automatically removed.  But if you get the syslog directly, you would need to manually edit to remove it.

Link to comment

The current array state is stopped, and the error message is about too many bad or missing disks.  The start button looks disabled.  If it was only disk3, wouldn't unRAID allow the array to be started and simulate it?

 

The parity disk is orange.  I am not quite sure I understand what that means.  The manual does not explain the orange status color, but I think unRAID's message about too many bad disks has to do with parity.

 

Not sure a simple drive rebuild it going to work at this point.

I missed that.  I think you are right... we need to know a bit more about what disks are supposed to be there and if they changed any cabling to put things on different ports, etc to cause the "too many missing/disabled disk error message"

BTW - for future reference - if you see a disk turn red or notice any other unusual behavior, your first step should be to save a copy of the syslog.  The attached syslog is after a reboot - and although helpful, provides little help in determining the reason the drive was disabled.

 

I'd suggest running a smartctl report on the parity disk at this point.  If you've loaded unmenu, the myMain "smart view" will provide this information.

 

One more thing - users should realize that your name is embedded in the syslog.  If you obtain the syslog via unmenu, your name is automatically removed.  But if you get the syslog directly, you would need to manually edit to remove it.

Good advice.

 

Link to comment

Indeed, "Start" is disabled, I assume because of the orange parity disk. I had started rebuild and stopped, unsure, so I'm not sure of the parity drive's state.

 

Since it won't rebuild without the failed disk (which isn't plugged in because it hangs up Unraid), won't I lose the failed disk's data if I replace it with a new disk and rebuild? Or will it use as much of the "good" parity as it can?

Link to comment

Those smart errors don't look like a problem on the parity disk. 

 

Here is what I recommend.

 

(You'll need a replacement disk >=250G and <= 640G)

 

1.  Power down and remove the damaged disk and install the new replaceement disk (make sure cabling is secure - use a fresh data cable)

2.  Power up the server, the array will not start (if it does, stop it immediatelly)

3.  Go to the devices page, unassign disk3, and assign your replacement disk to disk3

4.  Return to the main page

5.  Press the restore button (usually a bad idea, but necessary in your case)  Do not start the array.

6.  Open a telnet session, and enter the command ...

 

   mdcmd set invalidslot 3  (this tells unRAID that when the array is started, to rebuild disk3 instead of rebuilding parity, which is the default)

 

7. Go back to the unRAID Web GUI / main tab, and start the array.

8.  You should see the write count on disk3 growing, and the read counts on the other disks growing (if you refresh the page a few times)

9.  Let it complete.

 

If disk3 does not appear fully recovered, post back for further directions.

 

Note you'll need a replacement disk that is between 250G and 640G.  This won't work with a larger than parity sized replacement disk.

Link to comment

Wow, thanks. I will get such a disk and report back - I probably would have tried for a new "biggest" disk as a replacement, so I'm glad you gave me this procedure.

 

If you are upsizing a disk or replacing a failed disk in the "normal" way, there is a parity-swap procedure that will allow you to replace parity with a larger parity, and use the old parity disk to replace the upsized / failed disk.  However, if you need to use the "set invalidslot" procedure, that automatic feature is not available to you.

 

I'm sure you could manually copy your parity disk to a new larger disk (by zeroing out the new parity disk and then copying the old parity disk sector by sector to the new disk), and then using the old parity as the replacemenet disk.  This has not been done before but would not be that hard to do.

 

The other thought is that you may be able to temporarily use a smaller disk for the recovery effort, and after you are sure the recovery was good, use the normal parity-swap procedure to upsize that disk.  Then you would return the temporarity disk to its prior purpose / owner.

Link to comment

I'm trying your more conservative first suggestion  :)

 

So far, it is behaving as you said it would. I replaced the bad 250GB disk 3 with a 500GB disk. Interestingly, when it was "Starting...", bad disk3 initially said the same amt free as the old HD - 9GB free or so. After it got going, it changed to 250GB free.

 

Starting:

th_before.png

 

Rebuilding:

th_after.png

 

I'll report back when it's done with the results!

 

Link to comment

Well, it seems to have completed successfully, all green.  8) There's a ton of data on the drive, including a sparse disk image, so hopefully nothing got corrupted, I'll give an update in a week or so.

 

Thanks for all the help! Especially the recovery procedure bjp999!

Link to comment

Well, something's borked when I try to access the sparse disk image that is on my user share, which is actually split across several drives. I hope it's just the sparse disk image - it'd be a bummer, but it's my Time Machine backups, so I could delete them and start fresh from now. Here's what gets printed out in the syslog when accessing this sparse image (which is a directory structure) ... ideas? Should I force an fsck somehow?

 

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9205 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9206 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9207 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9208 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9209 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9220 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9221 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9222 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9223 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9224 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9225 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9236 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9237 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9238 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9239 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9240 0x0 SD]

Apr 15 18:21:17 files kernel: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-5150: search_by_key: invalid format found in block 2037165. Fsck?

Apr 15 18:21:17 files kernel: ReiserFS: md3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4178 9241 0x0 SD]

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.