February 25, 200917 yr I have 1+4 disks. One disk failed.It is not even recognaized by bios.It had red dot and my parity drive had blue dot. And I screwed up.I replaced the failed disk with new one.Unassigned old one reassigned new one put the checkmark on"I'm sure I want to do this" and then stupid me I've typed "mdcmd set invalidslot 99" Started the array and formated new disk it then proceded to check parity I've stoped it soonest i've noticed. Is there any way I can still reconstruct data for the faild drive? Please help
February 25, 200917 yr (Before you do ANYTHING else, make a copy of the "config" folder on the flash drive for safe keeping.) Only thing I can think of doing is to stop the array, then type mdcmd set invalidslot XX (Where XX is the actual slot number of the FAILED drive) and then check the checkbox below "Start" and press it. Let it rebuild the failed drive as best it can. Then, un-mount the failed drive and run a reiserfsck on it. It will probably request you run a rebuild-tree on it, but let it guide you. Whatever you do, save the old drive, just in case it can be read somehow. Depending on how long the rebuild parity check ran, you might get lucky if it did not get to any of your critical files. Others may have other ideas... might wait for concurrance from another experienced user before doing as I described above. I've never done it, and you might need to press restore to get back to a known state... Joe L.
February 25, 200917 yr Author Hi. During the parity check it did: 2,953,503-READS and 2,212,140-Writes The disks are 1TB How do I find "invalidslot XX" xx number. Forgive me but I'm quite a newbie here
February 25, 200917 yr Hi. During the parity check it did: 2,953,503-READS and 2,212,140-Writes The disks are 1TB How do I find "invalidslot XX" xx number. Forgive me but I'm quite a newbie here What disk failed?... If it was disk1 that failed, then the XX would be 1, if disk4, then the XX would be 4. A mistake here will start writing a good drive with bad reconstructed data... Do not get it wrong, or your headache will get even bigger. Oh yes, some friendly advice, for you and any other newbee: If you ever get hold of some matches, do not go anywhere near gasoline, or dynamite, or propane... To help others in the future, where did you read about "invalidslot 99 ??" since wherever it is, it needs to be surrounded by warnings that it ONLY is to be used when all disk hardware is KNOWN TO BE WORKING. Joe L.
February 25, 200917 yr Consider the portion of the disk where parity was incorrectly calculated as lost forever. Over 2 million writes might involve significant number of files. Sorry to be the bearer of bad news. We'll soon see how good the reiserfs is at rebuilding a massively corrupted file-system. You might send an e-mail to [email protected] pleading for help and pledging your first-born. Second thought, just ask for help, Tom said he already had enough kids. Joe L.
February 25, 200917 yr Author I've read it here :http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive%2C_Avoid_Rebuilding_Parity_Unnecessarily I've misunderstood it because after replacing failed drive with new one parity disk still had blue dot and I misunderstood what it meant. Disk1 is the one that failed so do I type : mdcmd set invalidslot 01 ?
February 25, 200917 yr Author does the green light beside disk1 should change after mdcmd set invalidslot 01 ?
February 25, 200917 yr does the green light beside disk1 should change after mdcmd set invalidslot 01 ? I would guess it will change to blue. If it is red, you will need to convince it the drive is no longer the one that failed. Do NOT press Start unless you are back to where the failed disk is BLUE and the others data disks are green. (not sure about what color the parity disk will be) If the indicator is red, post a screen shot and a syslog. Do not proceed. It will probably say that it will rebuild the failed disk too. It should look like this post Joe L.
February 25, 200917 yr Author After stoping the array I've type: mdcmd set invalidslot 01 but all drives still have green light. What skould I do?
February 25, 200917 yr After stoping the array I've type: mdcmd set invalidslot 01 but all drives still have green light. What skould I do? If you refresh the browser, does it stay green? If so, I'd try mdcmd set invalid slot 1 (without the leading zero) Then refresh the browser once more. If all green, then do not proceed. Wait for a response from Tom @ Lime-tech. Joe L.
February 25, 200917 yr Author Yes I've tried without leading zero and after refreshing browser everything is still green. Does he check this forum or do I have to send him email?
February 25, 200917 yr Yes I've tried without leading zero and after refreshing browser everything is still green. Does he check this forum or do I have to send him email? Send him an e-mail. Joe L.
February 25, 200917 yr Author Thanks for now Joe. I have to go to work now. Panicking and rushing not a good thing-STUPID ME. I've sent Tom an email.See if I get anything when I come back from work. Thanks again. Rob.
February 27, 200917 yr Author Here is detaied description of what had happened and a copy of email I've sent to Tom. Unfortunetly I did not get any reply yet.If someone have any ideas please help.basicly what put me in a panic mode is that after pluging in new replacment drive parity drive was still blue and it said that there are too many disks missing. "It is all SATA. I was writing a disk image from a computer on my network to disk1 of Unraid array. Went to get a coffee and when I came back I see a message on my network computer That Arconis could not write a image. I opened a browser (unraid menu) and I see there are errors on drive1 and parity drive.Disk1 had a red dot and it said it was missing And parity drive was blue. I restarted the system- same thing. I shot down the computer check all the cables and restarted. Drive1 was not being recognized in a bios. I’ve removed drive1 and checked it on 2 other computers. The drive spins up but is not being recognized in any computer’s bios. Because of that and a blue dot by parity drive I’ve panicked. I did not have same size or bigger extra drive (I could not find any info if I could use smaller drive, I could only find info that parity drive had to be same or bigger but nothing on data drives). Last morning I went and bought 1TB drive(same size as failed drive1).I could not find exact steps explaining failed drive replacement + I was in a panic mode and rushing instead of reading. That‘s where I came with idea of “mdcmd set invalidslot 99”. Oh, after unassinging the drive1 and reassigning a new one I started the array fallowing with formatting a new drive.I thought it was gonna take a while and when I came back to check it the parity was being rebuild. That did not look right so I stopped it and that’s where I am now. I have tried “mdcmd set invalidslot 01” as “Joe L” suggested but drive1 still is green so I stopped. If there is anything that can be done Please help. Thank you" Anybody PLEASE help. Rob
February 27, 200917 yr I'm sorry I've been tied up, and it looked like you were getting help, nothing left for me to add. Haven't had time for a full analysis, but it does not look good, can't hold out much hope. It really would have been wise to capture a syslog as soon as you saw the problem, before rebooting. I can't stress enough to users the importance of saving the syslog *before* you reboot. It held all of the evidence as to what went wrong, and how serious the parity drive issues were, whether they were real problems or just 'collateral damage' associated with the other drive's problems. The invalidslot command is a very dangerous one, that should only be used in certain contexts, such as in the Trust My Array procedure, following the use of the Restore button. In your first post, you indicated you "put the checkmark on "I'm sure I want to do this"". Was that associated with the Restore button or the Start the array button? I wasn't sure from your comments. In your first post, you indicated that it "proceeded to check parity", which would be normal and usually harmless, but in your last post, you indicated that "the parity was being rebuild", which is more serious. I think you meant a parity check rather than a parity rebuild? I just want to make sure. Either way, the parity was being 'corrected' for those early gigabytes, which makes recovery of the data from the failed drive very problematic (but not completely impossible). You were right to stop it. At this point, and only based on my current limited understanding of exactly what has happened, I believe the first few gigabytes of the parity drive are based on the new and empty replacement drive, and the rest of the parity drive has parity info based on the failed drive. So it may be possible to rebuild Disk 1 using it, knowing that the first part of the drive will be the file system of an empty drive, and the rest of it will be the data from the failed drive. Then if we can 'convince' reiserfsck to run a --rebuild-tree on the whole drive, ignoring the good but empty file system at the beginning of the drive, you *may* be able to recover most of the data on the drive. This is a new and dangerous procedure, and I would like Brian to review it for correctness, but I think Joe was on the right track with the "mdcmd set invalidslot 1" command. The only thing is, as the instructions say in the Trust My Array procedure, you have to use the dreaded Restore button first. That turns all of the drives Blue. You will get very specific responses to each step of this procedure, and you need to verify them exactly. In the very last step however, it will not be a parity check started, but a drive rebuild of Disk 1. Once complete, we will have to attempt to get reiserfsck to rebuild a file system based on the entire drive. I'll wait for a subsequent post for that. For now, I would recommend waiting for Brian to confirm my proposed plan. He's also better at laying out perfect step by step instructions, if he has time!
February 27, 200917 yr I am always an optimist with data recovery, and seldom (never actually) have I seen one of these cases where no recovery was possible. But I have to say that you have made a number of wrong turns and chances of recovery are certainly impacted. You should not use the "set invalidslot 99" unless you have ALL of the drives in your array and you KNOW that parity is correct or at least very close to correct. If you have 3 out of 4 data drives and follow that procedure, it will asssume that your parity is in sync with the 3 data drives and that the forth data drive does not exist. It will proceed to update parity sector by sector to make that a reality - and each sector written is a sector that will not be recoverable. Hopefully you did not leave it in this mode for too long! You should not use the "set invalidslot xx" unless you press the restore button first. We have no idea what the impact of running that on an array with drives that have green balls. NO IDEA. Maybe it ignores it. That would be best for you. Here is what you should do IMO. 1. Make a backup of you flash's config directory. 2. Make sure that all of your "good" drives are in the array and that the replacement drive is also in the array. The failed disk should be out of the machine. 3. The array should not start on reboot. If it does, stop the ensuing parity check immediately. 4. Upon boot, ensure that the parity drive is assigned to the parity slots, and the data drives are assigned to the data slots. The replacement disk should be assigned to the slot that the failed drive was assigned to (e.g., slot 3 / disk3). 5. Double check the setup 6. Triple check the setup 7. Quadruple check the setup 8. (Just for you) Quintuple check the setup. 9. Press the restore button. (you may have to click a checkbox saying you're sure). DO NOT START THE ARRAY. 10. Go to a telnet prompt and enter the command mdcmd set invalidslot X where X is the slot of the replacement disk (e.g., mdcmd set invalidslot 3" if the replacement disk is in slot 3 (disk3). 11. Go back to the unRAID console and start the array 12. The contents of disk3 (or whatever disk you specified) should start to rebuild. You should see lots of writes to that drive from the unRAID GUI. 13. Let it finish. 14. If you hadn't made some of the prior errors (sorry, not trying to rub it in but it is the truth), the drive would be fully restored and you would be good to go, but chances are your drive will show unformatted or when you go to look at it will have corruption. If you look at it and all looks fine, it will be a small miracle. (A small word of thanks to the deity of your choice may be in order.) 15. But the chances are likely that the rebuild will result in a corrupted disk. It will likely show as unformatted. DO NOT PRESS THE FORMAT BUTTON. 16. Make sure your other disks look okay. If not, make note of the problem disk(s). Hopefully they are fine and this should be a reassuring step for you. 17. You now need to run reiserfsck. Refer to instructions here. 18. Reiserfsck may run very quickly and may prompt you to rerun it with various options. You might want to refer to this post and read about a nightmare that I had a while ago. I had to run reiserfsck with "rebuild tree" settings similar to what you may need to do. 19. Let it finish. When done stop and restart the array. Hopefully your data - or at least a lot of it, will be back. Mine got jumbled a bit and a bunch of trash got added, but I retrieved a very high percentage of my data. Good luck!
March 1, 200917 yr Author Well. It did not work. After th rebuild finished all disks are green and the new disk1 shows empty. I was not able to run reiserfsck. I did 1.samba stop 2.umount /dev/md1 it responds the the device is busy so reiserfsck did not what to run telling me that partision is mounted with write permissions.
March 1, 200917 yr You need to run reiserfsck! You might need to wait for Joe L. or RobJ to give instructions on what to do so you can run reiserfsck. Be very careful. Do not write to the array in this state.
March 1, 200917 yr Author In the browser window I've pressed "Stop" and tried to power down the system but after pressing "stop" browser did not want to refresh and I've lost contact with "Tower".On the uNraid commend line system seams to hang (I was not able to power it down safely - without holding the power button).Should I run "Live Ubuntu CD" and tried to do reiserfsck from there?
March 1, 200917 yr In the browser window I've pressed "Stop" and tried to power down the system but after pressing "stop" browser did not want to refresh and I've lost contact with "Tower".On the uNraid commend line system seams to hang (I was not able to power it down safely - without holding the power button).Should I run "Live Ubuntu CD" and tried to do reiserfsck from there? If you have lost contact, and can no longer telnet in either, you will need to reboot. Odds are good that after a reboot you will be able to stop samba AND also un-mount disk1. If you run reiserfcsk from a live CD you will invalidate parity on your unRAID server making it useless. (I'd save that step for another emergency, don't do the livecd at this time) If you do the reiserfsck on unRAID as instructed, the parity drive will be kept in sync with any repairs made. If a linux system says a device is "busy" and cannot be un-mounted, then some process has a file open on that file-system, OR a process has a directory in that folder as its current-directory. For example, if you were to telnet to the server and then "cd /mnt/disk1" you will not be able to un-mount disk 1 since it is your current directory until you "cd" elsewhere, then the umount command can succeed. Depending on what programs are running, you will be unable to un-mount a disk if it is in use. Normally, as unRAID is distributed by lime-tech, stopping samba is sufficient to allow you to un-mount a disk. (As long as you did not log in and "cd" to it) We figured that the data-recovery-rebuild might result in massive corruption of the file system, but is does seem that it mounts, so there is a chance still to recover some data. Let's see what a reiserfsck does. What ever you do, do not reformat the disk. So, force a reboot of your server by powering down up, or pressing the reset button, and then telnet in, stop samba as instructed, then: Start with a simple reiserfsck /dev/md1 Let us know what it says before going further. Joe L.
March 1, 200917 yr Author After rebooting I check a browser and see then unRAID is doing parity check.I telnet and do: 1."samba stop"-it did not give me any message after typing it in. 2."umount /dev/md1"-it still says umount: /mnt/disk1: device is busy Then I check a browser and because it is doing parity check so I figure I stop it. Then I telnet again and do: 1."samba stop"-response:smbd: no process killed nmbd: no process killed 2.umount /dev/md1-response:umount: /mnt/disk1: device is busy umount: /mnt/disk1: device is busy What steps should I take next. I really appreciate all your work into this guys.I know it would be so much easier if I did not fu..up. Rob
March 1, 200917 yr After rebooting I check a browser and see then unRAID is doing parity check.I telnet and do: 1."samba stop"-it did not give me any message after typing it in. 2."umount /dev/md1"-it still says umount: /mnt/disk1: device is busy Then I check a browser and because it is doing parity check so I figure I stop it. Then I telnet again and do: 1."samba stop"-response:smbd: no process killed nmbd: no process killed 2.umount /dev/md1-response:umount: /mnt/disk1: device is busy umount: /mnt/disk1: device is busy What steps should I take next. I really appreciate all your work into this guys.I know it would be so much easier if I did not fu..up. Rob Do you have other programs loaded and running on your array? (Scripts you have loaded perhaps?) Anything that might be reading disk1 or accessing it in some way? You did right by stopping the parity check... For now, we don't want that to happen. To see the process ID holding the disk from being un-mounted type fuser -cu /mnt/disk1 It will print the process ID of anything using the disk. (It might print more than one) Then ps -p NNN will print the actual process (where NNN = the process ID number) You can then type kill NNN to kill the process. As an example, my disk8 is currently in use as my wife is watching a movie. I can identify the process: [pre]fuser -cu /mnt/disk8 /mnt/disk8: 17895(root) ps -p 17895 PID TTY TIME CMD 17895 ? 00:01:50 shfs[/pre] Im my case, my array is online, and the "shfs" (User-Share File-System) is using disk8. (And disk8 is where the movie we are watching is located) Joe L.
March 1, 200917 yr Author Well. At this point I can not telnet to "tower", I can still refresh browser gui. When I type fuser -cu /mnt/disk1 directly in unRAID commend line I do not get any message back.
Archived
This topic is now archived and is closed to further replies.