January 22, 201115 yr Hey all, I had a near catastrophic array loss last night on my unRAID box. Basic summary of what happened: • Disk 7 Status showed up as red lighted when I turned on the array Friday evening after returning home from work. • Shut down array and restarted, Disk 7 was still showing as red and no temp showing on unRAID Menu. System log showed a few occurrences of “I/O Buffer” errors on Disk 7. • Shut down and restarted array one more time. Pulled Disk 7 from Hot Swap enclosure and cleaned out dust (fair amount was built up), restarted the server and Disk 7 comes up BLUE and showing as correct size and serial (I was suspicious however as there temp was still not showing). Furthermore I could browse the disk when the array was back up! In retrospect I believe that the drive was actually completely disconnected at this point and I was actually browsing the disk via the parity information that was available(?). • This is where I think I made the most critical of errors. Thinking that maybe it was the dust build up or simply a one-off error, and adding the fact that I could browse the shares, I decided to run the ‘restore’ operation on the array. To my shock not only did the parity drive start rebuilding but Disk 7 now was showing as unformatted! • Not realizing that the restore button was going to clear the parity drive I immediately cancelled. Unfortunately I believe the damage was done after that. If my understanding is correct then the parity is pretty much lost at this point with no way to recover(?). To worsen the matter I almost had a stroke when the unRAID menu showed ALL drives in the array as unformatted. Thank the gods that on the next restart, with disk 7 pulled, all of the other drives were again showing as they should be. • Now if I leave Disk 7 in on start-up and the array simply shows it as unformatted and I am unable to mount it or even run reiserfsck on it (I am assuming this is because the disk is physically bad). **NOTE I very stupidly did not save the precrash sys log files as I thought everything was going to be OK. The next step that I’ll attempt is to dump a full image of the failing drive to a new 1TB drive via dd_rescue. I’m hoping that if I can get the good data from the failing drive onto the new drive I will be able to recover the file system. If it is possible I believe I could THEN use the new drive in the array and attempt to ‘restore’ again. If anyone has any other advice or techniques that I could try I would be keen to hear them. Thanks in advance for any and all help. Cheers -roguewarr 22_01_2011.txt 22_01_2011_Withdrive.txt
January 22, 201115 yr You can try loading it on a windows PC and see if you can get the data off of it or even get it to work.
January 22, 201115 yr You are running the 4.5.3 version of unRAID. It has a serious bug in that on re-start of the array it is possible for it to show all disks as un-formatted. If that occurs when you are adding a new drive and you press the "Format" button, you would have re-formatted ALL your drives, so... it could be worse. Yes, the button labeled as "restore" does NOT restore data. Making a copy of the failed drive is a good starting point. Right now unRAID thinks disk7 is removed. You can try running reiserfsck on /dev/md7 and see if it will let you. If all else fails, you "might" be able to rebuild it by running mkreiserfs on it (re-creating the blank/empty file-system) and then asking reiserfsck to rebuild the file-tree scanning the entire disk (looking for files and fragments) It is messy. Please... upgrade so you are not running 4.5.3. It will only cause you more confusion. See here: http://lime-technology.com/forum/index.php?topic=6206.0 Joe L.
January 22, 201115 yr I would suggest you take some time to assess the options before doing anything. The first thing to figure out - did the disk fail or not? If it didn't fail, you can get it working again and all would be well. If it failed (or is failing), you can try to recover data from the physical drive OR you could try to use th backup super.dat file. (There has been little or no discussion of anyone ever trying this after pressing restore, but I think it is a viable option). STEP A - TRIAGE - DID THE DISK FAIL? 1. Get a new fresh USB drive (you could modify your existing one, but I'd just set it aside and use a fresh one). Prepare it for unRAID. You do not need a key for it or anything, the free version is fine. 2. With server powered down, remove the suspect disk from its slot, and also remove another disk (that you know works) from its slot. Clean up the suspect disk as best you can. Get some canned air and blow off all the dust. Put it in the other slot. (You can leave the other disk outside of the array for now). 3. Boot from the new USB. Array will be empty (array definition is stored in config files on the other flash). Assign the ONE suspect disk to the bottom slot in the array (disk2). (I say bottom slot because if I said to assign to the top slot you might accidently assign it to the parity slot, which would be bad. It is best to use the bottom slot so as not to make this very easy to make mistake). 3b. If the suspect disk does not show up in the list of drives, you can shutdown and try to cleanup better - or try another slot. If you can't get the drive to recognize it is strong evidence that the disk has really failed. Post back if this is what you find, and I'll give directions to take the next step. 4. Start the "array" (array only has one disk and no parity). Try to access disk2 over Samba. If the contents are there, that is a very good sign. If the drive appears unformatted, that is not a good sign, but better than 3b. Report back what you find. 5. Regardless of results of step 4, get a syslog and smart report on the drive. See troubleshooting link in my sig for instructions.
January 22, 201115 yr Author Thanks everyone that has chimed in so far. I'm currently running a dd_rescue from the failed disk onto a new one (from what I've read this could take awhile). It's late here in Australia so I'll be leaving the array alone for the night. I had tried to run reiserfsck when Disk 7 was installed in the array but it failed to run stating that the drive had errors. It is a bit strange that there is no drive installed in that slot now yet it appears the array wants me to run the format command on it. Like I said, I am hoping to get a usable clone of the disk and then hopefully run reiserfsck on that, followed by potentially using the restore command after that.
January 22, 201115 yr Here's something to note assumming you cancelled the parity build immediately (say within less than a minute) and you have not written anything new to the array since this occurred. You did not lose much of the parity information. The short time the parity build was running likely damaged less than 1% of the parity data. So, you can install a new disk7 and tell unRAID that the new disk7 needs to be rebuilt. Almost all of the data will be written correctly to this new disk. Likely enough that you can run reiserfsck on it and get most of the data back. It has been done before. If you want to try this approach ask and someone will give you the steps to follow. I believe this would be the basic steps. Upgrade to unRAID 4.6. Assign the new drive to the paritycache slot and start the array and format the disk. Stop the array and assign the new drive to the disk7 slot. Log-in and run initconfig, answer Yes. Use the trust my parity procedure except use 7 instead of 99 for the invalid slot - "mdcmd set invalidslot 7". You should get a response "cmdOper=set" and "cmdResult=ok". Refresh the web interface. It should say it's going to rebuild disk7. If so, check the box and press start.It should show the array is ready to start so press start. Run reiserfsck on the disk7. Just note that, more or less, the longer the parity build was running the more data will be lost. Peter
January 22, 201115 yr Thanks everyone that has chimed in so far. I'm currently running a dd_rescue from the failed disk onto a new one (from what I've read this could take awhile). It's late here in Australia so I'll be leaving the array alone for the night. I had tried to run reiserfsck when Disk 7 was installed in the array but it failed to run stating that the drive had errors. It is a bit strange that there is no drive installed in that slot now yet it appears the array wants me to run the format command on it. Like I said, I am hoping to get a usable clone of the disk and then hopefully run reiserfsck on that, followed by potentially using the restore command after that. DO NOT PRESS THE BUTTON LABELED AS "Restore" without being instructed by someone here who is reviewing your steps. Your step of making an image copy is a really smart move. We can attempt a recovery of your files on it without causing changes to the main array. Without knowing what "reiserfsck" command you tried, and on which "device", and what error it reported, it is impossible to guide you... please include as much detail as requested. The specific errors do matter. Joe L.
January 22, 201115 yr Here's something to note assumming you cancelled the parity build immediately (say within less than a minute) and you have not written anything new to the array since this occurred. You did not lose much of th parity information. The short time the parity build was running likely damaged less than 1% of the parity data. So, you can install a new disk7 and tell unRAID that the new disk7 needs to be rebuilt. Almost all of the data will be written correctly to this new disk. Likely enough that you can run reiserfsck on it and get most of the data back. It has been done before. TrueIf you want to try this approach ask and someone will give you the steps to follow. I believe this would be the basic steps. Upgrade to unRAID 4.6. Good idea. Assign the new drive to the parity slot and start the array and format the disk. Stop the array and assign the new drive to the disk7 slot. Log-in and run initconfig, answer Yes. Use the trust my parity procedure except use 7 instead of 99 for the invalid slot - "mdcmd set invalidslot 7". You should get a response "cmdOper=set" and "cmdResult=ok". Refresh the web interface. It should say it's going to rebuild disk7. If so, check the box and press start. Run reiserfsck on the disk7. I would NOT perform the first step in your description. There is NO need to mess with the parity drive, and no need to format the new disk. Just note that, more or less, the longer the parity build was running the more data will be lost. Peter That is true.
January 22, 201115 yr Sorry Joe, I meant to the cache disk slot. I thought the partition info had to be added to the drive at some point so figured it won't hurt to spend a minute and add it up front. Peter
January 22, 201115 yr Sorry Joe, I meant to the cache disk slot. I thought the partition info had to be added to the drive at some point so figured it won't hurt to spend a minute and add it up front. Peter No, there is NO need to partition OR format the disk. unRAID will do that for you when re-constructing a disk.
January 22, 201115 yr I'm quite sure that forcing the data rebuild using the "mdcmd set invalidslot" command will just put back the data stored on the first partition and not the MBR (since the parity operates only on the 1st partition of each drive and not the whole drive). So, I would fully expect the drive will need the proper unRAID reiserfs MBR added at some point in the process. Peter
January 22, 201115 yr I'm quite sure that forcing the data rebuild using the "mdcmd set invalidslot" command will just put back the data stored on the first partition and not the MBR (since the parity operates only on the 1st partition of each drive and not the whole drive). So, I would fully expect the drive will need the proper unRAID reiserfs MBR added at some point in the process. Peter I've no experience to draw upon there... so you may be right. In any case, assigning the disk temporarily as the "cache" drive will do as you said ad is probably the easiest way for most unRAID users.
January 22, 201115 yr Refresh the web interface. It should say it's going to rebuild disk7. If so, check the box and press start. Joe, the more I think about it the more I think I got this line completely wrong. I'm now thinking the web interface will not say it's going to rebuild disk7. After all, the "mdcmd set invalidslot 7" is just one of the low level commands issued by the interface as part of the process during a normal disk replace-rebuild. So, I believe the interface will just show that the array is ready to be started. You will have to ensure disk7 is being written to once the array is running. Peter
January 22, 201115 yr The one thing I would add (well Brian touched on it in step 5), is: get a SMART report ASAP. Might as well ask the drive itself how it is doing, and get it 'from the horse's mouth'. I also strongly agree with upgrading your unRAID software. While it is OK to skip newer releases and stay with what works, it is not OK to ignore the release announcements. (You probably didn't ignore it, but thankfully the Restore button and 'Unformatted' issues do not come up much any more, and is becoming easier to forget.) In fact, it may be more important for those choosing to stick with older versions to examine release notes for bug fixes, so that they fully understand what they are choosing to live with. For example, any one using v4.5.3 or older should be very knowledgeable about the problems with the Restore button and the 'Unformatted' label, and there have been other fixes too.
January 23, 201115 yr Author OK, here’s an update on the issue so far. Overnight I completed the copy from Disk 7 to new disk (dd_rescue.log attached), which I did by using a RIP boot rescue linux disk that has dd_rescue built in. Everything copied without issue or errors which was my first clue that it may not be the disk at all. I then installed the new disk into Disk 7’s previous slot in the array; however this did not work as I had hoped. Basically the enclosure LED on disk 7 stayed lit the entire time the array was powered on and it took almost 10x longer to boot normal (I actually thought it had frozen). When it did finished booting I was not able to access the unraid web management at all (Syslog_1009_23-01-11). I could however login locally and via telnet. This has me now believing that perhaps it is an issue in the interface itself, either a cable or in the enclosure itself – note however leading up to this there were no recent hardware changes to the box. To test this I again removed the disk from slot 7 and moved the OLD presumably failing Disk 7 to one of the empty Slot 3 in the array (Syslog_1303_23-01-11). The array has now come good, web management is back and what I thought was a failed drive is not being detected correctly. I tried to run reiserfsck on md3 but it could not detect and device on /dev/md3 (reiserfsck.txt) I’ve also included a smart report from the drive and it appears good (smart_log). Now I’m not completely sure how to proceed. As you can see from the web menu screencap (1255_23-01-11), my only option is to kick off a rebuild. However I noticed that if I set host3 to disk7 that I get the option to kick off a data rebuild... Many of you have commented that if I stopped the parity before it went on too long that I should be able to recover from it. This should be a viable option in that case as I cancelled it before it had completed even a fraction of a percent. Does it matter that my parity is now showing orange and says it is up to date from the 22nd? I’m also hesitant to upgrade my unraid version until I can completely determine what issue(s) actually exist on the system. I mean if the upgrade where to go badly there is potential to aggravate the situation yeah? Thanks again to all those that have lent their help. dd_rescue.txt Syslog_1009_23-01-11.txt smart_log.txt
January 23, 201115 yr Did you move the drive on the devices page only or did you do it in hardware too? You can change the hardware SATA port and still use the same disk slot on the devices page. Did you try running reiserfsck using sdX where X is the port this damaged disk7 is connected to. The true sdX will show on the devices tab for each drive. Unfortunately, unRAID won't display if the drive is showing as formatted with a partition or not until the array is started. You could do another "Restore" and start the array. Immediately as fast as possible stop the parity build though. Then, you can see if the drive mounted and the data is still there. At this point, there isn't much that you can hurt if you get that parity build stopped very quickly. I suspect the drive is OK and you just had a SATA port problem and you'll find the drive is just fine. Peter
January 23, 201115 yr To test this I again removed the disk from slot 7 and moved the OLD presumably failing Disk 7 to one of the empty Slot 3 in the array You have provided a lot of info (thank you for that), but I'm not completely confident I understand the whole situation. I was a little confused by the way you used the term 'slot' above, but what I think you meant was that you physically removed the 'new disk' and unassigned it as Disk 7, then physically reinstalled the old Disk 7 and assigned it as Disk 3. Hopefully I have that correct? The new disk does not appear in the physical Disk Inventory, and the old disk 7 does (SN ending in 525), but assigned as Disk 3. That SMART report looks very good, another very good sign. However, I cannot tell if the Reiser file system on it is corrupted or not. There was a problem mounting the Reiser file system in one of the earlier syslogs, but that may be because of a bad SATA port (more on that later). I recommend stopping the array and running reiserfsck directly on the drive, by typing the following command at a console. At this point, we aren't worried about maintaining parity, so this method is a little different than the usually recommended method (Check Disk File systems). In the last syslog, your disk had the device ID of sde, but you should verify the correct ID from the Devices page. reiserfsck --check /dev/sde1 Hopefully, the check is good, or only minor issues fixed. If major problems are found, then you will have a lot more work to do. Assuming it is good, then you may be able to rather quickly restore the system by using the Trust My Array procedure, with your array just the way it is right now. The exact steps you take will depend on which unRAID version you are running, and I again recommend upgrading. Once the procedure completes, it will start a parity check, which will report a LOT of parity errors, but it will fix them as it goes, until it passes the point where you aborted. There will probably not be any more after that. I see Peter has just commented, with some of the same suggestions. I agree about a suspicious SATA port. The 8th SATA port on the 8 port card is behaving badly, and it is where Disk 7 used to be connected (I believe). It also hung when you tried to boot in the third syslog, when the system took so long to boot, mainly because of that port. One other small possibility is a bad power connection, when you have connected drives to that port. Because it is late and I'm tired, I'd wait until Joe or others have a chance to review what I've said, and confirm or correct it.
January 23, 201115 yr Author Thanks Peter "You can change the hardware SATA port and still use the same disk slot on the devices page." Thats exactly what I have done. The disk is now in slot 3 but assigned as drive 7. "Did you try running reiserfsck using sdX where X is the port this damaged disk7 is connected to." I've tried this but received the output below root@Tower:~# reiserfsck --check /dev/sde reiserfsck 3.6.19 (2003 www.namesys.com) Will read-only check consistency of the filesystem on /dev/sde Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/sde. Failed to open the filesystem. If the partition table has not been changed, and the partition is valid and it really contains a reiserfs partition, then the superblock is corrupted and you need to run this utility with --rebuild-sb. Then with the rebuild-sb option... root@Tower:~# reiserfsck --rebuild-sb /dev/sde reiserfsck 3.6.19 (2003 www.namesys.com) Will check superblock and rebuild it if needed Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/sde. what the version of ReiserFS do you use[1-4] (1) 3.6.x (2) >=3.5.9 (introduced in the middle of 1999) (if you use linux 2.2, choose this one) (3) < 3.5.9 converted to new format (don't choose if unsure) (4) < 3.5.9 (this is very old format, don't choose if unsure) (X) exit So do I continue trying to recover the drive with the reiserfsck or do I try and let the parity drive rebuild it? Is there any way that I can verify if the parity information is even intact?
January 23, 201115 yr You have a copy of the disk so I would try the rebuildtree option and see what happens. You will have to mount the drive after to see what data exists. Do you have unMENU running? If yes, you can use it to easily mount and share the drive so you can look at the contents. If the rebuild fails then you can try the steps I posted before to rebuild the drive from parity. You have damaged a small part of the parity information. The data stored in the first part of the parition will be damaged even if you let unRAID rebuild the disk from parity. So, you need to see how much is damaged after the rebuildtree is completed. If it appears to be bad you can try the parity rebuild and see what you get after that. You should be able to update unRAID the next time you are powering the server down without any issues. Just replace the bzimage and bzroot files only (I would rename the old ones say -4.5.3 just in case you need them). It's not hard at all to go back again by just renaming the files again. Peter
January 23, 201115 yr Author Thanks for the help guys. I'm currently running reiserfsck -S --rebuild-tree on /dev/sde, and it will probably take a few hours by the looks of it. I'll let you know how it goes after it completes. I'll also to the latest release after that. Also it definitely looks like there is an issue on port on the SATA card. However, I will need to determine if it is on the SATA card or within the enclosure itself. -rogeuwarr
January 23, 201115 yr Thanks Peter "You can change the hardware SATA port and still use the same disk slot on the devices page." Thats exactly what I have done. The disk is now in slot 3 but assigned as drive 7. "Did you try running reiserfsck using sdX where X is the port this damaged disk7 is connected to." I've tried this but received the output below root@Tower:~# reiserfsck --check /dev/sde reiserfsck 3.6.19 (2003 www.namesys.com) Will read-only check consistency of the filesystem on /dev/sde Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/sde. Failed to open the filesystem. If the partition table has not been changed, and the partition is valid and it really contains a reiserfs partition, then the superblock is corrupted and you need to run this utility with --rebuild-sb. Then with the rebuild-sb option... root@Tower:~# reiserfsck --rebuild-sb /dev/sde reiserfsck 3.6.19 (2003 www.namesys.com) Will check superblock and rebuild it if needed Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/sde. what the version of ReiserFS do you use[1-4] (1) 3.6.x (2) >=3.5.9 (introduced in the middle of 1999) (if you use linux 2.2, choose this one) (3) < 3.5.9 converted to new format (don't choose if unsure) (4) < 3.5.9 (this is very old format, don't choose if unsure) (X) exit So do I continue trying to recover the drive with the reiserfsck or do I try and let the parity drive rebuild it? Is there any way that I can verify if the parity information is even intact? The correct device to use would be /dev/sde1 (note the trailing "1") The file system is on the first partition, not the base device name. That is why no superblock was found. To run a reiserfsck you would need to use either the /dev/mdX device, or the /dev/sdX1 device. If you use the former parity will be maintained. If you use the latter, it will not. Try reisefsck --check /dev/sde1 Joe L.
January 23, 201115 yr Thanks for the help guys. I'm currently running reiserfsck -S --rebuild-tree on /dev/sde, and it will probably take a few hours by the looks of it. I'll let you know how it goes after it completes. I'll also to the latest release after that. Also it definitely looks like there is an issue on port on the SATA card. However, I will need to determine if it is on the SATA card or within the enclosure itself. -rogeuwarr Stop it.... before it does more damage. You need to run it on the first partition on the disk. /dev/sde1 I have no idea what it will do... or what your rebuild of the superblock on the wrong device will do either. Good luck. Use /dev/sde1 Joe L.
January 23, 201115 yr Thanks for the help guys. I'm currently running reiserfsck -S --rebuild-tree on /dev/sde, and it will probably take a few hours by the looks of it. I'll let you know how it goes after it completes. I'll also to the latest release after that. Also it definitely looks like there is an issue on port on the SATA card. However, I will need to determine if it is on the SATA card or within the enclosure itself. -rogeuwarr Stop it.... before it does more damage, although it might already be too late. You need to run it on the first partition on the disk. /dev/sde1 I have no idea what it will do when not run on the correct device... or what your rebuild of the superblock on the wrong device will do either. Good luck. Use /dev/sde1 Joe L.
January 23, 201115 yr Author " Stop it.... before it does more damage." S#%T! I've stopped it, restarted the array and am currently running reiserfsck --check on /dev/sde1. Hopefully this didn't mess anything up to bad. Technically I could replace it with the cloned new disk that I made last night and try it again if the parity info was wiped right?
Archived
This topic is now archived and is closed to further replies.