February 15, 201115 yr My system was working for years now. In the end of last year, I removed an already empty defect drive. I replaced it with a new empty drive of 1.5TB, but it was not part of the array; it was ready to do so. Of course, I precleared it successfully. Because of no more empty place, I've decided to add this disk on the array. I've made a successfull parity check. I didn't even touch the system. Then I, in this order : - comment the cache command on the go file, to have a chance to stop the array after a start without having to wait some hours. - stop the array - assign the drive on the devices - start the array - tick the (I'm sure to do this) and start format - wait till done - stop the array Bad luck, I don't know why, but there is a job running on the disk 10, and it doesn't unmount. I don't know how, but the cache is running, so I telnet a cache -q, but the disk doesn't unmount after this. (until now, all the problem I met with unraid system are because I can't stop it correctly in a decent time) Because I was in a hurry, I've made an error, and the system stop suddenly. Now the system can boot, but unraid doesn't start. After a reboot, the system is ready but "Stopped. Configuration valid." If I press on start, the disk are marked as "mounting", then it stop, with always the same message : "Stopped. Configuration valid." On the syslog, I can read just something like "start STOPPED" then "do_run: lock_rdev error: -6" No more explanation at all... Help please. syslog-2011-02-15.txt
February 15, 201115 yr Author Is it because of the new drive? At the end of the formatting process, everything was looking fine. I've read somewhere there is a process to remove an empty drive, is it a good idea? And, btw, what are the meaning of this : Feb 15 23:56:25 Tower emhttp: shcmd (4): /usr/local/sbin/set_ncq sdq 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (5): /usr/local/sbin/set_ncq sdf 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (6): /usr/local/sbin/set_ncq sdg 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (7): /usr/local/sbin/set_ncq sdi 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (: /usr/local/sbin/set_ncq sdh 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (9): /usr/local/sbin/set_ncq sdd 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (10): /usr/local/sbin/set_ncq sdc 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (11): /usr/local/sbin/set_ncq sde 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (12): /usr/local/sbin/set_ncq sdb 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (13): /usr/local/sbin/set_ncq sda 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (14): /usr/local/sbin/set_ncq sdl 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (15): /usr/local/sbin/set_ncq sdm 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (16): /usr/local/sbin/set_ncq sdn 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (17): /usr/local/sbin/set_ncq sdo 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (18): /usr/local/sbin/set_ncq sdp 1 >/dev/null Feb 15 23:56:25 Tower emhttp: shcmd (19): /usr/local/sbin/set_ncq sdj 1 >/dev/null It's very weird, all those data that are there (I hope) but the system stopped, all the hdd green, just it doesn't start. Anyone to help? Have I lost everything?
February 16, 201115 yr Those commands in the log are merely setting the NCQ parameter on each respective drive.
February 16, 201115 yr Author So, it's normal, right? What is wrong with my system? Everything look fine but it won't start. I'm completely lost!
February 16, 201115 yr You've not lost anything. most of the times people have not been able to start the array is because some add-on is preventing it, so step 1 is to disable all the extras and then reboot.
February 16, 201115 yr Author You've not lost anything. most of the times people have not been able to start the array is because some add-on is preventing it, so step 1 is to disable all the extras and then reboot. Thank you for your reassuring answer. After commenting the ntfs-3g and the uu lines, the go file seems like a new one. #!/bin/bash # Start the Management Utility /usr/local/sbin/emhttp & #cp /boot/config/samba/smb.conf /etc/samba #smbcontrol smbd reload-config #installpkg /boot/packages/ntfs-3g-2009.3.8-i486-1.tgz #/boot/packages/cache_dirs -B -w #/boot/unmenu/uu But after the reboot, no change. I think there is no more add-on from a stock unraid, but where have I to look at? I have somewhere a copy of the usb boot key that works (one month ago), and there was no change in the hardware in between. Just the assignation of the disk above, and of course the format that was done. Is it a good idea to put the content back on the key? Normally, it should work if there is no hardware failure now, is'n it? EDIT : Of course, I understand that my array will be unprotected, because the parity will be false (with one drive less). Bad idea.
February 16, 201115 yr You've not lost anything. most of the times people have not been able to start the array is because some add-on is preventing it, so step 1 is to disable all the extras and then reboot. Thank you for your reassuring answer. After commenting the ntfs-3g and the uu lines, the go file seems like a new one. Yes, I think you are right. #!/bin/bash # Start the Management Utility /usr/local/sbin/emhttp & #cp /boot/config/samba/smb.conf /etc/samba #smbcontrol smbd reload-config #installpkg /boot/packages/ntfs-3g-2009.3.8-i486-1.tgz #/boot/packages/cache_dirs -B -w #/boot/unmenu/uu But after the reboot, no change. I think there is no more add-on from a stock unraid, but where have I to look at? I have somewhere a copy of the usb boot key that works (one month ago), and there was no change in the hardware in between. Just the assignation of the disk above, and of course the format that was done. Is it a good idea to put the content back on the key? Normally, it should work if there is no hardware failure now, is'n it? You can certainly try that. It would simply ignore the newly added drive and it would not know the parity was out of date. (The current parity takes into account the formatting) Expect many errors on a parity check as it corrects parity. One thing you would want to double-check is if your power supply is up to the task. Is it possible you've just went over its limit to supply start up current to the drives? What specific power supply make/model are you using? Joe L.
February 16, 201115 yr Author One thing you would want to double-check is if your power supply is up to the task. Is it possible you've just went over its limit to supply start up current to the drives? What specific power supply make/model are you using? Joe L. Very improbable, I've already used two external drives more connected to this power supply (by an external molex). The hdd just assigned was already powered before. As written in the signature, the power supply is a qtec 500w. Btw, swapping this power suplly is at least a one day work. You can certainly try that. It would simply ignore the newly added drive and it would not know the parity was out of date. (The current parity takes into account the formatting) Expect many errors on a parity check as it corrects parity. Is this a little risked step? I do not know well the contents of the key towards the parity at this moment.
February 16, 201115 yr Author You can certainly try that. It would simply ignore the newly added drive and it would not know the parity was out of date. (The current parity takes into account the formatting) Expect many errors on a parity check as it corrects parity. Very disturbing. I've made it, and except the drive that was added in between disappear, everything look the same, always the same problem. Just in case, I've attached the new syslog. Now I'm completely bloqued. syslog-2011-02-16.txt
February 16, 201115 yr Author And, just in case, because the last drive wasn't needed, I've disconnected this HDD power supply now. No change. I can't start my array. Tower kernel: md: do_run: lock_rdev error: -6 Any info more needed?
February 16, 201115 yr Author After reading this in the syslog : Feb 15 23:56:25 Tower emhttp: get_fstype: open /dev/sda1: No such file or directory I'm asking myself if the problems doesn't lie with the sda1 drive. In this case, the better solution is to change this drive to a another one (I have one right here), and restart building it with the correct configuration (from the start of the problem, including the new drive), no? This new drive must be in a special state (precleared, cleared, formatted in another unraid system), or not? The current one is ntfs formatted.
February 16, 201115 yr The partition on sda is gone: Feb 16 02:33:58 Tower kernel: sda: unknown partition table
February 16, 201115 yr Author The partition on sda is gone: Feb 16 02:33:58 Tower kernel: sda: unknown partition table Yes, it's what I understand. Now, how to come back to a working system? Simply remove the sda, and the unraid will start and run slowly with the parity drive, and then rebuild another drive? Or trying to rewrite the MasterBootRecord on the drive which failed (faster)? And then, about the diagnose, this drive was already unMounted when the array was brutally switched off, is this a normal behaviour?
February 16, 201115 yr That would do it, wouldn't it. You can repair it. Before you do, please post the output of the following commands: fdisk -l -u /dev/sda dd if=/dev/sda count=1 2>/dev/null | od -x -A d Joe L.
February 16, 201115 yr Author That would do it, wouldn't it. You can repair it. Before you do, please post the output of the following commands: fdisk -l -u /dev/sda root@Tower:~# fdisk -l -u /dev/sda Disk /dev/sda: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Disk identifier: 0x00000000 Disk /dev/sda doesn't contain a valid partition table root@Tower:~# dd if=/dev/sda count=1 2>/dev/null | od -x -A d root@Tower:~# dd if=/dev/sda count=1 2>/dev/null | od -x -A d 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0000512 root@Tower:~# Joe L. I can repair it, but how? If it's by writing the mbr, and knowing I'm a noob in linux, it will be tricky. Or maybe with a tool I don't know. If it's by rebuilding, have I to prepare the hdd? Another one, or the disk already on place. Is it safe? Thanks for your help
February 16, 201115 yr I can repair it, but how? If it's by writing the mbr, and knowing I'm a noob in linux, it will be tricky. Or maybe with a tool I don't know. It is by use of a tool I wrote to help another user. Find it here: http://lime-technology.com/forum/index.php?topic=5072.msg47122#msg47122 If it's by rebuilding, have I to prepare the hdd? Another one, or the disk already on place. Is it safe? Thanks for your help I think you can use the existing disk. The utility will only write the MBR and make it exacty as unRAID would have made it. After using it the fdisk command should show a partition. unzip it to your flash drive, then type /boot/unraid_partition_disk.sh /dev/sda then, as long as you're sure you have the right disk, invoke it with the -p option. /boot/unraid_partition_disk.sh -p /dev/sda Joe L.
February 16, 201115 yr Author Ok, I'll try this in a moment. Just in case, and because I normally can trust the parity disk more than the failed sda one, is it a way to not let the system made a parity check with correction? AFAIK, when I will press the start button, the parity check will also start automatically, and if there are sync errors, those will be reflected on the parity disk, right? But those errors are on the data disk, and the parity datas will be changed from correct to incorrect. In this case, is it not a better idea to simply throw out the failed disk and replace it by another one, to let the data be rebuild correctly? Is there a process to start the array without an automatic parity check? Thnx
February 16, 201115 yr Author Bad news. I've made the correction on the MBR. Then I started the array, and of course the parity check start immediately. Doesn't see any error counter, but the parity disk was writing all the time, so I stopped the parity check. It show a lot of parity errors, as I suspected. I've stopped the array now, and I don't know what to do. Rebuild a new disk, with the errors I created? Where have I made an error? syslog attached. syslog-2011-02-16a.txt
February 16, 201115 yr Bad news. I've made the correction on the MBR. Then I started the array, and of course the parity check start immediately. Doesn't see any error counter, but the parity disk was writing all the time, so I stopped the parity check. It show a lot of parity errors, as I suspected. I've stopped the array now, and I don't know what to do. Rebuild a new disk, with the errors I created? Where have I made an error? syslog attached. The parity checks that were started were in NOCORRECT mode, therefore no changes were actually written to the parity drive. From your log: Feb 16 13:32:08 Tower kernel: mdcmd (69): check NOCORRECT Can you get to the files on the drive that had the missing partition now? Joe L.
February 16, 201115 yr Author I made a call to the check NOCORRECT before starting the array, and after stopping it. The first, before the start didn't work, and I don't know why. Feb 16 13:30:08 Tower kernel: mdcmd (46): check NOCORRECT Feb 16 13:30:08 Tower kernel: Feb 16 13:30:08 Tower kernel: md: check_array: not started The other was launched after stopping the array. Feb 16 13:32:08 Tower kernel: mdcmd (69): check NOCORRECT But, in between, there was the automatic check parity process : Feb 16 13:30:55 Tower kernel: mdcmd (51): check And after I can read : Feb 16 13:30:56 Tower kernel: md: parity incorrect: 128 Feb 16 13:30:56 Tower kernel: REISERFS (device md1): found reiserfs format "3.6" with standard journal Feb 16 13:30:56 Tower kernel: md: parity incorrect: 136 Feb 16 13:30:56 Tower kernel: REISERFS (device md1): using ordered data mode Feb 16 13:30:56 Tower kernel: md: parity incorrect: 144 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 152 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 160 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 168 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 176 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 184 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 192 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 200 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 208 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 216 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 224 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 232 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 240 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 248 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 256 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 264 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 272 Feb 16 13:30:56 Tower kernel: md: parity incorrect: 280 So, until there is something I don't understand, the parity disk was corrupted when I started the array. If it's not the case, I have another chance to rebuild the data disk by replacing it now, isn't it? Yes, I have access to the files on the disk that had the missing partition. Those are mainly video files, and it seems that I can use some of them. Is there an utility to verify the integrity of the files from the disk that had the missing partition?
February 16, 201115 yr I missed that additional start of the "check" I would un-assign the parity drive (or un-plug it) so a parity check cannot occur. I would the perform a file system check on the drive that had the missing partition. reiserfsck --check /dev/md9 (I think it is disk 9 that was the drive with the missing MBR, use the correct "md" device if I'm incorrect.) The issue is that your existing parity data includes the formatting of the disk you now have removed from your array. I think you can probably revert to your "current" config, the one that includes the newly formatted drive. Then, the parity is probably more correct (with the exceptions of the changes it just made when you started it most recently) If you have a spare disk, I'd go ahead and replace the disk with the previously missing MBR and get as much as possible re-constructed onto the replacement.
February 17, 201115 yr Author It's an all day job... The first thing I've made, until now, is to simply copy the data that can be read from the disk that had the missing partition on the new disk I've just added. Of course, this disk was nearly full, as expected, it's a fairly long process, with the added feature I have to bypass the lost files. After this, I've made a quick check on the copy. Now there are some errors on the syslog, around 300, that are Feb 16 17:10:02 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Feb 16 17:10:02 Tower kernel: REISERFS error (device md9): vs-5150 search_by_key: invalid format found in block 32769. Fsck? I think those errors are the same as the sync errors that remain on the array (following the check -NOCORRECT, there is still around 40 sync errors that remain at the start of the check). Now I've to re-create a clean disk (the md9) and finally recheck completely the parity. I've completely leave the solution of reconsrtuction, it's far too late now. I would un-assign the parity drive (or un-plug it) so a parity check cannot occur. I would the perform a file system check on the drive that had the missing partition. reiserfsck --check /dev/md9 So : - stopped the array - unassign the parity disk - check the "I'm sure..." - start the array - then, on a putty window : root@Tower:~# reiserfsck --check /dev/md9 reiserfsck 3.6.21 (2009 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will read-only check consistency of the filesystem on /dev/md9 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Thu Feb 17 12:49:05 2011 ########### Partition /dev/md9 is mounted with write permissions, cannot check it root@Tower:~# I cannot make this check. I've peek on the smart status of the three drives involved so far, the parity, the md9 (failed), the md15 (new one) and there is nothing special. Now, some questions : - why have I to do this without parity enabled (I know it's nearly lost, but anyway, is it a good idea?) - the data on the md9 are to be erased, so I want to reconstruct a clean file system on this disk (not a check, I know there are errors), but how? Thanx
February 17, 201115 yr To run the file system check you must first un-mount the drive. umount /mnt/disk9
February 17, 201115 yr Author To run the file system check you must first un-mount the drive. umount /mnt/disk9 If it's unmounted, it will be ready to format, it's a better idea. Why should I do this without parity enabled? Anyway, I know I've to rebuild the parity disk, it's lost now, but I don't like to be a funambulist.
February 17, 201115 yr To run the file system check you must first un-mount the drive. umount /mnt/disk9 If it's unmounted, it will be ready to format, it's a better idea. Why should I do this without parity enabled? Anyway, I know I've to rebuild the parity disk, it's lost now, but I don't like to be a funambulist. Typically, the answer would be yes, since the fixes it makes would also be applied to the parity disk. The --check will not fix anything, but it will tell you which option to apply next to the reiserfsck command.
Archived
This topic is now archived and is closed to further replies.