July 12, 200916 yr Last night I booted up unRaid for the first time in a month or so (shameful, I know), and a few disks, including parity, showed as missing. I'm running 4.2.1 on the original starter pack hardware with 10 IDE disks and 2 SATA (including parity). A quick check showed that the power cable to the parity disk was loose. I've always suspected that my Antec Neo HE500 PSU, which has modular power connectors, can support the configuration, but perhaps the distribution of disks across the power cabling was causing intermittent power problems to some of the drives. Previously the SATA drives ran off IDE-to-SATA power connectors, so they were sharing the power cable with a few other drives. So I dug out the SATA modular connectors and hooked up each SATA drive to a separate cable (because the distance between the two SATA connectors on one cable was too short). When I booted up all drives were OK except for my disk 6, which showed up as unformatted. A quick read through the forum indicated that it might be a umount issue, so I powered down (using the Web UI) and then powered the server up again. Unfortunately, I got the same problem. In the hope that someone can help, I've attached a slightly edited syslog. (I removed many, many, many lines like this: "Jul 11 22:32:52 Tower shfs: make_link: real_path=/mnt/disk3/Movies/Kids/Alice in Wonderland/VIDEO_TS/VTS_01_0.BUP already exists".)
July 12, 200916 yr Jul 11 22:32:50 Tower kernel: [ 75.880652] ReiserFS: md6: warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md6 Unfortunately, this is serious, causes the drive to appear truly Unformatted (because no file system found), and has to be handled with extreme caution, in order to recover all data on the drive. This is only the second time I have ever seen this error, the other was only a few days ago, with a much newer version of Linux and unRAID, thread is here. I advised him that he would need to rebuild the superblock for the Reiser file system on that drive (coincidentally Disk 6 too), which I still think was correct, but I had never seen the command performed, and was not aware that it would ask critical questions about how it was formatted, questions that have to be answered perfectly. I think the only one that had seen this situation was Brian, who wrote about it here, with a description of the questions and the correct answers. [Correction: another user has seen it too.] The wiki page that describes what to do, at least to begin with, is this: Check Disk File systems. But I would strongly recommend that you use a Telnet or PuTTY console, and cut and paste the dialogs and questions you see, and post them for us to examine and advise. First, do the steps through "reiserfsck /dev/md6", then show us what is the response.
July 12, 200916 yr Author Thanks for replying, Rob. I ran reiserfsck, and got this: =============================================== Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/md6. Failed to open the filesystem. If the partition table has not been changed, and the partition is valid and it really contains a reiserfs partition, then the superblock is corrupted and you need to run this utility with --rebuild-sb. root@Tower:~# ===============================================
July 12, 200916 yr That is what I expected to see, unfortunately, and thank you for capturing and posting it. From here on, it gets tricky, please continue with extreme caution! The next step, as you probably know if you have been reading the links I posted, especially Brian's, is to run: reiserfsck --rebuild-sb /dev/md6 Before doing so, I would print out Brian's post, and highlight the answers. Then run the preparative steps to take the drive offline, as found in Check Disk File systems, up to the first reiserfsck command. Then start the "reiserfsck --rebuild-sb /dev/md6", and answer the questions up to the point where you are asked to confirm that everything is OK ("Is this ok ? (y/n)[n]:"), but do NOT answer it yet, do NOT say Yes yet. In other words, finish step 7, but do NOT do step 8. Then capture the output and post it here. If you want to, you should be able to abort the command by Ctl-C at any time, or answer with a No. If anything is not clear, just ask us. Unfortunately, I'm going to be out of touch much of today, but will try to get on sometime this afternoon, and late tonight (it is morning right now for me). Radiopaque: if you are reading this, I apologize for your difficulties, I feel partly responsible. When I saw what messages and questions you were getting, I did not have time to deal with it, and it already looked too late. When I have time, I'll try to help again. I *think* you still need to rebuild-sb again, but with the accurate answers, then do a rebuild-tree. But something is different than Brian's, and I suspect that the partition structure may have been slightly adjusted by the initial problem, which slid the start of your Reiser file system, and that is critical. I have a couple of dangerous and tricky ideas, involving deleting and re-creating the partition table, and/or running TestDisk from a live CD, and/or running Tom's initial Reiser format command, but I need to think about it more. Perhaps someone else has ideas and time ...
July 12, 200916 yr Author Rob, I really appreciate your help. Please take your time on this - I'm in no great rush, and I'm sure you have other priorities. For the sake of documentation for other users who might find this useful, here's the full dialogue with what I hope are the correct answers. I exited using Ctrl-C. ========= Start ==================================== Tower login: root Password: [Disconnect bypassed -- root login allowed.] Linux 2.6.22.5. root@Tower:~# cd root@Tower:~# samba stop root@Tower:~# umount /dev/md6 umount: /dev/md6: not mounted root@Tower:~# reiserfsck --rebuild-sb /dev/md6 reiserfsck 3.6.19 (2003 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will check superblock and rebuild it if needed Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/md6. what the version of ReiserFS do you use[1-4] (1) 3.6.x (2) >=3.5.9 (introduced in the middle of 1999) (if you use linux 2.2, ch oose this one) (3) < 3.5.9 converted to new format (don't choose if unsure) (4) < 3.5.9 (this is very old format, don't choose if unsure) (X) exit 1 Enter block size [4096]: 4096 No journal device was specified. (If journal is not available, re-run with --no- journal-available option specified). Is journal default? (y/n)[y]: y Did you use resizer(y/n)[n]: n rebuild-sb: no uuid found, a new uuid was generated (b0894fe9-3850-4d57-b70b-a41 9cbf3823e) rebuild-sb: You either have a corrupted journal or have just changed the start of the partition with some partition table editor. If you are sure that the start of the partition is ok, rebuild the journal header. Do you want to rebuild the journal header? (y/n)[n]: y Reiserfs super block in block 16 on 0x906 of format 3.6 with standard journal Count of blocks on the device: 97677824 Number of bitmaps: 2981 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 0 Root block: 0 Filesystem is NOT clean Tree height: 0 Hash function used to sort names: not set Objectid map size 0, max 972 Journal parameters: Device [0x0] Magic [0x0] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x1: some corruptions exist. sb_version: 2 inode generation number: 0 UUID: b0894fe9-3850-4d57-b70b-a419cbf3823e LABEL: Set flags in SB: Is this ok ? (y/n)[n]: root@Tower:~# ======End=======================================
July 13, 200916 yr Sorry for the delay, but I've spent a number of hours researching this and Radiopaque's problem, and looking for general knowledge of ReiserFS, reiserfsck, debugreiserfs, and the superblock, and although I learned *some* things, could not find better, definitive answers. For one thing, I hunted and hunted for an explanation of the 0x906 and 0x901, but could find absolutely nothing. Your answers and the info shown look correct as far as I can tell, so repeat them and proceed with a Yes, Brian's 9 steps, and let's see what it reports. If and only if it appears to be completely successful, go ahead and run again the reiserfsck check partition command: reiserfsck /dev/md6 We do not want to run the --rebuild-tree command, as Brian did, unless reiserfsck specifically instructs us to. If the --rebuild-sb is successful, it may be all that is necessary. However, you still need to run the reiserfsck /dev/md6 command, just to verify that all is well.
July 13, 200916 yr Author I was going through reiserfsck --rebuild-sb /dev/md6 when I noticed something slightly odd. Here's what I saw: root@Tower:~# reiserfsck --rebuild-sb /dev/md6 reiserfsck 3.6.19 (2003 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will check superblock and rebuild it if needed Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes reiserfs_open: the reiserfs superblock cannot be found on /dev/md6. what the version of ReiserFS do you use[1-4] (1) 3.6.x (2) >=3.5.9 (introduced in the middle of 1999) (if you use linux 2.2, ch oose this one) (3) < 3.5.9 converted to new format (don't choose if unsure) (4) < 3.5.9 (this is very old format, don't choose if unsure) (X) exit 1 Enter block size [4096]: 4096 No journal device was specified. (If journal is not available, re-run with --no- journal-available option specified). Is journal default? (y/n)[y]: y Did you use resizer(y/n)[n]: n rebuild-sb: no uuid found, a new uuid was generated (2ebc6adb-0101-4f6e-8ffa-ed7 eb1f94e84) Reiserfs super block in block 16 on 0x906 of format 3.6 with standard journal Count of blocks on the device: 97677824 Number of bitmaps: 2981 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 0 Root block: 0 Filesystem is NOT clean Tree height: 0 Hash function used to sort names: not set Objectid map size 0, max 972 Journal parameters: Device [0x0] Magic [0x0] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x1: some corruptions exist. sb_version: 2 inode generation number: 0 UUID: 2ebc6adb-0101-4f6e-8ffa-ed7eb1f94e84 LABEL: Set flags in SB: Is this ok ? (y/n)[n]: root@Tower:~# The odd thing is that after this: Did you use resizer(y/n)[n]: n rebuild-sb: no uuid found, a new uuid was generated (2ebc6adb-0101-4f6e-8ffa-ed7 eb1f94e84) this bit was missing compared to what I got yesterday (this is Step 6 in Brian's post): rebuild-sb: You either have a corrupted journal or have just changed the start of the partition with some partition table editor. If you are sure that the start of the partition is ok, rebuild the journal header. Do you want to rebuild the journal header? (y/n)[n]: y Instead it goes straight into "Reiserfs super block in block 16 on 0x906 of format 3.6 with standard journal", etc. I'm a little nervous, as I don't understand why repeating the procedure would miss this part out. I've shutdown and restarted three times with the same result. So I'm putting everything on hold for the moment. Another question: what are the risks if I simply replace the drive (with a larger one) and rebuild it? Does it make too risky an assumption that parity is correct?
July 13, 200916 yr Another question: what are the risks if I simply replace the drive (with a larger one) and rebuild it? Does it make too risky an assumption that parity is correct? To answer that question we must first know the answer to a question of our own... During this whole process, did the array ever start? More specifically, did it start a parity check and start "fixing" the parity errors it found? Joe L.
July 13, 200916 yr You are indeed a computer systems professional or an engineer or a logician by nature, when an inconsistency bothers you, even when its latest result is in your favor! I agree, it is odd, but I *may* have a cause. What we have here is a tool indicating a bad journal on its first run, then with no apparent changes made, a second identical run says nothing about the journal, implying that the journal is now good. That leaves us with one of 2 possibilities, (1) the tool can't be trusted, or (2) something changed - journal was repaired. I think it is the second, although I am slowly being less and less impressed with reiserfsck's abilities. In my Reiser reading, there was discussion of the mounting of a Reiser file system in read-only mode, and it included complaints that it was not completely possible, because a journaling recovery would override the mode. That is, if there were transactions to be replayed, mounting in RO mode would cause the transactions to be replayed first, causing data modifications, and then only afterward, the drive would be available in read-only mode. Some complained, because there are times, such as for forensic purposes or for severe data recovery reasons, you do not want any writes AT ALL. I suspect in our case, that an independent journaling subsystem jumped ahead and made changes, even though you aborted the rebuild of the superblock. The Reiser journaling is designed to maintain data integrity at all times, so it is trying to help you, in the background. Just my theory ... Go ahead and run the rebuild of the superblock, and let's see the result. If good, then run the check, and let's see its results.
July 14, 200916 yr Author During this whole process, did the array ever start? More specifically, did it start a parity check and start "fixing" the parity errors it found? The array DID start, in that I could see the various drives and directories listed from Windows. But I didn't (consciously) read or write to it, and from the Web GUI it seems that NO parity check started. I do believe that the last time I ran a parity check there were no errors, but that might have been a while back. (I'm not at home at the moment, and can't confirm when it was.) I don't recall there being any read/write errors from the last time I used the array. You are indeed a computer systems professional or an engineer or a logician by nature' date=' when an inconsistency bothers you, even when its latest result is in your favor! [/quote'] I'm more of a maths/stats guy, so perhaps more in the logician line, but in this case it's more a case of fear of avoidable data loss trumping my natural impatience . Your explanation for what might have happened does seem to make sense; when I get home later, I'll go ahead and do the rebuild of the superblock and see how it goes. If this part is OK, how would I check that the data is correct? I could read a few files from disk 6 and see if they're OK, but it's not feasible to do a comprehensive check. Or is the superblock fix process robust enough that it either works or it doesn't? - i.e., if some files are OK, they're all OK. If any file is not OK, they're ALL not OK. If that's not the case, I could run a parity check, but I think this would only update the parity drive and if files are corrupted or missing on disk 6 then that's what the parity drive would calculate parity on, and thereafter there would be no way of getting the data back. I believe in later versions it's possible to do a read-only parity check. Can it be done in 4.2.1? On the other hand, Plan B is: replace disk 6 WITHOUT doing a parity check after the superblock fix. I should be able to recreate the data (assuming the parity drive is OK). If it turns out there are parity errors, would I then be able to put back the original disk 6 and use the data on that (not on the basis that it's guaranteed perfect, but on the basis that this is my last best hope)? Of course, these things are never straightforward, and I have an additional complication: in anticipation of upgrading my storage capacity - which is what started this whole thing off - I bought two 1.5TB WD drives. Assuming Plan B is the way to go, how would I replace disk 6, given what I have available? Absolute worst case, I could rip out the 400GB drive I have in my Win 7 test machine but that would be a real PITA.
July 14, 200916 yr Author The computer gods are against me today. I ran reiserfsck --rebuild-sb /dev/md6 and that seemed to be OK; told me there might still be some problems (but I assume that's a standard disclaimer), and that I should run a check to be sure. So I started reiserfsck /dev/md6 and it said something about replaying the journal. Somehow it looked like something that would take a while, so I left it running. When I came back about 30 minutes later, the PC that I was telnetting FROM had crashed. It's an old Celeron that has been occasionally flaky. Now my problem is how do I get back into the Telnet session (I can be safer and do it from my previously mentioned Win 7 machine) and check how the reiserfsck /dev/md6 is going? Or, alternatively, how long should I wait? Or is there another approach?
July 14, 200916 yr The computer gods are against me today. I ran reiserfsck --rebuild-sb /dev/md6 and that seemed to be OK; told me there might still be some problems (but I assume that's a standard disclaimer), and that I should run a check to be sure. So I started reiserfsck /dev/md6 and it said something about replaying the journal. Somehow it looked like something that would take a while, so I left it running. When I came back about 30 minutes later, the PC that I was telnetting FROM had crashed. It's an old Celeron that has been occasionally flaky. Now my problem is how do I get back into the Telnet session (I can be safer and do it from my previously mentioned Win 7 machine) and check how the reiserfsck /dev/md6 is going? Or, alternatively, how long should I wait? Or is there another approach? It is highly likely that when the telnet session was terminated, the reiserfsck also terminated. (I doubt it ignores hangup signals) You can telnet back in, and then issue a ps -ef | grep reiserfsck command. If the only line in the process list that matches is the one with the "grep" looking to match the process name, the reiserfsck process terminated. If you still see the line with the disk device named as an argument, it is still running. You can telnet in from multiple machines... and even multiple times from one machine. Joe L.
July 14, 200916 yr Some reiserfsck operations are not safe to abort, but the check you were running should be. Just start it again. All decisions as to the next step, and as to file integrity, are dependent on finishing the reiserfsck check. It should tell you whether there is more that needs to be fixed, or all is well. If it finds and fixes some minor issues itself, you may feel better by running it a second time. Once it has finished successfully, I don't think you need to worry about file integrity. I've never heard of it corrupting files. The rebuild-tree operation, on the other hand, may not always link files correctly back together, may mix contents of similar or older versions of the same file. I should have mentioned that the replaying of transactions is usually instantaneous or very short. Normally, there aren't any, so is over immediately. If the drive had been in the midst of heavy writing that was suddenly aborted, by a power outage or system crash, then there could be a number of transactions to replay, but they don't take very long, a couple of minutes at most.
July 14, 200916 yr Author Hmmm... things are not looking good. I couldn't telnet in as the server had lost its LAN connection. When I went to the console I saw a screen of messages ending in "Kernel panic - not syncing: Fatal exception in interrupt". The server not surprisingly wasn't responding to anything, so I rebooted. Disk 6 is still listed as unformatted and I'm now 7 minutes into "Replaying journal..." under reiserfscks.
July 15, 200916 yr Author Here's a screenshot of the kernel panic screen. Apologies for the poor resolution from having to resize.
July 15, 200916 yr Author Just noticed this thread on the same kernel panic message a few topics down. I'll run a memtest to check out the memory.
July 15, 200916 yr Author Unfortunately I don't think the reiserfsck completed - telnet showed "Connection lost" and the console showed the kernel panic error again. I ran Memtest - four passes, no errors, so I don't think it's a memory problem. Here's a screenshot of the kernel panic. There are some differences compared to the previous one. The penultimate line here is: "Recursive die() failure, output suppressed" I've also attached the syslog following reboot. (Edit: to add more information and attachments.)
July 16, 200916 yr The more relevant part of the panic screen had already scrolled off, in both screens. Try adding a vga parameter, such as vga=6, which is what I use, see this. It is probably going to indicate that it crashed while executing a Reiser function, but we should confirm that. A fix-it tool that crashes while testing what it is supposed to fix, is not particularly impressive! I would expect more robustness than this. Since the syslog does show that reiserfsck was already identifying ordinary file system corruption, before it crashed, you might go ahead and run the reiserfsck with the --fix-fixable option, but I have no idea whether it will perform any better.
July 16, 200916 yr Author Thanks for your patience, RobJ. Put in vga=6, ran reiserfsck /dev/md6, typed Yes when prompted and got this message: Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes The problem has occurred looks like a hardware problem. If you have bad blocks, we advise you to get a new hard drive, because once you get one bad block that the disk drive internals cannot hide from your sight,the chances of getting more are generally said to become much higher (precise statistics are unknown to us), and this disk drive is probably not expensive enough for you to you to risk your time and data on it. If you don't want to follow that follow that advice then if you have just a few bad blocks, try writing to the bad blocks and see if the drive remaps the bad blocks (that means it takes a block it has in reserve and allocates it for use for of that block number). If it cannot remap the block, use badblock option (-B) with reiserfs utils to handle this block correctly. bread: Cannot read the block (2): (Input/output error). Aborted root@Tower:~# If it was my old TV I would just thump it and try again, but probably won't work with a disk drive. Time to pray that parity is OK and replace the drive?
July 16, 200916 yr Have you run smartctl on this drive recently? (You'll need to do it on the /dev/sdX drive, not the md6 device) That error was as if it could not read the disk at all. Joe L.
July 17, 200916 yr Can you capture and post your syslog, it should contain more and better error messages. SMART tests would be very good to run now (see the Troubleshooting link in my sig, for instructions as to the syslog and SMART report and tests). This error seems like something new, have you seen any error messages concerning this drive before?
July 21, 200916 yr Author Unfortunately I ran into other problems. First, I ran smartctl (had to install the package, as I'm on 4.2.1). That seemed to be OK, so I ran reiserfsck, but got a "Bad root block" error. (The two parts - my wrong typing included - are in the attachment.) At this point I was stumped - the disk seemed to be OK, but reiserfsck seemed to be stuck. So I went for Plan B, and replaced that disk 6 with my other 400GB Seagate and let unRaid recreate the data on the disk. I rebooted the server after that and the syslog (attached) seemed to show that it mounted OK, complete with a filesystem. So parity seemed to have rescued the day, but.... the Windows GUI showed the new disk 6 as unformatted. I rebooted again, same result. So I ran reiserfsck again (on the new disk, of course) and got the same "Bad root block" error. This seems to indicate that the problem with the disk (possibly the filesystem structure) already existed the last time the old drive was written to (and therefore recorded by the parity drive). Since the error message indicated that --rebuild-tree did not complete, I ran reiserfsck with this option and got some messages but nothing that seemed fatal. I rebooted again, and this time disk 6 was green-dotted. A quick check on the disk showed that the files seemed to be OK, and there was nothing in the lost+found directory, so everything seems to be OK. Ironically if I'd run "--rebuild-tree" on the old disk 6, I'd probably have fixed the disk - at least to the same extent. After this, I ran a parity check and then upgraded to 4.4.2. Now to add some more capacity..... Thanks for all the patient help on this, Rob and Joe.
July 21, 200916 yr Wow! Great news! Probably unnecessary, but when you have a chance, I would run one last check (reiserfsck /dev/md6) on the drive, just to confirm all is clean.
Archived
This topic is now archived and is closed to further replies.