ixnu Posted October 28, 2013 Share Posted October 28, 2013 Ugly weekend. Server running for ~3 years on stock 5.0 with no plugins. Upgraded to 5.0 about a month ago. - Red ball on a WD 2 TB (WD1) over the weekend. - I shutdown the box to check cables etc. - Server refused to boot. No POST or even beeps after I removed cards and memory. - Quick run to Frys for a new motherboard. Replaced mother board and booted fine. - Replaced the red balled drive with new one (WD2), and kept the bad drive (WD1) unprotected but mounted. - The old RedBalled drive (WD1) now could run a smart report and all was fine according to smart. - The array appears to rebuild the disk ~ 8AM EST - New Red Ball appears on a different drive (WD3). - Syslog appears to have filled up with read errors (at ~125mb) and moved to syslog.1 (below) @~5AM but the rotated syslog is now empty. - Rebooted and Smart report looks good (WD3) after a reboot. At this point I'm at a loss. I would assume that I should just replace and rebuild, but I'm starting to think that it might be my controller card. Some concerns: Why didn't the syslog rotate properly? All of my files appear to be in tact, but would read errors result in corruption on the rebuild? Any ideas? Thanks for looking! Syslogs: http://pastebin.com/Bs82C60S (the one that filled up on the rebuild) http://pastebin.com/k9yHySTf (after most recent reboot) dmesg: http://pastebin.com/TEUNMc4X smart report on suspect drive (WD3) http://pastebin.com/VSDvMZtP Link to comment
ixnu Posted October 28, 2013 Author Share Posted October 28, 2013 This is getting worse. I have replaced the latest red balled drive with a new one. During the rebuild, I'm getting numerous REISERFS errors. Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck? Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 1185 0x0 SD] Oct 28 18:02:21 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3 Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck? Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 29459 0x0 SD] Full error log: http://pastebin.com/KdJSMyMq Smart report on md6 (a new drive) http://pastebin.com/1ePcxh0E Help me Obi-Wan! Link to comment
redlaws Posted October 29, 2013 Share Posted October 29, 2013 reiserfsck --rebuild-tree needs to run on those drives do a search on forum Link to comment
ixnu Posted October 29, 2013 Author Share Posted October 29, 2013 Yep. I have lost a significant amount of data and have no idea what caused it. The motherboard went bad and two drives were marked failed, so it might be the PSU... The strange part is that the most recent failed drive (WD3) is readable in another box, but after the rebuild, some data is missing from its replacement. reiserfsck has found tons of errors and I'm doing a tree rebuild now. Link to comment
dgaschk Posted October 29, 2013 Share Posted October 29, 2013 reiserfsck --rebuild-tree needs to run on those drives do a search on forum DO NOT do this. It is incorrect. See Check Disk FileSystems in my sig for the correct procedure. Link to comment
ixnu Posted October 29, 2013 Author Share Posted October 29, 2013 Any ideas how this much corruption could happen? Ran rebuild-tree it after reiserfsck returned this type of error on the first disk. This is from the second disk that I'm about to run rebuild-tree. Both disks have significant lost data. I would assume this is the prudent course of action. root@Tower:/var/log# reiserfsck --check /dev/md10 reiserfsck 3.6.21 (2009 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will read-only check consistency of the filesystem on /dev/md10 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Tue Oct 29 16:05:41 2013 ########### Replaying journal: Trans replayed: mountid 32, transid 26785, desc 2509, len 1, commit 2511, next trans offset 2494 Trans replayed: mountid 32, transid 26786, desc 2512, len 1, commit 2514, next trans offset 2497 Replaying journal: Done. Reiserfs journal '/dev/md10' in blocks [18..8211]: 2 transactions replayed Zero bit found in on-disk bitmap after the last valid bit. Checking internal tree.. \/ 2 (of 15-/ 25 (of 111// 84 (of 89/block 342786049: The level of the node (0) is not correct, (1) expected the problem in the internal node occured (342786049), whole subtree is skipped / 26 (of 111-block 342786056: The level of the node (46165) is not correct, (2) expected the problem in the internal node occured (342786056), whole subtree is skipped / 3 (of 15\/ 35 (of 162// 57 (of 90-block 354374318: The level of the node (16866) is not correct, (1) expected the problem in the internal node occured (354374318), whole subtree is skipped / 59 (of 162-block 351898852: The level of the node (33689) is not correct, (2) expected the problem in the internal node occured (351898852), whole subtree is skipped / 4 (of 15\block 334481327: The level of the node (26998) is not correct, (3) expected the problem in the internal node occured (334481327), whole subtree is skipped finished Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Bad nodes were found, Semantic pass skipped [b]5 found corruptions can be fixed only when running with --rebuild-tree [/b] ########### reiserfsck finished at Tue Oct 29 16:09:28 2013 ########### Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 Run with rebuild-tree. Yes, thanks. The tree rebuild on the first drive recovered all (or >99%) of the data. There is about 100-150GB missing from the second drive. The rebuild-tree is about 50%, so se will see what happens in a few hours. I'm still confounded by how the corruption happened. My working theory is that there was a read failure of a second drive (WD2) close to the end of the rebuild from parity of the first drive (WD1). This drive (WD2) then redballed when there was a write failure after the restore had completed. This does not explain all of the corruption on the final disk (WD3) that was restored from parity. I'm very interested to see if reiserfsck will bring back any of the data. Compounding all of this is the fact that the syslog filled before the end of the first restore, so I have no good diags. Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 Good news on the reiserfsck. It appears that I have recovered the majority of data. I have to finish some old hashes, but the big things are still there. At most, I lost ~5Gb of data. I still have no idea what or how it happened, so I'm not sure to make this [solved] On to parity checks. Link to comment
binhex Posted October 30, 2013 Share Posted October 30, 2013 hi ixnu, quick thought, im assuming you made sure when you replaced your mobo that you got the timings and voltage correct for your ram yes?, also i dont think it would hurt to run a memtest86, just to rule out any corruption of data due to bad memory module(s). Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 That's a good idea. I have not run memtest on the new rig. I think I'm going to get ECC RAM to replace it anyway. Should have done it from the start. Right now I have an additional issue. "du" does not report the correct sizes of directories with recovered files. "df" and "ls" agree, but "du" does not see the correct sizes of many of the recovered files. Running reiserfsck --check now. Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 Reiserfsck came back fine # reiserfsck --check /dev/md10 reiserfsck 3.6.21 (2009 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will read-only check consistency of the filesystem on /dev/md10 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Wed Oct 30 08:39:09 2013 ########### Replaying journal: Done. Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 282304 Internal nodes 1762 Directories 801 Other files 6945 Data block pointers 285114925 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Wed Oct 30 09:23:47 2013 ########### root@Tower:/# However, I still get this not-so-comfortable report: root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah total 8.6M drwxrwxrwx 3 nobody users 296 2013-10-13 04:48 ./ drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../ drwxrwxrwx 2 nobody users 600 2013-08-13 03:58 .actors/ -rw-rw-rw- 1 nobody users 110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml -rw-rw-rw- 1 nobody users 22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg -rw-rw-rw- 1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv -rw-rw-rw- 1 nobody users 994K 2013-10-12 22:36 backdrop.jpg -rw-rw-rw- 1 nobody users 85K 2011-06-02 14:20 mymovies-front.jpg root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h 5.3M ./.actors 14M . root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# This just feels bad. Updated syslog: https://gist.github.com/anonymous/7233604 Link to comment
JonathanM Posted October 30, 2013 Share Posted October 30, 2013 I think I'm going to get ECC RAM to replace it anyway.Be sure your CPU and motherboard can use it before you buy it. I don't think many of the consumer type boards have ECC enabled. Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 Thanks. That's one nice feature of ASUS AM3+ mb's - they all support ECC. About to pull the trigger on a FreeBSD ZFS box and AMD with ECC is so much cheaper than Intel. Link to comment
dgaschk Posted October 30, 2013 Share Posted October 30, 2013 Reiserfsck came back fine # reiserfsck --check /dev/md10 reiserfsck 3.6.21 (2009 www.namesys.com) ************************************************************* ** If you are using the latest reiserfsprogs and it fails ** ** please email bug reports to [email protected], ** ** providing as much information as possible -- your ** ** hardware, kernel, patches, settings, all reiserfsck ** ** messages (including version), the reiserfsck logfile, ** ** check the syslog file for any related information. ** ** If you would like advice on using this program, support ** ** is available for $25 at www.namesys.com/support.html. ** ************************************************************* Will read-only check consistency of the filesystem on /dev/md10 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Wed Oct 30 08:39:09 2013 ########### Replaying journal: Done. Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 282304 Internal nodes 1762 Directories 801 Other files 6945 Data block pointers 285114925 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Wed Oct 30 09:23:47 2013 ########### root@Tower:/# However, I still get this not-so-comfortable report: root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah total 8.6M drwxrwxrwx 3 nobody users 296 2013-10-13 04:48 ./ drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../ drwxrwxrwx 2 nobody users 600 2013-08-13 03:58 .actors/ -rw-rw-rw- 1 nobody users 110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml -rw-rw-rw- 1 nobody users 22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg -rw-rw-rw- 1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv -rw-rw-rw- 1 nobody users 994K 2013-10-12 22:36 backdrop.jpg -rw-rw-rw- 1 nobody users 85K 2011-06-02 14:20 mymovies-front.jpg root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h 5.3M ./.actors 14M . root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# This just feels bad. Updated syslog: https://gist.github.com/anonymous/7233604 This looks normal. ls -lah does not recursively sum directories and du . -h does recursively sum directories. Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 Thanks for looking at this! In the previous example, "ls" reports agg of 8.6M and "du" gives 14MB. I understand that subs are the difference, but there is a 6.4G ! file in the directory. "ls" reports the correct size of the individual files (but reports the agg incorrectly). "du" does not report correct size of individual files or the directory. root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du -h ./"The Great Dictator (1940).mkv" 7.5M ./The Great Dictator (1940).mkv root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# The example below is from a correct listing on the same drive. "ls" gives a total of 11G and "du" gives a total of 11G - which are correct. root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# ls -lah total 11G drwxrwxrwx 6 nobody users 968 2013-10-13 06:19 ./ drwxrwxrwx 172 nobody users 7.2K 2013-10-30 12:46 ../ drwxrwxrwx 2 user1 users 72 2013-10-09 04:20 .AppleDouble/ drwxrwxrwx 2 nobody users 224 2012-10-04 04:37 .actors/ -rw-rw-rw- 1 nobody users 286K 2012-10-03 23:34 Twelve\ Monkeys\ (1995)-fanart.jpg -rw-rw-rw- 1 nobody users 106 2011-10-18 01:15 Twelve\ Monkeys\ (1995).dvdid.xml -rw-rw-rw- 1 nobody users 33K 2011-08-29 18:50 Twelve\ Monkeys\ (1995).jpg -rw-rw-rw- 1 nobody users 11G 2009-08-01 00:27 Twelve\ Monkeys\ (1995).mkv -rw-rw-rw- 1 nobody users 7.4K 2012-12-26 09:35 Twelve\ Monkeys\ (1995).nfo -rw-rw-rw- 1 nobody users 49K 2012-10-03 23:34 Twelve\ Monkeys\ (1995).tbn -rw-rw-rw- 1 nobody users 63K 2010-12-07 15:22 backdrop.jpg -rw-rw-rw- 1 nobody users 64K 2013-10-12 22:34 banner.jpg -rw-rw-rw- 1 nobody users 76K 2012-08-11 19:27 clearart.png -rw-rw-rw- 1 nobody users 538K 2012-08-11 19:27 disc.png drwxrwxrwx 2 nobody users 96 2012-08-11 19:26 extrafanart/ drwxrwxrwx 2 nobody users 48 2012-08-11 19:26 extrathumbs/ -rw-rw-rw- 1 nobody users 71K 2012-02-21 21:11 fanart.jpg -rw-rw-rw- 1 nobody users 35K 2011-10-18 01:17 folder.jpg -rw-rw-rw- 1 nobody users 51K 2012-08-11 19:26 logo.png -rw-rw-rw- 1 nobody users 143K 2012-02-21 21:11 movie.jpg -rw-rw-rw- 1 nobody users 7.5K 2013-08-14 19:46 movie.nfo -rw-rw-rw- 1 nobody users 143K 2012-02-21 21:11 movie.tbn -rw-rw-rw- 1 nobody users 8.4K 2013-10-12 22:33 movie.xml -rw-rw-rw- 1 nobody users 258K 2011-05-06 18:05 mymovies-back.jpg -rw-rw-rw- 1 nobody users 141K 2011-05-06 18:05 mymovies-front.jpg -rw-rw-rw- 1 nobody users 17K 2011-10-18 01:17 mymovies.xml -rw-rw-rw- 1 nobody users 104K 2010-12-07 15:22 poster.jpg -rw-rw-rw- 1 nobody users 194K 2013-10-12 22:34 thumb.jpg root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# du . -h 64K ./extrafanart 4.0K ./.AppleDouble 100K ./.actors 0 ./extrathumbs 11G . root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# Now, all of the recovered dirs from reiserfsck have this same issue. None of my other directories have this issue. This is troubling to me, but I'm not sure if it's a bug or a feature. Link to comment
binhex Posted October 30, 2013 Share Posted October 30, 2013 Do the movies with wrong sizes actually playback or are they corrupt? Link to comment
ixnu Posted October 30, 2013 Author Share Posted October 30, 2013 I was only from a shell, so I could not test all of them. I just got home, and some of them are corrupt. :'( Some will play for a while and then crash. I have not done an wholesale test, but it's obvious that my array is not in a good state. Link to comment
dgaschk Posted October 30, 2013 Share Posted October 30, 2013 Does "/mnt/user/media/Movies/The Great Dictator (1940)# ls -lah" show the same? Link to comment
ixnu Posted October 31, 2013 Author Share Posted October 31, 2013 yes. root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah total 8.6M drwxrwxrwx 3 nobody users 296 2013-10-13 04:48 ./ drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../ drwxrwxrwx 2 nobody users 600 2013-08-13 03:58 .actors/ -rw-rw-rw- 1 nobody users 110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml -rw-rw-rw- 1 nobody users 22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg -rw-rw-rw- 1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv -rw-rw-rw- 1 nobody users 994K 2013-10-12 22:36 backdrop.jpg -rw-rw-rw- 1 nobody users 85K 2011-06-02 14:20 mymovies-front.jpg I have an ugly little script that will find my corrupt files: find . -size +1050M | du -hs * | grep M Basically, "find" will _find_ dirs bigger than a gig and du will then sort by MB. So, anything reported bigger than a GB by "find" but smaller than a GB by "du" will be reported. Link to comment
ixnu Posted October 31, 2013 Author Share Posted October 31, 2013 So this is another wrinkle... These files on the old hard drives are NOT corrupt. I was able to pull everything off of the old two drives that unraid had red balled. The drives write and read fine mounted in a ubuntu box. Link to comment
ixnu Posted October 31, 2013 Author Share Posted October 31, 2013 memcheck ran 14 hours with no issues. Starting a parity check now. Link to comment
redlaws Posted November 1, 2013 Share Posted November 1, 2013 So my answer was not totally incorrect. as I had been there & done that before. this error can mainly from non correct power downs of the server (although I also lost windows access at the time mainly because it was accessing those 2 hdd). i had that issue awhile back so I pulled them out & made a new system with the existing by rebuilding parity, I could not find the answer back then (best to had rebuilt then with the existing but didn't want to lose too much data) but kept the 2 problem disk & built an old 3 disk in the existing to recover data (recovered most). I am building a new box & will relocated the old tower into it this month. I telenet to keep an eye on my servers but I only put them on when I need them. 1 tower has a spare (ex red ball hd) , while the other has 2 spares (+ my oldtower 3 hdd to take not in array). Link to comment
ixnu Posted November 1, 2013 Author Share Posted November 1, 2013 So my answer was not totally incorrect. as I had been there & done that before. this error can mainly from non correct power downs of the server (although I also lost windows access at the time mainly because it was accessing those 2 hdd). Thanks for your input. Sorry that happened, but it sounds like your outcome was similar to mine. The failure condition and the inability to troubleshoot the root cause is what is troubling me. I have asked for an official support ticket, but have not heard back on anything. To recap: I started with a system state of no (or virtually no) corrupt data. I know this because I can pull good data off the red balled drives. I followed the correct procedure per documentation for the system. The principle repair and reporting tool "reiserfsck" found no corruption after completion of the rebuild, however this was not true. As a result of following the correct procedure, data loss and corruption were introduced into the array. The recovery procedure put the array in a far worse situation. Without the "du" and "ls" mismatches, I would not have known about the corruption. This situation sets up a real possibility of a future catastrophic system failure occasioned by data loss. My suggestion would be to have a script that looks for this type of corruption. With this information, at least you would know that your system is unreliable. Thoughts? Link to comment
Influencer Posted November 1, 2013 Share Posted November 1, 2013 Does that script your using not return a lot of false positives with ANY file that has an M in its name? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.