Jump to content
ixnu

Redballs & Bad Motherboards - Rated NC-17 for graphic violence and gore

28 posts in this topic Last Reply

Recommended Posts

Ugly weekend. Server running for ~3 years on stock 5.0 with no plugins. Upgraded to 5.0 about a month ago.

 

- Red ball on a WD 2 TB (WD1) over the weekend.

 

- I shutdown the box to check cables etc.

 

- Server refused to boot. No POST or even beeps after I removed cards and memory.

 

- Quick run to Frys for a new motherboard. Replaced mother board and booted fine.

 

- Replaced the red balled drive with new one (WD2), and kept the bad drive  (WD1) unprotected but mounted.

 

- The old RedBalled drive (WD1) now could run a smart report and all was fine according to smart.

 

- The array appears to rebuild the disk ~ 8AM EST

 

- New Red Ball appears on a different drive (WD3).

 

- Syslog appears to have filled up with read errors (at ~125mb) and moved to syslog.1 (below) @~5AM but the rotated syslog is now empty.

 

- Rebooted and Smart report looks good (WD3) after a reboot.

 

 

At this point I'm at a loss. I would assume that I should just replace and rebuild, but I'm starting to think that it might be my controller card.

 

Some concerns:

 

Why didn't the syslog rotate properly?

 

All of my files appear to be in tact, but would read errors result in corruption on the rebuild?

 

Any ideas?

 

Thanks for looking!

 

Syslogs:

http://pastebin.com/Bs82C60S (the one that filled up on the rebuild)

 

http://pastebin.com/k9yHySTf (after most recent reboot)

 

 

dmesg:

http://pastebin.com/TEUNMc4X

 

smart report on suspect drive (WD3)

http://pastebin.com/VSDvMZtP

Share this post


Link to post

This is getting worse.

 

I have replaced the latest red balled drive with a new one.

 

During the rebuild, I'm getting numerous REISERFS errors.

 

 

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck?

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 1185 0x0 SD]

Oct 28 18:02:21 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck?

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 29459 0x0 SD]

 

 

Full error log:

http://pastebin.com/KdJSMyMq

 

Smart report on md6 (a new drive)

 

http://pastebin.com/1ePcxh0E

 

Help me Obi-Wan!

Share this post


Link to post

Yep. I have lost a significant amount of data and have no idea what caused it.  :(

 

The motherboard went bad and two drives were marked failed, so it might be the PSU...

 

The strange part is that the most recent failed drive (WD3) is readable in another box, but after the rebuild, some data is missing from its replacement.

 

reiserfsck has found tons of errors and I'm doing a tree rebuild now.

 

 

 

 

Share this post


Link to post

reiserfsck --rebuild-tree

needs to run on those drives

 

do a search on forum

 

DO NOT do this. It is incorrect. See Check Disk FileSystems in my sig for the correct procedure.

Share this post


Link to post

Any ideas how this much corruption could happen?

 

Ran rebuild-tree it after reiserfsck returned this type of error on the first disk.

 

This is from the second disk that I'm about to run rebuild-tree. Both disks have significant lost data. I would assume this is the prudent course of action.

 

root@Tower:/var/log# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to reiserfs-list@namesys.com, **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Tue Oct 29 16:05:41 2013
###########
Replaying journal: Trans replayed: mountid 32, transid 26785, desc 2509, len 1, commit 2511, next trans offset 2494
Trans replayed: mountid 32, transid 26786, desc 2512, len 1, commit 2514, next trans offset 2497
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 2 transactions replayed
Zero bit found in on-disk bitmap after the last valid bit.
Checking internal tree.. \/  2 (of  15-/ 25 (of 111// 84 (of  89/block 342786049: The level of the node (0) is not correct, (1) expected
the problem in the internal node occured (342786049), whole subtree is skipped                                        / 26 (of 111-block 342786056: The level of the node (46165) is not correct, (2) expected
the problem in the internal node occured (342786056), whole subtree is skipped                                        /  3 (of  15\/ 35 (of 162// 57 (of  90-block 354374318: The level of the node (16866) is not correct, (1) expected
the problem in the internal node occured (354374318), whole subtree is skipped                                        / 59 (of 162-block 351898852: The level of the node (33689) is not correct, (2) expected
the problem in the internal node occured (351898852), whole subtree is skipped                                        /  4 (of  15\block 334481327: The level of the node (26998) is not correct, (3) expected
the problem in the internal node occured (334481327), whole subtree is skipped                                                    finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
[b]5 found corruptions can be fixed only when running with --rebuild-tree [/b]
###########
reiserfsck finished at Tue Oct 29 16:09:28 2013
###########

Share this post


Link to post

Run with rebuild-tree.

 

Yes, thanks.

 

The tree rebuild on the first drive recovered all (or >99%) of the data.

 

There is about 100-150GB missing from the second drive. The rebuild-tree is about 50%, so se will see what happens in a few hours.

 

I'm still confounded by how the corruption happened.

 

My working theory is that there was a read failure of a second drive (WD2) close to the end of the rebuild from parity of the first drive (WD1). This drive (WD2) then redballed when there was a write failure after the restore had completed.

 

This does not explain all of the corruption on the final disk (WD3) that was restored from parity. I'm very interested to see if reiserfsck will bring back any of the data.

 

Compounding all of this is the fact that the syslog filled before the end of the first restore, so I have no good diags.

 

 

Share this post


Link to post

Good news on the reiserfsck. It appears that I have recovered the majority of data. I have to finish some old hashes, but the big things are still there. At most, I lost ~5Gb of data.

 

I still have no idea what or how it happened, so I'm not sure to make this [solved]

 

On to parity checks.

Share this post


Link to post

hi ixnu, quick thought, im assuming you made sure when you replaced your mobo that you got the timings and voltage correct for your ram yes?, also i dont think it would hurt to run a memtest86, just to rule out any corruption of data due to bad memory module(s).

Share this post


Link to post

That's a good idea. I have not run memtest on the new rig.

 

I think I'm going to get ECC RAM to replace it anyway. Should have done it from the start.

 

Right now I have an additional issue.

 

"du" does not report the correct sizes of directories with recovered files.

 

"df" and "ls" agree, but "du" does not see the correct sizes of many of the recovered files.

 

Running reiserfsck --check now.

Share this post


Link to post

Reiserfsck came back fine

 

# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to reiserfs-list@namesys.com, **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Oct 30 08:39:09 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 282304
        Internal nodes 1762
        Directories 801
        Other files 6945
        Data block pointers 285114925 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Oct 30 09:23:47 2013
###########
root@Tower:/#

 

However, I still get this not-so-comfortable report:

 

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h
5.3M    ./.actors
14M     .
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

 

This just feels bad.

 

Updated syslog:

https://gist.github.com/anonymous/7233604

Share this post


Link to post

I think I'm going to get ECC RAM to replace it anyway.

Be sure your CPU and motherboard can use it before you buy it. I don't think many of the consumer type boards have ECC enabled.

Share this post


Link to post

Thanks.

 

That's one nice feature of ASUS AM3+ mb's - they all support ECC.

 

About to pull the trigger on a FreeBSD ZFS box and AMD with ECC is so much cheaper than Intel.

Share this post


Link to post

Reiserfsck came back fine

 

# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to reiserfs-list@namesys.com, **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Oct 30 08:39:09 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 282304
        Internal nodes 1762
        Directories 801
        Other files 6945
        Data block pointers 285114925 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Oct 30 09:23:47 2013
###########
root@Tower:/#

 

However, I still get this not-so-comfortable report:

 

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h
5.3M    ./.actors
14M     .
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

 

This just feels bad.

 

Updated syslog:

https://gist.github.com/anonymous/7233604

 

This looks normal. ls -lah does not recursively sum directories and du . -h does recursively sum directories.

Share this post


Link to post

Thanks for looking at this!

 

In the previous example, "ls" reports agg of 8.6M and "du" gives 14MB. I understand that subs are the difference, but there is a 6.4G ! file in the directory.

 

"ls" reports the correct size of the individual files (but reports the agg incorrectly).

 

"du" does not report correct size of individual files or the directory.

 

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du -h ./"The Great Dictator (1940).mkv"
7.5M    ./The Great Dictator (1940).mkv
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

 

 

The example below is from a correct listing on the same drive.

 

"ls" gives a total of 11G and "du" gives a total of 11G - which are correct.

 

root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# ls -lah
total 11G
drwxrwxrwx   6 nobody users  968 2013-10-13 06:19 ./
drwxrwxrwx 172 nobody users 7.2K 2013-10-30 12:46 ../
drwxrwxrwx   2 user1  users   72 2013-10-09 04:20 .AppleDouble/
drwxrwxrwx   2 nobody users  224 2012-10-04 04:37 .actors/
-rw-rw-rw-   1 nobody users 286K 2012-10-03 23:34 Twelve\ Monkeys\ (1995)-fanart.jpg
-rw-rw-rw-   1 nobody users  106 2011-10-18 01:15 Twelve\ Monkeys\ (1995).dvdid.xml
-rw-rw-rw-   1 nobody users  33K 2011-08-29 18:50 Twelve\ Monkeys\ (1995).jpg
-rw-rw-rw-   1 nobody users  11G 2009-08-01 00:27 Twelve\ Monkeys\ (1995).mkv
-rw-rw-rw-   1 nobody users 7.4K 2012-12-26 09:35 Twelve\ Monkeys\ (1995).nfo
-rw-rw-rw-   1 nobody users  49K 2012-10-03 23:34 Twelve\ Monkeys\ (1995).tbn
-rw-rw-rw-   1 nobody users  63K 2010-12-07 15:22 backdrop.jpg
-rw-rw-rw-   1 nobody users  64K 2013-10-12 22:34 banner.jpg
-rw-rw-rw-   1 nobody users  76K 2012-08-11 19:27 clearart.png
-rw-rw-rw-   1 nobody users 538K 2012-08-11 19:27 disc.png
drwxrwxrwx   2 nobody users   96 2012-08-11 19:26 extrafanart/
drwxrwxrwx   2 nobody users   48 2012-08-11 19:26 extrathumbs/
-rw-rw-rw-   1 nobody users  71K 2012-02-21 21:11 fanart.jpg
-rw-rw-rw-   1 nobody users  35K 2011-10-18 01:17 folder.jpg
-rw-rw-rw-   1 nobody users  51K 2012-08-11 19:26 logo.png
-rw-rw-rw-   1 nobody users 143K 2012-02-21 21:11 movie.jpg
-rw-rw-rw-   1 nobody users 7.5K 2013-08-14 19:46 movie.nfo
-rw-rw-rw-   1 nobody users 143K 2012-02-21 21:11 movie.tbn
-rw-rw-rw-   1 nobody users 8.4K 2013-10-12 22:33 movie.xml
-rw-rw-rw-   1 nobody users 258K 2011-05-06 18:05 mymovies-back.jpg
-rw-rw-rw-   1 nobody users 141K 2011-05-06 18:05 mymovies-front.jpg
-rw-rw-rw-   1 nobody users  17K 2011-10-18 01:17 mymovies.xml
-rw-rw-rw-   1 nobody users 104K 2010-12-07 15:22 poster.jpg
-rw-rw-rw-   1 nobody users 194K 2013-10-12 22:34 thumb.jpg
root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# du . -h
64K     ./extrafanart
4.0K    ./.AppleDouble
100K    ./.actors
0       ./extrathumbs
11G     .
root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)#

 

Now, all of the recovered dirs from reiserfsck have this same issue. None of my other directories have this issue.

 

This is troubling to me, but I'm not sure if it's a bug or a feature.

 

Share this post


Link to post

Do the movies with wrong sizes actually playback or are they corrupt?

Share this post


Link to post

I was only from a shell, so I could not test all of them.

 

I just got home, and some of them are corrupt.  :'(

 

Some will play for a while and then crash. I have not done an wholesale test, but it's obvious that my array is not in a good state.

Share this post


Link to post

yes.

 

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg

 

I have an ugly little script that will find my corrupt files:

 

find . -size +1050M | du -hs * | grep M 

 

Basically, "find" will _find_ dirs bigger than a gig and du will then sort by MB. So, anything reported bigger than a GB by "find" but smaller than a GB by "du" will be reported.

 

 

 

 

Share this post


Link to post

So this is another wrinkle...

 

 

These files on the old hard drives are NOT corrupt.

 

I was able to pull everything off of the old two drives that unraid had red balled.

 

The drives write and read fine mounted in a ubuntu box.

Share this post


Link to post

So my answer was not totally incorrect. as I had been there & done that before.

 

this error can mainly from non correct power downs of the server

(although I also lost windows access at the time mainly because it was accessing those 2 hdd).

 

i had that issue awhile back so I pulled them out & made a new system with the existing by rebuilding parity,

I could not find the answer back then (best to had rebuilt then with the existing but didn't want to lose too much data)

but kept the 2 problem disk & built an old 3 disk in the existing to recover data (recovered most).

I am building a new box & will relocated the old tower into it this month.

 

I telenet to keep an eye on my servers but I only put them on when I need them.

 

1 tower has a spare (ex red ball hd) , while the other has 2 spares (+ my oldtower 3 hdd to take not in array).

 

 

Share this post


Link to post

So my answer was not totally incorrect. as I had been there & done that before.

 

this error can mainly from non correct power downs of the server

(although I also lost windows access at the time mainly because it was accessing those 2 hdd).

 

Thanks for your input. Sorry that happened, but it sounds like your outcome was similar to mine.

 

The failure condition and the inability to troubleshoot the root cause is what is troubling me. I have asked for an official support ticket, but have not heard back on anything.

 

To recap:

 

I started with a system state of no (or virtually no) corrupt data. I know this because I can pull good data off the red balled drives.

 

I followed the correct procedure per documentation for the system.

 

The principle repair and reporting tool "reiserfsck" found no corruption after completion of the rebuild, however this was not true.

 

As a result of following the correct procedure, data loss and corruption were introduced into the array. The recovery procedure put the array in a far worse situation.

 

Without the "du" and "ls" mismatches, I would not have known about the corruption. This situation sets up a real possibility of a future catastrophic system failure occasioned by data loss.

 

My suggestion would be to have a script that looks for this type of corruption. With this information, at least you would know that your system is unreliable.

 

Thoughts?

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.