Redballs & Bad Motherboards - Rated NC-17 for graphic violence and gore

ixnu · October 28, 2013

Ugly weekend. Server running for ~3 years on stock 5.0 with no plugins. Upgraded to 5.0 about a month ago.

- Red ball on a WD 2 TB (WD1) over the weekend.

- I shutdown the box to check cables etc.

- Server refused to boot. No POST or even beeps after I removed cards and memory.

- Quick run to Frys for a new motherboard. Replaced mother board and booted fine.

- Replaced the red balled drive with new one (WD2), and kept the bad drive (WD1) unprotected but mounted.

- The old RedBalled drive (WD1) now could run a smart report and all was fine according to smart.

- The array appears to rebuild the disk ~ 8AM EST

- New Red Ball appears on a different drive (WD3).

- Syslog appears to have filled up with read errors (at ~125mb) and moved to syslog.1 (below) @~5AM but the rotated syslog is now empty.

- Rebooted and Smart report looks good (WD3) after a reboot.

At this point I'm at a loss. I would assume that I should just replace and rebuild, but I'm starting to think that it might be my controller card.

Some concerns:

Why didn't the syslog rotate properly?

All of my files appear to be in tact, but would read errors result in corruption on the rebuild?

Any ideas?

Thanks for looking!

Syslogs:

http://pastebin.com/Bs82C60S (the one that filled up on the rebuild)

http://pastebin.com/k9yHySTf (after most recent reboot)

dmesg:

http://pastebin.com/TEUNMc4X

smart report on suspect drive (WD3)

http://pastebin.com/VSDvMZtP

ixnu · October 28, 2013

This is getting worse.

I have replaced the latest red balled drive with a new one.

During the rebuild, I'm getting numerous REISERFS errors.

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck?

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 1185 0x0 SD]

Oct 28 18:02:21 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-5150 search_by_key: invalid format found in block 335323833. Fsck?

Oct 28 18:02:21 Tower kernel: REISERFS error (device md6): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [211 29459 0x0 SD]

Full error log:

http://pastebin.com/KdJSMyMq

Smart report on md6 (a new drive)

http://pastebin.com/1ePcxh0E

Help me Obi-Wan!

redlaws · October 29, 2013

reiserfsck --rebuild-tree

needs to run on those drives

do a search on forum

ixnu · October 29, 2013

Yep. I have lost a significant amount of data and have no idea what caused it.

The motherboard went bad and two drives were marked failed, so it might be the PSU...

The strange part is that the most recent failed drive (WD3) is readable in another box, but after the rebuild, some data is missing from its replacement.

reiserfsck has found tons of errors and I'm doing a tree rebuild now.

dgaschk · October 29, 2013

reiserfsck --rebuild-tree

needs to run on those drives

do a search on forum

DO NOT do this. It is incorrect. See Check Disk FileSystems in my sig for the correct procedure.

ixnu · October 29, 2013

Any ideas how this much corruption could happen?

Ran rebuild-tree it after reiserfsck returned this type of error on the first disk.

This is from the second disk that I'm about to run rebuild-tree. Both disks have significant lost data. I would assume this is the prudent course of action.

root@Tower:/var/log# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Tue Oct 29 16:05:41 2013
###########
Replaying journal: Trans replayed: mountid 32, transid 26785, desc 2509, len 1, commit 2511, next trans offset 2494
Trans replayed: mountid 32, transid 26786, desc 2512, len 1, commit 2514, next trans offset 2497
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 2 transactions replayed
Zero bit found in on-disk bitmap after the last valid bit.
Checking internal tree.. \/  2 (of  15-/ 25 (of 111// 84 (of  89/block 342786049: The level of the node (0) is not correct, (1) expected
the problem in the internal node occured (342786049), whole subtree is skipped                                        / 26 (of 111-block 342786056: The level of the node (46165) is not correct, (2) expected
the problem in the internal node occured (342786056), whole subtree is skipped                                        /  3 (of  15\/ 35 (of 162// 57 (of  90-block 354374318: The level of the node (16866) is not correct, (1) expected
the problem in the internal node occured (354374318), whole subtree is skipped                                        / 59 (of 162-block 351898852: The level of the node (33689) is not correct, (2) expected
the problem in the internal node occured (351898852), whole subtree is skipped                                        /  4 (of  15\block 334481327: The level of the node (26998) is not correct, (3) expected
the problem in the internal node occured (334481327), whole subtree is skipped                                                    finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
[b]5 found corruptions can be fixed only when running with --rebuild-tree [/b]
###########
reiserfsck finished at Tue Oct 29 16:09:28 2013
###########

dgaschk · October 30, 2013

Run with rebuild-tree.

ixnu · October 30, 2013

Run with rebuild-tree.

Yes, thanks.

The tree rebuild on the first drive recovered all (or >99%) of the data.

There is about 100-150GB missing from the second drive. The rebuild-tree is about 50%, so se will see what happens in a few hours.

I'm still confounded by how the corruption happened.

My working theory is that there was a read failure of a second drive (WD2) close to the end of the rebuild from parity of the first drive (WD1). This drive (WD2) then redballed when there was a write failure after the restore had completed.

This does not explain all of the corruption on the final disk (WD3) that was restored from parity. I'm very interested to see if reiserfsck will bring back any of the data.

Compounding all of this is the fact that the syslog filled before the end of the first restore, so I have no good diags.

ixnu · October 30, 2013

Good news on the reiserfsck. It appears that I have recovered the majority of data. I have to finish some old hashes, but the big things are still there. At most, I lost ~5Gb of data.

I still have no idea what or how it happened, so I'm not sure to make this [solved]

On to parity checks.

binhex · October 30, 2013

hi ixnu, quick thought, im assuming you made sure when you replaced your mobo that you got the timings and voltage correct for your ram yes?, also i dont think it would hurt to run a memtest86, just to rule out any corruption of data due to bad memory module(s).

ixnu · October 30, 2013

That's a good idea. I have not run memtest on the new rig.

I think I'm going to get ECC RAM to replace it anyway. Should have done it from the start.

Right now I have an additional issue.

"du" does not report the correct sizes of directories with recovered files.

"df" and "ls" agree, but "du" does not see the correct sizes of many of the recovered files.

Running reiserfsck --check now.

ixnu · October 30, 2013

Reiserfsck came back fine

# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Oct 30 08:39:09 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 282304
        Internal nodes 1762
        Directories 801
        Other files 6945
        Data block pointers 285114925 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Oct 30 09:23:47 2013
###########
root@Tower:/#

However, I still get this not-so-comfortable report:

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h
5.3M    ./.actors
14M     .
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

This just feels bad.

Updated syslog:

https://gist.github.com/anonymous/7233604

JonathanM · October 30, 2013

I think I'm going to get ECC RAM to replace it anyway.

Be sure your CPU and motherboard can use it before you buy it. I don't think many of the consumer type boards have ECC enabled.

ixnu · October 30, 2013

Thanks.

That's one nice feature of ASUS AM3+ mb's - they all support ECC.

About to pull the trigger on a FreeBSD ZFS box and AMD with ECC is so much cheaper than Intel.

dgaschk · October 30, 2013

Reiserfsck came back fine

# reiserfsck --check /dev/md10
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md10
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed Oct 30 08:39:09 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 282304
        Internal nodes 1762
        Directories 801
        Other files 6945
        Data block pointers 285114925 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed Oct 30 09:23:47 2013
###########
root@Tower:/#

However, I still get this not-so-comfortable report:

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du . -h
5.3M    ./.actors
14M     .
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

This just feels bad.

Updated syslog:

https://gist.github.com/anonymous/7233604

This looks normal. ls -lah does not recursively sum directories and du . -h does recursively sum directories.

ixnu · October 30, 2013

Thanks for looking at this!

In the previous example, "ls" reports agg of 8.6M and "du" gives 14MB. I understand that subs are the difference, but there is a 6.4G ! file in the directory.

"ls" reports the correct size of the individual files (but reports the agg incorrectly).

"du" does not report correct size of individual files or the directory.

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# du -h ./"The Great Dictator (1940).mkv"
7.5M    ./The Great Dictator (1940).mkv
root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)#

The example below is from a correct listing on the same drive.

"ls" gives a total of 11G and "du" gives a total of 11G - which are correct.

root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# ls -lah
total 11G
drwxrwxrwx   6 nobody users  968 2013-10-13 06:19 ./
drwxrwxrwx 172 nobody users 7.2K 2013-10-30 12:46 ../
drwxrwxrwx   2 user1  users   72 2013-10-09 04:20 .AppleDouble/
drwxrwxrwx   2 nobody users  224 2012-10-04 04:37 .actors/
-rw-rw-rw-   1 nobody users 286K 2012-10-03 23:34 Twelve\ Monkeys\ (1995)-fanart.jpg
-rw-rw-rw-   1 nobody users  106 2011-10-18 01:15 Twelve\ Monkeys\ (1995).dvdid.xml
-rw-rw-rw-   1 nobody users  33K 2011-08-29 18:50 Twelve\ Monkeys\ (1995).jpg
-rw-rw-rw-   1 nobody users  11G 2009-08-01 00:27 Twelve\ Monkeys\ (1995).mkv
-rw-rw-rw-   1 nobody users 7.4K 2012-12-26 09:35 Twelve\ Monkeys\ (1995).nfo
-rw-rw-rw-   1 nobody users  49K 2012-10-03 23:34 Twelve\ Monkeys\ (1995).tbn
-rw-rw-rw-   1 nobody users  63K 2010-12-07 15:22 backdrop.jpg
-rw-rw-rw-   1 nobody users  64K 2013-10-12 22:34 banner.jpg
-rw-rw-rw-   1 nobody users  76K 2012-08-11 19:27 clearart.png
-rw-rw-rw-   1 nobody users 538K 2012-08-11 19:27 disc.png
drwxrwxrwx   2 nobody users   96 2012-08-11 19:26 extrafanart/
drwxrwxrwx   2 nobody users   48 2012-08-11 19:26 extrathumbs/
-rw-rw-rw-   1 nobody users  71K 2012-02-21 21:11 fanart.jpg
-rw-rw-rw-   1 nobody users  35K 2011-10-18 01:17 folder.jpg
-rw-rw-rw-   1 nobody users  51K 2012-08-11 19:26 logo.png
-rw-rw-rw-   1 nobody users 143K 2012-02-21 21:11 movie.jpg
-rw-rw-rw-   1 nobody users 7.5K 2013-08-14 19:46 movie.nfo
-rw-rw-rw-   1 nobody users 143K 2012-02-21 21:11 movie.tbn
-rw-rw-rw-   1 nobody users 8.4K 2013-10-12 22:33 movie.xml
-rw-rw-rw-   1 nobody users 258K 2011-05-06 18:05 mymovies-back.jpg
-rw-rw-rw-   1 nobody users 141K 2011-05-06 18:05 mymovies-front.jpg
-rw-rw-rw-   1 nobody users  17K 2011-10-18 01:17 mymovies.xml
-rw-rw-rw-   1 nobody users 104K 2010-12-07 15:22 poster.jpg
-rw-rw-rw-   1 nobody users 194K 2013-10-12 22:34 thumb.jpg
root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)# du . -h
64K     ./extrafanart
4.0K    ./.AppleDouble
100K    ./.actors
0       ./extrathumbs
11G     .
root@Tower:/mnt/disk10/media/Movies/Twelve Monkeys (1995)#

Now, all of the recovered dirs from reiserfsck have this same issue. None of my other directories have this issue.

This is troubling to me, but I'm not sure if it's a bug or a feature.

binhex · October 30, 2013

Do the movies with wrong sizes actually playback or are they corrupt?

ixnu · October 30, 2013

I was only from a shell, so I could not test all of them.

I just got home, and some of them are corrupt. :'(

Some will play for a while and then crash. I have not done an wholesale test, but it's obvious that my array is not in a good state.

dgaschk · October 30, 2013

Does "/mnt/user/media/Movies/The Great Dictator (1940)# ls -lah" show the same?

ixnu · October 31, 2013

yes.

root@Tower:/mnt/disk10/media/Movies/The Great Dictator (1940)# ls -lah
total 8.6M
drwxrwxrwx   3 nobody users  296 2013-10-13 04:48 ./
drwxrwxrwx 177 nobody users 7.4K 2013-10-30 08:22 ../
drwxrwxrwx   2 nobody users  600 2013-08-13 03:58 .actors/
-rw-rw-rw-   1 nobody users  110 2011-10-17 23:33 The\ Great\ Dictator\ (1940).dvdid.xml
-rw-rw-rw-   1 nobody users  22K 2011-08-30 05:36 The\ Great\ Dictator\ (1940).jpg
-rw-rw-rw-   1 nobody users 6.4G 2010-05-25 05:10 The\ Great\ Dictator\ (1940).mkv
-rw-rw-rw-   1 nobody users 994K 2013-10-12 22:36 backdrop.jpg
-rw-rw-rw-   1 nobody users  85K 2011-06-02 14:20 mymovies-front.jpg

I have an ugly little script that will find my corrupt files:

find . -size +1050M | du -hs * | grep M

Basically, "find" will _find_ dirs bigger than a gig and du will then sort by MB. So, anything reported bigger than a GB by "find" but smaller than a GB by "du" will be reported.

ixnu · October 31, 2013

So this is another wrinkle...

These files on the old hard drives are NOT corrupt.

I was able to pull everything off of the old two drives that unraid had red balled.

The drives write and read fine mounted in a ubuntu box.

ixnu · October 31, 2013

memcheck ran 14 hours with no issues.

Starting a parity check now.

redlaws · November 1, 2013

So my answer was not totally incorrect. as I had been there & done that before.

this error can mainly from non correct power downs of the server

(although I also lost windows access at the time mainly because it was accessing those 2 hdd).

i had that issue awhile back so I pulled them out & made a new system with the existing by rebuilding parity,

I could not find the answer back then (best to had rebuilt then with the existing but didn't want to lose too much data)

but kept the 2 problem disk & built an old 3 disk in the existing to recover data (recovered most).

I am building a new box & will relocated the old tower into it this month.

I telenet to keep an eye on my servers but I only put them on when I need them.

1 tower has a spare (ex red ball hd) , while the other has 2 spares (+ my oldtower 3 hdd to take not in array).

ixnu · November 1, 2013

So my answer was not totally incorrect. as I had been there & done that before.

this error can mainly from non correct power downs of the server

(although I also lost windows access at the time mainly because it was accessing those 2 hdd).

Thanks for your input. Sorry that happened, but it sounds like your outcome was similar to mine.

The failure condition and the inability to troubleshoot the root cause is what is troubling me. I have asked for an official support ticket, but have not heard back on anything.

To recap:

I started with a system state of no (or virtually no) corrupt data. I know this because I can pull good data off the red balled drives.

I followed the correct procedure per documentation for the system.

The principle repair and reporting tool "reiserfsck" found no corruption after completion of the rebuild, however this was not true.

As a result of following the correct procedure, data loss and corruption were introduced into the array. The recovery procedure put the array in a far worse situation.

Without the "du" and "ls" mismatches, I would not have known about the corruption. This situation sets up a real possibility of a future catastrophic system failure occasioned by data loss.

My suggestion would be to have a script that looks for this type of corruption. With this information, at least you would know that your system is unreliable.

Thoughts?

Influencer · November 1, 2013

Does that script your using not return a lot of false positives with ANY file that has an M in its name?

Redballs & Bad Motherboards - Rated NC-17 for graphic violence and gore

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived