Parity last checked on 7/25/12, finding 66,423,368 errors

wheel · July 26, 2012

...and yet, parity is valid. Does anyone have any clue what could have caused that insanely large amount of errors without a syslog? Away from my system right now (sending this by mobile), but slightly freaked out; syslog coming within the hour...

Edit: 4.7 box, Raj's 20-drive beast tower

Edit 2: this appeared immediately following a drive rebuild - but the rebuilt drive was #14 (no errors) and the error-ridden drive was #16...

Edit 3: Made it home, syslog attached and rebuilding #16 now...

syslog.txt

dgaschk · July 26, 2012

http://lime-technology.com/forum/index.php?topic=9880.0

wheel · July 26, 2012

Post-rebuild syslog attached; last time, the 66,423,368 errors were directly associated with the dead disk #16, but this time, parity check results still list 66,423,368 errors (post-rebuild) with no individual disks listed as bearing the errors (unlike last time with the dead #16). Is this normal after a rebuild?

syslog.zip

dgaschk · July 26, 2012

Errors of any kind are not normal and need to be resolved. The syslog does not indicate any errors. Run a parity check.

wheel · July 27, 2012

The second parity check reported 6 errors fixed, but 2 hours later I'm seeing this:

Jul 26 21:46:29 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 432978720. Fsck?

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3719 3721 0x0 SD]

Jul 26 21:46:29 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 432978720. Fsck?

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3719 3721 0x0 SD]

mbryanr · July 27, 2012

Follow the instructions to check filesystem in dgaschk's sig

Sent from my SAMSUNG-SGH-I897 using Tapatalk 2

wheel · July 28, 2012

OK, disk 14 (first rebuilt disk) returned some errors:

Reiserfs journal '/dev/md14' in blocks [18..8211]: 0 transactions replayed

Zero bit found in on-disk bitmap after the last valid bit.

Checking internal tree.. \/ 1 (of 21|/ 34 (of 86...

[this led to]

"The level of the node (0) is not correct, (1) expected

the problem in the internal node occurred (462061569), whole subtree is skipped"

x5, basically, then

"Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

5 found corruptions can be fixed only when running with --rebuild-tree"

Same thing with the second rebuilt disk (disk16), which came back with 12 of those errors - but this time, just dumped me back to the command line after the 12th error (no "Comparing bitmaps.." message, no recommendation of --rebuild tree, unlike disk14)

I'm coming back here with this based on the recommendation in the instructions - is this a recommended time to run --rebuild-tree? I've parity checked the tower to the point where it comes back with 0 errors, but these 5 corruptions give me pause...

[EDIT] - hitting the other disks now, and no problems on 1-3, but #4 came back with two errors (but no "Level of the node is not correct" messages, just the "Comparing bitmaps / differs" message - and this one tells me to use the "--fix-fixable" command)

dgaschk · July 28, 2012

How do you shutdown the server? All of these file system corruptions are very unusual.

wheel · July 28, 2012

Always with a powerdown from the webgui.... I think I've figured the cause of the disk error based on current error spread (they're all disks that I'd moved files onto Thursday night - right after the parity check came back good, having corrected 6 errors). I ran another parity check overnight (after receiving those errors), and now the webgui tells me parity's checked with 0 errors found. Reiserfs, not so much...

wheel · July 31, 2012

OK, since no one recommended I run --rebuild-tree, and all of the files I'd moved Thursday night appear to be where they're supposed to be, I used unmenu to run a non-correcting parity check... and it came back with 0 errors.

I figured I was good, but JUST TO BE SAFE I went back and ran "reiserfsck --check" on disk14 (one with Thursday-move files)... and received the EXACT same five errors (and --rebuild-tree suggestion) I saw when I ran it last week.

If all the files look like they're where they're supposed to be, and if a parity check is coming back clean, but "reiserfsck --check" is still reporting errors on multiple disks, is this one of those rare "trust my array" moments, or is something really wrong here?

Joe L. · July 31, 2012

OK, since no one recommended I run --rebuild-tree, and all of the files I'd moved Thursday night appear to be where they're supposed to be, I used unmenu to run a non-correcting parity check... and it came back with 0 errors.

I figured I was good, but JUST TO BE SAFE I went back and ran "reiserfsck --check" on disk14 (one with Thursday-move files)... and received the EXACT same five errors (and --rebuild-tree suggestion) I saw when I ran it last week.

If all the files look like they're where they're supposed to be, and if a parity check is coming back clean, but "reiserfsck --check" is still reporting errors on multiple disks, is this one of those rare "trust my array" moments, or is something really wrong here?

Do not confuse parity checks with file-system checks.

You can have parity errors (or not) even without have any file system.

You have to fix the file system errors. If you do not, you'll eventually cause some corruption that will cause you to lose files. (or corrupt files)

If you were to have a disk fail now, and re-constructed a replacement, the replacement would have exactly the same file-system corruption.

(since parity is in sync with the corruption)

When you fix the file-system errors by using /dev/md1, /dev/md2, etc ad the devices you are rebuilding, the parity disk will be kept in sync.

You should erase the words "trust my array" from your vocabulary. It is almost never the right thing to do as if you are forcing a disk back into service AFTER A WRITE TO IT FAILED you must realize it has the wrong contents.

If a reiserfsck recommends a --rebuild-tree, you MUST perform a --rebuild-tree before the file system corruption is fixed.

wheel · August 1, 2012

Thanks, Joe - that clarifies the situation considerably!

I ran reiserfsck --check on each drive last night, fixed the errors on md2 and md14 via --fixable and --rebuild-tree, respectively (though d14's "5 corruptions" ended up with 4203 @ lost and found, and 27458 deleted!), but ran into a roadblock with md16.

md16 goes into the bitmap comparison stage ("phase one" of the reiserfsck, as far as I can tell) and immediately starts throwing back errors like:

"the problem in the internal node occurred (439552052), whole subtree is skipped"

I count 12 of these (along with "Zero bit found in on-disk bitmap after the last valid bit") before the reiserfsck ends with "Segmentation fault" and bumps me back to the command line.

Does anyone have any suggestions on how to handle this particular drive? Or is it unfixable by reiserfsck (or any other method) at this point?

wheel · August 2, 2012

I've been looking all over for situations like mine (reiserfsck --check gives a segfault error), but it seems like most segfault problems are associated with --rebuild-tree.

So, I've taken a hard look at the data on d16, and determined it won't be the end of the world if I have to replace it all manually (maybe 400gb or so, most of it seeming to work when I load it through SMB, though I haven't tested every single file - planning on moving the "corrupt" files to a known-good disk and playing with them one-by-one until I find the trouble files)... problem is, I don't want to do anything to the file system that would impact the integrity of the rest of my drives.

Does anyone have any advice on how to reach a "clean slate" on an irreparably bad disk (like d16 seems to be) without affecting parity or unraid in general?

dgaschk · August 2, 2012

Record the desired disk arraignment. Enter "initconfig". Assign drives as desired and start the array. Parity will rebuild.

wheel · August 2, 2012

Sounds good - should I delete all of the files on d16 first (since I don't know which ones are corrupt) before I set the new parity with initconfig?

dgaschk · August 2, 2012

Un-assign and preclear d16 while parity is being generated. Add d16 once it has finished pre clear. This will clear the contents of d16! Any files it contains will need to be restored from an independent backup.

wheel · August 5, 2012

OK, I took each of those steps - everything SEEMS fine, but I just noticed these errors on the syslog immediately after formatting the precleared d16:

Aug 4 18:20:33 Tower logger: mount: wrong fs type, bad option, bad superblock on /dev/md16,

Aug 4 18:20:33 Tower logger: missing codepage or helper program, or other error (Errors)

Aug 4 18:20:33 Tower logger: In some cases useful info is found in syslog - try

Aug 4 18:20:33 Tower logger: dmesg | tail or so

Aug 4 18:20:33 Tower logger:

Aug 4 18:20:33 Tower emhttp: _shcmd: shcmd (298): exit status: 32 (Other emhttp)

Aug 4 18:20:33 Tower emhttp: disk16 mount error: 32 (Errors)

Aug 4 18:20:33 Tower emhttp: shcmd (299): rmdir /mnt/disk16 (Other emhttp)

Does anyone have any idea what caused these (and whether they're a real issue)?

Joe L. · August 5, 2012

OK, I took each of those steps - everything SEEMS fine, but I just noticed these errors on the syslog immediately after formatting the precleared d16:

Aug 4 18:20:33 Tower logger: mount: wrong fs type, bad option, bad superblock on /dev/md16,

Aug 4 18:20:33 Tower logger: missing codepage or helper program, or other error (Errors)

Aug 4 18:20:33 Tower logger: In some cases useful info is found in syslog - try

Aug 4 18:20:33 Tower logger: dmesg | tail or so

Aug 4 18:20:33 Tower logger:

Aug 4 18:20:33 Tower emhttp: _shcmd: shcmd (298): exit status: 32 (Other emhttp)

Aug 4 18:20:33 Tower emhttp: disk16 mount error: 32 (Errors)

Aug 4 18:20:33 Tower emhttp: shcmd (299): rmdir /mnt/disk16 (Other emhttp)

Does anyone have any idea what caused these (and whether they're a real issue)?

Can't tell. You did not include the entire syslog so we can see it in context. They are exactly what would be expected if before formatting.

Can you see the contents of /mnt/disk16 ??

Joe L.

wheel · August 9, 2012

Hey, everyone - I'm sorry for disappearing, but just made it through a major disruption of my personal life... Much as it was at the back of my mind constantly, I haven't been able to sit down at either of my unRAID boxes until last night - and luckily, it looks like everything worked out fine with disk16. I ran reiserfsck --checks on every drive, and they came back solid; parity check came back solid; I think I'm set to go.

I'll post the last 5000 lines of syslog when I make it home tonight if anyone is curious about this situation, but I'm leaning towards an "if it ain't broke" attitude at this point.

Thanks again for all of the help!

Parity last checked on 7/25/12, finding 66,423,368 errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation