Parity last checked on 7/25/12, finding 66,423,368 errors


Recommended Posts

...and yet, parity is valid.  Does anyone have any clue what could have caused that insanely large amount of errors without a syslog?  Away from my system right now (sending this by mobile), but slightly freaked out; syslog coming within the hour...

 

Edit: 4.7 box, Raj's 20-drive beast tower

 

Edit 2: this appeared immediately following a drive rebuild - but the rebuilt drive was #14 (no errors) and the error-ridden drive was #16...

 

Edit 3: Made it home, syslog attached and rebuilding #16 now...

syslog.txt

Link to comment

Post-rebuild syslog attached; last time, the 66,423,368 errors were directly associated with the dead disk #16, but this time, parity check results still list 66,423,368 errors (post-rebuild) with no individual disks listed as bearing the errors (unlike last time with the dead #16).  Is this normal after a rebuild?

syslog.zip

Link to comment

The second parity check reported 6 errors fixed, but 2 hours later I'm seeing this:

 

Jul 26 21:46:29 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 432978720. Fsck?

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3719 3721 0x0 SD]

Jul 26 21:46:29 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 3

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 432978720. Fsck?

Jul 26 21:46:29 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3719 3721 0x0 SD]

Link to comment

OK, disk 14 (first rebuilt disk) returned some errors:

 

Reiserfs journal '/dev/md14' in blocks [18..8211]: 0 transactions replayed

Zero bit found in on-disk bitmap after the last valid bit.

Checking internal tree.. \/  1 (of 21|/ 34 (of 86...

 

[this led to]

 

"The level of the node (0) is not correct, (1) expected

the problem in the internal node occurred (462061569), whole subtree is skipped"

 

x5, basically, then

 

"Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

5 found corruptions can be fixed only when running with --rebuild-tree"

 

Same thing with the second rebuilt disk (disk16), which came back with 12 of those errors - but this time, just dumped me back to the command line after the 12th error (no "Comparing bitmaps.." message, no recommendation of --rebuild tree, unlike disk14)

 

I'm coming back here with this based on the recommendation in the instructions - is this a recommended time to run --rebuild-tree?  I've parity checked the tower to the point where it comes back with 0 errors, but these 5 corruptions give me pause...

 

[EDIT] - hitting the other disks now, and no problems on 1-3, but #4 came back with two errors (but no "Level of the node is not correct" messages, just the "Comparing bitmaps / differs" message - and this one tells me to use the "--fix-fixable" command)

Link to comment

Always with a powerdown from the webgui.... I think I've figured the cause of the disk error based on current error spread (they're all disks that I'd moved files onto Thursday night - right after the parity check came back good, having corrected 6 errors).  I ran another parity check overnight (after receiving those errors), and now the webgui tells me parity's checked with 0 errors found.  Reiserfs, not so much...

Link to comment

OK, since no one recommended I run --rebuild-tree, and all of the files I'd moved Thursday night appear to be where they're supposed to be, I used unmenu to run a non-correcting parity check... and it came back with 0 errors.

 

I figured I was good, but JUST TO BE SAFE I went back and ran "reiserfsck --check" on disk14 (one with Thursday-move files)... and received the EXACT same five errors (and --rebuild-tree suggestion) I saw when I ran it last week.

 

If all the files look like they're where they're supposed to be, and if a parity check is coming back clean, but "reiserfsck --check" is still reporting errors on multiple disks, is this one of those rare "trust my array" moments, or is something really wrong here?

Link to comment

OK, since no one recommended I run --rebuild-tree, and all of the files I'd moved Thursday night appear to be where they're supposed to be, I used unmenu to run a non-correcting parity check... and it came back with 0 errors.

 

I figured I was good, but JUST TO BE SAFE I went back and ran "reiserfsck --check" on disk14 (one with Thursday-move files)... and received the EXACT same five errors (and --rebuild-tree suggestion) I saw when I ran it last week.

 

If all the files look like they're where they're supposed to be, and if a parity check is coming back clean, but "reiserfsck --check" is still reporting errors on multiple disks, is this one of those rare "trust my array" moments, or is something really wrong here?

Do not confuse parity checks with file-system checks.

 

You can have parity errors (or not) even without have any file system.

 

You have to fix the file system errors.  If you do not, you'll eventually cause some corruption that will cause you to lose files. (or corrupt files)

 

If you were to have a disk fail now, and re-constructed a replacement, the replacement would have exactly the same file-system corruption.

(since parity is in sync with the corruption)

 

When you fix the file-system errors by using /dev/md1, /dev/md2, etc ad the devices you are rebuilding, the parity disk will be kept in sync.

 

You should erase the words "trust my array" from your vocabulary.  It is almost never the right thing to do as if you are forcing a disk back into service AFTER A WRITE TO IT FAILED you must realize it has the wrong contents.

 

If a reiserfsck recommends a --rebuild-tree, you MUST perform a --rebuild-tree before the file system corruption is fixed.

Link to comment

Thanks, Joe - that clarifies the situation considerably!

 

I ran reiserfsck --check on each drive last night, fixed the errors on md2 and md14 via --fixable and --rebuild-tree, respectively (though d14's "5 corruptions" ended up with 4203 @ lost and found, and 27458 deleted!), but ran into a roadblock with md16.

 

md16 goes into the bitmap comparison stage ("phase one" of the reiserfsck, as far as I can tell) and immediately starts throwing back errors like:

 

"the problem in the internal node occurred (439552052), whole subtree is skipped"

 

I count 12 of these (along with "Zero bit found in on-disk bitmap after the last valid bit") before the reiserfsck ends with "Segmentation fault" and bumps me back to the command line.

 

Does anyone have any suggestions on how to handle this particular drive?  Or is it unfixable by reiserfsck (or any other method) at this point?

Link to comment

I've been looking all over for situations like mine (reiserfsck --check gives a segfault error), but it seems like most segfault problems are associated with --rebuild-tree.

 

So, I've taken a hard look at the data on d16, and determined it won't be the end of the world if I have to replace it all manually (maybe 400gb or so, most of it seeming to work when I load it through SMB, though I haven't tested every single file - planning on moving the "corrupt" files to a known-good disk and playing with them one-by-one until I find the trouble files)... problem is, I don't want to do anything to the file system that would impact the integrity of the rest of my drives.

 

Does anyone have any advice on how to reach a "clean slate" on an irreparably bad disk (like d16 seems to be) without affecting parity or unraid in general?

Link to comment

OK, I took each of those steps - everything SEEMS fine, but I just noticed these errors on the syslog immediately after formatting the precleared d16:

 

 

Aug  4 18:20:33 Tower logger: mount: wrong fs type, bad option, bad superblock on /dev/md16,

Aug  4 18:20:33 Tower logger:        missing codepage or helper program, or other error (Errors)

Aug  4 18:20:33 Tower logger:        In some cases useful info is found in syslog - try

Aug  4 18:20:33 Tower logger:        dmesg | tail  or so

Aug  4 18:20:33 Tower logger:

Aug  4 18:20:33 Tower emhttp: _shcmd: shcmd (298): exit status: 32 (Other emhttp)

Aug  4 18:20:33 Tower emhttp: disk16 mount error: 32 (Errors)

Aug  4 18:20:33 Tower emhttp: shcmd (299): rmdir /mnt/disk16 (Other emhttp)

 

Does anyone have any idea what caused these (and whether they're a real issue)?

Link to comment

OK, I took each of those steps - everything SEEMS fine, but I just noticed these errors on the syslog immediately after formatting the precleared d16:

 

 

Aug  4 18:20:33 Tower logger: mount: wrong fs type, bad option, bad superblock on /dev/md16,

Aug  4 18:20:33 Tower logger:        missing codepage or helper program, or other error (Errors)

Aug  4 18:20:33 Tower logger:        In some cases useful info is found in syslog - try

Aug  4 18:20:33 Tower logger:        dmesg | tail  or so

Aug  4 18:20:33 Tower logger:

Aug  4 18:20:33 Tower emhttp: _shcmd: shcmd (298): exit status: 32 (Other emhttp)

Aug  4 18:20:33 Tower emhttp: disk16 mount error: 32 (Errors)

Aug  4 18:20:33 Tower emhttp: shcmd (299): rmdir /mnt/disk16 (Other emhttp)

 

Does anyone have any idea what caused these (and whether they're a real issue)?

Can't tell.  You did not include the entire syslog so we can see it in context.  They are exactly what would be expected if before formatting.

 

Can you see the contents of /mnt/disk16 ??

 

Joe L.

Link to comment

Hey, everyone - I'm sorry for disappearing, but just made it through a major disruption of my personal life... Much as it was at the back of my mind constantly, I haven't been able to sit down at either of my unRAID boxes until last night - and luckily, it looks like everything worked out fine with disk16.  I ran reiserfsck --checks on every drive, and they came back solid; parity check came back solid; I think I'm set to go.

 

I'll post the last 5000 lines of syslog when I make it home tonight if anyone is curious about this situation, but I'm leaning towards an "if it ain't broke" attitude at this point.

 

Thanks again for all of the help!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.