Bad parity after rebuild? [4.5.3]

March 21, 201016 yr

I recently had a drive that was taken out of the array because of write failures. After running SMART it did indeed fail the write test.

RMA'd it, popped in the new drive, and rebuilt it from parity. That all worked fine. Looked at the data on the disk and it appears to be completely intact.

A few days after rebuilding the drive though, I woke up to find the server unresponsive. I couldn't access it over the network (telnet or http), but hooking up the monitor showed a kernel panic message on the screen. I hard rebooted it, and it immediately started a parity check (as expected) and replayed some transactions (possible mover script related?). The issue is that the parity check is finding every bit to be in error (as far as I can tell). It seems to be completely recalculation parity.

I could be wrong, but my understanding is that after a rebuild, parity should still be valid.

Any ideas why it might do this and what problems I might have?

The syslog from the restart after the kernel panic is attached. Note that for some reason unraid is suddenly reporting the date as Dec 18, 2009 which is why the syslog is labeled as such.

Thanks,

Fatal_Flaw

syslog-2009-12-18.txt

March 21, 201016 yr

I recently had a drive that was taken out of the array because of write failures. After running SMART it did indeed fail the write test.

RMA'd it, popped in the new drive, and rebuilt it from parity. That all worked fine. Looked at the data on the disk and it appears to be completely intact.

A few days after rebuilding the drive though, I woke up to find the server unresponsive. I couldn't access it over the network (telnet or http), but hooking up the monitor showed a kernel panic message on the screen. I hard rebooted it, and it immediately started a parity check (as expected) and replayed some transactions (possible mover script related?). The issue is that the parity check is finding every bit to be in error (as far as I can tell).

What are you using to tell you that? A parity check after a hard crash will always perform a full parity check...

It seems to be completely recalculation parity.

Yes, quite normal if you did not stop the array before the reboot.

I could be wrong, but my understanding is that after a rebuild, parity should still be valid.

Because of the hard crash, the parity cannot be trusted, therefore it all must be checked.

Any ideas why it might do this and what problems I might have?

So far, other than the kernel crash, all looks quite normal. Too bad you did not take a picture of the screen. It would have provided clues.

The syslog from the restart after the kernel panic is attached. Note that for some reason unraid is suddenly reporting the date as Dec 18, 2009 which is why the syslog is labeled as such.

Thanks,

Fatal_Flaw

Your MB battery could be getting weak. It is as if it reset the clock on it.

March 21, 201016 yr

Author

Thanks for your prompt reply. I understand that it will check parity after a hard reboot, but is it normal to find every bit to be in error? It's about 16% done with the check and has found about 4 million errors.

Here's a screen shot of the parity check.

http://i44.tinypic.com/snmpsl.jpg

As for the cmos battery being dead, that seems unlikely as the mobo is brand new (<week) and was keeping time through resets just fine until now. But as I can't think of anything else that would cause that, I'll have to check it out.

Thanks again,

Fatal_Flaw

March 21, 201016 yr

Thanks for your prompt reply. I understand that it will check parity after a hard reboot, but is it normal to find every bit to be in error? It's about 16% done with the check and has found about 4 million errors.

Here's a screen shot of the parity check.

http://i44.tinypic.com/snmpsl.jpg

As for the cmos battery being dead, that seems unlikely as the mobo is brand new (<week) and was keeping time through resets just fine until now. But as I can't think of anything else that would cause that, I'll have to check it out.

Thanks again,

Fatal_Flaw

Now I see what you are saying. That is very unusual. It is as if the disk cannot be read (or was not able to be written)

Have you ever done a full parity check previously?

The most likely cause of errors like yours are a memory failure. All it would take is one bad bit any all of parity would look like it was bad.

I strongly suggest you run a memtest on the RAM for several passes, or overnight.

It it passes, then, you must assume some kind of hardware is failing... more tests will be in order to isolate the culprit.

(previous users have found bad hard-disks, bad disk-controllers, and bad motherboards)

Joe L.

March 21, 201016 yr

Author

Thanks again. I have run full parity checks before the hdd failure I mentioned in the first post, but none since then. The funny thing is, I just replaced pretty much the entire machine (mobo, cpu, video card, ps, memory) right before the hdd failure. Though everything except the hdds are new, it wouldn't be the first time someone shipped a defective unit. I have already run SMART tests on all my drives and the all came back with no errors. I realize SMART isn't 100% reliable but this does give some indication.

Something else just occurred to me though.

Both before and after replacing much of the machine, sometimes on boot I would get this message repeated several dozen times while unraid was booting.

usb 1-5.1: new full speed USB device using ehci_hcd and address 4

Unraid would still start and run. Without being able to find any concrete info on it, I ignored it (yeah, not the best thing to do, I know). Could this indicate a failing usb drive (my unraid drive) and could it be related to my current problems?

Thanks,

Fatal_Flaw

March 22, 201016 yr

If you were able to rebuild a disk - and it worked - your parity must have been accurate at that time.

If you are now running the parity check and it is finding so many errors - I would guess that the parity is getting destroyed as it goes. (Remember, every time unRAID finds a parity sync error is updates parity to make it consistant with the data).

Bad memory, an incompatible motherboard, or faking out unRAID can all cause this behavior. (Faking out unRAID means writing to a disk outside of the array). Could also possibly be related to HPA.

Will be interesting to see if a second parity check gives few if any sync errors. If so, you could rule out memory errors and incompatibly motherboard. You should also check each one of your data drives to make sure that you data appears intact on EACH disk.

Is your new motherboard a Gigabyte?

March 22, 201016 yr

Author

It's actually an MSI P43-C51 LGA 775 Intel P43. I specifically picked it because of the P43 chipset which I've heard works w/o issue with Unraid.

The bad/corrupt/messed up parity has got me a little baffled but I'll have to wait and see what a second check turns up before drawing any conclusions. Like I said, after the one hdd crapped out on me, I ran SMART scans on all of them and they all passed. Though I doubt bad memory would cause this, it certainly can't hurt to run a scan on it anyway.

Thanks,

Fatal_Flaw

March 22, 201016 yr

Focus on the memory test first!

It is very important to rule that out.

March 22, 201016 yr

Author

I'll run the memory test after I get off work today.

When I woke up this morning and went to check that the parity check completed, the screen again had a kernel panic message on it. This time I took a picture.

http://i40.tinypic.com/vdg2eq.jpg

Thanks,

Fatal_Flaw

March 22, 201016 yr

I'll run the memory test after I get off work today.

When I woke up this morning and went to check that the parity check completed, the screen again had a kernel panic message on it. This time I took a picture.

http://i40.tinypic.com/vdg2eq.jpg

Thanks,

Fatal_Flaw

The message at the end seems to indicate that one of your "reiserfs" file-systems is in need of repair.

Now, that does not explain the massive number of parity errors, and if you had bad memory it could just be a side-effect of what was mangled in memory.

Time to test the memory. Then, if it is OK, go through the steps of testing and fixing, if needed, the corruption as described here in the wiki: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

March 23, 201016 yr

Author

Well, you guys called it. After 12 passes with memtest86+, it found a total of 3 errors (I've never had such a subtly bad dimm before). I unfortunately don't have any other DDR3 around that I can test with to make sure it's the dimm and not the memory controller. The ram of course is still under warranty so I'll RMA it and see if the new ram works as expected.

Here's picture of the memtest errors in case they are relevant.

http://i43.tinypic.com/2guci03.jpg

Do the memory errors explain both the bad parity and the kernel panics? Why does the disk rebuild appear to have worked?

Given the memory errors, I don't trust my parity right now.

Hopefully after replacing the ram, i can recalculate parity and everything will just work, but I'm not sure if I'm that lucky.

Thanks for the help guys. I'm sure it will take a couple weeks before I receive the new memory. I'll let you know then what the results are and maybe further troubleshooting.

Thanks,

Fatal_Flaw

March 23, 201016 yr

Well, you guys called it. After 12 passes with memtest86+, it found a total of 3 errors (I've never had such a subtly bad dimm before). I unfortunately don't have any other DDR3 around that I can test with to make sure it's the dimm and not the memory controller. The ram of course is still under warranty so I'll RMA it and see if the new ram works as expected.

Here's picture of the memtest errors in case they are relevant.

http://i43.tinypic.com/2guci03.jpg

Do the memory errors explain both the bad parity and the kernel panics? Why does the disk rebuild appear to have worked?

Given the memory errors, I don't trust my parity right now.

Hopefully after replace in the ram, i can recalculate parity and everything will just work

Before you RMA the memory, make certain the voltage, timing, and clock speed are set in the BIOS to the correct values for your specific make and model RAM. Most motherboards attempt to get it set automatically. some get it right, some get it wrong. It is far better to set it manually (if the BIOS lets you) If the voltage or timing is wrong you'll get the same errors you are seeing, and it will be perfectly good ram you will be replacing. (and likely see the same type of errors with the replacement)

Yes, the bad memory can cause both the kernel panics and the parity errors. Yes, you'll want to re-check parity once you replace the RAM and/or reset the BIOS voltage/timing/clock-speed parameters to match those needed by your RAM.

Glad you found the cause.

Joe L.

April 2, 201016 yr

Author

UPDATE

I RMA'd the memory, and although I haven't received the replacement yet, I was able to borrow some from a friend (i got impatient). I haven't run memtest on it but he's been using it w/o issue so for the time being it seems safe to assume it's good.

Popped in the RAM, booted up and all drives show up, mount, etc. I automatically started a parity check of course and again immediately starting finding sync errors. This was expected given that we thought the bad ram messed up all the parity. I let it run the check over night and woke up to find unraid had kernel panicked again. It is the exact same message mentioned previously in this thread (http://i40.tinypic.com/vdg2eq.jpg).

I've started the parity check over again and so far it hasn't found any sync errors. I don't know how far it got through the previous check before the kernel panic but I imagine once it hits that point it will find/fix the rest of the errors.

So with the parity/memory issue possibly solved, I'm still left with the kernel panic problem. So far it's always happened over night. I don't recall if it's ever happened when a parity check wasn't running. I also have the mover scheduled to run at 3:40am so i don't know if that's related. I don't have much experience with linux kernel panics but I know in OS X it's sometimes a borked file system or drive.

Also, as far as I can tell, all the data drives seem to be intact and all files accounted for.

I'd certainly appreciate any more help/direction the community can offer.

Thanks,

Fatal_Flaw

April 2, 201016 yr

Rerun Memtest. Your bios might not have the proper values for the different memory.

April 2, 201016 yr

UPDATE

I RMA'd the memory, and although I haven't received the replacement yet, I was able to borrow some from a friend (i got impatient). I haven't run memtest on it but he's been using it w/o issue so for the time being it seems safe to assume it's good.

Popped in the RAM, booted up and all drives show up, mount, etc. I automatically started a parity check of course and again immediately starting finding sync errors. This was expected given that we thought the bad ram messed up all the parity. I let it run the check over night and woke up to find unraid had kernel panicked again. It is the exact same message mentioned previously in this thread (http://i40.tinypic.com/vdg2eq.jpg).

I've started the parity check over again and so far it hasn't found any sync errors. I don't know how far it got through the previous check before the kernel panic but I imagine once it hits that point it will find/fix the rest of the errors.

So with the parity/memory issue possibly solved, I'm still left with the kernel panic problem. So far it's always happened over night. I don't recall if it's ever happened when a parity check wasn't running. I also have the mover scheduled to run at 3:40am so i don't know if that's related. I don't have much experience with linux kernel panics but I know in OS X it's sometimes a borked file system or drive.

Also, as far as I can tell, all the data drives seem to be intact and all files accounted for.

I'd certainly appreciate any more help/direction the community can offer.

Thanks,

Fatal_Flaw

Kernel Panic's are almost always caused by bad memory.

April 3, 201016 yr

Author

I ran memtest over night and after 10 hours and 16 full passes it found 0 errors. At this point I think we need to explore other possibilities for the cause of the kernel panics. I'll start by running file system checks but would appreciate input/ideas.

Thanks,

Fatal_Flaw

April 3, 201016 yr

Author

I've started running reiserfsck as per http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems. Disk 1 came back clean but disk 2 found a lot of bad pointers in the internal tree. It also said the on-disk and correct bitmaps differ and that 5 found corruptions can only be fixed by rerunning reiserfsck with --rebuild-tree specified.

The wiki suggests that this should only be done as a last resort so I'm asking if there is anything else I should do. I've uploaded the full log here http://jump.fm/IDSAY but i'll paste the last few lines as well for easy reading

bad_indirect_item: block 96897172: The item (11947 11948 0x1 IND (1), len 1300, location 96 entry count 0, fsck need 0, format new) has the bad pointer (323) to the block (286318879), which is in tree already
bad_indirect_item: block 96897172: The item (11947 11948 0x1 IND (1), len 1300, location 96 entry count 0, fsck need 0, format new) has the bad pointer (324) to the block (286318880), which is in tree already
/149 (of 170)bad_node: vpf-10350: The block (286318898) is used more than once in the tree.
the problem in the internal node occured (286318898), whole subtree is skipped
finished                               
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
5 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Sat Apr  3 10:28:38 2010

I'm going to continue running reiserfsck on the rest of the drives until I can get some feedback from the community.

Thanks,

Fatal_Flaw

April 3, 201016 yr

If a prior reiserfsck says a rebuild-tree is needed, then it is needed.

Bad parity after rebuild? [4.5.3]

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)