[SOLVED] Increasing parity corrections over a month

ixnu · January 4, 2012

I'm yet again confounded. This time deeply so.

As mentioned above, I ran reiserfsck (non fix) from a vterm at the terminal on /dev/md7. The vterm results did not show up in syslog and I wanted a complete description of the blocks and errors that had rolled off the vterm.

I quickly wrote down some of the error:

The on_disk and correct bitmaps differs....bad_indirect_item block has the bad pointer (1005) to the block (1038113254)...

27 found corruptions can be fixed when running with --fix-fixable...

I wanted a clean boot and a complete log of the fsck error, so I rebooted and ran reiserfsck (non fix) on /dev/md7, but this time from unMenu. However, reiserfsck reported NO Corruptions Found even though I had not run a --fix on the disk.

Has anyone seen this behavior before?

I am tempted to run a reiserfsck --fix anyway.

ixnu · January 4, 2012

I ran the reiserfsck again on md7 and it found no errors.

I'm starting to think it is hardware related. I'm going to run Memtest for 24 hours just to rule it out.

ixnu · January 4, 2012

http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580

I believe VCA had a whole thread about that as well.

After reading the thread by VCA, I'm interested, but my behavior does not seem exactly the same. I indeed have similar matching blocks, but he had only a few syncs errors - mine are in the 100K's. If the disk surface was the issue, I don't think I would be able to read all of the data from the disk. With this many errors, it seems it has to be a serious issue with either the parity calculation or disk controller n'est pas?

The suspect drive (the one that failed the reiserfsck and then passed it) is on a SIL controller. At this point, I don't trust the SIL controller or the /dev/md7 controller.

The Memcheck has run for ~13 hours (about 25 hours in total over the last 3 days) without an error. I will let it go another 5 hours before concluding that RAM is not the problem.

I have ordered a replacement Adaptec 1430SA for the SIL and a slew of replacement SATA cables.

What is the best procedure to avoid data corruption and find the problem? Here are two that might work:

1.) Install the 1430SA and remove SIL controller

2.) Move SIL-based drives to the 1430SA and replace all SATA cables

Scenario A:

3.) Run parity with correction

4.) Run parity without correction and if 0 errors, keep suspect drive in array

- 4b.) If errors still exist - replace drive

Scenario B:

3.) Add additional drive to 1430SA

4.) Copy data from the suspect drive to different drive

5.) Run parity with correction

6.) Run parity without correct - if 0 errors, keep suspect drive

- 6b,) If errors still exist - remove suspect drive

My goals are to correctly diagnose and solve the problem. Please let me know your thoughts and if you agree with my logic.

Thanks!

lionelhutz · January 4, 2012

At one time, I tried 2 different SiL3132? based controllers (the chipset on the cheap 2-port boards available all over) and neither of them worked for me, giving lots of parity errors each check. But, they consistantly didn't work. It seemed to be a motherboard vs controller compatibility issue since the board worked in a Windows machine just fine. I'm now using the Jmicron or it's clone based board now, the one that was (or still is??) available from ebay for $8 without any issues.

The tests as outlined by VCA will still narrow down which drive is throwing the errors, regardless if it's a controller or drive or cable issue. Knowing where to look is better than guessing where to look.

Peter

ixnu · January 4, 2012

Forgive me because this is pushing up against the limits of Linux knowledge.

In VCA's test, it looks like he dd to stdout?

dd if=/dev/sdf skip=1471919504 count=204597073 | md5sum -b > sdf.log

and then piped to a respective hash log.

I assume that the blocks just scrolled on the term and then wrote the log to the USB flash?

I have attached my hits and ranges for sync errors.

My procedure would look something like this for my parity drive (sda).

cd
umount /dev/sda
dd if=/dev/sda skip=71552 count=100000000 | md5sum -b > sda.log

After this, I just walk down all sdx's

Thanks again for your help!

parity_hits.zip

lionelhutz · January 4, 2012

Sounds about right. Maybe use 200000000 instead of 100000000.

You repeat each line multiple times. Each time you run the line it will append another hash to the file.

Peter

ixnu · January 4, 2012

The memtest finished up with no errors. I ran a non-correcting parity sync and had very similar results with more matches.

Currently running this little script for 8 hashes that seems to work. Looks like it will take all night.

#!/bin/bash

LOG_DIR=/var/log/hashes

cd $LOG_DIR

for i in {1..8}

do

echo "Begin sda for the $i time."
dd if=/dev/sda skip=71552 count=200000000 | md5sum -b >> sda.log

echo "Begin sdb for the $i time."
dd if=/dev/sdb skip=71552 count=200000000 | md5sum -b >> sdb.log

echo "Begin sdc for the $i time."
dd if=/dev/sdc skip=71552 count=200000000 | md5sum -b >> sdc.log

echo "Begin sdd for the $i time."
dd if=/dev/sdd skip=71552 count=200000000 | md5sum -b >> sdd.log

echo "Begin sde for the $i time."
dd if=/dev/sde skip=71552 count=200000000 | md5sum -b >> sde.log

echo "Begin sdf for the $i time."
dd if=/dev/sdf skip=71552 count=200000000 | md5sum -b >> sdf.log

echo "Begin sdg for the $i time."
dd if=/dev/sdg skip=71552 count=200000000 | md5sum -b >> sdg.log

echo "Begin sdh for the $i time."
dd if=/dev/sdh skip=71552 count=200000000 | md5sum -b >> sdh.log

echo "Begin sdi for the $i time."
dd if=/dev/sdi skip=71552 count=200000000 | md5sum -b >> sdi.log

done

exit

ixnu · January 5, 2012

Well... not a single hash mismatch after 5 runs on each drive.

I'm going to run a parity update with corrections and then run the hashes again and then check parity and then run the hashes again. Rinse wash and repeat.

root@Tower:/var/log/hashs# cat sda3.log
78210de009cc7c759fc377c6f235e2f2 *-
78210de009cc7c759fc377c6f235e2f2 *-
78210de009cc7c759fc377c6f235e2f2 *-
78210de009cc7c759fc377c6f235e2f2 *-
78210de009cc7c759fc377c6f235e2f2 *-
root@Tower:/var/log/hashs# cat sdc3.log
b78d6acfc92d4d8ea1dd52bdfb9519a7 *-
b78d6acfc92d4d8ea1dd52bdfb9519a7 *-
b78d6acfc92d4d8ea1dd52bdfb9519a7 *-
b78d6acfc92d4d8ea1dd52bdfb9519a7 *-
b78d6acfc92d4d8ea1dd52bdfb9519a7 *-
root@Tower:/var/log/hashs# cat sdd3.log
4d10166502530286a017f41a481d5049 *-
4d10166502530286a017f41a481d5049 *-
4d10166502530286a017f41a481d5049 *-
4d10166502530286a017f41a481d5049 *-
4d10166502530286a017f41a481d5049 *-
root@Tower:/var/log/hashs# cat sde3.log
a42eed4b9302d291d28155d0972f9715 *-
a42eed4b9302d291d28155d0972f9715 *-
a42eed4b9302d291d28155d0972f9715 *-
a42eed4b9302d291d28155d0972f9715 *-
a42eed4b9302d291d28155d0972f9715 *-
root@Tower:/var/log/hashs# cat sdf3.log
5ef6be2d38a2c8c56c72618f0ff7fc42 *-
5ef6be2d38a2c8c56c72618f0ff7fc42 *-
5ef6be2d38a2c8c56c72618f0ff7fc42 *-
5ef6be2d38a2c8c56c72618f0ff7fc42 *-
5ef6be2d38a2c8c56c72618f0ff7fc42 *-
root@Tower:/var/log/hashs# cat sdg3.log
51bcff79432395fb97ec19d697d507a3 *-
51bcff79432395fb97ec19d697d507a3 *-
51bcff79432395fb97ec19d697d507a3 *-
51bcff79432395fb97ec19d697d507a3 *-
51bcff79432395fb97ec19d697d507a3 *-
root@Tower:/var/log/hashs# cat sdh3.log
b2c00b20c5cd1e0388c2fc63314299ce *-
b2c00b20c5cd1e0388c2fc63314299ce *-
b2c00b20c5cd1e0388c2fc63314299ce *-
b2c00b20c5cd1e0388c2fc63314299ce *-
b2c00b20c5cd1e0388c2fc63314299ce *-
root@Tower:/var/log/hashs# cat sdi3.log
d3a118891cae4d64d5a22c408bb1f8bb *-
d3a118891cae4d64d5a22c408bb1f8bb *-
d3a118891cae4d64d5a22c408bb1f8bb *-
d3a118891cae4d64d5a22c408bb1f8bb *-
d3a118891cae4d64d5a22c408bb1f8bb *-

WeeboTech · January 5, 2012

Well... not a single hash mismatch after 5 runs on each drive.

I'm going to run a parity update with corrections and then run the hashes again and then check parity and then run the hashes again. Rinse wash and repeat.

Hmmm... I'm wondering now if it has to do with multiple processes communicating through the bus.

Do you think it would be worthwhile to run on each drive simultaneously?

ixnu · January 5, 2012

Interesting idea.

My parity sync with corrections just finished with over 5M. (shudder)

I immediately ran a sync without corrections and it found 10K in the first 30secs.

I'm now running the script without umounting the disks, but not writing either.

I will run simultaneously (unmounted) next.

WeeboTech · January 5, 2012

What controller is your parity drive on?

Also, if you suspect the drives on the SIL controller are the ones causing issues, I suppose you could remove them from the array, recreate parity then do a parity check again.

At the current rate, I'm not sure you can trust parity anyway.

ixnu · January 5, 2012

Great idea WeeboTech!!!!!! :o

I separated the hashes and ran them simultaneously. As you can see below, two disk hashes are not matching. These two disks have two things in common - they are both attached to the SIL controller and they are both WD20EVDS. /dev/sdc was also the disk that reiserfsck temporarily reported as corrupt. At this point, I suspect the controller since the hashes came back correct when they were hit individually.

Which reminds me! I did the original reiserfsck's all simultaneously from vterms! This was the time it failed. When I ran reiserfsck one-by-one, there were no errors. This really makes me suspect the controller.

Here are the results when run symultaniously:

root@Tower:/var/log/hashs# cat sda_sym.log
3acbfd032f6cb940e13003d045fea680 *-
14d584bcd27e3ad69887bbe461d27fce *-
b3ef108c703c4a41f6d8a2841ae48549 *-
51a6fc9501c8d8862891abf17b255dff *-
0c05e959f3a2c9f7b4cc229542ea321e *-


root@Tower:/var/log/hashs# cat sdc_sym.log
775e9351bb8b6a1ffe11fde1396a8d41 *-
f4cd927ce33785e20f4bd97a755088c6 *-
d1980745346128bc19de66ff4d93cd10 *-
93c331b810d677403d8cbeaf70fa8ec8 *-
74d614bc5cff5cb43e286a3092b5f5e7 *-
14e40296a7ee9a51c99bcec8201a242b *-

root@Tower:/var/log/hashs# cat sdd_sym.log
8c6e629c618f3c3078cc2fe4a7976da5 *-
8c6e629c618f3c3078cc2fe4a7976da5 *-
8c6e629c618f3c3078cc2fe4a7976da5 *-
8c6e629c618f3c3078cc2fe4a7976da5 *-
8c6e629c618f3c3078cc2fe4a7976da5 *-
8c6e629c618f3c3078cc2fe4a7976da5 *-

root@Tower:/var/log/hashs# cat sde_sym.log
4bfcdfcf620bedcf40e220204d0d8234 *-
4bfcdfcf620bedcf40e220204d0d8234 *-
4bfcdfcf620bedcf40e220204d0d8234 *-
4bfcdfcf620bedcf40e220204d0d8234 *-
4bfcdfcf620bedcf40e220204d0d8234 *-
4bfcdfcf620bedcf40e220204d0d8234 *-

root@Tower:/var/log/hashs# cat sdf_sym.log
80da97777f949afc589e0631054ecb16 *-
80da97777f949afc589e0631054ecb16 *-
80da97777f949afc589e0631054ecb16 *-
80da97777f949afc589e0631054ecb16 *-
80da97777f949afc589e0631054ecb16 *-
80da97777f949afc589e0631054ecb16 *-

root@Tower:/var/log/hashs# cat sdg_sym.log
9eea37af0538ee44f7567c6a7eb02bf4 *-
9eea37af0538ee44f7567c6a7eb02bf4 *-
9eea37af0538ee44f7567c6a7eb02bf4 *-
9eea37af0538ee44f7567c6a7eb02bf4 *-
9eea37af0538ee44f7567c6a7eb02bf4 *-
9eea37af0538ee44f7567c6a7eb02bf4 *-

root@Tower:/var/log/hashs# cat sdh_sym.log
2b384338ed4877ca5580fb9d49a82e15 *-
2b384338ed4877ca5580fb9d49a82e15 *-
2b384338ed4877ca5580fb9d49a82e15 *-
2b384338ed4877ca5580fb9d49a82e15 *-
2b384338ed4877ca5580fb9d49a82e15 *-
2b384338ed4877ca5580fb9d49a82e15 *-

root@Tower:/var/log/hashs# cat sdi_sym.log
0fc1cbeff54e0a1a10e2a4017860d640 *-
0fc1cbeff54e0a1a10e2a4017860d640 *-
0fc1cbeff54e0a1a10e2a4017860d640 *-
0fc1cbeff54e0a1a10e2a4017860d640 *-
0fc1cbeff54e0a1a10e2a4017860d640 *-
0fc1cbeff54e0a1a10e2a4017860d640 *-

WeeboTech · January 5, 2012

So at this point you cannot trust parity to reliable rebuild a failed drive.

This is because all remaining drives are required for the rebuild.

If it were me I would probably remove these two drives from parity protection.

Recreate parity and do the parity nocorrect check again.

If you are good, then at least some of your drives are protected.

It all depends if those two drives are very important or not.

Have you ordered a new controller?

If so, then you can wait for it to come in and move these drives to the new controller then retest.

It's based on how long you will be keeping this controller in service.

Chances are single drive accesses are fine with writes and reads, but if you have to rebuild, then I'm sure corruption would be introduced on the drive being recreated.

ixnu · January 5, 2012

Just for documentation for me and others:

Procedure:

This procedure is a test to find what is causing parity changes if a parity sync is continuously finding issues and traditional tools (Memcheck, SMART, reiserfsck) can't find the disk(s) that is having an issue.

First, ID which block are causing the issue. After a parity sync, this can be found in the syslog:

Something like:

Jan 4 22:15:21 Tower kernel: md: parity incorrect: 279944 (Errors)

Most of my errors started showing up around 200000, so that is what I started with. I had tons off errors, so I only needed a 5GB read to find it, which is why my count is 10000000.

You have to build scripts that read from around these blocks for all disk devices.

The respective scripts will reference each /dev/sdx (found on the unMenu main page). In my setup for example, under "Device" /dev/sdc corresponds to "Disk" /dev/md6 (Disk 6 in unRaid under "status")

So, the script for that disk is:

#!/bin/bash
LOG_DIR=/var/log/hashes
cd $LOG_DIR

for i in {1..5}
  do

    echo "Begin sdc for the $i time."
    dd if=/dev/sdc skip=200000 count=10000000 | md5sum -b >> sdc_sim.log
  
  done
exit

Build a script for each respective disk. I named mine hashpipe_sda.sh, hashpipe_sdc.sh, ...

To run them simultaneously, separate the scripts on the console by '&'


root@Tower:/var/log/hashes# chmod 555 ./hashpipe*.sh
root@Tower:/var/log/hashes# ./hashpipe_sda.sh & ./hashpipe_sdc.sh & ./hashpipe_sdd.sh & ...

This will run 5 times for each disk and append to the corresponding error log (sdx_sim.log)

While it is running, the output will look something like this:

Begin sdc for the 2 time.
10000000+0 records in
10000000+0 records out
5120000000 bytes (5.1 GB) copied, 216.273 s, 23.7 MB/s
Begin sda for the 2 time.
10000000+0 records in
10000000+0 records out
5120000000 bytes (5.1 GB) copied, 216.397 s, 23.7 MB/s
Begin sde for the 2 time.

Once these have completed, you can cat the results and see if they are the same. I was paraniod and unmounted the devices prior to running the scripts. However, I forgot to unmount them the last time, and all went well.

This was a really tough issue for me since the SIL controller worked fine when I did my original testing. I suspect that the SIL (Silicon Image SIL3132) ran fine with a single drive attached or when testing a single drive.

The real test will come when the Adaptec controller shows up (not soon enough).

Thanks again to VCA and WeeboTech for pointing me in the right direction.

WeeboTech · January 5, 2012

They are both attached to the SIL controller and they are both WD20EVDS

I suppose a different test would be to put two other models of drives on the Silicon Image controller.

Then you would prove/disprove the problem is with drive model and controller, or just the controller.

What about moving the controller to another slot?

What's the model of this suspect controller?

ixnu · January 5, 2012

If it were me I would probably remove these two drives from parity protection.

Recreate parity and do the parity nocorrect check again.

I'm in the process right now.

Have you ordered a new controller?

Yep. The Adaptec 1430SA

Chances are single drive accesses are fine with writes and reads, but if you have to rebuild, then I'm sure corruption would be introduced on the drive being recreated.

I think you are correct. I actually did a rebuild that also increased the size of a disk when the SIL had a single drive attached. This rebuilt drive was not on the SIL, so the SIL behaved normally when only one was attached. The SIL also behaved normally when it precleared the second drive attached to it. I did an immediate parity check, and everything was fine.

Thanks again.

ixnu · January 5, 2012

What about moving the controller to another slot?

At this point it's coming out as soon as the 1430SA shows up. I feel my $13 Monoproce SIL3132 experiment has cost me thousands in hair treatments..if I cared for such matters.

What's the model of this suspect controller?

Monoprice SIL3132 (http://bit.ly/zMNKWV)

ixnu · January 5, 2012

Looks like I'm not the first to get bitten by this scenario:

http://bit.ly/x2xb24

...Then I noticed MD5 hashes of files on the drives, taken sequentially, differing without any changes to the files' contents. The final straw was that reading from each of the two separate drives simultaneously would result in silent read errors, to the point where the file would be heavily corrupted but the OS didn't care...

lionelhutz · January 5, 2012

Great work on the troubleshooting! It sure looks like that SATA card is a POS.

I think the card I tried was also a Syba but it didn't even work right with a single drive attached. I got an RMA card but it didn't work either.

This info needs to be in the Wiki for future reference. I copied your post to the Wiki at the link below and started editing it to clean-up and explain it more clearly. There is still more work to do.

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

Peter

WeeboTech · January 5, 2012

Thanks lionelhutz

prostuff1 · January 5, 2012

I have used quite a few of those Monoprice SIL cards in builds for customers and I have/had 2 in my old server.

Maybe I was lucky but I never had a problem with any of them.

WeeboTech · January 5, 2012

I have used quite a few of those Monoprice SIL cards in builds for customers and I have/had 2 in my old server.

Maybe I was lucky but I never had a problem with any of them.

that's why I asked if it could be tried in another slot.

Perhaps it's a chipset incompatibility.

However, from what I've read on newegg and here. There's more to it if others have very similiar problems.

ixnu · January 6, 2012

that's why I asked if it could be tried in another slot.

Perhaps it's a chipset incompatibility.

However, from what I've read on newegg and here. There's more to it if others have very similiar problems.

It is sitting in the PCI Express x1 slot of my ASRock 880GM-LE. I'm very hesitant of too much testing since this is my production rig. I will test the SIL in a dev box and let you know the results.

There is no way I could have found the issue without your guys help. Thanks again.

ixnu · January 6, 2012

This info needs to be in the Wiki for future reference. I copied your post to the Wiki at the link below and started editing it to clean-up and explain it more clearly. There is still more work to do.

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

Peter

Awesome. Thanks. I will edit it some more when I get a chance.

ixnu · January 6, 2012

Yet even more evidence that the SIL3132 can cause significant hair loss.

This poor guy (looks like debian?) went through almost my exact scenario step-by-step:

http://www.issociate.de/board/post/506879/Debugging_a_strange_array_corruption.html

[SOLVED] Increasing parity corrections over a month

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation