Jump to content
ixnu

[SOLVED] Increasing parity corrections over a month

51 posts in this topic Last Reply

Recommended Posts

Year-old 4.7 install

No short SMART test errors

I rebuilt a drive about 2 months ago.

 

I have been getting an increasing number of parity sync corrections.

 

It has grown from ~1500 to 25K yesterday, and now > 36K today.

 

What action (if any) should I take at this point.

 

Thanks for your help and Happy New Year!

syslog-2012-01-02.txt

Share this post


Link to post

Have you tried reseating your RAM and all connections to harddrives?  They tend to be causes of most parity errors.

I also had an issue getting thousands of errors, when I worked out the error rate it came out at 1 in 4 once the number stopped increasing.  After tearing my build apart and reinstalling everything I still got errors.  It was only by chance I remembered I had inlocked the 2nd core on an AMD Sempron 140.  Since my system was stable before I unlocked I hadnt bothered to check afterwards, luckily no data loss when I locked the core again.  Have you made any hardware changes recently?

Share this post


Link to post

Thanks for the reply.

 

I starting to think it might be the bug discussed here:

http://lime-technology.com/forum/index.php?topic=13866.0

 

This is esp concerning:

 

By updating to 5.0-beta8 or 4.7.1 (which isn't out yet).

 

Barring that, yes, with a data rebuild in progress, don't write.  With a "upgrade existing disk to larger one" operation, it's potentially unavoidable.

 

The only hardware changes I have made were swapping out a 1.5 TB for a 2TB drive about a month ago.

 

What are the steps to diag this conditional bug?

 

If this is the cause, what are the steps for recovery?

 

Thanks again for the response!

 

 

 

Share this post


Link to post

I wouldnt have thought it was that bug since you are still getting errors after a second parity check.  Are you running non correcting checks? That would explain why the errors keep repeating.  I'm not sure what else might be the problem.  I'm reletivly new to unRAID and only know about the errors I got through experience

Share this post


Link to post

I'm running correcting syncs.

 

The long SMART tests come back OK (well, I assume so for the Seagate since their SMART data is a mess that requires a special decoder ring).

 

I'm going reset the cables and run a mem test tonight.

 

I'm still very concerned about the behavior wrt the previously mentioned bug. I have searched the forums and don't really see a solid way to diagnose it or the behavior that is displayed or the best way to solve it.

 

This is the first real problem I've had with my Pro install, and the support has been less than ideal  :'(

 

Share this post


Link to post

you download md5deep, get a hash file on all files on the drive in question.

Then you can see over time if the files are being corrupted.

 

Or to double check things.

 

Copy some files to the drive in question using teracopy in test mode.

This copies to the array, then reads it back.

If you are good, do an md5sum on the file, then test it every now and then.

 

 

You can also do an md5deep recursively down the whole drive.

If you see parity errors on the particular drive in question, do md5deep in verify mode

Share this post


Link to post

Thanks for the response WeeboTech!

 

I'm currently running reiserfsck on the data disks.

 

However, I'm puzzled by your last statement: "If you see parity errors on the particular drive in question, do md5deep in verify mode."

 

I am ignorant of any report or log that allows one to see parity errors for a particular drive. I assumed that parity was a function of a data array, i.e.,  f(x)=parity; where x is the entire data array.

 

Is there a procedure where one can trace a suspected parity issue to an individual data disk in the array?

 

I assume that your hash scheme would attempt to isolate data corruption, but it would not be able to verify or diagnose the mechanics of the actual parity function/issue? I could see how this could be useful, but it could potentially take some time and logging.

 

I'm completely in the dark about the severity of this issue and I'm fearful that my array is quickly degrading and data corruption is a strong possibility. At the least, I assume that I'm not protected in the event of a HDD failure. The inability to diagnose the cause and potential effects of the parity issue is my biggest frustration. 

 

I will check back in when I complete the fsck's.

 

Thanks again.

 

Share this post


Link to post

I am ignorant of any report or log that allows one to see parity errors for a particular drive. I assumed that parity was a function of a data array, i.e.,  f(x)=parity; where x is the entire data array.

 

Is there a procedure where one can trace a suspected parity issue to an individual data disk in the array?

 

I assume that your hash scheme would attempt to isolate data corruption, but it would not be able to verify or diagnose the mechanics of the actual parity function/issue? I could see how this could be useful, but it could potentially take some time and logging.

 

You are correct that you cannot isolate where the parity errors are occurring from 100%.

For some reason I had it in my mind that when you used a particular disk, you saw a bunch of parity errors.

 

In any case, if you have the md5sum files for your data, you can verify if your data is getting corrupt or there is some other intermediate hardware error flipping bits.

 

In one personal situation, when I saw thousands and thousands of errors, it turned out to be a data drive failure.

Share this post


Link to post

I have followed the guidelines set out here:

 

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

 

Memtest - No Errors after 2 passes

SMART reports - No errors after extended tests

reiserfsck - no errors after 3 disks (4 remain)

 

Assuming remaining reiserfsck's returns no errors, what are my options?

 

I'm still completely in the dark about two questions that seem critical to me:

 

1.) How serious is my problem? Am I at risk of losing everything by attempting additional diagnosing. Should I be thinking about a complete rebuild with fresh preclears while I still have the array? Should I consider an upgrade in place?

 

2.) What are the failure conditions and behaviors of the rebuild bug (awaiting on 4.7.1 something or beta)?

 

Share this post


Link to post

How long did you let Memtest run?  You say 2 passes but how long a period of time was that?

 

I suggest running it at least for overnight and if possible for 24 hours.

Share this post


Link to post

How long did you let Memtest run?  You say 2 passes but how long a period of time was that?

 

I suggest running it at least for overnight and if possible for 24 hours.

 

~13 hours of running.

Share this post


Link to post

More info on my set up:

 

root@Tower:/var/log# lspci

00:00.0 Host bridge: Advanced Micro Devices [AMD] RS780 Host Bridge Alternate

00:01.0 PCI bridge: ASRock Incorporation Unknown device 9602

00:04.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 0)

00:05.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 1)

00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]

00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:12.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:13.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 3c)

00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller

00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)

00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller

00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge

00:14.5 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller

00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration

00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map

00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller

00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control

00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control

01:05.0 VGA compatible controller: ATI Technologies Inc Unknown device 9715

01:05.1 Audio device: ATI Technologies Inc Unknown device 970f

02:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

root@Tower:/var/log#

 

Share this post


Link to post

The SMART data on the last disk doesn't look good to me. The raw read errors and harware ECC recovered errors look bad to me.

 

What PS do you use? This has happened with a few people using Antec supplies though not as bad as you report.

Share this post


Link to post

The SMART data on the last disk doesn't look good to me. The raw read errors and harware ECC recovered errors look bad to me.

 

To my executor, Lionel Hutz, I leave $50,000.

 

OK, well how about a special SMART Seagate decoder ring:

 

http://forums.seagate.com/t5/Barracuda-XT-Barracuda-Barracuda/Seagate-s-Seek-Error-Rate-Raw-Read-Error-Rate-and-Hardware-ECC/td-p/122382

 

Best I can tell, these RAW_VALUEs are normal.

Share this post


Link to post

What PS do you use? This has happened with a few people using Antec supplies though not as bad as you report.

 

1 x CORSAIR Builder Series CX430 CMPSU-430CX 430W ATX12V Active PFC Power Supply

Share this post


Link to post

A few errors on the Samsung F4 drive.. 

 

And the strange "program fail count"...guess I'd ignore that. ???

Share this post


Link to post

A few errors on the Samsung F4 drive.. 

 

And the strange "program fail count"...guess I'd ignore that. ???

 

mbryanr, thanks for taking the time to look at my issues!

 

I have been reading through the forums and found this by bjp999:

 

http://lime-technology.com/forum/index.php?topic=12049.0

 

"I am not all that trustful of the manufacturers and their "normalized" values.  They are, after all, interested in sold drives staying sold and not coming back for service.  But the only parameters we KNOW to look out for are the reallocated and pending sectors.  If they start heading north, we know we have a problem."

 

===end of quote===

 

From experience (and reading about Google's data http://www.usenix.org/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf), I agree with bjp999 that SMART diags are a bit of black magic beyond the reallocations and temps. But I'm willing to listen to any help at this point.

 

I'm still waiting on an answer or any mention in regards to my original questions. Are the questions invalid or irrelevant?

 

Thanks again.

Share this post


Link to post

The issue isn't the rebuild bug. That would not cause so many errors and repeating errors through multiple checks.

 

I would say that you will certainly lose data if a disk fails during this, even if the disk that is acting up is the one that fails.

 

At this point, you either have to start swapping parts or do some much deeper troubleshooting. Here's an interesting post detailing how to check the integrity of the drives themselves.

 

http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580

 

I believe VCA had a whole thread about that as well.

 

Peter

 

Share this post


Link to post

The issue isn't the rebuild bug. That would not cause so many errors and repeating errors through multiple checks.

 

Peter,

 

Thanks very much for this information and links. I will devel deeper and report back.

 

 

Share this post


Link to post

I should add that I reiserfsck just completed on the F4 (/dev/sdi) without finding issues.

Share this post


Link to post

Ping! fsck found some issues on my last drive (md7)

 

Reiserfsck has found 27 corruptions on md7that can be fixed with running --fix-fixable.

 

I assume:

 

1- that I should run this invoking --fix-fixable?

2- it should be run on /dev/md7 and not /dev/sdx.

3- it should be run again without --fix

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.