[SOLVED] Increasing parity corrections over a month

January 2, 201214 yr

Year-old 4.7 install

No short SMART test errors

I rebuilt a drive about 2 months ago.

I have been getting an increasing number of parity sync corrections.

It has grown from ~1500 to 25K yesterday, and now > 36K today.

What action (if any) should I take at this point.

Thanks for your help and Happy New Year!

syslog-2012-01-02.txt

Quote

January 2, 201214 yr

Have you tried reseating your RAM and all connections to harddrives? They tend to be causes of most parity errors.

I also had an issue getting thousands of errors, when I worked out the error rate it came out at 1 in 4 once the number stopped increasing. After tearing my build apart and reinstalling everything I still got errors. It was only by chance I remembered I had inlocked the 2nd core on an AMD Sempron 140. Since my system was stable before I unlocked I hadnt bothered to check afterwards, luckily no data loss when I locked the core again. Have you made any hardware changes recently?

Quote

January 2, 201214 yr

Author

Thanks for the reply.

I starting to think it might be the bug discussed here:

http://lime-technology.com/forum/index.php?topic=13866.0

This is esp concerning:

By updating to 5.0-beta8 or 4.7.1 (which isn't out yet).

Barring that, yes, with a data rebuild in progress, don't write. With a "upgrade existing disk to larger one" operation, it's potentially unavoidable.

The only hardware changes I have made were swapping out a 1.5 TB for a 2TB drive about a month ago.

What are the steps to diag this conditional bug?

If this is the cause, what are the steps for recovery?

Thanks again for the response!

Quote

January 2, 201214 yr

I wouldnt have thought it was that bug since you are still getting errors after a second parity check. Are you running non correcting checks? That would explain why the errors keep repeating. I'm not sure what else might be the problem. I'm reletivly new to unRAID and only know about the errors I got through experience

Quote

January 3, 201214 yr

Author

I'm running correcting syncs.

The long SMART tests come back OK (well, I assume so for the Seagate since their SMART data is a mess that requires a special decoder ring).

I'm going reset the cables and run a mem test tonight.

I'm still very concerned about the behavior wrt the previously mentioned bug. I have searched the forums and don't really see a solid way to diagnose it or the behavior that is displayed or the best way to solve it.

This is the first real problem I've had with my Pro install, and the support has been less than ideal :'(

Quote

January 3, 201214 yr

Author

Reset all cables.

All SMART results are OK.

Ran 2 passes of Memtest without errors.

Followed all suggestions here:

http://lime-technology.com/forum/index.php?action=printpage;topic=15041.0

Sync errors are now over 100K with only 5% complete.

What are my next steps?

ZFS? Windows?

Quote

January 3, 201214 yr

you download md5deep, get a hash file on all files on the drive in question.

Then you can see over time if the files are being corrupted.

Or to double check things.

Copy some files to the drive in question using teracopy in test mode.

This copies to the array, then reads it back.

If you are good, do an md5sum on the file, then test it every now and then.

You can also do an md5deep recursively down the whole drive.

If you see parity errors on the particular drive in question, do md5deep in verify mode

Quote

January 3, 201214 yr

Author

Thanks for the response WeeboTech!

I'm currently running reiserfsck on the data disks.

However, I'm puzzled by your last statement: "If you see parity errors on the particular drive in question, do md5deep in verify mode."

I am ignorant of any report or log that allows one to see parity errors for a particular drive. I assumed that parity was a function of a data array, i.e., f(x)=parity; where x is the entire data array.

Is there a procedure where one can trace a suspected parity issue to an individual data disk in the array?

I assume that your hash scheme would attempt to isolate data corruption, but it would not be able to verify or diagnose the mechanics of the actual parity function/issue? I could see how this could be useful, but it could potentially take some time and logging.

I'm completely in the dark about the severity of this issue and I'm fearful that my array is quickly degrading and data corruption is a strong possibility. At the least, I assume that I'm not protected in the event of a HDD failure. The inability to diagnose the cause and potential effects of the parity issue is my biggest frustration.

I will check back in when I complete the fsck's.

Thanks again.

Quote

January 3, 201214 yr

I am ignorant of any report or log that allows one to see parity errors for a particular drive. I assumed that parity was a function of a data array, i.e., f(x)=parity; where x is the entire data array.

Is there a procedure where one can trace a suspected parity issue to an individual data disk in the array?

I assume that your hash scheme would attempt to isolate data corruption, but it would not be able to verify or diagnose the mechanics of the actual parity function/issue? I could see how this could be useful, but it could potentially take some time and logging.

You are correct that you cannot isolate where the parity errors are occurring from 100%.

For some reason I had it in my mind that when you used a particular disk, you saw a bunch of parity errors.

In any case, if you have the md5sum files for your data, you can verify if your data is getting corrupt or there is some other intermediate hardware error flipping bits.

In one personal situation, when I saw thousands and thousands of errors, it turned out to be a data drive failure.

Quote

January 3, 201214 yr

Call me curious...post the smart test results.

Quote

January 3, 201214 yr

Author

I have followed the guidelines set out here:

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

Memtest - No Errors after 2 passes

SMART reports - No errors after extended tests

reiserfsck - no errors after 3 disks (4 remain)

Assuming remaining reiserfsck's returns no errors, what are my options?

I'm still completely in the dark about two questions that seem critical to me:

1.) How serious is my problem? Am I at risk of losing everything by attempting additional diagnosing. Should I be thinking about a complete rebuild with fresh preclears while I still have the array? Should I consider an upgrade in place?

2.) What are the failure conditions and behaviors of the rebuild bug (awaiting on 4.7.1 something or beta)?

Quote

January 3, 201214 yr

How long did you let Memtest run? You say 2 passes but how long a period of time was that?

I suggest running it at least for overnight and if possible for 24 hours.

Quote

January 3, 201214 yr

Author

Call me curious...post the smart test results.

No problem. Here they are except for the one that is currently running reiserfsck.

Thanks for taking a look!

Unraid_smart_data_3_Jan_2012.txt

Quote

January 3, 201214 yr

Author

How long did you let Memtest run? You say 2 passes but how long a period of time was that?

I suggest running it at least for overnight and if possible for 24 hours.

~13 hours of running.

Quote

January 3, 201214 yr

Author

More info on my set up:

root@Tower:/var/log# lspci

00:00.0 Host bridge: Advanced Micro Devices [AMD] RS780 Host Bridge Alternate

00:01.0 PCI bridge: ASRock Incorporation Unknown device 9602

00:04.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 0)

00:05.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 1)

00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]

00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:12.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:13.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 3c)

00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller

00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)

00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller

00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge

00:14.5 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller

00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration

00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map

00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller

00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control

00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control

01:05.0 VGA compatible controller: ATI Technologies Inc Unknown device 9715

01:05.1 Audio device: ATI Technologies Inc Unknown device 970f

02:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

root@Tower:/var/log#

Quote

January 3, 201214 yr

The SMART data on the last disk doesn't look good to me. The raw read errors and harware ECC recovered errors look bad to me.

What PS do you use? This has happened with a few people using Antec supplies though not as bad as you report.

Quote

January 3, 201214 yr

Author

The SMART data on the last disk doesn't look good to me. The raw read errors and harware ECC recovered errors look bad to me.

To my executor, Lionel Hutz, I leave $50,000.

OK, well how about a special SMART Seagate decoder ring:

http://forums.seagate.com/t5/Barracuda-XT-Barracuda-Barracuda/Seagate-s-Seek-Error-Rate-Raw-Read-Error-Rate-and-Hardware-ECC/td-p/122382

Best I can tell, these RAW_VALUEs are normal.

Quote

January 3, 201214 yr

Author

What PS do you use? This has happened with a few people using Antec supplies though not as bad as you report.

1 x CORSAIR Builder Series CX430 CMPSU-430CX 430W ATX12V Active PFC Power Supply

Quote

January 3, 201214 yr

A few errors on the Samsung F4 drive..

And the strange "program fail count"...guess I'd ignore that.

Quote

January 3, 201214 yr

Author

A few errors on the Samsung F4 drive..

And the strange "program fail count"...guess I'd ignore that.

mbryanr, thanks for taking the time to look at my issues!

I have been reading through the forums and found this by bjp999:

http://lime-technology.com/forum/index.php?topic=12049.0

"I am not all that trustful of the manufacturers and their "normalized" values. They are, after all, interested in sold drives staying sold and not coming back for service. But the only parameters we KNOW to look out for are the reallocated and pending sectors. If they start heading north, we know we have a problem."

===end of quote===

From experience (and reading about Google's data http://www.usenix.org/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf), I agree with bjp999 that SMART diags are a bit of black magic beyond the reallocations and temps. But I'm willing to listen to any help at this point.

I'm still waiting on an answer or any mention in regards to my original questions. Are the questions invalid or irrelevant?

Thanks again.

Quote

January 3, 201214 yr

The issue isn't the rebuild bug. That would not cause so many errors and repeating errors through multiple checks.

I would say that you will certainly lose data if a disk fails during this, even if the disk that is acting up is the one that fails.

At this point, you either have to start swapping parts or do some much deeper troubleshooting. Here's an interesting post detailing how to check the integrity of the drives themselves.

http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580

I believe VCA had a whole thread about that as well.

Peter

Quote

January 3, 201214 yr

Author

The issue isn't the rebuild bug. That would not cause so many errors and repeating errors through multiple checks.

Peter,

Thanks very much for this information and links. I will devel deeper and report back.

Quote

January 3, 201214 yr

Author

I should add that I reiserfsck just completed on the F4 (/dev/sdi) without finding issues.

Quote

January 3, 201214 yr

Author

Ping! fsck found some issues on my last drive (md7)

Reiserfsck has found 27 corruptions on md7that can be fixed with running --fix-fixable.

I assume:

1- that I should run this invoking --fix-fixable?

2- it should be run on /dev/md7 and not /dev/sdx.

3- it should be run again without --fix

Quote

January 3, 201214 yr

See here: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

Quote

[SOLVED] Increasing parity corrections over a month

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)