Lots of Parity Sync Errors after back to back Checks

November 2, 200916 yr

Hi,

I had been humming along nicely with Unraid for a while now, until the other day when my monthly parity check kicked in.

To my surprise it showed over 900+ sync errors. First time I had a single one. The one thing that had changed recently is that I added a SIL PCI-e x2 Sata card and 2 hard drives. When I installed those two HD"s I ran the pre-clear script and found no hardware problems with the disks. I have copied files to one of the disks and have not had issues reading them.

I have quickly scanned the smart statistics on all of my disks and didn't notice anything alarming (but I need to run the long test it seems like after reading the board and then review the smartctrl report).

So I want to review what I should do?

1. Currently am running back to back parity checks. First one showed 1400+ errors and looks like the second one is reporting several as well so far.

2. I know people suggest running Memtest+, which I will do but I believe I did this when I first built the box I ran it and had no issues.

3. Check smartctrl reports and syslog (didn't see anything in it of significance that I could tell) but will post the logs

4. Check cabling

5. Try unassigning the two most recent drives, restore and check parity

6. Run reiserfsck on data disks.

Anything else? Any thoughts as to what is the likely culprit? My first thought is a bad/loose sata cable (can power cables be an issues as well, if I have a molex->2x sata splitter?). Can the PCI-e card itself be bad? I just don't get what could have changed in the last month. The two new hard drives were put through their paces with the pre-clean script, and I haven't noticed any issues in reading data from them like I said.

Thanks in advance.

November 2, 200916 yr

Hi,

I had been humming along nicely with Unraid for a while now, until the other day when my monthly parity check kicked in.

To my surprise it showed over 900+ sync errors. First time I had a single one. The one thing that had changed recently is that I added a SIL PCI-e x2 Sata card and 2 hard drives. When I installed those two HD"s I ran the pre-clear script and found no hardware problems with the disks. I have copied files to one of the disks and have not had issues reading them.

I have quickly scanned the smart statistics on all of my disks and didn't notice anything alarming (but I need to run the long test it seems like after reading the board and then review the smartctrl report).

So I want to review what I should do?

1. Currently am running back to back parity checks. First one showed 1400+ errors and looks like the second one is reporting several as well so far.

2. I know people suggest running Memtest+, which I will do but I believe I did this when I first built the box I ran it and had no issues.

3. Check smartctrl reports and syslog (didn't see anything in it of significance that I could tell) but will post the logs

4. Check cabling

5. Try unassigning the two most recent drives, restore and check parity

6. Run reiserfsck on data disks.

Anything else? Any thoughts as to what is the likely culprit? My first thought is a bad/loose sata cable (can power cables be an issues as well, if I have a molex->2x sata splitter?). Can the PCI-e card itself be bad? I just don't get what could have changed in the last month. The two new hard drives were put through their paces with the pre-clean script, and I haven't noticed any issues in reading data from them like I said.

Thanks in advance.

Parity sync errors are never good, and errors in back-to-back parity checks should never happen as I'm sure you know.

Problem is, it could be anything at all. from a disk drive not returning consistent results, to RAM not working properly, to a marginal power supply, or defective cabling to the drives.

I'd go on guessing, but since the only way to know what is happening is to look at the system log I'll suggest you start there. It would at least point out any obvious problems... The "others" you'll need to work through one at a time, eliminating hardware as possibilities.

If you are accomplished at interpreting a syslog file, look there. If you are not, then the next thing you should do, before rebooting, before making ANY changes, is to attach a copy to your next post.

Oh yes, to keep us from guessing, what is your hardware config?, in case it gives somebody a clue. We know early nforce chipsets have issues with parity calcs... the only solution with those MB was a replacement of the MB.

Other than describing you have parity errors on subsequent runs, all we can know is something is wrong... It may mean a data disk cannot be read consistently, or a parity disk, or the MB/RAM/CPU, or the cabling, or a power supply issue.

Oh yes, when you get a chance, do a memory test... can't hurt to get it out of the way.

Joe L.

November 2, 200916 yr

Author

Thanks Joe.

I'm running:

Unraid 4.4.2

AMD A64 X2 BE2300 1.9G AM2 65N

1gb of CRUCIAL CT2KIT12864AA667

GIGABYTE GA-MA74GM-S2 740G RT

Antec Earthwatts 430W PSU

5x Seagate 1.5TBs (latest firmware)

2x Samsung 1.5TB 5900rpm models

I'm attaching is the syslog (via the Unmenu), but the 2nd parity check is still running.

EDIT: I wanted to point out all those time related errors have always been happening for me with the same ~1.8s adjustment. No idea why or what causes that but it seems harmless enough I think. I have run many parity checks prior and had no issues.

November 2, 200916 yr

Thanks Joe.

I'm running:

Unraid 4.4.2

AMD A64 X2 BE2300 1.9G AM2 65N

1gb of CRUCIAL CT2KIT12864AA667

GIGABYTE GA-MA74GM-S2 740G RT

Antec Earthwatts 430W PSU

5x Seagate 1.5TBs (latest firmware)

2x Samsung 1.5TB 5900rpm models

I'm attaching is the syslog (via the Unmenu), but the 2nd parity check is still running.

EDIT: I wanted to point out all those time related errors have always been happening for me with the same ~1.8s adjustment. No idea why or what causes that but it seems harmless enough I think. I have run many parity checks prior and had no issues.

The good news is I don't see anything glaringly wrong in the syslog (other than your system clock not keeping good time)

The bad news is that it narrows the problem down to "something in your server" :-[

If you already see parity errors, you can stop the second parity check. It will serve no other purpose at this time, you have marginal hardware somewhere. Now, as you said, time to isolate. Very very first thing... check your BIOS settings for clock speed, voltage, and timing for your specific memory strips. if wrong (and many MB set it wrong) all bets are off. Next, run an extended memory test... It won't be the first time a RAM strip develops errors. Next, do some md5 checksums of the same file on different physical disks to try to determine what hardware is involved in the errors you are getting.

I'm a bit confused by this statement

To my surprise it showed over 900+ sync errors. First time I had a single one.

When you say "first time I had a single error" was that the only other time you performed a parity check?, or the last time?, or the previous monthly check?

You should never see any errors unless the array has an un-expected power failure while writing to the disks. then, after power restoration, a few parity errors would be expected if the parity drive was being written to when the power went out. Other than that, you should never see any.

Joe L.

November 2, 200916 yr

Author

Thanks Joe.

I'm running:

Unraid 4.4.2

AMD A64 X2 BE2300 1.9G AM2 65N

1gb of CRUCIAL CT2KIT12864AA667

GIGABYTE GA-MA74GM-S2 740G RT

Antec Earthwatts 430W PSU

5x Seagate 1.5TBs (latest firmware)

2x Samsung 1.5TB 5900rpm models

I'm attaching is the syslog (via the Unmenu), but the 2nd parity check is still running.

EDIT: I wanted to point out all those time related errors have always been happening for me with the same ~1.8s adjustment. No idea why or what causes that but it seems harmless enough I think. I have run many parity checks prior and had no issues.

The good news is I don't see anything glaringly wrong in the syslog (other than your system clock not keeping good time)

The bad news is that it narrows the problem down to "something in your server"

If you already see parity errors, you can stop the second parity check. It will serve no other purpose at this time, you have marginal hardware somewhere. Now, as you said, time to isolate. Very very first thing... check your BIOS settings for clock speed, voltage, and timing for your specific memory strips. if wrong (and many MB set it wrong) all bets are off. Next, run an extended memory test... It won't be the first time a RAM strip develops errors. Next, do some md5 checksums of the same file on different physical disks to try to determine what hardware is involved in the errors you are getting.

I'm a bit confused by this statement

To my surprise it showed over 900+ sync errors. First time I had a single one.

When you say "first time I had a single error" was that the only other time you performed a parity check?, or the last time?, or the previous monthly check?

You should never see any errors unless the array has an un-expected power failure while writing to the disks. then, after power restoration, a few parity errors would be expected if the parity drive was being written to when the power went out. Other than that, you should never see any.

Joe L.

Thanks again Joe. I will stop the parity check now, and check memory and go grab some other cables soon as well.

Sorry for my wording, I meant I had never seen a sync error before and I run monthly checks and have done so for 6+ months.

November 2, 200916 yr

Author

Ran an md5 checksum on each disk with a small file and none showed any issues. Do I need to try copying files over to the disks multiple times (i.e will it likely show everytime if there is a bad disk)?

All MB bios settings for memory were on auto (and always had been), so running memtest now for a little while.

I do have a drive (one of those lovely seagates), that tends to make clicking sounds sometimes. It very well might be the parity but i haven't confirmed this. It had done this a long time but I wonder if this might be cause if it isn't a memory or cabling/power issue. It looks ok when you look at the smartctrl results but I will run a long test as well and post it today.

November 2, 200916 yr

English is a terrible language. (and my native language :-[ )

I interpreted "First time I had a single one. " as:

the first time I had an error, I had one error.

instead of

this is the first time I've ever had an error.

Good that the monthly check detected this, and you can work to get it resolved without having to deal with a disk failure at the same time.

Good luck with the tests... it sounds like you know what you are doing.

November 2, 200916 yr

Author

English is a terrible language. (and my native language )

I interpreted "First time I had a single one. " as:

the first time I had an error, I had one error.

instead of

this is the first time I've ever had an error.

Good that the monthly check detected this, and you can work to get it resolved without having to deal with a disk failure at the same time.

Good luck with the tests... it sounds like you know what you are doing.

Thanks again Joe, it's great to have someone help out with suggestions.

November 2, 200916 yr

I hate to be the bearer of real bad news. I tried a Syba (I'm about 95% sure that was the make) SiL 3132 card in my ECS 740g board and had the same type of results. This was a cheap card and the card BIOS could not be re-flashed. I'm not sure if that made a difference but I suspect the use of a read-only ROM chip on the card had an effect some how or there was just something fundamentally wrong with the card. I did try 2 so it wasn't a single defective card.

The data being read from the hard drive on this card would read corrupted and each time I read the file it was corrupted in a different way. I'm a little fuzzy on what I tested but I'm quite sure I tried a write too and the written file was corrupted when read back. I never tried writing the file and then moving the drive to a motherboard port to see if I could read it back correctly so I can not confirm that the writes were OK but the reads were bad. I had no syslog errors or other indicators that there was a problem. At the time, I was running unRAID 4.4.2. I re-tested the memory and re-checked BIOS settings and didn't find anything. The hard drive could be connected back to the motherboard without changing anything else and the problems went away.

So, unfortunately, I'd be very afraid for all the data you've written to these new drives....

Next time, I'm trying a better make of card such as the 4-port Rosewell RC-218 card or at least a 3132 card that is flashable to the non-raid firmware and has been reported as working here by a few users.

Peter

November 2, 200916 yr

Author

I hate to be the bearer of real bad news. I tried a Syba (I'm about 95% sure that was the make) SiL 3132 card in my ECS 740g board and had the same type of results. This was a cheap card and the card BIOS could not be re-flashed. I'm not sure if that made a difference but I suspect the use of a read-only ROM chip on the card had an effect some how or there was just something fundamentally wrong with the card. I did try 2 so it wasn't a single defective card.

The data being read from the hard drive on this card would read corrupted and each time I read the file it was corrupted in a different way. I'm a little fuzzy on what I tested but I'm quite sure I tried a write too and the written file was corrupted when read back. I never tried writing the file and then moving the drive to a motherboard port to see if I could read it back correctly so I can not confirm that the writes were OK but the reads were bad. I had no syslog errors or other indicators that there was a problem. At the time, I was running unRAID 4.4.2. I re-tested the memory and re-checked BIOS settings and didn't find anything. The hard drive could be connected back to the motherboard without changing anything else and the problems went away.

So, unfortunately, I'd be very afraid for all the data you've written to these new drives....

Next time, I'm trying a better make of card such as the 4-port Rosewell RC-218 card or at least a 3132 card that is flashable to the non-raid firmware and has been reported as working here by a few users.

Peter

Thanks Peter for the heads up. I bought the 2 port SIL from Monoprice, and it was flashable (I did flash it to the non-raid bios). Do you think that makes a difference vs the one you mentioned? I just now tried a couple of md5 checksums and didn't see any discrepancy on any of the drives (including the two connected to the SIL).

November 2, 200916 yr

Thanks Peter for the heads up. I bought the 2 port SIL from Monoprice, and it was flashable (I did flash it to the non-raid bios). Do you think that makes a difference vs the one you mentioned? I just now tried a couple of md5 checksums and didn't see any discrepancy on any of the drives (including the two connected to the SIL).

Are you seeing correct MD5 checksums when you write and then verify the written file to the origional on another PC? Now, it's sounding better and better since you seem to be able to confirm the data is being written and read correctly. Make sure you try it with some very large files too, like multiple gigabyte sized.

Good to hear it sounds like your hardware is writing and reading the data correctly, now just to find why the parity is broken.

Peter

November 2, 200916 yr

Author

Thanks Peter for the heads up. I bought the 2 port SIL from Monoprice, and it was flashable (I did flash it to the non-raid bios). Do you think that makes a difference vs the one you mentioned? I just now tried a couple of md5 checksums and didn't see any discrepancy on any of the drives (including the two connected to the SIL).

Are you seeing correct MD5 checksums when you write and then verify the written file to the origional on another PC? Now, it's sounding better and better since you seem to be able to confirm the data is being written and read correctly. Make sure you try it with some very large files too, like multiple gigabyte sized.

Good to hear it sounds like your hardware is writing and reading the data correctly, now just to find why the parity is broken.

Peter

Yes, I generated an md5 checksum for a 500mb file on another PC, copied it, and tested it on one of the drives connected to the PCI-e card, and the checksums were the same. I will try with another larger file to confirm on both drives tonight as well as try switching out some cables.

The memtest check passed fine as well. Currently running a long smart test on the parity drive.

November 2, 200916 yr

Author

Md5 checksum on both disks connected to the SIL3132 card were fine for a 4gb file. So running long-test on the parity drive.

Can I rule out the SIL3132 and the two HD's connected to it if the md5 checksum was fine?

November 2, 200916 yr

Author

Crap, I was trying to do a long test on the parity but kept getting aborted by host messages (Maybe spin down).

So I turned off spin down and rebooted and all but one drive was showing up as unformatted?!!??

Please hellp if you can. I stopped array and rebooted again and just waiting for it to come up now.

EDIT: Upon reboot now all show correctly how much free space etc again. Weird.

November 3, 200916 yr

Just for completeness, here is the related FAQ entry, but I think you have already heard or read the help there.

"Why am I getting repeated parity errors?"

November 3, 200916 yr

Author

Just for completeness, here is the related FAQ entry, but I think you have already heard or read the help there.

"Why am I getting repeated parity errors?"

Thanks Rob, I definitely have consulted the guide.

I'm finishing up running a long smart test on the parity drive and will try to do so on the other disks as well.

Assuming the smart test is ok for the two on the PCI-e card and considering the md5 checksum seems ok can I rule out those drives?

Could it be the PSU?? The one thing is that when you run parity all drives are spun up and I did add 2 new drives recently. It's an earthwatt 430w. Maybe I'll swap it with a 500w earthwatts I have.

November 3, 200916 yr

Just for completeness, here is the related FAQ entry, but I think you have already heard or read the help there.

"Why am I getting repeated parity errors?"

Thanks Rob, I definitely have consulted the guide.

I'm finishing up running a long smart test on the parity drive and will try to do so on the other disks as well.

Assuming the smart test is ok for the two on the PCI-e card and considering the md5 checksum seems ok can I rule out those drives?

Could it be the PSU?? The one thing is that when you run parity all drives are spun up and I did add 2 new drives recently. It's an earthwatt 430w. Maybe I'll swap it with a 500w earthwatts I have.

It could be as simple as that. Your supply apparently has a pair of 17 Amp 12 volt rails, but one is used for the CPU, the other for the MB cards and disks. 7 disks pulling 2 amps each (avg guesstimate) when spinning up, are going to be mighty close to hitting the supply limit.

Joe L.

November 3, 200916 yr

My own thoughts on this repeated parity errors problem is that the first 3 suspects are memory, memory, and memory. After that, I suppose the top suspects would be power, heat, and bad motherboard or controller. Hard drives and cables seem very unlikely to me for this one. If you have another PSU, then that is easy enough to test. But I would first run a long and thorough memory test.

I would also check for heat problems, such as a fan that has stopped spinning, or a heat sink that has been jarred loose. A failed fan or loose heat sink on RAM or chipset is worse than no heat sink or fan, since it is blocking or limiting natural air flow to the hot element.

November 3, 200916 yr

In at least one case recently, it was a disk drive that otherwise tested good, but would return semi-random values at times.

See here: http://lime-technology.com/forum/index.php?topic=3642.msg38655#msg38655

It also caused random parity errors, and reads (and md5 checks) of big files would return random bits in the results.

It was the least likely suspect, but it can't be dismissed as a possibility.

November 3, 200916 yr

Author

Just completed my long smart tests and will attach them here. I didn't notice anything glaringly bad but would love it if anyone could take a look.

sdf is the parity drive and sdb and sdc are the two newer drives on the PCI-e card.

November 3, 200916 yr

Author

And here are the smart reports for the rest of the drives.

I'm going to swap out a couple of cables quick and run a parity check and will try to do more later tonight or tomorrow.

November 4, 200916 yr

Author

Well swapping out the power and sata cables for the newest HD's I had installed didn't do anything, so I'm creating parity with just 1 disk and will test/keep adding disks from there.

November 4, 200916 yr

The SMART reports all look good to me.

November 4, 200916 yr

Author

The SMART reports all look good to me.

Thanks Rob for confirming.

I'm thinking it's either the PCI-e controller card, PSU or Memory (though I ran Memtest but maybe not the extended test....it ran for like ~1hr iirc).

November 4, 200916 yr

Author

Parity check with one of the original disks connected to the MB was fine.

Now creating parity with all the HDs connected to the MB.

Could somehow the PCI-e card be the cause, yet I had no issues with md5 checksums (I'll test this out more) and reading/writing to the disks connected to it?

Lots of Parity Sync Errors after back to back Checks

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)