[SOLVED] - Increasing Sync Errors on Monthly Parity check


Recommended Posts

Last month i had 660 sync errors, I'm on 1780 so far, 1/3 through this parity check.  

I have a udma crc error count 1 that intermittantly shows up on the cache drive - could this be causing it?  I need to change the cache drive.

I have clock unsyncronized errors in my log - assume that means i need to change cmos battery.

What are the consequences of these sync errors, does that mean that the data was bad and it has been fixed? or that the data on those sectors is bad after the parity check?

Can i Interrupt the monthly parity check, change the cache drive, troubleshoot etc, and restart the parity check? approx 20 hours to go on it.

Update - i should specify, i have 2 servers, this is Pipe - the other problem posted was Tower,  a different server.

 

Can someone have a look at my logs and advise me how to proceed?  Thanks in advance!

pipe-diagnostics-20190101-1036.zip

Edited by FrozenGamer
clarification and additional question.
Link to comment

I had a quick look at what I thought were your array drives.  They all looked fine.

 

What are you using as a SATA expansion card?  That would be my next suspect.  It might be bad or there are some boards with Marvell chip sets that seem to have problems when used on Unraid systems.  The parity check can really stress expansion cards during a parity check since all of the drives are running full bore and the time required to complete the check is quite long with 8TB drives.  Cheap cards can also have an undersized heat sink... 

 

By the way, the cache drive problem has nothing to do with the parity sync errors.

Link to comment

Is there a way to tell which drives have the sync errors (to determine if all on one drive)? I see log entries that have P corections, Q corrections and PQ corrections, but only sectors not any other indicator.

example - Jan 1 00:05:12 PIPE kernel: md: recovery thread: PQ corrected, sector=42826928
Jan 1 00:05:19 PIPE kernel: md: recovery thread: PQ corrected, sector=43760832
Jan 1 00:05:27 PIPE kernel: md: recovery thread: PQ corrected, sector=44727040
Jan 1 00:05:31 PIPE kernel: md: recovery thread: Q corrected, sector=45261232
Jan 1 00:05:31 PIPE kernel: md: recovery thread: Q corrected, sector=45262752
Jan 1 00:05:51 PIPE kernel: md: recovery thread: P corrected, sector=47864688
Jan 1 00:05:51 PIPE kernel: md: recovery thread: PQ corrected, sector=47885344
Jan 1 00:05:59 PIPE kernel: md: recovery thread: PQ corrected, sector=48868

Link to comment
12 minutes ago, FrozenGamer said:

Is there a way to tell which drives have the sync errors

no

 

9 hours ago, FrozenGamer said:

Last month i had 660 sync errors, I'm on 1780 so far, 1/3 through this parity check.  

You really should have been trying to get to the bottom of this before now. The only successful parity check is one that has exactly zero errors. If you lose a data disk you won't be able to reliably rebuild it if parity isn't completely correct.

 

Have you done a memtest recently?

Link to comment

 I will run memtest when the parity check completes in about 10 hours. I can't say for certain that i have on this box, i did one in april prior to installing UPS's on both of my unraid servers.  Before that i had a lot more problems with my servers. Installing UPS's helped a lot.  We have quite a lot of outages here, 10 to 20 a year.   At that point (and you had had helped me in that thread) i had about 300 errors on each servers after a few unclean shutdowns due to power outages/brownouts.  This box in question has been running for about 60 days without a reboot,  there have been a few brownouts/outages during that time frame, but long enough to tell unraid/ups to shut down the machine (less than 30 seconds).

 

"you won't be able to reliably rebuild it if parity isn't completely correct."

that means it will rebuild unreliable data?  - I am currently at 3542 sync errors with about 33% left to go on parity check.  I assume that means that i have some questionable data even once parity checks get back to no errors?

 

If in the future i see any sync errors, just do another parity check?  even if its 1 or 2?

Link to comment

Every sync error means that the corresponding sector on a rebuilt disk is likely to be corrupt.   If those sectors correspond to ones that contain file system control information they can affect access even to sectors that do not correspond to the sync errors.   With the number of errors you quote I would think it very likely that any rebuilt disk could end up being unusable.

Link to comment
7 hours ago, FrozenGamer said:

I am currently at 3542 sync errors with about 33% left to go on parity check.  I assume that means that i have some questionable data even once parity checks get back to no errors?

Assuming you haven't actually attempted to rebuild a data disk while you had incorrect parity then your data should be fine. Each of your disks is independent, and bad parity doesn't have any effect on the data on those disks.

 

But, when you need to rebuild a data disk, parity must be correct, and every bit of parity plus every bit of ALL other disks must be reliably read in order to reliably rebuild a disk.

 

This is why we are always concerned with the health of EVERY disk in the array, regardless of how important its data is. Every disk is required for rebuilding any disk.

 

7 hours ago, FrozenGamer said:

If in the future i see any sync errors, just do another parity check?  even if its 1 or 2?

yes

 

The only correct answer is zero, and you need to get it to zero somehow. If another correcting parity check still doesn't fix it then you need to get to the bottom of the problem.

Link to comment

I started memtest, was running fine for a while, but then decided to just swap out the box completely.  - still using the same 16 bay expansion as before that is connected to the box through the same lsi 9207-8e and cable.  No other parts are the same.  The old box was an xps 8700 with 4770 intel cpu/8gb ram. - new one is same specs with 16gb of different ram.  Running another parity check now, should be done in about 27 or so hours.   

Link to comment

OK, i had 29 on that parity check and down to zero on this last one.  I also ran another parity on my other server which had 3 errors on the last check and it came back zero. How much would ECC memory help?  And if not that what would be your top recommendation to improve chances of less or no sync errors in future.

Link to comment

Most folks very, very very seldom have a sync error.  These seem to occur mostly after unclean shutdowns --- usually caused by power outages on servers without UPS's or bad (old) batteries in those UPS's.   The second cause is actually Bad Memory.  Any Memory test should last, at least, 24 hours.  The more memory, the longer the test should be.  (You are going to get more tests on each byte of Memory in 4GB server than 64GB server in a fixed time period!) 

 

ECC memory is used by some folks but the MB has to support it and most consumer MB's don't.  Server type MB's usually do.  However, in my opinion, ECC memory will do more to protect against 'bitrot' than sync errors.  

 

By the way, I seem to recall that some folks have had issues the Marvell SATA chip sets producing random errors.  The usual solution for these folks was to replace those ports with one based on LSI chip sets.  

  • Like 1
Link to comment
  • 5 years later...

This theory of getting parity to 0 is all wrong

You don't know if the parity is wrong on the parity drive themselves VS what's on the array. If you keep getting parity errors and know your hardware is good, you can just rebuild your parity and hope that solves the issue. 

 

Edited by CSIG1001
Link to comment
  • 2 weeks later...
On 4/5/2024 at 10:30 AM, CSIG1001 said:

This theory of getting parity to 0 is all wrong

You don't know if the parity is wrong on the parity drive themselves VS what's on the array. If you keep getting parity errors and know your hardware is good, you can just rebuild your parity and hope that solves the issue. 

 

I think the real answer is to look at how there is a mismatch between the parity and the data.  Something happened, or they would match.

 

if you run successive parity checks, WITHOUT PARTITY CORRECTIONS BEING WRITTEN, and have identical results, even if having sync errors, then yes, I agree it looks like there is not currently a hardware issue. if the results are NOT consistent, then there really probably is a hardware issue.

 

unless you have past logs to look at, to determine where the error occurred, you really do not know if it is a data or parity drive in error.  However, there are tools people have used in the past to identify which files may be affected under such situations.  The data files then can be verified to be correct or not, depending on how good you are with a backup strategy.  This way you can verify your data drives are correct, then re-build you parity drive(s).

 

Hardware issues or power bumps are the two main causes of bad data being written to either the data drives or the parity drives.  another many times overlooked cause is timing and voltage settings on motherboards.  Some newer motherboards have default settings that now are set with GAMERS in mind, and go for default performance instead of reliability settings.  One example is many ASUS motherboards s now.  Pushing the timing and voltage settings for better gaming performance is the opposite of what we should be seeking on a data server.  We want stable, reliable, and repeatable results.

 

Regardless, after data is written to the array, and data and parity writes are complete, any and all parity checks afterwards should have NO sync errors.  If there are errors, something is wrong, no matter how much things seem to be ok.

 

On critical data, I even use PAR files to create additional protection and recovery options files for sets of data files.  This allows me to verify the data files, and recover from damaged and even MISSING data files.  I then store all of them, the data files and the PAR files, on the DATA drives on unraid.  There are many programs that work with PAR and PAR2 files.  It is a similar concept of how the 2 parity drives work in UNRAID, but at the file level instead of the drive level.  QuickPar is one such utility, though I have not used that one myself.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.