Jump to content

Are a handful of sync errors normal?


Recommended Posts

First half of the story is here.

 

Precleared drives (no errors), started from scratch, moved ~2.7 TB of data from WHS to unraid.  Ran parity check just to be sure all is well before deleting data from WHS.  Came up with 12 sync errors.  I don't want to delete anything from the old server until I know it's safe on the new server.  Should I be concerned that I keep getting a few sync errors?  Is it just normal?

 

 

EDIT: Relevant section of syslog:

 

Feb 26 23:48:26 Tower kernel: md: recovery thread woken up ...
Feb 26 23:48:26 Tower kernel: md: recovery thread checking parity...
Feb 26 23:48:26 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.
Feb 27 00:13:52 Tower kernel: md: parity incorrect: 247669256
Feb 27 00:19:30 Tower kernel: md: parity incorrect: 302163048
Feb 27 00:27:30 Tower kernel: md: parity incorrect: 378208880
Feb 27 01:19:57 Tower kernel: md: parity incorrect: 853558232
Feb 27 01:51:18 Tower kernel: md: parity incorrect: 1109535408
Feb 27 02:00:59 Tower kernel: md: parity incorrect: 1185806184
Feb 27 02:30:43 Tower kernel: md: parity incorrect: 1405480344
Feb 27 02:31:18 Tower kernel: md: parity incorrect: 1409544312
Feb 27 02:36:35 Tower kernel: md: parity incorrect: 1446585832
Feb 27 03:12:19 Tower kernel: md: parity incorrect: 1680282472
Feb 27 04:16:51 Tower kernel: mdcmd (50): spindown 2
Feb 27 04:40:01 Tower logrotate: ALERT - exited abnormally.
Feb 27 04:53:57 Tower kernel: md: parity incorrect: 2527535128
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: renewing lease of 192.168.1.123
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: acknowledged 192.168.1.123 from 192.168.1.1
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: leased 192.168.1.123 for 86400 seconds
Feb 27 06:27:39 Tower kernel: md: parity incorrect: 3405040336
Feb 27 07:38:12 Tower kernel: md: sync done. time=28185sec
Feb 27 07:38:12 Tower kernel: md: recovery thread sync completion status: 0

Link to comment

First half of the story is here.

 

Precleared drives (no errors), started from scratch, moved ~2.7 TB of data from WHS to unraid.  Ran parity check just to be sure all is well before deleting data from WHS.  Came up with 12 sync errors.  I don't want to delete anything from the old server until I know it's safe on the new server.  Should I be concerned that I keep getting a few sync errors?  Is it just normal?

 

 

EDIT: Relevant section of syslog:

 

Feb 26 23:48:26 Tower kernel: md: recovery thread woken up ...
Feb 26 23:48:26 Tower kernel: md: recovery thread checking parity...
Feb 26 23:48:26 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.
Feb 27 00:13:52 Tower kernel: md: parity incorrect: 247669256
Feb 27 00:19:30 Tower kernel: md: parity incorrect: 302163048
Feb 27 00:27:30 Tower kernel: md: parity incorrect: 378208880
Feb 27 01:19:57 Tower kernel: md: parity incorrect: 853558232
Feb 27 01:51:18 Tower kernel: md: parity incorrect: 1109535408
Feb 27 02:00:59 Tower kernel: md: parity incorrect: 1185806184
Feb 27 02:30:43 Tower kernel: md: parity incorrect: 1405480344
Feb 27 02:31:18 Tower kernel: md: parity incorrect: 1409544312
Feb 27 02:36:35 Tower kernel: md: parity incorrect: 1446585832
Feb 27 03:12:19 Tower kernel: md: parity incorrect: 1680282472
Feb 27 04:16:51 Tower kernel: mdcmd (50): spindown 2
Feb 27 04:40:01 Tower logrotate: ALERT - exited abnormally.
Feb 27 04:53:57 Tower kernel: md: parity incorrect: 2527535128
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: renewing lease of 192.168.1.123
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: acknowledged 192.168.1.123 from 192.168.1.1
Feb 27 05:03:43 Tower dhcpcd[1046]: eth0: leased 192.168.1.123 for 86400 seconds
Feb 27 06:27:39 Tower kernel: md: parity incorrect: 3405040336
Feb 27 07:38:12 Tower kernel: md: sync done. time=28185sec
Feb 27 07:38:12 Tower kernel: md: recovery thread sync completion status: 0

Unless you've powered down without first stopping the array, where the disks could not be written before the power was lost, NO parity errors are normal.

 

I've had my server for over 5 years, other than a un-expected powerdown before I had a UPS, never have seen a parity error.

 

 

  • Like 1
Link to comment

I did that before I started.  I only ran through two passes, but no errors were found.  Rebooted server and am running memtest again.

 

I need to get this sorted out ASAP.  The next time my WHS shuts down, it will likely not come back up.  Is there any way to determine which disk is getting errors?  You'd think the log would specify where the errors were found, but it doesn't.

 

Link to comment

I did that before I started.  I only ran through two passes, but no errors were found.  Rebooted server and am running memtest again.

 

I need to get this sorted out ASAP.  The next time my WHS shuts down, it will likely not come back up.  Is there any way to determine which disk is getting errors?  You'd think the log would specify where the errors were found, but it doesn't.

 

It is telling you the sector relative to the start of the partition where the parity error was detected.  The unRAID software has absolutely no way to know which bit(s) are wrong.  Just that an exclusive or of all the bits of some word in the sector did not end up with an even number of bits set to a "1"

 

It could be a but set to zero when it should be a one, or a bit set to one when it should be zero, and it could be on any of your disks, or it could be transposed in RAM if memory is flaky, or in the disk controller chipset, or on the motherboard itself.

 

It is VERY hard to diagnose.  It is one reason why old Nforce4 chipsets are to be avoided.  They exhibited random parity errors that caused some to lose their hair.

 

Many time is is a single disk, and repeated md5sums of the same data on a single disk result in different results.  The fix, replace the disk.  These disks often show no errors otherwise.

 

Joe L.

Link to comment

I did that before I started.  I only ran through two passes, but no errors were found.  Rebooted server and am running memtest again.

 

I need to get this sorted out ASAP.  The next time my WHS shuts down, it will likely not come back up.  Is there any way to determine which disk is getting errors?  You'd think the log would specify where the errors were found, but it doesn't.

 

You can run smart reports on each drive and look for errors there.  You can also double check your cabling to your drives.  Read on, however.  I suspect your issues are not drive or cable related.

 

unRAID calculates parity by adding the corresponding bits of your data disks.  If the result is odd, parity is set to one.  If the result is even, parity is set to zero.  In this way, the corresponding bits added together across all the disks is always even.  So if unRAID finds that the sum is odd during a parity check, you have a sync error.  Very simple.  When unRAID finds this, it has no idea which disk changed, but what it will do is update parity to make the bits add up to to be even again.  Some people call this correcting the parity error - but "correct" may not be the best term.  If is a very syntactic "correction".

 

Now lets say your array is perfect, parity is perfect, and you run a parity check.  The parity check will have zero sync errors.  But let's throw is a bad memory chip - say one that results in a flipped bit 1 in a trillion times.  So your parity check is running along perfectly, and the memory error occurs in the memory buffer of disk contents.  unRAID trusts that the memory contains an exact copy of the data from the disk.  So unRAID will see a parity mismatch and update parity to be consistant with the bad read.  With this happening 1 in a trillion times, and you have 10 2T drives, you might expect to see 40 or so sync errors.  Now, after the parity check, parity is NOT perfect.  On 40 places on the disk, parity has been mistakenly updated because of the memory error.

 

Now let's say you buy new memory, run it through a weeklong memtst and KNOW the memory is perfect.  You now run a parity check.  What will be the result?   40 sync errors.  What it did was to CORRECT the 40 mistakes it made the last time.

 

But what if you didn't replace the bad memory and ran it again.  The chances of the memory error affecting EXACTLY the same disk locations are staggeringly low. So what will happen in in about 40 new spots memory will get corrected, and unRAID will adjust parity (incorrectly), but in the spots that were memory corrupted on the last parity check, unRAID will correct parity.  Net result, ~80 sync errors.  You are still left with ~40 bad parity situations on your disk.

 

It is possible that your system memory is fine and the one of the disks or disk controllers or chipset is corrupting the information before it gets to your computer's memory.  Drives have protections to prevent this, but it is still possible.  But it is far easier to rule out the memory first.  It is fairly common - especially if you are running 2 sticks.

 

One thing you can look at are the parity mismatch locations in your syslog.  If you are seeing the same sector marked bad and then marked good on the next parity check, if usually means you have a memory error.  

 

And remember, when you finally fix the problem, you will still have one more parity check with sync errors that will truly correct parity.  After that, sync errors should be zero.  Many users that find a bad memory chip are frustrated to find that it didn't fix the problem because the next parity check results in sync errors.  They don't realize that this really did correct parity and the next parity check will be perfect.

 

Hope this helps.  Good luck!

Link to comment

I did that before I started.  I only ran through two passes, but no errors were found.  Rebooted server and am running memtest again.

 

I need to get this sorted out ASAP.  The next time my WHS shuts down, it will likely not come back up.  Is there any way to determine which disk is getting errors?  You'd think the log would specify where the errors were found, but it doesn't.

 

It is telling you the sector relative to the start of the partition where the parity error was detected.  The unRAID software has absolutely no way to know which bit(s) are wrong.  Just that an exclusive or of all the bits of some word in the sector did not end up with an even number of bits set to a "1"

 

It could be a but set to zero when it should be a one, or a bit set to one when it should be zero, and it could be on any of your disks, or it could be transposed in RAM if memory is flaky, or in the disk controller chipset, or on the motherboard itself.

 

It is VERY hard to diagnose.  It is one reason why old Nforce4 chipsets are to be avoided.  They exhibited random parity errors that caused some to lose their hair.

 

Many time is is a single disk, and repeated md5sums of the same data on a single disk result in different results.  The fix, replace the disk.  These disks often show no errors otherwise.

 

Joe L.

 

In this post:

 

http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580

 

I reported my experiences with resolving an ongoing parity error issue.  You might be experiencing something similar, but you need to do further tests.  First run some more parity checks in the "nocorrect" mode.  If these show exactly the same errors (to the block) then what you probably have is an issue where data was being written to the drives when the machine got shut down unexpectedly.  This will leave the parity drive out of sync with the data.  Doing a parity check (in the normal check and update mode) will bring the parity disk back up to date with the contents of the drives and further partiy checks should show no more errors.

 

If the errors go away by themselves (for a few parity checks) or change locations between checks - which was what was happening to me - then you have a harder to diagnose problem.  It might be as simple as faulty cable/connector or it might be flaky memory, motherboard chip set or a drive.  In my case it turned out to be one of my 8 drives and the SMART tests (both long and short) gave no indication that the drive was acting up.

 

Anyway, don't panic, run one test at a time and keep a written log of what you test along the way and relevant results (like SMART test logs and the reported locations of the parity errors) of the tests and you should sort it out in time.

 

Regards,

 

Stephen

 

Link to comment

Been running memtest since it was suggested.  On the 10th iteration and no errors yet.

 

I only have three drives in the system at the moment, so narrowing it down shouldn't be too complicated.  I guess I run memtest for a few more hours, and if there's no errors, I move to long SMART tests for each drive.  If that shows nothing, then I just run non-correcting parity sync checks until some sort of pattern emerges and replace the drive that keeps having problems.  If the problems are still random and affecting different disks, then I should assume the memory is the problem and replace it even though it tested fine.  Do I have that right?

Link to comment

Been running memtest since it was suggested.  On the 10th iteration and no errors yet.

 

I only have three drives in the system at the moment, so narrowing it down shouldn't be too complicated.  I guess I run memtest for a few more hours, and if there's no errors, I move to long SMART tests for each drive.  If that shows nothing, then I just run non-correcting parity sync checks until some sort of pattern emerges and replace the drive that keeps having problems.  If the problems are still random and affecting different disks, then I should assume the memory is the problem and replace it even though it tested fine.  Do I have that right?

 

Yes, if you have two memory sticks you should be able to run the system on one at a time to see if that improves things.  Most of the past recommendations are to run the memtests for at least a night, though in the two cases I have encountered personally where a memory stick went bad memtest was able to detect the issue in one or two passes (usually one of the 8 tests it runs will fail).  So, if you've already done 10 full passes (say about 5 hours or so of testing) I would say your memory (and probably also the CPU) are fine.

 

Regards,

 

Stephen

 

Link to comment

I let it run 12 full passes worth of memtest, and no errors found.  

 

It's currently 36% through a "do not fix" sync, and it has already found 7 errors.  These errors are in the exact same spots as the 1st 7 errors of the last test.  Now, the llast sync I ran was supposed to "fix" any errors in the parity as it found them.  If I'm understanding BJP999, then it actually screwed up the parity in those 12 places the 1st time, and this scan is finding those screwups.  It's still not telling me why it screwed up in the first place, though.  

 

If this scan finds the exact same 12 errors as the first scan, then I guess I just run it again (without fixing) to be sure.  If that scan finds the exact same 12 errors again, then I run yet another scan (with fixing) to fix the original "fixes".  Once that's done, I should have a good parity.  The only problem is that I still won't know why I had errors in the first place.  What a pain.

Link to comment

I let it run 12 full passes worth of memtest, and no errors found.  

 

It's currently 36% through a "do not fix" sync, and it has already found 7 errors.  These errors are in the exact same spots as the 1st 7 errors of the last test.  Now, the llast sync I ran was supposed to "fix" any errors in the parity as it found them.  If I'm understanding BJP999, then it actually screwed up the parity in those 12 places the 1st time, and this scan is finding those screwups.  It's still not telling me why it screwed up in the first place, though.  

 

If this scan finds the exact same 12 errors as the first scan, then I guess I just run it again (without fixing) to be sure.  If that scan finds the exact same 12 errors again, then I run yet another scan (with fixing) to fix the original "fixes".  Once that's done, I should have a good parity.  The only problem is that I still won't know why I had errors in the first place.  What a pain.

 

Yup, sounds correct.  The most likely cause of this sort of suddenly appearing parity errors is an unclean machine shutdown - such as a power issue or someone just hitting the ctrl-alt-del to get a Windows login prompt.  At this point it would be useful if there was a utility that could take the list of parity error locations and output a list of the files (on all your disks) that occupy those locations, then you would have a short list of things to check for possible corruption.  Other than that, unless you have a backup that you can compare to you are probably out of luck.

 

Stephen

 

Link to comment

The most likely cause of this sort of suddenly appearing parity errors is an unclean machine shutdown

...

Other than that, unless you have a backup that you can compare to you are probably out of luck.

Neither machine has been improperly shut down since all this started.  Both servers are running on a UPS, so I know there haven't been any power issues (checked the log on my desktop).  I have shut the unraid box down a couple times, but that was always a proper shutdown through the web UI, and never while it was copying or syncing.

 

Actually, I still do have all the original files on the old WHS.  I've been too nervous to delete them until I figure out what the cause of the problem is on the new unraid server.

 

Ran a long smart scan on all three disks, and all came up clean.  Ran a non-repair scan twice, both times coming up with the same 11 errors.  All 11 errors coincide with errors that the first repair scan "fixed".  The original repair scan actually found 12 errors, so I guess the one error that hasn't popped up in subsequent scans was legitimate.

 

This stinks.  As far as I can tell, nothing is broken enough to be obvious.  I think my next step will be to switch around some of the SATA cables, just to see if that makes a difference.

 

EDIT:  Rather than just swap cables, I swapped the trays into another cage that is running through the SASLP-MV8, re-assigned the drives to the correct slots and started yet another non-repair sync.  This should rule out the MB SATA ports, the cables, and the backplane.  If this comes up with the same 11 errors, I'm just going to fix them, move on with migration, and hope for the best.  There's not much else I can do.

Link to comment

What disk drives are you running.  2T Samsungs have a bug called "silent corruption" that may be able to cause the symptoms you are seeing.

 

What motherboard are you running.  Some motherboards don't work well.  That doesn't always mean they won't run at all, but can mean that they don't run reliably.  Your may have one of those.

 

Suggest you post full system specs and a full syslog.

Link to comment

I was very careful to go with well-tested hardware.  The motherboard is a Biostar A760G M2+, CPU is a sempron 145, the SATA card is a SASLP-MV8, and the drives are two WD20EARS and one WD10EACS.  There are several more drives (mix of WD and Seagate) in my WHS box that I will be migrating over, but for now, it's just the 3 WD greens.  RAM is a pair of 1GB A-DATA DDR2 800 sticks and the boot drive is a 2GB tuff-n-tiny.  Running unraid pro 5.0b4 and the default drive format is set to 4k aligned.

 

Here's the full syslog since the last reboot earlier this evening when I moved the drives to a different cage: http://pastebin.com/EJZ8XUzQ

 

Currently 43% through the non-repair sync and it's found 8 errors that match up with the previous syncs.

Link to comment

Sync complete, and it's the same 11 errors as the last several passes.  I'm just going to go ahead and repair them and move on.

 

How many read-only parity checks did you run and get the same results?

 

Based on my read of the thread, I think you got 12 the first time, then again and got 11, and then a third time and got 11.  Is that right?

Link to comment

Sync complete, and it's the same 11 errors as the last several passes.  I'm just going to go ahead and repair them and move on.

 

If you are any good at programming, you could simple use a md5 hash generator to calculate the hash of all files on the old and the new server, compare those two and see if there are any files where the hash doesn't match. It's pretty easy to do in C# for example. Or write a small utility to compare the entire file content. Maybe you could even use one of the duplicate file finder tools and simply delete everything on the WHS server where there is an exact duplicate on the unraid server?

 

I'm currently working on a program to do that, but it will likely take another month or two until it's actually usable by anyone but me ;)

Link to comment

Sync complete, and it's the same 11 errors as the last several passes.  I'm just going to go ahead and repair them and move on.

 

How many read-only parity checks did you run and get the same results?

The first sync was in repair mode.  It found and "fixed" 12 errors.  I have done three non-repair passes since then, and all three have found the same 11 errors.  These coincide with 11 of the 12 from the first pass.  Now I am running a repair-sync to fix the 11 errors that were broken by the very first sync.

 

That mysterious 12th error hasn't popped up in any of the subsequent passes, so I think it was legitimately fixed by the first repair-sync.

Link to comment

The repair run fixed the same 11 errors that the last several scans found.  With no more options for determining if I even have a problem (let alone determining the cause), I deleted the original data on the WHS that I had copied to the unraid server.  WHS is now shuffling the remaining data around to remove a couple drives so I can continue with the migration.

 

For better or worse, there's no going back now.

Link to comment

Oh dear god what just happened?

 

Last night I pulled two WD20EADS drives from my WHS box, stuck them in the unraid server, and started preclearing them.

 

I wake up to discover that the unraid server is completely non-responsive.  The web UI is down, and there's nothing on the local monitor.  Crap.  With no other options, I hit the reset button.

 

It booted, but somehow the server took it upon itself to assign one of the new WD20EADS drives to the slot formerly occupied by the WD10EACS.  The server is now stopped, the start button is grayed out, and it claims it is "upgrading disk".

 

First of all, what's up with the total system crash?  Even if linux shits the bed, there should still be something on the local monitor.  Second of all, why is unraid upgrading itself without my consent?  Third of all, why is it doing it with a drive that was, at best, only half precleared?  And fourth, where the hell is my WD10EACS?

 

Here's the log: http://pastebin.com/Wr0L810t

Link to comment

you are running 5.0b4 and it had issues with the way disks were recognized.  This was discussed ad nauseum in the announcements thread, and the main reason 5.0b5, etc change the way disks were discovered.

 

If I had to guess at what happened, your computer restarted, disks were discovered in a different order, unRAID found the "new" disk in the old ones spot, and thought you were trying to upgrade it.  Since it is an upgrade, preclear does not really matter for that drive.  unRAID is coping over the data and everything else.  Using preclear for an upgrade drive does nothing more than to torture test the drive.

 

If this is not a test array please consider running 4.7 until the 5.0 betas have been tested by those with test arrays, or are experienced with linux/unRAID.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...