Sync errors on parity check - what does that mean?


Recommended Posts

I'm running 4.7 Final for a long time now, and each month my parity check runs with Zero errors.  For some reason this month (today), I got 7 errors so far, as it's still running for another 120 minutes.  I'll check it when I get home from work to see if I have more than 7. 

 

My question is "Should I be worried about this?"  I guess that is the reason the check runs in the first place, but should I look at any logs or confirm that the CORRECTED errors were actually errors and the corrections were good?  I read here somewhere a while back that not all corrections are what you may want, as parity may be wrong.

 

The only thing in the syslog is this,

 

Dec  1 00:00:01 Tower kernel: mdcmd (507): check NOCORRECT (unRAID engine)

Dec  1 00:00:01 Tower kernel:  (Routine)

Dec  1 00:00:01 Tower kernel: md: recovery thread woken up ... (unRAID engine)

Dec  1 00:00:01 Tower kernel: md: recovery thread checking parity... (unRAID engine)

Dec  1 00:00:02 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7784 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7792 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7808 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7816 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 8040 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 12368 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 12376 (Errors)

 

Given this is the first time I've had this issue, in 3 years of running my unraid, I thought I better ask.

 

Thanks!

 

Steve

Link to comment

I'm running 4.7 Final for a long time now, and each month my parity check runs with Zero errors.  For some reason this month (today), I got 7 errors so far, as it's still running for another 120 minutes.  I'll check it when I get home from work to see if I have more than 7. 

 

My question is "Should I be worried about this?"  I guess that is the reason the check runs in the first place, but should I look at any logs or confirm that the CORRECTED errors were actually errors and the corrections were good?  I read here somewhere a while back that not all corrections are what you may want, as parity may be wrong.

 

The only thing in the syslog is this,

 

Dec  1 00:00:01 Tower kernel: mdcmd (507): check NOCORRECT (unRAID engine)

Dec  1 00:00:01 Tower kernel:  (Routine)

Dec  1 00:00:01 Tower kernel: md: recovery thread woken up ... (unRAID engine)

Dec  1 00:00:01 Tower kernel: md: recovery thread checking parity... (unRAID engine)

Dec  1 00:00:02 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7784 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7792 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7808 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 7816 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 8040 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 12368 (Errors)

Dec  1 00:00:30 Tower kernel: md: parity incorrect: 12376 (Errors)

 

Given this is the first time I've had this issue, in 3 years of running my unraid, I thought I better ask.

 

Thanks!

 

Steve

That was run in NOCORRECT mode.  Regardless of what the unRAID display may say, the errors were not corrected.

 

Run another NOCORRECT check.  If the exact same errors occur in the same places, then parity is out of sync.

If different errors occur, then you have flaky hardware...  either memory, or a disk controller, or disk, or even a power supply acting up resulting in occasional incorrect data being read from the disks.

 

 

Link to comment

Did you have a hard shutdown of the server since the last parity check?

 No, the past several parity checks have all been zero errors and no hard crashes.

 

 

Ok, so how can I find out what the errors are, meaning what files on what drives are not happy?  

I will run another check, as you have suggested, but how will I confirm the errors are the same, or new ones?

And last, how can I correct the errors so parity is happy again?

 

edit

 

Ok, so I powered down my server completely and restarted it.  I then kicked off another parity check and here is what is in the syslog, oh and the status screen shows it just started a parity check, but already has the 7 errors again,

 

Dec  1 12:13:09 Tower ntpd[1484]: time reset -0.216085 s

Dec  1 12:14:20 Tower kernel: mdcmd (24): check CORRECT (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: recovery thread woken up ... (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: recovery thread checking parity... (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7784 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7792 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7808 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7816 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 8040 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 12368 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 12376 (Errors)

Dec  1 12:16:45 Tower unmenu[1569]: awk: ./unmenu.awk:630: fatal: print to "/inet/tcp/8080/0/0" failed (Connection reset by peer) (Minor Issues)

 

help!

 

 

Link to comment

Did you have a hard shutdown of the server since the last parity check?

 No, the past several parity checks have all been zero errors and no hard crashes.

 

 

Ok, so how can I find out what the errors are, meaning what files on what drives are not happy?  

I will run another check, as you have suggested, but how will I confirm the errors are the same, or new ones?

And last, how can I correct the errors so parity is happy again?

 

edit

 

Ok, so I powered down my server completely and restarted it.  I then kicked off another parity check and here is what is in the syslog, oh and the status screen shows it just started a parity check, but already has the 7 errors again,

 

Dec  1 12:13:09 Tower ntpd[1484]: time reset -0.216085 s

Dec  1 12:14:20 Tower kernel: mdcmd (24): check CORRECT (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: recovery thread woken up ... (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: recovery thread checking parity... (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7784 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7792 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7808 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 7816 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 8040 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 12368 (Errors)

Dec  1 12:14:20 Tower kernel: md: parity incorrect: 12376 (Errors)

Dec  1 12:16:45 Tower unmenu[1569]: awk: ./unmenu.awk:630: fatal: print to "/inet/tcp/8080/0/0" failed (Connection reset by peer) (Minor Issues)

 

help!

 

 

Last time you invoked it in NOCORRECT mode, this time you invoked it in CORRECT mode.

The same addresses were still incorrect, but this time they were corrected.

 

If you run a third parity check you'll probably find all is OK.

 

There is no easy way to determine which files correspond to the blocks with the errors.  It might be part of the empty space on the drive, or part of the file-system structures, or parts of one or more files.

Link to comment

Did you do any array maintenance since the last parity check?

 

- rebuild failed disk

- upsize a disk

- add a new disk

- add any addons

- ...

 

The fact that the parity blocks are pretty small seems to indicate the parity issues are in the "housekeeping" area - a good thing because file data is not stored there.

Link to comment

Hi Joe!

 

You wrote:

Last time you invoked it in NOCORRECT mode, this time you invoked it in CORRECT mode.

The same addresses were still incorrect, but this time they were corrected.

I use the unmenu addon that runs the parity check monthly automaticaly.  I guess you are saying that package is configured to NOCORRECT.  Should it be?

When I restarted my server, I manually pushed the CHECK button on the main screen, and I guess based on what you are saying, that does a CORRECT check.  Again, should I change this setting?

 

If you run a third parity check you'll probably find all is OK.

Other than the packages in unmenu, what 3rd party checkers do you use?

 

There is no easy way to determine which files correspond to the blocks with the errors.  It might be part of the empty space on the drive, or part of the file-system structures, or parts of one or more files.

Ok, that isn't very good.  Not much use in reporting a error if you can't figure out what is causing it and what data is impacted.  I wish this service could provide more information about the errors.

 

 

Thanks again Joe!

Link to comment

Did you do any array maintenance since the last parity check?

 

- rebuild failed disk

- upsize a disk

- add a new disk

- add any addons

- ...

 

The fact that the parity blocks are pretty small seems to indicate the parity issues are in the "housekeeping" area - a good thing because file data is not stored there.

 

Yes, here is the recent history.....

 

-last month the auto parity check ran with no errors, as it has for a long time.

 

-After that check, I replaced a old drive with a larger one, which I've done many times in the past.  The rebuild process worked with no errors.  I then ran a CHECK via the button on the main screen.  From Joe's message, that does a CORRECT check.  It resulted in NO errors.

-I then used my server all month with no issues, until the scheduled monthly unmenu service ran last night, resulting in these 7 errors.

 

No addons were added, no other new drives, etc.  A simple yank out the old, put in the new, clear it, and then start the rebuild.

 

What is the housekeeping area and should I do anything to clean that up?

 

Thanks

Link to comment

I said "Run a 3rd parity check" as in run once more.

not

run a "3rd party" parity checker.

 

On your version of unRAID, the button on the unRAID screen checks and corrects parity.    To perform a check only, you must either use the button on the unMENU web-page, or type the command on the command line.

/root/mdcmd check NOCORRECT

 

Link to comment

I said "Run a 3rd parity check" as in run once more.

not

run a "3rd party" parity checker.

 

On your version of unRAID, the button on the unRAID screen checks and corrects parity.    To perform a check only, you must either use the button on the unMENU web-page, or type the command on the command line.

/root/mdcmd check NOCORRECT

 

Ok, thanks Joe!  So would you recommend that I change the unmenu monthly parity checker to CORRECT when it runs?

Link to comment

-After that check, I replaced a old drive with a larger one, which I've done many times in the past.  The rebuild process worked with no errors.  I then ran a CHECK via the button on the main screen.  From Joe's message, that does a CORRECT check.  It resulted in NO errors.

-I then used my server all month with no issues, until the scheduled monthly unmenu service ran last night, resulting in these 7 errors.

 

I believe this is the reason for your sync errors.  Is it possible you didn't do that parity check after the rebuild?

 

Read THIS POST (from Tom acknowledging the bug and fix in 5.0b8d); and

THIS THREAD (from a user experiencing a similar problem to yours)

 

Your correcting parity check will fix the sync errors and you'll be back in business.  Although the bug COULD create data corruption, the have been no documented cases of it.  In the future, do not use your array while doing a rebuild and you should avoid this bug.  Or move to one of the latter 5.0 betas.

 

Tom does say a 4.7.1 is coming, but it is not looking likely given the amount of time that has gone by.

 

No addons were added, no other new drives, etc.  A simple yank out the old, put in the new, clear it, and then start the rebuild.

 

What is the housekeeping area and should I do anything to clean that up?

 

I think this was answered in one of the linked posts. But quickly, the housekeeping area is a part of the disk where reiserfs keeps a journal and other structures related to the disk format, but this area does not contain actual data files.  The largest part of the housekeeping area is the journal area where it temporarily stores data pending time to flush it to the data sectors later on the disk.  Much of the time the journal area has already been flushed, so overwriting something there has no effect.

 

Thanks

 

yw

Link to comment

Yes, I clicked on the button the main screen and the parity check was ok, zero errors.  I always do that after the rebuild is complete for replacement drives.  I do not use the server during the rebuild.  By that, I don't connect with my pc or stream any media during that time.  I guess my mac could auto connect and that would create a connection, but no real use.  In any case, it would be nice to get the update and close this gap.  :)

 

It's still running at 58% and still says 7 errors.  I hope you are right and this clears it up.  After this is done, I will run it yet again and it should come up with zero errors.  Fingers crossed!

 

update:  ok the parity check completed and ended with the same 7 errors, but as pointed out, they were corrected this time.  I rebooted the server and then kicked off another parity check, resulting in zero errors,  So, while I don't understand why this happened or what was corrected, things seem to be back to normal.

 

So for the housekeeping junk, are you saying to just ignor all that mess?

 

Thanks,

 

Steve

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.