Parity check error


Recommended Posts

I had my monthly parity check and it came back w/ 1 error.  So i ran it again w/ correct error checked it and then ran it again w/o the error check and I'm still getting

 

Event: unRAID Parity check

Subject: Notice [TOWER] - Parity check finished (1 errors)

Description: Duration: 18 hours, 34 minutes, 18 seconds. Average speed: 119.7 MB/s

Importance: warning

 

I had this last month as well as I tend to run monthly parity checks.

 

I'm currently at work and will post syslog when I get home.    I'm not sure why i keep getting this.   

 

Thanks

Link to comment

Your syslog only shows two parity checks, not three as you suggested. First, a scheduled check runs at midnight:

 

Feb  1 00:00:01 Tower kernel: mdcmd (59): check

Feb  1 00:00:01 Tower kernel: md: recovery thread: check P ...

...

Feb  1 01:27:36 Tower kernel: md: recovery thread: P corrected, sector=1072051656

...

Feb  1 18:52:49 Tower kernel: md: sync done. time=67967sec

Feb  1 18:52:49 Tower kernel: md: recovery thread: completion status: 0

 

then a manual check is run:

 

Feb  1 19:41:34 Tower kernel: mdcmd (64): check correct

Feb  1 19:41:34 Tower kernel: md: recovery thread: check P ...

...

Feb  1 21:09:09 Tower kernel: md: recovery thread: P corrected, sector=1072051656

...

Feb  2 14:15:52 Tower kernel: md: sync done. time=66858sec

Feb  2 14:15:53 Tower kernel: md: recovery thread: completion status: 0

 

and the log ends a few hours later so I don't see your third, non-correcting check.

 

Now, what I find odd is that while your manual, correcting check puts

 

Feb  1 19:41:34 Tower kernel: mdcmd (64): check correct

 

in the log, your automated, non-correcting check puts

 

Feb  1 00:00:01 Tower kernel: mdcmd (59): check

 

in the log. Compare that with the output from my server running the same version 6.2.4 of unRAID when it starts its automatic monthly non-correcting check:

 

Feb  1 05:00:01 Northolt kernel: mdcmd (216): check NOCORRECT

Feb  1 05:00:01 Northolt kernel:

Feb  1 05:00:01 Northolt kernel: md: recovery thread: check P Q ...

 

Is that difference significant? Is it because you have single parity and I have dual? I can't find anything else of any relevance, though a couple of other, probably unrelated issues caught my eye.

 

Every day at 04:00 the mover tries to move your system folder from the array to the cache but fails due to files being in use:

 

Jan 31 04:00:01 Tower root: mover started

Jan 31 04:00:01 Tower root: moving "s..m" to cache

Jan 31 04:00:01 Tower shfs/user0: err: shfs_rmdir: rmdir: /mnt/disk1/system/docker (39) Directory not empty

Jan 31 04:00:01 Tower move: rmdir: /mnt/user0/./system/docker Directory not empty

Jan 31 04:00:01 Tower shfs/user0: err: shfs_rmdir: rmdir: /mnt/disk1/system/libvirt (39) Directory not empty

Jan 31 04:00:01 Tower move: rmdir: /mnt/user0/./system/libvirt Directory not empty

Jan 31 04:00:01 Tower shfs/user0: err: shfs_rmdir: rmdir: /mnt/disk1/system (39) Directory not empty

Jan 31 04:00:01 Tower move: rmdir: /mnt/user0/./system Directory not empty

Jan 31 04:00:01 Tower root: mover finished

 

I recommend that you put it out of its misery once and for all by stopping your dockers and then stopping the docker service before running the mover manually. Then you can re-enable the docker service and start your dockers again.

 

The other thing I see is a lot of this:

 

Jan 31 20:56:31 Tower shfs/user: err: shfs_rmdir: rmdir: /mnt/cache/appdata/FoldingAtHome/work/02 (39) Directory not empty

Jan 31 20:56:31 Tower shfs/user: err: shfs_rmdir: rmdir: /mnt/cache/appdata/FoldingAtHome/work/02 (39) Directory not empty

Jan 31 20:57:31 Tower shfs/user: err: shfs_rmdir: rmdir: /mnt/cache/appdata/FoldingAtHome/work/02 (39) Directory not empty

 

It appears hundreds and hundreds of times. Perhaps a restart will fix it.

 

I don't see any disk or controller issues. I suggest you reboot, sort out your system share, and then run another parity check. It might as well be a correcting one. When it has finished grab a new set of diagnostics and post them.

 

Link to comment

Both parity checks on the log are check correct, for some reason the scheduled one only shows check, but you can tell both were correct because the sync error was corrected, as in:

 

Feb  1 01:27:36 Tower kernel: md: recovery thread: P corrected, sector=1072051656

 

Same error being corrected twice rules out a memory issue and other random factors, but disks look fine so no clue.

 

PS: Unrelated but first time I've noticed this SMART attribute:

 

22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100

 

Curious to see how it holds up as years go by.

Link to comment

This flip-flopping of the same sector back and forth has been seen before.

 

If I read this correctly, a correcting check found one error and fixed it.

 

And a second correcting check ran and found the same error and fixed it.

 

I believe, in truth, the first error was a mis-detection. Could have been caused by a memory error, a disk error, cable issue, or something else.  But the misdetection caused parity to be wong. One mistake in how many memory write? Seems unlikely, but it does happen.

 

The second "fix" was correct, it was correctly undoing the first fix.

 

I think I remember this did turn out to be a memory error. ECC memory is a wonderful thing. :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.