[Solved] Parity Check error counter continues to grow. Cause for concern?


Recommended Posts

  For roughly the last month, maybe two months, I've notice that Parity Checks have started producing errors.  The error count started at 9; which point I checked parity again and it was still 9 errors. A period of time later, maybe a week, I checked again and it had found more errors. As the weeks passed, and paranoid-checking the number keeps increasing.   I just now read the wiki, and I now understand it's an incremental counter.  However, what has me spooked is this hasn't been error'ing-out in the past. 

 

  I have run extended SMART self-tests on all my drives - all drives Completed without error.  To the best of my knowledge I have not experienced an unclean shutdown.  The system is on a UPS, but is not tied in with USB control (Tripplite unit not an APC). 

 

  Per another forum post, I tried running a Parity Check without correcting errors, but the error count was higher then previous check.  Can parity get so far behind in sync that it becomes "spoiled?"  Should I shutdown my system, yank out my two parity drives and zero-wipe them, and then have Unraid rebuild my parity?  Is there an S.O.P. instructions I've missed and should go read?

 

  Any help and advise would be appreciated.  Thank you for your time.

hydra-diagnostics-20190420-0251.zip

Link to comment

Any time you do not have zero errors on a parity check is a cause for concern as every error represents a sector that could end up corrupt if you have to rebuild a disk.    

 

It is not clear to me from your description if you have been running correcting or non-correcting parity checks?    If non-correcting then the number would continue to increase, but correcting should put it back to zero on subsequent runs (after corrections applied).

Link to comment
3 hours ago, itimpi said:

It is not clear to me from your description if you have been running correcting or non-correcting parity checks?    If non-correcting then the number would continue to increase, but correcting should put it back to zero on subsequent runs (after corrections applied).

Mostly correcting, but the last one was non-correcting as I thought that was suppose to reset the count. Some sort of bug... I should try and find that forum post. 

 

25 minutes ago, johnnie.black said:

You should run memtest, also make sure RAM is not overclocked.

I don't overclock, and the RAM is using its XMP profile; system has been stable, but I start a memtest after posting and report back.  

 

29 minutes ago, johnnie.black said:
4 hours ago, Jcloud said:

and I now understand it's an incremental counter. 

This is not correct, errors are just for the last check.

I re-read the wiki, looks like I misread that. 

Link to comment
1 hour ago, Jcloud said:

don't overclock, and the RAM is using its XMP

That's good, but just to be clear I also mean the RAM for the platform, in this case anything above 2666Mhz will be an overclock, even if it's not overclocking the RAM, e.g.  Ryzen is known to give sync errors with overclocked RAM and not always detectable with memtest.

Link to comment
1 hour ago, Jcloud said:

Mostly correcting, but the last one was non-correcting as I thought that was suppose to reset the count. Some sort of bug... I should try and find that forum post. 

It is a correcting check that clears the count for the next run.  After a correcting run the next run should give zero errors.

 

It is normally recommended that the default for the regular parity checks is set to be non-correcting.    That way if any errors are reported it gives you a chance to look into whether it is a genuine error, or if instead it is a disk playing up that is reporting a spurious error.   In the spurious error case you do now want to correct parity as in fact you would end up corrupting it instead :(

Link to comment
That's good, but just to be clear I also mean the RAM for the platform, in this case anything above 2666Mhz will be an overclock, even if it's not overclocking the RAM, e.g.  Ryzen is known to give sync errors with overclocked RAM and not always detectable with memtest.

My system specs are in the URL of my forum-signature (if that's useful). The RAM is much slower than 2666MHz, 2133 if my recollection is good. What I should have wrote is that I use the RAM default XMP so that chipset has the correct timing, and not guessing/detecting them via "auto-detect."

 

Presently memtest is on hour seven; completed Passes 1, Errors: 0. Memtest is 5% in pass #2. 128GB of RAM takes time wink.gif&key=0bdeb566242cd5ead08665bd4b37364c313a063f48a345c257d922ff62806df0

 

 

EDIT: My recall of RAM speed is bad, I screwed up, it is 2666MHz.

 

 

Sent from my SAMSUNG-SGH-I537 using Tapatalk

 

 

 

Link to comment
2 minutes ago, Squid said:

Diagnostics are always better than a simple syslog.

 

In your case though, the docker.img failed to umount, and I'm guessing that it's stored on disk #3 as that is the disk that's busy and refusing to umount and then stop the array.

My bad. I'll remember that for later.  I thought my docker image was on cache, but given your expertise you're probably right, I'll try to check that.

 

Currently booted into Safe-boot, and was starting to look around again.

Link to comment
4 hours ago, Squid said:

Diagnostics are always better than a simple syslog.

4 hours ago, Jcloud said:

Currently booted into Safe-boot, and was starting to look around again.

And its always better to get diagnostics before rebooting, but since you posted the syslog before rebooting, that will be the main thing we would be missing when you get diagnostics now.

 

Go ahead and post diagnostics by going to Tools - Diagnostics and attaching the complete diagnostics zip file to your next post.

Link to comment
26 minutes ago, trurl said:
5 hours ago, Jcloud said:

Currently booted into Safe-boot, and was starting to look around again.

And its always better to get diagnostics before rebooting, but since you posted the syslog before rebooting, that will be the main thing we would be missing when you get diagnostics now.

I'm not helping you guys to help me, am I? . . .  I must be impatient type.  I've rebooted yet again, prior to your post trurl.  Everything you wrote makes sense, so I'll try to do better.

 

Presently, I'm back in normal mode.  While in Safe-boot I ran repair_xfs on all my /dev/md*.   After that, I checked and found docker.img on Disk3 and moved that to cache.  I then tried to start my VM but got an error - that's when I opted to reboot.  Currently I'm running another parity check, it's about 47% completed.  My "soft-plan" was to run a second parity check shortly after this one completes, to see if there are errors on the second go.  As of now the status says 56 sync errors detected. 

hydra-diagnostics-20190421-0217.zip

Link to comment

Since you aren't correcting parity:

Apr 20 14:53:31 HYDRA kernel: mdcmd (45): check nocorrect

then the next parity run will have the same errors.

 

You have used 53G of your 80G docker image. And, as expected, it is corrupt.

 

If you have your docker applications setup correctly, it is extremely unlikely to ever need even 20G. Making it larger won't fix anything, it will just make it take longer to fill.

 

And, your system and appdata shares have data on the parity array instead of all on cache where it belongs.

 

But, forget about even running dockers and VMs until you get your system stable. And until you have exactly zero parity errors, you can't consider your system stable.

14 minutes ago, Jcloud said:

I must be impatient type.

Don't do anything else without further advice.

 

I didn't notice any I/O errors in that previous syslog or in the syslog in these diagnostics.

 

Do you see anything in the Errors column in Main - Array Devices for any disks?

 

Do you see any SMART warnings for any disks on the Dashboard?

 

 

Link to comment
14 minutes ago, trurl said:

Do you see anything in the Errors column in Main - Array Devices for any disks?

0 errors across the board.  I will comment that disk2 has a temperature warning while Parity checking, but it always has in past (WD Black drive).  Disk2 registering 46 degrees C (max of 50-C).

 

 

EDIT: 

>>>   You have used 53G of your 80G docker image. And, as expected, it is corrupt.

 

About a month ago my Docker image was corrupted. I nuked it and yes I did make it larger.  I had ran a scrub on the image and it said no errors, so (for my own benefit) what information are you looking at to tell that my Docker image is currently corrupt?  Should I just go back to my Dockerapps and appdata folders and nuke them out (an inconvenience but I wouldn't mind)?  Come to think of it this I had an incident with bittorrent which filled my entire cache - at which point that corrupted Docker; I also took off caching torrent downloads to prevent further incidents of that kind.

 

Edited by Jcloud
Link to comment

You should try to improve disk cooling. What I was more interested in though, was the warning for SMART for each disk, which is listed separately from the temperature warning.

 

OK, I just looked at SMART for each in the diagnostics, and they seem fine.

 

Cancel that parity check, and do a correcting parity check instead.

 

Link to comment
8 minutes ago, trurl said:

Cancel that parity check, and do a correcting parity check instead.

I'm 98%-positive this current parity check had the correcting-errors checked. However, I will comply. Just canceled parity-check; verified the check box was checked, started parity check again.

 

EDIT: as for the cooling. I'll have to check and see which bay it is in, the Icy-Dock or the ct1478.

Edited by Jcloud
Link to comment
11 minutes ago, Jcloud said:

I'm 98%-positive this current parity check had the correcting-errors checked. However, I will comply. Just canceled parity-check; verified the check box was checked, started parity check again.

Syslog from your diagnostics disagreed, as I already posted.

 

Post another diagnostic and let me see if for some reason it still isn't correcting.

Link to comment

Looks like it completed the correcting parity check, and you started another correcting parity check after that, which is still finding parity errors. They should have all been corrected by the previous check and you shouldn't be getting any now.

 

Have you done a memtest recently?

 

Link to comment
20 hours ago, Jcloud said:

EDIT: 

>>>   You have used 53G of your 80G docker image. And, as expected, it is corrupt.

 

About a month ago my Docker image was corrupted. I nuked it and yes I did make it larger.  I had ran a scrub on the image and it said no errors, so (for my own benefit) what information are you looking at to tell that my Docker image is currently corrupt?  Should I just go back to my Dockerapps and appdata folders and nuke them out (an inconvenience but I wouldn't mind)?  Come to think of it this I had an incident with bittorrent which filled my entire cache - at which point that corrupted Docker; I also took off caching torrent downloads to prevent further incidents of that kind.

 

I didn't see any of this until I was just now reviewing the thread. It's better to put new information in a new post. When I go to check threads for unread posts, the forum software takes me directly to the first unread post. And since you had put information in an old post I didn't see it.

 

But, I won't make any recommendations about all that except don't do anything else until you get to zero parity errors. We can deal with your dockers after.

Link to comment
Just now, trurl said:

So is this thing I deduced from your syslog true?

7 minutes ago, trurl said:

Looks like it completed the correcting parity check, and you started another correcting parity check after that, which is still finding parity errors.

Yes, correct I did start a second parity-check. Although I thought I had unchecked the box to just look for errors re-appearing; then again morning was a very drowsy time, so I wont put a certainty-percentage on that. However looking at the syslog yes, it is correcting.

Link to comment
  • Jcloud changed the title to [Solved] Parity Check error counter continues to grow. Cause for concern?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.