[SOLVED] Parity sync errors


54 posts in this topic Last Reply

Recommended Posts

Hi all,

 

I've been running unRAID for about 10 years now,.. mostly on auto-pilot, with only issues I've had a a few failed drives.

 

I run a monthly parity check,.. and up until the last few weeks have never had any sync errors with the check at all.

 

The last few parity checks have been showing up corrected errors.

 

Today 1st Nov,.. it did its normal monthly check and again found and corrected sync errors (217).

Nov  1 00:00:01 storage kernel: mdcmd (42): check CORRECT
Nov  1 00:00:01 storage kernel: 
Nov  1 00:00:01 storage kernel: md: recovery thread woken up ...
Nov  1 00:00:01 storage kernel: md: recovery thread checking parity...
Nov  1 00:00:01 storage kernel: md: using 1536k window, over a total of 3907018532 blocks.
Nov  1 00:05:16 storage kernel: md: correcting parity, sector=65721672
Nov  1 00:09:27 storage kernel: md: correcting parity, sector=122070112
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122183264
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122187616
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122196320
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122213728
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122218080
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122252896
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122435680
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122440032
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122514016
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122540128
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122548832
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122557536
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122561888
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122566240
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122592352
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122596704
Nov  1 00:20:28 storage kernel: md: correcting parity, sector=270032168
 

After it had finished,.. I manually kicked off the check again after a few hours,.. and again it found errors again, (its almost done and is at 218).

Nov  1 14:15:11 storage kernel: md: recovery thread woken up ...
Nov  1 14:15:11 storage kernel: md: recovery thread checking parity...
Nov  1 14:15:11 storage kernel: md: using 1536k window, over a total of 3907018532 blocks.
Nov  1 14:20:20 storage kernel: md: correcting parity, sector=65721672
Nov  1 14:24:08 storage kernel: md: correcting parity, sector=117798280
Nov  1 14:24:13 storage kernel: md: correcting parity, sector=118812296
Nov  1 14:24:15 storage kernel: md: correcting parity, sector=119217032
Nov  1 14:24:15 storage kernel: md: correcting parity, sector=119260552
Nov  1 14:24:16 storage kernel: md: correcting parity, sector=119452040
Nov  1 14:24:27 storage kernel: md: correcting parity, sector=122070112
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122183264
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122187616
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122196320
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122213728
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122218080
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122252896
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122435680
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122440032
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122514016
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122540128
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122548832
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122557536

 

Have I got a disk that is on its way out ? I see no other errors in the log,.. previously, when I was having disk errors, there'd be something in there.

 

I'm running 5.0.5 with the following layout:
4TB parity
2 x 2TB
2 x 4TB

 

I've looked at smartctl details and the only thing that looks out of place is in the attached file (WDC_WD40EFRX-68N32N0_3704261.txt), but the events in there seem like they're old, but I'm no expert here. The other 4 drives show similar to the file 'WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt'

 

Can the sectors associated with the errors be matched back to a disk ?

 

Hoping someone can shed some light on what's going on and help me identify if I do have a disk that needs to be replaced.

 

Thanks in advance,

 

Jim.....

WDC_WD40EFRX-68N32N0_3704261.txt WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt

Edited by jimbo123
Link to post
  • Replies 53
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

It's not about that, if there is a hardware problem, some sectors were being wrongly corrected, so when that's solved it's normal for the 1st check to still find errors, but if the 2nd one still finds

On the Main GUI page after you apply the new config. If you select "None", then the Main GUI with all your assignments would be blank, and you would need to recreate the proper assignments. If yo

Just an update,.. I replaced the parity disk and it looks like the problem has gone away from what I can see below. Am just doing another check, hopefully it comes back with 0 as well.

Posted Images

8 hours ago, trurl said:

Have you done memtest?

No I hadn't. I'll do that shortly.

 

Is there a way to run memtest while server is running,.. as I run it headless ?

 

EDIT: memtest running now,.. 50% done, no errors yet.

Edited by jimbo123
Link to post
On 11/1/2020 at 8:30 AM, jimbo123 said:

Can the sectors associated with the errors be matched back to a disk ?

No, but the fact that the checks were getting different sectors is what made me suspect RAM.

 

You might try again after checking all connections.

 

Then controller would be next suspect I think. Tell us more about your hardware.

 

Very difficult for us to support a version of Unraid we haven't seen in years. And latest version has a lot more to help us troubleshoot. When you get stable again you should upgrade.

Link to post

Hardware is one of the early HP MIcroservers,.. an N36L. Nothing changed really, running onboard controller. I updated BIOS firmware to enable the CD/DVD drive to operate as a SATA disk, or something similar from memory.

 

One parity disk and 4 data drives as mentioned earlier.

 

I'll update to current stable and see how things behave.

Link to post

Is this a correcting parity check?

 

After it finishes and before rebooting,

 

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

 

In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot.

Link to post
7 hours ago, trurl said:

In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot.

Based on this, I'll kick off another parity check and provide diagnostic zip when its completed.

Link to post

That Microserver uses ECC RAM, so unlikely to be a memory problem, could be a disk but IMHO most likely to be a board/controller issue, do you have another controller you could use or another board/PC you could transplant the disks to?

Link to post

Hi JorgeB,

 

Yes, I have 2 others doing other things, one as a low-end ESX box (40L), which by the looks of it, Unraid can now do, so its probably the obvious candidate. Will just have to figure out the whole BIOS thing out, as from memory, the onboard SATA port used for the CD/DVD drive at the top wouldn't support a disk and some Russian guy came up with some custom firmware to overcome this restriction,... or something similar.

 

The 2nd parity check just finished, results below,.. pretty consistent over the 2 runs, only less errors this time:

image.png.477d6041ac47704738f5bcd55f124fe0.png

 

I've also attached the diagnostic zip again.

 

Jim.....

 

storage-diagnostics-20201106-1936.zip

Link to post
2 minutes ago, JorgeB said:

If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors.

Clean shutdown,.. check was done earlier in the day with many corrected errors prior to replacing the board.

 

Will wait for this to finish and then run another tomorrow.

Link to post
1 minute ago, jimbo123 said:

Clean shutdown,.. c

It's not about that, if there is a hardware problem, some sectors were being wrongly corrected, so when that's solved it's normal for the 1st check to still find errors, but if the 2nd one still finds more then it's still not fixed.

Link to post

OK,. I'd say I've still got a problem here somewhere after the board replacement.

 

Currently doing another check,.. did a few yesterday also, 2 of which looked better, but the one going now is still correcting many errors.

 

Given I'm now running with a different board and different memory is this now pointing to a disk ?

 

image.png.38ec3240ac9dca825ac0b386dd2115cb.png

 

image.png.711913a83c3bb0d890f1152efb1b9dc6.png

 

 

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.