Jump to content

49 posts in this topic Last Reply

Recommended Posts

Hi all,

 

I've been running unRAID for about 10 years now,.. mostly on auto-pilot, with only issues I've had a a few failed drives.

 

I run a monthly parity check,.. and up until the last few weeks have never had any sync errors with the check at all.

 

The last few parity checks have been showing up corrected errors.

 

Today 1st Nov,.. it did its normal monthly check and again found and corrected sync errors (217).

Nov  1 00:00:01 storage kernel: mdcmd (42): check CORRECT
Nov  1 00:00:01 storage kernel: 
Nov  1 00:00:01 storage kernel: md: recovery thread woken up ...
Nov  1 00:00:01 storage kernel: md: recovery thread checking parity...
Nov  1 00:00:01 storage kernel: md: using 1536k window, over a total of 3907018532 blocks.
Nov  1 00:05:16 storage kernel: md: correcting parity, sector=65721672
Nov  1 00:09:27 storage kernel: md: correcting parity, sector=122070112
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122183264
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122187616
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122196320
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122213728
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122218080
Nov  1 00:09:28 storage kernel: md: correcting parity, sector=122252896
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122435680
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122440032
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122514016
Nov  1 00:09:29 storage kernel: md: correcting parity, sector=122540128
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122548832
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122557536
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122561888
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122566240
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122592352
Nov  1 00:09:30 storage kernel: md: correcting parity, sector=122596704
Nov  1 00:20:28 storage kernel: md: correcting parity, sector=270032168
 

After it had finished,.. I manually kicked off the check again after a few hours,.. and again it found errors again, (its almost done and is at 218).

Nov  1 14:15:11 storage kernel: md: recovery thread woken up ...
Nov  1 14:15:11 storage kernel: md: recovery thread checking parity...
Nov  1 14:15:11 storage kernel: md: using 1536k window, over a total of 3907018532 blocks.
Nov  1 14:20:20 storage kernel: md: correcting parity, sector=65721672
Nov  1 14:24:08 storage kernel: md: correcting parity, sector=117798280
Nov  1 14:24:13 storage kernel: md: correcting parity, sector=118812296
Nov  1 14:24:15 storage kernel: md: correcting parity, sector=119217032
Nov  1 14:24:15 storage kernel: md: correcting parity, sector=119260552
Nov  1 14:24:16 storage kernel: md: correcting parity, sector=119452040
Nov  1 14:24:27 storage kernel: md: correcting parity, sector=122070112
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122183264
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122187616
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122196320
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122213728
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122218080
Nov  1 14:24:28 storage kernel: md: correcting parity, sector=122252896
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122435680
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122440032
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122514016
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122540128
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122548832
Nov  1 14:24:29 storage kernel: md: correcting parity, sector=122557536

 

Have I got a disk that is on its way out ? I see no other errors in the log,.. previously, when I was having disk errors, there'd be something in there.

 

I'm running 5.0.5 with the following layout:
4TB parity
2 x 2TB
2 x 4TB

 

I've looked at smartctl details and the only thing that looks out of place is in the attached file (WDC_WD40EFRX-68N32N0_3704261.txt), but the events in there seem like they're old, but I'm no expert here. The other 4 drives show similar to the file 'WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt'

 

Can the sectors associated with the errors be matched back to a disk ?

 

Hoping someone can shed some light on what's going on and help me identify if I do have a disk that needs to be replaced.

 

Thanks in advance,

 

Jim.....

WDC_WD40EFRX-68N32N0_3704261.txt WDC_WD40EFRX-68N32N0_WD-WCC7K5NPLNN1.txt

Share this post


Link to post
8 hours ago, trurl said:

Have you done memtest?

No I hadn't. I'll do that shortly.

 

Is there a way to run memtest while server is running,.. as I run it headless ?

 

EDIT: memtest running now,.. 50% done, no errors yet.

Edited by jimbo123

Share this post


Link to post

 

Memtest looks ok so far,.. 2nd is now done still with no errors. There's only 2GB in this system.

 

Will leave it going for a few more hours.

 

image.thumb.png.9152a15b4e164facc54af4ba184fb6ed.png

 

 

Share this post


Link to post

Several hours and passes later,..  sill no errors detected,.. did a couple with ECC on,.. still no errors.

 

Will leave it running memtest for a few more hours.

 

Anything else I should check ?

Share this post


Link to post
On 11/1/2020 at 8:30 AM, jimbo123 said:

Can the sectors associated with the errors be matched back to a disk ?

No, but the fact that the checks were getting different sectors is what made me suspect RAM.

 

You might try again after checking all connections.

 

Then controller would be next suspect I think. Tell us more about your hardware.

 

Very difficult for us to support a version of Unraid we haven't seen in years. And latest version has a lot more to help us troubleshoot. When you get stable again you should upgrade.

Share this post


Link to post

Hardware is one of the early HP MIcroservers,.. an N36L. Nothing changed really, running onboard controller. I updated BIOS firmware to enable the CD/DVD drive to operate as a SATA disk, or something similar from memory.

 

One parity disk and 4 data drives as mentioned earlier.

 

I'll update to current stable and see how things behave.

Share this post


Link to post

OK,.. so unplugged all drives, board, cables, etc,.. plugged back in and kicked off parity check - still encounters sync errors.

 

While it goes through the check, I'll read up on the upgrade process and then get it current.

Share this post


Link to post

Have now upgraded to 6.8.3.

 

Seems like a similar amount of errors are being corrected during the check:

image.png.a3356abf4524f2f4e5caf48d13065e9d.png

 

EDIT: Seems to have grown to more than before,..

image.png.6f514462583e5b6d416e7484655df353.png

 

Edited by jimbo123

Share this post


Link to post

Is this a correcting parity check?

 

After it finishes and before rebooting,

 

Go to Tools - Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

 

In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot.

Share this post


Link to post
7 hours ago, trurl said:

In fact, don't reboot at all if you can help it while we are trying to track this down. We need to be able to compare this parity check with the next one in syslog, and syslog resets when you reboot.

Based on this, I'll kick off another parity check and provide diagnostic zip when its completed.

Share this post


Link to post

That Microserver uses ECC RAM, so unlikely to be a memory problem, could be a disk but IMHO most likely to be a board/controller issue, do you have another controller you could use or another board/PC you could transplant the disks to?

Share this post


Link to post

Hi JorgeB,

 

Yes, I have 2 others doing other things, one as a low-end ESX box (40L), which by the looks of it, Unraid can now do, so its probably the obvious candidate. Will just have to figure out the whole BIOS thing out, as from memory, the onboard SATA port used for the CD/DVD drive at the top wouldn't support a disk and some Russian guy came up with some custom firmware to overcome this restriction,... or something similar.

 

The 2nd parity check just finished, results below,.. pretty consistent over the 2 runs, only less errors this time:

image.png.477d6041ac47704738f5bcd55f124fe0.png

 

I've also attached the diagnostic zip again.

 

Jim.....

 

storage-diagnostics-20201106-1936.zip

Share this post


Link to post

If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors.

Share this post


Link to post
3 minutes ago, JorgeB said:

If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors.

 

Share this post


Link to post
2 minutes ago, JorgeB said:

If there were issues before this check is expected to correct some errors, if the issue is fixed next one should find 0 sync errors.

Clean shutdown,.. check was done earlier in the day with many corrected errors prior to replacing the board.

 

Will wait for this to finish and then run another tomorrow.

Share this post


Link to post
1 minute ago, jimbo123 said:

Clean shutdown,.. c

It's not about that, if there is a hardware problem, some sectors were being wrongly corrected, so when that's solved it's normal for the 1st check to still find errors, but if the 2nd one still finds more then it's still not fixed.

Share this post


Link to post

OK,. I'd say I've still got a problem here somewhere after the board replacement.

 

Currently doing another check,.. did a few yesterday also, 2 of which looked better, but the one going now is still correcting many errors.

 

Given I'm now running with a different board and different memory is this now pointing to a disk ?

 

image.png.38ec3240ac9dca825ac0b386dd2115cb.png

 

image.png.711913a83c3bb0d890f1152efb1b9dc6.png

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.