Jump to content
threesquared

Lots of parity errors

8 posts in this topic Last Reply

Recommended Posts

So I recently migrated from FreeNAS using two new 10TB shucked WD white label drives, 3 x 2TD WD Red drives and an old 120GB SSD from my last setup. Everything seemed to have worked fine until I performed the first scheduled parity check which returned something like several tens of thousands of sync errors... I ran a full SMART check on the parity drive and there was no issue reported. I have just run another parity check and this time it came back with 59073 errors. 

 

I know the SSD drive I am using for cache has some SMART errors reported but I thought the cache drive was not part of the parity? I have also been using the /mnt/cache folder directly so not sure if that could also be an issue? The next step I was going to take was to remove the cache drive entirely for now and see if the errors go away. Is there anything else I can do to narrow down the cause of these parity issues? I have attached my diagnostics file.

 

Thanks!

illmatic-diagnostics-20191204-0939.zip

Share this post


Link to post
25 minutes ago, threesquared said:

but I thought the cache drive was not part of the parity?

It's not, and unlikely to be a disk problem, IIRC those servers can use ECC or non ECC, if using non ECC run a memtest, if using ECC I would guess the board/controller as the most likely culprits.

 

Share this post


Link to post

Thanks for clarifying. I am using a HP N36L microserver with 16GB of ECC RAM at the moment. As luck has it I just got a N54L and was planning on moving everything over to that motherboard to upgrade the CPU. I suppose I will give that a go and see if the errors go away then.

 

I assume that with that many errors I can safely assume I am not protected by parity in case of a disk failure? It is odd as I never had any issues when running a 5 x 2TB ZFS RAIDZ pool on the same hardware and I resilvered the array due to a disk failure not that long ago and didn't experience any data loss that I noticed...

 

Thanks again for your help.

Share this post


Link to post
1 minute ago, threesquared said:

I assume that with that many errors I can safely assume I am not protected by parity in case of a disk failure?

Correct.

 

1 minute ago, threesquared said:

It is odd as I never had any issues when running a 5 x 2TB ZFS RAIDZ pool on the same hardware and I resilvered the array due to a disk failure not that long ago and didn't experience any data loss that I noticed...

Yes, though hardware can fail at any time, also weird that that sync errors started on the same sectors on both runs, but if I understood correctly the total number was very different correct?

 

Dec  1 03:00:01 illmatic kernel: md: using 1536k window, over a total of 9766436812 blocks.
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194400
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194408
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194416
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194424
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194432
Dec  1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194440

 

Dec  3 09:30:01 illmatic kernel: md: using 1536k window, over a total of 9766436812 blocks.
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194400
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194408
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194416
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194424
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194432
Dec  3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194440

 

All I can say is that I don't see how these sync errors can be a bug or software, only a hardware issue makes sense.

Share this post


Link to post

Ok I will see if swapping out some hardware makes any difference. I think the first parity check definitely had more errors but I couldn't be exactly sure of how many more. Is there any way to work out what disk those sectors starting at 4194400 are on?

 

Edit: So actually the first check returned 392611 errors and second one was 59073 errors

Edited by threesquared

Share this post


Link to post
3 minutes ago, threesquared said:

Is there any way to work out what disk those sectors starting at 4194400 are on?

Unfortunately no.

Share this post


Link to post
31 minutes ago, johnnie.black said:

I don't see how these sync errors can be a bug or software

Unless something is writing to an array device using sdX1 instead of the md device, for example if you use dd to write to /dev/sdc1 instead of /dev/md1 (disk1), that would make parity out of sync.

Share this post


Link to post

Just a quick update incase anyone reads this. I swapped out the motherboard/cpu tray and also removed the cache drive. I re-ran the parity check which found 20455 more errors but after that run I ran it again and it came back with zero errors this time. Hopefully the issue is sorted now and was most likely hardware related.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.