OK, there's definitely a problem, I managed to reproduce it with an identical array config:
My plan is trying to find out in which circumstances this happens, if it's disk size or config related, and then create a bug report, but since I can't reproduce this with small disks this can be a long and tedious process, I expect that @limetechis very busy at the moment with the latest beta and this isn't an urgent issue but if you have any idea what might be causing this with the info below please advise.
Quick summary:
-over the last couple of years or so multiple users, more than could be explained by just user error or other issues, have been complaining of sync errors after a parity swap, errors start on the extra parity section, i.e., if old parity was 8TB and new parity is 10TB sync errors start when you pass the 8TB mark on the check
-I could never reproduce this before on my test server but @Juniperwas willing to repeat the procedure twice so he/she could post the diags and both times it resulted in sync errors, looking at the diags procedure was correctly done as far as I could see.
-I was finally able to reproduce this using the exact same array config as the OP, like all other cases the sync errors start immediately past the old parity size:
Aug 2 09:14:08 Tower2 kernel: mdcmd (42): check nocorrect
Aug 2 09:14:08 Tower2 kernel: md: recovery thread: check P ...
Aug 2 09:15:01 Tower2 sSMTP[4234]: Creating SSL connection to host
Aug 2 09:15:01 Tower2 sSMTP[4234]: SSL connection using TLS_AES_256_GCM_SHA384
Aug 2 09:15:03 Tower2 sSMTP[4234]: Sent mail for
[email protected] (221 2.0.0 closing connection d11sm13298600wrw.77 - gsmtp) uid=0 username=xxx outbytes=708
Aug 2 09:38:35 Tower2 kernel: mlx4_en: eth2: Link Down
Aug 2 10:18:41 Tower2 kernel: mlx4_en: eth2: Link Up
Aug 2 13:47:06 Tower2 kernel: mlx4_en: eth2: Link Down
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053064
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053072
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053080
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053088
Looking at at hex dump of this first block (after adding the 64 sectors from before the partition starts) you can confirm there's data there:
Curiously there are some good blocks, notice the 20 block jump here:
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053488
Aug 3 00:10:09 Tower2 kernel: md: recovery thread: P incorrect, sector=15628053648
And of course the disk really is zeroed in these:
root@Tower2:~# dd if=/dev/sdb skip=15628053560 count=152 | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00013000
152+0 records in
152+0 records out
77824 bytes (78 kB, 76 KiB) copied, 0.000285448 s, 273 MB/s
Also curious for me that for the OP parity is wrong every third block (at least in the logged errors):
Jul 2 17:00:03 Schiethucken kernel: md: recovery thread: check P ...
Jul 2 17:00:09 Schiethucken emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Jul 3 03:40:01 Schiethucken root: mover: cache not present, or only cache present
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628053640
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628054664
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628055688
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628056712
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628057736
Jul 3 06:00:38 Schiethucken kernel: md: recovery thread: P corrected, sector=15628058760
The OP started with a precleared parity disk the first time, and a parity corrected one the second time, which means that part of the disk started as all zeros both times, and it looked to me by looking at the logs/graphs that when I did it the new parity disk was completely written till the end after the parity copy, so most likely explanation for this is that the new parity disk isn't being correctly zeroed after the parity copy, but I would assume you're using dd to zero out the extra capacity, so can't imagine how this could happen, any ideas?