threesquared Posted December 4, 2019 Share Posted December 4, 2019 So I recently migrated from FreeNAS using two new 10TB shucked WD white label drives, 3 x 2TD WD Red drives and an old 120GB SSD from my last setup. Everything seemed to have worked fine until I performed the first scheduled parity check which returned something like several tens of thousands of sync errors... I ran a full SMART check on the parity drive and there was no issue reported. I have just run another parity check and this time it came back with 59073 errors. I know the SSD drive I am using for cache has some SMART errors reported but I thought the cache drive was not part of the parity? I have also been using the /mnt/cache folder directly so not sure if that could also be an issue? The next step I was going to take was to remove the cache drive entirely for now and see if the errors go away. Is there anything else I can do to narrow down the cause of these parity issues? I have attached my diagnostics file. Thanks! illmatic-diagnostics-20191204-0939.zip Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 25 minutes ago, threesquared said: but I thought the cache drive was not part of the parity? It's not, and unlikely to be a disk problem, IIRC those servers can use ECC or non ECC, if using non ECC run a memtest, if using ECC I would guess the board/controller as the most likely culprits. Quote Link to comment
threesquared Posted December 4, 2019 Author Share Posted December 4, 2019 Thanks for clarifying. I am using a HP N36L microserver with 16GB of ECC RAM at the moment. As luck has it I just got a N54L and was planning on moving everything over to that motherboard to upgrade the CPU. I suppose I will give that a go and see if the errors go away then. I assume that with that many errors I can safely assume I am not protected by parity in case of a disk failure? It is odd as I never had any issues when running a 5 x 2TB ZFS RAIDZ pool on the same hardware and I resilvered the array due to a disk failure not that long ago and didn't experience any data loss that I noticed... Thanks again for your help. Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 1 minute ago, threesquared said: I assume that with that many errors I can safely assume I am not protected by parity in case of a disk failure? Correct. 1 minute ago, threesquared said: It is odd as I never had any issues when running a 5 x 2TB ZFS RAIDZ pool on the same hardware and I resilvered the array due to a disk failure not that long ago and didn't experience any data loss that I noticed... Yes, though hardware can fail at any time, also weird that that sync errors started on the same sectors on both runs, but if I understood correctly the total number was very different correct? Dec 1 03:00:01 illmatic kernel: md: using 1536k window, over a total of 9766436812 blocks. Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194400 Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194408 Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194416 Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194424 Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194432 Dec 1 03:00:19 illmatic kernel: md: recovery thread: P corrected, sector=4194440 Dec 3 09:30:01 illmatic kernel: md: using 1536k window, over a total of 9766436812 blocks. Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194400 Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194408 Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194416 Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194424 Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194432 Dec 3 09:30:19 illmatic kernel: md: recovery thread: P corrected, sector=4194440 All I can say is that I don't see how these sync errors can be a bug or software, only a hardware issue makes sense. Quote Link to comment
threesquared Posted December 4, 2019 Author Share Posted December 4, 2019 (edited) Ok I will see if swapping out some hardware makes any difference. I think the first parity check definitely had more errors but I couldn't be exactly sure of how many more. Is there any way to work out what disk those sectors starting at 4194400 are on? Edit: So actually the first check returned 392611 errors and second one was 59073 errors Edited December 4, 2019 by threesquared Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 3 minutes ago, threesquared said: Is there any way to work out what disk those sectors starting at 4194400 are on? Unfortunately no. Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 31 minutes ago, johnnie.black said: I don't see how these sync errors can be a bug or software Unless something is writing to an array device using sdX1 instead of the md device, for example if you use dd to write to /dev/sdc1 instead of /dev/md1 (disk1), that would make parity out of sync. Quote Link to comment
threesquared Posted December 6, 2019 Author Share Posted December 6, 2019 Just a quick update incase anyone reads this. I swapped out the motherboard/cpu tray and also removed the cache drive. I re-ran the parity check which found 20455 more errors but after that run I ran it again and it came back with zero errors this time. Hopefully the issue is sorted now and was most likely hardware related. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.