May 8, 201214 yr Unraid 4.6; 10 drives (first logical/physical 6 through onboard SATA, then 2 through Sil3132, then 2 through Sil3132). I had no parity drive for a long time. (long story, no lectures please; I understand the value) But, with this config, I had previously used parity drives with parity checks, and it would check cleanly. Now: - I do a parity check, and I come up with 25 or so errors. - I repeat the check, and there are 24 errors, in the same places. - I change memory from 16GB down to a smaller 1GB config that previously also cleanly worked - I do a parity check, and I get 7 errors - Repeat the parity check, same 7 errors in the same locations: root@Tower:/var/log# cat syslog | grep parity May 7 22:28:52 Tower kernel: md: recovery thread checking parity... May 7 22:29:06 Tower kernel: md: parity incorrect: 1457392 May 7 22:30:32 Tower kernel: md: parity incorrect: 11666128 May 7 22:31:07 Tower kernel: md: parity incorrect: 16181704 May 7 22:31:09 Tower kernel: md: parity incorrect: 16373960 May 7 22:34:02 Tower kernel: md: parity incorrect: 38137720 May 8 00:46:17 Tower kernel: md: parity incorrect: 1033482040 May 8 00:53:42 Tower kernel: md: parity incorrect: 1091494984 What should I look for? I'm grinding on the memory with memtest86+ right now, and I'm thinking later I should grind on it with google's stressapptest (perhaps it's the pathways at fault more than the memory itself).
May 8, 201214 yr Unraid 4.6; 10 drives (first logical/physical 6 through onboard SATA, then 2 through Sil3132, then 2 through Sil3132). I had no parity drive for a long time. (long story, no lectures please; I understand the value) But, with this config, I had previously used parity drives with parity checks, and it would check cleanly. Now: - I do a parity check, and I come up with 25 or so errors. - I repeat the check, and there are 24 errors, in the same places. - I change memory from 16GB down to a smaller 1GB config that previously also cleanly worked - I do a parity check, and I get 7 errors - Repeat the parity check, same 7 errors in the same locations: root@Tower:/var/log# cat syslog | grep parity May 7 22:28:52 Tower kernel: md: recovery thread checking parity... May 7 22:29:06 Tower kernel: md: parity incorrect: 1457392 May 7 22:30:32 Tower kernel: md: parity incorrect: 11666128 May 7 22:31:07 Tower kernel: md: parity incorrect: 16181704 May 7 22:31:09 Tower kernel: md: parity incorrect: 16373960 May 7 22:34:02 Tower kernel: md: parity incorrect: 38137720 May 8 00:46:17 Tower kernel: md: parity incorrect: 1033482040 May 8 00:53:42 Tower kernel: md: parity incorrect: 1091494984 What should I look for? I'm grinding on the memory with memtest86+ right now, and I'm thinking later I should grind on it with google's stressapptest (perhaps it's the pathways at fault more than the memory itself). How are you invoking the parity "check" (I'm trying to determine is you are performing a "correcting" check, or a a "nocorrect" check. You symptoms of a set of blocks being found as incorrect, and on a subsequent check also found could be either: You are performing a NOCORRECT type of check, and in that case, the exact same blocks will show up again and again until you correct them. OR you are performing a correcting type of check, and something in the hardware is intermittent so the FIRST parity check changed parity to what it thought was correct based upon the incorrect returned data AND the second parity check read correct data and so changed parity once more on the same blocks to fix parity as it should be. A third parity check would then find no errors. (if this was the case) Joe L.
May 8, 201214 yr Author Actually, funny you mention. The parity check started on its own when I boot the server. But, at least one of them I manually invoked by pressing the big "check parity" button from the web GUI. Keep in mind this is 4.6 -- How do I know if what kind of check I'm running? I don't think I saw that in /var/log/ ... I'm starting to think that in the case of the 2 consecutive, what I saw was "NOCORRECT", followed by an implicit "CORRECT" (from the web gui button). Perhaps a third test now will come back cleanly?
May 9, 201214 yr Author Ok, as Joe suspected, I think what was happening was that unclean shutdowns were inducing a NOCORRECT parity check on startup. I see the command now in the log (when I press the "parity check" button in the GUI): May 8 20:30:02 Tower kernel: mdcmd (26): check CORRECT And this check ran with no errors. So, we're good here. Thanks :-)
Archived
This topic is now archived and is closed to further replies.