weirdcrap Posted September 2, 2021 Share Posted September 2, 2021 (edited) Good morning. On my latest non-correcting parity check on NODE I received 2 reported parity errors. Sep 1 01:04:47 Node kernel: md: recovery thread: P incorrect, sector=682504056 Sep 1 03:37:49 Node kernel: md: recovery thread: P incorrect, sector=3443896592 There were none reported during last month's check and there have been no unclean shutdowns or power outages (server is on a UPS of adequate size). When the check finished I started another (what I thought was) non-correcting parity check to see if it hit the same sectors again. Well, I screwed up and started a correcting check instead🥴 The correcting check is at about 75% and has gone way past the previously reported sectors. Interestingly the second run has found and corrected only one error in an entirely different sector: Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000 I'm gonna let the check finish and run a 3rd non-correcting check to see if any new (or the same) sectors are reported. If everything comes back clean on the 3rd check should I just consider them legitimate corrected errors and move on? If it isn't clean (with the reported sectors seemingly changing at random so far) should I assume there is a disk reporting bad data? How would I go about identifying the culprit? What other causes should I look at? node-diagnostics-20210902-0816.zip Edited September 30, 2021 by weirdcrap Quote Link to comment
trurl Posted September 2, 2021 Share Posted September 2, 2021 6 minutes ago, weirdcrap said: What other causes should I look at? memtest Quote Link to comment
weirdcrap Posted September 3, 2021 Author Share Posted September 3, 2021 (edited) 23 hours ago, trurl said: memtest An update on this. The 3rd non-correcting check (92% completed) has flagged the exact same sector that the second correcting check supposedly already repaired??? 2nd (correcting): Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000 3rd (non-correcting) Sep 2 21:08:17 Node kernel: md: recovery thread: P incorrect, sector=8096460000 So the sector is staying consistent now, which I assume makes RAM less likely. Is this a disk issue? I've never seen a correcting check not actually correct the parity mismatch before. Or it did correct it and that sector has changed again? node-diagnostics-20210903-0552.zip Update: Completed, just the one incorrect sector again. Should I run another parity check? Do something else? Edited September 3, 2021 by weirdcrap Quote Link to comment
JorgeB Posted September 3, 2021 Share Posted September 3, 2021 3 hours ago, weirdcrap said: which I assume makes RAM less likely Not really, if the first time it was wrongly correctly due to a RAM bit flip, second time it would be corrected again to return to original state. Quote Link to comment
weirdcrap Posted September 3, 2021 Author Share Posted September 3, 2021 18 minutes ago, JorgeB said: Not really, if the first time it was wrongly correctly due to a RAM bit flip, second time it would be corrected again to return to original state. Ok, so you would recommend my next step be memtest with a few passes? Quote Link to comment
JorgeB Posted September 3, 2021 Share Posted September 3, 2021 It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0. Quote Link to comment
weirdcrap Posted September 4, 2021 Author Share Posted September 4, 2021 9 hours ago, JorgeB said: It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0. Really? Well damn that makes diagnosing this significantly more inconvenient. I'll have to adjust docker memory allocations and shut some less essential dockers down if I'm going to have to cut my RAM in half. I wasn't aware memtest was so flawed when it comes to small intermittent errors, i thought that was the whole point of the software? I'll have to get someone to do this Tuesday when the office is open again. Quote Link to comment
weirdcrap Posted September 6, 2021 Author Share Posted September 6, 2021 (edited) The server hard locked tonight. I have no idea what happened, I sent someone over there to check on it and they could get no output on the console screen so I've got no logs or anything... No clue if its related or not. It's running a non-correcting parity check now after a forced reboot. EDIT: Same exact sector has the last two parity checks: md: recovery thread: P incorrect, sector=8096460000 Edited September 6, 2021 by weirdcrap Quote Link to comment
weirdcrap Posted September 15, 2021 Author Share Posted September 15, 2021 On 9/3/2021 at 10:34 AM, JorgeB said: It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0. Two DIMMs removed and first correcting check is running. Quote Link to comment
weirdcrap Posted September 17, 2021 Author Share Posted September 17, 2021 The first run corrected the same sector and the second reported zero errors so that's a good sign. Correcting Check #1 Sep 15 07:19:12 Node kernel: mdcmd (37): check Sep 15 07:19:12 Node kernel: md: recovery thread: check P ... Sep 15 15:01:07 Node kernel: md: recovery thread: P corrected, sector=8096460000 Sep 16 01:05:00 Node kernel: md: sync done. time=63948sec Sep 16 01:05:00 Node kernel: md: recovery thread: exit status: 0 Sep 16 01:07:02 Node Parity Check Tuning: manual Correcting Parity Check finished (1 errors) Sep 16 01:07:02 Node Parity Check Tuning: Elapsed Time 17 hr, 45 min, 48 sec, Runtime 17 hr, 45 min, 48 sec, Increments 1, Average Speed 125.1MB/s Correcting check #2 Sep 16 05:13:29 Node kernel: mdcmd (38): check Sep 16 05:13:29 Node kernel: md: recovery thread: check P ... Sep 16 23:06:51 Node kernel: md: sync done. time=64402sec Sep 16 23:06:51 Node kernel: md: recovery thread: exit status: 0 Sep 16 23:07:01 Node Parity Check Tuning: manual Correcting Parity Check finished (0 errors) Sep 16 23:07:01 Node Parity Check Tuning: Elapsed Time 17 hr, 53 min, 22 sec, Runtime 17 hr, 53 min, 22 sec, Increments 1, Average Speed 124.2MB/s Swapped the set of DIMMs and testing again. Quote Link to comment
JorgeB Posted September 18, 2021 Share Posted September 18, 2021 10 hours ago, weirdcrap said: so that's a good sign yep. Quote Link to comment
weirdcrap Posted September 18, 2021 Author Share Posted September 18, 2021 (edited) 14 hours ago, JorgeB said: yep. Well it hard locked and crashed within a few hours of the second set being installed and a parrot check started. I've got someone going over today to power it off and back on. If it continues to be unstable with this set of DIMMS I wager a replacement is in order? EDIT: Unclean shutdown. I'm letting it run its non-correcting check, its already found 73 errors. It was in the middle of importing a bunch of stuff from sonarr, but it was all going to the cache drive (mover doesnt run till 3AM) so that wouldn't be the cause of these new parity errors right? After the non-correcting check is finished should I continue with the correcting checks? Edited September 18, 2021 by weirdcrap Quote Link to comment
weirdcrap Posted September 20, 2021 Author Share Posted September 20, 2021 (edited) @JorgeBIt finished the non-correcting check, 73 errors. within a few hours of starting a new correcting check with the second set of RAM it has again hard locked and the server is unresponsive. So at this point I've tried two correcting checks and it kernel panics each time with this RAM. Should I assume it's bad and replace? I find it interesting that it only happens during the correcting check. I had no problems with the first set of RAM. Edited September 20, 2021 by weirdcrap Quote Link to comment
JorgeB Posted September 20, 2021 Share Posted September 20, 2021 5 hours ago, weirdcrap said: Should I assume it's bad and replace? You should try that if possible. Quote Link to comment
weirdcrap Posted September 20, 2021 Author Share Posted September 20, 2021 7 hours ago, JorgeB said: You should try that if possible. I was able to find the same kit NIB on ebay so once it gets here I will replace and run a test with just the two new sticks to see if the panics stop. Quote Link to comment
weirdcrap Posted September 28, 2021 Author Share Posted September 28, 2021 It seems like the RAM has done the trick, its ~75% through a correcting check with no errors and it hasn't hard locked or crashed yet. However I just got a random notification from the server letting me know my RAW Read error rate on disk 1 is some ridiculous number 28-09-2021 05:27 PM Unraid Disk 1 SMART health [1] Warning [NODE] - raw read error rate is 65536 WDC_WD80EFAX-68KNBN0_VAJBBYUL (sdd) Which is odd because when I go to check the smart stats in the GUI it says my raw read error rate is zero??? Quote Link to comment
ChatNoir Posted September 29, 2021 Share Posted September 29, 2021 65536 is exactly 2^16. That is odd that you get that value from a notification and nothing from the GUI. You might want to do an extended SMART test on that drive after the parity check. Quote Link to comment
JorgeB Posted September 29, 2021 Share Posted September 29, 2021 8 hours ago, weirdcrap said: Which is odd because when I go to check the smart stats in the GUI it says my raw read error rate is zero??? Those are usually due to firmware issues, value changed and went back to 0, that should be safe to ignore. Quote Link to comment
weirdcrap Posted September 29, 2021 Author Share Posted September 29, 2021 (edited) 6 hours ago, ChatNoir said: 65536 is exactly 2^16. That is odd that you get that value from a notification and nothing from the GUI. You might want to do an extended SMART test on that drive after the parity check. I ran a short SMART yesterday but i'll turn off drive sleep and run an extended now that the parity check is finished. 5 hours ago, JorgeB said: Those are usually due to firmware issues, value changed and went back to 0, that should be safe to ignore. That's good to hear. I know it isn't one of the default monitored SMART attributes-I assume for reasons like this-but I had enabled it after coming across a recommendation in another thread when I was troubleshooting some disk issues that its useful for drives from certain vendors. I've never seen a WD drive with anything but zero for that attribute but I have seagate drives in my other server that all report a very high number for this attribute so I don't monitor it on VOID. EDIT: Oh and the correcting check completed with zero errors! Thanks Turl and JorgeB for helping me figure out it was the RAM.I think this is the first time I've ever had a computer issue and it was actually a bad stick of RAM. Edited September 29, 2021 by weirdcrap 1 Quote Link to comment
weirdcrap Posted September 30, 2021 Author Share Posted September 30, 2021 Apparently I'm just going to have to un-monitor that smart attribute. Disk1 just keeps randomly hitting me with notifications for the raw read error rate even though it isn't actually incrementing whatsoever. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.