[SOLVED] Parity errors without obvious cause


weirdcrap

Recommended Posts

Good morning.

 

On my latest non-correcting parity check on NODE I received 2 reported parity errors.

Sep 1 01:04:47 Node kernel: md: recovery thread: P incorrect, sector=682504056
Sep 1 03:37:49 Node kernel: md: recovery thread: P incorrect, sector=3443896592

 

There were none reported during last month's check and there have been no unclean shutdowns or power outages (server is on a UPS of adequate size).

 

When the check finished I started another (what I thought was) non-correcting parity check to see if it hit the same sectors again. Well, I screwed up and started a correcting check instead🥴

 

The correcting check is at about 75% and has gone way past the previously reported sectors. Interestingly the second run has found and corrected only one error in an entirely different sector: 

 

Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000

 

I'm gonna let the check finish and run a 3rd non-correcting check to see if any new (or the same) sectors are reported.

 

If everything comes back clean on the 3rd check should I just consider them legitimate corrected errors and move on?

 

If it isn't clean (with the reported sectors seemingly changing at random so far) should I assume there is a disk reporting bad data? How would I go about identifying the culprit?

 

What other causes should I look at?

 

node-diagnostics-20210902-0816.zip

Edited by weirdcrap
Link to comment
23 hours ago, trurl said:

memtest

An update on this. The 3rd non-correcting check (92% completed) has flagged the exact same sector that the second correcting check supposedly already repaired???

2nd (correcting):
Sep 2 02:39:36 Node kernel: md: recovery thread: P corrected, sector=8096460000

3rd (non-correcting)
Sep  2 21:08:17 Node kernel: md: recovery thread: P incorrect, sector=8096460000

 

So the sector is staying consistent now, which I assume makes RAM less likely. Is this a disk issue?

 

I've never seen a correcting check not actually correct the parity mismatch before. Or it did correct it and that sector has changed again?

node-diagnostics-20210903-0552.zip

 

Update: Completed, just the one incorrect sector again. Should I run another parity check? Do something else?

Edited by weirdcrap
Link to comment

It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity  checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0.

Link to comment
9 hours ago, JorgeB said:

It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity  checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0.

Really? Well damn that makes diagnosing this significantly more inconvenient. I'll have to adjust docker memory allocations and shut some less essential dockers down if I'm going to have to cut my RAM in half.

 

I wasn't aware memtest was so flawed when it comes to small intermittent errors, i thought that was the whole point of the software?

 

I'll have to get someone to do this Tuesday when the office is open again.

Link to comment

The server hard locked tonight. I have no idea what happened, I sent someone over there to check on it and they could get no output on the console screen so I've got no logs or anything... 

 

No clue if its related or not. It's running a non-correcting parity check now after a forced reboot.

 

EDIT: Same exact sector has the last two parity checks:

 

md: recovery thread: P incorrect, sector=8096460000

Edited by weirdcrap
Link to comment
  • 2 weeks later...
On 9/3/2021 at 10:34 AM, JorgeB said:

It's not always easy to find errors with memtest when getting so few errors, since you have 4 DIMMs I would suggest removing a couple and run two correcting parity  checks, if issue persists with both sets then it's likely not a RAM problem, note that the first check after the problem is fixed might still find errors, but next ones should always find 0.

Two DIMMs removed and first correcting check is running.

Link to comment

The first run corrected the same sector and the second reported zero errors so that's a good sign.

 

Correcting Check #1
Sep 15 07:19:12 Node kernel: mdcmd (37): check 
Sep 15 07:19:12 Node kernel: md: recovery thread: check P ...
Sep 15 15:01:07 Node kernel: md: recovery thread: P corrected, sector=8096460000
Sep 16 01:05:00 Node kernel: md: sync done. time=63948sec
Sep 16 01:05:00 Node kernel: md: recovery thread: exit status: 0
Sep 16 01:07:02 Node Parity Check Tuning: manual Correcting Parity Check finished (1 errors)
Sep 16 01:07:02 Node Parity Check Tuning: Elapsed Time 17 hr, 45 min, 48 sec, Runtime 17 hr, 45 min, 48 sec, Increments 1, Average Speed 125.1MB/s

Correcting check #2
Sep 16 05:13:29 Node kernel: mdcmd (38): check 
Sep 16 05:13:29 Node kernel: md: recovery thread: check P ...
Sep 16 23:06:51 Node kernel: md: sync done. time=64402sec
Sep 16 23:06:51 Node kernel: md: recovery thread: exit status: 0
Sep 16 23:07:01 Node Parity Check Tuning: manual Correcting Parity Check finished (0 errors)
Sep 16 23:07:01 Node Parity Check Tuning: Elapsed Time 17 hr, 53 min, 22 sec, Runtime 17 hr, 53 min, 22 sec, Increments 1, Average Speed 124.2MB/s

 

Swapped the set of DIMMs and testing again.

Link to comment
14 hours ago, JorgeB said:

yep.

Well it hard locked and crashed within a few hours of the second set being installed and a parrot check started. 

 

I've got someone going over today to power it off and back on. If it continues to be unstable with this set of DIMMS I wager a replacement is in order?

 

EDIT: Unclean shutdown. I'm letting it run its non-correcting check, its already found 73 errors. It was in the middle of importing a bunch of stuff from sonarr, but it was all going to the cache drive (mover doesnt run till 3AM) so that wouldn't be the cause of these new parity errors right? 

 

After the non-correcting check is finished should I continue with the correcting checks?

Edited by weirdcrap
Link to comment

@JorgeBIt finished the non-correcting check, 73 errors.

 

within a few hours of starting a new correcting check with the second set of RAM it has again hard locked and the server is unresponsive.

 

So at this point I've tried two correcting checks and it kernel panics each time with this RAM. Should I assume it's bad and replace?

 

I find it interesting that it only happens during the correcting check.

 

I had no problems with the first set of RAM.

Edited by weirdcrap
Link to comment
  • 2 weeks later...

It seems like the RAM has done the trick, its ~75% through a correcting check with no errors and it hasn't hard locked or crashed yet.

 

However I just got a random notification from the server letting me know my RAW Read error rate on disk 1  is some ridiculous number

28-09-2021 05:27 PM	Unraid Disk 1 SMART health [1]	Warning [NODE] - raw read error rate is 65536	WDC_WD80EFAX-68KNBN0_VAJBBYUL (sdd)

Which is odd because when I go to check the smart stats in the GUI it says my raw read error rate is zero???

Link to comment
6 hours ago, ChatNoir said:

65536 is exactly 2^16.

 

That is odd that you get that value from a notification and nothing from the GUI.

You might want to do an extended SMART test on that drive after the parity check.

I ran a short SMART yesterday but i'll turn off drive sleep and run an extended now that the parity check is finished.

 

5 hours ago, JorgeB said:

Those are usually due to firmware issues, value changed and went back to 0, that should be safe to ignore.

That's good to hear. I know it isn't one of the default monitored SMART attributes-I assume for reasons like this-but I had enabled it after coming across a recommendation in another thread when I was troubleshooting some disk issues that its useful for drives from certain vendors.

 

I've never seen a WD drive with anything but zero for that attribute but I have seagate drives in my other server that all report a very high number for this attribute so I don't monitor it on VOID.

 

EDIT: Oh and the correcting check completed with zero errors! Thanks Turl and JorgeB for helping me figure out it was the RAM.I think this is the first time I've ever had a computer issue and it was actually a bad stick of RAM.

Edited by weirdcrap
  • Like 1
Link to comment
  • weirdcrap changed the title to [SOLVED] Parity errors without obvious cause

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.