Jump to content

[SOLVED] Monthly check - 3 Errors detected, no unclean shutdown (logs included); new errors and SMART issue a month later


Recommended Posts

Thanks!

 

In the new syslog, two match but one doesn't:

Nov 14 15:50:01 Cybertron kernel: md: recovery thread: P incorrect, sector=2633952880
Nov 14 18:24:23 Cybertron kernel: md: recovery thread: P incorrect, sector=4906237848
Nov 14 21:14:13 Cybertron kernel: md: recovery thread: P incorrect, sector=6924983712

Link to comment

Looks like a RAM problem to me, though intermittent, I see you have a Ryzen CPU, it might be worth lowering RAM clock speeds as I remember similar issues with high clocked RAM and Ryzen and/or testing with just one DIMM, it's possible, even likely some errors will be detect again on the next check, but it needs to consistently detect the exact same errors.

  • Upvote 2
Link to comment

Alright, just finished the parity check. Only 2 errors this time - the persistent ones, not the floating one:

Nov 16 15:39:33 Cybertron kernel: md: recovery thread: P incorrect, sector=2633952880
Nov 16 18:13:54 Cybertron kernel: md: recovery thread: P incorrect, sector=4906237848

 

1) Do I run another non-correcting to repeat this, or run a correcting check?

2) The parity errors - likely to have messed up a file, or just parity, or no way to know?

3) Should I feel comfortable with the RAM ... under(?)clocked going forward? Am I okay to add more files (and a new drive, as I'm running low)?

 

Thanks again for all the help!

Link to comment
8 hours ago, Idaho121 said:

Do I run another non-correcting to repeat this, or run a correcting check?

One more to confirm no more extra errors appear would be good.

 

8 hours ago, Idaho121 said:

The parity errors - likely to have messed up a file, or just parity, or no way to know?

Impossible to know unless you have checksums ( or were using btrfs)

 

8 hours ago, Idaho121 said:

Should I feel comfortable with the RAM ... under(?)clocked going forward?

If no more new errors pop up it should be OK, consider a board/CPU with ECC support for next build.

  • Upvote 1
Link to comment
55 minutes ago, Idaho121 said:

Thanks. I've used the exact same type myself on several Ryzen builds in Asus (X470, B350), Gigabyte (X370, B350) and ASRock (B350) motherboards. At one point it was slightly cheaper than the DDR-2666 rated kit on Amazon. It works fine at DDR-2933 with 2000-series chips. That said, I have had a faulty Vengeance LPX DIMM, which was replaced without question by Corsair (they replaced both DIMMs in the set). I would run MemTest86 again for a good long time - say, 48 hours or more. Remember, a pass doesn't guarantee that it's good. Use the free downloadable version and make a separate USB stick and boot it in UEFI mode.

  • Upvote 1
Link to comment

MemTest86 version 7.5 found my faulty DIMM using the default settings. You may want to just change the setting for the number of times it cycles through the different tests. I think the default is four cycles of 13 tests.

 

Another thing you could try, if MemTest86 returns another pass, is to run on just one DIMM for a while in, say, the channel A socket. Run a parity check and when it finishes swap with the other DIMM in the same socket and repeat. If that reveals no difference you might want to try each DIMM singly in the channel B socket. Label the DIMMs (or note their serial numbers) and make careful notes and eventually you should be able to narrow the problem down to either one DIMM or one socket. It's all very time consuming stuff but it can run unattended and I'm sure you want to get to the bottom of this problem. It's annoying to have a potentially bad DIMM but Corsair offer an lifetime warranty and I really can't fault their RMA process.

  • Upvote 1
Link to comment
  • 1 month later...

Circling back here because there were 7 errors found this month during the check. I will run the extended MemTest, as I should have done a month ago...

 

However, I've had a drive with a low-but-stable Raw Read Error Rate number (was at 5). I just checked, and it's up to 10. I'm doing an extended SMART test now on it to see if that moves again.

 

1) Could this be what's causing the parity errors?

2) If so, do I replace the drive and then rebuild from the current parity, or should I run another parity check, correct, and then replace it/rebuild?

3) Do I have a couple of borked bits from when I corrected the 2 parity errors last time, or is it still RAM as the likely culprit and this is a separate issue?

 

Thanks again, all!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...