Idaho121 Posted November 7, 2018 Share Posted November 7, 2018 (edited) Hey all, As stated - monthly parity check found 3 errors. No unclean shutdowns (to my knowledge). Logs attached. Yes, I realize the error of my ways in having corrections written to parity. I've changed that and will be running another parity check. Thanks for the help! cybertron-diagnostics-20181107-0114.zip Edited January 18, 2019 by Idaho121 Quote Link to comment
JorgeB Posted November 7, 2018 Share Posted November 7, 2018 Assuming you're not using ECC ram running memtest would be the first thing to do. 1 Quote Link to comment
Idaho121 Posted November 8, 2018 Author Share Posted November 8, 2018 Great, thanks! ...I know how to do that on Windows, but unsure how to do it on unRaid. Quote Link to comment
JorgeB Posted November 8, 2018 Share Posted November 8, 2018 It's available during boot as one of the options. 1 Quote Link to comment
Idaho121 Posted November 14, 2018 Author Share Posted November 14, 2018 Alright, finally back home from some travel. Ran it for 24 hours with no issues (then a few passes in SMP and everything was fine). What should I check next? Quote Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 RAM is still the number one suspect, but it could have been an isolated issue, try running one more parity checks to see if the errors persist. 1 Quote Link to comment
Idaho121 Posted November 14, 2018 Author Share Posted November 14, 2018 Cool, thanks. On the Main page, the button there is defaulting to Write corrections to parity being checked. I went into the schedule page, and it's got "No" selected. Should that button be checked by default? Quote Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 Schedulle checking should be non correct, in this case run another non correcting check first, if it finds the same 3 errors run a correcting check to fix them and then another non correcting to confirm all is OK. 1 Quote Link to comment
Idaho121 Posted November 15, 2018 Author Share Posted November 15, 2018 How do I know if they're the same 3 errors? (Running now, done by tomorrow morning) Quote Link to comment
JorgeB Posted November 15, 2018 Share Posted November 15, 2018 Check the syslog and compare the sectors 1 Quote Link to comment
Idaho121 Posted November 16, 2018 Author Share Posted November 16, 2018 I'm having trouble finding where in the syslog that information is located. Thanks for the help here! Quote Link to comment
JorgeB Posted November 16, 2018 Share Posted November 16, 2018 Nov 1 00:03:42 Cybertron kernel: md: recovery thread: P incorrect, sector=66514704 ... Nov 1 02:31:28 Cybertron kernel: md: recovery thread: P incorrect, sector=2633952880 ... Nov 1 05:05:03 Cybertron kernel: md: recovery thread: P incorrect, sector=4906237848 1 1 Quote Link to comment
Idaho121 Posted November 16, 2018 Author Share Posted November 16, 2018 Thanks! In the new syslog, two match but one doesn't: Nov 14 15:50:01 Cybertron kernel: md: recovery thread: P incorrect, sector=2633952880 Nov 14 18:24:23 Cybertron kernel: md: recovery thread: P incorrect, sector=4906237848 Nov 14 21:14:13 Cybertron kernel: md: recovery thread: P incorrect, sector=6924983712 Quote Link to comment
JorgeB Posted November 16, 2018 Share Posted November 16, 2018 Looks like a RAM problem to me, though intermittent, I see you have a Ryzen CPU, it might be worth lowering RAM clock speeds as I remember similar issues with high clocked RAM and Ryzen and/or testing with just one DIMM, it's possible, even likely some errors will be detect again on the next check, but it needs to consistently detect the exact same errors. 2 Quote Link to comment
Idaho121 Posted November 16, 2018 Author Share Posted November 16, 2018 So lower RAM clock speeds, test again to see if I get the same errors, then run a correcting check? Or get some new RAM? I'd lower clock speeds through the BIOS on boot, right? Quote Link to comment
Idaho121 Posted November 17, 2018 Author Share Posted November 17, 2018 Alright, just finished the parity check. Only 2 errors this time - the persistent ones, not the floating one: Nov 16 15:39:33 Cybertron kernel: md: recovery thread: P incorrect, sector=2633952880 Nov 16 18:13:54 Cybertron kernel: md: recovery thread: P incorrect, sector=4906237848 1) Do I run another non-correcting to repeat this, or run a correcting check? 2) The parity errors - likely to have messed up a file, or just parity, or no way to know? 3) Should I feel comfortable with the RAM ... under(?)clocked going forward? Am I okay to add more files (and a new drive, as I'm running low)? Thanks again for all the help! Quote Link to comment
JorgeB Posted November 18, 2018 Share Posted November 18, 2018 8 hours ago, Idaho121 said: Do I run another non-correcting to repeat this, or run a correcting check? One more to confirm no more extra errors appear would be good. 8 hours ago, Idaho121 said: The parity errors - likely to have messed up a file, or just parity, or no way to know? Impossible to know unless you have checksums ( or were using btrfs) 8 hours ago, Idaho121 said: Should I feel comfortable with the RAM ... under(?)clocked going forward? If no more new errors pop up it should be OK, consider a board/CPU with ECC support for next build. 1 Quote Link to comment
Idaho121 Posted November 18, 2018 Author Share Posted November 18, 2018 What's the best app/docker to use for checksums? Quote Link to comment
JorgeB Posted November 18, 2018 Share Posted November 18, 2018 Dynamix file integrity plugin, it's the only one for Unraid, you can also use an external util like corz for Windows. 1 Quote Link to comment
John_M Posted November 20, 2018 Share Posted November 20, 2018 Just curious, what RAM are you using and is it on MSI's Qualified Vendor List? Quote Link to comment
Idaho121 Posted November 20, 2018 Author Share Posted November 20, 2018 14 minutes ago, John_M said: Just curious, what RAM are you using and is it on MSI's Qualified Vendor List? I believe so - https://www.amazon.com/gp/product/B0134EW7G8/ref=oh_aui_detailpage_o04_s00?ie=UTF8&psc=1 https://www.msi.com/Motherboard/support/X470-GAMING-PLUS#support-mem-12 Quote Link to comment
John_M Posted November 20, 2018 Share Posted November 20, 2018 55 minutes ago, Idaho121 said: I believe so - https://www.amazon.com/gp/product/B0134EW7G8/ref=oh_aui_detailpage_o04_s00?ie=UTF8&psc=1 https://www.msi.com/Motherboard/support/X470-GAMING-PLUS#support-mem-12 Thanks. I've used the exact same type myself on several Ryzen builds in Asus (X470, B350), Gigabyte (X370, B350) and ASRock (B350) motherboards. At one point it was slightly cheaper than the DDR-2666 rated kit on Amazon. It works fine at DDR-2933 with 2000-series chips. That said, I have had a faulty Vengeance LPX DIMM, which was replaced without question by Corsair (they replaced both DIMMs in the set). I would run MemTest86 again for a good long time - say, 48 hours or more. Remember, a pass doesn't guarantee that it's good. Use the free downloadable version and make a separate USB stick and boot it in UEFI mode. 1 Quote Link to comment
John_M Posted November 20, 2018 Share Posted November 20, 2018 MemTest86 version 7.5 found my faulty DIMM using the default settings. You may want to just change the setting for the number of times it cycles through the different tests. I think the default is four cycles of 13 tests. Another thing you could try, if MemTest86 returns another pass, is to run on just one DIMM for a while in, say, the channel A socket. Run a parity check and when it finishes swap with the other DIMM in the same socket and repeat. If that reveals no difference you might want to try each DIMM singly in the channel B socket. Label the DIMMs (or note their serial numbers) and make careful notes and eventually you should be able to narrow the problem down to either one DIMM or one socket. It's all very time consuming stuff but it can run unattended and I'm sure you want to get to the bottom of this problem. It's annoying to have a potentially bad DIMM but Corsair offer an lifetime warranty and I really can't fault their RMA process. 1 Quote Link to comment
Idaho121 Posted January 3, 2019 Author Share Posted January 3, 2019 Circling back here because there were 7 errors found this month during the check. I will run the extended MemTest, as I should have done a month ago... However, I've had a drive with a low-but-stable Raw Read Error Rate number (was at 5). I just checked, and it's up to 10. I'm doing an extended SMART test now on it to see if that moves again. 1) Could this be what's causing the parity errors? 2) If so, do I replace the drive and then rebuild from the current parity, or should I run another parity check, correct, and then replace it/rebuild? 3) Do I have a couple of borked bits from when I corrected the 2 parity errors last time, or is it still RAM as the likely culprit and this is a separate issue? Thanks again, all! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.