[SOLVED] Parity errors galore on new unRAID box


Recommended Posts

Well, I have the new RAM (Crucial CT2CP25664BA1339 4GB 2GBx2 240-pin PC3-10600 DIMM DDR3 Memory Kit), and the results are bizarre.

 

The BIOS and Memtest86 agree that I'm running the RAM at 666Mhz (DDR1333), CAS 9-9-9-24.  The BIOS confirms 1.504V.  I believe these values to be correct, but can't readily confirm it anywhere except the SPD data reported by the BIOS utility.

 

I'm now running 4GB RAM total instead of 2GB, and I don't like the fact that more than one variable has changed.  Bad science.

 

1. With the new RAM installed in the paired dual channel slots 1 and 2, the system boots, and my 10,000-pass MD5 test of 100,000 disk blocks returns about two or three times as many errors as it did with the original Kingston RAM.

 

2. When I fire up Memtest86, everything looks good until Test #6 [Moving inversions, 32 bit pattern], then many errors are reported in the first pass.

 

3. I remove the DIMM from slot 2 and fire up Memtest86 again.  No errors are reported in two passes.

 

4. I replace the DIMM in slot 1 with the other, then Memtest86 again.  No errors are reported in two passes.

 

5. I put the free DIMM into slot 2 (the two DIMMs are now back in slots 1 and 2, but swapped relative to step 1), and Memtest86 again.  No errors are reported in two passes, which is really unexpected, given step 2!

 

6. I boot unRAID and retry the 10,000-pass test from step 1.  No errors.

 

So, is this system haunted?  Was I having a physical seating problem with one of the slots, now accidentally remedied?  Should I still order a new power supply?

 

I'm happy the box currently appears to be working, but don't trust it at all, since an hour ago it wasn't working with the same two RAM sticks.  And, of course, I'm a little concerned that double the RAM might make affect the outcome of the 10,000-pass test.

 

Thoughts?

Link to comment

My best guess? Your motherboard is a little too aggressive with something to do with the interleaving. Do you have the option of setting 10-10-10?  With your reported behaviours, I'd be afraid you were walking the fine line somewhere, and the least little degradation over time may start to cause silent errors that will bite you. The fact that you get similar results with totally different sticks of ram makes me think it's definitely motherboard, not memory stick related. I wouldn't rest easy until you get more error free tests.

Link to comment

run memtest for a day... or 2?

 

I agree. If you still have issues, the other thing you can try is changing DIMMs. According to your motherboard manual (http://download3.msi.com/files/downloads/mnu_exe/E7599v4.1.zip) the dual channel configuration is DIMM1+DIMM2 and DIMM3+DIMM4. Try running your new memory in DIMMS3+4 and if you still have problems, disable dual channel mode, disable interleaving and/or loosen your memory timings a little, e.g. if it is rated at 9-9-9-24, try 10-10-10-28 or there-about. While you are testing these different configurations I would run memtest at least overnight to be sure it is working OK. Also, did you say you were running the latest BIOS? (v17.18 is the latest and can be downloaded from http://au.msi.com/service/download/bios-18237.html).

 

Edit: Saw a post which said you were alreading running the latest BIOS.

Link to comment

@Joe L.: Thanks for your feedback re the appropriateness of this thread, and for your snarky opinions about Windows ;-)

 

@jonathanm: I hadn't considered the possibility of loosening things up beyond the RAM specs...I guess there wouldn't be much performance hit, especially since this is a file server, not a gaming box.  That will be my next move, assuming I change anything at this point!

 

@Johnm: That sounds smart, though I'm error-free and 20 hours in at this point, plus leaving town in a couple days, so I'm thinking I'll switch to some "real world" unRAID testing.  If that goes well, I'll run an exhaustive multi-day Memtest while I'm gone.  If that doesn't go well, I probably need to make some changes (10-10-10-28?) before continuing with Memtest.

 

@chickensoup: When you say "changing DIMMs", do you mean trying another set?  Or do you just mean moving them around?  I'm on my second set of DIMMs (Kingston 1GBx2, then Crucial 2GBx2), and I've already seen that with the Kingstons, putting them in slots 1 & 3 appears to eliminate the problem (see http://lime-technology.com/forum/index.php?topic=19936.msg179372#msg179372).  Also, I can find a BIOS setting to disable interleaving, but can't for the life of me find a dual-channel setting.

 

So, as I said @Johnm above, I'm error-free after 20 hours of Memtest.  I don't know if I accidentally worked something out, or if a slight change in temperature, barometric pressure or the phase of the moon is going to put me right back where I started.  If everything had been just perfect with the new RAM, I'd feel great right now.  Given that things were initially worse with the new RAM, I don't know what to think.

 

Shall I loosen up the RAM timings?  Leave things alone?  Contact MSI?  Burn some sage?

 

Thanks very much for all the great feedback, everyone.  This is a great community!

Link to comment

If you are just joining us, the summary of my parity issues is this: RAM voodoo.

 

While it is possible that the power supply or some other actor created the RAM issues, I did buy a new pair of dual-channel-ready DIMMs, and did get them working somehow despite some initial weirdness.

 

Right now, my BIOS is set to Auto regarding all RAM timing, speed and voltage issues; the BIOS reports 1.504 volts, and Memtest86+ reports 666Mhz (DDR1333), CAS 9-9-9-24.

 

Yesterday, after about 20 hours of successful Memtest86+ testing, I re-added my two drives to the array, restarted it, and checked parity, which took 500+ minutes.

 

The result was 111 sync errors, which initially horrified me, but then I realized that when I last checked (and repaired) parity a couple weeks ago, the RAM voodoo would've resulted in a bunch of erroneous parity corrections.

 

If you can't trust your RAM, all bets are off :-)

 

After the first pass of parity correction with the 111 sync errors (why aren't these called parity errors?), I ran it again.

 

The result was zero errors, and that's the first time I've ever seen that out of this server.

 

So, I'm going to spend the next week or so beating up on this thing before I start trusting it with real data.

 

If I see any more signs of RAM voodoo, I'm going to try loosening up the RAM timing to 10-10-10-28, as recommended by @jonathanm and @chickensoup.

 

If I get desperate, I might even buy a Corsair power supply to replace the HEC.

 

Can anyone think of anything else I should (or shouldn't!) do?

 

Thank you all.

Link to comment

If you are just joining us, the summary of my parity issues is this: RAM voodoo.

 

Can anyone think of anything else I should (or shouldn't!) do?

 

Thank you all.

If religious, a prayer or two can't hurt. :)

If superstitious, try not to walk under any ladders, or break any mirrors.  ;) 

 

Seriously, very happy when issues like this are resolved.    You would think that the motherboard manufacturers would not design the needed timings to be so critical.  (I must admit though, to test for the errors does take time, and they cannot test every possible combination of memory strips you might install)

 

 

Link to comment

@Joe L.: I'm with you there, Joe, and thanks for all your input.  Interestingly, in case you missed it, I currently have the motherboard set to Auto regarding all RAM timing, speed, and voltage!  Go figure.

 

And I want to throw two other things out there for anyone learning from this thread:

 

1) My MSI motherboard does the dual channel thing when paired DIMMs are either in slots 1 & 2 or 3 & 4.  Even when I was still using the original Kensington RAM, voodoo went away when I put the DIMMs in slots 1 & 3, thereby disabling dual channel mode.  I didn't find any way in the BIOS to turn it off.

 

2) With the original RAM, leaving the motherboard in Auto mode SOMETIMES got the timings and speed wrong, though it always got the voltage right.  It was sometimes underclocking the RAM (no harm done, except for lost speed), and sometimes tightening the timings all the way to 7-7-7 (that's no good at all).  Regardless of what the BIOS interface claimed were the current RAM settings, it was useful to fire up Memtest86+ just to see what was reported there.

 

If I learn anything else about my case, I'll post it here.

 

A good rule of thumb: it is never the hardware, except when it is.

Link to comment

One more thing...here is an updated and somewhat more flexible version of the troubleshooting script from http://lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors, with easy-to-adjust variables at the head of the file, plus additional reporting and logging:

 

#!/bin/bash
LOG_DIR=/var/log/hashes
TEST_DRIVE=sdb
SKIP=25000
BLOCKS=100000
PASSES=10000

mkdir -p $LOG_DIR
cd $LOG_DIR
echo $(date)
echo $(date) >> $TEST_DRIVE.log
echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP
echo PASSES=$PASSES, BLOCKS=$BLOCKS, SKIP=$SKIP >> $TEST_DRIVE.log

for i in $(seq 1 $PASSES)
  do
    echo "Begin $TEST_DRIVE, pass $i."
    dd if=/dev/$TEST_DRIVE skip=$SKIP count=$BLOCKS | md5sum -b >> $TEST_DRIVE.log
  done
exit

 

Enjoy!

Link to comment

@Johnm, the creepy bit is that I don't think I fixed it--I think it "just started working," which isn't ideal.  When I got the two new sticks of RAM, they made things worse, and wouldn't even pass Memtest.  When I tested them one at a time, they did pass Memtest, and when I put them back together again, everything seemed fine.

Link to comment

@Johnm, the creepy bit is that I don't think I fixed it--I think it "just started working," which isn't ideal.  When I got the two new sticks of RAM, they made things worse, and wouldn't even pass Memtest.  When I tested them one at a time, they did pass Memtest, and when I put them back together again, everything seemed fine.

 

It is possible, however unlikely, that when you fitted the new RAM one of the sticks wasn't seated quite right and after re-fitting all is well. It does sound like your motherboard is just being stubborn though.

Link to comment

I've had uptime of about one week, and in that week I've added one more 1TB drive, and I've copied about 1.6TB to the array.  On occasion, I've run a Parity-Check, and until this morning, I've been clean.

 

This morning, I got one Sync Error which certainly wasn't due to an unclean shutdown, since I've had about a week of uptime.

 

So, now I'm back to mistrusting this system.

 

I'd greatly appreciate hearing who advises loosening up the RAM timing (maybe from 9-9-9-24 to 10-10-10-28, as recommended by @jonathanm and @chickensoup?), and who advises doing something else!

 

Thanks to everyone :-)

Link to comment

A little more info: after finding and fixing the single Sync Error, I started another Parity-Check, which found one more error, at an offset far, far from the previous.

 

For those new to the thread, the three hard drives were all successfully precleared multiple times, show no signs of SMART woes, and have passed manufacturer drive diagnostics.

 

RAM has already been replaced once, so my loose plan is to loosen RAM timings, then when that doesn't work, try a new power supply, then when that doesn't work, get a new motherboard.

 

If that doesn't work, I guess I'll try repainting the case.

Link to comment

Unfortunately, regardless of the OS sometimes an intermittant or stubborn problem is just that. You know the problem exists but it is extremely time consuming, difficult to replicate and very hard to tell if and when the problem has actually been fixed. In this case I would suggest trial & error. You have done all the diagnostics so your best bet if you have the parts, or if you can afford to would be to replace parts and see what happens. Obviously this approach doesn't work for everyone (expensive, time consuming, system remains "untrustworthy" for a long period of time) but you are definately running out of options.

 

You might be able to get some extra support if any of the unRAID guru's here could help by sifting through a few syslogs to see where the issue crops up. If you have a spare motherboard/CPU/RAM combo (even if the other PC is working) you could "rebuild" your unRAID onto a new setup for testing purposes, as painful as this sounds it might help to determine if the issue lies in your drives or not- depending on how many drives you have this could be useful to know.

Link to comment

@chickensoup: Thanks for the feedback.  I've determined that it isn't drive problems--this computer even has Prime95 calculation problems.

 

When I underclock the RAM down to 1066, however, I don't see any more problems.  That doesn't, however, help me sleep at night.  I can give up the speed, but I can't trust my data to a box that is seemingly stable only when underclocked.

 

Though I don't consider it utterly proven that the problem is the motherboard, MSI has VERY responsive (email responses often in less than an hour!) customer service, and they've agreed to RMA the board.

 

They won't, however, ship a replacement until a week or two after they've received mine, so I've decided to go a different way: I just ordered an ASUS M5A78L-M LX PLUS motherboard.  I also eBay-ed a AMD Athlon II X2 240 2.8 GHz CPU to go with it.

 

Once those parts are in place, I will have replaced everything but the power supply.  I probably should've done this a couple weeks ago.

 

Fingers crossed.

Link to comment

Let us know how it goes :-)

 

MSI has fantastic customer service. Even though they say 1-2 weeks in my experience if you send a motherboard in for warranty you will recieve a replacement in a few days. The only delays appear to be related to shipping. Even if it does take 2 weeks this is still much faster than any other supplier. Most take anywhere from 6 weeks+

 

I'm usually a Gigabyte fanboy but MSI have come a long way in the last couple of years. I'm considering moving to MSI boards for more of my builds as they are generally cheaper for the same build quality and have much better support.

Link to comment

MSI has fantastic customer service. Even though they say 1-2 weeks in my experience if you send a motherboard in for warranty you will recieve a replacement in a few days. The only delays appear to be related to shipping. Even if it does take 2 weeks this is still much faster than any other supplier. Most take anywhere from 6 weeks+

Intel branded boards will advance cross ship overnight through an approved vendor. They may offer it to end users directly as an option, I don't know. Anyway, if I call Intel and tell them I have a bad board, the replacement will show up the next day with a return shipping label for the old board.

Link to comment

@chickensoup: I like most things about my experience with MSI and the MSI motherboard, except that I have a computer that isn't very good at math ;-)  In all seriousness, though, it may not be the motherboard's fault, and I'm not now anti-MSI.  I do, however, have an underclocked-but-functional server at the moment, and I do wish MSI would've let me secure the replacement board with a credit card so that I could've hopefully had an hour or two of server downtime, instead of a week or two.

 

@jonathanm: That's impressive, re Intel-branded boards.  I will keep that in mind for my next build!

 

Thanks, guys!

Link to comment

It pains me to admit this publicly, but I owe it to the community that has been so helpful, and to anyone who might run into a problem like this down the road:

 

The (used) CPU I put into this build had one corner pin bent flat as a pancake.

 

So, I've also learned something about how my eyesight changes with age.

 

I've also learned that when my wife Googles the symptoms and asks me if maybe the CPU is missing a pin, I should listen.

 

Building your own system is easy, except when it isn't :D

Link to comment

Very first CPU I ever installed I guess I thought the pins were made of the stuff in wolverines claws, I tossed it in, clamped it down, and wouldn't boot. took it out, and 2 rows were bent flat.

 

Lesson #1: CPU pins are fragile, take time inserting them slowly, make sure you don't have to use force to get it in

Lesson #2: Pay attention to HOW the CPU goes in

Lesson #3: Build your first PC using a really old PC that should by all rights be in the trash, like..one that puppylinux can't even run on, so when you DO screw it up, you just throw it away.

 

Not saying you did any of the above, but thats what I did, :) Never did it again(and I'm still young but have badish eyesight)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.