[SOLVED] Parity errors galore on new unRAID box


Recommended Posts

Summary

I recently built my first unRAID server, and I’m having bizarre parity issues, despite my best efforts and a lot of troubleshooting.

 

Configuration

unRAID 4.7

Motherboard: MSI 870S-G46

CPU: AMD Athlon II X2 215 2.7 GHz ADX215OCK22GQ

RAM: 2GB Kingston PC3-10600 DDR3 dual channel (KVR1333D3K2/2GR)

PS: hec Zephyr MX 750

Parity drive: 2TB Western Digital WD20EARS

Disk1 drive: 1TB Seagate 31000528AS

(I kept it to just the two drives at first, so I could get to know unRAID a bit.)

 

Problem

The problem is that after preclearing both drives, firing up the array, copying ~100GB of big multimedia files to the server, seeing zero errors on both drives according to the rightmost column on the Main page, and then clicking Check to start a Parity-Check, I’m told there are dozens of parity errors.

 

Things you might want to know

I’m running unRAID 4.7

 

The drives are reporting 25-26°C

 

Both drives are “MBR: 4K-aligned”

 

I've attached SMART reports for both disks, as well as a lengthy, messy syslog (sorry about that)

 

Troubleshooting I’ve done

Memtest ran for over 24 hours with no errors reported.

 

Western Digital and Seagate drive utilities report nothing strange about either drive.

 

Each drive has been precleared at least three times, never with what I interpret as an error or failure report.

 

My testing

I’ve seen this problem consistently after preclearing both drives and starting from scratch three times--yes, that's many days of preclearing!

 

During the last preclear cycle, I precleared the 2TB drive once while preclearing the 1TB drive twice back-to-back, just to make sure there was plenty of activity involving both drives.

 

After that last round of preclears, issued an “initconfig” before repopulating the devices.

 

No problems were reported during the initial Parity-Sync.

 

The copying of the 100GB of big files was started after Parity-Sync completed, and seemingly went smoothly--the Windows 7 box pushing the files didn’t complain, anyway.

 

After the copy was complete, I clicked the Check button, and eventually 39 parity sync errors were fixed!

 

My questions

What do I try next regarding the parity failures?

This is feeling like a hardware failure outside the drives. I’m fearing I’ve chosen the wrong motherboard, or have a lemon.

 

Am I right that it is fine to preclear all drives with the -A option?

My understanding is that the Seagate drive doesn’t need this, but it is fine to do so, other than some space lost if I have a huge number of files. For the sake of elegance, simplicity, and never getting it wrong on any given drive in the future, I’m hoping always doing this is fine.

bland328_SMART_sda.txt

bland328_SMART_sdb.txt

bland328_syslog_with_a_few_notes_at_the_top.txt

Link to comment

The first thing I would suggest is do a memory test. When first booting, there is the option to select the memory test instead of just letting unRAID boot. Try it for at least 3 or 4 passes and even better overnight.

 

If that doesn't show a problem then there is another much more complex test that can be done. Basically, it will do multiple reads from each drive and give a checksum which all should be the same.

 

 

Link to comment

You can preclear the disk with the partition starting on sector 64 for ANY disk, furthermore, since the file-system internally uses 4096 byte blocks, I think the resulting space available for files is exactly the same regardless.

 

As other have said, step 1 is to ensure your memory is working properly.  You should NEVER have parity errors unless you have disks with sectors going bad, OR have had a hard shutdown/power loss and only some disks were written to.

 

As far as the memory goes, if it does show errors, make sure it is set up properly in the BIOS.  It must have the correct clock speed, timing, and voltage.

 

Some BIOS set them all correctly automatically, some do not, especially if you have premium RAM, which often takes non-standard settings.

 

If you can run a memory test overnight, and have no errors, then next step is to get a smart report from all your disks.  If they were recently -re-cleared you'll know their initial status.  You can tell if additional sectors have been marked as un-readable or have been re-allocated.

 

Then, the tests mentioned in the prior post can be run on each disk in turn.  Writing and reading each disk looking for inconsistent checksums.

 

Joe L.

 

 

Link to comment

Thanks for your responses, gentlemen.

 

The first thing I did upon seeing this result the first time was to run a 24-hour Memtest; no errors were reported.

 

Also, I did attach SMART results to my original post.

 

I assume this more complex drive test you both refer to is "reiserfsck --check".  I just did four back-to-back runs on the data drive, and the results are disturbing:

 


 

Run 1:

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/105 (of 145\/ 68 (of 170\bad_indirect_item: block 212886871: The item (1163 1165 0x1d4cb001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (213008304), which is in tree already finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Checking Semantic tree:

finished

3 found corruptions can be fixed when running with --fix-fixable

 

Run 2:

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/ 64 (of 145// 30 (of 170\bad_indirect_item: block 47209215: The item (359 369 0xf104001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (47270912), w/ 95 (of 145-/152 (of 170-bad_indirect_item: block 182910983: The item (1151 1152 0x17d6001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (182919168), which is in tree already                        /102 (of 145-/ 45 (of 170\bad_indirect_item: block 212107611: The item (1163 1164 0x551de001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (212462848), which is in tree already finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Checking Semantic tree:

finished

5 found corruptions can be fixed when running with --fix-fixable

 

Run 3:

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/ 13 (of 145\/128 (of 170\bad_indirect_item: block 83331021: The item (5 15 0x44d1d001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (83616512), whi/ 71 (of 145-/  3 (of 170/bad_indirect_item: block 66715745: The item (428 429 0x1778c001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (66813956), which is in tree already / 95 (of 145-/154 (of 170|bad_indirect_item: block 182910985: The item (1151 1152 0x1fbe001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (117) to the block (182921216), which is in tree already /126 (of 145-/ 86 (of 170|bad_indirect_item: block 231112707: The item (1231 1257 0x1 IND (1), len 396, location 336 entry count 0, fsck need 0, format new) has the bad pointer (45) to the block (231118080), which is in tree already finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Checking Semantic tree:

finished

6 found corruptions can be fixed when running with --fix-fixable

 

Run 4:

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

        Leaves 24105

        Internal nodes 146

        Directories 46

        Other files 1500

        Data block pointers 24307524 (0 of them are zero)

        Safe links 0

 


 

So...3 unfixed corruptions found on the first run, then 5 on the next run, then 6 on the next run, then 0 on the last run.

 

I emphasize that I never repaired any of the reported file system problems, which means likely means disk reads are sometimes corrupted, but that the disk is not actually corrupted.

 

The disk passes a long SeaTools test, and passes preclear.  RAM tested clean after 24 hours of continuous testing.

 

Replace the SATA cables?  Replace the power supply?  I'm a capable troubleshooter, but not a Linux or unRAID pro, and it feels to me like everything is suspect.

 

Any thoughts?

Link to comment

Well, that was some excellent advice....thanks, guys.  I learned a bunch, but I don't know what to do about it!

 

I have discovered that with either of my drives (2TB WD, 1TB Seagate) and with one other "junk drawer" drive (320GB Seagate), if I read 200,000 blocks 10,000 times, calculating an MD5 hash each time, about 0.1% of the MD5 hashes will different (or, more simply, "wrong").

 

Even stranger, the hashes are not strictly random when they are wrong.  That is, if the hash is "60f3d5b4459a58ba0d4c57cf10e47a3a" 99.9% of the time, I may see that after a couple thousand hashes I get a "c8a48d2d3009d7c897a853a924904029", then the hashes may be right for a couple thousand more reads, and then I may get another "c8a48d2d3009d7c897a853a924904029" hash.  Sometimes I'll get a wrong hash that never does repeat itself, but most eventually do.

 

Replacing the SATA cables doesn't make a difference.

 

I have also failed to reproduce these results when testing the same drives on another computer.

 

So...it appears I have built a shiny new nightmare of a file server that corrupts disk reads a statistically significant percentage of the time.

 

Any opinions on what I replace first?  RAM?  CPU?  Power supply?  Motherboard?

 

Or am I looking at it wrong?

Link to comment

Perhaps one other item you can look at is the version of the BIOS on your motherboard.

 

Did you check the MSI website to see if a newer BIOS version exists and perhaps corrects some of your experiences (it all sounds to me you have an unstable system).

 

Link to comment

Any other thoughts?  Anyone?  I'm up against a machine that is fast, "stable" (in that it doesn't crash), and occasionally subtly corrupts disk reads.

As already suggested by others here.

 

I would say the memory module is the first suspect, have you tried fixed settings for your memory (1333 MHz - CL9 in your case) instead of the default "automatic" settings?

 

 

Link to comment

Be certain you do not have any overclocking or core-unlocking enabled in your BIOS.

^ This is what caused me hours of headaches when I first booted my unraid.  I think it came to an error every 1 in 4 parity calculations because I had unlocked the second core on a AMD sempron.

Link to comment

@bonienl: Good point...I did forget about the RAM timings recommendation, and now that I look closely at it, I'm a little confused.

 

The DIMMs don't have a timing sticker, unless the timings are somehow encoded in one of the long numbers on there.  At the very least there is no #-#-#-# sort of declaration.

 

I can't find any KVR1333D3K2/2GR timing documentation online, but I can find people who sound like they know of what they speak saying this RAM is 9-9-9-24.

 

In the BIOS, the "DIMM Memory SPD Information" page says both DIMMs are Cycle Time=1CLK; TCL=9CLK; TRCD=9CLK; TRP=9CLK; TRAS=24CLK.

 

The BIOS documentation says that if the DRAM Timing Mode is set to Auto, it gets the timing information from the SPD data.

 

So, the word on the street is that the RAM is 9-9-9-24, and the BIOS says it is 9-9-9-24.  Is there still a point to me manually setting it to 9-9-9-24?  I'm not resistant to doing this--I just want to make sure I'm doing the right thing.

 

@dgaschk: No, it is three drives, and I can't replicate the behavior on another system using those same drives.

 

@ljh89: There is no overclocking or core-unlocking enabled.

 

Thanks for all the feedback!

Link to comment

@bonienl: Good point...I did forget about the RAM timings recommendation, and now that I look closely at it, I'm a little confused.

 

The DIMMs don't have a timing sticker, unless the timings are somehow encoded in one of the long numbers on there.  At the very least there is no #-#-#-# sort of declaration.

 

I can't find any KVR1333D3K2/2GR timing documentation online, but I can find people who sound like they know of what they speak saying this RAM is 9-9-9-24.

 

In the BIOS, the "DIMM Memory SPD Information" page says both DIMMs are Cycle Time=1CLK; TCL=9CLK; TRCD=9CLK; TRP=9CLK; TRAS=24CLK.

 

The BIOS documentation says that if the DRAM Timing Mode is set to Auto, it gets the timing information from the SPD data.

 

So, the word on the street is that the RAM is 9-9-9-24, and the BIOS says it is 9-9-9-24.  Is there still a point to me manually setting it to 9-9-9-24?  I'm not resistant to doing this--I just want to make sure I'm doing the right thing.

 

@dgaschk: No, it is three drives, and I can't replicate the behavior on another system using those same drives.

 

@ljh89: There is no overclocking or core-unlocking enabled.

 

Thanks for all the feedback!

It is not just RAM timing, but RAM timing, clock speed, and voltage.  All three must be as the manufacturer specified.

 

According to this page:

http://www.shopping.com/Kingston-Kingston-2GB-Kit-2x1GBs-1333MHz-DDR3-Desktop-Memory-Retail-KVR1333D3K2-2GR/info?sb=1

 

Your RAM needs:

Kingston ValueRAM's KVR1333D3K2/2GR is a kit of two 128M x 64-bit 1GB (1024MB) DDR3-1333 CL8 SDRAM (Synchronous DRAM) memory modules, based on sixteen 64M x 8-bit DDR3-1333 FBGA components per module. The SPDs are programmed to JEDEC standard latency 1333Mhz timing of 8-8-8 at 1.5V.

 

You might just be supplying the wrong voltage, or timing. (most likely, wrong voltage, as 1.5 volts it needs is higher than many ram strips specify.

Link to comment

I'm glad you're identifying the issue but I do agree it's a tough one to figure out.

 

I agree with the other recommendations. Try different memory if possible or a single RAM module at a time and also try a different power supply if you have one available. Corsair, Seasonic and PC Power and Cooling seem to be the go-to power supply manufacturers. I'd be surprized if HEC makes a quality power supply.

 

 

Link to comment

@Joe L.: The BIOS reports that it is running the DRAM at 1333Mhz and 1.504V.  Though the SPD reports that this is 9-9-9 DRAM, I did try 8-8-8, and found that it destabilized the system--plenty of kernel panics.  Should I trust what SPD tells me?

 

@chickensoup and @lionelhutz: I don't have an appropriate alternate power supply handy, but will order one.  Does anyone recommend anything more than Corsair, Seasonic and PC Power?

 

@lionelhutz: I'll try one module at a time, then order some new RAM when I order a new power supply.

 

I'll have two servers soon... ;-)

Link to comment

@Joe L.: The BIOS reports that it is running the DRAM at 1333Mhz and 1.504V.  Though the SPD reports that this is 9-9-9 DRAM, I did try 8-8-8, and found that it destabilized the system--plenty of kernel panics.  Should I trust what SPD tells me?

 

@chickensoup and @lionelhutz: I don't have an appropriate alternate power supply handy, but will order one.  Does anyone recommend anything more than Corsair, Seasonic and PC Power?

 

@lionelhutz: I'll try one module at a time, then order some new RAM when I order a new power supply.

 

I'll have two servers soon... ;-)

 

If you are still looking; Thermaltake & Antec also make good power supplies. For unRAID I'd recommend a Corsair but any brand-name power supply should be fine. What you are looking for is something with a single 12V rail. See this power supply thread for more info: http://lime-technology.com/forum/index.php?topic=12219.0

Link to comment

@Joe L.: The BIOS reports that it is running the DRAM at 1333Mhz and 1.504V.  Though the SPD reports that this is 9-9-9 DRAM, I did try 8-8-8, and found that it destabilized the system--plenty of kernel panics.  Should I trust what SPD tells me?

If you use 8-8-8 and you get kernel panics, and keep those settings, then you don't have to worry about parity errors any more  ;)

 

Seriously, it sounds as if you had correct timings and voltage.

 

Joe L.

Link to comment

@lionelhutz and @chickensoup: Thanks for the advice, guys!

 

@Joe L.: I like your positive attitude  :o

 

The situation at the moment is this: I pulled one of the two DIMMs, ran my 10,000-pass drive-reading test, and it passed!

 

So, I pulled that DIMM, put just the other DIMM in, and fired up the 10,000-pass test again, expecting (or, at least, hoping for) failures.  But it passed, too!

 

So...either my testing results are influenced by less or differently-configured RAM, or this system works well with one DIMM or the other, but not both.  They came packaged together as a "Dual Channel" kit, and have been installed in the appropriate slots, per the motherboard manual.

 

I've ordered a different brand of RAM to be delivered tomorrow, in hopes it will play nice with the motherboard.  I'm holding off on a new power supply for the moment.

 

Other thoughts?

Link to comment

Okay...I have a little more news...

 

If I put one of the two matched DIMMs in slot 1, everything works well.

 

If I put the other of the two matched DIMMs in slot 1, everything works well.

 

If I put the two matched DIMMs in slots 1 and 2, I get my MD5 data corruption problem.  This is the recommended Dual Channel configuration according to the motherboard documentation.

 

If I put the two matched DIMMs in slots 1 and 3, everything works well, much to my surprise.

 

So...it is looking like my motherboard has an issue with these specific DIMMs, or maybe with the Dual Channel configuration in general, or with something else I'm missing.

 

New DIMMs (Crucial instead of Kensington) are arriving tomorrow, but at this point I'd put money on them making no difference.

 

(Also, I fully recognize that my issue is no longer about unRAID, per se--this is now just about a motherboard that doesn't like the RAM I put in for some reason, and I'd likely be struggling even if this were a Windows box.  Is this thread now inappropriate for these forums?)

Link to comment

Okay...I have a little more news...

 

So...it is looking like my motherboard has an issue with these specific DIMMs, or maybe with the Dual Channel configuration in general, or with something else I'm missing.

 

New DIMMs (Crucial instead of Kensington) are arriving tomorrow, but at this point I'd put money on them making no difference.

 

(Also, I fully recognize that my issue is no longer about unRAID, per se--this is now just about a motherboard that doesn't like the RAM I put in for some reason, and I'd likely be struggling even if this were a Windows box.  Is this thread now inappropriate for these forums?)

The thread is appropriate, as others in the future may run into similar issues, and this thread may guide them towards a solution.

 

If it were a windows box, it would probably just crash with a blue-screen of death,

or programs would lock up, or lose your files in the middle of a task. 

(In other words, you might not suspect anything unusual at all)  ;)

 

It may still be a motherboard timing issue, probably not a voltage issue, since the strips work individually.  In any case, I'll be curious to see how the replacement RAM works.

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.