Jump to content

[Solved] - Repeating parity errors - PLEASE HELP!


Recommended Posts

Hi guys

 

i am struggling the last week or two to get a fully stable unraid going, and unfortunately already suffered from a data loss due to a parity issue i am having.

 

Every time i check the parity(with correction) enabled, it finds over 100,000 errors ( all grouped strangely in the first 500MB of the discs.)

 

anyhow - after a drive failing, and i was forced to rebuild from the "corrupt" parity , the new drive had lots of reiserFS issues - which i hope i was able to correct ( time will tell if files was damaged)

 

I thus went on a investigation to find out why the parity could be failing. One possible culprit it seems was cards with  Sil3132 chipsets as per the following thread :

http://lime-technology.com/forum/index.php?topic=21052.0

 

In order to avoid time wastage, i thus copied data from the drives on that controller and removed them from the array ( eg reset the configuration from scratch and rebuilt the array excluding the 2 drives)

 

unfortunately, my parity error persisted in the next two parity checks i performed ( see attached syslog's)

 

I followed the steps as prescribed in the following wiki :

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

but i am stuck unfortunately.

script i used ( modified a little bit to allow parameter passing, since 22 scripts didn't make sense :P )

#!/bin/bash

for i in {1..25}

  do

    echo "Begin $1 for the $i time."

    dd if=/dev/$1 skip=500 count=1100000 | md5sum -b >> $1.log

    echo -e "\n\r" >> $1.log

  done

exit

 

I based the skip and count on the syslog entries ( although the errors didn't repeat, both sets seems to be below the 1,100,000  range.

 

 

All the MD5 calculations 25 repetitions (performed simultaneously on the 22 disc's) came out correctly  .....

 

This leads me to believe the disc's is correct. (i also repeated this with a much larger data chunk and could find any fault...)

 

I am running 5.0-rc11 with simplefeatures / Unmenu and i think SNAP as plugins

 

two main controllers is Supermicro PCI Express x4 Low Profile SAS RAID Controller (AOC-SASLP-MV8)  , and the 6 other ports i am using from the MB

 

Is there any other tests i can perform in order to determine why my parity is in such a state?

 

 

Thx

 

Neo_x

 

PS

Syslog's / MD5-results and smart reports is attached  -just in case i overlooked something.

 

 

*********************************SOLVED*******************************************

took quite a while, but eventually managed pinpoint the issue to s specific two SATA ports on the MB ( sata ports is grouped in two configurations - 4 standard ports, and then these two in some or other RAID capable setup (thus possibly using another chipset).

not exactly sure why they only start to act up when i connect drives via my supermicro add-on cards (eg on their own i was able to generate a clean parity).

 

for now i have just simply disconnected them from the setup - willl look at upgrading the setup to a more standardized version when the time requires. - i consolidated data and removed all the old drives form the setup - so for now all is well :)

 

 

 

 

 

syslog_parity_check_1.zip

syslog_parity_check_2.zip

md5.zip

smart.zip

Link to comment

See my sig to disable all add-ons for testing.

 

ok done

 

guess you need me to redo parity checks with correction enabled)

 

PS

i  notice RC12a has released. - should i upgrade just to be safe(although i don't notice anything parity related)

 

*edit*

ok restarted stock with rc12a.

running a parity check with corrections now- once the second pass has completed, will i submit syslog's again

(its already finding errors though.....)

Link to comment

Hi guys

 

unfortunately reverting to stock and re correcting the parity didn't work. :( :(

 

2nd check is still going, but it doesn't seem to get better  (strange part - "some" (not all) of the error blocks seems to be repeating between the two syslog's.)

 

any recommendations as to how i could find out what the cause of this issue could be??

 

*panicking here*

rc12a_syslog_1.zip

rc12a_syslog2.zip

Link to comment

*bump*

 

i am currently running a 1TB DD test with 5 repetitions  - problem being that on 22 drives it is pushing the cpu to 100% all the time, thus i am not sure how accurate it is going to be. It is going to take another 48 hours - if power holds ....

 

 

any other ideas please?

 

 

Link to comment

And so i am at a loss.

 

1TB DD test completed - no failures at all  - after almost 3 days of continuous writing ... stress test which even a parity check cant match.

(see attached results)

 

will try upgrading firmware on my supermicro MV8 controllers now - just to be sure.

 

 

still outstanding  (only ideas i have) Smart tests for the discs and possibly a memory test...

md5__1TB.zip

Link to comment

Any drive could be causing the errors. Post SMART reports from all drives. There is no need to wait.

 

My first post included smart reports if memory serves.

I will post updated reports if needed once I am back - currently away on  a business trip :-)

 

Sent from my Nexus 10 using Tapatalk 2

 

 

Link to comment

Most of the drives have been overheated. Run the dd test on all of the previously cooked drives to determine the culprit.

 

i know about the overheating issue - was in a completely different case, and didn't cause any issues on my previous controller ( had an adaptec 24 port handling all the drives - only negative part was that it didn't support spin-down and i was required to do JBOD to get the drives visible in unraid)

 

In order to rule out hardware, i have run a long smart test and a memory test - both seems clear(see attached)

 

next up i will repeat the parity test on the three different sections of the machine ( both  MV8 controllers and also the 6 ports i use on the MB)

 

will give feedback once i located a possible cause for the issue.

 

smart_long_test.zip

mem_test.JPG.8267217434df9ebf530ffe6be9f7253e.JPG

Link to comment

just wanted to leave an update - hopefully someone can recommend something else i can check....

 

hi guys

 

some updates....

 

for sake of simplification - i have two supermicro SASLP-MV8 controllers, and then the MB which offers 6 ports.

i ran various tests - clearing the configurations, building the parity and then checking it.

 

1 x MV8 controller : pass

2 x Mv8 controller : pass

2 x MV8 + Motherboard : fail

 

Motherboard only  : pass

Motherboard + 1xMV8 : fail(although only 100 errors versus the 100,000 when using the full setup

 

syslog doesn't show any issues during parity generation or checks - thus i don't think it is sata / power or drive related ( keep in mind i performed a simultaneous DD + MD5 check for the 2x MV8 + motherboard setup, and didnt pick up any issues...

 

i don't want to keep stressing the system with data carrying hardware, as a failure somewhere along the line will leave me very sad...

 

considering to maybe replace MB/ CPU / RAM - although its a huge $400 expense which i don't want to do unless the only option..

 

 

any other recommendations is welcome...

Link to comment

What power supply do you have? 

 

I have a similar configuration to you and was getting similar errors until I upgraded the power supply to one capable of delivering more power.  I am guessing that under load my original power supply was struggling to handle the load.

Link to comment

What power supply do you have? 

 

I have a similar configuration to you and was getting similar errors until I upgraded the power supply to one capable of delivering more power.  I am guessing that under load my original power supply was struggling to handle the load.

 

If memory serves i went with a corsair 750 watt ( 100 watt higher than the recommended corsair model for the beast builds...)

Link to comment

A simple test as to whether or not it's likely the power supply:    You've got 22 disks, so build a config with just 12 of them -- 4 plugged into each MV8 and 4 on the motherboard.  Do NOT apply power to the other 10 disks for this test.

 

If this works perfectly, then it's almost certainly a power issue.

 

Link to comment

A simple test as to whether or not it's likely the power supply:    You've got 22 disks, so build a config with just 12 of them -- 4 plugged into each MV8 and 4 on the motherboard.  Do NOT apply power to the other 10 disks for this test.

 

If this works perfectly, then it's almost certainly a power issue.

 

Wow - excellent plan :D

Will probably repeat for both sets - should give an indication if any connectivity issues exists as well

 

Thx for the idea:-)

 

Sent from my Nexus 10 using Tapatalk 2

 

 

Link to comment

Note that the test I just suggested will take a LONG time to conduct => hopefully UnRAID will recognize the 11 data disks you use; but it will still have to write the initial parity.    THEN you need to run a parity check (correcting would be best).    Then if that's good you may want to run another one just to build a bit of confidence in the setup.

 

If that works well (I suspect it will, based on what you've done to date), then you can replace the power supply with a higher-capacity unit and add the other 10 disks to the array.

 

FWIW a 750w single-rail PSU should be sufficient for a 22 disk setup;  but this can depend a lot on the specific disks, which have widely varying power requirements, spin-up demands, etc.    But if you're replacing it, I'd get a high-quality, single-rail 850w or better unit.  If you want to stay with Corsair, the HX850 would be a good choice.  I like the Seasonic units -- the modular M12II 850 is an excellent unit.

Link to comment

not sure if power will prove to be the fault

 

used the same power supply (and thus power connectors as well) with the same drives ( 24 drives + 1 cache) without issue on my previous SATA controller

(Adaptec 52445) which is a 28 port card. Only issue was i couldn't resolve spin-down functionality with it.

 

thus i recently swapped out the card with the two MV8's and utilized the remaining ports on the MB - and thats when the fun started.

 

the breakout cables towards the motherboard ports had to be replaced as well - since i discovered the hard way that break out cables is directional (eg from controller to drive or from breakout panel towards the sata ports for the MB.

 

but murphy can always teach us something new - struggling for two weeks already - almost willing to give up and replace MB/ CPU etc ( board is semi-old(model evades my memory now :P) with a Q6600 quad core cpu.

 

 

Link to comment

It may not be the issue, but I certainly suspect it, based on this logic ...

 

When I saw your note that:

 

1 x MV8 controller : pass

2 x Mv8 controller : pass

2 x MV8 + Motherboard : fail

 

... my initial thought was "aha -- motherboard problem".

 

Then, of course, I read the next line, where you noted:

 

Motherboard only  : pass

 

Then you noted that the motherboard plus one MV8 didn't work, but had far fewer problems.

 

Based on that, it seems your system plus two add-in controllers plus 22 drives may be overtaxing one of the buses on the PSU.    No guarantee of that, of course, but the test I suggested should give you a pretty good indication whether this is the issue.

 

In any event, I can't really think of anything else to try short of a different motherboard !!

 

Link to comment

Just thought i should post an update :

 

tried the alternating drive recommendation by Garycase above and had some interesting results.

configuration was split in two 11 drive setups ( eg removing 11 drives from the setup

first set had 4 drives each on the MV8 controllers(testing the breakout panel for the Norco case on the first 4 slots). the last two Breakout panels required SAS to SATA breakout cables towards the motherboard sata ports. For the first pass i had the parity +data on slot 6, and one more drive on slot 5.

building and checking parity twice revealed a huge reduction in parity errors (still not 0) - but was less than 20.

 

repeating the same setup with all the other drives (leaving the parity on slot 6 and 3 data drives on slot 5 + 8 drives on the other slots), suddenly resulted in a huge jump in errors (think it was in the region of 600)

 

since i still encountered errors in both setups - it confirmed that power supplies is not a possible cause. Due to the higher number of errors with more drives on slot 5, i started to think that there is problem specific to that slot.

 

I then swopped the breakout cable on slot 5 with one i had spare, and moved the parity to that slot(eg bypassing slot 6 as well). so far so good - running parity check 2, and no errors was noted yet.

 

*HAPPY FACE*

 

not sure why syslog wont show drive communication issues related to problematic break out cables though - but it seems that i have found a possible culprit (or me fiddling around in the case has chased the gremlin somewhere else)

 

will integrate the drives on slot 6 next, and should hopefully have a working setup again.

 

needless to say - this was a massive monster to troubleshoot - luckily i haven't made any additional expenses yet with regards to motherboards etc (although im still tempted to  -extra CPU power is always welcome :D:P :P)

 

 

 

 

 

 

Link to comment

Sounds like you've found the problem => now you just need to confirm that's the only bad breakout and you'll be home free  :)

 

... will, of course, take a while to do those confirming tests (depending on just how many drives you add after each iteration) ... but you're definitely on the home stretch.

 

I'm not surprised it wasn't a power issue ... modern drives simply don't use enough power to overtax a 750W single-rail supply !!

 

Link to comment
  • 4 weeks later...

Glad all's well.  FYI, based on your outline of the resolution, the issue was almost certainly an incompatibility between the chipset used for the extra SATA ports on the motherboard and the add-in SuperMicro cards.    It happens.  I've had numerous cards over the years that simply wouldn't work in specific motherboards for exactly that reason.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...