[SOLVED] Possible causes for repeating parity sync errors?


Recommended Posts

I've been running my first Unraid setup for few years under v 4.4.2 gradually adding new disks without any major problems. Few months ago I added my 10th drive (2TB Samsung) and needed also to add an additional SATA-card. I had some initial trouble getting the card working (see the post http://lime-technology.com/forum/index.php?topic=19408.0) but eventually I got it working perfectly as it seemed. The 10th drive precleared nicely and there were no signs of problems. The new drive is of the notorious "silent corruption" Samsung HD204UI but the manufacturing date is well beyond the fix date so I did not apply any firmware updates.

 

Few days ago I decided to manually run the parity check which immediately started to cumulate sync errors. Total of ~600.000 sync errors, most of them from the first half of the parity check. I then re-ran the parity check with almost identical results, over 700.000 sync errors . Now I'm really worried, my eye doesn't catch anything suspicious in the syslog. Also the smart reports seem to be ok. Everything is attached to this post for the expert's eyes.

 

What are the possible causes for repeating parity sync errors? Could it be that one of more recent 2TB Samsung are not updated though they should have been? Could it be a cabling issue? Could it be the second identical SATA-card?

 

What are different steps I can take to isolate the problem? Should I first remove the 10th disk from the array and then rebuild the parity to narrow down the problem?

 

Any help is well appreciated!

 

-------------------------------------------------------------------------------------------------------

SOLUTION

In this case the problem was the second lately purchased Sil3132 2-port SATA-card which couldn't handle heavy load on both ports. Single port was fine, so preclear did not reveal any problems. I have one working and one non-working coming from the same supplier and looking almost exactly the same so it would be impossible to know before hand which card make/model might cause problems. In this case random read errors were produced for the two attached disks. If someone purchased a new controller and one disk the problem would emerge only when the second port was taken to use. The problem was also quite nasty to debug so any non-pro user would be most likely lost. 

 

Since there are at least a few identical reports (Sil3132 failing when both ports loaded), I would not recommend using any low-cost Sil3132 SATA-card in any environment. There are "safer" stable products available. On the other hand if you have a working Sil3132 card it is unlikely that it will fail later, all this seems like manufacturing quality control issue.

 

As part of the investigation I found the following threads/pages to be very informative:

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

http://lime-technology.com/forum/index.php?topic=18428.msg167122#msg167122 "{SOLVED-Beware SiL3132} - How to determine which drive is causing sync errors?"

http://lime-technology.com/forum/index.php?topic=17641.0 "[sOLVED] Increasing parity corrections over a month "

 

I stored some of my scripts used in the investigation in this post:

http://lime-technology.com/forum/index.php?topic=21052.msg187365#msg187365

syslog-2012-06-21.txt

smart.zip

Link to comment

Problems could easily be traced to ANY of multiple causes.

 

Easiest to test is memory.  Bad (or improperly configured) memory will cause random parity errors.  Reboot your server, and in the initial menu choose the memory test.  Let it run for several full passes, preferably overnight.  There should be NO errors.

 

If there are, confirm that your BIOS is set to the correct voltage, clock speed, and timing for your SPECIFIC make/model memory strips.  Most BIOS set them automatically.  Many get them right, some do not.  If all is correct, but still errors exist, try running with fewer memory strips if you can, to isolate the bad one.

 

Next, make sure the power supply is up to the task of feeding 10 disks.  A multi-rail supply may not be, regardless of its wattage rating, as only 1 rail is typically used to feed all the disks + fans + motherboard.  (other rails are typically dedicated to CPU and PCIe cards)

 

If those are all in order, it is time to test each disk in turn for consistent data being returned.  (unless obvious media errors are occurring constantly, and those would show in the SMART reports as sectors being re-allocated, or pending re-allocation)

 

Search for posts describing the use of "dd" and md5sum to repeatedly read from each disk from turn. constantly reading the same set of blocks should return the exact same checksum (as long as you are not writing to the disk)  If not, the disk, or disk controller may be defective.  That would easily cause random parity errors.    We've seen reports of both over the years... these are hardest to locate, as they often are not accompanied by errors in the system log as the disks are accessed.

 

 

Link to comment

Mem test now running, looking good after 1st pass.

 

The dd and md5sum usage was best described right in the wiki:

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

Unfortunately I'm running 4.4.2 which does not report the block position of sync error but since the sync errors emerge immediately after starting parity check, I assume I can start from beginning of the drive and make my way up until I see problems. However I do not fully understand the point of running the test scripts to multiple drives in parallel to "stress the I/O-system" as suggested by the wiki article. I would expect that to reveal problems with something else than a single drive. I will start with single drive and only if that does not cause any problems, continue with parallel execution.

 

The PSU is Antec TruePower Trio TP-650 for which I did some investigation when installing the 10th driver:

"In the mean time I did some research on the Antec Truepower Trio TP-650. It seems like the unit is a single rail PSU after all even though it's marketed as a 3 rail unit. I found several reviews and tests where this was discovered (link to Jonny Guru review below). So it should provide around 50A of total power which should be more than plenty.

http://www.jonnyguru.com/modules.php?name=NDReviews&op=Story3&reid=1"

 

However I have no reasonable way of confirming this.  It would make sense that the problems emerge when running the parity check when the power drain is maxed. Is there some way, besides detaching drives or replacing the PSU, to find out if the PSU is not powerful enough?

 

And thanks for a very speedy response!

Link to comment

Having spent many hours on this problem I've made little progress.

 

This is what I've done so far:

- Ran mem test for 6 hours -> no errors

- Modified and ran the dd/md5 test individually on each disk (50GB) -> all md5 hashes matched

- Ran mem test the second time, and 2 errors immediately appeared, then after 8 more hours no additional errors

- Manually changed the memory timings and voltage from auto (6-6-6-18) to match the Kingston recommended values (5-5-5-15) -> no effect

- Tried different combinations with the two memory sticks I have, one at a time in different slots -> no effect

 

Few questions, first two being the most important:

 

1. If I let parity check run for 5min letting it fix errors and then stop and restart the parity check, should the first 5mins be error free? I'm trying to find a quicker way other than run the whole parity check to see if the changes have any effect.

 

2. Should I have done something during preclear regarding HPA and/or partition alignment? I faintly remember investigating this when I got the first 2TB Samsung but I think I did not do anything about it. Syslog is reporting HPA on each 2TB Samsung.

 

3. Shouldn't also the individual dd/md5 disk read tests fail if the memory is failing?

 

4. Shouldn't the memory test produce more errors since the parity check generates ~200-300 sync errors / second?

Link to comment

2. Should I have done something during preclear regarding HPA and/or partition alignment? I faintly remember investigating this when I got the first 2TB Samsung but I think I did not do anything about it. Syslog is reporting HPA on each 2TB Samsung.

 

Using Joe's partition check tool (http://lime-technology.com/forum/index.php?topic=5072.msg47122#msg47122) I managed to verify that all the drives have correct partition table. Below is the result from the newest, 10th drive. I will now continue with Reiserfs checks...

 

########################################################################
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7J9AB711408
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes

Disk /dev/sdh: 2000.3 GB, 2000398934016 bytes
1 heads, 63 sectors/track, 62016336 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdh1              63  3907029167  1953514552+  83  Linux
Partition 1 does not end on cylinder boundary.
########################################################################
============================================================================
==
== DISK /dev/sdh IS partitioned for unRAID properly
== expected start = 63, actual start = 63
== expected size = 3907029105, actual size = 3907029105
==
============================================================================

Link to comment

You don't have any HPA's, the kernel version you are running had a bug that wrongly reported HPA's.

 

For other reasons, my suggestion would be to upgrade to v4.7, I cannot think of any reason at all where it would not be better and probably faster writing too.  You only need to extract bzimage, bzroot, and memtest and copy them to the root of the flash drive.  Of course, first copy the drive list with serial numbers from the Devices page, and make a complete backup of your flash drive, so you can revert if you so wish later.  Then try booting v4.7, check the Devices page for correct drive assignment, do a quick check that your UnRAID system feels normal, then try a parity check again.  The syslog should have more info this time also.

 

When you have hundreds of thousands of errors like that, it has to be something far more drastic a problem than random memory or power errors.  It feels more like an entire drive is missing, but there are no drive errors at all in your syslog.  Therefore, I cannot help wondering if you may be right about the second SiI3132 card, as if it is reading the drive on the second card as if it is on the first card, so actually reading the one drive twice.  Does not seem possible, but something is clearly very wrong.  Hopefully, v4.7 will tell us more.

Link to comment

I've been sticking to v4.4.2 because the server case is almost at it's physical capacity and to my understanding 4.4.2 is still regarded quite solid . Based on previous experience no upgrade is easy, eg. the new Linux kernel might introduce problems with my HighPoint RocketRAID card. As the system was rock solid until I added the second SATA-card and 10th drive, it must either be related to that or some HW has been broken in the same time frame. I consider the latter to be unlikely but still possible.

 

So I ran reiserfsck for all drives and no errors were reported. Then I started testing with dd/md5-method running the test in parallel for all drives (ran it individually earlier on). This to my understanding pretty much simulates parity check mechanism. Starting from block 0 and with block count 1500000 I immediately got hash mismatch from drives sdf and sdg. These drives are attached to 1st identical addon SATA-card, sdf being the parity drive. Repeated the the test several times, always those two drives having hash mismatch. Other drives were always fine. OK, some progress.

 

Then I started to narrow down the problem:

- Leave out sdh which is attached to the 2nd identical addon SATA-card -> sdf and sdg hash mismatch

- Run the test only on sdf and sdg (ie. without other drives testing in the background) -> sdf and sdg mismatch

- Leave out sdh and sdg (ie. sdf and the rest of the disks under test) -> no hash mismatch

 

So sdf and sdg create hash mismatch whenever they are both active. It does not matter whether the sdh (attached to the 2nd SATA-card) is active or not.  Conclusion -> the two identical SATA-cards somehow conflict with each other.

 

Next I will remove the second SATA-card from the system and repeat these tests.

 

For anyone interested I used the following scripts to do the testing. They are not generic (require system specific tinkering) but should provide quite nice basis for writing your own. I'm no bash guru and scripts are far from optimised ;)

 

reirserfsck_all.sh. Runs reiserfsck in check-mode silently for all drives. Adjust the drive count accordingly. Stops samba and unmounts all disks. Any process (eg. cachedirs) accessing mounts must be stopped prior to running this script.

logDir=/boot/logs/reiserfsck_all
driveCount=10

mkdir -p $logDir

# just to make sure you are in /root
cd

# stop samba
samba stop

# for each drive unmount, run reiserfsck check (read-only) and mount again
for ((i=1;i<=driveCount;i++)); do
    drive=/dev/md$i
    mountPoint=/mnt/disk$i
    logFile=$logDir/md$i.log
    umount $drive
    reiserfsck --check --yes --logfile $logFile $drive
    mount $drive $mountPoint
done

# restart samba
/usr/sbin/smbd -D
/usr/sbin/nmbd -D

 

ddmd5check_all.sh. Runs dd / md5 check simultaneously on all drives listed in a text file given as a parameter. Also start block, block count and repeat count are given as parameters. Uses ddmd5check_drive.sh as a child script. Checks the hash results and prints out either OK or hash mismatch. Logs are stored in flash (/boot/logs/ddmd5check/$timeStamp).

#!/bin/bash
set -u
set -e
deviceListFile=$1
startBlock=$2
blockCount=$3
repeatCount=$4
timeStamp=`date +%F_%H%M%S`

logDir=/boot/logs/ddmd5check/$timeStamp
logFile=$logDir/ddmd5check.log

mkdir -p $logDir

echo "$timeStamp deviceListFile=$deviceListFile startBlock=$startBlock blockCount=$blockCount}" >> $logFile

# run all drive checks in parallel
for drive in $(cat $deviceListFile); do
    ./ddmd5check_drive.sh $startBlock $blockCount $repeatCount $drive $logDir &
done

exit

 

ddm5check_drive.sh. Child script for ddmd5check_all.sh.

#!/bin/bash
set -u
set -e
startBlock=$1
blockCount=$2
repeatCount=$3
drive=$4
logDir=$5
logFile=$logDir/$drive.log  

for ((i=1;i<=$repeatCount;i++)); do
    echo "Begin $drive for the $i time. startBlock = $startBlock blockCount=$blockCount"
    dd if=/dev/$drive skip=$startBlock count=$blockCount status=noxfer 2>/dev/null | md5sum -b >> $logFile
done  
    
# check the hash result file
previousHash="empty"
hashFail=false
for hash in $(cat $logFile); do
    if [ "$hash" != "*-" ]; then
if [ "$previousHash" != "empty" ]; then
    if [ "$previousHash" != "$hash" ]; then
	hashFail=true
    fi
fi
previousHash=$hash
    fi
done
if $hashFail; then
    echo Hash mismatch with drive $drive in $logFile
else
    echo Drive $drive OK
fi

exit

 

drivelist.txt. Drive list example for ddmd5check_all to be given as the first parameter

sdf
sdj
sdk
sdl
sdb
sda
sdi
sdd
sdc
sdg
sdh

Link to comment

:o  ???  Thank you very much mryanr! The below excerpt from the provided thread matches to my current understanding 100% to my findings. So instead of two identical Sil3132 cards interfering with each other it could the single card using it's both channels. I will test this by removing sdg (one of the data drives) from my 1st Sil3132 SATA-card (instead of sdh from the 2nd one). I'll report back once finished.

 

It's also quite amazing how chuck23322 wound up going exactly the same path (and same pitfalls) as I did and yet I was not able to find that thread.

 

Testing complete.  Problem isolated and PROVEN.  Under heavy load, {created by DD/MD5 tests as suggested and discussed earlier in this thread}

 

1) Accessing two drives on any ONE side of the 3132 results in NO problems.

2) Accessing two drives (one on EACH side) of the 3132 results in returns of random data at random increments.  Silently.

3) It's not the drive combination that causes this, eg, the same tests - changing drive pairs - results in the problem only occurring when it's "EACH side" of the 3132 being used, not which drive pairs are used.

 

Ok, now we have - between my testing, previous person's testing, etc -- IMHO, more than anecdotal evidence of the 3132 - or at least some 3132s - having problems with accessing {under heavy load} both sides of the 3132 at the same time.

 

I'm pulling it from my build.  I've just got to decide now what to replace it (and my others) with.

...Chuck

Link to comment

:o  ???  Thank you very much mryanr! The below excerpt from the provided thread matches to my current understanding 100% to my findings. So instead of two identical Sil3132 cards interfering with each other it could the single card using it's both channels. I will test this by removing sdg (one of the data drives) from my 1st Sil3132 SATA-card (instead of sdh from the 2nd one). I'll report back once finished.

 

It's also quite amazing how chuck23322 wound up going exactly the same path (and same pitfalls) as I did and yet I was not able to find that thread.

 

Testing complete.  Problem isolated and PROVEN.  Under heavy load, {created by DD/MD5 tests as suggested and discussed earlier in this thread}

 

1) Accessing two drives on any ONE side of the 3132 results in NO problems.

2) Accessing two drives (one on EACH side) of the 3132 results in returns of random data at random increments.  Silently.

3) It's not the drive combination that causes this, eg, the same tests - changing drive pairs - results in the problem only occurring when it's "EACH side" of the 3132 being used, not which drive pairs are used.

 

Ok, now we have - between my testing, previous person's testing, etc -- IMHO, more than anecdotal evidence of the 3132 - or at least some 3132s - having problems with accessing {under heavy load} both sides of the 3132 at the same time.

 

I'm pulling it from my build.  I've just got to decide now what to replace it (and my others) with.

...Chuck

 

You're welcome, I think ;)

 

As I remember, I did do some brief testing on whether two separate 3132s (using one channel on each) had the problem -- and I don't think it did.

 

But since once I realized the 3132s had a proven issue in my build,  I no longer felt safe with them in it.  I went with the 8-port SuperMicro card (SATA II) and now some additional AsMedia cards  (SATA III) {now supported in 5-RC3 and above}...    You can get some AsMedia dual-port cards for about $30 or so -- but they won't do multi-port {in unraid, the specs say you can}

 

...Chuck

Link to comment

Well, the plot thickens...

 

To keep everyone, including myself, on the map let's first define the test configuration items:

- CARD_A = 1st Sil3132 based two-port SATA-card by Fujtech.

- CARD_B = 2nd Sil3132 based two-port SATA-card by Fujtech. This was added to the system 1-2 months ago when adding 10th drive.

- PARITY = parity disk

- DATA_1 = 1st data drive

- DATA_2 = 2nd data drive

- 7 other drives attached to MB and RocketRAID2310. No changes to these.

 

I performed dd/md5-tests with the following configurations and changes:

- No changes

  CARD_A: PARITY & DATA_1

  CARD_B: DATA_2

  -> hash mismatches on PARITY and DATA_1

 

- removed DATA_1 from CARD_A

  CARD_A: PARITY

  CARD_B: DATA_2

  -> no hash mismatches

 

- attached DATA_1 to CARD_B

  CARD_A: PARITY

  CARD_B:DATA_2 & DATA_1

  -> no hash mismatches

 

- switched PARITY and DATA_1

  CARD_A: DATA_1

  CARD_B: DATA_2 & PARITY

  -> no hash mismatches

 

- moved DATA_2 to CARD_A

  CARD_A: DATA_1 & DATA_2

  CARD_B: PARITY

  -> hash mismatches on DATA_1 and DATA_2

 

- removed CARD_B from the system (PARITY left out since no more ports available)

  CARD_A: DATA_1 & DATA_2

  -> hash mismatches on DATA_1 and DATA_2

 

- moved CARD_A to a different PCI Express slot

  CARD_A: DATA_1 & DATA_2

  -> hash mismatches on DATA_1 and DATA_2

 

- replaced CARD_A with CARD_B

  CARD_B: DATA_1 & DATA_2

  -> no hash mismatches

 

- installed CARD_A

  CARD_B: DATA_1 & DATA_2

  CARD_A: PARITY

 

So the read errors occur only on the 1st Sil3132 card (CARD_A) when it has two drives attached and never on the 2nd card (CARD_B). It does not matter whether CARD_B is present or not. I also tried different PCI Express slot combinations and it was always CARD_A failing regardless of the slot used. Since I for certain have had a working parity check for the last year or so, CARD_B must be the original card and CARD_A the new one. Cards were most likely swapped during the problem solving 2 months ago. CARD_A seems to be somehow defective since it cannot support two drives at the same time. Next time I have identical cards I remember to label them. From these results it seems like the new card is partly defective and there is no general problem with Sil3132. But it could be that there are a lot of low-cost low-quality cards around causing problems.

 

I left it in the configuration where the DATA_1 and DATA_2 are on the CARD_B and PARITY on CARD_A . I have parity sync running now, I guess I will have to do two runs. I'll report on the final results but looks good since I made a restart on the parity check and there were no sync errors on the repeated portion.

 

In the mean time I will be hunting for a replacement. Microstar AOC-SASLP-MV8 has been heavily recommended, just trying to find supplier delivering here to Finland. Also have to make sure it works with 4.4.2 and find out what cables are required for individual drive installation (SAS -> SATA breakout).

 

I really kick myself for not running full tests or even the parity check when getting the 2nd Sil3132 working 2 months ago. I guess I was too tired after banging my head in the wall for several days just getting the system to boot and drives detected with the 2nd Sil3132 card. The 10th drive preclearing went nicely so I didn't realise that parity check would fail... Have to install the periodic parity check script to avoid anything like this in the future.

 

And thanx for Chuck23322 too! I just wish I had found your thread (and the others) two days ago. Like mbryanr I didn't catch the Sil3132 and thus missed the link in the wiki. All the problem hunting threads are invaluable and it is also very good to see not only the end result but all the other steps taken. Nowadays I normally visit Unraid forums only on problem solving situations so I really appreciate the work from the forum actives. Just trying to chip in by documenting my efforts and findings.

Link to comment

From these results it seems like the new card is partly defective and there is no general problem with Sil3132. But it could be that there are a lot of low-cost low-quality cards around causing problems.

 

Glad you narrowed it down.  It is a PIA when things are "silently" going on to potentially destroy data.

 

Regarding your statement, I think it's semantics or picking nits -- but I'm firmly on the side of SIL3132's should be avoided for Unraid systems. 

 

While there is empirical evidence that they run fine in some builds, or some of them run fine,  I cannot think of everybody having to do MD5 checks to "make sure" they work fine.  Or worse, not bothering to do MD5 checks, and having a silent "killer" lurking in a build.

 

It also seems to be load related.  Some of the emperical evidence is in non-unraid scenarios, they're fine for general-purpose use -- but under heavy load (aka, Unraid parity rebuilds/checks) they silently fail.

 

Since there are plenty of "nearly as cheap" cards out there, that do not depend on the Sil3132 -- and work fine in UnRaid...  I think folks ought to seriously consider (and avoid) using SiL3132s -- or at least be knowledgeable about the potential danger.

 

My 2 cents.

 

Link to comment

Regarding your statement, I think it's semantics or picking nits -- but I'm firmly on the side of SIL3132's should be avoided for Unraid systems. 

...

Since there are plenty of "nearly as cheap" cards out there, that do not depend on the Sil3132 -- and work fine in UnRaid...  I think folks ought to seriously consider (and avoid) using SiL3132s -- or at least be knowledgeable about the potential danger.

I was just stating that my particular problem is not related to a generic Sil3132 incompatibility with Unraid but a single failing controller card. My other Sil3132 card has been working nicely for over a year. There are other people reporting systematic, long time, good results using Sil3132 cards with Unraid. Problems are always reported far more often than success.  It could just be that Sil3132 cards are so popular that they thus have more problem reports.

 

I will never buy a Sil3132 card again and I would not recommend them for Unraid usage. Perhaps a more noticeable warning/disclaimer in the Unraid wiki (http://lime-technology.com/wiki/index.php/Hardware_Compatibility#PCI_SATA_Controllers) regarding potential low-quality Sil3132 cards could be added?

Link to comment

Regarding your statement, I think it's semantics or picking nits -- but I'm firmly on the side of SIL3132's should be avoided for Unraid systems. 

...

Since there are plenty of "nearly as cheap" cards out there, that do not depend on the Sil3132 -- and work fine in UnRaid...  I think folks ought to seriously consider (and avoid) using SiL3132s -- or at least be knowledgeable about the potential danger.

I was just stating that my particular problem is not related to a generic Sil3132 incompatibility with Unraid but a single failing controller card. My other Sil3132 card has been working nicely for over a year. There are other people reporting systematic, long time, good results using Sil3132 cards with Unraid. Problems are always reported far more often than success.  It could just be that Sil3132 cards are so popular that they thus have more problem reports.

 

I will never buy a Sil3132 card again and I would not recommend them for Unraid usage. Perhaps a more noticeable warning/disclaimer in the Unraid wiki (http://lime-technology.com/wiki/index.php/Hardware_Compatibility#PCI_SATA_Controllers) regarding potential low-quality Sil3132 cards could be added?

 

I think we're in agreement ;)    And I agree that a more prominent warning probably ought to be in the Wiki.

 

One person's SiL3132 may work fine, another's may not -- but the difficulty in telling the difference -- and the danger involved, now with multiple reports -- I'd be quite wary.  We've got now several proofs of evidence to support this.

Link to comment

From these results it seems like the new card is partly defective and there is no general problem with Sil3132. But it could be that there are a lot of low-cost low-quality cards around causing problems.

 

Glad you narrowed it down.  It is a PIA when things are "silently" going on to potentially destroy data.

 

Regarding your statement, I think it's semantics or picking nits -- but I'm firmly on the side of SIL3132's should be avoided for Unraid systems. 

 

While there is empirical evidence that they run fine in some builds, or some of them run fine,  I cannot think of everybody having to do MD5 checks to "make sure" they work fine.  Or worse, not bothering to do MD5 checks, and having a silent "killer" lurking in a build.

 

It also seems to be load related.  Some of the emperical evidence is in non-unraid scenarios, they're fine for general-purpose use -- but under heavy load (aka, Unraid parity rebuilds/checks) they silently fail.

 

Since there are plenty of "nearly as cheap" cards out there, that do not depend on the Sil3132 -- and work fine in UnRaid...  I think folks ought to seriously consider (and avoid) using SiL3132s -- or at least be knowledgeable about the potential danger.

 

My 2 cents.

So you would use one in a "windows" PC?    Incorrect data might not show itself as frequently, as Microsoft might not make as many simultaneous calls to disks, but it only takes 1 occurrence to trash an important file.  I think I'd avoid a Sil3132 based controller regardless of the OS. 
Link to comment

From these results it seems like the new card is partly defective and there is no general problem with Sil3132. But it could be that there are a lot of low-cost low-quality cards around causing problems.

 

Glad you narrowed it down.  It is a PIA when things are "silently" going on to potentially destroy data.

 

Regarding your statement, I think it's semantics or picking nits -- but I'm firmly on the side of SIL3132's should be avoided for Unraid systems. 

 

While there is empirical evidence that they run fine in some builds, or some of them run fine,  I cannot think of everybody having to do MD5 checks to "make sure" they work fine.  Or worse, not bothering to do MD5 checks, and having a silent "killer" lurking in a build.

 

It also seems to be load related.  Some of the emperical evidence is in non-unraid scenarios, they're fine for general-purpose use -- but under heavy load (aka, Unraid parity rebuilds/checks) they silently fail.

 

Since there are plenty of "nearly as cheap" cards out there, that do not depend on the Sil3132 -- and work fine in UnRaid...  I think folks ought to seriously consider (and avoid) using SiL3132s -- or at least be knowledgeable about the potential danger.

 

My 2 cents.

So you would use one in a "windows" PC?    Incorrect data might not show itself as frequently, as Microsoft might not make as many simultaneous calls to disks, but it only takes 1 occurrence to trash an important file.  I think I'd avoid a Sil3132 based controller regardless of the OS.

 

Your point is well taken.  No, I wouldn't use them at all.    It's hard to quantify what trigger point there is -- and even harder to determine which SiL3132 cards will and won't exhibit the problem.

 

I was trying to be accommodating, I guess... to those who have argued {in other threads} "that SiL3132's are used all over the place, in hundreds of thousands of systems -- and just because a couple of us "unraid" folks have caused a few to break - well..."

 

That's not my argument though...  I no longer can recommend that a $10 SiL3132 card be used, driving what is likely much more than that investment in hardware, discs, etc --- when there are plenty other alternatives who have no such reports of silent corruption.

Link to comment

Short update, parity sync is now ~35% complete with 7400 sync errors. However I would expect this number to be way higher on this first run since the last parity check ended up with over 700.000 sync errors fixed. If they were due to random read errors on two disks I would expect there to be ~50% -> 350.000 errors now, if my logic is right. Or it could be that the last parity sync was somehow better than previous ones, actually fixing most of the previously generated sync errors. Just feels a bit mysterious.

 

I just hate situations like this, it's the same as with a car; you never trust it again when it once breaks and leaves you on the road. Every little weird sound makes you think it is going to break again  :-\ One good thing about is that you will learn to take preventive and damage control measures...

 

Edit: First run finished with 250.000 sync errors which is in the expected ball park. I just don't understand why the first 35% would generate only 7.400. Well, some things are not meant to be solved. I'll come back to results of the 2nd run and hopefully we can close this case. I'll update the first post with a compact description of the solution and steps taken. Hope it helps someone else.

Link to comment

Hope I never have this problem. It could really clobber a lot of data between parity checks.

 

 

The semi-good news is that {empirical} evidence for me, at least, is that routine reading/writing doesn't cause the problem {cannot 100% prove this in everybody's world/scenario, but it's what I saw}.    For me, it was during high-volume - largely created during a Parity rebuild {or simulated by the simultaneous MD5 checking}

 

So clobbering of data would more likely occur during a drive rebuild, and less likely during routine access. 

 

In my case, what I was getting:

 

1) Back-to-back parity checks {in maintenance mode to ensure no new writes} -- resulted in completely different counts of parity errors, and completely different lists of bad blocks in the syslog

2) Back-to-back parity rebuilds -- same

3) Once the problem was fixed,  one parity rebuild to fix the parity drive -- and a second parity check confirmed no errors

 

Now, of course, we STILL could have had a glitch accessing (read or write) of a data disk -- but empirical evidence is that it is much less likely to have occurred.    I know, I felt a horrible feeling my data drives had been trashed with this error -- until I realized that it was parity rebuilds when I was seeing it -- and thankfully, I had not had a drive failure necessitating me to rely on the parity process to rebuild a failed drive.

 

Clearly -- rebuilding a failed drive, while this silent situation was going on -- would definitely result in a trashed rebuilt drive.  Also would result in a trashed "simulated" drive.

 

On a semi-related note.  I really wish there was some way to convert a "error" - listing a sector/block (whatever the # is) in the syslog, back to what "file(s)" that use those blocks.    That would have told me what files to possibly take a strong look at -- or reacquire, restore, whatever.

Link to comment

Hope I never have this problem. It could really clobber a lot of data between parity checks.

 

- until I realized that it was parity rebuilds when I was seeing it -- and thankfully, I had not had a drive failure necessitating me to rely on the parity process to rebuild a failed drive.

 

I really wish there was some way to convert a "error" - listing a sector/block (whatever the # is) in the syslog, back to what "file(s)" that use those blocks.    That would have told me what files to possibly take a strong look at -- or reacquire, restore, whatever.

That is what I was thinking. 

 

I agree..I'm terrible at indexing my files to know which files are on which disks; and adding that error reporting could potentially decrease the amount of files that have to be re-ripped.

Link to comment

(Last checked on 6/28/2012 2:32:50 PM, finding 0 errors.)

 

Well that's a sight for sore eyes  ;D

 

My parity drive is still on the Sil3132 sata controller which cannot handle two loaded ports. I will put a periodic (nightly) parity sync in place so while waiting for a replacement I can sleep a little better.

 

I agree with Chuck23322 that my data is most likely intact since my normal usage pattern is to write to a single disk and once it's full get a new one to write on. I have all my transient, not so important, data (like dvb recordings) on local disks. Reading almost always very light 5-60Mbps streams. I haven't spotted any signs of actual data corruption in normal use. In fact I cannot even be sure whether the problem even affects writes at all. I'm quite sure I won't be doing any further tests to confirm that ;)

Link to comment

(Last checked on 6/28/2012 2:32:50 PM, finding 0 errors.)

 

Well that's a sight for sore eyes  ;D

 

My parity drive is still on the Sil3132 sata controller which cannot handle two loaded ports. I will put a periodic (nightly) parity sync in place so while waiting for a replacement I can sleep a little better.

 

I agree with Chuck23322 that my data is most likely intact since my normal usage pattern is to write to a single disk and once it's full get a new one to write on. I have all my transient, not so important, data (like dvb recordings) on local disks. Reading almost always very light 5-60Mbps streams. I haven't spotted any signs of actual data corruption in normal use. In fact I cannot even be sure whether the problem even affects writes at all. I'm quite sure I won't be doing any further tests to confirm that ;)

 

WaHoo!  Excellent news. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.