{SOLVED-Beware SiL3132} - How to determine which drive is causing sync errors?


Recommended Posts

I know I've got a problem.  I have over 30k sync errors when I do a parity check (not repair).  Given that I trust my parity drive,  I'd like to identify the drive(s) that are at the root of the sync errors.  I'm not sure how to do that.

 

I assume this is a common question/issue...  so I don't mind following in the path of others/threads -- I just haven't been successful in finding it...  perhaps I'm just SMF challenged :)

 

Anyway...  Running Version 4.7

 

Attached is syslog

syslog-2012-02-10.txt

Link to comment

I couldn't get this to post for a few days.... 

 

Ah... Found the "How To Troubleshoot Recurring Parity Errors"

 

It mentions "This procedure is a test to find what is causing parity changes if a parity sync is continuously finding issues and traditional tools (Memcheck, SMART, reiserfsck"

 

Guess I need to try the other stuff first -- Memcheck...  reiserfsck  {SMART is not really helpful, ugh, I have several drives that I need to retire/pull -- none are obvious to the problem}

 

Waiting for the latest parity-check to complete...  then I'll go into reiserfsck  diags from...

 

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

 

{and a related user thread - http://lime-technology.com/forum/index.php?topic=16245.msg151003#msg151003 }

 

~Update:  I'm now past memcheck's, and reiserfsck of every drive -- multiple times and multiple time in parallel {same time} across all drives - no errors found

 

Now I think I'm going to use the  DD + md5 checksum test across all drives, simultaneously...  I went back and looked at several recent syslogs and several of the parity errors were at the same block or block range.

 

http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

Once I find the 'bad' drive.  I assume somehow I can get a directory listing off of it and determine what folders/files I may need to consider 'bad' and decide what I can do about that {reload from older backups, etc}

 

I have a pair of pre-cleared 2TB drives waiting to go in,  once I determine which drive was/is throwing these parity errors.

 

Link to comment

Yes, go for the tests and let us know what you find.

 

In theory, the parity would be correct and you could just replace the drive and let it rebuild. However, the parity data will be hosed if you did a correcting parity check.

 

As for the directory listing.

The directory is /mnt/diskX and the command would be something like ls -lR > /boot/diskX.txt to get a listing of files. You could also just share the disks and get the listing on a networked PC.

 

Peter

 

Link to comment

Yes, go for the tests and let us know what you find.

 

In theory, the parity would be correct and you could just replace the drive and let it rebuild. However, the parity data will be hosed if you did a correcting parity check.

 

As for the directory listing.

The directory is /mnt/diskX and the command would be something like ls -lR > /boot/diskX.txt to get a listing of files. You could also just share the disks and get the listing on a networked PC.

 

Peter

 

Once I find the bad drive -- any hope of taking it more granular than just all the "files on the bad drive" -- eg, is there a tool that could tell me based on bad-blocks (from syslog) what file(s) live on that block?  Or the other way around,  a list of all blocks being used by a given file?

 

 

Link to comment

Typically SMART reports can be used to determine a problem disk. Post SMART reports.

 

Unfortunately,  I have at least 3 drives which all have questionable smarts reports {reallocated sectors, etc} -- so I think I'm stuck with brute-force DD/MD5 to narrow down my problem drive/drives.

 

The DD/MD5 method will only detect the remote case when a drive is returning inconsistent results. The drive may have reallocated sectors that cannot be read correctly, but it's very unlikely that the drives will return a zero on one read and a one on the next.

Link to comment
  • 2 weeks later...

Ok, have had my Unraid server offline (had other pressing personal matters) -- now I'm back to this saga -- I created the attached scripts -- based on block information shown previously in syslogs....  ran it 100 times... and lo and behold, I have FIVE of 15 drives returning inconsistent MD5 results...  ugh, triple ugh.

 

I'm re-running it now with 1000 passes to see if I have anything other drives/connections/cables to check.  In my case, it's flagging sdb, sdc, sdd, sde, sdf - and those are of varying manufacturers {sda is my flash device, not included in testing -- it's over USB}

 

/dev/sdb 2.00T	ST32000542AS_5XW16JJG	
/dev/sdc 1.00T	ST31000520AS_5VX03T9G	
/dev/sdd 1.00T	00D7B0_WD-WCAU40138375	
/dev/sde 1.00T	Hitachi_HDT721010SLA360_STF607MH1M27RW	
/dev/sdf 1.00T	ST31000528AS_9VP2WM9Y	

 

DD/MD5 concept from - http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

* Script - getmd5.sh

DRIVE=$1

LOG_DIR=/boot/chuck/md5

cd $LOG_DIR

for i in {1..50}
  do
    echo "Begin $DRIVE for the $i time."
    dd if=/dev/$DRIVE skip=12000 count=13000 | md5sum -b >> $DRIVE.1.log
    dd if=/dev/$DRIVE skip=200000 count=250000 | md5sum -b >> $DRIVE.2.log
    dd if=/dev/$DRIVE skip=200000 count=10000000 | md5sum -b >> $DRIVE.3.log
    dd if=/dev/$DRIVE skip=113000 count=110000 | md5sum -b >> $DRIVE.4.log
    dd if=/dev/$DRIVE skip=36000000 count=1000000 | md5sum -b >> $DRIVE.5.log
    dd if=/dev/$DRIVE skip=60000 count=20000 | md5sum -b >> $DRIVE.6.log
  done
exit

* Script getall.sh

./getmd5.sh sdb & ./getmd5.sh sdc & ./getmd5.sh sdd & ./getmd5.sh sde & 
./getmd5.sh sdf & ./getmd5.sh sdh & ./getmd5.sh sdi & ./getmd5.sh sdj & 
./getmd5.sh sdk & ./getmd5.sh sdl & ./getmd5.sh sdm & ./getmd5.sh sdn & 
./getmd5.sh sdo & ./getmd5.sh sdp & ./getmd5.sh sdg &

 

Attaching an example of a "bad" result (F) and a "good" result (K)

 

Next steps (proposed)

 

1) Complete the 1000 pass tests {going in parallel}

2) Analyse the pass results -- see if I'm still dealing with just B-F or any others

3) Look for common elements -- cabled to same controller card?  same model/brand of controller card, etc, same power supply leg, etc...

4) Replace cabling -- I went ahead and bought higher-quality SATA3 cabling

 

Longer term:

 

5) Determine files/directories on each of the drives -- so I know what "might" be at corruption risk

6) Dunno....

 

~Updated: NOTE -- the attached *all.txt files are a concatenated version of the individual 6 tests per drive.  So six times the MD5 changes, is correct.  But within a repetitive list of MD5s, when it changes to random, and goes back to normal -- that's the example of that test {of 6 in the file} returning bad data and going back to normal.

f_all.txt

k_all.txt

Link to comment

 

While it's still running my 1000 pass MD5 check,  I noticed these in the syslog -- how can I convert "ATA4" to a given drive?...  Is it a one-to-one to /dev/md4 ?  {If it's not obvious, I'm not unix savvy}

 

Feb 24 21:19:34 unraid kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)
Feb 24 21:19:34 unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)
Feb 24 21:19:34 unraid kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)
Feb 24 21:19:34 unraid kernel: ata4.00: failed command: READ DMA (Minor Issues)
Feb 24 21:19:34 unraid kernel: ata4.00: cmd c8/00:00:28:af:06/00:00:00:00:00/e0 tag 0 dma 131072 in (Drive related)
Feb 24 21:19:34 unraid kernel:          res 50/00:00:27:af:06/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)
Feb 24 21:19:34 unraid kernel: ata4.00: status: { DRDY } (Drive related)
Feb 24 21:19:34 unraid kernel: ata4: hard resetting link (Minor Issues)
Feb 24 21:19:34 unraid kernel: mdcmd (6452): spindown 11 (Routine)
Feb 24 21:19:34 unraid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)
Feb 24 21:19:35 unraid kernel: ata4.00: configured for UDMA/133 (Drive related)
Feb 24 21:19:35 unraid kernel: ata4: EH complete (Drive related)

Feb 25 11:35:37 unraid kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors)
Feb 25 11:35:37 unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors)
Feb 25 11:35:37 unraid kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors)
Feb 25 11:35:37 unraid kernel: ata4.00: failed command: READ DMA (Minor Issues)
Feb 25 11:35:37 unraid kernel: ata4.00: cmd c8/00:00:28:2e:74/00:00:00:00:00/e0 tag 0 dma 131072 in (Drive related)
Feb 25 11:35:37 unraid kernel:          res 50/00:00:27:2e:74/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)
Feb 25 11:35:37 unraid kernel: ata4.00: status: { DRDY } (Drive related)
Feb 25 11:35:37 unraid kernel: ata4: hard resetting link (Minor Issues)
Feb 25 11:35:37 unraid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related)
Feb 25 11:35:38 unraid kernel: ata4.00: configured for UDMA/133 (Drive related)
Feb 25 11:35:38 unraid kernel: ata4: EH complete (Drive related)

Link to comment

Please post the FULL syslog for Feb 24th. The snippets you provided do not contain enough information to determine which drive(s) are on ATA4 interface.

 

Ah...  "Feb 22 00:38:02 unraid kernel: ata4.00: ATA-8: ST3750330AS, SD1A, max UDMA/133"

 

That's a really old 750gb that I plan to retire once I can get this system stable.  I have two pre-cleared (5 passes) 2TB drives waiting to go in..

 

/dev/md5 /mnt/disk5 /dev/sdh ST3750330AS_9QK158NB 33°C 2230286 10 750.13G

 

{Syslog attached}

Model Family:     Seagate Barracuda 7200.11 family
Device Model:     ST3750330AS
Serial Number:    9QK158NB
Firmware Version: SD1A

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       105544716
  3 Spin_Up_Time            0x0003   098   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2348
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   030    Pre-fail  Always       -       4301169151
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7439
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       4
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       47
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   099   096   000    Old_age   Always       -       25770197016
189 High_Fly_Writes         0x003a   086   086   000    Old_age   Always       -       14
190 Airflow_Temperature_Cel 0x0022   063   057   045    Old_age   Always       -       37 (Min/Max 24/38)
194 Temperature_Celsius     0x0022   038   043   000    Old_age   Always       -       38 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   049   031   000    Old_age   Always       -       105544716
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       26

Syslog-2012-02-25.zip

Link to comment

That's not the kind of results I would expect. You are getting a very consistant change in the results every x number of tests. Are you 100% positive nothing is modifying the drive contents?

 

I'm 100% positive the drive contents are not changing.  The "writes" are not moving at all.

 

Every single drive has 10 writes - for 2 days now -  (100 for parity)

 

And every drive has over 2,000,000 reads so far, incrementing constantly, while I'm running the stress-test passes.

 

Ah, something just hit me {I posted the script} but failed to say that I had concatinated all SIX log files for a given drive together.

 

I did that so I only had to scan 15 files (one for each sd?) instead of 15 x 6 tests = 90 files ...

 

So when I was done, I did {FTP'd them back to my windows host} a copy sdb.?.log+nul: b_all.txt

 

(So yes, they show a constant increment -- as each of the 6 tests were run.}

 

Within each test -- you can see where it has a run of consistant reads, then a bunch of bad reads, and then back to good reads again}

 

Sorry for leaving out that critical explanation.  A good excerpt - showing the misbehavior - from the "sdf" drive is

5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
9929750ab45e6ae9b465ac2ad1723240 *-
3fd4e34a0d707b3fc1547ed205788e1d *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
600bf668c22bc0d7de7c5f94528f1cd7 *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
212a16a4f1172214efa14f5f9f58adff *-
e7b21efc96f3809c49fd11f4e95f9882 *-
1df094df42aba5732bee21e4c48d7629 *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-
5c9ffab8c2d9b86b1d2b42615721911e *-

It's clear when it reads correctly, it's MD5 is 5c9ff....  and other times it's jibberish -- for the same repetitve set of blocks.

 

Then it goes on, in the same 'f_all.txt' file,  to the next test

0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
10094b3f1af2ea313fc7b37019445160 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
128dcc44d1e21161f5c126c048d48319 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
f233575de053d37551c5641d09f951e5 *-
781290e56ba5266be5a7837986649e39 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *-
0085ffb5782a72a508bdb8bc121a7ae1 *

 

And there, that test returns '0085ff' most of the time, and other times it returns random stuff. 

 

 

 

 

Link to comment

Has memtest been run for 24 hours?

 

Yes, first for 6 hours, then for 48 hours -- that was the first step, then a reiserfsck of all the drives.  Passing both of those is what lead me down the road of the deeper DD/MD5 checks...

 

 

I'm now past memcheck's, and reiserfsck of every drive -- multiple times and multiple time in parallel {same time} across all drives - no errors found

 

 

Ok, 300 passes of my DD/MD5 script later, and it's consistent on which drives are showing problems -- it's SDB, C,D,E & F only. 

 

Shutting down the server {saved syslogs, drive location info, etc}

 

I'm off to trace it all out, and see what may be common about those drives.

Link to comment

I'm off to trace it all out, and see what may be common about those drives.

 

Well, (insert Staples 'That was Easy') -- all 5 affected drives are in my external Rosewill RSV-S8 cage driven by a PCIe x1 SiI3132 dual eSata controller.  (My understanding is that the RSV-S8 is a rebranded SansDigital TowerRAID TRM8-B)

 

So, of course, port multiplying/multi-lane is being done by the card and the external cabinet.  There are two eSata feeds between the server and the external cabinet, each feeds one group of 4 drives. 

 

Three drives are in the first 'channel' and two drives are in the second 'channel'. 

 

Power is supplied to all the drives by the external cabinet's built-in power supply.

 

~Update:  Ah, something just hit me -- I probably ought to do some more granular testing.  Instead of all 5 problem drives being hit (DD/MD5) at once -- try them one at a time.  Perhaps there's one that's dorking up the multi-lane channel....  Gots more testing to do.... 

Link to comment

Hm, one of the last threads with this issue was traced back to a SiL3132 card causing it when both channels were used at one time. I bet the drives will seem OK as long as you only use one of the 2 SATA channels on the card. So, you're likely looking at changing to a different SATA card that still supports PM.

 

Peter

Link to comment

Hm, one of the last threads with this issue was traced back to a SiL3132 card causing it when both channels were used at one time. I bet the drives will seem OK as long as you only use one of the 2 SATA channels on the card. So, you're likely looking at changing to a different SATA card that still supports PM.

 

Peter

 

Yup... seems so...

 

Footnote (Bad SiL 3132 controller) @ Refers to another link.. -- http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

Just went through this thread deeper -- SIL3132 issues... with multiple drives...  {in my case, even worse, multiple drives via doing Port-Multiplexing}

 

Yet even more evidence that the SIL3132 can cause significant hair loss.

 

This poor guy (looks like debian?) went through almost my exact scenario step-by-step:

 

http://www.issociate.de/board/post/506879/Debugging_a_strange_array_corruption.html

 

Working now on proving to myself this is the issue. {Doing the same tests with only one drive, or two drives on the same side (port) of the 3132 -- then reproducing the problem across two ports of the 3132}

 

And then, if I can prove the failure is dual-access, deciding on what to replace the controller(s) with.  I have two of these in my build, and was needing to upgrade a 3114 as well...

 

Ugh...  It's just money right?  No, in my case, I think I'd rather have stability now that this thing has grown to 15 drives...

Link to comment

Ah, something just hit me -- I probably ought to do some more granular testing.  Instead of all 5 problem drives being hit (DD/MD5) at once -- try them one at a time.  Perhaps there's one that's dorking up the multi-lane channel....  Gots more testing to do....

 

Working now on proving to myself this is the issue. {Doing the same tests with only one drive, or two drives on the same side (port) of the 3132 -- then reproducing the problem across two ports of the 3132}

 

And then, if I can prove the failure is dual-access, deciding on what to replace the controller(s) with.  I have two of these in my build, and was needing to upgrade a 3114 as well...

 

 

Testing complete.  Problem isolated and PROVEN.  Under heavy load, {created by DD/MD5 tests as suggested and discussed earlier in this thread}

 

1) Accessing two drives on any ONE side of the 3132 results in NO problems. 

2) Accessing two drives (one on EACH side) of the 3132 results in returns of random data at random increments.  Silently.

3) It's not the drive combination that causes this, eg, the same tests - changing drive pairs - results in the problem only occurring when it's "EACH side" of the 3132 being used, not which drive pairs are used.

 

 

 

Ok, now we have - between my testing, previous person's testing, etc -- IMHO, more than anecdotal evidence of the 3132 - or at least some 3132s - having problems with accessing {under heavy load} both sides of the 3132 at the same time.

 

I'm pulling it from my build.  I've just got to decide now what to replace it (and my others) with.

 

Thanks everybody for their patience and help here.

 

...Chuck

Link to comment

Wow, the SiL3132 controllers have proved to be real trouble for a few people. Good work troubleshooting and I hope the pulled-out hair grows back.  :D

 

I tried one once and it wouldn't even work on a single drive so I use that Jmicron card you could get from Ebay for $8, but I'm not sure it can still be had and it didn't work in certain motherboards either.

 

Peter

Link to comment

 

I'm pulling it from my build.  I've just got to decide now what to replace it (and my others) with.

 

Thanks everybody for their patience and help here.

 

...Chuck

 

Try the Supermicro 8-Port card: AOC-SASLP-MV8, they do pretty well, I replaced both of my 2-port SIL3132 cards with one of these about a year ago when I was trying to solve a bad drive problem and the Supermicro card has performed perfectly ever since.

 

Regards,

 

Stephen

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.