chuck23322 Posted February 10, 2012 Share Posted February 10, 2012 I know I've got a problem. I have over 30k sync errors when I do a parity check (not repair). Given that I trust my parity drive, I'd like to identify the drive(s) that are at the root of the sync errors. I'm not sure how to do that. I assume this is a common question/issue... so I don't mind following in the path of others/threads -- I just haven't been successful in finding it... perhaps I'm just SMF challenged Anyway... Running Version 4.7 Attached is syslog syslog-2012-02-10.txt Quote Link to comment
dgaschk Posted February 11, 2012 Share Posted February 11, 2012 Post SMART reports for all of the drives. Quote Link to comment
WeeboTech Posted February 11, 2012 Share Posted February 11, 2012 read this thread. It may provide a hint. http://lime-technology.com/forum/index.php?topic=17641.msg158817#msg158817 Quote Link to comment
chuck23322 Posted February 14, 2012 Author Share Posted February 14, 2012 I couldn't get this to post for a few days.... Ah... Found the "How To Troubleshoot Recurring Parity Errors" It mentions "This procedure is a test to find what is causing parity changes if a parity sync is continuously finding issues and traditional tools (Memcheck, SMART, reiserfsck" Guess I need to try the other stuff first -- Memcheck... reiserfsck {SMART is not really helpful, ugh, I have several drives that I need to retire/pull -- none are obvious to the problem} Waiting for the latest parity-check to complete... then I'll go into reiserfsck diags from... http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems {and a related user thread - http://lime-technology.com/forum/index.php?topic=16245.msg151003#msg151003 } ~Update: I'm now past memcheck's, and reiserfsck of every drive -- multiple times and multiple time in parallel {same time} across all drives - no errors found Now I think I'm going to use the DD + md5 checksum test across all drives, simultaneously... I went back and looked at several recent syslogs and several of the parity errors were at the same block or block range. http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Once I find the 'bad' drive. I assume somehow I can get a directory listing off of it and determine what folders/files I may need to consider 'bad' and decide what I can do about that {reload from older backups, etc} I have a pair of pre-cleared 2TB drives waiting to go in, once I determine which drive was/is throwing these parity errors. Quote Link to comment
dgaschk Posted February 14, 2012 Share Posted February 14, 2012 Typically SMART reports can be used to determine a problem disk. Post SMART reports. Quote Link to comment
lionelhutz Posted February 14, 2012 Share Posted February 14, 2012 Yes, go for the tests and let us know what you find. In theory, the parity would be correct and you could just replace the drive and let it rebuild. However, the parity data will be hosed if you did a correcting parity check. As for the directory listing. The directory is /mnt/diskX and the command would be something like ls -lR > /boot/diskX.txt to get a listing of files. You could also just share the disks and get the listing on a networked PC. Peter Quote Link to comment
chuck23322 Posted February 14, 2012 Author Share Posted February 14, 2012 Yes, go for the tests and let us know what you find. In theory, the parity would be correct and you could just replace the drive and let it rebuild. However, the parity data will be hosed if you did a correcting parity check. As for the directory listing. The directory is /mnt/diskX and the command would be something like ls -lR > /boot/diskX.txt to get a listing of files. You could also just share the disks and get the listing on a networked PC. Peter Once I find the bad drive -- any hope of taking it more granular than just all the "files on the bad drive" -- eg, is there a tool that could tell me based on bad-blocks (from syslog) what file(s) live on that block? Or the other way around, a list of all blocks being used by a given file? Quote Link to comment
chuck23322 Posted February 14, 2012 Author Share Posted February 14, 2012 Typically SMART reports can be used to determine a problem disk. Post SMART reports. Unfortunately, I have at least 3 drives which all have questionable smarts reports {reallocated sectors, etc} -- so I think I'm stuck with brute-force DD/MD5 to narrow down my problem drive/drives. Quote Link to comment
lionelhutz Posted February 14, 2012 Share Posted February 14, 2012 Not that I know of or have read about. Peter Quote Link to comment
dgaschk Posted February 14, 2012 Share Posted February 14, 2012 Typically SMART reports can be used to determine a problem disk. Post SMART reports. Unfortunately, I have at least 3 drives which all have questionable smarts reports {reallocated sectors, etc} -- so I think I'm stuck with brute-force DD/MD5 to narrow down my problem drive/drives. The DD/MD5 method will only detect the remote case when a drive is returning inconsistent results. The drive may have reallocated sectors that cannot be read correctly, but it's very unlikely that the drives will return a zero on one read and a one on the next. Quote Link to comment
chuck23322 Posted February 25, 2012 Author Share Posted February 25, 2012 Ok, have had my Unraid server offline (had other pressing personal matters) -- now I'm back to this saga -- I created the attached scripts -- based on block information shown previously in syslogs.... ran it 100 times... and lo and behold, I have FIVE of 15 drives returning inconsistent MD5 results... ugh, triple ugh. I'm re-running it now with 1000 passes to see if I have anything other drives/connections/cables to check. In my case, it's flagging sdb, sdc, sdd, sde, sdf - and those are of varying manufacturers {sda is my flash device, not included in testing -- it's over USB} /dev/sdb 2.00T ST32000542AS_5XW16JJG /dev/sdc 1.00T ST31000520AS_5VX03T9G /dev/sdd 1.00T 00D7B0_WD-WCAU40138375 /dev/sde 1.00T Hitachi_HDT721010SLA360_STF607MH1M27RW /dev/sdf 1.00T ST31000528AS_9VP2WM9Y DD/MD5 concept from - http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors * Script - getmd5.sh DRIVE=$1 LOG_DIR=/boot/chuck/md5 cd $LOG_DIR for i in {1..50} do echo "Begin $DRIVE for the $i time." dd if=/dev/$DRIVE skip=12000 count=13000 | md5sum -b >> $DRIVE.1.log dd if=/dev/$DRIVE skip=200000 count=250000 | md5sum -b >> $DRIVE.2.log dd if=/dev/$DRIVE skip=200000 count=10000000 | md5sum -b >> $DRIVE.3.log dd if=/dev/$DRIVE skip=113000 count=110000 | md5sum -b >> $DRIVE.4.log dd if=/dev/$DRIVE skip=36000000 count=1000000 | md5sum -b >> $DRIVE.5.log dd if=/dev/$DRIVE skip=60000 count=20000 | md5sum -b >> $DRIVE.6.log done exit * Script getall.sh ./getmd5.sh sdb & ./getmd5.sh sdc & ./getmd5.sh sdd & ./getmd5.sh sde & ./getmd5.sh sdf & ./getmd5.sh sdh & ./getmd5.sh sdi & ./getmd5.sh sdj & ./getmd5.sh sdk & ./getmd5.sh sdl & ./getmd5.sh sdm & ./getmd5.sh sdn & ./getmd5.sh sdo & ./getmd5.sh sdp & ./getmd5.sh sdg & Attaching an example of a "bad" result (F) and a "good" result (K) Next steps (proposed) 1) Complete the 1000 pass tests {going in parallel} 2) Analyse the pass results -- see if I'm still dealing with just B-F or any others 3) Look for common elements -- cabled to same controller card? same model/brand of controller card, etc, same power supply leg, etc... 4) Replace cabling -- I went ahead and bought higher-quality SATA3 cabling Longer term: 5) Determine files/directories on each of the drives -- so I know what "might" be at corruption risk 6) Dunno.... ~Updated: NOTE -- the attached *all.txt files are a concatenated version of the individual 6 tests per drive. So six times the MD5 changes, is correct. But within a repetitive list of MD5s, when it changes to random, and goes back to normal -- that's the example of that test {of 6 in the file} returning bad data and going back to normal. f_all.txt k_all.txt Quote Link to comment
chuck23322 Posted February 25, 2012 Author Share Posted February 25, 2012 While it's still running my 1000 pass MD5 check, I noticed these in the syslog -- how can I convert "ATA4" to a given drive?... Is it a one-to-one to /dev/md4 ? {If it's not obvious, I'm not unix savvy} Feb 24 21:19:34 unraid kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors) Feb 24 21:19:34 unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors) Feb 24 21:19:34 unraid kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors) Feb 24 21:19:34 unraid kernel: ata4.00: failed command: READ DMA (Minor Issues) Feb 24 21:19:34 unraid kernel: ata4.00: cmd c8/00:00:28:af:06/00:00:00:00:00/e0 tag 0 dma 131072 in (Drive related) Feb 24 21:19:34 unraid kernel: res 50/00:00:27:af:06/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors) Feb 24 21:19:34 unraid kernel: ata4.00: status: { DRDY } (Drive related) Feb 24 21:19:34 unraid kernel: ata4: hard resetting link (Minor Issues) Feb 24 21:19:34 unraid kernel: mdcmd (6452): spindown 11 (Routine) Feb 24 21:19:34 unraid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related) Feb 24 21:19:35 unraid kernel: ata4.00: configured for UDMA/133 (Drive related) Feb 24 21:19:35 unraid kernel: ata4: EH complete (Drive related) Feb 25 11:35:37 unraid kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen (Errors) Feb 25 11:35:37 unraid kernel: ata4.00: irq_stat 0x08000000, interface fatal error (Errors) Feb 25 11:35:37 unraid kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC } (Errors) Feb 25 11:35:37 unraid kernel: ata4.00: failed command: READ DMA (Minor Issues) Feb 25 11:35:37 unraid kernel: ata4.00: cmd c8/00:00:28:2e:74/00:00:00:00:00/e0 tag 0 dma 131072 in (Drive related) Feb 25 11:35:37 unraid kernel: res 50/00:00:27:2e:74/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors) Feb 25 11:35:37 unraid kernel: ata4.00: status: { DRDY } (Drive related) Feb 25 11:35:37 unraid kernel: ata4: hard resetting link (Minor Issues) Feb 25 11:35:37 unraid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) (Drive related) Feb 25 11:35:38 unraid kernel: ata4.00: configured for UDMA/133 (Drive related) Feb 25 11:35:38 unraid kernel: ata4: EH complete (Drive related) Quote Link to comment
BRiT Posted February 25, 2012 Share Posted February 25, 2012 Please post the FULL syslog for Feb 24-25th. The snippets you provided do not contain enough information to determine which drive(s) are on ATA4 interface. Quote Link to comment
chuck23322 Posted February 25, 2012 Author Share Posted February 25, 2012 Please post the FULL syslog for Feb 24th. The snippets you provided do not contain enough information to determine which drive(s) are on ATA4 interface. Ah... "Feb 22 00:38:02 unraid kernel: ata4.00: ATA-8: ST3750330AS, SD1A, max UDMA/133" That's a really old 750gb that I plan to retire once I can get this system stable. I have two pre-cleared (5 passes) 2TB drives waiting to go in.. /dev/md5 /mnt/disk5 /dev/sdh ST3750330AS_9QK158NB 33°C 2230286 10 750.13G {Syslog attached} Model Family: Seagate Barracuda 7200.11 family Device Model: ST3750330AS Serial Number: 9QK158NB Firmware Version: SD1A SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 116 099 006 Pre-fail Always - 105544716 3 Spin_Up_Time 0x0003 098 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 2348 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 4301169151 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7439 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 4 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 47 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 099 096 000 Old_age Always - 25770197016 189 High_Fly_Writes 0x003a 086 086 000 Old_age Always - 14 190 Airflow_Temperature_Cel 0x0022 063 057 045 Old_age Always - 37 (Min/Max 24/38) 194 Temperature_Celsius 0x0022 038 043 000 Old_age Always - 38 (0 20 0 0) 195 Hardware_ECC_Recovered 0x001a 049 031 000 Old_age Always - 105544716 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 26 Syslog-2012-02-25.zip Quote Link to comment
lionelhutz Posted February 25, 2012 Share Posted February 25, 2012 That's not the kind of results I would expect. You are getting a very consistant change in the results every x number of tests. Are you 100% positive nothing is modifying the drive contents? Quote Link to comment
chuck23322 Posted February 25, 2012 Author Share Posted February 25, 2012 That's not the kind of results I would expect. You are getting a very consistant change in the results every x number of tests. Are you 100% positive nothing is modifying the drive contents? I'm 100% positive the drive contents are not changing. The "writes" are not moving at all. Every single drive has 10 writes - for 2 days now - (100 for parity) And every drive has over 2,000,000 reads so far, incrementing constantly, while I'm running the stress-test passes. Ah, something just hit me {I posted the script} but failed to say that I had concatinated all SIX log files for a given drive together. I did that so I only had to scan 15 files (one for each sd?) instead of 15 x 6 tests = 90 files ... So when I was done, I did {FTP'd them back to my windows host} a copy sdb.?.log+nul: b_all.txt (So yes, they show a constant increment -- as each of the 6 tests were run.} Within each test -- you can see where it has a run of consistant reads, then a bunch of bad reads, and then back to good reads again} Sorry for leaving out that critical explanation. A good excerpt - showing the misbehavior - from the "sdf" drive is 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 9929750ab45e6ae9b465ac2ad1723240 *- 3fd4e34a0d707b3fc1547ed205788e1d *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 600bf668c22bc0d7de7c5f94528f1cd7 *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 212a16a4f1172214efa14f5f9f58adff *- e7b21efc96f3809c49fd11f4e95f9882 *- 1df094df42aba5732bee21e4c48d7629 *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- 5c9ffab8c2d9b86b1d2b42615721911e *- It's clear when it reads correctly, it's MD5 is 5c9ff.... and other times it's jibberish -- for the same repetitve set of blocks. Then it goes on, in the same 'f_all.txt' file, to the next test 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 10094b3f1af2ea313fc7b37019445160 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 128dcc44d1e21161f5c126c048d48319 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- f233575de053d37551c5641d09f951e5 *- 781290e56ba5266be5a7837986649e39 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 *- 0085ffb5782a72a508bdb8bc121a7ae1 * And there, that test returns '0085ff' most of the time, and other times it returns random stuff. Quote Link to comment
dgaschk Posted February 26, 2012 Share Posted February 26, 2012 Has memtest been run for 24 hours? Quote Link to comment
chuck23322 Posted February 26, 2012 Author Share Posted February 26, 2012 Has memtest been run for 24 hours? Yes, first for 6 hours, then for 48 hours -- that was the first step, then a reiserfsck of all the drives. Passing both of those is what lead me down the road of the deeper DD/MD5 checks... I'm now past memcheck's, and reiserfsck of every drive -- multiple times and multiple time in parallel {same time} across all drives - no errors found Ok, 300 passes of my DD/MD5 script later, and it's consistent on which drives are showing problems -- it's SDB, C,D,E & F only. Shutting down the server {saved syslogs, drive location info, etc} I'm off to trace it all out, and see what may be common about those drives. Quote Link to comment
chuck23322 Posted February 26, 2012 Author Share Posted February 26, 2012 I'm off to trace it all out, and see what may be common about those drives. Well, (insert Staples 'That was Easy') -- all 5 affected drives are in my external Rosewill RSV-S8 cage driven by a PCIe x1 SiI3132 dual eSata controller. (My understanding is that the RSV-S8 is a rebranded SansDigital TowerRAID TRM8-B) So, of course, port multiplying/multi-lane is being done by the card and the external cabinet. There are two eSata feeds between the server and the external cabinet, each feeds one group of 4 drives. Three drives are in the first 'channel' and two drives are in the second 'channel'. Power is supplied to all the drives by the external cabinet's built-in power supply. ~Update: Ah, something just hit me -- I probably ought to do some more granular testing. Instead of all 5 problem drives being hit (DD/MD5) at once -- try them one at a time. Perhaps there's one that's dorking up the multi-lane channel.... Gots more testing to do.... Quote Link to comment
lionelhutz Posted February 26, 2012 Share Posted February 26, 2012 Hm, one of the last threads with this issue was traced back to a SiL3132 card causing it when both channels were used at one time. I bet the drives will seem OK as long as you only use one of the 2 SATA channels on the card. So, you're likely looking at changing to a different SATA card that still supports PM. Peter Quote Link to comment
chuck23322 Posted February 26, 2012 Author Share Posted February 26, 2012 Hm, one of the last threads with this issue was traced back to a SiL3132 card causing it when both channels were used at one time. I bet the drives will seem OK as long as you only use one of the 2 SATA channels on the card. So, you're likely looking at changing to a different SATA card that still supports PM. Peter Yup... seems so... Footnote (Bad SiL 3132 controller) @ Refers to another link.. -- http://lime-technology.com/wiki/index.php?title=FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Just went through this thread deeper -- SIL3132 issues... with multiple drives... {in my case, even worse, multiple drives via doing Port-Multiplexing} Yet even more evidence that the SIL3132 can cause significant hair loss. This poor guy (looks like debian?) went through almost my exact scenario step-by-step: http://www.issociate.de/board/post/506879/Debugging_a_strange_array_corruption.html Working now on proving to myself this is the issue. {Doing the same tests with only one drive, or two drives on the same side (port) of the 3132 -- then reproducing the problem across two ports of the 3132} And then, if I can prove the failure is dual-access, deciding on what to replace the controller(s) with. I have two of these in my build, and was needing to upgrade a 3114 as well... Ugh... It's just money right? No, in my case, I think I'd rather have stability now that this thing has grown to 15 drives... Quote Link to comment
chuck23322 Posted February 27, 2012 Author Share Posted February 27, 2012 Ah, something just hit me -- I probably ought to do some more granular testing. Instead of all 5 problem drives being hit (DD/MD5) at once -- try them one at a time. Perhaps there's one that's dorking up the multi-lane channel.... Gots more testing to do.... Working now on proving to myself this is the issue. {Doing the same tests with only one drive, or two drives on the same side (port) of the 3132 -- then reproducing the problem across two ports of the 3132} And then, if I can prove the failure is dual-access, deciding on what to replace the controller(s) with. I have two of these in my build, and was needing to upgrade a 3114 as well... Testing complete. Problem isolated and PROVEN. Under heavy load, {created by DD/MD5 tests as suggested and discussed earlier in this thread} 1) Accessing two drives on any ONE side of the 3132 results in NO problems. 2) Accessing two drives (one on EACH side) of the 3132 results in returns of random data at random increments. Silently. 3) It's not the drive combination that causes this, eg, the same tests - changing drive pairs - results in the problem only occurring when it's "EACH side" of the 3132 being used, not which drive pairs are used. Ok, now we have - between my testing, previous person's testing, etc -- IMHO, more than anecdotal evidence of the 3132 - or at least some 3132s - having problems with accessing {under heavy load} both sides of the 3132 at the same time. I'm pulling it from my build. I've just got to decide now what to replace it (and my others) with. Thanks everybody for their patience and help here. ...Chuck Quote Link to comment
lionelhutz Posted February 27, 2012 Share Posted February 27, 2012 Wow, the SiL3132 controllers have proved to be real trouble for a few people. Good work troubleshooting and I hope the pulled-out hair grows back. I tried one once and it wouldn't even work on a single drive so I use that Jmicron card you could get from Ebay for $8, but I'm not sure it can still be had and it didn't work in certain motherboards either. Peter Quote Link to comment
vca Posted February 27, 2012 Share Posted February 27, 2012 I'm pulling it from my build. I've just got to decide now what to replace it (and my others) with. Thanks everybody for their patience and help here. ...Chuck Try the Supermicro 8-Port card: AOC-SASLP-MV8, they do pretty well, I replaced both of my 2-port SIL3132 cards with one of these about a year ago when I was trying to solve a bad drive problem and the Supermicro card has performed perfectly ever since. Regards, Stephen Quote Link to comment
rick.p Posted March 6, 2012 Share Posted March 6, 2012 I tried one once and it wouldn't even work on a single drive so I use that Jmicron card you could get from Ebay for $8, but I'm not sure it can still be had and it didn't work in certain motherboards either. This one? http://www.ebay.com/itm/150719596527?ssPageName=STRK:MEWNX:IT&_trksid=p3984.m1439.l2649 I have some coming and I'm hoping this is the one you speak of. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.