June 28, 200917 yr I have a tons of the below messages after each other in the syslog, but everything seems to work. Paritiy check is also run fine and it's ok. SDC is my cache drive. It seems that the errors appear and get repated, when mover starts. I see this error repated 10-20 times, then some new stuff get moved to the array. Then again 10-20 times the error, then some new stuff get moved again, and so on. The stuff moved to the array seems healthy after all. Could you please guys help me, and point out where to find the cause? Thank you in advance! ps.: Also, my sata2 controller for some reason is recognized as sata1. Jun 27 03:41:32 Tower kernel: ata2.00: limiting speed to UDMA/33:PIO4 Jun 27 03:41:32 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x280000 action 0x6 Jun 27 03:41:32 Tower kernel: ata2.00: irq_stat 0x03020002, device error via SDB FIS Jun 27 03:41:32 Tower kernel: ata2: SError: { 10B8B BadCRC } Jun 27 03:41:32 Tower kernel: ata2.00: cmd 60/00:00:1f:a6:a6/02:00:11:00:00/40 tag 0 ncq 262144 in Jun 27 03:41:32 Tower kernel: res 41/84:5f:c0:a7:a6/00:00:11:00:00/40 Emask 0x410 (ATA bus error) Jun 27 03:41:32 Tower kernel: ata2.00: status: { DRDY ERR } Jun 27 03:41:32 Tower kernel: ata2.00: error: { ICRC ABRT } Jun 27 03:41:32 Tower kernel: ata2: hard resetting link Jun 27 03:41:34 Tower kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 10) Jun 27 03:41:34 Tower kernel: ata2.00: configured for UDMA/33 Jun 27 03:41:34 Tower kernel: ata2: EH complete Jun 27 03:41:34 Tower kernel: sd 2:0:0:0: [sdc] 625142448 512-byte hardware sectors (320073 MB) Jun 27 03:41:34 Tower kernel: sd 2:0:0:0: [sdc] Write Protect is off Jun 27 03:41:34 Tower kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 Jun 27 03:41:34 Tower kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
June 28, 200917 yr RobJ is much more familiar with these types of syslog errors, so he may weigh in later with better information. But I believe that these types of errors are typical of a cabling problem. By cabling I mean anything in the signal path from the controller to the driive. This would include loose ports, bad or loose cables, or issues with backplanes. Although the data connection is the likely cause IMO, it could also be an issue with the power connection. I'd recommend running a smartctl report (see the Troubleshooting guide referenced in my sig or use unMenu) on the drives in question to see if there are any reallocated sectors or other problems that make us suspect bad drives. Otherwise, try to verify that all of your cable connections are good. Cabling problems are the most common and hardest to figure out. Post back the smartctl reports if you have any questions. Good luck!
June 28, 200917 yr Author Thank you for your feedback, bjp999! Smartctl report on sdc is here (with a lot of UDMA CRC error). No reallocated sector though. You are absolutely right, I am going to check the cabling. root@Tower:~# smartctl -a -d ata /dev/sdc smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Second Generation Serial ATA family Device Model: WDC WD3200AAKS-00L9A0 Serial Number: WD-WMAV22131851 Firmware Version: 01.03E01 User Capacity: 320,072,933,376 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Jun 28 19:37:38 2009 GMT-2 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 70) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 132 130 021 Pre-fail Always - 4383 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 290 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1277 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 290 194 Temperature_Celsius 0x0022 108 097 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 5098 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 606 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
June 29, 200917 yr Brian is right, looks very much like a poor quality cable. What I am finding to be characteristic of bad cables is the appearance of the BadCRC and ICRC flags in the syslog, and the SMART 'UDMA CRC' count. I'm guessing that a poor cable is not sufficiently resistant to interference or crosstalk, which causes data corruption, resulting in an incorrect CRC value. Jun 27 03:41:32 Tower kernel: ata2: SError: { 10B8B BadCRC } Jun 27 03:41:32 Tower kernel: ata2.00: error: { ICRC ABRT } 199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 5098 The drive itself looks very good.
June 29, 200917 yr Author Thank you guys for both of you the feedback and support. I can report you, that you were right, it seems the the cabling was the guilty for the error. I have 2 sata drive(one of them was the above mentioned cache drive) which were hooked on by cheap esata cables. I replaced them with an addonics esata cables, and it seems fine now. Men, they were expensive. 2 cables for 32EUR... Duhhh... Thank you again for your excellent support and taking care. What would happen with us without you guys
Archived
This topic is now archived and is closed to further replies.