June 3, 201313 yr I recently upgraded a data disk, and after it finished rebuilding, ran two passes of a NOCORRECT parity check. There were no sync errors reported during these tests. When the monthly NOCORRECT parity check ran, it reported 32 sync errors. I re-ran two more NOCORRECT parity checks, and again, there were no errors reported. So I went ahead and ran a CORRECT parity check, but the 32 sync errors reappeared in a different range. It seems as though the errors were on ata6.00, which corresponds to my parity disk. May 31 20:06:17 Tower kernel: ata6.00: ATA-8: WDC WD3001FAEX-00MJRA0, 01.01L01, max UDMA/133 After the original 32 sync errors, I ran a long SMART test which didn't turn up anything unusual. I did recently upgrade the parity disk a few weeks back, but before putting it in the system ran three passes of badblocks v1.42 and one pass of preclear with nothing of note. The only thing I can think of that may have caused the sync errors was the MOVER script running during these checks, but that shouldn't interfere should it? If anybody has some further insight, I'd appreciate it. Thank you. SMART report: smartctl -a -d ata /dev/sdf smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD3001FAEX-00MJRA0 Serial Number: WD-WCC130288340 Firmware Version: 01.01L01 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Jun 3 07:07:20 2013 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (34740) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70b5) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 167 167 021 Pre-fail Always - 10608 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 29 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 445 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 25 194 Temperature_Celsius 0x0022 120 111 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 408 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. First 32 sync errors: Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608040 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608048 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608056 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608064 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608072 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608080 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608088 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608096 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608104 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608112 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608120 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608128 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608136 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608144 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608152 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608160 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608168 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608176 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608184 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608192 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608200 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608208 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608216 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608224 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608232 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608240 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608248 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608256 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608264 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608272 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608280 Jun 1 06:25:12 Tower kernel: md: parity incorrect, sector=2654608288 Jun 1 06:25:12 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jun 1 06:25:12 Tower kernel: ata6.00: irq_stat 0x40000001 Jun 1 06:25:12 Tower kernel: ata6.00: failed command: READ DMA EXT Jun 1 06:25:12 Tower kernel: ata6.00: cmd 25/00:00:d8:1d:3a/00:04:9e:00:00/e0 tag 0 dma 524288 in Jun 1 06:25:12 Tower kernel: res 51/40:df:ec:1f:3a/00:01:9e:00:00/e0 Emask 0x9 (media error) Jun 1 06:25:12 Tower kernel: ata6.00: status: { DRDY ERR } Jun 1 06:25:12 Tower kernel: ata6.00: error: { UNC } Jun 1 06:25:12 Tower kernel: ata6.00: configured for UDMA/133 Jun 1 06:25:12 Tower kernel: ata6: EH complete Second 32 sync errors: Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897096 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897104 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897112 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897120 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897128 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897136 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897144 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897152 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897160 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897168 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897176 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897184 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897192 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897200 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897208 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897216 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897224 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897232 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897240 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897248 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897256 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897264 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897272 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897280 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897288 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897296 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897304 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897312 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897320 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897328 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897336 Jun 3 01:45:06 Tower kernel: md: correcting parity, sector=1915897344 Jun 3 01:46:06 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jun 3 01:46:06 Tower kernel: ata6.00: failed command: FLUSH CACHE EXT Jun 3 01:46:06 Tower kernel: ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jun 3 01:46:06 Tower kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 3 01:46:06 Tower kernel: ata6.00: status: { DRDY } Jun 3 01:46:06 Tower kernel: ata6: hard resetting link Jun 3 01:46:07 Tower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jun 3 01:46:07 Tower kernel: ata6.00: configured for UDMA/133 Jun 3 01:46:07 Tower kernel: ata6.00: retrying FLUSH 0xea Emask 0x4 Jun 3 01:46:07 Tower kernel: ata6: EH complete syslog-2013-06-03.zip
June 3, 201313 yr It's difficult to say what causes these occasional sync errors ... but it's not uncommon to occasionally have non-zero sync counts. I've had this perhaps 2-3 times in the 4 years I've been using UnRAID ... but it's never represented a real problem (i.e. all the data was fine). If you're concerned, run a comparison of all of your data against your backups. I've done that a couple of times when the sync count was non-zero, but it's always been perfect, so I'm reasonably convinced that Tom's view that sync errors are always actually on the parity disk (which is why corrections are written to that disk) is accurate. In fact, I never do a "non-correcting" test ... why would you NOT want to correct one of these errors??
June 3, 201313 yr Author garycase - I just found it strange to have two consecutive parity checks in a row complete without error, and then to have the third report an error. Then more weirdness since running two more tests and a long SMART test without error, and the third check to report the exact same number of sync errors. I was under the impression that one should be wary of sync errors. Plus, since there were UNC errors reported on the parity disk, I thought that maybe the parity disk might have to be replaced. But it was odd that there were no reallocated or pending sectors reported on SMART. I thought the rigorous exercise through badblocks and the preclear would have turned up any problems. dgaschk - attached the syslog as requested to the original post. Thank you all...
June 6, 201313 yr Author Try a new parity disk and/or SATA cable. Run pre-clear on the parity drive. Tried replacing the parity disk, but ran into a slew of sync errors (with a brand new hard disk, 3x badblocks v1.42 and precleared). So I opened up the case and checked things out. The breakout cable still seemed firmly seated, and connected directly to mobo. Didn't know what else could have happened, so I went ahead and replaced the backplane with an unused one. Took the moment to upgrade one of the controller cards from an SASLP to a SAS2LP as well. Parity rewritten without a problem, and starting first parity check which looks like it will complete without issue either. Will start a second as soon as that is finished. Unfortunately, looks like some of my data on the other disks will be corrupted as the parity disk was the first thing I upgraded followed by some other 3TB data disks. No big deal as there was nothing irreplaceable. One benefit is that my parity speed has increased from ~70 MB/s to ~100 MB/s!
Archived
This topic is now archived and is closed to further replies.