eweitzman Posted January 14, 2013 Share Posted January 14, 2013 parity drive failure after moving to SASLP-MV8 [The unraid version is 5.0-rc6-r8168-test, running without incident for many months. The SASLP-MV8 firmware is 3.1.0.15N.] In the process of updating my server, a parity check failed when the parity disk went offline. I'm uncertain whether to attribute the problem to the new SAS card, cabling, a disk failure, or if everything is ok and I should follow the "trust my parity drive" procedure. SMART indicates it's a drive problem though. The plan was to update the server by replacing two old four-port PCI and PCIe Sil SATA cards with a SASLP-MV8, and adding a new 3TB drive. The steps I followed were Verify SAS board and new drive 1. Install SAS board 2. Connect new 3TB to SAS board 3. Preclear the drive The drive precleared, verifying that the SAS card is okay. It also precleared in about 1/2 the time the other 3TB drive took. Change drive connections 5. Disconnect drives from SATA cards, remove SATA cards 6. Reconnect drives to motherboard and SAS card so 2TB and 3TB drives are on SAS card, including the parity drive 7. Start server, check drive/disk assignments 8. Perform read-only parity check Step 7 succeeded. Step 8 failed after around 8 hours. There were some device errors in the log an hour into the parity check but it seemed to have recovered until the parity drive went offline. Note that the new, precleared 3TB drive has not been added to the array yet. After the checking proceeded past the ends of the smaller drives and they spun down, the parity check failed. The parity drive turned blue, unMenu says the parity drive status is DISK_DSBL. After checking all connectors and SAS card seating and restarting the server, the parity drive status is DISK_DSBL_NEW. I reviewed the smartctl status and ran several short tests. Each short test had a read failure at LBA 3126529216. The short test fails at the same point even with the parity drive moved to a motherboard SATA port. unraid reports sync errors starting just past this LBA. There are 3 unstable sectors needing to be remapped (Current_Pending_Sector). Since the short test was run several times and had the read error at the same LBA, I assume that the data there is lost and the sector will get remapped when the firmware decides to. So at a minimum, I will have to rebuild parity. It seems to me that the drive failure is independent of any issue with the SAS card. But I'm uncertain, as this seems rather coincidental, and would appreciate it if someone could point out anything I may have missed. Could the drive's error timeout of this consumer drive come into play now that it's connected to a server controller instead of a consumer controller? Can the SAS controller be flashed or configured to work better with longer drive timeouts? Thanks, - Eric smartctl output (via unMenu) smartctl -a -d ata /dev/sdn smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2391051 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Jan 14 12:14:40 2013 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (51180) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 148 143 021 Pre-fail Always - 9575 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 309 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1138 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7704 194 Temperature_Celsius 0x0022 120 108 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 1138 3126529216 # 2 Short offline Completed: read failure 90% 1137 3126529216 # 3 Short offline Completed: read failure 90% 1137 3126529216 # 4 Short offline Completed without error 00% 1137 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. partial log file: Jan 14 01:05:07 Tower kernel: mdcmd (61): check NOCORRECT Jan 14 01:05:07 Tower kernel: md: recovery thread woken up ... Jan 14 01:05:07 Tower kernel: md: recovery thread checking parity... Jan 14 01:05:07 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks. Jan 14 02:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 02:05:33 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x780100 action 0x6 Jan 14 02:05:33 Tower kernel: ata8.00: irq_stat 0x08000000 Jan 14 02:05:33 Tower kernel: ata8: SError: { UnrecovData 10B8B Dispar BadCRC Handshk } Jan 14 02:05:33 Tower kernel: ata8.00: failed command: READ DMA EXT Jan 14 02:05:33 Tower kernel: ata8.00: cmd 25/00:00:5f:ca:db/00:04:11:00:00/e0 tag 0 dma 524288 in Jan 14 02:05:33 Tower kernel: res 50/00:00:5e:ca:db/00:00:11:00:00/e0 Emask 0x10 (ATA bus error) Jan 14 02:05:33 Tower kernel: ata8.00: status: { DRDY } Jan 14 02:05:33 Tower kernel: ata8: hard resetting link Jan 14 02:05:33 Tower kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 14 02:05:33 Tower kernel: ata8.00: configured for UDMA/133 Jan 14 02:05:33 Tower kernel: ata8: EH complete Jan 14 02:06:03 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x180000 action 0x6 frozen Jan 14 02:06:03 Tower kernel: ata7: SError: { 10B8B Dispar } Jan 14 02:06:03 Tower kernel: ata7.00: failed command: READ DMA EXT Jan 14 02:06:03 Tower kernel: ata7.00: cmd 25/00:40:1f:cf:db/00:03:11:00:00/e0 tag 0 dma 425984 in Jan 14 02:06:03 Tower kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 02:06:03 Tower kernel: ata7.00: status: { DRDY } Jan 14 02:06:03 Tower kernel: ata7: hard resetting link Jan 14 02:06:03 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 14 02:06:03 Tower kernel: ata7.00: configured for UDMA/133 Jan 14 02:06:03 Tower kernel: ata7: EH complete Jan 14 03:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 04:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 05:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 06:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 06:27:57 Tower kernel: mdcmd (62): spindown 15 Jan 14 07:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 07:45:26 Tower kernel: mdcmd (63): spindown 4 Jan 14 07:45:26 Tower kernel: mdcmd (64): spindown 7 Jan 14 08:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" Jan 14 09:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" problem with parity drive sdn starts here: Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f4530240 timed out Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f3eaee40 timed out Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f3eae480 timed out Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f7407d80 timed out Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f76660c0 timed out Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f76e1540 timed out Jan 14 09:24:17 Tower kernel: sas: Enter sas_scsi_recover_host busy: 6 failed: 6 Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efb7c0 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efb7c0 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efb7c0 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efb7c0 is aborted Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efbb80 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efbb80 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efbb80 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efbb80 is aborted Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa500 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa500 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa500 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa500 is aborted Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa140 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa140 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa140 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa140 is aborted Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa3c0 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa3c0 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa3c0 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa3c0 is aborted Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efb040 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efb040 Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efb040 is aborted Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efb040 is aborted Jan 14 09:24:17 Tower kernel: sas: ata13: end_device-0:4: cmd error handler Jan 14 09:24:17 Tower kernel: sas: ata9: end_device-0:0: dev error handler Jan 14 09:24:17 Tower kernel: sas: ata10: end_device-0:1: dev error handler Jan 14 09:24:17 Tower kernel: sas: ata11: end_device-0:2: dev error handler Jan 14 09:24:17 Tower kernel: sas: ata12: end_device-0:3: dev error handler Jan 14 09:24:17 Tower kernel: sas: ata13: end_device-0:4: dev error handler Jan 14 09:24:17 Tower kernel: sas: ata14: end_device-0:5: dev error handler Jan 14 09:24:17 Tower kernel: ata13.00: exception Emask 0x0 SAct 0x3f SErr 0x0 action 0x6 frozen Jan 14 09:24:17 Tower kernel: sas: ata15: end_device-0:6: dev error handler Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: sas: ata16: end_device-0:7: dev error handler Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0b:5b/02:00:ba:00:00/40 tag 0 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:04:40:c4:5a/00:00:ba:00:00/40 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0d:5b/02:00:ba:00:00/40 tag 1 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0f:5b/02:00:ba:00:00/40 tag 2 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:11:5b/02:00:ba:00:00/40 tag 3 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:13:5b/02:00:ba:00:00/40 tag 4 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:15:5b/02:00:ba:00:00/40 tag 5 ncq 262144 in Jan 14 09:24:17 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY } Jan 14 09:24:17 Tower kernel: ata13: hard resetting link Jan 14 09:24:19 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1522:mvs_I_T_nexus_reset for device[4]:rc= 0 Jan 14 09:24:19 Tower kernel: sas: sas_ata_task_done: SAS error 8a Jan 14 09:24:19 Tower kernel: sas: sas_ata_task_done: SAS error 8a Jan 14 09:24:19 Tower kernel: ata13.00: both IDENTIFYs aborted, assuming NODEV Jan 14 09:24:19 Tower kernel: ata13.00: revalidation failed (errno=-2) Jan 14 09:24:19 Tower kernel: mvsas 0000:02:00.0: Phy4 : No sig fis Jan 14 09:24:23 Tower kernel: sas: sas_form_port: phy4 belongs to port4 already(1)! Jan 14 09:24:24 Tower kernel: ata13: hard resetting link Jan 14 09:24:25 Tower kernel: ata13.00: configured for UDMA/133 Jan 14 09:24:25 Tower kernel: ata13.00: device reported invalid CHS sector 0 Jan 14 09:24:25 Tower last message repeated 4 times Jan 14 09:24:25 Tower kernel: ata13: EH complete Jan 14 09:24:25 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529904 Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529912 Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529920 Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529928 Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529936 Quote Link to comment
eweitzman Posted January 14, 2013 Author Share Posted January 14, 2013 I just moved the parity drive back to the SAS board. It was on the motherboard where I ran the final test shown in my initial post. Another short test, this time run from the webGui instead of unMenu, completed without error. Any ideas? Thanks, - Eric SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 148 143 021 Pre-fail Always - 9566 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 310 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1141 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 115 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7715 194 Temperature_Celsius 0x0022 119 108 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 1140 - # 2 Short offline Completed: read failure 90% 1138 3126529216 # 3 Short offline Completed: read failure 90% 1138 3126529216 # 4 Short offline Completed: read failure 90% 1137 3126529216 # 5 Short offline Completed: read failure 90% 1137 3126529216 # 6 Short offline Completed without error 00% 1137 - Quote Link to comment
dgaschk Posted January 15, 2013 Share Posted January 15, 2013 Rebuild parity. Use the same drive or a 3T drive then run pre-clear on the flaky drive. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.