parity drive failure after moving to SASLP-MV8


Recommended Posts

parity drive failure after moving to SASLP-MV8

 

[The unraid version is 5.0-rc6-r8168-test, running without incident for many months. The SASLP-MV8 firmware is 3.1.0.15N.]

 

In the process of updating my server, a parity check failed when the parity disk went offline. I'm uncertain whether to attribute the problem to the new SAS card, cabling, a disk failure, or if everything is ok and I should follow the "trust my parity drive" procedure. SMART indicates it's a drive problem though.

 

The plan was to update the server by replacing two old four-port PCI and PCIe Sil SATA cards with a SASLP-MV8, and adding a new 3TB drive. The steps I followed were

 

Verify SAS board and new drive

1. Install SAS board

2. Connect new 3TB to SAS board

3. Preclear the drive

 

The drive precleared, verifying that the SAS card is okay. It also precleared in about 1/2 the time the other 3TB drive took.

 

Change drive connections

5. Disconnect drives from SATA cards, remove SATA cards

6. Reconnect drives to motherboard and SAS card so 2TB and 3TB drives are on SAS card, including the parity drive

7. Start server, check drive/disk assignments

8. Perform read-only parity check

 

Step 7 succeeded. Step 8 failed after around 8 hours. There were some device errors in the log an hour into the parity check but it seemed to have recovered until the parity drive went offline.

 

Note that the new, precleared 3TB drive has not been added to the array yet.

 

After the checking proceeded past the ends of the smaller drives and they spun down, the parity check failed. The parity drive turned blue, unMenu says the parity drive status is DISK_DSBL. After checking all connectors and SAS card seating and restarting the server, the parity drive status is DISK_DSBL_NEW.

 

I reviewed the smartctl status and ran several short tests. Each short test had a read failure at LBA 3126529216. The short test fails at the same point even with the parity drive moved to a motherboard SATA port. unraid reports sync errors starting just past this LBA.

 

There are 3 unstable sectors needing to be remapped (Current_Pending_Sector). Since the short test was run several times and had the read error at the same LBA, I assume that the data there is lost and the sector will get remapped when the firmware decides to. So at a minimum, I will have to rebuild parity.

 

It seems to me that the drive failure is independent of any issue with the SAS card. But I'm uncertain, as this seems rather coincidental,  and would appreciate it if someone could point out anything I may have missed. Could the drive's error timeout of this consumer drive come into play now that it's connected to a server controller instead of a consumer controller? Can the SAS controller be flashed or configured to work better with longer drive timeouts?

 

Thanks,

- Eric

 

smartctl output (via unMenu)

 

smartctl -a -d ata /dev/sdn
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ2391051
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jan 14 12:14:40 2013 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
				was aborted by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (51180) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   148   143   021    Pre-fail  Always       -       9575
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       309
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1138
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       14
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7704
194 Temperature_Celsius     0x0022   120   108   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      1138         3126529216
# 2  Short offline       Completed: read failure       90%      1137         3126529216
# 3  Short offline       Completed: read failure       90%      1137         3126529216
# 4  Short offline       Completed without error       00%      1137         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

partial log file:

 

Jan 14 01:05:07 Tower kernel: mdcmd (61): check NOCORRECT
Jan 14 01:05:07 Tower kernel: md: recovery thread woken up ...
Jan 14 01:05:07 Tower kernel: md: recovery thread checking parity...
Jan 14 01:05:07 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks.
Jan 14 02:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 02:05:33 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x780100 action 0x6
Jan 14 02:05:33 Tower kernel: ata8.00: irq_stat 0x08000000
Jan 14 02:05:33 Tower kernel: ata8: SError: { UnrecovData 10B8B Dispar BadCRC Handshk }
Jan 14 02:05:33 Tower kernel: ata8.00: failed command: READ DMA EXT
Jan 14 02:05:33 Tower kernel: ata8.00: cmd 25/00:00:5f:ca:db/00:04:11:00:00/e0 tag 0 dma 524288 in
Jan 14 02:05:33 Tower kernel:          res 50/00:00:5e:ca:db/00:00:11:00:00/e0 Emask 0x10 (ATA bus error)
Jan 14 02:05:33 Tower kernel: ata8.00: status: { DRDY }
Jan 14 02:05:33 Tower kernel: ata8: hard resetting link
Jan 14 02:05:33 Tower kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 14 02:05:33 Tower kernel: ata8.00: configured for UDMA/133
Jan 14 02:05:33 Tower kernel: ata8: EH complete
Jan 14 02:06:03 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x180000 action 0x6 frozen
Jan 14 02:06:03 Tower kernel: ata7: SError: { 10B8B Dispar }
Jan 14 02:06:03 Tower kernel: ata7.00: failed command: READ DMA EXT
Jan 14 02:06:03 Tower kernel: ata7.00: cmd 25/00:40:1f:cf:db/00:03:11:00:00/e0 tag 0 dma 425984 in
Jan 14 02:06:03 Tower kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 02:06:03 Tower kernel: ata7.00: status: { DRDY }
Jan 14 02:06:03 Tower kernel: ata7: hard resetting link
Jan 14 02:06:03 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 14 02:06:03 Tower kernel: ata7.00: configured for UDMA/133
Jan 14 02:06:03 Tower kernel: ata7: EH complete
Jan 14 03:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 04:01:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 05:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 06:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 06:27:57 Tower kernel: mdcmd (62): spindown 15
Jan 14 07:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 07:45:26 Tower kernel: mdcmd (63): spindown 4
Jan 14 07:45:26 Tower kernel: mdcmd (64): spindown 7
Jan 14 08:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 
Jan 14 09:00:01 Tower crond[1221]: failed parsing crontab for user root: cron="" 

problem with parity drive sdn starts here:

Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f4530240 timed out
Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f3eaee40 timed out
Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f3eae480 timed out
Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f7407d80 timed out
Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f76660c0 timed out
Jan 14 09:24:17 Tower kernel: sd 0:0:4:0: [sdn] command f76e1540 timed out
Jan 14 09:24:17 Tower kernel: sas: Enter sas_scsi_recover_host busy: 6 failed: 6
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efb7c0
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efb7c0
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efb7c0 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efb7c0 is aborted
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efbb80
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efbb80
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efbb80 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efbb80 is aborted
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa500
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa500
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa500 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa500 is aborted
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa140
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa140
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa140 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa140 is aborted
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efa3c0
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efa3c0
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efa3c0 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efa3c0 is aborted
Jan 14 09:24:17 Tower kernel: sas: trying to find task 0xf3efb040
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: aborting task 0xf3efb040
Jan 14 09:24:17 Tower kernel: sas: sas_scsi_find_task: task 0xf3efb040 is aborted
Jan 14 09:24:17 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xf3efb040 is aborted
Jan 14 09:24:17 Tower kernel: sas: ata13: end_device-0:4: cmd error handler
Jan 14 09:24:17 Tower kernel: sas: ata9: end_device-0:0: dev error handler
Jan 14 09:24:17 Tower kernel: sas: ata10: end_device-0:1: dev error handler
Jan 14 09:24:17 Tower kernel: sas: ata11: end_device-0:2: dev error handler
Jan 14 09:24:17 Tower kernel: sas: ata12: end_device-0:3: dev error handler
Jan 14 09:24:17 Tower kernel: sas: ata13: end_device-0:4: dev error handler
Jan 14 09:24:17 Tower kernel: sas: ata14: end_device-0:5: dev error handler
Jan 14 09:24:17 Tower kernel: ata13.00: exception Emask 0x0 SAct 0x3f SErr 0x0 action 0x6 frozen
Jan 14 09:24:17 Tower kernel: sas: ata15: end_device-0:6: dev error handler
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: sas: ata16: end_device-0:7: dev error handler
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0b:5b/02:00:ba:00:00/40 tag 0 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:04:40:c4:5a/00:00:ba:00:00/40 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0d:5b/02:00:ba:00:00/40 tag 1 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:0f:5b/02:00:ba:00:00/40 tag 2 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:11:5b/02:00:ba:00:00/40 tag 3 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:13:5b/02:00:ba:00:00/40 tag 4 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13.00: failed command: READ FPDMA QUEUED
Jan 14 09:24:17 Tower kernel: ata13.00: cmd 60/00:00:b0:15:5b/02:00:ba:00:00/40 tag 5 ncq 262144 in
Jan 14 09:24:17 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 09:24:17 Tower kernel: ata13.00: status: { DRDY }
Jan 14 09:24:17 Tower kernel: ata13: hard resetting link
Jan 14 09:24:19 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1522:mvs_I_T_nexus_reset for device[4]:rc= 0
Jan 14 09:24:19 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Jan 14 09:24:19 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Jan 14 09:24:19 Tower kernel: ata13.00: both IDENTIFYs aborted, assuming NODEV
Jan 14 09:24:19 Tower kernel: ata13.00: revalidation failed (errno=-2)
Jan 14 09:24:19 Tower kernel: mvsas 0000:02:00.0: Phy4 : No sig fis
Jan 14 09:24:23 Tower kernel: sas: sas_form_port: phy4 belongs to port4 already(1)!
Jan 14 09:24:24 Tower kernel: ata13: hard resetting link
Jan 14 09:24:25 Tower kernel: ata13.00: configured for UDMA/133
Jan 14 09:24:25 Tower kernel: ata13.00: device reported invalid CHS sector 0
Jan 14 09:24:25 Tower last message repeated 4 times
Jan 14 09:24:25 Tower kernel: ata13: EH complete
Jan 14 09:24:25 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0
Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529904
Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529912
Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529920
Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529928
Jan 14 09:24:25 Tower kernel: md: parity incorrect: 3126529936

 

Link to comment

I just moved the parity drive back to the SAS board. It was on the motherboard where I ran the final test shown in my initial post. Another short test, this time run from the webGui instead of unMenu, completed without error. Any ideas?

 

Thanks,

- Eric

 


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   148   143   021    Pre-fail  Always       -       9566
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       310
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1141
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       115
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       14
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7715
194 Temperature_Celsius     0x0022   119   108   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1140         -
# 2  Short offline       Completed: read failure       90%      1138         3126529216
# 3  Short offline       Completed: read failure       90%      1138         3126529216
# 4  Short offline       Completed: read failure       90%      1137         3126529216
# 5  Short offline       Completed: read failure       90%      1137         3126529216
# 6  Short offline       Completed without error       00%      1137         -

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.