StevenD Posted February 3, 2013 Share Posted February 3, 2013 Array has been working properly since I upgraded my motherboard a few weeks ago. Monthly parity check kicked off at midnight on the 2nd. I woke up to a red ball on Disk 7. I replaced Disk 7 with a new 4TB drive and it appeared to be re-building OK. This morning, I woke up to a red ball on Disk 6. Not sure what to do now. In preparation of RMAing the 2TB Disk 7 that I pulled, I used Hitachi's DFT utility and it passed all its tests. It appears there really isnt anything wrong with Disk 6 either. Disk identity: Model Family: Hitachi Deskstar 7K2000 Device Model: Hitachi HDS722020ALA330 Serial Number: JK11A8B9J6HYEF Firmware Version: JKAOA3MA User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Feb 3 11:46:32 2013 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled Disk attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 133 133 054 Pre-fail Offline - 102 3 Spin_Up_Time 0x0007 119 119 024 Pre-fail Always - 607 (Average 608) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 1249 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 121 121 020 Pre-fail Offline - 35 9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 18235 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 111 192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 1294 193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 1294 194 Temperature_Celsius 0x0002 166 166 000 Old_age Always - 36 (Min/Max 19/50) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 Disk capabilities: General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (21889) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. Disk self-test log: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 18235 - Disk error log: No Errors Logged Thanks in advance for the help! syslog-disk6.zip syslog-disk7-parity-check.zip Quote Link to comment
electron286 Posted February 3, 2013 Share Posted February 3, 2013 It looks to me like you may have a bad power cable that could be connected to multiple drives. Or possibly power supply (or drive back-plane problems if you have one). There were two drive problems that seem to have occured at about the same time: - syslog-disk6 - Feb 2 23:16:00 nas kernel: mdcmd (66): spindown 9 Feb 2 23:16:13 nas kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Feb 2 23:16:13 nas kernel: ata5.00: failed command: SMART Feb 2 23:16:13 nas kernel: ata5.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in Feb 2 23:16:13 nas kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Feb 2 23:16:13 nas kernel: ata5.00: status: { DRDY } Feb 2 23:16:13 nas kernel: ata5: hard resetting link Feb 2 23:16:13 nas kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 2 23:16:13 nas kernel: ata5.00: link online but device misclassified Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Feb 2 23:16:13 nas kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Feb 2 23:16:13 nas kernel: ata5.00: configured for UDMA/133 Feb 2 23:16:13 nas kernel: ata5: EH complete - this one successfully reset and came back on-line - - but the next one not so happy... Feb 2 23:16:24 nas kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Feb 2 23:16:24 nas kernel: ata10.00: failed command: SMART Feb 2 23:16:24 nas kernel: ata10.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in Feb 2 23:16:24 nas kernel: res 40/00:00:e0:94:1e/00:02:37:01:00/40 Emask 0x4 (timeout) Feb 2 23:16:24 nas kernel: ata10.00: status: { DRDY } Feb 2 23:16:24 nas kernel: ata10: hard resetting link Feb 2 23:16:24 nas kernel: sas: ata11: end_device-0:4: dev error handler Feb 2 23:16:26 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Feb 2 23:16:27 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Feb 2 23:16:32 nas kernel: ata10.00: qc timeout (cmd 0x27) Feb 2 23:16:32 nas kernel: ata10.00: failed to read native max address (err_mask=0x4) Feb 2 23:16:32 nas kernel: ata10.00: HPA support seems broken, skipping HPA handling Feb 2 23:16:32 nas kernel: ata10.00: revalidation failed (errno=-5) Feb 2 23:16:32 nas kernel: ata10: hard resetting link Feb 2 23:16:35 nas kernel: mvsas 0000:05:00.0: Phy3 : No sig fis Feb 2 23:16:35 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Feb 2 23:16:39 nas kernel: drivers/scsi/mvsas/mv_sas.c 1951:Release slot [0] tag[0], task [d98a6dc0]: Feb 2 23:16:39 nas kernel: sas: sas_ata_task_done: SAS error 8a Feb 2 23:16:39 nas kernel: ata10.00: failed to set xfermode (err_mask=0x11) Feb 2 23:16:39 nas kernel: ata10.00: limiting speed to UDMA/133:PIO3 Feb 2 23:16:39 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Feb 2 23:16:41 nas kernel: ata10: hard resetting link Feb 2 23:16:46 nas kernel: ata10.00: qc timeout (cmd 0xec) Feb 2 23:16:46 nas kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x5) Feb 2 23:16:46 nas kernel: ata10.00: revalidation failed (errno=-5) Feb 2 23:16:46 nas kernel: ata10.00: disabled Feb 2 23:16:46 nas kernel: ata10: hard resetting link Feb 2 23:16:49 nas kernel: mvsas 0000:05:00.0: Phy3 : No sig fis Feb 2 23:16:49 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0 Feb 2 23:16:49 nas kernel: ata10: EH complete Feb 2 23:16:49 nas kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] READ CAPACITY(16) failed Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Result: hostbyte=0x04 driverbyte=0x00 Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Sense not available. Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] READ CAPACITY failed Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Result: hostbyte=0x04 driverbyte=0x00 Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Sense not available. Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Asking for cache data failed Feb 2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Assuming drive cache: write through Feb 2 23:16:49 nas kernel: sdm: detected capacity change from 2000398934016 to 0 Feb 2 23:16:53 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Feb 2 23:18:58 nas kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Feb 2 23:18:58 nas kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Feb 2 23:27:00 nas fan_speed.sh: Highest disk drive temp is: 38C Feb 2 23:27:00 nas fan_speed.sh: Changing disk drive fan speed from: [232 (90% @ 3292 rpm) ] to: [FULL (100% @ 3308 rpm) ] Feb 2 23:30:48 nas shfs/user: shfs_readdir: readdir_r: /mnt/disk6/TV/Big Brother US After Dark (5) Input/output error Feb 2 23:30:48 nas kernel: md: disk6 read error Feb 2 23:30:48 nas kernel: handle_stripe read error: 1532493840/6, count: 1 Feb 2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error Feb 2 23:30:48 nas kernel: REISERFS (device md6): Remounting filesystem read-only Feb 2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error Feb 2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error Feb 2 23:30:59 nas kernel: md: disk6 read error Feb 2 23:30:59 nas kernel: handle_stripe read error: 1534066768/6, count: 1 Feb 2 23:30:59 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Feb 2 23:30:59 nas kernel: REISERFS (device md7): Remounting filesystem read-only Feb 2 23:30:59 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Feb 2 23:30:59 nas last message repeated 8 times Feb 2 23:31:00 nas kernel: md: disk6 read error Feb 2 23:31:00 nas kernel: handle_stripe read error: 1534066768/6, count: 1 Feb 2 23:31:00 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Feb 2 23:31:01 nas last message repeated 129 times Feb 2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error Feb 2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error Feb 2 23:31:01 nas kernel: md: disk6 read error Feb 2 23:31:01 nas kernel: handle_stripe read error: 251209632/6, count: 1 Feb 2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error Feb 2 23:31:03 nas last message repeated 129 times Feb 2 23:31:17 nas kernel: md: disk6 read error Feb 2 23:31:17 nas kernel: handle_stripe read error: 1534066768/6, count: 1 Feb 2 23:31:17 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Feb 2 23:31:17 nas last message repeated 4 times Feb 2 23:31:17 nas shfs/user: shfs_readdir: readdir_r: /mnt/disk6/TV/Big Brother US After Dark (5) Input/output error Feb 2 23:31:17 nas kernel: md: disk6 read error - and further on... more disk6 errors and log notifications... (also notice the md7 error at Feb 2 23:30:59) And the next log... - syslog-disk7-parity-check Feb 2 04:48:14 nas kernel: sd 0:0:2:0: [sdl] command f3d8e780 timed out Feb 2 04:48:14 nas kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Feb 2 04:48:14 nas kernel: sas: trying to find task 0xcb8db400 Feb 2 04:48:14 nas kernel: sas: sas_scsi_find_task: aborting task 0xcb8db400 Feb 2 04:48:14 nas kernel: sas: sas_scsi_find_task: task 0xcb8db400 is aborted Feb 2 04:48:14 nas kernel: sas: sas_eh_handle_sas_errors: task 0xcb8db400 is aborted Feb 2 04:48:14 nas kernel: sas: ata9: end_device-0:2: cmd error handler Feb 2 04:48:14 nas kernel: sas: ata7: end_device-0:0: dev error handler Feb 2 04:48:14 nas kernel: sas: ata8: end_device-0:1: dev error handler Feb 2 04:48:14 nas kernel: sas: ata9: end_device-0:2: dev error handler Feb 2 04:48:14 nas kernel: ata9.00: exception Emask 0x0 SAct 0x400000 SErr 0x0 action 0x6 frozen Feb 2 04:48:14 nas kernel: ata9.00: failed command: READ FPDMA QUEUED Feb 2 04:48:14 nas kernel: ata9.00: cmd 60/08:00:37:5a:ec/00:00:0f:00:00/40 tag 22 ncq 4096 in Feb 2 04:48:14 nas kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Feb 2 04:48:14 nas kernel: ata9.00: status: { DRDY } Feb 2 04:48:14 nas kernel: ata9: hard resetting link Feb 2 04:48:14 nas kernel: sas: ata10: end_device-0:3: dev error handler Feb 2 04:48:14 nas kernel: sas: ata11: end_device-0:4: dev error handler Feb 2 04:48:16 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[2]:rc= 0 Feb 2 04:48:17 nas kernel: sas: sas_ata_task_done: SAS error 8a Feb 2 04:48:17 nas kernel: sas: sas_ata_task_done: SAS error 8a Feb 2 04:48:17 nas kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV Feb 2 04:48:17 nas kernel: ata9.00: revalidation failed (errno=-2) Feb 2 04:48:17 nas kernel: mvsas 0000:05:00.0: Phy2 : No sig fis Feb 2 04:48:21 nas kernel: sas: sas_form_port: phy2 belongs to port2 already(1)! Feb 2 04:48:22 nas kernel: ata9: hard resetting link Feb 2 04:48:27 nas kernel: ata9.00: qc timeout (cmd 0xec) Feb 2 04:48:27 nas kernel: ata9.00: failed to IDENTIFY (I/O error, err_mask=0x5) Feb 2 04:48:27 nas kernel: ata9.00: revalidation failed (errno=-5) Feb 2 04:48:27 nas kernel: ata9: hard resetting link Feb 2 04:48:29 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[2]:rc= 0 Feb 2 04:48:29 nas kernel: sas: sas_ata_task_done: SAS error 8a Feb 2 04:48:29 nas kernel: sas: sas_ata_task_done: SAS error 8a Feb 2 04:48:29 nas kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV Feb 2 04:48:29 nas kernel: ata9.00: revalidation failed (errno=-2) Feb 2 04:48:29 nas kernel: ata9.00: disabled Feb 2 04:48:29 nas kernel: ata9.00: device reported invalid CHS sector 0 Feb 2 04:48:29 nas kernel: ata9: EH complete Feb 2 04:48:29 nas kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 Feb 2 04:48:29 nas kernel: sd 0:0:2:0: [sdl] Unhandled error code Feb 2 04:48:29 nas kernel: sd 0:0:2:0: [sdl] Result: hostbyte=0x04 driverbyte=0x00 Feb 2 04:48:29 nas kernel: sd 0:0:2:0: [sdl] CDB: cdb[0]=0x28: 28 00 0f ec 5a 37 00 00 08 00 Feb 2 04:48:29 nas kernel: end_request: I/O error, dev sdl, sector 267147831 Feb 2 04:48:29 nas kernel: md: disk7 read error With how it looks to me, it seems likely either power cables, or a power supply problem, or possibly a bad drive back-plane/connections... Quote Link to comment
Joe L. Posted February 3, 2013 Share Posted February 3, 2013 Many single rail power supplies (or larger multi-rail) max out at about 6 or 7 drives. You might have one of that type power supply and have reached its limit. what exact make/model of power supply are you using? Joe L. Quote Link to comment
StevenD Posted February 3, 2013 Author Share Posted February 3, 2013 Thanks guys! Its the triple redundant power supply that comes with the SuperMicro SC933 case. Power Supply 760W Triple-Redundant AC to DC power supply with PFC [ 24-pin, (8-pin, 4-pin)=12V ] AC Voltage 100 - 240V, 50-60Hz, 14 - 8 Amp DC Output 5V + 3.3V ? 200W +5V 36.0 Amp +5V standby 3.5 Amp +12V 50.0 Amp (combined) -12V 1.0 AAmp +3.3V 36.0 Amp http://www.newegg.com/Product/Product.aspx?Item=N82E16817377069 http://www.ebay.com/itm/SuperMicro-CSE-PT933-PD382-Power-Distributor-/120952062208 Im only using two of them right now. I wonder if I should swap one of them out for the spare. Quote Link to comment
StevenD Posted February 4, 2013 Author Share Posted February 4, 2013 I ran a parity check overnight and all seems OK. I'm going to shut it down when I get home this evening and check all the connections. I managed to pick up an entire power supply assembly (three power supply modules and the power distributor) for $75! I at least have parts I can swap out if necessary for testing. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.