Jump to content

Two red ball drives in two days


Recommended Posts

Array has been working properly since I upgraded my motherboard a few weeks ago.  Monthly parity check kicked off at midnight on the 2nd. I woke up to a red ball on Disk 7.  I replaced Disk 7 with a new 4TB drive and it appeared to be re-building OK. 

 

This morning, I woke up to a red ball on Disk 6.

 

Not sure what to do now.  In preparation of RMAing the 2TB Disk 7 that I pulled, I used Hitachi's DFT utility and it passed all its tests. 

 

It appears there really isnt anything wrong with Disk 6 either.

 

Disk identity: 	

Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK11A8B9J6HYEF
Firmware Version: JKAOA3MA
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun Feb  3 11:46:32 2013 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Disk attributes: 	

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       102
  3 Spin_Up_Time            0x0007   119   119   024    Pre-fail  Always       -       607 (Average 608)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1249
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       35
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       18235
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       111
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1294
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1294
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 19/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

Disk capabilities: 	

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (21889) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

Disk self-test log: 	

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     18235         -

Disk error log: 	

No Errors Logged


 

 

Thanks in advance for the help!

syslog-disk6.zip

syslog-disk7-parity-check.zip

Link to comment

It looks to me like you may have a bad power cable that could be connected to multiple drives.  Or possibly power supply (or drive back-plane problems if you have one). There were two drive problems that seem to have occured at about the same time:

 

- syslog-disk6 -

 

Feb  2 23:16:00 nas kernel: mdcmd (66): spindown 9

Feb  2 23:16:13 nas kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb  2 23:16:13 nas kernel: ata5.00: failed command: SMART

Feb  2 23:16:13 nas kernel: ata5.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in

Feb  2 23:16:13 nas kernel:          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb  2 23:16:13 nas kernel: ata5.00: status: { DRDY }

Feb  2 23:16:13 nas kernel: ata5: hard resetting link

Feb  2 23:16:13 nas kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Feb  2 23:16:13 nas kernel: ata5.00: link online but device misclassified

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Feb  2 23:16:13 nas kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Feb  2 23:16:13 nas kernel: ata5.00: configured for UDMA/133

Feb  2 23:16:13 nas kernel: ata5: EH complete

 

 

- this one successfully reset and came back on-line -

 

- but the next one not so happy...

 

Feb  2 23:16:24 nas kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb  2 23:16:24 nas kernel: ata10.00: failed command: SMART

Feb  2 23:16:24 nas kernel: ata10.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in

Feb  2 23:16:24 nas kernel:          res 40/00:00:e0:94:1e/00:02:37:01:00/40 Emask 0x4 (timeout)

Feb  2 23:16:24 nas kernel: ata10.00: status: { DRDY }

Feb  2 23:16:24 nas kernel: ata10: hard resetting link

Feb  2 23:16:24 nas kernel: sas: ata11: end_device-0:4: dev error handler

Feb  2 23:16:26 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!

Feb  2 23:16:27 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0

Feb  2 23:16:32 nas kernel: ata10.00: qc timeout (cmd 0x27)

Feb  2 23:16:32 nas kernel: ata10.00: failed to read native max address (err_mask=0x4)

Feb  2 23:16:32 nas kernel: ata10.00: HPA support seems broken, skipping HPA handling

Feb  2 23:16:32 nas kernel: ata10.00: revalidation failed (errno=-5)

Feb  2 23:16:32 nas kernel: ata10: hard resetting link

Feb  2 23:16:35 nas kernel: mvsas 0000:05:00.0: Phy3 : No sig fis

Feb  2 23:16:35 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0

Feb  2 23:16:39 nas kernel: drivers/scsi/mvsas/mv_sas.c 1951:Release slot [0] tag[0], task [d98a6dc0]:

Feb  2 23:16:39 nas kernel: sas: sas_ata_task_done: SAS error 8a

Feb  2 23:16:39 nas kernel: ata10.00: failed to set xfermode (err_mask=0x11)

Feb  2 23:16:39 nas kernel: ata10.00: limiting speed to UDMA/133:PIO3

Feb  2 23:16:39 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!

Feb  2 23:16:41 nas kernel: ata10: hard resetting link

Feb  2 23:16:46 nas kernel: ata10.00: qc timeout (cmd 0xec)

Feb  2 23:16:46 nas kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x5)

Feb  2 23:16:46 nas kernel: ata10.00: revalidation failed (errno=-5)

Feb  2 23:16:46 nas kernel: ata10.00: disabled

Feb  2 23:16:46 nas kernel: ata10: hard resetting link

Feb  2 23:16:49 nas kernel: mvsas 0000:05:00.0: Phy3 : No sig fis

Feb  2 23:16:49 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[3]:rc= 0

Feb  2 23:16:49 nas kernel: ata10: EH complete

Feb  2 23:16:49 nas kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] READ CAPACITY(16) failed

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm]  Result: hostbyte=0x04 driverbyte=0x00

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Sense not available.

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] READ CAPACITY failed

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm]  Result: hostbyte=0x04 driverbyte=0x00

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Sense not available.

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Asking for cache data failed

Feb  2 23:16:49 nas kernel: sd 0:0:3:0: [sdm] Assuming drive cache: write through

Feb  2 23:16:49 nas kernel: sdm: detected capacity change from 2000398934016 to 0

Feb  2 23:16:53 nas kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!

Feb  2 23:18:58 nas kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  2 23:18:58 nas kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Feb  2 23:27:00 nas fan_speed.sh: Highest disk drive temp is: 38C

Feb  2 23:27:00 nas fan_speed.sh: Changing disk drive fan speed from: [232 (90% @ 3292 rpm) ] to: [FULL (100% @ 3308 rpm) ]

Feb  2 23:30:48 nas shfs/user: shfs_readdir: readdir_r: /mnt/disk6/TV/Big Brother US After Dark (5) Input/output error

Feb  2 23:30:48 nas kernel: md: disk6 read error

Feb  2 23:30:48 nas kernel: handle_stripe read error: 1532493840/6, count: 1

Feb  2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error

Feb  2 23:30:48 nas kernel: REISERFS (device md6): Remounting filesystem read-only

Feb  2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error

Feb  2 23:30:48 nas kernel: REISERFS error (device md6): zam-7001 reiserfs_find_entry: io error

Feb  2 23:30:59 nas kernel: md: disk6 read error

Feb  2 23:30:59 nas kernel: handle_stripe read error: 1534066768/6, count: 1

Feb  2 23:30:59 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error

Feb  2 23:30:59 nas kernel: REISERFS (device md7): Remounting filesystem read-only

Feb  2 23:30:59 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error

Feb  2 23:30:59 nas last message repeated 8 times

Feb  2 23:31:00 nas kernel: md: disk6 read error

Feb  2 23:31:00 nas kernel: handle_stripe read error: 1534066768/6, count: 1

Feb  2 23:31:00 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error

Feb  2 23:31:01 nas last message repeated 129 times

Feb  2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error

Feb  2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error

Feb  2 23:31:01 nas kernel: md: disk6 read error

Feb  2 23:31:01 nas kernel: handle_stripe read error: 251209632/6, count: 1

Feb  2 23:31:01 nas shfs/user: shfs_read: read: (5) Input/output error

Feb  2 23:31:03 nas last message repeated 129 times

Feb  2 23:31:17 nas kernel: md: disk6 read error

Feb  2 23:31:17 nas kernel: handle_stripe read error: 1534066768/6, count: 1

Feb  2 23:31:17 nas kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error

Feb  2 23:31:17 nas last message repeated 4 times

Feb  2 23:31:17 nas shfs/user: shfs_readdir: readdir_r: /mnt/disk6/TV/Big Brother US After Dark (5) Input/output error

Feb  2 23:31:17 nas kernel: md: disk6 read error

 

 

- and further on... more disk6 errors and log notifications... (also notice the md7 error at Feb  2 23:30:59)

 

 

And the next log... - syslog-disk7-parity-check

 

 

Feb  2 04:48:14 nas kernel: sd 0:0:2:0: [sdl] command f3d8e780 timed out

Feb  2 04:48:14 nas kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Feb  2 04:48:14 nas kernel: sas: trying to find task 0xcb8db400

Feb  2 04:48:14 nas kernel: sas: sas_scsi_find_task: aborting task 0xcb8db400

Feb  2 04:48:14 nas kernel: sas: sas_scsi_find_task: task 0xcb8db400 is aborted

Feb  2 04:48:14 nas kernel: sas: sas_eh_handle_sas_errors: task 0xcb8db400 is aborted

Feb  2 04:48:14 nas kernel: sas: ata9: end_device-0:2: cmd error handler

Feb  2 04:48:14 nas kernel: sas: ata7: end_device-0:0: dev error handler

Feb  2 04:48:14 nas kernel: sas: ata8: end_device-0:1: dev error handler

Feb  2 04:48:14 nas kernel: sas: ata9: end_device-0:2: dev error handler

Feb  2 04:48:14 nas kernel: ata9.00: exception Emask 0x0 SAct 0x400000 SErr 0x0 action 0x6 frozen

Feb  2 04:48:14 nas kernel: ata9.00: failed command: READ FPDMA QUEUED

Feb  2 04:48:14 nas kernel: ata9.00: cmd 60/08:00:37:5a:ec/00:00:0f:00:00/40 tag 22 ncq 4096 in

Feb  2 04:48:14 nas kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb  2 04:48:14 nas kernel: ata9.00: status: { DRDY }

Feb  2 04:48:14 nas kernel: ata9: hard resetting link

Feb  2 04:48:14 nas kernel: sas: ata10: end_device-0:3: dev error handler

Feb  2 04:48:14 nas kernel: sas: ata11: end_device-0:4: dev error handler

Feb  2 04:48:16 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[2]:rc= 0

Feb  2 04:48:17 nas kernel: sas: sas_ata_task_done: SAS error 8a

Feb  2 04:48:17 nas kernel: sas: sas_ata_task_done: SAS error 8a

Feb  2 04:48:17 nas kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV

Feb  2 04:48:17 nas kernel: ata9.00: revalidation failed (errno=-2)

Feb  2 04:48:17 nas kernel: mvsas 0000:05:00.0: Phy2 : No sig fis

Feb  2 04:48:21 nas kernel: sas: sas_form_port: phy2 belongs to port2 already(1)!

Feb  2 04:48:22 nas kernel: ata9: hard resetting link

Feb  2 04:48:27 nas kernel: ata9.00: qc timeout (cmd 0xec)

Feb  2 04:48:27 nas kernel: ata9.00: failed to IDENTIFY (I/O error, err_mask=0x5)

Feb  2 04:48:27 nas kernel: ata9.00: revalidation failed (errno=-5)

Feb  2 04:48:27 nas kernel: ata9: hard resetting link

Feb  2 04:48:29 nas kernel: drivers/scsi/mvsas/mv_sas.c 1521:mvs_I_T_nexus_reset for device[2]:rc= 0

Feb  2 04:48:29 nas kernel: sas: sas_ata_task_done: SAS error 8a

Feb  2 04:48:29 nas kernel: sas: sas_ata_task_done: SAS error 8a

Feb  2 04:48:29 nas kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV

Feb  2 04:48:29 nas kernel: ata9.00: revalidation failed (errno=-2)

Feb  2 04:48:29 nas kernel: ata9.00: disabled

Feb  2 04:48:29 nas kernel: ata9.00: device reported invalid CHS sector 0

Feb  2 04:48:29 nas kernel: ata9: EH complete

Feb  2 04:48:29 nas kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0

Feb  2 04:48:29 nas kernel: sd 0:0:2:0: [sdl] Unhandled error code

Feb  2 04:48:29 nas kernel: sd 0:0:2:0: [sdl]  Result: hostbyte=0x04 driverbyte=0x00

Feb  2 04:48:29 nas kernel: sd 0:0:2:0: [sdl] CDB: cdb[0]=0x28: 28 00 0f ec 5a 37 00 00 08 00

Feb  2 04:48:29 nas kernel: end_request: I/O error, dev sdl, sector 267147831

Feb  2 04:48:29 nas kernel: md: disk7 read error

 

 

With how it looks to me, it seems likely either power cables, or a power supply problem, or possibly a bad drive back-plane/connections...

 

Link to comment

Thanks guys!

 

Its the triple redundant power supply that comes with the SuperMicro SC933 case.

 

Power Supply

760W Triple-Redundant AC to DC power supply with PFC

[ 24-pin, (8-pin, 4-pin)=12V ]

AC Voltage     100 - 240V, 50-60Hz, 14 - 8 Amp

DC Output    5V + 3.3V ? 200W

+5V    36.0 Amp

+5V standby    3.5 Amp

+12V    50.0 Amp (combined)

-12V  1.0 AAmp

+3.3V    36.0 Amp

 

http://www.newegg.com/Product/Product.aspx?Item=N82E16817377069

 

http://www.ebay.com/itm/SuperMicro-CSE-PT933-PD382-Power-Distributor-/120952062208

 

Im only using two of them right now. I wonder if I should swap one of them out for the spare.

 

Link to comment

I ran a parity check overnight and all seems OK.  I'm going to shut it down when I get home this evening and check all the connections.  I managed to pick up an entire power supply assembly (three power supply modules and the power distributor) for $75!  I at least have parts I can swap out if necessary for testing.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...