SSD Posted October 26, 2014 Share Posted October 26, 2014 I was doing some file rearranging last night, leaving a lengthy copy of over 200G running. This morning I woke up to a disk failure. The disk was non-responsive from the GUI / myMain (would not pull a smart report). It was being simulated by unRAID as expected. The simulated contents looked fine. I powered down, removed the disk (nothing burned or anything) and re-inserted it (aren't drive cages wonderful). Powered up the server. Disk was recognized. Smart report was perfectly clean (see below). No reallocated sectors, pending sectors. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 135 135 054 Pre-fail Offline - 106 3 Spin_Up_Time 0x0007 128 128 024 Pre-fail Always - 547 (Average 545) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 263 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline - 32 9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21777 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 374 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 374 194 Temperature_Celsius 0x0002 230 230 000 Old_age Always - 26 (Min/Max 15/36) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 This array has been running perfectly for months. It hasn't be rebooted in at least 2 weeks. Since the disk looked perfect, I decided to attempt a rebuild onto itself. It is now over 200G in, running at a very fast 125 MB/sec. I am attaching the syslog. The syslog was truncated (got too long). Excitement begins on line 669. Hoping RobJ or someone else might take a look and see if there are any hints. My guess is that something weird happened that caused the disk to drop offline to the controller. Maybe a bad cable, power glitch, etc. I'm hopeful that reseating the disk (in the drive cage) was enough. This is a relatively full disk that has been having limited I/O in normal operation (except for parity checks). Thanks! syslog-2014-10-26b.zip Link to comment
dgaschk Posted October 26, 2014 Share Posted October 26, 2014 Monitor Power-Off-Retract–Count. It is indicative of power issues. Link to comment
SSD Posted October 26, 2014 Author Share Posted October 26, 2014 The rebuild completed successfully. Power-off-retract-count is similar or less than other drives of the same vintage. It indicates the drive is spinning up 12 or so times a month. Seems accurate. Link to comment
lionelhutz Posted October 26, 2014 Share Posted October 26, 2014 That's not dramatic. The last dramatic disk failure I had I found the disk grinding with the heads clunking as they kept returning to the park position every few seconds. When I opened it one of the platters had one side ground to a dull surface and the case was full of magnetic dust. Link to comment
WeeboTech Posted October 27, 2014 Share Posted October 27, 2014 It's sun spots and cosmic rays I tell ya. Gary presented a case for us to use ECC for a reason. Now the sunspots and cosmic rays are rearing the ugly head with your hard drive. Actually I find the syslog to be an interesting one. Looks like the drive go into a bad state and did not return anything back to the kernel at first. Then it was reset. I think it either got a new drive letter or you have two drives having issue sdc/sdr. I've seen situations in the past where use of a drive in a certain cage position would reveal errors and disconnects. I.E. The heavy vibrations from head movement would shake things loose. Oct 25 22:49:38 Shark kernel: Sense Key : 0x0 [current] Oct 25 22:49:38 Shark kernel: Info fld=0x0 Oct 25 22:49:38 Shark kernel: sd 2:0:0:1: [sdc] Oct 25 22:49:38 Shark kernel: ASC=0x0 ASCQ=0x0 Oct 26 02:16:55 Shark kernel: sas: ata17: end_device-11:0: cmd error handler Oct 26 02:16:55 Shark kernel: sas: ata17: end_device-11:0: dev error handler Oct 26 02:16:55 Shark kernel: ata17.00: exception Emask 0x0 SAct 0x7f7f7fdf SErr 0x0 action 0x6 frozen Oct 26 02:16:55 Shark kernel: ata17.00: failed command: WRITE FPDMA QUEUED Oct 26 02:16:55 Shark kernel: ata17.00: cmd 61/c8:00:30:5b:9b/01:00:3e:01:00/40 tag 0 ncq 233472 out Oct 26 02:16:55 Shark kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 26 02:16:55 Shark kernel: ata17.00: status: { DRDY } Oct 26 02:16:57 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1528:mvs_I_T_nexus_reset for device[0]:rc= 0 Oct 26 02:16:57 Shark kernel: sas: sas_ata_task_done: SAS error 8a Oct 26 02:16:57 Shark kernel: sas: sas_ata_task_done: SAS error 8a Oct 26 02:16:57 Shark kernel: ata17.00: both IDENTIFYs aborted, assuming NODEV Oct 26 02:16:57 Shark kernel: ata17.00: revalidation failed (errno=-2) Oct 26 02:16:57 Shark kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Oct 26 02:17:02 Shark kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Oct 26 02:17:02 Shark kernel: ata17: hard resetting link Oct 26 02:17:08 Shark kernel: ata17.00: qc timeout (cmd 0x27) Oct 26 02:17:08 Shark kernel: ata17.00: failed to read native max address (err_mask=0x4) Oct 26 02:17:08 Shark kernel: ata17.00: HPA support seems broken, skipping HPA handling Oct 26 02:17:08 Shark kernel: ata17.00: revalidation failed (errno=-5) Oct 26 02:17:08 Shark kernel: ata17: hard resetting link Oct 26 02:17:10 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1528:mvs_I_T_nexus_reset for device[0]:rc= 0 Oct 26 02:17:10 Shark kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Oct 26 02:17:14 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1963:Release slot [0] tag[0], task [ffff88011cc812c0]: Oct 26 02:17:14 Shark kernel: sas: sas_ata_task_done: SAS error 8a Oct 26 02:17:14 Shark kernel: ata17.00: failed to set xfermode (err_mask=0x11) Oct 26 02:17:14 Shark kernel: ata17.00: disabled Oct 26 02:17:14 Shark kernel: ata17.00: device reported invalid CHS sector 0 Oct 26 02:17:14 Shark kernel: ata17.00: device reported invalid CHS sector 0 Oct 26 02:17:14 Shark kernel: ata17: EH complete Oct 26 02:17:14 Shark kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Unhandled error code Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00 Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] CDB: Oct 26 02:17:14 Shark kernel: cdb[0]=0x8a: 8a 00 00 00 00 01 3e 9b 31 30 00 00 02 00 00 00 Oct 26 02:17:14 Shark kernel: end_request: I/O error, dev sdr, sector 5345325360 Oct 26 02:17:14 Shark kernel: md: disk3 write error, sector=5345325296 Oct 26 02:17:14 Shark kernel: md: disk3 write error, sector=5345325304 ... Oct 26 02:17:14 Shark kernel: md: disk3 read error, sector=5339018376 Oct 26 02:17:14 Shark kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] READ CAPACITY(16) failed Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00 Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Sense not available. Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] READ CAPACITY failed Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00 Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Sense not available. Oct 26 02:17:14 Shark kernel: sdr: detected capacity change from 3000592982016 to 0 Oct 26 02:17:28 Shark kernel: md: disk3 write error, sector=5339016336 Oct 26 02:17:28 Shark kernel: md: disk3 write error, sector=5339016344 Link to comment
SSD Posted October 27, 2014 Author Share Posted October 27, 2014 That's not dramatic. The last dramatic disk failure I had I found the disk grinding with the heads clunking as they kept returning to the park position every few seconds. When I opened it one of the platters had one side ground to a dull surface and the case was full of magnetic dust. Guess my "dramatic" comment was more about the nearly 3000 lines in the syslog related to the drive failure! Yours does seem quite a bit more dramatic than mine! It's sun spots and cosmic rays I tell ya. Gary presented a case for us to use ECC for a reason. Now the sunspots and cosmic rays are rearing the ugly head with your hard drive. Actually I find the syslog to be an interesting one. Looks like the drive go into a bad state and did not return anything back to the kernel at first. Then it was reset. I think it either got a new drive letter or you have two drives having issue sdc/sdr. I've seen situations in the past where use of a drive in a certain cage position would reveal errors and disconnects. I.E. The heavy vibrations from head movement would shake things loose. I'll keep an eye on it going forward. I'm hoping it was an oddity (I already have ECC so I can't blame it on the cosmic rays!) I didn't see that a new drive assignment was made, but didn't look too closely before rebooting. Thanks for taking a look Weebo! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.