Jump to content

Dramatic Disk Failure


SSD

Recommended Posts

I was doing some file rearranging last night, leaving a lengthy copy of over 200G running.

 

This morning I woke up to a disk failure.

 

The disk was non-responsive from the GUI / myMain (would not pull a smart report). It was being simulated by unRAID as expected. The simulated contents looked fine.

 

I powered down, removed the disk (nothing burned or anything) and re-inserted it (aren't drive cages wonderful).

 

Powered up the server. Disk was recognized. Smart report was perfectly clean (see below). No reallocated sectors, pending sectors.

 

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       106
  3 Spin_Up_Time            0x0007   128   128   024    Pre-fail  Always       -       547 (Average 545)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       263
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       21777
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       374
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       374
194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 15/36)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

This array has been running perfectly for months. It hasn't be rebooted in at least 2 weeks.

 

Since the disk looked perfect, I decided to attempt a rebuild onto itself. It is now over 200G in, running at a very fast 125 MB/sec.

 

I am attaching the syslog. The syslog was truncated (got too long). Excitement begins on line 669.

 

Hoping RobJ or someone else might take a look and see if there are any hints. My guess is that something weird happened that caused the disk to drop offline to the controller. Maybe a bad cable, power glitch, etc. I'm hopeful that reseating the disk (in the drive cage) was enough. This is a relatively full disk that has been having limited I/O in normal operation (except for parity checks).

 

Thanks!

syslog-2014-10-26b.zip

Link to comment

It's sun spots and cosmic rays I tell ya. Gary presented a case for us to use ECC for a reason.

Now the sunspots and cosmic rays are rearing the ugly head with your hard drive. 

 

 

Actually I find the syslog to be an interesting one.

 

Looks like the drive go into a bad state and did not return anything back to the kernel at first.

Then it was reset. I think it either got a new drive letter or you have two drives having issue sdc/sdr.

 

I've seen situations in the past where use of a drive in a certain cage position would reveal errors and disconnects.

I.E. The heavy vibrations from head movement would shake things loose.

 

 

 

 

Oct 25 22:49:38 Shark kernel: Sense Key : 0x0 [current] 
Oct 25 22:49:38 Shark kernel: Info fld=0x0
Oct 25 22:49:38 Shark kernel: sd 2:0:0:1: [sdc]  
Oct 25 22:49:38 Shark kernel: ASC=0x0 ASCQ=0x0

Oct 26 02:16:55 Shark kernel: sas: ata17: end_device-11:0: cmd error handler
Oct 26 02:16:55 Shark kernel: sas: ata17: end_device-11:0: dev error handler
Oct 26 02:16:55 Shark kernel: ata17.00: exception Emask 0x0 SAct 0x7f7f7fdf SErr 0x0 action 0x6 frozen
Oct 26 02:16:55 Shark kernel: ata17.00: failed command: WRITE FPDMA QUEUED
Oct 26 02:16:55 Shark kernel: ata17.00: cmd 61/c8:00:30:5b:9b/01:00:3e:01:00/40 tag 0 ncq 233472 out
Oct 26 02:16:55 Shark kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 26 02:16:55 Shark kernel: ata17.00: status: { DRDY }

Oct 26 02:16:57 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1528:mvs_I_T_nexus_reset for device[0]:rc= 0
Oct 26 02:16:57 Shark kernel: sas: sas_ata_task_done: SAS error 8a
Oct 26 02:16:57 Shark kernel: sas: sas_ata_task_done: SAS error 8a
Oct 26 02:16:57 Shark kernel: ata17.00: both IDENTIFYs aborted, assuming NODEV
Oct 26 02:16:57 Shark kernel: ata17.00: revalidation failed (errno=-2)
Oct 26 02:16:57 Shark kernel: mvsas 0000:02:00.0: Phy0 : No sig fis
Oct 26 02:17:02 Shark kernel: sas: sas_form_port: phy0 belongs to port0 already(1)!
Oct 26 02:17:02 Shark kernel: ata17: hard resetting link
Oct 26 02:17:08 Shark kernel: ata17.00: qc timeout (cmd 0x27)
Oct 26 02:17:08 Shark kernel: ata17.00: failed to read native max address (err_mask=0x4)
Oct 26 02:17:08 Shark kernel: ata17.00: HPA support seems broken, skipping HPA handling
Oct 26 02:17:08 Shark kernel: ata17.00: revalidation failed (errno=-5)
Oct 26 02:17:08 Shark kernel: ata17: hard resetting link
Oct 26 02:17:10 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1528:mvs_I_T_nexus_reset for device[0]:rc= 0
Oct 26 02:17:10 Shark kernel: mvsas 0000:02:00.0: Phy0 : No sig fis
Oct 26 02:17:14 Shark kernel: drivers/scsi/mvsas/mv_sas.c 1963:Release slot [0] tag[0], task [ffff88011cc812c0]:
Oct 26 02:17:14 Shark kernel: sas: sas_ata_task_done: SAS error 8a
Oct 26 02:17:14 Shark kernel: ata17.00: failed to set xfermode (err_mask=0x11)
Oct 26 02:17:14 Shark kernel: ata17.00: disabled
Oct 26 02:17:14 Shark kernel: ata17.00: device reported invalid CHS sector 0

Oct 26 02:17:14 Shark kernel: ata17.00: device reported invalid CHS sector 0
Oct 26 02:17:14 Shark kernel: ata17: EH complete
Oct 26 02:17:14 Shark kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Unhandled error code
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr]  
Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] CDB: 
Oct 26 02:17:14 Shark kernel: cdb[0]=0x8a: 8a 00 00 00 00 01 3e 9b 31 30 00 00 02 00 00 00
Oct 26 02:17:14 Shark kernel: end_request: I/O error, dev sdr, sector 5345325360
Oct 26 02:17:14 Shark kernel: md: disk3 write error, sector=5345325296
Oct 26 02:17:14 Shark kernel: md: disk3 write error, sector=5345325304
...
Oct 26 02:17:14 Shark kernel: md: disk3 read error, sector=5339018376
Oct 26 02:17:14 Shark kernel: sas: sas_form_port: phy0 belongs to port0 already(1)!
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] READ CAPACITY(16) failed
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr]  
Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Sense not available.
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] READ CAPACITY failed
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr]  
Oct 26 02:17:14 Shark kernel: Result: hostbyte=0x04 driverbyte=0x00
Oct 26 02:17:14 Shark kernel: sd 11:0:0:0: [sdr] Sense not available.
Oct 26 02:17:14 Shark kernel: sdr: detected capacity change from 3000592982016 to 0
Oct 26 02:17:28 Shark kernel: md: disk3 write error, sector=5339016336
Oct 26 02:17:28 Shark kernel: md: disk3 write error, sector=5339016344

Link to comment

That's not dramatic.  ;)

 

The last dramatic disk failure I had I found the disk grinding with the heads clunking as they kept returning to the park position every few seconds. When I opened it one of the platters had one side ground to a dull surface and the case was full of magnetic dust.

 

Guess my "dramatic" comment was more about the nearly 3000 lines in the syslog related to the drive failure!

 

Yours does seem quite a bit more dramatic than mine!

 

It's sun spots and cosmic rays I tell ya. Gary presented a case for us to use ECC for a reason.

Now the sunspots and cosmic rays are rearing the ugly head with your hard drive. 

 

Actually I find the syslog to be an interesting one.

 

Looks like the drive go into a bad state and did not return anything back to the kernel at first.

Then it was reset. I think it either got a new drive letter or you have two drives having issue sdc/sdr.

 

I've seen situations in the past where use of a drive in a certain cage position would reveal errors and disconnects.

I.E. The heavy vibrations from head movement would shake things loose.

 

 

I'll keep an eye on it going forward. I'm hoping it was an oddity (I already have ECC so I can't blame it on the cosmic rays!)

 

I didn't see that a new drive assignment was made, but didn't look too closely before rebooting.

 

Thanks for taking a look Weebo!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...