[SOLVED] Pre-Emptive Action - SysLog Errors - Has My Disk Failed?

cfmjohn · September 14, 2011

Hi all, I'm new around these parts. Been experimenting with unRAID and an old PC until last week I pulled the plug and bought new server hardware. I've just spent the day filling the array and have encountered a problem.

The web gui says all disks are running fine but the syslog has filled up with a lot of errors and minor issues all saying along the same lines. There are a lot of errors, I've included a sample below:

Sep 14 00:14:29 Tower kernel:          res 51/40:06:ca:00:04/00:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Sep 14 00:14:29 Tower kernel: ata4.00: status: { DRDY ERR } (Drive related)
Sep 14 00:14:29 Tower kernel: ata4.00: error: { UNC } (Errors)
Sep 14 00:14:29 Tower kernel: ata4.00: configured for UDMA/133 (Drive related)
Sep 14 00:14:29 Tower kernel: ata4: EH complete (Drive related)
Sep 14 00:14:30 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Sep 14 00:14:30 Tower kernel: ata4.00: irq_stat 0x40000001 (Drive related)
Sep 14 00:14:30 Tower kernel: ata4.00: failed command: READ DMA (Minor Issues)
Sep 14 00:14:30 Tower kernel: ata4.00: cmd c8/00:08:c8:00:04/00:00:00:00:00/e0 tag 0 dma 4096 in (Drive related)
Sep 14 00:14:30 Tower kernel:          res 51/40:05:cb:00:04/00:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Sep 14 00:14:30 Tower kernel: ata4.00: status: { DRDY ERR } (Drive related)
Sep 14 00:14:30 Tower kernel: ata4.00: error: { UNC } (Errors)

From reading the syslog I'm thinking the disk on ata4 is failing/has failed but I'm not unraid or linux proficient so I don't know the real meaning of those errors.

A bit of background:

I have been loading another disk with files though a TV Shows user share, which was across disk 1 & 2 as both had a TV Shows top-level folder. But the folder on disk 1 (sdd) was deleted earlier today. It appears that somehow this folder was re-created and some files added to it. I was adding files to disk 2 when I heard disk 1 make a loud clunk noise and spin up. It appears that this was when the files were added to disk 1.

I've attached my full syslog and my device configuration is below.

I hope this is a panic over nothing but I know from experience that hdd's making noises is a bad sign! I hope I've not gone overkill with the information and that someone may have the answer.

Thanks,

John

Devices:

parity device:	pci-0000:00:1f.2-scsi-0:0:0:0 host0 (sda) Hitachi_HDS5C3020ALA632_ML0220F30N6Z5D
disk1 device:	pci-0000:00:1f.2-scsi-3:0:0:0 host3 (sdd) SAMSUNG_HD103SJ_S246J9KB420338
disk2 device:	pci-0000:00:1f.2-scsi-1:0:0:0 host1 (sdb) Hitachi_HDS5C3020ALA632_ML0220F30GKJXD
disk3 device:	pci-0000:00:1f.2-scsi-2:0:0:0 host2 (sdc) SAMSUNG_HD103SJ_S246J9KB420330

syslog-2011-09-14.txt

mbryanr · September 14, 2011

Read errors on the drive. Run a smart test and post. You'll see sectors pending reallocation in the far right column if the disk is starting to fail. <Media error is a hint>

Sep 14 00:14:30 Tower kernel: md: disk1 read error

Sep 14 00:14:30 Tower kernel: handle_stripe read error: 262280/1, count: 1

cfmjohn · September 14, 2011

I've run the short SMART report and the output is below. Don't see any sectors needing reallocating but there are 6 errors noted.

I've also started a long SMART test and will add the results when it completes.

Thanks

John

Statistics for /dev/sdd SAMSUNG_HD103SJ_S246J9KB420338
smartctl -a -d ata /dev/sdd
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103SJ
Serial Number:    S246J9KB420338
Firmware Version: 1AJ10001
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Wed Sep 14 11:08:26 2011 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (9420) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 157) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       29
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   070   069   025    Pre-fail  Always       -       9186
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       79
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       153
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   057   000    Old_age   Always       -       20 (Lifetime Min/Max 16/53)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       82

SMART Error Log Version: 1
ATA Error Count: 6 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 6 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 cb 00 04 e0  Error: UNC 5 sectors at LBA = 0x000400cb = 262347

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c8 00 04 e0 08      00:01:49.612  READ DMA
  ef 10 02 00 00 00 a0 08      00:01:49.612  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 08      00:01:49.612  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      00:01:49.612  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:01:49.612  SET FEATURES [set transfer mode]

Error 5 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 ca 00 04 e0  Error: UNC 6 sectors at LBA = 0x000400ca = 262346

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c8 00 04 e0 08      00:01:49.611  READ DMA
  ef 10 02 00 00 00 a0 08      00:01:49.611  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 08      00:01:49.611  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      00:01:49.611  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:01:49.611  SET FEATURES [set transfer mode]

Error 4 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 cb 00 04 e0  Error: UNC 5 sectors at LBA = 0x000400cb = 262347

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c8 00 04 e0 08      00:01:49.610  READ DMA
  ef 10 02 00 00 00 a0 08      00:01:49.610  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 08      00:01:49.610  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      00:01:49.610  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:01:49.610  SET FEATURES [set transfer mode]

Error 3 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 ca 00 04 e0  Error: UNC 6 sectors at LBA = 0x000400ca = 262346

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c8 00 04 e0 08      00:01:49.608  READ DMA
  ef 10 02 00 00 00 a0 08      00:01:49.608  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 08      00:01:49.608  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      00:01:49.608  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:01:49.608  SET FEATURES [set transfer mode]

Error 2 occurred at disk power-on lifetime: 150 hours (6 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 cb 00 04 e0  Error: UNC 5 sectors at LBA = 0x000400cb = 262347

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 c8 00 04 e0 08      00:01:49.606  READ DMA
  ef 10 02 00 00 00 a0 08      00:01:49.606  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 08      00:01:49.606  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      00:01:49.606  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:01:49.606  SET FEATURES [set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       153         -
# 2  Short offline       Completed without error       00%       152         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

mbryanr · September 14, 2011

Sep 14 00:21:40 Tower kernel: ata4.00: cmd 25/00:20:d8:a6:00/00:03:00:00:00/e0 tag 0 dma 409600 in

Sep 14 00:21:40 Tower kernel: res 51/40:cc:2c:a8:00/00:01:00:00:00/e0 Emask 0x9 (media error)

This helped me....

http://lime-technology.com/wiki/index.php?title=The_Analysis_of_Drive_Issues#Drive_media_issue_.231

Run a non correcting partiy check...more errors should appear. And if recoverable, the drive will not show any more errors on the main page, if not - the parity check may slow to a crawl or errors will increase

cfmjohn · September 14, 2011

Thanks for the link, useful to know for the future. I've decided that since the drive is less than two weeks old I will return it to the shop I bought it from. I don't want marginal or problematic hardware running in a brand new server box.

Thanks again

John

[SOLVED] Pre-Emptive Action - SysLog Errors - Has My Disk Failed?

Recommended Posts

cfmjohn

Link to comment

mbryanr

Link to comment

cfmjohn

Link to comment

mbryanr

Link to comment

cfmjohn

Link to comment

Join the conversation