Jump to content

kernel: md: disk4 read error


SLRist

Recommended Posts

Hi there.

 

I've had a new unRAID server up and running for 2 days. It contains 5 x 2TB Seagate LP drives. No cache drive (as yet). These drives are various vintages, salvaged from a bunch of Seagate Expansion USB drives I had laying around.

 

Copying media over onto the server went fine up until the drives got to around 18% capacity (I'm splitting ISO files equally across the disks) when suddenly my file copying failed and I noticed a bunch of error messages in the console. Investigating the Syslog, I see lots of the following (full file attached):

 

Sep  5 19:02:37 UNRAID-01 kernel: md: disk4 read error

Sep  5 19:02:37 UNRAID-01 kernel: handle_stripe read error: 30064/4, count: 1

 

Screenshot from unMENU:

 

unraid-01-unmenu-status-01.png

 

 

 

A SMART status report for disk4 gives the status listed at the bottom (sorry but this means nothing to me).

 

Any suggestions please?  Is it definately Disk 4 (ST32000540AS_9WM037SE) which is at fault?

 

Should I just replace it, or is it worth running more tests?

 

Should I run a parity check?

 

If I should replace it, can you send me to a link describing the process?

 

Many thanks.

 

 

 

Statistics for /dev/sdd ST32000540AS_9WM037SE

smartctl -a -d ata /dev/sdd

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:     ST32000540AS

Serial Number:    9WM037SE

Firmware Version: CC83

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Mon Sep  5 19:24:45 2011 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 609) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (   1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (   2) minutes.

SCT capabilities:       (0x103b) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x000f   094   090   006    Pre-fail  Always       -       92471940

 3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0

 4 Start_Stop_Count        0x0032   093   093   020    Old_age   Always       -       7686

 5 Reallocated_Sector_Ct   0x0033   074   074   036    Pre-fail  Always       -       1080

 7 Seek_Error_Rate         0x000f   037   037   030    Pre-fail  Always       -       14499830388675

 9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12328

10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       37

183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       150

188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       8989503719470

189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1

190 Airflow_Temperature_Cel 0x0022   062   030   045    Old_age   Always   In_the_past 38 (9 169 38 26)

191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13

193 Load_Cycle_Count        0x0032   097   097   000    Old_age   Always       -       7694

194 Temperature_Celsius     0x0022   038   070   000    Old_age   Always       -       38 (0 9 0 0)

195 Hardware_ECC_Recovered  0x001a   049   026   000    Old_age   Always       -       92471940

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       195579925760904

241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3018733366

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1448457851

 

SMART Error Log Version: 1

ATA Error Count: 108 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 108 occurred at disk power-on lifetime: 12328 hours (513 days + 16 hours)

 When the command that caused the error occurred, the device was active or idle.

 

 After command completion occurred, registers were:

 ER ST SC SN CL CH DH

 -- -- -- -- -- -- --

 40 51 00 1f 72 00 00  Error: UNC at LBA = 0x0000721f = 29215

 

 Commands leading to the command that caused the error were:

 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

 -- -- -- -- -- -- -- --  ----------------  --------------------

 25 00 00 07 72 00 e0 00   1d+22:03:36.879  READ DMA EXT

 27 00 00 00 00 00 e0 00   1d+22:03:36.878  READ NATIVE MAX ADDRESS EXT

 ec 00 00 00 00 00 a0 00   1d+22:03:36.876  IDENTIFY DEVICE

 ef 03 46 00 00 00 a0 00   1d+22:03:36.876  SET FEATURES [set transfer mode]

 27 00 00 00 00 00 e0 00   1d+22:03:36.854  READ NATIVE MAX ADDRESS EXT

 

Error 107 occurred at disk power-on lifetime: 12328 hours (513 days + 16 hours)

 When the command that caused the error occurred, the device was active or idle.

 

 After command completion occurred, registers were:

 ER ST SC SN CL CH DH

 -- -- -- -- -- -- --

 40 51 00 1f 72 00 00  Error: UNC at LBA = 0x0000721f = 29215

 

 Commands leading to the command that caused the error were:

 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

 -- -- -- -- -- -- -- --  ----------------  --------------------

 25 00 00 07 72 00 e0 00   1d+22:03:34.065  READ DMA EXT

 27 00 00 00 00 00 e0 00   1d+22:03:34.063  READ NATIVE MAX ADDRESS EXT

 ec 00 00 00 00 00 a0 00   1d+22:03:34.062  IDENTIFY DEVICE

 ef 03 46 00 00 00 a0 00   1d+22:03:34.061  SET FEATURES [set transfer mode]

 27 00 00 00 00 00 e0 00   1d+22:03:34.040  READ NATIVE MAX ADDRESS EXT

 

Error 106 occurred at disk power-on lifetime: 12328 hours (513 days + 16 hours)

 When the command that caused the error occurred, the device was active or idle.

 

 After command completion occurred, registers were:

 ER ST SC SN CL CH DH

 -- -- -- -- -- -- --

 40 51 00 1f 72 00 00  Error: UNC at LBA = 0x0000721f = 29215

 

 Commands leading to the command that caused the error were:

 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

 -- -- -- -- -- -- -- --  ----------------  --------------------

 25 00 00 07 72 00 e0 00   1d+22:03:29.982  READ DMA EXT

 27 00 00 00 00 00 e0 00   1d+22:03:29.960  READ NATIVE MAX ADDRESS EXT

 ec 00 00 00 00 00 a0 00   1d+22:03:29.959  IDENTIFY DEVICE

 ef 03 46 00 00 00 a0 00   1d+22:03:29.875  SET FEATURES [set transfer mode]

 27 00 00 00 00 00 e0 00   1d+22:03:29.874  READ NATIVE MAX ADDRESS EXT

 

Error 105 occurred at disk power-on lifetime: 12328 hours (513 days + 16 hours)

 When the command that caused the error occurred, the device was active or idle.

 

 After command completion occurred, registers were:

 ER ST SC SN CL CH DH

 -- -- -- -- -- -- --

 40 51 00 1f 72 00 00  Error: UNC at LBA = 0x0000721f = 29215

 

 Commands leading to the command that caused the error were:

 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

 -- -- -- -- -- -- -- --  ----------------  --------------------

 25 00 00 07 72 00 e0 00   1d+22:03:25.405  READ DMA EXT

 27 00 00 00 00 00 e0 00   1d+22:03:25.403  READ NATIVE MAX ADDRESS EXT

 ec 00 00 00 00 00 a0 00   1d+22:03:25.402  IDENTIFY DEVICE

 ef 03 46 00 00 00 a0 00   1d+22:03:25.401  SET FEATURES [set transfer mode]

 27 00 00 00 00 00 e0 00   1d+22:03:25.380  READ NATIVE MAX ADDRESS EXT

 

Error 104 occurred at disk power-on lifetime: 12328 hours (513 days + 16 hours)

 When the command that caused the error occurred, the device was active or idle.

 

 After command completion occurred, registers were:

 ER ST SC SN CL CH DH

 -- -- -- -- -- -- --

 40 51 00 1f 72 00 00  Error: UNC at LBA = 0x0000721f = 29215

 

 Commands leading to the command that caused the error were:

 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

 -- -- -- -- -- -- -- --  ----------------  --------------------

 25 00 00 07 72 00 e0 00   1d+22:03:18.876  READ DMA EXT

 27 00 00 00 00 00 e0 00   1d+22:03:18.875  READ NATIVE MAX ADDRESS EXT

 ec 00 00 00 00 00 a0 00   1d+22:03:18.873  IDENTIFY DEVICE

 ef 03 46 00 00 00 a0 00   1d+22:03:18.873  SET FEATURES [set transfer mode]

 27 00 00 00 00 00 e0 00   1d+22:03:18.852  READ NATIVE MAX ADDRESS EXT

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

syslog-2011-09-05_nopwd.txt

Link to comment

replace the drive, there is a very high number of reallocated sectors.

 

I have another drive in the array with even more (1318) - but that one's not giving any read errors yet.

 

I think I'll replace drive 4 and see what happens with the other. Thanks.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...