Recommended Posts

Hi all

 

attempted to run a parity check today, but noticed it was running very very slowly.  Main screen showed 296 errors corrected.

 

I decided to stop and reboot after noticing what looked like drive errors in the syslog.

 

Upon reboot there are the following log entries (unraid main shows green ball and drive online):

 

Jul 24 15:04:46 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:04:46 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:04:46 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:04:46 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:04:46 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:04:46 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:04:46 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:04:46 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:04:49 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:04:49 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:04:49 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:04:49 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:04:49 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:04:49 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:04:49 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:04:49 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:04:53 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:04:53 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:04:53 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:04:53 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:04:53 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:04:53 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:04:53 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:04:53 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:04:56 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:04:56 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:04:56 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:04:56 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:04:56 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:04:56 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:04:56 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:04:56 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:04:59 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:04:59 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:04:59 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:04:59 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:04:59 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:04:59 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:04:59 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:04:59 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:05:02 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jul 24 15:05:02 Tower kernel: ata12.00: BMDMA stat 0x24 (Drive related)
Jul 24 15:05:02 Tower kernel: ata12.00: cmd 25/00:08:ff:89:c5/00:00:21:00:00/e0 tag 0 dma 4096 in (Drive related)
Jul 24 15:05:02 Tower kernel:          res 51/40:00:ff:89:c5/40:00:21:00:00/e0 Emask 0x9 (media error) (Errors)
Jul 24 15:05:02 Tower kernel: ata12.00: status: { DRDY ERR } (Drive related)
Jul 24 15:05:02 Tower kernel: ata12.00: error: { UNC } (Errors)
Jul 24 15:05:02 Tower kernel: ata12.00: configured for UDMA/133 (Drive related)
Jul 24 15:05:02 Tower kernel: sd 12:0:0:0: [sdj] Result: hostbyte=0x00 driverbyte=0x08 (System)
Jul 24 15:05:02 Tower kernel: sd 12:0:0:0: [sdj] Sense Key : 0x3 [current] [descriptor] (Drive related)
Jul 24 15:05:02 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Jul 24 15:05:02 Tower kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jul 24 15:05:02 Tower kernel:         21 c5 89 ff 
Jul 24 15:05:02 Tower kernel: sd 12:0:0:0: [sdj] ASC=0x11 ASCQ=0x4 (Drive related)
Jul 24 15:05:02 Tower kernel: end_request: I/O error, dev sdj, sector 566594047 (Errors)
Jul 24 15:05:02 Tower kernel: ata12: EH complete (Drive related)
Jul 24 15:05:02 Tower kernel: md: disk5 read error (Errors)
Jul 24 15:05:02 Tower kernel: handle_stripe read error: 566593984/5, count: 1 (Errors)
....... (repeats)

 

So, I'm guessing this is a failing drive?

 

If so, as the parity check was showing errors, should I attempt to get the parity check to complete before replacing the drive?

Link to comment

If so, as the parity check was showing errors, should I attempt to get the parity check to complete before replacing the drive?

 

NO!

 

Why don't you get a smart report from the apparently failing drive?  I suspect the cabling rather than the drive itself, but the smart report should be able to tell us.

Link to comment

I ran the short smart test from mymain, but it never seemed to return (left around 15mins).  Should I leave it longer?

 

myMain does not run a short test - only a smart report.

 

You can run a smart report with this command from a telnet prompt:

 

smartctl -a -d ata /dev/sdX | todos >/boot/smart.txt

 

Substitute sdX for the failing drive.

 

If that doesn't work try ...

 

smartctl -a /dev/sdX | todos >/boot/smart.txt

 

Link to comment

Results of that command.

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD5000AACS-00G8B1
Serial Number:    WD-WCAUH1136828
Firmware Version: 05.04C05
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jul 25 06:31:37 2011 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (12300) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 144) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   136   104   021    Pre-fail  Always       -       6166
 4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2306
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12405
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       816
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2306
194 Temperature_Celsius     0x0022   125   111   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   183   183   000    Old_age   Always       -       1392
198 Offline_Uncorrectable   0x0030   200   197   000    Old_age   Offline      -       7
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   037   001   000    Old_age   Offline      -       21851

SMART Error Log Version: 1
Warning: ATA error count 2349 inconsistent with error log pointer 1

ATA Error Count: 2349 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2349 occurred at disk power-on lifetime: 12405 hours (516 days + 21 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 42 00 f8 ea  Error: UNC at LBA = 0x0af80042 = 184025154

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 08 3f 00 f8 0a 0a      01:18:16.924  READ DMA
 c8 00 38 97 39 00 00 0a      01:18:16.923  READ DMA
 c8 00 08 3f 00 00 0b 0a      01:18:16.910  READ DMA
 ca 00 08 8f 39 00 00 0a      01:18:16.910  WRITE DMA
 c8 00 08 3f 00 04 0b 0a      01:18:16.469  READ DMA

Error 2348 occurred at disk power-on lifetime: 12405 hours (516 days + 21 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 57 3c 33 ee  Error: UNC at LBA = 0x0e333c57 = 238238807

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 08 57 3c 33 0e 0a      01:18:01.225  READ DMA
 ec 00 00 00 00 00 00 0a      01:18:01.205  IDENTIFY DEVICE
 ef 03 46 00 00 00 00 0a      01:18:01.205  SET FEATURES [set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12399         404773202
# 2  Short offline       Completed: read failure       90%     12398         404773202
# 3  Short offline       Completed: read failure       30%     12398         404773202
# 4  Short offline       Completed: read failure       10%     12398         404773202
# 5  Short offline       Aborted by host               10%     12398         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I have a spare disk i can replace this one with (have checked cabling and it looks OK.. I haven't changed anything in months on the server) - but I don't want to lose any data. I notice you said "NO!" to the parity check, so were you concerned the parity may be being updated with incorrect (corrupted) data?

Link to comment

Interesting smart report.

 

You have two problems:

 

1 - there are 2349 ata errors. Ata errors typically mean that you have a signal problem between the controller and the drive. Reseating or replacing the SATA data cable can fix this problem and stop future ata errors

 

2 - you have 1392 sectors pending reallocation. This tends to be a sign of a failing disk, but since we don't see even one sector actually reallocated, it is not 100% conclusive in my mind. We have seen pending sectors magically go away with no apparent negative affects.

 

I also see smart test failures, which are also not reassuring and have nothing to do with cabling.

 

If RMAing the disk is an option, that would not be a bad way to go.

 

If not, I'd suggest replacing (or at least unplugging and replugging securely) both ends of cables doing to this drive. If you could move it to another port that would be good at helping eliminate other factors that could cause signal problems. Run a parity check and see if the pending sectors clear and the ata errors hold constant.

 

Hope this helps. Good luck.

Link to comment

Results of that command.

 

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD5000AACS-00G8B1
Serial Number:    WD-WCAUH1136828
Firmware Version: 05.04C05
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jul 25 06:31:37 2011 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (12300) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 144) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   136   104   021    Pre-fail  Always       -       6166
 4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2306
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12405
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       816
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2306
194 Temperature_Celsius     0x0022   125   111   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   183   183   000    Old_age   Always       -       1392
198 Offline_Uncorrectable   0x0030   200   197   000    Old_age   Offline      -       7
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   037   001   000    Old_age   Offline      -       21851

SMART Error Log Version: 1
Warning: ATA error count 2349 inconsistent with error log pointer 1

ATA Error Count: 2349 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2349 occurred at disk power-on lifetime: 12405 hours (516 days + 21 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 42 00 f8 ea  Error: UNC at LBA = 0x0af80042 = 184025154

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 08 3f 00 f8 0a 0a      01:18:16.924  READ DMA
 c8 00 38 97 39 00 00 0a      01:18:16.923  READ DMA
 c8 00 08 3f 00 00 0b 0a      01:18:16.910  READ DMA
 ca 00 08 8f 39 00 00 0a      01:18:16.910  WRITE DMA
 c8 00 08 3f 00 04 0b 0a      01:18:16.469  READ DMA

Error 2348 occurred at disk power-on lifetime: 12405 hours (516 days + 21 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 57 3c 33 ee  Error: UNC at LBA = 0x0e333c57 = 238238807

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 08 57 3c 33 0e 0a      01:18:01.225  READ DMA
 ec 00 00 00 00 00 00 0a      01:18:01.205  IDENTIFY DEVICE
 ef 03 46 00 00 00 00 0a      01:18:01.205  SET FEATURES [set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     12399         404773202
# 2  Short offline       Completed: read failure       90%     12398         404773202
# 3  Short offline       Completed: read failure       30%     12398         404773202
# 4  Short offline       Completed: read failure       10%     12398         404773202
# 5  Short offline       Aborted by host               10%     12398         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I have a spare disk i can replace this one with (have checked cabling and it looks OK.. I haven't changed anything in months on the server) - but I don't want to lose any data. I notice you said "NO!" to the parity check, so were you concerned the parity may be being updated with incorrect (corrupted) data?

Based on those results, I would replace the disk with the spare and let unRAID reconstruct onto it.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.