Syslog errors

December 24, 200916 yr

I just noticed some read errors in my syslog on a data copy from my PC to my UnRAID server. Should I be concerned??

Dec 24 14:11:15 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:15 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:15 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:15 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:15 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:15 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:15 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:15 Tower kernel: ata4: EH complete
Dec 24 14:11:16 Tower unmenu[1290]: gawk: ./08-unmenu-array_mgmt.awk:115: warning: escape sequence `\'' treated as plain `''
Dec 24 14:11:19 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:19 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:19 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:19 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:19 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:19 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:19 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:19 Tower kernel: ata4: EH complete
Dec 24 14:11:22 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:22 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:22 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:22 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:22 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:22 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:22 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:22 Tower kernel: ata4: EH complete
Dec 24 14:11:26 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:26 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:26 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:26 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:26 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:26 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:26 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:26 Tower kernel: ata4: EH complete
Dec 24 14:11:30 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:30 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:30 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:30 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:30 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:30 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:30 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:30 Tower kernel: ata4: EH complete
Dec 24 14:11:34 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:34 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:34 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:34 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:34 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:34 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:34 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Unhandled sense code
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x00 driverbyte=0x08
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Sense Key : 0x3 [current] [descriptor]
Dec 24 14:11:34 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Dec 24 14:11:34 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 24 14:11:34 Tower kernel: 00 00 48 54 
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] ASC=0x11 ASCQ=0x4
Dec 24 14:11:34 Tower kernel: end_request: I/O error, dev sdc, sector 18516
Dec 24 14:11:34 Tower kernel: ata4: EH complete
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18448/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18456/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18464/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18472/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18480/2, count: 1

Quote

December 24, 200916 yr

Check the cable and all connections to see if these errors continue. When I had a drive that did this it was an imminent sign of failure. The reallocated sector count sky rocketed in a couple of days and I RMAed the drive.

if you can get the smart status, and run the short and long smart tests that would help a lot to see what if anything might be going wrong.

Quote

December 24, 200916 yr

I just noticed some read errors in my syslog on a data copy from my PC to my UnRAID server. Should I be concerned??

Dec 24 14:11:15 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:15 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:15 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:15 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:15 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:15 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:15 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:15 Tower kernel: ata4: EH complete
Dec 24 14:11:16 Tower unmenu[1290]: gawk: ./08-unmenu-array_mgmt.awk:115: warning: escape sequence `\'' treated as plain `''
Dec 24 14:11:19 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:19 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:19 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:19 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:19 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:19 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:19 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:19 Tower kernel: ata4: EH complete
Dec 24 14:11:22 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:22 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:22 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:22 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:22 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:22 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:22 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:22 Tower kernel: ata4: EH complete
Dec 24 14:11:26 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:26 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:26 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:26 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:26 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:26 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:26 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:26 Tower kernel: ata4: EH complete
Dec 24 14:11:30 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:30 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:30 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:30 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:30 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:30 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:30 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:30 Tower kernel: ata4: EH complete
Dec 24 14:11:34 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 24 14:11:34 Tower kernel: ata4.00: irq_stat 0x40000001
Dec 24 14:11:34 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in
Dec 24 14:11:34 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)
Dec 24 14:11:34 Tower kernel: ata4.00: status: { DRDY ERR }
Dec 24 14:11:34 Tower kernel: ata4.00: error: { UNC }
Dec 24 14:11:34 Tower kernel: ata4.00: configured for UDMA/133
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Unhandled sense code
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x00 driverbyte=0x08
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Sense Key : 0x3 [current] [descriptor]
Dec 24 14:11:34 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Dec 24 14:11:34 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 24 14:11:34 Tower kernel: 00 00 48 54 
Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] ASC=0x11 ASCQ=0x4
Dec 24 14:11:34 Tower kernel: end_request: I/O error, dev sdc, sector 18516
Dec 24 14:11:34 Tower kernel: ata4: EH complete
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18448/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18456/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18464/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18472/2, count: 1
Dec 24 14:11:34 Tower kernel: md: disk2 read error
Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18480/2, count: 1

Yes... you should be concerned... but first it is best to get a SMART report on the drive to learn its status.

For each "read" failure, unRAID used the other disks in combination with parity to supply the block of data it could not read. It also wrote the bad block back to the disk it could not read. If your disk has un-readable sectors, this would let its internal firmware relocate them. (But we won't know until you do a SMART report on the drive.)

The command to get the report is:

smartctl -a -d ata /dev/sdc

Joe L.

Quote

December 25, 200916 yr

Author

Ok, here's a copy of the smart report for the drive. As always, thanks for the responses and looking forward to your feedback.

Quote

December 25, 200916 yr

Ok, here's a copy of the smart report for the drive. As always, thanks for the responses and looking forward to your feedback.

The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing.

Quote

December 26, 200916 yr

Author

The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing.

Thanks for the info but how do I get the short smart test results? Every time I refresh the Smart Status Report in unmenu it just seems to give me smart statistics??

Quote

December 26, 200916 yr

The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing.

Thanks for the info but how do I get the short smart test results? Every time I refresh the Smart Status Report in unmenu it just seems to give me smart statistics??

It is because of a bug in unMENU when used to submit long/short tests to SATA drives. (It did not submit them properly, so they did not run)

Attached is a corrected disk-management plug-in file. Un-zip to your unmenu directory. (If you've never had it you'll need to stop and re-start unMENU for it to see the corrected version. If it exists in your "About" page as a plug-in before you download and un-zip it, it will be used the next time you click on the link, no need to re-start unMENU)

now... the output IS part of the smart status report. It will be near the bottom of the report and it looks like this:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9547         -
# 2  Extended offline    Completed without error       00%      3621         -
# 3  Extended offline    Completed without error       00%        10         -
# 4  Short offline       Completed without error       00%         6         -

Joe L.

Quote

December 26, 200916 yr

Author

Attached is a corrected disk-management plug-in file. Un-zip to your unmenu directory. (If you've never had it you'll need to stop and re-start unMENU for it to see the corrected version. If it exists in your "About" page as a plug-in before you download and un-zip it, it will be used the next time you click on the link, no need to re-start unMENU)

now... the output IS part of the smart status report. It will be near the bottom of the report and it looks like this:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9547         -
# 2  Extended offline    Completed without error       00%      3621         -
# 3  Extended offline    Completed without error       00%        10         -
# 4  Short offline       Completed without error       00%         6         -
Joe L.

Ok, here's what I got from the short test:

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 359 -

# 2 Short offline Completed without error 00% 356 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

I'll try a long test and report those results as well.

Quote

December 26, 200916 yr

I'll try a long test and report those results as well.

don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort.

Joe L.

Quote

December 26, 200916 yr

Author

I'll try a long test and report those results as well.

don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort.

Joe L.

Thanks Joe...is there a way to do this in unmenu or do I have to execute a command at the prompt?

EDIT: I found this thread http://lime-technology.com/forum/index.php?topic=4926.0 so I'll execute hdparm -S 0 /dev/sdc and then run the long test. Will this value only last until I reboot the server??

Quote

December 26, 200916 yr

Author

Looks like the test is still running but this doesn't look good:

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 360 276948556

# 2 Short offline Completed without error 00% 359 -

# 3 Short offline Completed without error 00% 356 -

Quote

December 26, 200916 yr

Looks like the test is still running but this doesn't look good:

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 360 276948556

# 2 Short offline Completed without error 00% 359 -

# 3 Short offline Completed without error 00% 356 -

It will all depend on how many sectors it relocated, or marks for re-allocation when next written.

Joe L.

Quote

December 26, 200916 yr

I'll try a long test and report those results as well.

don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort.

Joe L.

Thanks Joe...is there a way to do this in unmenu or do I have to execute a command at the prompt?

EDIT: I found this thread http://lime-technology.com/forum/index.php?topic=4926.0 so I'll execute hdparm -S 0 /dev/sdc and then run the long test. Will this value only last until I reboot the server??

That probably will NOT stop the spin-down that unRAID performs. On the "main" interface page in unRAID, click on the disk name in the far left column. It will open up a screen for just that disk to override the other spin-down settings. set it to never spin down for that disk. (Or, you can just set it on the settings page for all your drives to "never") Set it back later once the test is done.

unAID stopped using the disk itself to spin itself down when it was learned that various disks did not spin themselves down consistently. It does its own timing and issues its own spinup/down commands.

Did you pre-clear the disk having the errors before adding it to your array? (It will frequently find these same un-readable sectors and re-allocated them, but before you put the disk in your array for use with your data)

Joe L.

Quote

December 26, 200916 yr

Author

That probably will NOT stop the spin-down that unRAID performs. On the "main" interface page in unRAID, click on the disk name in the far left column. It will open up a screen for just that disk to override the other spin-down settings. set it to never spin down for that disk. (Or, you can just set it on the settings page for all your drives to "never") Set it back later once the test is done.

unAID stopped using the disk itself to spin itself down when it was learned that various disks did not spin themselves down consistently. It does its own timing and issues its own spinup/down commands.

Did you pre-clear the disk having the errors before adding it to your array? (It will frequently find these same un-readable sectors and re-allocated them, but before you put the disk in your array for use with your data)

Ok, I've attached the syslogs from when I pre-cleared all 3 of my drives originally. My build thread also includes some references to the pre-clearing that was done as well as the results (http://lime-technology.com/forum/index.php?topic=4383.15

I guess I'll need to restart the long test although Unraid doesn't show sdc disk as spun down but my parity drive and sdb disk are spun down currently. Is there any way to know if the test is still running?

Quote

December 27, 200916 yr

Author

Ok, here's the output from the long test. Let me know what I should do next. Thanks!

Statistics for /dev/sdc ST32000542AS_5XW024G9

smartctl -a -d ata /dev/sdc
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST32000542AS
Serial Number:    5XW024G9
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec 26 19:53:15 2009 GMT+5
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 633) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				No Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   094   006    Pre-fail  Always       -       201007447
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       114
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   049   049   030    Pre-fail  Always       -       176097237468
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       369
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       53
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   068   068   000    Old_age   Always       -       32
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       65537
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   066   045    Old_age   Always       -       26 (Lifetime Min/Max 19/28)
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   039   039   000    Old_age   Always       -       201007447
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       97693326115215
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1790065637
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3852326791

SMART Error Log Version: 1
ATA Error Count: 29 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 29 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f6 35 00 00  Error: UNC at LBA = 0x000035f6 = 13814

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 8f 35 00 e0 00      00:25:03.388  READ DMA
  27 00 00 00 00 00 e0 00      00:25:03.387  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:03.386  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:03.386  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:03.386  READ NATIVE MAX ADDRESS EXT

Error 28 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f5 35 00 00  Error: UNC at LBA = 0x000035f5 = 13813

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 8f 35 00 e0 00      00:24:58.598  READ DMA
  25 00 a8 ff ff ff ef 00      00:24:58.588  READ DMA EXT
  25 00 00 ff ff ff ef 00      00:24:58.572  READ DMA EXT
  35 00 e8 ff ff ff ef 00      00:24:58.570  WRITE DMA EXT
  35 00 00 ff ff ff ef 00      00:24:58.568  WRITE DMA EXT

Error 27 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:26:01.419  READ DMA
  27 00 00 00 00 00 e0 00      00:26:01.419  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:26:01.417  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:26:01.417  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:26:01.417  READ NATIVE MAX ADDRESS EXT

Error 26 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:25:57.640  READ DMA
  27 00 00 00 00 00 e0 00      00:25:57.639  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:57.638  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:57.638  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:57.638  READ NATIVE MAX ADDRESS EXT

Error 25 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:25:53.891  READ DMA
  27 00 00 00 00 00 e0 00      00:25:53.891  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:53.890  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:53.889  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:53.889  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       368         -
# 2  Extended offline    Completed: read failure       90%       360         276948556
# 3  Short offline       Completed without error       00%       359         -
# 4  Short offline       Completed without error       00%       356         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

December 27, 200916 yr

Ok, here's the output from the long test. Let me know what I should do next. Thanks!

Statistics for /dev/sdc ST32000542AS_5XW024G9

smartctl -a -d ata /dev/sdc
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST32000542AS
Serial Number:    5XW024G9
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec 26 19:53:15 2009 GMT+5
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 633) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				No Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   094   006    Pre-fail  Always       -       201007447
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       114
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   049   049   030    Pre-fail  Always       -       176097237468
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       369
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       53
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   068   068   000    Old_age   Always       -       32
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       65537
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   066   045    Old_age   Always       -       26 (Lifetime Min/Max 19/28)
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   039   039   000    Old_age   Always       -       201007447
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       97693326115215
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1790065637
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3852326791

SMART Error Log Version: 1
ATA Error Count: 29 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 29 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f6 35 00 00  Error: UNC at LBA = 0x000035f6 = 13814

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 8f 35 00 e0 00      00:25:03.388  READ DMA
  27 00 00 00 00 00 e0 00      00:25:03.387  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:03.386  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:03.386  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:03.386  READ NATIVE MAX ADDRESS EXT

Error 28 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f5 35 00 00  Error: UNC at LBA = 0x000035f5 = 13813

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 8f 35 00 e0 00      00:24:58.598  READ DMA
  25 00 a8 ff ff ff ef 00      00:24:58.588  READ DMA EXT
  25 00 00 ff ff ff ef 00      00:24:58.572  READ DMA EXT
  35 00 e8 ff ff ff ef 00      00:24:58.570  WRITE DMA EXT
  35 00 00 ff ff ff ef 00      00:24:58.568  WRITE DMA EXT

Error 27 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:26:01.419  READ DMA
  27 00 00 00 00 00 e0 00      00:26:01.419  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:26:01.417  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:26:01.417  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:26:01.417  READ NATIVE MAX ADDRESS EXT

Error 26 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:25:57.640  READ DMA
  27 00 00 00 00 00 e0 00      00:25:57.639  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:57.638  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:57.638  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:57.638  READ NATIVE MAX ADDRESS EXT

Error 25 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 54 48 00 00  Error: UNC at LBA = 0x00004854 = 18516

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c7 47 00 e0 00      00:25:53.891  READ DMA
  27 00 00 00 00 00 e0 00      00:25:53.891  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:25:53.890  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:25:53.889  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:25:53.889  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       368         -
# 2  Extended offline    Completed: read failure       90%       360         276948556
# 3  Short offline       Completed without error       00%       359         -
# 4  Short offline       Completed without error       00%       356         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The reports look pretty much the same so I would say that you need to replace the cable and see if that helps the situation. When I started having read failures on my drive it also produced a very high increase in the reallocated sector count and the pending reallocation sector count. Yours does not seem to have done that so I would check the cable, pay attention to the drive and see if anything else happens.

Quote

December 27, 200916 yr

Your disk has no re-allocated sectors, and none pending re-allocation. Those are both really good signs.

About all you can do is double check the cables are plugged in securely, there does not seem to e anything wrong with the drive

(or at least SMART does not show anything)

The drive dis have an error reading a sector at one point

Error: UNC at LBA = 0x000035f6 = 13814

Apparently, when it was re-written it corrected itself and did not need to be re-allocated.

All yo can do at this point is keep an eye out for errors, and make sure the cabling to it is secure.

Quote

December 27, 200916 yr

Author

Thanks guys! I really appreciate your help "translating" the reports. I'll double check the cabling and you can be sure that I'll post if I have more errors!

Quote

December 27, 200916 yr

This is the most common type of "hard disk problem". If the connection from computer to drive is not solid, commands and responses get garbled causing these types of errors. (Similarly, if the connection from PSU to drive is not solid, the drive can lose power. Even very brief power outages can create big problems!) Sometimes you see traces of "cabling" problems in the syslog, other times you only see them in the drive's log. Sometimes the issue causes unRAID to kick the drive from the array (red ball), other times not.

There are a number of different causes of connection problems:

1 - Bad or lose SATA cables

2 - Bad or lose SATA ports (e.g., on MB or SATA card)

3 - Bad or lose connections inside of a backplane / drive cage

4 - Bad or lose POWER connections

5 - Bad power splitters

Every time you open up your case and jiggle cables, you create the opportunity for some connection to become marginal. Vibration and repeated heating / cooling of components can also losen cables and trigger these types of problems even after months of error-free operation.

Although easy to say they are common, they can also be pretty hard and frustrating to isolate, diagnose, and fix. These problems are easily mistaken for bad drives.

Recommend using locking SATA cables where supported to minimize cable connection problems, running monthly parity checks, and running parity checks immediately before and after any maintenance inside the computer.

Quote

December 27, 200916 yr

Author

Every time you open up your case and jiggle cables, you create the opportunity for some connection to become marginal. Vibration and repeated heating / cooling of components can also losen cables and trigger these types of problems even after months of error-free operation.

Although easy to say they are common, they can also be pretty hard and frustrating to isolate, diagnose, and fix. These problems are easily mistaken for bad drives.

Recommend using locking SATA cables where supported to minimize cable connection problems, running monthly parity checks, and running parity checks immediately before and after any maintenance inside the computer.

I guess now would be a good time to mention that I had recently moved the server and opened the case! I'm half afraid to open it again to check the cables! Seriously though, I think your post is spot on and can be very confusing for any non-Windows OS based n00b like myself! This is a great forum and there's no way I would've been able to get my serving running without it! Thanks again guys!

Quote

January 1, 201016 yr

I really like all the advice about cabling issues above, and I'll probably try to incorporate some of it into the Wiki, so I don't want to disparage it in any way, but there is no evidence of cable related errors here, just ordinary bad sectors. The telltale evidence is highlighted below:

kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata4.00: irq_stat 0x40000001

kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in

kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error)

kernel: ata4.00: status: { DRDY ERR }

kernel: ata4.00: error: { UNC }

kernel: ata4.00: configured for UDMA/133

kernel: ata4: EH complete

UNC indicates an UNCorrectable sector, that is, the error correction stored with the sector could not successfully recover the exact original contents of the sector. The SMART report does indicate it in the following line, of 32 total uncorrectable sectors found in the history of this drive (I believe):

187 Reported_Uncorrect 0x0032 068 068 000 Old_age Always - 32

The SMART report shows an ATA Error Count of 29, with the last 5 displayed, and all 5 are related to UNC sectors. The syslog reports a read error in sector 18516, which corresponds to 3 of the ATA errors shown in the SMART report.

The fact that there are no remapped or pending sectors is very good news. There are actually 2 kinds of read errors, although both are reported the same way to us, at least at the application level. They both result in a failure to access the data stored in a particular sector. But to avoid confusion, it is good to understand the difference between the two, especially since we unRAID users are used to examining SMART reports, which report info from lower levels than other computer users see. The two types of read errors (also known as bad sectors) can be referred to as soft and hard errors. A hard error is when the magnetic media surface below a sector can no longer be relied on to store data. If working correctly, the magnetic particles should be able to store and maintain over reasonable conditions the same polarity or state that was last written to it. If they become too weak to hold a state, then the original data cannot be read, and new data cannot be written to it and read correctly back. The sector of a hard error must be remapped, replaced with a new good sector from the spares available. A soft error is when the magnetic particles are good, and can reliably hold their state no matter how they are tested, but the current sector data is too corrupted and cannot be corrected by the ECC data stored with it. This could occur if there was a random power spike while the head is writing to the sector, or more commonly if power is lost during a write to a sector. I believe this is what happened to you, that one or more sectors were affected by a power glitch that damaged the data stored in those sectors. So once the data is reconstructed and written back to the sector, the problem is completely fixed. The physical sector surface was good, and once good data was stored there, the sector is no longer bad.

There is one item in the SMART report that does concern me, and I would recommend monitoring it.

7 Seek_Error_Rate 0x000f 049 049 030 Pre-fail Always - 171802258884

The number to the far right can be ignored. The Seek_Error_Rate for recent Seagates has generally started out and stayed within the 60's. Why they aren't near 100 I don't know, but this is characteristic of recent Seagates. I don't think I have seen a Seagate 2TB drive before, so perhaps 49 is OK for them, but it seems very low to me, especially for such a new drive, and especially when the failure threshold is 30. I would keep a close eye on it. And that huge number to the right does look awfully big! Without any experience as to what that number *should* be though, or how it should be interpreted, I would not give it too much consequence.

Quote

January 1, 201016 yr

Author

Rob, thanks for the very insightful post. After replacing the cable, SATA port on MB, backplane SATA port etc., I'm continuing to get handle stripe read errors when writing to this particular drive. Below is an updated smart report for my sdc "problem" drive as well as for 2 other WD 2 TB drives in my array (for comparison). I've also attached a syslog from yesterday that shows the frequency of errors is increasing. I would appreciate your feedback and any course of action I can or should be taking at this point. Thanks again!

Quote

January 1, 201016 yr

The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive.

I see however, that there are no re-allocated sectors... This might indicate that it is simply all "soft" errors and they are being corrected as the "read" failures re-write the sectors based on the the disks in the array.

As an alternative to replacing the drive, you might see if the failures eventually slow down... (as it would not need to re-write the same sector twice)

Or, you could speed things up by stopping the array, un-assigning the drive that is failing, starting the array without it, then stopping it once more, re-assigning it, and re-starting it, having it completely write the contents back onto the failing drive. (basically it now thinks you replaced the drive with a new one) Since all the errors seem to be soft errors, it might fix itself.

That will cause you to be without parity protection during the time the rebuild takes... So if there are any critical files on your server, make copies of them on multiple disks... just in case a different disk fails too.

Thinking about it a bit more, you could perform almost the same re-write by simply moving the files on the failing disk to a different disk, and them moving them back... This would re-write everything while keeping parity protection.

Joe L.

Quote

January 1, 201016 yr

Author

The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive.

Since the drive is fairly new, are these errors something that would warrant an RMA to Seagate? I'm trying to figure out what constitutes a "failed" drive. It would seem in this case that I don't "yet" have good cause to be able to return it?

Quote

January 1, 201016 yr

The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive.

Since the drive is fairly new, are these errors something that would warrant an RMA to Seagate? I'm trying to figure out what constitutes a "failed" drive. It would seem in this case that I don't "yet" have good cause to be able to return it?

If they continue, yes, and RMA is in order, even if the SMART test does not indicate an imminent failure. Just include a printout of all the "media" read errors you are getting when you return the drive if they request proof.

Quote

Syslog errors

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)