December 24, 200916 yr I just noticed some read errors in my syslog on a data copy from my PC to my UnRAID server. Should I be concerned?? Dec 24 14:11:15 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:15 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:15 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:15 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:15 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:15 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:15 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:15 Tower kernel: ata4: EH complete Dec 24 14:11:16 Tower unmenu[1290]: gawk: ./08-unmenu-array_mgmt.awk:115: warning: escape sequence `\'' treated as plain `'' Dec 24 14:11:19 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:19 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:19 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:19 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:19 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:19 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:19 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:19 Tower kernel: ata4: EH complete Dec 24 14:11:22 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:22 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:22 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:22 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:22 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:22 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:22 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:22 Tower kernel: ata4: EH complete Dec 24 14:11:26 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:26 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:26 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:26 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:26 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:26 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:26 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:26 Tower kernel: ata4: EH complete Dec 24 14:11:30 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:30 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:30 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:30 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:30 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:30 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:30 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:30 Tower kernel: ata4: EH complete Dec 24 14:11:34 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:34 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:34 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:34 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:34 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:34 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:34 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Unhandled sense code Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x00 driverbyte=0x08 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Sense Key : 0x3 [current] [descriptor] Dec 24 14:11:34 Tower kernel: Descriptor sense data with sense descriptors (in hex): Dec 24 14:11:34 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Dec 24 14:11:34 Tower kernel: 00 00 48 54 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] ASC=0x11 ASCQ=0x4 Dec 24 14:11:34 Tower kernel: end_request: I/O error, dev sdc, sector 18516 Dec 24 14:11:34 Tower kernel: ata4: EH complete Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18448/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18456/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18464/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18472/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18480/2, count: 1
December 24, 200916 yr Check the cable and all connections to see if these errors continue. When I had a drive that did this it was an imminent sign of failure. The reallocated sector count sky rocketed in a couple of days and I RMAed the drive. if you can get the smart status, and run the short and long smart tests that would help a lot to see what if anything might be going wrong.
December 24, 200916 yr I just noticed some read errors in my syslog on a data copy from my PC to my UnRAID server. Should I be concerned?? Dec 24 14:11:15 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:15 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:15 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:15 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:15 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:15 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:15 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:15 Tower kernel: ata4: EH complete Dec 24 14:11:16 Tower unmenu[1290]: gawk: ./08-unmenu-array_mgmt.awk:115: warning: escape sequence `\'' treated as plain `'' Dec 24 14:11:19 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:19 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:19 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:19 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:19 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:19 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:19 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:19 Tower kernel: ata4: EH complete Dec 24 14:11:22 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:22 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:22 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:22 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:22 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:22 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:22 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:22 Tower kernel: ata4: EH complete Dec 24 14:11:26 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:26 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:26 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:26 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:26 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:26 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:26 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:26 Tower kernel: ata4: EH complete Dec 24 14:11:30 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:30 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:30 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:30 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:30 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:30 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:30 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:30 Tower kernel: ata4: EH complete Dec 24 14:11:34 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Dec 24 14:11:34 Tower kernel: ata4.00: irq_stat 0x40000001 Dec 24 14:11:34 Tower kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in Dec 24 14:11:34 Tower kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) Dec 24 14:11:34 Tower kernel: ata4.00: status: { DRDY ERR } Dec 24 14:11:34 Tower kernel: ata4.00: error: { UNC } Dec 24 14:11:34 Tower kernel: ata4.00: configured for UDMA/133 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Unhandled sense code Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x00 driverbyte=0x08 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] Sense Key : 0x3 [current] [descriptor] Dec 24 14:11:34 Tower kernel: Descriptor sense data with sense descriptors (in hex): Dec 24 14:11:34 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Dec 24 14:11:34 Tower kernel: 00 00 48 54 Dec 24 14:11:34 Tower kernel: sd 3:0:0:0: [sdc] ASC=0x11 ASCQ=0x4 Dec 24 14:11:34 Tower kernel: end_request: I/O error, dev sdc, sector 18516 Dec 24 14:11:34 Tower kernel: ata4: EH complete Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18448/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18456/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18464/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18472/2, count: 1 Dec 24 14:11:34 Tower kernel: md: disk2 read error Dec 24 14:11:34 Tower kernel: handle_stripe read error: 18480/2, count: 1 Yes... you should be concerned... but first it is best to get a SMART report on the drive to learn its status. For each "read" failure, unRAID used the other disks in combination with parity to supply the block of data it could not read. It also wrote the bad block back to the disk it could not read. If your disk has un-readable sectors, this would let its internal firmware relocate them. (But we won't know until you do a SMART report on the drive.) The command to get the report is: smartctl -a -d ata /dev/sdc Joe L.
December 25, 200916 yr Author Ok, here's a copy of the smart report for the drive. As always, thanks for the responses and looking forward to your feedback.
December 25, 200916 yr Ok, here's a copy of the smart report for the drive. As always, thanks for the responses and looking forward to your feedback. The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing.
December 26, 200916 yr Author The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing. Thanks for the info but how do I get the short smart test results? Every time I refresh the Smart Status Report in unmenu it just seems to give me smart statistics??
December 26, 200916 yr The report looks ok, but you should probably get a smartctl short and long test from the drive. That will read from the drive and test it out a little more then just getting the smart report. After you run the test compare the report to the one posted above, to see if anything has changed. If it has run another one and see if stuff keeps changing. Thanks for the info but how do I get the short smart test results? Every time I refresh the Smart Status Report in unmenu it just seems to give me smart statistics?? It is because of a bug in unMENU when used to submit long/short tests to SATA drives. (It did not submit them properly, so they did not run) Attached is a corrected disk-management plug-in file. Un-zip to your unmenu directory. (If you've never had it you'll need to stop and re-start unMENU for it to see the corrected version. If it exists in your "About" page as a plug-in before you download and un-zip it, it will be used the next time you click on the link, no need to re-start unMENU) now... the output IS part of the smart status report. It will be near the bottom of the report and it looks like this: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 9547 - # 2 Extended offline Completed without error 00% 3621 - # 3 Extended offline Completed without error 00% 10 - # 4 Short offline Completed without error 00% 6 - Joe L.
December 26, 200916 yr Author Attached is a corrected disk-management plug-in file. Un-zip to your unmenu directory. (If you've never had it you'll need to stop and re-start unMENU for it to see the corrected version. If it exists in your "About" page as a plug-in before you download and un-zip it, it will be used the next time you click on the link, no need to re-start unMENU) now... the output IS part of the smart status report. It will be near the bottom of the report and it looks like this: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 9547 - # 2 Extended offline Completed without error 00% 3621 - # 3 Extended offline Completed without error 00% 10 - # 4 Short offline Completed without error 00% 6 - Joe L. Ok, here's what I got from the short test: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 359 - # 2 Short offline Completed without error 00% 356 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I'll try a long test and report those results as well.
December 26, 200916 yr I'll try a long test and report those results as well. don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort. Joe L.
December 26, 200916 yr Author I'll try a long test and report those results as well. don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort. Joe L. Thanks Joe...is there a way to do this in unmenu or do I have to execute a command at the prompt? EDIT: I found this thread http://lime-technology.com/forum/index.php?topic=4926.0 so I'll execute hdparm -S 0 /dev/sdc and then run the long test. Will this value only last until I reboot the server??
December 26, 200916 yr Author Looks like the test is still running but this doesn't look good: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 360 276948556 # 2 Short offline Completed without error 00% 359 - # 3 Short offline Completed without error 00% 356 -
December 26, 200916 yr Looks like the test is still running but this doesn't look good: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 360 276948556 # 2 Short offline Completed without error 00% 359 - # 3 Short offline Completed without error 00% 356 - It will all depend on how many sectors it relocated, or marks for re-allocation when next written. Joe L.
December 26, 200916 yr I'll try a long test and report those results as well. don't forget to disable the spin-down timer, or the disk will be forced to spin down in the middle of the test, and the test will abort. Joe L. Thanks Joe...is there a way to do this in unmenu or do I have to execute a command at the prompt? EDIT: I found this thread http://lime-technology.com/forum/index.php?topic=4926.0 so I'll execute hdparm -S 0 /dev/sdc and then run the long test. Will this value only last until I reboot the server?? That probably will NOT stop the spin-down that unRAID performs. On the "main" interface page in unRAID, click on the disk name in the far left column. It will open up a screen for just that disk to override the other spin-down settings. set it to never spin down for that disk. (Or, you can just set it on the settings page for all your drives to "never") Set it back later once the test is done. unAID stopped using the disk itself to spin itself down when it was learned that various disks did not spin themselves down consistently. It does its own timing and issues its own spinup/down commands. Did you pre-clear the disk having the errors before adding it to your array? (It will frequently find these same un-readable sectors and re-allocated them, but before you put the disk in your array for use with your data) Joe L.
December 26, 200916 yr Author That probably will NOT stop the spin-down that unRAID performs. On the "main" interface page in unRAID, click on the disk name in the far left column. It will open up a screen for just that disk to override the other spin-down settings. set it to never spin down for that disk. (Or, you can just set it on the settings page for all your drives to "never") Set it back later once the test is done. unAID stopped using the disk itself to spin itself down when it was learned that various disks did not spin themselves down consistently. It does its own timing and issues its own spinup/down commands. Did you pre-clear the disk having the errors before adding it to your array? (It will frequently find these same un-readable sectors and re-allocated them, but before you put the disk in your array for use with your data) Ok, I've attached the syslogs from when I pre-cleared all 3 of my drives originally. My build thread also includes some references to the pre-clearing that was done as well as the results (http://lime-technology.com/forum/index.php?topic=4383.15 I guess I'll need to restart the long test although Unraid doesn't show sdc disk as spun down but my parity drive and sdb disk are spun down currently. Is there any way to know if the test is still running?
December 27, 200916 yr Author Ok, here's the output from the long test. Let me know what I should do next. Thanks! Statistics for /dev/sdc ST32000542AS_5XW024G9 smartctl -a -d ata /dev/sdc smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW024G9 Firmware Version: CC32 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 26 19:53:15 2009 GMT+5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 633) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 094 006 Pre-fail Always - 201007447 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 114 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 049 049 030 Pre-fail Always - 176097237468 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 369 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 53 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 068 068 000 Old_age Always - 32 188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 65537 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 074 066 045 Old_age Always - 26 (Lifetime Min/Max 19/28) 194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 039 039 000 Old_age Always - 201007447 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 97693326115215 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1790065637 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3852326791 SMART Error Log Version: 1 ATA Error Count: 29 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 29 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 f6 35 00 00 Error: UNC at LBA = 0x000035f6 = 13814 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 8f 35 00 e0 00 00:25:03.388 READ DMA 27 00 00 00 00 00 e0 00 00:25:03.387 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:03.386 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:03.386 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:03.386 READ NATIVE MAX ADDRESS EXT Error 28 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 f5 35 00 00 Error: UNC at LBA = 0x000035f5 = 13813 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 8f 35 00 e0 00 00:24:58.598 READ DMA 25 00 a8 ff ff ff ef 00 00:24:58.588 READ DMA EXT 25 00 00 ff ff ff ef 00 00:24:58.572 READ DMA EXT 35 00 e8 ff ff ff ef 00 00:24:58.570 WRITE DMA EXT 35 00 00 ff ff ff ef 00 00:24:58.568 WRITE DMA EXT Error 27 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:26:01.419 READ DMA 27 00 00 00 00 00 e0 00 00:26:01.419 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:26:01.417 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:26:01.417 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:26:01.417 READ NATIVE MAX ADDRESS EXT Error 26 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:25:57.640 READ DMA 27 00 00 00 00 00 e0 00 00:25:57.639 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:57.638 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:57.638 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:57.638 READ NATIVE MAX ADDRESS EXT Error 25 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:25:53.891 READ DMA 27 00 00 00 00 00 e0 00 00:25:53.891 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:53.890 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:53.889 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:53.889 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 368 - # 2 Extended offline Completed: read failure 90% 360 276948556 # 3 Short offline Completed without error 00% 359 - # 4 Short offline Completed without error 00% 356 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
December 27, 200916 yr Ok, here's the output from the long test. Let me know what I should do next. Thanks! Statistics for /dev/sdc ST32000542AS_5XW024G9 smartctl -a -d ata /dev/sdc smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW024G9 Firmware Version: CC32 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 26 19:53:15 2009 GMT+5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 633) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 094 006 Pre-fail Always - 201007447 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 114 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 049 049 030 Pre-fail Always - 176097237468 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 369 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 53 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 068 068 000 Old_age Always - 32 188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 65537 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 074 066 045 Old_age Always - 26 (Lifetime Min/Max 19/28) 194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 039 039 000 Old_age Always - 201007447 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 97693326115215 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1790065637 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3852326791 SMART Error Log Version: 1 ATA Error Count: 29 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 29 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 f6 35 00 00 Error: UNC at LBA = 0x000035f6 = 13814 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 8f 35 00 e0 00 00:25:03.388 READ DMA 27 00 00 00 00 00 e0 00 00:25:03.387 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:03.386 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:03.386 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:03.386 READ NATIVE MAX ADDRESS EXT Error 28 occurred at disk power-on lifetime: 351 hours (14 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 f5 35 00 00 Error: UNC at LBA = 0x000035f5 = 13813 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 8f 35 00 e0 00 00:24:58.598 READ DMA 25 00 a8 ff ff ff ef 00 00:24:58.588 READ DMA EXT 25 00 00 ff ff ff ef 00 00:24:58.572 READ DMA EXT 35 00 e8 ff ff ff ef 00 00:24:58.570 WRITE DMA EXT 35 00 00 ff ff ff ef 00 00:24:58.568 WRITE DMA EXT Error 27 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:26:01.419 READ DMA 27 00 00 00 00 00 e0 00 00:26:01.419 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:26:01.417 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:26:01.417 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:26:01.417 READ NATIVE MAX ADDRESS EXT Error 26 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:25:57.640 READ DMA 27 00 00 00 00 00 e0 00 00:25:57.639 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:57.638 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:57.638 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:57.638 READ NATIVE MAX ADDRESS EXT Error 25 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 54 48 00 00 Error: UNC at LBA = 0x00004854 = 18516 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 b0 c7 47 00 e0 00 00:25:53.891 READ DMA 27 00 00 00 00 00 e0 00 00:25:53.891 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:25:53.890 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 00:25:53.889 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:25:53.889 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 368 - # 2 Extended offline Completed: read failure 90% 360 276948556 # 3 Short offline Completed without error 00% 359 - # 4 Short offline Completed without error 00% 356 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. The reports look pretty much the same so I would say that you need to replace the cable and see if that helps the situation. When I started having read failures on my drive it also produced a very high increase in the reallocated sector count and the pending reallocation sector count. Yours does not seem to have done that so I would check the cable, pay attention to the drive and see if anything else happens.
December 27, 200916 yr Your disk has no re-allocated sectors, and none pending re-allocation. Those are both really good signs. About all you can do is double check the cables are plugged in securely, there does not seem to e anything wrong with the drive (or at least SMART does not show anything) The drive dis have an error reading a sector at one point Error: UNC at LBA = 0x000035f6 = 13814 Apparently, when it was re-written it corrected itself and did not need to be re-allocated. All yo can do at this point is keep an eye out for errors, and make sure the cabling to it is secure.
December 27, 200916 yr Author Thanks guys! I really appreciate your help "translating" the reports. I'll double check the cabling and you can be sure that I'll post if I have more errors!
December 27, 200916 yr This is the most common type of "hard disk problem". If the connection from computer to drive is not solid, commands and responses get garbled causing these types of errors. (Similarly, if the connection from PSU to drive is not solid, the drive can lose power. Even very brief power outages can create big problems!) Sometimes you see traces of "cabling" problems in the syslog, other times you only see them in the drive's log. Sometimes the issue causes unRAID to kick the drive from the array (red ball), other times not. There are a number of different causes of connection problems: 1 - Bad or lose SATA cables 2 - Bad or lose SATA ports (e.g., on MB or SATA card) 3 - Bad or lose connections inside of a backplane / drive cage 4 - Bad or lose POWER connections 5 - Bad power splitters Every time you open up your case and jiggle cables, you create the opportunity for some connection to become marginal. Vibration and repeated heating / cooling of components can also losen cables and trigger these types of problems even after months of error-free operation. Although easy to say they are common, they can also be pretty hard and frustrating to isolate, diagnose, and fix. These problems are easily mistaken for bad drives. Recommend using locking SATA cables where supported to minimize cable connection problems, running monthly parity checks, and running parity checks immediately before and after any maintenance inside the computer.
December 27, 200916 yr Author Every time you open up your case and jiggle cables, you create the opportunity for some connection to become marginal. Vibration and repeated heating / cooling of components can also losen cables and trigger these types of problems even after months of error-free operation. Although easy to say they are common, they can also be pretty hard and frustrating to isolate, diagnose, and fix. These problems are easily mistaken for bad drives. Recommend using locking SATA cables where supported to minimize cable connection problems, running monthly parity checks, and running parity checks immediately before and after any maintenance inside the computer. I guess now would be a good time to mention that I had recently moved the server and opened the case! I'm half afraid to open it again to check the cables! Seriously though, I think your post is spot on and can be very confusing for any non-Windows OS based n00b like myself! This is a great forum and there's no way I would've been able to get my serving running without it! Thanks again guys!
January 1, 201016 yr I really like all the advice about cabling issues above, and I'll probably try to incorporate some of it into the Wiki, so I don't want to disparage it in any way, but there is no evidence of cable related errors here, just ordinary bad sectors. The telltale evidence is highlighted below: kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 kernel: ata4.00: irq_stat 0x40000001 kernel: ata4.00: cmd c8/00:b0:c7:47:00/00:00:00:00:00/e0 tag 0 dma 90112 in kernel: res 51/40:00:54:48:00/00:00:00:00:00/00 Emask 0x9 (media error) kernel: ata4.00: status: { DRDY ERR } kernel: ata4.00: error: { UNC } kernel: ata4.00: configured for UDMA/133 kernel: ata4: EH complete UNC indicates an UNCorrectable sector, that is, the error correction stored with the sector could not successfully recover the exact original contents of the sector. The SMART report does indicate it in the following line, of 32 total uncorrectable sectors found in the history of this drive (I believe): 187 Reported_Uncorrect 0x0032 068 068 000 Old_age Always - 32 The SMART report shows an ATA Error Count of 29, with the last 5 displayed, and all 5 are related to UNC sectors. The syslog reports a read error in sector 18516, which corresponds to 3 of the ATA errors shown in the SMART report. The fact that there are no remapped or pending sectors is very good news. There are actually 2 kinds of read errors, although both are reported the same way to us, at least at the application level. They both result in a failure to access the data stored in a particular sector. But to avoid confusion, it is good to understand the difference between the two, especially since we unRAID users are used to examining SMART reports, which report info from lower levels than other computer users see. The two types of read errors (also known as bad sectors) can be referred to as soft and hard errors. A hard error is when the magnetic media surface below a sector can no longer be relied on to store data. If working correctly, the magnetic particles should be able to store and maintain over reasonable conditions the same polarity or state that was last written to it. If they become too weak to hold a state, then the original data cannot be read, and new data cannot be written to it and read correctly back. The sector of a hard error must be remapped, replaced with a new good sector from the spares available. A soft error is when the magnetic particles are good, and can reliably hold their state no matter how they are tested, but the current sector data is too corrupted and cannot be corrected by the ECC data stored with it. This could occur if there was a random power spike while the head is writing to the sector, or more commonly if power is lost during a write to a sector. I believe this is what happened to you, that one or more sectors were affected by a power glitch that damaged the data stored in those sectors. So once the data is reconstructed and written back to the sector, the problem is completely fixed. The physical sector surface was good, and once good data was stored there, the sector is no longer bad. There is one item in the SMART report that does concern me, and I would recommend monitoring it. 7 Seek_Error_Rate 0x000f 049 049 030 Pre-fail Always - 171802258884 The number to the far right can be ignored. The Seek_Error_Rate for recent Seagates has generally started out and stayed within the 60's. Why they aren't near 100 I don't know, but this is characteristic of recent Seagates. I don't think I have seen a Seagate 2TB drive before, so perhaps 49 is OK for them, but it seems very low to me, especially for such a new drive, and especially when the failure threshold is 30. I would keep a close eye on it. And that huge number to the right does look awfully big! Without any experience as to what that number *should* be though, or how it should be interpreted, I would not give it too much consequence.
January 1, 201016 yr Author Rob, thanks for the very insightful post. After replacing the cable, SATA port on MB, backplane SATA port etc., I'm continuing to get handle stripe read errors when writing to this particular drive. Below is an updated smart report for my sdc "problem" drive as well as for 2 other WD 2 TB drives in my array (for comparison). I've also attached a syslog from yesterday that shows the frequency of errors is increasing. I would appreciate your feedback and any course of action I can or should be taking at this point. Thanks again!
January 1, 201016 yr The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive. I see however, that there are no re-allocated sectors... This might indicate that it is simply all "soft" errors and they are being corrected as the "read" failures re-write the sectors based on the the disks in the array. As an alternative to replacing the drive, you might see if the failures eventually slow down... (as it would not need to re-write the same sector twice) Or, you could speed things up by stopping the array, un-assigning the drive that is failing, starting the array without it, then stopping it once more, re-assigning it, and re-starting it, having it completely write the contents back onto the failing drive. (basically it now thinks you replaced the drive with a new one) Since all the errors seem to be soft errors, it might fix itself. That will cause you to be without parity protection during the time the rebuild takes... So if there are any critical files on your server, make copies of them on multiple disks... just in case a different disk fails too. Thinking about it a bit more, you could perform almost the same re-write by simply moving the files on the failing disk to a different disk, and them moving them back... This would re-write everything while keeping parity protection. Joe L.
January 1, 201016 yr Author The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive. Since the drive is fairly new, are these errors something that would warrant an RMA to Seagate? I'm trying to figure out what constitutes a "failed" drive. It would seem in this case that I don't "yet" have good cause to be able to return it?
January 1, 201016 yr The solution, if you continue to get un-correctable media failures while reading the drive, is to replace the drive. Since the drive is fairly new, are these errors something that would warrant an RMA to Seagate? I'm trying to figure out what constitutes a "failed" drive. It would seem in this case that I don't "yet" have good cause to be able to return it? If they continue, yes, and RMA is in order, even if the SMART test does not indicate an imminent failure. Just include a printout of all the "media" read errors you are getting when you return the drive if they request proof.
Archived
This topic is now archived and is closed to further replies.