Jump to content

Drive failed preclear? Manufacturer diagnostics says it's fine..


magn2o

Recommended Posts

So I attempted to preclear my intended cache drive, and much to my surprise -- it (apparently?) failed. I was hoping someone a bit more familiar with the preclear output could help me decipher what exactly failed in this case.. the SAMSUNG diagnostics tool passed the drive fine (meaning no RMA will be issued for it), so I'm at a loss.. any insight would be appreciated!

 

Here's the preclear_disk.sh output:

===========================================================================

=                unRAID server Pre-Clear disk /dev/sde

=                      cycle 1 of 1

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Testing if the clear has been successful.    DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 32C, Elapsed Time:  20:51:02

============================================================================

==

== Disk /dev/sde has NOT been precleared successfully

== skip=30600 count=200 returned 02048 instead of 00000 skip=66400 count=200 returned 02048 instead of 00000 skip=66600 count=200 returned 02048 instead of 00000

============================================================================

 

Here is the relevant syslog information:

Jul  2 12:47:51 Tower preclear_disk-finish[29161]: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Home page is http://smartmontools.sourceforge.net/
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: === START OF INFORMATION SECTION ===
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Device Model:     SAMSUNG HD103SJ
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Serial Number:    S246J90Z185959
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Firmware Version: 1AJ100E4
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: User Capacity:    1,000,204,886,016 bytes
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Device is:        In smartctl database [for details use: -P show]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ATA Version is:   8
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ATA Standard is:  Not recognized. Minor revision code: 0x28
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Local Time is:    Fri Jul  2 12:47:50 2010 GMT+8
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART support is: Available - device has SMART capability.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART support is: Enabled
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: === START OF READ SMART DATA SECTION ===
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART overall-health self-assessment test result: PASSED
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: General SMART Values:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Offline data collection status:  (0x00)^IOffline data collection activity
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^Iwas never started.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^IAuto Offline Data Collection: Disabled.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Self-test execution status:      (   0)^IThe previous self-test routine completed
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^Iwithout error or no self-test has ever 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^Ibeen run.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Total time to complete Offline 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: data collection: ^I^I (9420) seconds.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Offline data collection
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: capabilities: ^I^I^I (0x5b) SMART execute Offline immediate.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^IAuto Offline data collection on/off support.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISuspend Offline collection upon new
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^Icommand.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^IOffline surface scan supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISelf-test supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^INo Conveyance Self-test supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISelective Self-test supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART capabilities:            (0x0003)^ISaves SMART data before entering
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^Ipower-saving mode.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISupports SMART auto save timer.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Error logging capability:        (0x01)^IError logging supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^IGeneral Purpose Logging supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Short self-test routine 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: recommended polling time: ^I (   2) minutes.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Extended self-test routine
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: recommended polling time: ^I ( 157) minutes.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SCT capabilities: ^I       (0x003f)^ISCT Status supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISCT Feature Control supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^I^I^I^I^ISCT Data Table supported.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART Attributes Data Structure revision number: 16
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Vendor Specific SMART Attributes with Thresholds:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   3 Spin_Up_Time            0x0023   071   070   025    Pre-fail  Always       -       8927
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       45
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       3094
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:  10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:  11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       7
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       62
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART Error Log Version: 1
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ATA Error Count: 3
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^ICR = Command Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^IFR = Features Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^ISC = Sector Count Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^ISN = Sector Number Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^ICL = Cylinder Low Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^ICH = Cylinder High Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^IDH = Device/Head Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^IDC = Device Command Register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^IER = Error register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: ^IST = Status register [HEX]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Powered_Up_Time is measured from power on, and printed as
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Error 3 occurred at disk power-on lifetime: 3052 hours (127 days + 4 hours)
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   When the command that caused the error occurred, the device was active or idle.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   After command completion occurred, registers were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ER ST SC SN CL CH DH
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- --
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   84 51 00 00 00 00 a0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   Commands leading to the command that caused the error were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- -- --  ----------------  --------------------
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ec 00 00 00 00 00 a0 08      00:00:00.369  IDENTIFY DEVICE
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   00 00 01 01 00 00 40 08      00:00:00.369  NOP [Abort queued commands]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   00 00 01 01 00 00 40 08      00:00:00.369  NOP [Abort queued commands]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   35 00 08 3f 00 c4 e0 08      00:00:00.368  WRITE DMA EXT
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   35 00 08 3f 00 c0 e0 08      00:00:00.368  WRITE DMA EXT
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Error 2 occurred at disk power-on lifetime: 42 hours (1 days + 18 hours)
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   When the command that caused the error occurred, the device was active or idle.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   After command completion occurred, registers were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ER ST SC SN CL CH DH
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- --
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   84 51 20 00 00 00 e0  Error: ICRC, ABRT 32 sectors at LBA = 0x00000000 = 0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   Commands leading to the command that caused the error were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- -- --  ----------------  --------------------
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   c8 00 20 00 00 00 e0 00      00:00:00.846  READ DMA
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   c8 00 18 20 00 00 e0 00      00:00:00.845  READ DMA
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   c8 00 20 50 00 00 e0 00      00:00:00.845  READ DMA
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   c8 00 08 48 00 00 e0 00      00:00:00.845  READ DMA
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   c8 00 08 b8 00 00 e0 00      00:00:00.845  READ DMA
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Error 1 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   When the command that caused the error occurred, the device was active or idle.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   After command completion occurred, registers were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ER ST SC SN CL CH DH
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- --
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   84 51 00 00 00 00 a0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   Commands leading to the command that caused the error were:
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   -- -- -- -- -- -- -- --  ----------------  --------------------
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ec 00 00 00 00 00 a0 00      00:00:01.931  IDENTIFY DEVICE
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   00 00 01 01 00 00 00 00      00:00:01.931  NOP [Abort queued commands]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   00 00 01 01 00 00 00 00      00:00:01.931  NOP [Abort queued commands]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   ec 00 00 00 00 00 a0 00      00:00:01.926  IDENTIFY DEVICE
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   00 00 01 01 00 00 40 00      00:00:01.925  NOP [Abort queued commands]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART Self-test log structure revision number 1
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: No self-tests have been logged.  [To run self-tests, use: smartctl -t]
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: SMART Selective self-test log data structure revision number 0
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Warning: ATA Specification requires selective self-test log data structure revision number = 1
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:     1        0        0  Completed [00% left] (0-65535)
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:     2        0        0  Not_testing
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:     3        0        0  Not_testing
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:     4        0        0  Not_testing
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:     5        0        0  Not_testing
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: Selective self-test flags (0x0):
Jul  2 12:47:51 Tower preclear_disk-finish[29161]:   After scanning selected spans, do NOT read-scan remainder of disk.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: If Selective self-test is pending on power-up, resume after 0 minute delay.
Jul  2 12:47:51 Tower preclear_disk-finish[29161]: 

Link to comment

That error indicates that three different sets of 200 sectors written with zeros did not return all zeros when read back.

 

These errors are often the most elusive, since they do not show up in diagnostics but cause random parity errors since the data written cannot be read back properly, but the drive gives no other indications.

 

Basically

skip=30600 count=200 returned 02048

skip=66400 count=200 returned 02048

skip=66600 count=200 returned 02048

All should have returned zero. 

 

They all ended up with the same checksum, so odds are it is a single bit that is flaky when reading the drive.  You will need to substitute the UUU with the correct "Unit size" below in

bs=UUU

as my diagnostic output did not print it.  The "bs" value is dependent on your drive's geometry.

 

You can get the correct value for UUU for a specific disk by typing:

fdisk -l /dev/sde | grep Units | awk '{ print $9}'

 

Running it on one of my disks results in a unit-size of 8225280 bytes being returned for it.  In the following commands I'd then use

bs=8225280.  As  I said, each disk is different, so you need to use fdisk to learn your disk's "Units" size.

 

You can then re-run these three commands to see if the data read is always bad, or just sometimes bad:

dd if=/dev/sde bs=UUU count=200 skip=30600 conv=noerror 2>/dev/null|sum| awk '{print $1}'

dd if=/dev/sde bs=UUU count=200 skip=66400 conv=noerror 2>/dev/null|sum| awk '{print $1}'

dd if=/dev/sde bs=UUU count=200 skip=66600 conv=noerror 2>/dev/null|sum| awk '{print $1}'

They should only take a few seconds to run.

 

It really does not matter if the read returns a checksum of zeros now.  (although it will be interesting) It just indicates that the drive will probably cause intermittent parity errors on random addresses when you perform a parity check with no other symptoms or errors in the system log to let you know what is happening, or even which drive is causing the inconsistent parity results.  These are the most difficult to find and fix since you have no clue which drive is defective.

 

If the three reads do still fail, try the following versions that will use octal-dump command "od" to print the actual values returned.  It might be a single bit that is always incorrect:

dd if=/dev/sde bs=UUU count=200 skip=30600 conv=noerror 2>/dev/null| od -x

dd if=/dev/sde bs=UUU count=200 skip=66400 conv=noerror 2>/dev/null| od -x

dd if=/dev/sde bs=UUU count=200 skip=66600 conv=noerror 2>/dev/null| od -x

 

Again, you'll need to use the correct value for "UUU"

 

You'll need to argue this one with Samsung... since if you can state that zeros written to the drive return non-zero results, they should RMA it.    Remember, it may be intermittent, and it only happened three times in the course of clearing your entire drive.

 

Joe L.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...