wsume99 Posted September 27, 2010 Share Posted September 27, 2010 I went back and did some more analysis of my syslog. On the first preclear run all of the errors were isolated to two sectors (1682918047 & 1683029711). It's a little strange that the post-clear SMART report showed a raw value of 3 current pending sectors. The second preclear run produced errors on five sectors (1682918047, 1683018743, 1683029711, 1683035192, and 1683062151). So the two suspect sectors from the first pass repeated and three new ones were added. This time the post-clear SMART report showed a raw value of 7 current pending sectors. Again not sure why it saying 7 when there were errors on only 5. The third preclear run is not completed yet (about 50% thru the post read process) but so far there were errors reported against four sectors (1683018743, 1683035191, 1683062152, and 1683090639). No errors reported yet against the repeats from the first two runs, maybe they've been reassigned. Also one repeat of a new sector from the second run and then three new sectors. But the funny thing is that the SMART report at the beginning of the run showed 8 current pending sectors when it only reported 7 at the end of the last run. How can that be? This drive is not assigned in my array so it should not be read from, so how could more pending sectors be identified between preclear runs? Another strange occurence. Before starting the precelar process I used the WDIDLE3 utility to disable the head parking feature on this drive. I used the WDIDLE3 /D command and got a response that the head park time was set to something like 64.7 minutes - IIRC. Then I tried the WDIDLE3 /S0 command and got a response that said head parking was disabled. Well looking at the results it is clearly not disabled. The load cycle count started the first run and 8 and ended it at 9. In the hour and 25 minutes betwen preclear runs the load cycle count went from 9 to 96. It ended the second run at 97. In the 3 hours and 17 minutes between the 2nd and 3rd runs it wend from 97 to 295. So clearly it has not been disabled eventhough it reported that it was. I'm really beginning to not like this drive. Quote Link to comment
BRiT Posted September 27, 2010 Share Posted September 27, 2010 I can't be certain, but you might need a complete power cycle (off then on) for the setting to take affect. That's the case for firmware updates. Maybe that setting is the same? Quote Link to comment
wsume99 Posted September 27, 2010 Share Posted September 27, 2010 I'm pretty sure, in fact I'm positive, that I did a power cycle. I know I did because I put the DOS bootable USB drive in the place of my unRAID USB drive to change the setting. After changing the IDLE3 setting I powered down the server, swapped USB drives, then powered up in unRAID. But just to be sure I'm going to check the setting again tonight. If that doesn't work I'll try the WDIDLE3 /D command and see how that works. I tried that the first time but it did not report that head parking was disabled (as the command is supposed to) just that it was a really long time (~64 minutes). Quote Link to comment
Joe L. Posted September 28, 2010 Share Posted September 28, 2010 I went back and did some more analysis of my syslog. On the first preclear run all of the errors were isolated to two sectors (1682918047 & 1683029711). It's a little strange that the post-clear SMART report showed a raw value of 3 current pending sectors. The second preclear run produced errors on five sectors (1682918047, 1683018743, 1683029711, 1683035192, and 1683062151). So the two suspect sectors from the first pass repeated and three new ones were added. This time the post-clear SMART report showed a raw value of 7 current pending sectors. Again not sure why it saying 7 when there were errors on only 5. The third preclear run is not completed yet (about 50% thru the post read process) but so far there were errors reported against four sectors (1683018743, 1683035191, 1683062152, and 1683090639). No errors reported yet against the repeats from the first two runs, maybe they've been reassigned. Also one repeat of a new sector from the second run and then three new sectors. But the funny thing is that the SMART report at the beginning of the run showed 8 current pending sectors when it only reported 7 at the end of the last run. How can that be? This drive is not assigned in my array so it should not be read from, so how could more pending sectors be identified between preclear runs? Another strange occurence. Before starting the precelar process I used the WDIDLE3 utility to disable the head parking feature on this drive. I used the WDIDLE3 /D command and got a response that the head park time was set to something like 64.7 minutes - IIRC. Then I tried the WDIDLE3 /S0 command and got a response that said head parking was disabled. Well looking at the results it is clearly not disabled. The load cycle count started the first run and 8 and ended it at 9. In the hour and 25 minutes betwen preclear runs the load cycle count went from 9 to 96. It ended the second run at 97. In the 3 hours and 17 minutes between the 2nd and 3rd runs it wend from 97 to 295. So clearly it has not been disabled eventhough it reported that it was. I'm really beginning to not like this drive. The same sectors repeating, to me would indicate a physical issue with the sector, not a noise sensitive issue as I first suspected. Those sectors apparently should be re-allocated, but the drive keeps re-using them since it was able to read them when the write phase occurs. They are bad, but not bad enough.... argh...... Quote Link to comment
wsume99 Posted September 28, 2010 Share Posted September 28, 2010 Well the third run is complete and there were no additional errors reported - so a total of four sectors had errors. Ironically the current pending sector count went down from 8 to 2. Still no reallocated sectors. Maybe my prayers to the HDD gods are being answered. I swapped the power connection with one from another drive just to see what happens. I also checked and I don't know how but the IDLE3 timer was set back to the factory default of 8 secs. That would explain why the load cycle counts were incrementing between preclear cycles. I again disabled it using WDIDLE3 /S0 command. Rebooted and verified that head parking was indeed disabled using WDIDLE3 /R and it was. Perhaps I do have a power issue as the load cycle count went up by two during the last preclear cycle. On the previous two runs it only changed by one, which is what I would expect. I launched another preclear cycle and I'm interested to see what happens with this one. I'll find out in about 30 hours. Quote Link to comment
SSD Posted September 30, 2010 Share Posted September 30, 2010 Thought these unusual results might be of interest to Joe and/or others. I ran a preclear on a disk in a port I had not used on my backplane. Drive seemed to be recognized but noticed that smart reports were failing (see bolded section below) while disk was being precleared. Sometimes it worked, sometimes it got this error. Was moving along at a good clip and finished this morning. unRAID is reporting non-zero values on the drive. Finding issues like this is why we run preclear scripts! I am going to experiement further to see if I have a loose cable or something, or if the drive itself is bad. Question about the non-zero values ... would preclear continue to search the entire drive before reporting non-zero values on the drive, or stop immediately when it hit one? Since it appeared to go all the way through the entire disk, is there any way to know how many or where these non-zero values are? Thanks Joe for this great tool. Saved me from a nightmare if this had been added to the array! Sep 30 07:54:32 Tower preclear_disk-finish[6363]: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Home page is http://smartmontools.sourceforge.net/ Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: === START OF INFORMATION SECTION === Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Device Model: Hitachi HDS722020ALA330 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Serial Number: JK11A5YAKDWW3X Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Firmware Version: JKAOA3EA Sep 30 07:54:32 Tower preclear_disk-finish[6363]: User Capacity: 2,000,398,934,016 bytes Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Device is: Not in smartctl database [for details use: -P showall] Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ATA Version is: 8 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ATA Standard is: ATA-8-ACS revision 4 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Local Time is: Thu Sep 30 07:54:31 2010 EDT Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART support is: Available - device has SMART capability. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART support is: Enabled Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: === START OF READ SMART DATA SECTION === Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART overall-health self-assessment test result: PASSED Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: General SMART Values: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Offline data collection status: (0x84)^IOffline data collection activity Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^Iwas suspended by an interrupting command from host. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^IAuto Offline Data Collection: Enabled. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Self-test execution status: ( 0)^IThe previous self-test routine completed Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^Iwithout error or no self-test has ever Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^Ibeen run. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Total time to complete Offline Sep 30 07:54:32 Tower preclear_disk-finish[6363]: data collection: ^I^I (23212) seconds. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Offline data collection Sep 30 07:54:32 Tower preclear_disk-finish[6363]: capabilities: ^I^I^I (0x5b) SMART execute Offline immediate. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^IAuto Offline data collection on/off support. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISuspend Offline collection upon new Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^Icommand. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^IOffline surface scan supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISelf-test supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^INo Conveyance Self-test supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISelective Self-test supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART capabilities: (0x0003)^ISaves SMART data before entering Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^Ipower-saving mode. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISupports SMART auto save timer. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Error logging capability: (0x01)^IError logging supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^IGeneral Purpose Logging supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Short self-test routine Sep 30 07:54:32 Tower preclear_disk-finish[6363]: recommended polling time: ^I ( 1) minutes. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Extended self-test routine Sep 30 07:54:32 Tower preclear_disk-finish[6363]: recommended polling time: ^I ( 255) minutes. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SCT capabilities: ^I (0x003d)^ISCT Status supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISCT Feature Control supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ^I^I^I^I^ISCT Data Table supported. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART Attributes Data Structure revision number: 16 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Vendor Specific SMART Attributes with Thresholds: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 33 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART Error Log Version: 0 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: No Errors Logged Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART Self-test log structure revision number 1 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: No self-tests have been logged. [To run self-tests, use: smartctl -t] Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SMART Selective self-test log data structure revision number 1 Sep 30 07:54:32 Tower preclear_disk-finish[6363]: SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 1 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 2 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 3 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 4 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-finish[6363]: 5 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Selective self-test flags (0x0): Sep 30 07:54:32 Tower preclear_disk-finish[6363]: After scanning selected spans, do NOT read-scan remainder of disk. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: If Selective self-test is pending on power-up, resume after 0 minute delay. Sep 30 07:54:32 Tower preclear_disk-finish[6363]: Sep 30 07:54:32 Tower preclear_disk-diff[6376]: ============================================================================ Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Disk /dev/sdb has NOT been successfully precleared Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Postread detected un-expected non-zero bytes on disk== Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Ran 1 preclear-disk cycle Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Using :Read block size = 8225280 Bytes Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Last Cycle's Pre Read Time : 6:34:23 (84 MB/s) Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Last Cycle's Zeroing time : 5:45:35 (96 MB/s) Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Last Cycle's Post Read Time : 20:32:46 (27 MB/s) Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Last Cycle's Total Time : 32:53:57 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Total Elapsed Time 32:53:57 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Disk Start Temperature: 34C Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Current Disk Temperature: 32C, Sep 30 07:54:32 Tower preclear_disk-diff[6376]: == Sep 30 07:54:32 Tower preclear_disk-diff[6376]: ============================================================================ Sep 30 07:54:32 Tower preclear_disk-diff[6376]: S.M.A.R.T. error count differences detected after pre-clear Sep 30 07:54:32 Tower preclear_disk-diff[6376]: note, some 'raw' values may change, but not be an indication of a problem Sep 30 07:54:32 Tower preclear_disk-diff[6376]: 15,25c15,85 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < Error SMART Status command failed Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < Please get assistance from http://smartmontools.sourceforge.net/ Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < Register values returned from SMART Status command are: Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < ST =0x50 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < ERR=0x00 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < NS =0x08 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < SC =0xa0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < CL =0x88 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < CH =0xe0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < SEL=0x40 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: < A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: --- Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > === START OF READ SMART DATA SECTION === Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART overall-health self-assessment test result: PASSED Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > General SMART Values: Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Offline data collection status: (0x84)^IOffline data collection activity Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^Iwas suspended by an interrupting command from host. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^IAuto Offline Data Collection: Enabled. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Self-test execution status: ( 0)^IThe previous self-test routine completed Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^Iwithout error or no self-test has ever Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^Ibeen run. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Total time to complete Offline Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > data collection: ^I^I (23212) seconds. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Offline data collection Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > capabilities: ^I^I^I (0x5b) SMART execute Offline immediate. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^IAuto Offline data collection on/off support. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISuspend Offline collection upon new Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^Icommand. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^IOffline surface scan supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISelf-test supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^INo Conveyance Self-test supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISelective Self-test supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART capabilities: (0x0003)^ISaves SMART data before entering Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^Ipower-saving mode. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISupports SMART auto save timer. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Error logging capability: (0x01)^IError logging supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^IGeneral Purpose Logging supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Short self-test routine Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > recommended polling time: ^I ( 1) minutes. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Extended self-test routine Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > recommended polling time: ^I ( 255) minutes. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SCT capabilities: ^I (0x003d)^ISCT Status supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISCT Feature Control supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ^I^I^I^I^ISCT Data Table supported. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART Attributes Data Structure revision number: 16 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Vendor Specific SMART Attributes with Thresholds: Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 3 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART Error Log Version: 0 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > No Errors Logged Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART Self-test log structure revision number 1 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > No self-tests have been logged. [To run self-tests, use: smartctl -t] Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SMART Selective self-test log data structure revision number 1 Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 1 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 2 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 3 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 4 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > 5 0 0 Not_testing Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Selective self-test flags (0x0): Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > After scanning selected spans, do NOT read-scan remainder of disk. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > If Selective self-test is pending on power-up, resume after 0 minute delay. Sep 30 07:54:32 Tower preclear_disk-diff[6376]: > Sep 30 07:54:32 Tower preclear_disk-diff[6376]: ============================================================================ Sep 30 07:54:32 Tower preclear_disk-diff[6376]: Quote Link to comment
Joe L. Posted September 30, 2010 Share Posted September 30, 2010 Thought these unusual results might be of interest to Joe and/or others. I ran a preclear on a disk in a port I had not used on my backplane. Drive seemed to be recognized but noticed that smart reports were failing (see bolded section below) while disk was being precleared. Sometimes it worked, sometimes it got this error. Was moving along at a good clip and finished this morning. unRAID is reporting non-zero values on the drive. Finding issues like this is why we run preclear scripts! I am going to experiement further to see if I have a loose cable or something, or if the drive itself is bad. Unlikely to be a cabling issue, but I can't predict what an intermittent connection would do. Question about the non-zero values ... would preclear continue to search the entire drive before reporting non-zero values on the drive, or stop immediately when it hit one? It continues to the end. Since it appeared to go all the way through the entire disk, is there any way to know how many or where these non-zero values are?Yes, in the /tmp directory you'll find a file named: /tmp/postread_errors$disk_basename where disk_basename is your disk under test. Check it out for the specific blocks and offsets. Th test for non-zeros bytes is fairly crude, it is just a sum of all the values returned when it reads a block of data. (block size is set to the size, in bytes, of a cylinder as reported by fdisk -l /dev/sdX If the sum of the bytes is zero, then all the bytes read in that set of blocks were zero. I do not know which specific byte/sector was non-zero. Thanks Joe for this great tool. Saved me from a nightmare if this had been added to the array! Yes, these disks that occasionally random values show up as parity errors when parity is checked, but unless you are doing a NOCORRECT check, they also then modify parity to reflect the bad data reported from the drives. They cause hair-loss, because you'll pull your hair out trying to figure out the cause of the random parity errors. (you'll have no idea which disk is the cause, because they do not report these as errors, they think they are reading the platter correctly) If not an obviously loose cable, RMA the drive. Joe L. Quote Link to comment
SSD Posted September 30, 2010 Share Posted September 30, 2010 Contents of postread_errorssdb skip=135200 count=200 returned instead of 00000 skip=149000 count=200 returned instead of 00000 Also forgot to report - the disk was occasionally reporting in standby (not spinnning) while preclear was occurring. Quote Link to comment
Joe L. Posted September 30, 2010 Share Posted September 30, 2010 Contents of postread_errorssdb skip=135200 count=200 returned instead of 00000 skip=149000 count=200 returned instead of 00000 Also forgot to report - the disk was occasionally reporting in standby (not spinnning) while preclear was occurring. Returned "" (blank) instead of 00000 is more interesting. Glad you took the time to give me feedback. That would probably indicate the drive did not respond at all. Interesting... It could then be a cabling issue, and not a random errant bit, or a drive that occasionally likes to not respond. (it needs to respond to get its temperature or spin-up/down status) Joe L. Quote Link to comment
SSD Posted September 30, 2010 Share Posted September 30, 2010 Ok - so I decided to run a short and a long smart test. The short test ran and completed. The long test seemed to start, but when I checked on progress, it seemed as though it had forgotten about the request ... root@Tower:~# smartctl -d ata -tlong /dev/sdb smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 255 minutes for test to complete. Test will complete after Thu Sep 30 16:01:42 2010 Use smartctl -X to abort test. <about 20 minutes passed> root@Tower:~# smartctl -a -d ata /dev/sdb smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: Hitachi HDS722020ALA330 Serial Number: JK11A5YAKDWW3X Firmware Version: JKAOA3EA User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Sep 30 11:58:57 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: (23212) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 131 131 054 Pre-fail Offline - 109 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 121 121 020 Pre-fail Offline - 35 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 37 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 3 194 Temperature_Celsius 0x0002 181 181 000 Old_age Always - 33 (Lifetime Min/Max 25/36) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 0 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 37 - <Shouldn't there be a row here saying the Long test was running??> SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
Joe L. Posted September 30, 2010 Share Posted September 30, 2010 It says it is running up at the top of the report: Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Quote Link to comment
SSD Posted September 30, 2010 Share Posted September 30, 2010 Thanks! Will check back tonight and see how it did. I am doing a big copy operation on array right now. Hope it will be done tonight so I can take the server down and check all of the connections. Trouble is, I don't have a very reliable way to tell if the problem is fixed. Quote Link to comment
Tight_wad Posted October 1, 2010 Share Posted October 1, 2010 I ran a preclear on a Hitachi 2 tb drive yesterday and when it came towards the end I got this message Sorry /dev/sdj mbr was not precleared or close to that. It made it through all but the final step. This is actually my 3 time trying to run preclear on this drive. The first time, my system totally lockup, couldn't http in or telnet, and counsel was frozen, so I couldn't get a syslog. I was running 2 preclears at the same time. And both on the Sata Card ports. I ran it again, this time just this drive, and did the same thing. And no syslog because everything froze. So I thought 3rd times a charm. I moved this drive off of the Sata card, and on to the Motherboard port. Thinking that might help. I also ran the preclear with the (-n) as this preclear_disk.sh -n /dev/sdj. My thinking here was that it always made it up to the end, so if it worked this way, I would run it again. Well, it did and it didn't work. It failed, but this time, nothing locked up so I have a Syslog this time. I have shortened the syslog at the point that it starts repeating similar info. If not it would be thousands of lines long and about 44Mb's. I figure that I have a bad brand new hard drive, but would like someone to take a look if possible. syslog.txt Quote Link to comment
Tight_wad Posted October 1, 2010 Share Posted October 1, 2010 I had to add the syslog to the above post. Quote Link to comment
Joe L. Posted October 1, 2010 Share Posted October 1, 2010 I ran a preclear on a Hitachi 2 tb drive yesterday and when it came towards the end I got this message Sorry /dev/sdj mbr was not precleared or close to that. It made it through all but the final step. This is actually my 3 time trying to run preclear on this drive. The first time, my system totally lockup, couldn't http in or telnet, and counsel was frozen, so I couldn't get a syslog. I was running 2 preclears at the same time. And both on the Sata Card ports. I ran it again, this time just this drive, and did the same thing. And no syslog because everything froze. So I thought 3rd times a charm. I moved this drive off of the Sata card, and on to the Motherboard port. Thinking that might help. I also ran the preclear with the (-n) as this preclear_disk.sh -n /dev/sdj. My thinking here was that it always made it up to the end, so if it worked this way, I would run it again. Well, it did and it didn't work. It failed, but this time, nothing locked up so I have a Syslog this time. I have shortened the syslog at the point that it starts repeating similar info. If not it would be thousands of lines long and about 44Mb's. I figure that I have a bad brand new hard drive, but would like someone to take a look if possible. Since you've already move the drive from one disk controller to another, it would eliminate the disk controller from being a possibility. The drive initially responds when the SMART report is first performed and then it times-out. The OS resets it and tries again, it still fails to respond. All the subsequent writes to it fail with errors written to the syslog. Eventually, the syslog would grow to where it uses all memory, and your server would become un-responsive as you've discovered. Only other possibility, besides the drive itself, would be a poor or intermittent power connection to the drive. (try an alternate power connection) Other than that... I'd say RMA the drive. Be thankful it was discovered before you tried using it in your array. Oct 1 00:52:28 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x3f80000 SErr 0x80000 action 0x6 frozen Oct 1 00:52:28 Tower kernel: ata9: SError: { 10B8B } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:98:20:9d:24/04:00:de:00:00/40 tag 19 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:a0:20:a1:24/04:00:de:00:00/40 tag 20 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:a8:20:a5:24/04:00:de:00:00/40 tag 21 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:b0:20:a9:24/04:00:de:00:00/40 tag 22 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:b8:20:ad:24/04:00:de:00:00/40 tag 23 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:c0:20:b1:24/04:00:de:00:00/40 tag 24 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9.00: failed command: WRITE FPDMA QUEUED Oct 1 00:52:28 Tower kernel: ata9.00: cmd 61/00:c8:20:b5:24/04:00:de:00:00/40 tag 25 ncq 524288 out Oct 1 00:52:28 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 00:52:28 Tower kernel: ata9.00: status: { DRDY } Oct 1 00:52:28 Tower kernel: ata9: hard resetting link Oct 1 00:52:36 Tower kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 1 00:52:36 Tower kernel: ata9.00: configured for UDMA/133 Oct 1 00:52:36 Tower kernel: ata9.00: device reported invalid CHS sector 0 Oct 1 00:52:36 Tower last message repeated 6 times Oct 1 00:52:36 Tower kernel: ata9: EH complete Oct 1 01:00:12 Tower kernel: sd 4:0:0:0: [sdj] Unhandled error code Oct 1 01:00:12 Tower kernel: sd 4:0:0:0: [sdj] Result: hostbyte=0x00 driverbyte=0x06 Oct 1 01:00:12 Tower kernel: sd 4:0:0:0: [sdj] CDB: cdb[0]=0x2a: 2a 00 de ab 6d 20 00 04 00 00 Oct 1 01:00:12 Tower kernel: end_request: I/O error, dev sdj, sector 3735776544 Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972068 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972069 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972070 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972071 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972072 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972073 Oct 1 01:00:12 Tower kernel: lost page write due to I/O error on sdj Oct 1 01:00:12 Tower kernel: Buffer I/O error on device sdj, logical block 466972074 Quote Link to comment
Tight_wad Posted October 1, 2010 Share Posted October 1, 2010 Since you've already move the drive from one disk controller to another, it would eliminate the disk controller from being a possibility. The drive initially responds when the SMART report is first performed and then it times-out. The OS resets it and tries again, it still fails to respond. All the subsequent writes to it fail with errors written to the syslog. Eventually, the syslog would grow to where it uses all memory, and your server would become un-responsive as you've discovered. Only other possibility, besides the drive itself, would be a poor or intermittent power connection to the drive. (try an alternate power connection) Other than that... I'd say RMA the drive. Be thankful it was discovered before you tried using it in your array. I will try it again tonight with a different power connection and see what it does. Right now it is on the last connection of 4 on the Corsair's power cable. Hopefully this is it. Thank you Quote Link to comment
Tight_wad Posted October 2, 2010 Share Posted October 2, 2010 I changed my power cable to one that I have been using on a drive. I wish I could report that it made the difference, but it didn't. Totally frozen by 96% complete of step 2. Time to RMA this drive. Quote Link to comment
Joe L. Posted October 2, 2010 Share Posted October 2, 2010 I changed my power cable to one that I have been using on a drive. I wish I could report that it made the difference, but it didn't. Totally frozen by 96% complete of step 2. Time to RMA this drive. Sorry you have a bad drive, but better learning now than after you added it to your array. Quote Link to comment
SSD Posted October 2, 2010 Share Posted October 2, 2010 Thanks! Will check back tonight and see how it did. I am doing a big copy operation on array right now. Hope it will be done tonight so I can take the server down and check all of the connections. Trouble is, I don't have a very reliable way to tell if the problem is fixed. The long smart test ran successfully. I moved the disk to a different controller and ran preclear successfully (so disk is okay). The smart report error I was getting suggested running with the "-T permissive" option. When I added that to the command, it worked. So I think that has more to do with the controller than with the preclear error (I confirmed this on other ports). My other controllers don't requrie this permissive option. Joe L., you might want to add the permissive option to your unmain and preclear scripts. I still don't know why I got those 2 errors preclearing the disk. But I've now reseated all of the cables and plan to continue to run preclear tests. Here is the error I was seeing in case someone is searching for the forum looking for this error: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: Hitachi HDS722020ALA330 Serial Number: JK11A5YAKDBxxx Firmware Version: JKAOA3EA User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Oct 2 13:52:13 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Error SMART Status command failed Please get assistance from http://smartmontools.sourceforge.net/ Register values returned from SMART Status command are: ST =0x50 ERR=0x00 NS =0x00 SC =0xc8 CL =0x43 CH =0x3b SEL=0x40 A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Quote Link to comment
GreggP Posted October 3, 2010 Share Posted October 3, 2010 I recently purchased a couple of new 2TB Hitachi drives and tried replacing both my 1TB parity and disk1 drives. I followed the instructions from the "official" unRAID manual which doesn't include several steps recommended by knowledgeable users like Joe L., including parity checks and running pre-clear on the new drives. After replacing the parity drive everything seemed fine, but after replacing the disk1 data drive, 151 errors were reported for the parity drive. More info is available in this thread: http://lime-technology.com/forum/index.php?topic=8096.msg78237#msg78237 Joe explained how to get my array back to its previous state with the 1TB drives. And, yesterday, I finished pre-clearing both the new 2TB Hitachi drives. However, before I put them in my array, I wanted to ask a few questions about the results of the pre-clear. I ran the pre-clears from 2 PuTTY session windows. At the end of the session for one of the disks, there were several errors listed. I don't know how to interpret these errors, so I tried copying & pasting the text to a txt file. Unfortunately, I didn't know that common Windows copy/paste techniques could result in pasting back into the PuTTY window. Afterwards, I also discovered the command for copying everything in the PuTTY window to the clipboard. Because of my sloppy copy/paste mess, it is hard to interpret my pre-clear results. I would get a syslog, but my server (and all other computers in my home) were shut down last night during a very rare power failure in our neighborhood. I've attached the session text from both of the drives to this message in case someone can decipher these and help me out. The "sda" drive doesn't seem to have any errors. However, the "sdg" drive might have some problems. I'd like to know whether this drive needs to be returned to Newegg because I want to make sure I do so within the terms of their return policy (not sure if it is 15 or 30 days). Pre-Clear_session_for_sdg_drive.txt Pre-Clear_session_for_sda_drive.txt Quote Link to comment
GreggP Posted October 3, 2010 Share Posted October 3, 2010 I also just generated a smart report that shows there are 4 sectors that have been reallocated and no pending re-allocations. Before running pre-clear on this drive the SMART report showed 1 allocated sector with 3 pending re-allocation. So it looks like those 3 were reallocated. Does this mean my drive is okay? SMART_report_for_sdg_drive.txt original_SMART_report_for_sdg_drive.txt Quote Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 Drive /dev/sdg Prior to the pre-clear there was 1 re-allocated sector. < 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1 --- After the pre-clear, there were 4 re-allocated sectors. > 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 4 Prior to the pre-clear there was 1 re-allocated "event" (the one sector it had re-allocated) < 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 Prior to the pre-clear there were 3 sectors pending re-allocation when next written. < 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 3 --- After the pre-clear there were 4 re-allocated events (the 4 sectors it re-allocated) > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 4 After the pre-clear there are no more sectors pending re-allocation. > 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 Notice in all cases the normalized value of 100 is un-changed and nowhere near the "normalized" failure threshold of 5 Since most large disks have several thousand spare sectors, this is expected. There is nothing really wrong with the drive and they would have every right to consider it to not be failing. An RMA is not in order unless you see the re-allocated sector count growing over the next months/years. I would run a few more pre-clear cycles on the drive. It the re-allocated sector count continues to increment, then you have the ammunition to RMA the drive and defend it is defective. If the re-allocation sector cont goes un-changed, you'll probably be fine for a very long time. Joe L. Quote Link to comment
madpoet Posted October 4, 2010 Share Posted October 4, 2010 Does preclear write a log anywhere? I've got what it says on the screen for the 5 drives I just ran, but it's a lot to copy out by hand I did it from the root console. Got errors (at least I think they are errors) and wanted to check on what the heck they were. Quote Link to comment
Joe L. Posted October 4, 2010 Share Posted October 4, 2010 Does preclear write a log anywhere? I've got what it says on the screen for the 5 drives I just ran, but it's a lot to copy out by hand I did it from the root console. Got errors (at least I think they are errors) and wanted to check on what the heck they were. It writes its output to the syslog. In addition the smart reports for the drives are all in the /tmp directory. Both are available until you reboot. Joe L. Quote Link to comment
madpoet Posted October 4, 2010 Share Posted October 4, 2010 Ahhhh.. shoot Ok. Good to know for next time! Nothing looked particularly bad. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.