Helmonder Posted March 12, 2012 Author Share Posted March 12, 2012 Ok, herewith, I tried it yesterday and it consistenly gave the same information... I do however want to stress that the preclear script did NOT fail because of errors with the drive... It causes large number of errors in the system, the new disk even became totally unavailable under use of the script. When clearing with unraid itself all went without a problem (took a good amount of time, but no errors or whatsoever). I am happy with the result I have right now, and I have made a link to this post in the preclear thread for people who think some more investigation is needed. === START OF INFORMATION SECTION === Device Model: WDC WD20EARX-00PASB0 Serial Number: WD-WMAZA5761786 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Mar 12 18:11:00 2012 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 113) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (39900) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 208 171 021 Pre-fail Always - 4575 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5127 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 52 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 38 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2040 194 Temperature_Celsius 0x0022 123 107 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 26 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 171 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 10% 5127 168700144 # 2 Extended offline Completed: read failure 90% 5080 168542561 # 3 Short offline Completed: read failure 80% 5080 168542563 # 4 Short offline Completed: read failure 90% 5080 168542560 # 5 Short offline Completed: read failure 90% 5080 168542560 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Link to comment
dgaschk Posted March 12, 2012 Share Posted March 12, 2012 The pending sectors have been resolved and Offline_Uncorrectable has increased to 26. This indicates media problems on the HDD. These SMART values need to be monitored: Reallocated_Sector_Ct, Offline_Uncorrectable, Reallocated_Event_Count, and Current_Pending_Sector. Run pre-clear on this disk and observe these values. Then post a new SMART report. Pre-clear was reporting the HDD errors and the unRAID clearing did not (Ignorance is bliss.). Using this drive without further testing is a mistake. Link to comment
Helmonder Posted March 13, 2012 Author Share Posted March 13, 2012 Thanks to everyone for assisting and please do not understand the following in a negative way: But as I stated earlier I CANNOT run preclear on this drive since it will crash my system... People seem to consistently think that my problem is that preclear finds errors on the drive, it is not... Preclear causes errors in the unraid system to an extent that it crashes my system.. I have to admit however that I am getting a crash course in SMART values here, and indeed the link between current pending sector and offline uncorrectable points to possible issues in the disk surface. So indeed the disk does seem to have issues. I have set up the array in such a way that only a temporary drive will get data on it, I will do a bit of experimenting to see how the values change (not because I do not recognise the issue but because I want to learn on the way) Link to comment
Joe L. Posted March 13, 2012 Share Posted March 13, 2012 Thanks to everyone for assisting and please do not understand the following in a negative way: But as I stated earlier I CANNOT run preclear on this drive since it will crash my system... People seem to consistently think that my problem is that preclear finds errors on the drive, it is not... Preclear causes errors in the unraid system to an extent that it crashes my system.. I have to admit however that I am getting a crash course in SMART values here, and indeed the link between current pending sector and offline uncorrectable points to possible issues in the disk surface. So indeed the disk does seem to have issues. I have set up the array in such a way that only a temporary drive will get data on it, I will do a bit of experimenting to see how the values change (not because I do not recognise the issue but because I want to learn on the way) We understand you are trying to learn. The unRAID "clearing" process only writes to the disk. The preclear_disk.sh process reads and writes the disk. There is a difference. If you only intend to write to your array, and never read data from it, the disk might be perfectly fine. If you intend to use it to read from, it may (or may not) show more errors. it is perfectly possible that once written the sectors will be readable. I suspect that will be the case, since I see no re-allocated sectors. (all the bad sectors , so far, have been re-written in place) Now that the disk is in your array, I might suggest one or more non-correcting parity checks. You can perform that through the button in unMENU, or, lacking that add-on, log on via telnet or the system console and type: /root/mdcmd check NOCORRECT That will initiate a non-correcting parity check. If it is successful, you should be fine. Do be aware that none of the "long" or "short" internal tests of that disk have ever completed. They've all aborted on a read failure. Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 10% 5127 168700144 # 2 Extended offline Completed: read failure 90% 5080 168542561 # 3 Short offline Completed: read failure 80% 5080 168542563 # 4 Short offline Completed: read failure 90% 5080 168542560 # 5 Short offline Completed: read failure 90% 5080 168542560 Link to comment
Helmonder Posted March 15, 2012 Author Share Posted March 15, 2012 Thanks again. Several of the disks in my array have come from my WHS system (that I now retired) and have been running for some time. Several disks get red flags from smarthistory (just installed it) because of number of hours run. I will wait for the next unraid version to start replacing with brand new 3tb drives, the 2TB's I free up I will use for off-site backup storage. Link to comment
Helmonder Posted March 15, 2012 Author Share Posted March 15, 2012 WEll... that was a golden bullet tip... I started the non correcting parity check and the disk got flagged and taken out of the array within 30 minutes.. I just removed it and replaced it with another one.. I had that one lying around and have done several small and extensive tests with it on my pc, got no errors. Am now running preclear on the new drive and it is again giving me a lot of red syslog errors. I'll do a a smart check Link to comment
Helmonder Posted March 15, 2012 Author Share Posted March 15, 2012 Following were the red errors that ended up in my syslog, syslog has remained stable and quiet since, preclear is still commencing, anyone here who can determine what these errors mean ? Mar 15 09:11:42 Tower kernel: ------------[ cut here ]------------ Mar 15 09:11:42 Tower kernel: WARNING: at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308() (Minor Issues) Mar 15 09:11:42 Tower kernel: Hardware name: System Product Name Mar 15 09:11:42 Tower kernel: Modules linked in: ntfs md_mod xor mvsas libsas scst scsi_transport_sas forcedeth sata_nv amd74xx (Drive related) Mar 15 09:11:42 Tower kernel: Pid: 7683, comm: hdparm Not tainted 2.6.32.9-unRAID #8 (Errors) Mar 15 09:11:42 Tower kernel: Call Trace: (Errors) Mar 15 09:11:42 Tower kernel: [<c102449e>] warn_slowpath_common+0x60/0x77 (Errors) Mar 15 09:11:42 Tower kernel: [<c10244c2>] warn_slowpath_null+0xd/0x10 (Errors) Mar 15 09:11:42 Tower kernel: [<c11b624d>] ata_qc_issue+0x10b/0x308 (Errors) Mar 15 09:11:42 Tower kernel: [<c11ba260>] ata_scsi_translate+0xd1/0xff (Errors) Mar 15 09:11:42 Tower kernel: [<c11a816c>] ? scsi_done+0x0/0xd (Errors) Mar 15 09:11:42 Tower kernel: [<c11a816c>] ? scsi_done+0x0/0xd (Errors) Mar 15 09:11:42 Tower kernel: [<c11baa40>] ata_sas_queuecmd+0x120/0x1d7 (Errors) Mar 15 09:11:42 Tower kernel: [<c11bc6df>] ? ata_scsi_pass_thru+0x0/0x21d (Errors) Mar 15 09:11:42 Tower kernel: [<f845769a>] sas_queuecommand+0x65/0x20d [libsas] (Errors) Mar 15 09:11:42 Tower kernel: [<c11a816c>] ? scsi_done+0x0/0xd (Errors) Mar 15 09:11:42 Tower kernel: [<c11a82c0>] scsi_dispatch_cmd+0x147/0x181 (Errors) Mar 15 09:11:42 Tower kernel: [<c11ace4d>] scsi_request_fn+0x351/0x376 (Errors) Mar 15 09:11:42 Tower kernel: [<c1126798>] __blk_run_queue+0x78/0x10c (Errors) Mar 15 09:11:42 Tower kernel: [<c1124446>] elv_insert+0x67/0x153 (Errors) Mar 15 09:11:42 Tower kernel: [<c11245b8>] __elv_add_request+0x86/0x8b (Errors) Mar 15 09:11:42 Tower kernel: [<c1129343>] blk_execute_rq_nowait+0x4f/0x73 (Errors) Mar 15 09:11:42 Tower kernel: [<c11293dc>] blk_execute_rq+0x75/0x91 (Errors) Mar 15 09:11:42 Tower kernel: [<c11292cc>] ? blk_end_sync_rq+0x0/0x28 (Errors) Mar 15 09:11:42 Tower kernel: [<c112636f>] ? get_request+0x204/0x28d (Errors) Mar 15 09:11:42 Tower kernel: [<c11269d6>] ? get_request_wait+0x2b/0xd9 (Errors) Mar 15 09:11:42 Tower kernel: [<c112c2bf>] sg_io+0x22d/0x30a (Errors) Mar 15 09:11:42 Tower kernel: [<c112c5a8>] scsi_cmd_ioctl+0x20c/0x3bc (Errors) Mar 15 09:11:42 Tower kernel: [<c11b3257>] sd_ioctl+0x6a/0x8c (Errors) Mar 15 09:11:42 Tower kernel: [<c112a420>] __blkdev_driver_ioctl+0x50/0x62 (Errors) Mar 15 09:11:42 Tower kernel: [<c112ad1c>] blkdev_ioctl+0x8b0/0x8dc (Errors) Mar 15 09:11:42 Tower kernel: [<c1131e2d>] ? kobject_get+0x12/0x17 (Errors) Mar 15 09:11:42 Tower kernel: [<c112b0f8>] ? get_disk+0x4a/0x61 (Errors) Mar 15 09:11:42 Tower kernel: [<c101b028>] ? kmap_atomic+0x14/0x16 (Errors) Mar 15 09:11:42 Tower kernel: [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf (Errors) Mar 15 09:11:42 Tower kernel: [<c104a179>] ? filemap_fault+0xb8/0x305 (Errors) Mar 15 09:11:42 Tower kernel: [<c1048c43>] ? unlock_page+0x18/0x1b (Errors) Mar 15 09:11:42 Tower kernel: [<c1057c63>] ? __do_fault+0x3a7/0x3da (Errors) Mar 15 09:11:42 Tower kernel: [<c105985f>] ? handle_mm_fault+0x42d/0x8f1 (Errors) Mar 15 09:11:42 Tower kernel: [<c108b6c6>] block_ioctl+0x2a/0x32 (Errors) Mar 15 09:11:42 Tower kernel: [<c108b69c>] ? block_ioctl+0x0/0x32 (Errors) Mar 15 09:11:42 Tower kernel: [<c10769d5>] vfs_ioctl+0x22/0x67 (Errors) Mar 15 09:11:42 Tower kernel: [<c1076f33>] do_vfs_ioctl+0x478/0x4ac (Errors) Mar 15 09:11:42 Tower kernel: [<c105dcdd>] ? do_mmap_pgoff+0x232/0x294 (Errors) Mar 15 09:11:42 Tower kernel: [<c1076f93>] sys_ioctl+0x2c/0x45 (Errors) Mar 15 09:11:42 Tower kernel: [<c1002935>] syscall_call+0x7/0xb (Errors) Mar 15 09:11:42 Tower kernel: ---[ end trace 80e02952ab951772 ]--- The following thread describes the same problem with another user: http://lime-technology.com/forum/index.php?topic=14946.0 Suggested solution is to do preclears from a motherboard SATA connector, meaning there is some kind of incompatibility between preclear and the expansion card .. Link to comment
Helmonder Posted March 15, 2012 Author Share Posted March 15, 2012 A SMART check on the new drive fails to start with the following output: === START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WCAZA4913598 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Mar 15 13:56:51 2012 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled Error SMART Status command failed Please get assistance from http://smartmontools.sourceforge.net/ Register values returned from SMART Status command are: ST =0x40 ERR=0x00 NS =0x04 SC =0xe0 CL =0x6c CH =0x28 SEL=0x40 A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Link to comment
Helmonder Posted March 15, 2012 Author Share Posted March 15, 2012 From console I was able to get output using command: smartctl -a -A -T permissive /dev/sdg Output is as follows: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 191 172 021 Pre-fail Always - 5441 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 49 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3384 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18 193 Load_Cycle_Count 0x0032 172 172 000 Old_age Always - 84663 194 Temperature_Celsius 0x0022 120 108 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 So disk looks fine as far as I am concerned.. Two questions remain: 1) Any idea on the errors I got in the syslog ? 2) Is there a possibility to have SMARTHISTORY and UNRAID configured in such a way that they will work from the unmenu ? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.