June 10, 201412 yr please help, possibly 2 disabled disks.. running outta work heading home will get SMART reports when i get there. can post full syslog if needed. this is the only stuff out of the ordinary. syslog.txt
June 10, 201412 yr Author disk3 Statistics for /dev/sdf SAMSUNG_HD753LJ_S13UJDWQ601930 smartctl -a -d ata /dev/sdf smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 DT Device Model: SAMSUNG HD753LJ Serial Number: S13UJDWQ601930 LU WWN Device Id: 5 0000f0 003069103 Firmware Version: 1AA01112 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b Local Time is: Tue Jun 10 00:20:36 2014 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11558) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 193) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 067 067 011 Pre-fail Always - 10600 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 3786 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 45080 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 132 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 253 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 1 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 077 050 000 Old_age Always - 23 (Min/Max 23/23) 194 Temperature_Celsius 0x0022 075 050 000 Old_age Always - 25 (Min/Max 23/25) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 19704 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. disk4 Statistics for /dev/sde SAMSUNG_HD204UI_S2H7JD2ZB02704 smartctl -a -d ata /dev/sde smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F4 EG (AF) Device Model: SAMSUNG HD204UI Serial Number: S2H7JD2ZB02704 LU WWN Device Id: 5 0024e9 0044e3cf9 Firmware Version: 1AQ10001 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Tue Jun 10 00:21:06 2014 EDT ==> WARNING: Using smartmontools or hdparm with this drive may result in data loss due to a firmware bug. ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ****** Buggy and fixed firmware report same version number! See the following web pages for details: http://knowledge.seagate.com/articles/en_US/FAQ/223571en http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (19440) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 324) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1530 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 068 068 025 Pre-fail Always - 9748 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2429 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12741 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 131 181 Program_Fail_Cnt_Total 0x0022 099 099 000 Old_age Always - 26111667 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 053 000 Old_age Always - 28 (Min/Max 14/47) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 3 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 11710 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2449 SMART Error Log Version: 1 ATA Error Count: 3 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 3 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.115 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:00.115 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:00.115 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.115 IDENTIFY DEVICE 00 00 01 01 00 00 00 00 00:00:00.114 NOP [Abort queued commands] Error 2 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.107 IDENTIFY DEVICE 00 00 01 01 00 00 40 00 00:00:00.107 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.105 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.097 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.095 NOP [Abort queued commands] Error 1 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 01 00 00 00 e0 00 00:00:00.067 IDENTIFY DEVICE 00 00 00 00 00 00 00 00 00:00:00.067 NOP [Abort queued commands] 00 00 00 00 00 00 00 00 00:00:00.067 NOP [Abort queued commands] 00 00 00 00 00 00 00 00 00:00:00.030 NOP [Abort queued commands] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. when i got home, i found disk3 set as NOT PRESENT (NP) and disk4 as DISABLED. I stopped the array, powered down and reseated the connections. Upon start up, disk3 was back online, 4 still disabled. I am seeing those errors in the SMART report and have another HD en route while i figure out if it's salvageable, but I'm wondering if I should also be worried about disk3? or was it just maybe a hiccup? is it OK to run with a disabled disk for however long it'll take to preclear the drive after getting it tomorrow?
June 10, 201412 yr Could you provide a screenshot of the main tab to clarify the current status? I did not notice anything in the smart reports to indicate that either disk is actually failing. It is perfectly possible to continue running with a single drive in a disabled state as unRAID can emulate that drive using a combination of parity and the remaining data drives. However another failure would almost certainly lead to data loss. I think that it is highly likely that disk4 is actually OK, but has been disabled by unRAID because it had a write error (possibly a side-effect of whatever took disk3 offline). Once that happens it stays disabled in unRAID until you take appropriate recovery action. If so it might be possible to recover this without using an additional replacement disk. Having said that since you already have another disk on the way, the procedure I would recommend is: Avoid writing new data to the array if you can until you have recovered it to a clean state. This may not be strictly necessary as a clean recovery would include any newly written data, but if any issues arise then this is not guaranteed. When the new disk arrives remove the 'failed' disk and put it somewhere safe until recovery has finished. That way it is still available for data recovery purposes if anything goes wrong in the normal recovery process. Pre-clear the new disk as an initial stress-test. Not strictly speaking necessary but does help confirm that the new disk is OK. Has the downside that it takes time and extends the time your array is in an unprotected state. Follow the unRAID process for rebuilding a failed disk onto the new replacement disk. If any issue arise then check back here for advice. If no issue arise then the old 'failed' disk can be considered a potential spare. I would try pre-clearing it to see if that completes without errors. If it does then the disk is almost certainly fine and you can then either keep the drive as a spare against another disk reporting issues, or add it to the array as an additional data disk (assuming you have space in your box and your unRAID license permits this)
June 10, 201412 yr Author please see attached. so nothing particularly bad in those SMART reports? good to know. i am curious what would make disk3 just drop off like that. disk4 i can understand, as i see the errors in the SMART report. as far as losing my data.. if disk3 craps out again before i've replaced my disk4 and am in a clean state, do you know if i can plug an unraid disk into another computer (windows) and move the data from there back to the array? will it recognize the filesystem?
June 10, 201412 yr please see attached. so nothing particularly bad in those SMART reports? good to know. i am curious what would make disk3 just drop off like that. disk4 i can understand, as i see the errors in the SMART report. No idea. Could be a cabling issue, something that upset the controller card, a power glitch. [/quote[as far as losing my data.. if disk3 craps out again before i've replaced my disk4 and am in a clean state, do you know if i can plug an unraid disk into another computer (windows) and move the data from there back to the array? will it recognize the filesystem? One of the big strengths of unRAID is that each disk is a complete free-standing file system so you can take a disk out of the array and read it elsewhere. As disks rarely fail from a physical perspective this is a huge advantage in terms of data recovery. On Windows you need a tool that can understand the Reiserfs file system. I think Microsoft provide a tool (Linux reader) that can do this, but I do not have the details to hand. Another option is to boot the PC of a Linux 'live' CD so that you can get into a Linux environment with support for reiserfs built-in.
June 10, 201412 yr Author so when they say if you lose more than 1 drive you will lose the data on those drives.. that's only if those drives actually die 100%.. otherwise you can still recover the data and move it back to the array?
June 11, 201412 yr so when they say if you lose more than 1 drive you will lose the data on those drives.. that's only if those drives actually die 100%.. otherwise you can still recover the data and move it back to the array? Yep. The drives are all individually formatted and readable with a reiserfs capable system. If the drive will spin up and mount, the chances of recovering most or all of your data is very high as long as you follow generally recommended drive recovery practices.
June 11, 201412 yr Author one last (probably) question.. unraid is up and running all good. i have the new drive preclearing in a different slot right now. when that's done, can i just stop the array, unassign disk4 (possibly bad drive) and reassign the new precleared disk4 to that assignment? rebuild. or do i need to reboot in there somewhere? i'm not removing any drives or anything because once the new disk4 is rebuilt i'd like to try preclearing the old disk4 and see if i can bring it back to life either add it back to the array, or keep it as a spare..
June 11, 201412 yr one last (probably) question.. unraid is up and running all good. i have the new drive preclearing in a different slot right now. when that's done, can i just stop the array, unassign disk4 (possibly bad drive) and reassign the new precleared disk4 to that assignment? rebuild. or do i need to reboot in there somewhere? i'm not removing any drives or anything because once the new disk4 is rebuilt i'd like to try preclearing the old disk4 and see if i can bring it back to life either add it back to the array, or keep it as a spare.. Yes, this is correct. Stop the array, change the drive assignment of the drive in question to the new drive. UnRaid should tell you that when you start the array it will rebuild the disk, then start the array. I tend to let it do its thing without heavy use of the array, but I am probably overly cautious.
June 11, 201412 yr I have found that sometimes (particularly of the old drive is still physically in the server) it is a good idea to start the array after unassigning the drive; and then stop the array and assign the new drive before continuing with the rebuild. Never confirmed whether this is strictly speaking necessary but it does seem to make sure that unRAID has forgotten about the old drive before you assign the new one.
June 11, 201412 yr Author i probably have about 24 hours from now before the 3rd preclear cycle is done, in the mean time i just want to verify the steps.. 1. stop array 2. unassign drive i want to replace (but will remain in the system for now as unassigned to preclear and either add back to the array or keep as a spare) 3. start array.. it should start OK but with disk4 missing (is this correct?) 4. stop array. 5. assign newly precleared disk to disk4 slot. 6. hit start or equivalent to start data-rebuild on the drive.
June 12, 201412 yr Author had the issue happen again. not completely but getting a lot of read errors on disk3.. still up and green but when i access it via share it is coming up empty. Jun 11 13:20:10 Tower kernel: ata2.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen Jun 11 13:20:10 Tower kernel: ata2: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch } Jun 11 13:20:10 Tower kernel: ata2.00: failed command: READ DMA Jun 11 13:20:10 Tower kernel: ata2.00: cmd c8/00:08:57:fe:49/00:00:00:00:00/e4 tag 0 dma 4096 in Jun 11 13:20:10 Tower kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error) Jun 11 13:20:10 Tower kernel: ata2.00: status: { DRDY } Jun 11 13:20:10 Tower kernel: ata2: hard resetting link Jun 11 13:20:10 Tower kernel: ata2: controller in dubious state, performing PORT_RST Jun 11 13:20:12 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:20:17 Tower kernel: ata2: hard resetting link Jun 11 13:20:19 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:20:24 Tower kernel: ata2: hard resetting link Jun 11 13:20:26 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:20:26 Tower kernel: ata2.00: disabled Jun 11 13:20:26 Tower kernel: ata2.00: device reported invalid CHS sector 0 Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08 Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Jun 11 13:20:26 Tower kernel: Sense Key : 0xb [current] [descriptor] Jun 11 13:20:26 Tower kernel: Descriptor sense data with sense descriptors (in hex): Jun 11 13:20:26 Tower kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Jun 11 13:20:26 Tower kernel: 00 00 00 00 Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Jun 11 13:20:26 Tower kernel: ASC=0x0 ASCQ=0x0 Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] CDB: Jun 11 13:20:26 Tower kernel: cdb[0]=0x28: 28 00 04 49 fe 57 00 00 08 00 Jun 11 13:20:26 Tower kernel: end_request: I/O error, dev sdc, sector 71958103 Jun 11 13:20:26 Tower kernel: md: disk3 read error, sector=71958040 Jun 11 13:20:26 Tower kernel: ata2: EH complete Jun 11 13:20:26 Tower kernel: ata2.00: detaching (SCSI 3:0:0:0) Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Synchronizing SCSI cache Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Stopping disk Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] START_STOP FAILED Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Jun 11 13:21:12 Tower kernel: ata1.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen Jun 11 13:21:12 Tower kernel: ata1: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch } Jun 11 13:21:12 Tower kernel: ata1.00: failed command: CHECK POWER MODE Jun 11 13:21:12 Tower kernel: ata1.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0 Jun 11 13:21:12 Tower kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error) Jun 11 13:21:12 Tower kernel: ata1.00: status: { DRDY } Jun 11 13:21:12 Tower kernel: ata1: hard resetting link Jun 11 13:21:12 Tower kernel: ata1: controller in dubious state, performing PORT_RST Jun 11 13:21:14 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:21:19 Tower kernel: ata1: hard resetting link Jun 11 13:21:21 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:21:26 Tower kernel: ata1: hard resetting link Jun 11 13:21:28 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Jun 11 13:21:28 Tower kernel: ata1.00: disabled Jun 11 13:21:28 Tower kernel: ata1: EH complete Jun 11 13:21:28 Tower kernel: ata1.00: detaching (SCSI 1:0:0:0) Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Jun 11 13:21:28 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Stopping disk Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] START_STOP FAILED Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Jun 11 13:21:28 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 Jun 11 13:22:50 Tower kernel: md: disk3 read error, sector=197482344 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=732168320 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=768810016 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=768815592 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=549113984 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673899160 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673903024 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673904400 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=57612720 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1453814560 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=767566792 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=937305640 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=238172048 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=428019760 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=349555864 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=978512992 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=16984392 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=16986072 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=844064360 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1237491808 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1131187520 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=265781248 Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1294680296 full syslog attached. i am also seeing some errors for disk4, but that's currently disabled so i'm not sure if that's something to worry about: Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:08:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error Jun 11 14:08:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:09:28 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error Jun 11 14:09:29 Tower kernel: md: disk3 read error, sector=20438296 Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:09:29 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error after some mild detective work, i found that both disk3 and disk4 (which i'm having issues with) is on the same SATA card. i'm thinking it's not likely that both sata cables went bad at the same time. should i order a new card? currently have this card: http://www.monoprice.com/Product?c_id=104&cp_id=10407&cs_id=1040702&p_id=2530&seq=1&format=2 EDIT: for now i have moved disk3 to another slot and will see if the issue returns.. that card is only like $15 shipped so swapping it out is no issue at all. syslog-2014-06-12.zip
June 12, 201412 yr Author Okay well there goes that idea. disk3 still spitting the same errors even after moving to a different slot (different card, sata cable, port on backplane) so it looks like disk 3 is definitely having issues. I'm at a loss for what to do. Disk4 is still disabled due to a write error i'm assuming based on what i've read, and something is definitely up with disk3. I have a hdd preclearing (had to restart, about 3 days left for 3 cycles, 2tb). Thoughts on reenabling disk4 and hoping for the best while I disable disk3 somehow, and while praying I don't lose 4 again, finish the preclear and replace disk3 instead of 4? Sent from my Q10 using Tapatalk
June 12, 201412 yr Author these are reports from a couple days ago.. i can rerun them if needed, but would require a restart since disk3 is not being recognized properly right now. disk3 Statistics for /dev/sdf SAMSUNG_HD753LJ_S13UJDWQ601930 smartctl -a -d ata /dev/sdf smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 DT Device Model: SAMSUNG HD753LJ Serial Number: S13UJDWQ601930 LU WWN Device Id: 5 0000f0 003069103 Firmware Version: 1AA01112 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b Local Time is: Tue Jun 10 00:20:36 2014 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11558) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 193) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 067 067 011 Pre-fail Always - 10600 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 3786 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 45080 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 132 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 253 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 1 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 077 050 000 Old_age Always - 23 (Min/Max 23/23) 194 Temperature_Celsius 0x0022 075 050 000 Old_age Always - 25 (Min/Max 23/25) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 19704 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. disk4 Statistics for /dev/sde SAMSUNG_HD204UI_S2H7JD2ZB02704 smartctl -a -d ata /dev/sde smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F4 EG (AF) Device Model: SAMSUNG HD204UI Serial Number: S2H7JD2ZB02704 LU WWN Device Id: 5 0024e9 0044e3cf9 Firmware Version: 1AQ10001 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Tue Jun 10 00:21:06 2014 EDT ==> WARNING: Using smartmontools or hdparm with this drive may result in data loss due to a firmware bug. ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ****** Buggy and fixed firmware report same version number! See the following web pages for details: http://knowledge.seagate.com/articles/en_US/FAQ/223571en http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (19440) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 324) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1530 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 068 068 025 Pre-fail Always - 9748 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2429 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12741 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 131 181 Program_Fail_Cnt_Total 0x0022 099 099 000 Old_age Always - 26111667 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 053 000 Old_age Always - 28 (Min/Max 14/47) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 3 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 11710 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2449 SMART Error Log Version: 1 ATA Error Count: 3 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 3 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.115 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:00.115 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:00.115 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.115 IDENTIFY DEVICE 00 00 01 01 00 00 00 00 00:00:00.114 NOP [Abort queued commands] Error 2 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.107 IDENTIFY DEVICE 00 00 01 01 00 00 40 00 00:00:00.107 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.105 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.097 NOP [Abort queued commands] 00 00 01 01 00 00 40 00 00:00:00.095 NOP [Abort queued commands] Error 1 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 e0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 01 00 00 00 e0 00 00:00:00.067 IDENTIFY DEVICE 00 00 00 00 00 00 00 00 00:00:00.067 NOP [Abort queued commands] 00 00 00 00 00 00 00 00 00:00:00.067 NOP [Abort queued commands] 00 00 00 00 00 00 00 00 00:00:00.030 NOP [Abort queued commands] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
June 12, 201412 yr Disk 3 has a runtime bad block. Unusual to get one and doesn't sound too good. Disk 4 has 4 UDMA CRC errors usually associated with bad cabling at some point in its checkered past. Since these things never clear back to zero this should just hold steady going forward. I'd suggest running them again. Use the command arguments "-a -A" instead of "-a -d ata".
June 12, 201412 yr Author /disk3 root@Tower:~# smartctl -a -A /dev/sdm smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 DT Device Model: SAMSUNG HD753LJ Serial Number: S13UJDWQ601930 LU WWN Device Id: 5 0000f0 003069103 Firmware Version: 1AA01112 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b Local Time is: Thu Jun 12 15:47:04 2014 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11558) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 193) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 065 065 011 Pre-fail Always - 11190 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 3791 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 45142 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 134 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 255 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 2 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 050 000 Old_age Always - 30 (Min/Max 28/30) 194 Temperature_Celsius 0x0022 069 050 000 Old_age Always - 31 (Min/Max 28/31) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 163166 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. /disk4 root@Tower:~# smartctl -a -A /dev/sdl smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F4 EG (AF) Device Model: SAMSUNG HD204UI Serial Number: S2H7JD2ZB02704 LU WWN Device Id: 5 0024e9 0044e3cf9 Firmware Version: 1AQ10001 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Thu Jun 12 15:48:06 2014 EDT ==> WARNING: Using smartmontools or hdparm with this drive may result in data loss due to a firmware bug. ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ****** Buggy and fixed firmware report same version number! See the following web pages for details: http://knowledge.seagate.com/articles/en_US/FAQ/223571en http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (19440) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 324) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1530 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 068 068 025 Pre-fail Always - 9701 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2431 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12803 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 133 181 Program_Fail_Cnt_Total 0x0022 099 099 000 Old_age Always - 26111667 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 053 000 Old_age Always - 29 (Min/Max 14/47) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 16 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 11710 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2451 SMART Error Log Version: 1 ATA Error Count: 9 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 9 occurred at disk power-on lifetime: 12803 hours (533 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:52.002 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:52.002 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:52.002 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:52.002 IDENTIFY DEVICE 00 00 01 01 00 00 40 00 00:00:52.002 NOP [Abort queued commands] Error 8 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 08 00 00 00 e0 Error: ICRC, ABRT 8 sectors at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 00 00 00 e0 00 00:00:00.352 READ DMA 27 00 00 00 00 00 e0 00 00:00:00.352 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.352 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:00.352 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:00.352 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error 7 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.157 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:00.157 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:00.157 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.157 IDENTIFY DEVICE 00 00 01 01 00 00 00 00 00:00:00.157 NOP [Abort queued commands] Error 6 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.150 IDENTIFY DEVICE 00 00 00 00 00 00 00 00 00:00:00.150 NOP [Abort queued commands] 00 00 00 00 00 00 00 00 00:00:00.148 NOP [Abort queued commands] 60 00 08 00 00 00 40 00 00:00:00.000 READ FPDMA QUEUED 60 00 08 00 00 00 40 00 00:00:00.117 READ FPDMA QUEUED Error 5 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 00 00:00:00.110 IDENTIFY DEVICE ef 03 45 00 00 00 a0 00 00:00:00.110 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:00.110 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.110 IDENTIFY DEVICE 00 00 01 01 00 00 00 00 00:00:00.110 NOP [Abort queued commands] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
June 12, 201412 yr Author of the two which drive do you think it's worse? disk4 has been disabled so i can't really tell what issues it is still having, but disk3 just drops off the face of the earth.. i'm thinking i should either: 1. reenable disk4 and hope for the best 2. preclear a disk and rebuild disk3. 3. preclear another disk (will order) and rebuild 4 shortly after 4. either junk or try to preclear disk 3 and 4 again to keep as spare or 1. remove both disk3 and 4 (new config i'm guessing?) 2. rebuild parity 3. add new disk that i'm currently preclearing (when it's done of course) 4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done) 5. either junk or try to preclear disk 3 and 4 again to keep as spare i'm leading towards the 2nd option unless theres a better way.
June 12, 201412 yr No smoking gun from smart reports Disk3 Calibration retries 132 -> 134 Runtime bad blocks 1 -> 2 Disk 4 G-sense error rate 2 -> 2 UDMA ECC 3 -> 16 Multi-zone error 11710 -> 11710 The UDMA error is indicative of a cabling problem (you should replace the cable in disk 4). I have little experience with Samsung drives and not sure about the calibration retries or the runtime bad blocks. I know drives recalibrate as heat rises so maybe that's normal. But it went up by 2 in a very short time. The runtime bad blocks sounds bad but not sure what it means. Some hard to diagnose drive behavior is found to be an issue with a power splitter. I might check that. I might suggest running the smart long tests on the drives. Or locating Samsung specific diagnostic disk. If the drives are failing those tests might fail and confirm it. 750G is pretty small in today's world. You might want to trade up to something bigger anyway. 2T is bigger and maybe fixing the cabling will get it working.
June 12, 201412 yr Author 750G is pretty small in today's world. You might want to trade up to something bigger anyway. 2T is bigger and maybe fixing the cabling will get it working. at least for disk3 i know it is not a sata cable issue. i moved it to a different slot in my case, so that's a different SATA card and port and different power and sata port on the backplane. i will try the same with disk4, but i'm thinking i should remove disk 3 and 4 (see post above yours) 1. remove both disk3 and 4 (new config i'm guessing?) 2. rebuild parity 3. add new disk that i'm currently preclearing (when it's done of course) 4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done) 5. either junk or try to preclear N times disk 3 and 4 again to keep as spare
June 13, 201412 yr Hi, long time no post. I don't want to hijack the thread but I'm having a similar issue. One of my drives is not being detected at all and another has come up as unformatted. I'm currently moving files off a spare 2TB drive to replace the undetected one (hoping it's not a dead port, I know it's not a cable issue) but not sure how to proceed after that. I know I will need to preclear any replacement disk(s) but then what? I'm still on 4.7 Plus, there doesn't seem to be a sub-forum for that version. Happy to post my own thread so as not to confuse the issue the OP is having, let me know. I would suggest starting a new thread is a good idea so that things do not get confusing in terms of responses/advice. A disk simply coming up as unformatted tends to mean that it failed to mount and there is some sort of file system corruption. The data on it is probably intact, but the problem mounting can only be fixed by running reiserfsck against the drive. You want to take a structured approach to fixing these issues to minimize any chance of losing data. As long as there is a spare disk available I like to take an approach that means I can put a problem disk aside while I work on trying to recover its data on to another disk. That way if anything goes wrong I still have the problem disk in the state that it was when the problem occurred to attempt data recovery against. My suggestion would be (I would be interested to see what others think) would be an approach along the lines of: Recover the disk that is currently not being detected to the new drive. When that finishes test the drive that has been replaced to see if it really has a fault. If it tests out OK it can become a new 'spare' drive. You need to try and recover the drive showing as unformatted. In an ideal world I would first try and rebuild onto a spare disk to keep the original one unchanged until recovery has finished. However if no spare drive as long as you have the array in maintenance mode and run reiserfsck against the relevant /dev/disk?? device (to maintain parity) you can do it against the current drive - possibly before trying to recover the 'faulty' disk mentioned above.
June 13, 201412 yr I would suggest starting a new thread is a good idea so that things do not get confusing in terms of responses/advice. Thanks Itimpi, I have started a new thread here :- http://lime-technology.com/forum/index.php?topic=33745.0 Have deleted my previous post from this thread and will take further discussion over to my thread.
June 13, 201412 yr Author 1. remove both disk3 and 4 (new config i'm guessing?) 2. rebuild parity 3. add new disk that i'm currently preclearing (when it's done of course) 4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done) 5. either junk or try to preclear N times disk 3 and 4 again to keep as spare basically i'm thinking of doing this: http://blog.ktz.me/?p=243 will this not work? the data itself on these 2 disks is not of super importance, but of course i'd rather not lose it, but as it is right now, none of my user shares are accessible (come up as empty.. this only happens when disk3 starts seeing the read errors. after a reboot, before disk3 fails, it's fine.), so it's more important to get the array and other disks that are working back online. i'll troubleshoot disk3 and 4 further when i'm back to a workable situation. EDIT 6/13/2014 @ 9:51p: i already did this. it's rebuilding parity now. hopefully i made the right choice. once this is done i'll try to copy the stuff over from the other 2 drives either via putting it in another computer or mounting them and copying over with mc like the tutorial above (any suggestions)? after that i'll start preclearing the other 2 drives, moving them around to see what comes of that.
Archived
This topic is now archived and is closed to further replies.