[solved] please help, possibly 2 disabled disks

June 10, 201412 yr

please help, possibly 2 disabled disks.. running outta work heading home will get SMART reports when i get there.

can post full syslog if needed. this is the only stuff out of the ordinary.

Quote

June 10, 201412 yr

Author

disk3

Statistics for /dev/sdf SAMSUNG_HD753LJ_S13UJDWQ601930
smartctl -a -d ata /dev/sdf
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJDWQ601930
LU WWN Device Id: 5 0000f0 003069103
Firmware Version: 1AA01112
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b
Local Time is:    Tue Jun 10 00:20:36 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(11558) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 193) minutes.
Conveyance self-test routine
recommended polling time: 	 (  21) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   067   067   011    Pre-fail  Always       -       10600
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       3786
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       45080
10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       132
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       253
13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       1
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   050   000    Old_age   Always       -       23 (Min/Max 23/23)
194 Temperature_Celsius     0x0022   075   050   000    Old_age   Always       -       25 (Min/Max 23/25)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       19704
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

disk4

Statistics for /dev/sde SAMSUNG_HD204UI_S2H7JD2ZB02704
smartctl -a -d ata /dev/sde
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD2ZB02704
LU WWN Device Id: 5 0024e9 0044e3cf9
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Jun 10 00:21:06 2014 EDT

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/223571en
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(19440) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 324) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1530
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   068   068   025    Pre-fail  Always       -       9748
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2429
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12741
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       131
181 Program_Fail_Cnt_Total  0x0022   099   099   000    Old_age   Always       -       26111667
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   053   000    Old_age   Always       -       28 (Min/Max 14/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       11710
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       2449

SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.115  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:00.115  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:00.115  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:00.115  IDENTIFY DEVICE
  00 00 01 01 00 00 00 00      00:00:00.114  NOP [Abort queued commands]

Error 2 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.107  IDENTIFY DEVICE
  00 00 01 01 00 00 40 00      00:00:00.107  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.105  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.097  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.095  NOP [Abort queued commands]

Error 1 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 01 00 00 00 e0 00      00:00:00.067  IDENTIFY DEVICE
  00 00 00 00 00 00 00 00      00:00:00.067  NOP [Abort queued commands]
  00 00 00 00 00 00 00 00      00:00:00.067  NOP [Abort queued commands]
  00 00 00 00 00 00 00 00      00:00:00.030  NOP [Abort queued commands]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

when i got home, i found disk3 set as NOT PRESENT (NP) and disk4 as DISABLED. I stopped the array, powered down and reseated the connections. Upon start up, disk3 was back online, 4 still disabled. I am seeing those errors in the SMART report and have another HD en route while i figure out if it's salvageable, but I'm wondering if I should also be worried about disk3? or was it just maybe a hiccup?

is it OK to run with a disabled disk for however long it'll take to preclear the drive after getting it tomorrow?

Quote

June 10, 201412 yr

Could you provide a screenshot of the main tab to clarify the current status?

I did not notice anything in the smart reports to indicate that either disk is actually failing.

It is perfectly possible to continue running with a single drive in a disabled state as unRAID can emulate that drive using a combination of parity and the remaining data drives. However another failure would almost certainly lead to data loss.

I think that it is highly likely that disk4 is actually OK, but has been disabled by unRAID because it had a write error (possibly a side-effect of whatever took disk3 offline). Once that happens it stays disabled in unRAID until you take appropriate recovery action. If so it might be possible to recover this without using an additional replacement disk.

Having said that since you already have another disk on the way, the procedure I would recommend is:

Avoid writing new data to the array if you can until you have recovered it to a clean state. This may not be strictly necessary as a clean recovery would include any newly written data, but if any issues arise then this is not guaranteed.
When the new disk arrives remove the 'failed' disk and put it somewhere safe until recovery has finished. That way it is still available for data recovery purposes if anything goes wrong in the normal recovery process.
Pre-clear the new disk as an initial stress-test. Not strictly speaking necessary but does help confirm that the new disk is OK. Has the downside that it takes time and extends the time your array is in an unprotected state.
Follow the unRAID process for rebuilding a failed disk onto the new replacement disk. If any issue arise then check back here for advice.
If no issue arise then the old 'failed' disk can be considered a potential spare. I would try pre-clearing it to see if that completes without errors. If it does then the disk is almost certainly fine and you can then either keep the drive as a spare against another disk reporting issues, or add it to the array as an additional data disk (assuming you have space in your box and your unRAID license permits this)

Quote

June 10, 201412 yr

Author

please see attached.

so nothing particularly bad in those SMART reports? good to know. i am curious what would make disk3 just drop off like that. disk4 i can understand, as i see the errors in the SMART report.

as far as losing my data.. if disk3 craps out again before i've replaced my disk4 and am in a clean state, do you know if i can plug an unraid disk into another computer (windows) and move the data from there back to the array? will it recognize the filesystem?

Quote

June 10, 201412 yr

please see attached.

so nothing particularly bad in those SMART reports? good to know. i am curious what would make disk3 just drop off like that. disk4 i can understand, as i see the errors in the SMART report.

No idea. Could be a cabling issue, something that upset the controller card, a power glitch.

[/quote[as far as losing my data.. if disk3 craps out again before i've replaced my disk4 and am in a clean state, do you know if i can plug an unraid disk into another computer (windows) and move the data from there back to the array? will it recognize the filesystem?

One of the big strengths of unRAID is that each disk is a complete free-standing file system so you can take a disk out of the array and read it elsewhere. As disks rarely fail from a physical perspective this is a huge advantage in terms of data recovery.

On Windows you need a tool that can understand the Reiserfs file system. I think Microsoft provide a tool (Linux reader) that can do this, but I do not have the details to hand. Another option is to boot the PC of a Linux 'live' CD so that you can get into a Linux environment with support for reiserfs built-in.

Quote

June 10, 201412 yr

Author

so when they say if you lose more than 1 drive you will lose the data on those drives.. that's only if those drives actually die 100%.. otherwise you can still recover the data and move it back to the array?

Quote

June 11, 201412 yr

so when they say if you lose more than 1 drive you will lose the data on those drives.. that's only if those drives actually die 100%.. otherwise you can still recover the data and move it back to the array?

Yep. The drives are all individually formatted and readable with a reiserfs capable system. If the drive will spin up and mount, the chances of recovering most or all of your data is very high as long as you follow generally recommended drive recovery practices.

Quote

June 11, 201412 yr

Author

one last (probably) question..

unraid is up and running all good. i have the new drive preclearing in a different slot right now.

when that's done, can i just stop the array, unassign disk4 (possibly bad drive) and reassign the new precleared disk4 to that assignment? rebuild. or do i need to reboot in there somewhere? i'm not removing any drives or anything because once the new disk4 is rebuilt i'd like to try preclearing the old disk4 and see if i can bring it back to life either add it back to the array, or keep it as a spare..

Quote

June 11, 201412 yr

one last (probably) question..

unraid is up and running all good. i have the new drive preclearing in a different slot right now.

when that's done, can i just stop the array, unassign disk4 (possibly bad drive) and reassign the new precleared disk4 to that assignment? rebuild. or do i need to reboot in there somewhere? i'm not removing any drives or anything because once the new disk4 is rebuilt i'd like to try preclearing the old disk4 and see if i can bring it back to life either add it back to the array, or keep it as a spare..

Yes, this is correct. Stop the array, change the drive assignment of the drive in question to the new drive. UnRaid should tell you that when you start the array it will rebuild the disk, then start the array. I tend to let it do its thing without heavy use of the array, but I am probably overly cautious.

Quote

June 11, 201412 yr

I have found that sometimes (particularly of the old drive is still physically in the server) it is a good idea to start the array after unassigning the drive; and then stop the array and assign the new drive before continuing with the rebuild.

Never confirmed whether this is strictly speaking necessary but it does seem to make sure that unRAID has forgotten about the old drive before you assign the new one.

Quote

June 11, 201412 yr

Author

i probably have about 24 hours from now before the 3rd preclear cycle is done, in the mean time i just want to verify the steps..

1. stop array

2. unassign drive i want to replace (but will remain in the system for now as unassigned to preclear and either add back to the array or keep as a spare)

3. start array.. it should start OK but with disk4 missing (is this correct?)

4. stop array.

5. assign newly precleared disk to disk4 slot.

6. hit start or equivalent to start data-rebuild on the drive.

Quote

June 11, 201412 yr

Looks good to me.

Quote

June 12, 201412 yr

Author

had the issue happen again. not completely but getting a lot of read errors on disk3.. still up and green but when i access it via share it is coming up empty.

Jun 11 13:20:10 Tower kernel: ata2.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
Jun 11 13:20:10 Tower kernel: ata2: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
Jun 11 13:20:10 Tower kernel: ata2.00: failed command: READ DMA
Jun 11 13:20:10 Tower kernel: ata2.00: cmd c8/00:08:57:fe:49/00:00:00:00:00/e4 tag 0 dma 4096 in
Jun 11 13:20:10 Tower kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
Jun 11 13:20:10 Tower kernel: ata2.00: status: { DRDY }
Jun 11 13:20:10 Tower kernel: ata2: hard resetting link
Jun 11 13:20:10 Tower kernel: ata2: controller in dubious state, performing PORT_RST
Jun 11 13:20:12 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:20:17 Tower kernel: ata2: hard resetting link
Jun 11 13:20:19 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:20:24 Tower kernel: ata2: hard resetting link
Jun 11 13:20:26 Tower kernel: ata2: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:20:26 Tower kernel: ata2.00: disabled
Jun 11 13:20:26 Tower kernel: ata2.00: device reported invalid CHS sector 0
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc]  
Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc]  
Jun 11 13:20:26 Tower kernel: Sense Key : 0xb [current] [descriptor]
Jun 11 13:20:26 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Jun 11 13:20:26 Tower kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jun 11 13:20:26 Tower kernel:         00 00 00 00 
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc]  
Jun 11 13:20:26 Tower kernel: ASC=0x0 ASCQ=0x0
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] CDB: 
Jun 11 13:20:26 Tower kernel: cdb[0]=0x28: 28 00 04 49 fe 57 00 00 08 00
Jun 11 13:20:26 Tower kernel: end_request: I/O error, dev sdc, sector 71958103
Jun 11 13:20:26 Tower kernel: md: disk3 read error, sector=71958040
Jun 11 13:20:26 Tower kernel: ata2: EH complete
Jun 11 13:20:26 Tower kernel: ata2.00: detaching (SCSI 3:0:0:0)
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Synchronizing SCSI cache
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc]  
Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] Stopping disk
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc] START_STOP FAILED
Jun 11 13:20:26 Tower kernel: sd 3:0:0:0: [sdc]  
Jun 11 13:20:26 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Jun 11 13:21:12 Tower kernel: ata1.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
Jun 11 13:21:12 Tower kernel: ata1: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
Jun 11 13:21:12 Tower kernel: ata1.00: failed command: CHECK POWER MODE
Jun 11 13:21:12 Tower kernel: ata1.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Jun 11 13:21:12 Tower kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
Jun 11 13:21:12 Tower kernel: ata1.00: status: { DRDY }
Jun 11 13:21:12 Tower kernel: ata1: hard resetting link
Jun 11 13:21:12 Tower kernel: ata1: controller in dubious state, performing PORT_RST
Jun 11 13:21:14 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:21:19 Tower kernel: ata1: hard resetting link
Jun 11 13:21:21 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:21:26 Tower kernel: ata1: hard resetting link
Jun 11 13:21:28 Tower kernel: ata1: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun 11 13:21:28 Tower kernel: ata1.00: disabled
Jun 11 13:21:28 Tower kernel: ata1: EH complete
Jun 11 13:21:28 Tower kernel: ata1.00: detaching (SCSI 1:0:0:0)
Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb]  
Jun 11 13:21:28 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] Stopping disk
Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb] START_STOP FAILED
Jun 11 13:21:28 Tower kernel: sd 1:0:0:0: [sdb]  
Jun 11 13:21:28 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
Jun 11 13:22:50 Tower kernel: md: disk3 read error, sector=197482344
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=732168320
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=768810016
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=768815592
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=549113984
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673899160
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673903024
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=673904400
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=57612720
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1453814560
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=767566792
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=937305640
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=238172048
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=428019760
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=349555864
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=978512992
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=16984392
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=16986072
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=844064360
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1237491808
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1131187520
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=265781248
Jun 11 13:23:30 Tower kernel: md: disk3 read error, sector=1294680296

full syslog attached. i am also seeing some errors for disk4, but that's currently disabled so i'm not sure if that's something to worry about:

Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:08:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error
Jun 11 14:08:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error
Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:08:35 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:09:28 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error
Jun 11 14:09:29 Tower kernel: md: disk3 read error, sector=20438296
Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:09:29 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk4/user (5) Input/output error
Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error
Jun 11 14:09:29 Tower kernel: REISERFS error (device md4): zam-7001 reiserfs_find_entry: io error

after some mild detective work, i found that both disk3 and disk4 (which i'm having issues with) is on the same SATA card. i'm thinking it's not likely that both sata cables went bad at the same time. should i order a new card?

currently have this card: http://www.monoprice.com/Product?c_id=104&cp_id=10407&cs_id=1040702&p_id=2530&seq=1&format=2

EDIT: for now i have moved disk3 to another slot and will see if the issue returns.. that card is only like $15 shipped so swapping it out is no issue at all.

syslog-2014-06-12.zip

Quote

June 12, 201412 yr

Author

Okay well there goes that idea. disk3 still spitting the same errors even after moving to a different slot (different card, sata cable, port on backplane) so it looks like disk 3 is definitely having issues.

I'm at a loss for what to do. Disk4 is still disabled due to a write error i'm assuming based on what i've read, and something is definitely up with disk3. I have a hdd preclearing (had to restart, about 3 days left for 3 cycles, 2tb).

Thoughts on reenabling disk4 and hoping for the best while I disable disk3 somehow, and while praying I don't lose 4 again, finish the preclear and replace disk3 instead of 4?

Sent from my Q10 using Tapatalk

Quote

June 12, 201412 yr

Post smart reports for the 2 drives.

Quote

June 12, 201412 yr

Author

these are reports from a couple days ago.. i can rerun them if needed, but would require a restart since disk3 is not being recognized properly right now.

disk3

Statistics for /dev/sdf SAMSUNG_HD753LJ_S13UJDWQ601930
smartctl -a -d ata /dev/sdf
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJDWQ601930
LU WWN Device Id: 5 0000f0 003069103
Firmware Version: 1AA01112
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b
Local Time is:    Tue Jun 10 00:20:36 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(11558) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 193) minutes.
Conveyance self-test routine
recommended polling time: 	 (  21) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   067   067   011    Pre-fail  Always       -       10600
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       3786
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       45080
10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       132
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       253
13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       1
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   050   000    Old_age   Always       -       23 (Min/Max 23/23)
194 Temperature_Celsius     0x0022   075   050   000    Old_age   Always       -       25 (Min/Max 23/25)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       19704
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

disk4

Statistics for /dev/sde SAMSUNG_HD204UI_S2H7JD2ZB02704
smartctl -a -d ata /dev/sde
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD2ZB02704
LU WWN Device Id: 5 0024e9 0044e3cf9
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Jun 10 00:21:06 2014 EDT

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/223571en
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(19440) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 324) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1530
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   068   068   025    Pre-fail  Always       -       9748
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2429
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12741
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       131
181 Program_Fail_Cnt_Total  0x0022   099   099   000    Old_age   Always       -       26111667
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   053   000    Old_age   Always       -       28 (Min/Max 14/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       11710
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       2449

SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.115  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:00.115  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:00.115  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:00.115  IDENTIFY DEVICE
  00 00 01 01 00 00 00 00      00:00:00.114  NOP [Abort queued commands]

Error 2 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.107  IDENTIFY DEVICE
  00 00 01 01 00 00 40 00      00:00:00.107  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.105  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.097  NOP [Abort queued commands]
  00 00 01 01 00 00 40 00      00:00:00.095  NOP [Abort queued commands]

Error 1 occurred at disk power-on lifetime: 12740 hours (530 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 01 00 00 00 e0 00      00:00:00.067  IDENTIFY DEVICE
  00 00 00 00 00 00 00 00      00:00:00.067  NOP [Abort queued commands]
  00 00 00 00 00 00 00 00      00:00:00.067  NOP [Abort queued commands]
  00 00 00 00 00 00 00 00      00:00:00.030  NOP [Abort queued commands]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

June 12, 201412 yr

Disk 3 has a runtime bad block. Unusual to get one and doesn't sound too good.

Disk 4 has 4 UDMA CRC errors usually associated with bad cabling at some point in its checkered past. Since these things never clear back to zero this should just hold steady going forward.

I'd suggest running them again. Use the command arguments "-a -A" instead of "-a -d ata".

Quote

June 12, 201412 yr

Author

/disk3

root@Tower:~# smartctl -a -A /dev/sdm
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJDWQ601930
LU WWN Device Id: 5 0000f0 003069103
Firmware Version: 1AA01112
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7, ATA8-ACS T13/1699-D revision 3b
Local Time is:    Thu Jun 12 15:47:04 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (11558) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 193) minutes.
Conveyance self-test routine
recommended polling time:        (  21) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   065   065   011    Pre-fail  Always       -       11190
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       3791
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       45142
10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       134
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       255
13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       2
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   050   000    Old_age   Always       -       30 (Min/Max 28/30)
194 Temperature_Celsius     0x0022   069   050   000    Old_age   Always       -       31 (Min/Max 28/31)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       163166
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/disk4

root@Tower:~# smartctl -a -A /dev/sdl
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD2ZB02704
LU WWN Device Id: 5 0024e9 0044e3cf9
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Jun 12 15:48:06 2014 EDT

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/223571en
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (19440) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 324) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1530
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   068   068   025    Pre-fail  Always       -       9701
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2431
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12803
10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       133
181 Program_Fail_Cnt_Total  0x0022   099   099   000    Old_age   Always       -       26111667
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   053   000    Old_age   Always       -       29 (Min/Max 14/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       16
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       11710
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       2451

SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 9 occurred at disk power-on lifetime: 12803 hours (533 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:52.002  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:52.002  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:52.002  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:52.002  IDENTIFY DEVICE
  00 00 01 01 00 00 40 00      00:00:52.002  NOP [Abort queued commands]

Error 8 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 00 00 00 e0  Error: ICRC, ABRT 8 sectors at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 00      00:00:00.352  READ DMA
  27 00 00 00 00 00 e0 00      00:00:00.352  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:00.352  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:00.352  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:00.352  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 7 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.157  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:00.157  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:00.157  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:00.157  IDENTIFY DEVICE
  00 00 01 01 00 00 00 00      00:00:00.157  NOP [Abort queued commands]

Error 6 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.150  IDENTIFY DEVICE
  00 00 00 00 00 00 00 00      00:00:00.150  NOP [Abort queued commands]
  00 00 00 00 00 00 00 00      00:00:00.148  NOP [Abort queued commands]
  60 00 08 00 00 00 40 00      00:00:00.000  READ FPDMA QUEUED
  60 00 08 00 00 00 40 00      00:00:00.117  READ FPDMA QUEUED

Error 5 occurred at disk power-on lifetime: 12788 hours (532 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 00      00:00:00.110  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      00:00:00.110  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:00.110  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:00:00.110  IDENTIFY DEVICE
  00 00 01 01 00 00 00 00      00:00:00.110  NOP [Abort queued commands]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

June 12, 201412 yr

Author

of the two which drive do you think it's worse? disk4 has been disabled so i can't really tell what issues it is still having, but disk3 just drops off the face of the earth.. i'm thinking i should either:

1. reenable disk4 and hope for the best

2. preclear a disk and rebuild disk3.

3. preclear another disk (will order) and rebuild 4 shortly after

4. either junk or try to preclear disk 3 and 4 again to keep as spare

or

1. remove both disk3 and 4 (new config i'm guessing?)

2. rebuild parity

3. add new disk that i'm currently preclearing (when it's done of course)

4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done)

5. either junk or try to preclear disk 3 and 4 again to keep as spare

i'm leading towards the 2nd option unless theres a better way.

Quote

June 12, 201412 yr

No smoking gun from smart reports

Disk3

Calibration retries 132 -> 134

Runtime bad blocks 1 -> 2

Disk 4

G-sense error rate 2 -> 2

UDMA ECC 3 -> 16

Multi-zone error 11710 -> 11710

The UDMA error is indicative of a cabling problem (you should replace the cable in disk 4). I have little experience with Samsung drives and not sure about the calibration retries or the runtime bad blocks. I know drives recalibrate as heat rises so maybe that's normal. But it went up by 2 in a very short time. The runtime bad blocks sounds bad but not sure what it means.

Some hard to diagnose drive behavior is found to be an issue with a power splitter. I might check that.

I might suggest running the smart long tests on the drives. Or locating Samsung specific diagnostic disk. If the drives are failing those tests might fail and confirm it.

750G is pretty small in today's world. You might want to trade up to something bigger anyway. 2T is bigger and maybe fixing the cabling will get it working.

Quote

June 12, 201412 yr

Author

750G is pretty small in today's world. You might want to trade up to something bigger anyway. 2T is bigger and maybe fixing the cabling will get it working.

at least for disk3 i know it is not a sata cable issue. i moved it to a different slot in my case, so that's a different SATA card and port and different power and sata port on the backplane. i will try the same with disk4, but i'm thinking i should remove disk 3 and 4 (see post above yours)

1. remove both disk3 and 4 (new config i'm guessing?)

2. rebuild parity

3. add new disk that i'm currently preclearing (when it's done of course)

4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done)

5. either junk or try to preclear N times disk 3 and 4 again to keep as spare

Quote

June 13, 201412 yr

Hi, long time no post. I don't want to hijack the thread but I'm having a similar issue. One of my drives is not being detected at all and another has come up as unformatted. I'm currently moving files off a spare 2TB drive to replace the undetected one (hoping it's not a dead port, I know it's not a cable issue) but not sure how to proceed after that. I know I will need to preclear any replacement disk(s) but then what? I'm still on 4.7 Plus, there doesn't seem to be a sub-forum for that version. Happy to post my own thread so as not to confuse the issue the OP is having, let me know.

I would suggest starting a new thread is a good idea so that things do not get confusing in terms of responses/advice.

A disk simply coming up as unformatted tends to mean that it failed to mount and there is some sort of file system corruption. The data on it is probably intact, but the problem mounting can only be fixed by running reiserfsck against the drive.

You want to take a structured approach to fixing these issues to minimize any chance of losing data. As long as there is a spare disk available I like to take an approach that means I can put a problem disk aside while I work on trying to recover its data on to another disk. That way if anything goes wrong I still have the problem disk in the state that it was when the problem occurred to attempt data recovery against.

My suggestion would be (I would be interested to see what others think) would be an approach along the lines of:

Recover the disk that is currently not being detected to the new drive. When that finishes test the drive that has been replaced to see if it really has a fault. If it tests out OK it can become a new 'spare' drive.
You need to try and recover the drive showing as unformatted. In an ideal world I would first try and rebuild onto a spare disk to keep the original one unchanged until recovery has finished. However if no spare drive as long as you have the array in maintenance mode and run reiserfsck against the relevant /dev/disk?? device (to maintain parity) you can do it against the current drive - possibly before trying to recover the 'faulty' disk mentioned above.

Quote

June 13, 201412 yr

I would suggest starting a new thread is a good idea so that things do not get confusing in terms of responses/advice.

Thanks Itimpi, I have started a new thread here :- http://lime-technology.com/forum/index.php?topic=33745.0

Have deleted my previous post from this thread and will take further discussion over to my thread.

Quote

June 13, 201412 yr

Author

1. remove both disk3 and 4 (new config i'm guessing?)

2. rebuild parity

3. add new disk that i'm currently preclearing (when it's done of course)

4. connect these disks to another computer and copy the data to the array (i will have enough space for the contents of both these drives after the one i'm preclearing now is done)

5. either junk or try to preclear N times disk 3 and 4 again to keep as spare

basically i'm thinking of doing this: http://blog.ktz.me/?p=243

will this not work? the data itself on these 2 disks is not of super importance, but of course i'd rather not lose it, but as it is right now, none of my user shares are accessible (come up as empty.. this only happens when disk3 starts seeing the read errors. after a reboot, before disk3 fails, it's fine.), so it's more important to get the array and other disks that are working back online. i'll troubleshoot disk3 and 4 further when i'm back to a workable situation.

EDIT 6/13/2014 @ 9:51p: i already did this. it's rebuilding parity now. hopefully i made the right choice. once this is done i'll try to copy the stuff over from the other 2 drives either via putting it in another computer or mounting them and copying over with mc like the tutorial above (any suggestions)?

after that i'll start preclearing the other 2 drives, moving them around to see what comes of that.

Quote

[solved] please help, possibly 2 disabled disks

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)