Troubleshooting bad drive vs bad connections

June 2, 201313 yr

Almost 1 month ago to the day I had a hdd fail for the 2nd time. I decided the disk was bad and tossed it. The new drive was working great but failed again today. Im guessing its not very likely that 2 separate drives happen to fail in the same exact slot is just a coincidence. Both times it was reisfer errors (sp?). I am not near my computer now, but it is currently rebuilding the same drive again. Will see if I can post some syslogs later.

I followed the check file systems section of the wiki and found 0 errors on /dev/md2.

In any case, what are the best steps to isolate the issue? I guess sata/power cables are cheap enough, but are there any other things I can do in the mean time to figure out exactly what the problem is?

Sent from my SCH-I605 using Tapatalk 2

Quote

June 2, 201313 yr

Are you using high quality locking SATA cables?

You could also have an issue with unstable power. The Antec Neo's are "okay", but not really high quality units. You may want to replace it with a Corsair HX series or a Seasonic X series unit.

Finally, do you have a UPS? Unscheduled power outages, and even brownouts or spikes, can also cause a lot of issues. A good UPS with AVR can eliminate all of these problems.

Quote

June 2, 201313 yr

Author

No ups but thats something to consider. Would the logs be able to tell me if im experiencing power issues? Im leaning more towards cabling issues do to the fact its always the same slot. With my case I yank out the old drive and pop in a new one so if aomething was loose I wouldnt know.

Sent from my SCH-I605 using Tapatalk 2

Quote

June 2, 201313 yr

Look at the SMART data for the disk and see if the Power-Off Retract Cycle count is non-zero. These are caused by non-normal drive shutdowns due to loss of power.

If the count is non-zero, it's a good indication that you're having some power issues.

Quote

June 2, 201313 yr

Author

Thanks. Im assuming thats the long smart report?

Sent from my SCH-I605 using Tapatalk 2

Quote

June 2, 201313 yr

Thanks. Im assuming thats the long smart report?

Sent from my SCH-I605 using Tapatalk 2

Just the SMART status report. You don't need to run either of the SMART tests (Short or Long). The status report will show the Power-Off Retract Cycle counts.

Quote

June 2, 201313 yr

An FYI, Power-Off Retract Cycle count will also increment if the heads are not 'parked' before the computer is turned off. It is not necessarily an accurate indicator of power or drive problems.

For the OP, run a memtest for 24 hours on the system. Make sure the power and the data cables are not taught or being stressed in any way. If your using a power splitter to increase the number of drive power connectors be sure it is of good quality and not the cause of your problem. Change the cables if necessary. If you are in a humid environment, such as a basement, any loss of the gold flashing on the data connectors will lead to oxidation and signal integrity issues. Best to keep a few new cables handy.

Quote

June 2, 201313 yr

Author

I'm attaching my SMART report below. Can anyone assist in decyphering? I do see a value of 6 for the Power-Off_Retract_Count as someone mentioned earlier.

Statistics for /dev/sdd WDC_WD20EARX-00PASB0_WD-WCAZAL574391
smartctl -a -d ata /dev/sdd
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WCAZAL574391
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jun  2 18:52:32 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (37800) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   177   177   021    Pre-fail  Always       -       6108
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       127
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       903
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1683
194 Temperature_Celsius     0x0022   116   115   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

First parity check (NOCORRECT) after reenabling the drive found 486 errors. I've rebooted and I'm running another parity check (NOCORRECT) hopefully it comes up clean this time.

It looks like I have to reboot and do the memtest outside of unRAID, not possible for right now as I need the array. I will try to do it very soon though.

I'm thinking the same exact slot having the issue with 2 different drives couldn't be a coincidence. I was thinking of switching the disk to another slot as I have plenty of unused but already connected bays, which would tell me whether or not it is a cabling issue with that specific slot. As long as I don't use that slot until it is repaired (assuming the drive no longer causes problems). Thoughts?

Quote

Troubleshooting bad drive vs bad connections

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)