June 2, 201313 yr Almost 1 month ago to the day I had a hdd fail for the 2nd time. I decided the disk was bad and tossed it. The new drive was working great but failed again today. Im guessing its not very likely that 2 separate drives happen to fail in the same exact slot is just a coincidence. Both times it was reisfer errors (sp?). I am not near my computer now, but it is currently rebuilding the same drive again. Will see if I can post some syslogs later. I followed the check file systems section of the wiki and found 0 errors on /dev/md2. In any case, what are the best steps to isolate the issue? I guess sata/power cables are cheap enough, but are there any other things I can do in the mean time to figure out exactly what the problem is? Sent from my SCH-I605 using Tapatalk 2
June 2, 201313 yr Are you using high quality locking SATA cables? You could also have an issue with unstable power. The Antec Neo's are "okay", but not really high quality units. You may want to replace it with a Corsair HX series or a Seasonic X series unit. Finally, do you have a UPS? Unscheduled power outages, and even brownouts or spikes, can also cause a lot of issues. A good UPS with AVR can eliminate all of these problems.
June 2, 201313 yr Author No ups but thats something to consider. Would the logs be able to tell me if im experiencing power issues? Im leaning more towards cabling issues do to the fact its always the same slot. With my case I yank out the old drive and pop in a new one so if aomething was loose I wouldnt know. Sent from my SCH-I605 using Tapatalk 2
June 2, 201313 yr Look at the SMART data for the disk and see if the Power-Off Retract Cycle count is non-zero. These are caused by non-normal drive shutdowns due to loss of power. If the count is non-zero, it's a good indication that you're having some power issues.
June 2, 201313 yr Author Thanks. Im assuming thats the long smart report? Sent from my SCH-I605 using Tapatalk 2
June 2, 201313 yr Thanks. Im assuming thats the long smart report? Sent from my SCH-I605 using Tapatalk 2 Just the SMART status report. You don't need to run either of the SMART tests (Short or Long). The status report will show the Power-Off Retract Cycle counts.
June 2, 201313 yr An FYI, Power-Off Retract Cycle count will also increment if the heads are not 'parked' before the computer is turned off. It is not necessarily an accurate indicator of power or drive problems. For the OP, run a memtest for 24 hours on the system. Make sure the power and the data cables are not taught or being stressed in any way. If your using a power splitter to increase the number of drive power connectors be sure it is of good quality and not the cause of your problem. Change the cables if necessary. If you are in a humid environment, such as a basement, any loss of the gold flashing on the data connectors will lead to oxidation and signal integrity issues. Best to keep a few new cables handy.
June 2, 201313 yr Author I'm attaching my SMART report below. Can anyone assist in decyphering? I do see a value of 6 for the Power-Off_Retract_Count as someone mentioned earlier. Statistics for /dev/sdd WDC_WD20EARX-00PASB0_WD-WCAZAL574391 smartctl -a -d ata /dev/sdd smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD20EARX-00PASB0 Serial Number: WD-WCAZAL574391 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Jun 2 18:52:32 2013 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 177 177 021 Pre-fail Always - 6108 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 127 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 903 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1683 194 Temperature_Celsius 0x0022 116 115 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. First parity check (NOCORRECT) after reenabling the drive found 486 errors. I've rebooted and I'm running another parity check (NOCORRECT) hopefully it comes up clean this time. It looks like I have to reboot and do the memtest outside of unRAID, not possible for right now as I need the array. I will try to do it very soon though. I'm thinking the same exact slot having the issue with 2 different drives couldn't be a coincidence. I was thinking of switching the disk to another slot as I have plenty of unused but already connected bays, which would tell me whether or not it is a cabling issue with that specific slot. As long as I don't use that slot until it is repaired (assuming the drive no longer causes problems). Thoughts?
Archived
This topic is now archived and is closed to further replies.