December 30, 200916 yr Hi -- I'm having some problems and not sure what to do. On Dec25th I received an unRAID Status: Array fault email. The details show that Disk 12 is Not-Reported. Disk 12 [sdp]: Not-Reported (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630) diskNumber.12=12 diskName.12=md12 diskState.12=4 diskSize.12=976762552 diskModel.12=WDC WD10EACS-00D6B0 diskSerial.12= WD-WCAU40183630 diskNumReads.12=244556612 diskNumWrites.12=96 diskNumErrors.12=2 diskId.12=ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630 rdevNumber.12=12 rdevStatus.12=DISK_DSBL rdevName.12=sdp rdevSize.12=976762552 rdevModel.12=WDC WD10EACS-00D6B0 rdevSerial.12= WD-WCAU40183630 rdevId.12=ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630 The unRAID Main screen showed the disk was red and disabled and that there were 2 errors. Here is the syslog. I had to make it an attachment because it was so long an I couldn't figure out how to make it more manageable. (Since this syslog I have deleted all of the .DS_Store files that were clogging up my syslog.) http://www.mediafire.com/?zmmt2kmekjw I had just run a successful parity check less than two days earlier. I tried to run SMART reports on the drive but it was unresponsive. I shut down the server after saving the syslog. When I rebooted the server, the drive was still disabled (as expected). But now I could run SMART reports against the drive. I ran the short report and here is the output. root@Tower:~# smartctl -a -d ata /dev/sdp smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD10EACS-00D6B0 Serial Number: WD-WCAU40183630 Firmware Version: 01.01A01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Dec 27 02:01:03 2009 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (22200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 167 147 021 Pre-fail Always - 6608 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1765 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 173 173 051 Old_age Always - 440 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9669 10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 269 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1728 194 Temperature_Celsius 0x0022 125 110 000 Old_age Always - 25 196 Reallocated_Event_Count 0x0032 193 193 000 Old_age Always - 7 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I was in too much of a hurry to run the long report. (Next time I'll be more patient.) I couldn't see anything bad in the syslog but I really have no idea what to look for. I've seen similar behavior before, and it has always meant the cable or connection is bad. So I went to the store and bought all new SATA cables with the locking tab on the clip. I re-wired the entire array with new cables making sure that the drives were in the original controller ports. Then I did the trust your array procedure and the system went through a parity check. This came up with the message "Parity updated 2 times to address sync errors." That doesn't feel right. All the other times I did this there were no errors. After the parity check was done, I rebooted the server and all was good for a little while. Then I got another email. Same as before -- Disk 12 is disabled. Same two errors show on unRAID main screen. Here is a screen shot of my syslog with the only red entries -- http://img96.imageshack.us/img96/81/errorsi.png Here is that syslog on pastebin http://pastebin.com/m7d3dff0a Here's the SMART history log for that drive in pastebin http://pastebin.com/m3ca42433 Any ideas on what the safe route would be is greatly appreciated. I have a spare new 1TB drive if necessary. Thanks very much -- especially to all those who created these great monitoring unRAID add-ons.
December 30, 200916 yr First, try changing the SATA cable.... I've come to suspect them first and the quality control on those generic red cables stinks...
January 3, 201016 yr Author Thanks for the reply. I will change the cable but these are not the generic red cables. I bought all new (blue) cables with the locking tabs. Should I replace the cable, then do the trust parity routine, and hope for the best? I'll post the results. Thanks
January 7, 201016 yr Author I replaced the cable and made sure every other connection was tight. I booted and during system POST I saw an error that looked like Port 05: Reset Port Error!! I think Port 05 is actually SATA06 on my motherboard. SATA06 is of course disk12 - the disabled disk. I pulled the disk out and tried to view its contents using an external SATA drive to USB adapter on my Ubuntu laptop. The drive would not mount as it normally should. I hope it's not my motherboard (ABIT AB9 Pro). Should I try to put a new 1TB in place of the disabled drive and rebuild the drive from parity?
January 7, 201016 yr Author I just saw that Joe L. provided this answer in a different post. http://lime-technology.com/forum/index.php?topic=5027.0 I'll try that first. Thanks.
January 7, 201016 yr Author As an update - the problem was the drive. The new replacement drive has been rebuilt and is working fine. Nice to have the server back up and running.
January 7, 201016 yr As an update - the problem was the drive. The new replacement drive has been rebuilt and is working fine. Nice to have the server back up and running. Thanks for the status update. I figure there are at least 4 or 5 loose or bad cables, and 1 or 2 bad power supplies (or bad splitters) for every bad drive reported in these threads. You just happened to be the one with the bad drive. Glad you are back up and running. Joe L.
Archived
This topic is now archived and is closed to further replies.