Disk disabled (twice)

December 30, 200916 yr

Hi -- I'm having some problems and not sure what to do.

On Dec25th I received an unRAID Status: Array fault email. The details show that Disk 12 is Not-Reported.

Disk 12 [sdp]: Not-Reported (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630)
diskNumber.12=12
diskName.12=md12
diskState.12=4
diskSize.12=976762552
diskModel.12=WDC WD10EACS-00D6B0
diskSerial.12= WD-WCAU40183630
diskNumReads.12=244556612
diskNumWrites.12=96
diskNumErrors.12=2
diskId.12=ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630
rdevNumber.12=12
rdevStatus.12=DISK_DSBL
rdevName.12=sdp
rdevSize.12=976762552
rdevModel.12=WDC WD10EACS-00D6B0
rdevSerial.12= WD-WCAU40183630
rdevId.12=ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630

The unRAID Main screen showed the disk was red and disabled and that there were 2 errors.

Here is the syslog. I had to make it an attachment because it was so long an I couldn't figure out how to make it more manageable. (Since this syslog I have deleted all of the .DS_Store files that were clogging up my syslog.)

http://www.mediafire.com/?zmmt2kmekjw

I had just run a successful parity check less than two days earlier. I tried to run SMART reports on the drive but it was unresponsive. I shut down the server after saving the syslog.

When I rebooted the server, the drive was still disabled (as expected). But now I could run SMART reports against the drive. I ran the short report and here is the output.

root@Tower:~# smartctl  -a  -d  ata  /dev/sdp
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EACS-00D6B0
Serial Number:    WD-WCAU40183630
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Dec 27 02:01:03 2009 GMT+8
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (22200) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   167   147   021    Pre-fail  Always       -       6608
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1765
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   173   173   051    Old_age   Always       -       440
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9669
10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       269
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1728
194 Temperature_Celsius     0x0022   125   110   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   193   193   000    Old_age   Always       -       7
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I was in too much of a hurry to run the long report. (Next time I'll be more patient.) I couldn't see anything bad in the syslog but I really have no idea what to look for. I've seen similar behavior before, and it has always meant the cable or connection is bad. So I went to the store and bought all new SATA cables with the locking tab on the clip. I re-wired the entire array with new cables making sure that the drives were in the original controller ports.

Then I did the trust your array procedure and the system went through a parity check. This came up with the message "Parity updated 2 times to address sync errors." That doesn't feel right. All the other times I did this there were no errors.

After the parity check was done, I rebooted the server and all was good for a little while.

Then I got another email.

Same as before -- Disk 12 is disabled. Same two errors show on unRAID main screen.

Here is a screen shot of my syslog with the only red entries --

http://img96.imageshack.us/img96/81/errorsi.png

Here is that syslog on pastebin

http://pastebin.com/m7d3dff0a

Here's the SMART history log for that drive in pastebin

http://pastebin.com/m3ca42433

Any ideas on what the safe route would be is greatly appreciated. I have a spare new 1TB drive if necessary.

Thanks very much -- especially to all those who created these great monitoring unRAID add-ons.

December 30, 200916 yr

First, try changing the SATA cable.... I've come to suspect them first and the quality control on those generic red cables stinks...

January 3, 201016 yr

Author

Thanks for the reply. I will change the cable but these are not the generic red cables. I bought all new (blue) cables with the locking tabs. Should I replace the cable, then do the trust parity routine, and hope for the best? I'll post the results.

Thanks

January 7, 201016 yr

Author

I replaced the cable and made sure every other connection was tight. I booted and during system POST I saw an error that looked like

Port 05: Reset Port Error!!

I think Port 05 is actually SATA06 on my motherboard. SATA06 is of course disk12 - the disabled disk.

I pulled the disk out and tried to view its contents using an external SATA drive to USB adapter on my Ubuntu laptop. The drive would not mount as it normally should.

I hope it's not my motherboard (ABIT AB9 Pro).

Should I try to put a new 1TB in place of the disabled drive and rebuild the drive from parity?

January 7, 201016 yr

Author

I just saw that Joe L. provided this answer in a different post.

http://lime-technology.com/forum/index.php?topic=5027.0

I'll try that first.

Thanks.

January 7, 201016 yr

Author

As an update - the problem was the drive. The new replacement drive has been rebuilt and is working fine. Nice to have the server back up and running.

January 7, 201016 yr

As an update - the problem was the drive. The new replacement drive has been rebuilt and is working fine. Nice to have the server back up and running.

Thanks for the status update. I figure there are at least 4 or 5 loose or bad cables, and 1 or 2 bad power supplies (or bad splitters) for every bad drive reported in these threads. You just happened to be the one with the bad drive.

Glad you are back up and running.

Joe L.

Disk disabled (twice)

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)