Jump to content

Hard drive disconnecting


Recommended Posts

Hi,

 

I have a hard drive that keeps on disconnecting, but I am not sure what is going wrong.

 

So essentially, what happens is that I usually see disk2 getting disconnected and disappearing from the array. I then go in the terminal and look for the drive itself but it's not there (ie it is located at /dev/sdf but it is not a valid location.) I cannot stop the array because it keeps trying to unmount it and syncing it but it doesn't work because the driv isn't there and so I have to force shut it down.

 

After, I boot up the computer, the drives are back. Just for good measure, I also checked all the power cables and the data cables too. I checked the smart health, they all passed, even after doing the long selftest.

 

In my syslog, I have this message appearing when the drive fails

Jun  3 04:40:04 Tower kernel: mdcmd (28958): spindown 0
Jun  3 04:40:04 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5
Jun  3 04:40:04 Tower kernel: mdcmd (28959): spindown 2
Jun  3 04:40:04 Tower emhttp: mdcmd: write: No such device or address
Jun  3 04:40:14 Tower emhttp: mdcmd: write: Input/output error

.

 

I am not sure what the problem could be.

The only other thing I could think of is the power supply failing, but I'm not sure why it would foil only on one particular drive.

 

Thanks for the help

Link to comment

Hi,

 

So this the smart history report.

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA4426638
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jun 10 09:10:53 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (37800) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   164   021    Pre-fail  Always       -       1008
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1731
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6170
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       292
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       208
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1549
194 Temperature_Celsius     0x0022   111   105   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       7
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6169         -
# 2  Extended offline    Completed without error       00%      6167         -
# 3  Short offline       Completed without error       00%      6161         -
# 4  Extended offline    Aborted by host               90%      6160         -
# 5  Short offline       Aborted by host               60%      6160         -
# 6  Extended offline    Aborted by host               20%      6160         -
# 7  Short offline       Aborted by host               80%      6156         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

I have not added anything to my hardware.

Link to comment

Are you power cycling your server frequently? 

 

these two parameters seem to indicate frequent loss of power to the drive

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      292

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      208

 

I don't know about your specific drive, but usually a "Power-Off-Retract" is in response to an unexpected loss of power.

 

Joe L.

Link to comment

Are you power cycling your server frequently? 

 

these two parameters seem to indicate frequent loss of power to the drive

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      292

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      208

 

I don't know about your specific drive, but usually a "Power-Off-Retract" is in response to an unexpected loss of power.

 

Joe L.

 

I was power cyclin at one point. I was trying to build slackware based unraid and power cycled quite abit during that time because I wasn't able to build a proper kernel, but I am using the regular version of unraid now.

 

Anyways, the hard drive are quite new, anout a year old, if not 14 months. I want to RMA it, but I don't know what to file it as.

Link to comment

The drive looks completely fine, no reason at all to RMA it.  And from the tiny bit of syslog info above, one or more drives have been disabled, which is usually true of a "loss of contact" situation.  This usually implies the drives are fine, but you have some sort of drive interface issue(s), which could be cabling, loose backplanes, power problems, bad disk controller, etc.  There should be a LOT more error messages in that syslog, and it would be very helpful to us if you would please post it here.

Link to comment

So, I figured out part of the problem. I had a faulty SATA-PCI card.

I have since replaced it. I replaced it because I wasn't able to boot it with my HDD's plugged in to it and it just so happened that the disks that were giving me problems were on it.

 

Now, I am still having a different issue. My disk2 has a red ball next to it and when I started the array, it was orange.

 

This is the smart history:

smartctl -a -d ata /dev/sdf (disk2)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA4426638
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jun 18 10:40:22 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (37800) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   164   021    Pre-fail  Always       -       966
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1740
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6257
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       301
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       216
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1550
194 Temperature_Celsius     0x0022   115   104   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       30
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6169         -
# 2  Extended offline    Completed without error       00%      6167         -
# 3  Short offline       Completed without error       00%      6161         -
# 4  Extended offline    Aborted by host               90%      6160         -
# 5  Short offline       Aborted by host               60%      6160         -
# 6  Extended offline    Aborted by host               20%      6160         -
# 7  Short offline       Aborted by host               80%      6156         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

And this is the syslog:

Jun 18 10:33:55 Tower kernel: mdcmd (23): spinup 1
Jun 18 10:33:55 Tower kernel: mdcmd (24): spinup 2
Jun 18 10:33:55 Tower kernel: mdcmd (25): spinup 3
Jun 18 10:33:55 Tower kernel: mdcmd (26): spinup 4
Jun 18 10:34:00 Tower kernel: mdcmd (27): stop 
Jun 18 10:34:00 Tower kernel: md1: stopping
Jun 18 10:34:00 Tower kernel: md2: stopping
Jun 18 10:34:00 Tower kernel: md3: stopping
Jun 18 10:34:00 Tower kernel: md4: stopping
Jun 18 10:34:00 Tower kernel: md: unRAID driver removed
Jun 18 10:34:00 Tower kernel: md: unRAID driver 2.1.3 installed
Jun 18 10:34:00 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079
Jun 18 10:34:00 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552
Jun 18 10:34:00 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315
Jun 18 10:34:00 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552
Jun 18 10:34:00 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638
Jun 18 10:34:00 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552
Jun 18 10:34:00 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077
Jun 18 10:34:00 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552
Jun 18 10:34:00 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878
Jun 18 10:34:00 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552
Jun 18 10:34:00 Tower kernel: mdcmd (6): import 5 0,0
Jun 18 10:34:00 Tower emhttp: _shcmd: shcmd (93): exit status: 1
Jun 18 10:34:00 Tower avahi-daemon[2727]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
Jun 18 10:34:01 Tower kernel: md: unRAID driver removed
Jun 18 10:34:01 Tower kernel: md: unRAID driver 2.1.3 installed
Jun 18 10:34:01 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079
Jun 18 10:34:01 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552
Jun 18 10:34:01 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315
Jun 18 10:34:01 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552
Jun 18 10:34:01 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638
Jun 18 10:34:01 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552
Jun 18 10:34:01 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077
Jun 18 10:34:01 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552
Jun 18 10:34:01 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878
Jun 18 10:34:01 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552
Jun 18 10:34:01 Tower kernel: mdcmd (6): import 5 0,0
Jun 18 10:35:02 Tower kernel: md: unRAID driver removed
Jun 18 10:35:02 Tower kernel: md: unRAID driver 2.1.3 installed
Jun 18 10:35:02 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079
Jun 18 10:35:02 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552
Jun 18 10:35:02 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315
Jun 18 10:35:02 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552
Jun 18 10:35:02 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638
Jun 18 10:35:02 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552
Jun 18 10:35:02 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077
Jun 18 10:35:02 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552
Jun 18 10:35:02 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878
Jun 18 10:35:02 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552
Jun 18 10:35:02 Tower kernel: mdcmd (6): import 5 0,0

 

I am not sure what to do next.

Link to comment

Sounds like you are on the right track.  We could advise better though, if we could see the full syslog.  That little piece above is completely fine, but there must be a number of errors in the rest of the syslog.  If you would rather not show it to us, search it yourself for ICRC and/or BadCRC.  I noticed that the latest SMART report, otherwise completely fine, is showing an increase in CRC errors of 23 (from 7 to 30).  That generally indicates a bad SATA cable to that drive, easily replaced.

Link to comment

So update:

 

It still doesn't work.

My disk2 is still red balling.

 

I attached the syslog this time. I guess I didn't attach them previously because they were either similar to this or it was filled with errors related to my SATA card breaking which I have now replaced.

 

After running the "reiserfsck --rebuild-tree", it rebuilt the tree in 4 passes, but I apparently didn't save it. I thought I had saved the output, but I didn't. But just in case I missed something, I am running another reiserfsck on it to make sure everything is ok.

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA4426638
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jun 24 00:29:21 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (37800) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   164   021    Pre-fail  Always       -       925
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1749
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6390
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       308
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       218
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1557
194 Temperature_Celsius     0x0022   112   104   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       30
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6374         -
# 2  Short offline       Completed without error       00%      6169         -
# 3  Extended offline    Completed without error       00%      6167         -
# 4  Short offline       Completed without error       00%      6161         -
# 5  Extended offline    Aborted by host               90%      6160         -
# 6  Short offline       Aborted by host               60%      6160         -
# 7  Extended offline    Aborted by host               20%      6160         -
# 8  Short offline       Aborted by host               80%      6156         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I really don't know what else to do anymore. I even replaced the SATA cable on disk 2 in case it would help, but it hasn't.

 

Is there any where else I can check for problems?

syslog.txt

Link to comment

I have a Corsair TX 650 v2 http://www.corsair.com/power-supply-units/tx-series-power-supply-units/enthusiast-series-tx650-v2-80-plus-bronze-certified-650-watt-high-performance-power-supply.html

 

I only have 5 drives + 1 2.5" drive.

 

What would I check to see if it's the PSU that's causing the problem?

 

Also, I'm not sure why my syslog was getting filtered before, but here it is again.

syslog.txt

Link to comment

I have a Corsair TX 650 v2 http://www.corsair.com/power-supply-units/tx-series-power-supply-units/enthusiast-series-tx650-v2-80-plus-bronze-certified-650-watt-high-performance-power-supply.html

 

I only have 5 drives + 1 2.5" drive.

 

What would I check to see if it's the PSU that's causing the problem?

 

Also, I'm not sure why my syslog was getting filtered before, but here it is again.

That is a good quality single-rail power supply.  it should be fine powering 5 disks.

 

Joe L.

Link to comment
I even replaced the SATA cable on disk 2 in case it would help, but it hasn't.

Actually, the new cable does appear to be working correctly now, there are no new incorrect CRC events recorded in your latest SMART report, value is still 30.  I have no doubt that at least some of the previous problems were due to that bad cable (wish I could see the June 3 syslog).  Apparently though, it was not the only problem.  But your latest syslog does not show any problems with any of the drives, no reason at all that you should not be able to start the array, and either rebuild that drive or rebuild parity, restoring the array completely.

 

Your 2 syslogs are somewhat unusual.  They almost look like 2 different systems, except that they include the same set of drives, connected a little differently, and with a few other commonalities.  The second one, begun June 24 at 7am, is a clean v5.0-rc4 UnRAID system with one visible addon, Hamachi, and one of its dependencies.  The first syslog begins June 19 at 6:43am, but is missing at least the first 900 lines, perhaps is a tail of the last 1115 lines.  It almost (but not quite) looks like a log rotation, this being a later section, perhaps the last.  Without all of the initial system setup logged, I'm a little handicapped.  What is there though does tell us some things, some of which concern me as to that system's stability.  There is a sequence of lines, included at the bottom of this post, that repeat relatively often, and include rather worrying issues, clearly not part of a normal UnRAID setup.  The first oddity, the first occurrence of this sequence is logged with the logging hour about 4 hours off (bazaar!).  Then it begins with pnp discovering memory region conflicts, and repeats this each occurrence of this line sequence.  Then there are warnings about the use of the NVIDIA module(!), and that "oom_adj is deprecated", and other oddities.

 

Another repeating error is: "Tower emhttp: main: can't bind listener socket: Address already in use".  When the main function of emhttp cannot bind a listener, that sounds serious to me, although I'm in no way an expert here.

 

Then there is the following sequence (an example with md3), which repeats several times for every drive including the Cache drive:

Tower kernel: REISERFS warning (device md3):  reiserfs_fill_super: CONFIG_REISERFS_CHECK is set ON

Tower kernel: REISERFS warning (device md3):  reiserfs_fill_super: - it is slow mode for debugging.

Tower kernel: reiserfs: using flush barriers

This indicates that the ReiserFS module was compiled with debugging turned on, and specifically the Reiser debugging that causes it to run every check at every Reiser related step, which causes the file system to run somewhat slower.  This clearly does not appear to be compiled correctly.

 

These previous issues may or may not be related to the problems you were having, but we do have to take them into account, especially when we don't have the syslog where the drive first red-balled (usually the most important one).  According to your newest syslog though, the system looks fine, and ready to rebuild that drive.  If for any reason the drive red-balls again, we need the syslog that covers the period when it went red.  There will be a set of errors within it that should tell us what is going wrong.

 

 

An example of the problematic sequence, it repeats almost identically numerous times (almost none of these lines appear in a normal UnRAID system, after the initial setup):

Jun 23 08:52:32 Tower avahi-daemon[1860]: Disconnected from D-Bus, exiting.

Jun 23 08:52:32 Tower avahi-dnsconfd[1869]: read(): EOF

Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:01:00.0 BAR 6 [mem 0x00000000-0x0007ffff pref]

Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:03:06.0 BAR 6 [mem 0x00000000-0x0007ffff pref]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000d4400-0x000d7fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000f0000-0x000f7fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000f8000-0x000fbfff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000fc000-0x000fffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x00000000-0x0009ffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x00100000-0xcfceffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jun 23 08:57:46 Tower kernel: highmem bounce pool size: 64 pages

Jun 23 08:57:46 Tower kernel: Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)

Jun 23 08:57:46 Tower kernel: i8042: Failed to disable AUX port, but continuing anyway... Is this a SiS?

Jun 23 08:57:46 Tower kernel: i8042: If AUX port is really absent please use the 'i8042.noaux' option

Jun 23 08:57:46 Tower kernel: reiserfs: using flush barriers

Jun 23 08:57:46 Tower kernel: udevd (991): /proc/991/oom_adj is deprecated, please use /proc/991/oom_score_adj instead.

Jun 23 08:57:46 Tower kernel: nvidia: module license 'NVIDIA' taints kernel.

Jun 23 08:57:46 Tower kernel: Disabling lock debugging due to kernel taint

Jun 23 08:57:46 Tower kernel: NVRM: loading NVIDIA UNIX x86 Kernel Module  285.05.15  Mon Oct 17 19:35:44 PDT 2011

Jun 23 08:57:46 Tower kernel: reiserfs: enabling write barrier flush mode

Jun 23 08:57:46 Tower kernel: scsi: killing requests for dead queue

Jun 23 08:57:46 Tower last message repeated 2 times

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through

Jun 23 08:57:46 Tower kernel: scsi: killing requests for dead queue

Jun 23 08:57:46 Tower last message repeated 4 times

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present

Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through

Jun 23 08:57:46 Tower kernel: REISERFS warning (device sdd1):  reiserfs_fill_super: CONFIG_REISERFS_CHECK is set ON

Jun 23 08:57:46 Tower kernel: REISERFS warning (device sdd1):  reiserfs_fill_super: - it is slow mode for debugging.

Jun 23 08:57:46 Tower kernel: reiserfs: using flush barriers

Jun 23 08:57:46 Tower kernel: r8169 0000:02:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-2.fw (-2)

Jun 23 08:57:50 Tower ntpd[1717]: bind(21) AF_INET6 fe80::1e6f:65ff:fe5c:50c0%2#123 flags 0x1 failed: Cannot assign requested address

Jun 23 08:57:50 Tower ntpd[1717]: unable to create socket on eth0 (5) for fe80::1e6f:65ff:fe5c:50c0#123

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...