thejinx0r Posted June 8, 2012 Share Posted June 8, 2012 Hi, I have a hard drive that keeps on disconnecting, but I am not sure what is going wrong. So essentially, what happens is that I usually see disk2 getting disconnected and disappearing from the array. I then go in the terminal and look for the drive itself but it's not there (ie it is located at /dev/sdf but it is not a valid location.) I cannot stop the array because it keeps trying to unmount it and syncing it but it doesn't work because the driv isn't there and so I have to force shut it down. After, I boot up the computer, the drives are back. Just for good measure, I also checked all the power cables and the data cables too. I checked the smart health, they all passed, even after doing the long selftest. In my syslog, I have this message appearing when the drive fails Jun 3 04:40:04 Tower kernel: mdcmd (28958): spindown 0 Jun 3 04:40:04 Tower kernel: md: disk0: ATA_OP e0 ioctl error: -5 Jun 3 04:40:04 Tower kernel: mdcmd (28959): spindown 2 Jun 3 04:40:04 Tower emhttp: mdcmd: write: No such device or address Jun 3 04:40:14 Tower emhttp: mdcmd: write: Input/output error . I am not sure what the problem could be. The only other thing I could think of is the power supply failing, but I'm not sure why it would foil only on one particular drive. Thanks for the help Quote Link to comment
Rajahal Posted June 8, 2012 Share Posted June 8, 2012 Just because a drive passes the SMART test doesn't mean it is completely healthy. Can you post the SMART report for disk2? Have you added any new drives or other hardware to your server recently? Posting a full syslog might also be helpful. Quote Link to comment
thejinx0r Posted June 10, 2012 Author Share Posted June 10, 2012 Hi, So this the smart history report. smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA4426638 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Jun 10 09:10:53 2012 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 164 021 Pre-fail Always - 1008 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1731 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6170 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 292 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 208 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1549 194 Temperature_Celsius 0x0022 111 105 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 7 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 6169 - # 2 Extended offline Completed without error 00% 6167 - # 3 Short offline Completed without error 00% 6161 - # 4 Extended offline Aborted by host 90% 6160 - # 5 Short offline Aborted by host 60% 6160 - # 6 Extended offline Aborted by host 20% 6160 - # 7 Short offline Aborted by host 80% 6156 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I have not added anything to my hardware. Quote Link to comment
Joe L. Posted June 10, 2012 Share Posted June 10, 2012 Are you power cycling your server frequently? these two parameters seem to indicate frequent loss of power to the drive 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 292 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 208 I don't know about your specific drive, but usually a "Power-Off-Retract" is in response to an unexpected loss of power. Joe L. Quote Link to comment
thejinx0r Posted June 12, 2012 Author Share Posted June 12, 2012 Are you power cycling your server frequently? these two parameters seem to indicate frequent loss of power to the drive 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 292 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 208 I don't know about your specific drive, but usually a "Power-Off-Retract" is in response to an unexpected loss of power. Joe L. I was power cyclin at one point. I was trying to build slackware based unraid and power cycled quite abit during that time because I wasn't able to build a proper kernel, but I am using the regular version of unraid now. Anyways, the hard drive are quite new, anout a year old, if not 14 months. I want to RMA it, but I don't know what to file it as. Quote Link to comment
thejinx0r Posted June 14, 2012 Author Share Posted June 14, 2012 So, is there any advice as to what I can file when RMAing it. Quote Link to comment
RobJ Posted June 14, 2012 Share Posted June 14, 2012 The drive looks completely fine, no reason at all to RMA it. And from the tiny bit of syslog info above, one or more drives have been disabled, which is usually true of a "loss of contact" situation. This usually implies the drives are fine, but you have some sort of drive interface issue(s), which could be cabling, loose backplanes, power problems, bad disk controller, etc. There should be a LOT more error messages in that syslog, and it would be very helpful to us if you would please post it here. Quote Link to comment
thejinx0r Posted June 18, 2012 Author Share Posted June 18, 2012 So, I figured out part of the problem. I had a faulty SATA-PCI card. I have since replaced it. I replaced it because I wasn't able to boot it with my HDD's plugged in to it and it just so happened that the disks that were giving me problems were on it. Now, I am still having a different issue. My disk2 has a red ball next to it and when I started the array, it was orange. This is the smart history: smartctl -a -d ata /dev/sdf (disk2) smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA4426638 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Jun 18 10:40:22 2012 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 164 021 Pre-fail Always - 966 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1740 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6257 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 301 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 216 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1550 194 Temperature_Celsius 0x0022 115 104 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 30 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 6169 - # 2 Extended offline Completed without error 00% 6167 - # 3 Short offline Completed without error 00% 6161 - # 4 Extended offline Aborted by host 90% 6160 - # 5 Short offline Aborted by host 60% 6160 - # 6 Extended offline Aborted by host 20% 6160 - # 7 Short offline Aborted by host 80% 6156 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. And this is the syslog: Jun 18 10:33:55 Tower kernel: mdcmd (23): spinup 1 Jun 18 10:33:55 Tower kernel: mdcmd (24): spinup 2 Jun 18 10:33:55 Tower kernel: mdcmd (25): spinup 3 Jun 18 10:33:55 Tower kernel: mdcmd (26): spinup 4 Jun 18 10:34:00 Tower kernel: mdcmd (27): stop Jun 18 10:34:00 Tower kernel: md1: stopping Jun 18 10:34:00 Tower kernel: md2: stopping Jun 18 10:34:00 Tower kernel: md3: stopping Jun 18 10:34:00 Tower kernel: md4: stopping Jun 18 10:34:00 Tower kernel: md: unRAID driver removed Jun 18 10:34:00 Tower kernel: md: unRAID driver 2.1.3 installed Jun 18 10:34:00 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 Jun 18 10:34:00 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552 Jun 18 10:34:00 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 Jun 18 10:34:00 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552 Jun 18 10:34:00 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 Jun 18 10:34:00 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552 Jun 18 10:34:00 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 Jun 18 10:34:00 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552 Jun 18 10:34:00 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 Jun 18 10:34:00 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552 Jun 18 10:34:00 Tower kernel: mdcmd (6): import 5 0,0 Jun 18 10:34:00 Tower emhttp: _shcmd: shcmd (93): exit status: 1 Jun 18 10:34:00 Tower avahi-daemon[2727]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Jun 18 10:34:01 Tower kernel: md: unRAID driver removed Jun 18 10:34:01 Tower kernel: md: unRAID driver 2.1.3 installed Jun 18 10:34:01 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 Jun 18 10:34:01 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552 Jun 18 10:34:01 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 Jun 18 10:34:01 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552 Jun 18 10:34:01 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 Jun 18 10:34:01 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552 Jun 18 10:34:01 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 Jun 18 10:34:01 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552 Jun 18 10:34:01 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 Jun 18 10:34:01 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552 Jun 18 10:34:01 Tower kernel: mdcmd (6): import 5 0,0 Jun 18 10:35:02 Tower kernel: md: unRAID driver removed Jun 18 10:35:02 Tower kernel: md: unRAID driver 2.1.3 installed Jun 18 10:35:02 Tower kernel: mdcmd (1): import 0 8,64 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 Jun 18 10:35:02 Tower kernel: md: import disk0: [8,64] (sde) WDC_WD20EARS-00MVWB0_WD-WMAZA4427079 size: 1953514552 Jun 18 10:35:02 Tower kernel: mdcmd (2): import 1 8,16 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 Jun 18 10:35:02 Tower kernel: md: import disk1: [8,16] (sdb) WDC_WD20EARS-00MVWB0_WD-WMAZA4413315 size: 1953514552 Jun 18 10:35:02 Tower kernel: mdcmd (3): import 2 8,80 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 Jun 18 10:35:02 Tower kernel: md: import disk2: [8,80] (sdf) WDC_WD20EARS-00MVWB0_WD-WMAZA4426638 size: 1953514552 Jun 18 10:35:02 Tower kernel: mdcmd (4): import 3 8,0 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 Jun 18 10:35:02 Tower kernel: md: import disk3: [8,0] (sda) WDC_WD20EARS-00MVWB0_WD-WMAZA4452077 size: 1953514552 Jun 18 10:35:02 Tower kernel: mdcmd (5): import 4 8,32 1953514552 WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 Jun 18 10:35:02 Tower kernel: md: import disk4: [8,32] (sdc) WDC_WD20EARS-00MVWB0_WD-WMAZA4428878 size: 1953514552 Jun 18 10:35:02 Tower kernel: mdcmd (6): import 5 0,0 I am not sure what to do next. Quote Link to comment
thejinx0r Posted June 18, 2012 Author Share Posted June 18, 2012 I started my array in maintenaince mde and ran a reiserfsck on it and it told me to rebuild the tree which I am doing now. Quote Link to comment
RobJ Posted June 20, 2012 Share Posted June 20, 2012 Sounds like you are on the right track. We could advise better though, if we could see the full syslog. That little piece above is completely fine, but there must be a number of errors in the rest of the syslog. If you would rather not show it to us, search it yourself for ICRC and/or BadCRC. I noticed that the latest SMART report, otherwise completely fine, is showing an increase in CRC errors of 23 (from 7 to 30). That generally indicates a bad SATA cable to that drive, easily replaced. Quote Link to comment
thejinx0r Posted June 24, 2012 Author Share Posted June 24, 2012 So update: It still doesn't work. My disk2 is still red balling. I attached the syslog this time. I guess I didn't attach them previously because they were either similar to this or it was filled with errors related to my SATA card breaking which I have now replaced. After running the "reiserfsck --rebuild-tree", it rebuilt the tree in 4 passes, but I apparently didn't save it. I thought I had saved the output, but I didn't. But just in case I missed something, I am running another reiserfsck on it to make sure everything is ok. smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA4426638 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Jun 24 00:29:21 2012 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 164 021 Pre-fail Always - 925 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1749 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6390 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 308 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 218 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1557 194 Temperature_Celsius 0x0022 112 104 000 Old_age Always - 38 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 30 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 6374 - # 2 Short offline Completed without error 00% 6169 - # 3 Extended offline Completed without error 00% 6167 - # 4 Short offline Completed without error 00% 6161 - # 5 Extended offline Aborted by host 90% 6160 - # 6 Short offline Aborted by host 60% 6160 - # 7 Extended offline Aborted by host 20% 6160 - # 8 Short offline Aborted by host 80% 6156 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I really don't know what else to do anymore. I even replaced the SATA cable on disk 2 in case it would help, but it hasn't. Is there any where else I can check for problems? syslog.txt Quote Link to comment
PeterB Posted June 24, 2012 Share Posted June 24, 2012 Your syslog appears to be incomplete/corrupted. What psu are you using? - I think I would start investigating there. Quote Link to comment
thejinx0r Posted June 24, 2012 Author Share Posted June 24, 2012 I have a Corsair TX 650 v2 http://www.corsair.com/power-supply-units/tx-series-power-supply-units/enthusiast-series-tx650-v2-80-plus-bronze-certified-650-watt-high-performance-power-supply.html I only have 5 drives + 1 2.5" drive. What would I check to see if it's the PSU that's causing the problem? Also, I'm not sure why my syslog was getting filtered before, but here it is again. syslog.txt Quote Link to comment
Joe L. Posted June 24, 2012 Share Posted June 24, 2012 I have a Corsair TX 650 v2 http://www.corsair.com/power-supply-units/tx-series-power-supply-units/enthusiast-series-tx650-v2-80-plus-bronze-certified-650-watt-high-performance-power-supply.html I only have 5 drives + 1 2.5" drive. What would I check to see if it's the PSU that's causing the problem? Also, I'm not sure why my syslog was getting filtered before, but here it is again. That is a good quality single-rail power supply. it should be fine powering 5 disks. Joe L. Quote Link to comment
RobJ Posted June 25, 2012 Share Posted June 25, 2012 I even replaced the SATA cable on disk 2 in case it would help, but it hasn't. Actually, the new cable does appear to be working correctly now, there are no new incorrect CRC events recorded in your latest SMART report, value is still 30. I have no doubt that at least some of the previous problems were due to that bad cable (wish I could see the June 3 syslog). Apparently though, it was not the only problem. But your latest syslog does not show any problems with any of the drives, no reason at all that you should not be able to start the array, and either rebuild that drive or rebuild parity, restoring the array completely. Your 2 syslogs are somewhat unusual. They almost look like 2 different systems, except that they include the same set of drives, connected a little differently, and with a few other commonalities. The second one, begun June 24 at 7am, is a clean v5.0-rc4 UnRAID system with one visible addon, Hamachi, and one of its dependencies. The first syslog begins June 19 at 6:43am, but is missing at least the first 900 lines, perhaps is a tail of the last 1115 lines. It almost (but not quite) looks like a log rotation, this being a later section, perhaps the last. Without all of the initial system setup logged, I'm a little handicapped. What is there though does tell us some things, some of which concern me as to that system's stability. There is a sequence of lines, included at the bottom of this post, that repeat relatively often, and include rather worrying issues, clearly not part of a normal UnRAID setup. The first oddity, the first occurrence of this sequence is logged with the logging hour about 4 hours off (bazaar!). Then it begins with pnp discovering memory region conflicts, and repeats this each occurrence of this line sequence. Then there are warnings about the use of the NVIDIA module(!), and that "oom_adj is deprecated", and other oddities. Another repeating error is: "Tower emhttp: main: can't bind listener socket: Address already in use". When the main function of emhttp cannot bind a listener, that sounds serious to me, although I'm in no way an expert here. Then there is the following sequence (an example with md3), which repeats several times for every drive including the Cache drive: Tower kernel: REISERFS warning (device md3): reiserfs_fill_super: CONFIG_REISERFS_CHECK is set ON Tower kernel: REISERFS warning (device md3): reiserfs_fill_super: - it is slow mode for debugging. Tower kernel: reiserfs: using flush barriers This indicates that the ReiserFS module was compiled with debugging turned on, and specifically the Reiser debugging that causes it to run every check at every Reiser related step, which causes the file system to run somewhat slower. This clearly does not appear to be compiled correctly. These previous issues may or may not be related to the problems you were having, but we do have to take them into account, especially when we don't have the syslog where the drive first red-balled (usually the most important one). According to your newest syslog though, the system looks fine, and ready to rebuild that drive. If for any reason the drive red-balls again, we need the syslog that covers the period when it went red. There will be a set of errors within it that should tell us what is going wrong. An example of the problematic sequence, it repeats almost identically numerous times (almost none of these lines appear in a normal UnRAID system, after the initial setup): Jun 23 08:52:32 Tower avahi-daemon[1860]: Disconnected from D-Bus, exiting. Jun 23 08:52:32 Tower avahi-dnsconfd[1869]: read(): EOF Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:01:00.0 BAR 6 [mem 0x00000000-0x0007ffff pref] Jun 23 08:57:46 Tower kernel: pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:03:06.0 BAR 6 [mem 0x00000000-0x0007ffff pref] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000d4400-0x000d7fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000f0000-0x000f7fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000f8000-0x000fbfff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x000fc000-0x000fffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x00000000-0x0009ffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: pnp 00:0b: disabling [mem 0x00100000-0xcfceffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit] Jun 23 08:57:46 Tower kernel: highmem bounce pool size: 64 pages Jun 23 08:57:46 Tower kernel: Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) Jun 23 08:57:46 Tower kernel: i8042: Failed to disable AUX port, but continuing anyway... Is this a SiS? Jun 23 08:57:46 Tower kernel: i8042: If AUX port is really absent please use the 'i8042.noaux' option Jun 23 08:57:46 Tower kernel: reiserfs: using flush barriers Jun 23 08:57:46 Tower kernel: udevd (991): /proc/991/oom_adj is deprecated, please use /proc/991/oom_score_adj instead. Jun 23 08:57:46 Tower kernel: nvidia: module license 'NVIDIA' taints kernel. Jun 23 08:57:46 Tower kernel: Disabling lock debugging due to kernel taint Jun 23 08:57:46 Tower kernel: NVRM: loading NVIDIA UNIX x86 Kernel Module 285.05.15 Mon Oct 17 19:35:44 PDT 2011 Jun 23 08:57:46 Tower kernel: reiserfs: enabling write barrier flush mode Jun 23 08:57:46 Tower kernel: scsi: killing requests for dead queue Jun 23 08:57:46 Tower last message repeated 2 times Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through Jun 23 08:57:46 Tower kernel: scsi: killing requests for dead queue Jun 23 08:57:46 Tower last message repeated 4 times Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] No Caching mode page present Jun 23 08:57:46 Tower kernel: sd 10:0:0:0: [sdg] Assuming drive cache: write through Jun 23 08:57:46 Tower kernel: REISERFS warning (device sdd1): reiserfs_fill_super: CONFIG_REISERFS_CHECK is set ON Jun 23 08:57:46 Tower kernel: REISERFS warning (device sdd1): reiserfs_fill_super: - it is slow mode for debugging. Jun 23 08:57:46 Tower kernel: reiserfs: using flush barriers Jun 23 08:57:46 Tower kernel: r8169 0000:02:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-2.fw (-2) Jun 23 08:57:50 Tower ntpd[1717]: bind(21) AF_INET6 fe80::1e6f:65ff:fe5c:50c0%2#123 flags 0x1 failed: Cannot assign requested address Jun 23 08:57:50 Tower ntpd[1717]: unable to create socket on eth0 (5) for fe80::1e6f:65ff:fe5c:50c0#123 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.