February 1, 201412 yr It's been some time since I've had to post any questions (which is good, as that means everything has been rock solid), unfortunately while at work we lost power and I was unable to access my server to shut it down properly. I did have it running with an APC UPS, but unfortunately that died. I have the server in a UPS (CyberPower), but I never explored if its possible to have it auto shutdown like it did with the APC. I digress.... The issue is now that the server is powered back up, I have the evil 'red dot' beside my "disk 3". I did power everything down, tore my server apart (12 drives in 4-in-3 cages) to locate the faulty drive (lesson learned, should have created a layout chart of what drive is where, as it was the darn 12th drive I checked). Once locating the faulty drive, I made sure power and SATA cables were secure on both it and the board (sata cable is part of a break-out cable). This made no difference. I'm running version 4.7 of UnRaid The drive is accessible. It passes the smart status check. Statistics for /dev/sdj ST3750330AS_5QK00GT5 smartctl -a -d ata /dev/sdj smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST3750330AS Serial Number: 5QK00GT5 Firmware Version: SD04 User Capacity: 750,156,374,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Jan 31 21:41:27 2014 EST ==> WARNING: There are known problems with these drives, see the following Seagate web pages: http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 642) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 159) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 197610088 3 Spin_Up_Time 0x0003 094 085 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1675 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 4333950815 9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49849 (Over 5 years, non-stop..aww!) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 6 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 260 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 65537 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 075 047 045 Old_age Always - 25 (Lifetime Min/Max 21/25) 194 Temperature_Celsius 0x0022 025 053 000 Old_age Always - 25 (0 5 0 0) 195 Hardware_ECC_Recovered 0x001a 040 023 000 Old_age Always - 197610088 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. HDParm Info for /dev/sdj ST3750330AS_5QK00GT5 /dev/sdj: ATA device, with non-removable media Model Number: ST3750330AS Serial Number: 5QK00GT5 Firmware Revision: SD04 Standards: Supported: 7 6 5 4 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 1465149168 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 715404 MBytes device size with M = 1000*1000: 750156 MBytes (750 GB) cache/buffer size = unknown Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test General Purpose Logging feature set * 64-bit World wide name * Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase Logical Unit WWN Device Identifier: 5000c500028898de NAA : 5 IEEE OUI : 000c50 Unique ID : 0028898de Checksum: correct If one of the skilled guys (Joe L., I always remember as you help getting setup initially was incredibly helpful), could help with interpretation of info included, I would greatly appreciate it. I've got a 2TB drive sitting by (granted, never got tested as I have no slots in my tower left), which I do actually want to replace the drive in question with at some point, but it would be nice to know if the drive has a problem. Please advise, guyz Thanks P.S. I did have to remove a bunch of characters from the syslog as it made it huge (i think its from me doing a file system check and they were almost like hashes to show it progressing). No details removed as it was the same repeating character sequence ( I left a few lines of it in the log ). syslog-2014-01-31.txt
February 1, 201412 yr The disk has a bad or loose SATA cable. Also see here: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F
February 1, 201412 yr Author I'll give it another disconnect / Reconnect , checking both ends. If it isn't one of my breakout cables from my SATA expansion card and it goes to the motherboard directly, I'll replace it. Considering the only event that has happened from when the server was running fine to having this error is a power outage, it is somewhat hard to believe the cable is a problem (the power outage wasn't the result of an earthquake - lol ). What I am saying, is that inside of the case has not been disturbed until after the problem was discovered. Going to muck with it now and I'll report back!
February 1, 201412 yr A simple disconnect/re-seating will NOT put the drive back to OK status. (even if it was just a bad connection) You must re-construct the drive, as writes to it failed. It is guaranteed to have incorrect data. Follow the directions in the wiki. When you lost power the case probably cooled off. The thermal stress on the cable/connector could have caused a poor connection.
February 1, 201412 yr Author A simple disconnect/re-seating will NOT put the drive back to OK status. (even if it was just a bad connection) You must re-construct the drive, as writes to it failed. It is guaranteed to have incorrect data. Follow the directions in the wiki. When you lost power the case probably cooled off. The thermal stress on the cable/connector could have caused a poor connection. I replaced the cable nonetheless, no harm, as it was one going right to my motherboard (not one of the breakout cables). After 5 minutes of "WTF", as my machine wouldn't boot up at all (no power), I realized during all the digging around this time, I pulled the connector from the case to the board for the power switch. lol Fun Fun Fun! I was looking to replace the 750GB drive with a spare 2TB I have, would this be as good as any time or would you re-construct the drive first and then yank and reconstruct again on new drive?
February 1, 201412 yr I was looking to replace the 750GB drive with a spare 2TB I have, would this be as good as any time or would you re-construct the drive first and then yank and reconstruct again on new drive? You will be running at risk until the drive is rebuilt, so putting in the 2TB and rebuilding on to that would probably be a good move, as long as the 2TB has been tested as good.
February 1, 201412 yr Author ......as long as the 2TB has been tested as good. That is where we have an issue. I have filled my server with 12 drives, leaving me know open drive bays to preclear the drive, as I had do with almost all drive up to now (thats to Joe L.'s assistance in my starting days). I could hook it up to my Windows 7 machine, but I am not familiar with a test process that was as extensive as preclear. I had this idea that I would just hook up a drive to the windows machine and run it for a few months, ensure nothing odd happened with it, then if it seemed ok, toss it in the server to replace one of my smaller drives. Far from a sequential / structured test. I have a couple of 2TB data drives in my main PC, that I could transfer the content from and on to the new drive sitting on my desk. Then, I could pull the drive and use it, but I would at best say it works and wasn't D.O.A. EDIT: I rebuilt the original drive. I'm going to run with it until I can figure out a way to preclear a drive (or something similar) hooked to a machine running windows. Not sure its possible. If I can't find a decent way to prepare the replacement drive, then plan B is to swap it in anyways, but not right anything new to it for multiple days. Worse case, the drive is faulty and I either hear issues with it, or smart reports identify something odd. If so, I still have the original drive with all the data intact, so I won't lose any data. (I'm really hoping to Avoid plan B)
February 2, 201412 yr This indicates a cable problem: ata11.00: exception Emask 0x50 SAct 0x3 SErr 0x280900 action 0x6 frozen Jan 31 21:05:14 Tower kernel: ata11.00: irq_stat 0x08000000, interface fatal error Jan 31 21:05:14 Tower kernel: ata11: SError: { UnrecovData HostInt 10B8B BadCRC } Jan 31 21:05:14 Tower kernel: ata11.00: failed command: READ FPDMA QUEUED Jan 31 21:05:14 Tower kernel: ata11.00: cmd 60/40:00:60:00:00/00:00:00:00:00/40 tag 0 ncq 32768 in Jan 31 21:05:14 Tower kernel: res 40/00:08:a8:88:e0/00:00:e8:00:00/40 Emask 0x50 (ATA bus error) Jan 31 21:05:14 Tower kernel: ata11.00: status: { DRDY } Do these messages still appear in the log?
February 3, 201412 yr Author Just a quick search, on the Syslog, I looked for "UnrecovData" and found no match. Their is lots of colourful stuff in the log, but I have to admit, their is so much in it, I only worry about it if something isn't working that I notice during my regular usage (ya, I know, not the best approach, but understanding everything in the log is for advanced users) Log attached since everything up and working again. syslog-2014-02-02.txt
February 3, 201412 yr I have filled my server with 12 drives, leaving me know open drive bays to preclear the drive, as I had do with almost all drive up to now (thats to Joe L.'s assistance in my starting days). If you have an open SATA connector and power. just open the case an leave the drive laying on the bottom of the case until it completes the preclear.
Archived
This topic is now archived and is closed to further replies.