TheStapler Posted May 10, 2016 Share Posted May 10, 2016 I put a new drive in, and within a day, it has gone RED X... the only errors with it, were the UDMA errors. If the drive is "good", why did it fail? Should I RMA this drive? I haven't been able to get an answer as to what these errors really are... some say it is a bad sata cable, or just bad communication error... but I don't know.... this drive was put in place of a failed 2tb drive, with some other errors on it as well as a tonne of UDMA errors. That drive, is still connected to my server, so it wasn't exchanged with that port, and I have ensured that all the cables are connected and good... What should I do? :'( Here is the general SMART report for the 2tb original failed drive: smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WCAZA6115958 LU WWN Device Id: 5 0014ee 25b072d3b Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Tue May 10 14:47:25 2016 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (38760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 374) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 11 3 Spin_Up_Time 0x0027 253 171 021 Pre-fail Always - 1008 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1390 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 056 056 000 Old_age Always - 32276 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 246 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 186 193 Load_Cycle_Count 0x0032 088 088 000 Old_age Always - 337956 194 Temperature_Celsius 0x0022 126 116 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 198 000 Old_age Always - 162944 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. and here is the extended SMART report for the new 3tb drive that has 'failed': smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red (AF) Device Model: WDC WD30EFRX-68EUZN0 Serial Number: WD-WCC4N7FAT48N LU WWN Device Id: 5 0014ee 2b7af7016 Firmware Version: 82.00A82 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue May 10 14:37:10 2016 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (39840) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 399) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 1 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 164 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 142 194 Temperature_Celsius 0x0022 126 119 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 8424 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 106 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
gubbgnutten Posted May 10, 2016 Share Posted May 10, 2016 See Need help? Read me first! and post diagnostics. Quote Link to comment
Squid Posted May 10, 2016 Share Posted May 10, 2016 You've got 8400 crc errors over the life of the drive (164 hours) not surprised it redballed. Check the cabling any any power splitters and post your diagnostics (the original drive has 164000 errors over its lifetime. No surprises it failed either) Sent from my LG-D852 using Tapatalk Quote Link to comment
TheStapler Posted May 10, 2016 Author Share Posted May 10, 2016 See Need help? Read me first! and post diagnostics. If I had a syslog, i would have posted it... the server has rebooted since this issue happened... this is why I posted the 2 SMART reports... I have searched the forum, and I saw a post about it, but I didn't really seem to see/understand the problem/result. I can add what my server is though: M/B: MSI - 970A-G46 (MS-7693) CPU: AMD FX(tm)-8350 Eight-Core @ 4000 HVM: Enabled IOMMU: Enabled Cache: 384 kB, 8192 kB, 8192 kB Memory: 16384 MB (max. installable capacity 32 GB) There are 2 supermicro raid/jbod cards as well Here is the output of the lspci: root@Tower:/mnt/user# lspci 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx0 port B) (rev 02) 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU) 00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B) 00:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port C) 00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port D) 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [iDE mode] (rev 40) 00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller 00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller 00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller 00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 42) 00:14.1 IDE interface: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller (rev 40) 00:14.2 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) (rev 40) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller (rev 40) 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) 00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller 00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) 00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller 00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) 02:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) 03:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller 04:05.0 VGA compatible controller: NVIDIA Corporation NV34 [GeForce FX 5200] (rev a1) 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) Quote Link to comment
TheStapler Posted May 10, 2016 Author Share Posted May 10, 2016 oh, and I forgot, it is an older supermicro case, with 24 hotswap bays, all SATA female connectors on the backplane, the case has 4x500watt redundant power supplies, running 1&3 and 2&4 power supplies off 2 APC 1300XL UPS's. Quote Link to comment
Squid Posted May 10, 2016 Share Posted May 10, 2016 It's bad cabling / loose cabling / data cabling tie strapped too tight / unshielded cabling tied to others / unshielded cabling tied to power / poor power splitters (99 % of them suck) / too many splitters / etc etc etc. Number one mistake when people build computers is cheaping out on data cables or making everything look neat (and if they insist on neat by using tiestraps) Sent from my LG-D852 using Tapatalk Quote Link to comment
TheStapler Posted May 10, 2016 Author Share Posted May 10, 2016 It's bad cabling / loose cabling / data cabling tie strapped too tight / unshielded cabling tied to others / unshielded cabling tied to power / poor power splitters (99 % of them suck) / too many splitters / etc etc etc. Number one mistake when people build computers is cheaping out on data cables or making everything look neat (and if they insist on neat by using tiestraps) Sent from my LG-D852 using Tapatalk The power that feeds the backplane, I've never touched... I beleive there is only 1 power connector, but I could be wrong... I will double check later. My sata cables are not tie-wrapped, and are decent quality SFF-8087 fan out cables, and the 6 on the motherboard are brand new, and were pulled out of their bags the day the motherboard was installed. The fan out cables were brand new when I got the raid cards as well, which was when I bought the mobo. I replaced all the fans in the case at the same time too (most were dying, so it was safer to replace them all at once). These 2 drives, are on the same channel of the same raid card though (I just looked at my drive charts), so maybe that is somewhere to start... Is there any way to see if there are errors with the raid card itself? like some kind of raid card diagnostics? Typically, I don't cheap out... which is why I have been swtiching to the WD RED drives. I was having issues with the WD GREEN dying more often, and until this WD RED died, I never had a WD RED fail. I don't spin down my drives, and I used to use the cache drive and the mover script, but I don't use that any more. I am thinking about moving my "apps" off of my cache drive, and putting it on the array, so that I have some 'safety' if/when the cache drive was to fail. Quote Link to comment
gubbgnutten Posted May 10, 2016 Share Posted May 10, 2016 See Need help? Read me first! and post diagnostics. If I had a syslog, i would have posted it... the server has rebooted since this issue happened... this is why I posted the 2 SMART reports... Well, sorry for not being a qualified mind reader. I'm working on it, but I'm starting to think that the course is not completely legit At least you now know that you should try to grab diagnostics (not only the syslog) before rebooting if you encounter problems in the future. Good luck with the troubleshooting. Quote Link to comment
RobJ Posted May 10, 2016 Share Posted May 10, 2016 I just want to echo what has already been said, it's not the drives, it's not the controller, it's almost certainly the cables or their connectors, or very remotely it's flaky power. It doesn't matter how new the cables are, at least one or more are bad, or you have power issues. Even the newest can be lemons. Oh, and don't staple the cables either! Quote Link to comment
Squid Posted May 10, 2016 Share Posted May 10, 2016 Oh, and don't staple the cables either! You would be amazed how often I diagnose customers problems to that At work. Sent from my LG-D852 using Tapatalk Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.