Jump to content
TheStapler

UDMA_CRC_Error_Count -- caused failed drive (with full SMART pass report)

10 posts in this topic Last Reply

Recommended Posts

I put a new drive in, and within a day, it has gone RED X... the only errors with it, were the UDMA errors. 

If the drive is "good", why did it fail?  Should I RMA this drive?  I haven't been able to get an answer as to what these errors really are... some say it is a bad sata cable, or just bad communication error... but I don't know.... :(    this drive was put in place of a failed 2tb drive, with some other errors on it as well as a tonne of UDMA errors.  That drive, is still connected to my server, so it wasn't exchanged with that port, and I have ensured that all the cables are connected and good...

 

What should I do???? :'(

 

Here is the general SMART report for the 2tb original failed drive:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA6115958
LU WWN Device Id: 5 0014ee 25b072d3b
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue May 10 14:47:25 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(38760) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 374) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       11
  3 Spin_Up_Time            0x0027   253   171   021    Pre-fail  Always       -       1008
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1390
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   056   056   000    Old_age   Always       -       32276
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       246
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       186
193 Load_Cycle_Count        0x0032   088   088   000    Old_age   Always       -       337956
194 Temperature_Celsius     0x0022   126   116   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   198   000    Old_age   Always       -       162944
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

and here is the extended SMART report for the new 3tb drive that has 'failed':

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.18-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N7FAT48N
LU WWN Device Id: 5 0014ee 2b7af7016
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 10 14:37:10 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(39840) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 399) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       164
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       142
194 Temperature_Celsius     0x0022   126   119   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       8424
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       106         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Share this post


Link to post

You've got 8400 crc errors over the life of the drive (164 hours)  not surprised it redballed.  Check the cabling any any power splitters and post your diagnostics

 

(the original drive has 164000 errors over its lifetime.  No surprises it failed either)

Sent from my LG-D852 using Tapatalk

 

 

Share this post


Link to post

See Need help? Read me first! and post diagnostics.

 

If I had a syslog, i would have posted it... the server has rebooted since this issue happened... this is why I posted the 2 SMART reports...

 

I have searched the forum, and I saw a post about it, but I didn't really seem to see/understand the problem/result.

 

I can add what my server is though:

M/B: MSI - 970A-G46 (MS-7693)
CPU: AMD FX(tm)-8350 Eight-Core @ 4000
HVM: Enabled
IOMMU: Enabled
Cache: 384 kB, 8192 kB, 8192 kB
Memory: 16384 MB (max. installable capacity 32 GB)

There are 2 supermicro raid/jbod cards as well

 

Here is the output of the lspci:

root@Tower:/mnt/user# lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx0 port B) (rev 02)
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)
00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)
00:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port C)
00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port D)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [iDE mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 42)
00:14.1 IDE interface: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller (rev 40)
00:14.2 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) (rev 40)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller (rev 40)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5
01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
02:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
03:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller
04:05.0 VGA compatible controller: NVIDIA Corporation NV34 [GeForce FX 5200] (rev a1)
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

 

Share this post


Link to post

oh, and I forgot, it is an older supermicro case, with 24 hotswap bays, all SATA female connectors on the backplane, the case has 4x500watt redundant power supplies, running 1&3 and 2&4 power supplies off 2 APC 1300XL UPS's.

Share this post


Link to post

It's bad cabling / loose cabling / data cabling tie strapped too tight / unshielded cabling tied to others / unshielded cabling tied to power / poor power splitters (99 % of them suck) / too many splitters / etc etc etc. 

 

Number one mistake when people build computers is cheaping out on data cables or making everything look neat (and if they insist on neat by using tiestraps)

 

Sent from my LG-D852 using Tapatalk

 

 

Share this post


Link to post

It's bad cabling / loose cabling / data cabling tie strapped too tight / unshielded cabling tied to others / unshielded cabling tied to power / poor power splitters (99 % of them suck) / too many splitters / etc etc etc. 

 

Number one mistake when people build computers is cheaping out on data cables or making everything look neat (and if they insist on neat by using tiestraps)

 

Sent from my LG-D852 using Tapatalk

 

The power that feeds the backplane, I've never touched... I beleive there is only 1 power connector, but I could be wrong... I will double check later.

 

My sata cables are not tie-wrapped, and are decent quality SFF-8087 fan out cables, and the 6 on the motherboard are brand new, and were pulled out of their bags the day the motherboard was installed.  The fan out cables were brand new when I got the raid cards as well, which was when I bought the mobo.  I replaced all the fans in the case at the same time too (most were dying, so it was safer to replace them all at once).  These 2 drives, are on the same channel of the same raid card though (I just looked at my drive charts), so maybe that is somewhere to start...

 

Is there any way to see if there are errors with the raid card itself?  like some kind of raid card diagnostics?

 

Typically, I don't cheap out... which is why I have been swtiching to the WD RED drives.  I was having issues with the WD GREEN dying more often, and until this WD RED died, I never had a WD RED fail.

 

I don't spin down my drives, and I used to use the cache drive and the mover script, but I don't use that any more.  I am thinking about moving my "apps" off of my cache drive, and putting it on the array, so that I have some 'safety' if/when the cache drive was to fail.

 

 

Share this post


Link to post

See Need help? Read me first! and post diagnostics.

 

If I had a syslog, i would have posted it... the server has rebooted since this issue happened... this is why I posted the 2 SMART reports...

 

Well, sorry for not being a qualified mind reader. I'm working on it, but I'm starting to think that the course is not completely legit ;D

 

At least you now know that you should try to grab diagnostics (not only the syslog) before rebooting if you encounter problems in the future. Good luck with the troubleshooting. :)

Share this post


Link to post

I just want to echo what has already been said, it's not the drives, it's not the controller, it's almost certainly the cables or their connectors, or very remotely it's flaky power.  It doesn't matter how new the cables are, at least one or more are bad, or you have power issues.  Even the newest can be lemons.

 

Oh, and don't staple the cables either!  :D

Share this post


Link to post

 

 

Oh, and don't staple the cables either!  :D

You would be amazed how often I diagnose customers problems to that At work.

 

 

Sent from my LG-D852 using Tapatalk

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.