Is one of my disks failing?


craigr

Recommended Posts

So a few days ago I noticed that one of my 3TB WD WD30EZRX-00DC0B0 had 45 errors in the unRAID MAIN page.  Ths was just after upgrading my MB and CPU a few days earlier.

 

I unplugged and replugged the SATA and power cables on the drive.  I did a parity check that resulted in zero errors.  I then moved all the data off the drive and ran a preclear on the drive.  The preclear found no errors. Below is pertinent SMART details:

 

# Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value

1 Raw Read Error Rate 0x002f 200 200 051 Pre-fail Always - 3156

3 Spin Up Time 0x0027 184 178 021 Pre-fail Always - 5800

4 Start Stop Count 0x0032 097 097 000 Old age Always - 3005

5 Reallocated Sector Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek Error Rate 0x002e 200 200 000 Old age Always - 0

9 Power On Hours 0x0032 070 070 000 Old age Always - 22563 (2y, 210d, 3h)

10 Spin Retry Count 0x0032 100 100 000 Old age Always - 0

11 Calibration Retry Count 0x0032 100 100 000 Old age Always - 0

12 Power Cycle Count 0x0032 100 100 000 Old age Always - 371

192 Power-Off Retract Count 0x0032 200 200 000 Old age Always - 173

193 Load Cycle Count 0x0032 200 200 000 Old age Always - 2860

194 Temperature Celsius 0x0022 123 112 000 Old age Always - 27

196 Reallocated Event Count 0x0032 200 200 000 Old age Always - 0

197 Current Pending Sector 0x0032 200 200 000 Old age Always - 0

198 Offline Uncorrectable 0x0030 200 200 000 Old age Offline - 0

199 UDMA CRC Error Count 0x0032 200 200 000 Old age Always - 0

200 Multi Zone Error Rate 0x0008 200 200 000 Old age Offline - 5

Done

 

What I find strange is that there are no reallocated sectors or events, but the raw read error rate is 3156 and the multi zone error rate is at 5.

 

I have seen no other errors since running the preclear and the drive passed preclear.  I have copied nearly 3TB of data to the drive and it's now nearly full again with no further errors.

 

Thoughts?

 

Thanks guys,

craigr

Link to comment

The raw values for raw read error rate, and multizone error rate is a manufacturer specific value and is a meaningless value.

 

On those what you should be looking at is the Value and Threshhold entries.

 

In both cases, they are at 200, with the failure threshold set at 51 and 0 respectively, with the worst value being equivalent to the value. 

 

In other words, the drive is still in perfect shape.  If the value ever approaches the threshold then I would change the drive.

Link to comment

There are other reasons for unRAID to report errors, and they may not be the fault of the drive at all.  Most commonly, it's a SAS card failure, that causes loss of contact with the drive.  Any attempts to read or write to the drive thereafter will fail, and be reported as errors (not the drive's fault).  There are other reasons too, similar (power issues, loose cabling and splitters, controller issues, port issues, heat issues, etc).  The syslog from that session would have the full story.

Link to comment

Thanks guys.

 

The drive seems fine and has been performing flawlessly with no further errors.  I also noticed that four of my other 4TB drives had very high "UDMA CRC Error Count" as well in their SMART reports...

 

I upgraded and added my cold spare 4TB drive to the array and two new 5TB drives (added 3 more drives total).  After the upgrade unRAID would not see some drives once booted and the drives it saw were different on each boot.  I split my 12 volt rails up differently and I have not seen anymore errors after that.  I suspect that I had a bad power connection to one of my splitters or that my PS actually has more than one 12 volt rail even though it is speced as having a single 12 volt rail.

 

Thanks for all the input.

 

craigr

Link to comment

Thanks guys.

 

The drive seems fine and has been performing flawlessly with no further errors.  I also noticed that four of my other 4TB drives had very high "UDMA CRC Error Count" as well in their SMART reports...

 

High CRC errors almost always means that the SATA cable is not providing a solid connection. Could be loose on one end, or bad cables. It is so very easy to have a marginal data connection. By far the most common issue that causes red balls, X's, or whatever we call them now.

Link to comment

Or, the SATA cables are picking up noise from adjacent cables. (adjacent power OR SATA cables)

 

This often occurs when a user attempts to make their server look neat by bundling all the SATA cables together. 

When doing so, it is putting into place a situation where induced noise is very likely.

 

Therefore, cut the tie-wraps bundling cables together.  Yes, it looks less neat, but... you'll see far fewer noise induced CRC errors.

 

Joe L.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.