Is one of my disks failing?

craigr · September 5, 2015

So a few days ago I noticed that one of my 3TB WD WD30EZRX-00DC0B0 had 45 errors in the unRAID MAIN page. Ths was just after upgrading my MB and CPU a few days earlier.

I unplugged and replugged the SATA and power cables on the drive. I did a parity check that resulted in zero errors. I then moved all the data off the drive and ran a preclear on the drive. The preclear found no errors. Below is pertinent SMART details:

# Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value

1 Raw Read Error Rate 0x002f 200 200 051 Pre-fail Always - 3156

3 Spin Up Time 0x0027 184 178 021 Pre-fail Always - 5800

4 Start Stop Count 0x0032 097 097 000 Old age Always - 3005

5 Reallocated Sector Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek Error Rate 0x002e 200 200 000 Old age Always - 0

9 Power On Hours 0x0032 070 070 000 Old age Always - 22563 (2y, 210d, 3h)

10 Spin Retry Count 0x0032 100 100 000 Old age Always - 0

11 Calibration Retry Count 0x0032 100 100 000 Old age Always - 0

12 Power Cycle Count 0x0032 100 100 000 Old age Always - 371

192 Power-Off Retract Count 0x0032 200 200 000 Old age Always - 173

193 Load Cycle Count 0x0032 200 200 000 Old age Always - 2860

194 Temperature Celsius 0x0022 123 112 000 Old age Always - 27

196 Reallocated Event Count 0x0032 200 200 000 Old age Always - 0

197 Current Pending Sector 0x0032 200 200 000 Old age Always - 0

198 Offline Uncorrectable 0x0030 200 200 000 Old age Offline - 0

199 UDMA CRC Error Count 0x0032 200 200 000 Old age Always - 0

200 Multi Zone Error Rate 0x0008 200 200 000 Old age Offline - 5

Done

What I find strange is that there are no reallocated sectors or events, but the raw read error rate is 3156 and the multi zone error rate is at 5.

I have seen no other errors since running the preclear and the drive passed preclear. I have copied nearly 3TB of data to the drive and it's now nearly full again with no further errors.

Thoughts?

Thanks guys,

craigr

Squid · September 5, 2015

The raw values for raw read error rate, and multizone error rate is a manufacturer specific value and is a meaningless value.

On those what you should be looking at is the Value and Threshhold entries.

In both cases, they are at 200, with the failure threshold set at 51 and 0 respectively, with the worst value being equivalent to the value.

In other words, the drive is still in perfect shape. If the value ever approaches the threshold then I would change the drive.

craigr · September 6, 2015

Thank you. That is what I thought.

What I found alarming was that it seemed to happen suddenly along with unRAID reporting 45 errors.

Best,

craigr

RobJ · September 7, 2015

There are other reasons for unRAID to report errors, and they may not be the fault of the drive at all. Most commonly, it's a SAS card failure, that causes loss of contact with the drive. Any attempts to read or write to the drive thereafter will fail, and be reported as errors (not the drive's fault). There are other reasons too, similar (power issues, loose cabling and splitters, controller issues, port issues, heat issues, etc). The syslog from that session would have the full story.

craigr · September 15, 2015

Thanks guys.

The drive seems fine and has been performing flawlessly with no further errors. I also noticed that four of my other 4TB drives had very high "UDMA CRC Error Count" as well in their SMART reports...

I upgraded and added my cold spare 4TB drive to the array and two new 5TB drives (added 3 more drives total). After the upgrade unRAID would not see some drives once booted and the drives it saw were different on each boot. I split my 12 volt rails up differently and I have not seen anymore errors after that. I suspect that I had a bad power connection to one of my splitters or that my PS actually has more than one 12 volt rail even though it is speced as having a single 12 volt rail.

Thanks for all the input.

craigr

SSD · September 15, 2015

Thanks guys.

The drive seems fine and has been performing flawlessly with no further errors. I also noticed that four of my other 4TB drives had very high "UDMA CRC Error Count" as well in their SMART reports...

High CRC errors almost always means that the SATA cable is not providing a solid connection. Could be loose on one end, or bad cables. It is so very easy to have a marginal data connection. By far the most common issue that causes red balls, X's, or whatever we call them now.

Joe L. · September 15, 2015

Or, the SATA cables are picking up noise from adjacent cables. (adjacent power OR SATA cables)

This often occurs when a user attempts to make their server look neat by bundling all the SATA cables together.

When doing so, it is putting into place a situation where induced noise is very likely.

Therefore, cut the tie-wraps bundling cables together. Yes, it looks less neat, but... you'll see far fewer noise induced CRC errors.

Joe L.

Is one of my disks failing?

Recommended Posts

craigr

Link to comment

Squid

Link to comment

craigr

Link to comment

RobJ

Link to comment

craigr

Link to comment

SSD

Link to comment

Joe L.

Link to comment

Join the conversation