Did my disk die? "Device is disabled. Contents emulated."

mckenna654 · May 22, 2017

I received an email alert this afternoon stating the following:

Event: unRAID array errors
Subject: Warning [S-CHASSIS] - array has errors
Description: Array has 1 disk with read errors
Importance: warning

Disk 1 - ST3000VN000-1H4167_Z3101PN6 (sdb) (errors 15)

One of my disks now reads Device is disabled. Contents emulated and is marked as faulty (with a red cross) in the array status.

Should I attempt to rebuild the array onto the disk? What's the best way of doing so?

I am running a basic two disk RAID 1 (mirror) array.

See attached for my server diagnostics zip.

Please let me know if there is any other info I can provide.

Thanks

s-chassis-diagnostics-20170522-2158.zip

Zonediver · May 22, 2017

The Disk has a lot of Raw Read Errors - so it seems this Disk is dying or allready dead.

Also the Seek Error Rate is extremely high.

Power On Hours: 21.068 - also a lot

I would change the disk as soon as possible.

EDIT: The second disk has also lots of errors and will fail soon i guess...

Edited May 22, 2017 by Zonediver

mckenna654 · May 22, 2017

1 minute ago, Zonediver said:

Check/change your SATA-cable fist and see what happens.

No problem.

I will check that tomorrow. I have a HP N54L microserver so I'm not 100% sure on how it's all wired inside.

Would moving the drive to a different bay achieve the same result?

JorgeB · May 22, 2017

Move it to a different slot and rebuild to the same disk, if it fails again in the near future replace it.

mckenna654 · May 23, 2017

16 hours ago, johnnie.black said:

Move it to a different slot and rebuild to the same disk, if it fails again in the near future replace it.

Should I remove the troubled disk from the array first?

How do I unassign a disk?

JorgeB · May 23, 2017

1 hour ago, jakeandchase said:

How do I unassign a disk?

Shutdown the server, swap the disk to a different slot, power back up, unassign the disk, start the array, stop the array, reassign the disk, start the array to begin the rebuild.

SSD · May 23, 2017

On 5/22/2017 at 8:16 AM, Zonediver said:

The Disk has a lot of Raw Read Errors - so it seems this Disk is dying or allready dead.

Also the Seek Error Rate is extremely high.

Power On Hours: 21.068 - also a lot

I would change the disk as soon as possible.

EDIT: The second disk has also lots of errors and will fail soon i guess...

The Raw Read Error Rates are fine. The large raw value means nothing. It is likely binary data that we humans have no want to understand. The "value" and "worst" are at safe levels.

Similarly, the Seek Error Rates are also fine. The "worst" has dipped to 60, but with a "thresh" of 30, is far from dipped below manufacturer failure threshold. And although normalized values are lower than I might expect for drives of this age, the fact that both drives have very similar values would lead me to believe that this is normal for this model drive. (But I'd be monitoring this going forward).

The attributes we look closely at are the Reallocated Sectors and Pending Sectors. You also want to verify that none of the attributes are showing failing now or "in the past". None of these are concern here.

The only thing that worries me about these drives is that they are 3T Seagates, and they have a bad track record.

Zonediver · May 23, 2017

Raw Read Errors cant be fine...

One of my WD-RED has lots of them since two weeks. I checked it outside the Arry and found out, that the transfer rate is about 1,2 MB/s at this position.

The Disk has 11 Block-Errors so thats not fine and not normal.

Edited May 23, 2017 by Zonediver

SSD · May 23, 2017

@Zonediver

The SMART data is one dimension of drive health. It has its limitations. Just because the smart report looks ok, it doesn't mean the drive is healthy. If you saw something that led you to do more investigation (which may or may not have been what the data was telling you), and you identified some slow sectors in the process that you believe are bad enough to look to replace the drive, there may have been some serendipity involved. But it is good if you found a drive issue.

The "raw" values that appear on the smart report, are, with a few notable exceptions, not something a person can understand without technical specs not made available by the manufacturers. They are likely bit masks (first 3 bits mean this, next 2 bits mean that, etc., etc.) that vary by manufacturer and even by model. Turning this concatenation of bits into a decimal number is meaningless. These raw values are translated into a normalized scale, where normally 100 is "good" and the threshold for going bad is defined. Smart tracks both the current normalzed value and the worst the normalized value has ever gotten. Some attributes have raw values that are usable - things like reallocated sectors and current pending sectors - which are absolute counts and have been used in that consistent way for every drive I have looked at. These are some of the most useful attributes to track.

Most important in the SMART attributes is not the absolute value, but a comparison over time. For example, having a drive with 1000 reallocated sectors that are rock solid and never increase with time, is probably better than having a drive with 20 reallocated sectors, and that number if growning with every parity check.

I do not have a particularly warm place in my heart for WD or Seagate drives (although I have been pretty happy with the 8T Seagate archives given their price and my experience with them to date). I believe that the HGST's are the highest quality, and if cost were no object, I'd been buying those.

Zonediver · May 23, 2017

Yes that might be possible. This WD-RED is my first failing RED - and i use plenty of this type.

Over the last 17 years i saw dying drives from IBM and - of course - Seagate, but only "one" WD.

All my sorted out WD-Greens are still working well, but this "RED-failure" tells me, that the quality of WD goes down,

so yes, if i have the money, i would prefere HGST too.

Edited May 23, 2017 by Zonediver

mckenna654 · May 23, 2017

I have changed slot and rebuilt the array with no issues.

If it fails again i will post back.

Thanks

seagate_surfer · May 31, 2017

Hi, we are sorry to hear that you're experiencing issues with your Seagate drive. Just In case you encounter further issues with one of our drives, you can always contact our Customer Support or look into any warranty information here. Please feel free to reach out if you have any questions!

limetech · May 31, 2017

1 hour ago, seagate_surfer said:

Hi, we are sorry to hear that you're experiencing issues with your Seagate drive

Hi surfer, please check your PM.

unevent · May 31, 2017

On 5/23/2017 at 1:01 PM, bjp999 said:

The Raw Read Error Rates are fine. The large raw value means nothing. It is likely binary data that we humans have no want to understand. The "value" and "worst" are at safe levels.

Similarly, the Seek Error Rates are also fine. The "worst" has dipped to 60, but with a "thresh" of 30, is far from dipped below manufacturer failure threshold. And although normalized values are lower than I might expect for drives of this age, the fact that both drives have very similar values would lead me to believe that this is normal for this model drive. (But I'd be monitoring this going forward).

The attributes we look closely at are the Reallocated Sectors and Pending Sectors. You also want to verify that none of the attributes are showing failing now or "in the past". None of these are concern here.

The only thing that worries me about these drives is that they are 3T Seagates, and they have a bad track record.

While not specifically replying to bjp999 rather using the post to redirect the focus to the Seagate error values since the thread took a couple turns. The seek error rate and the raw read error rate are 48 bit values. Convert the reported value to hexadecimal and the upper 16 bits is the number of errors and the lower 32 bits is the total number of seeks. So for the ST3000VN000-1H4167 drive, seek error rate of 91742336 converted to hex is 0x00000577E080. Upper 16: 0x0000 = 0 is zero seek errors over 0x0577E080 = 91,742,336 seeks. The high fly write count is high, however, on both Seagate drives. The advice from johnnie.black is good, or replace the drive and run preclear on this one and see if the high fly write count increases or not. As bjp999 mentioned, reallocated sectors and pending sectors are the typical watched values, but I also look at high fly writes on Seagate drives.

SSD · May 31, 2017

2 hours ago, unevent said:

While not specifically replying to bjp999 rather using the post to redirect the focus to the Seagate error values since the thread took a couple turns. The seek error rate and the raw read error rate are 48 bit values. Convert the reported value to hexadecimal and the upper 16 bits is the number of errors and the lower 32 bits is the total number of seeks. So for the ST3000VN000-1H4167 drive, seek error rate of 91742336 converted to hex is 0x00000577E080. Upper 16: 0x0000 = 0 is zero seek errors over 0x0577E080 = 91,742,336 seeks. The high fly write count is high, however, on both Seagate drives. The advice from johnnie.black is good, or replace the drive and run preclear on this one and see if the high fly write count increases or not. As bjp999 mentioned, reallocated sectors and pending sectors are the typical watched values, but I also look at high fly writes on Seagate drives.

Thanks for the insights on those attributes. Is that pretty generic across manufacturers, or strictly Seagate. I always see some high fly writes on Seagates and never had any correlation with drive failures. What do you look for with that attribute?

unevent · May 31, 2017

27 minutes ago, bjp999 said:

Thanks for the insights on those attributes. Is that pretty generic across manufacturers, or strictly Seagate. I always see some high fly writes on Seagates and never had any correlation with drive failures. What do you look for with that attribute?

Not sure how many other manufacturers use encoded values, I only know of Seagate. From the Wikipedia article for high fly writes, there is a head flying-height sensor 'which detects when a recording head is flying outside its normal operating range. If an unsafe fly height condition is encountered, the write process is stopped, and the information is rewritten or reallocated to a safe region of the hard drive'. Maybe a handful or so seem normal to me, but when the number is as high (50+) as what the two smart reports are saying I begin to wonder if there are issues with the mechanics of the drive and how many weak writes have been performed. If it were mine, running 1-2 full preclear cycles and seeing if the number increases and by much would determine if the drive gets shelved or continue usage.

Did my disk die? "Device is disabled. Contents emulated."

Recommended Posts

mckenna654

Link to comment

Zonediver

Link to comment

mckenna654

Link to comment

JorgeB

Link to comment

mckenna654

Link to comment

JorgeB

Link to comment

SSD

Link to comment

Zonediver

Link to comment

SSD

Link to comment

Zonediver

Link to comment

mckenna654

Link to comment

seagate_surfer

Link to comment

limetech

Link to comment

unevent

Link to comment

SSD

Link to comment

unevent

Link to comment

Join the conversation