dying HDD?

aspik · January 25, 2014

I have an error at one of my disks, logs shows this:

Jan 25 16:38:00 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 25 16:38:00 Tower kernel: ata3.00: irq_stat 0x40000001
Jan 25 16:38:00 Tower kernel: ata3.00: failed command: READ DMA EXT
Jan 25 16:38:00 Tower kernel: ata3.00: cmd 25/00:08:60:e2:03/00:00:d6:00:00/e0 tag 0 dma 4096 in
Jan 25 16:38:00 Tower kernel: res 51/40:08:60:e2:03/00:00:d6:00:00/e0 Emask 0x9 (media error)
Jan 25 16:38:00 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 25 16:38:00 Tower kernel: ata3.00: error: { UNC }
Jan 25 16:38:00 Tower kernel: ata3.00: configured for UDMA/133
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd] Unhandled sense code
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: Sense Key : 0x3 [current] [descriptor]
Jan 25 16:38:00 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Jan 25 16:38:00 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jan 25 16:38:00 Tower kernel: d6 03 e2 60
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: ASC=0x11 ASCQ=0x4
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd] CDB:
Jan 25 16:38:00 Tower kernel: cdb[0]=0x28: 28 00 d6 03 e2 60 00 00 08 00
Jan 25 16:38:00 Tower kernel: end_request: I/O error, dev sdd, sector 3590578784
Jan 25 16:38:00 Tower kernel: ata3: EH complete
Jan 25 16:38:00 Tower kernel: md: disk1 read error, sector=3590578720

The smart test result are without errors, maybe I look wrong, I can attached them, when someone tell me witch one is the right one. After I started the smart test, the disk did some loud noises (something like buzzing), after few sec it turn to normal noise (the buzzing stops). Should I replace the disk?

Thanks for help!

dirtysanchez · January 25, 2014

Post a SMART report for that drive.

aspik · January 25, 2014

Attached the smart logs for the disk

smart_sdd.txt

dirtysanchez · January 25, 2014

SMART report looks fine.

Reseat power and SATA connections to the drive and see if the problem persists.

aspik · January 25, 2014

Will do this later, currently I'm pre-clearing a disk so I can't power down the server. Now I see the same at the second disk and lot of read errors (130). Attached smart logs from the second disk. I suppose they also look OK. I will reset the cables on that too and see if it helps.

Does this have an impact on the parity? How do I handle this errors? Should I do a parity check?

smart_sdf.txt

aspik · January 30, 2014

Hi dirtysanchez,

I disconnected and connected the power and sata cables, but unfortunately I still get this error. I have in my array 4 disks: 2x WD Red 3TB and 2x old RE4-GP 2TB, the error occurs only on those RE4 disks. Sometimes only 1 read error and sometimes even 150! I'm really afraid, that something might go wrong on both disks and I loose the data on both disks...

I've attached syslog, do you have any other ideas, what's wrong?

syslog.txt

dirtysanchez · January 30, 2014

I'm not a syslog and drive error expert, but according to this that error is not a problem and can be ignored.

Hopefully an expert can chime in.

aspik · January 31, 2014

Thanks for the link! Which error do you mean? Because the log say:

Jan 30 20:27:50 Tower kernel: ata3.00: cmd 25/00:08:40:00:0c/00:00:6e:00:00/e0 tag 0 dma 4096 in
Jan 30 20:27:50 Tower kernel:          res 51/40:08:40:00:0c/00:00:6e:00:00/e0 Emask 0x9 (media error)
Jan 30 20:27:50 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 30 20:27:50 Tower kernel: ata3.00: error: { UNC }

And this looks like a Drive media issue #1, which is not good at all…

itimpi · January 31, 2014

The drive indicated is not fatal per se as the system recovered but it would not occur on a well-behaving system. However it does suggest there is an underlying problem - probably with the cabling or power supply.

aspik · January 31, 2014

This is strange.. I use brand new Delock straight/straigh 30cm cables (Item No. 82676). The PSU is also a new Corsair CX430M. Everything is in the Q25 case from Lian Li. Could it be something in the HDD cage from the case? As the cables are not directly connected to the disk, but to the HDD cage.

DaleWilliams · January 31, 2014

This is strange.. I use brand new Delock straight/straigh 30cm cables (Item No. 82676). The PSU is also a new Corsair CX430M. Everything is in the Q25 case from Lian Li. Could it be something in the HDD cage from the case? As the cables are not directly connected to the disk, but to the HDD cage.

Sounds like a very nice build!

I'd try changing the SATA cable routing, first. Unless they're shielded (which are hard to find), the SATA cables can interfere with each other. Don't tie them together, even though it does make a clean looking build.

Also swap the cables around. If the 'problem' drive moves from sdd to sddx then I'd replace the cable.

dgaschk · January 31, 2014

Paste the SMART report for disk 1.

dirtysanchez · January 31, 2014

Thanks for the link! Which error do you mean? Because the log say:

Jan 30 20:27:50 Tower kernel: ata3.00: cmd 25/00:08:40:00:0c/00:00:6e:00:00/e0 tag 0 dma 4096 in
Jan 30 20:27:50 Tower kernel:          res 51/40:08:40:00:0c/00:00:6e:00:00/e0 Emask 0x9 (media error)
Jan 30 20:27:50 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 30 20:27:50 Tower kernel: ata3.00: error: { UNC }

And this looks like a Drive media issue #1, which is not good at all…

I was referring to the DRDY ERR.

Also, I have the same case and as you state it has a backplane. You can always try removing and reseating the drive.

aspik · February 10, 2014

Unfortunately the error still occurs. What I've done already:

- resetet power and data cables: no effect

- removed the disks from the backplate and inserted it again: no effect

- changed the data cables to other new ones: no effect.

Currently I have 5 Disks (with parity) in the array and the errors occurs only on those RE4 disks. Attached syslog and smart report for disk1.

Anyone have an Idea?

syslog.txt

smart_disk1.txt

snowboardjoe · February 10, 2014

How often do you get the DRDY ERR? Sort of sounds like a condition where the disk is spun down and an event comes along to trigger it to spin up? Just a guess at this point. Could be certain hardware just reports this.

Do you see the error for other drives? Do you know if the disk was normally spun down before the error?

Hopefully some others can chime in here on their experiences.

EDITED: Also found this...

http://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues#Physical_Drive_Issues

That indicates it's a true error or interface problem. Could be a long SMART test is in order here.

aspik · February 10, 2014

Thanks for the replay snowboardjoe.

Do you see the error for other drives?

As I said before, it occurs only on the wd re4 disks. I have 2 of them in my array.

Do you know if the disk was normally spun down before the error?

Yes, indeed, it happens when the disk was spun down and I started to play a file on the htpc.

That indicates it's a true error or interface problem. Could be a long SMART test is in order here.

This is what worries me, an physicial drive issue:( I already done a long smart test from the web-gui, the result was without errors.

aspik · April 19, 2014

FYI: if anyone else stumble here with similar problems, I solved it finally.

Turns out it was bad cable management and not the right cables. I used the straight/straight SATA cables, after closing the side panel the cables were too squeezed and I was getting the errors. When I left the case open (without the side panel) the errors where gone. A solution for the problem is to buy the down/straight cables and do a better cable management...

DaleWilliams · April 19, 2014

FYI: if anyone else stumble here with similar problems, I solved it finally.

Turns out it was bad cable management and not the right cables. I used the straight/straight SATA cables, after closing the side panel the cables were too squeezed and I was getting the errors. When I left the case open (without the side panel) the errors where gone. A solution for the problem is to buy the down/straight cables and do a better cable management...

There's a new one for the wiki!

dying HDD?

Recommended Posts

aspik

Link to comment

dirtysanchez

Link to comment

aspik

Link to comment

dirtysanchez

Link to comment

aspik

Link to comment

aspik

Link to comment

dirtysanchez

Link to comment

aspik

Link to comment

itimpi

Link to comment

aspik

Link to comment

DaleWilliams

Link to comment

dgaschk

Link to comment

dirtysanchez

Link to comment

aspik

Link to comment

snowboardjoe

Link to comment

aspik

Link to comment

aspik

Link to comment

DaleWilliams

Link to comment

Archived