Jump to content

dying HDD?


aspik

Recommended Posts

I have an error at one of my disks, logs shows this:

Jan 25 16:38:00 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 25 16:38:00 Tower kernel: ata3.00: irq_stat 0x40000001
Jan 25 16:38:00 Tower kernel: ata3.00: failed command: READ DMA EXT
Jan 25 16:38:00 Tower kernel: ata3.00: cmd 25/00:08:60:e2:03/00:00:d6:00:00/e0 tag 0 dma 4096 in
Jan 25 16:38:00 Tower kernel: res 51/40:08:60:e2:03/00:00:d6:00:00/e0 Emask 0x9 (media error)
Jan 25 16:38:00 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 25 16:38:00 Tower kernel: ata3.00: error: { UNC }
Jan 25 16:38:00 Tower kernel: ata3.00: configured for UDMA/133
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd] Unhandled sense code
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: Sense Key : 0x3 [current] [descriptor]
Jan 25 16:38:00 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Jan 25 16:38:00 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jan 25 16:38:00 Tower kernel: d6 03 e2 60
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd]
Jan 25 16:38:00 Tower kernel: ASC=0x11 ASCQ=0x4
Jan 25 16:38:00 Tower kernel: sd 3:0:0:0: [sdd] CDB:
Jan 25 16:38:00 Tower kernel: cdb[0]=0x28: 28 00 d6 03 e2 60 00 00 08 00
Jan 25 16:38:00 Tower kernel: end_request: I/O error, dev sdd, sector 3590578784
Jan 25 16:38:00 Tower kernel: ata3: EH complete
Jan 25 16:38:00 Tower kernel: md: disk1 read error, sector=3590578720

 

The smart test result are without errors, maybe I look wrong, I can attached them, when someone tell me witch one is the right one. After I started the smart test, the disk did some loud noises (something like buzzing), after few sec it turn to normal noise (the buzzing stops). Should I replace the disk?

Thanks for help!

Link to comment

Will do this later, currently I'm pre-clearing a disk so I can't power down the server. Now I see the same at the second disk and lot of read errors (130). Attached smart logs from the second disk. I suppose they also look OK. I will reset the cables on that too and see if it helps.

Does this have an impact on the parity? How do I handle this errors? Should I do a parity check?

smart_sdf.txt

Link to comment

Hi dirtysanchez,

I disconnected and connected the power and sata cables, but unfortunately I still get this error. I have in my array 4 disks: 2x WD Red 3TB and 2x old RE4-GP 2TB, the error occurs only on those RE4 disks. Sometimes only 1 read error and sometimes even 150!  I'm really afraid, that something might go wrong on both disks and I loose the data on both disks...

I've attached syslog, do you have any other ideas, what's wrong?

syslog.txt

Link to comment

Thanks for the link! Which error do you mean? Because the log say:

Jan 30 20:27:50 Tower kernel: ata3.00: cmd 25/00:08:40:00:0c/00:00:6e:00:00/e0 tag 0 dma 4096 in
Jan 30 20:27:50 Tower kernel:          res 51/40:08:40:00:0c/00:00:6e:00:00/e0 Emask 0x9 (media error)
Jan 30 20:27:50 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 30 20:27:50 Tower kernel: ata3.00: error: { UNC }

 

And this looks like a Drive media issue #1, which is not good at all…

Link to comment

This is strange.. I use brand new Delock  straight/straigh 30cm cables (Item No. 82676). The PSU is also a new Corsair CX430M. Everything is in the Q25 case from Lian Li. Could it be something in the HDD cage from the case? As the cables are not directly connected to the disk, but to the HDD cage.

Link to comment

This is strange.. I use brand new Delock  straight/straigh 30cm cables (Item No. 82676). The PSU is also a new Corsair CX430M. Everything is in the Q25 case from Lian Li. Could it be something in the HDD cage from the case? As the cables are not directly connected to the disk, but to the HDD cage.

Sounds like a very nice build!

I'd try changing the SATA cable routing, first. Unless they're shielded (which are hard to find), the SATA cables can interfere with each other.  Don't tie them together, even though it does make a clean looking build. 

Also swap the cables around. If the 'problem' drive moves from sdd to sddx then I'd replace the cable.

Link to comment

Thanks for the link! Which error do you mean? Because the log say:

Jan 30 20:27:50 Tower kernel: ata3.00: cmd 25/00:08:40:00:0c/00:00:6e:00:00/e0 tag 0 dma 4096 in
Jan 30 20:27:50 Tower kernel:          res 51/40:08:40:00:0c/00:00:6e:00:00/e0 Emask 0x9 (media error)
Jan 30 20:27:50 Tower kernel: ata3.00: status: { DRDY ERR }
Jan 30 20:27:50 Tower kernel: ata3.00: error: { UNC }

 

And this looks like a Drive media issue #1, which is not good at all…

 

I was referring to the DRDY ERR. 

 

Also, I have the same case and as you state it has a backplane.  You can always try removing and reseating the drive.

Link to comment
  • 2 weeks later...

Unfortunately the error still occurs. What I've done already:

- resetet power and data cables: no effect

- removed the disks from the backplate and inserted it again: no effect

- changed the data cables to other new ones: no effect.

Currently I have 5 Disks (with parity) in the array and the errors occurs only on those RE4 disks. Attached syslog and smart report for disk1.

Anyone have an Idea?

syslog.txt

smart_disk1.txt

Link to comment

How often do you get the DRDY ERR?  Sort of sounds like a condition where the disk is spun down and an event comes along to trigger it to spin up?  Just a guess at this point.  Could be certain hardware just reports this.

 

Do you see the error for other drives?  Do you know if the disk was normally spun down before the error?

 

Hopefully some others can chime in here on their experiences. 

 

EDITED:  Also found this...

 

http://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues#Physical_Drive_Issues

 

That indicates it's a true error or interface problem.  Could be a long SMART test is in order here.

Link to comment

Thanks for the replay snowboardjoe.

Do you see the error for other drives?

As I said before, it occurs only on the wd re4 disks. I have 2 of them in my array.

 

Do you know if the disk was normally spun down before the error?

Yes, indeed, it happens when the disk was spun down and I started to play a file on the htpc.

 

That indicates it's a true error or interface problem.  Could be a long SMART test is in order here.

This is what worries me, an physicial drive issue:( I already done a long smart test from the web-gui, the result was without errors.

Link to comment
  • 2 months later...

FYI: if anyone else stumble here with similar problems, I solved it finally.

Turns out it was bad cable management and not the right cables. I used the straight/straight SATA cables, after closing the side panel the cables were too squeezed and I was getting the errors. When I left the case open (without the side panel) the errors where gone. A solution for the problem is to buy the down/straight cables and do a better cable management...

Link to comment

FYI: if anyone else stumble here with similar problems, I solved it finally.

Turns out it was bad cable management and not the right cables. I used the straight/straight SATA cables, after closing the side panel the cables were too squeezed and I was getting the errors. When I left the case open (without the side panel) the errors where gone. A solution for the problem is to buy the down/straight cables and do a better cable management...

There's a new one for the wiki!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...