Drives disappearing - Hard Resetting link


Recommended Posts

I just recently switched my motherboard out with another one. The thread on that is below

http://lime-technology.com/forum/index.php?topic=32426.0

 

In the Dynamix web gui, there is a place to run SMART Short Self-test and SMART Extended Self-test. I ran the short one and it came back with no errors on the parity drive. I then ran the extended test on the parity drive and I actually forgot to look back at it. Today I noticed when I came home that my parity drive was the only disk, excluding the cache, that hadn't spun down. I was concerned about that so I hit the log button saw program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO which stood out to me. I did some googling on it but couldnt find anything that I should could help so I'm coming here. I've attached a syslog. I noticed in the log that I'm having some of the same issues I was having before the mobo swap. Since that is showing ata 8 below I would think it is sata port 8 but I only have 0-7. Would this be sata 7?

 

ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Mar 24 11:08:17 Server kernel: ata8.00: failed command: CHECK POWER MODE

Mar 24 11:08:17 Server kernel: ata8.00: cmd e5/00:00:00:00:00/00:00:00:00:00/00 tag 0

Mar 24 11:08:17 Server kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Mar 24 11:08:17 Server kernel: ata8.00: status: { DRDY }

Mar 24 11:08:17 Server kernel: ata8: hard resetting link

Mar 24 11:08:23 Server kernel: ata8: link is slow to respond, please be patient (ready=0)

Mar 24 11:08:27 Server kernel: ata8: COMRESET failed (errno=-16)

Mar 24 11:08:27 Server kernel: ata8: hard resetting link

syslog.zip

Link to comment

It's been two days. I have had no errors. I ran short and extended SMART tests on all 7 disks and all returned 0 errors. I'm not going to say my long going initial problem is resolved but I have a good feeling about it. This weekend, I'm going to put the HDD I pulled out into a Windows PC and try some SMART tests on it and verify if it was my problem. I have a feeling the disk is fine but the interface could be messed up somehow. I will mark this thread as solved once I find out in another system. I actually hope the drive is bad so I can rest knowing it was my issue.

Link to comment

I spoke too soon. I'm not sure if anyone is going to read my previous two threads about what all I have done so I'll tell you quickly here. 

 

I bought a new power supply, a good one. Just in case that was the problem because I had a cheap one.

I bought new locking SATA cables.

I swapped motherboards from a known working one with Windows 7.

I have verified multiple times all wires are pushed in without strain.

 

Now I have a drive on ATA 5 that disappeared. This port has completely different cable in it than what my server had from the other two threads I've created over this.

Attached is the new syslog. Server down and I guess it's going to stay down until I can get some feed back from someone. I absolutely do not know what to do.

syslog.txt

Capture.PNG.35d31fc0ece50a76a645450497e278cd.PNG

Link to comment

FYI, I had 2 Seagate 1T drives and they would both log a double link reset every few days. They never went offline though. I have hot-swap bays so I've never touched the internals of the server but I've replaced those HDD and I haven't found a single link reset in the log since so it was the Seagate drives.

 

Unfortunately, I just don't know what to suggest for your system but I can see it's kicking your ass.

 

It would help if you could summarize with an up to date equipment list and pictures of the server showing the overall build and others with details of the power cables between the PS and the HDD and the SATA cables between the motherboard and the HDD.

 

Link to comment

I've attached a syslog. I noticed in the log that I'm having some of the same issues I was having before the mobo swap. Since that is showing ata 8 below I would think it is sata port 8 but I only have 0-7. Would this be sata 7?

 

ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Mar 24 11:08:17 Server kernel: ata8.00: failed command: CHECK POWER MODE

Mar 24 11:08:17 Server kernel: ata8.00: cmd e5/00:00:00:00:00/00:00:00:00:00/00 tag 0

Mar 24 11:08:17 Server kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Mar 24 11:08:17 Server kernel: ata8.00: status: { DRDY }

Mar 24 11:08:17 Server kernel: ata8: hard resetting link

Mar 24 11:08:23 Server kernel: ata8: link is slow to respond, please be patient (ready=0)

Mar 24 11:08:27 Server kernel: ata8: COMRESET failed (errno=-16)

Mar 24 11:08:27 Server kernel: ata8: hard resetting link

 

As you have already figured out, this is ata8.00 which corresponds to sdh, your parity drive.  It's associated with the second port on that nice 8 port SATA controller card you have.  The SATA link is still alive, so power is still connected, but the drive is not responding at all, and was quickly marked by the kernel as disabled.  Suspects would be a power spike, or vibration that shook the SATA cable only partly off, or crashed firmware on the drive, or perhaps overheated drive?

 

Now I have a drive on ATA 5 that disappeared. This port has completely different cable in it than what my server had from the other two threads I've created over this.

Attached is the new syslog. Server down and I guess it's going to stay down until I can get some feed back from someone. I absolutely do not know what to do.

 

This appears to be sde, Disk 5, attached to the fifth motherboard SATA port.  This looks like a connection shook loose, and again the SATA link stayed up, but the drive was not responsive, and was disabled too.  Essentially the same suspects...

Link to comment

 

As you have already figured out, this is ata8.00 which corresponds to sdh, your parity drive.  It's associated with the second port on that nice 8 port SATA controller card you have.  The SATA link is still alive, so power is still connected, but the drive is not responding at all, and was quickly marked by the kernel as disabled.  Suspects would be a power spike, or vibration that shook the SATA cable only partly off, or crashed firmware on the drive, or perhaps overheated drive?

 

 

This appears to be sde, Disk 5, attached to the fifth motherboard SATA port.  This looks like a connection shook loose, and again the SATA link stayed up, but the drive was not responsive, and was disabled too.  Essentially the same suspects...

 

I will check the wires again to be sure they are snug. With the thought of a vibration shaking out the SATA cables, would the hard drives spinning cause this? These are locking SATA cables.  I have the server behind a battery backup so I would assume that if there were a power spike it would have absorbed it plus I have many computers in the house that don't seem affected by it. I look at the server GUI often and if the temperatures are reading right they never seem to get very high.

 

I use the Dynamix web gui plugins and in the disk health section, somewhere in there it gives details on the individual hard drives. On one of those detail pages, it shows on all my Seagate drives that there is an updated firmware available. Since you mentioned something about firmware, this got me thinking about that. Could this be my issue?

Link to comment

That very definitely could be the issue, especially since you have just about ruled out anything else.  Be aware though, that some firmware updates for drives can result in all data lost, as if it's a new clean drive.  But I don't know the specifics of this particular drive firmware update.

Link to comment

There are no firmware updates for these drives. So I can cross that off the list.

 

I just thought of this.

On January 12th I had a disk actually die and I RMA'd it. On February 3rd, a friend gave me a used APC battery back up that he just put new batteries in. My first thread on here about this was February 16th.

 

I have now unplugged my server from the ups my friend gave me to my small one that kept my server up for 6 months until the drive failure.

 

Could it be?

Link to comment
  • 2 weeks later...

Just to let everyone know my server has been running for 16 days without error with the battery backup removed. That is the longest it has ran without issues since I got this ups. I'm pretty sure this was my problem now. Thanks to all who gave suggestions.

 

 

Sent from my iPhone using Tapatalk

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.