Jump to content

Hardware Issue Causing unRAID to Crash (ata11: hard resetting link)


cjlucas

Recommended Posts

I just started getting this error in dmesg recently after a few weeks of having a stable unRAID system (then I added 2 hard drives which is when the errors started occuring). I'm currently running 13 hard drives on a 620W PSU (+5V@30A) which I'm assuming has enough juice to power all those drives, so I'm not exactly sure what is causing this problem. The following output is from dmesg:

ata11: Unable to stop eDMA
ata11.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata11: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata11.00: cmd 25/00:c0:d7:5e:71/00:03:43:00:00/e0 tag 0 dma 491520 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata11.00: status: { DRDY }
ata11: hard resetting link
ata10: Unable to stop eDMA
ata10.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata10: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata10.00: cmd 25/00:00:17:83:6c/00:04:11:00:00/e0 tag 0 dma 524288 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata10.00: status: { DRDY }
ata10: hard resetting link
ata9: Unable to stop eDMA
ata9.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata9: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata9.00: cmd 25/00:08:27:95:9c/00:00:31:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata9.00: status: { DRDY }
ata9: hard resetting link

 

- Chris

Link to comment

I just started getting this error in dmesg recently after a few weeks of having a stable unRAID system (then I added 2 hard drives which is when the errors started occuring). I'm currently running 13 hard drives on a 620W PSU (+5V@30A) which I'm assuming has enough juice to power all those drives, so I'm not exactly sure what is causing this problem. The following output is from dmesg:

ata11: Unable to stop eDMA
ata11.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata11: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata11.00: cmd 25/00:c0:d7:5e:71/00:03:43:00:00/e0 tag 0 dma 491520 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata11.00: status: { DRDY }
ata11: hard resetting link
ata10: Unable to stop eDMA
ata10.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata10: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata10.00: cmd 25/00:00:17:83:6c/00:04:11:00:00/e0 tag 0 dma 524288 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata10.00: status: { DRDY }
ata10: hard resetting link
ata9: Unable to stop eDMA
ata9.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata9: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata9.00: cmd 25/00:08:27:95:9c/00:00:31:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
ata9.00: status: { DRDY }
ata9: hard resetting link

 

- Chris

Specifically which brand/model power supply are you using.  A 30 Amp 5 volt supply does not help if the 12 volt supply cannot handle the 13 drives.

 

Do NOT assume your power supply is able to handle the new drives...

 

One 630 watt power supply I found on the web had these specs:

# 180W: 5V (30A) and 3.3V (28A)

# 500W: +12V1 (16A), +12V2 (25A), +12V3 (17A). +12V1 and +12V3 combined max 300W

It has three 12 volt rails...  With a max output on rail 1 and 3 combined of 300 watts.  Depending on the connectors used on that power supply, you might be overloading one of the rails with the total load from your MB, disks, and fans.  12V1 might be only used for the MB, 12v2 might be set up for the video card, leaving 12v3 for the disks...

 

17Amps on 12v3 might not be enough to spin up 13 drives at the same time, especially since apparently it plus 12v1 share the same pool of 300 watts capacity.

300 watts capacity is a real limit of 25 amps combined capacity... but that does not look as good in the marketing brochure... It is almost like a car salesman saying your new car will handle 25 passengers... just not all at the same time, and not when moving, and 1 seat is reserved for the driver... etc.

 

Now, this might not be your case, but it is suspicious when errors start after adding drives. 

 

Joe L.

Link to comment

Hey Joe,

 

My PSU is CORSAIR CMPSU-620HX 620W ( http://www.newegg.com/Product/Product.aspx?Item=N82E16817139002 ). Here are the specs for it: +3.3@24A,+5V@30A,+12V1@18A,+12V2@18A,+12V3@18A,- [email protected],+5VSB@3A. Though I don't see a max combined wattage for those 3 12V rails. Also is there a way to see what cables use which rails? (pardon my ignorance if what I'm asking makes no sense) From what I've read hard drives are powered by the 5V rail so I figured I was set there.

Link to comment

Drives use 12v.

 

Are you noticing the errors as you are spinning up drives?  Drives pull about 2x their normal power when spinng up - so a situation where all drives are spinning up at the same time (like when powering up the server) is the most stressful time for the PSU.

 

Cabling problems are the most common problem I see here with strange drive errors. I'd double check connections (data and power) to the new drives. Try replacing cables and using diff SATA ports if possible.

Link to comment

Hey Joe,

 

My PSU is CORSAIR CMPSU-620HX 620W ( http://www.newegg.com/Product/Product.aspx?Item=N82E16817139002 ). Here are the specs for it: +3.3@24A,+5V@30A,+12V1@18A,+12V2@18A,+12V3@18A,- [email protected],+5VSB@3A. Though I don't see a max combined wattage for those 3 12V rails. Also is there a way to see what cables use which rails? (pardon my ignorance if what I'm asking makes no sense) From what I've read hard drives are powered by the 5V rail so I figured I was set there.

afaik this is a single rail powersupply and should be good for the drives to my personal experience. as stated, check cabling and connectors - they're causing lot's of problems...

 

Link to comment

Hey Joe,

 

My PSU is CORSAIR CMPSU-620HX 620W ( http://www.newegg.com/Product/Product.aspx?Item=N82E16817139002 ). Here are the specs for it: +3.3@24A,+5V@30A,+12V1@18A,+12V2@18A,+12V3@18A,- [email protected],+5VSB@3A. Though I don't see a max combined wattage for those 3 12V rails. Also is there a way to see what cables use which rails? (pardon my ignorance if what I'm asking makes no sense) From what I've read hard drives are powered by the 5V rail so I figured I was set there.

Actually, I've read they are powered mostly from the 12 volt rails.

 

A good article is here: http://ixbtlabs.com/articles2/storage/hddpower.html

 

It shows many modern drives draw 2 amps or more on the 12 volt line when spinning up.    Your 13 drives * 2 Amps might be overloading the power supply if they are all on the same 12 volt rail.  (26 Amp peak, on a rail with a 18Amp rating)

 

You apparently have 3 12 volt rails... each rated for 18 Amps.  If distributed across the rails, you do have the capacity for your disks.  You have purchased a very nice power supply...  but even they play a bit of "marketing" games...  The three 12 volt rails are rated at 18 Amps each... but the total output current, according to their manual of the entire supply is 620 watts.  I don't know about you, but if you multiply 12 * 18 * 3 you get 648 watts... There is NO way to get the full output...  In fact, their own manual shows a max combined of 600 watts for the three 12 volt rails...  And if your 5Volt and 3.3Volt lines happen to be powering a motherboard, I'll bet the power available on the 12 volt rails is even less.... 

 

As I said, you have a pretty decent supply... better than most out there... but they all play the same "marketing" games with "watts".... the only question is if your disks are distributed across its power supply rails to equalize the load.  (Just don't daisy-chain them all off the same connector)

 

As far as knowing which rails are on which connectors on the power supply....  Corsair conveniently leaves that information out of the manual...  All I can suggest is you use as many of the modular connectors on the power supply as possible... and space them so you don't use adjacent connectors, just in case they grouped the connectors by the rail that powers them.  They say the power supply uses circuitry to share potential across the connectors...  I've read internally it is a single rail supply...  Probably true.  The connectors probably all connect together in the supply case... (Their "advanced circuitry" )

 

Do not just use one or two connectors to feed all your drives... You are almost certain to be stressing the power supply connector wired that way.  Use as many as possible.

 

Joe L.

Link to comment

I had (still have) similar issues with my machine after I changed one drive from a 1TB drive to a 1.5TB drive and added a WD Green drive.  It seems the system would run stably all the way through the parity sync. My issue would occur when the drive was spun up after it had been put to standby.  The one drive would seem to cause issues and go offline. I moved the drive to an external case and the issue went away.

This made me think my seasonic 500W power supply is at the tipping point.

9 Drives.

2 1.5tb seagate 7200.11

1 1tb seagate 7200.11

6 WD green drives.

 

I've read the 1.5TB 7200.11 drives are a bit power hungry.

 

For the record, this did not occur at all during startup.

All the WD drives are set for Power Up in standby and wait for the kernel to start them.

So it's not the surge that's the issue. It's somewhere else.

Link to comment

I'd wonder if all of those 12V rails are even connected to the SATA/Molex conectors. I would not be surprized if the PCIe and 12V motherboard plugs get 2 of the rails and the third one is used for the SATA/Molex wiring. After all, I'm sure Corsair expects most people to use power hungry processors and video cards and also expects most people won't put in enough drives to overload a single 12V rail at 18A.

 

FYI, the Seagate 1.5T 7200.11 drives will pull max 2.8A at start-up. The Seagate LP drives are 2A. WD does not publish a start-up number for the Green drives that I could find.

 

Still, I would check and re-check and replace the SATA cables and also go over the power connections especially any power splitters you may have used.

 

Peter

 

Link to comment

What do you guys think about it being a faulty sata controller causing this? I'm also dual booting windows 7, sometimes it recognizes the 3 sata controllers installed in the system, but most of the time one of the controllers (the same one everytime) doesn't even show up in the device manager (last night all 3 cards were recognized, windows 7 crashed sometime in the middle of the night, rebooted, and after rebooting only 2 of the cards were recognized.). I haven't opened the box to see if the 3 drives that got the "hard resetting link" message are all on the same sata card or not, but I have a feeling they are because, according to dmesg, the numbers for each drive are all sequentially "next" to each other (ata9, ata10,ata11).

 

Thanks for all the information about PSU's but I'm just covering all my bases before I go out and start buying replacement hardware to fix the problem.

 

- Chris

Link to comment

The fact that it seemed to be reporting every possible error flag, for all 3 drives, and no other errors, looked very wrong to me, more like a buggy or crashed driver or controller, that is, returning a 0xff (all bits on).  And since the eDMA error was so unusual, I Google'd it, and found a similar case here.  Not entirely resolved, but the suggestion made was faulty hardware.  That eDMA error is rare, appears to be associated with the Marvell SATA driver, so this is probably the Adaptec 1430SA or the Rosewill equivalent.  Earlier cases of the error were attributed to a buggy Marvell driver included with earlier Linux versions, so make sure you try the latest unRAID release, with its relatively current kernel and drivers/modules.

 

If you look earlier in the syslog, you can identify the drives that are using ata9, ata10, and ata11, and thereby identify which controller this is.

 

This is not to say that it could not be related to power or cabling issues, those are still valid suspects, but this is one more possibility.

Link to comment
  • 1 month later...

Hello all.

 

I'd like to mark this issue solved and give an explanation as to what happened (It annoys me when people don't).

 

After a couple weeks of downtime after my mobo failed on a completely unrelated issue (bios update gone bad), I hooked up the motherboard again with only a few hdds plugged in and the crashes were still occuring, I took out one of the controller cards and after a few days of no crashes i deemed that the culprit, so i replaced the card and now im running all 14 drives with no issues.

 

Thanks for all your help guys.

 

- Chris

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...