[SOLVED] Errors in syslog - host bus errors with Gigabyte GA-X58A-UD3R_rev2


Recommended Posts

My rig was running great with no errors for months.  A couple of weeks ago I started seeing errors on my console, cache drive errors.  you can read all about that thread here,http://lime-technology.com/forum/index.php?topic=33856.new;topicseen#new    I never did get anyone to confirm if the cache drive was the issue, but no more CRC errors and the system seems very happy - well except...

 

So I replaced my cache drive, ran preclear, and now have a working cache drive again.  I have confirmed I have no SMART errors on any of my drives and unraid shows no errors on the webgui.  No errors on the console either.  Ran a parity check, zero errors.  However, there are some errors in the syslog that concern me.

 

I built this new server several months ago and have had no issues (other than the cache  drive mentioned above) and I have not changed the BIOS settings, drives or anything else.  I went back and confirmed that the errors did not exist with my old mb setup, good to have backups of my flash drive with syslogs.  .  This makes me think if I perhaps have a BIOS settings incorrect re the hard drive controllers.  This mb has 3 of them.  If it is a cable, it would be good to know which drive/cable is having the issue.  In the syslog, it refers to ata15 and ata16.  I need drive number or drive ID.  Perhaps there is a simple way to convert / tell.

 

Before I go changing anything in the BIOS, I'd appreciate some advice on what these errors mean and if they warrant immediate action.  I kindly request a review of my syslog.

 

syslog errors that I found:

 

host bus error

ioapic: probe of 0000:00:13.0 failed with error -22

 

Jun 27 02:42:40 Tower kernel: ata16: illegal qc_active transition (00000001->ffffffff)

Jun 27 02:42:40 Tower kernel: ata16.00: exception Emask 0x2 SAct 0x0 SErr 0x0 action 0x6 frozen

Jun 27 02:42:40 Tower kernel: ata16.00: failed command: READ DMA EXT

Jun 27 02:42:40 Tower kernel: ata16.00: cmd 25/00:48:b0:3f:62/00:01:60:00:00/e0 tag 0 dma 167936 in

Jun 27 02:42:40 Tower kernel:          res 50/00:00:f7:40:62/00:00:60:00:00/e0 Emask 0x2 (HSM violation)

Jun 27 02:42:40 Tower kernel: ata16.00: status: { DRDY }

Jun 27 02:42:40 Tower kernel: ata16: hard resetting link

Jun 27 02:42:40 Tower kernel: ata16: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Jun 27 02:42:40 Tower kernel: ata16.00: configured for UDMA/133

Jun 27 02:42:40 Tower kernel: ata16: EH complete

 

There may be more I missed.

 

Thank you

 

update: I said screw it and ordered enough new sata cables to replace them all, got GEAR GC18AKM12-SL.  Might as well rule out the cables.

syslog.zip

Link to comment

your harddrive on ata16 is the issue mate, probably on death row have you checked smart ;) This may help you find the device i had a much more easier method but i forgot, lol had my ata4 /dev/ssd die today no joke

 

http://askubuntu.com/questions/64351/how-to-match-ata4-00-to-the-apropriate-dev-sdx-or-actual-physical-disk

 

i can get away with

 

dmesg | grep ata

 

thats how i tracked down mine :)

 

who knows could be a half arsed sata cable though this will track down the drive for you and then take action with a new cable and cross your fingers :)

Link to comment

Thanks and sorry to hear about your drive loss.

 

I need a way to ID the ata 15, 16, 17 drives, as all of those showed up in the syslog when I had my old bad cache drive installed. I will post another question just for that.  Since then I now have a new cache drive and no CRC errors and my above post with syslog shows only a few entries for ata16.  Therefore I was thinking this could be a cable issue, so I ordered all new cables - will replace them all.

 

I ran parity check, no errors.  I did check SMART data on all drives, none have any errors, all pasted SMART tests.

 

I'd like to better understand the errors in the syslog, especially those with the host bus errors.

Link to comment

Post the strings in a google search and see what comes up. I had frozen errors on my drive that died today, don't worry and thanks, it was only a 160gb matrox being used as a cache :) I wish i kept longer logs but i had to power down  it was making this load tapping noise like a someone was throwing rocks at my brick wall lol. The frozen errors kept coming up one after another as the tapping went on with mine. I would just try the drive that is bring up the error with a new sata cable or even just check and jiggle it.

 

Had a quick search for you and well theres alot of people stumped, now one thing i found as someone got it in a vm so what controllers are you using? Has it all being working in the past? if so what did you change? did you add a new drive and it started happing then?

Link to comment

Ok, ata15 is my disk 8.  Ran smart test again, and this time I do have 4 errors.  This is an older disk, so I will replace it right away.

 

I recently swapped my cache drive due to CRC errors with that disk.  Before that, I had no errors or issues, nothing else had changed.  After swapping the cache drive, I have the errors I posted in this thread.  This is why I thought it could be cables.  However, now ata15 or disk 8 is showing smart errors.

 

I'm going to move my data off of disk8 and remove it from the array.  I'll swap after I get a new disk and have precleared it.

 

I'm also going to replace all cables - just to be sure.  Once I do all of this, I'll review the log again and see if any errors return.

 

Thanks1

Link to comment

Just before you close this switch i have the unmenu gui though i love this one feature i think yours dynamix has it not sure, just uses smartools in a nice easy way with error wiki help :) Yer command lines fine but being able to have the wiki included always you to understand each error is a screen shot. Clad you got this done and dusted bud :)

smart_NEW.jpg.66593be74cde18e2bf2dedeac323e530.jpg

Link to comment
  • 4 weeks later...

I still have a issue.  I've replaced all my sata cables, updated the bios to the latest version, and installed a brand new pre-cleared wd RED 4tb drive in the ata15 spot.  During a parity check, I still see these entries in my syslog and this drive has no pending reallocation errors, or any other errors,

 

Jul 22 21:54:14 Tower kernel: ata15: illegal qc_active transition (00000001->ffffffff)

Jul 22 21:54:14 Tower kernel: ata15.00: exception Emask 0x2 SAct 0x0 SErr 0x0 action 0x6 frozen

Jul 22 21:54:14 Tower kernel: ata15.00: failed command: READ DMA EXT

Jul 22 21:54:14 Tower kernel: ata15.00: cmd 25/00:00:c8:da:b5/00:04:39:00:00/e0 tag 0 dma 524288 in

Jul 22 21:54:14 Tower kernel:          res 50/00:00:c7:de:b5/00:00:39:00:00/e9 Emask 0x2 (HSM violation)

Jul 22 21:54:14 Tower kernel: ata15.00: status: { DRDY }

Jul 22 21:54:14 Tower kernel: ata15: hard resetting link

Jul 22 21:54:14 Tower kernel: ata15: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Jul 22 21:54:14 Tower kernel: ata15.00: configured for UDMA/133

Jul 22 21:54:14 Tower kernel: ata15: EH complete

 

Again, this error repeats and I also see the same error for ata16 drive, which is disk1.  I ran and confirmed that all of my drives do not have any smart errors or pending reallocation errors.  In the BIOS, all 3 of the onboard sata controllers are set for AHCI

 

Here is the description of the on-board controllers:

 

Storage Interface ? South Bridge:

-

6 x SATA 3Gb/s connectors (SATA2_0~SATA2_5) supporting up to 6 SATA 3Gb/s devices

-

Support for SATA RAID 0, RAID 1, RAID 5, and RAID 10

? Marvell 9128 chip:

-

2 x SATA 6Gb/s connectors (GSATA3_6, GSATA3_7) supporting up to

2 SATA 6Gb/s devices

-

Support for SATA RAID 0, and RAID 1

? GIGABYTE SATA2 chip:

-

1 x IDE connector supporting ATA-133/100/66/33 and up to 2 IDE devices

-

2 x SATA 3Gb/s connectors (GSATA2_8, GSATA2_9) supporting up to

2 SATA 3Gb/s devices

-

Support for SATA RAID 0, RAID 1, and JBOD

- 11 - Hardware Installation

Storage Interface w JMicron JMB362 chip:

- 2 x eSATA 3Gb/s connectors (eSATA/USB Combo) on the back panel sup- porting up to 2 SATA 3Gb/s devices

-

Support for SATA RAID 0, RAID 1, and JBOD

 

There is also a eSata 2 port as well, but I'm not using it.

 

ata15 and ata16 are on different controllers.

 

ata15 is now my new drive.  Syslog attached.  I think it already rotated so it may not be a full log.

 

After doing some searching, some people say they get these same errors because of bad psu.  I have a brand new one, which should be good for the 10 drives I have.  Next I may try swapping the power cables around to see if the problem moves with the power cable.

syslog.txt

Link to comment

With no SMART errors, no errors during disk access, and a perfect parity check it's not clear whether this is an issue to be concerned about, or just a few resets during the process of initializing the controllers.    Perhaps Tom, Joe, or one of the other Linux gurus can comment on that.

 

One thing I'd try is moving the drives to a different controller to see if the issue moves with the drive, or stays with the controller.  Since you're on v5, you can freely move drives to different ports ... so just swap the drives currently on the 2nd & 3rd controllers with the primary controller on the motherboard.

 

Link to comment

Thanks Gary!  Actually, the ata15 is a brand new drive with a brand new cable.  The only thing that is exactly the same is the same controller and same power cable.  That said, I'll try moving the drives to one of the other controllers and also try swapping the power cables to see if that works.

 

You are correct in that I get zero sync errors on parity check and no CRC errors or smart errors on any disks now.  I only know that with my old motherboard, I never saw these errors.

 

Thanks again Gary!  I really appreciate your input.

 

@ Tom or Joe or anyone else - any other thoughts?

Link to comment

Update:  I swapped the power cables and the issue remained with the same drives / ports.  Not a power issue.  I then swapped the SATA cables and the errors stayed with the SATA ports, meaning these ports are now connected to different drives.  No reason to post syslog again, as the errors are exactly the same as in my previous posts.  Both ATA15 and ATA16 generate these errrors.  These two ports are the onboard Gigabyte SATA2 - ports 8 and 9.  It does matter what drives I connect to these ports. When I run a parity check, the errors occur. 

 

It is still not clear to me if these are errors to be concerned with.  I will send PM to Tom and Joe.

 

The specs for these two ports from the manual are:

 

GIGABYTE SATA2 chip:

1 x IDE connector supporting ATA-133/100/66/33 and up to 2 IDE devices

2 x SATA 3Gb/s connectors (GSATA2_8, GSATA2_9) supporting up to

2 SATA 3Gb/s devices

Support for SATA RAID 0, RAID 1, and JBOD

 

I've uploaded screenshots of the bios settings as well, as perhaps it's something I have set incorrectly.  There is one option where by default it is set to Auto for Firmware Selection for that SATA 2 controller.  Auto tells it to update and use the newest firmware from the system bios.  The other option is to make it use the firmware on the chipset itself.  i have not tried that because I don't know enough about this stuff to know it's safe to try it.

 

With any errors, I would like to figure out how to resolve these.  Any help would greatly be appreciated.

July_26_2014_022_copy4.jpg.10fecb754a0a8d8cae6f06bb483f0d03.jpg

Link to comment

You've clearly isolated this to the two GSata ports on the motherboard.

 

Whether it's anything to be concerned with I'll leave to Tom or Joe (or another Linux expert who can evaluate whether or not those resets are anything to be concerned about).    My guess is everything's fine ... I'd just leave everything as is and not worry about it.  Although the OCD in me would probably cause me to not use those ports unless I had to ... if you have spare ports on an add-in card, use those instead -- and then you won't have those extraneous lines in the syslog  :)

 

 

Link to comment

thanks Gary.  I'm using all 10 ports on the motherboard, no add on cards.  I sent PM to both Joe and Tom, hoping one of them would help me.

 

I did think about just downsizing my server, meaning going from a total of 10 disks to 8, and as you said - not use those ports.  I'll have to buy more hd's and then I guess there is a process to remove a drive from the array??  I was thinking I could just use MC and Move the data to another drive, then stop the array and deselect the disks I no longer will use.  Powerdown and disconnect those drives.  Unless there is a easier way of doing that.  Until I can aford to buy more drives, I'd like to know how to fix the errors.  :)

Link to comment

Looking at your syslog, it appears to be bad controller, especially since problem stays with the physical port and not any drive or cable.  One thing you could do, depending on how adventurous you are, get another flash and boot unRaid basic.  Then install just two disks on those two ports (no parity), and run like that (to rule out possible under-power PSU issue).  As long as you don't write new data the parity would be ok, or use two other disks.  If issue persists you could run unraid6-beta6 and see if similar issue happens then probably could conclude there's a problem with those ports on your motherboard.

Link to comment

Interesting that it's still possible a psu or power issue?  Is there any other (easier) way to confirm it's not a power issue?  I had a 550 before buying the 650 and I was also running 10 drives with no issue.

 

As for running v6, I would have to use my same disks, so basically I would be upgrading from release v5 to beta v6.  Is that safe?  Can I go back if things don't work out and not risk my data?

 

so based on what you've seen, should I be worried about these errors in terms of data loss?

Link to comment

I learned long ago to never say never ... but it's VERY unlikely this is a power issue.

 

An easier way to effectively eliminate that as an issue (requires a keyboard attached to your server) is to start the boot process ... wait about 20 seconds; and then do a Ctrl-Alt-Del to restart it.  By that time all of your disks will be spun up, so the spin-up current draw is finished.  If this is by any chance power-related, you should NOT see the error messages again.

 

Another way to confirm it -- and probably eliminate the error messages as well -- is to buy one of this little 2-port controllers and use it instead of those 2 GSata ports:  http://www.newegg.com/Product/Product.aspx?Item=N82E16816124045

 

But I've used as many as 18 drives in systems with a good 650w supply with no problems ... and those were higher current drives than modern units.  I just don't think that's the issue.

 

I think it's almost certain the issue is the Gigabyte controller, but what I don't know is whether or not v6 will have the same issue.

 

Link to comment

Note:  If you don't want to upgrade your array to v6 to try that, I'd do the following (modeled after Tom's suggestion):

 

- Configure a new flash drive with UnRAID Basic ... first using the current v5.05 stable that you're already using.    Assign the two drives on the GSata ports as DATA drives (don't bother with a parity drive).    As Tom noted, as long as you don't write to the disks, you can just use the two already there ... or to be totally safe, just put two different disks on those ports for the test.

 

Then boot and see if you still get the errors in the syslog (I'd expect you will).

 

- Now change that UnRAID Basic disk to v6 and assign the same two data disks.    Then boot and look at the resulting syslog.    If it's now clean, you know you can solve your errors by upgrading to v6.  Otherwise, the same issue with the GSata ports is in v6.

 

As I noted earlier, however, I don't think it's a big deal ... once the ports are initialized they're clearly working just fine.    But the inexpensive little card may be worth it just for the peace of mind of a "clean" boot  :)

Link to comment

thanks Gary, I'll try that.  I also just found another psu, that I could setup next to my case and connect to the those drives / ports, totally eliminating the power as the problem.  I do have two old disks that I could use for the v6 test too. 

 

If it turns out to be the mb controller and those two ports, and if I decide to go with the inexpensive little card, which one would you recommend for my setup?

Link to comment

If it turns out to be the mb controller and those two ports, and if I decide to go with the inexpensive little card, which one would you recommend for my setup?

 

The little 2-port SYBA I suggested above would work fine.  They also make a 4-port version, in case you'd like a couple of additional ports.

 

http://www.newegg.com/Product/Product.aspx?Item=N82E16816124045

http://www.newegg.com/Product/Product.aspx?Item=N82E16816124064

Link to comment

thanks Gary!  Sorry I missed the first post / link to the addon card.  I read that one from my phone and missed it.  I like the idea of just putting in another small controller and just not using these two ports on the mb.

 

Note that I don't see the errors at boot up.  I only see them when running a parity check or during the rebuild process (after upgrading a disk).  I may have not made that part clear.

 

I also opened a  case with Gigabyte, although I don't think they will help me much because I purchased this mb used on e-bay.

Link to comment

Final update:  Tom and Gary were correct, it was a bad controller.  My thanks to both for helping me sort this out, especially Gary for all of the suggestions.  :)

 

I decided to buy the SYBA controller.  After installing that and disabling the two "bad" on-board ports, no more errors.  Gigabyte did offer to warranty the mb, but I decided to just keep the one I have.

Link to comment

Thanks for the update -- always nice to know the final resolution.  I wonder if the GB ports were actually "bad" or if there's just some issue with the initialization of the Linux drivers for those ports, since they seem to work fine once they initialized.  But I agree it's nice to not have any errors buried in the logs  :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.