[solved] Frequent "IRQ errors" on 4.7 causing partial crashes


Recommended Posts

Hello,

 

I've recently upgraded to a Zacate (E35M1-M) system with 4.7, both at the same time, and since then I see frequent partial crashes. First thing : I had the issue with the onboard Realtek NIC, I changed it for a D-Link DGE-528T, works perfectly.

BUT, now I get this frequently :

 

May 19 00:51:10 NAS kernel: irq 19: nobody cared (try booting with the "irqpoll" option)

May 19 00:51:10 NAS kernel: Pid: 0, comm: swapper Tainted: G W 2.6.32.9-unRAID #8

May 19 00:51:10 NAS kernel: Call Trace:

May 19 00:51:10 NAS kernel: [] __report_bad_irq+0x2e/0x6f

May 19 00:51:10 NAS kernel: [] note_interrupt+0xf5/0x13c

May 19 00:51:10 NAS kernel: [] handle_fasteoi_irq+0x5f/0x9d

May 19 00:51:10 NAS kernel: [] handle_irq+0x1a/0x24

May 19 00:51:10 NAS kernel: [] do_IRQ+0x40/0x96

May 19 00:51:10 NAS kernel: [] common_interrupt+0x29/0x30

May 19 00:51:10 NAS kernel: [] ? acpi_idle_enter_simple+0xfe/0x12a

May 19 00:51:10 NAS kernel: [] cpuidle_idle_call+0x63/0x9b

May 19 00:51:10 NAS kernel: [] cpu_idle+0x3a/0x4e

May 19 00:51:10 NAS kernel: [] start_secondary+0x195/0x19a

May 19 00:51:10 NAS kernel: handlers:

May 19 00:51:10 NAS kernel: [] (ahci_interrupt+0x0/0x3df [ahci])

May 19 00:51:10 NAS kernel: [] (sil_interrupt+0x0/0x26d [sata_sil])

May 19 00:51:10 NAS kernel: Disabling IRQ #19

 

Always the same, never the same time, and after that for some reason the system never really recovers, load that was <1 goes to >5, and everything is way slow...

 

Any idea?

Link to comment

Here is the complete logfile... I have AHCI selected in the BIOS for the mainboard SATA controller, but also have a Sil 3114 PCI controller that can't be setup on AHCI if I understood it well, and that sits on the same IRQ that my NIC. Bad, and can't be setup in BIOS again - ASUS E35 MB.

Link to comment

Sorry, so Sata controller is on the same irq than the onboard SATA as far as I can see :

 

root@NAS:/proc# more interrupts

          CPU0      CPU1

  0:      6504    142202  IO-APIC-edge      timer

  1:          0          2  IO-APIC-edge      i8042

  9:          0          0  IO-APIC-fasteoi  acpi

12:          0          3  IO-APIC-edge      i8042

17:        136      11235  IO-APIC-fasteoi  ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3

18:      7028    959268  IO-APIC-fasteoi  ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, eth0

19:      2265      80198  IO-APIC-fasteoi  ahci, sata_sil

NMI:          0          0  Non-maskable interrupts

LOC:    142490      6746  Local timer interrupts

SPU:          0          0  Spurious interrupts

PMI:          0          0  Performance monitoring interrupts

PND:          0          0  Performance pending work

RES:    156778      89635  Rescheduling interrupts

CAL:        23        55  Function call interrupts

TLB:      3595      8214  TLB shootdowns

TRM:          0          0  Thermal event interrupts

THR:          0          0  Threshold APIC interrupts

MCE:          0          0  Machine check exceptions

MCP:          5          5  Machine check polls

ERR:          0

MIS:          0

 

Link to comment

You made the mistake of buying brand new, just of the shelf hardware platform.

 

I always wonder what are people thinking when they do this. See - they are usually no problems when you run a mainstream software (such as Windows) as the hardware vendors and the software ones will work hand in hand (exchanging beta hardware and software) to insure that they are no major problems on the "launch" date when the new hardware hit the shelves. But not sure how many Linux developers were given the free hardware to test these beforehand.

 

And then we have the many Linux "flavors" and the Slackware upon which Unraid is based is not the most popular and widely used nowadays. And then Unraid development itself goes with even slower pace and they do not use the latest Slackware releases.

 

So for someone to buy a brand new shiny hardware platform and to expect that it will work under a custom software like Unraid without any problems is like hopping for a miracle. Miracles sometimes happen but most of the time they do not. So the early adopters are now in the "sweet" troubleshooting territory all for them-self.

 

What can you do.

1. Always disable all the unused hardware features on the motherboard - serial and parallel ports, audio, firewire, floppy, IDE controller if you do not use any of the older PATA drives as you are going to free some resources this way.

2. Check out daily your motherboard vendor for BIOS updates and if available apply them immediately and then do not forget about "#1" above. And you may even complain to them that this hardware does not work with that software...

3. Forget about 4.7 - go with the latest beta. If you just started with Unraid 4.7 then IMHO you should not have any issues with 5.0b6a

 

If these do not help your only choice is to try to run the latest full Slackware distro.

 

And please do not forget to share your success (or pain) along the way.

 

PS.read your PM  ;)

Link to comment

Thanks for the reply... Understand what you say, indeed it's a mistake in hindsight.

I have now booted with irqpoll option and it looks better, I'll just have to assess the impact on perfs. I think it's the Sil 3114 card that is causing the issue, plus the fact the BIOS doesn't allow you to remap the IRQs...

Link to comment

Just as an update :

* Booting with irqpoll option works, just the performances are then abysmal... Not a solution!

* I flashed the Sil 3114 card with the latest "IDE" (=non RAID) BIOS, flashing worked OK but on the UNRAID server it doesn't boot anymore!!

* I flashed the card with the latest Raid BIOS, and now it is running smoothly with it since more than 12 hours. Let's see if it stays like this!

Link to comment

Yep, I've seen this, unfortunately I was too cheap to pay for the Pro version, and the non-Pro didn't get the same BIOS update. Damn!!!

 

On the current BIOS version my Sil 3114 is back at square 1, giving me still the same errors as reported above. How frustrating...

Link to comment

Good to hear.

 

pcie cards seem better overall. I heard they were getting copious input from raid-card owners. I'm only using two-port cards but the old BIOS had a thing about trying to use my monoprice cards for video. :/

 

I'm not sure shared IRQs was really a cause. Granted, it didn't behave normally but even when I managed to isolate eth0 or sata IRQs they would eventually have problems. Seemed more like a routing thing, or just big bugs.

 

Mine is better as well. Whatever the source it's calmed down. Raw performance isn't quite as good unfortunately. Quick disk write numbers without work to correct it look 25% slower to the same array.

 

The Realtek is still horrible and will still crash hard above ~20MB/s. I haven't tried the Intel cards or other drivers yet.

 

BTW, if I hadn't mentioned it already, a full Slackware 13.1 install eventually died on the old BIOS. Same symptoms. Seems to work then familiar "Disabling IRQ", even just idling. When time allows I'll move my 13.1 array over for testing. It could mean more options for current drivers.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.