Call Trace Error via Fix Common Problems plugin


Recommended Posts

Hi Everyone,

 

Current unRAID setup:

unRAID v6.2.4 - stable

Mobo: Intel BOXDH67CL LGA 1155 Intel H67

CPU: Intel Core i3-2105

RAM: Can't find the receipt at the moment, but I'm pretty sure it's 8GB Kingston (non-ECC)

HDDs: 2xWD20EFRX, 2xWD20EURS, 1xWE40EFRX (parity)

SSD: Crucial M4 64GB (cache)

 

I was updating plugins tonight and went to Fix Common Problems, ran the scan, and the message said: "Your server has issued one or more call traces. This could be caused by a Kernel Issue, Bad Memory, etc. You should post your diagnostics and ask for assistance on the unRaid forums."

 

I briefly looked through the forum, and ran the diagnostics.  Based on the forum searches, I looked for keywords like: "hot" "throttle" "trace" and "err/error" and they all came back pretty empty.  I will probably reapply thermal paste to the cpu later this week, but after that and a memtest, is there something that I should try?  I'd be glad to post the diagnostics, if needed - I just don't know what to look for.

 

Beyond that, do you think it's safe to have it on? I don't want to overheat anything or cause any irreparable damage if I can help it.

 

Also, I've had a few "Plex server is unavailable" from remote viewers in the last few days - any chance they're related?

 

Thanks very much for your help and time!

Jeff

 

Link to comment

Rob is way better than me at interpreting these things but I might be able to shed a little light. The trace was called because IRQ 16 was ignored for some reason.

 

Jan 29 11:24:56 Radagast kernel: irq 16: nobody cared (try booting with the "irqpoll" option)

Jan 29 11:24:56 Radagast kernel: CPU: 0 PID: 10784 Comm: smbd Not tainted 4.4.30-unRAID #2

Jan 29 11:24:56 Radagast kernel: Hardware name:                  /DH67GD, BIOS BLH6710H.86A.0105.2011.0301.1654 03/01/2011

Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e70 ffffffff8136f79f ffff8800cc873600

Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e98 ffffffff8107f8ce ffff8800cc873600

Jan 29 11:24:56 Radagast kernel: 0000000000000000 0000000000000010 ffff88021fa03ed0 ffffffff8107fb9b

Jan 29 11:24:56 Radagast kernel: Call Trace:

Jan 29 11:24:56 Radagast kernel: <IRQ>  [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan 29 11:24:56 Radagast kernel: [<ffffffff8107f8ce>] __report_bad_irq+0x2b/0xb4

 

IRQ 16 is used by a USB 2 controller:

 

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00

 

This is where I struggle, I confess, trying to link the controller via a hub to a specific device. So I can't say what the device is. An obvious candidate would be your USB boot device, but I just can't say with any certainty. Maybe someone else can help here.

 

One thing I do notice is that your BIOS is dated 2011 so it might be worth looking to see if an update is available. Failing that, maybe try the "irqpoll" boot option that's suggested. I'm not sure why you think your CPU might be overheating. I can find no evidence of that.

 

Link to comment

Rob is way better than me at interpreting these things but I might be able to shed a little light. The trace was called because IRQ 16 was ignored for some reason.

 

Jan 29 11:24:56 Radagast kernel: irq 16: nobody cared (try booting with the "irqpoll" option)

Jan 29 11:24:56 Radagast kernel: CPU: 0 PID: 10784 Comm: smbd Not tainted 4.4.30-unRAID #2

Jan 29 11:24:56 Radagast kernel: Hardware name:                  /DH67GD, BIOS BLH6710H.86A.0105.2011.0301.1654 03/01/2011

Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e70 ffffffff8136f79f ffff8800cc873600

Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e98 ffffffff8107f8ce ffff8800cc873600

Jan 29 11:24:56 Radagast kernel: 0000000000000000 0000000000000010 ffff88021fa03ed0 ffffffff8107fb9b

Jan 29 11:24:56 Radagast kernel: Call Trace:

Jan 29 11:24:56 Radagast kernel: <IRQ>  [<ffffffff8136f79f>] dump_stack+0x61/0x7e

Jan 29 11:24:56 Radagast kernel: [<ffffffff8107f8ce>] __report_bad_irq+0x2b/0xb4

 

IRQ 16 is used by a USB 2 controller:

 

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00

 

This is where I struggle, I confess, trying to link the controller via a hub to a specific device. So I can't say what the device is. An obvious candidate would be your USB boot device, but I just can't say with any certainty. Maybe someone else can help here.

 

One thing I do notice is that your BIOS is dated 2011 so it might be worth looking to see if an update is available. Failing that, maybe try the "irqpoll" boot option that's suggested. I'm not sure why you think your CPU might be overheating. I can find no evidence of that.

 

Thanks for looking! 

 

Do you mean that the boot device itself is potentially bad or the port to which it's attached?  I just looked and there is a BIOS update from 2012, so I can try to apply that when I get home later today. 

 

I tried to look up what irqpoll is, and a lot of it goes over my head (sorry); however, it seems like when this is thrown, it indicates potential hardware failure.  Not sure which component, of course, haha.

 

In another thread (https://lime-technology.com/forum/index.php?topic=35590.0), they said that the problem was CPU overheating.  I didn't see any evidence when I looked through the logs, but figured it might be something good to do anyway and to reduce the number of potential variables.

 

I'm pretty woefully ignorant on log perusal, so if you don't mind me asking, what do you look for when going through a log? 

Link to comment

Most over heating problems are a result of dirty case, fans clogged fins on heatsinks and blocked intake ventilation posts.  All of these causes can be addressed by a good cleaning of the entire inside of the case.  It is a good  practice to clean the case about once a year to prevent any potential  problems.  Do it in a non-living area of the house (or outside) as you will be astonished at the amount of dirt and dust that can accumulate there. 

Link to comment

Do you mean that the boot device itself is potentially bad or the port to which it's attached?

 

I think both are probably ok but an interrupt was ignored so it's possible the USB port stopped working as a result and it's possible that's the port to which your boot device is attached. I believe the cause was software-related so updating the BIOS and/or the Linux kernel might fix it.

 

I just looked and there is a BIOS update from 2012, so I can try to apply that when I get home later today. 

 

That might help. If not you might try a newer kernel - i.e. a newer version of unRAID, 6.0.3-rc9.

 

I tried to look up what irqpoll is, and a lot of it goes over my head (sorry); however, it seems like when this is thrown, it indicates potential hardware failure.  Not sure which component, of course, haha.

 

It's a boot option. You add it to your syslinux configuration. I'm not sure what it does but from the name I guess it actively polls for interrupts rather than just waiting for them to happen.

 

In another thread (https://lime-technology.com/forum/index.php?topic=35590.0), they said that the problem was CPU overheating.  I didn't see any evidence when I looked through the logs, but figured it might be something good to do anyway and to reduce the number of potential variables.

 

Cleaning out the heatsink fins would do no harm but I don't believe that your trace call was due to overheating.

 

I'm pretty woefully ignorant on log perusal, so if you don't mind me asking, what do you look for when going through a log?

 

In your case you said that Fix Common Problems had reported a call trace so I searched your syslog for "trace" and copied/pasted what I found. It mentions the BIOS version and a reference to IRQ 16. "IRQ 16: nobody cared" means that a device caused an interrupt but went unnoticed. So I looked for "IRQ 16" and found that it's owned by a USB 2.0 controller. What I wasn't able to do was determine what device is plugged into that controller. Your boot device is one candidate, a keyboard is another.

 

Link to comment

If you want to try the boot option you can edit your syslinux configuration by going to the Main page of the web GUI and locating the Boot Device section and clicking the word "Flash". That opens a new page dedicated to your boot device. Scroll to the bottom and you'll see the Syslinux Configuration section. The area of interest looks like this:

 

label unRAID OS
  menu default
  kernel /bzimage
  append initrd=/bzroot

 

Edit it to look like this:

 

label unRAID OS
  menu default
  kernel /bzimage
  append irqpoll initrd=/bzroot

 

and click the Apply button. It will take effect when you reboot. I can't say it will fix it - it might even make things worse - so try the BIOS update first.

 

Link to comment

Do it in a non-living area of the house (or outside) as you will be astonished at the amount of dirt and dust that can accumulate there.

Computers are very effective air cleaners.  ;D

 

You don't even want to touch the inside of a tobacco smokers computer.  :o

 

I try to clean my hardware 2x a year, once in spring, once in fall because I have two things keeping my apartment dust filled: forced air and a cat.  I do that ever since a friend of mine gave me some of his old hardware (which I used to make my first unRAID build) and it was pretty freaking gross (please see attached pic).  I can't even imagine a smoker's machine - yuck.  :-\

DustPic.png.6f424830ce99263b487d1990d05ea300.png

Link to comment

If you want to try the boot option you can edit your syslinux configuration by going to the Main page of the web GUI and locating the Boot Device section and clicking the word "Flash". That opens a new page dedicated to your boot device. Scroll to the bottom and you'll see the Syslinux Configuration section. The area of interest looks like this:

 

label unRAID OS
  menu default
  kernel /bzimage
  append initrd=/bzroot

 

Edit it to look like this:

 

label unRAID OS
  menu default
  kernel /bzimage
  append irqpoll initrd=/bzroot

 

and click the Apply button. It will take effect when you reboot. I can't say it will fix it - it might even make things worse - so try the BIOS update first.

 

Thanks for the info! I'll try the BIOS update tonight or tomorrow.  I don't really have experience with unRAID RCs and a little leery about RC software in general, so maybe that'll be a last ditch effort. 

 

If the BIOS doesn't fix it, I'll try the irqpoll option.  Speaking of fixing, is this a persistent error (once it happens it 'stays on' or can it occur at any time? Would a server restart clear it from unRAID?

Link to comment

I understand your reluctance to use RCs. In practice the unRAID RCs are generally very stable, but try the BIOS update first, then the boot option if the BIOS doesn't help.

 

Once the IRQ has been ignored I believe it stays ignored. In some cases I've seen the IRQ in question was the one used by the SAS controller and disk performance plummets, so yours seems like a minor problem in comparison. Rebooting will fix it... until the next time. If it's your keyboard that's affected you might never notice. Do you have a USB keyboard connected, BTW? If your boot device is affected then it might be a big problem, but you have other USB sockets that you could try if necessary.

Link to comment

I need some advice then for FCP.

 

Since the determining factor in this this error is Call Trace: being present in the syslog, should I instead try and determine if the trace is from IRQ 16 nobody cared and if so drop the error down to an "other warning" so that this particular source basically does not trigger a notification?

 

The thing is that as it stands you definitely don't want to have FCP ignore Call Traces, but if this particular trace is harmless then it'll odds on pop up again on a reboot, and we can't have users ignoring legitimate ones.  I'll let the peanut gallery here decide  ;)

 

Additionally, this is a new test, and the only other recent posting about this came from another user who had different traces which were indeed likely caused by overheating, but the syslog in that case specifically had mce errors in it also specifying heat as the factor.  (mce errors are going to get flagged on this weekend's update to FCP)

Link to comment

I'm afraid I don't know whether this particular "nobody cared" is harmless or not. It really depends whether the IRQ is used by something important, like a disk controller or network card. In this case it's a USB controller, which may be significant if it affects access to the boot device. Nevertheless, it worried the OP and caused him to believe his CPU was overheating so it would be nice if FCP could determine the event that caused the call trace.

Link to comment

Ah-ha! From system/lsscsi.txt in the diagnostics:

 

[0:0:0:0]    disk    SanDisk  U3 Cruzer Micro  8.02  /dev/sda  /dev/sg0

  state=running queue_depth=1 scsi_level=0 type=0 device_blocked=0 timeout=30

  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3/1-1.3:1.0/host0/target0:0:0/0:0:0:0]

 

That matches the owner of IRQ 16:

 

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00

 

So the problem is likely to affect the boot device.

 

Link to comment

Ah-ha! From system/lsscsi.txt in the diagnostics:

 

[0:0:0:0]    disk    SanDisk  U3 Cruzer Micro  8.02  /dev/sda  /dev/sg0

  state=running queue_depth=1 scsi_level=0 type=0 device_blocked=0 timeout=30

  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3/1-1.3:1.0/host0/target0:0:0/0:0:0:0]

 

That matches the owner of IRQ 16:

 

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000

Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00

 

So the problem is likely to affect the boot device.

 

So does that mean that all USB ports are affected or just that particular one?  If I switched the unRAID stick to another USB port would the problem go away or is this indicative of problems to come (like if one USB port goes bad, they'll all eventually go bad)? 

 

I couldn't find any of my flash drives to upgrade the BIOS last night, so I'll try to grab one today and update it later this evening.  I'll then restart the server and see if the problem comes back again.  Should I add any other steps?

Link to comment

I'd just update the BIOS for now and restart with the USB stick in the same socket and see how it goes. If it happens again, try the USB stick in a different socket.

 

Sorry for the delay in reply - work was a little crazy yesterday, so I just got to this today...aaaaand I've run into an issue. 

 

I ran the F7 BIOS update from here: https://downloadmirror.intel.com/22273/eng/BIOS%20Update%20Readme.pdf. I double checked the correct BIOS version and put the .BIO file onto a recently formatted 1GB flash drive.  I went through the utility, I got a message saying something like, "BIOS update successful, rebooting machine." Only...nothing came back up, just a black screen with a "cable not plugged in" message. I tried restarting multiple times, did a fair amount of googling, and settled on following Intel's advice (http://www.intel.com/content/www/us/en/support/boards-and-kits/desktop-boards/000005753.html).  I tried Option 1, but the computer didn't automatically boot into Maintenance mode; so I tried option 2.  Pulled the CMOS battery for ~20 minutes and tried again. Same result - no screens loaded.  Then I tried the BIOS recovery option (http://www.intel.com/content/www/us/en/support/boards-and-kits/000005630.html).  Still nothing. 

 

The only thing I can think of is that the onboard HDMI is not the default output, and that the onboard DVI is, resulting in no screen image and all kinds of insanity being inflicted upon the motherboard through my multiple 'failed' attempts.  Unfortunately, I don't have a DVI-to-HDMI cable to test this theory, even though it sounds weak to me. Why wouldn't the board output to either port?  I'll try to borrow a cable tomorrow to test.

 

Has anyone run into a problem like this?  I'm really lost and I'm REALLY worried that I just screwed my board and my server.

 

Thanks,

Jeff

Link to comment

I'm sorry to hear that. It seems you did it right and the update was successful. I hope your theory about the DVI connector is correct. Do you have a graphics card you could try?

 

Thanks - I borrowed a DVI to VGA cable today, so hopefully that theory can be tested.  I do have a video card and can try that next.  If that doesn't work, I'll try to google a little bit more, in case there's something I missed.  Otherwise, it might be new mobo time.

 

I'll try the cable and all that jazz later this afternoon and will report back.

Link to comment

Well, I tried the cable...no dice.

 

Did some more googling, however, and found this thread:

 

https://communities.intel.com/thread/30813

 

It basically describes my problem and troubleshooting attempts to a T (my board is DH67CL).  It turns out that you can brick your motherboard if you 'jump' too many BIOS update versions. Not sure if this is an Intel-only problem, but I tried to go from version 105 to the newest one, which is 160. I saw NO disclaimers on the Intel website warning about this problem, so if anyone uses Intel boards for their unRAID builds, this would probably be useful information to give to them.  Not sure where that should go, though...should I email an admin?  I may be able to "unbrick" it by using a next generation (compared to mine) processor, but I don't think I want to buy a processor just for a potential fix, only to find that it doesn't work and now I'm out the money for the processor, too.

 

I guess the hunt is on for a new motherboard, at least.  Unless now is a good time to upgrade other internals...

Link to comment

Yes, that's correct. Everything about your array is stored on the boot flash.

 

I hadn't heard that little gem about jumping too many BIOS versions. That really stinks. I've never owned an Intel motherboard and I wouldn't buy one, knowing that. I like Asus and Gigabyte, personally.

 

I'm sorry this has become such a pain for you.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.