bonustreats Posted January 30, 2017 Share Posted January 30, 2017 Hi Everyone, Current unRAID setup: unRAID v6.2.4 - stable Mobo: Intel BOXDH67CL LGA 1155 Intel H67 CPU: Intel Core i3-2105 RAM: Can't find the receipt at the moment, but I'm pretty sure it's 8GB Kingston (non-ECC) HDDs: 2xWD20EFRX, 2xWD20EURS, 1xWE40EFRX (parity) SSD: Crucial M4 64GB (cache) I was updating plugins tonight and went to Fix Common Problems, ran the scan, and the message said: "Your server has issued one or more call traces. This could be caused by a Kernel Issue, Bad Memory, etc. You should post your diagnostics and ask for assistance on the unRaid forums." I briefly looked through the forum, and ran the diagnostics. Based on the forum searches, I looked for keywords like: "hot" "throttle" "trace" and "err/error" and they all came back pretty empty. I will probably reapply thermal paste to the cpu later this week, but after that and a memtest, is there something that I should try? I'd be glad to post the diagnostics, if needed - I just don't know what to look for. Beyond that, do you think it's safe to have it on? I don't want to overheat anything or cause any irreparable damage if I can help it. Also, I've had a few "Plex server is unavailable" from remote viewers in the last few days - any chance they're related? Thanks very much for your help and time! Jeff Quote Link to comment
RobJ Posted January 30, 2017 Share Posted January 30, 2017 There's almost nothing any of us can do if you don't attach the diagnostics for us to check! Quote Link to comment
bonustreats Posted January 30, 2017 Author Share Posted January 30, 2017 Figured as much - just didn't want to spam all my stuff, if this was some common error that I had missed. Thanks! radagast-diagnostics-20170129-2006.zip Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 Rob is way better than me at interpreting these things but I might be able to shed a little light. The trace was called because IRQ 16 was ignored for some reason. Jan 29 11:24:56 Radagast kernel: irq 16: nobody cared (try booting with the "irqpoll" option) Jan 29 11:24:56 Radagast kernel: CPU: 0 PID: 10784 Comm: smbd Not tainted 4.4.30-unRAID #2 Jan 29 11:24:56 Radagast kernel: Hardware name: /DH67GD, BIOS BLH6710H.86A.0105.2011.0301.1654 03/01/2011 Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e70 ffffffff8136f79f ffff8800cc873600 Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e98 ffffffff8107f8ce ffff8800cc873600 Jan 29 11:24:56 Radagast kernel: 0000000000000000 0000000000000010 ffff88021fa03ed0 ffffffff8107fb9b Jan 29 11:24:56 Radagast kernel: Call Trace: Jan 29 11:24:56 Radagast kernel: <IRQ> [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 29 11:24:56 Radagast kernel: [<ffffffff8107f8ce>] __report_bad_irq+0x2b/0xb4 IRQ 16 is used by a USB 2 controller: Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000 Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00 This is where I struggle, I confess, trying to link the controller via a hub to a specific device. So I can't say what the device is. An obvious candidate would be your USB boot device, but I just can't say with any certainty. Maybe someone else can help here. One thing I do notice is that your BIOS is dated 2011 so it might be worth looking to see if an update is available. Failing that, maybe try the "irqpoll" boot option that's suggested. I'm not sure why you think your CPU might be overheating. I can find no evidence of that. Quote Link to comment
bonustreats Posted January 30, 2017 Author Share Posted January 30, 2017 Rob is way better than me at interpreting these things but I might be able to shed a little light. The trace was called because IRQ 16 was ignored for some reason. Jan 29 11:24:56 Radagast kernel: irq 16: nobody cared (try booting with the "irqpoll" option) Jan 29 11:24:56 Radagast kernel: CPU: 0 PID: 10784 Comm: smbd Not tainted 4.4.30-unRAID #2 Jan 29 11:24:56 Radagast kernel: Hardware name: /DH67GD, BIOS BLH6710H.86A.0105.2011.0301.1654 03/01/2011 Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e70 ffffffff8136f79f ffff8800cc873600 Jan 29 11:24:56 Radagast kernel: 0000000000000000 ffff88021fa03e98 ffffffff8107f8ce ffff8800cc873600 Jan 29 11:24:56 Radagast kernel: 0000000000000000 0000000000000010 ffff88021fa03ed0 ffffffff8107fb9b Jan 29 11:24:56 Radagast kernel: Call Trace: Jan 29 11:24:56 Radagast kernel: <IRQ> [<ffffffff8136f79f>] dump_stack+0x61/0x7e Jan 29 11:24:56 Radagast kernel: [<ffffffff8107f8ce>] __report_bad_irq+0x2b/0xb4 IRQ 16 is used by a USB 2 controller: Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000 Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00 This is where I struggle, I confess, trying to link the controller via a hub to a specific device. So I can't say what the device is. An obvious candidate would be your USB boot device, but I just can't say with any certainty. Maybe someone else can help here. One thing I do notice is that your BIOS is dated 2011 so it might be worth looking to see if an update is available. Failing that, maybe try the "irqpoll" boot option that's suggested. I'm not sure why you think your CPU might be overheating. I can find no evidence of that. Thanks for looking! Do you mean that the boot device itself is potentially bad or the port to which it's attached? I just looked and there is a BIOS update from 2012, so I can try to apply that when I get home later today. I tried to look up what irqpoll is, and a lot of it goes over my head (sorry); however, it seems like when this is thrown, it indicates potential hardware failure. Not sure which component, of course, haha. In another thread (https://lime-technology.com/forum/index.php?topic=35590.0), they said that the problem was CPU overheating. I didn't see any evidence when I looked through the logs, but figured it might be something good to do anyway and to reduce the number of potential variables. I'm pretty woefully ignorant on log perusal, so if you don't mind me asking, what do you look for when going through a log? Quote Link to comment
Frank1940 Posted January 30, 2017 Share Posted January 30, 2017 Most over heating problems are a result of dirty case, fans clogged fins on heatsinks and blocked intake ventilation posts. All of these causes can be addressed by a good cleaning of the entire inside of the case. It is a good practice to clean the case about once a year to prevent any potential problems. Do it in a non-living area of the house (or outside) as you will be astonished at the amount of dirt and dust that can accumulate there. Quote Link to comment
JonathanM Posted January 30, 2017 Share Posted January 30, 2017 Do it in a non-living area of the house (or outside) as you will be astonished at the amount of dirt and dust that can accumulate there. Computers are very effective air cleaners. You don't even want to touch the inside of a tobacco smokers computer. Quote Link to comment
JorgeB Posted January 30, 2017 Share Posted January 30, 2017 You don't even want to touch the inside of a tobacco smokers computer. lol, completely agree, they look all yellow, oh, and the smell... Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 Do you mean that the boot device itself is potentially bad or the port to which it's attached? I think both are probably ok but an interrupt was ignored so it's possible the USB port stopped working as a result and it's possible that's the port to which your boot device is attached. I believe the cause was software-related so updating the BIOS and/or the Linux kernel might fix it. I just looked and there is a BIOS update from 2012, so I can try to apply that when I get home later today. That might help. If not you might try a newer kernel - i.e. a newer version of unRAID, 6.0.3-rc9. I tried to look up what irqpoll is, and a lot of it goes over my head (sorry); however, it seems like when this is thrown, it indicates potential hardware failure. Not sure which component, of course, haha. It's a boot option. You add it to your syslinux configuration. I'm not sure what it does but from the name I guess it actively polls for interrupts rather than just waiting for them to happen. In another thread (https://lime-technology.com/forum/index.php?topic=35590.0), they said that the problem was CPU overheating. I didn't see any evidence when I looked through the logs, but figured it might be something good to do anyway and to reduce the number of potential variables. Cleaning out the heatsink fins would do no harm but I don't believe that your trace call was due to overheating. I'm pretty woefully ignorant on log perusal, so if you don't mind me asking, what do you look for when going through a log? In your case you said that Fix Common Problems had reported a call trace so I searched your syslog for "trace" and copied/pasted what I found. It mentions the BIOS version and a reference to IRQ 16. "IRQ 16: nobody cared" means that a device caused an interrupt but went unnoticed. So I looked for "IRQ 16" and found that it's owned by a USB 2.0 controller. What I wasn't able to do was determine what device is plugged into that controller. Your boot device is one candidate, a keyboard is another. Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 If you want to try the boot option you can edit your syslinux configuration by going to the Main page of the web GUI and locating the Boot Device section and clicking the word "Flash". That opens a new page dedicated to your boot device. Scroll to the bottom and you'll see the Syslinux Configuration section. The area of interest looks like this: label unRAID OS menu default kernel /bzimage append initrd=/bzroot Edit it to look like this: label unRAID OS menu default kernel /bzimage append irqpoll initrd=/bzroot and click the Apply button. It will take effect when you reboot. I can't say it will fix it - it might even make things worse - so try the BIOS update first. Quote Link to comment
bonustreats Posted January 30, 2017 Author Share Posted January 30, 2017 Do it in a non-living area of the house (or outside) as you will be astonished at the amount of dirt and dust that can accumulate there. Computers are very effective air cleaners. You don't even want to touch the inside of a tobacco smokers computer. I try to clean my hardware 2x a year, once in spring, once in fall because I have two things keeping my apartment dust filled: forced air and a cat. I do that ever since a friend of mine gave me some of his old hardware (which I used to make my first unRAID build) and it was pretty freaking gross (please see attached pic). I can't even imagine a smoker's machine - yuck. Quote Link to comment
bonustreats Posted January 30, 2017 Author Share Posted January 30, 2017 If you want to try the boot option you can edit your syslinux configuration by going to the Main page of the web GUI and locating the Boot Device section and clicking the word "Flash". That opens a new page dedicated to your boot device. Scroll to the bottom and you'll see the Syslinux Configuration section. The area of interest looks like this: label unRAID OS menu default kernel /bzimage append initrd=/bzroot Edit it to look like this: label unRAID OS menu default kernel /bzimage append irqpoll initrd=/bzroot and click the Apply button. It will take effect when you reboot. I can't say it will fix it - it might even make things worse - so try the BIOS update first. Thanks for the info! I'll try the BIOS update tonight or tomorrow. I don't really have experience with unRAID RCs and a little leery about RC software in general, so maybe that'll be a last ditch effort. If the BIOS doesn't fix it, I'll try the irqpoll option. Speaking of fixing, is this a persistent error (once it happens it 'stays on' or can it occur at any time? Would a server restart clear it from unRAID? Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 I understand your reluctance to use RCs. In practice the unRAID RCs are generally very stable, but try the BIOS update first, then the boot option if the BIOS doesn't help. Once the IRQ has been ignored I believe it stays ignored. In some cases I've seen the IRQ in question was the one used by the SAS controller and disk performance plummets, so yours seems like a minor problem in comparison. Rebooting will fix it... until the next time. If it's your keyboard that's affected you might never notice. Do you have a USB keyboard connected, BTW? If your boot device is affected then it might be a big problem, but you have other USB sockets that you could try if necessary. Quote Link to comment
Squid Posted January 30, 2017 Share Posted January 30, 2017 I need some advice then for FCP. Since the determining factor in this this error is Call Trace: being present in the syslog, should I instead try and determine if the trace is from IRQ 16 nobody cared and if so drop the error down to an "other warning" so that this particular source basically does not trigger a notification? The thing is that as it stands you definitely don't want to have FCP ignore Call Traces, but if this particular trace is harmless then it'll odds on pop up again on a reboot, and we can't have users ignoring legitimate ones. I'll let the peanut gallery here decide Additionally, this is a new test, and the only other recent posting about this came from another user who had different traces which were indeed likely caused by overheating, but the syslog in that case specifically had mce errors in it also specifying heat as the factor. (mce errors are going to get flagged on this weekend's update to FCP) Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 I'm afraid I don't know whether this particular "nobody cared" is harmless or not. It really depends whether the IRQ is used by something important, like a disk controller or network card. In this case it's a USB controller, which may be significant if it affects access to the boot device. Nevertheless, it worried the OP and caused him to believe his CPU was overheating so it would be nice if FCP could determine the event that caused the call trace. Quote Link to comment
John_M Posted January 31, 2017 Share Posted January 31, 2017 Ah-ha! From system/lsscsi.txt in the diagnostics: [0:0:0:0] disk SanDisk U3 Cruzer Micro 8.02 /dev/sda /dev/sg0 state=running queue_depth=1 scsi_level=0 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3/1-1.3:1.0/host0/target0:0:0/0:0:0:0] That matches the owner of IRQ 16: Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000 Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00 So the problem is likely to affect the boot device. Quote Link to comment
John_M Posted January 31, 2017 Share Posted January 31, 2017 I spotted this in a completely unrelated thread. You don't want to be adding the irqpoll option if you can get by without it. Quote Link to comment
bonustreats Posted January 31, 2017 Author Share Posted January 31, 2017 Ah-ha! From system/lsscsi.txt in the diagnostics: [0:0:0:0] disk SanDisk U3 Cruzer Micro 8.02 /dev/sda /dev/sg0 state=running queue_depth=1 scsi_level=0 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3/1-1.3:1.0/host0/target0:0:0/0:0:0:0] That matches the owner of IRQ 16: Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xfe727000 Jan 27 20:32:26 Radagast kernel: ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00 So the problem is likely to affect the boot device. So does that mean that all USB ports are affected or just that particular one? If I switched the unRAID stick to another USB port would the problem go away or is this indicative of problems to come (like if one USB port goes bad, they'll all eventually go bad)? I couldn't find any of my flash drives to upgrade the BIOS last night, so I'll try to grab one today and update it later this evening. I'll then restart the server and see if the problem comes back again. Should I add any other steps? Quote Link to comment
John_M Posted January 31, 2017 Share Posted January 31, 2017 I'd just update the BIOS for now and restart with the USB stick in the same socket and see how it goes. If it happens again, try the USB stick in a different socket. Quote Link to comment
bonustreats Posted February 2, 2017 Author Share Posted February 2, 2017 I'd just update the BIOS for now and restart with the USB stick in the same socket and see how it goes. If it happens again, try the USB stick in a different socket. Sorry for the delay in reply - work was a little crazy yesterday, so I just got to this today...aaaaand I've run into an issue. I ran the F7 BIOS update from here: https://downloadmirror.intel.com/22273/eng/BIOS%20Update%20Readme.pdf. I double checked the correct BIOS version and put the .BIO file onto a recently formatted 1GB flash drive. I went through the utility, I got a message saying something like, "BIOS update successful, rebooting machine." Only...nothing came back up, just a black screen with a "cable not plugged in" message. I tried restarting multiple times, did a fair amount of googling, and settled on following Intel's advice (http://www.intel.com/content/www/us/en/support/boards-and-kits/desktop-boards/000005753.html). I tried Option 1, but the computer didn't automatically boot into Maintenance mode; so I tried option 2. Pulled the CMOS battery for ~20 minutes and tried again. Same result - no screens loaded. Then I tried the BIOS recovery option (http://www.intel.com/content/www/us/en/support/boards-and-kits/000005630.html). Still nothing. The only thing I can think of is that the onboard HDMI is not the default output, and that the onboard DVI is, resulting in no screen image and all kinds of insanity being inflicted upon the motherboard through my multiple 'failed' attempts. Unfortunately, I don't have a DVI-to-HDMI cable to test this theory, even though it sounds weak to me. Why wouldn't the board output to either port? I'll try to borrow a cable tomorrow to test. Has anyone run into a problem like this? I'm really lost and I'm REALLY worried that I just screwed my board and my server. Thanks, Jeff Quote Link to comment
John_M Posted February 2, 2017 Share Posted February 2, 2017 I'm sorry to hear that. It seems you did it right and the update was successful. I hope your theory about the DVI connector is correct. Do you have a graphics card you could try? Quote Link to comment
bonustreats Posted February 2, 2017 Author Share Posted February 2, 2017 I'm sorry to hear that. It seems you did it right and the update was successful. I hope your theory about the DVI connector is correct. Do you have a graphics card you could try? Thanks - I borrowed a DVI to VGA cable today, so hopefully that theory can be tested. I do have a video card and can try that next. If that doesn't work, I'll try to google a little bit more, in case there's something I missed. Otherwise, it might be new mobo time. I'll try the cable and all that jazz later this afternoon and will report back. Quote Link to comment
bonustreats Posted February 2, 2017 Author Share Posted February 2, 2017 Well, I tried the cable...no dice. Did some more googling, however, and found this thread: https://communities.intel.com/thread/30813 It basically describes my problem and troubleshooting attempts to a T (my board is DH67CL). It turns out that you can brick your motherboard if you 'jump' too many BIOS update versions. Not sure if this is an Intel-only problem, but I tried to go from version 105 to the newest one, which is 160. I saw NO disclaimers on the Intel website warning about this problem, so if anyone uses Intel boards for their unRAID builds, this would probably be useful information to give to them. Not sure where that should go, though...should I email an admin? I may be able to "unbrick" it by using a next generation (compared to mine) processor, but I don't think I want to buy a processor just for a potential fix, only to find that it doesn't work and now I'm out the money for the processor, too. I guess the hunt is on for a new motherboard, at least. Unless now is a good time to upgrade other internals... Quote Link to comment
bonustreats Posted February 2, 2017 Author Share Posted February 2, 2017 Just to double check: the unRAID OS is agnostic to the hardware (for the most part), right? If I switch the hardware (mobo, processor, RAM, etc), unRAID won't even notice, as long as it's booting into the same flash drive/hard drive configuration as before, correct? Quote Link to comment
John_M Posted February 2, 2017 Share Posted February 2, 2017 Yes, that's correct. Everything about your array is stored on the boot flash. I hadn't heard that little gem about jumping too many BIOS versions. That really stinks. I've never owned an Intel motherboard and I wouldn't buy one, knowing that. I like Asus and Gigabyte, personally. I'm sorry this has become such a pain for you. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.