Jump to content

Kernal Panic Galore!


MuppetRules

Recommended Posts

0. I think we need to address the IRQ16 issue.  Not sure how to do that.  Evidently its a conflict between USB and MVSAS.  But since the MVSAS is at IRQ11 (see earlier posts of MVSAS BIOS), how could it be on the same IRQ as the USB?  Perhaps this is the main issue?  Not sure.  I've disabled everything I can in the BIOS, but I have not contacted ASUS to see if their is another way.  If I can, just one USB would be enough.

 

I'm really rusty on interrupt architecture, and over the years it has gotten quite a bit more complex.  In the beginning (according to my rusty memory), Microsoft set up the PC architecture with just 16 interrupts, IRQ 0 through IRQ 15, most of which had assigned or recommended tasks.  These are the physical interrupts, and that's all the early CPU's could manage.  But it wasn't very long before many wanted more, so CPU's were able to handle quite a few more 'software' interrupts, and then over the years more advanced schemes were created to share interrupts, overlay whole tables of interrupts, etc.  Just guessing, but the SAS card seems to be 'hardwired' or configured at IRQ 11, an actual physical IRQ, but then mapped into one of the software IRQ schemes, where it appears to share IRQ 16 with the first USB controller.  In your last syslog, the first USB controller had no connected devices at all, the second one (using IRQ 23) had your mouse, keyboard, and flash drive connected to it.  Just a guess, but do you possibly have all 3 connected in back, so perhaps the first controller is handling devices plugged in front, but there aren't any.

 

Just a wild and crazy idea, try plugging something (not your flash drive) into a front port, which should cause a responsible handler to be attached.  No logical reason I can think of, that this should work, but perhaps having another handler hooked into the IRQ chain, may modify its behavior, stop the "nobody cared" error and disabling.  If however the bad IRQ call is caused by something defective in the SAS-related code, then this won't make any difference.

Link to comment

RobJ that is brilliant.  I remember struggling with IRQ settings across a variety of adapters and such back in the early days of the PC's (1981!).  Manually setting each one, having to manually ensure they are all correct.  Painful.  And back when you had a 3270 Card, A token ring card, a memory card, a sound card, a parallel port card and video card.  Lots of IRQ's to mis-manage!

 

I've moved the keyboard and mouse to the front panel and booted.  Syslog attached.  I've opened the log in the webpage and perhaps it can catch something of value...

 

Only a few errors

May 11 23:54:40 HunterNAS-6 kernel: acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20141107/psargs-359) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88040e864528), AE_NOT_FOUND (20141107/psparse-536) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20141107/psargs-359) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88040e864528), AE_NOT_FOUND (20141107/psparse-536) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 (Errors)

 

Only mention of IRQ 16

May 11 23:54:40 HunterNAS-6 kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xf7f04000

 

No mention of IRQ 11 in the log.

 

Do you see anything of interest?  Diagnostic?

 

Fingers crossed!

 

Link to comment

I've moved the keyboard and mouse to the front panel and booted.  Syslog attached.  I've opened the log in the webpage and perhaps it can catch something of value...

No syslog attached, happens to all of us, but I don't have time at the moment anyway, back tomorrow...  a few quick comments...

 

Only a few errors

May 11 23:54:40 HunterNAS-6 kernel: acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20141107/psargs-359) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88040e864528), AE_NOT_FOUND (20141107/psparse-536) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20141107/psargs-359) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88040e864528), AE_NOT_FOUND (20141107/psparse-536) (Errors)
May 11 23:54:40 HunterNAS-6 kernel: ata1.00: failed to get NCQ Send/Recv Log Emask 0x1 (Errors)

Those are normal, ACPI errors are in almost all syslogs, usually harmless.

 

Only mention of IRQ 16

May 11 23:54:40 HunterNAS-6 kernel: ehci-pci 0000:00:1a.0: irq 16, io mem 0xf7f04000

That's the first USB controller.

 

Screenshot below, no syslog available. 

Sorry, I keep forgetting to mention one tip that expands the console - add vga=6 to the append lines in the syslinux.cfg; see the Upgrading to UnRAID v6 guide for more on that (search vga=6)...

 

Would there be SOME way to trap a log to catch what's going on in the crash?

At the console, login, then run tail -f --lines=120 /var/log/syslog.  Might show more...  gotta go...

Link to comment

Hey RobJ,

 

Thanks for the tips. 

 

I've moved the keyboard and mouse to the front panel and booted.  Syslog attached.  I've opened the log in the webpage and perhaps it can catch something of value...

No syslog attached, happens to all of us, but I don't have time at the moment anyway, back tomorrow...  a few quick comments...

 

Yeah, old age and late at night.  Attached this time.

 

I'll give vga=6 a try...

 

Regarding the tail - that would certainly be a way, if I had a console.  Usually (but not always) locked up and messing with the LAN at that point.

syslog-2015-05-11.txt

Link to comment

Hey RobJ,

 

Thanks for the tips. 

 

I've moved the keyboard and mouse to the front panel and booted.  Syslog attached.  I've opened the log in the webpage and perhaps it can catch something of value...

No syslog attached, happens to all of us, but I don't have time at the moment anyway, back tomorrow...  a few quick comments...

 

Yeah, old age and late at night.  Attached this time.

 

I'll give vga=6 a try...

 

Regarding the tail - that would certainly be a way, if I had a console.  Usually (but not always) locked up and messing with the LAN at that point.

 

Run tail via telnet. Additionally, there are 6 consoles available on the attached keyboard and screen.

Link to comment

@Robj - removed the ProFTPd plugin and rebooted.  System didn't crash immediately.  Ran about 18 hours, then crashed.  I had the syslog open, this time it caught something (attached).

 

Looking in the syslog, the parity sync completes at 4:53, then 4 minutes later there is a general protection fault.  It displays 32 GPF's and then locks up.

 

Here's a shot of the terminal:

http://my.jetscreenshot.com/12412/20150513-v3qn-192kb

 

Hopefully this caught something...

syslog-2015-05-13_0459.txt

Link to comment

@Robj - removed the ProFTPd plugin and rebooted.  System didn't crash immediately.  Ran about 18 hours, then crashed.  I had the syslog open, this time it caught something (attached).

 

Looking in the syslog, the parity sync completes at 4:53, then 4 minutes later there is a general protection fault.  It displays 32 GPF's and then locks up.

 

Here's a shot of the terminal:

http://my.jetscreenshot.com/12412/20150513-v3qn-192kb

 

Hopefully this caught something...

 

Jefffery, Please start a new thread. Post this syslog and one that shows the first 15 to 20 minutes. 

 

It is very unlikely that these issues are related. But you should continue to follow back others threads.

 

 

Link to comment

Looking in the syslog, the parity sync completes at 4:53, then 4 minutes later there is a general protection fault.  It displays 32 GPF's and then locks up.

 

Pretty soon, you won't be needing me, you are doing your own syslog analysis!  The only other thing I noticed is that CacheDirs was the item interrupted, and that it became corrupted.  There's nothing significant about it being CacheDirs, as it happens to be an easy target, always running in the background.  By definition, interrupts interrupt what's running!  If it was disabled, then something else would be interrupted.  Might as well disable it though, as I don't think it's actually doing anything for you, currently, and that could remove some of the stress on the disk I/O system.  What is more important is that when interrupted the first time, it was not 'tainted', then shortly afterward it *was* 'tainted'.  To the best of my knowledge, that means system corruption, and at that point you cannot trust the system at all, and need to shut it down quickly, if it doesn't mercifully crash.  You can try to get any available diagnostics, as you did, but even they aren't necessarily trustworthy.  More bugs and issues were showing up late in the GPF's, but you can't apply any significance to them, because the system itself was corrupted, and anything can happen then.

 

The screen shot was similar to past ones, except the interrupt itself was clearly network related.  That may be significant, perhaps a bug in the NIC firmware or its driver.  Probably not what you wanted to hear...

Link to comment

@dgaschk - sure i can do that.  The only reason I've been posting this here is that MuppetRules and I appear to have similar problems (Kernel Panics) with the same HBA (Supermicro AOC-SAS2LV-MV8).  Does it make sense to keep them together?  Let me know and happy to start a new one...

 

@Robj - thanks, I am 'seeing' stuff there, but not a lot of helpful things.  Would be nice to have a function "Your problem is here"... ;)

 

I saw that "tainted" expression and probably thought it was important.

 

So the question is now what?  I need to get past this hardware stuff and move on...

 

I contacted Supermicro support asking them for diagnostic software/processes...

 

You can download the MegaRAID storage manager utility to monitor the RAID controller at this link: ftp://ftp.supermicro.com/driver/SAS/LSI/Tools/MegaRAID_Storage_Manager/Linux/2.35-01/

You can use this utility to view the status of the RAID controller and also generate the events log to see if there are any error logged in.

 

The readme.txt file seems to imply that it only works with SUSE and Red Hat. So I asked about that...

 

None of the utilities that we have has been certified with this particular operating system (unRAID/Slackware).  Usually our software are certified for more current operating systems, such as RedHat, and Centos.  However, it might work with the operating system that you are using.  We do not have a manual for the utility.  Another utility, which is command prompt based, that you can is use is the MegaCli utility which you can download at this link: ftp://ftp.supermicro.com/driver/SAS/LSI/Tools/MegaCli/

 

Has anyone had experience doing diagnostics?

 

I have other motherboards in my 'pile', so I was considering building a 3rd system with a different board...but I hate 'shooting in the dark'...

Link to comment

So I've stripped the system down to basics.

> Fresh install on new USB (SanDisk Ultra Fit 32gb) running unRAID v6 RC 2.0

> Motherboard Asus P8Z68-V Pro

> Motherboard NIC activated

> Removed pro/1000 card

> Removed Supermicro AOC-SAS2LP-MV8

> Using 3 Different unused disks 1TB for Parity, 500GB and 1TB for data

> Not running unmenu yet.

 

System loads without issue (latest syslog attached).

(had some problems with the USB in USB 2.0 slots, but resolved when I put it in USB 3.0 slot (see http://lime-technology.com/forum/index.php?topic=40009.msg375704#msg375704))

 

Formatted data disks, Parity has completed with no errors.

 

I've not been able to run more than 24 hours without a Panic.  But that was with a lot more drives running.

 

So my questions to the collective wisdom. 

1. Would the additional load of 8 more 5,4 adn 2 TB drives potentially cause the IRQ issues (i.e. some marginal voltage...?)  I've got a 700W powersupply, so i don't think this could be the issue - but I have seen powersupplies do weird things.

2. The system does not have the MVSAS HBA.  I did not see any IRQ issues in the syslog.  So does this seem to imply that I have a bad AOC-SAS2LP-MV8 controller?

 

Other thoughts?

 

Continuing to let the system run just accessing webgui and telnet from time to time to wait for the Panic if it comes.  If after a couple days it does not crash, then I'll reintroduce the MVSAS controller and rebuild the array as it was...  as they say, time will tell...

 

On the good side of things, I've been able to build my 2nd system (Biostar based) without a single hickup.  Guess you have to take the good with the bad!

syslog20150518.txt

Link to comment

@Robj - removed the ProFTPd plugin and rebooted.  System didn't crash immediately.  Ran about 18 hours, then crashed.  I had the syslog open, this time it caught something (attached).

 

Looking in the syslog, the parity sync completes at 4:53, then 4 minutes later there is a general protection fault.  It displays 32 GPF's and then locks up.

 

Here's a shot of the terminal:

http://my.jetscreenshot.com/12412/20150513-v3qn-192kb

 

Hopefully this caught something...

 

Jefffery, Please start a new thread. Post this syslog and one that shows the first 15 to 20 minutes. 

 

It is very unlikely that these issues are related. But you should continue to follow back others threads.

 

I posted the info as you asked, but not sure if I got the section about the "15 to 20 minutes" you wanted.  Let me know if I missed something.  Thanks in advance for your time to look at this.

 

As I mentioned, I've stripped the machine down and its still doing similar things.  Hopefully this info will be enough...

Link to comment

May 18 20:57:04 HunterNAS kernel: awk[4298]: segfault at 7ffe7d2568f3 ip 00002aeae67caea7 sp 00007ffebca01f10 error 6 in ld-2.17.so[2aeae67c0000+23000]
May 18 20:57:04 HunterNAS kernel: sleep[4299]: segfault at 7ffd2d15e893 ip 00002b5cbe8c2ea7 sp 00007ffd6c909eb0 error 6 in ld-2.17.so[2b5cbe8b8000+23000]
May 18 20:57:04 HunterNAS kernel: awk[4300]: segfault at 7ffcab727613 ip 00002b09151d7ea7 sp 00007ffceaed2c30 error 6 in ld-2.17.so[2b09151cd000+23000]

 

These lines do not belong. Are you certain that this is a clean install of unRAID? Format the flash and reinstall unRAID. DO NOT install any additional software. The system must be completely stock. Run Memtest overnight.

Link to comment

MuppetRules,

Your thread has been thoroughly highjacked.

 

@dgaschk - sure i can do that.  The only reason I've been posting this here is that MuppetRules and I appear to have similar problems (Kernel Panics) with the same HBA (Supermicro AOC-SAS2LV-MV8).  Does it make sense to keep them together?  Let me know and happy to start a new one...

 

These are not the same issue. If we grouped all the threads containing "Supermicro AOC-SAS2LV-MV8 and kernel panics" together about 10% of the threads on the forum would disappear.

 

Link to comment

@dgaschk - I created a new post as you suggested earlier and sent you a PM as well.

 

And to MuppetRules and to the community, I apologize for the fauxpax.  Didn't know each post was only meant to be for the originating creator.  Won't make that mistake again...

Well, that's certainly a rule that is not always followed. Wouldn't want this rule taken out of context either. I have seen a lot of new threads posted about some plugin where the poster had not only not posted in the plugin support thread, they hadn't even bothered to look through it. So it was a thread dedicated to that OP alright, but even less appropriate than this case.
Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...