Jump to content

Kernel Panic nightly- how to diagnose?


oryhara

Recommended Posts

Updated:

Second crash in as many days. 

Exact text of kernel panic:

Kernel panic - not syncing: Fatal exception in interrupt

Kernel Offset: 0x0 from 0xfffffffff81000000 (relocation range: 0xfffffffff80000000-0xffffffffff9fffffff)

---[ end Kernel panic - not syncing: Fatal exception in interrupt

 

transcribed from a picture on my cell phone, so I might have the wrong number of f's and 0's. 

 

Woke up this morning to a stopped array with unclean shutdown detected.  I had left a terminal open tailing the syslog, but all it showed was disks spining down around 3 AM.  How can I find out what caused my raid to shutdown uncleanly?  Its on a UPS and connected, so if power went out it should have shut down cleanly. 

 

It had been doing this nightly crashing before the upgrade to 6.0, but I had hoped that would fix it. 

Link to comment

It seems an unexpected self-reset because of failing/having issues hardware to me...  ???

 

I would start with an accurate (read: long...) Memtest and with a doublecheck of fans and operative CPU temps...

 

Even bad BIOS settings (e.g. overclocking, etc...) could be a cause of this too... (revert any changes eventually)

Link to comment

I did a long memtest before the 6.0 upgrade. 

There is no overclocking on my system.  Only change to stock BIOS was to boot from USB stick instead of a hard drive.

 

It happened sometime between 0314 and 0600.  Could the mover script run at 0340 have caused this? 

Link to comment

I did a long memtest before the 6.0 upgrade. 

There is no overclocking on my system.  Only change to stock BIOS was to boot from USB stick instead of a hard drive.

 

It happened sometime between 0314 and 0600.  Could the mover script run at 0340 have caused this?

Mover doesn't do anything unusual. Just moves files from cache to other disk(s). Sounds like a hardware issue. What is the exact model of your power supply?
Link to comment

Updated:

Second crash in as many days. 

Exact text of kernel panic:

Kernel panic - not syncing: Fatal exception in interrupt

Kernel Offset: 0x0 from 0xfffffffff81000000 (relocation range: 0xfffffffff80000000-0xffffffffff9fffffff)

---[ end Kernel panic - not syncing: Fatal exception in interrupt

 

transcribed from a picture on my cell phone, so I might have the wrong number of f's and 0's. 

 

 

Woke up this morning to another crash, this time with a kernel panic showing on the monitor attached to the raid. 

Kernel panic - not syncing: Fatal exception in interrupt. 

 

 

Link to comment

I suspect you may have a SAS card or other disk controller installed?  Try making sure that you have the latest firmware for each disk controller, and that your BIOS is up to date.

 

You may also want to pin down exactly when these panics occur, to see if they correlate with the mover or anything else.

 

If nothing else works, then you may want to swap in another disk controller if that's doable, just to see if it's a specific card at fault.  With an extra controller, you can move drives around, and see if the behavior changes with any combination.

 

You say you have pictures, are there common code addresses listed between the different panics, and what are they?  You can attach pictures here if you want.

Link to comment

Here is a picture of the kernel panic.

 

I have a PCI port multiplier SATA card, and 3 internal bridgeboards giving me 5 SATA ports each.  My raid isn't yet full, but I designed it for up to 20 drives.  4 port multipliers, 5 drives each, and 4x 5into3 icydock sleds in a 12x 5 1/4" case. 

 

Would it be a good idea to trade that all for the X10SL7 motherboard and use SAS to sata cables to give me the ability to mount 24 sata drives?  And stick it all in a NORCO RPC-4224?

IMG_1543.JPG.1ffea3f47f32c4c91d4f57ad69c792f0.JPG

Link to comment

You can never draw any conclusions or even suspicions from a single data point, so all I can say is that this particular panic occurred during write-related I/O to a ReiserFS formatted disk, specifically during file management, possibly a file deletion.  Now what we need is more data points, more panic pics, even if mental pics.  Would you say that the attached picture is (1) exactly the same as all other panics, (2) very similar to the others (all Reiser disk write I/O), (3) similar (all have disk I/O), or (4) not at all alike?  And roughly how many pics do you have, mental or camera?

 

I assume you will be doing further testing without that Cache drive, to prove it only happens when it's connected?

 

It's always nice to talk about new hardware, but in a way it's not relevant, if you have no idea yet what the real issue is, which component is faulty.

Link to comment

Mental pictures would say they are all alike, at least insofar as Kernel panic - not syncing.  I'll take another phone pic if it happens again tonight. 

 

I have left a terminal open tailing the syslog, and its last entry was at 313 AM, which led me to suspect the mover script and/or cache drive, but invoking it manually did not cause a problem. 

 

I need to move plex config off of the cache drive(installed to a folder named .plex) to another disk before i disable the cache drive, since its docker won't let me use a share for config.  At least i think that is correct, i confess i did not fully understand the instructions for setting that up with docker.  But it does work, and this crash was happening in version 5 so i think it is not related to my plex install, which i also had in version 5. 

 

I might also try moving the cache drive from a bridgeboard to the main motherboard's SATA port, but only if disabling the cache drive fixes this kernel panic. 

 

Link to comment

Update:

Monday morning I saw this:

 

I also left a tail on the syslog, and the last entry was dated 0153 AM.  I believe plex starts its routine maintenance at 2AM, so perhaps it is the culprit here.  But this system worked for years with plex and the three other apps(sabnzbd, sickbeard, couchpotato), so something else has changed for it to now be failing.   

IMG_1553.JPG.48a47ab149f7c9b70f54042f1e004760.JPG

Link to comment

I don't have a spare lying around; so I'll have to buy one. 

Do you think I need a larger one?

How much power would I need for 24 3.5" drives?

 

I have used this website in the past to get an idea. Keep in mind that gives you a minimum guestimation and I would error on the side of caution and get a bigger PSU than it tells you.

 

http://www.extreme.outervision.com/psucalculatorlite.jsp

Link to comment

According to that calculator, I need 671 watts for 20 drives.  I've only got 16 in there now, and a 950W supply. 

The system is still on after these kernel panics, which leads me to doubt that the PSU is causing any problem. 

But its still under warranty, so I'll contact the manufacturer. 

Link to comment

Running disks. 1-3 fine. 

Disk 4:

 

Fatal corruptions were found, Semantic pass skipped

6 found corruptions can be fixed only when running with --rebuild-tree

###########

reiserfsck finished at Mon Jun 22 22:36:54 2015

###########

 

spose I should run this with the --rebuild-tree option then?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...