Kernal Panic Galore!

MuppetRules · May 5, 2015

My unRAID server (Intel Xeon X3470 / SuperMicro X8SIL-F-O / 16GB 1333MHz DDR3 ECC / SuperMicro AOC-SASLP-MV8) running 6b15 Pro have crashed twice now. First, while I was transferring files over the network. Second, while trying to rebuild the array from the crash. Both times, I've received the following error:

Kernal panic - not syncing: Timeout synchronizing machine check over CPUs
Shutting down cpus with NMI
Kernel Offset: 0x0 from 0xffffffff81000000 (relocate range: 0xffffffff800000000-0xffffffff9fffffff)

I've tested this server extensively Memtest (3 days+), drives were precleared 3 times, CPU was tested thoroughly for stability.

RobJ · May 6, 2015

Check for a BIOS update. Check also for overheating, in general, and particularly the CPU and bridge chipsets.

MuppetRules · May 6, 2015

The motherboard is already flashed with the latest BIOS (1.2a). SAS card is flashed with the latest firmware as well (RAID is disabled). I'm using Arctic Silver 5 as compound and Intel stock cooler at 100% speed. I have two 120mm fans in the front to cool the drives and a 120mm on the back pulling out the hot air. 120mm fan are running on 'High' (85%-100%) speed. Temperatures hover around 24°C — 35°C.

I do keep getting the "No sensors found error!". I've ran the sensors-detect command and got the following:

#---cut here---

#Chip drivers

modprobe coretemp

modprobe jc42

modprobe w83627ehf

/usr/bin/sensors -s

#---cut here---

I used vi and added the text above at /etc/rc.d/rc.local however the changes don't stick after a reboot. Not sure if the two issues are connected ??

itimpi · May 6, 2015

I used vi and added the text above at /etc/rc.d/rc.local however the changes don't stick after a reboot. Not sure if the two issues are connected ??

unRAID runs from RAM so it is not surprising that information is lost on a reboot. If you want the to survive a reboot then you need to make entries in the 'go' file (/boot/config/go) to reapply the changes on each boot.

MuppetRules · May 6, 2015

Just had another "Kernal" panic in the middle of a parity check. This time around I monitored the temperatures closely and the system remained under 35°C. CPU temperature touched 40°C but that's well within the limits. The odd thing is I can leave the system running for days and it doesn't crash. I'm guessing its either the SAS card or the SAS cable. The hard drive went through 3 rounds of preclearing so that can't be it.

For what its worth, I added the modprobe information in the 'go' file and placed a sensors.conf file in the dynamix (/boot/config/plugins/dynamix) folder with appropriate labels. I doubt temperatures or sensors have anything to do with this issue.

RobJ · May 6, 2015

The sensor thing was a separate minor issue, unrelated, which you have now resolved. Machine check events can be very hard to solve, if at all. You do appear to have ruled out CPU, memory, drives, and heat. SAS subsystem is a possible suspect, but hard to test, unless you are in a position to replace it and test without it. And make sure there's no over-clocking, and all BIOS settings set to conservative choices.

If you are comfortable at the command line, and have a CD player installed, you might try a StressLinux live CD (or bootable USB flash with it).

jeffreywhunter · May 6, 2015

I'm having similar crashes and I have the same SAS card. Could we be looking at a driver issue with this SAS card?

When my server panics, its completely locked. I’ve captured several screens and included in the posts. I’ve attached FYI. Also included a video of a boot – which just happened to crash almost instantly. For whatever reason, it now will run a day or 2 before crashing.

https://www.dropbox.com/s/b2sk6giaxtpwkcc/20150410_104554.mp4?dl=0

Here are a bunch of screen pic’s I took from some of the panics and crashes. I’m sorry, I didn’t keep a log, but these should all be very similar. These occurred at some point, minutes or hours after boot. Some even lasted a couple days, long enough to do a preclear on 4tb drive.

https://www.dropbox.com/s/tm16lze8ruas15s/2015-04-09%2022.29.05.jpg?dl=0

https://www.dropbox.com/s/a8cajsi2pinbmzs/2015-04-16%2013.49.22.jpg?dl=0

https://www.dropbox.com/s/1mquv84p7zx091l/2015-04-16%2018.00.50.jpg?dl=0

https://www.dropbox.com/s/dur79khsqyulmrj/2015-04-17%2008.41.39.jpg?dl=0

https://www.dropbox.com/s/7x8eygghjf2hj8b/2015-05-02%2020.00.22.jpg?dl=0

I've been operating under the impression that this issue is a NIC issue (because of the network flood issues when the server crashes (no devices can access the LAN).

http://lime-technology.com/forum/index.php?topic=39215.msg371937#msg371937

Would love to find a cause for these panics...

MuppetRules · May 8, 2015

I connected the drives directly to the motherboard, ran a parity check and for the first time the system did not crash.

I'm having similar crashes and I have the same SAS card. Could we be looking at a driver issue with this SAS card?

When my server panics, its completely locked. I’ve captured several screens and included in the posts. I’ve attached FYI. Also included a video of a boot – which just happened to crash almost instantly. For whatever reason, it now will run a day or 2 before crashing.

https://www.dropbox.com/s/b2sk6giaxtpwkcc/20150410_104554.mp4?dl=0

Here are a bunch of screen pic’s I took from some of the panics and crashes. I’m sorry, I didn’t keep a log, but these should all be very similar. These occurred at some point, minutes or hours after boot. Some even lasted a couple days, long enough to do a preclear on 4tb drive.

https://www.dropbox.com/s/tm16lze8ruas15s/2015-04-09%2022.29.05.jpg?dl=0

https://www.dropbox.com/s/a8cajsi2pinbmzs/2015-04-16%2013.49.22.jpg?dl=0

https://www.dropbox.com/s/1mquv84p7zx091l/2015-04-16%2018.00.50.jpg?dl=0

https://www.dropbox.com/s/dur79khsqyulmrj/2015-04-17%2008.41.39.jpg?dl=0

https://www.dropbox.com/s/7x8eygghjf2hj8b/2015-05-02%2020.00.22.jpg?dl=0

I've been operating under the impression that this issue is a NIC issue (because of the network flood issues when the server crashes (no devices can access the LAN).

http://lime-technology.com/forum/index.php?topic=39215.msg371937#msg371937

Would love to find a cause for these panics...

Right now I'm transferring close to 2TB of data over the network so I guess the NIC theory will be tested. Though I'm fairly certain its the SAS card. Another similarity I noticed between my setup and yours is that we are both using Hitachi drives. Mine are 4TB HGST Deskstar NAS. Again, my bet is on the SAS card as I tested these drives thoroughly and found no issues whatsoever.

jeffreywhunter · May 8, 2015

Ok, I've been coming to that realization as well. I'm building a second server right now and will be launching that/copying many TB of files. If that works flawlessly, then I'm going to suspect the SAS adapter.

So now the question is, assuming it is the adapter, what do we do about it?

MuppetRules · May 9, 2015

Update:

Transferred close to 2TB of data while the drives were connected directly to the motherboard and no kernel panics whatsoever.

Ok, I've been coming to that realization as well. I'm building a second server right now and will be launching that/copying many TB of files. If that works flawlessly, then I'm going to suspect the SAS adapter.

So now the question is, assuming it is the adapter, what do we do about it?

I really don't know.

jeffreywhunter · May 9, 2015

I think I'm going to try a different adapter. Interesting conversation here:

http://lime-technology.com/forum/index.php?topic=23698.15

I'll post what i decide to do here...

Also just got my second system running on very different hardware. We'll see how long it runs. PC'ing disks now so will be a couple of days.

MuppetRules · May 9, 2015

Jeff, if you get a chance, can you verify if your controller settings are similar to mine?

BIOS version: 3.1.0.21

Raid Mode: JBOD

INT 13h: [Disable]
Silent Mode: [Disable]
Halt On Error: [Disable]
Staggered Spin-Up:
Number of devices in group [8]
Group Spin Up Delays [0]
HDD Detect Time(s): [10] *

* Default is 16

Possible options now that I've had some to think about it:

1. Hope this issue is resolved under unRAID v6 as MV8 works fine under unRAID v5

2. Use unRAID v5? Though I'd like to avoid ReiserFS at this point. Not to mention, KVM and Docker was what sold me on unRAID v6

3. Worst case: I'll move away from unRAID since I don't have the cash to buy a new controller right now. SnapRAID under Windows is one option and I have a couple of free Windows Server 2012 R2 & Windows 7 licenses from DreamSpark.

itimpi · May 9, 2015

1. Hope this issue is resolved under unRAID v6 as MV8 works fine under unRAID v5

Both my brother and myself use the SASLP-MV8 controller and it is working fine under v6 for us. Assuming that you do not have a faulty controller there is some other factor at play rather just the presence of that the SASLP-MV8.

jeffreywhunter · May 10, 2015

@MuppetRules. First off, what does the INT13 option do if enabled?

Adapter (AOC-SAS2LP-MV8) is at Firmware 1812

Original config: http://my.jetscreenshot.com/12412/20150510-ppba-103kb

My original configuration was 'stock'. Int13 was enabled - http://my.jetscreenshot.com/12412/20150510-2yir-107kb

Note that the IRQ Number is 0B (12) - In my syslog (attached), I found an entry (May 9 21:54:07 HunterNAS-6 kernel: serio: i8042 AUX port at 0x60,0x64 irq 12 (System)) which seems to indicate that another device is sitting on IRQ 12? Confusing. And no mention in the syslog about mvsas on IRQ 16. However, I did an cat /proc/interrupts and saw this...http://my.jetscreenshot.com/12412/20150510-4j8w-83kb

Red arrows are IRQ12, yet the MVSAS shows up on IRQ16 (blue arros) - which is shared by USB1? Also confusing. Someone else mentioned this at some point and said I was lucky that I didn't have a conflict.

Looking at the BIOS for the card (Ctrl-M while card boots) I see that my HDD detect times was @6 not 16 as you mentioned (typo?), I can only go up to 10 - http://my.jetscreenshot.com/12412/20150510-0dtw-74kb

After a reboot I checked the SAS2LP BIOS again, still at IRQ 12. Again, confusing.

http://my.jetscreenshot.com/12412/20150510-rdjp-108kb

RobJ · May 10, 2015

0B is actually 11, not 12. And the HDD detect time is in seconds, not an IRQ, and I don't think it is terribly important. The USB controllers often share IRQ's with other services. The INT 13 service is an old one from DOS days, used for disk I/O, but is usually quickly replaced with much more powerful disk I/O functions. Nowadays, I believe it's only used at boot time. If disabled, drive is not bootable. Since you aren't booting them, you don't care, might as well have it disabled.

jeffreywhunter · May 10, 2015

Thanks for the correction, its been a while since I worked with hexidecimal!

Regardless of my hexidecimal skills, I continue to have these panics.

I changed the IRQ setting in the adapter as discussed previously. The system ran for about 26 hours, finished the parity check, then crashed again...

http://my.jetscreenshot.com/12412/20150510-ccnc-177kb

Just before the crash, the system displayed a message stating that IRQ 16 had been disabled.

I'm also seeing a lot of SMTP errors - not sure they are related? May 10 07:47:02 HunterNAS-6 sSMTP[19442]: 501 Syntax error in parameters or arguments (Errors)

Syslog attached. Any insight? Wondering if I should start over? Find a different HBA adapter? Frustrated...

syslog-2015-05-10.txt

RobJ · May 11, 2015

Various comments -

* The panic pic shows a probable bug related to an interrupt routine, but can't tell much about it. A possible clue, the instruction pointer is in a function called ip_rcv, which may or may not be network related (and may or may not be related to the interrupt call).

* Syslog was not helpful, does not have any related errors or clues. Parity check started at 3:33pm, and last message posted to syslog was at 4:21pm, no further activity, can't even tell when you captured it.

* Email is having a problem, with the SSL configuration I think. I don't know enough to help you with that. It's a minor issue, unrelated to the system troubles, just means no emails sent.

May 10 15:33:14 HunterNAS-6 kernel: capability: warning: `proftpd' uses 32-bit capabilities (legacy support in use)

* I noticed the above message, concerning the ProFTP plugin. It's a 64 bit plugin, but apparently has 32 bit functions in it. I'm not qualified to say much about that, but it's my understanding that 32 bit compatibility is unreliable. Perhaps others with more expertise in either 32 bit compatibility or the ProFTP plugin can comment. I would personally want a plugin that is all 64 bit.

* Just a comment, you aren't using 5 of your fastest and most trouble-free SATA ports, on the motherboard. Perhaps moving some of the drives on the SAS card to the motherboard would make the SAS card happier, lower stress to it. Sad if that helped though!

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: ---------- caching directories ---------------
May 10 15:33:25 HunterNAS-6 cache_dirs.sh: Disk1(BTRFS)

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: Disk4(XFS)

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: NDH

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: appdata

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: documents

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: movies

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: music

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: ndh

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: pictures

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: tv

May 10 15:33:25 HunterNAS-6 cache_dirs.sh: ----------------------------------------------

* This is completely unrelated, but appeared noteworthy. The first 2 items look like mistakes? If not, sorry. Then there is both an NDH and an ndh, which is going to be a big source of confusion. In Linux, those are separate names, therefore separate shares, Linux is case-sensitive. Windows is not, and I don't know how it's going to handle this! You may want to merge them, rename the one that's wrong, in both the Shares configuration and the top level folder names on your drives.

I recommend caching only what you know needs caching, not everything. That is, only what is being constantly polled by external applications.

Frustrated...

* I don't doubt it, I'd be beyond that by now. I'm really sorry we haven't been able to help you solve this thing.

jeffreywhunter · May 11, 2015

Hey RobJ - appreciate your help. End of the day, I REALLY want this to work. I've got challenges because I'm not a Linux guru. Windows yes, web, yes, heck DOS yes (shows my age), but linux no... Linux is not a complete black box, cause I'm not afraid of command lines and config files. But not having years of experience, I just don't know how to fix some things...ugh... I have learned a lot in the last couple of weeks, but not making progress cause I can't get past basic startup. But it is a lot of fun...to a point. But play time is over and I need to get a couple of servers up.

=====

Regarding the NDH share. I have no idea how that got created. I created the ndh share, but not the NDH share. They point to the exact same locations.

Shares: http://my.jetscreenshot.com/12412/20150511-dzwy-51kb

ndh details: http://my.jetscreenshot.com/12412/20150511-w0cp-40kb

NDH details: http://my.jetscreenshot.com/12412/20150511-136z-40kb

=====

Regarding Cache - I'm confused on this point. I thought I wanted cache for all operations to ensure speedy transfer of files (i.e. file syncs, client backups, etc) so I would be bogged down by slow transfers. Sounds like Cache has some specific use cases?

=====

One interesting point. My second server, BioStar TA880gu3+ (6 Sata III ports) using mobo ethernet, Syba Dual-port SSD Caddy w/2 Additional Sata II ports, has been running flawlessly for a couple days with 9 drives attached, waiting to get the registration ID from Tom so i can set it up. Can't really move forward with that. However, the fact that it is running w/o issues is promising. Makes me feel like we have a number of incompatibilities to deal with. A challenge, but we should be able to discover those with a good diagnostic approach.

=====

So all that said - I'm happy (masochistic?) to start over and build this in layers to get it stable and determine what the problem is and hopefully learn a lot. Would this be a good approach? Or should I just give up and go WHS? :'(

0. I think we need to address the IRQ16 issue. Not sure how to do that. Evidently its a conflict between USB and MVSAS. But since the MVSAS is at IRQ11 (see earlier posts of MVSAS BIOS), how could it be on the same IRQ as the USB? Perhaps this is the main issue? Not sure. I've disabled everything I can in the BIOS, but I have not contacted ASUS to see if their is another way. If I can, just one USB would be enough.

=====

Assuming step 0 is unaddressable...

1. Start with a fresh USB build

2. Remove SAS hardware and ONLY use Sata ports on the Mobo. Remove Intel Pro/1000 and only use MOBO Eth. I have 2 Marvell 6GB/s, 2 Intel 6gb/s and 4 Intel 3gb/s ports. A total of 8 drives. That should be a good test. If panic with eth, replace with Pro/1000 and try again. If panic with Pro/1000, the problem has to be with the ethernet drive (agreed?).

3. Setup the following:

> Parity - 5GB Seagate ST5000DM000-1FK178

> No Cache (yet)

> Disk1 4GB - Seagate ST4000DM000-1F2168 on Marvell Port 1

> Disk2 2GB - Hitachi_HDS5C3020ALA632_ML0220 - on Marvell Port 2

> Disk3 2GB - Hitachi_HDS5C3020ALA632_ML0220 - Intel Sata III Port 1

> Disk4 2GB - Hitachi_HDS5C3020ALA632_ML0220 - Intel Sata III Port 2

> Disk5 2GB - Hitachi_HDS5C3020ALA632_ML0220 - Intel Sata II Port 1

> Disk6 2GB - Hitachi_HDS5C3020ALA632_ML0220 - Intel Sata II Port 2

> Disk7 2GB - Hitachi_HDS5C3020ALA632_ML0220 - Intel Sata II Port 3

> Disk8 2GB - Seagate ST2000DL003-9VT166 - Intel Sata II Port 4

4. Setup the basic system (format drives, set shares) but not install any plugins or apps yet. Format all the drives except parity with XFS (getting away from the BTRFS discussion we had).

5. Setup shares (no cache)

> Movies

> Documents

> NDH

> Pictures

> Client Backups

6. Copy files - I'll start copying files to the Picture share. Will be slow, but good for a test and its only 150GB or so. Run a few CrystalDiskMark tests (compare Sata II vs Sata III ports would be interesting)

=====

If I make it this far without a panic, then we should have some confidence in the system.

=====

7. Introduce AOC-SAS2LP-MV8 with existing drives. No load or activity but run a few CrystalDiskMark tests on each drive. Let run for a day or so to see if it Panics. Assume we would not see a Parity rebuild at this point since physical location shouldn't matter.

=====

If we make it this far (and I would be surprised at this point if we haven't seen a panic)

=====

8. Setup Cache - so at this point, I could setup the SSD Cache drive (or should I go BTRFS and a raid 1 2TB cache? or ?)

9. Copy files - copy all the files

If we make it this far, then we should be golden...if we get here.

Thoughts?

trurl · May 11, 2015

...Regarding the NDH share. I have no idea how that got created. I created the ndh share, but not the NDH share. They point to the exact same locations...

As you may know, Linux is case-sensitive regarding file and folder names. So NDH and ndh are not the exact same locations. All your screenshots show is that the NDH share and the ndh share are configured the same.

NDH will be at /mnt/user/NDH

ndh will be at /mnt/user/ndh

Not at all the same location.

And /mnt/user is just the aggregate of the top level folders on cache and the array disks.

Look at cache and each of the array disks. Some of them will have a top level folder named NDH and some of them will have a top level folder named ndh. Some may have both.

Normally I would say this was caused by a misconfigured application, where some application was configured to write to NDH instead of ndh. However, this would not cause these different shares to have identical configuration. Any share that you haven't explicitly configured would have default configuration, and your screenshots were definitely not defaults.

Any share that you configure will have a .cfg file in config/shares on the flash drive. From the unRAID command line, these are in /boot/config/shares. From the network, these would be in /flash/config/shares. Did you somehow copy or edit ndh.cfg to NDH.cfg?

JonathanM · May 11, 2015

I recommend caching only what you know needs caching, not everything. That is, only what is being constantly polled by external applications.

Regarding Cache - I'm confused on this point. I thought I wanted cache for all operations to ensure speedy transfer of files (i.e. file syncs, client backups, etc) so I would be bogged down by slow transfers. Sounds like Cache has some specific use cases?

I sense some confusion here. The cache_dirs command != cache drive. They are totally separate, and only interact coincidentally.

Cache_dirs is a script that continually reads the file and folder list of all the array contents that it is pointed at, with the intention of keeping that list in memory at all times, so as to dramatically speed up file name listing. If you configure it to keep too many items cached, it defeats the purpose as each pass runs out of RAM and keeps the disks active as it rereads the whole list. Ideally, it keeps the disks spun down until the actual file content is called for by some app, as the full cached list stays in RAM so the continual reads of the list never spins up the drives. It's a balancing act of figuring out which shares need to be cached in RAM vs how much will fit. If cache_dirs is keeping your disks spun up, you need to reduce the number of locations it is polling.

The cache drive is a location that does not participate in the real time parity calculation when it is written to, so theoretically writes can happen faster. It participates in the user share structure seamlessly, so any top level folders on the cache drive will show up as user shares, and the contents will be fused with any identically named top level folders on any other array drives. User shares can be configured to automatically put any new files on the cache drive instead of the array drive, or configured to ignore the array disks for new writes. The mover script will move the entire contents of the cache drive onto the array, unless it is told to ignore that user share.

Did I clear anything up? Or did I confuse things further?

trurl · May 11, 2015

...Regarding Cache - I'm confused on this point. I thought I wanted cache for all operations to ensure speedy transfer of files (i.e. file syncs, client backups, etc) so I would be bogged down by slow transfers...

I don't actually cache anything; i.e., I don't use my cache drive to cache writes.

Most things that are written to my unRAID are the result of automatic processes, so I seldom get bogged down since I am not involved. Client backups happen at night automatically, most other things originate on the internet with my unRAID doing the downloading, automatically, so nothing written by me there either.

The writes may be slower, but the speed doesn't matter for my use cases, and the writes are done directly to parity-protection instead of waiting on mover to move them there. Your use cases may be different, of course.

jeffreywhunter · May 11, 2015

@jonathanm - thanks for the clarification - you confirmed my understanding.

Regarding cache_dirs - I've got 16GB of memory, so had hoped that cache_dirs would have plenty of room. Perhaps not? How should one size memory?

Regarding the cache drive - The discussion I had regarding cache was only focused on the cache drive (apologies, should have mentioned 'drive'). Conceptually I was banking on the cache drive making file transfers speedy. However, even in my own testing direct to drives, I was seeing 100mb/s transfers. So perhaps cache is not needed?

@Trurl - yes, I'm aware of Linux caring about case. Initially I created the ndh share and started moving files when I realized it was using all the drives. So I had deleted the files/directory, then the share and recreated it with disk specific limitations. So perhaps in the back and forth with that, I accidentally created all caps. I normally don't, but doesn't mean I didn't!

So these are good discussion points and appreciate the input, but anyone got some input on the plan to figure out this Kernel: disabling INT 16 "followed by a crash/panic" approach?

trurl · May 11, 2015

... However, even in my own testing direct to drives, I was seeing 100mb/s transfers. So perhaps cache is not needed?

Writing with parity-protection typically gives 30-40. This is because writing with parity requires reading data to be overwritten, reading parity, calculating parity change, writing data, writing parity. So, 2 reads and 2 writes rather than a single write. If you were seeing 100 for writing with parity, you were likely only seeing the speed of memory buffering rather than the actual disk writing. Linux will use all free memory for buffering so you might have to do some fairly large writes to see real disk writing speeds.

jeffreywhunter · May 11, 2015

@Trurl - Is there a way to determine how much memory is committed to memory buffering? In moving large amounts of file (> 1TB) I've seen two behaviors. In one case, transfers slowed to 20-30mb/s. In others, I've see it maintain > 100MB/s for the entire transfer. Not been able to determine why its different...

BRiT · May 11, 2015

free -m

Kernal Panic Galore!

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived