Kernel Panic - How to Diagnose

Kees Fluitman · February 26

Hi Guys,

I've been getting some regular crashes lately. Mostly when my server seems a bit loaded with applications.

The specific error is Kernel Panic. But the thing is, since logs are going to the usb-drive, I have no way to see what happened or what caused it.
Any tips on how to better diagnose it?
I've attempted to go through the logs but to no avail. If you have any tips whether or not I should log to some external source, or if the diagnostics would help here? Ill post them.

I added a diagnostics from a couple of weeks ago, when it also happened. Maybe it will help to give you a glance at my system.

- memtest passed.
- Ive set syslog to a separate share. So let's see once a crash happens.

server-diagnostics-20240115-1739.zip

Edited February 27 by Kees Fluitman

JorgeB · February 26

4 hours ago, Kees Fluitman said:

- Ive set syslog to a separate share. So let's see once a crash happens

Post that after the next crash.

Kees Fluitman · February 27

21 hours ago, JorgeB said:

Post that after the next crash.

This is all I've got for now. Seems it crashed in the middle of the night (after I logged out late)
Don't know if the error is related to the crash. Don't know why there are so few entries either. Ill connect a monitor to see if i have sth on screen the next time it crashes. To compare it to my syslog.
The device in the last crash is

Quote

02:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)

But that error comes back frequently.

syslog-127.0.0.1.log

JorgeB · February 27

53 minutes ago, Kees Fluitman said:

But that error comes back frequently.

That should not be the cause of the problem, you can try again to see if the kernel panic gets logged, without that, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Kees Fluitman · February 27

3 hours ago, JorgeB said:

That should not be the cause of the problem, you can try again to see if the kernel panic gets logged, without that, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

crashed again shortly after. Got this from my monitor, but otherwise the server is unresponsive to any input from my keyboard.

I must say, the crashes started after the 6.12.6 upgrade, so now I'll try to upgrade to 6.12.8 and see again.

I added the image. The logs again show nothing right after i closed the ssh session this morning...really weird tbh.

JorgeB · February 27

Very difficult to see based on that since it doesn't show the beginning of the call trace, if it doesn't get logged to the remote persistent syslog, try mirroring it to the flash drive.

Kees Fluitman · February 27

1 hour ago, JorgeB said:

Very difficult to see based on that since it doesn't show the beginning of the call trace, if it doesn't get logged to the remote persistent syslog, try mirroring it to the flash drive.

I will do that right now. There are other signs of struggle before the crash comes as well. Generally slow down of applications, etc. But not always either. I will keep this topic up to date. After the update just now, it seems good. Im sending syslogs to an external server and mirroring the files to flash.
One thing i noticed this morning was that htop showed a lot more activity from Crowdsec than generally. I also had more frequent crashes when i had my diagnose stack (grafana, loki, etc.) running. But I already turned that off untill I figured out a more efficient way to handle diagnosis. I find they use quite a lot of resources for just Diagnosis/Observations. They pushed my RAM to the limit (having no more Free space and everything used by Cache)...but as far as i know, Linux just uses cache freely, so having 2GB less cache than usual, isnt all that important when you have 47GB in total.

So if i understand correctly, mirroring them, will not only have the syslog in RAM, but also on the flashdrive...so you can actually debug it better?

Edited February 27 by Kees Fluitman

itimpi · February 27

2 minutes ago, Kees Fluitman said:

So if i understand correctly, mirroring them, will not only have the syslog in RAM, but also on the flashdrive...so you can actually debug it better?

Yes - this is why it survives a crash.

It does, however, mean you are doing a lot more writes to the flash so it is probably something you want left running if you are not investigating a problem.

Kees Fluitman · February 28

20 hours ago, itimpi said:

Yes - this is why it survives a crash.

It does, however, mean you are doing a lot more writes to the flash so it is probably something you want left running if you are not investigating a problem.

I could stress my server more to see if the crash occurs, but as of the update yesterday, it has been very stable, with enough free memory, no slowdowns of GUI or apps, etc. I will turn on my monitoring stack and observe how more stress is handled by my server. It's been running clean for 24 hours now though.

Kees Fluitman · March 6

On 2/27/2024 at 3:51 PM, JorgeB said:

Very difficult to see based on that since it doesn't show the beginning of the call trace, if it doesn't get logged to the remote persistent syslog, try mirroring it to the flash drive.

After the update. I had no issues whatsoever anymore. It's been online for 7 days now. So ill turn off "mirror syslog to flashdrive" again. I only saw a few of these errors in my logs:

Quote

2024-03-03 21:22:36.558 Mar 3 20:05:52 server kernel: atlantic 0000:02:00.0: device [1d6a:07b1] error status/mask=00000001/0000a000

JorgeB · March 6

Those should not be a problem.

Kees Fluitman · April 5

On 3/6/2024 at 2:52 PM, JorgeB said:

Those should not be a problem.

crashes suddenly started again. Only reason i can think of, is a new docker stack I've started running (possible too high load on mem or cpu?). Or the update...Maybe the hardware, motherboard/cpu, are slowly showing their time.

Specs:
Asus X99-A
Intel® Core™ i7-5820K CPU @ 3.30GHz
Memory: 48 GiB DDR4

Syslog shows nothing. Added it.

Ill down my Logging stack (prometheus, grafana, etc.) see if under less load, it doesnt happen.

syslog-till-crash.txt

Edited April 5 by Kees Fluitman

JorgeB · April 5

If there's nothing logged, you can always try disabling all containers and then start them one by one and re-test.

Kees Fluitman · May 1

I probably have some motherboard problem. Crashes were returning, up to a point where booting failed. CPU stress tests were fine, memtest was fine. But I could only boot again and do those tests after only using 2x8GB memory modules. So I'll replace the mainboard and see.

The error you see here is what returns more frequently atm. Dunno if that is any issue. I also had some harddrive meta data errors, but those were solved after moving my files properly on the drives (cache pools and array).

May  1 19:54:25 server kernel: WARNING: CPU: 3 PID: 0 at net/netfilter/nf_nat_core.c:594 nf_nat_setup_info+0x8c/0x7d1 [nf_nat]

server-diagnostics-20240501-1956.zip

JorgeB · May 2

18 hours ago, Kees Fluitman said:

May  1 19:54:25 server kernel: WARNING: CPU: 3 PID: 0 at net/netfilter/nf_nat_core.c:594 nf_nat_setup_info+0x8c/0x7d1 [nf_nat]

These should go away if you change the docker network to ipvlan.

Kees Fluitman · May 4

On 5/2/2024 at 2:40 PM, JorgeB said:

These should go away if you change the docker network to ipvlan.

I've fixed the networks i think. but i do run a macvlan. i need it for my adguardhome instance. I followed the troubleshooting guide to turn off bridging.

I changed motherboard and CPU, since I got several crashes, and couldnt even boot with more than 16GB of memory. It runs smoothly now, but still got a crash just before (that was probably due to my weird macvlan settings that I'd changed to test, now it should be fine.)

I do also need to run a parity test, since all the crashes have kind off mixed up the parity. But before that, i saw checksum and scrub errors on my cachepool:

May  4 15:57:38 server kernel: BTRFS warning (device loop2): checksum error at logical 9563754496 on dev /dev/loop2, physical 10645884928, root 414, inode 996, offset 12288, length 4096, links 1 (path: usr/lib/apt/methods/gpgv)
May  4 15:57:38 server kernel: BTRFS warning (device loop2): checksum error at logical 9563754496 on dev /dev/loop2, physical 10645884928, root 413, inode 996, offset 12288, length 4096, links 1 (path: usr/lib/apt/methods/gpgv)
May  4 15:57:38 server kernel: BTRFS warning (device loop2): checksum error at logical 9563754496 on dev /dev/loop2, physical 10645884928, root 412, inode 996, offset 12288, length 4096, links 1 (path: usr/lib/apt/methods/gpgv)

These errors are probably due to the crash. But I just wonder what these files are responsible for. Maybe docker or the VMs? then i'd recreate the docker image. I just want to avoid any further crashes. So i can easily run the parity check.

I think it's from docker. I started the array, started VMs, nothing...Then started docker. then i got the errors again

May  4 16:44:49 server kernel: BTRFS error (device loop3): tree first key mismatch detected, bytenr=778682368 parent_transid=253 key expected=(5040,84,18446612688220248328) has=(5040,84,3585323317)

But dmesg does not show corresponding error messages this time. So that's a good thing i guess?

Now Dmesg does show occassional similar errors


[Sat May  4 17:28:11 2024] BTRFS error (device loop3): tree first key mismatch detected, bytenr=778682368 parent_transid=253 key expected=(5040,84,18446612688220248328) has=(5040,84,3585323317)

Would you suggest rebuilding my docker image?

Edited May 4 by Kees Fluitman

JorgeB · May 5

18 hours ago, Kees Fluitman said:

Would you suggest rebuilding my docker image?

Yep

Kees Fluitman · May 5

I completely cleaned my sata cache pool, nvme cache drive and reformatted the pool. After deleting the docker image and rebuilding, it seems ive been error free again the last 10 hours. Let's see how long it holds.

Ill keep this topic updated.

Kernel Panic - How to Diagnose

Recommended Posts

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

itimpi

Link to comment

Kees Fluitman

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

JorgeB

Link to comment

Kees Fluitman

Link to comment

Join the conversation