Jump to content

Help troubleshooting Kernel Panics


Go to solution Solved by rwdesigner,

Recommended Posts

Howdy!

 

I've really been enjoying unraid, and have recently migrated my install to new hardware.

 

Previously, I was using a Supermicro board with dual xeons, and it was working pretty well, with the occasional freezeup. 

 

I was trying to use the unraid server for many things, but with the occasional freezups, I started removing tasks, drives, and hardware to get down to just: -Quadro P2000, -400GB PCIe NVMe (cache), and two 16TB HDDs.  Software was all Dockers: PLEX, Tautulli, and Pihole. 

 

I wanted to move to a board with a newer generation processor with quicksync, a m.2 ssd, and also something that was quieter.

 

So, I upgraded to: 

  • Intel Core i7-11700K
  • ASUS PRIME B560-PLUS 
  • 16GB DDR4 2133Mhz 4Gx4
  • Noctua NHU-9S
  • SAMSUNG 980 PRO M.2 2280 2TB PCIe Gen 4.0 x4, NVMe  (cache)
  • Quadro P2000 (still installed)

 

The hardware upgrade came with some issues - the old motherboard had IPMI and I had setup nerdtools to check it for stats and report to the dashboard. Well, the new MB doesn't have IPMI and the log file was filling up with IPMI failures to connect or something like that. So, I removed the IPMI tool....

 

And I successfully migrated the cache drive from the 400GB PCIe NVMe to a SAMSUNG 980 PRO M.2 2280 2TB PCIe Gen 4.0.

 

Unfortunately, I'm still having issues where the system will abruptly reboot. The last time I was just running a parity check and dockers were disabled.

 

To troubleshoot, I've:

  • Run MEMTEST86 - 2 passes - 0 errors (2hrs)
  • Replaced SATA cables to HDDS
  • Removed all other SATA SSDs
    • the only drives connected are: USB (unraid), m.2 cache drive, 2x 16TB HDDs

 

Some of my research pointed toward "dockers using custom br0 network interface can cause kernel panics" so I disabled PiHole

 

Yesterday I was trying to get OpenVPN working and I enabled IPv6 (previously was disabled... maybe due to previous troubleshooting with this issue?)

 

Currently: removed USB wireless keyboard (maybe this was causing the kernel panic when it went into sleep mode?), attached PS/2 keyboard. Dockers disabled. Running a parity check.

..... and it crashed/rebooted again with no dockers running and it was just running a parity check....

 

I really feel like just starting over and seeing if that fixes anything... maybe just get a new test unraid usb going with some test HDDs and see if that configuration crashes....

 

TLDR: Unraid system randomly kernel panics. 

 

Please and Thank you for the help. I'm at my wit's end here =\

aor7u5yzza681[1].jpg

Edited by rwdesigner
Link to comment

There's nothing obvious logged, if it's a hardware problem it's kind of expected, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

Less than an hour later...

 

Kernel panic - not syncing : Fatal exception in interrupt

 

If this is not a software issue with unraid, jeez, I don't even know where to start. 

Try a bootable usb with Seatools and try to run SMART check there? if that crashes... it's the CPU/MB/RAM/HDD right?

 

The CPU, MB, and PSU are new. The HDDs are about a year old. The RAM is used, but it passed a couple hours of memtest.

 

Thank you for the help

panic.jpg

Link to comment

is there any chance this is related to the system time?

 

Quote

Dec 31 12:50:54 Helix ntpd[2023]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891215144
Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891215152
Dec 31 13:14:59 Helix kernel: md: recovery thread: P corrected, sector=891220760
Dec 31 13:23:26 Helix kernel: general protection fault, probably for non-canonical address 0x7063742f30373409: 0000 [#1] SMP NOPTI
Dec 31 13:23:26 Helix kernel: CPU: 10 PID: 0 Comm: swapper/10 Not tainted 5.10.28-Unraid #1
Dec 31 13:23:26 Helix kernel: Hardware name: ASUS System Product Name/PRIME B560-PLUS, BIOS 0820 04/27/2021
Dec 31 13:23:26 Helix kernel: RIP: 0010:bio_endio+0x50/0xc7
Dec 31 13:23:26 Helix kernel: Code: 01 75 0b 48 8b 45 08 48 85 c0 75 17 eb 2d 48 83 7d 58 00 74 ee 48 89 ef e8 02 8e 02 00 84 c0 75 e2 eb 7b 48 8b 80 a8 03 00 00 <48> 8b 78 28 48 85 ff 74 08 48 89 ee e8 a2 7a 01 00 48 81 7d 38 20
Dec 31 13:23:26 Helix kernel: RSP: 0018:ffffc90000334eb0 EFLAGS: 00010286
Dec 31 13:23:26 Helix kernel: RAX: 7063742f30373409 RBX: ffff888104ef7800 RCX: 0000000000000001
Dec 31 13:23:26 Helix kernel: RDX: 0000000000000000 RSI: ffff8881039a0fa0 RDI: ffff8881039a0f28
Dec 31 13:23:26 Helix kernel: RBP: ffff8881039a0f28 R08: ffff888104ef7800 R09: 0000000000000200
Dec 31 13:23:26 Helix kernel: R10: 0000000000000002 R11: ffffffff8251e750 R12: 000000000006f000
Dec 31 13:23:26 Helix kernel: R13: 0000000000039000 R14: 0000000000000000 R15: 0000000000001000
Dec 31 13:23:26 Helix kernel: FS: 0000000000000000(0000) GS:ffff88844f680000(0000) knlGS:0000000000000000
Dec 31 13:23:26 Helix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 31 13:23:26 Helix kernel: CR2: 0000150982217ff8 CR3: 000000000400a006 CR4: 0000000000770ee0
Dec 31 13:23:26 Helix kernel: PKRU: 55555554
Dec 31 13:23:26 Helix kernel: Call Trace:
Dec 31 13:23:26 Helix kernel: <IRQ>
Dec 31 13:23:26 Helix kernel: blk_update_request+0x1f9/0x2ad
Dec 31 13:23:26 Helix kernel: scsi_end_request+0x22/0xda
Dec 31 13:23:26 Helix kernel: scsi_io_completion+0x146/0x3bf
Dec 31 13:23:26 Helix kernel: blk_done_softirq+0x7c/0x99
Dec 31 13:23:26 Helix kernel: __do_softirq+0xc4/0x1c2
Dec 31 13:23:26 Helix kernel: asm_call_irq_on_stack+0xf/0x20
Dec 31 13:23:26 Helix kernel: </IRQ>
Dec 31 13:23:26 Helix kernel: do_softirq_own_stack+0x2c/0x39
Dec 31 13:23:26 Helix kernel: __irq_exit_rcu+0x45/0x80
Dec 31 13:23:26 Helix kernel: common_interrupt+0x119/0x12e
Dec 31 13:23:26 Helix kernel: asm_common_interrupt+0x1e/0x40
Dec 31 13:23:26 Helix kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
Dec 31 13:23:26 Helix kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
Dec 31 13:23:26 Helix kernel: RSP: 0018:ffffc9000014bea0 EFLAGS: 00000246
Dec 31 13:23:26 Helix kernel: RAX: ffff88844f6a2380 RBX: 0000000000000003 RCX: 000000000000001f
Dec 31 13:23:26 Helix kernel: RDX: 0000000000000000 RSI: 00000000238e38e3 RDI: 0000000000000000
Dec 31 13:23:26 Helix kernel: RBP: ffffe8ffffc99000 R08: 0000021c3524ab48 R09: 000000000000023b
Dec 31 13:23:26 Helix kernel: R10: 0000000000000252 R11: 071c71c71c71c71c R12: 0000021c3524ab48
Dec 31 13:23:26 Helix kernel: R13: ffffffff820c5dc0 R14: 0000000000000003 R15: 0000000000000000
Dec 31 13:23:26 Helix kernel: cpuidle_enter_state+0x101/0x1c4
Dec 31 13:23:26 Helix kernel: cpuidle_enter+0x25/0x31
Dec 31 13:23:26 Helix kernel: do_idle+0x1a6/0x214
Dec 31 13:23:26 Helix kernel: cpu_startup_entry+0x18/0x1a
Dec 31 13:23:26 Helix kernel: secondary_startup_64_no_verify+0xb0/0xbb
Dec 31 13:23:26 Helix kernel: Modules linked in: xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding wmi_bmof x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd nvme cryptd i2c_i801 nvme_core i2c_smbus i2c_core video input_leds glue_helper led_class ahci wmi e1000e libahci backlight thermal acpi_pad button fan
Dec 31 13:23:26 Helix kernel: ---[ end trace 3004778bc7b6ba3c ]---
Dec 31 13:23:26 Helix kernel: RIP: 0010:bio_endio+0x50/0xc7

 

Link to comment

I feel like I'm in crazy land here... 

 

When trying to update, I see two lines -- one with the current installed (6.9.2) version and one with 6.9.1.... I click " Branch - Next "  (that line disappears and all I see is 6.9.1 with the Restore option) - then after I click " Check for Updates " - the popup window appears and in the background I see v6.10-rc2 and under Status, I see the Install button, but when I close the popup window, that changes back to "6.9.2 - Up to Date" - I tried this with both the array started and stopped. 

 

 

Okay after trying the same procedure about 10 times, after I close the popup window, the Install button persisted... weird weird weird... 

 

I'll report back upgrading helps. Thank you

Link to comment
  • Installed Win10 on a SATA SSD & updated windows
  • Downloaded & Installed SeaTools for Windows
    • SMART test - both drives passed
    • Short DST - both drives passed
    • Short Generic Read Test - both drives passed
    • Long Generic Read Test - both drives passed
Edited by rwdesigner
long test finished
Link to comment
6 hours ago, JorgeB said:

Unlikely to be a disk issue, but could be hardware related, board, RAM, etc.

 

I know how to run tests for the HDDs, RAM, and CPU, but I have no idea how to run any motherboard tests.

 

And if the system is fully functional under windows, perhaps that indicates a compatibility issue between Unraid and this hardware. Which is very frustrating because Unraid has a reputation for not caring about what hardware it is running on.

 

The long generic HDD test has been running for almost 24 hours. I don't see any issues reported yet.

 

I just bought all of this new hardware and damn is this disheartening...  

 

I guess I'll try setting up a new unraid install with different HDDs and see if this piece of junk continues to crash. Maybe it's something corrupt with the Unraid files/config....

 

I would pay for help at this point...

Link to comment
7 hours ago, JorgeB said:

Unlikely to be a disk issue

Agree

 

On 1/1/2022 at 2:51 AM, rwdesigner said:

But if I try to just do a parity check, I know it will crash.

That's great, it crash immediate so more easy for troubleshoot, better a lot then intermittent. I will use some dummy disk for test.

 

BTW, it still look like memory issue, suggest more deeper in memtest. Or just simple test with one RAM module.

 

 

 

 

Link to comment
  • 2 weeks later...
  • Solution

Well, I have good news to report. After purchasing some ram (Corsair Dominator Platinum 16GB (2x8GB) DDR4 Gen6 3200MHz) off the motherboard's QVL list, the parity check completed successfully! No crashing. No reboots.

 

Quote

Last check completed on Fri 14 Jan 2022 10:02:52 AM CST (today)
Finding 40 errors Duration: 21 hours, 41 minutes, 54 seconds. Average speed: 204.8 MB/sec

Uptime: 

1 day, 2 hours, 45 minutes!

 

Just started Dockers and... I'll report back, but it seems that:

 

SOLUTION: Install RAM which is listed on the motherboard's QVL

 

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...