random kernel panics


Recommended Posts

Hey guys,

 

this is my last attempt to fix these random kernel panics. I'm fighting this issue since years, but never had a hint where to start.

Because I never got anything in the logs. I had set up an external rsyslog server for this, but the last log was minutes before the blackout and had nothing to do with kernel panics.

The Screen was always black and in the ipmi log was only this:

grafik.thumb.png.e509b523d2f2dad4d46973b6c66137e8.png

Not really helpful for finding the root course, but I could react on "OS Critical Stop" | "kernel panic" and send an
 

ipmitool -I lanplus -H 192.168.2.241 -U user -P pass power cycle

Command to automatically reboot.
 

But today I played around with the ipmi tools and the first time I got something what maybe can help. So that's my hope. :)

 

grafik.thumb.png.34b928264d2f71e64924fb6fd8ef4568.png

 

The string is not complete, and I don't get more information from the ipmi tool. But maybe this helps.

Any Idea?

 

Hardware:

AMD Ryzen 7 3700X 8-Core 3600 MHz @ ASRockRack X470D4U2-2T
American Megatrends International, LLC., Version P4.10

BIOS dated: Tuesday, 18.05.2021

Unraid 6.12.8

1x GeForce RTX 2060

1x Radeon RX 470 (VFIO Binded)

8x HDD 84tb - xfs Raid

5x SSD Cache 1,3tb

1x Samsung_SSD_970_EVO_500GB nvme for docker

1x SanDisk_SDSSDH3_2T00 2tb for VM

Full Hardware:

 

Spoiler

IOMMU group 0:[1022:1482] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 1:[1022:1483] 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

IOMMU group 2:[1022:1483] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

IOMMU group 3:[1022:1482] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 4:[1022:1482] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 5:[1022:1483] 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

IOMMU group 6:[1022:1483] 00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

IOMMU group 7:[1022:1482] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 8:[1022:1482] 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 9:[1022:1482] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 10:[1022:1484] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]

IOMMU group 11:[1022:1482] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

IOMMU group 12:[1022:1484] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]

IOMMU group 13:[1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)

[1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)

IOMMU group 14:[1022:1440] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0

[1022:1441] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1

[1022:1442] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2

[1022:1443] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3

[1022:1444] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4

[1022:1445] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5

[1022:1446] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6

[1022:1447] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7

IOMMU group 15:[8086:1563] 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)

IOMMU group 16:[8086:1563] 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)

IOMMU group 17:[1022:43d0] 03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01)

Bus 001 Device 001 Port 1-0 ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 001 Device 002 Port 1-3 ID 1b1c:1c0b Corsair RM750i Power Supply

Bus 001 Device 003 Port 1-4 ID 2109:2812 VIA Labs, Inc. VL812 Hub

Bus 001 Device 004 Port 1-14 ID 046b:ff01 American Megatrends, Inc. Virtual Hub

Bus 001 Device 005 Port 1-4.1 ID 0001:0000 Fry's Electronics MEC0003

Bus 001 Device 006 Port 1-14.3 ID 046b:ffb0 American Megatrends, Inc. Virtual Ethernet

Bus 001 Device 007 Port 1-4.2 ID 2109:2812 VIA Labs, Inc. VL812 Hub

Bus 001 Device 008 Port 1-14.5 ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse

Bus 001 Device 009 Port 1-4.3 ID 2109:2812 VIA Labs, Inc. VL812 Hub

Bus 002 Device 001 Port 2-0 ID 1d6b:0003 Linux Foundation 3.0 root hub

Bus 002 Device 002 Port 2-4 ID 2109:0812 VIA Labs, Inc. VL812 Hub

Bus 002 Device 003 Port 2-4.2 ID 2109:0812 VIA Labs, Inc. VL812 Hub

Bus 002 Device 004 Port 2-4.3 ID 2109:0812 VIA Labs, Inc. VL812 Hub

Bus 002 Device 005 Port 2-4.2.4 ID 18d1:9302 Google Inc.

[1022:43c8] 03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)

[1:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdj 500GB

[2:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdk 500GB

[3:0:0:0] disk ATA SanDisk SDSSDH3 30RL /dev/sdl 2.00TB

[4:0:0:0] disk ATA SanDisk SDSSDH3 20RL /dev/sdm 500GB

[7:0:0:0] disk ATA SanDisk SDSSDH3 40RL /dev/sdn 500GB

[8:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdo 500GB

[1022:43c6] 03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01)

[1022:43c7] 20:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1022:43c7] 20:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1022:43c7] 20:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1022:43c7] 20:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1022:43c7] 20:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1022:43c7] 20:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)

[1a03:1150] 21:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)

[1a03:2000] 22:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)

[1b4b:9215] 25:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)

[10de:1e89] 26:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2060] (rev a1)

[10de:10f8] 26:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)

[10de:1ad8] 26:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)

Bus 003 Device 001 Port 3-0 ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 001 Port 4-0 ID 1d6b:0003 Linux Foundation 3.0 root hub

[10de:1ad9] 26:00.3 Serial bus controller: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)

[144d:a808] 2a:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

[N:0:4:1] disk Samsung SSD 970 EVO 500GB__1 /dev/nvme0n1 500GB

IOMMU group 18:[1000:0087] 2b:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

[9:0:0:0] disk ATA ST14000NE0008-2J EN01 /dev/sdb 14.0TB

[9:0:1:0] disk ATA ST16000NM001G-2K SN02 /dev/sdc 16.0TB

[9:0:2:0] disk ATA ST16000NM001G-2K SN03 /dev/sdd 16.0TB

[9:0:3:0] disk ATA ST16000NM001G-2K SN02 /dev/sde 16.0TB

[9:0:4:0] disk ATA ST12000VN0007-2G SC60 /dev/sdf 12.0TB

[9:0:5:0] disk ATA ST10000VN0008-2J SC60 /dev/sdg 10.0TB

[9:0:6:0] disk ATA ST8000AS0002-1NA AR17 /dev/sdh 8.00TB

[9:0:7:0] disk ATA ST8000AS0002-1NA AR17 /dev/sdi 8.00TB

IOMMU group 19:[1002:67df] 2c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)

[1002:aaf0] 2c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

IOMMU group 20:[1022:148a] 2d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function

IOMMU group 21:[1022:1485] 2e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP

IOMMU group 22:[1022:1486] 2e:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP

IOMMU group 23:[1022:149c] 2e:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller

Bus 005 Device 001 Port 5-0 ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 005 Device 002 Port 5-2 ID 0930:6544 Toshiba Corp. TransMemory-Mini / Kingston DataTraveler 2.0 Stick

Bus 006 Device 001 Port 6-0 ID 1d6b:0003 Linux Foundation 3.0 root hub

IOMMU group 24:[1022:1487] 2e:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller

 

Running Aps:

grafik.thumb.png.687603f12941283a8335fb2e4c26e61c.png

The RTX is pass through to Shinobi


I have this problem nearly since 3-4 years as the server exists. At the beginning, this only happens once in a month. But since a year this happens once a week.

After a power cycle while the server is checking parity, this never happens. Only 1-2 days after the parity got checked, what takes around 19h.

 

Here is my uptime diagram from the last year. Every drop was a black screen.

grafik.thumb.png.44f99f82337b1f9712945544368a4556.png

 

So any Ideas what to check, or what's the root cause of this problem?
Thanks in advance, Stefan


 

serva4-diagnostics-20240228-2204.zip raw.txt sel.txt

Link to comment

Noticed some very strange settings for the Parity Check Tuning plugin where it is set to resume at 0:15 and pause at 0::30 - a total runtime of only 15 minutes.   Mentioned this as one of the last things that happened seemed to be a parity check running.   Can your PSU reliably handle the load of all drives being active at the same time?

Link to comment

After I deinstalled the second ATI GPU, I had hopes that the System is more stable now.

But sadly it crashed again after 3 days uptime.

 

The stop was at 21:02.
Ipmi Log:

grafik.thumb.png.8b67e25665847f8ed9b5d855e65c47e0.png

Last Unraid Log was 6min before.
So it looks like, that there is nothing written.

The ipmi log says that the Video controller had a failure, but not which.

I have a RTX2060 and the onboard in the system.

 

grafik.thumb.png.6ffc4af75f21eacbc604348ae78d865f.png

 

grafik.thumb.png.183349fabce9167d1ab8186ce20a76b1.png

 

How can I deactivate the onboard VGA for unraid? I think I can't do that in the bios.

 

greetz Stefan

 

grafik.png

Link to comment
  • 2 weeks later...

Hey all..

 

After removing the gpu driver, the GPU Errors in the ipmi logs are gone. That's good.

 

But tonight the server crashed again, but with a different error.

The server was not reachable from the network with a 100% packet loss.

But I could log in via the onboard KVM and use the terminal.  Strangely, a ping from the server to the network was successful.

 

Now I have a ton of these errors in the log.

 

Quote

Mar 17 04:00:31 serva4 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Mar 17 04:00:31 serva4 kernel: rcu:     0-...!: (0 ticks this GP) idle=bba0/0/0x0 softirq=79212217/79212217 fqs=2 (false positive?)
Mar 17 04:00:31 serva4 kernel:     (detected by 14, t=60002 jiffies, g=98489137, q=1178260 ncpus=16)
Mar 17 04:00:31 serva4 kernel: Sending NMI from CPU 14 to CPUs 0:
Mar 17 04:00:31 serva4 kernel: NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x7/0xc
Mar 17 04:00:31 serva4 kernel: rcu: rcu_preempt kthread timer wakeup didn't happen for 60002 jiffies! g98489137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
Mar 17 04:00:31 serva4 kernel: rcu:     Possible timer handling issue on cpu=0 timer-softirq=5734298
Mar 17 04:00:31 serva4 kernel: rcu: rcu_preempt kthread starved for 60003 jiffies! g98489137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
Mar 17 04:00:31 serva4 kernel: rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Mar 17 04:00:31 serva4 kernel: rcu: RCU grace-period kthread stack dump:
Mar 17 04:00:31 serva4 kernel: task:rcu_preempt     state:I stack:0     pid:15    ppid:2      flags:0x00004000
Mar 17 04:00:31 serva4 kernel: Call Trace:
Mar 17 04:00:31 serva4 kernel: <TASK>
Mar 17 04:00:31 serva4 kernel: __schedule+0x5b2/0x612
Mar 17 04:00:31 serva4 kernel: ? _raw_spin_unlock_irqrestore+0x24/0x3a
Mar 17 04:00:31 serva4 kernel: ? __mod_timer+0x207/0x232
Mar 17 04:00:31 serva4 kernel: ? rcu_gp_init+0x494/0x494
Mar 17 04:00:31 serva4 kernel: schedule+0x8e/0xcc
Mar 17 04:00:31 serva4 kernel: schedule_timeout+0x9d/0xd7
Mar 17 04:00:31 serva4 kernel: ? __next_timer_interrupt+0xf6/0xf6
Mar 17 04:00:31 serva4 kernel: rcu_gp_fqs_loop+0x12d/0x475
Mar 17 04:00:31 serva4 kernel: rcu_gp_kthread+0x151/0x16d
Mar 17 04:00:31 serva4 kernel: kthread+0xe4/0xef
Mar 17 04:00:31 serva4 kernel: ? kthread_complete_and_exit+0x1b/0x1b
Mar 17 04:00:31 serva4 kernel: ret_from_fork+0x1f/0x30
Mar 17 04:00:31 serva4 kernel: </TASK>
Mar 17 04:00:31 serva4 kernel: rcu: Stack dump where RCU GP kthread last ran:
Mar 17 04:00:31 serva4 kernel: Sending NMI from CPU 14 to CPUs 0:
Mar 17 04:00:31 serva4 kernel: NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x7/0xc

 

The CPU Numer (CPU 14 to CPUs 0:) itterated from 0-15.

 

I then wanted to stop the array from the command line and reboot, but the server got stuck and hung for over an hour. 
Then I send a CRTL+ALT+DEL but got stuck at the "starting diagnostic collection" step. So I had to send a "power cycle" via ipmi to reboot and now have to do the parity check again.

 

 

 

grafik.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.