corgan Posted February 28 Share Posted February 28 Hey guys, this is my last attempt to fix these random kernel panics. I'm fighting this issue since years, but never had a hint where to start. Because I never got anything in the logs. I had set up an external rsyslog server for this, but the last log was minutes before the blackout and had nothing to do with kernel panics. The Screen was always black and in the ipmi log was only this: Not really helpful for finding the root course, but I could react on "OS Critical Stop" | "kernel panic" and send an ipmitool -I lanplus -H 192.168.2.241 -U user -P pass power cycle Command to automatically reboot. But today I played around with the ipmi tools and the first time I got something what maybe can help. So that's my hope. The string is not complete, and I don't get more information from the ipmi tool. But maybe this helps. Any Idea? Hardware: AMD Ryzen 7 3700X 8-Core 3600 MHz @ ASRockRack X470D4U2-2T American Megatrends International, LLC., Version P4.10 BIOS dated: Tuesday, 18.05.2021 Unraid 6.12.8 1x GeForce RTX 2060 1x Radeon RX 470 (VFIO Binded) 8x HDD 84tb - xfs Raid 5x SSD Cache 1,3tb 1x Samsung_SSD_970_EVO_500GB nvme for docker 1x SanDisk_SDSSDH3_2T00 2tb for VM Full Hardware: Spoiler IOMMU group 0:[1022:1482] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 1:[1022:1483] 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group 2:[1022:1483] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group 3:[1022:1482] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 4:[1022:1482] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 5:[1022:1483] 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group 6:[1022:1483] 00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group 7:[1022:1482] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 8:[1022:1482] 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 9:[1022:1482] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 10:[1022:1484] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] IOMMU group 11:[1022:1482] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge IOMMU group 12:[1022:1484] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] IOMMU group 13:[1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) [1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) IOMMU group 14:[1022:1440] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0 [1022:1441] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1 [1022:1442] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2 [1022:1443] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3 [1022:1444] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4 [1022:1445] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5 [1022:1446] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6 [1022:1447] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7 IOMMU group 15:[8086:1563] 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01) IOMMU group 16:[8086:1563] 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01) IOMMU group 17:[1022:43d0] 03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01) Bus 001 Device 001 Port 1-0 ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002 Port 1-3 ID 1b1c:1c0b Corsair RM750i Power Supply Bus 001 Device 003 Port 1-4 ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 001 Device 004 Port 1-14 ID 046b:ff01 American Megatrends, Inc. Virtual Hub Bus 001 Device 005 Port 1-4.1 ID 0001:0000 Fry's Electronics MEC0003 Bus 001 Device 006 Port 1-14.3 ID 046b:ffb0 American Megatrends, Inc. Virtual Ethernet Bus 001 Device 007 Port 1-4.2 ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 001 Device 008 Port 1-14.5 ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse Bus 001 Device 009 Port 1-4.3 ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 002 Device 001 Port 2-0 ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 002 Device 002 Port 2-4 ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 002 Device 003 Port 2-4.2 ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 002 Device 004 Port 2-4.3 ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 002 Device 005 Port 2-4.2.4 ID 18d1:9302 Google Inc. [1022:43c8] 03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01) [1:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdj 500GB [2:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdk 500GB [3:0:0:0] disk ATA SanDisk SDSSDH3 30RL /dev/sdl 2.00TB [4:0:0:0] disk ATA SanDisk SDSSDH3 20RL /dev/sdm 500GB [7:0:0:0] disk ATA SanDisk SDSSDH3 40RL /dev/sdn 500GB [8:0:0:0] disk ATA SanDisk SDSSDH3 00RL /dev/sdo 500GB [1022:43c6] 03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01) [1022:43c7] 20:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1022:43c7] 20:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1022:43c7] 20:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1022:43c7] 20:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1022:43c7] 20:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1022:43c7] 20:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) [1a03:1150] 21:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04) [1a03:2000] 22:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41) [1b4b:9215] 25:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11) [10de:1e89] 26:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2060] (rev a1) [10de:10f8] 26:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1) [10de:1ad8] 26:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1) Bus 003 Device 001 Port 3-0 ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001 Port 4-0 ID 1d6b:0003 Linux Foundation 3.0 root hub [10de:1ad9] 26:00.3 Serial bus controller: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1) [144d:a808] 2a:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [N:0:4:1] disk Samsung SSD 970 EVO 500GB__1 /dev/nvme0n1 500GB IOMMU group 18:[1000:0087] 2b:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) [9:0:0:0] disk ATA ST14000NE0008-2J EN01 /dev/sdb 14.0TB [9:0:1:0] disk ATA ST16000NM001G-2K SN02 /dev/sdc 16.0TB [9:0:2:0] disk ATA ST16000NM001G-2K SN03 /dev/sdd 16.0TB [9:0:3:0] disk ATA ST16000NM001G-2K SN02 /dev/sde 16.0TB [9:0:4:0] disk ATA ST12000VN0007-2G SC60 /dev/sdf 12.0TB [9:0:5:0] disk ATA ST10000VN0008-2J SC60 /dev/sdg 10.0TB [9:0:6:0] disk ATA ST8000AS0002-1NA AR17 /dev/sdh 8.00TB [9:0:7:0] disk ATA ST8000AS0002-1NA AR17 /dev/sdi 8.00TB IOMMU group 19:[1002:67df] 2c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) [1002:aaf0] 2c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] IOMMU group 20:[1022:148a] 2d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function IOMMU group 21:[1022:1485] 2e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP IOMMU group 22:[1022:1486] 2e:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP IOMMU group 23:[1022:149c] 2e:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller Bus 005 Device 001 Port 5-0 ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 005 Device 002 Port 5-2 ID 0930:6544 Toshiba Corp. TransMemory-Mini / Kingston DataTraveler 2.0 Stick Bus 006 Device 001 Port 6-0 ID 1d6b:0003 Linux Foundation 3.0 root hub IOMMU group 24:[1022:1487] 2e:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller Running Aps: The RTX is pass through to Shinobi I have this problem nearly since 3-4 years as the server exists. At the beginning, this only happens once in a month. But since a year this happens once a week. After a power cycle while the server is checking parity, this never happens. Only 1-2 days after the parity got checked, what takes around 19h. Here is my uptime diagram from the last year. Every drop was a black screen. So any Ideas what to check, or what's the root cause of this problem? Thanks in advance, Stefan serva4-diagnostics-20240228-2204.zip raw.txt sel.txt Quote Link to comment
JorgeB Posted March 2 Share Posted March 2 Did you try enabling the syslog server to see if it catches anything? Also make sure this has been taken care of: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 Quote Link to comment
itimpi Posted March 2 Share Posted March 2 Noticed some very strange settings for the Parity Check Tuning plugin where it is set to resume at 0:15 and pause at 0::30 - a total runtime of only 15 minutes. Mentioned this as one of the last things that happened seemed to be a parity check running. Can your PSU reliably handle the load of all drives being active at the same time? Quote Link to comment
corgan Posted March 7 Author Share Posted March 7 After I deinstalled the second ATI GPU, I had hopes that the System is more stable now. But sadly it crashed again after 3 days uptime. The stop was at 21:02. Ipmi Log: Last Unraid Log was 6min before. So it looks like, that there is nothing written. The ipmi log says that the Video controller had a failure, but not which. I have a RTX2060 and the onboard in the system. How can I deactivate the onboard VGA for unraid? I think I can't do that in the bios. greetz Stefan Quote Link to comment
JorgeB Posted March 8 Share Posted March 8 12 hours ago, corgan said: How can I deactivate the onboard VGA for unraid? You can blacklist the ast driver: https://docs.unraid.net/unraid-os/release-notes/6.10.0#linux-kernel Quote Link to comment
corgan Posted March 17 Author Share Posted March 17 Hey all.. After removing the gpu driver, the GPU Errors in the ipmi logs are gone. That's good. But tonight the server crashed again, but with a different error. The server was not reachable from the network with a 100% packet loss. But I could log in via the onboard KVM and use the terminal. Strangely, a ping from the server to the network was successful. Now I have a ton of these errors in the log. Quote Mar 17 04:00:31 serva4 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: Mar 17 04:00:31 serva4 kernel: rcu: 0-...!: (0 ticks this GP) idle=bba0/0/0x0 softirq=79212217/79212217 fqs=2 (false positive?) Mar 17 04:00:31 serva4 kernel: (detected by 14, t=60002 jiffies, g=98489137, q=1178260 ncpus=16) Mar 17 04:00:31 serva4 kernel: Sending NMI from CPU 14 to CPUs 0: Mar 17 04:00:31 serva4 kernel: NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x7/0xc Mar 17 04:00:31 serva4 kernel: rcu: rcu_preempt kthread timer wakeup didn't happen for 60002 jiffies! g98489137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 Mar 17 04:00:31 serva4 kernel: rcu: Possible timer handling issue on cpu=0 timer-softirq=5734298 Mar 17 04:00:31 serva4 kernel: rcu: rcu_preempt kthread starved for 60003 jiffies! g98489137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 Mar 17 04:00:31 serva4 kernel: rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. Mar 17 04:00:31 serva4 kernel: rcu: RCU grace-period kthread stack dump: Mar 17 04:00:31 serva4 kernel: task:rcu_preempt state:I stack:0 pid:15 ppid:2 flags:0x00004000 Mar 17 04:00:31 serva4 kernel: Call Trace: Mar 17 04:00:31 serva4 kernel: <TASK> Mar 17 04:00:31 serva4 kernel: __schedule+0x5b2/0x612 Mar 17 04:00:31 serva4 kernel: ? _raw_spin_unlock_irqrestore+0x24/0x3a Mar 17 04:00:31 serva4 kernel: ? __mod_timer+0x207/0x232 Mar 17 04:00:31 serva4 kernel: ? rcu_gp_init+0x494/0x494 Mar 17 04:00:31 serva4 kernel: schedule+0x8e/0xcc Mar 17 04:00:31 serva4 kernel: schedule_timeout+0x9d/0xd7 Mar 17 04:00:31 serva4 kernel: ? __next_timer_interrupt+0xf6/0xf6 Mar 17 04:00:31 serva4 kernel: rcu_gp_fqs_loop+0x12d/0x475 Mar 17 04:00:31 serva4 kernel: rcu_gp_kthread+0x151/0x16d Mar 17 04:00:31 serva4 kernel: kthread+0xe4/0xef Mar 17 04:00:31 serva4 kernel: ? kthread_complete_and_exit+0x1b/0x1b Mar 17 04:00:31 serva4 kernel: ret_from_fork+0x1f/0x30 Mar 17 04:00:31 serva4 kernel: </TASK> Mar 17 04:00:31 serva4 kernel: rcu: Stack dump where RCU GP kthread last ran: Mar 17 04:00:31 serva4 kernel: Sending NMI from CPU 14 to CPUs 0: Mar 17 04:00:31 serva4 kernel: NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x7/0xc The CPU Numer (CPU 14 to CPUs 0:) itterated from 0-15. I then wanted to stop the array from the command line and reboot, but the server got stuck and hung for over an hour. Then I send a CRTL+ALT+DEL but got stuck at the "starting diagnostic collection" step. So I had to send a "power cycle" via ipmi to reboot and now have to do the parity check again. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.