Jump to content

Unraid Caushing NMI Uncorrectable PCI Express Error


36ve

Recommended Posts

I have a DL380p G8 with 12 drives installed.

 

Over the last month after a few hours of operation occasionally the server will crash. Checking the iLo it shows:

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 2, Function 0, Error status 0x00000020)

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

 

I have tried the drives with the USB that contains unraid on another DL380p G8 with complete different hardware. And the same issue still occurs. I also did change the USB as that was an old one and was thought to have been the issue but all of the issues persist.

 

We tried having unraid in Safemode for all testing to ensure only the dockers we have are running with every other plugin/app disabled.

 

Any help much appreciated as i cannot rebuild the current system as i do not have enough drives to lift some of the data off to a new system.

galar-diagnostics-20220202-0909.zip

Link to comment
Feb  2 12:54:17 Galar kernel: ------------[ cut here ]------------
Feb  2 12:54:17 Galar kernel: NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Feb  2 12:54:17 Galar kernel: WARNING: CPU: 22 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0xcf/0x12b
Feb  2 12:54:17 Galar kernel: Modules linked in: xt_mark veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap macvlan xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding tg3 sb_edac ipmi_ssif i2c_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd hpsa glue_helper rapl scsi_transport_sas intel_cstate intel_uncore acpi_power_meter ata_piix acpi_ipmi thermal button ipmi_si [last unloaded: tg3]
Feb  2 12:54:17 Galar kernel: CPU: 22 PID: 0 Comm: swapper/22 Tainted: G          I       5.10.28-Unraid #1
Feb  2 12:54:17 Galar kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 05/24/2019
Feb  2 12:54:17 Galar kernel: RIP: 0010:dev_watchdog+0xcf/0x12b
Feb  2 12:54:17 Galar kernel: Code: 79 b7 00 00 75 38 48 89 ef c6 05 63 79 b7 00 01 e8 79 dd fc ff 44 89 e1 48 89 ee 48 c7 c7 ef 7f de 81 48 89 c2 e8 50 16 10 00 <0f> 0b eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9d eb 16 48 8b
Feb  2 12:54:17 Galar kernel: RSP: 0018:ffffc90006898ed8 EFLAGS: 00010286
Feb  2 12:54:17 Galar kernel: RAX: 0000000000000000 RBX: ffff88812613c438 RCX: 0000000000000027
Feb  2 12:54:17 Galar kernel: RDX: 00000000ffffbfff RSI: 0000000000000001 RDI: ffff888a17918920
Feb  2 12:54:17 Galar kernel: RBP: ffff88812613c000 R08: 0000000000000000 R09: 00000000ffffbfff
Feb  2 12:54:17 Galar kernel: R10: ffffc90006898d08 R11: ffffc90006898d00 R12: 0000000000000000
Feb  2 12:54:17 Galar kernel: R13: ffffc90006898f10 R14: ffffc90006898f10 R15: ffffffff820060c8
Feb  2 12:54:17 Galar kernel: FS:  0000000000000000(0000) GS:ffff888a17900000(0000) knlGS:0000000000000000
Feb  2 12:54:17 Galar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  2 12:54:17 Galar kernel: CR2: 0000146096259a24 CR3: 000000000200a001 CR4: 00000000001706e0
Feb  2 12:54:17 Galar kernel: Call Trace:
Feb  2 12:54:17 Galar kernel: <IRQ>
Feb  2 12:54:17 Galar kernel: call_timer_fn.isra.0+0x12/0x6f
Feb  2 12:54:17 Galar kernel: ? netif_tx_lock+0x7a/0x7a
Feb  2 12:54:17 Galar kernel: __run_timers.part.0+0x144/0x185
Feb  2 12:54:17 Galar kernel: ? update_process_times+0x68/0x6e
Feb  2 12:54:17 Galar kernel: ? hrtimer_forward+0x73/0x7b
Feb  2 12:54:17 Galar kernel: ? tick_sched_timer+0x5a/0x64
Feb  2 12:54:17 Galar kernel: ? timerqueue_add+0x62/0x68
Feb  2 12:54:17 Galar kernel: ? recalibrate_cpu_khz+0x1/0x1
Feb  2 12:54:17 Galar kernel: run_timer_softirq+0x21/0x43
Feb  2 12:54:17 Galar kernel: __do_softirq+0xc4/0x1c2
Feb  2 12:54:17 Galar kernel: asm_call_irq_on_stack+0x12/0x20
Feb  2 12:54:17 Galar kernel: </IRQ>
Feb  2 12:54:17 Galar kernel: do_softirq_own_stack+0x2c/0x39
Feb  2 12:54:17 Galar kernel: __irq_exit_rcu+0x45/0x80
Feb  2 12:54:17 Galar kernel: sysvec_apic_timer_interrupt+0x87/0x95
Feb  2 12:54:17 Galar kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb  2 12:54:17 Galar kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
Feb  2 12:54:17 Galar kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
Feb  2 12:54:17 Galar kernel: RSP: 0018:ffffc90006377ea0 EFLAGS: 00000246
Feb  2 12:54:17 Galar kernel: RAX: ffff888a17922380 RBX: 0000000000000004 RCX: 000000000000001f
Feb  2 12:54:17 Galar kernel: RDX: 0000000000000000 RSI: 000000002dd30fdd RDI: 0000000000000000
Feb  2 12:54:17 Galar kernel: RBP: ffffe8f5feb3fa00 R08: 00000c6a45ea0e72 R09: 000000000000038d
Feb  2 12:54:17 Galar kernel: R10: 000000000000038d R11: 071c71c71c71c71c R12: 00000c6a45ea0e72
Feb  2 12:54:17 Galar kernel: R13: ffffffff820c5dc0 R14: 0000000000000004 R15: 0000000000000000
Feb  2 12:54:17 Galar kernel: cpuidle_enter_state+0x101/0x1c4
Feb  2 12:54:17 Galar kernel: cpuidle_enter+0x25/0x31
Feb  2 12:54:17 Galar kernel: do_idle+0x1a6/0x214
Feb  2 12:54:17 Galar kernel: cpu_startup_entry+0x18/0x1a
Feb  2 12:54:17 Galar kernel: secondary_startup_64_no_verify+0xb0/0xbb
Feb  2 12:54:17 Galar kernel: ---[ end trace ab6d36e9d5980c46 ]---

 

Quote

macvlan

I guess that you are using custom IP addresses for one or several docker containers.

 

I would suggest to update to 6.10.0 RC2 and switch your docker network from MACVLAN to IPVLAN (settings/docker).

Link to comment

  

32 minutes ago, Squid said:

Why are you running with the VM settings set to acs override = downstream if you're not running any VMs with passthrough?  The most stable was to set up a server is with ACS disabled unless you absolutely need it.

I had removed the vm's i was passing through, as they were no longer used or needed the setting has just been left like this for a while now. I shall try changing it and seeing if that makes a difference.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...