Jump to content

[6.10.3] Unraid freeze, only way unclean shutdown


Wirman

Recommended Posts

Hi,

 

i have a problem, every once in a while (about 3-4 times a month), my Unraid System freezes.

Freeze means: I am not able to remotely connect to the WebGUI, and the VMs and Dockers are also not reachable. If i put on the monitor connected to the server, it does not recognize the source. If i replug the monitor, i see the terminal log in screen, but it does not respond to keyboard inputs. The fans are working normally, the computer is still on. All i can do, is a hard shutdown by long pressing the power button. I have attached my log file. I have no idea what triggers that behavior, sometimes I notice that the system is frozen only after several hours.

 

Can someone read out of the diagnostics file, why the server freezes? Or what kind of freeze that is?

Is there a way to do a clean shutdown with a short press of the power button? Is there a setting for that?

 

Thanks in advance

Wirman

jarvis-diagnostics-20220917-2139.zip

Link to comment
  • 3 weeks later...

Okay, now i had the freeze again, with Mirror syslog to flash enabled.

But where can i find the logs now?

This thread:

says, that they are in the /logs folder, but i do not have such folder on the usb drive.

grafik.thumb.png.d1ce3e69870f38a1a49c262ea6cd3cf1.png

 

A "find -name logs" in root directory only shows me some log folders in docker and the boot/logs, but there are only the previous diagnostics i made in there. and the syslog file (/boot/logs/syslog) starts also with the time after reboot, so this is not the appended file.

Am i missing something?

 

Thanks in advance

Link to comment

Okay, the syslog was appended, i don't know, why i haven't seen this before.

Attached is the syslog. I restarted the server on Oct 11 00:26:13. Before i have a self detected stall on cpu and the interface cannot find the router. These messages started to show on Oct 10 21:43:39. In the log before there is always a renewing of adresses on the interfaces, is this normal behavior?

syslog

Link to comment

If you mean these:

 

Oct 15 20:07:20 jarvis avahi-daemon[3083]: Registering new address record for 2a02:810d:4b40:496d::13da on br0.10.*.
Oct 15 20:07:20 jarvis avahi-daemon[3083]: Withdrawing address record for 2a02:810d:4b40:496d::2000 on br0.10.
Oct 15 20:07:30 jarvis avahi-daemon[3083]: Withdrawing address record for 2a02:810d:4b40:496f::2000 on br0.100.
Oct 15 20:07:31 jarvis dhcpcd[1705]: br0.100: DHCPv6 REPLY: Address not bound to this interface.
Oct 15 20:07:32 jarvis dhcpcd[1705]: br0.100: ADV 2a02:810d:4b40:496f::13da/128 from fe80::21b:21ff:feaf:6054
Oct 15 20:07:32 jarvis dhcpcd[1705]: br0.100: REPLY6 received from fe80::21b:21ff:feaf:6054
Oct 15 20:07:32 jarvis dhcpcd[1705]: br0.100: adding address 2a02:810d:4b40:496f::13da/128

 

They are normal.

Link to comment

okay, so you think, the switch to ipvlan will resolve the error, why the cpu is stalling?

Oct 10 21:43:39 jarvis kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Oct 10 21:43:39 jarvis kernel: rcu: 	12-....: (60000 ticks this GP) idle=03f/1/0x4000000000000000 softirq=10533211/10533211 fqs=14583 
Oct 10 21:43:39 jarvis kernel: 	(t=60000 jiffies g=36180177 q=58701)
Oct 10 21:43:39 jarvis kernel: NMI backtrace for cpu 12
Oct 10 21:43:39 jarvis kernel: CPU: 12 PID: 16137 Comm: kworker/u40:6 Tainted: G        W  O      5.15.46-Unraid #1
Oct 10 21:43:39 jarvis kernel: Hardware name: Gigabyte Technology Co., Ltd. W480M VISION W/W480M VISION W, BIOS F21 11/23/2021
Oct 10 21:43:39 jarvis kernel: Workqueue: events_power_efficient gc_worker [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: Call Trace:
Oct 10 21:43:39 jarvis kernel: <IRQ>
Oct 10 21:43:39 jarvis kernel: dump_stack_lvl+0x46/0x5a
Oct 10 21:43:39 jarvis kernel: nmi_cpu_backtrace+0xae/0xd2
Oct 10 21:43:39 jarvis kernel: ? lapic_can_unplug_cpu+0x93/0x93
Oct 10 21:43:39 jarvis kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Oct 10 21:43:39 jarvis kernel: rcu_dump_cpu_stacks+0xc3/0xea
Oct 10 21:43:39 jarvis kernel: rcu_sched_clock_irq+0x22d/0x631
Oct 10 21:43:39 jarvis kernel: ? trigger_load_balance+0x7a/0x292
Oct 10 21:43:39 jarvis kernel: ? tick_sched_do_timer+0x3e/0x3e
Oct 10 21:43:39 jarvis kernel: update_process_times+0x8c/0xab
Oct 10 21:43:39 jarvis kernel: tick_sched_timer+0x38/0x65
Oct 10 21:43:39 jarvis kernel: __hrtimer_run_queues+0xf8/0x18a
Oct 10 21:43:39 jarvis kernel: hrtimer_interrupt+0x92/0x160
Oct 10 21:43:39 jarvis kernel: __sysvec_apic_timer_interrupt+0x96/0xdb
Oct 10 21:43:39 jarvis kernel: sysvec_apic_timer_interrupt+0x61/0x7d
Oct 10 21:43:39 jarvis kernel: </IRQ>
Oct 10 21:43:39 jarvis kernel: <TASK>
Oct 10 21:43:39 jarvis kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Oct 10 21:43:39 jarvis kernel: RIP: 0010:gc_worker+0x140/0x30b [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: Code: 00 0f 9e c1 e9 83 01 00 00 41 8b 45 08 48 8b 15 b1 04 eb e1 29 d0 85 c0 7f 10 4c 89 ef 41 ff c6 e8 57 fb ff ff e9 d8 00 00 00 <41> 8b 45 08 48 8b 15 90 04 eb e1 29 d0 ba 00 00 00 00 0f 48 c2 ba
Oct 10 21:43:39 jarvis kernel: RSP: 0018:ffffc9000293fe48 EFLAGS: 00000202
Oct 10 21:43:39 jarvis kernel: RAX: 000000000ed17370 RBX: 00000000000493df RCX: ffff8881ad600000
Oct 10 21:43:39 jarvis kernel: RDX: 000000010ed39b66 RSI: ffffc9000293fe54 RDI: ffff8885c23d6a48
Oct 10 21:43:39 jarvis kernel: RBP: 0000000000015b2d R08: 0000000000000000 R09: 0000000080190017
Oct 10 21:43:39 jarvis kernel: R10: ffff88810319e3c0 R11: ffff88810319e3c0 R12: ffffffffa036d620
Oct 10 21:43:39 jarvis kernel: R13: ffff8885c23d6a00 R14: 0000000000000030 R15: ffff8885c23d6a48
Oct 10 21:43:39 jarvis kernel: ? gc_worker+0xb2/0x30b [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: process_one_work+0x195/0x27a
Oct 10 21:43:39 jarvis kernel: worker_thread+0x19c/0x240
Oct 10 21:43:39 jarvis kernel: ? rescuer_thread+0x28b/0x28b
Oct 10 21:43:39 jarvis kernel: kthread+0xdc/0xe3
Oct 10 21:43:39 jarvis kernel: ? set_kthread_struct+0x32/0x32
Oct 10 21:43:39 jarvis kernel: ret_from_fork+0x1f/0x30
Oct 10 21:43:39 jarvis kernel: </TASK>

 

Link to comment

and again i had to shutdown the system. Attached ist the log since the relevant errors. Unfortunately i don't understand these errors, i have no idea about the kernel.

It starts with this warning:

Oct 17 00:27:59 jarvis kernel: WARNING: CPU: 4 PID: 27239 at arch/x86/kvm/mmu/mmu.c:3835 kvm_mmu_page_fault+0x1e8/0x4b5 [kvm]

after that, the system was still working 13 hours, but then the following appears:

Oct 17 13:32:38 jarvis kernel: Invalid SPTE change: cannot replace a present leaf
Oct 17 13:32:38 jarvis kernel: SPTE with another present leaf SPTE mapping a
Oct 17 13:32:38 jarvis kernel: different PFN!
Oct 17 13:32:38 jarvis kernel: as_id: 0 gfn: 127644 old_spte: ffff8881057ddd58 new_spte: 6000003f6e44b77 level: 1
Oct 17 13:32:38 jarvis kernel: ------------[ cut here ]------------
Oct 17 13:32:38 jarvis kernel: kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:446!
Oct 17 13:32:38 jarvis kernel: invalid opcode: 0000 [#1] SMP NOPTI
Oct 17 13:32:38 jarvis kernel: CPU: 8 PID: 29233 Comm: CPU 0/KVM Tainted: G        W  O      5.15.46-Unraid #1
Oct 17 13:32:38 jarvis kernel: Hardware name: Gigabyte Technology Co., Ltd. W480M VISION W/W480M VISION W, BIOS F21 11/23/2021
Oct 17 13:32:38 jarvis kernel: RIP: 0010:__handle_changed_spte+0x113/0x42d [kvm]
...

there it says something of a kernel bug. But what can i do now with this information? After that, the cpu is again on stall

Oct 17 13:33:38 jarvis kernel: rcu: INFO: rcu_sched self-detected stall on CPU

That's really annoying and i am really lost.

 

In the kernel code i found this:

/**
 * __handle_changed_spte - handle bookkeeping associated with an SPTE change
 * @kvm: kvm instance
 * @as_id: the address space of the paging structure the SPTE was a part of
 * @gfn: the base GFN that was mapped by the SPTE
 * @old_spte: The value of the SPTE before the change
 * @new_spte: The value of the SPTE after the change
 * @level: the level of the PT the SPTE is part of in the paging structure
 * @shared: This operation may not be running under the exclusive use of
 *	    the MMU lock and the operation must synchronize with other
 *	    threads that might be modifying SPTEs.
 *
 * Handle bookkeeping that might result from the modification of a SPTE.
 * This function must be called for all TDP SPTE modifications.
 */
static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
				  u64 old_spte, u64 new_spte, int level,
				  bool shared)
{
	bool was_present = is_shadow_present_pte(old_spte);
	bool is_present = is_shadow_present_pte(new_spte);
	bool was_leaf = was_present && is_last_spte(old_spte, level);
	bool is_leaf = is_present && is_last_spte(new_spte, level);
	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);

	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
	WARN_ON(level < PG_LEVEL_4K);
	WARN_ON(gfn & (KVM_PAGES_PER_HPAGE(level) - 1));

	/*
	 * If this warning were to trigger it would indicate that there was a
	 * missing MMU notifier or a race with some notifier handler.
	 * A present, leaf SPTE should never be directly replaced with another
	 * present leaf SPTE pointing to a different PFN. A notifier handler
	 * should be zapping the SPTE before the main MM's page table is
	 * changed, or the SPTE should be zeroed, and the TLBs flushed by the
	 * thread before replacement.
	 */
	if (was_leaf && is_leaf && pfn_changed) {
		pr_err("Invalid SPTE change: cannot replace a present leaf\n"
		       "SPTE with another present leaf SPTE mapping a\n"
		       "different PFN!\n"
		       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
		       as_id, gfn, old_spte, new_spte, level);

		/*
		 * Crash the host to prevent error propagation and guest data
		 * corruption.
		 */
		BUG();
	}

 

What is the SPTE? Line 446 is the BUG(); line of course.

What could cause this problem? Is it a hardware thing? Is it a software setting? Is it in the software itself?

Is it meaningful to upgrade to the new Unraid version, when in the new kernel version the  __handle_changed_spte function hasn't changed until the point, where BUG() occurs?

 

syslog

Link to comment
On 10/16/2022 at 10:06 AM, Wirman said:

so you think, the switch to ipvlan will resolve the error, why the cpu is stalling?

Nope, that's a different issue.

 

9 hours ago, Wirman said:

What could cause this problem? Is it a hardware thing? Is it a software setting? Is it in the software itself?

Difficult to say, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...