Call Traces during Parity Sync

boxer74 · August 7, 2018

Last night my server crashed. Docker containers non-responsive, no local terminal, no web ui. Did a hard reset and upon the subsequent parity sync, I'm getting call traces. While these occur, my Windows 10 VM is unresponsive. The rest of the server seems to work fine. Any thoughts as to why this would be happening all of a sudden?

Quote

Aug 7 11:56:46 unRAID kernel: INFO: rcu_sched self-detected stall on CPU
Aug 7 11:56:46 unRAID kernel: 29-...: (60000 ticks this GP) idle=17e/140000000000001/0 softirq=1555787/1555787 fqs=14876
Aug 7 11:56:46 unRAID kernel: (t=60001 jiffies g=1556548 c=1556547 q=27686)
Aug 7 11:56:46 unRAID kernel: NMI backtrace for cpu 29
Aug 7 11:56:46 unRAID kernel: CPU: 29 PID: 5876 Comm: unraidd Not tainted 4.14.49-unRAID #1
Aug 7 11:56:46 unRAID kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C602, BIOS P1.80 12/09/2013
Aug 7 11:56:46 unRAID kernel: Call Trace:
Aug 7 11:56:46 unRAID kernel: <IRQ>
Aug 7 11:56:46 unRAID kernel: dump_stack+0x5d/0x79
Aug 7 11:56:46 unRAID kernel: nmi_cpu_backtrace+0x9b/0xba
Aug 7 11:56:46 unRAID kernel: ? irq_force_complete_move+0xf3/0xf3
Aug 7 11:56:46 unRAID kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4
Aug 7 11:56:46 unRAID kernel: rcu_dump_cpu_stacks+0x8e/0xb8
Aug 7 11:56:46 unRAID kernel: rcu_check_callbacks+0x212/0x5f0
Aug 7 11:56:46 unRAID kernel: update_process_times+0x23/0x45
Aug 7 11:56:46 unRAID kernel: tick_sched_timer+0x33/0x61
Aug 7 11:56:46 unRAID kernel: __hrtimer_run_queues+0x78/0xc1
Aug 7 11:56:46 unRAID kernel: hrtimer_interrupt+0x87/0x157
Aug 7 11:56:46 unRAID kernel: smp_apic_timer_interrupt+0x75/0x85
Aug 7 11:56:46 unRAID kernel: apic_timer_interrupt+0x7d/0x90
Aug 7 11:56:46 unRAID kernel: </IRQ>
Aug 7 11:56:46 unRAID kernel: RIP: 0010:memcmp+0x7/0x1d
Aug 7 11:56:46 unRAID kernel: RSP: 0018:ffffc9000718bcd0 EFLAGS: 00000283 ORIG_RAX: ffffffffffffff10
Aug 7 11:56:46 unRAID kernel: RAX: 0000000000000000 RBX: ffff8808482542f0 RCX: 0000000000000283
Aug 7 11:56:46 unRAID kernel: RDX: 0000000000001000 RSI: ffff88085b504000 RDI: ffff880859e0b000
Aug 7 11:56:46 unRAID kernel: RBP: 00000000ffffffff R08: 000000000000006d R09: ffff880848254358
Aug 7 11:56:46 unRAID kernel: R10: 0000000000000fd0 R11: 0000000000000ff0 R12: ffff880847940800
Aug 7 11:56:46 unRAID kernel: R13: 0000000000000001 R14: ffff880848254330 R15: 0000000000000008
Aug 7 11:56:46 unRAID kernel: check_parity+0x206/0x30b [md_mod]
Aug 7 11:56:46 unRAID kernel: ? autoremove_wake_function+0x9/0x2a
Aug 7 11:56:46 unRAID kernel: ? __wake_up_common+0xb2/0x126
Aug 7 11:56:46 unRAID kernel: handle_stripe+0xefc/0x1293 [md_mod]
Aug 7 11:56:46 unRAID kernel: unraidd+0xb8/0x111 [md_mod]
Aug 7 11:56:46 unRAID kernel: ? md_open+0x2c/0x2c [md_mod]
Aug 7 11:56:46 unRAID kernel: ? md_thread+0xbc/0xcc [md_mod]
Aug 7 11:56:46 unRAID kernel: ? handle_stripe+0x1293/0x1293 [md_mod]
Aug 7 11:56:46 unRAID kernel: md_thread+0xbc/0xcc [md_mod]
Aug 7 11:56:46 unRAID kernel: ? wait_woken+0x68/0x68
Aug 7 11:56:46 unRAID kernel: kthread+0x111/0x119
Aug 7 11:56:46 unRAID kernel: ? kthread_create_on_node+0x3a/0x3a
Aug 7 11:56:46 unRAID kernel: ? SyS_exit_group+0xb/0xb
Aug 7 11:56:46 unRAID kernel: ret_from_fork+0x35/0x40

Frank1940 · August 7, 2018

Post up the entire diagnostics file in your next post. Tools >>> Diagnostics

boxer74 · August 7, 2018

Attached.

unraid-diagnostics-20180807-1653.zip

Squid · August 7, 2018

FYI, there are no call traces in the diagnostics. Maybe start a parity check and see if a trace shows up and then repost the diagnostics

boxer74 · August 7, 2018

Here's a proper file.

unraid-diagnostics-20180807-1808.zip

Squid · August 7, 2018

Just for giggles, under settings, Disk Settings, Tunable (nr_requests), try setting it to 8 instead of the default of 128.

IIRC, when self detected stalls were a big issue on unRaid, the original setting was 8, and the fix was 128. Try setting it to the reverse and see if any improvement happens.

boxer74 · August 7, 2018

Seems like that made it worse.

boxer74 · August 8, 2018

Think it's bad ram?

JorgeB · August 8, 2018

There was a user with similar issues that got rid of most of the issues by reducing the tunables values (not nr_requests, the md tunables), though I think this is just a workaround even if it helps, and there's something wrong with the hardware.

boxer74 · August 8, 2018

I set the tunables back to defaults and it seems to have stopped the call traces. How could I tell what hardware issues there may be?

Call Traces during Parity Sync

Recommended Posts

boxer74

Link to comment

Frank1940

Link to comment

boxer74

Link to comment

Squid

Link to comment

boxer74

Link to comment

Squid

Link to comment

boxer74

Link to comment

boxer74

Link to comment

JorgeB

Link to comment

boxer74

Link to comment

Archived