chesh Posted November 30, 2018 Share Posted November 30, 2018 (edited) I've been troubleshooting an issue w/ my Unraid server for the last month and have been mostly avoiding the issue by not running a parity check. At the beginning of the month, while my parity check was running, I noticed my docker containers and VMs were running like crap when a parity check was running. At least, that's what I eventually figured out after downgrading back to 6.5.3 thinking it was an issue w/ the new 6.6.x releases. It started out with my Windows 7 VM being unresponsive and my containers having timeout issues. I eventually found the following in my logs: Nov 29 11:22:08 Tower kernel: INFO: rcu_sched self-detected stall on CPU Nov 29 11:22:08 Tower kernel: 30-...: (60000 ticks this GP) idle=a26/140000000000001/0 softirq=1132375/1132375 fqs=14147 Nov 29 11:22:08 Tower kernel: (t=60001 jiffies g=52263 c=52262 q=104379) Nov 29 11:22:08 Tower kernel: NMI backtrace for cpu 30 Nov 29 11:22:08 Tower kernel: CPU: 30 PID: 11752 Comm: unraidd Not tainted 4.14.49-unRAID #1 Nov 29 11:22:08 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C602-4L/D16, BIOS P1.80 01/16/2014 Nov 29 11:22:08 Tower kernel: Call Trace: Nov 29 11:22:08 Tower kernel: <IRQ> Nov 29 11:22:08 Tower kernel: dump_stack+0x5d/0x79 Nov 29 11:22:08 Tower kernel: nmi_cpu_backtrace+0x9b/0xba Nov 29 11:22:08 Tower kernel: ? irq_force_complete_move+0xf3/0xf3 Nov 29 11:22:08 Tower kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4 Nov 29 11:22:08 Tower kernel: rcu_dump_cpu_stacks+0x8e/0xb8 Nov 29 11:22:08 Tower kernel: rcu_check_callbacks+0x212/0x5f0 Nov 29 11:22:08 Tower kernel: update_process_times+0x23/0x45 Nov 29 11:22:08 Tower kernel: tick_sched_timer+0x33/0x61 Nov 29 11:22:08 Tower kernel: __hrtimer_run_queues+0x78/0xc1 Nov 29 11:22:08 Tower kernel: hrtimer_interrupt+0x87/0x157 Nov 29 11:22:08 Tower kernel: smp_apic_timer_interrupt+0x75/0x85 Nov 29 11:22:08 Tower kernel: apic_timer_interrupt+0x7d/0x90 Nov 29 11:22:08 Tower kernel: </IRQ> Nov 29 11:22:08 Tower kernel: RIP: 0010:memcmp+0x2/0x1d Nov 29 11:22:08 Tower kernel: RSP: 0018:ffffc900077c7cd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Nov 29 11:22:08 Tower kernel: RAX: 0000000000000000 RBX: ffff881015ec0ce8 RCX: 0000000000000fd7 Nov 29 11:22:08 Tower kernel: RDX: 0000000000001000 RSI: ffff88103b417000 RDI: ffff881015ed7000 Nov 29 11:22:08 Tower kernel: RBP: ffff881015ed7000 R08: 00000000000000b6 R09: ffff881015ec0d88 Nov 29 11:22:08 Tower kernel: R10: 0000000000000fd0 R11: 0000000000000ff0 R12: ffff88103856c800 Nov 29 11:22:08 Tower kernel: R13: 0000000000000000 R14: ffff881015ec0d60 R15: 000000000000000f Nov 29 11:22:08 Tower kernel: check_parity+0x27c/0x30b [md_mod] Nov 29 11:22:08 Tower kernel: ? ttwu_do_wakeup.isra.4+0xd/0x84 Nov 29 11:22:08 Tower kernel: handle_stripe+0xefc/0x1293 [md_mod] Nov 29 11:22:08 Tower kernel: unraidd+0xb8/0x111 [md_mod] Nov 29 11:22:08 Tower kernel: ? md_open+0x2c/0x2c [md_mod] Nov 29 11:22:08 Tower kernel: ? md_thread+0xbc/0xcc [md_mod] Nov 29 11:22:08 Tower kernel: ? handle_stripe+0x1293/0x1293 [md_mod] Nov 29 11:22:08 Tower kernel: md_thread+0xbc/0xcc [md_mod] Nov 29 11:22:08 Tower kernel: ? wait_woken+0x68/0x68 Nov 29 11:22:08 Tower kernel: kthread+0x111/0x119 Nov 29 11:22:08 Tower kernel: ? kthread_create_on_node+0x3a/0x3a Nov 29 11:22:08 Tower kernel: ret_from_fork+0x35/0x40 Nov 29 11:22:12 Tower kernel: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 30-... } 63749 jiffies s: 7381 root: 0x2/. Nov 29 11:22:12 Tower kernel: blocking rcu_node structures: l=1:16-31:0x4000/. Nov 29 11:22:12 Tower kernel: Task dump for CPU 30: Nov 29 11:22:12 Tower kernel: unraidd R running task 0 11752 2 0x80000008 Nov 29 11:22:12 Tower kernel: Call Trace: Nov 29 11:22:12 Tower kernel: ? md_open+0x2c/0x2c [md_mod] Nov 29 11:22:12 Tower kernel: ? md_thread+0xbc/0xcc [md_mod] Nov 29 11:22:12 Tower kernel: ? handle_stripe+0x1293/0x1293 [md_mod] Nov 29 11:22:12 Tower kernel: ? md_thread+0xbc/0xcc [md_mod] Nov 29 11:22:12 Tower kernel: ? wait_woken+0x68/0x68 Nov 29 11:22:12 Tower kernel: ? kthread+0x111/0x119 Nov 29 11:22:12 Tower kernel: ? kthread_create_on_node+0x3a/0x3a Nov 29 11:22:12 Tower kernel: ? ret_from_fork+0x35/0x40 Is this a bad SATA/molex power connector or bad cable to this part of my backplane? Do I possibly have some ports going out? Any help would be much appreciated. Thanks for any help that you can provide! tower-diagnostics-20181130-1023.zip Edited November 30, 2018 by chesh Uploaded diagnostics Quote Link to comment
JorgeB Posted November 30, 2018 Share Posted November 30, 2018 This can usually be fixed by lowering the tunables, likely just md_sync_thresh, start lowering and keep trying lower numbers until the call traces stop. Quote Link to comment
chesh Posted December 3, 2018 Author Share Posted December 3, 2018 That totally fixed my issue. Set it back to defaults and ran a parity check. No slowdowns and no errors in the logs. Thanks for the help! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.