self detected CPU stall errors when parity check invoked


Recommended Posts

See here for full syslog: http://lime-technology.com/forum/index.php?topic=27720.msg245463#msg245463

 

Jun  6 21:20:28 Tower1 logger:   /usr/sbin/rpc.mountd
Jun  6 21:20:28 Tower1 mountd[9899]: Kernel does not have pseudo root support.
Jun  6 21:20:28 Tower1 mountd[9899]: NFS v4 mounts will be disabled unless fsid=0
Jun  6 21:20:28 Tower1 mountd[9899]: is specfied in /etc/exports file.
Jun  6 21:20:28 Tower1 emhttp: shcmd (106): /usr/local/sbin/emhttp_event svcs_restarted
Jun  6 21:20:28 Tower1 emhttp_event: svcs_restarted
Jun  6 21:25:37 Tower1 kernel: mdcmd (61): check NOCORRECT
Jun  6 21:25:37 Tower1 kernel: md: recovery thread woken up ...
Jun  6 21:25:37 Tower1 kernel: md: recovery thread checking parity...
Jun  6 21:25:38 Tower1 kernel: md: using 1536k window, over a total of 2930273228 blocks.
Jun  6 22:33:49 Tower1 kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6001 jiffies g=17792 c=17791 q=12897)
Jun  6 22:33:49 Tower1 kernel: Pid: 9006, comm: unraidd Not tainted 3.9.3-unRAID #8
Jun  6 22:33:49 Tower1 kernel: Call Trace:
Jun  6 22:33:49 Tower1 kernel:  [<c1062abc>] print_cpu_stall+0xbc/0x107
Jun  6 22:33:49 Tower1 kernel:  [<c1062d4c>] __rcu_pending+0x4f/0x12a
Jun  6 22:33:49 Tower1 kernel:  [<c1062e9a>] rcu_check_callbacks+0x73/0x9b
Jun  6 22:33:49 Tower1 kernel:  [<c1032e89>] update_process_times+0x2d/0x53
Jun  6 22:33:49 Tower1 kernel:  [<c10550db>] tick_sched_timer+0x77/0xa1
Jun  6 22:33:49 Tower1 kernel:  [<c1040d4a>] ? __remove_hrtimer+0x25/0x7a
Jun  6 22:33:49 Tower1 kernel:  [<c1040e8d>] __run_hrtimer+0x45/0xaf
Jun  6 22:33:49 Tower1 kernel:  [<c10411f5>] hrtimer_interrupt+0xf1/0x1e7
Jun  6 22:33:49 Tower1 kernel:  [<c101c426>] smp_apic_timer_interrupt+0x6d/0x7f
Jun  6 22:33:49 Tower1 kernel:  [<c14030f9>] apic_timer_interrupt+0x2d/0x34
Jun  6 22:33:49 Tower1 kernel:  [<c12f007b>] ? ide_dump_status+0xab/0x14a
Jun  6 22:33:49 Tower1 kernel:  [<c1247446>] ? blk_update_request+0x12f/0x308
Jun  6 22:33:49 Tower1 kernel:  [<c124762d>] blk_update_bidi_request+0xe/0x4f
Jun  6 22:33:49 Tower1 kernel:  [<c1248045>] blk_end_bidi_request+0x1d/0x53
Jun  6 22:33:49 Tower1 kernel:  [<c12480ba>] blk_end_request+0x12/0x14
Jun  6 22:33:49 Tower1 kernel:  [<c12fbc45>] scsi_end_request+0x1f/0x70
Jun  6 22:33:49 Tower1 kernel:  [<c12fbfb0>] scsi_io_completion+0x1b0/0x421
Jun  6 22:33:49 Tower1 kernel:  [<c12fbfb0>] ? scsi_io_completion+0x1b0/0x421
Jun  6 22:33:49 Tower1 kernel:  [<c12fbd49>] ? scsi_device_unbusy+0x7c/0x82
Jun  6 22:33:49 Tower1 kernel:  [<c12f6de4>] scsi_finish_command+0x91/0x97
Jun  6 22:33:49 Tower1 kernel:  [<c12fc2f5>] scsi_softirq_done+0xc5/0xcd
Jun  6 22:33:49 Tower1 kernel:  [<c124c87a>] blk_done_softirq+0x4a/0x57
Jun  6 22:33:49 Tower1 kernel:  [<c102e980>] __do_softirq+0x8d/0x145
Jun  6 22:33:49 Tower1 kernel:  [<c102ea16>] ? __do_softirq+0x123/0x145
Jun  6 22:33:49 Tower1 kernel:  [<c102ea98>] irq_exit+0x33/0x6c
Jun  6 22:33:49 Tower1 kernel:  [<c100367e>] do_IRQ+0x87/0x9b
Jun  6 22:33:49 Tower1 kernel:  [<c1403a2c>] common_interrupt+0x2c/0x31
Jun  6 22:33:49 Tower1 kernel:  [<c125e07e>] ? memcmp+0x17/0x25
Jun  6 22:33:49 Tower1 kernel:  [<f882484f>] handle_stripe+0xa53/0xcf6 [md_mod]
Jun  6 22:33:49 Tower1 kernel:  [<c1044e4b>] ? __wake_up+0x3b/0x42
Jun  6 22:33:49 Tower1 kernel:  [<f8824b63>] unraidd+0x71/0xb5 [md_mod]
Jun  6 22:33:49 Tower1 kernel:  [<f8821b7a>] md_thread+0xd3/0xea [md_mod]
Jun  6 22:33:49 Tower1 kernel:  [<c103ef79>] ? wake_up_bit+0x5b/0x5b
Jun  6 22:33:49 Tower1 kernel:  [<c103eb39>] kthread+0x90/0x95
Jun  6 22:33:49 Tower1 kernel:  [<f8821aa7>] ? import_device+0x166/0x166 [md_mod]
Jun  6 22:33:49 Tower1 kernel:  [<c1403537>] ret_from_kernel_thread+0x1b/0x28
Jun  6 22:33:49 Tower1 kernel:  [<c103eaa9>] ? kthread_freezable_should_stop+0x4a/0x4a
Jun  6 22:34:59 Tower1 kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6001 jiffies g=17793 c=17792 q=19669)
Jun  6 22:34:59 Tower1 kernel: Pid: 9006, comm: unraidd Not tainted 3.9.3-unRAID #8
Jun  6 22:34:59 Tower1 kernel: Call Trace:
Jun  6 22:34:59 Tower1 kernel:  [<c1062abc>] print_cpu_stall+0xbc/0x107
Jun  6 22:34:59 Tower1 kernel:  [<c1062d4c>] __rcu_pending+0x4f/0x12a
Jun  6 22:34:59 Tower1 kernel:  [<c1062e9a>] rcu_check_callbacks+0x73/0x9b
Jun  6 22:34:59 Tower1 kernel:  [<c1032e89>] update_process_times+0x2d/0x53
Jun  6 22:34:59 Tower1 kernel:  [<c10550db>] tick_sched_timer+0x77/0xa1
Jun  6 22:34:59 Tower1 kernel:  [<c1040d4a>] ? __remove_hrtimer+0x25/0x7a
Jun  6 22:34:59 Tower1 kernel:  [<c1040e8d>] __run_hrtimer+0x45/0xaf
Jun  6 22:34:59 Tower1 kernel:  [<c10411f5>] hrtimer_interrupt+0xf1/0x1e7
Jun  6 22:34:59 Tower1 kernel:  [<c101c426>] smp_apic_timer_interrupt+0x6d/0x7f
Jun  6 22:34:59 Tower1 kernel:  [<c14030f9>] apic_timer_interrupt+0x2d/0x34
Jun  6 22:34:59 Tower1 kernel:  [<c104007b>] ? cpu_timer_fire+0x35/0x5c
Jun  6 22:34:59 Tower1 kernel:  [<c1402b53>] ? _raw_spin_unlock_irqrestore+0x8/0xa
Jun  6 22:34:59 Tower1 kernel:  [<c1044e4b>] __wake_up+0x3b/0x42
Jun  6 22:34:59 Tower1 kernel:  [<f8820f33>] md_done_sync+0x2b/0x2f [md_mod]
Jun  6 22:34:59 Tower1 kernel:  [<f8824947>] handle_stripe+0xb4b/0xcf6 [md_mod]
Jun  6 22:34:59 Tower1 kernel:  [<c1044e4b>] ? __wake_up+0x3b/0x42
Jun  6 22:34:59 Tower1 kernel:  [<f8824b63>] unraidd+0x71/0xb5 [md_mod]
Jun  6 22:34:59 Tower1 kernel:  [<f8821b7a>] md_thread+0xd3/0xea [md_mod]
Jun  6 22:34:59 Tower1 kernel:  [<c103ef79>] ? wake_up_bit+0x5b/0x5b
Jun  6 22:34:59 Tower1 kernel:  [<c103eb39>] kthread+0x90/0x95
Jun  6 22:34:59 Tower1 kernel:  [<f8821aa7>] ? import_device+0x166/0x166 [md_mod]
Jun  6 22:34:59 Tower1 kernel:  [<c1403537>] ret_from_kernel_thread+0x1b/0x28
Jun  6 22:34:59 Tower1 kernel:  [<c103eaa9>] ? kthread_freezable_should_stop+0x4a/0x4a
Jun  6 22:36:25 Tower1 kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=17799 c=17798 q=8142)
Jun  6 22:36:25 Tower1 kernel: Pid: 9006, comm: unraidd Not tainted 3.9.3-unRAID #8
Jun  6 22:36:25 Tower1 kernel: Call Trace:
Jun  6 22:36:25 Tower1 kernel:  [<c1062abc>] print_cpu_stall+0xbc/0x107
Jun  6 22:36:25 Tower1 kernel:  [<c1062d4c>] __rcu_pending+0x4f/0x12a
Jun  6 22:36:25 Tower1 kernel:  [<c1062e9a>] rcu_check_callbacks+0x73/0x9b
Jun  6 22:36:25 Tower1 kernel:  [<c1032e89>] update_process_times+0x2d/0x53
Jun  6 22:36:25 Tower1 kernel:  [<c10550db>] tick_sched_timer+0x77/0xa1
Jun  6 22:36:25 Tower1 kernel:  [<c1040d4a>] ? __remove_hrtimer+0x25/0x7a
Jun  6 22:36:25 Tower1 kernel:  [<c1040e8d>] __run_hrtimer+0x45/0xaf
Jun  6 22:36:25 Tower1 kernel:  [<c10411f5>] hrtimer_interrupt+0xf1/0x1e7
Jun  6 22:36:25 Tower1 kernel:  [<c101c426>] smp_apic_timer_interrupt+0x6d/0x7f
Jun  6 22:36:25 Tower1 kernel:  [<c14030f9>] apic_timer_interrupt+0x2d/0x34
Jun  6 22:36:25 Tower1 kernel:  [<c124007b>] ? crypto_aes_expand_key+0x123/0x39b
Jun  6 22:36:25 Tower1 kernel:  [<c12fb5a7>] ? scsi_request_fn+0x33a/0x371
Jun  6 22:36:25 Tower1 kernel:  [<c1246d9b>] __blk_run_queue+0x28/0x31
Jun  6 22:36:25 Tower1 kernel:  [<c12470dd>] blk_run_queue+0x1b/0x2c
Jun  6 22:36:25 Tower1 kernel:  [<c12fadf2>] scsi_run_queue+0xe4/0x151
Jun  6 22:36:25 Tower1 kernel:  [<c12fb771>] scsi_next_command+0x28/0x34
Jun  6 22:36:25 Tower1 kernel:  [<c12fbc8c>] scsi_end_request+0x66/0x70
Jun  6 22:36:25 Tower1 kernel:  [<c12fbfb0>] scsi_io_completion+0x1b0/0x421
Jun  6 22:36:25 Tower1 kernel:  [<c1246d9b>] ? __blk_run_queue+0x28/0x31
Jun  6 22:36:25 Tower1 kernel:  [<c12fbd49>] ? scsi_device_unbusy+0x7c/0x82
Jun  6 22:36:25 Tower1 kernel:  [<c12f6de4>] scsi_finish_command+0x91/0x97
Jun  6 22:36:25 Tower1 kernel:  [<c12fc2f5>] scsi_softirq_done+0xc5/0xcd
Jun  6 22:36:25 Tower1 kernel:  [<c12fbc8c>] ? scsi_end_request+0x66/0x70
Jun  6 22:36:25 Tower1 kernel:  [<c124c87a>] blk_done_softirq+0x4a/0x57
Jun  6 22:36:25 Tower1 kernel:  [<c102e980>] __do_softirq+0x8d/0x145
Jun  6 22:36:25 Tower1 kernel:  [<c12fbd49>] ? scsi_device_unbusy+0x7c/0x82
Jun  6 22:36:25 Tower1 kernel:  [<c102ea98>] irq_exit+0x33/0x6c
Jun  6 22:36:25 Tower1 kernel:  [<c100367e>] do_IRQ+0x87/0x9b
Jun  6 22:36:25 Tower1 kernel:  [<c12fc2f5>] ? scsi_softirq_done+0xc5/0xcd
Jun  6 22:36:25 Tower1 kernel:  [<c1403a2c>] common_interrupt+0x2c/0x31
Jun  6 22:36:25 Tower1 kernel:  [<c104007b>] ? cpu_timer_fire+0x35/0x5c
Jun  6 22:36:25 Tower1 kernel:  [<c1245100>] ? xor_avx_5+0x6e/0x34c
Jun  6 22:36:25 Tower1 kernel:  [<c124315e>] xor_blocks+0x74/0x7c
Jun  6 22:36:25 Tower1 kernel:  [<f8823ce2>] check_parity+0x96/0xcc [md_mod]
Jun  6 22:36:25 Tower1 kernel:  [<f882482b>] handle_stripe+0xa2f/0xcf6 [md_mod]
Jun  6 22:36:25 Tower1 kernel:  [<c1044e4b>] ? __wake_up+0x3b/0x42
Jun  6 22:36:25 Tower1 kernel:  [<f8824b63>] unraidd+0x71/0xb5 [md_mod]
Jun  6 22:36:25 Tower1 kernel:  [<f8821b7a>] md_thread+0xd3/0xea [md_mod]
Jun  6 22:36:25 Tower1 kernel:  [<c103ef79>] ? wake_up_bit+0x5b/0x5b
Jun  6 22:36:25 Tower1 kernel:  [<c103eb39>] kthread+0x90/0x95
Jun  6 22:36:25 Tower1 kernel:  [<f8821aa7>] ? import_device+0x166/0x166 [md_mod]
Jun  6 22:36:25 Tower1 kernel:  [<c1403537>] ret_from_kernel_thread+0x1b/0x28
Jun  6 22:36:25 Tower1 kernel:  [<c103eaa9>] ? kthread_freezable_should_stop+0x4a/0x4a
Jun  6 22:42:06 Tower1 kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=17986 c=17985 q=10992)
Jun  6 22:42:06 Tower1 kernel: Pid: 9006, comm: unraidd Not tainted 3.9.3-unRAID #8
Jun  6 22:42:06 Tower1 kernel: Call Trace:
Jun  6 22:42:06 Tower1 kernel:  [<c1062abc>] print_cpu_stall+0xbc/0x107
Jun  6 22:42:06 Tower1 kernel:  [<c1062d4c>] __rcu_pending+0x4f/0x12a
Jun  6 22:42:06 Tower1 kernel:  [<c1062e9a>] rcu_check_callbacks+0x73/0x9b
Jun  6 22:42:06 Tower1 kernel:  [<c1032e89>] update_process_times+0x2d/0x53
Jun  6 22:42:06 Tower1 kernel:  [<c10550db>] tick_sched_timer+0x77/0xa1
Jun  6 22:42:06 Tower1 kernel:  [<c1040d4a>] ? __remove_hrtimer+0x25/0x7a
Jun  6 22:42:06 Tower1 kernel:  [<c1040e8d>] __run_hrtimer+0x45/0xaf
Jun  6 22:42:06 Tower1 kernel:  [<c10411f5>] hrtimer_interrupt+0xf1/0x1e7
Jun  6 22:42:06 Tower1 kernel:  [<c12fadf2>] ? scsi_run_queue+0xe4/0x151
Jun  6 22:42:06 Tower1 kernel:  [<c101c426>] smp_apic_timer_interrupt+0x6d/0x7f
Jun  6 22:42:06 Tower1 kernel:  [<c14030f9>] apic_timer_interrupt+0x2d/0x34
Jun  6 22:42:06 Tower1 kernel:  [<c12f007b>] ? ide_dump_status+0xab/0x14a
Jun  6 22:42:06 Tower1 kernel:  [<c12f6dc3>] ? scsi_finish_command+0x70/0x97
Jun  6 22:42:06 Tower1 kernel:  [<c12fc2f5>] scsi_softirq_done+0xc5/0xcd
Jun  6 22:42:06 Tower1 kernel:  [<c12fc2f5>] ? scsi_softirq_done+0xc5/0xcd
Jun  6 22:42:06 Tower1 kernel:  [<c10482c5>] ? sched_clock_cpu+0x3f/0x13f
Jun  6 22:42:06 Tower1 kernel:  [<c124c87a>] blk_done_softirq+0x4a/0x57
Jun  6 22:42:06 Tower1 kernel:  [<c102e980>] __do_softirq+0x8d/0x145
Jun  6 22:42:06 Tower1 kernel:  [<c102ea16>] ? __do_softirq+0x123/0x145
Jun  6 22:42:06 Tower1 kernel:  [<c102ea98>] irq_exit+0x33/0x6c
Jun  6 22:42:06 Tower1 kernel:  [<c100367e>] do_IRQ+0x87/0x9b
Jun  6 22:42:06 Tower1 kernel:  [<c100367e>] ? do_IRQ+0x87/0x9b
Jun  6 22:42:06 Tower1 kernel:  [<c1403a2c>] common_interrupt+0x2c/0x31
Jun  6 22:42:06 Tower1 kernel:  [<f8823cb8>] ? check_parity+0x6c/0xcc [md_mod]
Jun  6 22:42:06 Tower1 kernel:  [<f882482b>] handle_stripe+0xa2f/0xcf6 [md_mod]
Jun  6 22:42:06 Tower1 kernel:  [<c1044e4b>] ? __wake_up+0x3b/0x42
Jun  6 22:42:06 Tower1 kernel:  [<f8824b63>] unraidd+0x71/0xb5 [md_mod]
Jun  6 22:42:06 Tower1 kernel:  [<f8821b7a>] md_thread+0xd3/0xea [md_mod]
Jun  6 22:42:06 Tower1 kernel:  [<c103ef79>] ? wake_up_bit+0x5b/0x5b
Jun  6 22:42:06 Tower1 kernel:  [<c103eb39>] kthread+0x90/0x95
Jun  6 22:42:06 Tower1 kernel:  [<f8821aa7>] ? import_device+0x166/0x166 [md_mod]
Jun  6 22:42:06 Tower1 kernel:  [<c1403537>] ret_from_kernel_thread+0x1b/0x28
Jun  6 22:42:06 Tower1 kernel:  [<c103eaa9>] ? kthread_freezable_should_stop+0x4a/0x4a
Jun  6 22:45:11 Tower1 kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6001 jiffies g=18128 c=18127 q=9065)

 

The phrase "unraidd Not tainted" just means that the kernel does not have any proprietary (non-GPL) modules loaded.  If you try to submit an error report to kernel guys and they see you're using a "tainted" kernel, they won't help you.  This is not the case with the unRaid kernel build.

 

As for the stall.. I have not reproduced this yet.  The message is "informational" meaning there's not anything crashing, but still should not happen and I'll get to this issue at some point.

Link to comment
  • 2 weeks later...

I would get a lot of these CPU stall errors when running a parity check and the value of md_sync_window seemed to have some effect as to the frequency of the errors occurring.

 

I did not get any CPU stall errors on RC15 when I ran a non-correcting parity check.

Link to comment