642 Posted January 29, 2018 Share Posted January 29, 2018 I'm currently doing a parity check on my unraid 5.0.5 NAS. Very slow, about 20MB/s, it's unusual. But on the main page, nothing special. However on the syslog, everything is red, here is a sample : Quote Jan 29 12:37:58 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3} (t=24003 jiffies g=11097 c=11096 q=665) Jan 29 12:37:58 Tower kernel: Pid: 2017, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors) Jan 29 12:37:58 Tower kernel: Call Trace: (Errors) Jan 29 12:37:58 Tower kernel: [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors) Jan 29 12:37:58 Tower kernel: [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors) Jan 29 12:37:58 Tower kernel: [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors) Jan 29 12:37:58 Tower kernel: [<c1032ed9>] update_process_times+0x2d/0x53 (Errors) Jan 29 12:37:58 Tower kernel: [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors) Jan 29 12:37:58 Tower kernel: [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors) Jan 29 12:37:58 Tower kernel: [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors) Jan 29 12:37:58 Tower kernel: [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors) Jan 29 12:37:58 Tower kernel: [<c12fc7f4>] ? scsi_io_completion+0x1b0/0x421 (Errors) Jan 29 12:37:58 Tower kernel: [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors) Jan 29 12:37:58 Tower kernel: [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors) Jan 29 12:37:58 Tower kernel: [<c102e99c>] ? __do_softirq+0x65/0x151 (Errors) Jan 29 12:37:58 Tower kernel: [<c1044cfb>] ? check_preempt_curr+0x29/0x64 (Errors) Jan 29 12:37:58 Tower kernel: [<c105dc57>] ? irq_to_desc+0xf/0x11 (Errors) Jan 29 12:37:58 Tower kernel: [<c102eae8>] irq_exit+0x33/0x6c (Errors) Jan 29 12:37:58 Tower kernel: [<c100367e>] do_IRQ+0x87/0x9b (Errors) Jan 29 12:37:58 Tower kernel: [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors) Jan 29 12:37:58 Tower kernel: [<c125e504>] ? memcmp+0x15/0x25 (Errors) Jan 29 12:37:58 Tower kernel: [<f87bdc1f>] handle_stripe+0xa4d/0xceb [md_mod] (Errors) Jan 29 12:37:58 Tower kernel: [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors) Jan 29 12:37:58 Tower kernel: [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors) Jan 29 12:37:58 Tower kernel: [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors) Jan 29 12:37:58 Tower kernel: [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors) Jan 29 12:37:58 Tower kernel: [<c103ebf1>] kthread+0x90/0x95 (Errors) Jan 29 12:37:58 Tower kernel: [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors) Jan 29 12:37:58 Tower kernel: [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors) Jan 29 12:37:58 Tower kernel: [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors) Jan 29 12:39:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3} (t=6001 jiffies g=11098 c=11097 q=924) Jan 29 12:39:32 Tower kernel: Pid: 2017, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors) Jan 29 12:39:32 Tower kernel: Call Trace: (Errors) Jan 29 12:39:32 Tower kernel: [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors) Jan 29 12:39:32 Tower kernel: [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors) Jan 29 12:39:32 Tower kernel: [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors) Jan 29 12:39:32 Tower kernel: [<c1032ed9>] update_process_times+0x2d/0x53 (Errors) Jan 29 12:39:32 Tower kernel: [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors) Jan 29 12:39:32 Tower kernel: [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors) Jan 29 12:39:32 Tower kernel: [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors) Jan 29 12:39:32 Tower kernel: [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors) Jan 29 12:39:32 Tower kernel: [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors) Jan 29 12:39:32 Tower kernel: [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors) Jan 29 12:39:32 Tower kernel: [<f845331a>] ? sas_queuecommand+0x199/0x1bf [libsas] (Errors) Jan 29 12:39:32 Tower kernel: [<c12f7735>] scsi_dispatch_cmd+0xfa/0x125 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fbd04>] scsi_request_fn+0x253/0x371 (Errors) Jan 29 12:39:32 Tower kernel: [<c124721b>] __blk_run_queue+0x28/0x31 (Errors) Jan 29 12:39:32 Tower kernel: [<c124755d>] blk_run_queue+0x1b/0x2c (Errors) Jan 29 12:39:32 Tower kernel: [<c12fb636>] scsi_run_queue+0xe4/0x151 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fbfb5>] scsi_next_command+0x28/0x34 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fc4d0>] scsi_end_request+0x66/0x70 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fc7f4>] scsi_io_completion+0x1b0/0x421 (Errors) Jan 29 12:39:32 Tower kernel: [<c124721b>] ? __blk_run_queue+0x28/0x31 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors) Jan 29 12:39:32 Tower kernel: [<c12f7628>] scsi_finish_command+0x91/0x97 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fcb39>] scsi_softirq_done+0xc5/0xcd (Errors) Jan 29 12:39:32 Tower kernel: [<c12fc4d0>] ? scsi_end_request+0x66/0x70 (Errors) Jan 29 12:39:32 Tower kernel: [<c124ccfa>] blk_done_softirq+0x4a/0x57 (Errors) Jan 29 12:39:32 Tower kernel: [<c102e9cb>] __do_softirq+0x94/0x151 (Errors) Jan 29 12:39:32 Tower kernel: [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors) Jan 29 12:39:32 Tower kernel: [<c102eae8>] irq_exit+0x33/0x6c (Errors) Jan 29 12:39:32 Tower kernel: [<c100367e>] do_IRQ+0x87/0x9b (Errors) Jan 29 12:39:32 Tower kernel: [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors) Jan 29 12:39:32 Tower kernel: [<c124419c>] ? xor_sse_5_pf64+0x182/0x32c (Errors) Jan 29 12:39:32 Tower kernel: [<c12435de>] xor_blocks+0x74/0x7c (Errors) Jan 29 12:39:32 Tower kernel: [<f87bd0b8>] check_parity+0x96/0xcc [md_mod] (Errors) Jan 29 12:39:32 Tower kernel: [<f87bdbfb>] handle_stripe+0xa29/0xceb [md_mod] (Errors) Jan 29 12:39:32 Tower kernel: [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors) Jan 29 12:39:32 Tower kernel: [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors) Jan 29 12:39:32 Tower kernel: [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors) Jan 29 12:39:32 Tower kernel: [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors) Jan 29 12:39:32 Tower kernel: [<c103ebf1>] kthread+0x90/0x95 (Errors) Jan 29 12:39:32 Tower kernel: [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors) Jan 29 12:39:32 Tower kernel: [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors) Jan 29 12:39:32 Tower kernel: [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors) Don't know what to do: stop unraid, stop parity check, from where those eroors are coming? Link to comment
trurl Posted January 29, 2018 Share Posted January 29, 2018 Syslog snippets are seldom sufficient. Stop parity check, get the complete syslog and post it. Link to comment
642 Posted January 29, 2018 Author Share Posted January 29, 2018 Thanks trurl, you're right. Stopped parity check, reboot server, then restart and again restart parity check. Every 1-10 min, a bunch of red lines with "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)" In fact recently I've done nothing more than moving a lot of files in and out of the server. syslog-2018-01-29 (after reboot check parity 4gb) red bunch of lines every 5-10min.txt Link to comment
642 Posted January 30, 2018 Author Share Posted January 30, 2018 I've made memtest. I have check two banks of memory, single, by two, ganged, unganged, interleaved or not, under clocked or not : not a hitch for at least 3 passes, no error. I've retry the server without unmenu plugin, no dircache, nothing. On the go file, one line active only : "/usr/local/sbin/emhttp &" Always the same process, start the array, start parity check (usual stuff on syslog). After a while between 2 mins to 15 mins, a bunch of red lines starting with "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)". Then those red lines will repeat every 2 to 15 mins, the network response very slow. In the meanwhile the parity check will continue at 30MB/s approximatively. I've disabled "Cool'nQuiet" and "AMD CE1" support in the Bios. ... OK : FOUND I've found the "core" of the problem if not the solution. I've disabled three cores out of 4 cores of the Athlon IIX4, and everything seems fine. Then I've eventually disabled only the 4th one and it seems sufficient. I don't know why but just doing a parity check, with all the cores I've errors in the syslog (see above) and a speed of around 20-35MB/s. With the 4th core disabled (I've not checked for the 2nd only or the 3rd only), there is no error like above in the syslog while parity checking, and it's running at 60-75MB/s. If someone comes with an idea (the Athlon is dying?), Thanks. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.