Jump to content

Currently parity checking : syslog full of errors.


642

Recommended Posts

I'm currently doing a parity check on my unraid 5.0.5 NAS.

Very slow, about 20MB/s, it's unusual.

But on the main page, nothing special.

However on the syslog, everything is red, here is a sample :

 

 

Quote

Jan 29 12:37:58 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3}  (t=24003 jiffies g=11097 c=11096 q=665)
Jan 29 12:37:58 Tower kernel: Pid: 2017, comm: unraidd Tainted: G           O 3.9.11p-unRAID #5 (Errors)
Jan 29 12:37:58 Tower kernel: Call Trace: (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Jan 29 12:37:58 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c12fc7f4>] ? scsi_io_completion+0x1b0/0x421 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c102e99c>] ? __do_softirq+0x65/0x151 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1044cfb>] ? check_preempt_curr+0x29/0x64 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c105dc57>] ? irq_to_desc+0xf/0x11 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c102eae8>] irq_exit+0x33/0x6c (Errors)
Jan 29 12:37:58 Tower kernel:  [<c100367e>] do_IRQ+0x87/0x9b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c125e504>] ? memcmp+0x15/0x25 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bdc1f>] handle_stripe+0xa4d/0xceb [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)
Jan 29 12:39:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3}  (t=6001 jiffies g=11098 c=11097 q=924)
Jan 29 12:39:32 Tower kernel: Pid: 2017, comm: unraidd Tainted: G           O 3.9.11p-unRAID #5 (Errors)
Jan 29 12:39:32 Tower kernel: Call Trace: (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Jan 29 12:39:32 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f845331a>] ? sas_queuecommand+0x199/0x1bf [libsas] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12f7735>] scsi_dispatch_cmd+0xfa/0x125 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fbd04>] scsi_request_fn+0x253/0x371 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124721b>] __blk_run_queue+0x28/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124755d>] blk_run_queue+0x1b/0x2c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fb636>] scsi_run_queue+0xe4/0x151 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fbfb5>] scsi_next_command+0x28/0x34 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc4d0>] scsi_end_request+0x66/0x70 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc7f4>] scsi_io_completion+0x1b0/0x421 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124721b>] ? __blk_run_queue+0x28/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12f7628>] scsi_finish_command+0x91/0x97 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fcb39>] scsi_softirq_done+0xc5/0xcd (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc4d0>] ? scsi_end_request+0x66/0x70 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124ccfa>] blk_done_softirq+0x4a/0x57 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c102e9cb>] __do_softirq+0x94/0x151 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c102eae8>] irq_exit+0x33/0x6c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c100367e>] do_IRQ+0x87/0x9b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124419c>] ? xor_sse_5_pf64+0x182/0x32c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12435de>] xor_blocks+0x74/0x7c (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bd0b8>] check_parity+0x96/0xcc [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bdbfb>] handle_stripe+0xa29/0xceb [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)

Don't know what to do: stop unraid, stop parity check, from where those eroors are coming?

 

 

 

 

 

Link to comment

Thanks trurl, you're right.

Stopped parity check, reboot server, then restart and again restart parity check.

Every 1-10 min, a bunch of red lines with "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)"

In fact recently I've done nothing more than moving a lot of files in and out of the server.

syslog-2018-01-29 (after reboot check parity 4gb) red bunch of lines every 5-10min.txt

Link to comment

I've made memtest. I have check two banks of memory, single, by two, ganged, unganged, interleaved or not, under clocked or not : not a hitch for at least 3 passes, no error. 

I've retry the server without unmenu plugin, no dircache, nothing. On the go file, one line active only : "/usr/local/sbin/emhttp &"

Always the same process, start the array, start parity check (usual stuff on syslog).

After a while between 2 mins to 15 mins, a bunch of red lines starting with  "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)".

Then those red lines will repeat every 2 to 15 mins, the network response very slow.

In the meanwhile the parity check will continue at 30MB/s approximatively.

I've disabled "Cool'nQuiet" and "AMD CE1" support in the Bios.

...

 

OK : FOUND

I've found the "core" of the problem if not the solution.

I've disabled three cores out of 4 cores of the Athlon IIX4, and everything seems fine. Then I've eventually disabled only the 4th one and it seems sufficient.

I don't know why but just doing a parity check, with all the cores I've errors in the syslog (see above) and a speed of around 20-35MB/s.

With the 4th core disabled (I've not checked for the 2nd only or the 3rd only), there is no error like above in the syslog while parity checking, and it's running at 60-75MB/s.

 

If someone comes with an idea (the Athlon is dying?), Thanks.

 

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...