Currently parity checking : syslog full of errors.


Recommended Posts

I'm currently doing a parity check on my unraid 5.0.5 NAS.

Very slow, about 20MB/s, it's unusual.

But on the main page, nothing special.

However on the syslog, everything is red, here is a sample :

 

 

Quote

Jan 29 12:37:58 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3}  (t=24003 jiffies g=11097 c=11096 q=665)
Jan 29 12:37:58 Tower kernel: Pid: 2017, comm: unraidd Tainted: G           O 3.9.11p-unRAID #5 (Errors)
Jan 29 12:37:58 Tower kernel: Call Trace: (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Jan 29 12:37:58 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c12fc7f4>] ? scsi_io_completion+0x1b0/0x421 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c102e99c>] ? __do_softirq+0x65/0x151 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1044cfb>] ? check_preempt_curr+0x29/0x64 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c105dc57>] ? irq_to_desc+0xf/0x11 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c102eae8>] irq_exit+0x33/0x6c (Errors)
Jan 29 12:37:58 Tower kernel:  [<c100367e>] do_IRQ+0x87/0x9b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c125e504>] ? memcmp+0x15/0x25 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bdc1f>] handle_stripe+0xa4d/0xceb [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Jan 29 12:37:58 Tower kernel:  [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Jan 29 12:37:58 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Jan 29 12:37:58 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)
Jan 29 12:39:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3}  (t=6001 jiffies g=11098 c=11097 q=924)
Jan 29 12:39:32 Tower kernel: Pid: 2017, comm: unraidd Tainted: G           O 3.9.11p-unRAID #5 (Errors)
Jan 29 12:39:32 Tower kernel: Call Trace: (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1062c2a>] print_cpu_stall+0xbc/0x107 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1062eba>] __rcu_pending+0x4f/0x12a (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1063008>] rcu_check_callbacks+0x73/0x9b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1032ed9>] update_process_times+0x2d/0x53 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c105520b>] tick_sched_timer+0x77/0xa1 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1040e02>] ? __remove_hrtimer+0x25/0x7a (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1040f45>] __run_hrtimer+0x45/0xaf (Errors)
Jan 29 12:39:32 Tower kernel:  [<c10412ad>] hrtimer_interrupt+0xf1/0x1e7 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c101c43a>] smp_apic_timer_interrupt+0x6d/0x7f (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401411>] apic_timer_interrupt+0x2d/0x34 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f845331a>] ? sas_queuecommand+0x199/0x1bf [libsas] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12f7735>] scsi_dispatch_cmd+0xfa/0x125 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fbd04>] scsi_request_fn+0x253/0x371 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124721b>] __blk_run_queue+0x28/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124755d>] blk_run_queue+0x1b/0x2c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fb636>] scsi_run_queue+0xe4/0x151 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fbfb5>] scsi_next_command+0x28/0x34 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc4d0>] scsi_end_request+0x66/0x70 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc7f4>] scsi_io_completion+0x1b0/0x421 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124721b>] ? __blk_run_queue+0x28/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12f7628>] scsi_finish_command+0x91/0x97 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fcb39>] scsi_softirq_done+0xc5/0xcd (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc4d0>] ? scsi_end_request+0x66/0x70 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124ccfa>] blk_done_softirq+0x4a/0x57 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c102e9cb>] __do_softirq+0x94/0x151 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12fc58d>] ? scsi_device_unbusy+0x7c/0x82 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c102eae8>] irq_exit+0x33/0x6c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c100367e>] do_IRQ+0x87/0x9b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401d2c>] common_interrupt+0x2c/0x31 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c124419c>] ? xor_sse_5_pf64+0x182/0x32c (Errors)
Jan 29 12:39:32 Tower kernel:  [<c12435de>] xor_blocks+0x74/0x7c (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bd0b8>] check_parity+0x96/0xcc [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bdbfb>] handle_stripe+0xa29/0xceb [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1044f5f>] ? __wake_up+0x3b/0x42 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bdf2e>] unraidd+0x71/0xb5 [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87bacb2>] md_thread+0xd3/0xea [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103f031>] ? wake_up_bit+0x5b/0x5b (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103ebf1>] kthread+0x90/0x95 (Errors)
Jan 29 12:39:32 Tower kernel:  [<f87babdf>] ? import_device+0x166/0x166 [md_mod] (Errors)
Jan 29 12:39:32 Tower kernel:  [<c1401837>] ret_from_kernel_thread+0x1b/0x28 (Errors)
Jan 29 12:39:32 Tower kernel:  [<c103eb61>] ? kthread_freezable_should_stop+0x4a/0x4a (Errors)

Don't know what to do: stop unraid, stop parity check, from where those eroors are coming?

 

 

 

 

 

Edited by 642
Link to comment

Thanks trurl, you're right.

Stopped parity check, reboot server, then restart and again restart parity check.

Every 1-10 min, a bunch of red lines with "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)"

In fact recently I've done nothing more than moving a lot of files in and out of the server.

syslog-2018-01-29 (after reboot check parity 4gb) red bunch of lines every 5-10min.txt

Edited by 642
Link to comment

I've made memtest. I have check two banks of memory, single, by two, ganged, unganged, interleaved or not, under clocked or not : not a hitch for at least 3 passes, no error. 

I've retry the server without unmenu plugin, no dircache, nothing. On the go file, one line active only : "/usr/local/sbin/emhttp &"

Always the same process, start the array, start parity check (usual stuff on syslog).

After a while between 2 mins to 15 mins, a bunch of red lines starting with  "Tower kernel: Pid: 1484, comm: unraidd Tainted: G O 3.9.11p-unRAID #5 (Errors)".

Then those red lines will repeat every 2 to 15 mins, the network response very slow.

In the meanwhile the parity check will continue at 30MB/s approximatively.

I've disabled "Cool'nQuiet" and "AMD CE1" support in the Bios.

...

 

OK : FOUND

I've found the "core" of the problem if not the solution.

I've disabled three cores out of 4 cores of the Athlon IIX4, and everything seems fine. Then I've eventually disabled only the 4th one and it seems sufficient.

I don't know why but just doing a parity check, with all the cores I've errors in the syslog (see above) and a speed of around 20-35MB/s.

With the 4th core disabled (I've not checked for the 2nd only or the 3rd only), there is no error like above in the syslog while parity checking, and it's running at 60-75MB/s.

 

If someone comes with an idea (the Athlon is dying?), Thanks.

 

 

Edited by 642
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.