Parity Check suddenly gets very slow + CPU usage very high

JuliusZet · July 16, 2018

Hello everyone,

I am running a HP ProLiant DL380p Gen8 Server with the following Hardware-Configuration:

- CPU: 2x Intel® Xeon® CPU E5-2680 v2 @ 2.80GHz (10 cores / 20 threads per CPU )

- RAM: 64 GB Single-bit ECC (8x 8 GB DDR3-1333)

- Storage-Controllers:

- HP Smart Array P420i Controller (embedded, not in use)

- HP Smart HBA H240 (in PCIe x8 Slot Number 6/6, in use)

- Storage:

- 4x Seagate IronWolf Pro 2 TB (Server-HDDs)

- 2x Samsung SM863 480 GB (Server-SSDs)

Last weekend I got myself a new Storage-Controller, the HP Smart HBA H240. Previously I was using the embedded controller (HP Smart Array P420i Controller) in HBA mode. However speeds were not as I expected, that's why I bought a plane HBA. With its firmware updated to the latest version it just works great!

But today, along with my first scheduled Parity Check at 4:00 CEST, I ran into a problem: Disk Speeds were at < 10 MB/sec and CPU usage was very high. The WebGUI, the Terminal and all VMs were therefore very unresponsive.

This is how it looks like, when I start a Parity Check manually:

At first everything looks normal. But not even a minute into the Parity-Check the disk speeds suddenly go below 10 MB/sec while at the exact same time the CPU usage rises.

Here is an excerpt of my syslog from the time, where I started the Parity-Check:

Jul 16 12:03:18 unRAID-Server emhttpd: req (7): startState=STARTED&file=&cmdCheck=Check&optionCorrect=correct&csrf_token=****************
Jul 16 12:03:18 unRAID-Server kernel: mdcmd (52): check correct
Jul 16 12:03:18 unRAID-Server kernel: md: recovery thread: check P ...
Jul 16 12:03:18 unRAID-Server kernel: md: using 1536k window, over a total of 1953514552 blocks.
Jul 16 12:04:01 unRAID-Server sSMTP[25194]: Creating SSL connection to host
Jul 16 12:04:01 unRAID-Server sSMTP[25194]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Jul 16 12:04:03 unRAID-Server sSMTP[25194]: Sent mail for [email protected] (221 2.0.0 fwd30.t-online.de closing. / Closing.) uid=0 username=root outbytes=760
Jul 16 12:06:01 unRAID-Server kernel: INFO: rcu_sched self-detected stall on CPU
Jul 16 12:06:01 unRAID-Server kernel: 	35-...: (59999 ticks this GP) idle=216/140000000000001/0 softirq=24868/24868 fqs=13592 
Jul 16 12:06:01 unRAID-Server kernel: 	 (t=60001 jiffies g=118852 c=118851 q=22580)
Jul 16 12:06:01 unRAID-Server kernel: NMI backtrace for cpu 35
Jul 16 12:06:01 unRAID-Server kernel: CPU: 35 PID: 4029 Comm: unraidd Not tainted 4.14.49-unRAID #1
Jul 16 12:06:01 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018
Jul 16 12:06:01 unRAID-Server kernel: Call Trace:
Jul 16 12:06:01 unRAID-Server kernel: <IRQ>
Jul 16 12:06:01 unRAID-Server kernel: dump_stack+0x5d/0x79
Jul 16 12:06:01 unRAID-Server kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jul 16 12:06:01 unRAID-Server kernel: nmi_cpu_backtrace+0x9b/0xba
Jul 16 12:06:01 unRAID-Server kernel: ? irq_force_complete_move+0xf3/0xf3
Jul 16 12:06:01 unRAID-Server kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4
Jul 16 12:06:01 unRAID-Server kernel: rcu_dump_cpu_stacks+0x8e/0xb8
Jul 16 12:06:01 unRAID-Server kernel: rcu_check_callbacks+0x212/0x5f0
Jul 16 12:06:01 unRAID-Server kernel: update_process_times+0x23/0x45
Jul 16 12:06:01 unRAID-Server kernel: tick_sched_timer+0x33/0x61
Jul 16 12:06:01 unRAID-Server kernel: __hrtimer_run_queues+0x78/0xc1
Jul 16 12:06:01 unRAID-Server kernel: hrtimer_interrupt+0x87/0x157
Jul 16 12:06:01 unRAID-Server kernel: smp_apic_timer_interrupt+0x75/0x85
Jul 16 12:06:01 unRAID-Server kernel: apic_timer_interrupt+0x7d/0x90
Jul 16 12:06:01 unRAID-Server kernel: </IRQ>
Jul 16 12:06:01 unRAID-Server kernel: RIP: 0010:xor_avx_4+0x53/0x2d8
Jul 16 12:06:01 unRAID-Server kernel: RSP: 0018:ffffc9000909bca0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff10
Jul 16 12:06:01 unRAID-Server kernel: RAX: ffff880809239000 RBX: 0000000000000000 RCX: ffff880809237000
Jul 16 12:06:01 unRAID-Server kernel: RDX: ffff880809236000 RSI: ffff880809239000 RDI: 0000000000001000
Jul 16 12:06:01 unRAID-Server kernel: RBP: ffff880809237000 R08: ffff880809238000 R09: ffff880809238000
Jul 16 12:06:01 unRAID-Server kernel: R10: ffff880809237000 R11: ffff880809236000 R12: ffff880809236000
Jul 16 12:06:01 unRAID-Server kernel: R13: ffff880809239000 R14: 0000000000000003 R15: ffff880809239000
Jul 16 12:06:01 unRAID-Server kernel: check_parity+0x125/0x30b [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod]
Jul 16 12:06:01 unRAID-Server kernel: ? wait_woken+0x68/0x68
Jul 16 12:06:01 unRAID-Server kernel: kthread+0x111/0x119
Jul 16 12:06:01 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a
Jul 16 12:06:01 unRAID-Server kernel: ? SyS_exit_group+0xb/0xb
Jul 16 12:06:01 unRAID-Server kernel: ret_from_fork+0x35/0x40
Jul 16 12:06:01 unRAID-Server kernel: 	35-...: (59999 ticks this GP) idle=216/140000000000001/0 softirq=24868/24868 fqs=13593 
Jul 16 12:06:01 unRAID-Server kernel: 	(detected by 3, t=60011 jiffies, g=118852, c=118851, q=22604)
Jul 16 12:06:01 unRAID-Server kernel: Sending NMI from CPU 3 to CPUs 35:
Jul 16 12:06:01 unRAID-Server kernel: NMI backtrace for cpu 35
Jul 16 12:06:01 unRAID-Server kernel: CPU: 35 PID: 4029 Comm: unraidd Not tainted 4.14.49-unRAID #1
Jul 16 12:06:01 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018
Jul 16 12:06:01 unRAID-Server kernel: task: ffff88081b485100 task.stack: ffffc90009098000
Jul 16 12:06:01 unRAID-Server kernel: RIP: 0010:memcmp+0x7/0x1d
Jul 16 12:06:01 unRAID-Server kernel: RSP: 0018:ffffc9000909bcd0 EFLAGS: 00000287
Jul 16 12:06:01 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080a1fcc68 RCX: 00000000000000eb
Jul 16 12:06:01 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff880809239008 RDI: ffff880809239000
Jul 16 12:06:01 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff880809239000
Jul 16 12:06:01 unRAID-Server kernel: R10: ffff880809238000 R11: ffff880809237000 R12: ffff880819073c00
Jul 16 12:06:01 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff880809239000
Jul 16 12:06:01 unRAID-Server kernel: FS:  0000000000000000(0000) GS:ffff88103f7c0000(0000) knlGS:0000000000000000
Jul 16 12:06:01 unRAID-Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 16 12:06:01 unRAID-Server kernel: CR2: 000014fa5b793000 CR3: 0000000001c0a004 CR4: 00000000001606e0

I have also attached my diagnostics as well as my full syslog. If you need further details please let me know.

I would be very thankful to everybody helping me out here!

Best regards

JuliusZet

unraid-server-diagnostics-20180716-1232.zip

unraid-server-syslog-20180716-1228.zip

JorgeB · July 16, 2018

Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model.

JuliusZet · July 16, 2018

32 minutes ago, johnnie.black said:

Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model.

Thank you very much for your reply!

When I get home, I am going to

- update my BIOS

- see if that fixed the issue, if not:

- put the controller in a different slot

Have a great day!

JuliusZet · July 16, 2018

5 hours ago, johnnie.black said:

Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model.

Updating the BIOS and changing the PCIe slot of the controller did not fix the issue. The problem still persists.

Could someone explain what I am experiencing here? Could this be a driver issue?

JorgeB · July 16, 2018

I would recommended getting one the recommended LSI HBA models, any LSI with a SAS2008/2308/3008 chipset in IT mode, e.g., 9201-8i, 9211-8i, 9207-8i, 9300-8i, etc and clones, like the Dell H200/H310 and IBM M1015, these latter ones need to be crossflashed.

JuliusZet · July 17, 2018

12 hours ago, johnnie.black said:

I would recommended getting one the recommended LSI HBA models, any LSI with a SAS2008/2308/3008 chipset in IT mode, e.g., 9201-8i, 9211-8i, 9207-8i, 9300-8i, etc and clones, like the Dell H200/H310 and IBM M1015, these latter ones need to be crossflashed.

Yesterday I uninstalled the HP H240 HBA and connected the SFF-8087 cables from the backplane to the embedded HP P420i (which I previously configured to operate in HBA mode).

Overnight I successfully completed a Parity Check, however I noticed some strange "spikes":

It's the same thing happening here like with the HP H240 HBA. The disks read speeds suddenly drop while the CPU usage rises. The only difference here is that with the HP H240 HBA things remained bad. With the HP P420i it looks like it could sort of "recover" somehow.

I am suspecting the new SFF-8087 cables since I had to swap the original ones that worked perfectly with the embedded controller with the ones that came with the new HP H240 HBA. (The original cables had angled connectors so they didn't fit in the ports of the HBA.)

Now I have two questions:

- Could this issue be related to defective cables / loose connections?

- I'm thinking about buying an LSI SAS 9207-8i. Are there chances that this issue still persists with the LSI HBA?

JorgeB · July 17, 2018

8 minutes ago, JuliusZet said:

Could this issue be related to defective cables / loose connections?

Very unlikely.

10 minutes ago, JuliusZet said:

I'm thinking about buying an LSI SAS 9207-8i. Are there chances that this issue still persists with the LSI HBA?

Possible, post diagnostics that cover the last parity check, to see if the problem is the same.

JuliusZet · July 17, 2018

8 minutes ago, johnnie.black said:

Very unlikely.

Possible, post diagnostics that cover the last parity check, to see if the problem is the same.

Sorry, I totally forgot about this. I attached my diagnostics and my syslog:

unraid-server-diagnostics-20180717-0838.zip

unraid-server-syslog-20180717-0839.zip

JorgeB · July 17, 2018

NMIs are still happening during the check, but the current controller uses the same driver as the previous one, LSI will use a different driver, so if the issues are related to the controller the LSI should work without problems, but can't say for sure.

JuliusZet · July 26, 2018

On 7/17/2018 at 8:58 AM, johnnie.black said:

NMIs are still happening during the check, but the current controller uses the same driver as the previous one, LSI will use a different driver, so if the issues are related to the controller the LSI should work without problems, but can't say for sure.

Hello again,

my LSI SAS 9207-8i just arrived. I have installed it and it works perfectly. Except during Parity Checks... NMIs are still there.

Even the plugin system.stats is outputting improssible values because of the "system overload".

I dont't even know anymore what to do now. I'm very frustrated right now.

unraid-server-diagnostics-20180726-1558.zip

unraid-server-syslog-20180726-1558.zip

JorgeB · July 26, 2018

That's bad news, I would guess then it's a problem with the server/board, check to see if there's a system event log, usually server boards have them, there might be more info there.

pwm · July 26, 2018

Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation.

Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ...

Edited July 26, 2018 by pwm

JuliusZet · July 26, 2018

9 minutes ago, pwm said:

Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation.

Yes, I did this with my HP H240 before. Same results.

10 minutes ago, pwm said:

Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ...

Yes, that's a good idea! I will give this a try now.

Thank you for your participation and have a great day!

JuliusZet · July 26, 2018

2 hours ago, johnnie.black said:

That's bad news, I would guess then it's a problem with the server/board, check to see if there's a system event log, usually server boards have them, there might be more info there.

I can only find tow kinds of logs:

- The "iLO Event Log" which shows me stuff like "Server reset." or "Power on request received by: Automatic Power Recovery.".

- The "Integrated Management Log" which shows me stuff like "Firmware flashed (ProLiant System BIOS - P70 05/21/2018)" or "Maintenance note: Intelligent Provisioning was loaded."

I can not find anything unusual related to NMIs there.

Where would I find the system event log you mentioned earlier?

JorgeB · July 26, 2018

10 minutes ago, JuliusZet said:

Where would I find the system event log you mentioned earlier?

It's likely the iLO Event log, I wouldn't expect NMIs to be logged, but there could be some other hardware issue logged.

JuliusZet · July 26, 2018

30 minutes ago, johnnie.black said:

It's likely the iLO Event log, I wouldn't expect NMIs to be logged, but there could be some other hardware issue logged.

Nope, there is nothing unusual there.

1 hour ago, pwm said:

Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation.

Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ...

What I did in the meantime:

- BIOS Reset

- Deactivated all unneccessary devices (on-board SATA Controller + on-board RAID-Controller)

- Re-created the unRAID-USB-flash device

But that didn't seem to help at all.

An excerpt from my syslog during the parity check (started at 19:41:20):

Jul 26 19:41:20 unRAID-Server emhttpd: req (2): startState=STARTED&file=&cmdCheck=Check&optionCorrect=correct&csrf_token=****************
Jul 26 19:41:20 unRAID-Server kernel: mdcmd (40): check correct
Jul 26 19:41:20 unRAID-Server kernel: md: recovery thread: check P ...
Jul 26 19:41:20 unRAID-Server kernel: md: using 1536k window, over a total of 1953514552 blocks.
Jul 26 19:42:12 unRAID-Server sSMTP[5210]: Creating SSL connection to host
Jul 26 19:42:12 unRAID-Server sSMTP[5210]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Jul 26 19:42:14 unRAID-Server sSMTP[5210]: Sent mail for [email protected] (221 2.0.0 fwd26.t-online.de closing. / Closing.) uid=0 username=root outbytes=760
Jul 26 19:42:46 unRAID-Server kernel: INFO: rcu_sched self-detected stall on CPU
Jul 26 19:42:46 unRAID-Server kernel: 	21-...: (59999 ticks this GP) idle=1ee/140000000000001/0 softirq=1792/1792 fqs=13455 
Jul 26 19:42:46 unRAID-Server kernel: 	 (t=60001 jiffies g=6532 c=6531 q=24126)
Jul 26 19:42:46 unRAID-Server kernel: NMI backtrace for cpu 21
Jul 26 19:42:46 unRAID-Server kernel: CPU: 21 PID: 3954 Comm: unraidd Not tainted 4.14.49-unRAID #1
Jul 26 19:42:46 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 05/21/2018
Jul 26 19:42:46 unRAID-Server kernel: Call Trace:
Jul 26 19:42:46 unRAID-Server kernel: <IRQ>
Jul 26 19:42:46 unRAID-Server kernel: dump_stack+0x5d/0x79
Jul 26 19:42:46 unRAID-Server kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jul 26 19:42:46 unRAID-Server kernel: nmi_cpu_backtrace+0x9b/0xba
Jul 26 19:42:46 unRAID-Server kernel: ? irq_force_complete_move+0xf3/0xf3
Jul 26 19:42:46 unRAID-Server kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4
Jul 26 19:42:46 unRAID-Server kernel: rcu_dump_cpu_stacks+0x8e/0xb8
Jul 26 19:42:46 unRAID-Server kernel: rcu_check_callbacks+0x212/0x5f0
Jul 26 19:42:46 unRAID-Server kernel: update_process_times+0x23/0x45
Jul 26 19:42:46 unRAID-Server kernel: tick_sched_timer+0x33/0x61
Jul 26 19:42:46 unRAID-Server kernel: __hrtimer_run_queues+0x78/0xc1
Jul 26 19:42:46 unRAID-Server kernel: hrtimer_interrupt+0x87/0x157
Jul 26 19:42:46 unRAID-Server kernel: smp_apic_timer_interrupt+0x75/0x85
Jul 26 19:42:46 unRAID-Server kernel: apic_timer_interrupt+0x7d/0x90
Jul 26 19:42:46 unRAID-Server kernel: </IRQ>
Jul 26 19:42:46 unRAID-Server kernel: RIP: 0010:memcmp+0x2/0x1d
Jul 26 19:42:46 unRAID-Server kernel: RSP: 0018:ffffc9000727bcd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Jul 26 19:42:46 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080578bd20 RCX: 0000000000000409
Jul 26 19:42:46 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff8808057cc008 RDI: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: R10: ffff8808057cb000 R11: ffff8808057ca000 R12: ffff88081a045800
Jul 26 19:42:46 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: check_parity+0x14f/0x30b [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? wait_woken+0x68/0x68
Jul 26 19:42:46 unRAID-Server kernel: kthread+0x111/0x119
Jul 26 19:42:46 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a
Jul 26 19:42:46 unRAID-Server kernel: ret_from_fork+0x35/0x40
Jul 26 19:42:46 unRAID-Server kernel: 	21-...: (59999 ticks this GP) idle=1ee/140000000000001/0 softirq=1792/1792 fqs=13456 
Jul 26 19:42:46 unRAID-Server kernel: 	(detected by 26, t=60005 jiffies, g=6532, c=6531, q=24126)
Jul 26 19:42:46 unRAID-Server kernel: Sending NMI from CPU 26 to CPUs 21:
Jul 26 19:42:46 unRAID-Server kernel: NMI backtrace for cpu 21
Jul 26 19:42:46 unRAID-Server kernel: CPU: 21 PID: 3954 Comm: unraidd Not tainted 4.14.49-unRAID #1
Jul 26 19:42:46 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 05/21/2018
Jul 26 19:42:46 unRAID-Server kernel: task: ffff88081ad53600 task.stack: ffffc90007278000
Jul 26 19:42:46 unRAID-Server kernel: RIP: 0010:memcmp+0x2/0x1d
Jul 26 19:42:46 unRAID-Server kernel: RSP: 0018:ffffc9000727bcd0 EFLAGS: 00000246
Jul 26 19:42:46 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080578bd20 RCX: 0000000000000fba
Jul 26 19:42:46 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff8808057cc008 RDI: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: R10: ffff8808057cb000 R11: ffff8808057ca000 R12: ffff88081a045800
Jul 26 19:42:46 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff8808057cc000
Jul 26 19:42:46 unRAID-Server kernel: FS:  0000000000000000(0000) GS:ffff88081f8c0000(0000) knlGS:0000000000000000
Jul 26 19:42:46 unRAID-Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 19:42:46 unRAID-Server kernel: CR2: 0000151a721d77a0 CR3: 0000000001c0a001 CR4: 00000000001606e0
Jul 26 19:42:46 unRAID-Server kernel: Call Trace:
Jul 26 19:42:46 unRAID-Server kernel: check_parity+0x14f/0x30b [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod]
Jul 26 19:42:46 unRAID-Server kernel: ? wait_woken+0x68/0x68
Jul 26 19:42:46 unRAID-Server kernel: kthread+0x111/0x119
Jul 26 19:42:46 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a
Jul 26 19:42:46 unRAID-Server kernel: ret_from_fork+0x35/0x40
Jul 26 19:42:46 unRAID-Server kernel: Code: 48 63 c1 4c 39 c0 73 19 49 8b 3c c1 48 85 ff 74 10 4c 89 d6 e8 71 ff ff ff 84 c0 75 09 ff c1 eb df b9 ea ff ff ff 89 c8 c3 31 c9 <48> 39 d1 74 13 0f b6 04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 
Jul 26 19:43:14 unRAID-Server emhttpd: req (3): startState=STARTED&file=&csrf_token=****************&cmdNoCheck=Cancel
Jul 26 19:43:14 unRAID-Server kernel: mdcmd (41): nocheck 
Jul 26 19:43:15 unRAID-Server kernel: md: md_do_sync: got signal, exit...
Jul 26 19:43:15 unRAID-Server kernel: md: recovery thread: completion status: -4
Jul 26 19:44:01 unRAID-Server sSMTP[5754]: Creating SSL connection to host
Jul 26 19:44:01 unRAID-Server sSMTP[5754]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Jul 26 19:44:02 unRAID-Server sSMTP[5754]: Sent mail for [email protected] (221 2.0.0 fwd14.t-online.de closing. / Closing.) uid=0 username=root outbytes=78

Edit: I saw some "Advanced CPU Settings" in the BIOS but I did not touch them. Maybe some of these setting is causing those errors?

Edited July 26, 2018 by JuliusZet

JuliusZet · July 26, 2018

When I put some real load on the server like starting many VMs simultaniously I see this in my syslog:

Jul 26 21:09:36 unRAID-Server kernel: perf: interrupt took too long (4229 > 2500), lowering kernel.perf_event_max_sample_rate to 47000
Jul 26 21:09:37 unRAID-Server kernel: perf: interrupt took too long (6320 > 5286), lowering kernel.perf_event_max_sample_rate to 31000
Jul 26 21:09:46 unRAID-Server kernel: perf: interrupt took too long (8617 > 7900), lowering kernel.perf_event_max_sample_rate to 23000
Jul 26 21:09:51 unRAID-Server kernel: perf: interrupt took too long (12258 > 10771), lowering kernel.perf_event_max_sample_rate to 16000
Jul 26 21:09:56 unRAID-Server kernel: perf: interrupt took too long (16051 > 15322), lowering kernel.perf_event_max_sample_rate to 12000
Jul 26 21:10:08 unRAID-Server kernel: perf: interrupt took too long (21657 > 20063), lowering kernel.perf_event_max_sample_rate to 9000
Jul 26 21:10:25 unRAID-Server kernel: perf: interrupt took too long (27495 > 27071), lowering kernel.perf_event_max_sample_rate to 7000
Jul 26 21:11:06 unRAID-Server kernel: perf: interrupt took too long (35995 > 34368), lowering kernel.perf_event_max_sample_rate to 5000
Jul 26 21:12:45 unRAID-Server kernel: perf: interrupt took too long (46427 > 44993), lowering kernel.perf_event_max_sample_rate to 4000
Jul 26 21:15:29 unRAID-Server kernel: perf: interrupt took too long (58952 > 58033), lowering kernel.perf_event_max_sample_rate to 3000
Jul 26 21:19:25 unRAID-Server kernel: perf: interrupt took too long (76236 > 73690), lowering kernel.perf_event_max_sample_rate to 2000

Maybe that has got something to do with my NMIs?

Edited July 26, 2018 by JuliusZet

JuliusZet · July 27, 2018

When all of my VMs are running (mostly game servers) and a Prity Check is being started my system stats look like this:

During the Parity Check all my gameservers have very bad performance drops.

When I stop all VMs and start a Parity Check my graphs look like this:

I have no clue what is going on here.

unraid-server-diagnostics-20180727-0908.zip

JuliusZet · July 27, 2018

@johnnie.black Do you think that this issue could be related to Tunables? I did not change them bacause I do not understand what they do and the wikipedia and forum posts are outdated (http://lime-technology.com/wiki/Improving_unRAID_Performance#User_Tunables and

). So mine are all set to default values. Are there Tunables that I could change to try and fix this issue? I am sorry if I am bothering you but I really want to fix this issue and I dont't know how I should get started and who else I could talk to.

JorgeB · July 27, 2018

56 minutes ago, JuliusZet said:

Do you think that this issue could be related to Tunables?

Not likely, but it won't hurt to try, use these:

Tunable (md_num_stripes): 4096
Tunable (md_sync_window): 2048
Tunable (md_sync_thresh): 2000

JuliusZet · July 27, 2018

1 hour ago, johnnie.black said:

Not likely, but it won't hurt to try, use these:

Tunable (md_num_stripes): 4096
Tunable (md_sync_window): 2048
Tunable (md_sync_thresh): 2000

Hi johnnie.black,

I just figured it out!

What I just did is I applied your Tunables from your previous post and started a Parity Check. The NMIs started immediately. Canceled the Partiy Check and started it again just to double check ... Yep ... disk speeds are bad and CPU usage is high right from the start of the Parity Check.

Then I reset the Tunable values to default. Started the Parity Check again and ... it happened like before: At first everything looks normal (disk speeds as expected from 4 HDDs and CPU usage ~2 %) but after a few seconds the NMIs start.

So I went like: Hmm... When I increase the Tunables it goes worse. What if I now decreased them? I thought taking the default values and halving them would be a good start:

Tunable (nr_requests): 64

Tunable (md_num_stripes): 640
Tunable (md_sync_window): 192
Tunable (md_sync_thresh): 96

Wow! I did not think that this would work! Look at the system stats graphs:

The only thing I need to figure out now is what these strange "spikes" are. The syslog shows no NMIs, no errors, nothing! Don't get me wrong, I'm really very glad that it is finally working! However these spikes are not normal.

I think that it would help if I had some up-to-date information about Tunables what exactly they stand for and what they do. I'm sure that this was discussed already on the forum somewhere but I can't find it. Could you link a post where it is explained? That would be very great!

Thank you very much!

Edited July 27, 2018 by JuliusZet

JorgeB · July 27, 2018

Original tunables:

https://lime-technology.com/forums/topic/4473-unraid-server-release-45-beta8-available/?do=findComment&comment=41691

And a description of sync_thresh that was added later:

https://lime-technology.com/forums/topic/42369-unraid-server-release-614-available/?do=findComment&comment=416597

Try to leave the tunables as high as possible without causing the issues, as with everything working correctly lower tunables will usually decrease performance, especially if you start adding more disks.

Parity Check suddenly gets very slow + CPU usage very high

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation