JuliusZet Posted July 16, 2018 Share Posted July 16, 2018 Hello everyone, I am running a HP ProLiant DL380p Gen8 Server with the following Hardware-Configuration: - CPU: 2x Intel® Xeon® CPU E5-2680 v2 @ 2.80GHz (10 cores / 20 threads per CPU ) - RAM: 64 GB Single-bit ECC (8x 8 GB DDR3-1333) - Storage-Controllers: - HP Smart Array P420i Controller (embedded, not in use) - HP Smart HBA H240 (in PCIe x8 Slot Number 6/6, in use) - Storage: - 4x Seagate IronWolf Pro 2 TB (Server-HDDs) - 2x Samsung SM863 480 GB (Server-SSDs) Last weekend I got myself a new Storage-Controller, the HP Smart HBA H240. Previously I was using the embedded controller (HP Smart Array P420i Controller) in HBA mode. However speeds were not as I expected, that's why I bought a plane HBA. With its firmware updated to the latest version it just works great! But today, along with my first scheduled Parity Check at 4:00 CEST, I ran into a problem: Disk Speeds were at < 10 MB/sec and CPU usage was very high. The WebGUI, the Terminal and all VMs were therefore very unresponsive. This is how it looks like, when I start a Parity Check manually: At first everything looks normal. But not even a minute into the Parity-Check the disk speeds suddenly go below 10 MB/sec while at the exact same time the CPU usage rises. Here is an excerpt of my syslog from the time, where I started the Parity-Check: Jul 16 12:03:18 unRAID-Server emhttpd: req (7): startState=STARTED&file=&cmdCheck=Check&optionCorrect=correct&csrf_token=**************** Jul 16 12:03:18 unRAID-Server kernel: mdcmd (52): check correct Jul 16 12:03:18 unRAID-Server kernel: md: recovery thread: check P ... Jul 16 12:03:18 unRAID-Server kernel: md: using 1536k window, over a total of 1953514552 blocks. Jul 16 12:04:01 unRAID-Server sSMTP[25194]: Creating SSL connection to host Jul 16 12:04:01 unRAID-Server sSMTP[25194]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Jul 16 12:04:03 unRAID-Server sSMTP[25194]: Sent mail for [email protected] (221 2.0.0 fwd30.t-online.de closing. / Closing.) uid=0 username=root outbytes=760 Jul 16 12:06:01 unRAID-Server kernel: INFO: rcu_sched self-detected stall on CPU Jul 16 12:06:01 unRAID-Server kernel: 35-...: (59999 ticks this GP) idle=216/140000000000001/0 softirq=24868/24868 fqs=13592 Jul 16 12:06:01 unRAID-Server kernel: (t=60001 jiffies g=118852 c=118851 q=22580) Jul 16 12:06:01 unRAID-Server kernel: NMI backtrace for cpu 35 Jul 16 12:06:01 unRAID-Server kernel: CPU: 35 PID: 4029 Comm: unraidd Not tainted 4.14.49-unRAID #1 Jul 16 12:06:01 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018 Jul 16 12:06:01 unRAID-Server kernel: Call Trace: Jul 16 12:06:01 unRAID-Server kernel: <IRQ> Jul 16 12:06:01 unRAID-Server kernel: dump_stack+0x5d/0x79 Jul 16 12:06:01 unRAID-Server kernel: INFO: rcu_sched detected stalls on CPUs/tasks: Jul 16 12:06:01 unRAID-Server kernel: nmi_cpu_backtrace+0x9b/0xba Jul 16 12:06:01 unRAID-Server kernel: ? irq_force_complete_move+0xf3/0xf3 Jul 16 12:06:01 unRAID-Server kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4 Jul 16 12:06:01 unRAID-Server kernel: rcu_dump_cpu_stacks+0x8e/0xb8 Jul 16 12:06:01 unRAID-Server kernel: rcu_check_callbacks+0x212/0x5f0 Jul 16 12:06:01 unRAID-Server kernel: update_process_times+0x23/0x45 Jul 16 12:06:01 unRAID-Server kernel: tick_sched_timer+0x33/0x61 Jul 16 12:06:01 unRAID-Server kernel: __hrtimer_run_queues+0x78/0xc1 Jul 16 12:06:01 unRAID-Server kernel: hrtimer_interrupt+0x87/0x157 Jul 16 12:06:01 unRAID-Server kernel: smp_apic_timer_interrupt+0x75/0x85 Jul 16 12:06:01 unRAID-Server kernel: apic_timer_interrupt+0x7d/0x90 Jul 16 12:06:01 unRAID-Server kernel: </IRQ> Jul 16 12:06:01 unRAID-Server kernel: RIP: 0010:xor_avx_4+0x53/0x2d8 Jul 16 12:06:01 unRAID-Server kernel: RSP: 0018:ffffc9000909bca0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff10 Jul 16 12:06:01 unRAID-Server kernel: RAX: ffff880809239000 RBX: 0000000000000000 RCX: ffff880809237000 Jul 16 12:06:01 unRAID-Server kernel: RDX: ffff880809236000 RSI: ffff880809239000 RDI: 0000000000001000 Jul 16 12:06:01 unRAID-Server kernel: RBP: ffff880809237000 R08: ffff880809238000 R09: ffff880809238000 Jul 16 12:06:01 unRAID-Server kernel: R10: ffff880809237000 R11: ffff880809236000 R12: ffff880809236000 Jul 16 12:06:01 unRAID-Server kernel: R13: ffff880809239000 R14: 0000000000000003 R15: ffff880809239000 Jul 16 12:06:01 unRAID-Server kernel: check_parity+0x125/0x30b [md_mod] Jul 16 12:06:01 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod] Jul 16 12:06:01 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod] Jul 16 12:06:01 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod] Jul 16 12:06:01 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod] Jul 16 12:06:01 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod] Jul 16 12:06:01 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod] Jul 16 12:06:01 unRAID-Server kernel: ? wait_woken+0x68/0x68 Jul 16 12:06:01 unRAID-Server kernel: kthread+0x111/0x119 Jul 16 12:06:01 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a Jul 16 12:06:01 unRAID-Server kernel: ? SyS_exit_group+0xb/0xb Jul 16 12:06:01 unRAID-Server kernel: ret_from_fork+0x35/0x40 Jul 16 12:06:01 unRAID-Server kernel: 35-...: (59999 ticks this GP) idle=216/140000000000001/0 softirq=24868/24868 fqs=13593 Jul 16 12:06:01 unRAID-Server kernel: (detected by 3, t=60011 jiffies, g=118852, c=118851, q=22604) Jul 16 12:06:01 unRAID-Server kernel: Sending NMI from CPU 3 to CPUs 35: Jul 16 12:06:01 unRAID-Server kernel: NMI backtrace for cpu 35 Jul 16 12:06:01 unRAID-Server kernel: CPU: 35 PID: 4029 Comm: unraidd Not tainted 4.14.49-unRAID #1 Jul 16 12:06:01 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018 Jul 16 12:06:01 unRAID-Server kernel: task: ffff88081b485100 task.stack: ffffc90009098000 Jul 16 12:06:01 unRAID-Server kernel: RIP: 0010:memcmp+0x7/0x1d Jul 16 12:06:01 unRAID-Server kernel: RSP: 0018:ffffc9000909bcd0 EFLAGS: 00000287 Jul 16 12:06:01 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080a1fcc68 RCX: 00000000000000eb Jul 16 12:06:01 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff880809239008 RDI: ffff880809239000 Jul 16 12:06:01 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff880809239000 Jul 16 12:06:01 unRAID-Server kernel: R10: ffff880809238000 R11: ffff880809237000 R12: ffff880819073c00 Jul 16 12:06:01 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff880809239000 Jul 16 12:06:01 unRAID-Server kernel: FS: 0000000000000000(0000) GS:ffff88103f7c0000(0000) knlGS:0000000000000000 Jul 16 12:06:01 unRAID-Server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 12:06:01 unRAID-Server kernel: CR2: 000014fa5b793000 CR3: 0000000001c0a004 CR4: 00000000001606e0 I have also attached my diagnostics as well as my full syslog. If you need further details please let me know. I would be very thankful to everybody helping me out here! Best regards JuliusZet unraid-server-diagnostics-20180716-1232.zip unraid-server-syslog-20180716-1228.zip Quote Link to comment
JorgeB Posted July 16, 2018 Share Posted July 16, 2018 Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model. 1 Quote Link to comment
JuliusZet Posted July 16, 2018 Author Share Posted July 16, 2018 32 minutes ago, johnnie.black said: Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model. Thank you very much for your reply! When I get home, I am going to - update my BIOS - see if that fixed the issue, if not: - put the controller in a different slot Have a great day! Quote Link to comment
JuliusZet Posted July 16, 2018 Author Share Posted July 16, 2018 5 hours ago, johnnie.black said: Various NMI events, these are usually hardware related, you can try looking for a bios update, using the controller in a different slot or replacing the controller by a different model. Updating the BIOS and changing the PCIe slot of the controller did not fix the issue. The problem still persists. Could someone explain what I am experiencing here? Could this be a driver issue? Quote Link to comment
JorgeB Posted July 16, 2018 Share Posted July 16, 2018 I would recommended getting one the recommended LSI HBA models, any LSI with a SAS2008/2308/3008 chipset in IT mode, e.g., 9201-8i, 9211-8i, 9207-8i, 9300-8i, etc and clones, like the Dell H200/H310 and IBM M1015, these latter ones need to be crossflashed. Quote Link to comment
JuliusZet Posted July 17, 2018 Author Share Posted July 17, 2018 12 hours ago, johnnie.black said: I would recommended getting one the recommended LSI HBA models, any LSI with a SAS2008/2308/3008 chipset in IT mode, e.g., 9201-8i, 9211-8i, 9207-8i, 9300-8i, etc and clones, like the Dell H200/H310 and IBM M1015, these latter ones need to be crossflashed. Yesterday I uninstalled the HP H240 HBA and connected the SFF-8087 cables from the backplane to the embedded HP P420i (which I previously configured to operate in HBA mode). Overnight I successfully completed a Parity Check, however I noticed some strange "spikes": It's the same thing happening here like with the HP H240 HBA. The disks read speeds suddenly drop while the CPU usage rises. The only difference here is that with the HP H240 HBA things remained bad. With the HP P420i it looks like it could sort of "recover" somehow. I am suspecting the new SFF-8087 cables since I had to swap the original ones that worked perfectly with the embedded controller with the ones that came with the new HP H240 HBA. (The original cables had angled connectors so they didn't fit in the ports of the HBA.) Now I have two questions: - Could this issue be related to defective cables / loose connections? - I'm thinking about buying an LSI SAS 9207-8i. Are there chances that this issue still persists with the LSI HBA? Quote Link to comment
JorgeB Posted July 17, 2018 Share Posted July 17, 2018 8 minutes ago, JuliusZet said: Could this issue be related to defective cables / loose connections? Very unlikely. 10 minutes ago, JuliusZet said: I'm thinking about buying an LSI SAS 9207-8i. Are there chances that this issue still persists with the LSI HBA? Possible, post diagnostics that cover the last parity check, to see if the problem is the same. Quote Link to comment
JuliusZet Posted July 17, 2018 Author Share Posted July 17, 2018 8 minutes ago, johnnie.black said: Very unlikely. Possible, post diagnostics that cover the last parity check, to see if the problem is the same. Sorry, I totally forgot about this. I attached my diagnostics and my syslog: unraid-server-diagnostics-20180717-0838.zip unraid-server-syslog-20180717-0839.zip Quote Link to comment
JorgeB Posted July 17, 2018 Share Posted July 17, 2018 NMIs are still happening during the check, but the current controller uses the same driver as the previous one, LSI will use a different driver, so if the issues are related to the controller the LSI should work without problems, but can't say for sure. 1 Quote Link to comment
JuliusZet Posted July 26, 2018 Author Share Posted July 26, 2018 On 7/17/2018 at 8:58 AM, johnnie.black said: NMIs are still happening during the check, but the current controller uses the same driver as the previous one, LSI will use a different driver, so if the issues are related to the controller the LSI should work without problems, but can't say for sure. Hello again, my LSI SAS 9207-8i just arrived. I have installed it and it works perfectly. Except during Parity Checks... NMIs are still there. Even the plugin system.stats is outputting improssible values because of the "system overload". I dont't even know anymore what to do now. I'm very frustrated right now. unraid-server-diagnostics-20180726-1558.zip unraid-server-syslog-20180726-1558.zip Quote Link to comment
JorgeB Posted July 26, 2018 Share Posted July 26, 2018 That's bad news, I would guess then it's a problem with the server/board, check to see if there's a system event log, usually server boards have them, there might be more info there. Quote Link to comment
pwm Posted July 26, 2018 Share Posted July 26, 2018 (edited) Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation. Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ... Edited July 26, 2018 by pwm Quote Link to comment
JuliusZet Posted July 26, 2018 Author Share Posted July 26, 2018 9 minutes ago, pwm said: Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation. Yes, I did this with my HP H240 before. Same results. 10 minutes ago, pwm said: Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ... Yes, that's a good idea! I will give this a try now. Thank you for your participation and have a great day! Quote Link to comment
JuliusZet Posted July 26, 2018 Author Share Posted July 26, 2018 2 hours ago, johnnie.black said: That's bad news, I would guess then it's a problem with the server/board, check to see if there's a system event log, usually server boards have them, there might be more info there. I can only find tow kinds of logs: - The "iLO Event Log" which shows me stuff like "Server reset." or "Power on request received by: Automatic Power Recovery.". - The "Integrated Management Log" which shows me stuff like "Firmware flashed (ProLiant System BIOS - P70 05/21/2018)" or "Maintenance note: Intelligent Provisioning was loaded." I can not find anything unusual related to NMIs there. Where would I find the system event log you mentioned earlier? Quote Link to comment
JorgeB Posted July 26, 2018 Share Posted July 26, 2018 10 minutes ago, JuliusZet said: Where would I find the system event log you mentioned earlier? It's likely the iLO Event log, I wouldn't expect NMIs to be logged, but there could be some other hardware issue logged. Quote Link to comment
JuliusZet Posted July 26, 2018 Author Share Posted July 26, 2018 (edited) 30 minutes ago, johnnie.black said: It's likely the iLO Event log, I wouldn't expect NMIs to be logged, but there could be some other hardware issue logged. Nope, there is nothing unusual there. 1 hour ago, pwm said: Have you tried to move the card to another slot? If maybe you get an interrupt collision where the wrong driver gets activated and starts looking at hardware not even involved in the disk copy operation. Edit: And do you have hardware on the motherboard that you don't need and can turn off in the BIOS - audio? serial ports? Additional SATA controller? ... What I did in the meantime: - BIOS Reset - Deactivated all unneccessary devices (on-board SATA Controller + on-board RAID-Controller) - Re-created the unRAID-USB-flash device But that didn't seem to help at all. An excerpt from my syslog during the parity check (started at 19:41:20): Jul 26 19:41:20 unRAID-Server emhttpd: req (2): startState=STARTED&file=&cmdCheck=Check&optionCorrect=correct&csrf_token=**************** Jul 26 19:41:20 unRAID-Server kernel: mdcmd (40): check correct Jul 26 19:41:20 unRAID-Server kernel: md: recovery thread: check P ... Jul 26 19:41:20 unRAID-Server kernel: md: using 1536k window, over a total of 1953514552 blocks. Jul 26 19:42:12 unRAID-Server sSMTP[5210]: Creating SSL connection to host Jul 26 19:42:12 unRAID-Server sSMTP[5210]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Jul 26 19:42:14 unRAID-Server sSMTP[5210]: Sent mail for [email protected] (221 2.0.0 fwd26.t-online.de closing. / Closing.) uid=0 username=root outbytes=760 Jul 26 19:42:46 unRAID-Server kernel: INFO: rcu_sched self-detected stall on CPU Jul 26 19:42:46 unRAID-Server kernel: 21-...: (59999 ticks this GP) idle=1ee/140000000000001/0 softirq=1792/1792 fqs=13455 Jul 26 19:42:46 unRAID-Server kernel: (t=60001 jiffies g=6532 c=6531 q=24126) Jul 26 19:42:46 unRAID-Server kernel: NMI backtrace for cpu 21 Jul 26 19:42:46 unRAID-Server kernel: CPU: 21 PID: 3954 Comm: unraidd Not tainted 4.14.49-unRAID #1 Jul 26 19:42:46 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 05/21/2018 Jul 26 19:42:46 unRAID-Server kernel: Call Trace: Jul 26 19:42:46 unRAID-Server kernel: <IRQ> Jul 26 19:42:46 unRAID-Server kernel: dump_stack+0x5d/0x79 Jul 26 19:42:46 unRAID-Server kernel: INFO: rcu_sched detected stalls on CPUs/tasks: Jul 26 19:42:46 unRAID-Server kernel: nmi_cpu_backtrace+0x9b/0xba Jul 26 19:42:46 unRAID-Server kernel: ? irq_force_complete_move+0xf3/0xf3 Jul 26 19:42:46 unRAID-Server kernel: nmi_trigger_cpumask_backtrace+0x56/0xd4 Jul 26 19:42:46 unRAID-Server kernel: rcu_dump_cpu_stacks+0x8e/0xb8 Jul 26 19:42:46 unRAID-Server kernel: rcu_check_callbacks+0x212/0x5f0 Jul 26 19:42:46 unRAID-Server kernel: update_process_times+0x23/0x45 Jul 26 19:42:46 unRAID-Server kernel: tick_sched_timer+0x33/0x61 Jul 26 19:42:46 unRAID-Server kernel: __hrtimer_run_queues+0x78/0xc1 Jul 26 19:42:46 unRAID-Server kernel: hrtimer_interrupt+0x87/0x157 Jul 26 19:42:46 unRAID-Server kernel: smp_apic_timer_interrupt+0x75/0x85 Jul 26 19:42:46 unRAID-Server kernel: apic_timer_interrupt+0x7d/0x90 Jul 26 19:42:46 unRAID-Server kernel: </IRQ> Jul 26 19:42:46 unRAID-Server kernel: RIP: 0010:memcmp+0x2/0x1d Jul 26 19:42:46 unRAID-Server kernel: RSP: 0018:ffffc9000727bcd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Jul 26 19:42:46 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080578bd20 RCX: 0000000000000409 Jul 26 19:42:46 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff8808057cc008 RDI: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: R10: ffff8808057cb000 R11: ffff8808057ca000 R12: ffff88081a045800 Jul 26 19:42:46 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: check_parity+0x14f/0x30b [md_mod] Jul 26 19:42:46 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? wait_woken+0x68/0x68 Jul 26 19:42:46 unRAID-Server kernel: kthread+0x111/0x119 Jul 26 19:42:46 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a Jul 26 19:42:46 unRAID-Server kernel: ret_from_fork+0x35/0x40 Jul 26 19:42:46 unRAID-Server kernel: 21-...: (59999 ticks this GP) idle=1ee/140000000000001/0 softirq=1792/1792 fqs=13456 Jul 26 19:42:46 unRAID-Server kernel: (detected by 26, t=60005 jiffies, g=6532, c=6531, q=24126) Jul 26 19:42:46 unRAID-Server kernel: Sending NMI from CPU 26 to CPUs 21: Jul 26 19:42:46 unRAID-Server kernel: NMI backtrace for cpu 21 Jul 26 19:42:46 unRAID-Server kernel: CPU: 21 PID: 3954 Comm: unraidd Not tainted 4.14.49-unRAID #1 Jul 26 19:42:46 unRAID-Server kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 05/21/2018 Jul 26 19:42:46 unRAID-Server kernel: task: ffff88081ad53600 task.stack: ffffc90007278000 Jul 26 19:42:46 unRAID-Server kernel: RIP: 0010:memcmp+0x2/0x1d Jul 26 19:42:46 unRAID-Server kernel: RSP: 0018:ffffc9000727bcd0 EFLAGS: 00000246 Jul 26 19:42:46 unRAID-Server kernel: RAX: 0000000000000000 RBX: ffff88080578bd20 RCX: 0000000000000fba Jul 26 19:42:46 unRAID-Server kernel: RDX: 0000000000000ff8 RSI: ffff8808057cc008 RDI: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: RBP: 0000000000000258 R08: 0000000000000000 R09: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: R10: ffff8808057cb000 R11: ffff8808057ca000 R12: ffff88081a045800 Jul 26 19:42:46 unRAID-Server kernel: R13: 0000000000000001 R14: 0000000000000003 R15: ffff8808057cc000 Jul 26 19:42:46 unRAID-Server kernel: FS: 0000000000000000(0000) GS:ffff88081f8c0000(0000) knlGS:0000000000000000 Jul 26 19:42:46 unRAID-Server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 19:42:46 unRAID-Server kernel: CR2: 0000151a721d77a0 CR3: 0000000001c0a001 CR4: 00000000001606e0 Jul 26 19:42:46 unRAID-Server kernel: Call Trace: Jul 26 19:42:46 unRAID-Server kernel: check_parity+0x14f/0x30b [md_mod] Jul 26 19:42:46 unRAID-Server kernel: handle_stripe+0xefc/0x1293 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: unraidd+0xb8/0x111 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? md_open+0x2c/0x2c [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? md_thread+0xbc/0xcc [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? handle_stripe+0x1293/0x1293 [md_mod] Jul 26 19:42:46 unRAID-Server kernel: md_thread+0xbc/0xcc [md_mod] Jul 26 19:42:46 unRAID-Server kernel: ? wait_woken+0x68/0x68 Jul 26 19:42:46 unRAID-Server kernel: kthread+0x111/0x119 Jul 26 19:42:46 unRAID-Server kernel: ? kthread_create_on_node+0x3a/0x3a Jul 26 19:42:46 unRAID-Server kernel: ret_from_fork+0x35/0x40 Jul 26 19:42:46 unRAID-Server kernel: Code: 48 63 c1 4c 39 c0 73 19 49 8b 3c c1 48 85 ff 74 10 4c 89 d6 e8 71 ff ff ff 84 c0 75 09 ff c1 eb df b9 ea ff ff ff 89 c8 c3 31 c9 <48> 39 d1 74 13 0f b6 04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 Jul 26 19:43:14 unRAID-Server emhttpd: req (3): startState=STARTED&file=&csrf_token=****************&cmdNoCheck=Cancel Jul 26 19:43:14 unRAID-Server kernel: mdcmd (41): nocheck Jul 26 19:43:15 unRAID-Server kernel: md: md_do_sync: got signal, exit... Jul 26 19:43:15 unRAID-Server kernel: md: recovery thread: completion status: -4 Jul 26 19:44:01 unRAID-Server sSMTP[5754]: Creating SSL connection to host Jul 26 19:44:01 unRAID-Server sSMTP[5754]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 Jul 26 19:44:02 unRAID-Server sSMTP[5754]: Sent mail for [email protected] (221 2.0.0 fwd14.t-online.de closing. / Closing.) uid=0 username=root outbytes=78 Edit: I saw some "Advanced CPU Settings" in the BIOS but I did not touch them. Maybe some of these setting is causing those errors? Edited July 26, 2018 by JuliusZet Quote Link to comment
JuliusZet Posted July 26, 2018 Author Share Posted July 26, 2018 (edited) When I put some real load on the server like starting many VMs simultaniously I see this in my syslog: Jul 26 21:09:36 unRAID-Server kernel: perf: interrupt took too long (4229 > 2500), lowering kernel.perf_event_max_sample_rate to 47000 Jul 26 21:09:37 unRAID-Server kernel: perf: interrupt took too long (6320 > 5286), lowering kernel.perf_event_max_sample_rate to 31000 Jul 26 21:09:46 unRAID-Server kernel: perf: interrupt took too long (8617 > 7900), lowering kernel.perf_event_max_sample_rate to 23000 Jul 26 21:09:51 unRAID-Server kernel: perf: interrupt took too long (12258 > 10771), lowering kernel.perf_event_max_sample_rate to 16000 Jul 26 21:09:56 unRAID-Server kernel: perf: interrupt took too long (16051 > 15322), lowering kernel.perf_event_max_sample_rate to 12000 Jul 26 21:10:08 unRAID-Server kernel: perf: interrupt took too long (21657 > 20063), lowering kernel.perf_event_max_sample_rate to 9000 Jul 26 21:10:25 unRAID-Server kernel: perf: interrupt took too long (27495 > 27071), lowering kernel.perf_event_max_sample_rate to 7000 Jul 26 21:11:06 unRAID-Server kernel: perf: interrupt took too long (35995 > 34368), lowering kernel.perf_event_max_sample_rate to 5000 Jul 26 21:12:45 unRAID-Server kernel: perf: interrupt took too long (46427 > 44993), lowering kernel.perf_event_max_sample_rate to 4000 Jul 26 21:15:29 unRAID-Server kernel: perf: interrupt took too long (58952 > 58033), lowering kernel.perf_event_max_sample_rate to 3000 Jul 26 21:19:25 unRAID-Server kernel: perf: interrupt took too long (76236 > 73690), lowering kernel.perf_event_max_sample_rate to 2000 Maybe that has got something to do with my NMIs? Edited July 26, 2018 by JuliusZet Quote Link to comment
JuliusZet Posted July 27, 2018 Author Share Posted July 27, 2018 When all of my VMs are running (mostly game servers) and a Prity Check is being started my system stats look like this: During the Parity Check all my gameservers have very bad performance drops. When I stop all VMs and start a Parity Check my graphs look like this: I have no clue what is going on here. unraid-server-diagnostics-20180727-0908.zip Quote Link to comment
JuliusZet Posted July 27, 2018 Author Share Posted July 27, 2018 @johnnie.black Do you think that this issue could be related to Tunables? I did not change them bacause I do not understand what they do and the wikipedia and forum posts are outdated (http://lime-technology.com/wiki/Improving_unRAID_Performance#User_Tunables and ). So mine are all set to default values. Are there Tunables that I could change to try and fix this issue? I am sorry if I am bothering you but I really want to fix this issue and I dont't know how I should get started and who else I could talk to. Quote Link to comment
JorgeB Posted July 27, 2018 Share Posted July 27, 2018 56 minutes ago, JuliusZet said: Do you think that this issue could be related to Tunables? Not likely, but it won't hurt to try, use these: Tunable (md_num_stripes): 4096 Tunable (md_sync_window): 2048 Tunable (md_sync_thresh): 2000 Quote Link to comment
JuliusZet Posted July 27, 2018 Author Share Posted July 27, 2018 (edited) 1 hour ago, johnnie.black said: Not likely, but it won't hurt to try, use these: Tunable (md_num_stripes): 4096 Tunable (md_sync_window): 2048 Tunable (md_sync_thresh): 2000 Hi johnnie.black, I just figured it out! What I just did is I applied your Tunables from your previous post and started a Parity Check. The NMIs started immediately. Canceled the Partiy Check and started it again just to double check ... Yep ... disk speeds are bad and CPU usage is high right from the start of the Parity Check. Then I reset the Tunable values to default. Started the Parity Check again and ... it happened like before: At first everything looks normal (disk speeds as expected from 4 HDDs and CPU usage ~2 %) but after a few seconds the NMIs start. So I went like: Hmm... When I increase the Tunables it goes worse. What if I now decreased them? I thought taking the default values and halving them would be a good start: Tunable (nr_requests): 64 Tunable (md_num_stripes): 640 Tunable (md_sync_window): 192 Tunable (md_sync_thresh): 96 Wow! I did not think that this would work! Look at the system stats graphs: The only thing I need to figure out now is what these strange "spikes" are. The syslog shows no NMIs, no errors, nothing! Don't get me wrong, I'm really very glad that it is finally working! However these spikes are not normal. I think that it would help if I had some up-to-date information about Tunables what exactly they stand for and what they do. I'm sure that this was discussed already on the forum somewhere but I can't find it. Could you link a post where it is explained? That would be very great! Thank you very much! Edited July 27, 2018 by JuliusZet Quote Link to comment
JorgeB Posted July 27, 2018 Share Posted July 27, 2018 Original tunables: https://lime-technology.com/forums/topic/4473-unraid-server-release-45-beta8-available/?do=findComment&comment=41691 And a description of sync_thresh that was added later: https://lime-technology.com/forums/topic/42369-unraid-server-release-614-available/?do=findComment&comment=416597 Try to leave the tunables as high as possible without causing the issues, as with everything working correctly lower tunables will usually decrease performance, especially if you start adding more disks. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.