Help to Troubleshoot Memory Related Errors - Any/All Ideas Welcomed!


Recommended Posts

So..... This is starting to drive me batshit crazy, so any input from whomever is really appreciated.

If I cannot get to the bottom of it, I'm saying F*** it, and buying a different motherboard (original one died, this one is a replacement, also screw Gigabyte lately they use to be so much more awesome than my experience with recent releases).

 

I have these kinds of errors within my syslog

Jul 27 15:56:46 Server kernel: WARNING: CPU: 11 PID: 16664 at arch/x86/kernel/cpu/perf_event_intel_ds.c:334 reserve_ds_buffers+0x110/0x33d()
Jul 27 15:56:46 Server kernel: alloc_bts_buffer: BTS buffer allocation failure
Jul 27 15:56:46 Server kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net tun vhost macvtap macvlan xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod it87 hwmon_vid mxm_wmi x86_pkg_temp_thermal coretemp kvm_intel kvm e1000e i2c_i801 ahci ptp pps_core libahci wmi
Jul 27 15:56:46 Server kernel: CPU: 11 PID: 16664 Comm: qemu-system-x86 Not tainted 4.4.15-unRAID #1
Jul 27 15:56:46 Server kernel: Hardware name: Gigabyte Technology Co., Ltd. Default string/X99-SLI-CF, BIOS F22 06/13/2016
Jul 27 15:56:46 Server kernel: 0000000000000000 ffff880580de7920 ffffffff81369dfe ffff880580de7968
Jul 27 15:56:46 Server kernel: 000000000000014e ffff880580de7958 ffffffff8104a31d ffffffff81020923
Jul 27 15:56:46 Server kernel: 0000000000000000 0000000000000001 0000000000000009 ffff880125248700
Jul 27 15:56:46 Server kernel: Call Trace:
Jul 27 15:56:46 Server kernel: [<ffffffff81369dfe>] dump_stack+0x61/0x7e
Jul 27 15:56:46 Server kernel: [<ffffffff8104a31d>] warn_slowpath_common+0x8f/0xa8
Jul 27 15:56:46 Server kernel: [<ffffffff81020923>] ? reserve_ds_buffers+0x110/0x33d
Jul 27 15:56:46 Server kernel: [<ffffffff8104a379>] warn_slowpath_fmt+0x43/0x4b
Jul 27 15:56:46 Server kernel: [<ffffffff810f6bc3>] ? __kmalloc_node+0x22/0x153
Jul 27 15:56:46 Server kernel: [<ffffffff81020923>] reserve_ds_buffers+0x110/0x33d
Jul 27 15:56:46 Server kernel: [<ffffffff8101b3e0>] x86_reserve_hardware+0x135/0x147
Jul 27 15:56:46 Server kernel: [<ffffffff8101b442>] x86_pmu_event_init+0x50/0x1c9
Jul 27 15:56:46 Server kernel: [<ffffffff810ae054>] perf_try_init_event+0x41/0x72
Jul 27 15:56:46 Server kernel: [<ffffffff810ae4a5>] perf_event_alloc+0x420/0x66e
Jul 27 15:56:46 Server kernel: [<ffffffffa0837596>] ? kvm_dev_ioctl_get_cpuid+0x1c0/0x1c0 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffff810b041b>] perf_event_create_kernel_counter+0x22/0x112
Jul 27 15:56:46 Server kernel: [<ffffffffa08376e1>] pmc_reprogram_counter+0xbf/0x104 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0837933>] reprogram_fixed_counter+0xc7/0xd8 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e941>] intel_pmu_set_msr+0xe0/0x2ca [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0837b34>] kvm_pmu_set_msr+0x15/0x17 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0819a5a>] kvm_set_msr_common+0x921/0x983 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e3ba>] vmx_set_msr+0x2ec/0x2fe [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0816427>] kvm_set_msr+0x61/0x63 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b479ba>] handle_wrmsr+0x3b/0x62 [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4c5f9>] vmx_handle_exit+0xfbb/0x1053 [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e0bf>] ? vmx_vcpu_run+0x30e/0x31d [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa081ff9c>] kvm_arch_vcpu_ioctl_run+0x38a/0x1080 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa081a93b>] ? kvm_arch_vcpu_load+0x6b/0x173 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa081a9b8>] ? kvm_arch_vcpu_load+0xe8/0x173 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa08120ec>] kvm_vcpu_ioctl+0x178/0x499 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa082e7d9>] ? em_rsm+0x14d/0x14d [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffff81117b8f>] do_vfs_ioctl+0x3a3/0x416
Jul 27 15:56:46 Server kernel: [<ffffffff8111fba5>] ? __fget+0x72/0x7e
Jul 27 15:56:46 Server kernel: [<ffffffff81117c40>] SyS_ioctl+0x3e/0x5c
Jul 27 15:56:46 Server kernel: [<ffffffff81622f6e>] entry_SYSCALL_64_fastpath+0x12/0x6d
Jul 27 15:56:46 Server kernel: ---[ end trace 8f5773cb964683c2 ]---
Jul 27 15:56:46 Server kernel: qemu-system-x86: page allocation failure: order:4, mode:0x260c0c0
Jul 27 15:56:46 Server kernel: CPU: 11 PID: 16664 Comm: qemu-system-x86 Tainted: G        W       4.4.15-unRAID #1
Jul 27 15:56:46 Server kernel: Hardware name: Gigabyte Technology Co., Ltd. Default string/X99-SLI-CF, BIOS F22 06/13/2016
Jul 27 15:56:46 Server kernel: 0000000000000000 ffff880580de7798 ffffffff81369dfe 0000000000000001
Jul 27 15:56:46 Server kernel: 0000000000000004 ffff880580de7830 ffffffff810bcc1f 0260c0c000000010
Jul 27 15:56:46 Server kernel: ffff880600000040 0000000400000040 0000000000000004 0000000000000004
Jul 27 15:56:46 Server kernel: Call Trace:
Jul 27 15:56:46 Server kernel: [<ffffffff81369dfe>] dump_stack+0x61/0x7e
Jul 27 15:56:46 Server kernel: [<ffffffff810bcc1f>] warn_alloc_failed+0x10f/0x127
Jul 27 15:56:46 Server kernel: [<ffffffff810bfc36>] __alloc_pages_nodemask+0x870/0x8ca
Jul 27 15:56:46 Server kernel: [<ffffffff810bfe3a>] alloc_kmem_pages_node+0x4b/0xb3
Jul 27 15:56:46 Server kernel: [<ffffffff810f4424>] kmalloc_large_node+0x24/0x52
Jul 27 15:56:46 Server kernel: [<ffffffff810f6bc3>] __kmalloc_node+0x22/0x153
Jul 27 15:56:46 Server kernel: [<ffffffff8102099f>] reserve_ds_buffers+0x18c/0x33d
Jul 27 15:56:46 Server kernel: [<ffffffff8101b3e0>] x86_reserve_hardware+0x135/0x147
Jul 27 15:56:46 Server kernel: [<ffffffff8101b442>] x86_pmu_event_init+0x50/0x1c9
Jul 27 15:56:46 Server kernel: [<ffffffff810ae054>] perf_try_init_event+0x41/0x72
Jul 27 15:56:46 Server kernel: [<ffffffff810ae4a5>] perf_event_alloc+0x420/0x66e
Jul 27 15:56:46 Server kernel: [<ffffffffa0837596>] ? kvm_dev_ioctl_get_cpuid+0x1c0/0x1c0 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffff810b041b>] perf_event_create_kernel_counter+0x22/0x112
Jul 27 15:56:46 Server kernel: [<ffffffffa08376e1>] pmc_reprogram_counter+0xbf/0x104 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0837933>] reprogram_fixed_counter+0xc7/0xd8 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e941>] intel_pmu_set_msr+0xe0/0x2ca [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0837b34>] kvm_pmu_set_msr+0x15/0x17 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0819a5a>] kvm_set_msr_common+0x921/0x983 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e3ba>] vmx_set_msr+0x2ec/0x2fe [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0816427>] kvm_set_msr+0x61/0x63 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b479ba>] handle_wrmsr+0x3b/0x62 [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4c5f9>] vmx_handle_exit+0xfbb/0x1053 [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa0b4e0bf>] ? vmx_vcpu_run+0x30e/0x31d [kvm_intel]
Jul 27 15:56:46 Server kernel: [<ffffffffa081ff9c>] kvm_arch_vcpu_ioctl_run+0x38a/0x1080 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa081a93b>] ? kvm_arch_vcpu_load+0x6b/0x173 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa081a9b8>] ? kvm_arch_vcpu_load+0xe8/0x173 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa08120ec>] kvm_vcpu_ioctl+0x178/0x499 [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffffa082e7d9>] ? em_rsm+0x14d/0x14d [kvm]
Jul 27 15:56:46 Server kernel: [<ffffffff81117b8f>] do_vfs_ioctl+0x3a3/0x416
Jul 27 15:56:46 Server kernel: [<ffffffff8111fba5>] ? __fget+0x72/0x7e
Jul 27 15:56:46 Server kernel: [<ffffffff81117c40>] SyS_ioctl+0x3e/0x5c
Jul 27 15:56:46 Server kernel: [<ffffffff81622f6e>] entry_SYSCALL_64_fastpath+0x12/0x6d
Jul 27 15:56:46 Server kernel: Mem-Info:
Jul 27 15:56:46 Server kernel: active_anon:1844977 inactive_anon:10104 isolated_anon:0
Jul 27 15:56:46 Server kernel: active_file:555155 inactive_file:761395 isolated_file:0
Jul 27 15:56:46 Server kernel: unevictable:4571771 dirty:629 writeback:53 unstable:0
Jul 27 15:56:46 Server kernel: slab_reclaimable:261411 slab_unreclaimable:31708
Jul 27 15:56:46 Server kernel: mapped:31517 shmem:100649 pagetables:16889 bounce:0
Jul 27 15:56:46 Server kernel: free:90006 free_pcp:64 free_cma:0
Jul 27 15:56:46 Server kernel: Node 0 DMA free:15892kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jul 27 15:56:46 Server kernel: lowmem_reserve[]: 0 1979 31930 31930
Jul 27 15:56:46 Server kernel: Node 0 DMA32 free:128068kB min:8372kB low:10464kB high:12556kB active_anon:501832kB inactive_anon:3032kB active_file:29564kB inactive_file:26644kB unevictable:1411940kB isolated(anon):0kB isolated(file):0kB present:2174356kB managed:2164640kB mlocked:1411940kB dirty:12kB writeback:0kB mapped:9220kB shmem:25704kB slab_reclaimable:46352kB slab_unreclaimable:4568kB kernel_stack:912kB pagetables:4532kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jul 27 15:56:46 Server kernel: lowmem_reserve[]: 0 0 29951 29951
Jul 27 15:56:46 Server kernel: Node 0 Normal free:216064kB min:126728kB low:158408kB high:190092kB active_anon:6878076kB inactive_anon:37384kB active_file:2191056kB inactive_file:3018936kB unevictable:16875144kB isolated(anon):0kB isolated(file):0kB present:31195136kB managed:30670976kB mlocked:16875144kB dirty:2504kB writeback:212kB mapped:116848kB shmem:376892kB slab_reclaimable:999292kB slab_unreclaimable:122264kB kernel_stack:15408kB pagetables:63024kB unstable:0kB bounce:0kB free_pcp:256kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:64 all_unreclaimable? no
Jul 27 15:56:46 Server kernel: lowmem_reserve[]: 0 0 0 0
Jul 27 15:56:46 Server kernel: Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 3*4096kB (M) = 15892kB
Jul 27 15:56:46 Server kernel: Node 0 DMA32: 809*4kB (UME) 936*8kB (UME) 1036*16kB (UME) 571*32kB (UME) 261*64kB (UME) 118*128kB (UM) 36*256kB (UM) 13*512kB (UME) 16*1024kB (UME) 5*2048kB (M) 2*4096kB (M) = 128068kB
Jul 27 15:56:46 Server kernel: Node 0 Normal: 26167*4kB (UME) 9745*8kB (UME) 2095*16kB (U) 14*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 216596kB
Jul 27 15:56:46 Server kernel: 1417236 total pagecache pages
Jul 27 15:56:46 Server kernel: 0 pages in swap cache
Jul 27 15:56:46 Server kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul 27 15:56:46 Server kernel: Free swap  = 0kB
Jul 27 15:56:46 Server kernel: Total swap = 0kB
Jul 27 15:56:46 Server kernel: 8346367 pages RAM
Jul 27 15:56:46 Server kernel: 0 pages HighMem/MovableOnly
Jul 27 15:56:46 Server kernel: 133490 pages reserved
Jul 27 15:56:46 Server kernel: qemu-system-x86: page allocation failure: order:4, mode:0x260c0c0

 

Sometimes repeating LOTS of times.

 

They've been there for quite some time throughout the 6.2 beta's, however I also believed they started to appear when the motherboard was replaced with the replacement board (the one installed currently).

 

I initially reported about it here http://lime-technology.com/forum/index.php?topic=48193.msg471875#msg471875

then followed up here https://lime-technology.com/forum/index.php?topic=49705.msg481602#msg481602.

I requested a support request from LT which resulted in this (I think they're a little busy lately!!)

I've reviewed your post and the log and unfortunately don't have much to comment on at this time.  I would suggest trying the latest RC to see if any improvements are there.  This definitely feels hardware-specific as it's the first case of this I've seen.  We'll keep our eyes open and see if we can recreate this ourselves, but the initially look is that it's hardware-related which means it will be difficult to reproduce.

 

I followed up hoping to get some pointing in a direction as to a way to figure out, but that was 2 weeks ago and I got nothing back.  :-[

 

I've played with the memory timings just in case something is odd... Nada.

Newest Bios, check... New fancy power supply cause it was my birthday (HX850i), check (no difference, but I didn't figure it was related).

Memtest included and the Passmark (newest) version, all passed multiple tests.

Tried various XMP, and "optimization" settings related to RAM in the BIOS, no different..

Everything else set to Auto, lowered to 2133 from 2400 just in case.

 

I was thinking this was an OOM issue related to KVM or QEMU, however no one has thought this to be the case from the entry in the Syslog, so I guess that is not the case.

These parts to me always look suspicious

Jul  4 12:47:20 Server kernel: 0 pages in swap cache
Jul  4 12:47:20 Server kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul  4 12:47:20 Server kernel: Free swap  = 0kB
Jul  4 12:47:20 Server kernel: Total swap = 0kB
Jul  4 12:47:20 Server kernel: 8338252 pages RAM
Jul  4 12:47:20 Server kernel: 0 pages HighMem/MovableOnly

 

While these don't always make UnRAID unstable, they certainly aren't supposed to be there.

Sometimes however they get bad enough to where a VM will shutdown. However I have also recently seen an OOM related issue, which did force a shutdown of 3 of 4 active VM's.

 

Current diagnostics attached, along with this fluke of an OOM condition I had the other day which is likely unrelated to this (I was playing with ram settings a little, so it may be related).

 

If you have some thoughts, great, as I am about to just buy a new MB (I don't really WANT to do this!) if not as I have wasted too much time on this..  :-\

 

The WAF also went down, to the "your other hardware seemed to be fine, why did you get this new stuff you've had a lot of issues?" (no comment)..  :P

server-diagnostics-20160727-1728.zip

server-syslog-20160725-1930.zip

Link to comment

I didn't go through your diagnostics or follow the links you gave but, given that your memory passes the MemTest, I would strip everything back to the basic unRAID NAS functions - i.e. no VMs and no Dockers. If that proved to be stable I'd reinstate the Dockers and check again, then the VMs, one at a time.

Link to comment

I didn't go through your diagnostics or follow the links you gave but, given that your memory passes the MemTest, I would strip everything back to the basic unRAID NAS functions - i.e. no VMs and no Dockers. If that proved to be stable I'd reinstate the Dockers and check again, then the VMs, one at a time.

 

Yeah, trying to avoid this, but understand the troubleshooting reasons for the suggestion.

You would think that these traces and events in the syslog would mean something useful to someone other than me, but I have yet to find that person, or that information through my searches.

The thing is, I think the messages WILL go away without any VM's running, as the "tainted" is always in line with VM's attached to CPU cores. So if I see the line for CPU 3, I also see the thread pair I pass as well in a message a little later (which would be CPU 9 in this case).

Hmmm, so at that point we're thinking VM specific issues, which could be the case..

Maybe I'll switch one or two of them (would be nice if I could just "switch" them... ;) ) from SeaBIOS to OVMF/UEFI and see if it helps.

Link to comment

I'm afraid I don't know how to interpret them either but I see "kvm" crop up time and time again. Someone like RobJ might be able to help if he reads your post. I'm afraid I have nothing else to offer, other than to repeat what I would do. You need to simplify the problem and the virtualisation part is a major complication.

Link to comment

I'm afraid I don't know how to interpret them either but I see "kvm" crop up time and time again. Someone like RobJ might be able to help if he reads your post. I'm afraid I have nothing else to offer, other than to repeat what I would do. You need to simplify the problem and the virtualisation part is a major complication.

 

I get it, and thanks for looking!

I have already had some help from RobJ in the B23 thread, and nothing was conclusive, he believes that it is a bug in the low level memory mangement.

I posted on the VFIO forum with that kind of title, and no one found it interesting enough, or had an idea either.

I will do some more shuffling soon, will try OVMF instead on a VM or two.

I'm nearly certain if I remove VM's, or even remove the use of VT-d (meaning have nothing assigned) this issue will "fix" itself.

However, that only tells me it is something related to memory management within IOMMU or shadow pages and then I'm still at a loss....  Lame!

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.