alloc_bts_buffer: BTS buffer allocation failure-page allocation failure: order:4

bungee91 · August 9, 2016

For the life of me I cannot get rid of these errors.

This is what I know, certain of it at this point (yelling) THIS IS NOT A HARDWARE FAULT (got that off my chest! @LT need some further help)

I've been getting messages like these for a while now:

Server kernel: CPU: 10 PID: 11439 Comm: qemu-system-x86 Tainted: G W 4.4.15-unRAID #1

They all seem to start with this message first:

Server kernel: alloc_bts_buffer: BTS buffer allocation failure
Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net tun vhost macvtap macvlan xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod hwmon_vid igb ptp pps_core mxm_wmi fbcon bitblit fbcon_rotate fbcon_ccw fbcon_ud fbcon_cw softcursor font ast drm_kms_helper cfbfillrect cfbimgblt x86_pkg_temp_thermal cfbcopyarea ttm coretemp drm kvm_intel kvm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops fb ahci i2c_i801 fbdev sata_mv i2c_algo_bit libahci ipmi_si wmi [last unloaded: pps_core]

So I replaced my motherboard with a snazzy new one!

Memtest 86+ (newest Passmark), ran for 19 hours, 6 passes, no errors.

Replaced all VM's with newly installed UEFI ones (Windows 10, newest virtio).

In my opinion this is an allocation issue regarding memory allocation and KVM. Maybe this has to do with hugepages, or shared memory, IDK.

I reboot, it goes away for a day or so, then comes back.

Same issue fixed in v4.0 of kernel https://bugzilla.kernel.org/show_bug.cgi?id=93251

I'm honestly shocked no one else is having this issue, something unique to my processor or something?

Looking for any help, or pointers.

I'm considering disabling Hyper-v (I have two in use Nvidia cards) and going back to 6.1.9, as this issue was never present on that version.

Guarantee turning off VM's completely will keep the error from occurring (not exactly something that fixes the issue).

@RobJ any chance you can look at the new logs and provide any input? Thanks for looking before.

server-diagnostics-20160808-1928.zip

bungee91 · August 9, 2016

I've reverted to 6.1.9 for testing purposes.

As others may already know, this lead to all installed Dockers no longer being shown (I assume the version change in 6.2), and all VM's created within 6.2 to no longer be listed.

Not really a big issue, just was unaware, luckily had a backup of all UUID's for Windows or activation would have deactivated.

Dockers are easy to reinstall thanks to the saved templates!

It is my assumption the problem will be resolved (I'm also crossing fingers, pinches of salt over shoulders, etc..), as I certainly started noticing this towards the end of the beta cycle for 6.2.

If I get to a week of uptime without this showing, I will consider this issue solved in that it is certainly a change within the kernel update included within 6.2.

I have lately only been able to go a day or two without this being present.

If it does show up, I'm seriously out of ideas.

Memory tested good, timings are even relaxed (running 2133 on 2400 ram), motherboard has been replaced with a completely different model.

In my RMA'ing (dead MB, and GPU) the only other thing replaced was the GPU for my main VM, however the messages certainly sound more memory/CPU related.

This would only leave the CPU, and that is highly unlikely given no other related issues.

Power supply also replaced, more for a "I want this shiny new toy", vs problem resolution.

jonp · August 9, 2016

Bungee,

Can you take a look at your motherboard BIOS settings and tell me what options you have regarding memory, virtualization, and IOMMU in there? This still seems like something that may be amiss between hardware and software configuration.

bungee91 · August 9, 2016

Bungee,

Can you take a look at your motherboard BIOS settings and tell me what options you have regarding memory, virtualization, and IOMMU in there? This still seems like something that may be amiss between hardware and software configuration.

Called in the big guns huh?!

Let me take a look once I get home this evening, the only difference between this and my previous board was that above 4g decoding had to be disabled on the Gigabyte board or it reverted to BIOS default settings (no idea why), so it was always set to off.

On the SuperMicro I have now I was able to enable it (even though supposedly only useful for Grid or fancy CUDA cards or something of that nature), and pass all related Memtest's with it on, so I left it that way.

Will have all the information you can shake a stick at this evening.

bungee91 · August 10, 2016

Well, you're right.... (it's not personal)

There is still something wrong with my setup, AND it's likely hardware related..

Seen this with the 6.1.9 installation and I know for a fact I never had these issues prior to the MB and GPU that died and needed to be replaced.

Server kernel: WARNING: CPU: 11 PID: 24446 at arch/x86/kernel/cpu/perf_event_intel_ds.c:315 reserve_ds_buffers+0x10e/0x347()
Aug  9 02:00:23 Server kernel: alloc_bts_buffer: BTS buffer allocation failure
Aug  9 02:00:23 Server kernel: Modules linked in: kvm_intel kvm vhost_net vhost macvtap macvlan md_mod xt_nat veth xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables tun ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat hwmon_vid igb i2c_algo_bit ahci ptp sata_mv i2c_i801 libahci pps_core ipmi_si [last unloaded: md_mod]
Aug  9 02:00:23 Server kernel: CPU: 11 PID: 24446 Comm: qemu:Main Not tainted 4.1.18-unRAID #1

I attached that syslog, but it looks the same as the others.

I also seen this towards the bottom, and I know I have seen it before, and it may be harmless but I thought I'd mention it:

kernel: perf interrupt took too long (2563 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

All photos of my BIOS setup screens are here https://photos.google.com/share/AF1QipNcs6rSDd3rAgkKe9LpxOV5cUn3gt1zkZkjJqQK93oiUmbsW6LCHZTsGMcnZHqqeg?key=MzdVOXI3aXUtVkVwTTlMa1ZqSjB5Q2lFN1RsU0R3 they're legible, but phone pics while holding a keyboard and mouse (this BIOS cannot be navigated solely with just a keyboard, stupid!).

So, I now continue down this path. I know I have seen the tainted message on more CPU's than the one assigned to only my primary VM, but recently it has only been that one in particular (4,5,10,11).

Knowing this I removed the RMA'd 260X from the server, and have that VM as VNC for now.

I will likely switch it over to a spare GT720 I have in there at some point, for tonight I don't think I care.

Look forward to any specifics you may see in the BIOS screens, but again, I never had these issues initially before that MB died, and while I thought it was tied to it initially, the more I tried other things the less likely it appeared to be a hardware related case. Prior to the MB dieing, it took out my initial 260X GPU, prompting RMA, then a week later the MB died.

I wouldn't think this type of failure is good for any component that was on the MB, so all GT720's I have, and even the RAM (even though it won't fail Memtest) is suspect.

I have one odd thing with Memtest newest version (Passmark), if I run it in multiple CPU mode it will freeze up pretty quickly.

While I know this can be an issue with the included older Memtest distributed with UnRAID, I didn't expect this to be the case with the newest Passmark version.

If this is any indication of anything, well, I've now shared that experience.

Looking forward to your input, back on 6.2RC3 with my replacement AMD card removed just in case.

If there was something odd with it (or any of the GPU's installed) since their lanes go directly to the CPU, I could see this causing issues however uncertain of that extent.

I fully understand in troubleshooting removing all but required components is key, and if that means loosing the other GPU's as I test, so be it. Just sucks with so much reliance on one box, that and I was in denial to a point.

Thanks,

Jeff

Edit:If you care to read about my initial misery (sickens me to know I've been fighting this on/off for 5 months now), the thread with details is here https://lime-technology.com/forum/index.php?topic=48024.msg460209#msg460209 it has some back story that may be helpful, but mainly just failure.. $:-\$

server-syslog-20160809-1846.zip

bungee91 · October 11, 2016

Need to update this, and ask for continued support.. Yes I know I'm the only one seeing this. Reminder that Memtest 24+ hours with 0 errors, no actual instability issues. This also (somewhat) makes sense I only starting seeing these when they replaced my GPU with a newer refurbished one (different model, R260X).

This is what I know now, this is only evident (for me with the components I have here) when I pass through either my replacement R260X, or GTX950, so basically "gaming able" GPU's.

This error "alloc_bts_buffer" from extensive Googling details a memory fragmentation error, which in the past has been requested to stop "spamming the logs".

If I remove either of those cards, and assign a GT720 to this VM, the issue is completely gone, 8+ days on 6.2 final (there have been frequent updates, causing me to reboot).

I removed the R260X, tried my new GTX950 (same issue), I assigned a GT720 and the problem is completely non existent (it's almost too peaceful in my syslog!) .

I'm currently running the following:

3 GT 720's (Win 8.1 MCE, Win 10 X2), 1 GT 710 (Librelec), and one headless, all working as expected.

So, if I keep it just like this, this is likely solved...

However if I add back in either GPU, I'm pretty confident within 24 hours these messages will appear again, always following the VM that the card is assigned to.

I've attached a clean log from my 8+ days using the 720's and 710 prior to upgrading to 6.3 RC1.

Other logs to compare to above, I don't see anything obviously different between the two to help diagnose this.

server-diagnostics-20161009-1000_Clean_8_Days_GT720s.zip

jonp · October 12, 2016

I never had these issues initially before that MB died, and while I thought it was tied to it initially, the more I tried other things the less likely it appeared to be a hardware related case. Prior to the MB dieing, it took out my initial 260X GPU, prompting RMA, then a week later the MB died. I wouldn't think this type of failure is good for any component that was on the MB, so all GT720's I have, and even the RAM (even though it won't fail Memtest) is suspect.

You are correct in that all of your components are suspect of issue. I know you already replaced the board, but perhaps "damage has been done" or perhaps the board wasn't really to blame at all. In order of least to most costly tests you can perform:

1 - Try reducing the amount of total system memory you have installed and if the issues persist, switch the chips that you removed for the ones that you left in and try again.

2 - Try replacing the power supply and then repeat test 1 if issues persist

3 - Replace all memory on the system with new memory

4 - Try a different GPU (GTX 960, 970, 980, or 1080 have all been tested by us here at LT)

If this was an issue in software, I think it would be more widespread and reported. The fact that we can't recreate it and you seem to be isolated in this situation points very heavily at this still being a hardware related issue. Memtest is a great tool, but is not the end-all be-all for diagnosing memory issues. In fact, there are no tools that can fully diagnose each hardware component in a computer (certainly not any available to the consumer market). So while memtest can reveal glaring problems like a chip that has been physically damaged, if there are other issues related to the memory controller or the data paths in between, these may not be as easily detected.

bungee91 · October 12, 2016

I never had these issues initially before that MB died, and while I thought it was tied to it initially, the more I tried other things the less likely it appeared to be a hardware related case. Prior to the MB dieing, it took out my initial 260X GPU, prompting RMA, then a week later the MB died. I wouldn't think this type of failure is good for any component that was on the MB, so all GT720's I have, and even the RAM (even though it won't fail Memtest) is suspect.

You are correct in that all of your components are suspect of issue. I know you already replaced the board, but perhaps "damage has been done" or perhaps the board wasn't really to blame at all. In order of least to most costly tests you can perform:

1 - Try reducing the amount of total system memory you have installed and if the issues persist, switch the chips that you removed for the ones that you left in and try again.

2 - Try replacing the power supply and then repeat test 1 if issues persist

3 - Replace all memory on the system with new memory

4 - Try a different GPU (GTX 960, 970, 980, or 1080 have all been tested by us here at LT)

If this was an issue in software, I think it would be more widespread and reported. The fact that we can't recreate it and you seem to be isolated in this situation points very heavily at this still being a hardware related issue. Memtest is a great tool, but is not the end-all be-all for diagnosing memory issues. In fact, there are no tools that can fully diagnose each hardware component in a computer (certainly not any available to the consumer market). So while memtest can reveal glaring problems like a chip that has been physically damaged, if there are other issues related to the memory controller or the data paths in between, these may not be as easily detected.

Thanks John, but unfortunately I've been through all of that and here I am... Let me go through this line by line.

1 - Try reducing the amount of total system memory you have installed and if the issues persist, switch the chips that you removed for the ones that you left in and try again.

Not exactly what you asked, but I changed out memory completely (even though no errors), went from 32GB (4X8GB, Crucial) to 64GB (4X16GB G. Skill) with the exact same issue.

Memtest 24 hours, 9+ passes successful.

2 - Try replacing the power supply and then repeat test 1 if issues persist

Did that, replaced a perfectly good Corsair TX650 with a (fancy!) HX850i

3 - Replace all memory on the system with new memory

See #1

4 - Try a different GPU (GTX 960, 970, 980, or 1080 have all been tested by us here at LT)

I kind of did, well I had a R260X, bought a new (second-hand) EVGA GTX950 with the same results. Used my 3rd GT720, and the issue is resolved.

Adding one

5 - I thought maybe my CPU was damaged when my motherboard died (RIP), so I borrowed a 5920k, and received the exact same issues on the exact same CPU cores/VM's with it installed.

So knowing all of that, everything is either new or tested against a new product, and still I see these messages. I've looked extensively at memory mapping, various kernel parameters (related to memory, page file, MTRR, etc.) with nothing making any change.

I'm honestly happier than a pig in shit that removing these GPU's from my system has solved this issue, however something still isn't right.

I understand what you're saying, I'm just stumped, completely.. I'm normally pretty good with this kind of stuff, this one is just plain strange!

I also completely wiped my flash drive and started fresh (basically everything), just in case, and no difference.

I'm pretty confident if you had my GPU you'd test it and also tell me it is fine, no issues.... However with them removed, issue gone.

Edit: IDK, does this sound similar (albeit quite old) https://groups.google.com/forum/#!topic/linux.kernel/scXkSlZ5EMQ

jonp · October 12, 2016

Have you tried the 6.3-rc track yet? That has newer kernel and QEMU as well.

bungee91 · October 12, 2016

Nope, but I am now.

Changes from my "good" setup: remove one GT720, remove USB3 card (in slot that double card blocks).

Install GTX950, remove USB 3 card from XML, switch to i440fx-2.7 (newest now included in 6.3), assign MB USB controller to VM.

Start VM, run DDU to clean out any old drivers, install fresh drivers, set MSI interrupt to on.

Now we wait... Give me a few days and I'll let you know how it goes.

I grabbed a clean diagnostics from my 3 days on 6.3 RC1 prior to shutdown.

If it does reappear (I would like to be confident, but my previous experiences make that difficult), maybe in comparison something will be more obvious.

Also (previously done testing) while ruling things out and testing my hardware, I thought maybe some specific CPU PCIe lanes/traces were damaged, so I moved the GTX950 to the next PCIe 16x (length, not wired) slot that other VM's use without issue. These log files entries were the same as before following the VM/GPU, the other VM now using the primary PCIe slot worked as well as it always did prior.

RobJ · October 15, 2016

Howdy bungee, I don't see anything in particular either. I was going to weakly suggest the PassMark version of MemTest, then discovered you had tried it too, and besides this doesn't look like a RAM failure. It looks like an allocation failure, which would be a bug somewhere.

The only advice I have is to keep checking for BIOS updates for the motherboard, firmware updates for the cards, and keep trying newer kernels.

The kernel bug you mentioned I think is resolved, included in current kernels, so probably no longer applicable.

bungee91 · October 17, 2016

Thanks for taking another look Rob, and as always the continued support from LT/Jon.

Well it may be a little early to drink to it (I can and will drink to other random things however), I think this issue is resolved in the newest kernel/QEMU!!

I have not seen any indication of this fragmentation within my logs lately, it has been rather peaceful (and somewhat boring to be honest).

Anyhow, I still don't know why only I have had these issues, but if they are truly gone I could give 2 shits as to the answer to that question!

So I'll give it some more time/usage, and if it doesn't come back up again I'll then update this as solved.

No idea what commits in the most recent kernel may have addressed this, but it's looking like a winner so far.

bungee91 · October 19, 2016

Well, that was short lived... I'm out of ideas..

It looks different now, but it's back. :-[

Oct 18 16:12:59 Server kernel: qemu-system-x86: page allocation failure: order:4, mode:0x260c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO|__GFP_NOTRACK)
Oct 18 16:12:59 Server kernel: CPU: 5 PID: 11355 Comm: qemu-system-x86 Not tainted 4.7.6-unRAID #1

and

Oct 18 16:13:28 Server kernel: WARNING: CPU: 5 PID: 11355 at arch/x86/events/intel/ds.c:334 reserve_ds_buffers+0x119/0x353
Oct 18 16:13:28 Server kernel: alloc_bts_buffer: BTS buffer allocation failure
Oct 18 16:13:28 Server kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net tun vhost macvtap macvlan xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod ipmi_devintf igb mxm_wmi x86_pkg_temp_thermal coretemp kvm_intel kvm fbcon bitblit fbcon_rotate fbcon_ccw fbcon_ud fbcon_cw softcursor font ast e1000e drm_kms_helper cfbfillrect cfbimgblt cfbcopyarea ttm drm i2c_i801 agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops fb ptp fbdev pps_core i2c_algo_bit i2c_core ahci libahci sata_mv ipmi_si wmi [last unloaded: igb]

Repeat 3-4 times. Attached newest diagnostics, no idea, will switch back to the GT720 and see what happens.

server-diagnostics-20161019-1309.zip

bungee91 · October 26, 2016

Alright, this is now very interesting!!

I've taken the advice that I don't want to be true, but know damn well something is not right with my setup..

These errors started to occur with the GT720 assigned (so my previous thoughts were incorrect).

So, I purchased a new MB, CPU and using my "new" ram (recently replaced from Crucial, thought it was related to my issue) which is also on the QVL for this board (Currently have an ASRock Extreme4, 5820k, and Crucial 3X8GB DDR4). I will say I didn't run Memtest on this new setup, just left it at auto and booted up.

All settings in the BIOS set to default/auto with the exception of VTx and VT-d set to enabled.

The exact same order 4 page allocation failure is showing up!.

Considering it seems to always be associated with only the CPU cores to a specific VM titled "Main" (cores 4,5,10,11), I'm going to make a new VM from scratch and see if this is resolved.

I can't for the life of me think what the VM could be doing to cause these errors!?!

My previous (and keeping) hardware MB/CPU/RAM is currently undergoing grueling stress testing and Memtest's to likely find zero issues, since it is highly unlikely (IMHO) that the exact same error is present on completely different CPU/MB/Ram.

Always open to opinions and insight, I will figure this out!

Attached newest diagnostics with new hardware installed.

It could be some kind of bug with Haswell-E, however no one else seems to have this issue, so that is pretty unlikely.

Current theory: Somehow, someway my VM seems to be violating a RAM related rule, causing this to occur. I don't know of any process in particular within Windows that is triggered or occurs to make this happen. The "Main" VM only has 4GB assigned with this new hardware, vs the 12GB prior on my regular CPU/MB/RAM. No other changes with the exception of removing pass through of a USB controller.

Edit: Added my XML for "Main" VM (renamed to .txt), nothing special going on within it, the editor was used for the most recent edits.

Uncertain the virtio drivers currently installed, but fast-startup off, high performance on, installed all drivers at time of install in recommended order, all wiki best practices followed.

server-diagnostics-20161026-1311.zip

Main.txt

bungee91 · November 15, 2016

I wanted to give an update to this, and thank everyone for the help and diagnosing of this issue! :-*

This was entirely related to a software issue (yes, crazy huh!? I seriously wouldn't have thought this either).

Somehow/someway the VM was causing this issue, not certain how exactly (corrupt subsystem files within the VM?), just know the issue is completely gone after a fresh Windows 10 VM install.

If I turn back on that specific VM, they come back. I have tested changing Windows drivers, virtio drivers, etc... and it always would come back on that specific VM (odd).

Even decided to push it, and install both my R260X to the one VM and the GTX950 to my primary, and all has been well for some time now.

If I would have known it was a software/VM install causing all of this trouble (doesn't seem too plausible), I would have scorched the earth and everything it touched a long time prior.

All of my hardware passed extensive testing with 48+ hours of Passmark Memtest86, HCI Memtest within Windows heavily stressing the IMC and CPU, and most all other tests/tasks I could throw at it.

Anyhow, thanks again, this is marked as solved!

alloc_bts_buffer: BTS buffer allocation failure-page allocation failure: order:4

Recommended Posts

bungee91

Link to comment

bungee91

Link to comment

jonp

Link to comment

bungee91

Link to comment

bungee91

Link to comment

bungee91

Link to comment

jonp

Link to comment

bungee91

Link to comment

jonp

Link to comment

bungee91

Link to comment

RobJ

Link to comment

bungee91

Link to comment

bungee91

Link to comment

bungee91

Link to comment

bungee91

Link to comment

Join the conversation