Nested Virtualization Issues on rc18f

December 31, 20178 yr

Hi everyone,

I'm running into an issue and wanted to share, mainly to help contribute to the 6.4 QA...but if anyone has any suggestions, that'd be nice too.

Short Description

I upgraded from rc10b to rc18f this morning and am now no longer able to run nested virtualization.

Long Description

I have specific requirements for my labs where I need to leverage nested virtualization (Cisco VIRL if anyone is interested). I run several VMs on UNRAID but I only use nested virtualization on one of them (the others appear to be operating fine after the upgrade). The guest hypervisor OS is Ubuntu 14.04 LTS (3.19.0-74-generic) running KVM/QEMU version 2.2.0. The only passthrough that I'm doing from the UNRAID system is CPU host-passthrough - I pin/isolate the vCPUs for this guest from the others. The physical hardware is AMD ThreadRipper 1950X on an ASUS Zenith Extreme X399 board (if the rest of the peripherals are important/relevant, let me know).

After upgrading from rc10b to rc18f, the guest hypervisor now consistently crashes whenever a nested guest tries to start. Unfortunately, I don't know where in the changes between rc10b and rc18f the issue was introduced since I simply made the jump from 10b to 18f.

On UNRAID, logs in /var/log/libvirt/* and /var/log/syslog weren't too helpful. Logs on the guest hypervisor provide some info but not enough to tell me (or at least as far as I can understand) the root cause and fix - here're logs from a couple of the times where the guest hypervisor crashed:

Some of the logs were cut short (the guest hypervisor's name is 'virl'):

Dec 31 10:23:28 virl kernel: [  398.872252] audit_printk_skb: 135 callbacks suppressed
Dec 31 10:23:28 virl kernel: [  398.872255] audit: type=1400 audit(1514737408.461:76): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt
Dec 31 10:23:28 virl kernel: [  398.872401] audit: type=1400 audit(1514737408.461:77): apparmor="STATUS" operation="profile_load" profile="unconfined" name="qemu_br
Dec 31 10:23:32 virl kernel: [  402.671699] BUG: unable to handle kernel paging request at ffff9008bf81eea0
Dec 31 10:23:32 virl kernel: [  402.671703] IP: [<ffffffff811a7542>] handle_mm_fault+0x132/0x10e0
Dec 31 10:23:32 virl kernel: [  402.671709] PGD 0
Dec 31 10:23:32 virl kernel: [  402.671710] Oops: 0000 [#1] SMP
Dec 31 10:23:32 virl kernel: [  402.671712] Modules linked in: xt_REDIRECT nf_nat_redirect xt_mark vxlan ip6_udp_tunnel udp_tunnel xt_comment iptable_raw xt_CHECKSU
Dec 31 10:23:32 virl kernel: [  402.671739] CPU: 15 PID: 3044 Comm: kvm.real Tainted: G           OE  3.19.0-74-generic #82~14.04.1-Ubuntu
Dec 31 10:23:32 virl kernel: [  402.671741] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014

Here's another sample (capture more in my console this time):

Dec 31 10:34:48 virl kernel: [   42.351596] init: plymouth-stop pre-start process (17226) terminated with status 1
Dec 31 10:36:42 virl kernel: [  156.341944] audit_printk_skb: 135 callbacks suppressed
Dec 31 10:36:42 virl kernel: [  156.341947] audit: type=1400 audit(1514738202.678:69): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvir
Dec 31 10:36:42 virl kernel: [  156.342073] audit: type=1400 audit(1514738202.678:70): apparmor="STATUS" operation="profile_load" profile="unconfined" name="qemu_b
Dec 31 10:36:46 virl kernel: [  159.996795] BUG: Bad page map in process kvm.real  pte:ffff8808c986c429 pmd:8c986c067
Dec 31 10:36:46 virl kernel: [  159.996799] addr:0000564a43abf080 vm_flags:08100073 anon_vma:ffff8800bba5e870 mapping:          (null) index:564a43abf
Dec 31 10:36:46 virl kernel: [  159.996803] CPU: 3 PID: 1917 Comm: kvm.real Tainted: G           OE  3.19.0-74-generic #82~14.04.1-Ubuntu
Dec 31 10:36:46 virl kernel: [  159.996804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Dec 31 10:36:46 virl kernel: [  159.996805]  0000000000000000 ffff8808e96c7ce8 ffffffff817b61b3 0000564a43abf080
Dec 31 10:36:46 virl kernel: [  159.996807]  ffff880a0ef69080 ffff8808e96c7d38 ffffffff811a37ca ffff8808c986c429
Dec 31 10:36:46 virl kernel: [  159.996808]  0000000564a43abf ffff880a0e76c000 0000000000000000 ffff8808c986c429
Dec 31 10:36:46 virl kernel: [  159.996810] Call Trace:
Dec 31 10:36:46 virl kernel: [  159.996816]  [<ffffffff817b61b3>] dump_stack+0x63/0x81
Dec 31 10:36:46 virl kernel: [  159.996818]  [<ffffffff811a37ca>] print_bad_pte+0x1aa/0x250
Dec 31 10:36:46 virl kernel: [  159.996820]  [<ffffffff811a463e>] vm_normal_page+0x8e/0xa0
Dec 31 10:36:46 virl kernel: [  159.996822]  [<ffffffff811a7ba2>] handle_mm_fault+0x792/0x10e0
Dec 31 10:36:46 virl kernel: [  159.996824]  [<ffffffff81202e70>] ? poll_select_copy_remaining+0x130/0x130
Dec 31 10:36:46 virl kernel: [  159.996827]  [<ffffffff81062d64>] __do_page_fault+0x1c4/0x5a0
Dec 31 10:36:46 virl kernel: [  159.996830]  [<ffffffff810f263a>] ? do_futex+0x10a/0x630
Dec 31 10:36:46 virl kernel: [  159.996832]  [<ffffffff810e470e>] ? ktime_get_ts64+0x4e/0xf0
Dec 31 10:36:46 virl kernel: [  159.996834]  [<ffffffff81202e41>] ? poll_select_copy_remaining+0x101/0x130
Dec 31 10:36:46 virl kernel: [  159.996835]  [<ffffffff81063171>] do_page_fault+0x31/0x70
Dec 31 10:36:46 virl kernel: [  159.996837]  [<ffffffff817bfe28>] page_fault+0x28/0x30
Dec 31 10:36:46 virl kernel: [  159.996838] Disabling lock debugging due to kernel taint
Dec 31 10:36:46 virl kernel: [  159.996840] kvm.real: Corrupted page table at address 564a43abf080
Dec 31 10:36:46 virl kernel: [  159.996841] PGD 90bee0067 PUD 915452067 PMD 8c986c067 PTE ffff8808c986c429
Dec 31 10:36:46 virl kernel: [  159.996843] Bad pagetable: 000d [#1] SMP
Dec 31 10:36:46 virl kernel: [  159.996845] Modules linked in: xt_REDIRECT nf_nat_redirect xt_mark vxlan ip6_udp_tunnel udp_tunnel xt_comment iptable_raw xt_CHECKS
Dec 31 10:36:46 virl kernel: [  159.996880] CPU: 3 PID: 1917 Comm: kvm.real Tainted: G    B      OE  3.19.0-74-generic #82~14.04.1-Ubuntu
Dec 31 10:36:46 virl kernel: [  159.996881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Dec 31 10:36:46 virl kernel: [  159.996882] task: ffff8808e979a740 ti: ffff8808e96c4000 task.ti: ffff8808e96c4000
Dec 31 10:36:46 virl kernel: [  159.996883] RIP: 0033:[<00007f0061219404>]  [<00007f0061219404>] 0x7f0061219404
Dec 31 10:36:46 virl kernel: [  159.996886] RSP: 002b:00007ffd9c4c9ab0  EFLAGS: 00010202
Dec 31 10:36:46 virl kernel: [  159.996887] RAX: 0000564a43abf070 RBX: 0000000000000001 RCX: 0000000000000000
Dec 31 10:36:46 virl kernel: [  159.996887] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000564a43abf070
Dec 31 10:36:46 virl kernel: [  159.996888] RBP: 00007ffd9c4c9ae4 R08: 0000564a431bea00 R09: 0000000000000000
Dec 31 10:36:46 virl kernel: [  159.996889] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000020230d00
Dec 31 10:36:46 virl kernel: [  159.996889] R13: 0000000000000001 R14: 000000000000000f R15: 0000564a43ab1330
Dec 31 10:36:46 virl kernel: [  159.996891] FS:  00007f006a925980(0000) GS:ffff880a3fc60000(0000) knlGS:0000000000000000
Dec 31 10:36:46 virl kernel: [  159.996892] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 31 10:36:46 virl kernel: [  159.996893] CR2: 0000564a43abf080 CR3: 000000090208e000 CR4: 00000000003407e0
Dec 31 10:36:46 virl kernel: [  159.996895]
Dec 31 10:36:46 virl kernel: [  159.996896] RIP  [<00007f0061219404>] 0x7f0061219404
Dec 31 10:36:46 virl kernel: [  159.996897]  RSP <00007ffd9c4c9ab0>
Dec 31 10:36:46 virl kernel: [  159.996899] ---[ end trace e5ed6cb101eeea59 ]---

For now, I'll revert back to UNRAID rc10b.

Thanks!

Quote

January 1, 20188 yr

Most likely due to the experimental AMD kernel patches.

Quote

January 1, 20188 yr

Author

Yeah, that's kind of what I was thinking too. That's why I thought it was important to note my jump from 10b to 18f - the kernel changes in 15e or 16b would probably cause the same issues for me too. I suppose I could test that if the dev folks found that helpful. Otherwise I'll stick with 10b for the time being.

Thanks for the reply.

Quote

January 2, 20188 yr

On 12/31/2017 at 7:08 PM, realies said:

Most likely due to the experimental AMD kernel patches.

No such patches are in -rc18f.

Quote

January 3, 20188 yr

On 12/31/2017 at 1:55 PM, zblue.h said:

For now, I'll revert back to UNRAID rc10b.

Does rc14 still work? That's the last release using the 4.13.x kernel before we moved to 4.14.x.

Quote

January 4, 20188 yr

Author

Eschultz,

Great point, I can certainly try upgrading from 10b to 14. I'll do that any follow up shortly.

Thanks for the support!

Quote

January 4, 20188 yr

Author

On second thought, my KVM is down so I'll wait until I'm home before I try to jump to 14.

Quote

January 5, 20188 yr

Author

I upgraded from 10b to 14 and can confirm that nested virtualization still works, as expected.

Quote

Nested Virtualization Issues on rc18f

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)