pfSense VM fails to boot after upgrade to 6.8.0-rc8


Recommended Posts

This is going to be a terrible post because I do not have a lot of information.  I upgraded to 6.8.0-rc8 today from 6.8.0-rc7.  I used the Unraid Nvidia plugin, but did not think that was the issue.  After the upgrade my pfSense VM would halt while booting and never came up.  I used the same Unraid Nvidia plugin to rollback to 6.8.0-rc7 and the VM booted without issue.  If there are some logs that I can provide I will, but I am not sure where they would be, or even exist at this point since I rolled back and rebooted my server.  I didn't see this issue reported so far from a quick search and wanted to see if anyone else was having this issue.

Link to comment

I have exactly the same problem as described by user beaverly72.

Since I updated to 6.8.0-rc8 (and continued with rc9 - same story) my pfsense VM is stalled using 100% processing utilization of one core.

I had to forcefully shut it down. It never came up.

 

Rolling back to 6.8.0-rc7 works for me but I decided to leave the testing branch and rollback to 6.7 stable.

Any suggestions to fix my issue? I'm worried since the next stable version is close to release and I really would like to use 6.8 as my daily driver.

Thanks for your input!
KR

Link to comment

I’m not sure I follow this.  6.8.0-rc7 has the 5.x kernel.  Which it sounds like is being abandoned in 6.8.0 and will be implemented in 6.9.0-rc1.  6.8.0-rc8 contains a kernel version that I believe was used in 6.7.2 (I think), which my pfSense VM booted with just fine.  So how does Unraid rolling back the kernel to a version that previously worked with my VM now cause the VM not to boot? And the 5.x kernel, which apparently is causing the issues noted by Unraid, allows my VM to boot without issue?  I’m sure there’s a logical explanation, I’m just not getting it.

Edited by beaverly72
  • Thanks 1
Link to comment

I just digged up some numbers to compare.

 

You are basically right. 6.8.0-rc8 (and 9) uses the same major kernel version (4.19) as the stable version 6.7.2 (were pfSense was working fine for me)

 

So let me list the kernel changes between three versions.

 

6.7.2 (pfSense boots)

Linux Kernel: 4.19.56

 

6.8.0-rc9 (pfSense fails)

Linux Kernel: 4.19.88

 

6.8.0-rc8 (pfSense fails)

Linux Kernel: 4.19.87

 

6.8.0-rc7 (pfSense boots fine!)

Linux Kernel: 5.3.12

 

So the possibility is that there were some changes between 4.19.56 and (including) 4.19.88 which led to the issue with pfSense (and this „problem“ was fixed in the 5.x kernel)

 

OR

 

the fault is not the kernel itself but some packages which got up/downgraded.

 

Will try to list some more changes between those versions if it helps.

 

KR,

Edited by MrSmith3101
Link to comment

I noticed the same yesterday firing up a Pfsense VM that I use from time to time to test some stuff. No device passthrough I only use 2 virtual nics. One on br0 as WAN for pfsense and a internal virbr where usually a VM is connected to to generate some traffic. It never halted on startup nore did I install anything inside or changed any config. I played around a couple hours yesterday but I wasn't able to start the VM at all. Changing different Q35 versions, switching the vdisk type from qcow2 to raw, virtio, scsi, sata nothing worked. Also the restore of an old backup didn't help and also not to generate a new VM with the same vdisk attached. It always halts directly after the boot selection and 1 core utilises 100%

  • Thanks 1
Link to comment

Thanks for your input bastl. Appreciate your testing time.

So different VM configurations do not help in this case. =(

 

I should add that my pfsense VM uses two "real" NICs which are passthrough from the host. (it is a Intel Pro 1000 series card with 4 NICs)

But as noted by bastl it does not look like it make any difference.

Link to comment
17 minutes ago, bastl said:

@MrSmith3101 Are you using OVMF or seabios for your VM? I only have a OVMF setup and hadn't the time yet to test a fresh install on seabios.

Hmm thats a good question.

Sadly I "destroyed" my first pfsense VM as I tried to change the version branch to experimental. I thought this would fix the boot issues. But it didn't. (exactly same behavior) Besides that the experimental build of pfsense also broke support for my Intel NICs. So I wasn't able to connect to the GUI anymore.
Long story short - I had to delete my pfsense VM and create a new one.
And now to answer your question --> I think (90% sure here) that my old pfsense VM was configured as seabios. (I previously had boot problems but different to the current "stall" issue) But I'm not 100% sure. Sorry =(

My "new" pfsense VM is now configured as OVMF (as this is the standard config)

 

So it looks like it does not make any difference if you choose OVMF or seabios.


Sorry that I can't verify this 100% now =/
 

Edited by MrSmith3101
Link to comment

Same issue here. Went from 6.7.2 stable to 6.8. Pfsense vm stuck at 100% during initial boot. Went back to 6.7.2 for now. I do apologise for not having any diagnostics to help the cause. However, I did find something interesting during my troubleshooting. I suspected the issue was with the vdisks. Hence I started playing around with the vdisks and noticed if I create just a dummy file by using <touch vdisk1.img>. I was able to get pass the 100% hang state. Now obviously this won't work since that vdisk1.img is not really a vdisk but this test seems to confirm my suspicion. Oh well 6.9 will probably address it I am hoping. 

Link to comment

I found a workaround for this!

 

The culprit is the cpu-mode "host-passthrough". If I switch to "Emulated QEMU64" the VM boots up again. Switching it in the gui should work if you havn't setup any special CPU flags. Another way is to edit the xml like the following:

 

change

  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>

to

  <cpu>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>

also forces the CPU into emulated QEMU64 mode.

 

Another option is to emulate a Intel Skylake CPU for example with the following:

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='1' cores='2' threads='1'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='pcid'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='erms'/>
    <feature policy='disable' name='invpcid'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='disable' name='spec-ctrl'/>
  </cpu>

 

Edit:

"AES-NI CPU Crypto" isn't supported on "Emulated QEMU64" mode. For future Pfsense versions this is a requirement if I remember correctly.

Edited by bastl
  • Thanks 2
Link to comment
4 hours ago, bastl said:

@Farrukh Can you explain a bit further what you did? Only creating a dummy vdisk inside the VM folder without attaching it to the VM shouldn't do anything. What is the format your Pfsense VM are using? Qcow2 or RAW?

I am using RAW for my pfsense vdisk. After upgrading to 6.8 I thought that my vdisk got corrupted. Happened to me once before hence I suspected the vdisk. I was going through process of elimination to see what will get the vm to get pass the 100% stuck state. 

This is what I did:

Force stop pfsense vm 

Using cli

cd /mnt/user/domain/pfsense/

mv vdisk1.img vdisk1.old (keep RAW based vdisk)

touch vdisk1.img (create a file called vdisk. This is not RAW or Qcow2. This is just to get the vm to start)

Even though this got me pass the 100% stuck state it doesn't matter since this mean there is no disk drive attached to the vm. 

Link to comment
2 hours ago, bastl said:

I found a workaround for this!

 

The culprit is the cpu-mode "host-passthrough". If I switch to "Emulated QEMU64" the VM boots up again. Switching it in the gui should work if you havn't setup any special CPU flags. Another way is to edit the xml like the following:

 

change


  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>

to


  <cpu>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>

also forces the CPU into emulated QEMU64 mode.

 

Another option is to emulate a Intel Skylake CPU for example with the following:


  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='1' cores='2' threads='1'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='pcid'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='erms'/>
    <feature policy='disable' name='invpcid'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='disable' name='spec-ctrl'/>
  </cpu>

 

Edit:

"AES-NI CPU Crypto" isn't supported on "Emulated QEMU64" mode. For future Pfsense versions this is a requirement if I remember correctly.

Good job and thanks I will try this out. I believe you are right about the AES-NI requirement but I think we should be good for 2.5. For now they have removed the requirements for it. Here is the article I was reading. 

 

https://www.netgate.com/blog/pfsense-2-5-0-development-snapshots-now-available.html

Link to comment

@joelones It could be possible for some excessive CPU operations you can see performance decreases if a specific CPU feature is emulated. I think if you're not routing hundreds of clients through Pfsense or have couple VPN connections incoming at the same time, you won't notice it. I did a couple tests with a Linux and Win7 VM in parallel using my Test-Pfsense environment and it was stable with full download speeds and no packet drops. Might be different on a larger scale. For me it's fine for testing.

Link to comment
36 minutes ago, bastl said:

@joelones It could be possible for some excessive CPU operations you can see performance decreases if a specific CPU feature is emulated. I think if you're not routing hundreds of clients through Pfsense or have couple VPN connections incoming at the same time, you won't notice it. I did a couple tests with a Linux and Win7 VM in parallel using my Test-Pfsense environment and it was stable with full download speeds and no packet drops. Might be different on a larger scale. For me it's fine for testing.

Ok thanks. I may just wait for 6.9 at this point knowing that we'll have to upgrade back to the v5 kernel anyway. Hopefully the GSO bug is squashed as well, but this is definitely a viable option at this point thanks to your testing.

Edited by joelones
Link to comment


I am also experiencing the same issue  going from 6.7.2 to 6.8.0 on ASRock B450M Pro4 with AMD Ryzen 5 1600 and 1 Quad port NIC, pfsense does not boot! But With @bastl workaround by removing 

 <cpu mode='host-passthrough' check='none'>

 pfsense boot like before. Big thanks for sharing that workaround @bastl.

Link to comment

@bastl

So I updated to 6.8 stable and decided to try this workaround. I did try the Skylake emulation for my AMD FX8320 and it didn't quite seem to like it very much and gave an unsupported CPU error when I tried to start the VM. I guess my CPU is either too old or lacks the instructions to emulate Skylake properly. Maybe I need to model an older Intel CPU, like Sandybridge or something?? I know my model is a Opteron_G5.

 

I had no choice but to opt for Emulated QEMU64 mode, hopefully the lack of AES-NI won't impact overall CPU performance with respect to VPN usage.

 

EDIT: I seem to have gotten pfSense to boot with AES-NI on my AMD wit this:

 

<cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Opteron_G5</model>
    <vendor>AMD</vendor>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='bmi1'/>
    <feature policy='require' name='mmxext'/>
    <feature policy='require' name='fxsr_opt'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='require' name='cr8legacy'/>
    <feature policy='require' name='osvw'/>
    <feature policy='disable' name='rdtscp'/>
    <feature policy='disable' name='svm'/>
  </cpu>

 

Edited by joelones
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.