Jump to content

Unraid 'crash' after swapping hardware


Go to solution Solved by Caennanu,

Recommended Posts

Gday all,

 

Recently i've set up a Virtual machine using a coral TPU on Unraid.

This TPU i had to passthrough to the VM by editting the xml, as it wouldn't be listed as a passable device.

 

This all worked fine and dandy, set up my new VM with the TPU, installed shinobi tested that for a while, works great.

So time to dismantle the old VM, which was using GPU acceleration for object detection.

Removing the GPU from the system however seems to have a fairly big impact i didn't anticipate.

 

When booting the unraid machine, everything seems to work just fine. except that the VM's do not actually boot.

Trying to analyse what is happening, i notice the GUI is freezing. Also SSH'ing into the system fails, and the command line interface on the physical machine doubles nearly every keystroke, making it impossible to do anything. The system becomes unresponsive.

 

After a couple of reboots. i manage to catch the logs. It seems its trying to find the TPU. Which makes sence. 

But probably due to the hardware swapping, it has changed Id's....

 

So... i turn the system off. Disable automatically loading the array (via disk.cfg on the usb), so that i can boot. start the array manually without the VM's starting automatically after the array comes up.... this does not seem to be the case...

 

So the questions are... 

Is there a way to edit the VM file to temporarily remove the TPU till i know the new id, so i can re-add it?

Is there a way to actually stop the VM's from auto starting? (would be nice to know in general, if this can be done via CLI or editting files)

 

And before you ask.

With an system that is almost unresponsive, adding logs will be a challenge.

Unraid version: i'm on RC4. 

Link to comment

After reading some on the forum with similar issues, i haven't been able to find a resolution yet other than perhaps renaming the images.

Luckily i don't have the images on the disk that is part off the array (only their backups) so if needed i could just pull that disk and edit the xml's from the gui. 

Would still prefer an more near solution tho.

Link to comment

update:

Puling the disk with the VM's on it. Also causes the system to crash.

Can't pull logs from unraid, but ipmi now shows an MCE i haven't had before...

 

CPU 2: Machine Check: 0 Bank 17: dc2040000000011b

TSC 0 ADDR 28b880 SYND 149901000a800500 IPID 9600450f00

PROCESSOR 2:830f10 TIME 1664694930 SOCKET 0 APIC 4 microcode 8301055

 

Link to comment

update2:
after re-arranging hardware in pci-e slots, i've managed to get the TPU its 'configured' ID again.

System now boots, VM manager boots. 

 

Would love to know if someone knows how to change this ID if added manually to the XML of VM's.

Because this isn't particularly user friendly.

Link to comment
35 minutes ago, Caennanu said:

update2:
after re-arranging hardware in pci-e slots, i've managed to get the TPU its 'configured' ID again.

System now boots, VM manager boots. 

 

Would love to know if someone knows how to change this ID if added manually to the XML of VM's.

Because this isn't particularly user friendly.

The XML are stored in the following path, but VM Manager has to be active for it to be mounted.

 

To stop a VM from autostarting you have to remove the symlink in the autostart dir.

 

root@computenode:/etc/libvirt/qemu# ls
Debian.xml  Linux.xml   Linux3.xml  Linux5.xml  Unraid-VM.xml             Windows\ 10.xml  autostart/  nvram/   swtpm/
HA.xml      Linux2.xml  Linux4.xml  Ubuntu.xml  Windows\ 10\ test\ 7.xml  Windows\ 11.xml  networks/   rpi.xml
root@computenode:/etc/libvirt/qemu# ls autostart
HA.xml@
root@computenode:/etc/libvirt/qemu# 

 

What where you looking to change?

Link to comment
  • Solution

@SimonF

Thanks for the reply.

 

The thing is, i couldn't start the VM manager without making unraid unresponsive. That is why i was looking for a way to at least disable the auto starting off the VM's within the manager.

The reason i wanted to turn it off, is because 1 VM is non default. I had added the below line to the XML to load up the Coral TPU, which wasn't a passable pci-e device.

Since i moved hardware around, or rather removed, it seemed that the Id of the coral TPU was different, and it was.

 

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>

The bus in question changed from 0x03 to 0x01. And i had no way to edit this.

And that seemed the most logical cause of unraid becomming unstable.

Link to comment
7 hours ago, Caennanu said:

@SimonF

Thanks for the reply.

 

The thing is, i couldn't start the VM manager without making unraid unresponsive. That is why i was looking for a way to at least disable the auto starting off the VM's within the manager.

The reason i wanted to turn it off, is because 1 VM is non default. I had added the below line to the XML to load up the Coral TPU, which wasn't a passable pci-e device.

Since i moved hardware around, or rather removed, it seemed that the Id of the coral TPU was different, and it was.

 

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>

The bus in question changed from 0x03 to 0x01. And i had no way to edit this.

And that seemed the most logical cause of unraid becomming unstable.

The file system will need to be mounted so you can update. Not sure on the process to mount the libvirt image without starting Libvirt.

Link to comment

@SimonF

Makes sence in a way. Still i believe there should be a way to turn off auto start of VM's when starting VM manager. Specifically for issues like the one i experienced.

 

We shall see, maybe i should log a bug report. or a feature request to add a toggle to the VM manager menu with something like 'Disable VM auto start'

Link to comment
On 10/3/2022 at 9:42 AM, Caennanu said:

@SimonF

Makes sence in a way. Still i believe there should be a way to turn off auto start of VM's when starting VM manager. Specifically for issues like the one i experienced.

 

We shall see, maybe i should log a bug report. or a feature request to add a toggle to the VM manager menu with something like 'Disable VM auto start'

Raise a feature request.

 

Do you know which device was trying to be mapped when the system was crashing.

Link to comment
On 10/4/2022 at 8:53 PM, SimonF said:

Do you know which device was trying to be mapped when the system was crashing.

I do not know which it was trying to map at the time, as i fail to see the logic of the allocation. 

The only thing i found is that the TPU was allocated on bus 00, while it was trying to allocate it at 03.

 

If anything was connected / assigned to 03 at the time i do not know. If it was... it was connecting to either of the folowing:

LSI HBA, Adaptec HBA, Dual 10GB SFP NIC or GT710

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...