Installing New Hardware Stops Docker & VM's


Recommended Posts

I have an unRAID server running without any faults for over 2 years. It's running 6.8.3 and has an array of 5 drives and a cache pool of 2 SSD's for VM's and dockers.

 

I decided to install an m.2 NVME drive to move all my VM's onto the NVME drive, leaving the cache pool for docker.

 

Installed the drive and rebooted. The login screen displayed a white background when it normally was black. I could log in and normally, the unRAID UI is black but now it's white and the machine logo in the description box has changed. Docker and VM's will not also run and there is a parsing error?

I rebooted the server and then I am unable to even see the UI for unRAID, although the shares were all running and available on the network.

 

I also looked in the bios and made sure that the nvme was not in any of the boot priority menus, just in case this was causing the problem, it made no difference.

 

I decided to take the NVME drive out of the server and rebooted and everything went back to normal, colour scheme, server picture, VM's and dockers all running perfectly.

 

I am completely baffled! I have added the log files to this post.

 

Could anyone help at all, it would be really appreciated?

pcserver-diagnostics-20210104-1742.zip

Link to comment

That’s the last log file I have. I thought there may be something in there. I took a screenshot of the computer shutting down when I had the nvme drive in. Looking at it says .zip file is not created? The photo is attached to this post. 

 

The whole server behaved so strangely. It was almost like it could not find some settings or access a part of the system. It was exactly the same when I installed a second graphics card. 
 

Could it be a power supply issue? I am just on the cusp of power demand from the psu and it behaves like this? I am clutching at straws. 
 

Thanks for for taking the time. 

6182E2EB-5F5D-4E17-B6F7-5D678EF7BCB8.jpeg

Link to comment
13 hours ago, RallyGallery said:

.zip file is not created?

Seems like a flash drive problem. Put it in your PC and let it checkdisk. While there make a backup.

 

The syslogs from the diagnostics you posted earlier have a lot of messages related to nvidia. Have you tried without nvidia build? You would want to make sure you have a backup of flash if you want to try without that since there isn't any way to reinstall it now that it has been pulled.

  • Thanks 1
Link to comment

I have a backup of the usb so that's not a problem. I will run the usb drive through checkdisk. i wonder if its worth getting a new usb and moving the system to a new usb stick?

 

I was also going to update to 6.9. its rc2 but looks pretty stable. It also has built in nvidia drivers so that would be better than the version I am using, as you say its been discontinued.  I wonder if I upgrade and try the nvme again it may work? 

 

Thanks again for your time. 

Link to comment

A quick update. I have powered down the server and checked the USB stick and it had no errors. I have rebooted the server and removed any previous nvidia build and installed 6.9 rc2 and got everything working. All the nvidia drivers, vm's and containers work perfectly.

 

I am going to install the nvme drive and see what happens. Will keep you updated.

 

I have kept the USB stick in the same port as you suggested.

Link to comment

I'm assuming that you are passing some hardware to the VMs, such as a GPU.  I don't see any mention of stubbing hardware via the vfio-pci.cfg file but I can only assume that you are (can't remember if that shows up in the diagnostics).  You will most likely need to update that config file as well as your VMs once you install your nvme.  Whenever you install new hardware that interfaces with the PCIe bus, it can shift the existing allocations around. This can cause issues with stubbed hardware, where something that shouldn't have been stubbed suddenly is (i.e. USB ports with the unraid flash drive on them, disk controllers, etc).  This can also cause the VM to try to access hardware at the previous address, which now is occupied by something else.

 

Long story short - this happened to me when I installed a second GPU; all of the PCIe assignments changed and I had to recreate my vfio-pci.cfg file to stub the correct hardware.

  • Like 1
Link to comment

Sorry my apologies. 

 

I have installed the nvme drive and had another go. Same errors, however, I have captured as much data as I can. I downloaded a diagnostics file with the flash drive installed. It's attached below. Plus had a look at usb information. Screenshots below. It looks like a usb problem? Took the nvme drive out and system works fine.

 

The only usb2 ports are on the front of the case so I can give that a try, unless you think there is anything else I should do based on the latest diagnostic report and screen shot. 

 

Thanks ever so much for your time and help.

20210118_195854773_iOS.jpg

20210118_200641304_iOS.jpg

pcserver-diagnostics-20210118-2005.zip

Link to comment
36 minutes ago, civic95man said:

I'm assuming that you are passing some hardware to the VMs, such as a GPU.  I don't see any mention of stubbing hardware via the vfio-pci.cfg file but I can only assume that you are (can't remember if that shows up in the diagnostics).  You will most likely need to update that config file as well as your VMs once you install your nvme.  Whenever you install new hardware that interfaces with the PCIe bus, it can shift the existing allocations around. This can cause issues with stubbed hardware, where something that shouldn't have been stubbed suddenly is (i.e. USB ports with the unraid flash drive on them, disk controllers, etc).  This can also cause the VM to try to access hardware at the previous address, which now is occupied by something else.

 

Long story short - this happened to me when I installed a second GPU; all of the PCIe assignments changed and I had to recreate my vfio-pci.cfg file to stub the correct hardware.

Thanks for the reply. I am passing a sound card through to the VM I have plus an unassigned device hard drive for use on Blue Iris. No GPU. Interesting you said about when you installed a second GPU as I had the same error as now, with change of screen, no docker or VM Manager running. I wonder if this is indeed the problem? The unraid USB I have shows no errors as such and as soon as I take out the NVME then all is well. I did want to add a second GPU to pass through to the Windows VM but after all the hassle a year ago I decided against it. If this error can be corrected I would give it another go!

 

I have attached the latest diagnostics file which I created with teh problems and the nvme drive in.

 

So my next question, what do I need to write in the vfio-pci.cfg file? 

 

Thanks for the help and reading the post.

pcserver-diagnostics-20210118-2005.zip

Link to comment
1 hour ago, RallyGallery said:

I have attached the latest diagnostics file which I created with teh problems and the nvme drive in.

 

So my next question, what do I need to write in the vfio-pci.cfg file? 

So it looks like we've found the root cause of the problem.  By any chance did you start a VM in those diagnostics?

 

What it looks like to me in this snippet is that something took control of the USB controller on 0000:09:00.3, which appears to be where your unraid flash drive is located:

Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1
Jan 18 19:57:21 PCServer kernel: usb usb6: USB disconnect, device number 1
Jan 18 19:57:21 PCServer kernel: usb 6-4: USB disconnect, device number 2
Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: USB bus 6 deregistered
Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1
Jan 18 19:57:21 PCServer kernel: usb usb5: USB disconnect, device number 1

 

So what happens, is that when you added the nvme to your system, it interfaces to your PCIe bus, which changed the prior assignments. It inserted itself here:

01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
	Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
	Kernel driver in use: nvme
	Kernel modules: nvme

 

which caused everything to move down and thus:

09:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
	Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
	Kernel driver in use: vfio-pci
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
	Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
0a:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
	Subsystem: ASUSTeK Computer Inc. FCH SATA Controller [AHCI mode] [1043:8747]
	Kernel driver in use: ahci
	Kernel modules: ahci
0a:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
	Subsystem: ASUSTeK Computer Inc. Device [1043:8797]

 

09:00.3 now became the USB controller and 0a:00.3 is now your Audio Device (which you were trying to pass to the VM).  Basically, when you started the VM, it tried to take control of the device on the PCIe bus at 09:00.3 - which is your USB controller - not the audio device which it previously was.  And since the unraid flash is on that USB controller, your system loses connection to it.

 

So to fix this, edit the Windows VM XML to pass through the correct device (audio device) at the correct address (0a:00.3).  Repeat this for anything else that is passed through to other VMs.  

 

Finally, it would be best to stub those devices first - which prepares them for use in a VM by assigning a "dummy" driver and prevents unraid from using them.  I believe there is a VFIO-PCI plugin in CA which would let you select what to stub (issolate for VM use).  This would be the easiest route - then assign those respectively to the VM.

  • Like 1
  • Thanks 1
Link to comment
1 hour ago, RallyGallery said:

Interesting you said about when you installed a second GPU as I had the same error as now

 

Yes, the problem is that I had "stubbed" several components and when the new GPU was added, the PCIe assignments changed but the stubbed assignments didn't - meaning that several items (disk controllers, network adapters) disappeared.  I just had to edit my vfio-pci file and I was good

Link to comment
10 hours ago, civic95man said:

 

Yes, the problem is that I had "stubbed" several components and when the new GPU was added, the PCIe assignments changed but the stubbed assignments didn't - meaning that several items (disk controllers, network adapters) disappeared.  I just had to edit my vfio-pci file and I was good

Thanks ever so much for your help! This is exactly the problem. Been doing some research based on your comments. unRaid 6.9 has the plugin you were talking about installed when you inspect the iomuu groups. The sound card is passed through to the VM and the VM is set to autostart. I could get rid of the sound card passthrough but first I will try and split it out in the iomuu groups. Also setting the VM to manual start just in case.

 

When looking at the groups there is a green dot next to devices passed through to the VM so I will attached them to the vfio.pci file and see what happens. All done via the iomuu page and much easier now that the plugin is incorporated. Screen shot below. 

 

If that all works I will install the nvme and try again.

 

Will keep you updated with the progress!

 

If that works I may try the second GPU!!

 

Thanks again!

 

image.thumb.png.6f69151ac26edec6d6ea26a63762a875.png

 

Link to comment
6 hours ago, RallyGallery said:

unRaid 6.9 has the plugin you were talking about installed when you inspect the iomuu groups.

I'm still on the 6.8 series but the 6.9 seems to have what you need.  

 

6 hours ago, RallyGallery said:

The sound card is passed through to the VM and the VM is set to autostart. I could get rid of the sound card passthrough but first I will try and split it out in the iomuu groups

Autostarting the VM is always a risky proposition as you could run into problems as you've seen.  My personal preference is that unless it's running some kind of critical task (such as pfsense) then I don't see any reason to autostart.  Again, that is just my personal preference.

 

Last I looked at your logs (and in the screenshot), it looked like the audio device is already split into it's own group.

 

6 hours ago, RallyGallery said:

When looking at the groups there is a green dot next to devices passed through to the VM so I will attached them to the vfio.pci file and see what happens. All done via the iomuu page and much easier now that the plugin is incorporated. Screen shot below. 

I would probably make sure the VM is set to manually start. Then install the nvme (do not adjust the pcie stub yet).  With the new hardware how you want it, then adjust the pcie stub on the iommu group and reboot for those to take effect.  Then you can configure the VMs to passthrough those stubbed components.

 

Remember, to repeat the process with any new hardware you've added - in my case, I had forgotten what I had done so it took me by surprise.

 

6 hours ago, RallyGallery said:

Thanks again!

No problem! Sounds like you have a good grasp on how this all works now.

  • Thanks 1
Link to comment
2 hours ago, civic95man said:

I'm still on the 6.8 series but the 6.9 seems to have what you need.  

 

Autostarting the VM is always a risky proposition as you could run into problems as you've seen.  My personal preference is that unless it's running some kind of critical task (such as pfsense) then I don't see any reason to autostart.  Again, that is just my personal preference.

 

Last I looked at your logs (and in the screenshot), it looked like the audio device is already split into it's own group.

 

I would probably make sure the VM is set to manually start. Then install the nvme (do not adjust the pcie stub yet).  With the new hardware how you want it, then adjust the pcie stub on the iommu group and reboot for those to take effect.  Then you can configure the VMs to passthrough those stubbed components.

 

Remember, to repeat the process with any new hardware you've added - in my case, I had forgotten what I had done so it took me by surprise.

 

No problem! Sounds like you have a good grasp on how this all works now.

Great advice! I have an update and success, well nearly success. I had a play around with changing the iommu groups and tried stubbing, using in the built ability in teh devices section and wrote the config file. I pass through an external usb drive to record video from Blue Iris and this is passed through to the VM. I got to the stage where I didn't need the sound card so didn't pass that through and could see the nvme and also the external drive and teh system did not crash due to the 'usb error'. My main vm refused to work, it kept crashing the system, but by not autostarting the VM (advice noted and will always be used!) it was easy to reboot and try something else. My other VM worked a treat and didn't crash, so I have moved to that, set up Blue Iris (I had already backed up the settings) and it now works. I will just delete the other VM which was causing problems (see screenshot below when it tried to start)

 

Final ask, if I can?? What is the best procedure to move the VM's which are held on my cache (2 x ssd in mirror 1) to the now working, and showing nvme? The nvme drive is formated and mounted and not attached to teh cache pool or array. It basically sat there waiting to be used. I want to ensure I have get this right! You have been so helpful and this is the final piece of the jigsaw.

 

Thanks ever so much for all your help, time and assistance!

20210119_103823483_iOS.jpg

Link to comment
1 hour ago, RallyGallery said:

Final ask, if I can?? What is the best procedure to move the VM's which are held on my cache (2 x ssd in mirror 1) to the now working, and showing nvme? The nvme drive is formated and mounted and not attached to teh cache pool or array.

Looks like you've made a few steps in the right direction.  

 

Now for the bad news, I see an issue right away with that screenshot.  The php warning at the top (Warning: parse_ini_file.........) seems to indicate that your flash drive dropped offline again.  Either one of two things come to mind: 1.  you should use a USB2 port for the flash drive.  2.  when you passed through the USB device for blue iris in your windows VM, you passed through an entire USB controller that the flash drive is one, or the drive itself, on accident.

 

If you just need a single USB device, then you should be able to pass through JUST the USB device by itself to the VM.  This works well for mice/ keyboards.  If you need more control of the USB functionality or want the entire USB controller to appear in the VM, then yes, you pass the controller through - but anything that is connected to that controller will not be available to unraid (i.e. the flash drive).  If that is the case, then you need to find a USB controller on it's own IOMMU group AND not attached to the unraid flash drive.

 

Now, to answer your original question - it depends on how you want to utilize that flash drive in your VM(s).  You can just store all of your VM disks on that nvme and just point the disks to that.  Or you could pass the nvme to windows and let it use it directly.  The latter means that ONLY that windows VM will see and be able to use it.

  • Like 1
Link to comment

The good news for the screenshot is that it’s for the ‘old’ VM which I am going to delete and not worry about. I just included it for info but it actually caused more confusion! Apologies for that! I have had no more issues as such. The external hard drive says missing in the historical section of  unassigned devices but it’s passed through to the vm so that all works perfectly. 
 

In terms of the two options I want to place all the vm’s on the nvme and run them off there. Can you assist on how to move them over? I can normally work this stuff out but you are such a big help and it really saves time and stress!
 

So to recap the unraid server boots up perfectly. The nvme drive is installed and working and mounted. I have a Windows VM that works fine with no errors and sees the external hard drive for Blue Iris use. 
 

Just want to move all VMs onto the nvme disk and run them all from there. Then have my cache pool just for docker containers. 
 

As ever, thank you!

Link to comment
6 minutes ago, RallyGallery said:

The good news for the screenshot is that it’s for the ‘old’ VM which I am going to delete and not worry about. I just included it for info but it actually caused more confusion! Apologies for that! I have had no more issues as such. The external hard drive says missing in the historical section of  unassigned devices but it’s passed through to the vm so that all works perfectly. 

Thats good then.  Sounds like you're all set.

 

So I could try and describe how to do the process but its easier to just reference another post

I am assuming that you have the nvme mounted by unassigned devices. Basically just copy the VM image file from the cache to the nvme using your choice of methods; although I recommend using the --sparse=always option as it keeps the image size smaller.  Then edit the VM to point to the new disk location (XML editor may be easier)

 

If you have any questions, feel free to come back here

  • Thanks 1
Link to comment

Update! The server is working well, no more failures. I decided to create a new vm on the nvme disk due to the main ine I use causing issues for some reason (as discussed above and it will be deleted). I also have another VM running on the cache disk which just runs Blue Iris. The new VM is causing a strange problem. VM creates and starts with no problem. Windows 10 setup loads and I can select the virtio store driver and select where to create the primary partition and click 'Next' and it freezes. In unraid it says the VM is paused and has an orange dot next to the icon in the unraid GUI. I have also ensured that the Primary disk location is 'manual' and select 'remote' for the unassigned nvme so it's not creating it on the cache pool.

 

I have fiddled around with unassigned devices setting for the nvme drive  and at times when you select the windows partition you see the actual physical nvme disk and also the 30g unraid assigned space on the nvme in the select partition to create Windows 10 on.

 

The nvme is mounted, not shared (only want to use it to store VM's on) and no pass through. I will also create a test vm on the cache drive to see if I can create this same behaviour. 

 

Any thoughts?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.