civic95man

Members
  • Posts

    224
  • Joined

  • Last visited

Posts posted by civic95man

  1. On 1/23/2021 at 5:54 PM, Aceriz said:

    I wondered if it might be one of the software applications.   Specifically... I am wondering if it is the installed the iCUE corsair software that is for my H150i CPU pump..

    I would be surprised if that was the cause; specifically, the VM is isolated from your unraid system.  That corsair pump is just another USB peripheral as far as unraid, the VM, and Windows is concerned (much like a mouse or keyboard).  If you want to pursue this route then it can't hurt anything.

  2. 12 hours ago, Aceriz said:

    My system has 32GB of ECC memory  so in the FLASH section behind the append I put in the following.   "append hugepagesz=2M hugepages=12096 vfio-pci.ids=1b73:1100,10de:1ad8,10de:1e87,10de:10f8,10de:1ad9, initrd=/bzroot,/bzroot-gui nomodeset"

    I chose the 12096 as I had tried putting it at the full 16128   but my system would not boot past the initial selection page for the gui safemode etc... froze on the countdown...  so I reset and put the lower number in. 

    I was thinking of trying the 1G option since your processor supports it, but 2M still works.  So you basically tell the kernel to set aside XXXX contiguous block of memory of 2M in size (instead of the default 4K).  The kicker here is that some applications, besides the VM, can and will use the hugepages, so plan accordingly. If you want 16G set aside as hugepages then you would put "hugepages=8192", since 16GB / 2MB = 8192.

     

    I think the mis-configured hugepages was causing the oom killing spree in those diagnostics.

     

     

    12 hours ago, Aceriz said:

    Just so I know is this something that will erase the data off of the SSD cache when If I look at updating as part of the problem solving ? Sorry if stupid question...

    This is in regard to the transition from prior versions of unraid to 6.9-rc; where there were excessive writes to the SSD and part of the solution was to align the partition to 1MB.  This would require a repartitioning of the SSD - but you would have to manually invoke it.  The issue is that supposedly, 6.8 and earlier doesn't recognize this layout (but I could be mistaken).  In either case, just don't format the SSD in 6.9 and you'll be fine.

     

    Storage pools WAS introduced in 6.9 and there was a process to revert back to 6.8 and earlier, but I think those notes were lost in the beta release updates somewhere.  As always, backup your flash drive before updating.

     

    At this point, I think the 6.9 route would be the best choice since it utilizes a newer kernel which should better support your MB and CPU, and possibly get rid of that BTS buffer allocation failure.

  3. 6 minutes ago, RallyGallery said:

    The good news for the screenshot is that it’s for the ‘old’ VM which I am going to delete and not worry about. I just included it for info but it actually caused more confusion! Apologies for that! I have had no more issues as such. The external hard drive says missing in the historical section of  unassigned devices but it’s passed through to the vm so that all works perfectly. 

    Thats good then.  Sounds like you're all set.

     

    So I could try and describe how to do the process but its easier to just reference another post

    I am assuming that you have the nvme mounted by unassigned devices. Basically just copy the VM image file from the cache to the nvme using your choice of methods; although I recommend using the --sparse=always option as it keeps the image size smaller.  Then edit the VM to point to the new disk location (XML editor may be easier)

     

    If you have any questions, feel free to come back here

    • Thanks 1
  4. 1 hour ago, RallyGallery said:

    Final ask, if I can?? What is the best procedure to move the VM's which are held on my cache (2 x ssd in mirror 1) to the now working, and showing nvme? The nvme drive is formated and mounted and not attached to teh cache pool or array.

    Looks like you've made a few steps in the right direction.  

     

    Now for the bad news, I see an issue right away with that screenshot.  The php warning at the top (Warning: parse_ini_file.........) seems to indicate that your flash drive dropped offline again.  Either one of two things come to mind: 1.  you should use a USB2 port for the flash drive.  2.  when you passed through the USB device for blue iris in your windows VM, you passed through an entire USB controller that the flash drive is one, or the drive itself, on accident.

     

    If you just need a single USB device, then you should be able to pass through JUST the USB device by itself to the VM.  This works well for mice/ keyboards.  If you need more control of the USB functionality or want the entire USB controller to appear in the VM, then yes, you pass the controller through - but anything that is connected to that controller will not be available to unraid (i.e. the flash drive).  If that is the case, then you need to find a USB controller on it's own IOMMU group AND not attached to the unraid flash drive.

     

    Now, to answer your original question - it depends on how you want to utilize that flash drive in your VM(s).  You can just store all of your VM disks on that nvme and just point the disks to that.  Or you could pass the nvme to windows and let it use it directly.  The latter means that ONLY that windows VM will see and be able to use it.

    • Like 1
  5. Looked through your diagnostics (both) and still see the OOM errors.  The first diagnostics, which the system ran for about 2 days, was full of them as you stated.  I find it very odd that your memory seems very fragmented, and thus why it can't allocate an order 4 block of contiguous memory - especially after a fresh reboot.  

     

    Here is a suggestion: have you tried using Hugepages for your VM?  It's typically only needed for very large capacities, or if you are suffering performance issues; however, in this case, it's worth a shot.  Here is a post about how to utilize it:

     

    If that doesn't work then I would suggest either trying the 6.9-rc or adding more memory.  The 6.9 series has added pools and changed the format option of the SSD cache, so while it's not a one-way trip, it's not as simple to revert to the prior 6.8 or 6.7 release.  With that said, the 6.9-rc seems very stable and should work fine.

  6. 6 hours ago, RallyGallery said:

    unRaid 6.9 has the plugin you were talking about installed when you inspect the iomuu groups.

    I'm still on the 6.8 series but the 6.9 seems to have what you need.  

     

    6 hours ago, RallyGallery said:

    The sound card is passed through to the VM and the VM is set to autostart. I could get rid of the sound card passthrough but first I will try and split it out in the iomuu groups

    Autostarting the VM is always a risky proposition as you could run into problems as you've seen.  My personal preference is that unless it's running some kind of critical task (such as pfsense) then I don't see any reason to autostart.  Again, that is just my personal preference.

     

    Last I looked at your logs (and in the screenshot), it looked like the audio device is already split into it's own group.

     

    6 hours ago, RallyGallery said:

    When looking at the groups there is a green dot next to devices passed through to the VM so I will attached them to the vfio.pci file and see what happens. All done via the iomuu page and much easier now that the plugin is incorporated. Screen shot below. 

    I would probably make sure the VM is set to manually start. Then install the nvme (do not adjust the pcie stub yet).  With the new hardware how you want it, then adjust the pcie stub on the iommu group and reboot for those to take effect.  Then you can configure the VMs to passthrough those stubbed components.

     

    Remember, to repeat the process with any new hardware you've added - in my case, I had forgotten what I had done so it took me by surprise.

     

    6 hours ago, RallyGallery said:

    Thanks again!

    No problem! Sounds like you have a good grasp on how this all works now.

    • Thanks 1
  7. 1 hour ago, RallyGallery said:

    Interesting you said about when you installed a second GPU as I had the same error as now

     

    Yes, the problem is that I had "stubbed" several components and when the new GPU was added, the PCIe assignments changed but the stubbed assignments didn't - meaning that several items (disk controllers, network adapters) disappeared.  I just had to edit my vfio-pci file and I was good

  8. 1 hour ago, RallyGallery said:

    I have attached the latest diagnostics file which I created with teh problems and the nvme drive in.

     

    So my next question, what do I need to write in the vfio-pci.cfg file? 

    So it looks like we've found the root cause of the problem.  By any chance did you start a VM in those diagnostics?

     

    What it looks like to me in this snippet is that something took control of the USB controller on 0000:09:00.3, which appears to be where your unraid flash drive is located:

    Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1
    Jan 18 19:57:21 PCServer kernel: usb usb6: USB disconnect, device number 1
    Jan 18 19:57:21 PCServer kernel: usb 6-4: USB disconnect, device number 2
    Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache
    Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
    Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: USB bus 6 deregistered
    Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1
    Jan 18 19:57:21 PCServer kernel: usb usb5: USB disconnect, device number 1

     

    So what happens, is that when you added the nvme to your system, it interfaces to your PCIe bus, which changed the prior assignments. It inserted itself here:

    01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
    	Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
    	Kernel driver in use: nvme
    	Kernel modules: nvme

     

    which caused everything to move down and thus:

    09:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
    	Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
    	Kernel driver in use: vfio-pci
    0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
    	Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
    0a:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
    	Subsystem: ASUSTeK Computer Inc. FCH SATA Controller [AHCI mode] [1043:8747]
    	Kernel driver in use: ahci
    	Kernel modules: ahci
    0a:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
    	Subsystem: ASUSTeK Computer Inc. Device [1043:8797]

     

    09:00.3 now became the USB controller and 0a:00.3 is now your Audio Device (which you were trying to pass to the VM).  Basically, when you started the VM, it tried to take control of the device on the PCIe bus at 09:00.3 - which is your USB controller - not the audio device which it previously was.  And since the unraid flash is on that USB controller, your system loses connection to it.

     

    So to fix this, edit the Windows VM XML to pass through the correct device (audio device) at the correct address (0a:00.3).  Repeat this for anything else that is passed through to other VMs.  

     

    Finally, it would be best to stub those devices first - which prepares them for use in a VM by assigning a "dummy" driver and prevents unraid from using them.  I believe there is a VFIO-PCI plugin in CA which would let you select what to stub (issolate for VM use).  This would be the easiest route - then assign those respectively to the VM.

    • Like 1
    • Thanks 1
  9. On 1/15/2021 at 6:35 PM, Aceriz said:

    Over the last 4-5 days I have also had the time to really try and isolate down the problem.   So I am able to run the system in Safe-Mode with docker and VM off without any of the errors.   When I run in Safemode with VM on and Dockers off I get errors.

    So it seems to be VM related.  Might be good to post diagnostics AFTER this happens again.  That snippet of the syslog left out a lot of details and I saw reference to another OOM error. 

     

    Also, I looked into your previous OOM error from the first post one last time and I can *kinda* see how it gave you the error.  If anyone is curious, technically, you ran out of memory on the Normal zone and couldn't assign a contiguous block (order of 4).  I don't know why it didn't use DMA32 zone.  Maybe someone else can answer that.

     

    On 1/15/2021 at 6:35 PM, Aceriz said:

    not sure if this line was significant :Jan 15 20:05:59 UNRAID kernel: type mismatch for 8c000000,4000000 old: write-back new: write-combining"

    seems to be related to intel integrated graphics??? for now, just assume it's nothing.

     

    I did find reference to your current issues in an older forum post: 

    Their solution was to nuke the offending VM and start over.  I guess you could try that.  You could first try removing the xml but keeping the vdisk (assuming you're using vdisks for the VM).  If that doesn't work then try creating a new vdisk, keeping the old one.

     

    If that doesn't work either, then you could always go with the latest 6.9.0 release candidate.  The new kernel might help things out and I think there is a newer release of qemu rolled up in there as well.

  10. I'm assuming that you are passing some hardware to the VMs, such as a GPU.  I don't see any mention of stubbing hardware via the vfio-pci.cfg file but I can only assume that you are (can't remember if that shows up in the diagnostics).  You will most likely need to update that config file as well as your VMs once you install your nvme.  Whenever you install new hardware that interfaces with the PCIe bus, it can shift the existing allocations around. This can cause issues with stubbed hardware, where something that shouldn't have been stubbed suddenly is (i.e. USB ports with the unraid flash drive on them, disk controllers, etc).  This can also cause the VM to try to access hardware at the previous address, which now is occupied by something else.

     

    Long story short - this happened to me when I installed a second GPU; all of the PCIe assignments changed and I had to recreate my vfio-pci.cfg file to stub the correct hardware.

    • Like 1
  11. 26 minutes ago, TheSnotRocket said:

    So.. maybe this will help - this is one of the latest diagnostic logs that I have - it's NOT from the pastebin from the OP of this post.. but still contains:

     

    CPU: 11 PID: 16390 Comm: php-fpm Tainted: P        W  O      4.19.107-Unraid #1

     

    When my server hangs like this, it's unable to create or collect logfiles... I've left the diagnostics command running from my BMC window for an hour or more... nothing happens.

    The "tainted" just means that you're using an OOT driver which isn't "officially" supported by the kernel, hence OOT.  The usual culprits are the intel igb and the nvidia drivers.

     

    You may need to setup a syslog server so you can see what happens when you lose your system

  12. Well, just looked at your older diagnostics from the 5th of October.  Nothing stands out in the configuration.  Your syslog is spammed with multiple drive connection/resets/etc.  Not sure if it's because of an actual connection issue or maybe it's your HBA card - should look into upgrading the firmware.  Your syslog is seriously filled with those messages.  You also have a lot of kernel panics towards the end which seemingly result in an OOM condition (odd since you have so much RAM).  Maybe try booting in safemode and see if you're stable, then slowly enable dockers/VMs until you find the cause.  

  13. No, it's under the SMART stats for the drive.  It will test the entire drive and if it can read the entire drive successfully then it will report PASS, otherwise it will report a failure and you'll know to toss it.

  14. 13 minutes ago, Djinn said:

    I tried to toss 3 other drives I had laying around in there

    That's your problem.  You shouldn't throw drives around. Handle them gently.

     

    Seriously though, we would need diagnostics to know what is going on.  Post them on the next reply

     

  15. 8 minutes ago, Alessio Liscietti said:

    I take it there is no way to "recover" the data though other methods?

    Google may be your friend here but if you do want to try to recover anything then you need to take your array offline so nothing else will be written to it. writes to that disk will overwrite data that *may* be recovered.

  16. 2 minutes ago, trurl said:

    Unraid doesn't format the drive for rebuild, so I don't know what you mean here. The filesystem (format) is part of the bits of the rebuild. Suspect you have an incorrect understanding of the meaning of "format". Format means "write an empty filesytem (of some type) to this disk. That is what it has always meant in every operating system you have ever used.

     

    When you format a disk assigned to the parity array, Unraid treats that write operation exactly as it does any other write operation, by updating parity. So after the format, parity agrees the disk has an empty filesystem. Rebuilding a formatted disk results in a formatted disk.

     

    And your diagnostics confirms disk1 is empty.

    yes, syslog shows disk was formatted.  I suspect the emulated disk was showing as unmountable so they formatted it since it was a given option.  That format option really needs to be moved somewhere else, like maybe within the individual drive's options, at least that way you have to work to find it!