civic95man

Members
  • Posts

    224
  • Joined

  • Last visited

Everything posted by civic95man

  1. I'm not sure how true this is, but I was under the impression that it was broadcast traffic, in conjunction with static IPs that was causing macvlan to s*** the bed and cause kernel panics. Now I'm not sure if this is true with every vendor, but Ubiquiti switches/routers do not route broadcast packets between networks (so I was told/read), which was the idea behind creating vlans for docker containers with static IPs. I also think it was thrown out there that certain network adapters may be more prone to this than others. This is a very good point since it could be an external device on the network that is hammering docker and therefore the macvlan interface, which isn't present in a "test lab". Just my $0.02
  2. That is very strange, but if it seems to be the root cause then its good to know. I guess if you stress test your system without it installed and don't see any call traces then you've found your solution!
  3. I would be surprised if that was the cause; specifically, the VM is isolated from your unraid system. That corsair pump is just another USB peripheral as far as unraid, the VM, and Windows is concerned (much like a mouse or keyboard). If you want to pursue this route then it can't hurt anything.
  4. Just do a flash backup from the web UI [main] -> [boot device -> flash] -> [flash backup] and keep it somewhere safe. If things go sideways, you just copy everything back to the flash drive. I would then update from within unraid
  5. I was thinking of trying the 1G option since your processor supports it, but 2M still works. So you basically tell the kernel to set aside XXXX contiguous block of memory of 2M in size (instead of the default 4K). The kicker here is that some applications, besides the VM, can and will use the hugepages, so plan accordingly. If you want 16G set aside as hugepages then you would put "hugepages=8192", since 16GB / 2MB = 8192. I think the mis-configured hugepages was causing the oom killing spree in those diagnostics. This is in regard to the transition from prior versions of unraid to 6.9-rc; where there were excessive writes to the SSD and part of the solution was to align the partition to 1MB. This would require a repartitioning of the SSD - but you would have to manually invoke it. The issue is that supposedly, 6.8 and earlier doesn't recognize this layout (but I could be mistaken). In either case, just don't format the SSD in 6.9 and you'll be fine. Storage pools WAS introduced in 6.9 and there was a process to revert back to 6.8 and earlier, but I think those notes were lost in the beta release updates somewhere. As always, backup your flash drive before updating. At this point, I think the 6.9 route would be the best choice since it utilizes a newer kernel which should better support your MB and CPU, and possibly get rid of that BTS buffer allocation failure.
  6. Thats good then. Sounds like you're all set. So I could try and describe how to do the process but its easier to just reference another post I am assuming that you have the nvme mounted by unassigned devices. Basically just copy the VM image file from the cache to the nvme using your choice of methods; although I recommend using the --sparse=always option as it keeps the image size smaller. Then edit the VM to point to the new disk location (XML editor may be easier) If you have any questions, feel free to come back here
  7. Looks like you've made a few steps in the right direction. Now for the bad news, I see an issue right away with that screenshot. The php warning at the top (Warning: parse_ini_file.........) seems to indicate that your flash drive dropped offline again. Either one of two things come to mind: 1. you should use a USB2 port for the flash drive. 2. when you passed through the USB device for blue iris in your windows VM, you passed through an entire USB controller that the flash drive is one, or the drive itself, on accident. If you just need a single USB device, then you should be able to pass through JUST the USB device by itself to the VM. This works well for mice/ keyboards. If you need more control of the USB functionality or want the entire USB controller to appear in the VM, then yes, you pass the controller through - but anything that is connected to that controller will not be available to unraid (i.e. the flash drive). If that is the case, then you need to find a USB controller on it's own IOMMU group AND not attached to the unraid flash drive. Now, to answer your original question - it depends on how you want to utilize that flash drive in your VM(s). You can just store all of your VM disks on that nvme and just point the disks to that. Or you could pass the nvme to windows and let it use it directly. The latter means that ONLY that windows VM will see and be able to use it.
  8. Looked through your diagnostics (both) and still see the OOM errors. The first diagnostics, which the system ran for about 2 days, was full of them as you stated. I find it very odd that your memory seems very fragmented, and thus why it can't allocate an order 4 block of contiguous memory - especially after a fresh reboot. Here is a suggestion: have you tried using Hugepages for your VM? It's typically only needed for very large capacities, or if you are suffering performance issues; however, in this case, it's worth a shot. Here is a post about how to utilize it: If that doesn't work then I would suggest either trying the 6.9-rc or adding more memory. The 6.9 series has added pools and changed the format option of the SSD cache, so while it's not a one-way trip, it's not as simple to revert to the prior 6.8 or 6.7 release. With that said, the 6.9-rc seems very stable and should work fine.
  9. I'm still on the 6.8 series but the 6.9 seems to have what you need. Autostarting the VM is always a risky proposition as you could run into problems as you've seen. My personal preference is that unless it's running some kind of critical task (such as pfsense) then I don't see any reason to autostart. Again, that is just my personal preference. Last I looked at your logs (and in the screenshot), it looked like the audio device is already split into it's own group. I would probably make sure the VM is set to manually start. Then install the nvme (do not adjust the pcie stub yet). With the new hardware how you want it, then adjust the pcie stub on the iommu group and reboot for those to take effect. Then you can configure the VMs to passthrough those stubbed components. Remember, to repeat the process with any new hardware you've added - in my case, I had forgotten what I had done so it took me by surprise. No problem! Sounds like you have a good grasp on how this all works now.
  10. Yes, the problem is that I had "stubbed" several components and when the new GPU was added, the PCIe assignments changed but the stubbed assignments didn't - meaning that several items (disk controllers, network adapters) disappeared. I just had to edit my vfio-pci file and I was good
  11. So it looks like we've found the root cause of the problem. By any chance did you start a VM in those diagnostics? What it looks like to me in this snippet is that something took control of the USB controller on 0000:09:00.3, which appears to be where your unraid flash drive is located: Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1 Jan 18 19:57:21 PCServer kernel: usb usb6: USB disconnect, device number 1 Jan 18 19:57:21 PCServer kernel: usb 6-4: USB disconnect, device number 2 Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache Jan 18 19:57:21 PCServer kernel: sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: USB bus 6 deregistered Jan 18 19:57:21 PCServer kernel: xhci_hcd 0000:09:00.3: remove, state 1 Jan 18 19:57:21 PCServer kernel: usb usb5: USB disconnect, device number 1 So what happens, is that when you added the nvme to your system, it interfaces to your PCIe bus, which changed the prior assignments. It inserted itself here: 01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] Subsystem: Samsung Electronics Co Ltd Device [144d:a801] Kernel driver in use: nvme Kernel modules: nvme which caused everything to move down and thus: 09:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f] Subsystem: ASUSTeK Computer Inc. Device [1043:8747] Kernel driver in use: vfio-pci 0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455] Subsystem: ASUSTeK Computer Inc. Device [1043:8747] 0a:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) Subsystem: ASUSTeK Computer Inc. FCH SATA Controller [AHCI mode] [1043:8747] Kernel driver in use: ahci Kernel modules: ahci 0a:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457] Subsystem: ASUSTeK Computer Inc. Device [1043:8797] 09:00.3 now became the USB controller and 0a:00.3 is now your Audio Device (which you were trying to pass to the VM). Basically, when you started the VM, it tried to take control of the device on the PCIe bus at 09:00.3 - which is your USB controller - not the audio device which it previously was. And since the unraid flash is on that USB controller, your system loses connection to it. So to fix this, edit the Windows VM XML to pass through the correct device (audio device) at the correct address (0a:00.3). Repeat this for anything else that is passed through to other VMs. Finally, it would be best to stub those devices first - which prepares them for use in a VM by assigning a "dummy" driver and prevents unraid from using them. I believe there is a VFIO-PCI plugin in CA which would let you select what to stub (issolate for VM use). This would be the easiest route - then assign those respectively to the VM.
  12. So it seems to be VM related. Might be good to post diagnostics AFTER this happens again. That snippet of the syslog left out a lot of details and I saw reference to another OOM error. Also, I looked into your previous OOM error from the first post one last time and I can *kinda* see how it gave you the error. If anyone is curious, technically, you ran out of memory on the Normal zone and couldn't assign a contiguous block (order of 4). I don't know why it didn't use DMA32 zone. Maybe someone else can answer that. seems to be related to intel integrated graphics??? for now, just assume it's nothing. I did find reference to your current issues in an older forum post: Their solution was to nuke the offending VM and start over. I guess you could try that. You could first try removing the xml but keeping the vdisk (assuming you're using vdisks for the VM). If that doesn't work then try creating a new vdisk, keeping the old one. If that doesn't work either, then you could always go with the latest 6.9.0 release candidate. The new kernel might help things out and I think there is a newer release of qemu rolled up in there as well.
  13. I'm assuming that you are passing some hardware to the VMs, such as a GPU. I don't see any mention of stubbing hardware via the vfio-pci.cfg file but I can only assume that you are (can't remember if that shows up in the diagnostics). You will most likely need to update that config file as well as your VMs once you install your nvme. Whenever you install new hardware that interfaces with the PCIe bus, it can shift the existing allocations around. This can cause issues with stubbed hardware, where something that shouldn't have been stubbed suddenly is (i.e. USB ports with the unraid flash drive on them, disk controllers, etc). This can also cause the VM to try to access hardware at the previous address, which now is occupied by something else. Long story short - this happened to me when I installed a second GPU; all of the PCIe assignments changed and I had to recreate my vfio-pci.cfg file to stub the correct hardware.
  14. you'll want to look into the 6.9 beta which allows multiple pools. Although it's still a beta, it seems really stable with many people using it.
  15. That's good to know, and good to point out. I don't boot into the GUI mode but it is a nice option if required so I'd like to have that option. I run a X10SRA-F and remember seeing a note that a new VGA driver was required when updating the BMC to 3.80 or later (I'm on 3.88 now)
  16. The "tainted" just means that you're using an OOT driver which isn't "officially" supported by the kernel, hence OOT. The usual culprits are the intel igb and the nvidia drivers. You may need to setup a syslog server so you can see what happens when you lose your system
  17. Well, just looked at your older diagnostics from the 5th of October. Nothing stands out in the configuration. Your syslog is spammed with multiple drive connection/resets/etc. Not sure if it's because of an actual connection issue or maybe it's your HBA card - should look into upgrading the firmware. Your syslog is seriously filled with those messages. You also have a lot of kernel panics towards the end which seemingly result in an OOM condition (odd since you have so much RAM). Maybe try booting in safemode and see if you're stable, then slowly enable dockers/VMs until you find the cause.
  18. Still, diagnostics right now could help shed some light on why it hung originally. At least give us an idea of your hardware (AMD needs some specific workarounds for example)
  19. I like the unbalance plugin. it lets you move data from one drive to another
  20. But it will take a long time to complete (several hours)
  21. No, it's under the SMART stats for the drive. It will test the entire drive and if it can read the entire drive successfully then it will report PASS, otherwise it will report a failure and you'll know to toss it.
  22. You could try an extended test and see if it passes
  23. That's your problem. You shouldn't throw drives around. Handle them gently. Seriously though, we would need diagnostics to know what is going on. Post them on the next reply
  24. Never used it but scanning over the differences, I would think just the Recovery Explorer Standard should be fine as each disk is it's own file system so you don't need a lot of those features like the RAID version. Let us know how it goes