harperhendee

Members
  • Posts

    42
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

harperhendee's Achievements

Rookie

Rookie (2/14)

1

Reputation

  1. When I refer to VM HDD, I mean the individual VM disk allocation size. My theory was that an intermittent cache SSD caused occasional system hangs, and system hangs caused data corruption that wasn't caught by parity, since the hang occurred before parity was updated. Bad SSD was my last hope of a "fixable" problem with my system to run 2x VR sessions on VMs. Intermittent hangs that happen at different times, regardless of whether PCI remapped or not... If the disks aren't bad, the only culprits left are Unraid, the KVM SW, and the actual virtualization tech in the chip. In any case, the system is running solidly on the former cache disk as a standalone Win10 gaming build. I may give this one more try using my 256GB SATA SSD as new cache. But I'm pretty tired of debugging this issue--its simply not converging. Much as I like virtualization tech, I like platforms that don't hang even better. --Brad
  2. I've been chasing stability problems in my unraid system for almost a year now. I have replaced almost every component of the system, but still get hangs on VMs, especially at load time and when detecting new devices. I have one final theory: The cache drive is unreliable. My array is as follows: Cache: Intel 1 TB M.2 SSD Disk 1 : 2 TB Western digital red Disk 2 : 1 TB Western digital black (high performance) Parity: 2 TB western digital red I noticed a while back that almost no data was actually consumed in either of the WD disks. My VM HDD allocation was about 1.5 TB before I recently deleted and rebuilt my VMs. I noticed that the cache was shown as fully used, but the array disks showed very low utilization. Yesterday, I decided to do a Windows 10 installation on the cache drive. I deleted all the partitions form the windows installation SW and installed it in the resulting unallocated space. Things worked fine for the installation and upgrade to Win10 fall update. Then I decided to plug in my other 256GB SSD with Win10 on it. I figured I could mount the disk and re-install SW and drivers from its "downloads" directory. The system hung at a black screen with spinning circles. I told it to boot from the 256GB SSD. Boots fine. So my thought is the SSD is actually bad, but I am not sure how to confirm this other than replacing the component. Replacing a 1TB M2 SSD is not cheap, so I was hoping some diagnostics might help me determine if I need to or not. --Harper
  3. I've been using unraid for a little over a year now. My goal has been to build a multi-headed VR gaming rig. I had many trials and tribulations to get the system to a stable point with all the PCIE pass through and USB devices. It was pretty stable for about 6 months. Then I started getting intermittent hangs on my two main gaming VMs (one for Oculus, one for VIVE). The hangs became more frequent until the system was unusable. I ended up booting back to a regular Windows image without unraid and everything worked fine. A few weeks back, I made major HW changes. Maybe my problem was HW related. I changed my Xeon 22-core processor for a new i9 7900x. I had to upgrade the motherboard from my Gigabyte x99 Aorus Designaire to a MSI x299 Xpower AC board. With these two major upgrades done, I brought up the whole system again with a new install of Windows on a clean SSD. Everything works fine in windows. I booted back to unraid and had a look at my old VMs. They all had problems. The two main VMs would intermittently boot, then hang after 1-10 minutes of usage. I had a number of relatively "pristine" VMs that I had only done a basic windows install with variations of BIOS and processor model. These VMs also would hang, especially during loading. I never got any error messages that I could sniff out. The VM would just be shown as "paused" in the web interface. If I force killed the VM, about 50% of the time it would no longer even launch, giving me some message about execution errors. So I decided that perhaps these VMs had become corrupted by having survived all the hangs in the past. I deleted all VMs and built up 3 new ones. I had no problems with these VMs during installation. Then I decided to run some stress tests. I ran combinations of 7zip, cinebench, and passmark. All three VMs performed flawlessly for a good 24 hours of grueling CPU loads. I brought the VMs down, increased disk space, and then launched them again to fix the disks in windows. My first VM hung at BIOS. I killed it and relaunched. It hung during windows init. I rebooted and tried the other VMs. I was able to get hangs during BIOS and windows logo as well. Also, I was able to fatally hang a VM by plugging/unplugging USB devices into a PCIE mapped USB card. What do I mean by "hang"? The VM becomes unresponsive to input, doesn't update its output, which remains stuck. Sometimes HTOP will show CPU utilization approaching 0 on all cores. Sometimes, it will show all cores pegged at 100% (especially for launch hangs). Sometimes, a single thread continues even though VM is unresponsive. Sometimes, unraid will hang as well and requires a hard reset to recover. After a hang, sometimes I can relaunch the VM, and sometimes I get execution errors when I attempt. Reboot resolves these issues. So my conclusion is: 1) CPU virtualization is fine 2) Memory virtualization is fine (I suspected memory corruption as issue early on) 3) VM launch has serious problems 4) USB discovery via PCIE passthru has problems And I have some big open questions: 1) Why do my VMs get worse over time? 2) What does a "VM Hang" actually involve? 3) Is there an architectural problem with KVM, virtio, or Intel's virtualization tech? I'm going to give this a few more days to resolve, and then I'm wiping the whole thing and building up a couple of dual-boot windows build. I may not be able to run 2 parallel sessions, but at least it will be stable. --Brad
  4. This should work. I have two Fresco 1100 USB cards in my rig. I run two gaming VMs in paralell running Oculus and Vive. My setup isn't perfect. I get occasional hangs, especially when playing "switchboard operator" on the USB ports. I think this is due to my own problems with PCI initialization. My cards reside at 4e and 54. Here's where they are discovered in syslog.txt: Similar stuff on 54. Later, it is revisited at 4c (which is the PCI bridge chip on MOBO) And then I get a non-fatal error: Then another non-fatal error: And then this message: And finally this message: USB devices seem to work fine in Unraid after this point. I can even boot from the card (better than my mobo usb, actually). However, when I launch my VM, I get one more non-fatal error: There's a warning in the VM log file as well: After this point, USB mostly works. But I get flaky behavior that I think is related to the non-fatal errors. Some of the flaky behavior results in hard hangs, which is not nice at all! I've been doing the research on how PCI enumeration works so that I can steer things into a more predictable and stable configuration. This thread is pure gold! --Harper
  5. I get similar errors on my two Innatek USB3 cards, but they seem to work in spite of the warnings. I'm wondering if it contributes to some of the system stability issues I see around USB. Here's what my errors are on bootup: Mar 15 19:09:05 Yggrasil kernel: Unpacking initramfs... Mar 15 19:09:05 Yggrasil kernel: Freeing initrd memory: 139516K (ffff88005bccc000 - ffff88006450b000) Mar 15 19:09:05 Yggrasil kernel: DMAR: [Firmware Bug]: RMRR entry for device 4e:00.0 is broken - applying workaround Mar 15 19:09:05 Yggrasil kernel: DMAR: [Firmware Bug]: RMRR entry for device 54:00.0 is broken - applying workaround Mar 15 19:09:05 Yggrasil kernel: DMAR: dmar0: Using Queued invalidation Mar 15 19:09:05 Yggrasil kernel: DMAR: dmar1: Using Queued invalidation When I launch a VM, I get another virtlog error about not able to map the BAR on the USB device. I don't have it handy, I'm afraid.
  6. I have been dealing with some system stability issues since my first installation 6 months ago. They fall into a couple of different buckets: Bucket 1 - Windows 10 VM fails to launch 1a) Hangs with no display from VM 1b) Hangs with VM static display of windows logo, but no spinning dots 1c) Goes to Windows recovery 1d) Kill and relaunch fixes it about 50% of the time Bucket 2 - Windows 10 VM hangs during usage 2a) VM becomes unresponsive when USB devices plugged/unplugged 2b) Other VMs continue to function Bucket 3 - Unraid hard hang 3a) Sometimes bucket 2 problems also hang the Unraid server. 3b) Screen is frozen for Unraid and VMs 3c) Server will not respond to SSH, ping, or short HW power button press Debugging these things is difficult: Bucket 1 - There isn't any debug trail that I can find. I don't see any errors associated with the failure in syslog or virtlogd. I can usually spot that the issue has occurred based on CPU usage. Normal behavior is all CPUs at 100% for a time, then one CPU at 80-100% while others are idling from 5-50%. Failure modes are all CPUs 100% or 1 CPU at 100% and all others at 0%. Bucket 2 - I occasionally get a "fatal error" message in VM log. Sometimes nothing. When I try to restart the VM, I usually get some execution error pop up. All CPUs are at 0%. Bucket 3 - I have no idea how to debug this. Once it hangs, I reboot the system and lose my logs from previous run. I was thinking that I might use a second computer to ssh and run "tail -f" on the syslogd file. Are there other debug messages I can get to? I read about MCE logs as a possible debug path. I'm not sure if those are already going to show up in syslogd or my remote SSH console. What low level information is exposed with unraid? Is there a HW observation point where I could get lower level debug information over and above what unraid supplies? HW diagram of system is attached. --Brad
  7. What is your use case here? Do you want a thumb drive available to two separate VMs?
  8. You could always try using a Ethernet connected USB hub, like SEH myUTN to dynamically attach to the USB device. myUTN is kind of expensive for what you get, but it is very useful when trying to share a USB device with multiple VMs. One of my use cases is to use a lab probe with multiple VMs. In this case, I install the myUTN SW on each VM, then activate the probe on the VM I want. There is some lag due to network, but it works remarkably well. --Harper
  9. I don't have this handy, but I enabled it by going to Steam and selecting "enable beta". This is somewhere in the Steam app, not the SteamVR app. Go to the SteamVR page inside the Steam App, then hunt down this switch. --Harper
  10. My Vive setup has been pretty solid ever since I updated to the beta drivers. But I'm using a PCIE USB card to pass in the USB devices. There are a lot of USB endpoints in the Vive controller: Bus 009 Device 005: ID 0bb4:2744 HTC (High Tech Computer Corp.) Bus 009 Device 006: ID 0bb4:2134 HTC (High Tech Computer Corp.) Bus 009 Device 007: ID 0bb4:0306 HTC (High Tech Computer Corp.) Bus 009 Device 008: ID 0424:274d Standard Microsystems Corp. Bus 009 Device 009: ID 0d8c:0012 C-Media Electronics, Inc. Bus 009 Device 010: ID 0bb4:2c87 HTC (High Tech Computer Corp.) Bus 009 Device 011: ID 28de:2101 Bus 009 Device 012: ID 28de:2101 Bus 009 Device 013: ID 28de:2000 Bus 009 Device 014: ID 0bb4:2c87 HTC (High Tech Computer Corp.) There are several problems: 1) The wireless controllers share the same vendor/device ids 2) The wireless controllers will disconnect when powered down or sometimes during gameplay The real issue is that the Vive is a mini subsystem held together by a USB hub.
  11. Last night, I was running with my two headed VR rig in some games. I had two players running through the Rec Room, doing some of the latest content. Both VMs were working fine for about 20 minutes, when I got unrecoverable errors on both VMs simultaneously. I had to reboot the system. I wasn't able to observe the GPU temperatures directly, but I felt some real heat when I opened the box. It got me to wondering how to manage the various temperature controls spread across systems. For the GPUs, I assume that the VM which controls that GPU will control the fans on that GPU only. For the CPU cooler, I'm not sure who is the controller. I have the NZXT Kraken x62 AIO cooler, which has its own controls for fans. It is attached as a USB device to the mobo. If I unplug the USB connector, the cooler defaults to "max" settings. How do I monitor the temperature of my system as a whole? How do I make sure the CPU cooler is reacting appropriately without running NZXT SW? How can I diagnose whether an error is thermal related? --Brad
  12. I just did a bunch of experiments with this last night. The difference between the passing case and the failing case is that in the failing case, I cannot boot to the USB device, regardless of what I do in BIOS. BIOS simply doesn't see it as an option, even over multiple power cycles. You'll notice that in both cases, the Kingston Datatraveller has the same device credentials. I purposely connected it to the MB ports that went directly to the chipset so that it would be enumerated as simply as possible. All the "basic" systems are this way--hard wired mouse+keyboard, controls to power supply and cooler, the unraid thumb drive. I have a few thoughts on what is going on... 1) The enumeration sequence in unraid is different than the enumeration sequence in early boot, or in BIOS. In early boot, there is only one USB 2.0 port that is monitored so that FW patches can be applied. Early boot is controlled by the PCH (the motherboard chipset) and runs on an embedded CPU that cannot be controlled by the user. During BIOS execution, all the USB devices are discovered and enumerated. This is how BIOS sees the USB world. During OS execution, all the USB devices are rediscovered and enumerated. This is how Unraid sees the USB world 2) I only have visibility into the mapping during OS execution. Perhaps BIOS execution phases experiences discovery failures that go unreported. 3) BIOS seems to always prefer the "last" USB device. When I have three USB keyboards plugged in: a) Kensington direct to Motherboard b) Logitech A on PCIE USB card at 4e c) Logitech B on PCIE USB card at 54 BIOS will only recognize Logictech B in this case. I'm not sure what I can infer from that, other than the USB stack that BIOS uses is more primitive than what an OS provides. Remember, USB enumeration is an OS level task. The code that discovers USB during BIOS is entirely different than the code that does it again using OS runtime. It is entirely possible that some topologies will confound the BIOS code. BIOS code is written by a handful of SW programmers using "generic" lab builds that have few tested configurations. There isn't a lot of feedback from end uses to those BIOS writers. So I can easily believe there are "fatal" USB topologies. In any case, I hope this info is useful to the folks at Limetech. I think there is a real danger for relying on USB devices for boot. The USB topology is fussy and dynamic. It is hard to guarantee a deterministic and reliable boot recipe including USB. I think the solution is to boot from a PCIE mapped device and use the USB thumb drive as a license dongle. The USB stick as OS works brilliantly when it works, but it is a huge problem if something goes wrong here. The problems found in USB enumeration are hard to understand and harder to solve. This is not a good place for Limetech to provide technical support, but it prevents users from using Unraid, so it is a problem for the company.
  13. First things first--I have a lot of USB devices. Take a look at this giant list. In particular, have a look at what the Vive adds to the mix: USB Devices Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 Device 002: ID 047d:2043 Kensington Bus 003 Device 003: ID 8087:0a2b Intel Corp. Bus 003 Device 004: ID 1e71:170e NZXT Bus 003 Device 005: ID 045b:0209 Hitachi, Ltd Bus 003 Device 006: ID 045b:0209 Hitachi, Ltd Bus 003 Device 007: ID 1b1c:1c08 Corsair Bus 003 Device 008: ID 0930:6544 Toshiba Corp. TransMemory-Mini / Kingston DataTraveler 2.0 Stick (2GB) Bus 003 Device 010: ID 045e:00cb Microsoft Corp. Basic Optical Mouse v2.0 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 004 Device 002: ID 045b:0210 Hitachi, Ltd Bus 004 Device 003: ID 045b:0210 Hitachi, Ltd Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 006 Device 002: ID 2833:0211 Bus 006 Device 003: ID 2833:0211 Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 002: ID 045e:02e6 Microsoft Corp. Bus 007 Device 003: ID 1a40:0101 Terminus Technology Inc. Hub Bus 007 Device 004: ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 007 Device 005: ID 2833:0211 Bus 007 Device 006: ID 046d:c52b Logitech, Inc. Unifying Receiver Bus 007 Device 007: ID 2833:2031 Bus 007 Device 008: ID 2833:0031 Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 002: ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 008 Device 003: ID 2833:3031 Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 009 Device 002: ID 28de:1142 Bus 009 Device 003: ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 009 Device 004: ID 046d:c52b Logitech, Inc. Unifying ReceiverBus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 009 Device 002: ID 28de:1142 Bus 009 Device 003: ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 009 Device 004: ID 046d:c52b Logitech, Inc. Unifying Receiver Bus 009 Device 005: ID 0bb4:2744 HTC (High Tech Computer Corp.) Bus 009 Device 006: ID 0bb4:2134 HTC (High Tech Computer Corp.) Bus 009 Device 007: ID 0bb4:0306 HTC (High Tech Computer Corp.) Bus 009 Device 008: ID 0424:274d Standard Microsystems Corp. Bus 009 Device 009: ID 0d8c:0012 C-Media Electronics, Inc. Bus 009 Device 010: ID 0bb4:2c87 HTC (High Tech Computer Corp.) Bus 009 Device 011: ID 28de:2101 Bus 009 Device 012: ID 28de:2101 Bus 009 Device 013: ID 28de:2000 Bus 009 Device 014: ID 0bb4:2c87 HTC (High Tech Computer Corp.) Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 010 Device 002: ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 010 Device 002: ID 2109:0812 VIA Labs, Inc. VL812 Hub Notice that there are 10 US buses, and 36 devices! If I plug the Vive into a motherboard USB slot, I get the following: Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 Device 002: ID 047d:2043 Kensington Bus 003 Device 003: ID 8087:0a2b Intel Corp. Bus 003 Device 004: ID 1e71:170e NZXT Bus 003 Device 005: ID 045b:0209 Hitachi, Ltd Bus 003 Device 006: ID 045b:0209 Hitachi, Ltd Bus 003 Device 007: ID 1b1c:1c08 Corsair Bus 003 Device 008: ID 0930:6544 Toshiba Corp. TransMemory-Mini / Kingston DataTraveler 2.0 Stick (2GB) Bus 003 Device 009: ID 0bb4:2744 HTC (High Tech Computer Corp.) Bus 003 Device 010: ID 045e:00cb Microsoft Corp. Basic Optical Mouse v2.0 Bus 003 Device 011: ID 0bb4:2134 HTC (High Tech Computer Corp.) Bus 003 Device 012: ID 0bb4:0306 HTC (High Tech Computer Corp.) Bus 003 Device 013: ID 0424:274d Standard Microsystems Corp. Bus 003 Device 014: ID 28de:2000 Bus 003 Device 015: ID 0bb4:2c87 HTC (High Tech Computer Corp.) Bus 003 Device 016: ID 0d8c:0012 C-Media Electronics, Inc. Bus 003 Device 017: ID 0bb4:2c87 HTC (High Tech Computer Corp.) Bus 003 Device 018: ID 28de:2101 Bus 003 Device 019: ID 28de:2101 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 004 Device 002: ID 045b:0210 Hitachi, Ltd Bus 004 Device 003: ID 045b:0210 Hitachi, Ltd Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 006 Device 002: ID 2833:0211 Bus 006 Device 003: ID 2833:0211 Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 002: ID 045e:02e6 Microsoft Corp. Bus 007 Device 003: ID 1a40:0101 Terminus Technology Inc. Hub Bus 007 Device 004: ID 2109:2812 VIA Labs, Inc. VL812 Hub Bus 007 Device 005: ID 2833:0211 Bus 007 Device 006: ID 046d:c52b Logitech, Inc. Unifying Receiver Bus 007 Device 007: ID 2833:2031 Bus 007 Device 008: ID 2833:0031 Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 002: ID 2109:0812 VIA Labs, Inc. VL812 Hub Bus 008 Device 003: ID 2833:3031 Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Plugging into the PCIE card causes the Vive USB devices to show up earlier in the enumeration sequence. I'm not sure why this affects things, but it appears to work. If I want to boot from a USB drive, I need to unplug the VIVE from the PCIE card. This is kind of pain in the ass. But I don't know that my BIOS can cope with this mysterious enumeration sequence.
  14. Can I create VMs that use the same image with different PCI devices mapped? This might be a way for me to isolate the installation vs. the HW. If the Vanaheim image is able to run fine with all the Muspelheim dedicated HW, then the problem is probably with the installation.
  15. I have two main gaming VMs that I have been using to run multi-player VR sessions, Vanaheim (Vive oriented) and Muspelheim (Oculus oriented) . The two VMs are almost identical in composition and creation, except for what PCI devices are assigned for each VM. Vanaheim always boots up successfully. But Muspelheim will often (~50%) hang at either the BIOS screen or the Windows 10 Loading screen, prior to the spinning circles. On a failing run, I don't see anything in the VM log files to indicate something bad has happened. The CPU utilization shows all 8 CPUs running near 100% during initialization, then all go to zero while one stays up at near 100%. This CPU stays active forever, but the boot never advances. I've left it in this state for hours without resolution. What can I do to characterize and debug this issue?