Team_Dango

Members
  • Content Count

    12
  • Joined

  • Last visited

Community Reputation

2 Neutral

About Team_Dango

  • Rank
    Newbie
  1. Thank you for the suggestion. I gave that a shot and it seemed to help. The VM did not crash for several hours. I even for a moment thought it may have been fixed. But eventually it crashed again same as before, much to my disappointment. After that initial success I was not able to achieve the same level of stability on subsequent reboots. I also tried adding both "video=vesafb:off" and "video=efifb:off" to the syslinux config, which is something I saw suggested a few places. This did not help at all. If anything it was less stable. I should perhaps mention that I alre
  2. Update: I found the relevant messages in the logs when a crash happens: Apr 27 21:59:49 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs ... Apr 27 21:59:52 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs Apr 27 21:59:53 Tower kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway Apr 27 21:59:54 Tower kernel: vfio-pci 0000:02:00.0: not ready 1023ms after FLR; waiting Apr 27 21:59:55 Tower kernel: vfio-pci 0000:02:00.0: not ready 2047ms after FL
  3. I don't have a way to monitor the passed-through GPU's temps from within Unraid (I didn't think that was possible, please correct me if I'm wrong). Within Windows, I haven't noticed any abnormal temperatures leading up to a crash. As I mentioned, it seems to happen most reliably when the GPU is under load, but it has happened other times as well. I really appreciate your help. I agree with your diagnostics. The root problem is whatever is causing the crash. From what I can tell the "internal error: ..." seems like a reasonable error to get after an unclean force-st
  4. I have a Windows 10 HTPC/gaming VM set up on my Unraid server. It has a dedicated Nvidia RTX 2070 Super. It worked fine for months, but lately it has been having issues where it suddenly stops outputting a signal to the TV. It seems to mostly happen when the GPU is under load or when an application is starting up, though I have had it happen as soon as Windows starts. Sometimes after the VM has crashed the GPU fans ramp to 100% and stay there until the server is rebooted. Also, after a VM crash, the Unraid GUI reports that all CPU threads allocated to the VM are at 100% for a coupl
  5. It turns out the problem was not with the server but rather with Chrome. I was able to get noVNC to connect using Edge, which prompted me to restart Chrome and then it worked. I had tried using Edge to connect previously with no luck, but I don't think I had tried again since restoring the libvirt image. Glad it is working now, but I still do not know why adding the new VM broke things in the first place.
  6. I've had an Ubuntu VM running on my server for several months now. I connect to it using the noVNC option built into Unraid. Yesterday I tired adding a second Ubuntu VM for a new task. After starting the new VM I was unable to connect to either it or the old VM. NoVNC simply reports "Failed to connect to server". I deleted the new VM but still could not connect to the old VM. I tried restoring from a saved libvirt image, which involved resetting the server altogether, but still could not connect. My only other VM is a Windows machine hooked up to a GPU and monitor, this works fine.
  7. Thank you for the suggestion. I'll check for that next time I reboot the server.
  8. After doing some digging I believe I have solved my issue. It seems like it is somewhat a known bug on Asus X99 motherboards. Mine is an Asus X99-WS/IPMI. I am on the latest BIOS so updating was not an option. The solution was to add "pcie_aspm=off" to my syslinux configuration. After a reboot I appear to no longer be getting errors. Fingers crossed it stays fixed. If anyone has anything to add feel free to chime in. If I don't have any errors tomorrow morning I'll mark this solved.
  9. I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend. The first device that is throwing errors is one of two GPUs in the system. The errors look like: Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0 Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) Tower kernel: vfio-pci 0000:01:00.0: device [10de:1e84] error status/mask=00100000/00000000 T
  10. I believe this is the full diagnostics. tower-diagnostics-20200914-1233.zip
  11. My server experienced an unexpected shutdown over the weekend and now one of the cache drives is acting up. Unfortunately I was not home, so I don't know for sure what happened. The server is on a UPS and there was no power outage as far as I can tell. When I booted the server back up, the first thing I noticed was all of my dockers and VMs were missing. This freaked me out, then I noticed that my first cache drive was listed as "Unmountable: No file system". I run two 250GB SSD's in BTRFS RAID 1 as a cache. The drive looks fine when the array is stopped, it lists its file system as BTRFS, it