Team_Dango

Members
  • Posts

    14
  • Joined

  • Last visited

Posts posted by Team_Dango

  1. Just installed the Baikal docker and I am hoping for advice on getting it correctly routed through NPM. The official baikal installation docs have an example Nginx config and mention needing to keep the "Specific" directory from being accessible. Is this handled by the docker container (it appears to have nginx internally), or do I need to add additional configuration in NPM? Thank you!

  2. I have an instance of Postgres11 set up and working with one application. Now I am looking to install a second application (Joplin) that also needs its own PostgreSQL database. What is the best way to do this? Having two apps share one database feels like a bad idea. Can I create a second database within one instance of Postgres11, or do I need to install a second instance?

    Thanks!

  3. On 4/28/2021 at 7:00 PM, joecool169 said:

    Have you tried "video=efifb:off" in the syslinux config?

     

    Thank you for the suggestion. I gave that a shot and it seemed to help. The VM did not crash for several hours. I even for a moment thought it may have been fixed. But eventually it crashed again same as before, much to my disappointment. After that initial success I was not able to achieve the same level of stability on subsequent reboots.

     

    I also tried adding both "video=vesafb:off" and "video=efifb:off" to the syslinux config, which is something I saw suggested a few places. This did not help at all. If anything it was less stable.

     

    I should perhaps mention that I already have one extra parameter in my syslinux config: "pcie_aspm=off" which solved an error I started getting after adding the LSI card. I do not know if it could somehow have anything to do with the other issue.

     

    I have by now also tried downgrading my motherboard BIOS. The latest is 4001 (what I was using) so I tried the two previous releases, 3901 and 3803. Neither caused any change in behavior that I could see. After those tests I reset back to 4001.

     

    I tried a fresh install of a new Windows 10 VM with the same GPU. This proved difficult as I was getting more errors.

    Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
    Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
    Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: No more image in the PCI ROM
    Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x168
    Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x190

    Device 01:00.0 is the GPU and 04:00.0 is the USB controller passed through to the VM to which I connect the keyboard and mouse.

     

    The errors made it so the keyboard and mouse were not recognized inside the VM, which made installing Windows impossible. I was able to work around this by passing though the keyboard and mouse directly as USB devices. This allowed me to get through the Windows installation, and after a few reboots I was able to pass through the USB controller without error. However after installing the graphics drivers and heaven benchmark the VM again crashed as soon as the benchmark started.

     

    I am again very much out of ideas. As always, any help would be very much appreciated.

     

    Thank you.

  4. Update:

    I found the relevant messages in the logs when a crash happens:

     

    Apr 27 21:59:49 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    ...
    Apr 27 21:59:52 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 21:59:53 Tower kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
    Apr 27 21:59:54 Tower kernel: vfio-pci 0000:02:00.0: not ready 1023ms after FLR; waiting
    Apr 27 21:59:55 Tower kernel: vfio-pci 0000:02:00.0: not ready 2047ms after FLR; waiting
    Apr 27 21:59:57 Tower kernel: vfio-pci 0000:02:00.0: not ready 4095ms after FLR; waiting
    Apr 27 22:00:01 Tower kernel: vfio-pci 0000:02:00.0: not ready 8191ms after FLR; waiting
    Apr 27 22:00:10 Tower kernel: vfio-pci 0000:02:00.0: not ready 16383ms after FLR; waiting
    Apr 27 22:00:27 Tower kernel: vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting
    Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up
    Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: can't change power state from D0 to D3hot (config space inaccessible)
    Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: can't change power state from D0 to D3hot (config space inaccessible)
    Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:01:25 Tower kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
    Apr 27 22:01:26 Tower kernel: vfio-pci 0000:02:00.0: not ready 1023ms after FLR; waiting
    Apr 27 22:01:27 Tower kernel: vfio-pci 0000:02:00.0: not ready 2047ms after FLR; waiting
    Apr 27 22:01:29 Tower kernel: vfio-pci 0000:02:00.0: not ready 4095ms after FLR; waiting
    Apr 27 22:01:34 Tower kernel: vfio-pci 0000:02:00.0: not ready 8191ms after FLR; waiting
    Apr 27 22:01:42 Tower kernel: vfio-pci 0000:02:00.0: not ready 16383ms after FLR; waiting
    Apr 27 22:01:59 Tower kernel: vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting
    Apr 27 22:02:35 Tower kernel: vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
    Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
    ...
    Apr 27 22:02:41 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs

     

    The most useful message seems to be the "vfio_bar_restore: reset recovery - restoring BARs" (note: this message is repeated many times, I clipped most for readability). Googling this returns a decent number of results, though it seems to mostly be with AMD cards from what I can tell. Most of the solutions appear to involve the motherboard.

     

    I tried changing which slot the 2070S was in. Originally I had it in the primary (#2) slot with my M2000 (for Plex transcoding) in the #4 slot. I also have an LSI card in the #5 slot. I tried having the M2000 in the primary slot and the 2070S in the #3 slot (this is the way the MB manual recommends having three pcie devices, I originally had the M2000 a slot lower for better airflow to the 2070S). Unfortunately this did not fix the problem. I thought it had fixed it because the system was stable for a few hours but it eventually did crash. (The error log above is from that crash, which is why the GPU is now device 02.00.0).

     

    I have not yet tried downgrading to an older MB BIOS. That will be my next step, unless there are any other suggestions.

     

    If anyone has had experience with any of the above errors with Nvidia cards, please let me know.

     

    Thank you

     

  5. 21 hours ago, joecool169 said:

    Are you monitoring temps when the problem occurs?

     

    I don't have a way to monitor the passed-through GPU's temps from within Unraid (I didn't think that was possible, please correct me if I'm wrong). Within Windows, I haven't noticed any abnormal temperatures leading up to a crash. As I mentioned, it seems to happen most reliably when the GPU is under load, but it has happened other times as well.

     

     

    21 hours ago, joecool169 said:

    Keep in mind that I am by no means an expert, but I think that Windows is crashing and then the gpu has not reset properly due to the crash. I would be trying to troubleshoot the cause of the crash, gpu temp, cpu temp, windows logs.

     

    There are a couple forum threads about switching slots and downgrading mobo bios for similar problems. I googled "internal error: Unknown PCI header type '127' for device" and then sorted to show only results from unraid.net

     

    I really appreciate your help. I agree with your diagnostics. The root problem is whatever is causing the crash. From what I can tell the "internal error: ..." seems like a reasonable error to get after an unclean force-stop of the VM. I included that info because I have been having a hard time finding any other error messages or other indicators in the logs of either Unraid or the VM to help diagnose the issue. I will try to recreate the crash tonight and check the logs again to see if I missed anything.

     

    Assuming no other suggestions, I'm going to try fully reinstalling Windows on the VM. I thought for sure it was a hardware issue with the 2070S until I got the same error on the 3070, so now I am guessing it is a software issue. I suppose it could still be a hardware error at the motherboard level. For the record I am running an Asus X99 WS/IPMI LGA2011 with an Intel Xeon E5-2680 v3. The GPU is in the second slot, which appears to be the primary slot according to the MB manual. I am on the latest BIOS AFAIK. If a Windows reinstall doesn't change anything I'll try an older BIOS, though I would be surprised if that was the issue since the problem only started happening recently.

  6. I have a Windows 10 HTPC/gaming VM set up on my Unraid server. It has a dedicated Nvidia RTX 2070 Super. It worked fine for months, but lately it has been having issues where it suddenly stops outputting a signal to the TV. It seems to mostly happen when the GPU is under load or when an application is starting up, though I have had it happen as soon as Windows starts.

     

    Sometimes after the VM has crashed the GPU fans ramp to 100% and stay there until the server is rebooted. Also, after a VM crash, the Unraid GUI reports that all CPU threads allocated to the VM are at 100% for a couple of seconds, then just the first thread sits at 100% until the VM is stopped. The VM will not stop cleanly, it has to be force stopped. After this it cannot be started again until the entire server is rebooted. Trying to start the VM before rebooting gives this error (the device ID is the GPU):

    internal error: Unknown PCI header type '127' for device '0000:01:00.0'

     

    The VM issue does not appear to effect any other component running on the server. All docker applications and my other VM's continue running as normal after the Windows VM has crashed.

     

    Some searching indicated this might be a VBIOS issue. I was originally using a VBIOS from techpowerup that was modified as explained in SpaceInvaderOne's 2017 GPU passthrough video. I tried the userscript technique from SpaceInvaderOne's newer video to dump my own VBIOS. Using this VBIOS file did not fix the issue.

     

    Finally, I tried swapping out the 2070 for the 3070 I have in my desktop machine. I used SpaceInvaderOne's script to dump the VBIOS and did a clean GPU driver reinstall on the VM. At first it seemed like the issue had been resolved, but after a few minutes running heaven benchmark the VM crashed exactly as it had with the 2070.

     

    I am now out of ideas. Any advice would be much appreciated. Thank you.

  7. It turns out the problem was not with the server but rather with Chrome. I was able to get noVNC to connect using Edge, which prompted me to restart Chrome and then it worked. I had tried using Edge to connect previously with no luck, but I don't think I had tried again since restoring the libvirt image. Glad it is working now, but I still do not know why adding the new VM broke things in the first place. 

  8. I've had an Ubuntu VM running on my server for several months now. I connect to it using the noVNC option built into Unraid. Yesterday I tired adding a second Ubuntu VM for a new task. After starting the new VM I was unable to connect to either it or the old VM. NoVNC simply reports "Failed to connect to server". I deleted the new VM but still could not connect to the old VM. I tried restoring from a saved libvirt image, which involved resetting the server altogether, but still could not connect. My only other VM is a Windows machine hooked up to a GPU and monitor, this works fine.

     

    I get the following error in the Unraid logs each time I attempt to connect to the VM via noVNC:

     

    Tower nginx: 2021/03/15 09:30:55 [error] 10278#10278: *1827 recv() failed (104: Connection reset by peer) while reading upstream, client: 10.10.20.164, server: , request: "GET //wsproxy/5700/ HTTP/1.1", upstream: "http://127.0.0.1:5700/", host: "10.10.20.2"

     

    Any advice on how to fix this would be much appreciated.

     

    I am running Unraid 6.9

     

    Thank you

  9. After doing some digging I believe I have solved my issue. It seems like it is somewhat a known bug on Asus X99 motherboards. Mine is an Asus X99-WS/IPMI. I am on the latest BIOS so updating was not an option.

     

    The solution was to add "pcie_aspm=off" to my syslinux configuration. After a reboot I appear to no longer be getting errors. Fingers crossed it stays fixed. 

     

    If anyone has anything to add feel free to chime in. If I don't have any errors tomorrow morning I'll mark this solved.

    • Like 3
  10. I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend.

     

    The first device that is throwing errors is one of two GPUs in the system. The errors look like:

    Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
    Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
    Tower kernel: vfio-pci 0000:01:00.0:   device [10de:1e84] error status/mask=00100000/00000000
    Tower kernel: vfio-pci 0000:01:00.0:    [20] UnsupReq               (First)
    Tower kernel: vfio-pci 0000:01:00.0: AER:   TLP Header: 40000001 00000003 000be7c0 f7f7f7f7
    Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

     

    The second device that is throwing errors is my LSI card. This is new. It is an LSI 9207-8i purchased from The Art of the Server on ebay. It is in a PCIe slot that was previously occupied by an NVME SSD in a PCIe adapter. Those errors look like:

    Tower kernel: mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
    Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
    Tower kernel: mpt3sas 0000:04:00.0:   device [1000:0087] error status/mask=00000001/00002000
    Tower kernel: mpt3sas 0000:04:00.0:    [ 0] RxErr 

     

    Despite these errors, both devices are acting normally. The GPU is passed through to a VM and behaves as expected even under full load. The LSI card also appears fully functional. I went through an entire parity check which passed with zero errors. I am currently running through a drive rebuild (not because of drive failure, just swapping it out) and would rather not have to abort, but I also do not know how severe these errors are and if I need to take immediate action. 

     

    I am attaching my full diagnostics dump.

     

    Any advice would be much appreciated.

     

    Thank you.

    tower-diagnostics-20210310-1757.zip

  11. My server experienced an unexpected shutdown over the weekend and now one of the cache drives is acting up. Unfortunately I was not home, so I don't know for sure what happened. The server is on a UPS and there was no power outage as far as I can tell. When I booted the server back up, the first thing I noticed was all of my dockers and VMs were missing. This freaked me out, then I noticed that my first cache drive was listed as "Unmountable: No file system". I run two 250GB SSD's in BTRFS RAID 1 as a cache. The drive looks fine when the array is stopped, it lists its file system as BTRFS, it can be selected as a cache drive and URAID says "Configuration valid". The array even starts fine, but once started the drive shows the error. I ran a SMART self test and everything looks alright, though I'm not really sure what I'd be looking for, so I'm attaching the full text.

     

    My question is, where do I go from here? Is it safe to rebuild the BTRFS array using the second drive, or do I need to replace the bad drive altogether? How would I go about rebuilding the array safely? This isn't even getting to the issue of the missing dockers, I'm really hoping that will all come back once I fix the cache. If not I do have backups, but it'd be a pain. Finally, any ideas why the server crashed in the first place? I'm using desktop hardware mostly, an Asrock Z75 Pro3 motherboard, i7 2600K, and 16GB of DDR3 memory at 1600MHz. The RAM is not ECC, so maybe a bit flipped and broke everything? Any advice would be much appreciated.

     

    Thanks so much,
    TD

     

     

    Annotation 2020-09-14 110238.png

    Samsung_SSD_840_EVO_250GB_S1DBNSBF440459L-20200913-2132.txt