November 2, 2025Nov 2 Hi all, I looked around about why this might happen and possible fixes. Can't figure where to start with this ... I need the collective knowlege of you fine people.Got a GTX1660, using Nvidia Drivers and a certain patch applied at first array start only. (Setup worked flawlessly for years, issue appeared ~ 6-8 months ago)All works fine for a good while, then randomly, my gpu is not found.It disapears from my Transcode GPU choices in Emby.GPU Statistics throws a "Vendor command returned unparseable data."nvivia-smi returns "No devices were found"Tools > System devicesI still got :IOMMU group 2:[8086:1901] 00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)[10de:2184] 01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660] (rev a1)[10de:1aeb] 01:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)[10de:1aec] 01:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1)Bus 003 Device 001 Port 3-0 ID 1d6b:0002 Linux Foundation 2.0 root hubBus 004 Device 001 Port 4-0 ID 1d6b:0003 Linux Foundation 3.0 root hub[10de:1aed] 01:00.3 Serial bus controller: NVIDIA Corporation TU116 USB Type-C UCSI Controller (rev a1)After a simple software reboot, everything comes back and work fine.I attached anonymized diagnostics untheran-diagnostics-20251101-2135.zip
November 2, 2025Nov 2 I am also having the exact same issue but my is happening more frequently about once a week. I am getting all of the same symptoms. The odd thing is it is only happening with my 3060 used for Ollama, My p2000 for encoding and other random tasks does not every do it. This started happening about a month ago or so. Both video cards are always still visible in devices but I am unable to use the 3060 when this happens. I did follow the an old SpaceInvader video for the very old AMD bug and the script did work for my card and it did come back after the sleep cycle was completed. Not sure if that helps or not. raidered-diagnostics-20251101-2052.zip
November 3, 2025Nov 3 Author Ok, so fun fact, ChatGPT (v5, thinking mode) is REALLY good at extracting and analyzing the diagnostics zip file. Prompt (+ upload of the diagnostics.zip in the same prompt):In Unraid OS, I have an Nvidia GTX 1660 GPU.All works fine for a good while, then randomly, my gpu is not found.When I reboot, everything comes back all right and working for a while.I rebooted this morning, and right now the GPU is not found.Analyze all the files and logs in this compressed zip archive.Extract the file's content, analyze all the logs and configurations, report back with an explanation of this behavior and provide a step by step solution.Here is what it reported :What’s happening (from your logs)Driver loads at boot:At boot the NVIDIA kernel modules load cleanly and bind to the 1660 at PCI 0000:01:00.0:nvidia: loading ...NVRM: loading NVIDIA UNIX x86_64 Kernel Module 580.95.05DRM attaches: [drm] Initialized nvidia-drm ... for 0000:01:00.0A few minutes later the GPU stops initializingStarting at 10:40:20 (same boot), the kernel begins spamming:NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1623)repeated many times as RmInitAdapter failed! (0x62:0x40:2661)These lines continue for minutes. When this starts, nvidia-smi shows “No devices were found” (your system/nvidia-smi.txt). The gpustat JSON also reports vendor command errors and N/A for clocks/temps.You have three display adapters in this systemASPEED BMC (IPMI) — AST driver (BMC console)Intel iGPU — i915 driver, device 00:02.0NVIDIA GTX 1660 (TU116) — device 01:00.0Motherboard: Supermicro X11SSH-LN4F, BIOS 2.0b (2017-07-27).PCIe power-saving is enabledYour bridges show ASPM L1/L0s enabled (system/aspm-status.txt). ACPI reports AER unsupported (“platform does not support [AER]”), so the OS can’t log link-layer errors even if the link is flaking.Timing correlationDocker/containers (including Tdarr, which can use NVENC) start around 10:36–10:37. A few minutes later (~10:40) the NVIDIA driver begins failing to (re)initialize the adapter. This pattern is classic for a card that’s slipped into a bad PCIe power state or is being poked while the driver isn’t “persistent.”Diagnosis in one line:Your GTX 1660 is on an older Supermicro platform (2017 BIOS) with ASPM power-saving enabled and two other GPUs present (ASPEED + Intel iGPU). Under load/idle transitions the GPU drops into a PCIe power state the driver can’t recover from; when a container touches it, the driver tries to (re)initialize and you get RmInitAdapter failed storms. A reboot resets the link, so it “comes back” — until the next time it falls into that state.Why this causes the exact symptoms you seeASPM + older platform firmware: Aggressive PCIe link power management (L1/L0s) on some older C236/X11 boards is notorious for GPUs “falling into a ditch.” Without AER, the kernel can’t even record proper link errors — so all you see is NVIDIA’s RM failing.Multiple GPUs: With ASPEED + Intel iGPU + NVIDIA in the same machine, BAR allocation and power policies get more complex on old firmware. Even if initial BAR assignment is OK (it is in your log), subsequent power-state transitions can leave the device non-responsive to RM (NVIDIA’s resource manager), which matches your RmInitAdapter failed! loops.No persistence mode: If nothing keeps the device initialized between client uses, transitions in and out of low-power + module refcounts can amplify the chance of landing in that bad state.Containers hitting the GPU while it’s not “ready”: Tdarr/Plex/etc. may probe NVENC very early after boot; if the card hasn’t fully settled (or slipped into ASPM L1.2), that’s when the first RM failures typically appear.Fix plan (do in order)You don’t have to do every step — but the more of these you apply, the more bullet-proof it gets. I’ve split them into BIOS, Unraid boot parameters, and service ordering. None of these change your data or array.1) BIOS / firmware settings (Supermicro X11SSH-LN4F)Update BIOS to the latest available for X11SSH-LN4F.You are on 2.0b (2017-07-27) — several later revs improve PCIe compatibility and resource handling.Disable ASPM in BIOS (PCIe power management).Set PCIe ASPM/Power Saving to Disabled (global and per-slot if offered).Force slot speed for the x16 slot hosting the 1660 to Gen3 (not Auto).(Your logs show 8.0 GT/s caps on bridges; forcing speed avoids training oddities.)Enable “Above 4G Decoding” (a.k.a. 64-bit BAR decoding).Even though initial BARs look fine, this reduces pressure with 3 GPUs + NVMe/SATA HBAs.Primary display: Set Onboard/BMC as primary; if you don’t need Intel QuickSync, disable the Intel iGPU entirely.If you do use QuickSync in containers, keep iGPU on — but still set BMC as primary.(If you can only do a couple things: update BIOS, disable ASPM, force Gen3.)Here is where I might pratice caution with what it is suggesting :2) Unraid boot (Syslinux) kernel parametersEdit: Main → Flash → Syslinux Configuration (apply to the boot stanza you use).Add to the append line:pcie_aspm=offOptional but sometimes helpful on X11 when not using Intel iGPU:To keep the iGPU from binding at all (if you don’t need it):modprobe.blacklist=i915Only do this if you are not using QuickSync. Blacklisting i915 removes Intel iGPU from the OS.If you see framebuffer clashes (rare here), you can also add:video=efifb:offApply and reboot.3) Make the NVIDIA device “persistent” and start GPU containers after it’s readyEnable persistence mode at bootAdd these lines near the bottom of /boot/config/go (before the last line is fine):# Ensure NVIDIA driver is initialized & kept warm between clients modprobe nvidia /usr/bin/nvidia-smi -pm 1 # Optional warm-up query (fails fast if device absent) /usr/bin/nvidia-smi -L || trueThis keeps the device initialized so it doesn’t slip into a bad low-power state between workloads.Stagger GPU containers (Tdarr, Plex/Emby/Jellyfin, etc.) until the device node exists.If you use the User Scripts plugin, create one named “Start GPU containers” with:#!/bin/bash # Wait up to ~2 minutes for NVIDIA device to appear for i in {1..60}; do if [ -e /dev/nvidia0 ]; then break fi sleep 2 done # Start your GPU-using containers (edit list) docker start tdarr 2>/dev/null || true # docker start plex 2>/dev/null || true # docker start emby 2>/dev/null || trueSet it to At Startup of Array. Also set those containers’ own setting “Start when array starts” = No, so your script controls the timing.Double-check container GPU flagsIn each GPU container template:NVIDIA_VISIBLE_DEVICES = your GPU UUID or allNVIDIA_DRIVER_CAPABILITIES = compute,video,utility (for NVENC)Use the NVIDIA runtime (Unraid’s NVIDIA Driver plugin does this automatically when set).4) (If not using QuickSync) keep Intel iGPU out of the wayIf you don’t need QuickSync:Either disable Intel iGPU in BIOS orCreate /boot/config/modprobe.d/blacklist-i915.conf with:blacklist i915(This keeps i915 from binding, reducing multi-GPU interaction. Reboot to apply.)5) Physical basics (worth a quick check)Ensure the GTX 1660 has all PCIe power plugs fully seated.If using a riser, test directly in the board slot.Try another slot if available (on X11 it may share lanes with other controllers).How we know this is the root causeKernel shows successful bind at boot and repeated RmInitAdapter failed a few minutes later (10:40:20+). That pattern is not a driver install issue — it’s a device/link power or initialization issue.nvidia-smi has “No devices were found” at the time your diagnostics were captured — exactly what RM-init failure yields.ASPM is enabled and AER is not available on this platform, which hides link-error breadcrumbs and is a well-known recipe for flaky GPU links on older server boards.Three GPUs present (BMC + Intel + NVIDIA) on an old BIOS (2017) increases the chance of a device getting power-gated or left in D3 during runtime.After you make the changesReboot.On the Unraid console or SSH:nvidia-smi watch -n1 nvidia-smiYou should see the 1660 listed and it should remain visible indefinitely.Start a GPU workload (Tdarr/Emby NVENC) and watch for any new NVRM lines in Tools → System Log. There shouldn’t be any RmInitAdapter failed after these fixes.If anything still misbehaves, the next most impactful single change is disabling the Intel iGPU (if you don’t use it) and ensuring ASPM is truly off (both BIOS and kernel pcie_aspm=off).TL;DRYour GTX 1660 is fine; it’s the platform power/PCIe policy. Update the X11SSH BIOS, disable ASPM, set the GPU slot to Gen3, enable Above-4G, keep the card “awake” with nvidia-smi -pm 1, and start GPU containers only after /dev/nvidia0 exists. This stops the RmInitAdapter failed loops and the random “GPU not found” episodes without needing to reboot.So, I'll try the BIOS fixes first, it should fix the PCIe power management issues.Any thoughts on the rest of the suggestions ?
November 10, 2025Nov 10 Author Solution So, I updated my BIOS and disabled any and all power management settings ... Rock solid for the last week.Note that I did nothing past my "Here is where I might pratice caution with what it is suggesting :" comment in the middle of the instructions.Edit : BIOS "only" finally didn't fix it for good (made last longer tho), I had to edit my /boot/config/go file and added these lines at the end (or in the order it makes most sense for you)# Ensure NVIDIA driver is initialized & kept warm between clientsmodprobe nvidia/usr/bin/nvidia-smi -pm 1# Optional warm-up query (fails fast if device absent)/usr/bin/nvidia-smi -L || true Edited December 29, 2025Dec 29 by Altheran Solution was incomplete
December 28, 2025Dec 28 I followed the entire guide to make it work, but unfortunately it doesn't work. Can you help me?
December 29, 2025Dec 29 Author On 12/28/2025 at 11:55 AM, rickjb29 said:I followed the entire guide to make it work, but unfortunately it doesn't work. Can you help me?You loose your GPU X time after starting your server, and it comes back after a reboot ?If you have the same symptoms as me, I edited my solution as I had to tweak my go file also.Also, export your diagnostics zip file into ChatGPT, see if it think you have the same issue as me.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.