JesterEE

Members
  • Posts

    168
  • Joined

  • Last visited

Everything posted by JesterEE

  1. With the Unraid 6.12 series on Linux kernel 6.1 natively, I decided to finally revisit this topic with my update to 6.12.8. After the OS update, I checked the lspci output to see if the OS was correctly assigning the correct memory size allocation for my ASUS KO GeForce RTX 3070 V2 OC Edition 8GB. I was pleasantly surprised that without doing anything, it was assigning the resource space to the maximum video memory allotment my card is able to provide (i.e. 8GB) (see full lspci output at bottom of this post). # lspci -vvvs 0c:00.0 0c:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070 Lite Hash Rate] (rev a1) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. GA104 [GeForce RTX 3070 Lite Hash Rate] Capabilities: [bb0 v1] Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 8GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB BAR 3: current size: 32MB, supported: 32MB Note the BAR 1 size is set to 8GB. Before the kernel update (and with the kernel patch referenced in the earlier pages of this thread), it was set to a default of 256MB. All is looking good so far! I followed the these baseline steps ✅ Host BIOS UEFI w/o CSM ✅ Host BIOS Enable ReBAR support ✅ Host BIOS Enable 4G Decoding ⬛ Enable & Boot Custom Kernel syslinux configuration (near beginning of this thread) not needed anymore Before modifying my Windows 10 Pro VM configuration, I booted up the VM to see if anything was needed for the Guest OS to recognize ReBAR. I did make sure my guest bios was set to OVMF TPM (regular OVMF provided the same result as shown below though). Windows booted without issue and I ran both GPU-Z 2.57.0 and the NVIDIA Control Panel to check ReBAR support: This is what I saw: GPU-Z reported ReBAR as Enabled, but when I went into the Advanced settings, 4G Decode was shown as Disabled in BIOS. NVIDIA Control Panel shows ReBAR an Enabled and shows it's correctly allocating 8GB of dedicated video memory with an additional 16GB of shared memory for 24GB total. If I close the apps and relaunch them, GPU-Z reports differently, showing ReBAR as Disabled with the same advanced details (NVIDIA Control Panel stays reporting ReBAR Enabled with the same details). I shut down the VM and tried the XML edits noted in this thread and other online spaces talking about VFIO ReBAR: <domain type='kvm'> ➡️ <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-fw_cfg'/> <qemu:arg value='opt/ovmf/X-PciMmio64Mb,string=65536'/> </qemu:commandline> After relaunching the VM, I found the results to be the same. So, this is interesting in that the XML may not be required for ReBAR anymore either. However, since I'm getting inconsistent reporting using GPU-Z and the NVIDIA Control Panel, I can't be sure. I think I trust NVIDIA Control Panel more than GPU-Z on this one even though GPU-Z has never steered me wrong in the past. I figure the hardware vendors driver information software probably knows better and GPU-Z is looking at some inconsistent information and reporting incorrectly. But, I think putting a synthetic benchmark to test Host BIOS setting differences is probably called for in this scenario (ReBar and 4G Decoding On vs Off). I'll report some of that in a follow-up post. Anyone else see something similar to what I'm seeing and have verified ReBAR functional in their VM? -JesterEE # lspci -vvvs 0c:00.0 0c:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070 Lite Hash Rate] (rev a1) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. GA104 [GeForce RTX 3070 Lite Hash Rate] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 141 IOMMU group: 30 Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 7c00000000 (64-bit, prefetchable) [size=8G] Region 3: Memory at 7e00000000 (64-bit, prefetchable) [size=32M] Region 5: I/O ports at f000 [size=128] Expansion ROM at fc000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [bb0 v1] Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 8GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB BAR 3: current size: 32MB, supported: 32MB Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?> Capabilities: [d00 v1] Lane Margining at the Receiver <?> Capabilities: [e00 v1] Data Link Feature <?> Kernel driver in use: vfio-pci Kernel modules: nvidia_drm, nvidia
  2. Has this plugin been delisted from CA? Fix Common Problems is showing this after upgrade to 6.12.8 from 6.11.
  3. Here is a link to the libtorrent bug tracker for this issue: https://github.com/arvidn/libtorrent/issues/6952
  4. Been running it continuously since my 11/18/2022 post. No issues.
  5. Yes. Above 4G Decoding: Enabled Resize BAR Support: Auto (other option is Disabled)
  6. Started using a new video card in Unraid this week and noticed the card name and PCIe Gen columns on the first line are overlapping for may card with a long name. Can the card name be truncated depending on the width of the window (and subsequently the column)?
  7. Yup, messed that up in the copypasta while experimenting. Anyway, not a big deal...it works for me if I want to set the ReBAR to acceptable values lower than the default 256MB (for my card [64MB, 128MB, 256MB]) ... But it will not set them higher (for my card [512MB, 1GB, 2GB, 4GB, 8GB]). If I try and set it to a value lower than 64MB or higher than 256MB I will get the error. # -bash: echo: write error: Device or resource busy Here is the is the memory allocation info for my card # lspci -vvvs 0b:00.0 0b:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070 Lite Hash Rate] (rev a1) (prog-if 00 [VGA controller]) ... Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at c8000000 (64-bit, prefetchable) [size=32M] ... Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB BAR 3: current size: 32MB, supported: 32MB Thanks for publishing the patch and modified kernel even though it didn't work for me completely. Hope others give it a shot too to report their mileage.
  8. No, I changed the addressed on my side, but I posted the command so it could easily be referenced from your post. Here is my version: #!/bin/bash echo -n "0000:0b:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind echo 14 > /sys/bus/pci/devices/0000\:0b\:00.0/resource1_resize # <<<< Gets stuck here echo -n "10de 2488" > /sys/bus/pci/drivers/vfio-pci/new_id || echo -n "0000:0b:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
  9. @Skitals So I tried the script commands you specified in your previous post, but got stuck when actually sizing the ReBar with: # echo 14 > /sys/bus/pci/devices/0000\:0d\:00.0/resource1_resize # -bash: echo: write error: Device or resource busy Did some searching and I couldn't find a way to correct this. Not looking for tech support necessarily, just reporting my experience. On my system, the video card is bound to VFIO and the system is booting with a syslinux config including ... video=efifb:off ...
  10. Tried your patch today with my ASUS KO RTX 3070 and a Windows 10 VM. GPU-Z is still reporting Resizeable Bar as Disabled. Was there any additional setup needed to set the initial state of the Bar or should it be on by default with the patched kernel?
  11. I checked the repo again just now, this is still the latest LSIO release on libtorrent v1. I use the Gluetun container for my VPN and I've never seen this issue. Actually just the opposite. I intentionally test this from time to time to see if I'm leaking my IP and when the VPN is off and does not revert to the default internet connection (essentially a built in kill switch). I do not create a custom docker network as this write-up has shown. Instead, in the template for the container you want to use the VPN network, I set: Network Type: None and add --network=container:VPN_CONTAINER_NAME on the extra parameters line. I'm pretty sure this is essentially doing the same thing except without naming the network, so I'm not sure why we have different experiences with dropped connections. What is important to note, doing it this way will require the client containers to rebuild when the VPN container is updated. This is because docker needs to point the clients (deluge, etc.) to the new endpoint since it has a new hash associated with the VPN container. So when you update your VPN container via the WebUI, since Unraid 6.9 (I think), the OS has been smart enough to rebuild the attached containers automatically, and after a minute or so for rebuild and restarting the client containers, all is well. However, if the VPN container gets updated automatically by the Auto Update Applications plugin, the rebuild will not be triggered (since this rebuild control is implemented in the Docker WebUI php code), and all clients will lose their network connection. This will still not leak my IP and revert to the default network, but the client containers will just have no network connectivity. So, in the Auto Update Applications settings, I turn autoupdate off for the VPN client and do that one manually from time to time. Hope this helps!
  12. 3+ days torrent uptime, no crashes in 6.11.5. I'm satisfied with the current resolution that libtorrent v2 is to blame. If you all want to get back on v2, I suggest following the open issue on the libtorrent tracker to see when they correctly support transparent hugepages.
  13. Cross linking to an Unraid support thread where libtorrent (used by deluge) is causing a kernel error on the 6.11 release series. If you are on 6.11.* and see your syslog contain the error shown in this thread accompanied by your deluge/Unraid webUI being unresponsive, you may also be experiencing the same issue. This is not unique to Unraid, but seemingly all distros utilizing newer versions of the Linux v5 kernel.
  14. Quick note to deluge users that are now using the older version I linked a couple days ago, I updated the post with a newer release since I had an issue with state corruption between restarts. See the updated comment here. I did a couple quick tests (starting/stopping/restarting the container) and I don't see the same corruption occurring on the newer version of deluge still with libtorrent v1.
  15. @binhex you run one of the more widely used qbittorrent images in the community. Have you seen many more reports on your support channels?
  16. Day 4 and I got this somewhat related error spammed in the deluge docker container log. The client is not accessible, but Unraid itself seems unaffected (WebUI, SSH, etc.). I tried to restart the docker container and the deluge daemon is not starting correctly yielding the same error being reported right away. I'm going to restart the server and see what happens. I'll edit this post when I test out the container after the restart. POST RESTART UPDATE The Unraid OS restart stopped my array without issue so no parity check on reboot! I'm very happy about that, I was getting sick of dirty restarts. After restarting my deluge container, I was still getting the same error as above. I did some debugging, and it looks like my session.state file in the appdata got corrupted (this is a known issue in the before-days which I haven't had in a while - maybe about a year). Once I pulled in a copy of the state file from the backup, everything returned to normal. FYI for those that don't deluge, it makes one backup of the settings and state files in the app directory, but in my experience, it is not good about backing up only known good files ... I've had the built-in backup store a corrupted file before so milage may vary. It would be a good idea to get the state file from the Unraid CA Backup / Restore Appdata plugin if you have it installed (which, come on, we all should). I'll restart my container and see how long it goes till this comes up again. But overall, this was much nicer bad experience in that libtorrent didn't trigger a kernel error that took down the server. 😆 First, I'm going to upgrade to 6.11.5 and go from there.
  17. 72 hours of running Deluge on libtorrent v1 and no crashes. In every other test it has crashed before this point, so all signs point to this is the true issue. Looking at this thread on the kernel bug tracker is looks like some sort of operating system configuration bug. I guess we are at the mercy of the Linux gods. What I find interesting is that this should be a very widespread and a somewhat prohibitive issue with an internet full of bittorent users on newer Linux distros. Yet the chatter is relatively low on +1s and reports to having the same issue. Is it not affecting everyone the same way?
  18. Looking at your syslog, this doesn't look like the same problem as this thread is addressing. You have a ton of BTRFS errors. Maybe search that on the forum and see what you find.
  19. You know, now that this has all come to light ... it makes sense that this is a libtorrent issue because the only people seeing it are either running deluge or qbittorent; the 2 most popular clients that use this library (Libtorrent - Projects).
  20. UPDATE: I had an issue where the 'old' version of deluge (2.0.5) was causing state corruption on docker power off. I know this has been resolved more recently (maybe by libtorrent v2, maybe in deluge, I'm not sure) as I have not had this problem in months running the latest LSIO docker. So, I looked in the source repo, again, and found a newer version of deluge (2.1.1) built with libtorrent-rasterbar v1. This may fix the issue, I do not know yet. Either way, it's newer than the link from the original post. 2.1.1-r3-ls179 - Dockerhub Link ORIGINAL POST: For those of us that use Deluge (from Linuxserver), I snooped the source and found that the last libtorrent-rasterbar with a version < 2 was 2.0.5-0202202181752ubuntu20.04.1-ls140 - Dockerhub Link. This was released 9 months ago! At that point they switched the base image to Alpine and pointed to either 3.16 or edge; which now only has libtorrent-rasterbar v2 in the Alpine package index. I've updated this docker daily since forever, so I have definitely been using libtorrent-rasterbar v2 for months. So, @Altwazar and @binhex are probably right in that there is some new Linux kernel interaction in the 6.11 series that is triggering the error. I'll try and roll another container with this version over my lunch break and see if the kernel gets mad. Update: Pulled and running. We'll see what we see 😆.
  21. I think the most important part is this is seemingly not an Unraid specific issue and seems like some funky interaction between libraries. Does it still make sense to track this issue here?
  22. I use Deluge, and in the linuxserver/deluge:latest docker it's pulling libtorrent 2.0.8.0 which causes the same issue. I'm at work, but later I'm going to try and load up some older versions that use libtorrent v1 and see if it persists.
  23. Thank you very much for joining our community forum just to let us know that this is repeatable on other Linux systems and it's an application error rather than a kernel error!
  24. 48 hours of continuous RW and still not triggering the error. 24 more and I'm ganna call it a pass. Still looking for insight on network stress testing.
  25. I was playing with fio some and I settled on this command to test the system for random read/writes to emulate how I think a torrent client should perform while simultaneously downloading/seeding multiple files: fio --directory=/torrents --name=iops-test-job --ioengine=libaio --rw=randrw --bs=4k --iodepth=256 --direct=1 --group_reporting --eta-newline=1 --end_fsync=1 --time_based --size=2GB --numjobs=25 --runtime=86400 This creates 25 2GB files (50GB total) in the /torrents directory (must be mapped in the docker template) and tests random read/write performance for the duration of runtime (in seconds). I used timeanddate.com to create meaningful durations for my schedule. I'll run this for a few days and ... we'll see what we see. 😉 If this turns up a big nothing-burger I think the next logical test would be the docker network interface. I have never done this but found that either netstress or iperf may make 2 good candidates. Does anyone have experience with these tools or Linux network stress testing in general?