-
[Plugin] Nvidia-Driver
GOTCHA First i tried with pci=realloc but then unraid did not boot up anymore (dmesg error attached). Then i activated 4g decode in the bios. rbar or something related i've couldn't find. Can i do something else to tweak? Because the system hangs after a short while after boot up dmesg after pc=realloc [ 397.564440] BUG: Bad rss-counter state mm:00000000afb2a49e type:MM_ANONPAGES val:1 [ 455.429619] BUG: kernel NULL pointer dereference, address: 0000000000000051 [ 455.431042] #PF: supervisor read access in kernel mode [ 455.432405] #PF: error_code(0x0000) - not-present page [ 455.433788] PGD 8000000168d08067 P4D 8000000168d08067 PUD 1826a3067 PMD 0 [ 455.435145] Oops: 0000 [#47] PREEMPT SMP PTI [ 455.436461] CPU: 5 PID: 8222 Comm: monitor Tainted: P D IO 6.1.74-Unraid #1 [ 455.437791] Hardware name: Supermicro Super Server/X11SSH-F, BIOS 2.6 06/12/2021 [ 455.439078] RIP: 0010:kmem_cache_alloc+0xa4/0x14d [ 455.440381] Code: 04 24 74 05 48 85 c0 75 1a 45 89 f0 4c 89 f9 83 ca ff 44 89 e6 48 89 ef e8 2a fc ff ff 48 89 04 24 eb 25 8b 4d 28 48 8b 7d 00 <48> 8b 1c 08 48 8d 8a 00 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 [ 455.441817] RSP: 0018:ffffc9000d6dfe00 EFLAGS: 00010202 [ 455.443194] RAX: 0000000000000001 RBX: ffff8881055d9c00 RCX: 0000000000000050 [ 455.444579] RDX: 0000000000076b05 RSI: 0000000000000cc0 RDI: 0000000000032230 [ 455.445999] RBP: ffff8881001e8500 R08: 0000000000000cc0 R09: 0000000000020004 [ 455.447362] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000cc0 [ 455.448767] R13: ffff8881001e8500 R14: 00000000000000a8 R15: ffffffff810901b4 [ 455.450133] FS: 000014fcbcf47640(0000) GS:ffff889055340000(0000) knlGS:0000000000000000 [ 455.451521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 455.452946] CR2: 0000000000000051 CR3: 00000001846f6002 CR4: 00000000003706e0 [ 455.454344] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 455.455787] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 455.457168] Call Trace: [ 455.458542] <TASK> [ 455.459968] ? __die_body+0x1a/0x5c [ 455.461353] ? page_fault_oops+0x329/0x376 [ 455.462787] ? do_user_addr_fault+0x12e/0x48d [ 455.464176] ? exc_page_fault+0xfb/0x11d [ 455.465563] ? asm_exc_page_fault+0x22/0x30 [ 455.467001] ? prepare_creds+0x21/0xc8 [ 455.468392] ? kmem_cache_alloc+0xa4/0x14d [ 455.469827] prepare_creds+0x21/0xc8 [ 455.471210] prepare_exec_creds+0xb/0x4c [ 455.472591] bprm_execve+0x63/0x52b [ 455.474014] do_execveat_common.isra.0+0x1a6/0x1cf [ 455.475395] __x64_sys_execve+0x38/0x44 [ 455.476782] do_syscall_64+0x68/0x81 [ 455.478080] entry_SYSCALL_64_after_hwframe+0x64/0xce [ 455.479378] RIP: 0033:0x14fcbffffa87 [ 455.480669] Code: 48 8d 3d fc 3f 11 00 5b 5d 41 5c 41 5d 41 5e 41 5f e9 0d bd fa ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 61 c3 10 00 f7 d8 64 89 01 48 [ 455.482095] RSP: 002b:000014fcbc28fe68 EFLAGS: 00000246 ORIG_RAX: 000000000000003b [ 455.483473] RAX: ffffffffffffffda RBX: 00007ffe8b1df690 RCX: 000014fcbffffa87 [ 455.484906] RDX: 0000000001657da0 RSI: 00007ffe8b1df970 RDI: 000014fcc00ca1af [ 455.486284] RBP: 000014fcbc28fff0 R08: 0000000000000000 R09: 0000000000000000 [ 455.487616] R10: 0000000000000008 R11: 0000000000000246 R12: 000014fcc00c7a78 [ 455.488944] R13: 00007ffe8b1df990 R14: 0000000001730600 R15: 0000000000000001 [ 455.490180] </TASK> [ 455.491341] Modules linked in: ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc igb intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp nvidia_drm(PO) nvidia_modeset(PO) kvm_intel ipmi_ssif i915 kvm nvidia(PO) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi drm_buddy sha256_ssse3 drm_display_helper ast sha1_ssse3 drm_vram_helper drm_ttm_helper aesni_intel ttm crypto_simd nvme cryptd drm_kms_helper rapl intel_cstate intel_gtt i2c_i801 intel_uncore nvme_core i2c_smbus drm agpgart ahci i2c_algo_bit i2c_core libahci mei_me syscopyarea sysfillrect sysimgblt input_leds mei joydev intel_pch_thermal led_class fb_sys_fops thermal fan acpi_ipmi video ipmi_si wmi backlight intel_pmc_core acpi_power_meter acpi_pad button unix [ 455.491400] [last unloaded: igb] [ 455.499372] CR2: 0000000000000051 [ 455.500693] ---[ end trace 0000000000000000 ]--- [ 455.505168] RIP: 0010:kmem_cache_alloc+0xa4/0x14d [ 455.506489] Code: 04 24 74 05 48 85 c0 75 1a 45 89 f0 4c 89 f9 83 ca ff 44 89 e6 48 89 ef e8 2a fc ff ff 48 89 04 24 eb 25 8b 4d 28 48 8b 7d 00 <48> 8b 1c 08 48 8d 8a 00 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 [ 455.507916] RSP: 0018:ffffc90003a07e80 EFLAGS: 00010202 [ 455.509301] RAX: 0000000000000001 RBX: ffffc90003a07f58 RCX: 0000000000000050 [ 455.510745] RDX: 0000000000076905 RSI: 0000000000000cc0 RDI: 0000000000032230 [ 455.512131] RBP: ffff8881001e8500 R08: 0000000000000cc0 R09: 0000000000000000 [ 455.513505] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000cc0 [ 455.514917] R13: ffff8881001e8500 R14: 00000000000000a8 R15: ffffffff810901b4 [ 455.516286] FS: 000014fcbcf47640(0000) GS:ffff889055340000(0000) knlGS:0000000000000000 [ 455.517697] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 455.519104] CR2: 0000000000000051 CR3: 00000001846f6002 CR4: 00000000003706e0 [ 455.520487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 455.521903] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 455.523260] note: monitor[8222] exited with irqs disabled [ 463.482733] BUG: Bad rss-counter state mm:0000000041167a8a type:MM_ANONPAGES val:1
-
[Plugin] Nvidia-Driver
Yes sure dsunraid-diagnostics-20240301-2115.zip Seems the issue is related to this https://forums.developer.nvidia.com/t/nvrm-this-pci-i-o-region-assigned-to-your-nvidia-device-is-invalid/229899
-
[Plugin] Nvidia-Driver
Sure, many thanks for investigating in it. dsunraid-diagnostics-20240301-2032.zip
-
[Plugin] Nvidia-Driver
I try to use a Tesla T4 (which is Turing based) but with this plugin and any driver version no GPU is found. At the host can see it. Can you inlcude the Data Center Driver? https://www.nvidia.de/download/driverResults.aspx/221557/de 01:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) Subsystem: NVIDIA Corporation TU104GL [Tesla T4] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 11 IOMMU group: 2 Region 0: Memory at 6f000000 (32-bit, non-prefetchable) [size=16M] Region 3: Memory at 70000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Null Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [c8] MSI-X: Enable- Count=6 Masked- Vector table: BAR=0 offset=00b90000 PBA: BAR=0 offset=00ba0000 Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [250 v1] Latency Tolerance Reporting Max snoop latency: 71680ns Max no snoop latency: 71680ns Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+ L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- L1SubCtl2: Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [bb0 v1] Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 256MB, supported: 64MB 128MB 256MB BAR 3: current size: 32MB, supported: 32MB Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration- 10BitTagReq- Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq- IOVSta: Migration- Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00 VF offset: 4, stride: 1, Device ID: 0000 Supported Page Size: 00000573, System Page Size: 00000001 Region 0: Memory at 6c000000 (32-bit, non-prefetchable) Region 1: Memory at 0000000000000000 (64-bit, prefetchable) Region 3: Memory at 0000000000000000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0
DirkS
Members
-
Joined
-
Last visited