Jump to content

[Plugin] Nvidia-Driver


ich777

Recommended Posts

@ich777

- In the Bios I have only the Option for Resize Bar "disable" or "auto", and it sits on auto.

- "Above 4G Decoding" is ON.

- I thought I have disabled the c-states, but those were actually enabled... Corrected that now.

- PSU is 550 watt from the original build of this PC. Even with the GPU, the whole thing should not need more then 220 watts. So plenty wattage :)

- Pressing "power" 2 times didn't work :( As I dont have speakers in this Case (Jonsbo V4), at least I didn't find cables for the case speaker, I waited at the case, if anything happens... unfortunatly nothing happend. If I do a shut down or restart from the GUI everything is fine.

 

Yes, I did not unplugg the card inbetween the several installation processes. And yes, the plugin was installed while I generated the diagnostics. I thought this would be better for the informations. But I can unplug the Card, uninstall the plugin and then get a new diagnostics if that would be helpful.

 

I now have some minutes and will do bios update, control all bios settings and check if anything happens in the GUI.

Link to comment
1 hour ago, BastiKA84 said:

I thought this would be better for the informations. But I can unplug the Card, uninstall the plugin and then get a new diagnostics if that would be helpful.

Try to set up a remote syslogserver or connect a monitor so that you can actually see the output from the console when it crashes, take a picture and post it here, that would be super helpful.

 

But as said above this seems like some kind of hardware compatibility issue and is maybe solved with a BIOS update.

 

1 hour ago, BastiKA84 said:

I now have some minutes and will do bios update, control all bios settings and check if anything happens in the GUI.

Please don't forget to disable C-States again after the BIOS update and validate all other settings.

  • Like 1
Link to comment

Bios Update is done. All settings should be correct:

image.thumb.png.64f72d08d602172e8a55c266e9e1698b.png

 

image.thumb.png.5bd5217f9135400305828000e8b85627.png

 

 

Starting the server and entering the GUI I stopped all dockers and installed the plugin again. The opening windows states, everything should be fine:

image.thumb.png.c13188467de231792f5cd317702dcab4.png

 

But unfortunatly, I still can not enter the plugin... The GUI just stoppes working.

Here, 5 minutes after trying to enter the plugin:

image.thumb.png.6aed68bd2a11665b654f537c56c54d75.png

 

As the server is still connected to a display, I can see inputs from the keyboard, which is also still connected.

-------------------------------

Wow... Whats this?

As I tried starting the plugin while typing here, I did not do a hard reset as I was looking up how to do a SSH call (yes, I am that kind of a noob). I recognized, that the unraid tab in my browser was not loading anymore. It showed the page with all my installed plugins. I now pressed the plugin button again, and I am in the plugin... 

image.thumb.png.16feaf0bdd4b60a25deb0c286b4653d2.png

image.thumb.png.37ac70e0c200413a64bd717fa2330312.png

 

Seems running now. Thank you @ich777 for you support :) Now I will try to add the Card to jellyfin for transcoding. So maybe I will be back in some minutes 😅

  • Like 1
Link to comment

Hi All, 

 

I've upgraded my old P400 with a T4 but the new card is not recognized. 

 

Apr 23 19:08:28 littleboy kernel: NVRM: GPU at PCI:0000:af:00: GPU-36d51216-544d-71c6-0604-11d08f217cd0
Apr 23 19:08:28 littleboy kernel: NVRM: Xid (PCI:0000:af:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
Apr 23 19:08:28 littleboy kernel: NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x62:0x40:2523)
Apr 23 19:08:28 littleboy kernel: NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 0
Apr 23 19:08:28 littleboy kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000af00] Failed to allocate NvKmsKapiDevice
Apr 23 19:08:28 littleboy kernel: [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000af00] Failed to register device

  [...]

Apr 23 19:12:10 littleboy kernel: NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x62:0x40:2523)
Apr 23 19:12:10 littleboy kernel: NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 0
Apr 23 19:12:10 littleboy kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Apr 23 19:12:10 littleboy kernel: nvidia-uvm: Loaded the UVM driver, major device number 235.
Apr 23 19:12:10 littleboy kernel: NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x62:0x40:2523)
Apr 23 19:12:10 littleboy kernel: NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 0


running nvidia-smi command I have this output: 

 

# nvidia-smi
No devices were found

 

# lsmod | grep nvidia
nvidia_uvm           4644864  0
nvidia_drm             90112  0
nvidia_modeset       1347584  1 nvidia_drm
nvidia              54116352  2 nvidia_uvm,nvidia_modeset
video                  61440  1 nvidia_modeset
drm_kms_helper        167936  4 mgag200,nvidia_drm
drm                   499712  6 drm_kms_helper,drm_shmem_helper,nvidia,mgag200,nvidia_drm
backlight              20480  3 video,drm,nvidia_modeset
i2c_core               86016  9 drm_kms_helper,i2c_algo_bit,igb,nvidia,mgag200,i2c_smbus,i2c_i801,ipmi_ssif,drm

 

I deleted the GPU Stats plugin, reinstalled drivers, and rebooted a couple of times, but the card is not recognized.

 

1750210899_Screenshot2024-04-23at19_18_57.thumb.png.d942e967bbd1d7388d14d3c4c7a87bd3.png

 

I read a bit around the forum and I have:

754218808_Screenshot2024-04-23at19_15_50.thumb.png.2142c87c5e45556edd6bc3192e66e12e.png

 

The GPU is also recognized in BIOS/iDrac

1280133543_Screenshot2024-04-23at19_31_52.thumb.png.97621e85a0ae201c5d932079ded96eec.png

 

131385308_Screenshot2024-04-23at19_32_04.thumb.png.c4b06d320122054ddf8f1ff6448fe94e.png

 

Is it possible to fix? Could be the GPU broken? 

 

 

 

littleboy-diagnostics-20240417-1506.zip

Edited by skler
Link to comment
2 hours ago, skler said:

I've upgraded my old P400 with a T4 but the new card is not recognized. 

This is a really bad sign:

Apr 23 19:08:28 littleboy kernel: NVRM: Xid (PCI:0000:af:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0

 

It might be that your card is defective or your Firmware/BIOS is simply not compatible with your Nvidia card.

 

If possible please try to driver version: 550.67 in the plugin, reboot and see if that helps.

Can you maybe also try another slot in your Server?

 

It is also possible that the GPU is not compatible with your server, this would not be the first time that I see a card because it's not Dell, HP,... certified.

 

I assume auxiliary power is also connected properly?

 

For testing purposes I would also recommend that you remove the file: /boot/config/modprobe.d/nvidia.conf

Just a information because I saw it, you should do it like that if you want to use more options:

options nvidia-drm modeset=1 fbdev=1

 

  • Upvote 1
Link to comment
7 minutes ago, ich777 said:

This is a really bad sign:

Apr 23 19:08:28 littleboy kernel: NVRM: Xid (PCI:0000:af:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0

 

It might be that your card is defective or your Firmware/BIOS is simply not compatible with your Nvidia card.

 

If possible please try to driver version: 550.67 in the plugin, reboot and see if that helps.

 

Done, Diagnostic attached (without nvidia.conf params)

 

7 minutes ago, ich777 said:

Can you maybe also try another slot in your Server?

 

is the only pcie x16

 

7 minutes ago, ich777 said:

 

It is also possible that the GPU is not compatible with your server, this would not be the first time that I see a card because it's not Dell, HP,... certified.

 

it should be compatible.. 

 

https://docs.nvidia.com/certification-programs/nvidia-certified-systems/index.html

 

664682785_Screenshot2024-04-23at22_17_26.thumb.png.74432c8fbf7d46c9c0178119a91fd9ef.png

 

 

7 minutes ago, ich777 said:

 

I assume auxiliary power is also connected properly?

 

the T4 don't have one 

 

7 minutes ago, ich777 said:

 

For testing purposes I would also recommend that you remove the file: /boot/config/modprobe.d/nvidia.conf

 

done

 

7 minutes ago, ich777 said:

Just a information because I saw it, you should do it like that if you want to use more options:

options nvidia-drm modeset=1 fbdev=1

 

 

oki thanks 

 

still no device found 

littleboy-diagnostics-20240423-2225.zip

Link to comment

I've tried opensource drivers too

 

[  256.084894] nvidia-uvm: Loaded the UVM driver, major device number 236.
[  257.340047] NVRM: kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[  257.340054] NVRM: kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[  257.340060] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[  257.343106] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x62:0x40:1784)
[  257.345186] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 0

 

do you think that the GPU could be broken? 

 

littleboy-diagnostics-20240423-2250.zip

 

Link to comment
9 hours ago, skler said:

Done, Diagnostic attached (without nvidia.conf params)

Did you already try it with the driver version that I recommended above?

 

9 hours ago, skler said:

it should be compatible.. 

But do you maybe have to buy a license to make use of it? Such Servers can be horrible when it comes to hardware support.

I can't help much with real server stuff because I only use consumer hardware on my system.

 

9 hours ago, skler said:

I've tried opensource drivers too

Thank you, since this is a real Datacenter card you don't have to use the option in nvidia.conf like pointed out on the plugin page but it shouldn't make a real difference.

 

9 hours ago, skler said:

do you think that the GPU could be broken? 

Maybe yes but it is really hard to tell, are you also sure that the PCIe x16 slot can deliver enough power, I think someone with a Dell PowerEdge had an issue where the PCIe slot doesn't delivered enough power because of a riser card.

 

Can you maybe test the card in another system and install the Nvidia driver?

Link to comment

Hi everyone.  I'm having a similar issue with the error:

Quote

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

I've got a Tesla P4... and Oddly enough it was working fine.. But it was getting really hot and I saw this error...... So i took it out, got a fan adapter for it that i'm powering w/a different power supply, and I still have this SMI issue.

 

I've attached diagnostics, but I hope it doesn't suggest there is an issue with the actual card.

 

I've reinsalled the driver, I'm moved the card to a different PCIE  and still not able to get passed this error.

 

Any thoughts or suggestions would be greatly appreciated.

odyssey-diagnostics-20240425-1029.zip

Link to comment
24 minutes ago, mew0717 said:

I've reinsalled the driver, I'm moved the card to a different PCIE  and still not able to get passed this error.

Sorry but I don't see a Nvidia GPU on your PCIe bus.

 

Are you sure the card isn't defective. Please try to reseat the card and make sure that it plugged in properly if it needs auxiliary power.

Link to comment
39 minutes ago, ich777 said:

Sorry but I don't see a Nvidia GPU on your PCIe bus.

 

Are you sure the card isn't defective. Please try to reseat the card and make sure that it plugged in properly if it needs auxiliary power.

 

I wonder if it got fried.  It showed up fine at first, then it ended up getting to around 96c and thats when i first saw the error.  Does / can that happen?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...