[Plugin] Nvidia-Driver


ich777

Recommended Posts

1 hour ago, Tithonius said:

I assume the command I want to run is the "tail -f /var/log/syslog" command yeah

No, you don't need to run anything, the Kernel Panic is displayed automatically.

 

1 hour ago, Tithonius said:

Or is there a command that will show me a more verbose output? I wouldn't be opposed to having the most info possible if that's a thing. 

No, what do you want to see, you are seeing the entire syslog...

 

1 hour ago, Tithonius said:

What command should I be running to see the console output?

Leave it as it is and it will display the Kernel Panic if one occurs.

 

52 minutes ago, Tithonius said:

How possible is it that it's the GPU statistics plugin causing this crash?

If you are not on the Dashboard very unlikely...

 

52 minutes ago, Tithonius said:

Also, is it possible that Tdarr is causing the crash? 

Could be also the cause of the issue, back in the days Tdarr was notorious for crashing servers, that's why I recommend to use Unmanic.

Link to comment
On 3/2/2023 at 4:08 AM, ich777 said:

Then the crash is most certainly not related to the Nvidia Driver.

 

Have you yet changed MACVLAN to IPVLAN in your Docker settings? MACVLAN is notorious to crash servers with similar Kernel panics.

 

May I ask why? The open source module has no real benefit...

Had another crash again, this time it was when I know for sure there was lots of transcoding happening with TDARR. Perhaps then it does only crash when the GPU is being utilized?

 

Yes I changed it to MACVLAN and noticed no difference and the crash still occurred.

 

No reason, was just interested in it.

 

Is there anything else I can try, can you link me to some general troubleshooting tips? Sorry for late reply, I've been busy.

Link to comment
6 hours ago, LimesKey said:

Is there anything else I can try,

syslog mirror to flash to may get better infos what is happening while crashing

attach a monitor to see on the screen what is happening while crashing

 

would be my approach now ... and common when you read a little about troubleshooting.

 

and may just asked, your PSU is powerful enough ? if yes, forget the question ;) i just remembered my 2070S liked to crash my unraid Server while Gaming ... but i knew its been "on the edge" ;) new PSU, then its been gone.

  • Thanks 1
Link to comment
On 3/17/2023 at 12:56 AM, alturismo said:

syslog mirror to flash to may get better infos what is happening while crashing

attach a monitor to see on the screen what is happening while crashing

 

would be my approach now ... and common when you read a little about troubleshooting.

 

and may just asked, your PSU is powerful enough ? if yes, forget the question ;) i just remembered my 2070S liked to crash my unraid Server while Gaming ... but i knew its been "on the edge" ;) new PSU, then its been gone.

I have syslog mirror to Unraid not flash, will that work? Actually the last time it crashed I put the GPU on another 500W PSU, so I have 2 x 500W PSUs, 1 is for the CPU and 2 HDDs, the other is for the GPU + another 2 HDDs. In a bit I will get a real server PSU with 1000w+ that would be a lot more efficient. It's a little janky but the GPU barely uses more than 70W while transcoding and overall it should not exceed the 500W of my PSU even if everything was on 1 PSU. Both PSUs are pretty-good quality corsair I believe and they are on a UPS.

Link to comment
2 hours ago, LimesKey said:

I have syslog mirror to Unraid not flash, will that work?

basically yes, but then its running through local syslog server ... which is usually dead before you see the real error's ... so just for debugging now i can only recommend mirror to flash.

 

2 hours ago, LimesKey said:

Both PSUs are pretty-good quality corsair I believe and they are on a UPS

definately you should be more then safe ;) but may ... also try without your UPS ... just to make sure its not a power supply hickup from there ...

Link to comment
28 minutes ago, emrepolat7 said:

Do you think this can be implemented as an plugin?

I will look into that.

 

Just a question: may I ask what do you want to accomplish with that, exact use case if possible.

 

EDIT: I've now gone through the tutorial and saw that you need the vGPU unlock which ultimately violates the Nvidia EULA.

 

Maybe post that in the Feature Request sub forums, I really don't want to support such things. Sorry

  • Like 1
Link to comment
1 hour ago, emrepolat7 said:

I perfectly understand and respect that. What I don't understand that on their web page they claim it is Open-Source GPU Virtualization. So I am confused if it is legal or not?

 

https://www.arccompute.io/solutions/open-source-gpu-virtualization

Open-Source does not imply legality. Nor is an EULA necessarily legally binding. 
From a quick glance at the referenced project though it looks like their "Open-Source GPU Virtualization" is mostly the upper level virtualization stuff and they still require you to use closed source virtualization drivers for the actual hardware.

 

Edit: 
Just to clarify a bit an EULA is meant to be legally binding, whether or not a particular one is gets complicated quickly, and can depend on things like jurisdiction, the specific language of the document, how the user acknowledges it, etc. Enforceability is yet another matter. 
This post does not constitute legal advice. #notalawyer

Edited by primeval_god
  • Like 2
Link to comment
On 3/14/2023 at 1:32 PM, ich777 said:

No, you don't need to run anything, the Kernel Panic is displayed automatically.

 

No, what do you want to see, you are seeing the entire syslog...

 

Leave it as it is and it will display the Kernel Panic if one occurs.

 

If you are not on the Dashboard very unlikely...

 

Could be also the cause of the issue, back in the days Tdarr was notorious for crashing servers, that's why I recommend to use Unmanic.

Okay, so I finally got a crash to happen while I had a screen attached and this is the output that I can see:

 

PXL_20230321_010847780.thumb.jpg.a1705269fe77c053ab36323219b3e1dd.jpg

 

Obviously this isnt the whole crash, but does this give any info into whats going on?

 

Link to comment
10 hours ago, Tithonius said:

Also, when i rebooted this time my flash drive was not detected... Im backing it up now on a windows machine, but I wonder if somthing is just wrong with my flash drive

I think you are running tadarr or am I wrong?

Back in the day tadarr was notorious for crashing server but I think they fixed that.

 

Maybe try to boot into safe mode if possible and see if the server crashes there too.

You can also try to uninstall the Nvidia Driver plugin and see if that helps (of course you have to reboot to fully install it).

Link to comment
1 hour ago, ich777 said:

I think you are running tadarr or am I wrong?

Back in the day tadarr was notorious for crashing server but I think they fixed that.

 

Maybe try to boot into safe mode if possible and see if the server crashes there too.

You can also try to uninstall the Nvidia Driver plugin and see if that helps (of course you have to reboot to fully install it).

So this time before last crash I switched to unmanic per your recommendation, and I like it better. Tdarr isn't even installed... I also thought to myself, what has changed since I started getting crashes..., Then I realized that I moved my hba down a 16x slot to install the GPU, and that meant that the hba was only running at x4 speed. It "shouldn't" matter, but some people online were saying it might. So last night I swapped the GPU and hba on the motherboard. I don't care if the GPU is at a x4 slot as transcoding isn't any slower, and the hba can have a full x16 bandwidth slot. We are good without crashes overnight for now, but I do still have all my monitoring in place to keep an eye on it. 

 

I also ordered a new flash drive off the recommended lose from limetech, should be here soon just in case that's my issue. It's time for a flash drive change anyway this one is pretty old.

Link to comment
27 minutes ago, Tithonius said:

I don't care if the GPU is at a x4 slot as transcoding isn't any slower

It won't be, at least for home use.

 

Please keep me updated how things are going.

Are you sure that the HBA is cooled well and not getting too hot?

Link to comment
10 hours ago, ich777 said:

It won't be, at least for home use.

 

Please keep me updated how things are going.

Are you sure that the HBA is cooled well and not getting too hot?

I had another crash today, same looking log on the screen, but no i hadnt thought about the hba getting too hot... it hasnt been a problem before but then again when im transcoding i wonder if thats actually the problem.... i wonder if there is a way to monitor that?

 

Edit: Doesnt look like my 9201-8i has temperture monitoring, so I am just gonna follow a reddit guide on a good looking 40mm fan mod for it and repaste it. even if thats not the problem, it wont hurt to do.

Edited by Tithonius
Link to comment
6 hours ago, Tithonius said:

Doesnt look like my 9201-8i has temperture monitoring

It is very rare that HBAs have temperature monitoring anyways.

 

6 hours ago, Tithonius said:

it hasnt been a problem before but then again when im transcoding i wonder if thats actually the problem

With the HBA was only a guess, probably try to not transcode for a few days and see if the issue also occurs, maybe uninstall the Nvidia Driver too.

You can only do step by step troubleshooting.

Link to comment
39 minutes ago, ich777 said:

It is very rare that HBAs have temperature monitoring anyways.

 

With the HBA was only a guess, probably try to not transcode for a few days and see if the issue also occurs, maybe uninstall the Nvidia Driver too.

You can only do step by step troubleshooting.

Sooooo i may have found what it might be..

 

https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/page/6/?tab=comments#comment-21671

 

this bug here with libtorrent v2 in qBittorrent and Deluge is scarily close to the issues that I have been having. it explains the randomness of the crashes, and the logs people have been showing seem to be very similar to the snippets of the logs that I have been having

 

I did have the server crash while (at the time, i have unmanic now) I had tdarr off without transcoding. but this whole time I have had my qBittorrent to auto start on server boot, so it would explain all of that.

 

I am still in hardcore troubleshooting mode, and if this does in fact fix the problem then I will come back and post my entire journey here for others as well. Im also gonna still put the small fan on my HBA because i already made the amazon order for the supplies 😛

 

Also, i just wanna say, you guys here have been so helpful. this most likely isnt even an issue with your software, and you are all doing everything you can to help, and i am so thankful for that. so, thanks.

  • Like 1
Link to comment
28 minutes ago, Bademeister said:

Hallo zusammen 

Only support in English over here...

 

28 minutes ago, Bademeister said:

Angehängt habe ich noch meine Diagnostiks files .... 

Für jede hilfe bin ich sehr dankbar ....

You are running a Quadro 4000 (please keep in mind this is not a Quadro P4000), this is a pretty old card and is not supported by this driver or even the old legacy driver 470.xx.xx

 

You even can't use the card for Docker containers because you need at least a card Kepler based and the Quadro 4000 is Fermi based.

 

Hope that answers your question. :)

  • Like 2
Link to comment

Okay, so... I had another crash, this time I had a crash where the syslog was being mirrored to flash. In that syslog, at the bottom, you can see that the USB is getting reset over and over, so it seems like maybe those USB ports on my motherboard are bad? The flash drive i just replaced with a new one as I figured this might be the case. Is that what others seen to find in this log as well?

 

I really hope this is actually the issue...

syslog.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.