Tithonius

April 1, 2023

19 minutes ago, hexfury said:

I'm no pro on this, but that kernel dump to me looks like a thermal failure. Your card (or motherboard?) overheated and got a a bad memory call. The nvidia-smi called by your dashboard is what crashed, and then nginx died trying to query it.

This is definitely not a thermal crash.. the card runs a nice cool 45C all the time under full load transcoding. Also, to be clear, the server didn't crash, just the GUI

I'm not overclocking anything, so the motherboard should be totally within thermal limits easy (its just a i3-10100) and the CPU and stuff have never ever had a thermal crash or issue being hot at all before.

April 1, 2023

Okay, I'm back. and GOOD NEWS EVERYONE! I fixed the random hard crashes that I've been having! So it seems that unraid didn't like the "mismatched" RAM in my system. (it was the same model number and everything, same timings, the works, but 2 of my 4 sticks had different pcb layouts.) That being said, I'm not out of the woods yet. I had a good run of like 10 days in a row with no crashes, and I'm back to square one before all this started.

So now i am back to the issue where rarely i get home, hit refresh on my webui and get met with a 500 internal server error. But, now I have proper logging setup, and was able to capture the error in a syslog. I think this shows what's going on. (Hopefully)

Would love if you guys could take a look. It does look to me like an nvidia issue, hence why I posted here.

Thanks again.

syslog-192.168.1.10.log

March 26, 2023

Okay, so my journey with my server crashing starts from installing a GPU. about a month ago I replaced my 1050ti in my gaming rig and decided it would be nice to put that GPU in my server. The server ran just fine for a few weeks, but then crashed. I thought, well crap, maybe Unraid doesn't like my GPU. So I uninstalled the Nvidia driver plugin (I was using the GPU for transcoding in Unmanic) and I removed the GPU from the system. For a couple more days I didn't have any more crashes. Then out of nowhere, back in a known working config, the server crashed again.

So far I have replaced all of the RAM with new sticks, and it has passed multiple 14 hour memtests, I have replaced the USB with a new one, and have checked all connectors in the machine. I have reseated my HBA card. I am running out of things to try to replace. After every crash, i can reboot the system and be back to a "working" machine no problem.

As of now I have reinstalled the GPU, and reinstalled the NVidia plugin as it seems to crash with or without it.

I have syslog running, but cant really see anything in the syslog after a crash. I have taken a few pictures of the screen output during a crash, but am unsure if that is helpful at all.

Ill attach everything here, but as of now i'm not sure what to even do going forward.

syslog.txt eos-diagnostics-20230326-1026.zip

March 25, 2023

welp we crashed again with the usb in a usb 3 slot... time to add a pci usb card...

this is REALLY getting old... im ordering a new motherboard...

March 25, 2023

The diagnostics after the reboot:

eos-diagnostics-20230324-1921.zip

March 25, 2023

Okay, so... I had another crash, this time I had a crash where the syslog was being mirrored to flash. In that syslog, at the bottom, you can see that the USB is getting reset over and over, so it seems like maybe those USB ports on my motherboard are bad? The flash drive i just replaced with a new one as I figured this might be the case. Is that what others seen to find in this log as well?

I really hope this is actually the issue...

syslog.txt

March 22, 2023

Welp, my server crashed again overnight after making the change to libtorrent v1... oof. my new flash drive is here so im gonna replace that and do a fresh install of unraid, just dragging over my config folder... see if that fixes this.

im so tired of crashing...

March 22, 2023

39 minutes ago, ich777 said:

It is very rare that HBAs have temperature monitoring anyways.

With the HBA was only a guess, probably try to not transcode for a few days and see if the issue also occurs, maybe uninstall the Nvidia Driver too.

You can only do step by step troubleshooting.

Sooooo i may have found what it might be..

https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/page/6/?tab=comments#comment-21671

this bug here with libtorrent v2 in qBittorrent and Deluge is scarily close to the issues that I have been having. it explains the randomness of the crashes, and the logs people have been showing seem to be very similar to the snippets of the logs that I have been having

I did have the server crash while (at the time, i have unmanic now) I had tdarr off without transcoding. but this whole time I have had my qBittorrent to auto start on server boot, so it would explain all of that.

I am still in hardcore troubleshooting mode, and if this does in fact fix the problem then I will come back and post my entire journey here for others as well. Im also gonna still put the small fan on my HBA because i already made the amazon order for the supplies 😛

Also, i just wanna say, you guys here have been so helpful. this most likely isnt even an issue with your software, and you are all doing everything you can to help, and i am so thankful for that. so, thanks.

March 21, 2023

10 hours ago, ich777 said:

It won't be, at least for home use.

Please keep me updated how things are going.

Are you sure that the HBA is cooled well and not getting too hot?

I had another crash today, same looking log on the screen, but no i hadnt thought about the hba getting too hot... it hasnt been a problem before but then again when im transcoding i wonder if thats actually the problem.... i wonder if there is a way to monitor that?

Edit: Doesnt look like my 9201-8i has temperture monitoring, so I am just gonna follow a reddit guide on a good looking 40mm fan mod for it and repaste it. even if thats not the problem, it wont hurt to do.

March 21, 2023

1 hour ago, ich777 said:

I think you are running tadarr or am I wrong?

Back in the day tadarr was notorious for crashing server but I think they fixed that.

Maybe try to boot into safe mode if possible and see if the server crashes there too.

You can also try to uninstall the Nvidia Driver plugin and see if that helps (of course you have to reboot to fully install it).

So this time before last crash I switched to unmanic per your recommendation, and I like it better. Tdarr isn't even installed... I also thought to myself, what has changed since I started getting crashes..., Then I realized that I moved my hba down a 16x slot to install the GPU, and that meant that the hba was only running at x4 speed. It "shouldn't" matter, but some people online were saying it might. So last night I swapped the GPU and hba on the motherboard. I don't care if the GPU is at a x4 slot as transcoding isn't any slower, and the hba can have a full x16 bandwidth slot. We are good without crashes overnight for now, but I do still have all my monitoring in place to keep an eye on it.

I also ordered a new flash drive off the recommended lose from limetech, should be here soon just in case that's my issue. It's time for a flash drive change anyway this one is pretty old.

March 21, 2023

Also, when i rebooted this time my flash drive was not detected... Im backing it up now on a windows machine, but I wonder if somthing is just wrong with my flash drive

March 21, 2023

On 3/14/2023 at 1:32 PM, ich777 said:

No, you don't need to run anything, the Kernel Panic is displayed automatically.

No, what do you want to see, you are seeing the entire syslog...

Leave it as it is and it will display the Kernel Panic if one occurs.

If you are not on the Dashboard very unlikely...

Could be also the cause of the issue, back in the days Tdarr was notorious for crashing servers, that's why I recommend to use Unmanic.

Okay, so I finally got a crash to happen while I had a screen attached and this is the output that I can see:

Obviously this isnt the whole crash, but does this give any info into whats going on?

March 14, 2023

How possible is it that it's the GPU statistics plugin causing this crash?

That's the only thing that has been new since these crashes have been happening...

Also, is it possible that Tdarr is causing the crash?

March 14, 2023

11 hours ago, ich777 said:
I completely forgot that you have to execute this command so that the screen doesn't goes blank or better speaking to sleep:
setterm --blank 0
You don't have to be logged in, just log in once, execute the command and maybe log out again (but you don't have to log out strictly speaking).

Okay, so last night I set the pcie gen to 3 instead of auto, reseated the GPU in the slot, and reseated all the memory. It passed a 13 hour memory test with 0 errors, and seems to be okay. When I woke up again this morning I saw that it had hard crashed again... The screen went to sleep so when I get home from work today I'll set the screen to not sleep like you said and see if the console can show me what's going on. I assume the command I want to run is the "tail -f /var/log/syslog" command yeah? But that would seem to just show me what the syslog shows, and the syslog didn't have anything last time..

Or is there a command that will show me a more verbose output? I wouldn't be opposed to having the most info possible if that's a thing.

What command should I be running to see the console output?

I also set the docker network to IPVLAN and restarted docker, but didn't reboot after that change. Should I have rebooted?

On the plus side the GPU is showing at a x16 lane now so that's progress I guess... Still crashing though.

March 14, 2023

3 hours ago, ich777 said:

This is caused because your server is crashing really hard and even the syslog server won't work anymore when the crash happens, have you yet tried to connect a monitor tho the GPU to see the console output. The only thing you can really do to capture whats happening is to connect a screen and wait for a crash to happen, after that take a picture from the output.

Okay, so i have a screen attached now, i can see the login prompt, i assume that I just login to the non gui on the screen then just leave it? Or do I have to have it show me console output somehow?

March 13, 2023

The GPU was pulled from a working gaming rig, and has never been abused, and doesn't seem to be really failing. It's transcodes all look great, and I'm having no other issues with it while it is working.

The card is also only showing as being on a gen 3 by 2x connection in GPU statistics even though it is installed in a gen 3 16x slot.. could that be pointing to a potential problem? The only other card in the system in my lsi hba, so it can't be a lanes issue with the CPU, there should be plenty available. I may try reseating or moving the GPU to a different slot, but I think that may be the only slot on that board it will fit in.

I guess if it passes the memtest that is running atm then I'll try reseating the GPU to see if that's a problem. I already replaced all the memory in the system, and upgraded to 64GB so it's not a memory space issue, and I replaced all the ram after these problems started occuring as a troubleshooting step. I am running a memtest just to be sure, but early signs are showing that the memory is fine.

I'm kinda at my wits end, and really am having trouble as I have never had problems with this machine until I added the gpu. It's super discouraging...

When I get home from work hopefully the memtest will be done and I'll boot the system again and do another diagnostic as well as post my syslog just for y'all to look over. I'm also going to make sure that some idle power stuff in bios isn't the problem as I see that can be an issue for others here.

idk tbh I'm kinda just guessing here...

March 13, 2023

So my problem has progressed now. As a recap, I have a 1050ti installed with the latest updates to everything in my server. All drivers, os updates and such. I have a problem where randomly the server will work fine for days at a time, then just crash. I used to just have the gui crash, but now am getting entire server lockups. I enabled syslog logging earlier, but when the problem happened again I didn't see anything in the syslog that was concerning, and the helpful peeps over on the unRAID discord didn't see anything either. The diagnostic doesn't look any different then it did last time... I can't get a diagnostic while it's crashing anymore due to the server being hard locked. When I reboot and do diagnostics everything goes back to being normal. This has only been happening ever since I added the GPU to the system so it has to be related to this somehow. Is there another way for me to get different logs or something that will help me narrow down the issue?

March 8, 2023

Okay, did that, but would that really be the reason for the 500 internal server error?

Also, just to clarify, I have uninstalled the app, rebooted, reinstalled the app, and rebooted again and i haven't had the error since, so maybe it was just a weird thing. but i was just curious if there was anything in the diagnostic that was alarming.

March 8, 2023

How would i delete the bind?

March 8, 2023

46 minutes ago, alturismo said:

may delete the vfio bind, looks like this was from another device you had a vfio bind active (possible another GPU from earlier times ?) for VM passthrough ?

vfio bind should only be activated for devices which you want to passthrough a VM ...

I did have an AMD gpu i did some testing with, but never anything more then plug it in really

March 8, 2023

In lspci.txt I can see the GPU there.

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) Subsystem: ASUSTeK Computer Inc. GP107 [GeForce GTX 1050 Ti] [1043:85d1] Kernel driver in use: nvidia Kernel modules: nvidia_drm, nvidia

But I also see in the log that it's unable to bind device? I'm confused.

March 8, 2023

This is a diagnostic from just now, with the plugin installed, and everything working.eos-diagnostics-20230307-1941.zip

March 8, 2023

I have a 1050ti installed... And the latest driver installed. I'll do another diagnostic and post it, I took that while a crash was happening and I'd be willing to bet that might be why that was missing. The igpu in the system is not nearly as fast as the 1050ti at least in my testing. I have almost 30TB of data I need to transcode through, so I was looking for the higher fps while transcoding.

March 7, 2023

I have been having some issues recently with the plugin. From time to time I come home and find that my server GUI is in a 500 internal server error. The SSH into the server works fine, and all dockers and VMs are all running, but the web GUI just fails to load. If I SSH in and to "/etc/rc.d/rc.php-fpm restart" the GUI will come back, sometimes for a short time and sometimes for a while, but eventually I find myself back in the crash 500 internal server error. I have this plugin as well as GPU Statistics by b3rs3rk installed. The other weird thing is that when I do get the GUI back, the Statistics on the main page for my GPU are all blank.

I have a ASUS Strix 1050ti installed with the latest official driver in the plugin installed as well. I only use the GPU for Tdarr transcoding, and have had no issues while the GPU is doing that task. The output files also look fine so I dont think that MY GPU is dying. Any thoughts?

eos-diagnostics-20230101-1425.zip

Tithonius

Posts

Joined

Last visited

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by Tithonius

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

Regular Server Crashes

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver

[Plugin] Nvidia-Driver