BomB191

Members
  • Posts

    104
  • Joined

  • Last visited

  • Days Won

    1

Posts posted by BomB191

  1. On 2/23/2023 at 9:46 PM, juan11perez said:

    Thank you for sharing this info. I also have the asus 470 and have had the problem  for months.

    I ended up installing a fan controller .

     

    I'll try this fix

     

    edit: so ive added the fix. The asus sensors are gone, but I also lost cpu fan speeds etc.....

    this the case for others?

    Yes this removes unraids ability to look/touch any of that. I personally figure not knowing the temp is better then knowing my fans will randomly shut off causing thermal shutdowns.

     

    Just set the Bios to the fan levels you need and call it a day.

    • Like 1
  2. On 2/22/2023 at 9:28 AM, general18 said:

      

     

    Do you put anything in the file or what?

    I've tried by putting the text from Reddit inside the disable-asus-wmi.conf file (TXT File format). This text: # Workaround broken firmware on ASUS motherboards blacklist asus_wmi_sensors

     

    I then rebooted the server and went to bed with a noisy server. When I then woke up, it was completely silent, and was running around 85-95 degrees C on a 3800X.

    Motherboard is: ASUSTeK COMPUTER INC. PRIME X470-PRO

     

    And to be 100% sure: Can I just add the file to the USB drive through the share, or do I have to take the USB drive into my PC and do it that way?

    image.png.af08ea2bcb83dde29335630040f98cf5.png

     

    Just open your flash and go to *\flash\config\modprobe.d

     

    Create whatever file you like with .conf mines "disable-asus-wmi.conf"

     

    In that file paste

     

    # Workaround broken firmware on ASUS motherboards
    blacklist asus_wmi_sensors

     

    reboot and random fan stops resolved.

    I'm not the correct person to explain exactly wtf it does or how it works. But I assume it just makes unraid/linux not touch that fan controller 

     

     

     

     

    • Thanks 1
  3. On 1/27/2023 at 7:08 PM, ich777 said:

    I really don't know what causes that issue because I barely can't reproduce it and I'm pretty much clueless what it could be since Unraid runs from RAM and the package is also installed on boot, so this basically means everything is installed fresh on each reboot.

     

    I can only imagine that something in Docker prevents it from running properly because that's the only place in the chain where something is stored across reboots.

     

    Anyways, glad that everything is now working for you again!

    Hey I found the root cause!

    Krusader grabs something locking out  /proc/sys/kernel/overflowuid

    I haven't dug into it too much. but basically if I run krusader I cannot restart anything that uses the GPU without a system reboot.

     

    This is only the (looks it up) the one from your repository the binhex one is ok even when it has root privileges. (I did have your one on root privileges with the advised settings)

  4. 58 minutes ago, ich777 said:

    What did you change to fix the fan issues that you where having? Is there maybe a setting that you‘ve changed that could affect the Nvidia Driver?

    it was a fan issue with asus x470 boards and the fan controller that's now in the base unraid/linux image bug out causing all fans to stop after a random amount of time, within a week or so. I have a file in 'config\modprobe.d' named 'disable-asus-wmi.conf'

    With the text of

    '# Workaround broken firmware on ASUS motherboards
    blacklist asus_wmi_sensors'

     

    1 hour ago, ich777 said:

    This happened once to me but a force update from the container fixed this issue for me back then on my test server.

    Back then I tried also to downgrade to a previous version from Unraid and upgraded again but I couldn‘t reproduce it.

    So an update while I was awaiting a reply.

    I re downgraded back to unraid 6.10.3. tried the usual stuff and failed.

    Re updated back to 6.11.5 did the usual thing and now it works. I have no idea. but it has persisted through several reboots and transcoding now works on the GPU.

     

    Best guess something just stuck initially and fixed itself when I cycled the down/upgrade cycle

    • Like 1
  5. 6 hours ago, ich777 said:

    What do you mean exactly with that? Do you have nvidia-persistenced enabled? If so, you kill it by doing:

    kill $(pidof nvidia-persistenced)

    from a Unraid Terminal.

    I enabled then killed it. (someone else was having power state issues)

     

    6 hours ago, ich777 said:

    Can you please double check if the UUID from the GPU matches in the template? Can you also maybe post a screenshot from the container template so that I can see which parameters you've added for the Nvidia Driver to work?

    Please note that on most newer driver version the Key "all" at the GPU UUID causes issues and you should always put in your UUID from the card.

    Also please add a Variable, as described in the second post from this thread, with the Key: "NVIDIA_CAPABILITIES" and as Value: "all"

    This should fix the issue

    image.thumb.png.5e5b163624a8a5b76745546123c46e4c.png

    image.thumb.png.993d174df5f65d3b8c6717de6e59f71f.png

     

    Still failed 

    image.thumb.png.7f8ffaa5de9e1e7562db09a1fa9c718f.png

     

    Tried un installing then reinstalling as per instructions and this also fail.

    Same with force update

    image.png.0e74af974227e363470659a728dde0d4.png

     

     

    the weird thing is everything else works

    image.png.72c4448fa966b72e5c3c12c0a6944a9a.png

    Driver picks it up OK an so does nvidia-smi (I did try downgrading versions) 

    image.png.d01c2b7885ff7305e7d12b371553625d.png

     

    Same issue with the plex docker from plex.

    I would expect this to work but somethings weird :( 

     

    So update I think its something to do with /proc/sys/kernel/overflowuid its weird even logged in as root I don't have permissions to delete it or modify permissions 

  6. I think I'm finally stuck.

     

    updated to 6.11.5 (after some system fan issues) updated the nvidia driver to 525.85.05.

     

    I can see it in the plugin but I cannot use '--runtime=nvidia' it just fails with bad permeameter or if I check it I get the below

    image.thumb.png.5e6ab9a06497ad103d21bbb958b51335.png

     

    I have tried full uninstall/reinstall with reboots. changing versions etc. Now it did work once when I was looking around with other peoples issues 'nvidia-persistenced' then '$(pidof nvidia-persistenced)' But now it just wont take at all. 

     

    Im sure its something silly like it always is haha. any help would be huge. diagnostics attached :) 

    tower-diagnostics-20230127-0816.zip

  7. On 12/26/2022 at 2:59 PM, Minijuice said:

    Only up for 12 hours but so far we're stable. 

     

    I have yet to have a random Fan stop/power down since downgrading to 6.10.3 it would have happened by now. and with I think 4 other people that have spoken up with the exact same problem and exact same resolution it has to be something Unraid related and the Airflow monitor. no way this many people on the same ish hardware are all having hardware problems that go away after downgrading.

  8. 25 minutes ago, ConnerVT said:

     

    Your hardware *not* shutting down when the fans stop?  (Sorry, a moment of levity seemed needed)

     

    Have you asked around in the ASUS forum?  Perhaps it is more widespread than Unraid (such as across several distros/kernels)

     

    Keep cool.

     

    Ha well yes that's very true. The magic smoke would be released if that happened.

     

    I did have a quick google. but most of what I found was rookies not explaining to well and a ton of recommendations to get AI tuner running on the OS (Asus fan controller software) so kinda useless for us unfortunately. 

  9. 9 hours ago, Haldanite said:

    This has me thinking that there might be a common denominator with the MB.  The BIOS on my ROG STRIX X470 is version 5220 from 9/12/2019.  There have been several updates since this version and nothing specifically noting any issues with the system fans shutting down, but it might be worth trying to update to the latest BIOS version 6042 and upgrading UNRAID back to 6.11.5 and see what happens.  Not really a fan of knowing that my system shuts down due to hitting high temps; really don't want to do that to many times.

    So that's exactly what I did after discovering it was the fans. latest Bios and latest Unraid. Fans still shut off randomly I had 18 hours and 26 hours.

     

    I have done the same as you did and rolled unraid back to 6.10.3. only been 9 hours so far.

     

    I also couldn't think of anything worse for my hardware having to shutdown due to temp protections. So now we wait

     

     

  10. Interesting so we have 3 servers on Asus ROG STRIX X470-F Gaming motherboards. All with intermittent total fan failure.

     

    I'm running 6.11.5 and had the issue on 6.11.2 and I think the version before that too. (initially thought I had RAM issues, before I saw the fans stop in Realtime, logs show nothing).

     

    It would be one hell of a chance that 3 motherboards have intermittent fan controller problems. My wild guess is maybe some weird power management issues between Unraid and the Bios>

     

    Do keep us posted on 6.10.3. I have managed to go 18 hours all the way upto 1.5 weeks

  11. 10 minutes ago, ich777 said:

    What did you do exactly?

    Did you remove the entry with ‘all‘ or did you remove your UUID?

    Yes when I went to create a fresh container i noticed it under 'show more settings'

     

    so on my container i had 2x NVIDIA_VISIBLE_DEVICES one with my GPU and one with 'all' in the field.

     

    I deleted the variable I created and used the one in the container already and renamed it to my gpu 

    So the container now has the below in the settings regarding the GPU

    image.thumb.png.2c41a531502bb40ca3dc5c38e6a4d926.png

    image.thumb.png.cc3ffee0ea998b306493d24f1237097c.png

    • Like 1
  12. 5 minutes ago, ich777 said:

    No, have you yet tried to change a value in the container and to basically recreate it on your server and see if this fixes your issue?

     

    Can you maybe post your Diagnostics?

    I require a dunce hat for tonight.

     

    Went to make a new container and notice these 2 prams hiding under more settings.

    image.thumb.png.38e5539ea5c4590483f8addff0f06f6f.png

     

    Figures it would be something extremally stupid. I didn't even contemplate checking in there.

     

    The disappointment in myself is unmeasurable.

     

    TIL check 'Show more settings ...' :( Sorry for wasting your time. and thank you immensely for the assistance 

  13. 48 minutes ago, ich777 said:

    Do you have nvidia-persistenced enabled? If yes, please disable it with:

    kill $(pidof nvidia-persistenced)

     

    and try it again after you've disabled it.

     

    Do you have all the variables in the Docker template as described in the second post of this thread?

    After running 'kill $(pidof nvidia-persistenced)'

    I get the same error

    docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: device error: false: unknown device: unknown.

     

    Can also confirm both required variables are in the docker 

    Key : NVIDIA_VISIBLE_DEVICES

    Value : GPU-9ef5c7e3-966f-cd37-8881-73507c0b7e0a

     

    Key NVIDIA_DRIVER_CAPABILITIES

    Value : all

     

    This is in the unmanic container. I assume I'm not at the point of the container itself having issues with it yet.

     

    I am on Version: 6.10.3 should I hop onto the 6.11.0-rc3?

  14. 3 hours ago, ich777 said:

    A user already had the same issue and I was able to reproduce this on my test server while up- and downgrading Unraid version.

     

    I was able to solve it here:

     

    Please also look at the following post since the user had many packages installed from the NerdPack and in his case it seems to be that this caused the issue.

     

    ...please report back if you got it working again.

    Unfortunately I attempted those fixes before posting.

     

    The only nerd pack item I had installed was perl (cant even remember what I installed it for to be fair)

    But all has been removed completely and rebooted, I also tried reinstalling the driver after this aslo - same result

     

    nvidia-persistenced in the cmd line is accepted but no change.

     

    NVIDIA_VISIBLE_DEVICES I think is where my issue might be

     

    I'm copying the information from

     image.png.fefcdef682ae8b96325a4f3800f739de.png

    Confirmed no spaces, Tried re copy pasting 

     

    Correct "GPU-9ef5c7e3-966f-cd37-8881-73507c0b7e0a"

    Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: device error: false: unknown device: unknown.

     

    incorrect "asfa" I triel 'all' also as i saw that somewhere when I was searching.

    Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: device error: false: unknown device: unknown.

     

    Item is as per instructions on first post

    image.png.d6070716099e05279b1ac29f62973501.png

     

    NVIDIA_DRIVER_CAPABILITIES however spits a different error when I set it to 'some'

     

    Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    unsupported capabilities found in 'some' (allowed ''): unknown.

     

    with the correct 'all' I get

    Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: device error: false: unknown device: unknown.

     

     

    My final attempt was

    put ' --runtime=nvidia' in extra pram

    fail the "save/compile'

    Go back in and edit the template and repasted the 'GPU-9ef5c7e3-966f-cd37-8881-73507c0b7e0a'

    Failed with the same error as NVIDIA_VISIBLE_DEVICES as above.

     

  15. On 8/3/2022 at 8:23 PM, ich777 said:

     

    On 8/3/2022 at 7:43 PM, Scootter said:

    First, thanks for all the work you do to provide this plugin! I suddenly have started having issues with any docker that uses --runtime=nvidia. I first noticed after a system reboot and saw that Plex had not started up. When I tried to start it I immediately got "Execution Error, Bad parameter".

    This usually indicates that the runtime is not working properly and also is logged in your syslog.

     

    What packages have you installed from the Nerd Pack? I can only imagine that you have something installed that is interfering with the Nvidia Driver.

    Have you changed anything recently in your system, may it be the hardware or software (Docker, Plugins,...)

    So I appear to be having this issue.

    Fresh install though so never worked before.

    Just uninstalled Nerd Pack and rebooted

     

    Getting

    docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: device error: false: unknown device: unknown.

    I'm Sure I'm missing something.

     

    Done the usual reboot/reinstall etc etc. (initially I was having VIFO problems, system used to pass it through now it doesn't)

    Driver gets the below and appears to be A ok

    image.thumb.png.26e4116b76ca596944457a410c84feae.png

     

    GPU stats is getting pulled correctly too 

    image.png.f39df14983789acdc55bb481ede2fa1e.png

     

    image.png.b3cd958fbe556a4bf88800a53a70c309.png

     

    I'm like 99% sure I'm missing something dumb. what logs would you need?

     

    Edit: Also confirmed these are ok too.

    image.png.ccfe3c9aa604d2504f38281bd51b8cde.png