[Support] Data-Monkey - netdata


Recommended Posts

I really like the new Netdata cloud feature, I sign in, claimed my node and ran the script they provide for dockers.  I get an automatic update and now that claimed node is unreachable and I have to go through the process, over, and over, and over.  How can I configure Netdata so it's name persists so I can get updates but also keep it sync'd with their cloud?

Link to comment

Started getting fork errors and tons of odd behavior... turned out something was creating a massive amount of processes.

That something is netdata :/ I've ran this for months and no issue but since the switch it seems things aren't working the same

Note that I'm using the default config and haven't touched any settings.

 

201 is netdata and I waited 5s between commands... if I leave this going it'll just keep creating processes until my machine starts to throw errors after a few days.

 

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
      2 100
      2 daemon
      2 message+
      3 ntp
      4 102
      4 avahi
      4 rpc
      5 103
     27 472
     33 101
    168 nobody
    194 sshd
    260 201
   1677 root

 

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
      2 100
      2 daemon
      2 message+
      3 ntp
      4 102
      4 avahi
      4 rpc
      5 103
     27 472
     33 101
    168 nobody
    194 sshd
    352 201
   1672 root

 

I check what the processes are and I see hundreds of these

201      12273  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12470  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12711  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12895  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13054  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13235  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13415  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>

 

I'm not sure where netdata logs are so I don't know what it's trying to do that keeps spinning... any thoughts?

Link to comment

I have come across the same issue as above. I originally posted the issue in the Dynamix forum because of the errors that I saw, they looked at my diagnostics and saw the process issue that you are seeing. This is what I originally posted here:

Quote

Hello, I hope this is the right place to post this. I have searched and have been unable to find a solution. In the last few days I added a second cache drive, identical to the existing one. I added this to create a cache pool. Since then I noticed occasional weird messages in my email that don't seem to make sense:

Subject is: 

cron for user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null

and the body consists of:

/bin/sh: fork: retry: Resource temporarily unavailable

 

Typically I get several in a row and then they stop for 12-24 hours. If I leave them it seems to only get worse leading to the server being unresponsive twice now in the last few days. I was able to reboot it from the GUI once, but the second time I had to do a hard boot. I tried uninstalling and reinstalling the SSD Trim plugin but did not seem to make a difference. It came back up without issue and the errors seemed to be cleared, but then about 24 hours later they started happening again. Everything seems to be working ok otherwise, I am not sure what is causing this. One thought I had is that one of the cache drives is on an HBA and the other is connected directly to the motherboard, not sure if that would make a difference. I have attached the diagnostic. Let me know what you guys think, the server has been running great otherwise and I have really been enjoying UNRAID. Thanks for the support! 

And then one of the guys there replied:

Quote

Hmm. your diagnostics show that you have a netdata container that is not properly reaping the finished processes


201      15538  0.2  0.0  33416 22352 ?        SNl  03:01   2:23  |   |       \_ /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
201        300  0.0  0.0      0     0 ?        ZNs  05:34   0:00  |   |       \_ [timeout] <defunct>
201        301  0.0  0.0      0     0 ?        ZNs  06:28   0:00  |   |       \_ [timeout] <defunct>
201        302  0.0  0.0      0     0 ?        ZNs  04:32   0:00  |   |       \_ [timeout] <defunct>
  ... snip ...
201      32766  0.0  0.0      0     0 ?        ZNs  09:32   0:00  |   |       \_ [timeout] <defunct>
201      32767  0.0  0.0      0     0 ?        ZNs  08:54   0:00  |   |       \_ [timeout] <defunct>

So your server is running out of process ids to run new processes. You should check with the support thread for the netdata container you are running

I have attached my diagnostics if you want to take a look. Going to turn off the netdata container for now and see if I am seeing any more of these issues. Thanks in advance for the support!

hulk-diagnostics-20200519-2143.zip

Link to comment
9 hours ago, primeval_god said:

@melmurp @muslimsteel You might want to consider raising an issue over at https://github.com/netdata/netdata/ where the developers of Netdata and the Docker container reside. 

 

9 hours ago, muslimsteel said:

@primeval_god Thanks, looks like they might have an issue open for this same thing on GitHub:

https://github.com/netdata/netdata/issues/9084

 

Thanks guys, seems they fixed it so just need to wait for the next release

https://github.com/netdata/netdata/pull/9107

Link to comment

Trying to get notifications to work....Following instructions here:

  1. https://hub.docker.com/r/titpetric/netdata
  2. https://learn.netdata.cloud/docs/agent/step-by-step/step-05

I have added my target email using "./edit-config health_alarm_notify.conf" for (1).

 

And added the following parameters (with my emails) for (2):

 

-e SMTP_TO=user@gmail.com -e SMTP_USER=user -e SMTP_PASS=password

 

And I have generated an app password for the sending gmail account and am using that above. I get:

 

# SENDING TEST CLEAR ALARM TO ROLE: sysadmin
2020-05-23 11:24:44: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
sendmail: can't connect to remote host (127.0.0.1): Connection refused

I am sure I am doing something silly - Thanks!

Link to comment
  • 2 weeks later...
On 6/2/2020 at 9:19 PM, BeerNut said:

Does anyone have UPS communication working on this new version?

According to the docs Netdata needs certain UPS tools installed. Can you check if something like `upsc -l` works on the system? If it does but still no UPS charts are created, please submit a bug about it on github with the relevant details.

Link to comment

Hope you all are finding Netdata easy to use, despite some of the changes to the official docker image. If you are running into problems, you will get fastest response times if you submit an issue (or feature request) on our github repo and provide the relevant logs and info. We try to respond as needed but other channels take longer! Feedback and pull requests are also much appreciated :)

So let me know if you have any questions about Netdata. My turn to ask questions will be a few months from now once I've had enough with my dual-boot setup and have some time to tinker with hardware.

- Zack, dev advocate with Netdata

Link to comment
  • 4 weeks later...
On 5/15/2020 at 2:25 PM, Boldly_Goes said:

I really like the new Netdata cloud feature, I sign in, claimed my node and ran the script they provide for dockers.  I get an automatic update and now that claimed node is unreachable and I have to go through the process, over, and over, and over.  How can I configure Netdata so it's name persists so I can get updates but also keep it sync'd with their cloud?

Did you manage to resolve this?

Link to comment
  • 1 month later...

I have noticed that netdata is not saving all the logs / graphs. It will save the last few hours but if I keep zooming out it will have large gaps where all the graphs are blank but then possibly days before the graphs will pop back into existence, then disappear again etc.

 

The server has been on the whole time and the docker running non-stop with no issues I am aware of.

 

Any ideas on how to get it to keep all the graphs/logs?

Link to comment
  • 4 weeks later...
1 hour ago, TexasUnraid said:

Did more research and it seems that the docker is setup to use ram for the logs and for some reason the dbengine option is disabled?

 

I tried to enable it but it just errors and says it is not supported on this platform? Anyway to enable it? I would like to keep a few days of logs if possible.

It works in the netdata/netdata image. You also have bind mount /var/cache/netdata/dbengine/ to a folder in your appdata to ensure it persists across container upgrades.

Link to comment
1 minute ago, primeval_god said:

It works in the netdata/netdata image. You also have bind mount /var/cache/netdata/dbengine/ to a folder in your appdata to ensure it persists across container upgrades.

That is the container I am using. This is the error I get after changing the netdata.conf file to enable dbengine:

2020-09-12 16:46:41: netdata FATAL : MAIN : RRD_MEMORY_MODE_DBENGINE is not supported in this platform. # : Invalid argument

2020-09-12 16:46:41: netdata INFO : MAIN : EXIT: netdata prepares to exit with code 1...
2020-09-12 16:46:41: netdata INFO : MAIN : EXIT: cleaning up the database...
2020-09-12 16:46:41: netdata INFO : MAIN : Cleaning up database [0 hosts(s)]...

I mapped the dbengine to appdata along with the etc/netdata folder (after coping it to appdata first) to make updating the netdata.conf easier, never have been able to figure out the text editor in alpine.

 

firefox_9GB0dBOnaW.jpg

Link to comment

Ok, I just wiped the entire netdata docker and did a clean install and now it seems to be using the dbengine by default? I can't explain it, I had not messed with anything on the old container prior to this.

 

I then made the same changes I did before, mapping the config and dbengine folder to appdata and bingo, seems to be working. Will have to wait and see if it holds onto the data for longer this time but database files are showing up in the folder.

 

Although I did notice that under dbengine compression savings ratio, it shows 0?

 

Seems from reading that it should be at least 50% or higher? Any idea why it is not being compressed? according to the calculator without compression it is going to eat a LOT of space.

Link to comment
  • 2 weeks later...
On 6/2/2020 at 11:19 PM, BeerNut said:

Does anyone have UPS communication working on this new version?

I have the same problem so I dug into it 😉

It seems that "apcupsd" package is missing, and netdata GitHub team seems to be struggling with lots of issues...

Waiting for their solution, I built a custom script you can use to add this missing packet to netdata container automatically:

#!/bin/bash
#description=This script adds APC UPS support to Netdata Container

if (! docker exec netdata apcaccess >/dev/null )
  then { 
    docker exec netdata apk add apcupsd
    docker restart netdata
  }
fi

Quite simple as you can see, and keep in mind to reuse it every time netdata container is updated.

  • Thanks 1
Link to comment
On 9/15/2020 at 4:58 PM, TexasUnraid said:

The DBengine is working good now, saving data as expected. Although I am now getting a new issue, my docker stats stopped showing up? I used to be able to see all the stats for my dockers but now they are simply missing?

 

I have restarted netdata a few times but they are still gone. Any ideas?

I had the same issue, and by inspecting the logs, it seems to be related to some db files still locked when netdata container is restarted, as I suspect my drive not fast enough, even if it's a nice Samsung SSD...

So the workaround I found is to stop netdata container, wait a bit like a minute or two, and then start netdata container, and if containers status doesn't show at this time, just refresh netdata page after a few seconds...

 

Now I'm not a Netdata expert so I may be wrong and you may face another issue. 

Link to comment
On 9/25/2020 at 10:49 AM, mmz06 said:

I have the same problem so I dug into it 😉

It seems that "apcupsd" package is missing, and netdata GitHub team seems to be struggling with lots of issues...

Waiting for their solution, I built a custom script you can use to add this missing packet to netdata container automatically:

Thank you so much for this!  

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.