Jump to content
roland

[Support] Data-Monkey - netdata

233 posts in this topic Last Reply

Recommended Posts

I really like the new Netdata cloud feature, I sign in, claimed my node and ran the script they provide for dockers.  I get an automatic update and now that claimed node is unreachable and I have to go through the process, over, and over, and over.  How can I configure Netdata so it's name persists so I can get updates but also keep it sync'd with their cloud?

Share this post


Link to post

Started getting fork errors and tons of odd behavior... turned out something was creating a massive amount of processes.

That something is netdata :/ I've ran this for months and no issue but since the switch it seems things aren't working the same

Note that I'm using the default config and haven't touched any settings.

 

201 is netdata and I waited 5s between commands... if I leave this going it'll just keep creating processes until my machine starts to throw errors after a few days.

 

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
      2 100
      2 daemon
      2 message+
      3 ntp
      4 102
      4 avahi
      4 rpc
      5 103
     27 472
     33 101
    168 nobody
    194 sshd
    260 201
   1677 root

 

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
      2 100
      2 daemon
      2 message+
      3 ntp
      4 102
      4 avahi
      4 rpc
      5 103
     27 472
     33 101
    168 nobody
    194 sshd
    352 201
   1672 root

 

I check what the processes are and I see hundreds of these

201      12273  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12470  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12711  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      12895  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13054  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13235  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>
201      13415  0.0  0.0      0     0 ?        ZNs  21:08   0:00 [timeout] <defunct>

 

I'm not sure where netdata logs are so I don't know what it's trying to do that keeps spinning... any thoughts?

Share this post


Link to post

I have come across the same issue as above. I originally posted the issue in the Dynamix forum because of the errors that I saw, they looked at my diagnostics and saw the process issue that you are seeing. This is what I originally posted here:

Quote

Hello, I hope this is the right place to post this. I have searched and have been unable to find a solution. In the last few days I added a second cache drive, identical to the existing one. I added this to create a cache pool. Since then I noticed occasional weird messages in my email that don't seem to make sense:

Subject is: 

cron for user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null

and the body consists of:

/bin/sh: fork: retry: Resource temporarily unavailable

 

Typically I get several in a row and then they stop for 12-24 hours. If I leave them it seems to only get worse leading to the server being unresponsive twice now in the last few days. I was able to reboot it from the GUI once, but the second time I had to do a hard boot. I tried uninstalling and reinstalling the SSD Trim plugin but did not seem to make a difference. It came back up without issue and the errors seemed to be cleared, but then about 24 hours later they started happening again. Everything seems to be working ok otherwise, I am not sure what is causing this. One thought I had is that one of the cache drives is on an HBA and the other is connected directly to the motherboard, not sure if that would make a difference. I have attached the diagnostic. Let me know what you guys think, the server has been running great otherwise and I have really been enjoying UNRAID. Thanks for the support! 

And then one of the guys there replied:

Quote

Hmm. your diagnostics show that you have a netdata container that is not properly reaping the finished processes


201      15538  0.2  0.0  33416 22352 ?        SNl  03:01   2:23  |   |       \_ /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
201        300  0.0  0.0      0     0 ?        ZNs  05:34   0:00  |   |       \_ [timeout] <defunct>
201        301  0.0  0.0      0     0 ?        ZNs  06:28   0:00  |   |       \_ [timeout] <defunct>
201        302  0.0  0.0      0     0 ?        ZNs  04:32   0:00  |   |       \_ [timeout] <defunct>
  ... snip ...
201      32766  0.0  0.0      0     0 ?        ZNs  09:32   0:00  |   |       \_ [timeout] <defunct>
201      32767  0.0  0.0      0     0 ?        ZNs  08:54   0:00  |   |       \_ [timeout] <defunct>

So your server is running out of process ids to run new processes. You should check with the support thread for the netdata container you are running

I have attached my diagnostics if you want to take a look. Going to turn off the netdata container for now and see if I am seeing any more of these issues. Thanks in advance for the support!

hulk-diagnostics-20200519-2143.zip

Share this post


Link to post
9 hours ago, primeval_god said:

@melmurp @muslimsteel You might want to consider raising an issue over at https://github.com/netdata/netdata/ where the developers of Netdata and the Docker container reside. 

 

9 hours ago, muslimsteel said:

@primeval_god Thanks, looks like they might have an issue open for this same thing on GitHub:

https://github.com/netdata/netdata/issues/9084

 

Thanks guys, seems they fixed it so just need to wait for the next release

https://github.com/netdata/netdata/pull/9107

Share this post


Link to post

Trying to get notifications to work....Following instructions here:

  1. https://hub.docker.com/r/titpetric/netdata
  2. https://learn.netdata.cloud/docs/agent/step-by-step/step-05

I have added my target email using "./edit-config health_alarm_notify.conf" for (1).

 

And added the following parameters (with my emails) for (2):

 

-e SMTP_TO=user@gmail.com -e SMTP_USER=user -e SMTP_PASS=password

 

And I have generated an app password for the sending gmail account and am using that above. I get:

 

# SENDING TEST CLEAR ALARM TO ROLE: sysadmin
2020-05-23 11:24:44: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
sendmail: can't connect to remote host (127.0.0.1): Connection refused

I am sure I am doing something silly - Thanks!

Share this post


Link to post

Hi and thanks for this docker!

 

Would you be able to suggest how to make Netdata automatically recognise HDDtemp's data coming from Atribe's HDDtemp docker image? (support here:

)

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.