[Support] Data-Monkey - netdata

Boldly_Goes · May 15, 2020

I really like the new Netdata cloud feature, I sign in, claimed my node and ran the script they provide for dockers. I get an automatic update and now that claimed node is unreachable and I have to go through the process, over, and over, and over. How can I configure Netdata so it's name persists so I can get updates but also keep it sync'd with their cloud?

melmurp · May 18, 2020

Started getting fork errors and tons of odd behavior... turned out something was creating a massive amount of processes.

That something is netdata I've ran this for months and no issue but since the switch it seems things aren't working the same

Note that I'm using the default config and haven't touched any settings.

201 is netdata and I waited 5s between commands... if I leave this going it'll just keep creating processes until my machine starts to throw errors after a few days.

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
2 100
2 daemon
2 message+
3 ntp
4 102
4 avahi
4 rpc
5 103
27 472
33 101
168 nobody
194 sshd
260 201
1677 root

ps --no-headers auxwwwm | cut -f1 -d' ' | sort | uniq -c | sort -n
2 100
2 daemon
2 message+
3 ntp
4 102
4 avahi
4 rpc
5 103
27 472
33 101
168 nobody
194 sshd
352 201
1672 root

I check what the processes are and I see hundreds of these

201 12273 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 12470 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 12711 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 12895 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 13054 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 13235 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>
201 13415 0.0 0.0 0 0 ? ZNs 21:08 0:00 [timeout] <defunct>

I'm not sure where netdata logs are so I don't know what it's trying to do that keeps spinning... any thoughts?

muslimsteel · May 20, 2020

I have come across the same issue as above. I originally posted the issue in the Dynamix forum because of the errors that I saw, they looked at my diagnostics and saw the process issue that you are seeing. This is what I originally posted here:

Quote

Hello, I hope this is the right place to post this. I have searched and have been unable to find a solution. In the last few days I added a second cache drive, identical to the existing one. I added this to create a cache pool. Since then I noticed occasional weird messages in my email that don't seem to make sense:

Subject is:

cron for user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null

and the body consists of:

/bin/sh: fork: retry: Resource temporarily unavailable

Typically I get several in a row and then they stop for 12-24 hours. If I leave them it seems to only get worse leading to the server being unresponsive twice now in the last few days. I was able to reboot it from the GUI once, but the second time I had to do a hard boot. I tried uninstalling and reinstalling the SSD Trim plugin but did not seem to make a difference. It came back up without issue and the errors seemed to be cleared, but then about 24 hours later they started happening again. Everything seems to be working ok otherwise, I am not sure what is causing this. One thought I had is that one of the cache drives is on an HBA and the other is connected directly to the motherboard, not sure if that would make a difference. I have attached the diagnostic. Let me know what you guys think, the server has been running great otherwise and I have really been enjoying UNRAID. Thanks for the support!

And then one of the guys there replied:

Quote

Hmm. your diagnostics show that you have a netdata container that is not properly reaping the finished processes


201      15538  0.2  0.0  33416 22352 ?        SNl  03:01   2:23  |   |       \_ /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
201        300  0.0  0.0      0     0 ?        ZNs  05:34   0:00  |   |       \_ [timeout] <defunct>
201        301  0.0  0.0      0     0 ?        ZNs  06:28   0:00  |   |       \_ [timeout] <defunct>
201        302  0.0  0.0      0     0 ?        ZNs  04:32   0:00  |   |       \_ [timeout] <defunct>
  ... snip ...
201      32766  0.0  0.0      0     0 ?        ZNs  09:32   0:00  |   |       \_ [timeout] <defunct>
201      32767  0.0  0.0      0     0 ?        ZNs  08:54   0:00  |   |       \_ [timeout] <defunct>

So your server is running out of process ids to run new processes. You should check with the support thread for the netdata container you are running

I have attached my diagnostics if you want to take a look. Going to turn off the netdata container for now and see if I am seeing any more of these issues. Thanks in advance for the support!

hulk-diagnostics-20200519-2143.zip

primeval_god · May 20, 2020

@melmurp @muslimsteel You might want to consider raising an issue over at https://github.com/netdata/netdata/ where the developers of Netdata and the Docker container reside.

muslimsteel · May 20, 2020

@primeval_god Thanks, looks like they might have an issue open for this same thing on GitHub:

https://github.com/netdata/netdata/issues/9084

melmurp · May 20, 2020

9 hours ago, primeval_god said:

@melmurp @muslimsteel You might want to consider raising an issue over at https://github.com/netdata/netdata/ where the developers of Netdata and the Docker container reside.

9 hours ago, muslimsteel said:

@primeval_god Thanks, looks like they might have an issue open for this same thing on GitHub:

https://github.com/netdata/netdata/issues/9084

Thanks guys, seems they fixed it so just need to wait for the next release

https://github.com/netdata/netdata/pull/9107

TexasDave · May 23, 2020

Trying to get notifications to work....Following instructions here:

I have added my target email using "./edit-config health_alarm_notify.conf" for (1).

And added the following parameters (with my emails) for (2):

-e [email protected] -e SMTP_USER=user -e SMTP_PASS=password

And I have generated an app password for the sending gmail account and am using that above. I get:

# SENDING TEST CLEAR ALARM TO ROLE: sysadmin
2020-05-23 11:24:44: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
sendmail: can't connect to remote host (127.0.0.1): Connection refused

I am sure I am doing something silly - Thanks!

OdinEidolon · May 24, 2020

Hi and thanks for this docker!

Would you be able to suggest how to make Netdata automatically recognise HDDtemp's data coming from Atribe's HDDtemp docker image? (support here:

)

BeerNut · June 2, 2020

Does anyone have UPS communication working on this new version?

Zack · June 3, 2020

On 6/2/2020 at 9:19 PM, BeerNut said:

Does anyone have UPS communication working on this new version?

According to the docs Netdata needs certain UPS tools installed. Can you check if something like `upsc -l` works on the system? If it does but still no UPS charts are created, please submit a bug about it on github with the relevant details.

BeerNut · June 4, 2020

I'm just having trouble with the configuration. I'll mess with it some more when I have time to see if I can get it working.

Zack · June 5, 2020

Hope you all are finding Netdata easy to use, despite some of the changes to the official docker image. If you are running into problems, you will get fastest response times if you submit an issue (or feature request) on our github repo and provide the relevant logs and info. We try to respond as needed but other channels take longer! Feedback and pull requests are also much appreciated

So let me know if you have any questions about Netdata. My turn to ask questions will be a few months from now once I've had enough with my dual-boot setup and have some time to tinker with hardware.

- Zack, dev advocate with Netdata

ONI Assassin · July 1, 2020

On 5/15/2020 at 2:25 PM, Boldly_Goes said:

I really like the new Netdata cloud feature, I sign in, claimed my node and ran the script they provide for dockers. I get an automatic update and now that claimed node is unreachable and I have to go through the process, over, and over, and over. How can I configure Netdata so it's name persists so I can get updates but also keep it sync'd with their cloud?

Did you manage to resolve this?

TexasUnraid · August 14, 2020

I have noticed that netdata is not saving all the logs / graphs. It will save the last few hours but if I keep zooming out it will have large gaps where all the graphs are blank but then possibly days before the graphs will pop back into existence, then disappear again etc.

The server has been on the whole time and the docker running non-stop with no issues I am aware of.

Any ideas on how to get it to keep all the graphs/logs?

TexasUnraid · September 12, 2020

Did more research and it seems that the docker is setup to use ram for the logs and for some reason the dbengine option is disabled?

I tried to enable it but it just errors and says it is not supported on this platform? Anyway to enable it? I would like to keep a few days of logs if possible.

primeval_god · September 12, 2020

1 hour ago, TexasUnraid said:

Did more research and it seems that the docker is setup to use ram for the logs and for some reason the dbengine option is disabled?

I tried to enable it but it just errors and says it is not supported on this platform? Anyway to enable it? I would like to keep a few days of logs if possible.

It works in the netdata/netdata image. You also have bind mount /var/cache/netdata/dbengine/ to a folder in your appdata to ensure it persists across container upgrades.

TexasUnraid · September 12, 2020

1 minute ago, primeval_god said:

It works in the netdata/netdata image. You also have bind mount /var/cache/netdata/dbengine/ to a folder in your appdata to ensure it persists across container upgrades.

That is the container I am using. This is the error I get after changing the netdata.conf file to enable dbengine:

2020-09-12 16:46:41: netdata FATAL : MAIN : RRD_MEMORY_MODE_DBENGINE is not supported in this platform. # : Invalid argument

2020-09-12 16:46:41: netdata INFO : MAIN : EXIT: netdata prepares to exit with code 1...
2020-09-12 16:46:41: netdata INFO : MAIN : EXIT: cleaning up the database...
2020-09-12 16:46:41: netdata INFO : MAIN : Cleaning up database [0 hosts(s)]...

I mapped the dbengine to appdata along with the etc/netdata folder (after coping it to appdata first) to make updating the netdata.conf easier, never have been able to figure out the text editor in alpine.

primeval_god · September 12, 2020

Do you have all the dbengine configuration options uncommented?

	memory mode = dbengine
	page cache size = 200
	dbengine disk space = 2048

TexasUnraid · September 12, 2020

Yes, although in my case the netdata file was empty by default and asked me to wget it.

I did that and it was then filled but the default memory mode was save and it did not include the page cache or disk space options in the file.

I manually added then though but still got the same error.

TexasUnraid · September 13, 2020

Ok, I just wiped the entire netdata docker and did a clean install and now it seems to be using the dbengine by default? I can't explain it, I had not messed with anything on the old container prior to this.

I then made the same changes I did before, mapping the config and dbengine folder to appdata and bingo, seems to be working. Will have to wait and see if it holds onto the data for longer this time but database files are showing up in the folder.

Although I did notice that under dbengine compression savings ratio, it shows 0?

Seems from reading that it should be at least 50% or higher? Any idea why it is not being compressed? according to the calculator without compression it is going to eat a LOT of space.

TexasUnraid · September 13, 2020

Ok, after it ran for awhile the dbengine compression level suddenly jumped up to 77%. So looks like it is working as expected!

TexasUnraid · September 15, 2020

The DBengine is working good now, saving data as expected. Although I am now getting a new issue, my docker stats stopped showing up? I used to be able to see all the stats for my dockers but now they are simply missing?

I have restarted netdata a few times but they are still gone. Any ideas?

mmz06 · September 25, 2020

On 6/2/2020 at 11:19 PM, BeerNut said:

Does anyone have UPS communication working on this new version?

I have the same problem so I dug into it 😉

It seems that "apcupsd" package is missing, and netdata GitHub team seems to be struggling with lots of issues...

Waiting for their solution, I built a custom script you can use to add this missing packet to netdata container automatically:

#!/bin/bash
#description=This script adds APC UPS support to Netdata Container

if (! docker exec netdata apcaccess >/dev/null )
  then { 
    docker exec netdata apk add apcupsd
    docker restart netdata
  }
fi

Quite simple as you can see, and keep in mind to reuse it every time netdata container is updated.

mmz06 · September 25, 2020

On 9/15/2020 at 4:58 PM, TexasUnraid said:

The DBengine is working good now, saving data as expected. Although I am now getting a new issue, my docker stats stopped showing up? I used to be able to see all the stats for my dockers but now they are simply missing?

I have restarted netdata a few times but they are still gone. Any ideas?

I had the same issue, and by inspecting the logs, it seems to be related to some db files still locked when netdata container is restarted, as I suspect my drive not fast enough, even if it's a nice Samsung SSD...

So the workaround I found is to stop netdata container, wait a bit like a minute or two, and then start netdata container, and if containers status doesn't show at this time, just refresh netdata page after a few seconds...

Now I'm not a Netdata expert so I may be wrong and you may face another issue.

BeerNut · September 26, 2020

On 9/25/2020 at 10:49 AM, mmz06 said:

I have the same problem so I dug into it 😉

It seems that "apcupsd" package is missing, and netdata GitHub team seems to be struggling with lots of issues...

Waiting for their solution, I built a custom script you can use to add this missing packet to netdata container automatically:

Thank you so much for this!

[Support] Data-Monkey - netdata

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

TRusselo

roland

danyg

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation