codefaux

Members
  • Posts

    129
  • Joined

  • Last visited

  • Days Won

    1

codefaux last won the day on April 29 2021

codefaux had the most liked content!

1 Follower

Converted

  • Gender
    Male
  • ICQ
    was shut down lol
  • AIM
    also sunlit
  • YIM
    doesn't even exist bro
  • MSN Messenger
    also dead. srsly?

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

codefaux's Achievements

Apprentice

Apprentice (3/14)

19

Reputation

  1. That doesn't support any of the other metrics from any of the other dozen or so scripts at the link I mentioned, a few of which I'm interested in/using now manually. That's why I suggested adding the text collector as a supported feature on the plugin by default. It won't impact anyone except people who want to use it if it merely watches a folder for scripts to execute. Could you be bothered to explain why? I don't mean to sound dismissive or aggressive or anything, but would you follow a recommendation from a stranger with no justification or explanation? I'm not looking for suggestions; I'm not the plugin author here lol I'll allow them to be run from anywhere you require them to be placed. I'm suggesting that you add in support for the text collector to your prometheus_node_exporter plugin, since prometheus_node_exporter already supports it. I'm suggesting where you should place them. You disagree; place them anywhere you like.
  2. Could you be convinced to add a variable to pipe stdout to a specific path? My target: prometheus_node_exporter contains a textfile collector. There are a handful of community scripts designed to - be run from cron (ie your plugin on a custom schedule) - collect useful (I'd argue critical) system metrics (ie smartmon stats like temps and failure indicators) for prometheus-node-exporter - output data to stdout, to be piped to a file The inconvenience is, to use these is not straightforward. They would require extensive modification most users can't grok. Thanks to your plugin, an alternative is to manually save them to disk someplace, then use your plugin to write a script which runs them with stdout piped to a file, scheduled cron-alike. EDIT: These are the same script, ignore the name difference I used an old screenshot for the one by mistake This is workable but some users may still find this unapproachable. If the user could paste the entire script (in my case, /boot/config/scripts/smartmon.sh aka smartmon.sh [raw] among others) into a User Script, schedule them as cron-alikes, and mark their output to be sent to a specific path, it would make these more approachable. It could be implemented similar to the variables you've recently added; an STDOUT variable could be changed to request redirection. Regardless of your decision, keep up the great work. The plugin has been quite valuable for many of us!
  3. Hey here's another suggestion for an easy addition full of feature delivery. There exists a handful of community scripts to ammend data provided by node_exporter. Most of them seem to be intended to pipe to a text file and read via node_exporter's textfile collector. The metrics go into a file wherever, then to the exporter as a launch parameter, you add `--collector.textfile.directory=/var/lib/node_exporter` for example, and all of the readable files within /var/lib will be exported as metrics if they're in the correct format. For example, smartmon.sh writes smartmon statistics such as temperature and unallocated blocks. nvme_metrics.sh might be of interest, btrfs_stats.py, maybe even directory-size.sh for some folks. The most simple way I can think of is for your plugin to create a directory in the ram filesystem at /var/log and add `--collector.textfile.directory=/var/log/node_exporter` and suggest users execute the desired scripts, writing into /var/log/node_exporter in per-script files. I can see two ways of doing this. - One, users copy script files to someplace like /boot/config/scripts (one-time write for the scripts, no flash wear) and execute them via the User Scripts plugin as such; Scheduled similarly; The /var/log filesystem will exist on any system, won't cause flash wear, is wiped on reboot. The path should have plenty of space for a few kb (or even dozen MB) of metrics being rewritten every few minutes. If it doesn't, failure case is that logging fails on the system -- not ideal but it's mounted with 128MB by Unraid and should never be near full unless a user has a serious problem elsewhere. If it's filling, the absense of this plugin's proposed metrics won't prevent it or even delay it by much. These metrics are designed to be overwritten, not appeneded, so they should never grow more than a few dozen MB in the most obscene scenario. Plugins seem to run as root, so permissions shouldn't be a problem. I'm also going to ping the user scripts developer to allow stdout to be piped to a file per-script, so users can simply paste the scripts into User Scripts and forward the stdout, instead of needing to save them to /boot/config/scripts manually and write a User Script to run it manually.
  4. Might be worth running it from the console to see if there's any ignored error spam like in my comment. Could be an error being dropped instead of forwarded to system logs, like on my (and everyone else's) system using the Node Exporter.
  5. lol it's ok, I was just very confused. I'm not even using the Docker containers, I JUST needed the exporter; I'm running the rest on another system. That's the main reason I went into this so sideways; I only wanted the exporter, but it had all these extra steps and I had to assume which went to what. I did so incorrectly, but that's ok. For sure! I thought the file went with the plugin, that's all. Yup, but I didn't know that and it was all in one giant block so I assumed (incorrectly) and here we are, lol That's very much what I had expected, literally everyone using this plugin is silently ignoring repeated errors because the stdout isn't directed anywhere people know about... Synonymous with "remember that" -- it's something you say to people who know the thing you're about to say. COMPLETELY did not know that, but given the reply upstream that does not surprise me anymore lol. I'll remember that. Since the logs aren't kept, it won't be filling up a logfile. I haven't restarted my array in a few months and I don't intend to soon, so I'll likely just leave them running in a detached tmux until the issue is properly resolved. I'd already opened a bug report upstream, I'll link them here and add this information but it seems for now the best bet would be to patch the plugin to disable the md collector on your end. Edit: Github issue https://github.com/prometheus/node_exporter/issues/2642
  6. I already explained that, but I'll be more verbose. I downloaded the plugin before I wrote its config file. You wrote the instructions. Steps four and five. I did not perform steps four and five until after I had installed the plugin. I also did not do anything beyond those steps, regrading modifying settings, configuration or parameters. There's no "restart" control that I could find, and I didn't feel like restarting my entire server (or the array and all of my services) simply to restart a single plugin. Thus, I used the console, killed the running process, and restarted it by hand. No custom parameters, I didn't change some settings. You'll note that I never said there was a user-facing issue, or that I couldn't connect or report metrics from it. It functions just fine, but on my system it's burping a line at the console about an internal (non-critical) error, every time Prometheus connects to it and retrieves metrics. The only difference between now and if I uninstall/reinstall/reboot is that the errors will be sent to a logfile someplace or discarded entirely -- I have no idea which -- instead of being sent to the console, since I ran it by hand. What I'm realizing though, is this is above your head, so to speak. If you run it by hand yourself, does it throw an error every time Prometheus polls it? I'll file a bug upstream with the actual node collector's maintainer, as it's now clear to me that the actual collector is mishandling the output of /proc/mdstat on my system and it has nothing to do with the small wrapper plugin you wrote. Mmmmmmm no, though. It's not "only meant to be installed." It's meant to be learned about, your system is meant to be manually prepared for it, and then it's meant to be installed. Steps four and five could/should be handled by either the previous step's container, or this plugin itself if it notices the configuration file is not present. Furthermore, installing the plugin starts the collector, which already expects the config file to be present, so steps four and five should actually be before step three. If the installation process weren't so complicated, I would've noticed that this wasn't your problem earlier. I installed the plugin by finding it in the CA app and going "hey that does what I need" and then discovering that it wasn't working. And in NO situation do you merely "install the plugin and that's it" so that's just a flat inaccurate thing to claim.
  7. Literally just run the binary from the ssh terminal. I hadn't written its config file yet so I noted how it was executed, killed it, wrote the config file, and executed it by hand in a terminal.
  8. Bug report. Prometheus Node Exporter throws an error every single time it's polled by Prometheus. EDIT: As of version 2023.02.26 -- first time using this plugin, unsure when bug first appeared Text is; ts=2023-03-24T17:51:45.859Z caller=collector.go:169 level=error msg="collector failed" name=mdadm duration_seconds=0.000470481 err="error parsing mdstatus: error parsing mdstat \"/proc/mdstat\": not enough fields in mdline (expected at least 3): sbName=/boot/config/super.dat"
  9. Glad you solved it, something to always keep in mind when building a system. To the general public who winds up here looking for similar solutions; (Posting it here because they won't go back and experience it like you did. Seriously not trying to rub it in. This needs to be seen *anywhere* the conclusion is "overclocked RAM" and is *NOT* JUST FOR RYZEN SYSTEMS but they do it worst. We've had this talk, I know you know.) Never overclock RAM, at all, regardless of what it says on the box, INCLUDING XMP, without testing it. Gaming system, Facebook system, ESPECIALLY A SERVER. XMP IS OVERCLOCKING. I'll say it again because people argue it so much; before you argue, Google it. XMP IS OVERCLOCKING. It's a "factory supported overclock" but it IS OVERCLOCKING, and you HAVE TO TEST FOR STABILITY when overclocking. Read Intel's documentation on the XMP spec. I DO NOT CARE if your BIOS defaults it on, VIGILANTLY turn it back *OFF* unless you're going to run an extremely extended *(DOZENS OF HOURS) RAM test when you're building a server. RAM overclocking leads to SILENT DATA CORRUPTION, which is *RIDICULOUS* on a SERVER, which is explicitly present to HANDLE DATA. I should also note that I personally have never seen a literal server-intended motherboard which supports *any* means of overclocking, and I feel like that's due to the causal link between overclocking and data corruption. Overclocking server hardware is NOT a good decision, unless you also test for stability. Overclocking RAM without testing it is literally identical to knowingly using a stick of failing RAM. You're running a piece of engineered electronics faster than the engineers who built it said to, often at higher voltages than it's designed for, to get it to go fractionally faster. Does that move a bottleneck? Not even a little bit, not since the numbered Pentium era. It HELPS on a high end system, IF YOU TEST IT THOROUGHLY, but I would NEVER overclock RAM on a server. I feel like NOT having silently corrupted data on a randomly unstable system is better overall than "the benchmark number is a little higher and I SWEAR programs load faster!" Stop thinking overclocking RAM is safe. Stop using XMP on "server" systems. Any potential speed gain is not worth it. Be safer with your systems, and your data.
  10. You fundamentally misunderstand Host Access. Host Access allows the HOST (unRAID) to reach the Docker container; otherwise the HOST (and only the HOST) cannot reach the container with Host Access off. The containers can get to the outside Internet with no host access, and no container-specific IP. The containers can receive connections from anywhere EXCEPT the host, with Host Access turned off. The containers can receive connections from anywhere INCLUDING the host, with Host Access on. When it crashes, bring the array up in maintenance mode, do a manual xfs_repair on every drive (/dev/md* from terminal, or manually per drive from GUI) -- I used to, every time unRAID crashed, and do still when we have power interruptions -- and now no longer have to fix parity. I still scan as a test every so often on my 113TB 30-disk array but it no longer requires sync fixes ever. I'm unsubscribing from this thread; I've posted my fix (still 100% stable on my hardware, where it was unstable regularly and repeatedly, and verified by reverting into instability and re-applying to regain stability) and it's not helping and/or nobody is listening. If anyone needs direct access to my information or wishes to work one-on-one with me to look into issues privately, or verify this is the same crash, etc, send me a private message and I'll gladly help; the chatter here is going in circles and I've said my part. I hope at least someone was helped by my presence; good luck.
  11. As an afterthought; because it's probably relevant, here's my network config page. Heading down for the night, will check in when I wake up.
  12. Try 6.8.3, enable vlans even if you're not using them. That seems to be the part which fixed my issues. If that isn't stable I'll screenshot my configurations and we'll try to figure it out -- I was crashing every few days with five or so containers. Now I'm stable with easily a dozen running right now. Good catch, I forgot about that actually. I had disabled host access at the time, expecting it to have been a part of the fix. Currently host access is enabled, still stable, on 6.8.3 -- other things may have changed since then since I accidentally reverted a version, but stability is unaffected. Here's a screenshot of my currently running system.
  13. Up until a week ago I was still running 6.9.x as mentioned earlier, with host access, with flawless uptime, given the workaround I indicated. I recently accidentally downgraded to 6.8.3 (flash device problems, long story) and I'm still stable. Perhaps try 6.8.3? PS, anyone on the 6.10 rc series who can verify stability? I'm unwilling to touch anything until Limetech has this issue figured out. I tried briefly with a brand new flash device but it issued me a trial license without telling me what's what it was going to do, and since keys are no longer locally managed I can't fix it without contacting support (GREAT JOB LIMETECH) so capital F that, until I know it's gonna work.
  14. I still haven't updated, because the workaround I posted still works for me. I'm using Docker with 20+ containers, each with a dedicated IP, all from one NIC. I never even tried the 6.10-rc because after it was released, I read a few posts from folks using the 6.10-rc that the problem still existed even with ipvlan. Honestly, I'm stable, and I'm not going to upgrade until I stop hearing about this bug.
  15. Sorry to hear it -- I'm still running the configuration posted and still stable save the recent power bumps in our area, but I had nearly a month of uptime at one point and even that power cycle was scheduled. I've also moved additional Docker containers onto the system so I could shut down the other for power/heat savings during the heat wave. I haven't had the spoons to convince myself to change anything since it became stable, so I'm not on the RC yet. I might suggest that if there is a panic, it could be unrelated -- post its details just to be sure. Various hardware have errata that could cause panics, including both Intel and AMD C-state bugs on various generations, and even filesystem corruption from previous panics -- something Unraid doesn't forward to the UI from the kernel logs, by default. Good luck all.