Iker Posted September 22, 2021 Share Posted September 22, 2021 (edited) Overview: I started using ZFS a little while ago, but the lack of zfs-related information in the GUI really bothers me, however, my Unraid is mainly monitored from Grafana; so I started to look for monitoring solutions, but… found nothing that work consistently good in Unraid. This is a guide for users of ZFS Plugin who want to monitor their ZFS pools using Grafana & Prometheus & zfs_exporter. The only purpose of this is to some nice dashboard with the general stats of the pools (ARC stats not included, possible with Telegraf if you like). Requirements: Unraid <= 6.9.2 Grafana Basic Knowledge o PromQL Install zfs_exporter There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid. Download from the releases page the Linux x64 version, place it in a directory, preferably in one of your pools : https://github.com/pdf/zfs_exporter zfs_exporter-2.2.1.linux-amd64.tar.gz tar -xf zfs_exporter-2.2.1.linux-amd64.tar.gz -C /ssdnvme/scripts In User Scripts plugins, create a new script named zfs_exporter, the script contents is: #!/bin/bash echo "/ssdnvme/scripts/zfs_exporter --properties.dataset-filesystem="available,logicalused,quota,referenced,used,usedbydataset,written" --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/" | at NOW -M > /dev/null 2>&1 Configure the scripts to be executed At Startup of Array, and you're done; the location of my Docker Folder is "/ssdnvme/dockerfiles" and it creates a lot of snapshots, you can exclude those paths with the "--exclude" parameter. If everything is ok, you should be able to access "http://YOURUNRAIDIP:9134/metrics" and see something like this If you are unable to access, check if zfs_exporter is running ("ps aux | grep zfs_exporter") Install Prometheus & Configure From the Community Applications install Prometheus Docker, modify "prometheus.yml" to collect metrics from "YOURUNRAIDIP:9134" Restart Prometheus, if everything went ok you should be able to test some expressions Create a Dashboard With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example: Final Words I don't expect this configuration to be necessary in the near future for Unraid, however is useful for me. Unraid 6.10 comes with ZFS 2.1 at least, which includes zfs_influxdb (https://openzfs.github.io/openzfs-docs/man/8/zpool_influxdb.8.html), making the process of exporting the metrics much easier. Hope it helps some ZFS users in the Unraid community, let me know if you have any questions or are having any issues getting this up and running. Edited September 23, 2021 by Iker 1 Quote Link to comment
ich777 Posted September 24, 2021 Share Posted September 24, 2021 On 9/23/2021 at 12:09 AM, Iker said: There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid. I could create a plugin for this if necessary... So the users won't have to deal with the command line and had to just install the plugin from the CA App. Have you seen my thread here yet: 1 Quote Link to comment
Iker Posted September 24, 2021 Author Share Posted September 24, 2021 Hi ich777, of course that I have check your post, it was the first thing I find when I was looking for Prometheus zfs exporter plugin. Well, Create a plugin for zfs_exporter will be amazing for all ZFS Unraid users!, thank you very much!, however it will only apply for versions <= 6.9.2, as I stated in the post, with Unraid 6.10 things should be much easier with zfs_influxdb. Let me know if you need anything from me for the plugin. Quote Link to comment
ich777 Posted September 24, 2021 Share Posted September 24, 2021 20 minutes ago, Iker said: Let me know if you need anything from me for the plugin. Have to look into it... I don't use ZFS currently and I'm not really familiar with ZFS and the Exporter. Is it necessary to create a entry for each mountpoint or is this discovered automatically? Quote Link to comment
Iker Posted September 24, 2021 Author Share Posted September 24, 2021 The discover of the pools is completely automatic, the only parametrization for the exporter is the command line arguments, these controls which properties, specific pools, and other options do you be included or if you want to exclude some datasets; this is an example of the command line: zfs_exporter --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/ --collector.dataset-snapshot: Include the properties for the snapshots that could be present --properties.dataset-snapshot: (Depends upon previous setting) Include these properties for the snapshots --exclude: Regular expression for which datasets/snapshots/volumes must be excluded. There are some other options that maybe are interesting for you, these control the webgui, log level, etc. you could check all of them in zfs_exporter Github (https://github.com/pdf/zfs_exporter). Quote Link to comment
ich777 Posted September 25, 2021 Share Posted September 25, 2021 11 hours ago, Iker said: The discover of the pools is completely automatic Thank you for the write up, I will look into it ASAP but give me a little time since this seems a little more compicated that I first thought... 😅 But I have some ideas... 1 Quote Link to comment
TheSkaz Posted October 4, 2021 Share Posted October 4, 2021 @Iker would you be able to share the json for your dashboard as a starting point? I have everything else working. Quote Link to comment
Iker Posted October 5, 2021 Author Share Posted October 5, 2021 No problem, panels are mixed between prometheus & influxdb 2 (telegraf); so, keep that in mind, the sintaxis of flux is very different. ZFS.json 1 Quote Link to comment
TheSkaz Posted October 5, 2021 Share Posted October 5, 2021 thank you. quick question, fr the influxdb queries, mine show "select measurement" on all of them, so I assume that I dont have the corresponding metrics in the db? my telegraf.conf has zfs enabled with pool and dataset metrics set to true. Quote Link to comment
Iker Posted October 6, 2021 Author Share Posted October 6, 2021 (edited) As I stated in my answer, I use influxdb 2, so the query language is Flux, not InfluxQL; the data is still present in the database as long as you are using telegraf; but the queries are very different, for example: Pool Writes (Flux): from(bucket: "Telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "zfs_pool") |> filter(fn: (r) => r["_field"] == "nwritten") |> aggregateWindow(every: v.windowPeriod, fn: mean) |> derivative(unit: 1s, nonNegative: true) |> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r.pool})) Pool Writes (InfluxQL): SELECT SELECT non_negative_derivative(mean("nwritten"), 1s) AS "writes" FROM "zfs_pool" WHERE $timeFilter GROUP BY time($__interval), "pool" fill(none) I'm not completely sure about the InfluxQL query (I don't have any Influx 1.8 DB available right now), but should be something along those lines. (https://docs.influxdata.com/influxdb/v1.8/flux/flux-vs-influxql/) In the next days I should come back to you with all the queries to their InfluxQL equivalent, just let me spin up a InfluxDB 1.8 and translate the queries. Edited October 6, 2021 by Iker Quote Link to comment
TheSkaz Posted October 6, 2021 Share Posted October 6, 2021 took your sample and edited it for the other graphs. currently working on ARC size and demand 1 Quote Link to comment
TheSkaz Posted October 6, 2021 Share Posted October 6, 2021 from(bucket: v.bucket) |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r[\"_measurement\"] == \"zfs\") |> filter(fn: (r) => r[\"pools\"] == \"hddmain::ssdnvme::ssdsata\") |> filter(fn: (r) => r[\"_field\"] == \"arcstats_size\" or r[\"_field\"] == \"arcstats_data_size\" or r[\"_field\"] == \"arcstats_metadata_size\" or r[\"_field\"] == \"arcstats_mfu_size\" or r[\"_field\"] == \"arcstats_dnode_size\" or r[\"_field\"] == \"arcstats_mru_size\") |> aggregateWindow(every: v.windowPeriod, fn: mean) |> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r._field}))", Arc Size, converted to this so far: SELECT mean("arcstats_size") as Size, mean("arcstats_data_size") as Data, mean("arcstats_metadata_size") as Metadata, mean("arcstats_mfu_size") as MFU, mean("arcstats_dnode_size") as DNODE, mean("arcstats_mru_size") as MRU FROM "zfs" WHERE $timeFilter GROUP BY time($__interval) fill(none) Quote Link to comment
Iker Posted October 6, 2021 Author Share Posted October 6, 2021 Wow, that's great, seems that you got most of the panels working by now, let me know if you need any help. Quote Link to comment
TheSkaz Posted October 6, 2021 Share Posted October 6, 2021 the Arc Demand is throwing me off. I can get the Data Hit Ratio, I think. but the pivot that you have going on, I have no clue. Quote Link to comment
Iker Posted October 7, 2021 Author Share Posted October 7, 2021 (edited) There you go; check it out, any questions happy to help.ZFS -Influx1.8.json; also, check the parameters in the non_negative_derivative because your latency are way over the roof; maybe compare with the output of "zpool iostat -l". Edited October 7, 2021 by Iker Quote Link to comment
TheSkaz Posted October 8, 2021 Share Posted October 8, 2021 Attached is my updated one, using the 1.8.4 Influx. You were right, my non_negative_derivatives were set at 1s instead of 1ms. ZFS-1.8.json Quote Link to comment
Iker Posted October 10, 2021 Author Share Posted October 10, 2021 (edited) As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide. Edited October 10, 2021 by Iker Quote Link to comment
TheSkaz Posted October 11, 2021 Share Posted October 11, 2021 On 10/9/2021 at 10:30 PM, Iker said: As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide. I did that and was getting kernel panics from ZFS hourly... had to downgrade Quote Link to comment
Iker Posted October 11, 2021 Author Share Posted October 11, 2021 (edited) Hope the rc2 is more stable for you, zfs 2.1 introduced draid and even is you are not using it, it's a nice to have. In the monitoring aspect there are more options with zpool_influxdb and some already present are different, when my dashboard is more mature I will write another post about it. This is a little sneak peek: Edited October 11, 2021 by Iker Quote Link to comment
norsemanGrey Posted July 29, 2022 Share Posted July 29, 2022 On 9/23/2021 at 12:09 AM, Iker said: With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example: Very nice dashboard! Can I ask how you are able to get named datasets for your Docker containers? My container datasets are created with the full container ID. Also, is the dataset size reflecting only the container layer dataset or the all the image layer datasets for that container as well? Quote Link to comment
Iker Posted July 31, 2022 Author Share Posted July 31, 2022 The datasets are just for the containers data, not really the image; in fact I excluded all the information from /<pool>/dockerfiles/, is just way too much to bear with. Quote Link to comment
boomam Posted May 31, 2023 Share Posted May 31, 2023 Old thread, but any chance we can get this JSON sanitized and pushed to the Grafana library? Would allow others to easily use the template without having to hack away at it for their own purposes. Quote Link to comment
Iker Posted May 31, 2023 Author Share Posted May 31, 2023 Hi my friend; unfortunately, I'm no longer using this dashboard, as I ditched influxdb and telegraf from my stack in favor of victoria metrics. I have plans to write a new guide, including a template for the Grafana dashboard, but it will take me a while. Quote Link to comment
boomam Posted June 2, 2023 Share Posted June 2, 2023 That's a shame. If you dont mind me asking, and it is slightly off-topic, but why Victoria and not Influx/Telegraf ? Quote Link to comment
Iker Posted June 2, 2023 Author Share Posted June 2, 2023 Well, just the influx migration from v1.8 to 2.X was wild, and when you have a lot of data, the performance is far from good; they have to rewrite the entire engine two times now (https://www.influxdata.com/products/influxdb-overview/#influxdb-edge-cluster-updates) so I decided to move to Prometheus, and Victoria Metrics as long-term storage makes a lot of sense, queries are really fast, and Victoria is compatible with telegraf line protocol. Overall I'm happy with my decision, and my dashboards load much faster now. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.