ZFS Monitoring Dashboard

Iker · September 22, 2021

Overview:

I started using ZFS a little while ago, but the lack of zfs-related information in the GUI really bothers me, however, my Unraid is mainly monitored from Grafana; so I started to look for monitoring solutions, but… found nothing that work consistently good in Unraid.

This is a guide for users of ZFS Plugin who want to monitor their ZFS pools using Grafana & Prometheus & zfs_exporter. The only purpose of this is to some nice dashboard with the general stats of the pools (ARC stats not included, possible with Telegraf if you like).

Requirements:

Unraid <= 6.9.2

Grafana

Basic Knowledge o PromQL

Install zfs_exporter

There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid.

Download from the releases page the Linux x64 version, place it in a directory, preferably in one of your pools :

https://github.com/pdf/zfs_exporter
zfs_exporter-2.2.1.linux-amd64.tar.gz
tar -xf zfs_exporter-2.2.1.linux-amd64.tar.gz -C /ssdnvme/scripts

In User Scripts plugins, create a new script named zfs_exporter, the script contents is:

#!/bin/bash
echo "/ssdnvme/scripts/zfs_exporter --properties.dataset-filesystem="available,logicalused,quota,referenced,used,usedbydataset,written" --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/"  | at NOW -M > /dev/null 2>&1

Configure the scripts to be executed At Startup of Array, and you're done; the location of my Docker Folder is "/ssdnvme/dockerfiles" and it creates a lot of snapshots, you can exclude those paths with the "--exclude" parameter.

If everything is ok, you should be able to access "http://YOURUNRAIDIP:9134/metrics" and see something like this

image.png.7a618aadb5be745801d5265d8499a9e2.png

If you are unable to access, check if zfs_exporter is running ("ps aux | grep zfs_exporter")

Install Prometheus & Configure

From the Community Applications install Prometheus Docker, modify "prometheus.yml" to collect metrics from "YOURUNRAIDIP:9134"

image.png.b1cdcb382cc25c5993f13f4aeda36b31.png

image.png.98644e353f47e4ea0a8f87873a3184b8.png

Restart Prometheus, if everything went ok you should be able to test some expressions

image.png.a2e017a62c9299587cd5680ceb99ad57.png

Create a Dashboard

With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example:

Final Words

I don't expect this configuration to be necessary in the near future for Unraid, however is useful for me. Unraid 6.10 comes with ZFS 2.1 at least, which includes zfs_influxdb (https://openzfs.github.io/openzfs-docs/man/8/zpool_influxdb.8.html), making the process of exporting the metrics much easier.

Hope it helps some ZFS users in the Unraid community, let me know if you have any questions or are having any issues getting this up and running.

Edited September 23, 2021 by Iker

ich777 · September 24, 2021

On 9/23/2021 at 12:09 AM, Iker said:

There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid.

I could create a plugin for this if necessary...

So the users won't have to deal with the command line and had to just install the plugin from the CA App.

Have you seen my thread here yet:

Iker · September 24, 2021

Hi ich777, of course that I have check your post, it was the first thing I find when I was looking for Prometheus zfs exporter plugin. Well, Create a plugin for zfs_exporter will be amazing for all ZFS Unraid users!, thank you very much!, however it will only apply for versions <= 6.9.2, as I stated in the post, with Unraid 6.10 things should be much easier with zfs_influxdb.

Let me know if you need anything from me for the plugin.

ich777 · September 24, 2021

20 minutes ago, Iker said:

Let me know if you need anything from me for the plugin.

Have to look into it...

I don't use ZFS currently and I'm not really familiar with ZFS and the Exporter. Is it necessary to create a entry for each mountpoint or is this discovered automatically?

Iker · September 24, 2021

The discover of the pools is completely automatic, the only parametrization for the exporter is the command line arguments, these controls which properties, specific pools, and other options do you be included or if you want to exclude some datasets; this is an example of the command line:

zfs_exporter --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/

--collector.dataset-snapshot: Include the properties for the snapshots that could be present

--properties.dataset-snapshot: (Depends upon previous setting) Include these properties for the snapshots

--exclude: Regular expression for which datasets/snapshots/volumes must be excluded.

There are some other options that maybe are interesting for you, these control the webgui, log level, etc. you could check all of them in zfs_exporter Github (https://github.com/pdf/zfs_exporter).

ich777 · September 25, 2021

11 hours ago, Iker said:

The discover of the pools is completely automatic

Thank you for the write up, I will look into it ASAP but give me a little time since this seems a little more compicated that I first thought... 😅

But I have some ideas...

TheSkaz · October 4, 2021

@Iker would you be able to share the json for your dashboard as a starting point? I have everything else working.

Iker · October 5, 2021

No problem, panels are mixed between prometheus & influxdb 2 (telegraf); so, keep that in mind, the sintaxis of flux is very different.

ZFS.json

TheSkaz · October 5, 2021

thank you. quick question, fr the influxdb queries, mine show "select measurement" on all of them, so I assume that I dont have the corresponding metrics in the db? my telegraf.conf has zfs enabled with pool and dataset metrics set to true.

Iker · October 6, 2021

As I stated in my answer, I use influxdb 2, so the query language is Flux, not InfluxQL; the data is still present in the database as long as you are using telegraf; but the queries are very different, for example:

Pool Writes (Flux):

from(bucket: "Telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "zfs_pool")
  |> filter(fn: (r) => r["_field"] == "nwritten")
  |> aggregateWindow(every: v.windowPeriod, fn: mean)
  |> derivative(unit: 1s, nonNegative: true)
  |> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r.pool}))

Pool Writes (InfluxQL):

SELECT

SELECT non_negative_derivative(mean("nwritten"), 1s) AS "writes" FROM "zfs_pool" WHERE $timeFilter GROUP BY time($__interval), "pool" fill(none)

I'm not completely sure about the InfluxQL query (I don't have any Influx 1.8 DB available right now), but should be something along those lines. (https://docs.influxdata.com/influxdb/v1.8/flux/flux-vs-influxql/)

In the next days I should come back to you with all the queries to their InfluxQL equivalent, just let me spin up a InfluxDB 1.8 and translate the queries.

Edited October 6, 2021 by Iker

TheSkaz · October 6, 2021

took your sample and edited it for the other graphs. currently working on ARC size and demand

TheSkaz · October 6, 2021

from(bucket: v.bucket)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[\"_measurement\"] == \"zfs\")  
|> filter(fn: (r) => r[\"pools\"] == \"hddmain::ssdnvme::ssdsata\")
|> filter(fn: (r) => r[\"_field\"] == \"arcstats_size\" or
r[\"_field\"] == \"arcstats_data_size\" or
r[\"_field\"] == \"arcstats_metadata_size\" or
r[\"_field\"] == \"arcstats_mfu_size\" or    
r[\"_field\"] == \"arcstats_dnode_size\" or
r[\"_field\"] == \"arcstats_mru_size\")  
|> aggregateWindow(every: v.windowPeriod, fn: mean)
|> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r._field}))",

Arc Size, converted to this so far:

SELECT mean("arcstats_size") as Size, mean("arcstats_data_size")  as Data, mean("arcstats_metadata_size") as Metadata, mean("arcstats_mfu_size") as MFU, mean("arcstats_dnode_size") as DNODE, mean("arcstats_mru_size") as MRU FROM "zfs" WHERE $timeFilter GROUP BY time($__interval) fill(none)

Iker · October 6, 2021

Wow, that's great, seems that you got most of the panels working by now, let me know if you need any help.

TheSkaz · October 6, 2021

the Arc Demand is throwing me off. I can get the Data Hit Ratio, I think. but the pivot that you have going on, I have no clue.

Iker · October 7, 2021

There you go; check it out, any questions happy to help.ZFS -Influx1.8.json; also, check the parameters in the non_negative_derivative because your latency are way over the roof; maybe compare with the output of "zpool iostat -l".

Edited October 7, 2021 by Iker

TheSkaz · October 8, 2021

Attached is my updated one, using the 1.8.4 Influx. You were right, my non_negative_derivatives were set at 1s instead of 1ms.

ZFS-1.8.json

Iker · October 10, 2021

As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide.

Edited October 10, 2021 by Iker

TheSkaz · October 11, 2021

On 10/9/2021 at 10:30 PM, Iker said:

As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide.

I did that and was getting kernel panics from ZFS hourly... had to downgrade

Iker · October 11, 2021

Hope the rc2 is more stable for you, zfs 2.1 introduced draid and even is you are not using it, it's a nice to have. In the monitoring aspect there are more options with zpool_influxdb and some already present are different, when my dashboard is more mature I will write another post about it. This is a little sneak peek:

Edited October 11, 2021 by Iker

norsemanGrey · July 29, 2022

On 9/23/2021 at 12:09 AM, Iker said:

With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example:

Very nice dashboard!

Can I ask how you are able to get named datasets for your Docker containers? My container datasets are created with the full container ID. Also, is the dataset size reflecting only the container layer dataset or the all the image layer datasets for that container as well?

Iker · July 31, 2022

The datasets are just for the containers data, not really the image; in fact I excluded all the information from /<pool>/dockerfiles/, is just way too much to bear with.

boomam · May 31, 2023

Old thread, but any chance we can get this JSON sanitized and pushed to the Grafana library?

Would allow others to easily use the template without having to hack away at it for their own purposes.

Iker · May 31, 2023

Hi my friend; unfortunately, I'm no longer using this dashboard, as I ditched influxdb and telegraf from my stack in favor of victoria metrics. I have plans to write a new guide, including a template for the Grafana dashboard, but it will take me a while.

boomam · June 2, 2023

That's a shame.
If you dont mind me asking, and it is slightly off-topic, but why Victoria and not Influx/Telegraf ?

Iker · June 2, 2023

Well, just the influx migration from v1.8 to 2.X was wild, and when you have a lot of data, the performance is far from good; they have to rewrite the entire engine two times now (https://www.influxdata.com/products/influxdb-overview/#influxdb-edge-cluster-updates) so I decided to move to Prometheus, and Victoria Metrics as long-term storage makes a lot of sense, queries are really fast, and Victoria is compatible with telegraf line protocol. Overall I'm happy with my decision, and my dashboards load much faster now.

ZFS Monitoring Dashboard

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation