ZFS Monitoring Dashboard


Iker

Recommended Posts

Overview:

 

I started using ZFS a little while ago, but the lack of zfs-related information in the GUI really bothers me, however, my Unraid is mainly monitored from Grafana; so I started to look for monitoring solutions, but… found nothing that work consistently good in Unraid.

 

This is a guide for users of ZFS Plugin who want to monitor their ZFS pools using Grafana & Prometheus & zfs_exporter. The only purpose of this is to some nice dashboard with the general stats of the pools (ARC stats not included, possible with Telegraf if you like).

 

Requirements:

 

Unraid <= 6.9.2 

Grafana

Basic Knowledge o PromQL

 

Install zfs_exporter

 

There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid.

 

Download from the releases page the Linux x64 version, place it in a directory, preferably in one of your pools :

 

https://github.com/pdf/zfs_exporter
zfs_exporter-2.2.1.linux-amd64.tar.gz
tar -xf zfs_exporter-2.2.1.linux-amd64.tar.gz -C /ssdnvme/scripts

 

In User Scripts plugins, create a new script named zfs_exporter, the script contents is:

 

#!/bin/bash
echo "/ssdnvme/scripts/zfs_exporter --properties.dataset-filesystem="available,logicalused,quota,referenced,used,usedbydataset,written" --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/"  | at NOW -M > /dev/null 2>&1

 

Configure the scripts to be executed At Startup of Array, and you're done;  the location of my Docker Folder is "/ssdnvme/dockerfiles" and it creates a lot of snapshots, you can exclude those paths with the "--exclude" parameter.

 

If everything is ok, you should be able to access "http://YOURUNRAIDIP:9134/metrics" and see something like this

 

image.png.7a618aadb5be745801d5265d8499a9e2.png

 

If you are unable to access, check if zfs_exporter is running ("ps aux | grep zfs_exporter")

 

Install Prometheus & Configure

 

From the Community Applications install Prometheus Docker, modify "prometheus.yml" to collect metrics from "YOURUNRAIDIP:9134"

 

image.png.b1cdcb382cc25c5993f13f4aeda36b31.png

 

image.png.98644e353f47e4ea0a8f87873a3184b8.png

 

Restart Prometheus, if everything went ok you should be able to test some expressions

 

image.png.a2e017a62c9299587cd5680ceb99ad57.png

 

 

Create a Dashboard

With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example:

 

image.thumb.png.61e6daec8097492c6ccaa134206a63b4.png

 

Final Words

 

I don't expect this configuration to  be necessary in the near future for Unraid, however is useful for me. Unraid 6.10 comes with ZFS 2.1 at least, which includes zfs_influxdb (https://openzfs.github.io/openzfs-docs/man/8/zpool_influxdb.8.html), making the process of exporting the metrics much easier.

 

Hope it helps some ZFS users in the Unraid community, let me know if you have any questions or are having any issues getting this up and running.

Edited by Iker
  • Like 1
Link to comment
On 9/23/2021 at 12:09 AM, Iker said:

There is a limitation on what properties you could monitor with Telegraf on Linux, that means, no health status, zdev or dataset or volume info, snapshots, size of the pools, datasets, etc.; here is where zfs_exporter comes into play, it's a ZFS Prometheus exporter, single executable written in go that plays nice with Unraid.

I could create a plugin for this if necessary...

So the users won't have to deal with the command line and had to just install the plugin from the CA App.

 

Have you seen my thread here yet:

 

  • Like 1
Link to comment

Hi ich777, of course that I have check your post, it was the first thing I find when I was looking for Prometheus zfs exporter plugin. Well, Create a plugin for zfs_exporter will be amazing for all ZFS Unraid users!, thank you very much!, however it will only apply for versions <= 6.9.2, as I stated in the post, with Unraid 6.10 things should be much easier with zfs_influxdb.

 

Let me know if you need anything from me for the plugin.

Link to comment

The discover of the pools is completely automatic, the only parametrization for the exporter is the command line arguments, these controls which properties, specific pools, and other options do you be included or if you want to exclude some datasets; this is an example of the command line:

 

zfs_exporter --collector.dataset-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" --exclude=^ssdnvme/dockerfiles/

 

--collector.dataset-snapshot: Include the properties for the snapshots that could be present 

--properties.dataset-snapshot: (Depends upon previous setting) Include these properties for the snapshots

--exclude: Regular expression for which datasets/snapshots/volumes must be excluded.

 

There are some other options that maybe are interesting for you, these control the webgui, log level, etc.  you could check all of them in zfs_exporter Github (https://github.com/pdf/zfs_exporter).

 

Link to comment
  • 2 weeks later...

As I stated in my answer, I use influxdb 2, so the query language is Flux, not InfluxQL; the data is still present in the database as long as you are using telegraf; but the queries are very different, for example:

 

Pool Writes (Flux):

from(bucket: "Telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "zfs_pool")
  |> filter(fn: (r) => r["_field"] == "nwritten")
  |> aggregateWindow(every: v.windowPeriod, fn: mean)
  |> derivative(unit: 1s, nonNegative: true)
  |> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r.pool}))

 

Pool Writes (InfluxQL):

 

SELECT

SELECT non_negative_derivative(mean("nwritten"), 1s) AS "writes" FROM "zfs_pool" WHERE $timeFilter GROUP BY time($__interval), "pool" fill(none)

 

I'm not completely sure about the InfluxQL query (I don't have any Influx 1.8 DB available right now), but should be something along those lines. (https://docs.influxdata.com/influxdb/v1.8/flux/flux-vs-influxql/)

 

In the next days I should come back to you with all the queries to their InfluxQL equivalent, just let me spin up a InfluxDB 1.8 and translate the queries.

 

Edited by Iker
Link to comment
from(bucket: v.bucket)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r[\"_measurement\"] == \"zfs\")  
|> filter(fn: (r) => r[\"pools\"] == \"hddmain::ssdnvme::ssdsata\")
|> filter(fn: (r) => r[\"_field\"] == \"arcstats_size\" or
r[\"_field\"] == \"arcstats_data_size\" or
r[\"_field\"] == \"arcstats_metadata_size\" or
r[\"_field\"] == \"arcstats_mfu_size\" or    
r[\"_field\"] == \"arcstats_dnode_size\" or
r[\"_field\"] == \"arcstats_mru_size\")  
|> aggregateWindow(every: v.windowPeriod, fn: mean)
|> map(fn: (r) => ({ _value: r._value, _time:r._time, _field : r._field}))",

 

Arc Size, converted to this so far:

 

SELECT mean("arcstats_size") as Size, mean("arcstats_data_size")  as Data, mean("arcstats_metadata_size") as Metadata, mean("arcstats_mfu_size") as MFU, mean("arcstats_dnode_size") as DNODE, mean("arcstats_mru_size") as MRU FROM "zfs" WHERE $timeFilter GROUP BY time($__interval) fill(none)

 

 

 

Link to comment

As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide.

Edited by Iker
Link to comment
On 10/9/2021 at 10:30 PM, Iker said:

As an update, I was checking out ZFS version 2.1, but it was marked as "unstable" on unRAID 6.9.2, so simply YOLO it and upgrade to 6.10rc1, the result, zfs_exporter works just fine; but telegraf is not polling any metrics from the pools.... but, I find zpool_influxdb is working without so much trouble, at least for pools (not datasets), I'm going to start working with just zfs_exporter & zpool_influxdb for the next guide.

 

I did that and was getting kernel panics from ZFS hourly... had to downgrade

Link to comment

Hope the rc2 is more stable for you, zfs 2.1 introduced draid and even is you are not using it, it's a nice to have. In the monitoring aspect there are more options with zpool_influxdb and some already present are different, when my dashboard is more mature I will write another post about it. This is a little sneak peek:

 

image.thumb.png.149af78f59617912f34579e3a7c033b4.png

Edited by Iker
Link to comment
  • 9 months later...
On 9/23/2021 at 12:09 AM, Iker said:

With Grafana the process is very straightforward, add the Prometheus source, and start work the panels to your preference; example:

 

Very nice dashboard! 

 

Can I ask how you are able to get named datasets for your Docker containers? My container datasets are created with the full container ID. Also, is the dataset size reflecting only the container layer dataset or the all the image layer datasets for that container as well?

Link to comment
  • 10 months later...

Hi my friend; unfortunately, I'm no longer using this dashboard, as I ditched influxdb and telegraf from my stack in favor of victoria metrics. I have plans to write a new guide, including a template for the Grafana dashboard, but it will take me a while.

Link to comment

Well, just the influx migration from v1.8 to 2.X was wild, and when you have a lot of data, the performance is far from good; they have to rewrite the entire engine two times now (https://www.influxdata.com/products/influxdb-overview/#influxdb-edge-cluster-updates) so I decided to move to Prometheus, and Victoria Metrics as long-term storage makes a lot of sense, queries are really fast, and Victoria is compatible with telegraf line protocol. Overall I'm happy with my decision, and my dashboards load much faster now.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.