Monitoring ZFS on Unraid (2023)


Recommended Posts

Back in the day, I wrote a post about monitoring and gathering some of the most relevant stats from ZFS; however, my stack has changed quite a bit since then, so this is a more updated and refined version of how to monitor ZFS monitoring on Unraid

Tech Stack

  • Unraid 6.9+
    • Prometheus Node Exporter - Plugin
    • User Scripts - Plugin
  • VictoriaMetrics (Docker Compose provided)
  • Grafana

VictoriaMetrics?

VictoriaMetrics inclusion is not a common choice; what about InfluxDB and Prometheus? Well, it is pretty simple; InfluxDB has been going nowhere for quite some time, up to the point that they rewrote the entire thing for InfluxDB 3.0 (OSS not released yet), not exactly inspiring confidence for a long-term time series DB. And for Prometheus, it is pretty straightforward; Prometheus is not designed for long-term storage as there is no sharding or replication built-in, but we can always look for Prometheus-compatible alternatives; in this case, my choice was VictoriaMetrics (VM), only because is easy to deploy, maintain and offers compatibility with Prometheus queries, remote write, InfluxDB's Line Protocol, and a ton of other things (Check here for details).

Installation

We must install the Prometheus Node Exporter and User scripts plugin; both plugins work directly on Unraid (not docker) and require no further plugin configuration. This is a very simple step; just go to the Apps tab and look for the plugins:

 

image.png.c3226b53957e51f30c52f116e941f1a2.png

 

Deploying Grafana is beyond this post, but there is not a short amount of tutorials on how to deploy the app; keep in mind we are only interested in Grafana, not other components (Prometheus, InfluxDB, Telegraf, etc.).

For VictoriaMetrics, use the following Docker Compose for deploying VM as a single instance with everything packed up listening in Unraid_IP:8428:

 

services:
  victoriametrics:
    container_name: victoriametrics
    image: victoriametrics/victoria-metrics:latest
    volumes:
      - /mnt/user/appdata/victoriametrics:/victoria-metrics-data
    command:
      - "--retentionPeriod=6" # In Months
      - "--selfScrapeInterval=30s" # Self monitoring
      - "--loggerTimezone=America/Bogota" 
      - "--maxLabelsPerTimeseries=60" # Increase for complex queries
      - "--promscrape.config=/victoria-metrics-data/prometheus.yml" 
    restart: unless-stopped
    ports:
      - 8428:8428

Configuration

Once everything is installed, we need to configure a few tasks for the scraping to be successful; let's start with the VM promscape configuration file:

global:
  scrape_interval:     30s # By default, scrape targets every 30 seconds

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'vmetrics'

scrape_configs:
# Here we configure unraid scraping, this give us multiple stats not just ZFS
  - job_name: 'Unraid'
    scheme: http
    static_configs:
      - targets: ['192.168.1.100:9100']

 

This should get us a good amount of information, not only from ZFS but the entire system; our next step is to export ZFS-specific information, this can be achieved using multiple tools; one example could be Prometheus ZFS exporter, Prometheus exporter for ZFS stats and many others, the downside is that those exporters put some load on the system, mainly because the parsing task associated with them; that was the main reason for me to choose another option zpool_influxdb.8 — OpenZFS documentation; this has the benefit of being already present on ZFS, so you don't have to install anything else, just use the following user script "zpool_influxdb redirector":

#!/bin/bash
/usr/libexec/zfs/zpool_influxdb | curl -X POST 'https://<URAID_IP>:8428/write?extra_label=job=Unraid' --data-binary @-

 

The script should run every minute; even if data comes in Influx's Line Protocol format, VM can parse it without too much trouble and inject it to the DB; we add another label, "job=Unraid" just in case we need to identify the source, host (Multiple Unraid servers anyone?).

At this point, we should have everything correctly set up for our monitoring; the only missing part now is creating our dashboard and alerts if needed.

Dashboard

The dashboard I currently use is published here: ZFS Details | Grafana Labs; let's explore some of the panels and clarify what they are for.

General

image.thumb.png.d4e055ffa900bd8ea1bacecf8e605cba.png

 

This panel helps us keep an eye on some basic stats, like the Health, size, used and free space, and other more advanced, as the fragmentation and check errors are some of the most important elements for a pool. Depending on your pool topology, you may want to change panels here and there.

ARC

image.thumb.png.b7b2b4f29d44001f1581b6a4581ef3ba.png

 

One of the most confusing features of ZFS is ARC (Adaptive Replacement Cache); with this panel, you can have the information needed to optimize, grow, or even shrink your ARC based on the utilization and the Hit Ratio.

Pool Stats

image.thumb.png.5a8f58eab6df83fa628a80d461ddb9f4.png

 

Pool stats are mainly direct to get info about how the writes and read from the pool, including the latency of those write/read requests; in some cases, you are going to be able to spot slow processes and devices based on the latency and how it performs against certain workloads.

IO Stats

image.thumb.png.53eff9e0599ebd26fd05e961e52bfef3.png

 

IO Stats are all about latency, including multiple processes like Trim, Scrub, Sync, and Async operations; again, this becomes handy to diagnose certain problems for specific workloads and to complement other stats for the workloads you may be running on the pools. 

What about my Disks?

You may notice that there are no stats for specific disks. The node exporter plugin includes disk-specific information; you can find some of those metrics on this dashboard Node Exporter Full | Grafana Labs, and import those that you may find helpful; personally, I rarely use my other disk dashboard, as is not helpful for me anymore, beyond some stats about how it's performing some new disks.

Final Words

This has been my go-to dashboard for the last year and works very well for me, and I sincerely hope you find it useful. One last thing you can add to your stack, and I strongly recommend you explore, is alerting; you can implement this in multiple ways; personally I use a combination of these two:

 

Best Regards!

Edited by Iker
  • Like 1
Link to comment

This seems great!

I'm struggling with a piece here, the user-script:

#!/bin/bash
/usr/libexec/zfs/zpool_influxdb | curl -X POST 'https://<VMIP/HVMHOST>:<PORT>/write?extra_label=job=Unraid' --data-binary @-

 

What should i put in here:
https://<VMIP/HVMHOST>:<PORT>

 

I don't understand what you mean with HVMHOST and which port?

 

Edit:
I tried with VictoriaMetrics IP and port 8428 and i get this in the script log:

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 144k 0 0 100 144k 0 20.2M --:--:-- --:--:-- --:--:-- 23.5M
Script Finished Aug 24, 2023 11:33.02

 

But there is no need population on the zfs dashboard for pools etc?

 

 

Edit2: Got it working!

The extra_label=job=Unraid' need to match the job set in VictoriaMetrics prometheus.yml file, so as in this guide it need to changed to:

 

extra_label=job=unRaid'

Edited by Allram
  • Like 1
Link to comment

How do I get my `<VMIP/HVMHOST>:<PORT>` 

`ip a` in the VM Container returns

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
102: eth0@if103: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:13:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.2/16 brd 172.19.255.255 scope global eth0
       valid_lft forever preferred_lft forever

 

Edited by Roalkege
typo
Link to comment

Maybe you should also change the IP  in the VM.config.

 

Another thing, I added 

http://unraid-ip/metrics

as a Prometheus datasource. After that imported the Dashboard and selected Prometheus as source. But the Dasboard "No Data". "http://unraid-ip" shows a website and "http://unraid-ip/metrics" shows a lot of things.

Edited by Roalkege
Link to comment
4 hours ago, Roalkege said:

How do I get my `<VMIP/HVMHOST>:<PORT>` 

`ip a` in the VM Container returns

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
102: eth0@if103: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:13:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.2/16 brd 172.19.255.255 scope global eth0
       valid_lft forever preferred_lft forever

 

See our docker-compose example https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/docker-compose.yml#L9

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.