ZFS scrub speeds slow nvme

Unoid · February 4

Unraid 6.12.6

Epyc 7302, 16c 3.3ghz

256 GB pc3200 ecc

pool in question: 4x4TB team group MP43 nvme pcie 3 4x4x4x4 bifurcation raid z1. 32GB Ram Cache. pool at about 50% use. Compression=on

Scrub speeds range from 90MB/s to 250MB/s. shouldn't scrub speeds be a good bit faster? These Team group mp43's do 3000 read, 2400ish writes sequential. TLC NAND with DRAM. 5 hour 11 minutes for 6.3TB data used of 11.8TB

  pool: speedteam
 state: ONLINE
  scan: scrub repaired 0B in 05:11:58 with 0 errors on Thu Feb  1 06:11:59 2024
config:

	NAME                STATE     READ WRITE CKSUM
	speedteam           ONLINE       0     0     0
	  raidz1-0          ONLINE       0     0     0
	    /dev/nvme0n1p1  ONLINE       0     0     0
	    /dev/nvme1n1p1  ONLINE       0     0     0
	    /dev/nvme2n1p1  ONLINE       0     0     0
	    /dev/nvme3n1p1  ONLINE       0     0     0

errors: No known data errors

image.png.56189361c228657705338e4fa1e2130d.png

Edited February 5 by Unoid

JorgeB · February 5

That is low, this is what I get during a scrub, EPYC 7232P

Unoid · February 5

I have compression=on, I'm running a scrub again to gather metrics. I don't see any extra logs to give me more insight. NVME temps are all under 50C, (I just added heatsinks to them)

I captured the first 5 minutes of scrub AFTER it does the initial "indexing" or whatever operation:

  pool: speedteam
 state: ONLINE
  scan: scrub in progress since Mon Feb  5 10:30:29 2024
	7.92T scanned at 0B/s, 188G issued at 421M/s, 7.92T total
	0B repaired, 2.32% done, 05:20:51 to go

Disk IO: Writes are all less than 150kb/s. Reads in screenshot.

image.png.eeab03cbf16927d8dbf3b2c0b1623b0c.png

CPU is barely being taxed. mostly sitting around 3-10% across all threads, with occasional spikes to 50-100% Total CPU average under 6%.

image.png.fae6e4c0f7ea5de6050d9c9bd995f9d7.png

Edited February 5 by Unoid

JorgeB · February 5

50 minutes ago, Unoid said:

I have compression=on

That should not have a big impact, if any, but not sure what the issue could be, is the pool fast when copying for example from within the pool?

Unoid · February 5

1 hour ago, JorgeB said:

That should not have a big impact, if any, but not sure what the issue could be, is the pool fast when copying for example from within the pool?

Last I tested it was doing many GB/s writes copying large sequential.

Just ran a test copying a 9GB blue ray rip (h265) I have.

rsync:

sent 8,868,009,936 bytes received 35 bytes 311,158,244.60 bytes/sec

Grafana only showing peak of 129MB/s write for the rsync.

running a dd if=/dev/zero of=/mnt/user/speedtest oflag=direct bs=128k count=32k

reported 685MB/s in terminal, but grafana showed 2.03GB/s write to one nvme.

a VM with a vdisk mounted of the same zpool share running kdiskmark showed similar speeds to the screenshot in orig post.

JorgeB · February 5

dd is not a very good test, try copying one or more large files inside the pool, you can use Windows explorer for example, the data will be copied locally, it won't transverse the LAN, typically speeds should be around 1 to 2GB/s.

Unoid · February 5

I copied a 27GB file to and from the same zpool mount. my desktop with smb mount to it is limited to 5GBps which this result is earily like reading and writing over the network making out the 5GBps.

These speeds should be bottlenecked by write speed of around 2GB/s...

JorgeB · February 5

2 minutes ago, Unoid said:

I copied a 27GB file to and from the same zpool mount. my desktop with smb mount to it is limited to 5GBps which this result is earily like reading and writing over the network making out the 5GBps.

If you use Windows explorer (with Windows 8 or newer), the data won't use the network, the copy is made locally with Samba server side copy, like this, I'm currently using a gigabit LAN connection, so it would be 115MB/s max:

Unoid · February 5

1 minute ago, JorgeB said:

If you use Windows explorer (with Windows 8 or newer), the data won't use the network, the copy is made locally with Samba server side copy, like this, I'm currently using a gigabit LAN connection, so it would be 115MB/s max:

I did the same on a windows 11 gaming desktop. same speed I showed

JorgeB · February 5

So that suggests the pool may just be slow, and not just a scrub issue, could be device related, you'd need to test with different ones if possible.

Unoid · February 5

JorgeB: Thank you for helping walking me through troubleshooting steps.

At this point I'm going to have to set the shares to send data to my HDD main array and run mover. then remake the zpool and run more tests.

extended smart test = 0 errors.

PCI-E link is accurate at 8GT/s x4 lanes (PCI-E 3.0 nvme's in a carrier card in a PCI-E 4.0 x16 slot).

I can't tell what the issue may be. When data is moved I'll try disk benchmarks on each nvme separately.

Edited February 5 by Unoid

Unoid · February 6

Random unraid question. the tunables for zfs in /sys/module/zfs/parameters/*

Am I able to set each in the /boot/modprobe.d/zfs.conf

I'm thinking of changing settings for ashift and default recordsize etc.

Also can I use the CLI as root to zfs create? instead fo using the GUI?

JorgeB · February 6

The default ashift should be fine, recordsize can be changed at any time, it will only affect newly written data:

zfs set recordsize=1M pool_name

I use 1M for all my pools, in my tests it performs better with large files, especially with raidz.

Unoid · February 6

5 hours ago, JorgeB said:
The default ashift should be fine, recordsize can be changed at any time, it will only affect newly written data:
zfs set recordsize=1M pool_name
I use 1M for all my pools, in my tests it performs better with large files, especially with raidz.

I've been doing a LOT of reading on zfs on NVME. my drives only expose 512 sector size, not 4k which seems weird for a newish PCI-E 3.0 4TB device. ashift=9 is what I've read 2^9=512.

I have a spreadsheet of a lot of tests to run in different configurations. hopefully I'll find out what is slowing these nvme's down so horribly.

JorgeB, If I create zfs vdevs/pools with various options in bash, does the /boot/ OS know how to persist what I did? That's why I asked if I need to modify the zfs.conf file on /boot

Edited February 6 by Unoid

JorgeB · February 6

7 minutes ago, Unoid said:

ashift=9 is

Correct, but Unraid uses by default ashift=12, which is 4K and the current default recommendation by ZFS.

10 minutes ago, Unoid said:

JorgeB, If I create zfs vdevs/pools with various options in bash, does the /boot/ OS know how to persist what I did?

The ashift used when creating the pool is always the one used, it cannot be changed after, recordsize can be changed at anytime (for new data)

Unoid · February 6

35 minutes ago, JorgeB said:

Correct, but Unraid uses by default ashift=12, which is 4K and the current default recommendation by ZFS.

The ashift used when creating the pool is always the one used, it cannot be changed after, recordsize can be changed at anytime (for new data)

May I ask what topography you have in your nvme zpool? z1? mirrored vdevs? how many disks

JorgeB · February 6

I have one with a 5 device raiz1 and another with 4 devices, also raidz1, both using all default settings except recordsize=1M.

Unoid · February 6

Are they 4k sector size disks?

JorgeB · February 6

They are NMVe devices, they report 512B sectors, but the current recommendation AFAIK, is to always use ashift=12 with flash based devices.

JorgeB · February 6

Even if the devices were really 512B it should not affect performance using a large ashift, though it could waste a little more space.

Unoid · February 7

I ran tests after backing up data to HDD. I only played with raid type [0, z1, 2vdev mirror] I left ashift at 12 since that's the default the GUI uses.

I changed recordsizees=[16k,128k,256k,1M]

I noticed that per raid type running the fio command, the first test is at default pool of 128k, it takes a while to write the 8 job blocks, but every test with different recordsize set to the pool the 8 job chunks of 10GB were still written at the initial 128K recordsize. It introduces error into these results.

fio command taken from reading this nvme article on benchmarking nvme and keeping ARC from fudging the numbers.

https://pv-tech.eu/posts/common-pitfall-when-benchmarking-zfs-with-fio/

Sharing the results anyways:

fio command:   fio --rw=read --bs=1m --direct=1 --ioengine=libaio --size=10G  --group_reporting --filename=/mnt/user/speedtest/bucket --name=job1 --offset=0G  --name=job2 --offset=10G --name=job3 --offset=20G --name=job4 --offset=30G  --name=job5 --offset=40G --name=job6 --offset=50G --name=job7 --offset=60G  --name=job8 --offset=70G

4x4TB TeamGroup mp34 
---

Type_(recordsize, ashift)

r0_(16K,12):
READ:  bw=3049MiB/s (3197MB/s), 3049MiB/s-3049MiB/s (3197MB/s-3197MB/s), io=80.0GiB (85.9GB), run=26867-26867msec
WRITE: bw=778MiB/s (816MB/s), 778MiB/s-778MiB/s (816MB/s-816MB/s), io=80.0GiB (85.9GB), run=105330-105330msec

r0_(128K,12): 
READ:  bw=3057MiB/s (3206MB/s), 3057MiB/s-3057MiB/s (3206MB/s-3206MB/s), io=80.0GiB (85.9GB), run=26796-26796msec
WRITE: bw=6693MiB/s (7018MB/s), 6693MiB/s-6693MiB/s (7018MB/s-7018MB/s), io=80.0GiB (85.9GB), run=12239-12239msec

r0_(512k,12): 
READ:  bw=3063MiB/s (3212MB/s), 3063MiB/s-3063MiB/s (3212MB/s-3212MB/s), io=80.0GiB (85.9GB), run=26746-26746msec
WRITE: bw=3902MiB/s (4092MB/s), 3902MiB/s-3902MiB/s (4092MB/s-4092MB/s), io=80.0GiB (85.9GB), run=20994-20994msec

r0_(1M,12):  
READ:  bw=3059MiB/s (3208MB/s), 3059MiB/s-3059MiB/s (3208MB/s-3208MB/s), io=80.0GiB (85.9GB), run=26776-26776msec
WRITE: bw=3969MiB/s (4162MB/s), 3969MiB/s-3969MiB/s (4162MB/s-4162MB/s), io=80.0GiB (85.9GB), run=20639-20639msec

---

z1_(16k,12):
READ: bw=3050MiB/s (3198MB/s), 3050MiB/s-3050MiB/s (3198MB/s-3198MB/s), io=80.0GiB (85.9GB), run=26860-26860msec 
WRITE: bw=410MiB/s (430MB/s), 410MiB/s-410MiB/s (430MB/s-430MB/s), io=80.0GiB (85.9GB), run=199875-199875msec

z1_(128K,12): 
READ: bw=2984MiB/s (3129MB/s), 2984MiB/s-2984MiB/s (3129MB/s-3129MB/s), io=80.0GiB (85.9GB), run=27456-27456msec
WRITE: bw=5873MiB/s (6158MB/s), 5873MiB/s-5873MiB/s (6158MB/s-6158MB/s), io=80.0GiB (85.9GB), run=13949-13949msec

z1_(512K,12): 
READ: bw=2990MiB/s (3135MB/s), 2990MiB/s-2990MiB/s (3135MB/s-3135MB/s), io=80.0GiB (85.9GB), run=27402-27402msec
WRITE: bw=1596MiB/s (1674MB/s), 1596MiB/s-1596MiB/s (1674MB/s-1674MB/s), io=80.0GiB (85.9GB), run=51318-51318msec

z1_(1M,12): 
READ: bw=1086MiB/s (1139MB/s), 1086MiB/s-1086MiB/s (1139MB/s-1139MB/s), io=80.0GiB (85.9GB), run=75447-75447msec
WRITE: bw=1949MiB/s (2043MB/s), 1949MiB/s-1949MiB/s (2043MB/s-2043MB/s), io=80.0GiB (85.9GB), run=42039-42039msec

---

2vdev mirror_(16K,12):
READ: bw=3091MiB/s (3241MB/s), 3091MiB/s-3091MiB/s (3241MB/s-3241MB/s), io=80.0GiB (85.9GB), run=26506-26506msec
WRITE: bw=1521MiB/s (1595MB/s), 1521MiB/s-1521MiB/s (1595MB/s-1595MB/s), io=80.0GiB (85.9GB), run=53867-53867msec

2vdev mirror_(128K,12):
READ: bw=3085MiB/s (3234MB/s), 3085MiB/s-3085MiB/s (3234MB/s-3234MB/s), io=80.0GiB (85.9GB), run=26558-26558msec
WRITE: bw=4421MiB/s (4636MB/s), 4421MiB/s-4421MiB/s (4636MB/s-4636MB/s), io=80.0GiB (85.9GB), run=18529-18529msec

2vdev mirror_(512K,12):
READ: bw=3090MiB/s (3240MB/s), 3090MiB/s-3090MiB/s (3240MB/s-3240MB/s), io=80.0GiB (85.9GB), run=26510-26510msec
WRITE: bw=3486MiB/s (3655MB/s), 3486MiB/s-3486MiB/s (3655MB/s-3655MB/s), io=80.0GiB (85.9GB), run=23500-23500msec

2vdev mirror_(1M,12):
READ: bw=3104MiB/s (3255MB/s), 3104MiB/s-3104MiB/s (3255MB/s-3255MB/s), io=80.0GiB (85.9GB), run=26393-26393msec
WRITE: bw=3579MiB/s (3753MB/s), 3579MiB/s-3579MiB/s (3753MB/s-3753MB/s), io=80.0GiB (85.9GB), run=22891-22891msec
deleted fio bucket file re-run in case setting recordsize=1M but bucket wrote on first default run of 128K
READ: bw=3258MiB/s (3416MB/s), 3258MiB/s-3258MiB/s (3416MB/s-3416MB/s), io=80.0GiB (85.9GB), run=25145-25145msec
WRITE: bw=4440MiB/s (4656MB/s), 4440MiB/s-4440MiB/s (4656MB/s-4656MB/s), io=80.0GiB (85.9GB), run=18451-18451msec
^^^ Significant difference confirming running this test without delete the fio bucket file used for testing affects speed.

I want to gather data points on trying ashift=[9,12,13]. However this isn't exposed to the gui on zpool creation. I may get time to just create the pool in bash and set ashift there, then do the format and mount (unsure if the GUI can pick it up if I do it via CLI).

edit:

I remade the pool in my desired style of raid z1. immediately set recordsize=1M

check out the fio bucket written at 128k vs 1M when running.

z1_(1M,12): 
128k fio bucket:
READ: bw=1086MiB/s (1139MB/s), 1086MiB/s-1086MiB/s (1139MB/s-1139MB/s), io=80.0GiB (85.9GB), run=75447-75447msec
WRITE: bw=1949MiB/s (2043MB/s), 1949MiB/s-1949MiB/s (2043MB/s-2043MB/s), io=80.0GiB (85.9GB), run=42039-42039msec
1M bucket:
READ: bw=3221MiB/s (3378MB/s), 3221MiB/s-3221MiB/s (3378MB/s-3378MB/s), io=80.0GiB (85.9GB), run=25432-25432msec
WRITE: bw=6124MiB/s (6422MB/s), 6124MiB/s-6124MiB/s (6422MB/s-6422MB/s), io=80.0GiB (85.9GB), run=13376-13376msec

Makes me wish I had deleted the fio bucket after every run.

I'm settling on 1M and z1. I may still try ashift changes.

Edited February 7 by Unoid

Unoid · February 9

I loaded up 800GB of movies (loves 1M sectorsize)to the zpool. scrubs average over 11GB/s reads (4 nvmes at 2.9GB/s). I'm curious if the last zpool when loaded up with 55% capacity could really slow the scrub down to 50-100MB/s?

JorgeB · February 10

Seems unlikely to me, zfs pools do slow down after 90% capacity, but AFIK it does not affect the scrubbing speed.

Unoid · February 10

Update after loading 6.5TB of movies back to the zpool

  pool: speedteam
 state: ONLINE
  scan: scrub in progress since Sat Feb 10 12:50:25 2024
	8.40T scanned at 0B/s, 355G issued at 2.01G/s, 8.40T total
	0B repaired, 4.13% done, 01:08:28 to go

image.png.d3dc4723952e5d4eb87616c7ec39c357.png

ZFS scrub speeds slow nvme

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation