ZFS plugin for unRAID

Andrea3000 · May 29, 2022

Hi,

I just set up an unraid server with zfs and I run some benchmarks.

The performance of the array are significantly lower than what I was expecting.

Eventually the array will be of 8 drives in raidz2 but at the moment I'm waiting for 4 more SATA cables to be delivered, therefore at the moment my system is as follows:

Gigabyte C246M-WU4

Intel Core i3-9100

Kingston 64GB 2666Mhz ECC RAM

4x WD Red Pro NAS 4TB 7200RPM

The ZFS array is of 4 drives in raidz2.

I just run some benchmarks with fio and this is what I get:

fio --direct=1 --name=test --bs=256k --filename=/zfs/zfs/test/whatever.tmp --thread --size=64G --iodepth=64 --readwrite=randrw
test: (g=0): rw=randrw, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=psync, iodepth=64
fio-3.23
Starting 1 thread
test: Laying out IO file (1 file / 65536MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=44.5MiB/s,w=47.5MiB/s][r=177,w=189 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2196: Sun May 29 11:44:57 2022
  read: IOPS=225, BW=56.4MiB/s (59.1MB/s)(31.0GiB/580821msec)
    clat (usec): min=25, max=765897, avg=4299.50, stdev=18500.73
     lat (usec): min=25, max=765897, avg=4299.87, stdev=18500.73
    clat percentiles (usec):
     |  1.00th=[    41],  5.00th=[    71], 10.00th=[   126], 20.00th=[   129],
     | 30.00th=[   133], 40.00th=[   139], 50.00th=[   147], 60.00th=[   159],
     | 70.00th=[  6390], 80.00th=[  9503], 90.00th=[ 11207], 95.00th=[ 12649],
     | 99.00th=[ 26084], 99.50th=[ 39584], 99.90th=[299893], 99.95th=[530580],
     | 99.99th=[658506]
   bw (  KiB/s): min=  512, max=130048, per=100.00%, avg=58788.18, stdev=27067.79, samples=1141
   iops        : min=    2, max=  508, avg=229.63, stdev=105.74, samples=1141
  write: IOPS=225, BW=56.4MiB/s (59.2MB/s)(32.0GiB/580821msec); 0 zone resets
    clat (usec): min=23, max=23560, avg=115.99, stdev=251.13
     lat (usec): min=25, max=23570, avg=125.35, stdev=251.63
    clat percentiles (usec):
     |  1.00th=[   28],  5.00th=[   41], 10.00th=[   98], 20.00th=[  101],
     | 30.00th=[  103], 40.00th=[  105], 50.00th=[  108], 60.00th=[  111],
     | 70.00th=[  117], 80.00th=[  130], 90.00th=[  143], 95.00th=[  149],
     | 99.00th=[  174], 99.50th=[  212], 99.90th=[  799], 99.95th=[ 5342],
     | 99.99th=[12649]
   bw (  KiB/s): min=  512, max=125952, per=100.00%, avg=59016.54, stdev=28070.89, samples=1137
   iops        : min=    2, max=  492, avg=230.52, stdev=109.65, samples=1137
  lat (usec)   : 50=4.79%, 100=5.49%, 250=73.76%, 500=0.26%, 750=0.04%
  lat (usec)   : 1000=0.06%
  lat (msec)   : 2=0.02%, 4=0.03%, 10=7.27%, 20=7.41%, 50=0.71%
  lat (msec)   : 100=0.05%, 250=0.05%, 500=0.03%, 750=0.03%, 1000=0.01%
  cpu          : usr=0.51%, sys=6.56%, ctx=44312, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=131040,131104,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=56.4MiB/s (59.1MB/s), 56.4MiB/s-56.4MiB/s (59.1MB/s-59.1MB/s), io=31.0GiB (34.4GB), run=580821-580821msec
  WRITE: bw=56.4MiB/s (59.2MB/s), 56.4MiB/s-56.4MiB/s (59.2MB/s-59.2MB/s), io=32.0GiB (34.4GB), run=580821-580821msec

The dataset used to run the benchmark was created with those parameters:

zfs create zfs/test -o casesensitivity=insensitive -o compression=off -o atime=off -o sync=standard

Both the read/write speeds and the IOPS are much lower than what those drives should be capable of.

@ich777 are there some settings I'm missing or am I doing something wrong?

Thanks

Andrea

ich777 · May 29, 2022

11 minutes ago, Andrea3000 said:

@ich777 are there some settings I'm missing or am I doing something wrong?

I'm not really the ZFS pro, I only compile the packages for each Unraid version so that they are available for everyone, I'm only using ZFS for my Cache and and for my appdata and VMs without any snapshots whatsoever because I'm not really a big fan of ZFS anyways because it's the complete opposite of that what Unraid stands for and is not really my use case except for the "Cache".

The right persons to ask are @steini84, @BVD, @Iker,...

But from what I see your read/write speeds are really slow, there must be a bottleneck somewhere, on my NVME drives I reach almost the specified speeds of about 2.500MB/s.

Marshalleq · May 29, 2022

Right out of the gate I can say that you'd be better off with 4 drives in two mirrors. You have the same space and more speed vs 4 drives in raidz2. Secondly, I assume that an i3 is powerful enough to calculate the parity needed without bottleneck, but it might pay to double check that.

In the above configuration, do you have all of that RAM spare or not much ram? I have seen it where having not much ram spare slows things down a lot - this can somewhat be mitigated by setting the available ram in the go file.

It also may pay to performance test each drive individually, in case one of them is slowing the others down. I had this same problem on a thunderbolt connected ZFS cabinet the other day, then found out that running non-zfs file system on a single (or multiple disk) was also slow. I am yet to progress but suspect I have one drive that is playing up. It's surprising (and annoying) how often this can be a faulty SATA cable.

ZFS will be slower that other raided file systems, but not by that much so I agree something is wrong. It's probably close to what unaid performance is though as that is actually very slow.

Can't think of anything else right now sorry, and I appreciate you may have thought of these things already - but sometimes it can trigger a thought for a solution right? Hope you figure it out and let us know - it might help me for my one!

BVD · May 29, 2022

4 drives in z2 is always going to be awful. As mentioned above, if only using 4 drives, you should mirror them so you dont have the z2 overhead in a place it doesnt make any sense. Once you get the other 4 drives, then you can redeploy with z2.

I'd recommend spending some time doing a bit more research before jumping into zfs though personally - jumping in with both feet is fine, but if unprepared, you'll likely fail to realize the filesystems full benefits.

Andrea3000 · May 30, 2022

Thank you @Marshalleq and @BVD for your replies.

The only reason why I choose raidz2 with only 4 drives is because I know that I will be running raidz2 with 8 drives and I wanted to test a similar parity layout, while waiting for the sata cables to arrive.

I agree that it doesn’t make any sense.

The RAM is mostly free, there were around 60GB free when I run the fio test.

As for the CPU, I assumed that an i3 with turbo would have been sufficient to compute parity (with compression off for this test). I can see one core at 100% during the test, the others around 20%. Could the CPU be the bottleneck?

Last night I started testing the drives individually using the DiskSpeed docker.

I tested the first which gave me 250MB/s at the outer diameter and 150MB/s at the inner diameter.

I will test the others and report back.

Thanks

Andrea

Arragon · May 30, 2022

How do I create a ZVOL for docker and what do I need to do so Unraid uses it?

gyto6 · May 30, 2022

@Arragon

On 3/24/2022 at 11:39 PM, gyto6 said:
Thanks again @BVD ! I'm finally able to run the docker.img inside of a ZVOL.

For those interested:
zfs create -V 20G pool/docker # -V refers to create a ZVOL
cfdisk /dev/pool/docker # To create easily a partition
mkfs.btrfs -q /dev/pool/docker-part1 # Simple to format in the desired sgb
mount /dev/pool/docker-part1 /mnt/pool/docker # The expected mount point

Edited May 30, 2022 by gyto6

Iker · May 30, 2022

On 5/29/2022 at 4:55 AM, Andrea3000 said:

fio --direct=1 --name=test --bs=256k --filename=/zfs/zfs/test/whatever.tmp --thread --size=64G --iodepth=64 --readwrite=randrw
....
Run status group 0 (all jobs):
   READ: bw=56.4MiB/s (59.1MB/s), 56.4MiB/s-56.4MiB/s (59.1MB/s-59.1MB/s), io=31.0GiB (34.4GB), run=580821-580821msec
  WRITE: bw=56.4MiB/s (59.2MB/s), 56.4MiB/s-56.4MiB/s (59.2MB/s-59.2MB/s), io=32.0GiB (34.4GB), run=580821-580821msec

This is not low by any means, in fact is quite good given that you are testing random access (randrw) in hdds with a 256 KB block.

https://www.anandtech.com/show/8265/wd-red-pro-review-4-tb-drives-for-nas-systems-benchmarked/4

Probably you should look for the correct fio settings ("-direct=1" is no bueno in this context), there are a couple adittional variables when you try to benchmark ZFS; this is a good starting point:

https://serverfault.com/questions/1021251/fio-benchmarking-inconsistent-and-slower-than-anticipated-are-my-raids-miscon

Edited May 30, 2022 by Iker

Marshalleq · May 30, 2022

12 hours ago, gyto6 said:

@Arragon

Hi thanks for posting, I'm interested in what advantage this is to you. I have run the img file on a zfs dataset and also run it as a folder which creates a host of annoying datasets within it. Does this somehow get the best of both worlds? Thanks.

BVD · May 30, 2022

The biggest advantage is performance. I've been able to get north of 1.1m IOPs out of mine, something thatd be nearly impossible with any btrfs setup. The disadvantage though is that if one simply dives into zfs without doing any tuning (or worse, just applies something they read online without understanding the implications), they're likely to end up worse off than they would've with virtually any other filesystem/deployment.

gyto6 · May 31, 2022

16 hours ago, Marshalleq said:

Hi thanks for posting, I'm interested in what advantage this is to you. I have run the img file on a zfs dataset and also run it as a folder which creates a host of annoying datasets within it. Does this somehow get the best of both worlds? Thanks.

You're welcome!

Concerning to me, my config doesn't annoy me as I'm fully using SSD probably.

Some ZFS tuning are necessary as volblocksize to not get some troubles.

Whatever, I'm never coming back by now as, indeed, I don't have any performance issue and furthermore, I've absolutely not any "snapshots/legacy snapshots" flooding from the docker dataset anymore.

I'd recommand this setting instead of using the docker folders for the simple reason that you're never really expected to edit the docker folder and it's easier to copy a simple .img file for backup.

Edited May 31, 2022 by gyto6

Andrea3000 · May 31, 2022

On 5/29/2022 at 9:59 PM, Marshalleq said:

Right out of the gate I can say that you'd be better off with 4 drives in two mirrors. You have the same space and more speed vs 4 drives in raidz2. Secondly, I assume that an i3 is powerful enough to calculate the parity needed without bottleneck, but it might pay to double check that.

In the above configuration, do you have all of that RAM spare or not much ram? I have seen it where having not much ram spare slows things down a lot - this can somewhat be mitigated by setting the available ram in the go file.

It also may pay to performance test each drive individually, in case one of them is slowing the others down. I had this same problem on a thunderbolt connected ZFS cabinet the other day, then found out that running non-zfs file system on a single (or multiple disk) was also slow. I am yet to progress but suspect I have one drive that is playing up. It's surprising (and annoying) how often this can be a faulty SATA cable.

ZFS will be slower that other raided file systems, but not by that much so I agree something is wrong. It's probably close to what unaid performance is though as that is actually very slow.

Can't think of anything else right now sorry, and I appreciate you may have thought of these things already - but sometimes it can trigger a thought for a solution right? Hope you figure it out and let us know - it might help me for my one!

On 5/30/2022 at 12:52 AM, BVD said:

4 drives in z2 is always going to be awful. As mentioned above, if only using 4 drives, you should mirror them so you dont have the z2 overhead in a place it doesnt make any sense. Once you get the other 4 drives, then you can redeploy with z2.

I'd recommend spending some time doing a bit more research before jumping into zfs though personally - jumping in with both feet is fine, but if unprepared, you'll likely fail to realize the filesystems full benefits.

On 5/30/2022 at 8:24 PM, Iker said:

This is not low by any means, in fact is quite good given that you are testing random access (randrw) in hdds with a 256 KB block.

https://www.anandtech.com/show/8265/wd-red-pro-review-4-tb-drives-for-nas-systems-benchmarked/4

Probably you should look for the correct fio settings ("-direct=1" is no bueno in this context), there are a couple adittional variables when you try to benchmark ZFS; this is a good starting point:

https://serverfault.com/questions/1021251/fio-benchmarking-inconsistent-and-slower-than-anticipated-are-my-raids-miscon

Ok, I have benchmarked the drives individually using the DiskSpeed docker and each drive performs more or less the same with roughly 250MB/s read at the outer diameter and 140-150MB/s at the inner diameter.

Maybe my expectations were too high for a random read/write benchmark, I set my expectations based on this post/guide from the Level1Techs forum: https://forum.level1techs.com/t/zfs-on-unraid-lets-do-it-bonus-shadowcopy-setup-guide-project/148764 that shows results of 150-160MB/s for randrw with 4x 5400RPM drives in raidz1, with average IOPS just short of 700...which is huge compared to what I get.

Based on what @Iker said, and the disk review posted, the results obtained in the Levels1Techs forum are unrealistically too high for 4x HDD?!?

The other day, before I posted my benchmark, I tried also a sequential read and write and the performance was still very poor, less than 100MB/s, which is what made me conclude that something wasn't right.

I have repeated the benchmark yesterday and I now get performance between 280-320MB/s for sequential read/write in raidz2. I was hoping to be more in the 400-500MB/s range but it still respectable.

@BVD I've done a bit of research and I'm pretty much sold on zfs. The features that attract me the most are: snapshots, data integrity, bit rot protection and decent performance.

For what concerns the tuning, I think I need to study a bit because I wouldn't know where to start. I know what kind of data I will be storing therefore I can at least choose the blocksize, but I wouldn't know what else can be tuned.

Are there some guidelines/tutorials on how to tune the zfs pool to maximise performance? Or at least to know what I should expect from a zpool of 8x WD Red Pro Nas drives in raidz2?

Marshalleq · May 31, 2022

I can tell you what I use, then you can go and read up on those bits. To me these are the key bits in setting up an array. I'm not going to disagree entirely with BVD, but really like all things technical, research and experimentation is valuable and it's no reason to fear ZFS and not use it. There are some people that dive in without doing even a minor bit of forethought and I assume his commentary is really aimed toward that rarer group whom will likely get themselves into trouble with everything not just ZFS.

Anyway, here's the basic command I use first. If it's an ssd pool, I add -o autotrim=on some people are still scared of this, but I've never ever had even one issue with it - compare that to btrfs and the issues were quite a few - though that was years ago now.

zpool create HDDPool1 -f -o ashift=12 -o -m /mnt/HDDPool1 raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg

See below about special vdev before creating the above.

Then the optional tweaking stuff I use:

zfs set xattr=sa atime=off compression=zstd-1 HDDPOOL1

And depending on your media a dataset could be changed thus:

zfs set recordsize=1M HDDPool1 Default is 128k (Don't get caught with this too much as I keep forgetting ZFS does variable record sizes - 1M might be good if you have a lot of large video files for example.)

dedup=on (I use this a lot, but only because I have a special vdev because IMO it means that no significant extra ram is required, however I've had quite the discussion about that so not all will agree - definitely do your own research before enabling this one). Works great if you have lots of VM's and ISO's. My RAM is not used in any way that I've ever been able to notice.

More options

zfs set primarycache=metadata HDDPool1/dataset

zfs set secondarycache=all HDDPool1/dataset

Some of the cache options are actually dealt with automatically. The promise with them is to optimise how much of your data will be cached in ram depending on e.g. if you have big files or not and whether it is valuable and even possible to cache them.

And finally, the special vdev mentioned above is very cool. It will store metadata on a second set of disks assigned to the array. So for example, if you have slow hard drives in a raidz2, you could have 3 SSD's (for same redundancy level), which speeds up seeking and such. It optionally also will store on the SSD's, any small file up to the size you specific (which must be less than the recordsize or you'd be storing everything). As you probably know, small files on HDD's aren't very fast to read and write to, so the advantage here is obvious if you have that kind of setup.

zpool add HDDPool1 -o ashift=12 special mirror /dev/sdp /dev/sdr /dev/sdq

I actually also set up a fast SSD / NVME as a level 2 cache - this can be done at any time and it's advantage is just that anything the doesn't get hit in ram, fails over to SSD and again is a way of speeding up reads from HDD's.

zpool add HDDPool1 cache /dev/nvme0n1

useful commands:

arc_summary

arcstat

So what you can probably see is there is a default way of doing things and the 'tweaking' mentioned above is really more about understanding your data and how you want to address it via ZFS. Some settings need to be done at array creation and some can be done later. Most settings that are done later will only apply to newly written data so you end up having to copy the data off and on again if you get it wrong.

I found it super fun to go on the journey and learn it all, I expect you will too. If you're like me, you'll want to be doing some more reading now!

Have a great day.

Marshalleq.

Edited May 31, 2022 by Marshalleq

Andrea3000 · June 1, 2022

7 hours ago, Marshalleq said:

I can tell you what I use, then you can go and read up on those bits. To me these are the key bits in setting up an array. I'm not going to disagree entirely with BVD, but really like all things technical, research and experimentation is valuable and it's no reason to fear ZFS and not use it. There are some people that dive in without doing even a minor bit of forethought and I assume his commentary is really aimed toward that rarer group whom will likely get themselves into trouble with everything not just ZFS.

Anyway, here's the basic command I use first. If it's an ssd pool, I add -o autotrim=on some people are still scared of this, but I've never ever had even one issue with it - compare that to btrfs and the issues were quite a few - though that was years ago now.

zpool create HDDPool1 -f -o ashift=12 -o -m /mnt/HDDPool1 raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg

See below about special vdev before creating the above.

Then the optional tweaking stuff I use:

zfs set xattr=sa atime=off compression=zstd-1 HDDPOOL1

And depending on your media a dataset could be changed thus:

zfs set recordsize=1M HDDPool1 Default is 128k (Don't get caught with this too much as I keep forgetting ZFS does variable record sizes - 1M might be good if you have a lot of large video files for example.)

dedup=on (I use this a lot, but only because I have a special vdev because IMO it means that no significant extra ram is required, however I've had quite the discussion about that so not all will agree - definitely do your own research before enabling this one). Works great if you have lots of VM's and ISO's. My RAM is not used in any way that I've ever been able to notice.

More options

zfs set primarycache=metadata HDDPool1/dataset

zfs set secondarycache=all HDDPool1/dataset

Some of the cache options are actually dealt with automatically. The promise with them is to optimise how much of your data will be cached in ram depending on e.g. if you have big files or not and whether it is valuable and even possible to cache them.

And finally, the special vdev mentioned above is very cool. It will store metadata on a second set of disks assigned to the array. So for example, if you have slow hard drives in a raidz2, you could have 3 SSD's (for same redundancy level), which speeds up seeking and such. It optionally also will store on the SSD's, any small file up to the size you specific (which must be less than the recordsize or you'd be storing everything). As you probably know, small files on HDD's aren't very fast to read and write to, so the advantage here is obvious if you have that kind of setup.

zpool add HDDPool1 -o ashift=12 special mirror /dev/sdp /dev/sdr /dev/sdq

I actually also set up a fast SSD / NVME as a level 2 cache - this can be done at any time and it's advantage is just that anything the doesn't get hit in ram, fails over to SSD and again is a way of speeding up reads from HDD's.

zpool add HDDPool1 cache /dev/nvme0n1

useful commands:

arc_summary

arcstat

So what you can probably see is there is a default way of doing things and the 'tweaking' mentioned above is really more about understanding your data and how you want to address it via ZFS. Some settings need to be done at array creation and some can be done later. Most settings that are done later will only apply to newly written data so you end up having to copy the data off and on again if you get it wrong.

I found it super fun to go on the journey and learn it all, I expect you will too. If you're like me, you'll want to be doing some more reading now!

Have a great day.

Marshalleq.

Thank you very much @Marshalleq for all the informations.

Is there a particular reason why you chose zstd compression instead of lz4?

Is it because you have a powerful enough CPU that compensates the slower compression/decompression?

As part of the “performance” tuning I’m expecting today the delivery of an Intel Optane P1600X 118GB NVMe that I plan to partition with a 10GB partition to use as SLOG and a 108GB partition to use as L2ARC.

But I will wait before installing it to make sure that first I’m happy with the performance of the zpool as it is and to find optimal settings for my use case.

Thanks

Andrea

Marshalleq · June 1, 2022

56 minutes ago, Andrea3000 said:

Thank you very much @Marshalleq for all the informations.

Is there a particular reason why you chose zstd compression instead of lz4?

Is it because you have a powerful enough CPU that compensates the slower compression/decompression?

If I recall correctly the performance is better and it’s multi threaded where as lz4 is very old and single threaded. Could be wrong about the threads. Zstd has differing levels of performance you can set obviously. I just read up on it at the time and chose it.

58 minutes ago, Andrea3000 said:

As part of the “performance” tuning I’m expecting today the delivery of an Intel Optane P1600X 118GB NVMe that I plan to partition with a 10GB partition to use as SLOG and a 108GB partition to use as L2ARC.

But I will wait before installing it to make sure that first I’m happy with the performance of the zpool as it is and to find optimal settings for my use case.

Thanks

Andrea

I don’t use slogs I thought they were really only beneficial in rare cases and would need more redundancy because it’s writes? I don’t know much about them sorry. Very nice drive though!

jortan · June 1, 2022

1 hour ago, Andrea3000 said:

Thank you very much @Marshalleq for all the informations.

Is there a particular reason why you chose zstd compression instead of lz4?

Is it because you have a powerful enough CPU that compensates the slower compression/decompression?

As part of the “performance” tuning I’m expecting today the delivery of an Intel Optane P1600X 118GB NVMe that I plan to partition with a 10GB partition to use as SLOG and a 108GB partition to use as L2ARC.

But I will wait before installing it to make sure that first I’m happy with the performance of the zpool as it is and to find optimal settings for my use case.

Thanks

Andrea

For NVME you're probably better of using lz4 by default, and zstd for datasets with very compressible data (large log/text files), or for datasets you don't read often (archives/backups.) Some insights here:

https://news.ycombinator.com/item?id=23210491

Marshalleq · June 1, 2022

Wow that was a lot to take in. On that two sided argument I’m going for the awesome read performance mentioned and think it isn’t going to make much difference for writes. Especially when I am using a 32 thread threadripper and a 24 thread dual xeon setup.

jortan · June 1, 2022

10 minutes ago, Marshalleq said:

Wow that was a lot to take in. On that two sided argument I’m going for the awesome read performance mentioned and think it isn’t going to make much difference for writes. Especially when I am using a 32 thread threadripper and a 24 thread dual xeon setup.

Yes for HDD pools zstd makes more sense as you are IO bound by slow spinning rust. Needing to read and write less data to spinning rust is more likely to be beneficial than any increase in computation required for compression/decompression.

For NVME you are less bound by IO, so the increase in computation required by zstd is more likely to impact performance negatively. On modern computers it probably makes very little difference. If you're rocking 10-year-old Xeons (like me) then it might. That said, zstd also gives better compression (to varying degrees, depending on the type of data) so if that's important to you it's another thing to consider.

Edited June 1, 2022 by jortan

gyto6 · June 1, 2022

1 hour ago, Marshalleq said:

I don’t use slogs I thought they were really only beneficial in rare cases and would need more redundancy because it’s writes? I don’t know much about them sorry. Very nice drive though!

Slog is relevant for database and network shares. They do not need to be mirrored. If a Slog is degraded, you'd need to get a unexpected powershutdown in the 5min interval of the degraded state to loose data. When a Slog is degraded, the sync write goes back to the normal transaction and the user/database wait for a real anwser for each sync request.

But even without a power outage, according to Slog usage (if database queries or network shares are heavily used by more than 10 persons for example), the data registered in the Slog and not delivered to drive host of the file before the Special failure would be lost. If it's a modification in an excel, the modifications are lost, not the file. Must be understood that it's only an interval of transactions which are lost.

Special vDev must be mirrored. Metadata being stored on it, you loose your whole data written since the special vDev has been added if the drive is degraded.

Edited June 1, 2022 by gyto6

jortan · June 1, 2022

30 minutes ago, gyto6 said:

Slog is relevant for database and network shares.

To clarify, it's relevant for any application doing synchronous writes to your ZFS pool. For these writes, the file system won't confirm to the application that the write has been completed until the data is confirmed as written to the pool. Because writes to SLOG are much faster, there can be significant improvements in write performance. SLOG is not a read cache - ZFS will never read from the SLOG except in very rare circumstances (mostly after a power failure.) Even if it could, it would be useless as the SLOG is regularly flushed and any data is already in memory (ARC)

If your application is doing asynchronous writes, ZFS stores the write in memory (ARC) and reports reports to the application that the write was complete (it then flushes those writes from ARC to your pool - by default every 5 seconds I think?) The SLOG has zero benefit here.

I have a feeling QEMU/VMs will do synchronous writes also, but I don't have an SLOG running to test. From memory NFS shares default to synchronous and SMB shares will default to asynchronous?

>>Intel Optane P1600X 118GB

That said if you have/are getting this for L2ARC anyway, you may as well slice off 10GB for a SLOG. I was running a P4800X in exactly the same way a while ago.

Edited June 1, 2022 by jortan

gyto6 · June 1, 2022

52 minutes ago, jortan said:

To clarify, it's relevant for any application doing synchronous writes to your ZFS pool.

Indeed. 😉

52 minutes ago, jortan said:

SLOG is not a read cache

L2ARC is a read cache for those wondering. Most Recently Used (MRU) and Most Frequently Used (MFU) composing the ARC hold the files in RAM. The files evicted from these components due to RAM size limite, file usage cycle and ZFS parameter are then stored in the L2ARC, which should be set onto an SSD (even SATA, but prefer NVMe).

I must notice that the L2ARC index is stored itself in the RAM, reducing your potential MRU/MFU size!

52 minutes ago, jortan said:

(it then flushes those writes from ARC to your pool - by default every 5 seconds I think?)

Correct about the 5 seconds

52 minutes ago, jortan said:

SMB shares will default to asynchronous?

Indeed, you need to set this parameter in your share config to enable synchronous writes. Asynchronous writes is used for "efficiency" writing to memory buffer, letting the disks to handle theirs others activity. (Thanks for the correction @jortan )

“strict sync = yes”

52 minutes ago, jortan said:

That said if you have/are getting this anyway, you may as well slice off 10GB for a SLOG.

OpenZFS recommandations are even 4Go/GB : https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html?highlight=slog#optane-3d-xpoint-ssds

Edited June 1, 2022 by gyto6

Andrea3000 · June 1, 2022

4 hours ago, gyto6 said:
Indeed. 😉

L2ARC is a read cache for those wondering. Most Recently Used (MRU) and Most Frequently Used (MFU) composing the ARC hold the files in RAM. The files evicted from these components due to RAM size limite, file usage cycle and ZFS parameter are then stored in the L2ARC, which should be set onto an SSD (even SATA, but prefer NVMe).

I must notice that the L2ARC index is stored itself in the RAM, reducing your potential MRU/MFU size!

Correct about the 5 seconds

Indeed, you need to set this parameter in your share config to enable synchronous writes. Asynchronous writes is used for "efficiency" writing to memory buffer, letting the disks to handle theirs others activity. (Thanks for the correction @jortan )
“strict sync = yes”
OpenZFS recommandations are even 4Go/GB : https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html?highlight=slog#optane-3d-xpoint-ssds

I’m going for a 10GB SLOG based on the fact that I will connect to the server via 10gbe, which means that in 5 seconds I can send up to 5GB. I’m doubling that to have some additional margin.

The P1600X has power loss protection, therefore I should be able to recover everything that wasn’t yet committed to disk in case of power failure.

gyto6 · June 1, 2022

20 minutes ago, Andrea3000 said:

I’m going for a 10GB SLOG based on the fact that I will connect to the server via 10gbe, which means that in 5 seconds I can send up to 5GB. I’m doubling that to have some additional margin.

The P1600X has power loss protection, therefore I should be able to recover everything that wasn’t yet committed to disk in case of power failure.

I won't expect such a result for many files as a huge one sadly in reality.

Whatever, test your config to determine where you're facing a real bottleneck and what you can miss. 🙂

Edited June 1, 2022 by gyto6

Andrea3000 · June 2, 2022

I've done some more benchmarking and, if anything, I'm more puzzled than before..

I created 4 zpools (one at a time), each with just a single drive. I run a sequential write test with fio and I get 250MB/s from each drive, as expected from the manufacturer specifications.

I then created a striped zpool with no parity (equivalent of RAID0) using 3 drives, re-run the sequential write test and I obtained this:

fio --direct=1 --name=test --bs=128k --filename=/zfs/test/whatever.tmp --size=64G --iodepth=64 --readwrite=write
test: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=64
fio-3.23
Starting 1 process
test: Laying out IO file (1 file / 65536MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=709MiB/s][w=5670 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=28574: Thu Jun  2 09:48:20 2022
  write: IOPS=5745, BW=718MiB/s (753MB/s)(64.0GiB/91257msec); 0 zone resets
    clat (usec): min=10, max=25115, avg=168.47, stdev=97.66
     lat (usec): min=11, max=25118, avg=171.59, stdev=98.28
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   60], 10.00th=[  111], 20.00th=[  141],
     | 30.00th=[  157], 40.00th=[  165], 50.00th=[  169], 60.00th=[  174],
     | 70.00th=[  178], 80.00th=[  188], 90.00th=[  206], 95.00th=[  245],
     | 99.00th=[  453], 99.50th=[  562], 99.90th=[  799], 99.95th=[  963],
     | 99.99th=[ 2008]
   bw (  KiB/s): min=564736, max=5577216, per=100.00%, avg=736542.42, stdev=364660.92, samples=182
   iops        : min= 4412, max=43572, avg=5754.23, stdev=2848.91, samples=182
  lat (usec)   : 20=3.44%, 50=0.57%, 100=5.08%, 250=86.32%, 500=3.88%
  lat (usec)   : 750=0.58%, 1000=0.09%
  lat (msec)   : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=4.56%, sys=30.31%, ctx=506955, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
  WRITE: bw=718MiB/s (753MB/s), 718MiB/s-718MiB/s (753MB/s-753MB/s), io=64.0GiB (68.7GB), run=91257-91257msec

Up until this point, everything makes sense. I get 750MB/s which is the sum of the 3 drives.

I then decided to create a 4 disks wide raidz1 pool. This should give me the same sequential write performance of the 3-wide striped pool. Instead, this is what I get:

fio --direct=1 --name=test --bs=128k --filename=/zfs/test/whatever.tmp --size=64G --iodepth=64 --readwrite=write --idle-prof=percpu
test: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=64
fio-3.23
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=378MiB/s][w=3027 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12882: Thu Jun  2 10:19:12 2022
  write: IOPS=3322, BW=415MiB/s (436MB/s)(64.0GiB/157791msec); 0 zone resets
    clat (usec): min=11, max=101426, avg=299.37, stdev=168.44
     lat (usec): min=12, max=101426, avg=300.34, stdev=168.46
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[  174], 10.00th=[  258], 20.00th=[  285],
     | 30.00th=[  297], 40.00th=[  302], 50.00th=[  310], 60.00th=[  314],
     | 70.00th=[  318], 80.00th=[  330], 90.00th=[  355], 95.00th=[  371],
     | 99.00th=[  420], 99.50th=[  537], 99.90th=[  922], 99.95th=[ 1123],
     | 99.99th=[ 1893]
   bw (  KiB/s): min=266752, max=5665536, per=100.00%, avg=425720.69, stdev=299525.60, samples=315
   iops        : min= 2084, max=44262, avg=3325.94, stdev=2340.04, samples=315
  lat (usec)   : 20=3.80%, 50=0.17%, 100=0.08%, 250=5.44%, 500=89.95%
  lat (usec)   : 750=0.38%, 1000=0.11%
  lat (msec)   : 2=0.07%, 4=0.01%, 10=0.01%, 20=0.01%, 250=0.01%
  cpu          : usr=0.77%, sys=5.58%, ctx=505109, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=415MiB/s (436MB/s), 415MiB/s-415MiB/s (436MB/s-436MB/s), io=64.0GiB (68.7GB), run=157791-157791msec

CPU idleness:
  system: 95.21%
  percpu: 92.62%, 96.21%, 95.96%, 96.08%
  unit work: mean=26.64us, stddev=0.53

Instead, I only get 436MB/s. As you can see, I added the "--idle-prof=percpu" option to report the CPU idleness which shows values well above 90%.

From this I concluded that I'm not bottlenecked by the CPU for the parity calculation.

But if this is true, why are the performance so much worse considering that the number of "data" drives are the same between the 4-wide raidz1 and 3-wide striped pool?

How can I determine what the bottleneck is?

Edited June 2, 2022 by Andrea3000

jortan · June 2, 2022

15 minutes ago, Andrea3000 said:

Instead, I only get 436MB/s ... But if this is true, why are the performance so much worse considering that the number of "data" drives are the same between the 4-wide raidz1 and 3-wide striped pool?

This seems about what I would expect. You're not streaming data directly and uninterrupted to your spinning rust as you would be in a RAID0-like configuration. For every write to one disk, ZFS is having to store that data and metadata redundantly on another disk. Then the first disk gets interrupted because it needs to write data/metadata to provide redundancy for another disk. You're not streaming data neatly in a row, there are seek times involved.

If you want performance with spinning rust, get more spindles and ideally switch to RAID10 (mirrrored pairs)

ZFS plugin for unRAID

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

steini84

ich777

steini84

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation