Jump to content

Dedup in zfs


ReneV

Recommended Posts

(This is a follow-up to http://lime-technology.com/forum/index.php?topic=4604.0.)

 

De-duplication in zfs has just made its way into the public builds of OpenSolaris. My first impression: it is computationally expensive!

 

Making a duplicate of a 7GB file that takes up 4GB of HDD space (because of file-system level compression) takes 100% CPU time for around a quarter of the copying time on a quad-core (Q9550). The concrete example took 45secs. The same copy without de-duplication took 3mins at <10% CPU usage. [both numbers include de- and re-compression, I would imagine.] The good news: the copy with de-duplication consumes 0.1 GB of space.

 

Additional bad news: even though straight reading from the de-duplicating file system does not appear to take substantially more CPU resources [EDIT: the preceding statement is probably wrong; see also below], there was a speed penalty of around a third compared to the duplicating case.

 

 

My tentative conclusion is (either that the current dedup implementation in zfs can be improved or) that de-duplication is only going to pay for itself in highly specialised cases. RAM caching seems to respect de-duplication, for example, and one clear case that looks like it will benefit is running several copies of a given virtual machine. For my part, I won't jump into anything just yet.

Link to comment

How does it fare if you disable compression?

 

52secs for the copying, with what appears to be the same CPU usage profile and 0.2GB space consumption for the copy. Similar speed penalty for straight reading.

 

This is all very uncontrolled, though and, with two variables and a single test, it's not really good for any sort of decision making.

 

 

 

EDIT: straight reading does appear to take somewhat more resources in the de-duplicating case, both with and without compression; sorry about that.

Link to comment
  • 3 weeks later...

As it turns out, the numbers I reported above are apples to oranges.

 

The issue is this: zfs uses checksums (by default). Initially, this was to have filesystem-level error correction. For dedup, duplicates are identified by comparing checksums (by default). This is entirely reasonable for checksums with essentially no collisions. In particular, turning on zfs dedup turns on sha256 checksums (by default). Otherwise, zfs uses fletcher checksums (by default), which are much cheaper to compute. This will account for a part of the performance differences I reported.

 

However, it appears to be even more important that up to the still-to-be released svn_131 build, zfs uses it's own non-optimised sha256 code. The numbers reported in various Sun blogs for the speed-up by switching to the optimised sha256 code in the kernel are based on artificial data sets and it's entirely unclear what the real-world effect will be, but it could be substantial. Additionally, the kernel code can take advantage of hardware acceleration, where available.

 

More later ...

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...