ZFS: To Dedupe or Not to Dedupe…

…that is the question.

Ever since the introduction of deduplication into ZFS, users have been divided into two camps: One side enthusiastically adopted deduplication as a way to save storage space, while the other remained skeptical, pointing out that dedupe has a cost, and that it may not be always the best option.

Let’s look a little deeper into the benefits of ZFS deduplication as well as the cost, because ultimately it boils down to running a cost/benefit analysis of ZFS deduplication. It’s that simple.

ZFS Deduplication: What Value Do You Get?

ZFS dedupe will discard any data block that is identical to an already written block, while keeping a reference so that it can always reproduce the same block when read. You can read more about ZFS deduplication and how it works here.

Before you decide to use deduplication, it’s good to know what value you’ll get out of it. Here are a few options to figure out how much space you’ll save as a result of using ZFS deduplication:

Test it with some real data. This is the most accurate and straightforward option: Set up a test pool, enable ZFS deduplication on it, then copy a representative amount of the data you are considering onto it. Then use zpool list and look at the DEDUP column for the deduplication ratio. The important thing here is to use a representative amount of data, so you get an accurate estimate of how much savings to expect.
Simulate it by applying the zdb -S command to an existing pool with the data you want to deduplicate. This option is less accurate than using real data with a real deduped zpool, but it can provide you with a ballpark estimate based on your existing data.
Guess it, based on the knowledge you have of your data. Not the best option, but sometimes, setting up a test pool is not feasible and simulating dedupe on existing data doesn’t work because you simple don’t have any data to analyze with. For example, if you plan to run a storage server for virtual machines: How many machines do you support? How often are they patched? How likely will people apply the same software/patches/data to your machines? How many GB of dedupe-able data is this likely to generate? Can you come up with a representative test case after all to make the guess less guessy?

In any case, you’ll end up having an expected deduplication ratio for the data: For every GB of data you actually store, how many GBs of retrievable data will you get? This number can have any value: Some people see a value of 1.00 (no duplicates whatsoever), others see some moderate savings like 1.5 (store 2 GB, get one free), and some very lucky people can see as much as 20x, for example a virtualization storage server with a very repetitive usage profile.

Now take your total amount of storage and divide it by the dedup ratio, then subtract the result from your total amount of storage. That is your expected storage savings as a result of deduplication:

Total Storage - ( Total Storage / Expected Dedupe ratio ) = Expected Storage Savings

As a fictional example, let’s assume that we’re looking at a 10 TB storage pool to be used for storing virtual machine images in a virtual desktop scenario. In a quick test, we set up a 1 TB pool and copy some existing VM data to it, which yielded a dedup ratio of 2. This means that we only need about 5 TB of capacity to provide the 10 TB of data thanks to deduplication, hence we would save 5 TB of disk storage.

Let’s assume that the average cost of 1 TB of disk (including controller, enterprise class drives, etc.) is at $1000 for the sake of simplicity: Then dedup would save us $5000 in this particular example.

So what do we need to spend in order to realize these cost savings?

ZFS Dedupe: The Cost

Saving space through deduplication doesn’t come for free. There is a cost. In the case of ZFS, it’s memory: ZFS keeps a dedup table in which it stores all the checksums of all the blocks that were written after deduplication was enabled. When writing new blocks, it uses this table to determine whether a block has been written yet, or not.

Over time, the table becomes larger and larger, and since every write operation has to use it, it should be kept in main memory to avoid unnecessary extra reads from disk. To be clear: ZFS can work perfectly well even if the table is not in memory. But the bigger the deduplication table grows, the slower write performance will become, as more and more writes trigger more and more extra reads for dedup table lookups.

How much memory does one need to keep the ZFS dedup table in memory, and hence your system happy?

According to the ZFS dedup FAQ (no link, opensolaris.org no longer exists), each entry in the dedup table costs about 320 Bytes of memory per block. To estimate the size of the dedup table, we need to know how many blocks ZFS will need to store our data. This question can be tricky: ZFS uses a variable block size between 512 bytes and 128K, depending on the size of the files it stores. So we can’t really know in advance how many blocks ZFS will use for storing our data.

If we mainly store large files (think videos, photos, etc.), then the average block size will be closer to 128K, if we store small files (source code, emails, other data), we’ll probably be closer to a few K. Here are some ways to find out for sure:

Option 1: Count Your Blocks With ZDB

The most accurate way to determine the number of blocks is to use the zdb -b <poolname> command, which will print out detailed block statistics. But beware: This command may take a long time to complete as it will scan all of the metadata blocks in your pool:

  constant@walkuere:~$ zdb -b oracle

  Traversing all blocks to verify nothing leaked ...

    No leaks (block sum matches space maps exactly)

    bp count:          306575
    bp logical:    19590419968      avg:  63900
    bp physical:   17818332160      avg:  58120     compression:   1.10
    bp allocated:  17891765760      avg:  58360     compression:   1.09
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 17891765760     used: 52.48%

The number to look for here is bp count which is the number of block pointers, hence the number of blocks in the pool.

In this example, the dedup table would take up 306575 blocks * 320 bytes, which yields around 100 MB. This file system has a size of 32GB, so we can assume that the dedup table can only grow to about 200 MB, since it’s about 50% full now.

Option 2: Estimate Your Average Block Size, Then Divide

If you don’t have your data in a ZFS pool already, you’ll have to guess what your average block size is, then divide your expected storage capacity by the average block size to arrive at an estimated number of ZFS blocks. For most cases of mixed data like user home directories, etc., we can assume an average block size of 64K, which is in the middle between the minimum of 512 bytes and 128K. This works reasonably well for my own example:

  constant@walkuere:~$ zpool list oracle                                          
  NAME     SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
  oracle  31.8G  16.7G  15.1G  52%  1.00x  ONLINE  -

we see that the average block size of my work pool is 16.7G divided by 306575 which yields about 54K. Your mileage will vary, especially if you use different kinds of data such as VM images, databases, pictures, etc.

Example: If a zpool stores 5 TB of data at an average block size of 64K, then 5TB divided by 64K yields 78125000 blocks. Multiplied by 320 Bytes, we get a dedup table size of 25 GB!

The Total RAM Cost of Deduplication

But knowing the size of your deduplication table is not enough: ZFS needs to store more than just the dedup table in memory, such as other metadata and of course cached block data. There’s a limit to how much of the ZFS ARC cache can be allocated for metadata (and the dedup table falls under this category), and it is capped at 1/4 the size of the ARC.

In other words: Whatever your estimated dedup table size is, you’ll need at least four times that many in RAM, if you want to keep all of your dedup table in RAM. Plus any extra RAM you want to devote to other metadata, such as block pointers and other data structures so ZFS doesn’t have to figure out the path through the on-pool data structure for every block it wants to access.

RAM Rules of Thumb

If this is all too complicated for you, then let’s try to find a few rules of thumb:

For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K.
This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS.

The Alternative: L2ARC

So far, we have assumed that you want to keep all of your dedup table in RAM at all times, for maximum ZFS performance. Given the potentially large amount of RAM that this can mean, it is worth exploring some alternatives.

Fortunately, ZFS allows the use of SSDs as a second level cache for its RAM-based ARC cache. Such SSDs are then called “L2ARC”. If the RAM capacity of the system is not big enough to hold all of the data that ZFS would like to keep cached (including metadata and hence the dedup table), then it will spill over some of this data to the L2ARC device. This is a good alternative: When writing new data, it’s still much faster to consult the SSD based L2ARC for determining if a block is a duplicate, than having to go to slow, rotating disks.

So, for deduplicated installations that are not performance-critical, using an SSD as an L2ARC instead of pumping up the RAM can be a good choice. And you can mix both approaches, too.

Adding up the Dedup Cost

Back to our example: Our 10 TB pool is expected to just utilize 5 TB which, at an average block size of 64K would need a dedup table that is approximately 25GB in size. Last time I checked, an enterprise-class SSD with 32GB was in the range of $400, while 25GB of Memory was in the range of $1000. So this is the range of what using deduplication will actually cost in terms of extra SSD and/or RAM needed.

Putting Together the Business Case

There you have it: For our fictitious example of a 10 TB storage pool for VDI with an expected dedup savings of 5 TB which translates into $5000 in disk space saved, we’d need to invest in $400 worth of SSD or better $4000 of RAM. That still leaves us with at least $1000 in net savings which means that in this case dedup is a (close) winner!

But if we assume the same amount of raw data (10 TB) but only a dedup savings factor of 1.1, then our equation would be different: We’d still save close to 1 GB of disk storage (ca. $1000) but we’d need to build up a dedup table that can manage 9 TB of data, which would be in the range of 45GB. That means about $600 in SSD capacity for storing the dedup table. For RAM, we’d need 4 times that amount (180GB), since only 1/4 of ZFS ARC RAM is available for metadata. That doesn’t look very attractive to me.

So it really boils down to what you get (= amount of space saved) and what you need to spend in order to get that (= extra SSD and/or RAM for the DDT), depending on whether you want to sacrifice some performance (by going SSD only) or not (by adding enough RAM).

Now, you can do your own calculations on a case by case basis.

Finding the Break-Even Point

Given some standard cost for disk space, SSDs and RAM, you can calculate a break even point. That way, you can just ask: What’s the expected dedup savings factor? Then you can instantly decide whether it’s worth deduplicating or not.

At our fictitious values of $1000 per TB of disk space, $400 for a 32 GB SSD and $1000 for 24GB of memory, and assuming an average block size of 64K, we can derive two break-even dedup ratios depending on the performance requirements:

For applications where performance isn’t critical, you can get away with no extra RAM for the DDT, but with some extra space for storing the DDT at least on an SSD in L2ARC. Each TB of pool capacity will cost 5GB of dedup table, no matter how much dedup will save. 5GB of dedup table will cost $62.5 when stored in a standard SSD ($400 for 32GB). Hence, for each TB of pool capacity, at least 62.5 GB need to be saved through dedup for the SSD to pay for itself (1000 GB cost $1000, 62.5 GB saved will save 62.5$, the price of having that dedup table stored in SSD). That translates into a minimum dedup factor of 1.0625 to warrant the extra SSD capacity needed.
For applications that are more performance-sensitive, you’ll need the same amount of memory for the DDT per TB (5GB), but this time you want to store it fully in RAM. ZFS limits metadata use in RAM to 1/4 of total ARC size, so we need to make sure our system has at least 20GB of extra RAM per TB of stored data. That means each TB of deduped pool data will cost us approx. $834 in x86 memory for storing its dedup table, so the minimum dedup savings factor needs to be 1.834 here.

Rules of Thumb

These are all fictitious numbers, YMMV, but I think some good rules of thumb are:

If performance isn’t critical, and if you expect to save more than 20% of storage capacity through deduplication, then go for it but add at least 5GB of L2ARC SSD capacity per TB of pool data to store the dedup table in.
If performance is important, wait until you expect at least a 2:1 reduction in storage space through deduplication, and add 30GB of RAM to your system for every TB of disk capacity to make sure that the dedup table is always in memory so optimal write performance is ensured..

Why the extra 10GB of RAM in the latter rule of thumb? You don’t want to fill up ZFS’ metadata RAM cache entirely with the dedup table. The other metadata you want to have quick access to are ZFS’ pointer data strcutures so it knows where each data block is stored on disk etc. That can be estimated at 1% of total pool capacity which is 10GB per TB of pool data.

Conclusion

The decision to use ZFS deduplication or not is almost always a simple cost/benefit analysis. When using deduplication, one needs to plan for at least some extra L2ARC SSD requirements, or better some extra RAM for storing the dedup table in a manner that doesn’t negatively impact write performance.

Especially RAM can become a decisive factor in deciding for or against deduplication, so usually a dedup savings factor of 2 is a necessary threshold for deduplication to become a real cost saver.

Your Take

So here are some fictitious numbers. Did you do your own dedup analysis? What were your results? What do you base your decision to use deduplication on? Place a comment below and share your own dedup cost/benefit analysis cases.

This post is obsolete