talk script

advertisement

Script for “Delta Compressed and Deduplicated Storage Using Stream-informed Locality”

Slide 2 Motivation and Approach

For backup storage, increasing compression allows users to store more data without increasing their costs. Compression also decreases storage center space, power, and management overheads. Our approach to increase compression is to combine deduplication with delta compression. Deduplication finds identical chunks and replaces them with references. Delta compression finds similar chunks and stores only the changed bytes.

Slide 3 Previous Work on Similarity Indexing

A key challenge is the large number of data regions to be indexed so that similar data regions can be found. Previous work assumed that version information was available so that an earlier version of a file could be used to calculate a difference. Other work kept the similarity index in memory, but this is impractical for large storage systems. The index could be stored on disk, but then there is random I/O for every lookup. We presented a paper at FAST earlier this year that addressed this issue by leveraging stream-informed locality, which I will describe in more detail shortly. That paper was focused on WAN replication, so throughput requirements were low, and it did not store delta compressed data. In this work, unlike our earlier work, we actually store delta compressed data and have to deal with the impact on throughput, cleaning, referential complexities, and data integrity .

Slide 4 Contributions

Our first contribution is that we built an experimental prototype that uses stream-informed locality to support efficient deduplication and delta compression. Using this prototype, we quantify throughput and make suggestions for future improvement. Adding delta compressed chunks introduces a new level of indirection in the storage system, which impacts data integrity and cleaning. Adding delta compression was actually the easy step. Getting everything else to work was the hard step. We also vary the chunk size and report how deduplication and delta compression changes.

Slide 5 Stream-informed Deduplication

First, I will show an example of stream-informed deduplication and then show how to add delta compression. A backup file enters the storage system, and we break it into variable-sized, content defined chunks. We represent each chunk with a secure fingerprint such as a SHA-1. Since this is the first backup to an empty system, we store the data to a container on disk. Container C1 has a metadata section with fingerprints 1 through 6 and a data section with the chunks. On the left side, we have an index mapping from fingerprint to container. A week later, we receive another full backup that has been slightly modified in chunk 4, shown in red. Again we calculate fingerprints, and chunk 4 has a different fingerprint because a few bytes have changed. We start the deduplication process by looking for fingerprint 1 in the index, which leads us to load container 1’s metadata into an in-memory cache.

Then we get a match for fingerprints 1, 2, 3, 5, and 6, so those chunks are deduplicated and do not need to be stored. Fingerprint 4 does not match, so that chunk is stored.

Slide 6 Stream-informed Deduplication and Delta Compression

Starting the process from the beginning, we receive a backup file, divide it into chunks, and this time we calculate fingerprints used for deduplication and sketches used for delta compression. I will explain sketches on the next slide, but they are a resemblance hash designed to find similar, non-identical chunks. This time, we store both fingerprints and sketches in the metadata section of container 1. The on-disk index still maps from fingerprint to container. Note that we are not creating a sketch index. A week later another backup takes place, the file is divided into chunks, and chunk 4 has been slightly modified. Its fingerprint is different, but notice that its sketch is the same as the earlier version of chunk

4. We lookup the first fingerprint in the index, and load container 1’s fingerprints and sketches into a cache. All of the chunks deduplicate except chunk 4, but the sketch is a match, so we can delta encode chunk 4 relative to the earlier version and store the difference.

Slide 7 Deduplication and Delta Compression

Now, I will go through deduplication and delta compression in a bit more detail. When a chunk is being processed we calculate a fingerprint shown in red. If the fingerprint does not match anything we try to perform delta compression. We start by calculating a sketch. Our technique for sketching is to calculate small rolling fingerprints across the chunk, and we maintain the 4 maximal feature values. We then calculate a Rabin fingerprint over these features to create a super-feature, which is a 4 byte value. Our sketch consists of multiple super-features, each used for indexing. In this case, there is no match for the chunk, so it is stored. The next chunk is processed, and it has the same fingerprint as the earlier chunks, so it is a duplicate and does not need to be stored. Then we process a chunk that is similar to the first chunk, but it has a few bytes modified, shown in red. We calculate a fingerprint over the chunk, but it does not match, so we calculate a sketch. The 4 maximal values are the same as the earlier, base, chunk, so the sketch is a match. We then calculate the difference relative to the base chunk, and store those modified bytes as well as a fingerprint reference to the base chunk. This is how delta compression leads to space savings.

Slide 8 Backup Datasets

In our experiments, we used 4 backup datasets, typically consisting of full and incremental backups.

They vary in size from 2TB up to 5TB, and they were collected over 4 to 6 months.

Slide 9 Compression Results

We built an experimental prototype and measured compression after storing each dataset.

Deduplication varies from 5 to 25. This means that we only have to store 1/5 th to 1/25 th of the data, and higher compression factors indicate greater space savings. Next, we attempted to delta compress all chunks that were non-duplicate, and delta compression varied from 2.6 up to 4. Next, we applied LZ compression to both delta compressed and non-delta compressed chunks, with LZ adding 1 and a half to

2 and a half more compression. The total compression is then the multiplication of these three compression factors. As discussed in previous work, there is some overlap between compression achieved with delta compression and LZ, because they both reduce redundancy within a chunk. So, we

reran the experiment with delta turned off as a baseline, and we found that the compression improvement with delta was between 1.4 and 3.5 beyond deduplication and LZ.

Slide 10 Throughput

An important consideration for a storage system is the throughput. Delta compression requires extra computation and I/O. The computation is related to creating a sketch and computing differences between chunks. The larger overhead is related to I/O. In order to perform delta compression, we need to read the base chunk from disk, then calculate the difference, which is the set of modified bytes and a reference to the earlier chunk, which get stored. We performed an experiment measuring the throughput of our prototype with delta compression both on and off. With delta compression off, deduplication still runs and serves as our baseline. During the first full backup, most of the chunks are sketched, and there is some delta compression, which leads to our throughput running at 74% of the baseline system. After the first full backup, much of the data is removed as duplicate, but delta compression increases, and our throughput is 50% of the baseline deduplicated storage system.

Slide 11 Throughput Stages

We investigated throughput performance in more detail by timing each stage of delta compression in a single-stream experiment. For reading chunks, we present read numbers using either hard drives or solid state drives. One large oddity is the lookup times varies from 30 megabytes per second up to

1,500 megabytes per second, a 50X difference. We discovered that for the bottom two datasets, there were large numbers of duplicate sketches in the cache, and the timing difference was related to handling long chains in our hash table. A simple solution is to prevent duplicates in the sketch cache or stop the lookup early when a sufficiently good match is found. As expected, the slowest stage is reading chunks from storage. Throughputs were as low as 1 megabyte per second when there was a random I/O for each base chunk. Throughputs were higher for workstations when there was better locality for the reads, and multiple base chunks were loaded together. We compare these read numbers to replacing hard drives with solid state drives. I want to be clear that we did not actually measure performance with solid state drives, but used reported read times and applied the same locality as the hard drive system to estimate these values. I also want to emphasize that aggregate throughput is higher than this table might indicate for multiple reasons. Deduplication acts as multiplier on throughput, and only two thirds of the chunks have sketch matches and result in disk reads. Pipelining and asynchronous reads across multiple disks will improve overall throughput.

Slide 12 Indirection Complexities

Adding delta compressed chunks creates new complexity in the storage system related to the delta references. One new issue to consider is what happens if duplicate chunks enter the system. There have been several papers that improved throughput of their system by allowing some duplicates to be written. In this example, cat is written to disk. Then catch is stored as a delta of cat. Sometime later, catch is seen again. While it could be deduplicated, it may be written as a duplicate. Then when cat is seen again, it is similar to catch and could be written as a delta version. Now to read back cat, we could find the plain version, which is fastest to read. Alternatively, we may find the delta encoded version,

which references catch. Unfortunately, there are two versions of catch. One is a plain version, and the other is delta encoded relative to cat, and requires another read. It is even possible for a loop to happen.

Another issue to consider is the number of delta levels a system supports. Suppose catcher is written to the system. The most space efficient way to store it is as a delta of catch, which is also a delta of cat.

This requires multiple reads on the write path to reconstruct catch, and makes catcher a 2-level delta.

Our prototype only allows one level of delta encoding to reduce complexities, but we do trade-off some potential compression.

Slide 13 Indirection Complexities

We are focused on backup storage, and the integrity of data is of highest importance. We need to have end-to-end validity checks to confirm that files can be reconstructed. A new complexity is that reconstructing a delta chunk requires reading the base, which may be stored remotely in the system. To verify catcher, we read the delta encoded version and then need to read catch, which is somewhere else on disk. We implemented the simple version, which performs the read immediately, but a more efficient solution is needed.

Another complexity is related to cleaning up deleted files and chunks, which we call garbage collection.

An incorrect implementation can cause dangling references, multi-level deltas, and loops. In this example, we deleted the duplicate versions of cat and catch, and all three needed chunks cat, catch, and catcher still exist. Unfortunately, they cannot be fully reconstructed. Catcher references catch, which references cat, which references catch, and we have a data loss.

Slide 14 Garbage Collection

Let’s investigate garbage collection in more detail. We have a log structured file system, and there are generally two cleaning techniques. Reference counting keeps a count for each chunk of how many times it is referenced, presumably referenced directly or indirectly as a delta base. There are many issues with reference counting. There are too many references to keep in memory, so they need to be stored on disk, which increases I/O. Also, system errors can cause incorrect counts and data loss.

We implemented a mark-and-sweep algorithm that walks the live files and marks live chunks. In this example, we have live chunks apple, banana, and zebra. On disk, we have two containers containing live and dead chunks. An issue specific to stream-informed delta compression is that garbage collection can change locality on disk. When we run cleaning, live chunks from two different containers will be copied forward into a new container. Since we do not have a sketch index, this can impact delta compression.

Slide 15 Garbage Collection Impact on Delta Compression

We investigated this experimentally with three datasets. We wrote the data with a 4 week retention policy. After we wrote each week of data we measured delta compression. We then deleted data older than 4 weeks and ran garbage collection. We repeated this experiment without running garbage collection as a baseline for the maximum delta compression of our system, and this figure shows how

much of the delta compression we achieved with garbage collection turned on. For Email and Source

Code, there was almost no impact on delta compression. For Workstations, cleaning caused delta compression to decrease about 20% from the baseline, with the variance week to week shown. We believe that this is an acceptable amount of delta compression, but it does deserve further investigation.

Slide 16 Compression vs. Chunk Size

Finally, we explored how compression is impacted by chunk size. Our previous experiments used an average chunk size of 8KB, but in this experiment, we vary the average chunk size from 1KB up to 1MB on the workstations dataset. The vertical axis shows compression factors, and higher is better.

Deduplication is highest with small chunks, and slowly decreases as the system fails to find identical chunks. Adding delta compression, you can see that it steadily increases, basically finding compression that deduplication was unable to find. Then we add LZ compression, which grows slightly, but is fairly flat. Total compression shown in purple is the multiplication of these three factors. Finally, the black line shows how much compression is achieved if delta is turned off. There are two important results from this experiment. First, delta compression adds significant compression beyond the standard technique of deduplication and LZ compression. Second, surprisingly, total compression is relatively flat over a large chunk range because delta compression is able to compensate for compression that deduplication misses. This suggests that we could increase our chunk size to improve throughput or reduce memory pressure for tracking chunks. The paper shows results for all 4 datasets, which are consistent with this result. We leave an analysis of how throughput varies with chunk size to future work.

Slide 17 Conclusion and Future Work

In conclusion, we built an experimental storage prototype with deduplication and delta compression using stream-informed locality to remove the need for a sketch index. Delta compression adds significant additional compression on backup datasets. We studied the throughput of delta storage, and showed that is about 50% of the underlying deduplicated storage system because of extra computation and I/O, and we present several suggestions for improving throughput. We also explore complexities related to delta references and discuss the impact on garbage collection and data integrity. Finally, we show that delta compression helps maintain a high level of compression across a broad range of chunk sizes.

Slide 18 Questions?

I would be happy to answer any questions.

Download