Document 14106748

advertisement
MIGRATORY COMPRESSION
Coarse-grained Data Reordering to Improve Compressibility
Xing Lin*, Guanlin Lu, Fred Douglis, Philip Shilane,
Grant Wallace
*University
of Utah
EMC Corporation
Data Protection and Availability Division
© Copyright 2014 EMC Corporation. All rights reserved.
1
Overview
Ÿ  Compression: finds redundancy among strings
within a certain distance (window size)
–  Problem: window sizes are small, and similarity across a
larger distance will not be identified
Ÿ  Migratory compression (MC): coarse-grained
reorganization to group similar blocks to improve
compressibility
–  A generic pre-processing stage for standard compressors
–  In many cases, improve both compressibility and
throughput
–  Effective for improving compression for archival storage
© Copyright 2014 EMC Corporation. All rights reserved.
2
Background
•  Compression Factor (CF) = ratio of original / compressed
•  Throughput = original size / (de)compression time
Compressor MAX Window size Techniques
gzip
64 KB
LZ; Huffman coding
bzip2
900 KB
Run-length encoding;
Burrows-Wheeler Transform;
Huffman coding
7z
1 GB
LZ; Markov chain-based
range coder
© Copyright 2014 EMC Corporation. All rights reserved.
3
Background
•  Compression Factor (CF) = ratio of original / compressed
•  Throughput = original size / (de)compression time
Compressor MAX Window size Techniques
gzip
64 KB
LZ; Huffman coding
bzip2
900 KB
Run-length encoding;
Burrows-Wheeler Transform;
Huffman coding
7z
1 GB
LZ; Markov chain-based
range coder
© Copyright 2014 EMC Corporation. All rights reserved.
4
Background
•  Compression Factor (CF) = ratio of original / compressed
•  Throughput = original size / (de)compression time
MAX Window size
Techniques
gzip
64 KB
LZ; Huffman coding
bzip2
900 KB
Run-length encoding; BurrowsWheeler Transform; Huffman
coding
7z
1 GB
LZ; Markov chain-based range
coder
Example Throughput vs. CF
Compression Tput (MB/s)
Compressor
14
gzip
12
10
8
bzip2
6
4
7z
2
0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Compression Factor (X)
The larger the window,
the better the compression
but slower
© Copyright 2014 EMC Corporation. All rights reserved.
5
Background
•  Compression Factor (CF) = ratio of original / compressed
•  Throughput = original size / (de)compression time
MAX Window size
Techniques
gzip
64 KB
LZ; Huffman coding
bzip2
900 KB
Run-length encoding; BurrowsWheeler Transform; Huffman
coding
7z
1 GB
LZ; Markov chain-based range
coder
rzip (from rsync):
intra-file deduplication;
bzip2 the remainder
Example Throughput vs. CF
Compression Tput (MB/s)
Compressor
14
gzip
12
10
8
bzip2
6
rzip
7z
4
2
0
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Compression Factor (X)
The larger the window,
the better the compression
but slower
© Copyright 2014 EMC Corporation. All rights reserved.
6
Motivation
Ÿ  Compress a single, large file
–  Traditional compressors are unable to exploit redundancy
across a large range of data (e.g., many GB)
gz window
A B
A’ C … (GBs) A” B’ … 7z window
compress
transform
A A’ A” B B’ C
… … Standard compressor
© Copyright 2014 EMC Corporation. All rights reserved.
…
compress
…
mzip
7
Motivation
Ÿ  Better compression for long-term retention
–  Data migration from backup to archive tier
–  Observation: similar blocks could be scattered across
“compression regions” (CRs) in the backup tier
CR1
A B
C CR: unit of data that is
compressed together
(e.g. 128 KB)
CR2 A’ B’ C’ …
Backup tier
© Copyright 2014 EMC Corporation. All rights reserved.
8
Motivation
Ÿ  Better compression for long-term retention
–  Data migration from backup to archive tier
–  Observation: similar blocks could be scattered across
“compression regions” (CRs) in the backup tier
CR1
A B
CR1
C A B
C Migrate
CR2 A’ B’ C’ CR2 A’ B’ C’ …
…
Backup tier
Archive tier
Current practice
© Copyright 2014 EMC Corporation. All rights reserved.
9
Motivation
Ÿ  Better compression for long-term retention
–  Data migration from backup to archive tier
–  Observation: similar blocks could be scattered across
“compression regions” (CRs) in the backup tier
–  Proposal: fill each CR with similar data
CR1
A B
C CR1
A A’ B B’
CR2
C C’ …
Migrate
CR2 A’ B’ C’ …
…
Backup tier
Archive tier
Migratory compression
© Copyright 2014 EMC Corporation. All rights reserved.
10
Recap
Ÿ  Migratory compression
–  Idea: improve compression by grouping similar blocks
–  Use cases
▪  mzip: compress a single, large file
▪  Data migration for archival storage
Ÿ  Considerations
▪  How to identify similar blocks?
▪  How to reorganize blocks efficiently?
▪  Other tradeoffs (refer to the paper)
© Copyright 2014 EMC Corporation. All rights reserved.
11
mzip Workflow
File
Segmentation
Similarity detector
Ÿ  Segmentation: partition into blocks,
calculate similarity “features”
Reorganizer
Ÿ  Similarity detector: group by
content and identify duplicate and
similar blocks; output migrate and
restore recipe
Reorganized
file
Ÿ  Reorganizer: rearrange the input file
Compressor
–  Reorganized file may exist only as a pipe
into compressor
Compressed
file
© Copyright 2014 EMC Corporation. All rights reserved.
12
mzip Workflow
File
Compressed
file
Segmentation
Decompressor
Similarity detector
Reorganized
file
Reorganizer
Reorganized
file
Compressor
Compressed
file
© Copyright 2014 EMC Corporation. All rights reserved.
Extract
restore
recipe
Restorer
File
13
mzip Example
migrate recipe
0 3 1 5 2 4 .
.
Block ID
Similarity
detector
A
1
B
2
C
3
A’
4
D
5
B’
…
File
0
…
0 2 4 1 5 3 .
.
restore recipe
© Copyright 2014 EMC Corporation. All rights reserved.
14
mzip Example
migrate recipe
0 3 1 5 2 4 .
Reorganize
Block ID
A
0
1
B
A’
1
2
C
B
2
3
A’
B’
3
4
D
C
4
5
B’
D
5
…
…
Reorganized
File
A
…
0
…
File
.
Restore
0 2 4 1 5 3 .
.
restore recipe
© Copyright 2014 EMC Corporation. All rights reserved.
15
Similarity Detection
Similarity feature (based on [Broder 1997])
•  A strong hash for duplication detection
•  Features: weak hashes provide hints about similarity among
blocks
•  Two blocks are similar when they share values for some
weak hashes
Mature technique
•  REBL: Redundancy Elimination at Block Level [Kulkarni et al.
2004]
•  WAN Optimized Replication [Shilane et al. 2012]
© Copyright 2014 EMC Corporation. All rights reserved.
16
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
B
2
C
3
A’
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
17
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
B
2
C
3
A’
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
18
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
B
2
C
A’
3
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
19
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
A’
2
B
C
3
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
20
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
A’
2
B
C
B’
3
4
D
5
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
21
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
0
A
1
A’
2
B
3
B’
4
C
D
5
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
22
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
recipe
0 3 1 5 2 4
–  Block-level
0
A
1
A’
2
B
3
B’
4
C
5
D
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
23
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
recipe
0 3 1 5 2 4
–  Block-level
▪  Random IOs
▪  Fine for memory and SSD
0
A
1
A’
2
B
3
B’
4
C
5
D
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
24
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
▪  Random IOs
▪  Fine for memory and SSD
1st scan
–  Multipass (HDD)
0
A
1
B
2
C
3
A’
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Buffer
Block-level
Multipass
25
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
▪  Random IOs
▪  Fine for memory and SSD
1st scan
–  Multipass (HDD)
0
A
1
A’
2
B
C
3
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Buffer
Block-level
Multipass
26
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
0 3 1 5 2 4
recipe
–  Block-level
–  Multipass (HDD)
0
2nd scan
▪  Random IOs
▪  Fine for memory and SSD
1
2
Buffer
C
3
4
D
5
B’
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
Multipass
27
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
recipe
0 3 1 5 2 4
–  Block-level
–  Multipass (HDD)
2nd scan
▪  Random IOs
▪  Fine for memory and SSD
0
B’
1
C
2
D
Buffer
3
4
5
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
Multipass
28
Data Reorganization
Ÿ  Re-arrange the input file, according to a recipe
–  Reorganizer & Restorer
Ÿ  Options
recipe
0 3 1 5 2 4
–  Block-level
–  Multipass (HDD)
▪  Convert random IOs into
sequential scans
2nd scan
▪  Random IOs
▪  Fine for memory and SSD
0
B’
1
C
2
D
Buffer
3
4
5
input
© Copyright 2014 EMC Corporation. All rights reserved.
Block-level
Multipass
29
Evaluation
Ÿ  mzip:
–  How much can compression be improved?
–  What is the complexity (runtime overhead)?
–  How does SSD or HDD affect data reorganization
performance? For HDD, does multi-pass help?
–  How does MC perform, compared with delta compression?
(refer to our paper)
–  What are the configuration options for MC? (refer to our
paper)
Ÿ  Data migration for archival storage
–  Compression and performance
© Copyright 2014 EMC Corporation. All rights reserved.
30
Datasets
Type
Name
Workstation
backup
WS1
Email server
backup
EXCHANGE2
VM image
UbuntuVM
Compression Factor
(standalone - baseline)
Dedupe
Factor
(X)
gzip
bzip2
7z
rzip
17.36
1.69
2.70
3.22
4.44
4.46
17.32
1.02
2.78
3.13
4.75
4.79
6.98
1.51
3.90
4.26
6.71
6.69
Size
(GB)
* Only one dataset for each type is shown here. Refer to our paper for others.
© Copyright 2014 EMC Corporation. All rights reserved.
31
Compression Factor vs. Compression Throughput
Compression Tput (MB/s)
Workstation1
25
MC improves both CF and
compression throughput
ü  Deduplication
ü  Re-organization
gzip(mc)
20
gzip
gzip
gz(mc)
15
bzip2
bz(mc)
rzip
10
7z
7z(mc)
bzip2
5
7z
rzip
0
0
2
© Copyright 2014 EMC Corporation. All rights reserved.
4
6
8
Compression Factor (X)
10
rz(mc)
32
Compression Factor vs. Compression Throughput
MC improves CF but slightly
reduces compression throughput
Compression Tput (MB/s)
Exchange2
25
gzip
20
bzip2
15
7z
rzip
10
gz(mc)
5
bz(mc)
7z(mc)
0
0
2
4
6
Compression Factor (X)
© Copyright 2014 EMC Corporation. All rights reserved.
8
rz(mc)
33
Maximal Compression
Compressors can be run at various compression levels and
window sizes, making a tradeoff between compression and
runtime
Compression Tput (MB/s)
Workstation1
7
7z-DEF(MC)
6
5
4
7z-DEF
3
7z-MAX(MC)
MC benefits
maximal compression
7z-MAX
2
1
0
0
2
4
6
8
Compression Factor (X)
© Copyright 2014 EMC Corporation. All rights reserved.
10
Results for other
compressors have
same trends
34
Compression Factor Breakdown
MC improves compression
to varying extents
beyond dedup and gzip
8
Compresion Factor (X)
7
6
5
4
reorg
3
gzip
dedup
2
1
0
EX2
WS1
UBUN
Dataset
© Copyright 2014 EMC Corporation. All rights reserved.
35
Data Reorganization
SSD: reasonably good
HDD: multipass helps
Ÿ  Configurations
–  mem: input and output stored in tmpfs
–  SSD/HDD: used to store input; output to HDD; 8 GB mem
UBUN: 7 G
Multipass window:
4.8 G (2 scans) Compression Tput
(MB/s)
30
25
20
mem
ssd-block
15
hdd-multipass
10
hdd-block
5
0
EX1
© Copyright 2014 EMC Corporation. All rights reserved.
WS1
UBUN
36
Data Migration for Archival Storage
Ÿ  DataDomain Filesystem (DDFS)
–  Backup tier: LZ (fast)
–  Archive tier: recompresses with GZ (25-44% better CF)
Ÿ  With
MC,
–  Identify and sort by cluster sizes of similar blocks
–  Migrate in 3 stages: top third largest clusters, middle third,
bottom third (including distinct blocks)
Ÿ  Datasets
–  WORKSTATIONS: many backups of several workstations
–  Exchange[123]: many backups of 3 exchange email servers
© Copyright 2014 EMC Corporation. All rights reserved.
37
Effect of MC on Archival Migration
25
gzip
Compression Factor (X)
mc_total
top1/3
20
mid1/3
ü  MC improves CF
•  [44% (EX3), 157%
(WS)]
•  Top 2/3 compresses
very well
end1/3
15
10
5
0
WS
© Copyright 2014 EMC Corporation. All rights reserved.
EX1
EX2
EX3
38
Archival Migration – Costs
Ÿ  Migration Runtime
–  Exchange1: 3X longer
Ÿ  Read Performance
–  As data is scattered in file system, read performance
suffers [Lillibridge 2013]
–  entire EXCHANGE1: 1.3 X longer
–  final backup: 7X longer (1.24 X longer if just the top third
of similar clusters are reorganized)
Ÿ  Memory overheads
© Copyright 2014 EMC Corporation. All rights reserved.
39
Related Work
Ÿ  Improving traditional compressors
–  LZMA algorithm with extra-large window (7z)
–  rzip rolling hash to deduplicate with a large window
–  Burrows-Wheeler Transform to reorder data (bzip2)
Ÿ  Delta compression (MC is slightly better in CF and throughput)
–  Similar to MC when limited to a single file
–  Bookkeeping complexities in a file system
Ÿ  Similarity detection
–  WAN Optimized Replication: delta-compress similar chunks for replication
–  REBL: Redundancy Elimination at Block Level (Never applied practically at
scale, or in self-contained file)
© Copyright 2014 EMC Corporation. All rights reserved.
40
Conclusions
Ÿ  Migratory Compression preprocesses data to make it
more compressible
–  Identify and cluster similar data
Ÿ  mzip
Ÿ  Archival storage
–  MC reduces $/GB further
© Copyright 2014 EMC Corporation. All rights reserved.
Throughput
–  Improves existing compressors, in both compressibility
and frequently runtime
–  Redraw the performance-compression curve!
15
10
5
0
2.0
3.0 4.0
CF
5.0
41
Thank you!
Questions?
© Copyright 2014 EMC Corporation. All rights reserved.
42
Download