IBM ProtecTIER Deduplication Solutions
Stanislav Dzúrik
IBM FTSS Storage
stanislav_ dzurik@sk.ibm.com
got
data?
too much
And not enough ( blank ) to store it all?
Time Money People Floor Space Electricity Air Conditioning
Protect More. Store Less.®
The tidal wave of data continues …
• The amount of digital information continues to grow exponentially
• And we need to keep more of it, longer
• And the costs of losing data are increasingly unacceptable
o Lost revenues
o Lost customer confidence
o Embarrassment in the market
o Fines from contracts, government agencies
o CEO and CFO could go to jail
• But budgets are not increasing
2005
2006
2007
2008
2009
2010
Data created and copied is expected to grow at 48%
CAGR through 2010
We Need to do More with Less,
and we need to do it smarter
Protect More. Store Less.®
Source: Various external consultant reports
Survey - what are your two biggest storage pain points?
* TheInfoPro Storage Study: F1000 Sample. n=149. Other n=14. *Multiple responses recorded
Protect More. Store Less.®
Storage efficiency strategies and best practices
• Stop storing so much
• Move data to the right place
• Store more with what’s on the floor
Protect More. Store Less.®
A set of essential technologies enables storage efficiency
• Stop storing so much
• Data Compression
• Data Deduplication
• Move data to the right place
• Automated Tiering
• Automated Data Migration
• Store more with what’s on the floor
• Storage Virtualization
• Thin Provisioning
Protect More. Store Less.®
The pressures on backup administrators are growing
More new data coming
Backup takes longer
Growth
Backup
Manage
Recover
Can’t buy more storage
Recovery takes longer
Protect More. Store Less.®
Using the right balance of high density tape
and high performance disk will help . . .
• Long Term Retention
o Cost effective capacity
o Removable & transportable
• Compliance
o Meet financial & regulatory
requirements
o Data encryption, WORM
• Short Term Retention
o Use disk for daily backup
& restore operations
• Performance
o Fast backups
o Even faster restores
o Meet “backup windows”
Protect More. Store Less.®
Compression and Deduplication use less physical storage
• Store data more efficiently
• Lower Operating Expenses: Power, cooling, floor space
• Keep more data online for analytics and fast restores
Protect More. Store Less.®
And data deduplication is the key
to using more disk more cost effectively!
Protect More. Store Less.®
Data Deduplication Overview
Deduplication Architectures
Storage
Devices
Server
Client
LAN
Client side
• Reduce load on
server
• Reduces bandwidth
on LAN
• Adds load to client
• No cross correlation
among multiple
clients
SAN
Server side
• Allows cross
correlation of data
among multiple
clients
• Adds load to server
Protect More. Store Less.®
Block Storage Device
• Transparent to clients and
servers
• Reduces load on server
and client
• Adds load to storage device
• No file or format awareness
Data Deduplication Process (simplified)
Assume a Data-Object or -Stream as Subject for
deduplication
Data Object / Stream
Data Object is split in Chunks (fixed or variable size)
For each Chunk an identity characteristic is determined
A B C D A E F F D B A F
A
B
C
D
E
Duplicate chunks are identified
• Identical Chunks are referenced with pointers,
references.
• Non-identical chunks or single instances are
effectively stored
• Compression may be performed in addition.
F
Required Disk-Cache is reduced
Identical Chunks
Protect More. Store Less.®
Methods for Data Chunking
Data Object / Stream
1. File based
o
o
One chunk is one file, most appropriate for file systems
E.g. TSM Incremental Backup forever helps eliminate redundant data
2. Fixed block
o
o
Data object is split into fixed blocks
Used by block storage devices
3. Format Aware
o
o
Understands explicit data formats and chunk data object according to format
Example: breaking a PowerPoint deck into separate slides
4. Format agnostic
o
•
Chunking is based on algorithm that looks for logical breaks or similar elements
within a data object
Chunking method influences deduplication ratio
Protect More. Store Less.®
Method for Determining Duplicates
A B C D A E F F D
1. Hashing
o
o
Computes a hash (MD-5, SHA-256) for each data chunk
Compares hash with all hash of existing data
o
Identical hash means most likely identical data
o
Potential (small) Risk of Hash Collisions: identical hash and non identical data
o
Must be prevented through secondary comparison (additional metadata, second hash
method, binary comparison)
A
B
C
D
E
F
2. Binary Comparison
o
Compares all bits of similar chunks
3. Delta Differencing
o
o
o
Computes a “delta” between two “similar” chunks of data where one chunk is the
baseline and the second is the delta
Since each delta is unique there is no possibility of collision
To reconstruct the original chunk the delta(s) have to be re-applied to the
baseline chunk
Protect More. Store Less.®
In-Line Deduplication
• Data is deduplicated before it is actually stored
• Deduplication is performed as data flows into the secondary storage system
Bac
kup
Ded
upli
cati
on
VTL
Primar
y
Storag
e
Secon
dary
Storag
e
• Advantages
o
Processes data once, eliminates additional post-processing tasks
• Disadvantages
o
o
CPU intensive deduplication process can create performance bottleneck
One process per I/O stream
Protect More. Store Less.®
Out-Band Deduplication (Post-Processing)
• Data is first stored and deduplicatedDed
in the background
Bac
kup
Primar
y
Storag
e
Secon
dary
Storag
e
upli
cati
on
VTL
Secon
dary
Storag
e
• Advantages
o
o
o
De-duplication CPU overhead no longer affects backup window
Supports multiple I/O streams
Potentially faster restore for first version (not deduplicated)
• Disadvantages
o
o
o
Data is written, read and written – thus more I/O intensive
Deduplication window must be coordinated with backup window as it take
typically longer than in-line processing
Requires larger secondary storage because first version is not deduplicated
Protect More. Store Less.®
3 × Deduplication in the IBM Portfolio
Tape
File
LUN
TSM
API
ProtecTIER
TS7650G
A-SIS
N series
Gateway
Protect More. Store Less.®
TSM R6
ProtecTIER Overview
Protect More. Store Less.®
ProtecTIER reduces the required backup disk capacity by
up to 25 times!
Protect More. Store Less.®
IBM ProtecTIER Deduplication Innovation and Leadership
2003
2004
6 PhDs begin
researching
massively
scalable
deduplication
algorithms
2005
2006
First Deduplication
Virtual Tape Library
deployed into
production
First non-hash
deduplication algorithm
developed, designed for
100% data integrity
2007
2008
2009
2010
2011
First single node
system to store
over 1PB of
deduplicated data
Fastest single
First to deliver
node inline
Many-to-Many
The only “true”
deduplication enterprise-class
replication
solution
deduplication
solution on the
IBM acquires
market today
First
Fastest restore
Diligent
Deduplication
speed – up to
solution
for
2800 MB/sec!
First to deliver VTL
First true clustered System z
IBM’s first midrange
solutions for both Open system with Global
solution released
and Mainframe
Deduplication
environments
•
Installed in all major industries
o Over 1,400 ProtecTIER systems sold to
date
o Production systems range in size from 5TB
to over 700TB
o Over 90 PB of physical disk capacity
behind ProtecTIER servers in production
protecting thousands of PBs of backup
data
Protect More. Store Less.®
IBM’s Virtual Tape De-duplication SW Products
• ProtecTIER
ProtecTIER VT is a scalable and robust virtual tape solution that emulates
tape libraries, enabling existing backup applications to send data to the
ProtecTIER disk-based platform, rather than directly to tape.
• HyperFactor
HyperFactor is a revolutionary de-duplication solution which eliminates
redundant data, enabling customers to increase their effective capacity by
up to 25 times. ProtecTIER is powered by HyperFactor and can radically
reduce both physical disk capacity and total storage costs.
Protect More. Store Less.®
How ProtecTIER works
Repository
New Data Stream
HyperFactor™
Memory
Resident Index
ProtecTIER™
Server
Backup Servers
Only
Backup
4GBwith
needed
Inline
todeduplication
map
1PB
Up of
to physical
1400MB/sec
disk!per server or
2000MB/sec
with 2 node cluster!
“Filtered” data
Protect More. Store Less.®
ProtecTIER Deduplication Operation and Results Example
•
•
•
Backup application writes data to
ProtecTIER as it would to tape
Only unique data is stored, existing
duplicate data is referenced
When data objects expire,
references are removed and free
space is reclaimed and reused
Backup Amount Amount Dedupe
Event Received Stored Ratio
First Full Backup 1 TB 250 GB 4:1
Incremental Backup 100 GB 10 GB 4.2:1
Incremental Backup 100 GB 10 GB 4.4:1
1
2
3
4
5
Second Full Backup 1 TB 10 GB 7.8:1
Incremental Backup 100 GB 10 GB 8:1
Third Full Backup 1 TB 10 GB 11:1
A B C D E
F
G H
I
J
After two months . . . 7.8 TB 350 GB 22:1
Protect More. Store Less.®
Storage Impact from ProtecTIER Deduplication
Represented
capacity
Master
Server
Backup
Server
Store up to
ProtecTIER
Server
Physical
capacity
25 times backup data on given physical storage capacity
Protect More. Store Less.®
Significantly Reduces Replication Bandwidth
Primary Site
Represented capacity
Backup
Server
ProtecTIER
Gateway
Physical
capacity
Backup
Server
IP-based
WAN link
Deduplication
enables a large
amounts of data to
be replicated with
significantly less
bandwidth
Secondary Site
Virtual cartridges
can be cloned to
tape at DR site
Backup
Server
ProtecTIER
Gateway
Physical
capacity
Tape
library
Protect More. Store Less.®
ProtecTIER Many-to-One Replication Overview
Up to 12 Branch Offices (spokes): Gateways and/or Appliances
1 target (hub): Appliance, Gateway, single or two-node cluster
IP based NR
links
Backup
Server
ProtecTIER
Gateway
Physical
capacity
Central / DR Site
Protect More. Store Less.®
Virtual cartridges
can be cloned to
tape by the MainSite B/U server
Tape
library
ProtecTIER Many-to-Many Native Replication Grid
Site A
Up to 4 hubs in a grid
Site B
Site C
Site D
Backup
Server
ProtecTIER
Gateway
Physical
capacity
Supports any combination of Gateways, Appliances, single or two-node clusters
Protect More. Store Less.®
ProtecTIER Support for Symantec OpenStorage (OST)
• OST API separates the backup
logic from the storage appliance
logic and implementation
NetBackup
Server
NetBackup
Policy and Control
OpenStorage API
ProtecTIER OST
Plugin
IBM ProtecTIER:
ProtecTIER
Server
Backup storage appliance
with Deduplication and
Native Replication
Protect More. Store Less.®
IBM ProtecTIER® Deduplication Family
TS7650
ProtecTIER
Appliances
TS7610
ProtecTIER
Appliance
Express
Good Performance
Entry Level
Easy to Install
Up to 100 MB/sec
4 TB and 5.4 TB Useable
Capacity
TS7650G & TS7680
ProtecTIER
Gateways
Highest Performance
Largest Capacity
High Availability
Better Performance
Larger Capacity
Scalable
Sc
ala
ble
Ca
pac
ity
an
Up to 500
d MB/sec
7 TB
to 36 TB
Per
Useable
for Capacity
ma
nce
Protect More. Store Less.®
Backup: Up to 2000 MB/sec
Restore: Up to 2800 MB/sec
Up to 1 PB Useable Capacity
ProtecTIER Differentiation
ProtecTIER Advantage: Data Integrity
• Unique and patented HyperFactor® deduplication
technology
• The only production proven deduplication solution not
based on a hash algorithm
• Designed for 100% data integrity
• Bit for bit comparison of data to ensure data is a
duplicate
• Can NEVER lose data due to a hash collision
Although the chance of losing data from a hash collision is
low, it is NOT ZERO as it is with a ProtecTIER solution
Protect More. Store Less.®
ProtecTIER Advantage: Restore Performance
• Restoring data from a ProtecTIER solution is even FASTER
than backing up
• ProtecTIER can easily restore at 2800MB/sec!
• High restore performance not limited to certain backup
applications or specific data sets like other vendors
• High restore performance achieved on real data with
realistic 20% change rate in production environments
• Never requires agents on backup servers
Other vendor’s “CPU-centric” architectures are optimized for
processing hashes not moving data
Protect More. Store Less.®
ProtecTIER Advantage: Scalability
• A single ProtecTIER system can support up to 1 Petabyte of
useable capacity
• ProtecTIER supports the use of any IBM storage system
(DS8000, DS5000, XIV, etc.) and most third party storage
systems for the repository
• IBM has hundreds of ProtecTIER systems with over 100TBs of
useable capacity in production environments throughout the
world
• IBM always states “Useable Capacity” and never uses the
deceptive “RAW capacity” terms like other vendors
The hidden costs associated with managing, maintaining, powering
and cooling multiple appliances is significant and should not be ignored!
Protect More. Store Less.®
ProtecTIER Advantage: Global Deduplication
• ProtecTIER Cluster with true Global Deduplication has been
Generally Available and in production since 2008
• Supported with all major backup applications and available for
all Open Systems, System z and System I platforms
• No agents or backup server upgrades required
• Other vendor’s Global Deduplication capabilities are immature
and incomplete with very few if any systems in production
• Other vendor’s Global Dedupe restricted to certain models,
only with NetBackup OST and require agents to be installed
Many vendors claim to have Global Deduplication but create multiple separate
repositories that may contain redundant data!
Protect More. Store Less.®
ProtecTIER Advantage: Inline Deduplication
Example: Disk activity needed to ingest and deduplicate 10 TBs of backup data
Post Process Approach: Deduplicate after Storing
10 TB Data
Hash-based
Post Process
Write 10 TB
Read 10 TB
2x
Requires:
> storage
> I/Os
> Time
> Effort
> Admin
ProtecTIER Inline Approach: Deduplicate before Storing
10 TB Data
HyperFactor
Read or Write
10 TB
Protect More. Store Less.®
1x
Results:
simple
faster
easier
cheaper
efficient
ProtecTIER Advantage: Inline Deduplication
Inline Processing
Backup
Server
Truck
ProtecTIER VT
Tape Library
SLA is Met
Dedupe
8:00 PM
2:00 AM
8:00 AM
8:00 PM
Post Processing
Dedupe
Backup
Overlap
Server
Truck
VTL
Tape Library
Dedupe
8:00 PM
2:00 AM
8:00 AM
Protect More. Store Less.®
8:00 PM
With an IBM ProtecTIER Solution you can . . .
• Store up to 25 times more data on disk
o
Up to 25:1 reduction with 100% data integrity
• Reduce backup and restore times
o
o
Fast inline deduplication up to 2000 MB/sec
Even faster restores up to 2800 MB/sec
• Improve the reliability of backup operations
o
Eliminates mechanical & handling failures
• Drive the cost of disk based backup down
o
Reduces energy, cooling, and space
required
• Increase data retention
o
Store more backup data on disk for a longer
time with very little additional cost
Protect More. Store Less.®
For More Information on IBM’s ProtecTIER
IBM Customers
The main ProtecTIER Web Page
www.ibm.com/systems/storage/tape/protectier
Protect More. Store Less.®
Trademarks and Disclaimers
8 IBM Corporation 1994-2011. All rights reserved.
References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at
http://www.ibm.com/legal/copytrade.shtml.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
Information is provided "AS IS" without warranty of any kind.
The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
environmental costs and performance characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does
not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including
vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other
claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance,
function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to
communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user
will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration,
and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios
stated here.
Photographs shown may be engineering prototypes. Changes may be incorporated in production models.
Protect More. Store Less.®
Ďakujem za pozornosť
Protect More. Store Less.®