ZL UNIFIED ARCHIVE Global Single Instance Storage

advertisement
ZL UNIFIED ARCHIVE
Global Single Instance Storage
ZL TECHNOLOGIES | White Paper
GLOBAL SINGLE INSTANCE STORAGE
PAGE 1
ZL UA’s Global Single Instance Storage (Global SIS) technology is a multi-layered
de-duplication technology that enables ZL UA to achieve industry leading storage
reduction and TCO for e-mail and files.
Effective single instancing of e-mail and files is extremely challenging due to the
high volumes of unique, duplicate, and near-duplicate content that exist within an
enterprise. E-mail comprises the greatest volume of content in many organizations
and has unique challenges for SIS, including handling e-mail variation across
different users, different mail servers, PST and NSF archive files, departments,
capture mechanisms, and date ranges. File system files have their own issues
including versioning and SIS between attachments and stand-alone files for
retention and disposition purposes.
These challenges are great enough that even the largest software providers have
moved away from single instance storage to less effective compression-only
solutions. ZL UA offers industry-leading storage reduction rates by addressing
these challenges directly with advanced R&D.
This white paper explains how ZL UA performs the widest scale and most granular
hashing in the industry for true storage optimization, enforcing both message-level
and attachment-level Global SIS for up to 80% data reduction from the production
servers.
SIS Overview
Studies have shown that in large enterprises, each e-mail can be replicated an
average of 8 times throughout the corporate network. This is mainly due to large
distribution lists, “FYI” or courtesy carbon copies (CCs), and other mass mailings
within firms. In addition to the e-mail messages themselves, attachments comprise
large portion of the storage requirements and are often duplicated in many e-mails
and file systems.
The basic idea behind SIS is to compute a digest or hash of the content of the
item being single instanced (e.g. e-mail, attachment, file system file, etc.) and then
look up that hash value in a SIS lookup table. If a match is found, the item already
exists in the vault, and an additional pointer is added to the existing content. If no
match is found, the item does not yet exist within the archive, so the new content
is written to the vault, and a row is added to the SIS table.
For each item, the database will keep a running reference count for the number
of copies of the content that exists, and the items will be purged only after the
reference count reaches 0.
ZL Technologies White Paper
GLOBAL SINGLE INSTANCE STORAGE
PAGE 2
Global vs. Siloed and Hybrid SIS
A key component of SIS scalability is the size of the SIS lookup table. The most
effective Global SIS uses a single large lookup table for speed and consistency.
This is the approach used by ZL UA and supports billions of items. ZL UA can
support this across an enterprise because it is built on an elastic computing grid
architecture paired with a very large database architecture.
The other technique is to use multiple siloed SIS lookup tables. Siloed lookup
tables work well when the content can be effectively segregated or partitioned;
however, this has proved extremely challenging, and organizations supporting this
model have reported little to no SIS storage savings for large deployments. The
ineffectiveness of this approach has even led to some organizations removing SIS
entirely in favor of less effective compression-only approaches.
Finally, some organizations that use Siloed SIS have attempted to compensate
for decreasing returns by coordinating many siloed databases. This is a new
introduction and requires complicated coordination between the separate
databases, which can lead to performance and data loss problems.
ZL UA supports Global SIS, as it is the most effective approach from both storage
effectiveness and reliability perspectives.
SIS Matching Techniques
ZL UA offers several levels of SIS that, when taken together, achieve industryleading levels of storage savings.
Exact Duplicate Elimination
In some instances, many exact duplicates can exist throughout an organization.
One area this occurs is with some e-mail journaling systems that create many
exact copies on many different servers. The archiving and processing of these files
can require significant resources, so ZL processes these and eliminates them as
early as possible to lower operational and storage costs.
Basic hash validation of file content is popular for simple exact duplicate
identification and file integrity validation; however, for archiving purposes, often
differences in metadata outside of the file itself such as an e-mail’s sender and
recipients’ derived department information can require separate tracking for
retention and disposition policy purposes. In these cases, the duplicate will be
archived under the separate policies and single instanced using message SIS.
ZL Technologies White Paper
GLOBAL SINGLE INSTANCE STORAGE
PAGE 3
ZL UA’s Exact Duplicate Elimination handles both cases where exact duplicates can
be discarded and when duplicates need to be archived to ensure compliance with
both internal and external policies.
Message SIS for Near Duplicates
E-mail is unique in that mail servers and clients alter e-mail during the e-mail flow.
As a result of this, many e-mails will exist as near duplicates where a single e-mail
results in many near duplications, copies that are nearly identical but different
due to mail server handling and transformations. Effective handling of these near
duplicates is essential for SIS due to the high volumes of e-mail generated by
many organizations.
To handle near duplicates, ZL uses a flexible matching technique where ZL
first normalizes comparison values before performing the hash calculations.
Normalizing the values includes lower-casing all text, eliminating extra spaces,
eliminating html tags, etc. ZL also normalizes the date fields to GMT to account
for time zone variations. Inconsequential headers (e.g., spam checking or antivirus
checking) are ignored for the hash computation as they provide little value in an
organization’s archival requirements.
ZL performs a number of hashes across different fields of the e-mail:
1.
2.
3.
4.
Body digest - a hash value taken from the body of an e-mail.
Attachment digest - a hash value taken from all attachments of an e-mail.
Basic Header digest – a hash value taken from various header fields.
Envelope Header digest – a hash value taken from the sender and recipient
fields.
Hash values for all four must be identical for Message SIS to occur.
For greater matching flexibility, ZL offers two types of SIS: aggressive and nonaggressive. In non-aggressive SIS, the date field is included in the basic header
digest. In aggressive SIS, the date field is not included in the basic header hash
calculation, and any message that is subsequently Single Instanced will be stored
separately in the database.
After a successful SIS match, non-duplicate information that needs to be recorded
can be stored using ZL UA’s Delta Fields technology that allows certain information
to be stored outside of the ZL UA Vault and added back to the item when it is
retrieved.
ZL Technologies White Paper
GLOBAL SINGLE INSTANCE STORAGE
PAGE 4
Attachment SIS across E-mail Messages and Stand-Alone Files
While e-mail messages are largely responsible for the growth in content that
must be archived and placed under retention management policies at firms,
attachments are largely responsible for the extreme growth in average mail size.
Common file types including PowerPoint presentations, PDFs, and other large file
types are increasingly sent by e-mail and can easily push an e-mail into the multimegabyte range. Additionally, attachments often exist as stand-alone documents
on various file systems, either from the senders or recipients as they seek to read
and edit the files outside of the e-mail message. The factors combine to make
attachment single instancing a dramatic storage optimization mechanism across
e-mail and stand-alone files.
When messages are ingested by ZL, they are parsed for attachments. To optimize
single instance storage, ZL allows organizations to configure the minimum size of
an attachment upon which to perform attachment SIS. This eliminates SIS for very
small attachments as more storage could be used to SIS the attachment than to
store it with the e-mail. All attachments beyond this size will be extracted from the
message and will be stored separately.
The extracted attachments are then subject to single instancing. The attachments
are extracted in raw binary, and then the hash is computed for each. Similar to the
process for Message SIS, ZL looks up the hash in the Attachment SIS. If the table
contains the hash, the attachment is already in the archive, and a pointer is added.
Otherwise, the attachment is stored in raw binary form, and a record is entered
into the Attachment SIS table.
File System SIS Considerations
As organizations look to use their archives to manage file system files for
retention, disposition, e-Discovery, and legal hold purposes, the special
requirements for file systems must be taken into account. These requirements
include file names, file locations, modification dates and access control. For
example, the same file is often saved with different file names by different users.
ZL UA File System Archiving transparently handles this metadata by discounting it
for SIS purposes to achieve storage savings while including the information in the
archive so no information is lost.
ZL UA File System Archiving also supports file versioning so multiple versions of
the same file can be captured, searched, and browsed for content management
purposes.
ZL Technologies White Paper
GLOBAL SINGLE INSTANCE STORAGE
PAGE 5
Conclusion
As organizations manage their growing amounts of electronic e-mail and other
unstructured content, they must contend with duplicate content that exists at
many levels and in many forms. ZL UA effectively and transparently manages
single instancing at a very granular level to ensure highly accurate, reliable, and
effective single instancing of all files, covering, exact duplicates, near duplicates,
attachments, and versioning across the organization, supporting up to 80%
SIS reduction across billions of items. Configuration options, including ZL UA’s
delta fields capability, allow organizations to transparently fine tune their single
instancing for more or less aggressive savings depending on the data they wish to
preserve. ZL’s unified, fully integrated, highly scalable architecture allows Global
SIS without imposing penalties in database storage, processing throughput, or
accountability. Global SIS is a key factor in making ZL the fastest, most scalable,
and lowest TCO archiving solution on the market.
ZL Technologies White Paper
Download