ZL UNIFIED ARCHIVE Global Single Instance Storage ZL TECHNOLOGIES | White Paper GLOBAL SINGLE INSTANCE STORAGE PAGE 1 ZL UA’s Global Single Instance Storage (Global SIS) technology is a multi-layered de-duplication technology that enables ZL UA to achieve industry leading storage reduction and TCO for e-mail and files. Effective single instancing of e-mail and files is extremely challenging due to the high volumes of unique, duplicate, and near-duplicate content that exist within an enterprise. E-mail comprises the greatest volume of content in many organizations and has unique challenges for SIS, including handling e-mail variation across different users, different mail servers, PST and NSF archive files, departments, capture mechanisms, and date ranges. File system files have their own issues including versioning and SIS between attachments and stand-alone files for retention and disposition purposes. These challenges are great enough that even the largest software providers have moved away from single instance storage to less effective compression-only solutions. ZL UA offers industry-leading storage reduction rates by addressing these challenges directly with advanced R&D. This white paper explains how ZL UA performs the widest scale and most granular hashing in the industry for true storage optimization, enforcing both message-level and attachment-level Global SIS for up to 80% data reduction from the production servers. SIS Overview Studies have shown that in large enterprises, each e-mail can be replicated an average of 8 times throughout the corporate network. This is mainly due to large distribution lists, “FYI” or courtesy carbon copies (CCs), and other mass mailings within firms. In addition to the e-mail messages themselves, attachments comprise large portion of the storage requirements and are often duplicated in many e-mails and file systems. The basic idea behind SIS is to compute a digest or hash of the content of the item being single instanced (e.g. e-mail, attachment, file system file, etc.) and then look up that hash value in a SIS lookup table. If a match is found, the item already exists in the vault, and an additional pointer is added to the existing content. If no match is found, the item does not yet exist within the archive, so the new content is written to the vault, and a row is added to the SIS table. For each item, the database will keep a running reference count for the number of copies of the content that exists, and the items will be purged only after the reference count reaches 0. ZL Technologies White Paper GLOBAL SINGLE INSTANCE STORAGE PAGE 2 Global vs. Siloed and Hybrid SIS A key component of SIS scalability is the size of the SIS lookup table. The most effective Global SIS uses a single large lookup table for speed and consistency. This is the approach used by ZL UA and supports billions of items. ZL UA can support this across an enterprise because it is built on an elastic computing grid architecture paired with a very large database architecture. The other technique is to use multiple siloed SIS lookup tables. Siloed lookup tables work well when the content can be effectively segregated or partitioned; however, this has proved extremely challenging, and organizations supporting this model have reported little to no SIS storage savings for large deployments. The ineffectiveness of this approach has even led to some organizations removing SIS entirely in favor of less effective compression-only approaches. Finally, some organizations that use Siloed SIS have attempted to compensate for decreasing returns by coordinating many siloed databases. This is a new introduction and requires complicated coordination between the separate databases, which can lead to performance and data loss problems. ZL UA supports Global SIS, as it is the most effective approach from both storage effectiveness and reliability perspectives. SIS Matching Techniques ZL UA offers several levels of SIS that, when taken together, achieve industryleading levels of storage savings. Exact Duplicate Elimination In some instances, many exact duplicates can exist throughout an organization. One area this occurs is with some e-mail journaling systems that create many exact copies on many different servers. The archiving and processing of these files can require significant resources, so ZL processes these and eliminates them as early as possible to lower operational and storage costs. Basic hash validation of file content is popular for simple exact duplicate identification and file integrity validation; however, for archiving purposes, often differences in metadata outside of the file itself such as an e-mail’s sender and recipients’ derived department information can require separate tracking for retention and disposition policy purposes. In these cases, the duplicate will be archived under the separate policies and single instanced using message SIS. ZL Technologies White Paper GLOBAL SINGLE INSTANCE STORAGE PAGE 3 ZL UA’s Exact Duplicate Elimination handles both cases where exact duplicates can be discarded and when duplicates need to be archived to ensure compliance with both internal and external policies. Message SIS for Near Duplicates E-mail is unique in that mail servers and clients alter e-mail during the e-mail flow. As a result of this, many e-mails will exist as near duplicates where a single e-mail results in many near duplications, copies that are nearly identical but different due to mail server handling and transformations. Effective handling of these near duplicates is essential for SIS due to the high volumes of e-mail generated by many organizations. To handle near duplicates, ZL uses a flexible matching technique where ZL first normalizes comparison values before performing the hash calculations. Normalizing the values includes lower-casing all text, eliminating extra spaces, eliminating html tags, etc. ZL also normalizes the date fields to GMT to account for time zone variations. Inconsequential headers (e.g., spam checking or antivirus checking) are ignored for the hash computation as they provide little value in an organization’s archival requirements. ZL performs a number of hashes across different fields of the e-mail: 1. 2. 3. 4. Body digest - a hash value taken from the body of an e-mail. Attachment digest - a hash value taken from all attachments of an e-mail. Basic Header digest – a hash value taken from various header fields. Envelope Header digest – a hash value taken from the sender and recipient fields. Hash values for all four must be identical for Message SIS to occur. For greater matching flexibility, ZL offers two types of SIS: aggressive and nonaggressive. In non-aggressive SIS, the date field is included in the basic header digest. In aggressive SIS, the date field is not included in the basic header hash calculation, and any message that is subsequently Single Instanced will be stored separately in the database. After a successful SIS match, non-duplicate information that needs to be recorded can be stored using ZL UA’s Delta Fields technology that allows certain information to be stored outside of the ZL UA Vault and added back to the item when it is retrieved. ZL Technologies White Paper GLOBAL SINGLE INSTANCE STORAGE PAGE 4 Attachment SIS across E-mail Messages and Stand-Alone Files While e-mail messages are largely responsible for the growth in content that must be archived and placed under retention management policies at firms, attachments are largely responsible for the extreme growth in average mail size. Common file types including PowerPoint presentations, PDFs, and other large file types are increasingly sent by e-mail and can easily push an e-mail into the multimegabyte range. Additionally, attachments often exist as stand-alone documents on various file systems, either from the senders or recipients as they seek to read and edit the files outside of the e-mail message. The factors combine to make attachment single instancing a dramatic storage optimization mechanism across e-mail and stand-alone files. When messages are ingested by ZL, they are parsed for attachments. To optimize single instance storage, ZL allows organizations to configure the minimum size of an attachment upon which to perform attachment SIS. This eliminates SIS for very small attachments as more storage could be used to SIS the attachment than to store it with the e-mail. All attachments beyond this size will be extracted from the message and will be stored separately. The extracted attachments are then subject to single instancing. The attachments are extracted in raw binary, and then the hash is computed for each. Similar to the process for Message SIS, ZL looks up the hash in the Attachment SIS. If the table contains the hash, the attachment is already in the archive, and a pointer is added. Otherwise, the attachment is stored in raw binary form, and a record is entered into the Attachment SIS table. File System SIS Considerations As organizations look to use their archives to manage file system files for retention, disposition, e-Discovery, and legal hold purposes, the special requirements for file systems must be taken into account. These requirements include file names, file locations, modification dates and access control. For example, the same file is often saved with different file names by different users. ZL UA File System Archiving transparently handles this metadata by discounting it for SIS purposes to achieve storage savings while including the information in the archive so no information is lost. ZL UA File System Archiving also supports file versioning so multiple versions of the same file can be captured, searched, and browsed for content management purposes. ZL Technologies White Paper GLOBAL SINGLE INSTANCE STORAGE PAGE 5 Conclusion As organizations manage their growing amounts of electronic e-mail and other unstructured content, they must contend with duplicate content that exists at many levels and in many forms. ZL UA effectively and transparently manages single instancing at a very granular level to ensure highly accurate, reliable, and effective single instancing of all files, covering, exact duplicates, near duplicates, attachments, and versioning across the organization, supporting up to 80% SIS reduction across billions of items. Configuration options, including ZL UA’s delta fields capability, allow organizations to transparently fine tune their single instancing for more or less aggressive savings depending on the data they wish to preserve. ZL’s unified, fully integrated, highly scalable architecture allows Global SIS without imposing penalties in database storage, processing throughput, or accountability. Global SIS is a key factor in making ZL the fastest, most scalable, and lowest TCO archiving solution on the market. ZL Technologies White Paper