Performance Metrics and Recommendations Version 8.2 | March 4, 2016 For the most recent version of this document, visit our documentation website. Table of Contents 1 Performance metrics and recommendations 4 2 Performance baselines - EDRM data set 4 3 Assisted Review performance 9 3.0.1 Sample set creation results 9 3.0.2 Saving categorization results 9 4 dtSearch performance 4.0.1 Results summary 5 Export performance 9 10 10 5.0.1 Results summary 10 5.0.2 Export fields 11 6 Imaging performance 6.0.1 Results summary 7 Import performance 11 11 12 7.0.1 Results summary 12 7.0.2 Import fields 13 8 OCR performance 13 8.0.1 Results summary 13 9 Processing performance 14 9.1 Inventory performance 14 9.2 Processing performance 14 9.3 End to end performance 15 10 Production and branding performance 10.0.1 Results summary 11 Relativity Analytics performance 11.0.1 Results summary Relativity | Performance Metrics and Recommendations - 2 15 16 16 16 12 Structured Analytics performance 12.0.1 Results summary Relativity | Performance Metrics and Recommendations - 3 17 17 1 Performance metrics and recommendations This document provides metrics and recommendations on all Relativity functions that saw an increase in performance Relativity 8.2, including: n n n n n n n n n n Assisted Review dtSearch Export Imaging Import OCR Processing Production/branding Relativity Analytics Structured Analytics 2 Performance baselines - EDRM data set This section provides baselines for certain Relativity functions, as performed on the EDRM data set. n Export n The export consisted of both Natives and full text A total of 1,104,990 records were exported o Average file size of documents exported was 0.11 MB Import Native/Full Text n The load file contained 50 fields of metadata The load file contained pointers to the extracted text of the documents. o The total file size of all Natives loaded equaled 118.10 GB o Average file size of documents imported was 0.11 MB Basic Imaging - Tiff on-the-fly n Average number of images per document equaled 1.35 Average size of images equaled 0.03 MB o A total of 135,100 images were imaged (100,000 records) Native Imaging o o o o o o o o o n Average number of images per document equaled 1.25 Average size of images equaled 0.03 MB A total of 124, 586 images were imaged (100,000 records) OCR o o o Average number of images per document equaled 4.1 The Accuracy Level was set to ‘Medium’ in the OCR Profile A total of 233,680 images were OCR’d (100,000 records) Relativity | Performance Metrics and Recommendations - 4 n Production/branding n The productions consisted of images only A total of 100,000 records were produced o A total of 353,326 images were produced o Each document was branded with 2 endorsements Relativity Analytics - Indexing n A total of 1,104,990 records were indexed (Search) The total size of Full Text fields was 8.03 GB dtSearch - Indexing n A total of 1,104,990 records were indexed The total size of Full Text fields was 8.03 GB Categorization n A total of 1,104,990 records were categorized A total of 5 categories were created with 5000 examples o Number of Dimensions was 100 o The total size of Extracted Text of examples was 32.87 MB Structured Analytics n A total of 1,104,990 records were used for these tests. Export batch size was set to the default value of 1,000 documents. o Import batch size was set to the default value of 2,000 documents. Processing o o o o o o o o o o o o o o The data set consisted of the raw EDRM v2 data set, which contains 90 PST files contained in zip files. Ingestion was performed with 4 worker servers. Publish was performed using 4 threads. A total of 1,104,949 records were discovered and published to Relativity. The following table provides a summary of the performance of several common operations in Relativity. For details, click the link used for each piece of functionality. All results are for the EDRM Enron V2 data set, which consists of 1.1 million documents. Functionality Assisted Review Operation Round start - statistical sampling (1.1m documents) Round start - stratified sampling (1.1m documents) Finalization - all documents (1.1m documents) dtSearch Index builds (1 worker agent) Search Terms Reports (4 search threads) Export Natives + metadata Relativity | Performance Metrics and Recommendations - 5 Rate 60 seconds 17 seconds 1,700,000 records/hr 1,400,000 records/hr 61,000 terms/hr 404,000 records/hr Functionality Operation Images + metadata Import Natives + metadata (no copy to repository) Natives + metadata (copy to repository) Image append (no copy to repository) Imaging Basic imaging - TIFF (8 worker agents) Basic imaging - JPEG (8 worker agents) Native imaging (2 worker servers) OCR Low accuracy (8 worker agents) Medium accuracy (8 worker agents) High accuracy (8 worker agents) Processing Inventory (4 worker machines) Ingestion (4 worker machines) Publish Total rate (4 worker machines) Production/branding 8 worker agents Relativity Analytics Indexing Categorization (5,000 examples) Clustering Relativity | Performance Metrics and Recommendations - 6 Rate 399,000 images/hr 383,000 records/hr 279,000 records/hr 95,400 images/hr 20,500 images/hr 19,800 images/hr 43,000 images/hr 15,000 images/hr 7,800 images/hr 6,600 images/hr 221 GB/hr 514,000 records/hr 358,000 records/hr 210,000 records/hr 141,000 images/hr 1,670,000 records/hr 1,090,000 records/hr 1,100,000 records/hr Functionality Operation Structured Analytics Email threading Rate 1,000,000 records/hr 940,000 records/hr 1,450,000 records/hr 820,000 records/hr 520,000 records/hr Near duplicate identification Repeated content identification Language identification All operations The following table provides Relativity environment server specifications: Server SQL * Memory Cores (GB) 128 32 Host details n n n n n Relativity | Performance Metrics and Recommendations - 7 Dell PowerEdge R720 2 x Intel E52650 2Ghz 8 Core Processor 8 x 16GB (128GB) 1600MHz RDIMM Memory 2 x Intel i350 Quad Port Gb Network Adapter PERC H710P RAID Controller Server Web Memory Cores (GB) 4 4 Host details n n Agent 4 4 n File 4 4 n Search/CA 64 4 Loading 4 4 n n Processing Worker 16 n n 8 physical 16 logical n n n n Relativity | Performance Metrics and Recommendations - 8 Dell PowerEdge R820 4 x Intel E54620 2.2Ghz 8 Core Processor 8 x 16GB (128GB) 1600MHz RDIMM Memory Intel i350 Quad Port Gb Network Adapter Brocade 825 Dual Port 8Gb Fibre Channel Host Bus Adapter PERC H710P RAID Controller Dell PowerEdge R420 2 x Intel E52420 1.9Ghz 6C Processor 2 x 8GB (16GB) 1600Mhz RDIMM Intel i350 Quad Port Gb Network Adapter * The SQL Server is a physical machine. All others are VM's. 3 Assisted Review performance All sample sets were created in a workspace with no prior Assisted Review projects or rounds using the same 1,104,190 document set with no prior coding. Statistical samples were created with 95% confidence level and 2.5% margin of error. Categorization counts were artificially inflated in the first round by editing the categorization set associated with the round. All tests were performed with eight Assisted Review worker agents and times were measured through Assisted Review audits. Note: Stratified sampling is data dependent. Increasing the size of the data set, files contained in the data set, or variance between files in the data set will have a significant performance impact. 3.0.1 Sample set creation results The following table provides a breakdown of sample set creation performance: Test case name Sample size Time (mm:ss) Docs/second (universe) Docs/second (sample set) Sample set creation - statistical 1,535 01:00 18,403 26 Sample set creation - fixed size 50,000 03:00 6,134 278 Sample set creation - stratified sample* 1,454 17:00 1,082 1.43 * New in Relativity 8.2 3.0.2 Saving categorization results The following table provides a breakdown of categorization save performance: Test case name Categorized document count Save duration (hh:mm:ss) Categorization results saved/hr Saving categorization results 1,039,739 0:36:00 1,732,898 4 dtSearch performance The index was created with one dtSearch index worker agent using the All Documents in Workspace searchable set. Included in this 1,104,990 document set were three fields: Extracted Text, Document Identifier, and Group identifier. Relativity | Performance Metrics and Recommendations - 9 A 10,000 term list was selected based on terms that occurred in at least 1,000 of the searchable documents. Each term was a simple keyword, no advanced searching functionality was present (proximity search, Boolean conditions, or wildcards). The test was performed with one Search Terms Report manager agent and one dtSearch Search manager agent (hosted on a unique server). 4.0.1 Results summary The following table provides a breakdown of dtSearch performance: Test case name Execution time (h:mm:ss) Documents/hr Terms/hr dtSearch Index Build 0:47:25 1,398,230 N/A Search Terms Report - 10,000 terms 0:09:52 6,719,534 60,810 5 Export performance All exports were completed in direct mode, copying files directly from the repository. Export was the only action executing on the machine. Images were exported as single-page tiffs. Of the 1,104,990 documents tested, 100,000 had images. Extracted Text was exported into separate files. All metrics are calculated using the total number of documents. 5.0.1 Results summary The following tables provide a breakdown of export performance: Test case name Natives + metadata Images + metadata Images .dat file .opt size (MB) size (MB) 0 1050 N/A File size, natives + images (MB) Export time (hh:mm:ss) 133,351 2:44:10 353326 260 353,326 0:53:09 32 Test case Records/hr Images/hr MB/hr Average file transfer rate (MB/sec) name Natives + 403854 N/A 49121 22.06 metadata Images + 1247402 398862 399192 4.3 metadata Relativity | Performance Metrics and Recommendations - 10 Metadata transfer rate (MB/sec) 9.72 0.478 5.0.2 Export fields n Natives + metadata o 46 fields were exported: 12 long text fields (including Extracted Text) 18 fixed-length text fields l 2 whole number fields l 7 date fields l 1 yes-no field l 2 single-choice fields l 2 decimal fields l 1 multi-choice field l 1 single-object field Images + metadata l l n o 11 fields were exported: l l l l 5 fixed-length text fields 2 long text fields 3 date fields 1 single choice field 6 Imaging performance Basic imaging tests were set up to utilize two agent servers, each with four imaging worker agents. Each set output original sized images with 300 DPI. Native imaging tests used eight imaging worker agents per imaging worker server. The default imaging profile was used and images were generated in tiff format. The document pool for all tests was 100,000 documents. All imaging tests were submitted through an imaging set and times recorded from audits on the imaging set. No other actions were taking place in the environment during the imaging scenarios, and memory on all machines was cleared immediately prior to eliminate effects of data caching. 6.0.1 Results summary The following tables provides a breakdown of imaging performance: Test case name Images Total size of Imaging time Images/hr created images (MB) (hh:mm:ss) 135,110 4,149.81 20,478.97 6:35:22 Basic imaging – TIFF Basic imaging – 135,110 108,996.12 JPEG Native imaging – 1 124,586 4,321.28 worker server Native imaging – 2 124,586 4,321.28 6:47:57 19,849.66 5:45:42 21,623.26 2:53:01 43,204.86 Relativity | Performance Metrics and Recommendations - 11 Test case name Images Total size of Imaging time Images/hr created images (MB) (hh:mm:ss) worker servers 7 Import performance All imports were completed in direct mode, copying files directly from the repository. Import was the only action executing on the machine. Imported images were exported as single-page tiffs. Of the 1,104,990 documents tested, 100,000 had images. All metrics are calculated using the total number of documents. Appending Natives imported Extracted Text stored in separate files. There was no Extracted Text imported when appending images. 7.0.1 Results summary The following tables provide a breakdown of import performance: Test case name Copy to Images .dat file .opt File size, natives + images (MB) repository size size (MB) (MB) Native No append 1 Native Yes append 2 Image No append Test case name Native append 1 Native append 2 Image append N/A 1041.17 N/A 130,048 Import time (hh:mm:ss) 2:53:17 N/A 1041.17 N/A 130,048 3:57:50 353,326 N/A 32.79 18,739 1:02:52 Records/hr Images/hr MB/hr Metadata transfer rate (MB/sec) File transfer rate (MB/Sec) SQL process rate (docs/sec) 382,607 N/A 45,390 13.66 N/A N/A 278,764 N/A 33,071 13.50 38.66 N/A 95,440 337,215 17.916 11.18 N/A 566.00 Relativity | Performance Metrics and Recommendations - 12 7.0.2 Import fields n 46 fields were imported: o o o o o o o o o 12 long text fields (including Extracted Text) 18 fixed-length text fields 2 whole number fields 7 date fields 1 yes-no field 2 single-choice fields 2 decimal fields 1 multi-choice field 1 single-object field 8 OCR performance All OCR tests were performed with eight OCR Worker agents using the default OCR set settings. The document pool contained 100,000 documents consisted of 233,680 TIFF images. All times were taken from the Audits for the OCR Set. OCR Time is the measurement of how much time is spent reading image files from disk, running the OCR process for each page, and saving the text for a page to the database. Compilation time is the measurement of how much time is spent combining page-level OCR results into document-level results, and saving these to the Document table. 8.0.1 Results summary The following table provides a breakdown of OCR performance: Test case Documents Images OCR time (hh:mm:ss) name to OCR OCR – 100,000 Low Accuracy OCR – 100,000 Medium Accuracy OCR – 100,000 High Accuracy Images/hr 233,680 15:09:52 Compilation Total time time (hh:m- (hh:mm:ss) m:ss) 15:31:06 0:19:25 233,680 29:13:30 0:26:13 29:41:28 7,870 233,680 34:54:43 0:26:10 35:23:16 6,603 Relativity | Performance Metrics and Recommendations - 13 15,058 9 Processing performance The following tables break down performance metrics for Relativity Processing, as run on the EDRM data set. The EDRM data set has the following specifications: n n n n n Source size - 42 GB Source file count - 90 PST files Expanded size* - 118 GB Published documents - 1.1 million Fileshare size* - 82 GB * When processing duplicates, Relativity doesn't store multiple copies of the same file in the fileshare. Instead, it stores one copy of the file and then creates multiple links to that file. When exporting all natives using the RDC, Relativity exports all duplicates. Exporting all natives from a workspace containing EDRM results in 118GB of data (Expanded Dataset). When checking the fileshare of the workspace, duplicates only have a single copy of the duplicate file. The size of a workspace containing EDRM is 81GB. Roughly 37GB of the duplicate files are not present. 9.1 Inventory performance Test case name Custodians Inventory time (mm:ss) Finalize time (mm:ss) Total time (mm:ss) GB/hr Inventory (4 Workers) 1 31:32 00:26 31:58 221.48 Inventory (4 Workers) 50 32:22 00:28 32:50 215.63 Inventory (4 Workers) 200 43:03 00:31 43:34 162.51 9.2 Processing performance All time entries in the following table appear in the (hh:mm) format. Test case name Ingestion time Text extraction time Discovery time GB/hr Publish time GB/hr 4 Workers 00:55 1:13 2:09 54.86 N/A N/A 2 Workers 1 Worker 1:19 2:15 2:19 5:11 3:39 7.26 32.30 N/A 15.84 N/A N/A N/A 2 Threads N/A N/A N/A N/A 36.39 Relativity | Performance Metrics and Recommendations - 14 3:14 9.3 End to end performance Test case name Total time (hh:mm) Total GB/hr 4 Workers 05:23 21.88 2 Workers 06:39 17.72 1 Worker 10:27 11.28 10 Production and branding performance Eight Branding Managers were used in all tests, with four agents per agent server. Each production consisted of images only, no natives. No redactions were present in the markup set used for the scenarios. The document pool contained 100,000 documents and consisted of 353,326 TIFF images. Page-level numbering was used in all scenarios with two stamps used: Production Bates Number in the left header, Control Number in the right footer. Scenarios with Relational Attachment used the relational field Group Identifier. Families ranged in size between 1 and 327 documents per family. The Median family size was 1 document per family, mean family size was 1.567 documents per family. No other actions were taking place in the environment during the production or branding steps, and memory on all machines was cleared immediately prior to eliminate effects of data caching. All times were measured using the Windows Event Log on the agent servers. Branding time was Relativity | Performance Metrics and Recommendations - 15 measured as the difference between when the first agent started work on the first document, until the last production image had been created. 10.0.1 Results summary The following table provides a breakdown of production/branding performance: Test case name Documents Images Average Production time Branding time family size (hh:mm:ss) (hh:mm:ss) Images/hr New Bates 100000 353326 N/A 0:10:44 2:19:15 141,346.11 New Bates with Relational 100000 Attachment 353326 1.567 0:09:44 2:19:48 141,771.47 Original Bates 100000 353326 N/A 0:11:13 2:26:32 134,387.07 Original Bates with Relational Attachment 100000 353326 1.567 0:10:17 2:21:38 139,547.30 11 Relativity Analytics performance The Analytics index build was performed using the default searchable and training sets. Start time was measured as the time the first document was sent to the Analytics server, and end time was measured as when the index of 1,104,990 documents was active and searchable. Indexes were built to not automatically remove English email signatures and footers. Analytics clustering was performed on all documents in the workspace, in a new cluster. Start time was measured as the time the mass operation was initiated, and end time was measured as the time the job was removed from the relevant queue in the database. 11.0.1 Results summary The following tables provide a breakdown of Analytics performance: Test case name Operation time (h:mm:ss) Documents/hr Analytics Index Build 0:39:43 1,669,309.27 Analytics Clustering 0:59:31 1,107,759.40 Each Analytics categorization was run using a new categorization set consisting of all 1,104,990 documents. Samples were taken by coding the first 1,000 or 5,000 documents with extracted text, sorted by ArtifactID ascending, and coding them to one of the categories. Times were recorded from audits of the categorization set. Test case name Examples Categories Operation time (h:mm:ss) Documents/hr Categorization 1 1,000 2 0:24:59 2,653,745.16 Categorization 2 5,000 2 1:00:51 1,089,554.64 Categorization 3 5,000 5 1:00:30 1,095,857.85 Relativity | Performance Metrics and Recommendations - 16 12 Structured Analytics performance All 1,104,990 documents in the workspace were used in each job. Each job was executed with eight Structured Analytics Worker agents and times were measured by polling the Structured Analytics job queue. Each Structured Analytics operation consists of three distinct phases: 1. Export - relevant document metadata and text is exported from Relativity to the Analytics engine. 2. Structured Analytics operations - the Analytics engine processes all documents, and determines the relevant information to be sent back to Relativity. 3. Importing results - the data that the Analytics engine extracted from the documents is sent back to Relativity through the Import API. This phase also includes job finalization and cleanup. 12.0.1 Results summary The following table provides a breakdown of Structured Analytics performance. All durations in the following table are in (h:mm:ss) format. Test case name Export dur- Structured Analytics oper- Import results ation ations duration duration Total dur- Documents/hr ation Language Identification 0:16:56 0:37:04 0:26:39 1:20:39 822,063 Textual Near Duplicate identification 0:17:11 0:24:08 0:29:19 1:10:38 938,642 Repeated Content Identification 0:16:56 0:13:57 0:14:37 0:45:30 1,457,130 Email Threading 0:18:22 0:15:22 0:32:45 1:06:29 997,233 All operations 0:18:17 1:13:07 0:36:38 2:08:02 517,829 Relativity | Performance Metrics and Recommendations - 17 Proprietary Rights This documentation (“Documentation”) and the software to which it relates (“Software”) belongs to kCura LLC and/or kCura’s third party software vendors. kCura grants written license agreements which contain restrictions. All parties accessing the Documentation or Software must: respect proprietary rights of kCura and third parties; comply with your organization’s license agreement, including but not limited to license restrictions on use, copying, modifications, reverse engineering, and derivative products; and refrain from any misuse or misappropriation of this Documentation or Software in whole or in part. The Software and Documentation is protected by the Copyright Act of 1976, as amended, and the Software code is protected by the Illinois Trade Secrets Act. Violations can involve substantial civil liabilities, exemplary damages, and criminal penalties, including fines and possible imprisonment. ©2016. kCura LLC. All rights reserved. Relativity® and kCura® are registered trademarks of kCura LLC. Relativity | Performance Metrics and Recommendations - 18