Relativity Performance Metrics and Recommendations - 8.2

Performance Metrics and Recommendations
Version 8.2 | March 4, 2016
For the most recent version of this document, visit our documentation website.
Table of Contents
1 Performance metrics and recommendations
4
2 Performance baselines - EDRM data set
4
3 Assisted Review performance
9
3.0.1 Sample set creation results
9
3.0.2 Saving categorization results
9
4 dtSearch performance
4.0.1 Results summary
5 Export performance
9
10
10
5.0.1 Results summary
10
5.0.2 Export fields
11
6 Imaging performance
6.0.1 Results summary
7 Import performance
11
11
12
7.0.1 Results summary
12
7.0.2 Import fields
13
8 OCR performance
13
8.0.1 Results summary
13
9 Processing performance
14
9.1 Inventory performance
14
9.2 Processing performance
14
9.3 End to end performance
15
10 Production and branding performance
10.0.1 Results summary
11 Relativity Analytics performance
11.0.1 Results summary
Relativity | Performance Metrics and Recommendations - 2
15
16
16
16
12 Structured Analytics performance
12.0.1 Results summary
Relativity | Performance Metrics and Recommendations - 3
17
17
1 Performance metrics and recommendations
This document provides metrics and recommendations on all Relativity functions that saw an
increase in performance Relativity 8.2, including:
n
n
n
n
n
n
n
n
n
n
Assisted Review
dtSearch
Export
Imaging
Import
OCR
Processing
Production/branding
Relativity Analytics
Structured Analytics
2 Performance baselines - EDRM data set
This section provides baselines for certain Relativity functions, as performed on the EDRM data set.
n
Export
n
The export consisted of both Natives and full text
A total of 1,104,990 records were exported
o Average file size of documents exported was 0.11 MB
Import Native/Full Text
n
The load file contained 50 fields of metadata
The load file contained pointers to the extracted text of the documents.
o The total file size of all Natives loaded equaled 118.10 GB
o Average file size of documents imported was 0.11 MB
Basic Imaging - Tiff on-the-fly
n
Average number of images per document equaled 1.35
Average size of images equaled 0.03 MB
o A total of 135,100 images were imaged (100,000 records)
Native Imaging
o
o
o
o
o
o
o
o
o
n
Average number of images per document equaled 1.25
Average size of images equaled 0.03 MB
A total of 124, 586 images were imaged (100,000 records)
OCR
o
o
o
Average number of images per document equaled 4.1
The Accuracy Level was set to ‘Medium’ in the OCR Profile
A total of 233,680 images were OCR’d (100,000 records)
Relativity | Performance Metrics and Recommendations - 4
n
Production/branding
n
The productions consisted of images only
A total of 100,000 records were produced
o A total of 353,326 images were produced
o Each document was branded with 2 endorsements
Relativity Analytics - Indexing
n
A total of 1,104,990 records were indexed (Search)
The total size of Full Text fields was 8.03 GB
dtSearch - Indexing
n
A total of 1,104,990 records were indexed
The total size of Full Text fields was 8.03 GB
Categorization
n
A total of 1,104,990 records were categorized
A total of 5 categories were created with 5000 examples
o Number of Dimensions was 100
o The total size of Extracted Text of examples was 32.87 MB
Structured Analytics
n
A total of 1,104,990 records were used for these tests.
Export batch size was set to the default value of 1,000 documents.
o Import batch size was set to the default value of 2,000 documents.
Processing
o
o
o
o
o
o
o
o
o
o
o
o
o
o
The data set consisted of the raw EDRM v2 data set, which contains 90 PST files contained in zip
files.
Ingestion was performed with 4 worker servers.
Publish was performed using 4 threads.
A total of 1,104,949 records were discovered and published to Relativity.
The following table provides a summary of the performance of several common operations in
Relativity. For details, click the link used for each piece of functionality.
All results are for the EDRM Enron V2 data set, which consists of 1.1 million documents.
Functionality
Assisted Review
Operation
Round start - statistical sampling (1.1m documents)
Round start - stratified sampling (1.1m documents)
Finalization - all documents (1.1m documents)
dtSearch
Index builds (1 worker agent)
Search Terms Reports (4 search threads)
Export
Natives + metadata
Relativity | Performance Metrics and Recommendations - 5
Rate
60
seconds
17
seconds
1,700,000
records/hr
1,400,000
records/hr
61,000
terms/hr
404,000
records/hr
Functionality
Operation
Images + metadata
Import
Natives + metadata (no copy to repository)
Natives + metadata (copy to repository)
Image append (no copy to repository)
Imaging
Basic imaging - TIFF (8 worker agents)
Basic imaging - JPEG (8 worker agents)
Native imaging (2 worker servers)
OCR
Low accuracy (8 worker agents)
Medium accuracy (8 worker agents)
High accuracy (8 worker agents)
Processing
Inventory (4 worker machines)
Ingestion (4 worker machines)
Publish
Total rate (4 worker machines)
Production/branding 8 worker agents
Relativity Analytics
Indexing
Categorization (5,000 examples)
Clustering
Relativity | Performance Metrics and Recommendations - 6
Rate
399,000
images/hr
383,000
records/hr
279,000
records/hr
95,400
images/hr
20,500
images/hr
19,800
images/hr
43,000
images/hr
15,000
images/hr
7,800
images/hr
6,600
images/hr
221 GB/hr
514,000
records/hr
358,000
records/hr
210,000
records/hr
141,000
images/hr
1,670,000
records/hr
1,090,000
records/hr
1,100,000
records/hr
Functionality
Operation
Structured Analytics Email threading
Rate
1,000,000
records/hr
940,000
records/hr
1,450,000
records/hr
820,000
records/hr
520,000
records/hr
Near duplicate identification
Repeated content identification
Language identification
All operations
The following table provides Relativity environment server specifications:
Server
SQL *
Memory Cores
(GB)
128
32
Host details
n
n
n
n
n
Relativity | Performance Metrics and Recommendations - 7
Dell PowerEdge R720
2 x Intel E52650 2Ghz
8 Core Processor
8 x 16GB
(128GB)
1600MHz
RDIMM
Memory
2 x Intel
i350 Quad
Port Gb
Network
Adapter
PERC
H710P
RAID Controller
Server
Web
Memory Cores
(GB)
4
4
Host details
n
n
Agent
4
4
n
File
4
4
n
Search/CA
64
4
Loading
4
4
n
n
Processing Worker
16
n
n
8 physical
16 logical
n
n
n
n
Relativity | Performance Metrics and Recommendations - 8
Dell PowerEdge R820
4 x Intel E54620
2.2Ghz 8
Core Processor
8 x 16GB
(128GB)
1600MHz
RDIMM
Memory
Intel i350
Quad Port
Gb Network
Adapter
Brocade
825 Dual
Port 8Gb
Fibre Channel Host
Bus
Adapter
PERC
H710P
RAID Controller
Dell PowerEdge R420
2 x Intel E52420
1.9Ghz 6C
Processor
2 x 8GB
(16GB)
1600Mhz
RDIMM
Intel i350
Quad Port
Gb Network
Adapter
* The SQL Server is a physical machine. All others are VM's.
3 Assisted Review performance
All sample sets were created in a workspace with no prior Assisted Review projects or rounds using
the same 1,104,190 document set with no prior coding. Statistical samples were created with 95%
confidence level and 2.5% margin of error. Categorization counts were artificially inflated in the
first round by editing the categorization set associated with the round.
All tests were performed with eight Assisted Review worker agents and times were measured
through Assisted Review audits.
Note: Stratified sampling is data dependent. Increasing the size of the data set, files contained in the data
set, or variance between files in the data set will have a significant performance impact.
3.0.1 Sample set creation results
The following table provides a breakdown of sample set creation performance:
Test case name
Sample
size
Time
(mm:ss)
Docs/second (universe)
Docs/second (sample
set)
Sample set creation - statistical
1,535
01:00
18,403
26
Sample set creation - fixed size
50,000
03:00
6,134
278
Sample set creation - stratified
sample*
1,454
17:00
1,082
1.43
* New in Relativity 8.2
3.0.2 Saving categorization results
The following table provides a breakdown of categorization save performance:
Test case name
Categorized document
count
Save duration (hh:mm:ss)
Categorization results
saved/hr
Saving categorization
results
1,039,739
0:36:00
1,732,898
4 dtSearch performance
The index was created with one dtSearch index worker agent using the All Documents in Workspace
searchable set. Included in this 1,104,990 document set were three fields: Extracted Text, Document
Identifier, and Group identifier.
Relativity | Performance Metrics and Recommendations - 9
A 10,000 term list was selected based on terms that occurred in at least 1,000 of the searchable
documents. Each term was a simple keyword, no advanced searching functionality was present
(proximity search, Boolean conditions, or wildcards). The test was performed with one Search Terms
Report manager agent and one dtSearch Search manager agent (hosted on a unique server).
4.0.1 Results summary
The following table provides a breakdown of dtSearch performance:
Test case name
Execution time (h:mm:ss) Documents/hr Terms/hr
dtSearch Index Build
0:47:25
1,398,230
N/A
Search Terms Report - 10,000 terms 0:09:52
6,719,534
60,810
5 Export performance
All exports were completed in direct mode, copying files directly from the repository. Export was
the only action executing on the machine. Images were exported as single-page tiffs. Of the
1,104,990 documents tested, 100,000 had images. Extracted Text was exported into separate files.
All metrics are calculated using the total number of documents.
5.0.1 Results summary
The following tables provide a breakdown of export performance:
Test case
name
Natives +
metadata
Images +
metadata
Images .dat file .opt
size (MB) size
(MB)
0
1050
N/A
File size, natives
+ images (MB)
Export time (hh:mm:ss)
133,351
2:44:10
353326 260
353,326
0:53:09
32
Test case Records/hr Images/hr MB/hr Average file transfer rate (MB/sec)
name
Natives + 403854
N/A
49121 22.06
metadata
Images + 1247402
398862
399192 4.3
metadata
Relativity | Performance Metrics and Recommendations - 10
Metadata transfer rate
(MB/sec)
9.72
0.478
5.0.2 Export fields
n
Natives + metadata
o
46 fields were exported:
12 long text fields (including Extracted Text)
18 fixed-length text fields
l 2 whole number fields
l 7 date fields
l 1 yes-no field
l 2 single-choice fields
l 2 decimal fields
l 1 multi-choice field
l 1 single-object field
Images + metadata
l
l
n
o
11 fields were exported:
l
l
l
l
5 fixed-length text fields
2 long text fields
3 date fields
1 single choice field
6 Imaging performance
Basic imaging tests were set up to utilize two agent servers, each with four imaging worker agents.
Each set output original sized images with 300 DPI. Native imaging tests used eight imaging worker
agents per imaging worker server. The default imaging profile was used and images were
generated in tiff format. The document pool for all tests was 100,000 documents.
All imaging tests were submitted through an imaging set and times recorded from audits on the
imaging set. No other actions were taking place in the environment during the imaging scenarios,
and memory on all machines was cleared immediately prior to eliminate effects of data caching.
6.0.1 Results summary
The following tables provides a breakdown of imaging performance:
Test case name
Images Total size of Imaging time Images/hr
created images (MB) (hh:mm:ss)
135,110 4,149.81
20,478.97
6:35:22
Basic imaging –
TIFF
Basic imaging –
135,110 108,996.12
JPEG
Native imaging – 1 124,586 4,321.28
worker server
Native imaging – 2 124,586 4,321.28
6:47:57
19,849.66
5:45:42
21,623.26
2:53:01
43,204.86
Relativity | Performance Metrics and Recommendations - 11
Test case name
Images Total size of Imaging time Images/hr
created images (MB) (hh:mm:ss)
worker servers
7 Import performance
All imports were completed in direct mode, copying files directly from the repository. Import was
the only action executing on the machine. Imported images were exported as single-page tiffs. Of
the 1,104,990 documents tested, 100,000 had images. All metrics are calculated using the total
number of documents.
Appending Natives imported Extracted Text stored in separate files. There was no Extracted Text
imported when appending images.
7.0.1 Results summary
The following tables provide a breakdown of import performance:
Test
case
name
Copy to Images .dat file .opt File size, natives + images (MB)
repository
size
size
(MB)
(MB)
Native No
append
1
Native Yes
append
2
Image No
append
Test
case
name
Native
append
1
Native
append
2
Image
append
N/A
1041.17 N/A 130,048
Import
time
(hh:mm:ss)
2:53:17
N/A
1041.17 N/A 130,048
3:57:50
353,326 N/A
32.79 18,739
1:02:52
Records/hr Images/hr MB/hr Metadata transfer rate
(MB/sec)
File transfer rate
(MB/Sec)
SQL process rate
(docs/sec)
382,607
N/A
45,390 13.66
N/A
N/A
278,764
N/A
33,071 13.50
38.66
N/A
95,440
337,215
17.916 11.18
N/A
566.00
Relativity | Performance Metrics and Recommendations - 12
7.0.2 Import fields
n
46 fields were imported:
o
o
o
o
o
o
o
o
o
12 long text fields (including Extracted Text)
18 fixed-length text fields
2 whole number fields
7 date fields
1 yes-no field
2 single-choice fields
2 decimal fields
1 multi-choice field
1 single-object field
8 OCR performance
All OCR tests were performed with eight OCR Worker agents using the default OCR set settings. The
document pool contained 100,000 documents consisted of 233,680 TIFF images. All times were
taken from the Audits for the OCR Set.
OCR Time is the measurement of how much time is spent reading image files from disk, running the
OCR process for each page, and saving the text for a page to the database. Compilation time is the
measurement of how much time is spent combining page-level OCR results into document-level
results, and saving these to the Document table.
8.0.1 Results summary
The following table provides a breakdown of OCR performance:
Test case Documents Images OCR time (hh:mm:ss)
name
to OCR
OCR –
100,000
Low
Accuracy
OCR –
100,000
Medium
Accuracy
OCR –
100,000
High
Accuracy
Images/hr
233,680 15:09:52
Compilation Total time
time (hh:m- (hh:mm:ss)
m:ss)
15:31:06
0:19:25
233,680 29:13:30
0:26:13
29:41:28
7,870
233,680 34:54:43
0:26:10
35:23:16
6,603
Relativity | Performance Metrics and Recommendations - 13
15,058
9 Processing performance
The following tables break down performance metrics for Relativity Processing, as run on the
EDRM data set.
The EDRM data set has the following specifications:
n
n
n
n
n
Source size - 42 GB
Source file count - 90 PST files
Expanded size* - 118 GB
Published documents - 1.1 million
Fileshare size* - 82 GB
* When processing duplicates, Relativity doesn't store multiple copies of the same file in the
fileshare. Instead, it stores one copy of the file and then creates multiple links to that file. When
exporting all natives using the RDC, Relativity exports all duplicates. Exporting all natives from a
workspace containing EDRM results in 118GB of data (Expanded Dataset). When checking the
fileshare of the workspace, duplicates only have a single copy of the duplicate file. The size of a
workspace containing EDRM is 81GB. Roughly 37GB of the duplicate files are not present.
9.1 Inventory performance
Test case name
Custodians Inventory time
(mm:ss)
Finalize time
(mm:ss)
Total time
(mm:ss)
GB/hr
Inventory (4
Workers)
1
31:32
00:26
31:58
221.48
Inventory (4
Workers)
50
32:22
00:28
32:50
215.63
Inventory (4
Workers)
200
43:03
00:31
43:34
162.51
9.2 Processing performance
All time entries in the following table appear in the (hh:mm) format.
Test case name Ingestion time Text extraction time Discovery time GB/hr Publish time GB/hr
4 Workers
00:55
1:13
2:09
54.86 N/A
N/A
2 Workers
1 Worker
1:19
2:15
2:19
5:11
3:39
7.26
32.30 N/A
15.84 N/A
N/A
N/A
2 Threads
N/A
N/A
N/A
N/A
36.39
Relativity | Performance Metrics and Recommendations - 14
3:14
9.3 End to end performance
Test case name Total time (hh:mm) Total GB/hr
4 Workers
05:23
21.88
2 Workers
06:39
17.72
1 Worker
10:27
11.28
10 Production and branding performance
Eight Branding Managers were used in all tests, with four agents per agent server. Each production
consisted of images only, no natives. No redactions were present in the markup set used for the
scenarios. The document pool contained 100,000 documents and consisted of 353,326 TIFF images.
Page-level numbering was used in all scenarios with two stamps used: Production Bates Number in
the left header, Control Number in the right footer.
Scenarios with Relational Attachment used the relational field Group Identifier. Families ranged in
size between 1 and 327 documents per family. The Median family size was 1 document per family,
mean family size was 1.567 documents per family.
No other actions were taking place in the environment during the production or branding steps, and
memory on all machines was cleared immediately prior to eliminate effects of data caching. All
times were measured using the Windows Event Log on the agent servers. Branding time was
Relativity | Performance Metrics and Recommendations - 15
measured as the difference between when the first agent started work on the first document, until
the last production image had been created.
10.0.1 Results summary
The following table provides a breakdown of production/branding performance:
Test case name
Documents Images Average
Production time Branding time
family size (hh:mm:ss)
(hh:mm:ss)
Images/hr
New Bates
100000
353326 N/A
0:10:44
2:19:15
141,346.11
New Bates with Relational 100000
Attachment
353326 1.567
0:09:44
2:19:48
141,771.47
Original Bates
100000
353326 N/A
0:11:13
2:26:32
134,387.07
Original Bates with
Relational Attachment
100000
353326 1.567
0:10:17
2:21:38
139,547.30
11 Relativity Analytics performance
The Analytics index build was performed using the default searchable and training sets. Start time
was measured as the time the first document was sent to the Analytics server, and end time was
measured as when the index of 1,104,990 documents was active and searchable. Indexes were built
to not automatically remove English email signatures and footers.
Analytics clustering was performed on all documents in the workspace, in a new cluster. Start time
was measured as the time the mass operation was initiated, and end time was measured as the
time the job was removed from the relevant queue in the database.
11.0.1 Results summary
The following tables provide a breakdown of Analytics performance:
Test case name
Operation time (h:mm:ss) Documents/hr
Analytics Index Build 0:39:43
1,669,309.27
Analytics Clustering 0:59:31
1,107,759.40
Each Analytics categorization was run using a new categorization set consisting of all 1,104,990
documents. Samples were taken by coding the first 1,000 or 5,000 documents with extracted text,
sorted by ArtifactID ascending, and coding them to one of the categories. Times were recorded
from audits of the categorization set.
Test case name Examples Categories Operation time (h:mm:ss) Documents/hr
Categorization 1 1,000
2
0:24:59
2,653,745.16
Categorization 2 5,000
2
1:00:51
1,089,554.64
Categorization 3 5,000
5
1:00:30
1,095,857.85
Relativity | Performance Metrics and Recommendations - 16
12 Structured Analytics performance
All 1,104,990 documents in the workspace were used in each job. Each job was executed with eight
Structured Analytics Worker agents and times were measured by polling the Structured Analytics job
queue.
Each Structured Analytics operation consists of three distinct phases:
1. Export - relevant document metadata and text is exported from Relativity to the Analytics engine.
2. Structured Analytics operations - the Analytics engine processes all documents, and determines the relevant information to be sent back to Relativity.
3. Importing results - the data that the Analytics engine extracted from the documents is sent back to
Relativity through the Import API. This phase also includes job finalization and cleanup.
12.0.1 Results summary
The following table provides a breakdown of Structured Analytics performance. All durations in the
following table are in (h:mm:ss) format.
Test case name
Export dur- Structured Analytics oper- Import results
ation
ations duration
duration
Total dur- Documents/hr
ation
Language Identification 0:16:56
0:37:04
0:26:39
1:20:39
822,063
Textual Near Duplicate
identification
0:17:11
0:24:08
0:29:19
1:10:38
938,642
Repeated Content
Identification
0:16:56
0:13:57
0:14:37
0:45:30
1,457,130
Email Threading
0:18:22
0:15:22
0:32:45
1:06:29
997,233
All operations
0:18:17
1:13:07
0:36:38
2:08:02
517,829
Relativity | Performance Metrics and Recommendations - 17
Proprietary Rights
This documentation (“Documentation”) and the software to which it relates (“Software”) belongs
to kCura LLC and/or kCura’s third party software vendors. kCura grants written license agreements
which contain restrictions. All parties accessing the Documentation or Software must: respect
proprietary rights of kCura and third parties; comply with your organization’s license agreement,
including but not limited to license restrictions on use, copying, modifications, reverse engineering,
and derivative products; and refrain from any misuse or misappropriation of this Documentation or
Software in whole or in part. The Software and Documentation is protected by the Copyright Act of
1976, as amended, and the Software code is protected by the Illinois Trade Secrets Act. Violations
can involve substantial civil liabilities, exemplary damages, and criminal penalties, including fines
and possible imprisonment.
©2016. kCura LLC. All rights reserved. Relativity® and kCura® are registered trademarks of kCura
LLC.
Relativity | Performance Metrics and Recommendations - 18