Venkat Rangan

advertisement
Role of Enterprise Search
in E-Discovery
June 18, 2008
© 2008 Clearwell Systems, Inc. Confidential
Enterprise E-Discovery is a business process
Search is central to E-Discovery
Electronic Discovery Reference Model
(www.edrm.net)
Processing
Preservation
Information
Management
Identification
Review
Production
Presentation
Collection
Analysis
VOLUME
Identification Search
• Custodians
• Meta-Data
• Date Range
• Media Type
• Data Type
© 2008 Clearwell Systems, Inc. Confidential
RELEVANCE
Collection Search
• By Custodian
• By Operator
• By Data Type
• By keyword, phrase, concept
• By Project
2
Analysis/Review Search
• Responsiveness
• Privilege Determination
• Review Grouping
• Near-duplicates
• Quality Control
FRCP Rules governing E-Discovery
Rule
Summary Reading
Rule 16(b)
Outline plans for e-discovery and document production
Rule 26(f)
Procedures and Protocols to govern e-discovery
Rule 16(b) (5)
Courts to include scheduling orders
Rule 26(a)
Expansion on definition of ESI
Rule 26(b) (2)
E-Discovery Scope Cost-Shifting arguments – Burden of
reasonableness moving to Requesting Party
Rule 26(b) (5)
Inadvertently disclosed ESI and Privilege Claw-back agreements
Rule 34(b)
Specify forms of production (Native, Image etc.)
Rule 37(f)
Disallow sanctions when ESI lost due to retention policy and
good faith efforts
© 2008 Clearwell Systems, Inc. Confidential
3
FRCP Rules and Their Impact on E-Discovery
• Emphasis on co-operation during E-Discovery
• Sedona Principles as a Guide for E-Discovery
• Early Discovery Planning Conferences
• No “Gaming” of E-Discovery
• Prepare for Meet and Confer
•
•
•
•
•
Organizational Structure
Information Assets and Data Map
ILM Policies and Procedures
Backup and Disaster Recovery Practices
Preservation Hold/Legal Hold Policies and Actions
• Establish E-Discovery Scope
•
•
•
•
Estimate Review Size from automated Search Results
Raw Volume, Processed Volume, Review Volume
Substantiate “Not Reasonably Accessible” Claims
Move burden of “cost provability” to the Requesting Party
© 2008 Clearwell Systems, Inc. Confidential
4
Enabling E-Discovery within an Enterprise
Analysis, Culling, Review
Organizational Data
Digital Asset Database
IT Personnel
Legal IT Personnel
Legal Search/
Analysts
ECM/ILM Policies
File Shares
Messaging
servers
Meta-Data
Index
Data Map
Case
Data
Keyword
Index
CMS
Enterprise Intranet
© 2008 Clearwell Systems, Inc. Confidential
5
Preservation
Hold
E-Discovery Search Characteristics
Theme
• Produce Entire Results – not sufficient to
only produce Top N
• No Estimates of Counts – Must provide
accurate, actual counts
• Stability of Results
• Very large Result Sets
• Fast Query Response Time
Relevance
• Activity Based Relevance –
Responsiveness Search vs. Privilege
Search
• Meta-Data based Relevance –
Timeliness, People, Connection to other
data
• Review-directed Relevance
• Traditional TF/IDF based Relevance
• Provide Complete Hit Context
Data Types
Flexibility
Results Management
• Complete Auditing of all Searches
• Document Hit Count Reports
• Tie back to original Document MetaData
• EDRM XML-2 Export to downstream
processes
• Group Neat-Duplicates, Concept
Clusters for Review Efficiency
Workflow
• Many data formats – 10,000 formats
• Advanced Search/Query Language
• Incremental ESI Collections (Batches)
• New communication formats – Wiki,
Blogs, SMS, IM, Unified Messaging
• Iterative Search and Search Refinement
• Multi-level Review
• Guided Navigation, one-click Filtering
• Multi-person Review
• Saving and Sharing Searches
• Rolling Productions
• Remove impediments to search – ACLs,
Encryption, Container Extraction
• Activity Reports
• ESI from old, legacy applications
• Incomplete and Corrupt data (Deleted
Files, raw disk blocks)
• Handle Multi-language ESI
• Handle Low-fidelity documents – OCRscanned images
© 2008 Clearwell Systems, Inc. Confidential
• Real-time updates for Tagging,
Classifying Results
6
• Outside Counsel, Opposing Counsel
interactions
• Project Management
Search Effectiveness
Techniques to improve Precision and Recall
Precision
• Pre-filtering wildcard expansions
Precision
• Boolean Queries
1.0
• Proximity Specification
• Keyword Scope (Sentence, Paragraph)
0.8
• Meta-Data Context
Search
0.6
• Entity based Search
Recall
0.4
• Misspellings/Fuzzy Search
• Wildcard Specifications
0.2
• Synonyms
• Related Terms
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
© 2008 Clearwell Systems, Inc. Confidential
• Concept Search
• Bayesian Search
7
E-Discovery Search: Typical measures and outcomes
Number of truly responsive in Retrieved Collection:
Search Method
Retrieved
Documents
Sample Size
Responsive in
Sample
Estimate in
Retrieved
Precision
Keyword
Search
14846
1537
940
9080
0.61
Discussions
Threads
16515
1537
1069
11486
0.70
Concept Search
18554
1537
1128
13617
0.73
Number of truly responsive documents in Un-Retrieved Collection
Search Method
Unretrieved
Documents
Keyword
Search
1076258
Discussions
Threads
1074589
Concept Search
1072550
© 2008 Clearwell Systems, Inc. Confidential
Sample Size
Responsive in
Sample
Estimate in
Unretrieved
Recall
1537
29
20307
0.31
1537
28
19576
0.37
1537
26
18143
0.43
8
Interactive Search: Key to Search Efficiency
Interactive wildcard, stemming
expansion selection
• Removes precision-recall
tradeoff by enabling
interactive review and
removal of false positive
expansions
• Save thousands of dollars per
search
Search Report
• Detailed, interactive
keyword search report
results for iterative large
query execution
• Full transparency and
auditing
•
© 2008 Clearwell Systems, Inc. Confidential
Significant time savings
9
E-Discovery is about extracting Relevant Content
100-1000 TB
50 TB
Preservation
Store
Archive
and Store
© 2008 Clearwell Systems, Inc. Confidential
Collect and
Preserve
10
500 GB
1-2 GB
Analyze and
Review
Enterprise Case Study – Global Media Conglomerate
Case Data
456,448
Eliminating the need to process and review
456,000 documents saved $175,000
208,628
Data culling
based on query
permutations
reduced data
set by 99% to
417
74,713
Time = 2.5 days
© 2008 Clearwell Systems, Inc. Confidential
11
E-Discovery - Workflow
S
O
U
R
C
E
S
Rate of Ingestion
• 1M files/hour
• 10K directory scans
• 1 TB/hour
Rate of Indexing
• 100 K files/hour
• 10-20 GB/hour
Source SCAN
Meta-Data
(Shallow Index)
SQL
Full Text
Indexer
Rate of Processing
• 100 custodians
• 10K files/hour
• 1 GB PST/custodian
Copy Engine
Deep Index
Full-Text
Case ESI Store
Processing
Case Mgmt
Processing
Manifest
Full Text
Size of Index Size of Index
• 0.2 TB
• 1 TB
• 1 billion rows • 10 billion
• 10K/s Bulkobjects
Load
© 2008 Clearwell Systems, Inc. Confidential
Rate of Extraction
• 20 K files/hour
• 2-4 GB/hour
Size of Index
• 1 TB (each partition)
• Up to 100 index
partitions
• 10 billion objects
• 200-400 file types
• Includes meta-data
12
Size of Store
• 32 TB FC/SCSI
• 4 TB NTFS
• 300 GB/custodian
• 100 custodians
Size of Manifest
• 10 million items
E-Discovery Search: Collection Workflow
S
O
U
R
C
E
S
Case Document Collection
Source SCAN
Search
Meta-Data
(Shallow Index)
Search Scope
• Owners/SID
• Last Modification Date
• Creation Date
• Author/Title
• Department
Search Technology
• Keyword Search
• Parameterized Date Range
© 2008 Clearwell Systems, Inc. Confidential
13
Copy of Original
• Maintain Original Locations
• Hash with Meta-Data for content and
location integrity
• Hash without Meta-Data for content
Integrity
E-Discovery Search: Analysis Workflow
Potentially
Privileged
Documents
Responsive
Documents
Search
NonResponsive
Documents
Privilege Search
• Documents
• Emails
Search
Case Document
Collection
Search Scope
• Documents
• Emails
Search Technology
• Keywords
• Boolean Search
• Proximity Search
• Fuzzy Search
• Concept Search
• Tagged Search
Search
Refinement
• Additional
Keywords
• Additional
Search
Methods
Privileged
Documents
Privilege Review
Potentially
Responsive
Documents
Production
Documents
Privilege
“Misses” Review
Search
Sampling
Engine
Confidence
Level
Sample
Size
95
1537
99
66358
Quality Control
• Documents
• Emails
• Tags
Sample
Non-Responsive
Documents
Reports
• Search Reports
• Activity Reports
• QC Reports
• Project Review Reports
• Privilege Log
• Exceptions Reports
Responsive
Misses
“Recall”
Document
Review
© 2008 Clearwell Systems, Inc. Confidential
14
Download