Technology-Assisted Review Can be More Effective and More Efficient Than Exhaustive Manual Review Maura R. Grossman Wachtell, Lipton, Rosen & Katz mrgrossman@wlrk.com (212) 403-1391 Gordon V. Cormack University of Waterloo gvcormac@uwaterloo.ca (519) 888-4567 x34450 Watson Versus Jennings and Ritter 2 Debunking the Myth of Manual Review The Myth: That “eyeballs-on” review of each and every document in a massive collection of ESI will identify essentially all responsive (or privileged) documents; and That computers are less reliable than humans in identifying responsive (or privileged) documents. The Facts: Humans miss a substantial number of responsive (or privileged) documents; Computers—aided by humans—find at least as many responsive (or privileged) documents as humans alone; and Computers—aided by humans—make fewer errors on responsiveness (or privilege) than humans alone, and are far more efficient than humans. 3 Human Assessors Disagree! Suppose two assessors, A and B, review the same set of documents; Overlap = # documents coded responsive by both A and B # documents coded responsive by A or B, or both A and B Example: Primary and secondary assessors both code 2,504 documents as responsive. One or both code 2,531 + 2,504 + 463 = 5,498 documents as responsive. Overlap = 2,504 ∕ 5,498 = 45.5%. 4 More Human Assessors Disagree Even More! Suppose three assessors, A, B, and C, review the same set of documents; Overlap = # documents coded responsive by A and B and C # documents coded responsive by one or more of A, B, or C Example: Primary, secondary, and tertiary assessors all code 1,972 documents as responsive. One or more code 1,482 + 532 + 224 + 1,972 + 1,049 + 239 + 522 = 6,020 documents as responsive. Overlap = 1,972 / 6,020 = 32.8%. 5 Pairwise Assessor Overlap in the TREC 4 IR Task (Voorhees 2000) 6 Assessor Overlap With the Original Response to a DOJ Second Request (Roitblat et al. 2010) 7 Assessor Overlap: IR Versus Legal Tasks 8 What is the “Truth”? Option #1: Deem Someone Correct Deem the primary reviewer as the gold standard (Voorhees 2000). 9 What is the “Truth”? Option #2: Take the Majority Vote Deem the majority vote as the gold standard. 10 What is the “Truth”? Option #3: Have all Disagreements Adjudicated by a Topic Authority Have a senior attorney adjudicate all but only cases of disagreement (Roitblat et al. 2010; TREC Interactive Task 2009). 11 How Good are Human Eyeballs? What do we mean by “How Good”? Recall; Precision; and F1. 12 Measures of Information Retrieval Recall = # of responsive documents retrieved Total # of responsive documents in the entire document collection (“How many of the responsive documents did I find?”) Precision = # of responsive documents retrieved Total # of documents retrieved (“How much of what I retrieved was junk?”) F1 = The harmonic mean of Recall and Precision. 13 Recall and Precision 14 The Recall-Precision Trade-Off Perfection 100% 90% Blair & Maron (1985) 80% Typical result in a manual responsiveness review Precision 70% 60% 50% 40% 30% 20% TREC Best Benchmark (Best performance on Precision at a given Recall) 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall 15 How Good is Manual Review? 16 Effectiveness of Manual Review 17 How Good is Technology-Assisted Review? 18 What is “Technology-Assisted Review”? 19 Defining “Technology-Assisted Review” The use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection. These technologies typically rank the documents from most to least likely to be responsive to a specific information request. This ranking can then be used to “cut” or partition the documents into one or more categories, such as potentially responsive or not, in need of further review or not, etc. Think of a spam filter that reviews and classifies email into “ham,” “spam,” and “questionable.” 20 Types of Machine Learning SUPERVISED LEARNING = where a human chooses the document exemplars (“seed set”) to feed to the system and requests that the system rank the remaining documents in the collection according to their similarity to, or difference from, the exemplars (i.e., “find more like this”). ACTIVE LEARNING = where the system chooses the document exemplars to feed to the human and requests that the human make responsiveness determinations from which the system then learns and applies that learning to the remaining documents in the collection. 21 Machine Learning Step #1: Achieving High Precision Document Set for Review Source: Servient Inc. http://www.servient.com/ 22 Machine Learning Step #2: Improving Recall Documents Set Excluded From Review Source: Servient Inc. http://www.servient.com/ 23 How Do We Evaluate TechnologyAssisted Review? 24 The Text REtrieval Conference (“TREC”): Measuring the Effectiveness of TechnologyAssisted Review International, interdisciplinary research project sponsored by the National Institute of Standards and Technology (NIST), which is part of the U.S. Department of Commerce. Designed to promote research into the science of information retrieval. First TREC conference was held in 1992; the TREC Legal Track began in 2006. Designed to evaluate the effectiveness of search technologies in the context of ediscovery. Employs hypothetical complaints and requests for production drafted by members of The Sedona Conference®. For the first three years (2006-2008), documents were from the publicly available 7 million document tobacco litigation Master Settlement Agreement database. Since 2009, publicly available Enron data sets have been used. Participating teams of information scientists from around the world and U.S. litigation support service providers have contributed computer runs attempting to identify responsive (or privileged) documents. 25 TREC The TREC Interactive Task The Interactive Task was introduced in 2008, and repeated in 2009 and 2010. It models a document review for responsiveness. It begins with a mock complaint and associated requests for production (“topics”). It has a single Topic Authority (“TA”) for each topic. Teams may interact with the Topic Authority for up to 10 hours. Each team must submit a binary (“responsive” / “unresponsive”) decision for each and every document in the collection for their assigned topic(s). It provides for a two-step assessment and adjudication process for the gold standard: where the team and assessor agree on coding, the coding decision is deemed correct; where the team and assessor disagree on coding, appeal is made to the Topic Authority who determines which coding decision is correct. 26 Effectiveness of Technology-Assisted Review at TREC 2009 27 Manual Versus Technology-Assisted Review 28 But! Roitblat, Voorees, and the TREC 2009 Interactive Task all used different datasets, different topics, and different gold standards, so we cannot directly compare them. While technology-assisted review appears to be at least as good as manual review, we need to control for these differences. 29 Effectiveness of Manual Versus Technology-Assisted Review 30 So, Technology-Assisted Review is at Least as Effective as Manual Review, But is it More Efficient? 31 Efficiency of Technology-Assisted Versus Exhaustive Manual Review Exhaustive manual review involves coding 100% of the documents, while technologyassisted review involves coding of between 0.5% (Topic 203) and 5% (Topic 207) of the documents. Therefore, on average, technology-assisted review is 50 times more efficient than exhaustive manual review. 32 Why Are Humans So Lousy at Document Review? 33 Topic 204 (TREC 2009) Document Request All documents or communications that describe, discuss, refer to, report on, or relate to any intentions, plans, efforts, or activities involving the alteration, destruction, retention, lack of retention, deletion, or shredding of documents or other evidence, whether in hard-copy or electronic form. Topic Authority Maura R. Grossman (Wachtell, Lipton, Rosen & Katz) 34 Inarguable Error for Topic 204 35 Interpretation Error for Topic 204 36 Arguable Error for Topic 204 37 Topic 207 (TREC 2009) Document Request All documents or communications that describe, discuss, refer to, report on, or relate to fantasy football, gambling on football, and related activities, including but not limited to, football teams, football players, football games, football statistics, and football performance. Topic Authority K. Krasnow Waterman (LawTechIntersect, LLC) 38 Inarguable Error for Topic 207 39 Interpretation Error for Topic 207 40 Arguable Error for Topic 207 41 Types of Manual Coding Errors 42 Take-Away Messages Technology-assisted review finds at least as many responsive documents as exhaustive manual review (meaning that recall is at least as good). Technology-assisted review is more accurate than exhaustive manual review (meaning that precision is much better). Technology-assisted review is orders of magnitude more efficient than manual review (meaning that it is quicker and cheaper). 43 Measurement is Key Not all technology-assisted review (and not all exhaustive manual review) is created equal. Measurement is important in selecting and defending an e-discovery strategy. Measurement also is critical in discovering better search methods and tools. 44 Additional Resources TREC http://trec.nist.gov/ TREC Legal Track http://trec-legal.umiacs.umd.edu/ TREC 2008 Overview http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf TREC 2009 Overview http://trec.nist.gov/pubs/trec18/papers/LEGAL09.OVERVIEW.pdf TREC 2010 Overview Forthcoming (April 2011) at http://trec-legal.umiacs.umd.edu/ Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review Can be More Effective and More Efficient Than Exhaustive Manual Review, XVII:3 Richmond Journal of Law & Technology (Spring 2011) (in press). 45 Questions? Thank You! 46