Can case citations tell us what a legal opinion is about? If so, why not use them to sort legal information? Submitted by: Eric Dunn and Michael Ruiz May 31, 2014 Instructor: Professor Vogl and Professor Genesereth Course: Legal Informatics (Law 729) Spring 2013-2014 Introduction Justice Oliver Wendell Holmes described lawyering as a practice of “gather[ing] the scattered prophecies of the past” to bear on a particular case.1 In legal opinions these “prophecies” are gathered in the form of citations to precedent – prior cases that provide the basis for an argument. These citations are signals that transmit information in a standard format. For a lawyer, a citation means a particular argument depends upon another. Citations reference a larger “legal network” in which “a web of opinions [are] connected to each other through stacked sets of citations.”2 Our paper seeks to leverage citations to precedent in order to intuit the subject of a legal opinion based on how it taps into these “legal network[s].” In Part I, we describe why leveraging citations can and should be done automatically. In Part II, we describe the algorithm we developed to automatically categorize legal opinions based on the citations they make to other cases and then test this approach to see how well it sorts recently decided Federal Circuit cases. In Part III, we discuss the limitations of our initial observations and make recommendations for future research. Our results suggest that algorithms can accurately predict a case’s subject matter based purely on the cases cited within the opinion. Applying these algorithms to the growing amount of legal information available to the public will help make legal information accessible – not just available, but useful. 1 Oliver Wendell Holmes Jr., The Path of the Law, 10 Harv. L. Rev. 457 (1897). Jay P. Kesan, David L. Schwartz, Ted Sichelman, Paving the Path to Accurately Predicting Legal Outcomes: A Comment on Professor Chien's Predicting Patent Litigation, 90 Tex. L. Rev. 97 (2012). 2 Part I: The value of automatically sorting legal information The amount of legal information available for free has increased, but it will remain inaccessible until it is organized. The first obstacle to obtaining access to the law is being able to gather, as Justice Holmes would say, the precedents and statutes that give the law its shape. While legal information was once “buried in dimly lit basements of federal courthouses” it is now possible for “anyone with a computer, Internet connection, and credit card” to access legal information.3 For most of the twentieth century Westlaw and Lexis, which provide online access to legal opinions, drove this increase in access.4 In the past decade the competition to provide basic access to legal information has intensified and the cost of access has fallen.5 Large companies, start-ups, and open source projects have exploded the amount of open, available legal information.6 More information is not necessarily better, however. Access to legal information is meaningless if the information is not organized and searchable. Growing the haystack without making the needle easier to find does not increase accessibility, it simply provides access to those who know where to look. This problem has already been solved, at least partially. The primary tools that modern lawyers use to navigate the law combine access to information with tools that 3 Dru Stevenson and Nicholas J. Wagoner, Bargaining in the Shadow of Big Data, 66 Fla. L. Rev. 22-23 (forthcoming 2014). Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2325137. 4 Paul Lomio, Lecture at Stanford Law School (Apr. 3, 2014), available at http://goo.gl/vbdDTJ. 5 Stevenson, supra note 3, at 25. 6 Id. Additionally, several examples of emerging legal technology provide an example of the ongoing revolution. Ravel Law, available at ravellaw.com, provides access to legal opinions and legal analytics. Casetext, available at casetext.com, provides access to legal opinions and “crowd-sourced” tags, which attempt to replicate Lexis and Westlaw headnotes. Google has also made some court documents and legal documents (such as patents) available via Google Scholar, available at http://scholar.google.com. The Library of Congress now maintains a free online database of U.S. statutes, now available at http://beta.congress.gov/. While PACER is also maintained by the U.S. courts, and accessible with a fee, projects such as RECAP, available at recapthelaw.org, have emerged to make even these documents accessible for free. make this information accessible. For example, Westlaw and Lexis allow users to search for legal information, quickly summarize it, and discover whether it is still applicable law. This information is sorted by subject, court, terms, and other pieces of metadata. But these tools exist behind a pay wall. They are the product of “an army of pricey legal experts [who] manually sift through, summarize, and classify each source before making it available online.”7 Other projects, such as Casetext, have used “crowd-sourcing” to fuel the process of organizing legal information. These projects ultimately still depend on an army of unpaid legal experts who sift, summarize, and classify.8 Resolving the tension between growing the size of available information and making the information accessible need not depend on pricey legal experts. Well-tuned algorithms provide an opportunity to resolve this tension because they make information accessible independent of its size. In other words, finding the needle in a larger haystack merely requires more computing power rather than more manpower. Combined with existing tools that cull cases from court websites and statutes from government websites these tools could grow the pool of available information and automatically sort it. Legal opinions are ripe for automatic sorting because lawyers already pair structured, machine-readable metadata with their arguments. Computer algorithms that analyze language thrive on pattern recognition.9 Although computers do no speak the language of humans, at least not yet, programs can be trained to recognize patterns and leverage those patterns to analyze text. For example, 7 Dru Stevenson and Nicholas J. Wagoner, Bargaining in the Shadow of Big Data, 66 Fla. L. Rev. 24 (forthcoming 2014). Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2325137. 8 During our research we found the available tags at Casetext inadequate. Some cases have well developed tags, suggesting that increased use could allow Casetext to develop into a valuable legal resource, but many cases, especially critical intellectual property cases, were untagged and unsorted. 9 See generally Christopher Bishop, PATTERN RECOGNITION AND MACHINE LEARNING, (Springer-Verlag New York, Inc. 2006). e-discovery tools prove that supervised algorithms, which analyze text (keywords, etc) and metadata (information about the text), can help sort legal information.10 Scholars have begun to apply these natural language processing approaches to legal arguments,11 but our paper argues that the most useful patterns in legal opinions remain underutilized. It is often said that learning the law is like learning a different language.12 Lawyers learn how to understand and parse legal language, but they also learn another language that lawyers speak. This language is highly structured, governed by a single set of rules, and largely uniform.13 This “second language” is the language of legal citations – often called “blue booking” among lawyers, scholars, and exhausted law students.14 Legal citations are highly structured signals that link together legal arguments, connecting a legal argument to one that has been made previously. When lawyers “gather the scattered prophecies of the past” they cite to precedent, cases in which a court has previously decided and articulated a legal rule.15 A citation hints that an argument being provided in the current opinion is substantively similar, or identical, to the argument to which the lawyer (or judge) has cited. To allow Daniel M. Katz, Quantitative Legal Prediction – or – How I Learned to Stop Worrying and Start Preparing for the Data Driven Future of the Legal Services Industry, 62 Emory L. J. 946 (2011). 11 Marie-Francine Moens, Erik Boiy, Raquel Mochales Palau, Chris Reed, Automatic detection of arguments in legal texts, Address at the 11th International Conference on Artificial Intelligence and Law, (June 04, 2007), available at http://dl.acm.org/citation.cfm?id=1276362. 12 William M. Sullivan, Anne Colby, Judith Welch Wegner, Lloyd Bond, Lee S. Shulman, EDUCATING LAWYERS: PREPARATION FOR THE PROFESSION OF LAW 67, (Jossey-Bass, 2007). 13 It is worth noting that the Bluebook provides universal citations in most federal courts, but all courts (especially at the state level) have subtle differences. Nevertheless these variations are systematic. Expanding a citator to read different formats would be a matter of programming specific use cases rather than departing from the value that structured citations offer. 14 The term makes reference to the Bluebook. THE BLUEBOOK: A UNIFORM SYSTEM OF CITATION (Columbia Law Review Ass’n et al. eds., 19th ed. 2010). 15 Oliver Wendell Holmes Jr., The Path of the Law, 10 Harv. L. Rev. 457 (1897). 10 the reader to easily refer to the previous argument citations appear in a structured format governed by a universal style manual.16 For example, Figure 1.117 shows an example of a legal citation: The value of legal citations is that they offer a way to judge the substance of an argument without reading the opinion or processing unstructured text. Figure 1.1 shows the type of information available in a citation, including the court that decided the case and the year it was decided. To sort the contents of a legal opinion an algorithm need only know some information about a subset of cases. Then if a future case cites to a “known” case the algorithm can learn about the new case based on the “known” case. Finding these citations can be done automatically, leveraging the structured format lawyers use to communicate legal information to each other. This approach does not eliminate the need for an “army” of legal experts, but it dramatically reduces the amount of manpower necessary to sort legal information. From a small pool of cases, hand coded by humans, an algorithm can expand outwards – searching new documents for references to known cases and tagging these cases appropriately. 16 Even where The Bluebook does not provide the universally followed set of citation rules, citations merely follow a subtly different set of uniform standards. These standards would be available on the website of the relevant court. For example, New York courts have specific citation rules which subtly vary from The Bluebook and these rules are available at http://www.courts.state.ny.us/reporter/new_styman.htm. 17 THE BLUEBOOK: A UNIFORM SYSTEM OF CITATION 87 (Columbia Law Review Ass’n et al. eds., 19th ed. 2010). Part II: Testing an algorithm to sort Federal Circuit cases Designing the algorithm The goal of our project was to design a program to parse citations to precedent and use these citations to approximate the subject of a legal opinion. In essence, the program finds the propositions used to support a lawyer’s argument and leverages its knowledge about those propositions to understand the underlying argument. Given the information contained in case citations and the machine-readable pattern in which case citations appear, our first step was to write code to parse a legal opinion and pull out citations to previous cases. Once we developed a parsing algorithm we needed to “teach” the algorithm about precedent. To do this we used Casetext, a website which provides access to legal opinions and the ability to “tag” legal opinions with descriptive terms. Based on a combination of substantive expertise and Casetext’s ability to sort opinions based on the number of citations they have received we collected and tagged thirty “critical cases” to feed into the program. Each page of a case was tagged with descriptive information, such as labels indicating whether the page discussed patents. The critical cases included key cases from the Federal Circuit and the Supreme Court, which are the two appellate courts with subject matter jurisdiction in patent cases. Figure 2.1 shows an example of a page from Casetext with tags applied in the right hand margin. Next, we “trained” the algorithm by integrating the tags applied through Casetext into the program. We were able to essentially match pages in the Federal Reporter to substantive tags in a dataset. Leveraging our existing parsing algorithm we calibrated the program to read a legal opinion, find any references to a page in the Federal Reporter (a citation to a Supreme Court or Federal Circuit case), and match this reference to the dataset described above in order to apply tags from “known” cases. Based on the applied tags the program calculated whether to classify a case as a patent case or not.18 Testing the algorithm We tested the efficacy of the algorithm by measuring its accuracy via a binary classification test. First, we drew a sample of 100 random legal opinions issued by United States Court of Appeals for the Federal Circuit in the final quarter of 2013. The Court of Appeals for the Federal Circuit decides more patent cases than any other single U.S. court, with approximately half of its cases being patent cases.19 We chose to limit our sample to this court in particular to ensure that our sample contained a critical mass of patent cases for the algorithm to detect. We examined these cases manually and sorted them into “patent” and “nonpatent” categories. We categorized a case as a patent case if the legal issue was substantially related to patent subject matter. Opinions based overwhelmingly on procedural grounds without discussion of underlying patent issues were categorized as non-patent cases even if the original complaints were based on patent claims. These 18 The program has additional functionality not explored in this project. Essentially, it is able to carry forward all the tags and make more complicated conclusions – such as whether to apply test specific tags. 19 United States Court of Appeals for the Federal Circuit, Appeals Filed, by Category FY 2013 (available at http://www.cafc.uscourts.gov/the-court/statistics.html). cases might contain useful general precedent regarding subjects like venue change or court filing deadlines, but the lack of reference to patent doctrine or patent issues grants them little value to an attorney searching for patent cases. After adjusting for errors, (explained below) our final sample contained 38 patent cases and 51 non-patent cases. After manually sorting the cases, we compared the algorithm’s categorization for each case against our manual categorization. By comparing the differences we determined how often the algorithm correctly identified what we determined to be patent cases. We ran the binary classification test for the algorithm under two different thresholds. In the first trial (threshold = 1), the algorithm categorized a case as “patent” if the body of the text contained at least one reference to patents. For example, if the opinion cited a “critical” case it was assigned as a “patent” case under this first threshold. In the second trial (threshold = 2), the body of the text had to contain at least two references to patents. For example, if the opinion cited to a critical case multiple times, cited at least two critical cases, or cited to a page in a critical case that had been tagged as especially relevant to patents the case was designated as a “patent” case under this second threshold. Results 5 cases were removed from the analysis for having no citations. These opinions were short orders or affirmations without fully written decisions and no cases were cited in the body of the text. We removed these cases from the analysis because they were not the target type of cases that we were testing since they have little precedential value. An additional 6 cases had to be removed for technical reasons. For these cases, either the URL was unstable and produced an error or the webpage was incompatible with the algorithm and also produced an error. The first trial (threshold = 1, N = 89) tested at .75 accuracy with a negative prediction value of .7 and a positive prediction value of 1. See figure 2.2 for a breakdown of individual case results. Patent Case Algorithm Test Outcome Positive Negative Positive 16 22 Negative 0 51 Figure 2.2 The second trial (threshold = 2, N = 89) tested at .67 accuracy with a negative prediction value of .63 and a positive prediction value of 1. See figure 2.3 for a breakdown of individual case results. Patent Case Test Outcome Positive Negative Positive 9 29 Negative 0 51 Figure 2.3 RESULTS AT THRESHOLD = 1 True Positive True Negative False Positive False Negative 22, 25% 16, 18% 0, 0% 51, 57% Figure 2.4 Figure 2.4 displays results of trial 1 presented as a ratio of the total sample. At a threshold value of one, the algorithm correctly categorized 16 of the cases as patent cases (true positives) and 51 cases as non-patent (true negatives). The algorithm also miscategorized 22 cases as non-patent cases when they in fact were patent cases (false negatives). In proportionate terms, the algorithm correctly identified 75% of all cases, but misidentified 58% of the pool of patent cases. The algorithm set to a threshold value of two performed strictly worse than the threshold one trial. The number of false negatives rose to 29 with the algorithm only correctly tagging 24% of patent cases. Unsurprisingly, the number of false positives remained zero. Given these results, it’s clear that a threshold value of one critical case is a more accurate filter for the algorithm under our trial conditions. The algorithm was successfully able to categorize many of the cases, but there is still room for improvement. Even at a threshold value of 1 the algorithm missed over half the patent cases in the sample, meaning that those patent cases in our sample did not cite to any of the critical cases that the algorithm is built to recognize. Despite these errors, it is encouraging to note that the algorithm did not produce any false positives. Although the sensitivity of our algorithm is lower than we would prefer, its reliability is also much higher than expected. The algorithm may have missed several patent cases, but it was correct about every case that it did categorize as a patent case. This indicates that none of the nonpatent cases were citing to critical patent cases in their opinion. If this pattern holds, then our automated case tagging system has only a minimal chance to be thrown off by cases citing precedent from other bodies of case law. This is an important success because it suggests that our automated sorting system can be used without fear of erroneously tagging cases. At worst, the algorithm will leave some cases untagged that would have remained untagged in the absence of our automated system. Part III: Limitations on our test and future adjustments Our conclusions are limited by a small critical case pool and the narrow conditions of our initial test. Our primary limitation on the accuracy of the sorting algorithm is the small number of critical cases that the system is based on. The 22 patent cases that were erroneously flagged as non-patent in our test contain citations to other patent cases, but not the critical cases that our algorithm relies on. If we build up the underlying web of critical cases we should see a direct correlation in number of patent cases that we catch with our sorting system. A second limitation to our findings is that our sample was drawn from a single court that only hears limited types of cases. The Court of Appeals for the Federal Circuit has subject matter jurisdiction over several areas of law, but the majority of its cases are patent and administrative law cases. While our sample is an accurate representation of the cases before the Court of Appeals for the Federal Circuit, the population from which our sample is drawn is not necessarily a good representation of the larger body of case law across different courts. It’s possible that our sorting algorithm will be less accurate at separating patent cases from other types of cases not present in our sample such as tech transactions, licensing contract disputes, or even unexpected case types such as civil rights or general torts cases. As we move forward into applying the algorithm to different case types, another possible limitation we may discover is that our lack of false positives in this test is unique to patent law. Our trial did not produce any false positives because the non-patent cases in our sample did not cite to our critical patent cases, but this effect may be dependent on an insular quality of patent law. Patent law is a highly technical and specific body of law, so it makes sense that other types of cases would have little use for findings from such a specialized legal field. This might not be the case for more general bodies of law, such as tort. For example, one might imagine that an opinion issued for a contract case discussing negligence might cite a seminal tort case on the nature of the negligence standard. Under our current model, our algorithm would erroneously flag this contract case as a tort case based on that citation – at least at the low threshold used in our first trial. One possible solution for this potential problem might be adjusting the algorithm’s threshold values based on the type of tag we want to apply, which will have less of an effect on false-negatives when the size of critical cases grows. Additionally, the layering of page specific tags will help address this problem. For example, if a human can distinguish parts of an opinion that reflect one area of law from another then the algorithm can trace a citation not just to a specific opinion but to a particular page in the opinion. Expansion of our critical case pool and moving into additional bodies of law will allow us to sort more varied and more granular information. Our next steps involve expanding the scope of our proposed sorting system by addressing our limitations from this first test. We first have to expand our selection of critical opinions for patent case identification before establishing pools of critical cases in other bodies of law. The algorithm’s sorting accuracy and its optimal threshold value will have to be tested for separate areas of law to ensure that our automated system remains an effective tool for categorizing types of cases other than patent. After collecting a robust selection of critical cases, we can work to create more granular tags beyond simple categorization of case types. Because our algorithm bases its conclusions on case citations, the system could theoretically identify cases that are often cited for specific propositions. Ideally, our automated system will be able to not only categorize the type of case that it is reading, but also to link often cited cases with particular doctrines so that specific passages could be categorized as well. For example, the algorithm might tag a case generally as a Title VII case while also tagging specific pages or paragraphs with doctrines specific to Title VII cases such as “tiers of scrutiny.” In experiments with our algorithm we were able to successfully deploy this approach in the area of patent law, tagging specific passages as not only “patent” but with tags such as “patentable subject matter” – refining our algorithm to recognize where a test was used in an opinion. One step we could take to improve the functionality of our system is to have the system automatically integrate the cases it tags into its base of critical cases. We did not employ this approach for this trial because we wanted to strictly test whether citation categorization was useful in the first instance. If done successfully, every case that our system reads and tags as a patent case, for example, would immediately be used to identify other patent cases in the future that cite to the cases that were tagged by the algorithm. This machine learning would improve accuracy by increasing the size of the critical case pool and it may also keep the system up to date as more recent legal opinions are issued. We will have to approach this form of machine learning with caution however, because any cases that are erroneously tagged and placed into the critical case pool could perpetuate other erroneous tags. It is possible that a few initial false positives could have an effect that spirals outward into more false positives once those tags are used to identify further cases, seriously hampering the algorithms accuracy in a systemic fashion. Currently, the lack of false positives in our initial test lessens these concerns, but it is a possibility we will have to monitor. Conclusion Legal data is increasingly available, but the sheer volume of legal information becoming available makes it unwieldy to use. As with any body of large data, properly sorting and labelling this data is an essential first step before useful analysis becomes possible. Unfortunately, the sheer amount of legal text that is produced every year requires a large number of legal professionals to read and annotate the massive body of information. As an alternative to costly legal analysts, we propose an automated system that reads, categorizes, and labels legal opinions. If we are to depend on automatic algorithms to cull legal information we must develop automatic algorithms to sort this information as well. Our initial test shows that an automated system built to categorize legal opinions based on the recognition of case citations can accurately sort legal information. Our algorithm correctly labeled 75% of cases in our sample as patent or non-patent and we expect this accuracy rate to improve as we add more cases to the critical pool of citations that the algorithm recognizes. Additionally, our algorithm was 100% correct whenever it classified a case as a patent case. This lack of false positives is important because it suggests that while our automated classification system may miss some cases and leave them untagged, it does not erroneously classify a case with the wrong tag. If this trend remains true, our automated system’s positive tags can be relied on as a completely accurate way to categorize large bodies of cases. While our test results are somewhat limited by the homogeneity of the population our sample was taken from and by our exclusive focus on patent cases, our findings suggest that our automated method is already a useful tool for sorting data. Ideally, our system will eventually be able to tag specific doctrines within legal opinions and automatically incorporate newly tagged cases into its own underlying web of critical cases. Regardless of these possible future features, we are confident that our current approach provides an efficient and accurate way to categorize legal information.