Record Linkage, Text-Mining, and e-Discovery Paul Thompson Dartmouth College A primary focus of the TREC Legal Track is to determine whether more advanced information retrieval technologies can better support the e-discovery process than Boolean retrieval, which is the predominant form of information retrieval in use in the legal domain. The ICAIL 2007 DESI workshop is exploring whether artificial intelligence technologies can support e-discovery better not only than Boolean, but also better than statistical ranked retrieval. This paper considers the problem of e-discovery in a broader context. First, Dartmouth’s Language Technologies Group plans to build a retrieval system incorporating advanced information extraction technology, in particular cross-document co-reference of named entities extracted from text. Similar technology has also been considered in the domain of national security, e.g., see the 2005 SIAM International Conference on Data Mining Workshop on Link Analysis Counterterrorism and Security Analysis (SIAM 2005 ), where workshop participants were encouraged to use the Enron e-mail dataset as a surrogate for a counterterrorism database. One technology that has been used for link analysis was provided in a commercial product, Non-Obvious Relationship Analysis (NORA) of SRD. This product was developed originally to detect fraudulent collusion in the gaming industry in Nevada. SRD was since acquired by IBM. In a 2006 presentation to the National Academy of Sciences Jeff Jonas, formerly of SRD, now at IBM, described a way to perform record linkage with encrypted data, thus preserving privacy (Jonas 2006). Other users of NORA, however, claim that record-linkage effectiveness would be degraded were analysis to be done with encrypted data. Dartmouth’s approach to record linkage and text mining for e-discovery will also include process query system technology to model dynamic relationships among entities in text (Cybenko and Berk 2007, Patman and Thompson 2003, Thompson 2005) In future e-discovery, when the new retrieval methods pioneered by the TREC Legal Track, the ICAIL community, or the commercial e-discovery sector have matured, protecting privacy in e-discovery may be even more important than it is now. For example, question answering, text mining, or multimedia data mining technologies may be used to sample documents, broadly defined, from the total information repository of an organization. Our research proposes to compare the performance of a retrieval system augmented by record linkage and text mining capabilities on the TREC Legal Track corpus under two conditions. The same retrieval system will be used first with unencrypted data and then with encrypted data. The degradation in performance, if any, will be measured. References SIAM. 2005. SIAM International Conference on Data Mining Workshop on Link Analysis Counterterrorism and Security Analysis http://www.cs.queensu.ca/~skill/proceedings/ Cybenko, G. and Berk, V. 2007. “Process Query Systems” IEEE Computer, January 2007, p. 62-70 Jonas, J. 2006. “The Information Sharing Paradox And Thoughts on Maximizing Discovery while Minimizing Disclosure” presentation to the National Academies Project on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals Patman, F. and Thompson, P. “Names: A New Frontier in Text Mining” NSF / NIJ Symposium on Intelligence and Security Informatics, Lecture Notes in Computer Science, vol. 2665, Berlin: Springer-Verlag, June 1-3, 2003, Tucson, Arizona, p. 27-38 Thompson, P. “Text Mining, Names, and Security” Journal of Database Management, vol. 16, no. 1, January-March, 2005, special issue on database technology for enhancing national security, p. 54-59