Mining Patterns for Document Retrieval by Karin Cheung Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science at the MASSACHUESETTS INSTITUTE OF TECHNOLOGY February 2001 Copyright 2001 Karin Cheung. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. A uthor ............................................................ Department of Electrical Engineering and Computer 14ience January 25, 2001 C ertified by .......................................................................... Rakesh Agrawal VI-A Company Supervisor Certified by....................................................................... ........... . . aas Kennet M.I.T. Thesis Supervisor A ccepted by................... .. ........ Arthur C. Smith Chairman, Department Committee on Graduate Students 5ARKER MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 11 2001 LIBRARIES Mining Patterns for Document Retrieval by Karin Cheung Submitted to the Department of Electrical Engineering and Computer Science February 6, 2001 In partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science ABSTRACT The task of information discovery in text databases becomes increasingly complex as larger corpora are used. This thesis presents ways of creating a richer understanding of the relationships between documents in such corpora by automatically assembling a collection of cross-references for any document in the corpus, through the discovery of word patterns. More specifically, the system explores the use of data mining technologies, in particular association and sequential mining, to identify features for text categorization. 2 Acknowledgements First and foremost, I would like to thank the entire Data Mining group of the IBM Almaden Research Center for their insights and assistance during these past several months. Thanks especially to Rakesh Agrawal, whose patience and understanding allowed me to familiarize myself with the field of data mining and ultimately decide what I personally wanted to accomplish in this thesis. To my mentor Daniel Gruhl, a big thanks for answering countless questions, reading numerous thesis drafts, and always looking out for me. Thanks also to Ramakrishnan Srikant for his invaluable insights on data mining and his always-helpful advice. On the MIT side, I would like to thank my thesis advisor Kenneth Haase for his guidance. Thanks, Ken, for allowing me to follow my interests. To Jack Driscoll, thanks for giving me a new perspective on the newspaper industry. Thanks also to Walter Bender for his support and interest in my work. I would also like to thank the MIT VI-A Internship Program, for allowing me the wonderful and unique opportunity of conducting my research jointly at ARC and the MIT Media Laboratory. Lastly, thanks to Salvador Alvarez for his ever-present support and encouragement. And thanks ultimately to my family, who have taught me everything I ever needed to know. 3 Contents I 2 Introduction..................................................................................... 8 1.1 Motivation for the Ariel system...................................................... 9 1.1.1 Reader A ssistance................................................................. 9 1.1.2 Newspaper Creation............................................................... 10 T h esis G o als.............................................................................. 1.3 Defin ition s.............................................................................. 1.4 Organization ............................................................................. 12 Prior Research.................................................................................... 13 .. 1 1 2.1 N ew s of the Future....................................................................... 13 2.2 The Search for Sim ilarity............................................................... 14 2.2.1 T raditional IR M ethods............................................................ 15 2.2.2 Current IR Precision/Recall Results..........................................16 2.2.3 Using Phrases in Text Retrieval................................................. 2.3 3 11 1.2 16 18 D ata M in ing ............................................................................... .......... 19 2.3.1 Data Mining Techniques................................................ 2.3.2 Applications to Text Databases................................................. 20 Mining Story Patterns........................................................................... 22 Motiv ation s................................................................................ 22 3 .1 3.1.1 Shared Writing Styles............................................................. 22 3.1.2 T im e C lusters...................................................................... 22 3.1.3 Topic R ecognition................................................................. 23 3.2 A Story P attern ........................................................................... 24 3.3 Generating Patterns.................................................................... 24 3.3.1 A Formal Problem Statement..................................................... 25 3.3.2 The A lgorithm .................................................................... 25 3.3.3 Results of Pattern Generation................................................... 27 3.4 The Search for Related Stories........................................................ 3.5 Other Considered Techniques................................................ 4 29 .......... 31 4 6 32 3.5.2 G eneral C orpus Trends............................................................ 33 3 .5.3 C onclusions.......................................................................... 34 35 The Milan News System...................................................... ....... 35 4.1.1 Gathering and storing information................................................ 37 4.1.2 Making it useful - experts and the blackboard system....................... 37 4.2 The V inci System ........................................................................ 38 4.3 M ilan Services........................................................................... 40 4.4 W here A riel F its.......................................................................... 40 Ariel: Design and Implementation.......................................................... 42 5.1 G eneral System Flow .................................................................... 42 5.2 Story A nalysis........................................................................... 43 5.3 Pattern M ining ........................................................................... 45 5.4 Indexing for Searches.................................................................... 46 5.5 A Ranking Schem a............................................................. . ...... 47 5.6 The U ser Interface...................................................................... 48 5.7 C onclusions............................................................................... 50 An Evaluation of Ariel.......................................................................... 52 6.1 7 Looking at Sequential Patterns................................................... Existing Resources............................................................................... 4.1 5 3.5.1 Experim ental Protocol................................................................... 52 6.1.1 Graph Characteristics............................................................. 53 6.1.2 Performance Considerations..................................................... 54 6.2 Milan Experiment Results............................................................. 56 6.3 Reuters Experiment Results............................................................ 59 6.4 Final Observations...................................................................... 62 Conclusions and Future Work............................................................... 64 7.1 C ontributions............................................................................. 64 7.2 F uture W ork ............................................................................. 65 7.2.1 Incorporating Concepts into Patterns............................................ 65 7.2.2 Recognizing Topic Progression................................................. 66 7.2.3 Story Summarization............................................................. 66 7.2.4 Story M erging ............................................................ 7.2.5 Expanding the Scope of Ariel................................................... 5 ........ 66 67 List of Figures 2-1 Statistical analysis on text is traditionally done by first creating a document-word matrix. Each element represents the number of occurrences of the corresponding w ord in the given docum ent................................................................. 15 3-1 An example of pattern generation using frequent itemsets.............................. 26 3-2 Pattern generation for the Hubble Space Telescope article............................ 28 3-3 Pattern generation for the Camp David article........................................... 28 3-4 Pattern generation for the Al Gore article................................................. 29 4-1 Milan is a personalized information portal for the Vinci system..................... 36 4-2 A frame describing an article on the Vietnam Memorial, represented both graphically and in XML format........................................................... 37 4-3 A graphical representation of the Vinci system........................................ 39 4-4 An illustration of the Milan system, which is built on top of the Vinci base services. Milan relies on several other Vinci services, including Ariel and various new s channels. .................................................................... 41 5-1 Overall system diagram for the approach presented in this thesis.................... 43 5-2 A M ilan new s article......................................................................... 50 5-3 A Milan page displaying the top stories of the "Reintroducing Gore" article........ 51 6-1 Precision/Recall graph for Article 1........................................................ 56 6-2 Precision/Recall graph for Article 2...................................................... 57 6-3 Precision/Recall graph for Article 3...................................................... 58 6-4 Precision/Recall graph for Article 4...................................................... 59 6-5 Precision/Recall graph for Article 5...................................................... 60 6-6 Precision/Recall graph for Article 6...................................................... 61 6 List of Tables 2.1 Recall Level Precision Averages for TREC-8 Ad Hoc Retrieval..................... 17 3.1 The prelim inary testing environm ent.................................................... 30 3.2 Preliminary Precision/Recall Evaluation for the Hubble Space Telescope article... 30 3.3 Preliminary Precision/Recall Evaluation for the Camp David article................ 30 3.4 Preliminary Precision/Recall Evaluation for the Al Gore article..................... 31 3.5 Precision/Recall values for articles retrieved using association patterns and sequential patterns........................................................ 3.6 ................. 32 An experiment in discovering general corpus trends using pattern mining techniques............................. .................... 33 5.1 A sample dictionary mapping each word to its unique numeric identifier............ 45 6.1 The chosen articles for this testing environment........................................ 53 6.2 A comparison of average precision values between Milan and Reuters pattern m ining and cosine sim ilarity results...................................................... 7 63 Chapter 1 Introduction October 10, 2000. Kellie, a college student at Berkeley, has just returned home from a long day of classes. She drops her books on the bed, grabs an apple from the fridge, and sits down at her computer to read the daily news. Browsing idly through the articles, a headline catches her eye "U.S. Focus Turns to Middle East Violence". She clicks on the link to read the article, and learns of President Clinton's urgent attempt to negotiate a cease-fire between Israel and Palestine. She recalls hearing about the recent uprising in the Middle East, but cannot remember what sparked the conflict, or why the Israelis and Palestinians have always been so hostile to each other. Curious, Kellie decides to call up Ariel, a computer software agent that searches for news stories similar to the current one. Within moments, the current article is statistically analyzed for frequent word patterns and then matched with other stories in the corpus. A new window pops up on the computer screen. Clinton Follows Carter's Footsteps to Camp David, Process Begun in '78 Arab Uprising Spreads to Israel; Death Toll Rises to More Than 30 Middle East Clash Takes U.S. Role to a New Height Middle East Violence May Push Parties Back to Peace Talks Arafat and Barak Agree to Emergency Summit Meeting I Browsing through the list, Kellie can now read through articles regarding the Middle East violence. Here she finds the key issues of the Camp David summit, the outbreaks of violence, and the attempts to reach a peace agreement. Ariel can perform a similarity analysis on any news article requested by the user, including those retrieved via a previous query. If Kellie wants to see more articles about Camp David, for example, she can ask Ariel to retrieve stories similar to the article "Clinton Follows Carter's Footsteps to Camp David". 1 These are actual results from the Ariel system given the article "U.S. Focus Turns to Middle East Violence" from the Milan news corpus 8 1.1 Motivation for the Ariel system 1.1.1 Reader Assistance Trying to keep up with current events is a time-consuming task. People want to be informed about the events that affect them, but at the same time the average person has many other responsibilities and interests. As a result, only a small percentage of the day can be spent on the news. The Ariel system allows more time to be spent on background information and less time on searching for it. More specifically, Ariel introduces a new, efficient method of discovering similar articles in a news corpus, using statistical data mining techniques. Integrated with the personalized newspaper Milan, Ariel can find related news stories for any article requested by the user. This automated system discards the need for cumbersome query-directed searches, and allows the reader to make better use of their time. For major events, such as coverage of the Middle East crisis, news providers occasionally include background articles for the topic. The CNN article about an emergency Middle East summit might have links to several detailed documents, including biographies of the Israeli and Palestinian leaders, an explanation of their perspectives and demands, and an account of Camp David's decades of peace efforts. Similarly, the 2000 Olympic games in Sydney may include a link to pictures of the opening ceremonies, a list of athletes and their performances, and a story of the Olympic game's origins. These documents give the reader a good overall summary of the event, and leave the reader with a well-established foundation with which to better understand the current article. However, since these precis are compiled and summarized by human editors, only a handful of events can receive such careful attention. For example, a reader who cares about commercial airline news, might want information on recent airline mergers, employee strikes, and competitive pricing. Or, another reader might be interested in the latest fashions, desiring reports from top clothing designers, tips on interior decorating, and secret beauty treatments. A researcher may want articles on the effect of natural disasters on the chip industry. Or, a child could demand stories on Pokemon trading cards. A news system that relies on the skills of humans certainly cannot accommodate all possible topics of interest. Thus only those stories widely held in public interest are maintained historically. Less popular topics are often left as standalone articles, with little or no direction toward similar articles. When a reader is strongly motivated to find related articles, however, there is always the possibility of querying a search engine. For example, if no Ariel system existed, Kellie could search the 2 CNN web site for articles similar to "U.S. Focus Turns to Middle East Violence". 2 The articles shown here are actual results retrieved from the CNN.com search utility. 9 Using the article title as the search phrase, the system returns 4 articles: TIME Asia TIME Finance Building a New House (27-Sept-97) Oklahoma Bombing Trial (16-Dec-97) Oklahoma Bombing Trial (03-Nov-97) Oklahoma Bombing Trial (26-Nov-97) Kellie decides to try again, this time using "Middle East Violence" as the search phrase. As a result, CNN.com retrieves 595 articles about the Middle East crisis. While some of these articles are useful, many are completely unrelated. The articles range from "Palestinian violence halts Barak's talks with Clinton (21-May-00)" to "Peru's President pulls off another coup (06-Dec-99)". While searching for information can produce useful results, it can also be a tedious and frustrating task. Search engines sometimes return a plethora of articles, many which are unrelated to the reader's interests. Other times they return nothing at all. Neither of these results is very appealing. Therefore, it's not surprising that users prefer articles chosen by human editors rather than those chosen by Boolean searches. 1.1.2 Newspaper Creation The Ariel system assists readers in finding background information for any given article, so that they can learn about a topic more effectively than traditional means have allowed. However, while this thesis is targeted largely towards news consumers, journalists and editors could also benefit from such a system. Before writing an article, a writer will frequently search for stories similar to the one being written, to see how other journalists have approached the story. In many media organizations, including Wired com, this research is conducted by scanning through stories from their own archives using simple Boolean searches [12]. With the Ariel system integrated into the newspaper archives, no cumbersome searches are needed. The writer simply chooses one article, and asks Ariel to find stories related to it. Within seconds, Ariel can scan through the archives and retrieve the requested articles. While some stories are written by a newspaper's own journalists, many articles come in from other sources. A large newspaper company, such as The Boston Globe, has approximately 10 outside news sources, including the Associated Press, United Press, Reuters, and Knight-Ridder [12]. Frequently, several versions of the same story exist, written by journalists from different news sources. A newspaper editor must sort through these stories to determine which stories are important enough to be printed. An editor may choose a version of the story directly from the news source, or perhaps combine two versions into one. Approximately three hours of an editor's time is spent in this filtering and selection process for a single newspaper issue [12]. Ariel can help editors by processing stories as soon as they enter the news system. Therefore, by simply choosing a single article, the editor can have Ariel retrieve similar stories without sorting through the stories by hand. 10 1.2 Thesis Goals This thesis is concerned with the development and evaluation of an autonomous news system that helps a reader better understand a news topic, without being limited to the scarce resource of human editors and the frustrating option of search engines. This system, named Ariel, analyzes a story and provides the reader with a collection of related articles. To determine which articles are most relevant, Ariel searches for features, which we call story patterns, within a given article, and then matches those features to other articles in a news corpus. We will refer to this retrieval technique as pattern mining. Story patterns are patterns of words that appear frequently among articles pertaining to the same news topic. This thesis illustrates that the topic of a news article can be revealed through an analysis of the patterns displayed throughout the article. The analysis of story patterns is performed using statistical data mining techniques, namely association and sequential mining [5][7][8][45]. This paper explores the use of both association mining and sequential mining techniques to discover these patterns. As a result, two different uses of patterns are considered in this research. One usage involves searching for common patterns of words within a single article, while the other involves searching for patterns throughout the entire news corpus. This thesis will demonstrate that the former technique allows the construction of unique story descriptors, while the latter retrieves far less interesting patterns. Once these features are identified, the system cross-references articles with similar patterns, and stores these results with the article in the news database. Thus, when a user requests related stories for a particular article, Ariel simply refers to its database to determine which articles may provide the most useful information. By analyzing each story before a user request arrives and storing the results in a database, the system can employ fairly computation-intensive techniques without forcing the user to wait through the processing time. Miller's work has shown that a 2-second response time is the limit for a user's flow of thought to stay uninterrupted [32]. For longer delays, the user will want to engage in other tasks while waiting for the computer the finish. Pre-computation prevents this time constraint from limiting the quality of retrieval. 1.3 Definitions In this discussion, a news article refers to a single piece of writing about an event in the news. A news story, on the other hand, is considered a collection of articles about the same topic. For example the news story on the 2000 Olympic games includes the news articles "An Olympic ode from down under", "The Drug Games: the legacy Sydney doesn't want", "First Sydney gold for Marion Jones", and "The Olympics conclude". A news topic is a single descriptive word or phrase about a given news story. In this case, the topic might be "The 2000 Olympic Games". A story pattern refers to a selection of words that identifies an article as belonging to a particular news topic. 11 1.4 Organization Chapter 1 - Introduction Chapter 2 - Prior Art Chapter 3 - Mining Story Patterns Chapter 4 - Existing Resources Chapter 5 - Ariel: Design and Implementation Chapter 6 - An Evaluation of Ariel Chapter 7 - Conclusions and Future Work Chapter 8 - References 12 Chapter 2 Prior Art This thesis incorporates ideas from online news, information retrieval, and data mining research. This section provides an overview of related research done in these particular fields, many which have influenced the development of this thesis. 2.1 News of the Future The realm of online news has flourished in recent years, with services such as CNN.com and my. Yahoo.com becoming more popular every day. In fact, the percentage of Americans getting their news from the Internet at least once a week has grown from 20% in 1998 to 33.3% in 2000 [36]. Furthermore, approximately 99% of the nation's largest newspapers now have an online presence. There are more than 4,000 newspapers online worldwide, half of them from the United States. As online news sources are becoming more widely read, there has been a growing interest in creating a computer system that can develop meaning in a corpus of news. These systems would be able to choose relevant articles based on an understanding of article and user contexts, to assist the work of human editors. The News in the Future research consortium provides a forum for the MIT Media Lab and member companies to explore technologies that will affect the collection and dissemination of news [35]. NiF's work in image, text, and audio understanding is improving the machine understanding of text and enabling computer systems to better understand the context of news. Furthermore, in terms of information retrieval, the consortium's work in managing data has contributed greater precision levels to database queries and filtering. FramerD is a distributed object-oriented system designed for the maintenance and sharing of knowledge bases [23]. This system, developed by Kenneth Haase and the Machine Understanding Group of the Media Lab, is a prevalent database tool used in NiF applications. The system incorporates knowledge 13 base applications such as a multi-lingual ontology and a sentence parser. At the same time, FramerD is a frame-based transport system based on d-types [1]. One application built on FramerD is PLUM, a system that adds contextual information to disaster news stories [13]. PLUM uses augmented text systems to combine the understanding of news with the understanding of the reader's context. The result is the original news article, supplemented by facts relating the disaster with facts about the user's own community. Such annotations may include "The affected area is about the size of Cambridge," or "Sorgum is 75% of the country's agricultural output". The use of augmented news has also been prevalent in online newspaper developments. MyZWrap and Panorama are two personalized newspapers developed at the MIT Media Lab and IBM Almaden Research Center, respectively, which use augmented text to develop meaning in news [19]. The goal was to develop a system that could assist, simplify, and automate the types of editorial decisions that a human editor faces. The personalized news portal Milan emerged from these two systems, and is currently the context used by Ariel. Milan supplies Ariel with a news corpus and an online newspaper to work with, as well ass the functionality to display news articles for the reader. Further details on Milan will be discussed in Chapter 4. Some researchers have also experimented with integrating different news media into a single news system. For example, Bender's work in Network Plus is an investigation in combining news wire services with network television news [9]. The resulting model is a joint television viewing by both the consumer and a personal computer. The computer system analyzes the incoming television broadcast, and retrieves The resulting data may then be used to augment the further data from both local and remote databases. current television broadcast, or create an augmented newspaper from the broadcast. Similar research has been done in the MITRE Corporation in recent years. In AAAI-97, they present the BNA and BNN systems, Broadcast News Analysis and Broadcast News Navigator, respectively [31]. The system develops a broadcast news video corpora, and provides techniques such as story segmentation, proper name extraction, and visualization of associated metadata. 2.2 The Search for Similarity While this thesis arose from interests in contextualizing news, the foundations of this work lie in information retrieval. Simply put, this thesis illustrates a technique for document retrieval based on a user query. Every information retrieval system consists of information items, a set of requests, and some mechanism for determining which, if any, of the items meet the requirements of the requests [41]. This thesis shows exactly these elements. The items, or documents, are the news articles of the news corpus. A request is a user indication that they want more stories related to a specific article that they are reading. And, finally, the similarity metric is the use of story patterns to discover similarities. 14 CAT DOC1 DOC2 DOC3 0 1 3 BEAR --- DOG 2 0 1 1 0 ... 0 Figure 2-1: Statistical analysis on text is traditionally done by first creating a document-word matrix. Each element represents the number of occurrences of the corresponding word in the given document 2.2.1 Traditional IR Methods Statistical techniques have a long history of use in the context of information retrieval. In order to apply these techniques, a mapping is usually needed between the articles and a vector representation of the articles. This idea was first proposed by Gerald Salton and the SMART retrieval system in 1971 [40]. The most common such mapping is one that takes the list of words that appear in a document and map them to individual elements in a vector. These documents can then be collected into a single matrix that represents the corpus, as seen in Figure 2-1. Consequently, many retrieval systems refer to this matrix to determine document similarities, rather than processing information from the document corpus. For example, a boolean search on "cat + dog" looks up the indexes for the words "cat" and "dog", and determines which of the documents have matches in both rows. Cosine similarity is another well-known similarity metric that utilizes vector spaces. Given two document vectors, the cosine method performs a dot product function on the vectors. The result is a similarity value between I and 0, with I meaning that the documents have exactly the same content and 0 meaning the two are nothing alike. Many systems also utilize word frequencies as a similarity metric, also known as tf/idf (term frequency/inverse document frequency). A writer generally repeats certain words throughout a document as the subject is explained in further detail. This word emphasis is well-recognized as "an indicator of significance" [30]. However, neither high-frequency or low-frequency words are good content identifiers, as explained by the "principle of least effort" [49]. This principle claims that writers generally use a smaller common vocabulary. Consequently, the most frequent words tend to be the least informative. As a result, many consider term frequency to be proportional to the frequency with which a word appears in a document, but inversely proportional to the number of documents it appears in. 15 While statistical techniques perform well in text retrieval, ambiguities in word or phrase meanings may lead to retrieval errors. In the sentences "Spring has arrived!" and "The watch has a broken spring," the word spring has two completely different meanings. However, a system that only considers single words for context description may very well classify these two sentences as similar. This weakness can be fixed, of course, by using term co-occurrence statistics and word adjacency operators, but statistical techniques are still imperfect. In particular, a statistical system has no way of distinguishing phrases such as "Venetian blind" and "blind Venetian." Online lexical reference systems, such as WordNet, provide tools to de-ambiguate text, by organizing information in terms of word meanings instead of word forms [18][48]. Using WordNet, the words spring, fountain, outflow, and outpouring in the appropriate senses could all be broken down into the same concept. Thus, concepts instead of words could be stored in the vector space model and used to describe documents in place of words. However, issues like the Venetian blind example cannot be solved without a more complex representation of documents. Natural Language Processing (NLP) provides such a representation, allowing document analyses to include structured descriptions such as noun-verb-noun or adjective-noun combinations [41]. These more meaningful larger units intuitively should create substantial improvements in test analysis and information retrieval. At this time, however, the performance of statistical and syntactic methods is domain-dependant. 2.2.2 Current IR Precision/Recall Results To provide a better idea of current IR capabilities, Table 2.1 provides the evaluation results for the TREC-8 runs for Carnegie Mellon University, John Hopkins University, and Oracle[47]. TREC-8 is the 8' annual Text REtrieval Conference, held in November 1999; the latest of a series of workshops designed to promote research in text retrieval. The table shows precision/recall results for the main TREC task, ad hoc retrieval. The ad hoc retrieval task is similar to how a researcher might use a library. A question is asked and a collection is available, but the answer is unknown. This problem implies that the input topic has no training material. The same problem is addressed in this thesis, where a reader requests articles similar to a given article, or topic. While Ariel has access to the Milan news corpus with which to discover the requested information, the solution itself is unavailable. The precision/recall numbers allow a clear comparison between the Ariel system and other state-of-the-art systems, such as the SMART retrieval system, thus providing us with a better foundation for evaluating Ariel. 2.2.3 Using Phrases in Text Retrieval Phrases have traditionally been regarded as precision-enhancing devices in text retrieval systems. The exploration of phrases as terms in a vector space based retrieval system, first introduced by Salton [40][41], have received careful attention in the past few decades. Statistically, phrases may be thought of simply as 16 Recall 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Precision CMU Precision JHJ Precision Oracle 0.5705 0.4092 0.3308 0.2987 0.2415 0.2118 0.1710 0.1380 0.1037 0.0601 0.0315 0.7677 0.5496 0.4641 0.4070 0.3570 0.3139 0.2663 0.2244 0.1522 0.1017 0.0410 0.9279 0.7747 0.6610 0.5448 0.4542 0.3882 0.3168 0.2565 0.2067 0.1332 0.0791 Average: 0.2175 Average: 0.3126 Average: 0.4130 Table 2.1: Recall Level Precision Averages for TREC-8 Ad Hoc Retrieval combinations of terms that are used in place of the individual terms they are composed of. The best phrases have a larger than expected joint frequency of occurrence, given the frequency of the individual terms. Phrases can be defined in a variety of ways. At one extreme, a system may define two terms appearing in the same document as a phrase. However, better precision results are found when more restrictions are in place. For example, phrases may be limited to terms occurring in the same sentence, in the same sentences and adjoining positions, or in the same sentences with at most k items separating them. Phrases may also be defined syntactically. However, syntactic phrase detection becomes much more timeconsuming and computationally expensive since additional tests are needed. Given these definitions, the story patterns described in this thesis may be thought of as phrases within a document. Thus, a clearer understanding of phrase usage and their resulting performance benefits this thesis. In most phrase-based statistical systems, a phrase is defined as a pair of non-function words that occur contiguously frequently enough in a corpus. Similarly, in phrase-based syntactic systems, a phrase is defined as any set of words that satisfy certain syntactic relations or constitute specific syntactic structures [33]. Salton et al.'s 1975 work has shown that including statistical phrases as terms in vector space Furthermore, the work of Lewis and Croft in 1990 models increases precision by 17% to 39% [40]. illustrates that the quality of text categorization is clearly improved by using word phrases [29]. In 1993, Renouf's 1993 work also demonstrates that phrases could be reliably used as search terms. According to his evaluations, sequences of two or more nouns can effectively identify concepts found within a document [38]. The use of phrases has also become prevalent in related areas of study. The automatic determination of text themes within a document has been explored by Salton et al., by using phrases to 17 describe document themes. More recently, phrases have been used as a means for automatically deriving a hierarchy of concepts from a specified set of documents [42], instead of using traditional clustering techniques. And lastly, the Phrasier system has been recently developed, an interactive system for browsing through related documents using automatically extracted key phrases [26]. However, some groups have argued that the use of phrases is not useful in enhancing precision in document retrieval. In 1989, Fagan ran the same experiments as Salton's 1975 work, but used larger document collections, about 11% to 20% [15]. 10MB. His reports show that average precision improvements ranged from This downward trend continued in 1997, with Mitra et al, replicating the same experiments on a 655 MB collection, and reporting a 1% precision improvement if phrases are used as terms. Mitra's experiments also found that syntactic and statistical methods for recognizing phrases yielded comparable performance. And most recently, in 1999, statistical phrase analyses by Turpin et al. confirmed Mitra's results, adding further evidence to the argument that phrases are not useful precisionenhancing devices [46]. These studies have argued that the advances in single term precision have outperformed the usefulness of phrases in text retrieval problems. While there exists some arguments against the use of phrases, the common belief is that word patterns are good precision-enhancing devices. In fact, as this thesis will illustrate, statistical word patterns are certainly advantageous to the field of information retrieval when the phrases used are generated using traditional association mining techniques. 2.3 Data Mining Data mining is often referred to as knowledge discovery in databases, or KDD. A combination of statistical and machine learning techniques, data mining is the process of identifying useful and understandable patterns in data. The primary difference between statistical analysis, machine learning, and data mining is the issue of scalability [44]. While statistical analysis and machine learning deal with smaller data sets, data mining techniques are applied to voluminous databases. In the media, data mining is most often associated with consumer-oriented applications. Data mining software can help businesses obtain insightful information about their customers and business practices by identifying customer behaviors, recognizing market trends, and minimizing costs [20]. If this information is used effectively, data mining technologies give businesses a clear advantage over their competitors. In addition to consumer markets, data mining can have a variety of different applications. For example, in astronomy, the SKICAT system performs image analysis, classification, and cataloging of sky objects from survey images [16]. In sports, IBM's Advanced Scout aids NBA basketball coaches in organizing and interpreting data from games [16]. And in more recent years, a growing interest in applying 18 data mining to document analysis has emerged. Applications include categorizing text into pre-defined topics and discovering trends in text databases. 2.3.1 Data Mining Techniques Data mining techniques can be categorized into four general classes, Associations, Sequential Patterns, Classification, and Clustering [4][20]. While this thesis only concerns itself with association and sequential mining, classifiers and clustering will also be mentioned for completeness. Each of these methods will receive a brief description, and some examples of applications for which these functions are useful. The next section will describe data mining work related to textual analysis, which is more the focus of this thesis. Associations and Sequential Patterns Given a set of transactions, where each transaction is a set of items, an association rule is an expression of the form X -) Y. For example, a clothing retailer might infer the rule "38% of all people who bought jackets and umbrellas also bought sweaters". Here, 38% is referred to as the confidence of the rule. Associations can involve any number of items on either side of the rule. The problem of association mining is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints. A more restrictive form of association mining is the problem of mining sequential patterns. Sequential patterns are simply associations found in a particular sequence. In most retail organizations, records of customer purchases typically consist of the transaction date and the items bought in the transaction. Very frequently these records also contain customer ID's, particularly when the purchase has been made using a credit card, membership card, or a frequent buyer club. Given this data, an analysis can be made by relating the purchases with the identity of the customer. For example, in a bookstore's database, one sequential pattern might be "10% of customers who bought The Sorcerers Stone in one transaction later bought The Chamber of Secrets in a later transaction, and The Prisoner of Azkaban, and The Goblet of Fire in a third. 3 Each transaction is referred to as an element in the sequence. As you can see, elements may be single items or sets of items. Furthermore, these purchases need not be consecutive. Customers who purchase other books in between also support this sequential pattern. Sequential mining may be used in any application involving frequently occurring patterns in time. Classification The problem of classification involves finding rules that partition data into distinct categories. In classification, a set of example records is given, each containing a number of attributes. One of these 3 From the Harry Potter series by J.K. Rowling. 19 attributes, called the classifying attribute, identifies the class to which each sample record belongs. The records are used as a training set on which to build a model of the classifying attribute based on the others [43]. Credit analysis is a prime classification application. For example, credit card companies have detailed records about their customers, with a number of attributes regarding each record. Each record is generally tagged with a GOOD, AVERAGE, or BAD attribute, representing the credit risk of each customer. A classifier could examine each of these tagged records, and produce a description of the set of GOOD customers as those "between the ages of 40 and 55, with an annual income over $45,000, living in suburban areas". Models may be built using a variety of techniques, including decision trees and neural networks. Clustering Clustering techniques are similar to classification, but for one major difference. While records contain classifying attributes for classifiers, clustering uses a set of records without tagged attributes. The goal of a clustering function is to identify a finite set of categories or clusters to describe the data. Data can be clustered to identify market segments or consumer affinities, for example. A furniture retailer might use clustering to determine which groups of consumers would be more likely to purchase a new product, and organize their advertisement campaign accordingly. Given a set of customer transaction records, the retailer may separate customers demographically by those interested in expensive furniture sets, moderately priced furnishings, and unfinished decor. 2.3.2 Applications to Text Databases One major data mining interest is in the field of text classification. Text classification is the problem of automatically assigning predefined categories to free text documents. In text searching and browsing, text classification systems may be used to create a topic hierarchy for user navigation through a text database. As shown in [10], using statistical pattern recognition allows discriminants to be efficiently separated at the node of each taxonomy. Then, these discriminants are used to build a multi-level classifier. The Athena system is another text classification system, built to maintain a hierarchy of documents through interactive mining technologies [3]. Currently, Athena is implemented on Lotus Notes to support e-mail management and discussion databases. Using Nafve Bayes classifiers and clustering algorithms, Athena performs topic discovery, hierarchy reorganization, inbox browsing, and hierarchy maintenance. While other classification models have been developed for e-mail routing purposes, Athena is the only such system to address text database organizations outside the routing of incoming mail. Similarly, association mining has been explored as a method of discovering news topics for the online newspaper ZWrap [19]. The ZWrap system seeks to automatically identify rules about various news topics and bring them to the attention of an editor who can decide whether the presented hypotheses are valid. The use of association-rule data mining on news features, such as locations or people, may discover 20 features with high occurrence rates together. For example, association techniques applied to people may discover that 80% of the time Al Gore is mentioned in a news article, George W. Bush also appears. Using these techniques, nodes of topics may be formed, where each node contains several rules pertaining to a particular person or event. In this way, a better understanding of news contexts may be created. Sequential mining has also been incorporated in IR research, using statistical phrases to discover trends in text databases [27]. sequential patterns. approach. Here, phrase identification is transformed into the problem of mining The pattern mining techniques described in this thesis are most comparable to this Instead of finding common transaction sequences for consumer purchases, for example, the system finds common sequences of words and calls them phrases. Associated with each phrase is a history of the frequency of occurrence for that phrase. Thus, when a user queries the database for a certain trend, these histories may be consulted to determine which pattern histories meet the specified requirements. Trends are defined as specific subsequences of a phrase history that satisfy a user query, for example a spike-shaped trend, or a bell-shaped trend. In recent years, the combining of data mining techniques with information retrieval problems have shown promising results. The next chapter will introduce the basic ideas of this thesis, which illustrates a new application of data mining to high-precision document retrieval. This thesis incorporates many of the ideas described above in the realms of news, IR, and data mining research. 21 Chapter 3 Mining Story Patterns The Ariel system generates story patterns for each article in the news corpora. Each set of story patterns may be thought of as a unique signature for a particular story, which allows for the retrieval of related articles. This section explains the techniques used to create these patterns, as well as some techniques that were explored but eventually discarded. Furthermore, this section attempts to provide some insight behind the usefulness of these patterns. 3.1 Motivations 3.1.1 Shared Writing Styles Journalists generally use similar wording in their writings, when referring to the same event or topic. As a result, patterns of words are repeated frequently. For example, the 2000 presidential election might frequently use words: Gore, Bush, president, election, and race. Or, the topic of the 2000 Olympics may include Olympics, Sydney, athlete, drug, and medal. As discussed in the previous chapter, Section 2.2.3, these word patterns can help IR systems identify related documents. 3.1.2 Time Clusters From a reader's standpoint, news topics come and go. Some topics emerge as minor tidbits of news, gathering momentum until they peak as the nation's hottest news stories. Once the excitement dies, however, the topic soon disappears from the public eye. The 2000 presidential election is one such example, with early campaign news emerging early in the summer and culminating into the tight election race. Weeks later, election results are resolved and public interest subsides. Other topics may appear out of nowhere, creating immediate headlines in the news and then dying away. One example is the Federal Reserve's surprise decision on January 3, 2001 to slash rates by half a percentage, creating a spur of activity in the stock market. Or, consider the Middle East crisis, which attained high public interest during 22 the Camp David summit, and then re-emerged months later in news of the Israeli-Palestine uprisings. News topics such as these revolve around clusters of time. Like waves rising and falling, topics emerge in the news for a certain period of time, and then diminish from public view. Some topics, such as the Camp David story, may later re-emerge in an updated context. Patterns are strongly correlated to the time cluster in which it appears. In one experiment, Haase demonstrated the importance of numbers in online newspapers. If one article claimed "43 people died in a train crash yesterday, a search on the number 43 in that period of time would most likely produce articles regarding that accident [17]. This number 43 can be viewed as part of a unique signature for the train accident topic. However, such signatures are only valid within the time cluster, as the number 43 may have had other significance in the past. We then considered what other types of signatures could be used to describe news topics. Patterns of words can be though of as unique signatures for a particular news topic. Word patterns are more complex than numbers. However, they can be used in a much broader range of news stories. For example, an article on the Middle East crisis may support the pattern {Clinton, Israel, Palestine, peace}. This pattern encompasses a key idea of the topic, namely the involvement of the U.S. in the Middle East peace process. Furthermore, the pattern encapsulates the topic's representation in time. In addition, more than one pattern may represent a news topic. Groups of patterns for the Middle East crisis may include {Barak, Arafat, meet}, {Camp, David, peace}, and {Clinton, Israel, Palestine, peace}. The uniqueness of a pattern, or group of patterns, allows for it to surface only when the topic is present. For example, the words Clinton, Israel, Palestine, and peace will only appear together during Clinton's time as president. This relationship marks a period of time in which the pattern is useful. In fact, the pattern will predominantly appear during the Camp David summits and periods of violence, while virtually disappearing during peaceful times. 3.1.3 Topic Recognition In many retrieval systems, a major concern is the difficulty of discovering when new topics emerge, when old ones become obsolete, and when interest in an old topic rises again. A news system may take several days to discover that Florida is playing a major role in the 2000 presidential election, that interest in Elian Gonzales has died away, or that enthusiasm for the Olympics has risen again. Due to the nature of patterns, however, these concerns are insignificant. Since patterns are immediately discovered within a given article, Ariel recognizes the importance of the term "Florida" as soon as it appears in a story pattern. Furthermore, the topic of Elian will become unimportant when the topic's patterns stop appearing in daily news. And, lastly, patterns generated for the Olympics will match both current and past Olympic stories. Therefore, patterns provide a quick and effective method of finding related stories without worrying about explicit topic recognition. 23 3.2 A Story Pattern A story pattern is a set of words that appear together within the same sentence of a given article. A k-item pattern is defined as a pattern containing k elements. No particular ordering of the words is necessary, since the flexibility of writing allows a given concept to be described using completely different word orderings. For example, the one article may state: "President Clinton met with Palestinian leader, Yasser Arafat, today" while another article says "Arafat spoke with Clinton briefly on Tuesday morning." In both cases, the 2-item pattern {Clinton, Arafat} is present. This pattern is a set, and thus is equivalent to the {Arafat, Clinton} pattern. Sequential patterns were also examined, since a case could be made for the importance of word ordering in patterns. It was observed that this technique creates slightly lower precision values and significantly diminished recall values (see Table 3.5), leading us to believe that restricting patterns by word ordering is too inflexible to find similarities in journalistic writing. This technique will be discussed in more detail later in the chapter. 3.3 Generating Patterns Patterns are generated using association mining techniques [5][7], which are traditionally used to mine association rules for sales transactions. The major components of association mining are a record of transactions, the items purchased in each transaction, and the ID of the customer who made the transaction. The problem of mining association rules can be broken down into two stages, generating frequent itemsets and generating rules from these itemsets. A frequent itemset is considered a set of items (itemsets) that have a transaction support above a minimum support, meaning that the support for an itemset is the number of transactions that contain it. To generate rules, a straightforward algorithm might be: For every frequent itemset 1, find all non-empty subsets of 1. For every such subset a, output a rule of the form a => (I - a) if the ratio of supportfor (1) to support (a) is at least minimum confidence. A rule x => y is valid if minimum confidence% of transactionsin the transactionset that contain X also contain Y. In the case of text mining, we substitute a transaction record with a record of a single news article. Each sentence of the article is viewed as one transaction, where a sentence is defined as the set of words contained in that sentence. The items of each transaction correspond to the words found in the sentence. And, finally, the customer identifier is simply a unique sentence identifier. In creating patterns, we generate frequent itemsets from the article. No rule generation is used. The result is a set of patterns, the itemsets, where each pattern is a set of words frequently appearing together in sentences of the article. 24 3.3.1 A Formal Problem Statement Let W= {Wi, W2, ... , wm} be a set of all words. Let A be a news article, which can be viewed as a set of sentences, where each sentence S is itself a set of words such that S c W. Associated with each sentence is a unique identifier, called its SENT ID. A sentence S is said to contain P, a set of some words in W, if P C S. P is considered a story pattern for news article A if the support threshold support is satisfied, meaning that > support number of sentences contain P. Given a set of sentences A, the problem of finding story patterns is to generate all patterns P that have support greater than a user-specified minimum support. 3.3.2 The Algorithm The methodology described here is a general approach that can be applied to text databases of varying complexity. The original algorithm is taken from Rakesh Agrawal and Ramakrishnan Srikant's work on discovering fast algorithms for mining association patterns [5][7]. This algorithm was carefully chosen for its performance advantages. It has been shown to always outperform previously known algorithms such as AIS and SETM. The performance gap increases with problem size, ranging from a factor of three for smaller problems to more than an order of magnitude for larger problems [7]. Other association rule mining algorithms also exist [44]. However, these algorithms generally focus on generating longer patterns and/or minimizing the number of passes for disk-resident data. Since this thesis involves relatively short patterns and documents, the performance of Agrawal and Srikant's algorithm is more than sufficient. The result of the mining is a set of patterns that occur frequently within a given news article. The general approach is given here, adjusted for text retrieval applications. In order to analyze a given article A, the text must first be broken down into sentences and words. After removing all stop words from A, each remaining word is stemmed using the Porter Stemming algorithm [11][37]. Stemming allows the system to recognize the same word in different linguistic forms. Next, each stemmed word is given a unique identifier, since the use of numbers is less computationally expensive than strings. Now, A may be represented as a set of sentences (transactions), which in turn becomes a set of words (items). In each article, both the headline and body are considered. The headline is simply represented as the first sentence of the article. Without word stemming, the Ariel system would lack the recognition of many similar words and thus fail to recognize similar stories containing these words. For example, consider the sample article "Yesterday, Clinton welcomed the visiting Prime Minister Barak. Clinton's welcome was warm as he received Barak". In the case of word stemming, the words "Clinton" and "Clinton's" are both translated to "Clinton" while "welcomed" and "welcome" are recognized as different forms of "welcom". Thus, a system that incorporates stemming would recognize the common pattern {Clinton, welcom, Barak} within 25 Article A Sent 1 2 3 4 Items 134 235 1235 25 L3 Li L2 Itesset Support Itemset Support 2 2 3 2 (13) (23) (25) 5) 2 3 3 3 (1) (2) (3) (5) C3 C2 Itemset Support (12) (13) (15) (23) 1 2 1 2 (2 5) (35) 3 2 (2 5) 2 (235} C4 Support ______t [jmet Support Iteme upr 2------- Figure 3-1: An example of pattern generation using frequent itemsets the article, assuming that the minimum support is two sentences. However, without the occurrence of stemming, only {Barak} could be extracted from these two sentences. Clearly, the former pattern presents a better description of the two sentences. In document retrieval, the former pattern allows for higher precision, since the latter pattern will retrieve all articles mentioning Barak. While recall may be slightly lower in word stemming, precision values make it the better choice in document retrieval. Patterns generation involves making multiple passes over A. In the first pass, the support of each word is determined, based upon how many sentences that word appears in. Out of these words, the system then decides which arefrequent (having minimum support), and which are not. In each subsequent pass, the large items, or words, found in the previous pass are used as the initial set of items. This set is used for generating new potentially large patterns, called candidates, whose support is also counted. At the end of each pass we determine which patterns are actually large. These large patterns become the seed for the next pass. This process continues until no new large patterns are found. By only considering patterns used in the previous pass, this algorithm reduces the number of candidates. The basic idea is that any subset of a large pattern must also be large. Therefore, candidate patterns having k-items can be generated in two phases, by joining large patterns with k-i items, the join procedure, and deleting those patterns than contain any subset that is not large, the prune procedure. For example, consider the article in Figure 3-1, assuming that the minimum support is 2 sentences. In the first pass, the algorithm determines the support of each item, eliminating all small items. The remaining large items, shown in L1, represent the initial 1-item set. To create the next candidate set, these 1-item sets are joined together in all possible ways to form the 2-item candidate set C2. Out of these, the small itemsets are pruned away and the large itemsets are chosen to form L 2. The process is repeated to create the large 3-item set, L3 . When C 4 is generated using L 3 , it turns out to be empty and the algorithm 26 terminates. The result is a set of frequent itemsets, or patterns L, n L 2 r- L3, which can be translated to frequent patterns found in the article. Algorithm details may be found in Agrawal and Srikant [7]. There are several advantages to choosing this particular approach. Most importantly, the algorithm is highly efficient in comparison to other algorithms. Every article varies in size and content, producing anywhere from 5 patterns to 50,000 patterns in each run. In addition, over 500,000 articles exist in the Milan news corpus. As a result, the speed of the algorithm is an important consideration. Furthermore, since the algorithm has already been implemented in the IBM Intelligent Miner for data [24], a product of the Data Mining group at the IBM Almaden Research Center, the code is readily accessible. This saves a great deal of time and resources that would have been spent in implementing a pattern generation algorithm from scratch. 3.3.3 Results of Pattern Generation The resulting patterns represent the key concepts for a given article. Here, we have chosen three articles from the news corpus to illustrate pattern generation results. Figures 3-2, 3-3, and 3-4 provide excerpts from the three stories, as well as the most frequent patterns selected from each pattern size. First of all, each of the patterns, including the 1-item patterns, seems to grasp the theme of the article quite well. Furthermore, as the patterns increase in complexity, seemingly important keywords drop out of the pattern. For example, in the Hubble telescope article, the word "Hubble" should clearly play an important role in describing the story. However, while both the one and two item patterns contain this word, the term disappears in longer patterns. This occurrence maybe contributed to the frequent use of pronouns within the article. While Hubble may be mentioned throughout the article, several sentences use the words "it" and "they" to describe the Hubble telescope and Hubble astronomers, respectively. Thus, in longer patterns the term drops out, even though it is implicitly referenced throughout the article. 27 ...... ........... Figure 3-2: Pattern generation for the Hubble Space Telescope article I Figure 3-3: Pattern generation for the Camp David article 28 I Figure 3-4: Pattern generation for the Al Gore article 3.4 The Search for Related Stories Now that the patterns have been generated, the system requires a scheme for finding similar articles given these patterns. One option is to find the same patterns within the other articles of the news corpus, with the requirement that all words contained in the pattern must be from the same sentence. An article that contains at least one pattern is considered a match. We will call this option sentence-level-matching. However, this metric may be too restricting, since the focus of one article may not be the same as another even though they speak of the same topic. Therefore, another option is also explored; when a story contains all the elements of a pattern, not necessarily from the same sentence, it is considered a match. This option will be called article-level-matching. Using both metrics, preliminary tests are run on the above three articles, to provide a rough idea of the usefulness of patterns in text retrieval. The Milan corpus is used for this evaluation; however the number of considered stories is limited for the sake of efficiency. A good article is one whose content is related to the topic of the given article. The total set of relevant articles is chosen from the Milan corpus by hand. This data is used to calculate precision/recall values. A summary of this testing environment is given in Table 3.1. For this preliminary test set, we compare patterns of 3 items or more with Boolean searches and word pair searches. Boolean searches correspond to I-item patterns, and word pair searches correspond to 2-item patterns. The intuition in this case is that more complex patterns offer a more defined topic. Only 29 the top 5 most frequent patterns, single words, and word pairs are used in this examination. In other words, the 5 most frequent patterns, words, and bigrams are used in retrieving relevant stories. Tables 3.2, 3.3, and 3.4 show the precision/recall results for the Hubble space article, Camp David article, and Al Gore article respectively. Precision is defined as the ratio of the number of relevant articles retrieved to the number of relevant articles in the corpus. Recall is defined as the ratio of the number of relevant articles retrieved to the total number of articles retrieved. Note that Boolean precision/recall values will be the same in both sentence and article matching, since a single word will obviously appear in the same sentence as well as the same article. Number of Relevant articles 33 Total number of articles considered 8,500 Camp David 53 18,000 Al Gore 126 19,000 Article Hubble Telescope Table 3.1: The preliminary testing environment Sentence-level Recall Precision Article-level Recall Precision Boolean Search 22.5% 48.5% 22.5% 48.5% Word-Pair Search 100% 27.3% 100% 27.3% Pattern Search 100% 9.1% 100% 30.3% Table 3.2: Preliminary Precision/Recall Evaluation for the Hubble Space Telescope article Sentence-level Precision Recall Article-level Precision Recall Boolean Search 9.2% 96.2% 9.2% 96.2% Word-Pair Search 38.9% 92.5% 35.5% 94.3% 76% 68.6% 77.8% 92.4% Pattern Search Table 3.3: Preliminary Precision/Recall Evaluation for the Camp David article 30 Sentence-level Precision Recall Article-level Precision Recall Boolean Search 4.2% 94.4% 4.2% 94.4% Word-Pair Search 14.0% 45.2% 13.1% 63.5% Pattern Search 100% 5.6% 83.6% 40.1% Table 3.4: Preliminary Precision/Recall Evaluation for the Al Gore article These early results illustrate the promising performance of pattern mining, with complete article matching performing better than sentence matching. Retrieving articles based upon a pattern's presence within a single sentence is too restrictive. In most cases, article-matching significantly improves Ariel's recall performance while only slightly weakening precision. Furthermore, the precision of 3 or more item patterns far surpasses that of word pairs and Boolean searches. From a reader standpoint, the importance of precision outweighs that of recall. A reader would much rather receive a small number of relevant stories rather than a larger number of stories in which relevant articles need to be weeded out. Thus, while longer patterns have poorer recall numbers, the precision results outweigh recall weaknesses. It should also be noted that the procedure of using the 5 most frequent patterns is only used for this preliminary test set. In normal Ariel processes, all existing patterns of 3-items or more will be used in the retrieval process. Thus, recall performance should increase with more rigorous evaluations. More extensive testing is described in Chapter 6, the evaluation section of this paper, along with a comparison of pattern mining with common story retrieval techniques. As a result of these tests, the Ariel system will retrieve articles based upon all patterns with 3items or more. An article is considered relevant if every item of the pattern is present within the article, regardless of its sentence position. 3.5 Other Considered Techniques While researching pattern generation, several different approaches were explored. The technique described above involves generating frequent itemsets from an article chosen by the user. Additionally, this thesis considers two other approaches, substituting association mining with sequential mining, and replacing the specific article with an analysis of the entire news corpus. 31 3.5.1 Looking at Sequential Patterns Generating sequential patterns is very similar to the association mining approach described above. However, instead of using sets of words from each sentence (itemsets), lists of words are used (itemlists)[8][45]. This means that ordering now becomes significant. The advantage is that sequential patterns reduce the chance of retrieval error. For example, consider one article containing the sentence "The Camp David summit opens today", and another including the sentence "David went to camp this summer." Since both sentences contain the itemset {Camp, David}, association mining techniques would group these two articles as similar. However, sequential mining would identify two distinct patterns, {Camp, David} and {David, Camp}, respectively, and the error would be avoided. However, the disadvantage is that we lose the flexibility of association mining. Whereas the sentences "Clinton and Arafat met" and "Arafat and Clinton met" would be considered matches by association techniques, sequential methods identify two distinct patterns {Clinton, Arafat, met} and {Arafat, Clinton, met}, and would therefore fail to recognize their similarity. To test this sequential approach, sequential patterns were generated for the three stories described above, using the same minimum support of 2, and the same testing environment described in Table 3.1. The experiment is exactly identical to the article-level-matching association mining tests (Tables 3.2, 3.3, and 3.4), using the same Milan corpus. However, instead of being unordered, the generated patterns are sequential. Table 3.5 illustrates the resulting precision/recall values in relation to article-level-matching association mining results. In general, sequential patterns create moderately lower precision values and significantly lower recall values. These results make sense since the restriction of ordered items allows fewer story retrievals. In the case of Article 2, however, precision/recall values are identical. Since only two patterns are generated using association mining and the same patterns were picked up by sequential techniques, the results are the same. Although sequential mining allows one to distinguish between the phrases "blind Venetian" and "Venetian blind" for example, these results emphasize the need for flexibility in word usage. Thus, the restrictions of sequential mining weaken story retrieval rather than aid it. Compared to association mining, little advantage is seen in using sequential story patterns. Article Assoc. Assoc. Seq. Seq. Precision Recall Precision Recall 1. Hubble space telescope 73.3% 86.8% 64.7% 28.9% 2. Camp David 86.0% 76.6% 86.0% 76.6% 3. Al Gore 75.8% 35.3% 75.0% 6.8% Table 3.5: Precision/Recall values for articles retrieved using association patterns and sequential patterns. 32 3.5.2 General corpus trends Another approach involves searching through the entire corpus for story patterns, instead of focusing on the chosen article. The basic intuition is that retrieved patterns encapsulate common themes found within the corpus. If several articles discuss the 2000 Olympic games, for example, these articles will likely share common word choices, and thus share common patterns. If enough stories reflect a certain pattern, that pattern may be viewed as a news topic descriptor. Then, when a user chooses one of these articles, the pattern is recognized and all other stories containing this pattern will be retrieved. In this approach, a large pattern is defined by the number of articles containing it, rather than sentences. Patterns are generated using the association mining algorithm; however, the transaction record is now a record of the entire news corpus, instead of a single news article. Since an article is still divided into its sentence components, patterns from each sentence may be joined together to form a set of patterns. For example, the structure < {Olympics, Sydney} {athlete, compete} > corresponds to "Olympics" and "Sydney" occurring in one sentence and "athlete" and "compete" occurring in another sentence, with both sentences coming from the same article. This pattern is accepted only if a minimum support is satisfied. While this technique appears promising, a brief examination of Table 3.6 illustrates that results are poor. Instead of retrieving useful patterns that describe various topics within the corpus, only the most common patterns are gathered. For example, the first pattern < {page, open, browser, window} > emerges from the common phrase at the end of each CNN.com article "Pages will open in a new browser window". Similarly, the pattern < {web, post, gmt} > appears in regard to phrases like "April 20, 2000 Web posted at 12:33PM EDT (1633 GMT)". These results indicate that discovering useful patterns among a large set of articles is more difficult than it appears. This complexity is attributed to the various writing styles and topic discussions throughout the corpus. By adhering to a single article, the writing style does not change and the focus is distinct, allowing a more meaningful pattern generation. Patterns generated from 1000 articles < {page, open, browser, window} > < {web, post, gmt} > < {cnn} {search, new} > < {people, business} {new, year} > < {internet, headline}> Table 3.6: An experiment in discovering general corpus trends using pattern mining techniques 33 3.5.3 Conclusions While various approaches to pattern generation exist, association mining techniques appear to provide the most informative patterns. The use of frequent itemsets allows for flexibility in writing styles, while capturing the main ideas of the article. At the same time, by concentrating on a single article, the system takes advantage of language redundancy to provide a clear description of the article's meaning. 34 Chapter 4 Existing Resources This section describes the resources Ariel was built upon. Milan is the personalized news portal that supplies Ariel with a news corpus and a user interface to work with. More specifically, Ariel is one of the many services built on top of Milan. This chapter describes the underlying structure of the news database, the data transport system, and the user interface. Furthermore, this section provides a glimpse of IBM's Vinci project [2], a local area service-oriented architecture created for the development of distributed datacentric applications. Milan is one such application built using Vinci. 4.1 The Milan News System Milan is an online newspaper service similar to MyYahoo.com and CNN.com. It obtains its articles from a variety of sources including Factiva, CNN.com, and net news, and maintains a corpus of news articles that is built on the IBM DB2 database [6]. Furthermore, Milan provides each user with a personalized newspaper, which allows users to choose news topics, stock updates, weather reports, and other options to display. Figure 4-1 displays a sample Milan page. This news system, however, differs from other personalized papers in that it allows users to create and modify their own news channels, instead of having to select topics from a pre-designed list of choices. News channels are generally pre-defined categories of news from which a user may choose sections of their personalized newspaper. For example, a user of MyYahoo may select the "Top World News Stories", "U.S. Market News", "Health from Reuters", and "Entertainment from Reuters" news channels for their newspaper. In the case of creating one's own channels, someone with family in Taiwan may be very interested in reading all the articles regarding it and its disputes with China. Normally, that person would be forced to choose the "Top World News Stories" news channel to add to their newspaper, since this is the 35 - -: A ileanE TMVuwSNePmW)ISG1 alN*ISEIr J F * / dit ?,age Tab. Generated Wdnesdqy, August 9, 200011:52:07 AM PDT IBM in the News (more) Dow Rises 1099 on Economic Report . Nw Weather wg YORK Aug.8 B-lue-chip tocks rose today amid growing optimsm that iderest rates will remain stable bt technology lescInedsclgtlly onproftakngter Mondqys bigadvance Linux (more) owg Holders Register VA LINUX SYSTEMS INC Stock=S1TM PPBNSOURCE:PORmi44 0 omER VALINUX C USYAOOL: LNUX0 OFMI1R: REGJNTSOFTH U..IVZRSJTTOFMIC OTTL2: None OBROKER: DBALZBROWN 3SRgISTRgD: 40789 DATE R2ISTZRD:. os/FFBN IPO Background: TNPC n Marketino sHAR Pact With AOL- oPtNSouRC:SgCS4 a a zTwc oe/9 Chasing Hollywood "Pirates": Suits a Test for INC QSYMBOL:XT JM 0 009 JNG COMMON QA)OUJT. Up to Digital Copyright, Free - Producedin limit rqanitiesand a0 33AcTFILz#33 41412 LPPILING N0POST only vailableinnavyblue, the T-shirti5 acuwioulyold-world artifact $4Qnllaon GS RF5CT'E: N USEL LINO HOLDE R N 0.. oswDow Jones 10AM Averages: DJlA 10889,72 Up 22 71- 0 30MDUS10889.72 UP 22.71 ORi21 % 0 20 TR4ISF 289750 UP 3.29 0R O %0 653STOCKS3212.10 S f * 015 UTMLS35 34 UF0.30 OR UPS270R0166 The Dow Jones. 1 % W/7 T1Ar-1 e0 (morel Microsoft In the News wst9FFBN Tech Wrap: Software Stocks Push Main Index Higher- aSS7ER. PDERALR7INGSBUNES our technologr ndi1x d of9 a xrP%& -7d Daniel &einberg (DG) - It has been threeyears since *ple CIO Steve Jobs explicitly menionedplansfor Java on the Macintoshaa Macworld ixpo keynote addres. There has been plenty of W1MM IeenLaee UI?54 w~ NEW'S 0ST4BOL XPT 0 00 (hism market wrap was originally published on Tuesdqy eveynng.) El 0 WASHINGTON (tFBN)-toc*k in am f., as/s Analysis: Java 2 for the Mac almost ready - by back W7;V * MiiiM tofiguwe in a storm over theatureofthe Inteet. The problem is on the Test for DqitaLCopright Free. acadinaedqunlfleand only availablein navy blue, the Tshirt s a curiouly old-world atfact togure in a storm over theflure ofthe Inteet The problem is on the osi Chasing Hollywood"Pirates" Suits a A ASPs (more) Portals (more) asAnalysis: Java 2 for the Mac almost ready - seiner (ID)-- I has been threeyeavs since Aple CEO 2ere Jobs explicitly mentionedplansfor Java on the Macintoshata Macworld 06/7 ixpokeynote address There has benplenty of. Top Tier SoftwareAPO -2: Sees Net Proceeds Of $39.9 Million - oFNsoURCz: SEC S-i/Al 0 OiSUNR TOPTIERSOFPWAR NC. 0SFMOL:XTZM 0 0OPIRNG: Figure 4-1: Milan is a personalized information portal for the Vinci system. closest channel that exists. However, with Milan, the user can create his own channel, perhaps using the search "Taiwan + China" to meet his/her specific needs. Furthermore, Milan provides the ability for users to communicate their interests to each other, by sharing channels with each other. With the creator's permission, other users may use or modify the Taiwan news channel and/or add it to their own newspapers. Gruhl originally described this system in his Ph.D. thesis entitled "The Search for Meaning in Large Text Databases"[ 19]. Milan emerged from the MyZWrap and Panorama systems, both designed to simplify and partially automate the editorial decisions of a human news editor by gaining a more thorough understanding of the news articles within the database. Milan differs from the two prototype systems by offering a decentralized and more modular structure, a characteristic inherent to all Vinci-based systems. Instead of requiring components to exchange information through a centralized database, Milan allows services to communicate among one another directly. Furthermore, Milan components can be built on various platforms and deployed among any number of machines, creating a flexible and robust architecture. 36 <vinci:FRAME> <OID>3325</> <DATE>Novenber 12, 2000 <HEADLINE>The Making of a Memor ial</> <BODY>On Veterans Day, the memorial to the 57,000 Americans missing or killed in Vietnam... </> </> Figure 4-2: A frame describing an article on the Vietnam Memorial, represented both graphically and in XML format 4.1.1 Gathering and storing information News is transferred to Milan through wire services, net news, and Internet news providers. Approximately 10,000 news articles are gathered every day and stored in the Milan news corpus. Currently, the corpus contains approximately 500,000 news stories, dating from August 1999 to the present. Milan uses frames as its news storage medium, which provides the system with both speed and flexibility [19]. A frame is a single-level depth XML document. For example, a frame describing a news article might include a headline, body, date, and news source. Figure 4-2 illustrates a sample news article stored in a frame. These frames are stored in a PISA database. Created for the Vinci system, PISA is a relational database that utilizes a vertical format for storing objects while maintaining a horizontal view [6]. Once the news articles enter Milan, they are immediately processed into frames. Each article is given a unique Object IDentification number (OID) and stored in a simple frame consisting of the OID and the text of the article [23]. This OLD allows the frame to be referred to in a concise and unambiguous manner. Since OID's are never recycled, they can be used as pointers to an article with the assurance that it will never point to another article. However, since storing the article as a string of text and tagging it with an OID is not very useful for understanding its content, Milan uses several software agents, or what Gruhl calls experts, to extract more detailed information from these articles. 4.1.2 Making it useful - experts and the blackboard system A single news article is of little value unless its meaning is understood. While articles are generally represented as a single block of text, it is very useful to understand what people, locations, and dates were mentioned in the article. For this purpose, Milan uses various experts organized into a blackboard system to extract useful information from each article [14]. Each expert is specialized in only one task. One might find a people-spotter, location-finder, or date-locator expert working on a particular news article. These 37 experts work together to create understanding about the article, and store each of their findings in the frame itself. In the case of Milan, thousands of blackboards exist, each board representing a single news article. Hence, instead of several experts studying one blackboard, these experts can walk around a room full of boards and make contributions to each board as they see fit. If an expert is working on an article, another expert can just pass it by, and come back to it later, when it's free. Imagine that each expert writes on the blackboard with his own piece of chalk. If each expert has a different colored piece of chalk, the contributions of the various experts are easily viewed by noting the different colors on the blackboard. As a result, those agents that depend on the work of others can simply glance at a board to see if the appropriate colors exist yet. Each blackboard can be implemented as a frame. When an article arrives, it is processed into a new frame, which is visited by various experts. Experts can be divided into two groups, structural experts and information experts. Structural experts extract the headline, author, body, and the time an article was posted. After these agents have visited, the second group of experts then performs tasks such as word stemming, people spotting, and location finding for a more thorough understanding of the article. Once the information has been extracted, various searching techniques are used to select stories for the newspaper, choosing which stories to present and in how much detail. Milan explores the use of Boolean searches, AltaVista-like searches, and relevance feedback to present the most satisfactory articles for the reader's perusal. However, these techniques lie outside the scope of this thesis, and will not be discussed in detail. For more details, see Gruhl's Ph.D. thesis [19]. 4.2 The Vinci System In the larger scheme of things, Milan is only one application of several built on the Vinci project. As stated before, Vinci is a local area service-oriented architecture that provides both standard and specialized building blocks with which to develop data-centric applications such as Milan. As shown in Figure 4-3, the Vinci project consists of several base services, including a database, web crawler, and data mining tools. On top of this foundation, almost any application can be plugged in and registered as a Vinci service. Services may communicate with one another via the xtalk protocol. Vinci requires components and services to communicate by exchanging encoded XML documents, a protocol called xtalk [2]. xtalk creates a decentralized environment in which services can directly communicate with one another, as well as with users. For instance, a system such as Ariel may want to communicate with Milan to request a particular news article. 38 Value Added Services (Query Mapping CSearchj Catalog Management User Profiling Personalization News & Community (Privacy) Monitoring] Data Extraction Personal Databases Base Services Figure 4-3: A graphical representation of the Vinci system. For this communication to take place, Ariel must send the following xt a 1 k document: <vinci: FRAME> <vinci:COMMAND>getFrame</> <OID>3325</> Milan evaluates this document and returns: <vinci: FRAME> <OID>3325</ > <DATE>November 12, 2000</> <HEADLINE>The Making of a Memorial</ > <BODY>On Veterans Day, the memorial to the 57,000 Americans missing or killed in Vietnam is one of the most visited monuments in Washington </> 39 Ra-I -o Realtime Remote Querying This frame-transfer protocol exists for all Vinci services, making communication between services both flexible and efficient. In fact, this idea is very similar to systems making method calls over a network. As long as the data transfer schema known for the desired service, communication between systems is very simple. 4.3 Milan Services Just as Milan is a service of Vinci, Milan maintains services of its own to help with its tasks. For example, many readers like to see weather reports for their hometown. Instead of searching through the Internet for up-to-date weather information, Milan simply calls the weather service for help, which in turn searches for the appropriate information. Now, Milan simply has to establish a connection with the service and send a frame such as: <vinci :FRAME> <vinci:COMMAND>getWeather</> <ZIPCODE>02139</> The weather service returns: <vinci :FRAME> <CITY>BOSTON</> <TEMP>32 F</> <PIC> http://www.intellicast.com/images/icons.66 wtext.jpg </> This information then becomes represented as a channel on Milan that gives current weather conditions for a given city. An illustration of this channel is provided in Figure 4-1, the sample Milan newspaper. Other services are involved in finding the word of the day, providing stock quotes, preparing comic strips, allowing AltaVista-like searches, and maintaining a user database. These services hold similar conversations with Milan, so Milan can obtain the information it needs to maintain its users' newspapers. 4.4 Where Ariel Fits The Ariel system aids Milan users in finding related stories for any news story of interest. Therefore, it makes sense to depict Ariel as both an expert and a service in Milan. As an expert, Ariel can analyze a news article and discover its similar stories before the article is actually requested. Then, as a service, Ariel can retrieve the appropriate articles for Milan when a user requests information, just as a weather 40 Comics Ariel !j Pattern Generation Stock Quotes Article Retrieval Similarity Ranking Word of the Day Figure 4-4: An illustration of the Milan system, which is built on top of the Vinci base services. Milan relies on several other Vinci services, including Ariel and various news channels. service would relay requested weather reports or a stock service would provide stock updates. Similarly, Ariel can use any Milan and Vinci resources (e.g. via the frame-transfer system described above) to collect the data it needs. Figure 4-4 provides an illustration of the Milan system with the addition of Ariel's components. Also, where the available Milan and Vinci services are insufficient for Ariel's purposes, new services and experts can be built to aid in needed tasks. For example, the pattern generator can easily be made into a Milan service. Feature spotters, on the other hand, may also be useful to incorporate among Milan's news experts. Spotters that identify people, places, and times may provide Ariel with important information about its news articles. These services, and many others, can be built and integrated into the Milan system, to aid this thesis' study of mining patterns for text retrieval. 41 Chapter 5 Ariel: Design and Implementation The ideas set out in the previous chapters have been explored in the development of Ariel, a news retrieval system based upon user queries for related news articles. Ariel utilizes the Milan news corpus, and is currently considering over 500,000 stories, with new articles added daily. The system is fully integrated into Milan's newspaper, such that Milan readers can make requests to Ariel while browsing through articles. The Ariel system consists of approximately 2,500 lines of code, not including the pattern mining algorithm, and is written mainly in Java with some use of Perl and Javascript. 5.1 General System Flow The Ariel system is composed of six distinct sections: preprocessing articles, indexing article words, generating patterns, finding related stories, ranking stories by similarity, and providing a user interface. The general system flow is depicted in Figure 5-1. News enters Milan through news streams, and is reformatted into frames. Milan experts then examine and augment these frames and store them in a news corpus. Next, the Ariel system parses the headline and body of the articles based upon stemmed words. Since each word is mapped to a unique integer identifier, a binary format of the story is created. This binary data file is fed into a Vinci service built for pattern generation, which produces patterns for each article (This service is discussed in greater detail in Section 5.3). Once Ariel identifies these patterns, related stories are discovered and ranked for user retrieval, using indexes that have been developed. A user may request related stories directly from the Milan newspaper, which contacts an Ariel JSP. This new web page presents the user with a list of similar stories, as well as a method for browsing through them. 42 Frame \ Database Figure 5-1: Overall system diagram for the approach presented in this thesis 5.2 Story Analysis Consider the following article, shortened for the purposes of this discussion, which has just entered the Milan system from the CNN news stream: How do designers define luxury? 4 It can be a cashmere kimono. A bed made with embroidered sheets. The detailing on a handmade pair of shoes. An unexpected mix of beautiful colors. Designers have different concepts of luxury. First, Milan reformats the article into a frame and structural experts extract the headline and body. Next, the information experts extract stemmed words, people, locations, and other features from the text. However, since Ariel only requires knowledge of the headline, body, and stemmed words, only these features are discussed here. In the Ariel system, each word is represented by its stemmed form, using the Porter stemming algorithm [37]. In this way, the words "design", "designing", and "designer", for example, appear identical to the system. Without such stemming, each word would be viewed as unique, and no correlation Modified version of "How do designers define luxury?" CNN.com December 29,2000 http://www.cnn.com/2000/STYLE/fashion/12/29/luxury/index.html 4 43 would be drawn between them. Furthermore, each stemmed word is identified by a unique number. Whenever a new word is discovered, it is assigned an identifier and added to the dictionary. Also, Ariel does not consider stop words. A list of these words is provided by Cornell University [11]. By removing stop words, Ariel prevents the generation of many useless patterns that would otherwise lead to greater article mismatches. For example, consider the sample article: "Clinton will move out of office in the middle of January. Then, Bush will be in control of the White House." By including stop words, such as will, of in, the, then, be, a minimum support of two sentences will allow pattern generation to discover {will, of, in, the}, {will, in, the}, {will, of, the}, {will, of, in}, and {of, in, the} as valid patterns. Clearly, these patterns will retrieve numerous articles that have little to do with the given topic. Thus, a removal of stop words allows more effective patterns to be retrieved, while also speeding up pattern generation processes. Since stop words are not considered, fewer word combinations will exist for the system to examine. The dictionary is a simple mapping between each stemmed word and its unique identifier. Since a string format requires 4 bytes per Unicode character, and an integer requires 4 bytes, this conversion from word to number significantly decreases memory usage and thus speeds up system processes. Ariel then translates each stemmed word in the headline and body into its numeric representation. The headline may be viewed as a sentence in the article while the body is divided into its sentence components. As a result, the fashion article shown above may be expressed as: Headline: 1 2 3 Body: Sentl: Sent2: Sent3: Sent4: Sent5: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 18 19 3 Given the dictionary shown in Table 5.1. These results are not actual Ariel results, and are only provided as a means for explanation. Finally, the article's sets of word ID's, or sentences, are combined into a single string representation to be used in pattern mining. The example above would be expressed as: 3 1 2 3 2 4 5 4 6 7 8 9 4 10 11 12 13 4 14 15 16 17 4 1 18 19 3 The underlined numbers indicate how many items exist in the following sentence. Thus, if a string begins with 3, we know that the first sentence in the article contains three items, which are listed immediately afterwards. The next sentence contains 2 items, 4 and 5, the third sentence contains 4 items, and so on. Using this format, we can express article components using a simple string representation. 44 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Design Define Luxury' Cashmere Kimono Bed Made Embroider Sheet Detail Handmade Pair Shoe Unexpect 15. 16. 17. 18. 19. Mix Beautiful Color Different Concept Table 5.1: A sample dictionary mapping each word to its unique numeric identifier 5.3 Pattern Mining Patterns are generated using the IBM DB2 Intelligent Miner for Data [24], built in 1996 in the context of the Quest project at the IBM Almaden Research Center. Specifically, Ariel utilizes the "Associations & Sequential Patterns algorithm" of the Intelligent Miner, which is written in C++ [7]. As described in Chapter 3, the Ariel system utilizes frequent itemset generation from the data mining system. To incorporate these techniques into Ariel, pattern generation is implemented as a Vinci service. A conversation between Ariel and the pattern generation service begins with the following request: <vinci: FRAME> <vinci:COMMAND>disc</> <DATAFILE> 3 1 2 3 2 4 5 4 6 7 8 9 4 10 11 12 13 4 14 15 16 17 1 18 19 3</> <MINSUPCNT>2</> After evaluating the provided data, the service responds. <vinci: FRAME> <PATTERN> <SET> [1] [3] [18]</> <NUMITEMS> 3</> <SUPPORT> 28.3 </> <PATTERN> <SET> [16] [17] </> <NUMITEMS>2</> <SUPPORT> 15.4</> </> The service requires the string representation of the article, described above, along with a minimum support count. The service considers all patterns that have appeared in at least two sentences of 45 the article, since Ariel uses a minimum support count of two. After analyzing the data, patterns are generating using the given support count and the resulting patterns are returned, along with their support and size. Each pattern's support is actually its frequency of occurrence within the article. While this information is not used in the Ariel system, future systems may want to incorporate it in sorting or ranking patterns. 5.4 Indexing for Searches Performance is a key issue with any system that interacts with human users. In searching for related articles, an article that shares at least one pattern with the original article is considered a match. Furthermore, only patterns of three items or more are considered. If no 3-item patterns exist then 2-item patterns are used. As a result, the Ariel system requires a method for determining whether an article contains a particular pattern. However, since every article in the corpus needs to be examined in order to collect a complete set of similar stories, an efficient search technique is required. One cumbersome method would be to examine each article in the database for patterns, simply by extracting the headline and body of the article at every query, and looking for the appropriate features within them. Or, a faster method would be to store every word contained in the article in its numeric form as a key, value pair, such as: <vinci: FRAME> <OID> 80764 </> <HEADLINE> Hubble image sheds light on darkness within galaxies </> <BODY> The chance alignment of two spiral galaxies...</> <WORD> 356 </> <WORD> 1255 </> <WORD> 1307 </> <WORD> 1905 </> In this way, the system could query each article for patterns by extracting all WORD features from each article, and then comparing the individual items with the given pattern. For example, compared with the pattern {356 1307 1905}, the above article would be considered a match. However, the need to examine every article in the corpus still appears rather cumbersome. A better solution is to create a set of indices which map frame features to vectors of OID's where they appear. For example, the system may record that <WORD> 356 </>, where the number 356 represents the word image, appears in article frames 4467, 5580, 80764, 97840, and 112283. Thus, instead of searching through the entire news corpus, a search for articles containing pattern {356 1307 1905} simply requires an index lookup of the 3 items, and then an examination of which articles all three items have in common. These common articles are therefore considered matches to the original article. 46 While this type of index works well in Ariel, there are some drawbacks to the system. For example, indices do not support position searching (i.e. Find Bill within two words of Clinton). One fix might be to have multiple indices, one for most searches and one for positional searches. 5.5 A Ranking Schema Retrieving all matching stories is useful in evaluation suites. However, in most cases, a user does not want to sort through all of the matching articles, even if all were related to the main story in some way. Occasionally, several hundred matches may be retrieved by the system. An ideal system would only retrieve the number of stories desired by the user. Thus, a ranking scheme is needed in order to rank stories from most similar to least similar. In this way, the system has a foundation upon which to choose the best stories, regardless of the number of results. In this ranking process, we establish that that larger sized patterns produce better article matches than smaller sized patterns. This statement is based on the assumption that rarer patterns, the longer ones, provide more valuable information than their shorter, more common counterparts. Salton's research reaffirms this statement. In his research, term weighting is done by assigning a high degree of importance to terms occurring in only a few documents in the collection. Thus, the rarer a term, the more likely it is an important feature of the document [41]. Furthermore, this assumption is reinforced by the preliminary precision/recall results discussed above. Thus, a story containing a 4-item pattern from the main story is ranked higher than one containing a 3-item or 2-item pattern. However, each group of patterns still retrieves a large number of articles. For example, the 4-item patterns may retrieve 20 articles all together, while the 3-item patterns retrieve another 50 from the corpus. Out of these groups, a new ranking scheme is needed to determine which articles are the better matches. A variety of similarity comparisons exist. In this thesis, we choose the cosine similarity metric. We use the following cosine ranking scheme: 1) 2) 3) 4) Construct a dictionary of all terms in the corpus selection (i.e. the group of articles under consideration) Construct a normalized vector for the main article Construct a normalized vector for each of the articles under consideration Generate a similarity function between the main article and each test article given the following cosine similarity metric [39]. The result is a measurement of similarity, from 1 to 0, as calculated by the degree to which the articles DOCi and DOCj share the terms k. = Cosine Similarity (DOC, , DOC 1 ) = k,=1 t (TERM ik TERM j) 2 t 2 1 2 [Z (TERMik) .Z U(TERMIk ) ] k =1 5) Rank the articles based on highest to lowest similarity. 47 k =1 Once a set of similar articles has been retrieved, the above metric is used to sort the articles from most to least similar. In this way, an Ariel user may specify the number of articles to receive. For example, if a user requests 10 similar articles, the system will retrieve the top 10 most similar articles found. Here, pattern length and cosine similarity are used as the ranking scheme. Pattern mining allows Ariel to quickly isolate a set of similar articles before analyzing them more carefully with the cosine metric. In addition, other more expensive methods could also be used in place of cosine. The modularity of Ariel allows for any ranking scheme to be easily integrated with the rest of the system. Another concern of article retrieval is deciding which stories to retrieve. One option is to gather all articles pertaining to the given subject matter. Or, a second option is to collect only those articles appearing at an earlier time than the given article. Considering the former alternative, computation becomes extremely time consuming. Every time a new article enters the corpus, Ariel must identify which articles are similar to it, and recalculate ranking information for all articles in this category. However, in the legal profession, for example, this extra computation may be highly desirable. In patent searches, a lawyer's goal is to find all inventions related to a given topic. Thus, a re-computation is necessary to make sure all searches are up-to-date. crucial. However, in the realm of news, these extra computations are not as Since news is generally presented in a historic context, retrieving only those articles that provide a historic background for the original article is a more suitable alternative. 5.6 The User Interface The Ariel interface may be described from two perspectives, as another Vinci service or as a reader of the Milan newspaper. Any Vinci service may request similar stories from the Ariel system via the xtalk protocol. This communication is easy, as long as the client knows the appropriate schema for interacting with Ariel. A sample communication follows. The Vinci service sends the request: <vinci: FRAME> <vinci:COMMAND>patternsearch</> <OID>80764</> <COUNT>8</> This request simply asks Ariel to complete a pattern mining search for the article identified by the OID 80764, in this case the Hubble space telescope article mentioned earlier. Once this search is completed, the top 8 most similar articles should be transmitted from Ariel to the client. 48 In response, Ariel sends the frame: <vinci: FRAME> <OID>97840</> <OID>170536</> <OID>96061</> <OID>80380</> <OID>139884</> <OID>112283</> <OID>5580</> <OID>4467</> The OID's in the response frame identify the 8 articles most similar to the Hubble space story. From a user perspective, however, this form of communication provides little information about the requested topic. Therefore, for the average user, Ariel provides a web interface that seamlessly fits into Milan's existing newspaper interface. As a result, the user may peruse through related articles without requiring any knowledge of Ariel's existence. The following is a brief walkthrough of the system's interface: Consider our Berkeley student Kellie again, a Milan user who is browsing idly through a newspaper much like the example shown in Figure 4-1. In particular, Kellie is looking for stories about the 2000 presidential campaign for her upcoming school report. Eventually, she comes across the article "Reintroducing Mr. Gore" which appears to provide some interesting information for her report. Clicking on the article link reveals Figure 5-2, a web page containing the full news story. The main frame presents the article as it appears from the Factiva news wire. The left-hand column, on the other hand, lists various pieces of information, including a section on "Related Stories". In this section, Ariel displays the top three most similar articles in relation to the Gore article in the main frame. 49 nt: factYva. Reintroducing Mr. Gore LAST NIGHT Vice President Gore delivered what had be more times than he probably cared to hear, the most im of his political career. At long last the Democratic n, longer in Bill Clinton' 's shadow, Mr. Gore sought to c' voters that he has both the strength and the supplenes; campaign season in which voters are said by pollsters anything resembling "negative campaigning," the vice p: to implicitly distinguish himself from Republican riva by means of a heavy reliance on substance, seriousness Mr. Gore''s unspoken message was that he has the exper and general heft that the less-experienced Texas gover: the vice president could not rely exclusively on such because too many voters continue to see him as awkward automaton. So Mr. Gore also sought to reintroduce hims American people as a man comfortable in his own skin, be comfortable with. Figure 5-2: A Milan news article If further information is requested, however, Ariel pops open a new window, as seen in Figure 53. In this window, the main frame contains the most similar story, in this case "For Bush, it's time to show his mettle". The left side displays a list of similar articles. Currently, the system retrieves 20 similar articles from the corpus. Kellie may browse through each of these articles at her own convenience, with each article popping up into the main frame when selected. Thus, without the need of search engines or any amount of time, Kellie can collect several key resources for her report in just a few clicks, simply by using the Milan online newspaper. 5.7 Conclusions A few observations may be drawn from this brief tour of the Ariel system. First of all, Ariel allows for the construction of a useful Milan service that is seamlessly integrated with the Milan system. A Milan user can utilize the features of Ariel without needing to realize its presence in the Milan newspaper. 50 N': 0 Gnerated Saturdqy, January 20, The Viewi hrsW art"Ob 0 3S116 IA Arid Stories . _ _ CadJM t ss ThA * 220342: CevbWd -Outs - Snbwy; Ow:T - 13899 kOM5L ss For Bush, It"s Time To Show His Mettle 136899: CONVENTION CHATTER: Cbinta. Democts Tou The Ecamy 0 177294: TMe Debtes. Two Cadids. Two Stles: Bush * 140239: Ga Debuts asHis "Own IWn" Vice PRs*Ist Clims e 153056: hto Hcmestrec*A Tidtweted Racc S'xz by Gihr e 141729:RknekOo m the Oftsi:Bush. 0mn MeD O CUMW=i ftategits e va. ifact 13..1.:.Ral,,f. OWL: As * P*A 2001 5:20:34 PMPST 136507: Clkltn Damtwnk*s Scee in L.A.: S* he bis GOP Dan Bakz PHILADELPHIA, Aug. 2 -- When George W. Thursday night, he has one overriding goe people that he is presidential. The opening nights of the Republican Ne offered testimonials in his behalf, a sez new and united party and tonight, with vi Richard B. Cheney''s speech, the first What is left to Bush is still a tall cli oz whatever doubts the voters have about hi- and to show them that he has an agenda t_ FF-F- N Figure 5-3: A Milan page displaying the top stories of the "Reintroducing Gore" article Furthermore, the methods described in this chapter provide a fairly accurate retrieval system that utilizes both pattern mining and cosine similarity to retrieve and rank stories. While pattern mining requires a larger amount of article preprocessing time than cosine mining, these calculations can be preformed prior to any user queries to maintain an efficient retrieval system. The Ariel approach also encourages incremental improvements of existing components, as long as the network interface remains unchanged. Milan and Vinci components do not require any knowledge of how Ariel accomplishes its task. 51 Chapter 6 An Evaluation of Ariel The Ariel system is an information retrieval system designed to retrieve similar news stories. While the success of Ariel is ultimately based upon its usefulness to readers, an evaluation of usefulness is difficult to obtain. Therefore, this thesis will rely upon traditional IR precision and recall analyses to determine the accuracy of the retrieval system. Precision is defined here as the ratio of the number of similar stories retrieved to the total number of stories retrieved. Recall is defined as the ratio of the number of similar stories retrieved to the total number of similar stories in existence. In online news, readers generally prefer to get fewer errors rather than more article matches. Thus, this thesis considers precision of higher priority than recall. This consideration differs from the field of legal concerns, for example. Lawyers searching for prior art want the highest recall value possible, since they want to find as many documents about their case as possible. Therefore, from a legal perspective, recall would be of a higher priority than precision. 6.1 Experimental Protocol The evaluation of Ariel includes a rigorous analysis of three stories chosen from the Milan data set and three stories chosen from the Reuters-21578 collection [28]. See Table 6.1 for details on the six chosen articles. Milan is a general news source, including articles from technology, politics, science, and art. The data set contains various news articles dating from August 1999 to the present. Currently, Milan consists of over 500,000 articles with approximately 10,000 articles added daily. The Reuters collection is also a news data set, specifically compiled from the 1987 Reuters news wire. Containing 21,577 articles, Reuters is significantly smaller than Milan. It is also a more specialized collection, with articles chosen from various economic reports. These reports mainly include market analyses and trading speculations. 52 ID 1 Article Title Corpus Articles considered Milan 2 Hubble Image Sheds Light on Darkness Next in the Middle East 3 Reintroducing Mr. Gore Milan 8,500 articles June 2000 18,000 articles July to August 2000 19,000 articles 4 Bahia Cocoa Review Reuters 5 If Dollar Follows Wall Street, Japanese will Divest Tower Report Diminishes Reagan's Reuters Milan August to September 2000 21,577 articles Entire corpus 6 Reuters Hopes of Rebound 21,577 articles Entire corpus 21,577 articles Entire corpus Table 6.1: The chosen articles for this testing environment. All six stories undergo the same evaluation process. First, each article is manually matched with a set of similar articles. For the Milan corpus, this master set is chosen from a specified section of the corpus, since the corpus itself is too large for a manual search. In Reuters, however, the articles are chosen from the entire corpus. Next, Ariel retrieves and ranks all the articles it believes to be similar, filtering out those articles not included in the pre-selected group. This process is identical to that described in chapters 3 and 5. In addition to the graphs, each article evaluation contains an excerpt from the story, the number of patterns found, and a list of several generated patterns. These pieces of information provide the reader with a better idea of the article topic, as well as more familiarity with the common patterns found in the story. 6.1.1 Graph Characteristics Precision/Recall graphs are then obtained by the following method: Let C be the master set of similar articles, chosen manually. Let A = {a,, a 2 , ... , am} be Ariel's list of retrieved articles in ranked order, where a, is the most similar article and am is the least similar. Let S be a set of articles, where S c A. Let P be the set of (Recall, Precision) points 1. S={}, an empty set 2. 3. 3. 4. 5. 6. 7. 8. P = {},an empty set for (i = 1; i:5 m; i++) do begin addaitoS Precisioni = number of elements s e S where s e C number of elements in S Recalli = number of elements s E S where s E C number of elements in C add (Recalli, Precisioni) to P end Result = P; 53 Each point in P is displayed on the precision/recall graphs, connected by lines for easier viewing. As a comparison, these precision/recall graphs are then matched with graphs generated from a cosine similarity metric. The cosine graphs are also obtained via the above method where A represents a list of related articles retrieved by cosine similarity. One characteristic of these graphs is that the numbers are almost never monotonically decreasing (see Figures 6-1, 6-2, 6-3, 6-4, 6-5, and 6-6). In other words, as recall increases, the precision sometimes increases and sometimes decreases. The reason is that each new precision/recall point is calculated by adding a new document to the set, as shown in the above algorithm. Thus, with each new article, the recall will never decrease, since the number of relevant articles will never decrease. However, the precision may vary with each addition, since the new article may either be relevant or not. A relevant article will increase both precision and recall, while a non-relevant article will decrease precision and keep recall stagnant. As a result, a monotonically decreasing graph could only be obtained by maintaining a constant recall value of 100% throughout the graph. One may also note from the graphs that the shapes of pattern and cosine results show striking similarities. As stated before, pattern mining uses a ranking scheme involving both pattern length considerations and cosine similarity results. The cosine graph is, of course, created purely from cosine results. Thus, these schemes create comparable graph shapes. 6.1.2 Performance Considerations In these precision/recall evaluations, it is also important to note that pattern mining uses significantly less CPU time than cosine similarity. We will assume that both techniques require article preprocessing, namely headline and body extraction, word stemming, and dictionary creation. In both cases, the article is represented in numeric form, as a string of numbers in pattern mining and as a matrix in cosine similarity. In pattern mining, story patterns are generated via the association mining scheme described in Section 3.3, and used to isolate a set of relevant articles. Since patterns are simply sets of frequent words, relevant articles are identified by the number of these frequent words they contain. Indices help speed up this search process (see Section 5.4). In the case of cosine similarity, however, every word in every article needs to be accounted for and compared. Thus, indices are of little use, since all articles in the corpus need to be examined. However, the use of cosine similarity inherently ranks the articles by relevancy without further computation. Pattern mining, on the other hand, has isolated a set of similar articles, which are unranked. Thus, pattern mining uses cosine similarity to rank a significantly smaller article set, which have already been deemed relevant. In terms of the Milan corpus, over 500,000 articles exist in the database. Cosine similarity requires each of these 500,000 articles to be compared to the original article, word for word, before ranking them. Pattern mining, on the contrary, analyzes the original article and discovers several story patterns. Assuming that these patterns consist of 12 unique word items (the average number of words given 20 54 chosen articles), only those articles that contain any of these 12 words would ever be noticed by the Ariel system. Furthermore, out of this set, perhaps 500 articles are found to contain the correct patterns. As a result, instead of requiring a cosine similarity ranking on 500,000 articles, only 500 articles are required. The Ariel system clearly requires less computational power than the cosine similarity metric, thus making pattern mining evaluations run much more efficiently than cosine evaluations. 55 Milan Experiment Results 6.2 Article 1: Hubble Image Sheds Light on Darkness The chance alignment of two spiral galaxies offers a rare glimpse of elusive galactic material, according to Hubble Space Telescope astronomers, who released a striking image of the pair on Thursday... 41 patterns found. {image shed light space} {help dust scientist interstellar} {image light space telescope} {march april image heritage} {interstellar dust bright stars} {matter hold mass} {dust galaxy silhouette} {spiral dark space} 1 Pattern Mining +Cosine 8milarit?~ - 0.81- 0.61[ a 0 0.4 F 0.21- OL 0 0.2 0.4 Recall 0.6 Figure 6-1 Precision/recall graph for Article 56 0.8 1 I 1 Article 2: Next in the Middle East Prime Minister Ehud Barak and Palestinian leader Yasser Arafat returned home from the Camp David summit to very different welcomes. Mr. Arafat was received as a hero who stood up to pressure and did not forsake Palestinian claims over Jerusalem... 2 patterns found. {prime, minister, opposition} {prime minister, Barak} 1 Pattern Mining -e-Cosine Similarity - 0.8 r C 0 0.6 r . 0.4 r 0.2 r I 0 0 0.2 0.6 0.4 Recall Figure 6-2: Precision/Recall graph for Article 2 57 0.8 1 Article 3: Re-Introducing Mr. Gore Last night Vice President Gore delivered what had been called, more times than he probably cared to hear, the most important speech of his political career. At long last the Democratic nominee, no longer in Bill Clinton's shadow, Mr. Gore sought to convince American voters that he has the strength and the suppleness to lead... 2 patterns found. {vice, president, voter} {american, sought, gore} Pattern Mining Cosine Similarity 0.8 0 0.6 0.4 0.2 0 0 0.2 0.6 0.4 0.8 Recall Figure 6-3: Precision/Recall graph for Article 3 58 * * 6.3 Reuters Experiment Results Article 4: Bahia Cocoa Review Showers contined throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review... 40 patterns found. {comissaria smith york time} {bahia cocoa review} {bahia total estimate} {bag mln crop} {sale dlr port} {sept york time} 1 Pattern Mining Cosine Similarity 0.8 C 0 0.6 CL 0.4 0.2 ~ 0 0 0.2 0.6 0.4 Recall Figure 6-4: Precision/Recall graph for Article 4 59 0.8 * Article 5: If Dollar Follows Wall Street, Japanese will Divest If the dollar goes the way of Wall Street, Japanese will finally move out of dollar investments in a serious way, The Japanese, the dominant Japanese investment managers say. foreign investors in U.S. Dollar securities, have already sold U.S. Equities... 29 patterns found. {manage market stock bond general} {manage invest wall japanese} {stock bond general} {manage invest international} {manage international department} {market, stock, bond, general} 1 Pattem Mining Cosine Similarty - - 0.8 0.6 C 0 CO .5 0.4 t 0.2 f I 0 0 0.2 0.4 0.6 Recall Figure 6-5: Precision/Recall graph for Article 5 60 0.8 1 Article 6: Tower Report Diminishes Reagan's Hope of Rebound The Tower Commission report, which says President Reagan was ignorant about much of the Iran arms of regaining about ends his prospects just Deal, dominance in Washington, political analysts said... political 23 patterns found. {analyst public reagan back} {public point person popular} {report told tower} {expect white house} {institute reagan washington} {point person popular} {analyst tower politic} 1 Pattem Mining Cosine Similarty - 0.8 0.6 C 0 0.4 I 0.2 I 0 0 0.2 0.6 0.4 Recall Figure 6-6: Precision/Recall graph for Article 6 61 0.8 1 6.4 Final Observations The Milan data set performed extremely well. Precision and recall values for pattern mining dominated those of the cosine similarity metric in virtually all graph areas. Furthermore, in both Articles 1 and 2, pattern mining is able to maintain consistently high precision levels up until an 80% recall value. While Article 3 also performs well, its results are not quite as high. Rather, Article 3 shows a steady drop in precision, almost from the very beginning of the experiment. This phenomenon can most likely be attributed to the large number of related stories to the Gore topic. While the Hubble and Camp David stories included 34 and 54 relevant articles, respectively, the Gore topic consisted of 117 relevant articles. Therefore, it makes sense that the recall numbers are low during much of the experiment, pushing (recall, precision) points closer together. If we count the number of documents added before precision falls below 80%, for example, we would find that Article 1 retrieves 38 stories, Article 2 retrieves 70 stories, and Article 3 retrieves 58 stories. Therefore, even though the graphs may indicate a lower performance for Article 3, further analysis shows that the results are quite similar. In the Reuter's data set, the graphs are almost identical in all three cases. This result indicates that the use of pattern mining does little to affect the results of the cosine similarity metric. However, this in itself is an achievement. As stated in Section 6.1.2, a pure cosine metric requires a thorough analysis of each article in the corpus, using the painstaking process of comparing every single term to that of the query article. Pattern mining, on the other hand, extracts the unique patterns of the query article and uses only those items present to isolate a set of similar articles. As a result, pattern mining performs significantly faster than the cosine metric. Moreover, graph similarities indicate that pattern mining successfully chose a set of articles similar to that chosen by cosine similarity. Therefore, while producing similar precision/recall values, the speed of the Ariel system makes it the better choice. We can also view our results side by side, as shown in Table 6.2. Here, the precision values for the Milan and Reuters data sets are averaged for conciseness. For example, the numbers from the Milan/Pattern column are the averaged results of the three Milan articles (Articles 1, 2, and 3) using pattern mining techniques. While evaluations on both data sets show promising results, the Milan corpus clearly performed better than Reuters. Pattern mining results on the Milan database show an average precision of 75.21% while the same evaluations on Reuters produce an average 22.81% precision. This observation may be attributed to differences in the content of the two databases. Milan contains a wide range of news topics, including stories on politics, space, fashion, and world news. Reuters, on the other hand, contains a specific set of topics involving various economic reports, mostly trading speculations and money market concerns. Therefore, the patterns created for Articles 4, 5, and 6 were not as unique as they would have been in the Milan database, causing Reuters patterns to identify many more non-relevant articles. 62 Recall 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Averages Milan Precision Pattern Cosine Reuters Precision Pattern Cosine 1.0000 0.9762 0.9479 0.9444 0.8997 0.8589 0.7963 0.7267 0.6404 0.4453 0.0369 1.0000 0.9583 0.9167 0.8817 0.8335 0.7418 0.6805 0.5052 0.2701 0.0960 0.0345 1.0000 0.3915 0.2856 0.2397 0.1710 0.1265 0.1056 0.0737 0.0541 0.0386 0.0226 1.0000 0.3850 0.3149 0.2160 0.1616 0.1202 0.0972 0.0714 0.0423 0.0280 0.0122 0.7521 0.6289 0.2281 0.2226 Table 6.2: A comparison of average precision values between Milan and Reuters pattern mining and cosine similarity results Comparing pattern mining and cosine similarity results, pattern mining produces better results in both cases. In Milan, the average difference in precision is 12.32%. In Reuters, this difference is much more subtle, 0.55%. Again, the variance in performance is most likely attributed to the differences in the data sets. The patterns generated in the Milan corpus provided a clearer distinction between articles than the Reuters patterns. However, since patterns produce similar or greater results than cosine similarity, this suggests that pattern mining techniques are identifying sets of stories very similar in characteristics to those stories collected by cosine similarity, in a manner of greater efficiency. Thus, pattern mining appears to be the better choice, both for precision and efficiency. 63 Chapter 7 Conclusions and Future Work This thesis was motivated by the observation that online news lacks an efficient and flexible way of retrieving similar news stories. Readers would benefit from an automated news system that possesses both the understanding of a human editor and the ability to provide this service for any given article in the news Journalists and newspaper editors can also benefit from such a system, allowing research and corpus. editing to proceed with more effective tools than Boolean searches. The Ariel system was developed to address the issue of news retrieval using data mining technologies. The result is a novel approach to text retrieval using pattern mining. 7.1 Contributions In creating Ariel, this thesis has developed an autonomous news retrieval system that works in conjunction with the personalized news portal, Milan. The system has been specifically built for users of Milan's online newspaper. When a reader desires more information on a specific article, he/she calls upon Ariel to gather similar articles from the news corpus. In addition, Ariel could easily be applied to journalistic and editorial situations too. Journalists very often require background information on the topic they are writing about. Using Ariel, these writers simply need to identify a sample article and similar articles will be retrieved. Newspaper editing is a slightly different problem. Editors are generally presented with a plethora of articles, from which they must choose a selection of articles to print. Since several articles are often written on the same topic, Ariel can extract the similar articles from the news collection so that the editor may pick and choose from them. Furthermore, the Ariel architecture provides a modular, easily reusable structure that can be used in many different experiments. While Ariel is currently integrated with the Milan news system, other text databases may also be used. In the case of newspaper journalism and editing, Ariel could exist upon 64 various news archives and live news streams. Additionally, Ariel is not limited to news corpora. Any text database, including patent, scientific journal, and medical research databases, may be integrated with the Ariel system. Furthermore, various components of the Ariel system may be replaced or removed at any time without the rest of the system's knowledge. Since Ariel communications with the rest of the Vinci system through a well-defined network interface, any changes may be made without notification. Therefore, a new searching technique or ranking scheme may be incorporated with ease. Moreover, the architecture allows for a computationally expensive investigation of the articles to be performed ahead of time, and the results stored for future use. This technique allows for a more in-depth examination of the articles at search time, without the need for the user to wait through these computations. Ariel also demonstrates that patterns of words, traditionally called phrases, are still useful to IR research despite many claims that their value has all but diminished. Using pattern mining, a novel text retrieval method based upon statistical data mining, Ariel provides the mechanics for discovering patterns within a given article and then comparing these patterns to the rest of the corpus. Created with association mining technologies, story patterns may be viewed as unique signatures of a news topic or event. While various document retrieval functions already exist, these are very time consuming calculations. Functions such as the cosine similarity metric require a detailed analysis of each document before isolating the similar ones. Pattern mining, on the other hand, simply scans each article for the appropriate features, much like Boolean search. Ariel's performance surpasses that of cosine metric, word pair, and Boolean searches in precision and recall analyses. Furthermore, pattern mining is significantly more efficient than cosine similarity, while providing comparable or better retrieval results. Boolean and word pair searches, on the other hand, are more efficient than pattern mining, although the quality of text retrieval drops significantly. In addition, Ariel's precision/recall performance appears to be comparable to the TREC-8 ad hoc retrieval results, despite differences in the testing environment and corpus used. 7.2 Future Work On the surface, the Ariel system provides a simple, yet effective, text retrieval tool for the domain of online news. Underneath this application, however, is a brand new technique waiting to be explored. This thesis presents a mere glimpse of the field of pattern mining. With further exploration, this technique could make significant contributions to the IR field. Below, several ideas are outlined for future work on the Ariel system. 7.2.1 Incorporating Concepts into Patterns Currently, Ariel patterns are composed of stemmed words. In the future, more informative patterns could be created, by using concepts in addition to words. For example, one uses the words "angry" or "mad" or "furious" with exactly the same meaning intended. 65 However, to Ariel, each one of these words is completely different from the next. To resolve this problem, Ariel can easily be extended to recognize synonyms, by incorporating a tool such as WordNet [48]. In this way, we can produce a concept to describe anger, which recognizes all synonyms and associates them with the same meaning. Similarly, people and locations could also be represented as concepts. President Clinton, for example, may appear in the news as Bill Clinton, Clinton, William Jefferson Clinton, and soon, formerPresident Clinton. To the current system, these phrases have no relation to one another, other than sharing the word "Clinton". However, in the same way an anger concept can be created, we can produce a Clinton concept that identifies each of his different names as an alias for the same person. 7.2.2 Recognizing Topic Progression By incorporating a more complex understanding of article content, as described above, Ariel can begin to For example, consider an analysis of the article notice topic differences within relevant articles. "Thousands pause to remember when Hiroshima become hell on earth", a story about the 55th anniversary of the atomic bombing of Hiroshima. Currently, Ariel would search for relevant articles and rank them by similarity. However, Ariel can be built to recognize more specialized topics within the articles, including World War II event, the Manhattan project, and the victims of the bombing. As a result, Ariel could categorize the similar stories by both relevance and by topic. The ability to distinguish between different article focuses and evaluate their relevance to the user's interest is one possible extension of the Ariel project. 7.2.3 Story Summarization Instead of providing a list of similar articles, as Ariel currently does, summarization techniques may be used to condense information each relevant article into a single synopsis. In this way, the user could browse through the highlights from each article without actually needing to scan through the entire selection. A process known as semantic encoding [34], developed in the IBM Toyko Research Laboratory, allows a computer system to automatically analyses text and shrinks it into a single description. Perhaps these same techniques could be applied to Ariel. 7.2.4 Story Merging Similar to summarization, Ariel would also benefit from the ability to scan through the collection of retrieved documents, and merge the entire collection into a single synopsis about the topic at hand. One approach would be to locate common themes within the group of articles and generate a synopsis using these themes. Themes could possibly be discovered using the approach described in Section 3.5.2, searching for general corpus trends. While this approach was unsuccessful in searching through an entire 66 news corpus, perhaps more interesting results could be obtained when searching through a limited number of stories regarding the same topic. 7.2.5 Expanding the Scope of Ariel In addition, Ariel should be expanded to different types of users and text databases to further evaluate its success. Since Ariel has already been applied to news, a natural progression would be to incorporate Ariel into a news archive used by journalists and editors. While this application has been discussed throughout this thesis, Ariel has never actually been implemented as such a tool. Another possibility is to apply Ariel's pattern mining techniques to a patent database, such as the IBM's Intellectual Property Network [25], to help IP lawyers conduct research for prior art. Or, Ariel could be incorporated into various types of journals and placed in a library for students and researchers. 67 References 1. N. Abramson. Design, specification, and implementation of a movie server. Bachelor's Thesis. Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1990 2. R. Agrawal, R.J. Bayardo, D. Gruhl, S. Papadimitriou. Vinci: A Service-Oriented Architecture for Rapid Development of Web Applications. In Proceedingsof the 10'h World Wide Web Conference. Hong Kong, May 2001. 3. R. Agrawal, R.J. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proceedingsof the 7 InternationalConference on Extending Database Technology (EDBT). Konstanz, Germany. March 2000 4. R. Agrawal, T. Imielinski, A. Swami. Database Mining: A Performance Perspective. In IEEE Transactionson Knowledge and DataEngineering, Vol. 5, No. 6. December 1993 5. R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules between Sets of Items in Large Databases. In AMC Sigmod InternationalConference on Management of Data, Washington D.C., May 1993 6. R. Agrawal, A. Somani, Y. Xu. Storage and Querying of E-Commerce Data. Submitted to ACM SIGMOD InternationalConference on Management ofData. December 2000. 7. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 2 0'h InternationalConference on Very Large Databases.Santiago, Chile. September 1994. 8. R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proceedingsof the 11h International Conference on Data Engineering,Taipei, Taiwan, March 1995 9. W. Bender and P. Chesnais. Network Plus. In SPIE ElectronicImaging Devices and Systems Symposium, Vol. 900, pages 81-86. Los Angeles, California. January 1988 10. S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan. Using Taxonomy, Discrinimants, and Signatures for Navigating in Text Databases. In Proceedings of the 2 3rd InternationalConference on Very Large Data Bases (VLDB). Athens, Greece. August 1997 11. Cornell University, CS Department. Cornell list of stop words found at ftp://ftp.cs.cornell.edu/pub/smart/english.stop in January 2001. 12. Jack Driscoll. Visiting Scholar to the MIT Media Laboratory. Former editor of The Boston Globe. Interview. December 20, 2000. 13. Sara Elo. Plum: Contextualizing news for communities through augmentation. Masters Thesis for the Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1995 68 14. R.S. Engelmore, A.J. Morgan, and H.P. Nii. Introduction. In BlackboardSystems. Addison-Wesley Publishers Ltd., 1988 15. J.L. Fagan. The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. Journalof the American Societyfor Information Science, 40(2):115-132, 1989 16. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery in Databases. In Proceedingsof the American Associationfor Artificial Intelligence. Fall 1996 17. Simson Garfinkel and Kenneth Haase. Reference Points. Interview. January 31, 2001. 18. Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan Cigarran. Indexing with WordNet synsets can improve text retrieval. In Proceedingsof the COLING/ACL Workshop on Usage of WordNet in Natural Language ProcessingSystems. 1998, Montreal 19. Daniel Gruhl, The Search for Meaning in Large Text Databases. PhD Thesis for the Massachusetts Institute of Technology, February 2000 20. Data Mining: Extending the Information Warehouse Framework. IBM Data Mining Whitepaper found at http://www.almaden.ibm.com/cs/quest/papers/whitepaper.html in January 2001. 21. Kenneth B. Haase. Analogy in the Large. In Proceedings ofAMC SIGIR Conference. 1995 22. Kenneth B. Haase. Do Experts Need Understanding. In IEEE Expert, Spring 1997 23. Kenneth B. Haase, FramerD: Representing Knowledge in the Large. Technical Report, MIT Media Lab, 1996 24. IBM DB2 Intelligent Miner for Data. Found at http://www-4.ibm.com/software/data/iminer/fordata/index.html in January 2001 25. IBM Intellectual Property Network. Found at http://www.almaden.ibm.com/cs/patent.html in January 2001. 26. Steve Jones and Mark S. Staveley. Phrasier: a system for interactive document retrieval using keyphrases. In the 2 2 "d ProceedingsofACMSIGIR Conference on Research and Development in Information Retrieval. Pages 160-167. August 1999 27. B. Lent, R. Agrawal, and R. Srikant. Discovering Trends in Text Databases. In Proceedings of the 3 rd InternationalConference on Knowledge Discovery in Databasesand DataMining, Newport Beach, California, August 1997 28. David Lewis. The Reuters-21578 text categorization test collection. AT&T Labs. September 1997. Found at http://www.research.att.com/lewis/reuters21578.html in January 2001. 29. D. Lewis and W. Croft. Term clustering of syntactic phrases. In the 1 3 'h InternationalA CM SIGIR Conference on Research and Development in Information Retrieval, pages 385-404. 1990. 30. H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal of Research and Development, Vol. 2, No. 2, April 1958 31. M. Maybury, A. Merlino, J. Rayson. Segmentation, Content Extraction and Visualization of Broadcast News Video using Multistream Analysis. In Proceedings of the American AssociationforArtificial Intelligence. 1997. 69 32. R.B Miller. Response time in user-system conversational transactions. In Proceedingsof the AFIPS FallJoint Computer Conference, pages 267-277. 1968 33. Mandar Mitra, Chris Buckley, Amit Singhal, Claire Cardie. An Analysis of Statistical and Syntactic Phrases. In Proceedingsof RIA097, Computer-AssistedInformation Searching on the Internet,pages 200-214, Montreal, Canada, June 1997 34. K. Nagao, S. Hosoya, Y. Shirai, and K. Squire. Semantic Transcoding: Making the World Wide Web More Understandable and Usable with External Annotations. IBM Research, Tokyo Research Laboratory 35. News in the Future research consortium, MIT Media Laboratory. Found at http://nif.media.mit.edu/nif.html in January 2001. 36. The Pew Research Center. Internet now sapping broadcast news audience. Found at http://www.people-press.org/mediaOOrpt.htm in January 2001. 37. M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980. Downloadable from the Porter Stemming Algorithm home page. http://open.muscat.com/stemming 38. A. Renouf Making sense of text: automated approaches to meaning extraction. In the 17'h InternationalOnline Information Meeting Proceedings,page 77-86. 1993 39. Mark Rorvig. Images of similarity: A visual exploration of optimal similarity metrics and scaling properties of TREC topic-document sets. In The Journalof the American Societyfor Information Science. January 1998. 40. Gerald Salton, editor. The SMART RetrievalSystem - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, NJ, 1971 41. Gerald Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983 42. Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In the 2 2 nd Proceedingsof ACM SIGIR Conference on Research and Development in Information Retrieval, page 206-231. August 1999. 43. J. Shafer, R. Agrawal, M. Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining. In the 2 2 "nd InternationalConference on Very Large Databases(VLDB-96), Bombay, India, September 1996 44. R. Srikant. Member of the Data Mining group at the IBM Almaden Reseach Center. Interview. January 25, 2001 45. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proceedingsof the 5 InternationalConference on Extending Database Technology (EDBT). Avignon, France, March. 1996 46. Andrew Turpin, Alistair Moffat. Statistical Phrases for Vector-Space Information Retieval. In the 22" ProceedingsofACMSIGIR Conference on Research and Development in Information Retrieval, pages 309-3 10, August 1999 70 47. E.M. Voorhees and D. Harman. Overview of the Eighth Text Retreval Conference (TREC-8). Including Appendix A, TREC-8 Results. Found at http://trec.nist.gov/pubs/trec8/t8 proceedings.html in January 2001. 48. WordNet. A Lexical Database for English. Cognitive Science Library. Princeton University. Found at http://www.cogsci.princeton.edu/-wn/ in January 2001. 49. G.K. Zipf. Human Behavior and the Principle of Least Effort, Addison Wesley Publishing, Reading, Massachusetts, 1949. 71