Mining Patterns for Document Retrieval
by
Karin Cheung
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUESETTS INSTITUTE OF TECHNOLOGY
February 2001
Copyright 2001 Karin Cheung. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
A uthor ............................................................
Department of Electrical Engineering and Computer 14ience
January 25, 2001
C ertified by ..........................................................................
Rakesh Agrawal
VI-A Company Supervisor
Certified by.......................................................................
...........
. .
aas
Kennet
M.I.T. Thesis Supervisor
A ccepted by...................
..
........
Arthur C. Smith
Chairman, Department Committee on Graduate Students
5ARKER
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
JUL 11 2001
LIBRARIES
Mining Patterns for Document Retrieval
by
Karin Cheung
Submitted to the
Department of Electrical Engineering and Computer Science
February 6, 2001
In partial fulfillment of the requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer Science
ABSTRACT
The task of information discovery in text databases becomes increasingly complex as larger
corpora are used. This thesis presents ways of creating a richer understanding of the relationships
between documents in such corpora by automatically assembling a collection of cross-references
for any document in the corpus, through the discovery of word patterns. More specifically, the
system explores the use of data mining technologies, in particular association and sequential
mining, to identify features for text categorization.
2
Acknowledgements
First and foremost, I would like to thank the entire Data Mining group of the IBM Almaden Research
Center for their insights and assistance during these past several months. Thanks especially to Rakesh
Agrawal, whose patience and understanding allowed me to familiarize myself with the field of data mining
and ultimately decide what I personally wanted to accomplish in this thesis. To my mentor Daniel Gruhl, a
big thanks for answering countless questions, reading numerous thesis drafts, and always looking out for
me. Thanks also to Ramakrishnan Srikant for his invaluable insights on data mining and his always-helpful
advice.
On the MIT side, I would like to thank my thesis advisor Kenneth Haase for his guidance.
Thanks, Ken, for allowing me to follow my interests. To Jack Driscoll, thanks for giving me a new
perspective on the newspaper industry. Thanks also to Walter Bender for his support and interest in my
work. I would also like to thank the MIT VI-A Internship Program, for allowing me the wonderful and
unique opportunity of conducting my research jointly at ARC and the MIT Media Laboratory.
Lastly, thanks to Salvador Alvarez for his ever-present support and encouragement. And thanks
ultimately to my family, who have taught me everything I ever needed to know.
3
Contents
I
2
Introduction.....................................................................................
8
1.1
Motivation for the Ariel system......................................................
9
1.1.1
Reader A ssistance.................................................................
9
1.1.2
Newspaper Creation...............................................................
10
T h esis G o als..............................................................................
1.3
Defin ition s..............................................................................
1.4
Organization .............................................................................
12
Prior Research....................................................................................
13
.. 1 1
2.1
N ew s of the Future.......................................................................
13
2.2
The Search for Sim ilarity...............................................................
14
2.2.1
T raditional IR M ethods............................................................
15
2.2.2
Current IR Precision/Recall Results..........................................16
2.2.3
Using Phrases in Text Retrieval.................................................
2.3
3
11
1.2
16
18
D ata M in ing ...............................................................................
.......... 19
2.3.1
Data Mining Techniques................................................
2.3.2
Applications to Text Databases.................................................
20
Mining Story Patterns...........................................................................
22
Motiv ation s................................................................................
22
3 .1
3.1.1
Shared Writing Styles.............................................................
22
3.1.2
T im e C lusters......................................................................
22
3.1.3
Topic R ecognition.................................................................
23
3.2
A Story P attern ...........................................................................
24
3.3
Generating Patterns....................................................................
24
3.3.1
A Formal Problem Statement.....................................................
25
3.3.2
The A lgorithm ....................................................................
25
3.3.3
Results of Pattern Generation...................................................
27
3.4
The Search for Related Stories........................................................
3.5
Other Considered Techniques................................................
4
29
.......... 31
4
6
32
3.5.2
G eneral C orpus Trends............................................................
33
3 .5.3
C onclusions..........................................................................
34
35
The Milan News System......................................................
.......
35
4.1.1
Gathering and storing information................................................
37
4.1.2
Making it useful - experts and the blackboard system.......................
37
4.2
The V inci System ........................................................................
38
4.3
M ilan Services...........................................................................
40
4.4
W here A riel F its..........................................................................
40
Ariel: Design and Implementation..........................................................
42
5.1
G eneral System Flow ....................................................................
42
5.2
Story A nalysis...........................................................................
43
5.3
Pattern M ining ...........................................................................
45
5.4
Indexing for Searches....................................................................
46
5.5
A Ranking Schem a.............................................................
.
...... 47
5.6
The U ser Interface......................................................................
48
5.7
C onclusions...............................................................................
50
An Evaluation of Ariel..........................................................................
52
6.1
7
Looking at Sequential Patterns...................................................
Existing Resources...............................................................................
4.1
5
3.5.1
Experim ental Protocol...................................................................
52
6.1.1
Graph Characteristics.............................................................
53
6.1.2
Performance Considerations.....................................................
54
6.2
Milan Experiment Results.............................................................
56
6.3
Reuters Experiment Results............................................................
59
6.4
Final Observations......................................................................
62
Conclusions and Future Work...............................................................
64
7.1
C ontributions.............................................................................
64
7.2
F uture W ork .............................................................................
65
7.2.1
Incorporating Concepts into Patterns............................................
65
7.2.2
Recognizing Topic Progression.................................................
66
7.2.3
Story Summarization.............................................................
66
7.2.4
Story M erging ............................................................
7.2.5
Expanding the Scope of Ariel...................................................
5
........ 66
67
List of Figures
2-1
Statistical analysis on text is traditionally done by first creating a document-word
matrix. Each element represents the number of occurrences of the corresponding
w ord in the given docum ent.................................................................
15
3-1
An example of pattern generation using frequent itemsets..............................
26
3-2
Pattern generation for the Hubble Space Telescope article............................
28
3-3
Pattern generation for the Camp David article...........................................
28
3-4
Pattern generation for the Al Gore article.................................................
29
4-1
Milan is a personalized information portal for the Vinci system.....................
36
4-2
A frame describing an article on the Vietnam Memorial, represented both
graphically and in XML format...........................................................
37
4-3
A graphical representation of the Vinci system........................................
39
4-4
An illustration of the Milan system, which is built on top of the Vinci base
services. Milan relies on several other Vinci services, including Ariel and
various new s channels. ....................................................................
41
5-1
Overall system diagram for the approach presented in this thesis....................
43
5-2
A M ilan new s article.........................................................................
50
5-3
A Milan page displaying the top stories of the "Reintroducing Gore" article........
51
6-1
Precision/Recall graph for Article 1........................................................
56
6-2
Precision/Recall graph for Article 2......................................................
57
6-3
Precision/Recall graph for Article 3......................................................
58
6-4
Precision/Recall graph for Article 4......................................................
59
6-5
Precision/Recall graph for Article 5......................................................
60
6-6
Precision/Recall graph for Article 6......................................................
61
6
List of Tables
2.1
Recall Level Precision Averages for TREC-8 Ad Hoc Retrieval.....................
17
3.1
The prelim inary testing environm ent....................................................
30
3.2
Preliminary Precision/Recall Evaluation for the Hubble Space Telescope article... 30
3.3
Preliminary Precision/Recall Evaluation for the Camp David article................
30
3.4
Preliminary Precision/Recall Evaluation for the Al Gore article.....................
31
3.5
Precision/Recall values for articles retrieved using association patterns and
sequential patterns........................................................
3.6
.................
32
An experiment in discovering general corpus trends using pattern mining
techniques.............................
....................
33
5.1
A sample dictionary mapping each word to its unique numeric identifier............
45
6.1
The chosen articles for this testing environment........................................
53
6.2
A comparison of average precision values between Milan and Reuters pattern
m ining and cosine sim ilarity results......................................................
7
63
Chapter 1
Introduction
October 10, 2000. Kellie, a college student at Berkeley, has just returned home from a long day of classes.
She drops her books on the bed, grabs an apple from the fridge, and sits down at her computer to read the
daily news. Browsing idly through the articles, a headline catches her eye "U.S. Focus Turns to Middle
East Violence". She clicks on the link to read the article, and learns of President Clinton's urgent attempt
to negotiate a cease-fire between Israel and Palestine. She recalls hearing about the recent uprising in the
Middle East, but cannot remember what sparked the conflict, or why the Israelis and Palestinians have
always been so hostile to each other.
Curious, Kellie decides to call up Ariel, a computer software agent that searches for news stories similar to
the current one. Within moments, the current article is statistically analyzed for frequent word patterns and
then matched with other stories in the corpus. A new window pops up on the computer screen.
Clinton Follows Carter's Footsteps to Camp David, Process Begun in '78
Arab Uprising Spreads to Israel; Death Toll Rises to More Than 30
Middle East Clash Takes U.S. Role to a New Height
Middle East Violence May Push Parties Back to Peace Talks
Arafat and Barak Agree to Emergency Summit Meeting I
Browsing through the list, Kellie can now read through articles regarding the Middle East violence. Here
she finds the key issues of the Camp David summit, the outbreaks of violence, and the attempts to reach a
peace agreement. Ariel can perform a similarity analysis on any news article requested by the user,
including those retrieved via a previous query. If Kellie wants to see more articles about Camp David, for
example, she can ask Ariel to retrieve stories similar to the article "Clinton Follows Carter's Footsteps to
Camp David".
1 These are actual results from the Ariel system given the article "U.S. Focus Turns to Middle East
Violence" from the Milan news corpus
8
1.1
Motivation for the Ariel system
1.1.1
Reader Assistance
Trying to keep up with current events is a time-consuming task. People want to be informed about the
events that affect them, but at the same time the average person has many other responsibilities and
interests. As a result, only a small percentage of the day can be spent on the news.
The Ariel system allows more time to be spent on background information and less time on
searching for it. More specifically, Ariel introduces a new, efficient method of discovering similar articles
in a news corpus, using statistical data mining techniques. Integrated with the personalized newspaper
Milan, Ariel can find related news stories for any article requested by the user. This automated system
discards the need for cumbersome query-directed searches, and allows the reader to make better use of their
time.
For major events, such as coverage of the Middle East crisis, news providers occasionally include
background articles for the topic. The CNN article about an emergency Middle East summit might have
links to several detailed documents, including biographies of the Israeli and Palestinian leaders, an
explanation of their perspectives and demands, and an account of Camp David's decades of peace efforts.
Similarly, the 2000 Olympic games in Sydney may include a link to pictures of the opening ceremonies, a
list of athletes and their performances, and a story of the Olympic game's origins. These documents give
the reader a good overall summary of the event, and leave the reader with a well-established foundation
with which to better understand the current article.
However, since these precis are compiled and summarized by human editors, only a handful of
events can receive such careful attention. For example, a reader who cares about commercial airline news,
might want information on recent airline mergers, employee strikes, and competitive pricing. Or, another
reader might be interested in the latest fashions, desiring reports from top clothing designers, tips on
interior decorating, and secret beauty treatments. A researcher may want articles on the effect of natural
disasters on the chip industry. Or, a child could demand stories on Pokemon trading cards. A news system
that relies on the skills of humans certainly cannot accommodate all possible topics of interest. Thus only
those stories widely held in public interest are maintained historically. Less popular topics are often left as
standalone articles, with little or no direction toward similar articles.
When a reader is strongly motivated to find related articles, however, there is always the
possibility of querying a search engine. For example, if no Ariel system existed, Kellie could search the
2
CNN web site for articles similar to "U.S. Focus Turns to Middle East Violence".
2 The
articles shown here are actual results retrieved from the CNN.com search utility.
9
Using the article title as the search phrase, the system returns 4 articles:
TIME Asia TIME Finance Building a New House (27-Sept-97)
Oklahoma Bombing Trial (16-Dec-97)
Oklahoma Bombing Trial (03-Nov-97)
Oklahoma Bombing Trial (26-Nov-97)
Kellie decides to try again, this time using "Middle East Violence" as the search phrase. As a
result, CNN.com retrieves 595 articles about the Middle East crisis. While some of these articles are useful,
many are completely unrelated. The articles range from "Palestinian violence halts Barak's talks with
Clinton (21-May-00)" to "Peru's President pulls off another coup (06-Dec-99)".
While searching for information can produce useful results, it can also be a tedious and frustrating
task.
Search engines sometimes return a plethora of articles, many which are unrelated to the reader's
interests. Other times they return nothing at all. Neither of these results is very appealing. Therefore, it's
not surprising that users prefer articles chosen by human editors rather than those chosen by Boolean
searches.
1.1.2
Newspaper Creation
The Ariel system assists readers in finding background information for any given article, so that they can
learn about a topic more effectively than traditional means have allowed. However, while this thesis is
targeted largely towards news consumers, journalists and editors could also benefit from such a system.
Before writing an article, a writer will frequently search for stories similar to the one being written, to see
how other journalists have approached the story. In many media organizations, including Wired com, this
research is conducted by scanning through stories from their own archives using simple Boolean searches
[12]. With the Ariel system integrated into the newspaper archives, no cumbersome searches are needed.
The writer simply chooses one article, and asks Ariel to find stories related to it. Within seconds, Ariel can
scan through the archives and retrieve the requested articles.
While some stories are written by a newspaper's own journalists, many articles come in from other
sources.
A large newspaper company, such as The Boston Globe, has approximately 10 outside news
sources, including the Associated Press, United Press, Reuters, and Knight-Ridder [12].
Frequently,
several versions of the same story exist, written by journalists from different news sources. A newspaper
editor must sort through these stories to determine which stories are important enough to be printed. An
editor may choose a version of the story directly from the news source, or perhaps combine two versions
into one. Approximately three hours of an editor's time is spent in this filtering and selection process for a
single newspaper issue [12]. Ariel can help editors by processing stories as soon as they enter the news
system. Therefore, by simply choosing a single article, the editor can have Ariel retrieve similar stories
without sorting through the stories by hand.
10
1.2 Thesis Goals
This thesis is concerned with the development and evaluation of an autonomous news system that helps a
reader better understand a news topic, without being limited to the scarce resource of human editors and the
frustrating option of search engines. This system, named Ariel, analyzes a story and provides the reader
with a collection of related articles. To determine which articles are most relevant, Ariel searches for
features, which we call story patterns, within a given article, and then matches those features to other
articles in a news corpus. We will refer to this retrieval technique as pattern mining. Story patterns are
patterns of words that appear frequently among articles pertaining to the same news topic.
This thesis
illustrates that the topic of a news article can be revealed through an analysis of the patterns displayed
throughout the article.
The analysis of story patterns is performed using statistical data mining techniques, namely
association and sequential mining [5][7][8][45]. This paper explores the use of both association mining
and sequential mining techniques to discover these patterns. As a result, two different uses of patterns are
considered in this research. One usage involves searching for common patterns of words within a single
article, while the other involves searching for patterns throughout the entire news corpus. This thesis will
demonstrate that the former technique allows the construction of unique story descriptors, while the latter
retrieves far less interesting patterns.
Once these features are identified, the system cross-references articles with similar patterns, and
stores these results with the article in the news database. Thus, when a user requests related stories for a
particular article, Ariel simply refers to its database to determine which articles may provide the most
useful information.
By analyzing each story before a user request arrives and storing the results in a
database, the system can employ fairly computation-intensive techniques without forcing the user to wait
through the processing time. Miller's work has shown that a 2-second response time is the limit for a
user's flow of thought to stay uninterrupted [32]. For longer delays, the user will want to engage in other
tasks while waiting for the computer the finish. Pre-computation prevents this time constraint from
limiting the quality of retrieval.
1.3 Definitions
In this discussion, a news article refers to a single piece of writing about an event in the news. A news
story, on the other hand, is considered a collection of articles about the same topic. For example the news
story on the 2000 Olympic games includes the news articles "An Olympic ode from down under", "The
Drug Games: the legacy Sydney doesn't want", "First Sydney gold for Marion Jones", and "The Olympics
conclude". A news topic is a single descriptive word or phrase about a given news story. In this case, the
topic might be "The 2000 Olympic Games". A story pattern refers to a selection of words that identifies an
article as belonging to a particular news topic.
11
1.4 Organization
Chapter 1 - Introduction
Chapter 2 - Prior Art
Chapter 3 - Mining Story Patterns
Chapter 4 - Existing Resources
Chapter 5 - Ariel: Design and Implementation
Chapter 6 - An Evaluation of Ariel
Chapter 7 - Conclusions and Future Work
Chapter 8 - References
12
Chapter 2
Prior Art
This thesis incorporates ideas from online news, information retrieval, and data mining research. This
section provides an overview of related research done in these particular fields, many which have
influenced the development of this thesis.
2.1
News of the Future
The realm of online news has flourished in recent years, with services such as CNN.com and my. Yahoo.com
becoming more popular every day. In fact, the percentage of Americans getting their news from the
Internet at least once a week has grown from 20% in 1998 to 33.3% in 2000 [36].
Furthermore,
approximately 99% of the nation's largest newspapers now have an online presence. There are more than
4,000 newspapers online worldwide, half of them from the United States. As online news sources are
becoming more widely read, there has been a growing interest in creating a computer system that can
develop meaning in a corpus of news. These systems would be able to choose relevant articles based on an
understanding of article and user contexts, to assist the work of human editors.
The News in the Future research consortium provides a forum for the MIT Media Lab and
member companies to explore technologies that will affect the collection and dissemination of news [35].
NiF's work in image, text, and audio understanding is improving the machine understanding of text and
enabling computer systems to better understand the context of news. Furthermore, in terms of information
retrieval, the consortium's work in managing data has contributed greater precision levels to database
queries and filtering.
FramerD is a distributed object-oriented system designed for the maintenance and sharing of
knowledge bases [23]. This system, developed by Kenneth Haase and the Machine Understanding Group
of the Media Lab, is a prevalent database tool used in NiF applications. The system incorporates knowledge
13
base applications such as a multi-lingual ontology and a sentence parser. At the same time, FramerD is a
frame-based transport system based on d-types [1].
One application built on FramerD is PLUM, a system that adds contextual information to disaster
news stories [13].
PLUM uses augmented text systems to combine the understanding of news with the
understanding of the reader's context.
The result is the original news article, supplemented by facts
relating the disaster with facts about the user's own community.
Such annotations may include "The
affected area is about the size of Cambridge," or "Sorgum is 75% of the country's agricultural output".
The use of augmented news has also been prevalent in online newspaper developments.
MyZWrap and Panorama are two personalized newspapers developed at the MIT Media Lab and IBM
Almaden Research Center, respectively, which use augmented text to develop meaning in news [19]. The
goal was to develop a system that could assist, simplify, and automate the types of editorial decisions that a
human editor faces. The personalized news portal Milan emerged from these two systems, and is currently
the context used by Ariel. Milan supplies Ariel with a news corpus and an online newspaper to work with,
as well ass the functionality to display news articles for the reader. Further details on Milan will be
discussed in Chapter 4.
Some researchers have also experimented with integrating different news media into a single news
system. For example, Bender's work in Network Plus is an investigation in combining news wire services
with network television news [9]. The resulting model is a joint television viewing by both the consumer
and a personal computer. The computer system analyzes the incoming television broadcast, and retrieves
The resulting data may then be used to augment the
further data from both local and remote databases.
current television broadcast, or create an augmented newspaper from the broadcast. Similar research has
been done in the MITRE Corporation in recent years.
In AAAI-97, they present the BNA and BNN
systems, Broadcast News Analysis and Broadcast News Navigator, respectively [31]. The system develops
a broadcast news video corpora, and provides techniques such as story segmentation, proper name
extraction, and visualization of associated metadata.
2.2
The Search for Similarity
While this thesis arose from interests in contextualizing news, the foundations of this work lie in
information retrieval. Simply put, this thesis illustrates a technique for document retrieval based on a user
query.
Every information retrieval system consists of information items, a set of requests, and some
mechanism for determining which, if any, of the items meet the requirements of the requests [41]. This
thesis shows exactly these elements. The items, or documents, are the news articles of the news corpus. A
request is a user indication that they want more stories related to a specific article that they are reading.
And, finally, the similarity metric is the use of story patterns to discover similarities.
14
CAT
DOC1
DOC2
DOC3
0
1
3
BEAR ---
DOG
2
0
1
1
0 ...
0
Figure 2-1: Statistical analysis on text is traditionally done by first creating a document-word matrix. Each
element represents the number of occurrences of the corresponding word in the given document
2.2.1
Traditional IR Methods
Statistical techniques have a long history of use in the context of information retrieval. In order to apply
these techniques, a mapping is usually needed between the articles and a vector representation of the
articles. This idea was first proposed by Gerald Salton and the SMART retrieval system in 1971 [40]. The
most common such mapping is one that takes the list of words that appear in a document and map them to
individual elements in a vector. These documents can then be collected into a single matrix that represents
the corpus, as seen in Figure 2-1.
Consequently, many retrieval systems refer to this matrix to determine document similarities,
rather than processing information from the document corpus. For example, a boolean search on "cat +
dog" looks up the indexes for the words "cat" and "dog", and determines which of the documents have
matches in both rows. Cosine similarity is another well-known similarity metric that utilizes vector spaces.
Given two document vectors, the cosine method performs a dot product function on the vectors. The result
is a similarity value between I and 0, with I meaning that the documents have exactly the same content and
0 meaning the two are nothing alike.
Many systems also utilize word frequencies as a similarity metric, also known as tf/idf (term
frequency/inverse document frequency). A writer generally repeats certain words throughout a document as
the subject is explained in further detail.
This word emphasis is well-recognized as "an indicator of
significance" [30]. However, neither high-frequency or low-frequency words are good content identifiers,
as explained by the "principle of least effort" [49]. This principle claims that writers generally use a smaller
common vocabulary. Consequently, the most frequent words tend to be the least informative. As a result,
many consider term frequency to be proportional to the frequency with which a word appears in a
document, but inversely proportional to the number of documents it appears in.
15
While statistical techniques perform well in text retrieval, ambiguities in word or phrase meanings
may lead to retrieval errors. In the sentences "Spring has arrived!" and "The watch has a broken spring,"
the word spring has two completely different meanings. However, a system that only considers single
words for context description may very well classify these two sentences as similar. This weakness can be
fixed, of course, by using term co-occurrence statistics and word adjacency operators, but statistical
techniques are still imperfect. In particular, a statistical system has no way of distinguishing phrases such
as "Venetian blind" and "blind Venetian."
Online lexical reference systems, such as WordNet, provide tools to de-ambiguate text, by
organizing information in terms of word meanings instead of word forms [18][48].
Using WordNet, the
words spring, fountain, outflow, and outpouring in the appropriate senses could all be broken down into the
same concept. Thus, concepts instead of words could be stored in the vector space model and used to
describe documents in place of words.
However, issues like the Venetian blind example cannot be solved without a more complex
representation of documents. Natural Language Processing (NLP) provides such a representation, allowing
document analyses to include structured descriptions such as noun-verb-noun or adjective-noun
combinations [41]. These more meaningful larger units intuitively should create substantial improvements
in test analysis and information retrieval. At this time, however, the performance of statistical and syntactic
methods is domain-dependant.
2.2.2
Current IR Precision/Recall Results
To provide a better idea of current IR capabilities, Table 2.1 provides the evaluation results for the TREC-8
runs for Carnegie Mellon University, John Hopkins University, and Oracle[47].
TREC-8 is the 8' annual
Text REtrieval Conference, held in November 1999; the latest of a series of workshops designed to
promote research in text retrieval. The table shows precision/recall results for the main TREC task, ad hoc
retrieval. The ad hoc retrieval task is similar to how a researcher might use a library. A question is asked
and a collection is available, but the answer is unknown.
This problem implies that the input topic has no
training material. The same problem is addressed in this thesis, where a reader requests articles similar to a
given article, or topic.
While Ariel has access to the Milan news corpus with which to discover the
requested information, the solution itself is unavailable.
The precision/recall numbers allow a clear
comparison between the Ariel system and other state-of-the-art systems, such as the SMART retrieval
system, thus providing us with a better foundation for evaluating Ariel.
2.2.3
Using Phrases in Text Retrieval
Phrases have traditionally been regarded as precision-enhancing devices in text retrieval systems. The
exploration of phrases as terms in a vector space based retrieval system, first introduced by Salton [40][41],
have received careful attention in the past few decades. Statistically, phrases may be thought of simply as
16
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
CMU
Precision
JHJ
Precision
Oracle
0.5705
0.4092
0.3308
0.2987
0.2415
0.2118
0.1710
0.1380
0.1037
0.0601
0.0315
0.7677
0.5496
0.4641
0.4070
0.3570
0.3139
0.2663
0.2244
0.1522
0.1017
0.0410
0.9279
0.7747
0.6610
0.5448
0.4542
0.3882
0.3168
0.2565
0.2067
0.1332
0.0791
Average:
0.2175
Average:
0.3126
Average:
0.4130
Table 2.1: Recall Level Precision Averages for TREC-8 Ad Hoc Retrieval
combinations of terms that are used in place of the individual terms they are composed of. The best
phrases have a larger than expected joint frequency of occurrence, given the frequency of the individual
terms.
Phrases can be defined in a variety of ways.
At one extreme, a system may define two terms
appearing in the same document as a phrase. However, better precision results are found when more
restrictions are in place. For example, phrases may be limited to terms occurring in the same sentence, in
the same sentences and adjoining positions, or in the same sentences with at most k items separating them.
Phrases may also be defined syntactically. However, syntactic phrase detection becomes much more timeconsuming and computationally expensive since additional tests are needed.
Given these definitions, the story patterns described in this thesis may be thought of as phrases
within a document. Thus, a clearer understanding of phrase usage and their resulting performance benefits
this thesis. In most phrase-based statistical systems, a phrase is defined as a pair of non-function words that
occur contiguously frequently enough in a corpus. Similarly, in phrase-based syntactic systems, a phrase is
defined as any set of words that satisfy certain syntactic relations or constitute specific syntactic structures
[33].
Salton et al.'s 1975 work has shown that including statistical phrases as terms in vector space
Furthermore, the work of Lewis and Croft in 1990
models increases precision by 17% to 39% [40].
illustrates that the quality of text categorization is clearly improved by using word phrases [29].
In 1993,
Renouf's 1993 work also demonstrates that phrases could be reliably used as search terms. According to
his evaluations, sequences of two or more nouns can effectively identify concepts found within a document
[38].
The use of phrases has also become prevalent in related areas of study.
The automatic
determination of text themes within a document has been explored by Salton et al., by using phrases to
17
describe document themes. More recently, phrases have been used as a means for automatically deriving a
hierarchy of concepts from a specified set of documents [42], instead of using traditional clustering
techniques.
And lastly, the Phrasier system has been recently developed, an interactive system for
browsing through related documents using automatically extracted key phrases [26].
However, some groups have argued that the use of phrases is not useful in enhancing precision in
document retrieval.
In 1989, Fagan ran the same experiments as Salton's 1975 work, but used larger
document collections, about
11% to 20% [15].
10MB. His reports show that average precision improvements ranged from
This downward trend continued in 1997, with Mitra et al, replicating the same
experiments on a 655 MB collection, and reporting a 1% precision improvement if phrases are used as
terms.
Mitra's experiments also found that syntactic and statistical methods for recognizing phrases
yielded comparable performance. And most recently, in 1999, statistical phrase analyses by Turpin et al.
confirmed Mitra's results, adding further evidence to the argument that phrases are not useful precisionenhancing devices [46].
These studies have argued that the advances in single term precision have
outperformed the usefulness of phrases in text retrieval problems.
While there exists some arguments against the use of phrases, the common belief is that word
patterns are good precision-enhancing devices. In fact, as this thesis will illustrate, statistical word patterns
are certainly advantageous to the field of information retrieval when the phrases used are generated using
traditional association mining techniques.
2.3
Data Mining
Data mining is often referred to as knowledge discovery in databases, or KDD. A combination of statistical
and machine learning techniques, data mining is the process of identifying useful and understandable
patterns in data. The primary difference between statistical analysis, machine learning, and data mining is
the issue of scalability [44]. While statistical analysis and machine learning deal with smaller data sets, data
mining techniques are applied to voluminous databases.
In the media, data mining is most often associated with consumer-oriented applications. Data
mining software can help businesses obtain insightful information about their customers and business
practices by identifying customer behaviors, recognizing market trends, and minimizing costs [20]. If this
information is used effectively, data mining technologies give businesses a clear advantage over their
competitors.
In addition to consumer markets, data mining can have a variety of different applications. For
example, in astronomy, the SKICAT system performs image analysis, classification, and cataloging of sky
objects from survey images [16].
In sports, IBM's Advanced Scout aids NBA basketball coaches in
organizing and interpreting data from games [16]. And in more recent years, a growing interest in applying
18
data mining to document analysis has emerged. Applications include categorizing text into pre-defined
topics and discovering trends in text databases.
2.3.1
Data Mining Techniques
Data mining techniques can be categorized into four general classes, Associations, Sequential Patterns,
Classification, and Clustering [4][20]. While this thesis only concerns itself with association and sequential
mining, classifiers and clustering will also be mentioned for completeness. Each of these methods will
receive a brief description, and some examples of applications for which these functions are useful. The
next section will describe data mining work related to textual analysis, which is more the focus of this
thesis.
Associations and Sequential Patterns
Given a set of transactions, where each transaction is a set of items, an association rule is an expression of
the form X
-)
Y.
For example, a clothing retailer might infer the rule "38% of all people who bought
jackets and umbrellas also bought sweaters". Here, 38% is referred to as the confidence of the rule.
Associations can involve any number of items on either side of the rule. The problem of association
mining is to find all association rules that satisfy user-specified minimum support and minimum confidence
constraints.
A more restrictive form of association mining is the problem of mining sequential patterns.
Sequential patterns are simply associations found in a particular sequence. In most retail organizations,
records of customer purchases typically consist of the transaction date and the items bought in the
transaction. Very frequently these records also contain customer ID's, particularly when the purchase has
been made using a credit card, membership card, or a frequent buyer club. Given this data, an analysis can
be made by relating the purchases with the identity of the customer.
For example, in a bookstore's database, one sequential pattern might be "10% of customers who
bought The Sorcerers Stone in one transaction later bought The Chamber of Secrets in a later transaction,
and The Prisoner of Azkaban, and The Goblet of Fire in a third. 3 Each transaction is referred to as an
element in the sequence. As you can see, elements may be single items or sets of items. Furthermore, these
purchases need not be consecutive. Customers who purchase other books in between also support this
sequential pattern.
Sequential mining may be used in any application involving frequently occurring
patterns in time.
Classification
The problem of classification involves finding rules that partition data into distinct categories.
In
classification, a set of example records is given, each containing a number of attributes. One of these
3
From the Harry Potter series by J.K. Rowling.
19
attributes, called the classifying attribute, identifies the class to which each sample record belongs. The
records are used as a training set on which to build a model of the classifying attribute based on the others
[43].
Credit analysis is a prime classification application. For example, credit card companies have
detailed records about their customers, with a number of attributes regarding each record. Each record is
generally tagged with a GOOD, AVERAGE, or BAD attribute, representing the credit risk of each
customer. A classifier could examine each of these tagged records, and produce a description of the set of
GOOD customers as those "between the ages of 40 and 55, with an annual income over $45,000, living in
suburban areas". Models may be built using a variety of techniques, including decision trees and neural
networks.
Clustering
Clustering techniques are similar to classification, but for one major difference. While records contain
classifying attributes for classifiers, clustering uses a set of records without tagged attributes. The goal of a
clustering function is to identify a finite set of categories or clusters to describe the data.
Data can be clustered to identify market segments or consumer affinities, for example. A furniture
retailer might use clustering to determine which groups of consumers would be more likely to purchase a
new product, and organize their advertisement campaign accordingly. Given a set of customer transaction
records, the retailer may separate customers demographically by those interested in expensive furniture
sets, moderately priced furnishings, and unfinished decor.
2.3.2
Applications to Text Databases
One major data mining interest is in the field of text classification. Text classification is the problem of
automatically assigning predefined categories to free text documents. In text searching and browsing, text
classification systems may be used to create a topic hierarchy for user navigation through a text database.
As shown in [10], using statistical pattern recognition allows discriminants to be efficiently separated at the
node of each taxonomy. Then, these discriminants are used to build a multi-level classifier.
The Athena system is another text classification system, built to maintain a hierarchy of
documents through interactive mining technologies [3]. Currently, Athena is implemented on Lotus Notes
to support e-mail management and discussion databases. Using Nafve Bayes classifiers and clustering
algorithms, Athena performs topic discovery, hierarchy reorganization, inbox browsing, and hierarchy
maintenance. While other classification models have been developed for e-mail routing purposes, Athena
is the only such system to address text database organizations outside the routing of incoming mail.
Similarly, association mining has been explored as a method of discovering news topics for the
online newspaper ZWrap [19]. The ZWrap system seeks to automatically identify rules about various news
topics and bring them to the attention of an editor who can decide whether the presented hypotheses are
valid. The use of association-rule data mining on news features, such as locations or people, may discover
20
features with high occurrence rates together. For example, association techniques applied to people may
discover that 80% of the time Al Gore is mentioned in a news article, George W. Bush also appears. Using
these techniques, nodes of topics may be formed, where each node contains several rules pertaining to a
particular person or event. In this way, a better understanding of news contexts may be created.
Sequential mining has also been incorporated in IR research, using statistical phrases to discover
trends in text databases [27].
sequential patterns.
approach.
Here, phrase identification is transformed into the problem of mining
The pattern mining techniques described in this thesis are most comparable to this
Instead of finding common transaction sequences for consumer purchases, for example, the
system finds common sequences of words and calls them phrases. Associated with each phrase is a history
of the frequency of occurrence for that phrase. Thus, when a user queries the database for a certain trend,
these histories may be consulted to determine which pattern histories meet the specified requirements.
Trends are defined as specific subsequences of a phrase history that satisfy a user query, for example a
spike-shaped trend, or a bell-shaped trend.
In recent years, the combining of data mining techniques with information retrieval problems have
shown promising results. The next chapter will introduce the basic ideas of this thesis, which illustrates a
new application of data mining to high-precision document retrieval. This thesis incorporates many of the
ideas described above in the realms of news, IR, and data mining research.
21
Chapter 3
Mining Story Patterns
The Ariel system generates story patterns for each article in the news corpora. Each set of story patterns
may be thought of as a unique signature for a particular story, which allows for the retrieval of related
articles. This section explains the techniques used to create these patterns, as well as some techniques that
were explored but eventually discarded. Furthermore, this section attempts to provide some insight behind
the usefulness of these patterns.
3.1
Motivations
3.1.1
Shared Writing Styles
Journalists generally use similar wording in their writings, when referring to the same event or topic. As a
result, patterns of words are repeated frequently. For example, the 2000 presidential election might
frequently use words: Gore, Bush, president, election, and race. Or, the topic of the 2000 Olympics may
include Olympics, Sydney, athlete, drug, and medal. As discussed in the previous chapter, Section 2.2.3,
these word patterns can help IR systems identify related documents.
3.1.2
Time Clusters
From a reader's standpoint, news topics come and go. Some topics emerge as minor tidbits of news,
gathering momentum until they peak as the nation's hottest news stories. Once the excitement dies,
however, the topic soon disappears from the public eye. The 2000 presidential election is one such
example, with early campaign news emerging early in the summer and culminating into the tight election
race. Weeks later, election results are resolved and public interest subsides. Other topics may appear out
of nowhere, creating immediate headlines in the news and then dying away. One example is the Federal
Reserve's surprise decision on January 3, 2001 to slash rates by half a percentage, creating a spur of
activity in the stock market. Or, consider the Middle East crisis, which attained high public interest during
22
the Camp David summit, and then re-emerged months later in news of the Israeli-Palestine uprisings.
News topics such as these revolve around clusters of time. Like waves rising and falling, topics emerge in
the news for a certain period of time, and then diminish from public view. Some topics, such as the Camp
David story, may later re-emerge in an updated context.
Patterns are strongly correlated to the time cluster in which it appears. In one experiment, Haase
demonstrated the importance of numbers in online newspapers. If one article claimed "43 people died in a
train crash yesterday, a search on the number 43 in that period of time would most likely produce articles
regarding that accident [17]. This number 43 can be viewed as part of a unique signature for the train
accident topic. However, such signatures are only valid within the time cluster, as the number 43 may have
had other significance in the past.
We then considered what other types of signatures could be used to describe news topics. Patterns
of words can be though of as unique signatures for a particular news topic. Word patterns are more
complex than numbers. However, they can be used in a much broader range of news stories. For example,
an article on the Middle East crisis may support the pattern {Clinton, Israel, Palestine, peace}. This pattern
encompasses a key idea of the topic, namely the involvement of the U.S. in the Middle East peace process.
Furthermore, the pattern encapsulates the topic's representation in time. In addition, more than one pattern
may represent a news topic. Groups of patterns for the Middle East crisis may include {Barak, Arafat,
meet}, {Camp, David, peace}, and {Clinton, Israel, Palestine, peace}.
The uniqueness of a pattern, or group of patterns, allows for it to surface only when the topic is
present. For example, the words Clinton, Israel, Palestine, and peace will only appear together during
Clinton's time as president. This relationship marks a period of time in which the pattern is useful. In fact,
the pattern will predominantly appear during the Camp David summits and periods of violence, while
virtually disappearing during peaceful times.
3.1.3
Topic Recognition
In many retrieval systems, a major concern is the difficulty of discovering when new topics emerge, when
old ones become obsolete, and when interest in an old topic rises again. A news system may take several
days to discover that Florida is playing a major role in the 2000 presidential election, that interest in Elian
Gonzales has died away, or that enthusiasm for the Olympics has risen again.
Due to the nature of patterns, however, these concerns are insignificant.
Since patterns are
immediately discovered within a given article, Ariel recognizes the importance of the term "Florida" as
soon as it appears in a story pattern. Furthermore, the topic of Elian will become unimportant when the
topic's patterns stop appearing in daily news. And, lastly, patterns generated for the Olympics will match
both current and past Olympic stories. Therefore, patterns provide a quick and effective method of finding
related stories without worrying about explicit topic recognition.
23
3.2
A Story Pattern
A story pattern is a set of words that appear together within the same sentence of a given article. A k-item
pattern is defined as a pattern containing k elements. No particular ordering of the words is necessary,
since the flexibility of writing allows a given concept to be described using completely different word
orderings.
For example, the one article may state: "President Clinton met with Palestinian leader, Yasser
Arafat, today" while another article says "Arafat spoke with Clinton briefly on Tuesday morning." In both
cases, the 2-item pattern {Clinton, Arafat} is present. This pattern is a set, and thus is equivalent to the
{Arafat, Clinton} pattern.
Sequential patterns were also examined, since a case could be made for the importance of word
ordering in patterns.
It was observed that this technique creates slightly lower precision values and
significantly diminished recall values (see Table 3.5), leading us to believe that restricting patterns by word
ordering is too inflexible to find similarities in journalistic writing. This technique will be discussed in
more detail later in the chapter.
3.3
Generating Patterns
Patterns are generated using association mining techniques [5][7], which are traditionally used to mine
association rules for sales transactions.
The major components of association mining are a record of
transactions, the items purchased in each transaction, and the ID of the customer who made the transaction.
The problem of mining association rules can be broken down into two stages, generating frequent itemsets
and generating rules from these itemsets.
A
frequent itemset is considered a set of items (itemsets) that
have a transaction support above a minimum support, meaning that the support for an itemset is the number
of transactions that contain it. To generate rules, a straightforward algorithm might be:
For every frequent itemset 1, find all non-empty subsets of 1. For every such subset
a, output a rule of the form a => (I - a) if the ratio of supportfor (1) to support (a)
is at least minimum confidence. A rule x => y is valid if minimum confidence% of
transactionsin the transactionset that contain X also contain Y.
In the case of text mining, we substitute a transaction record with a record of a single news article.
Each sentence of the article is viewed as one transaction, where a sentence is defined as the set of words
contained in that sentence. The items of each transaction correspond to the words found in the sentence.
And, finally, the customer identifier is simply a unique sentence identifier. In creating patterns, we generate
frequent itemsets from the article. No rule generation is used. The result is a set of patterns, the itemsets,
where each pattern is a set of words frequently appearing together in sentences of the article.
24
3.3.1
A Formal Problem Statement
Let W=
{Wi, W2,
... , wm} be a set of all words.
Let A be a news article, which can be viewed as a set of sentences, where each sentence S is itself a set
of words such that S c W. Associated with each sentence is a unique identifier, called its SENT ID.
A sentence S is said to contain P, a set of some words in W, if P C S.
P is considered a story pattern for news article A if the support threshold support is satisfied, meaning
that > support number of sentences contain P.
Given a set of sentences A, the problem of finding story patterns is to generate all patterns P that have
support greater than a user-specified minimum support.
3.3.2
The Algorithm
The methodology described here is a general approach that can be applied to text databases of varying
complexity. The original algorithm is taken from Rakesh Agrawal and Ramakrishnan Srikant's work on
discovering fast algorithms for mining association patterns [5][7]. This algorithm was carefully chosen for
its performance advantages. It has been shown to always outperform previously known algorithms such as
AIS and SETM. The performance gap increases with problem size, ranging from a factor of three for
smaller problems to more than an order of magnitude for larger problems [7].
Other association rule mining algorithms also exist [44]. However, these algorithms generally
focus on generating longer patterns and/or minimizing the number of passes for disk-resident data. Since
this thesis involves relatively short patterns and documents, the performance of Agrawal and Srikant's
algorithm is more than sufficient.
The result of the mining is a set of patterns that occur frequently within a given news article. The
general approach is given here, adjusted for text retrieval applications.
In order to analyze a given article A, the text must first be broken down into sentences and words.
After removing all stop words from A, each remaining word is stemmed using the Porter Stemming
algorithm [11][37].
Stemming allows the system to recognize the same word in different linguistic forms.
Next, each stemmed word is given a unique identifier, since the use of numbers is less computationally
expensive than strings.
Now, A may be represented as a set of sentences (transactions), which in turn
becomes a set of words (items). In each article, both the headline and body are considered. The headline is
simply represented as the first sentence of the article.
Without word stemming, the Ariel system would lack the recognition of many similar words and
thus fail to recognize similar stories containing these words. For example, consider the sample article
"Yesterday, Clinton welcomed the visiting Prime Minister Barak. Clinton's welcome was warm as he
received Barak".
In the case of word stemming, the words "Clinton" and "Clinton's" are both translated to
"Clinton" while "welcomed" and "welcome" are recognized as different forms of "welcom".
Thus, a
system that incorporates stemming would recognize the common pattern {Clinton, welcom, Barak} within
25
Article A
Sent
1
2
3
4
Items
134
235
1235
25
L3
Li
L2
Itesset Support
Itemset Support
2
2
3
2
(13)
(23)
(25)
5)
2
3
3
3
(1)
(2)
(3)
(5)
C3
C2
Itemset Support
(12)
(13)
(15)
(23)
1
2
1
2
(2 5)
(35)
3
2
(2
5)
2
(235}
C4
Support
______t
[jmet Support
Iteme
upr
2-------
Figure 3-1: An example of pattern generation using frequent itemsets
the article, assuming that the minimum support is two sentences. However, without the occurrence of
stemming, only {Barak} could be extracted from these two sentences. Clearly, the former pattern presents
a better description of the two sentences. In document retrieval, the former pattern allows for higher
precision, since the latter pattern will retrieve all articles mentioning Barak. While recall may be slightly
lower in word stemming, precision values make it the better choice in document retrieval.
Patterns generation involves making multiple passes over A. In the first pass, the support of each
word is determined, based upon how many sentences that word appears in. Out of these words, the system
then decides which arefrequent (having minimum support), and which are not. In each subsequent pass,
the large items, or words, found in the previous pass are used as the initial set of items. This set is used for
generating new potentially large patterns, called candidates, whose support is also counted. At the end of
each pass we determine which patterns are actually large. These large patterns become the seed for the next
pass. This process continues until no new large patterns are found.
By only considering patterns used in the previous pass, this algorithm reduces the number of
candidates. The basic idea is that any subset of a large pattern must also be large. Therefore, candidate
patterns having k-items can be generated in two phases, by joining large patterns with k-i items, the join
procedure, and deleting those patterns than contain any subset that is not large, the prune procedure.
For example, consider the article in Figure 3-1, assuming that the minimum support is 2 sentences.
In the first pass, the algorithm determines the support of each item, eliminating all small items. The
remaining large items, shown in L1, represent the initial
1-item set. To create the next candidate set, these
1-item sets are joined together in all possible ways to form the 2-item candidate set C2. Out of these, the
small itemsets are pruned away and the large itemsets are chosen to form L 2. The process is repeated to
create the large 3-item set, L3 . When C 4 is generated using L 3 , it turns out to be empty and the algorithm
26
terminates. The result is a set of frequent itemsets, or patterns L,
n L 2 r- L3, which can be translated to
frequent patterns found in the article. Algorithm details may be found in Agrawal and Srikant [7].
There are several advantages to choosing this particular approach.
Most importantly, the
algorithm is highly efficient in comparison to other algorithms. Every article varies in size and content,
producing anywhere from 5 patterns to 50,000 patterns in each run. In addition, over 500,000 articles exist
in the Milan news corpus.
As a result, the speed of the algorithm is an important consideration.
Furthermore, since the algorithm has already been implemented in the IBM Intelligent Miner for data [24],
a product of the Data Mining group at the IBM Almaden Research Center, the code is readily accessible.
This saves a great deal of time and resources that would have been spent in implementing a pattern
generation algorithm from scratch.
3.3.3
Results of Pattern Generation
The resulting patterns represent the key concepts for a given article. Here, we have chosen three articles
from the news corpus to illustrate pattern generation results. Figures 3-2, 3-3, and 3-4 provide excerpts
from the three stories, as well as the most frequent patterns selected from each pattern size.
First of all, each of the patterns, including the
1-item patterns, seems to grasp the theme of the
article quite well. Furthermore, as the patterns increase in complexity, seemingly important keywords drop
out of the pattern. For example, in the Hubble telescope article, the word "Hubble" should clearly play an
important role in describing the story. However, while both the one and two item patterns contain this
word, the term disappears in longer patterns.
This occurrence maybe contributed to the frequent use of
pronouns within the article. While Hubble may be mentioned throughout the article, several sentences use
the words "it" and "they" to describe the Hubble telescope and Hubble astronomers, respectively. Thus, in
longer patterns the term drops out, even though it is implicitly referenced throughout the article.
27
......
...........
Figure 3-2: Pattern generation for the Hubble Space Telescope article
I
Figure 3-3: Pattern generation for the Camp David article
28
I
Figure 3-4: Pattern generation for the Al Gore article
3.4
The Search for Related Stories
Now that the patterns have been generated, the system requires a scheme for finding similar articles given
these patterns. One option is to find the same patterns within the other articles of the news corpus, with the
requirement that all words contained in the pattern must be from the same sentence.
An article that
contains at least one pattern is considered a match. We will call this option sentence-level-matching.
However, this metric may be too restricting, since the focus of one article may not be the same as another
even though they speak of the same topic. Therefore, another option is also explored; when a story
contains all the elements of a pattern, not necessarily from the same sentence, it is considered a match.
This option will be called article-level-matching.
Using both metrics, preliminary tests are run on the above three articles, to provide a rough idea of
the usefulness of patterns in text retrieval.
The Milan corpus is used for this evaluation; however the
number of considered stories is limited for the sake of efficiency. A good article is one whose content is
related to the topic of the given article. The total set of relevant articles is chosen from the Milan corpus by
hand. This data is used to calculate precision/recall values. A summary of this testing environment is
given in Table 3.1.
For this preliminary test set, we compare patterns of 3 items or more with Boolean searches and
word pair searches. Boolean searches correspond to I-item patterns, and word pair searches correspond to
2-item patterns. The intuition in this case is that more complex patterns offer a more defined topic. Only
29
the top 5 most frequent patterns, single words, and word pairs are used in this examination. In other words,
the 5 most frequent patterns, words, and bigrams are used in retrieving relevant stories. Tables 3.2, 3.3, and
3.4 show the precision/recall results for the Hubble space article, Camp David article, and Al Gore article
respectively. Precision is defined as the ratio of the number of relevant articles retrieved to the number of
relevant articles in the corpus. Recall is defined as the ratio of the number of relevant articles retrieved to
the total number of articles retrieved.
Note that Boolean precision/recall values will be the same in both
sentence and article matching, since a single word will obviously appear in the same sentence as well as the
same article.
Number of
Relevant articles
33
Total number of
articles considered
8,500
Camp David
53
18,000
Al Gore
126
19,000
Article
Hubble Telescope
Table 3.1: The preliminary testing environment
Sentence-level
Recall
Precision
Article-level
Recall
Precision
Boolean Search
22.5%
48.5%
22.5%
48.5%
Word-Pair Search
100%
27.3%
100%
27.3%
Pattern Search
100%
9.1%
100%
30.3%
Table 3.2: Preliminary Precision/Recall Evaluation for the Hubble Space Telescope article
Sentence-level
Precision
Recall
Article-level
Precision
Recall
Boolean Search
9.2%
96.2%
9.2%
96.2%
Word-Pair Search
38.9%
92.5%
35.5%
94.3%
76%
68.6%
77.8%
92.4%
Pattern Search
Table 3.3: Preliminary Precision/Recall Evaluation for the Camp David article
30
Sentence-level
Precision
Recall
Article-level
Precision
Recall
Boolean Search
4.2%
94.4%
4.2%
94.4%
Word-Pair Search
14.0%
45.2%
13.1%
63.5%
Pattern Search
100%
5.6%
83.6%
40.1%
Table 3.4: Preliminary Precision/Recall Evaluation for the Al Gore article
These early results illustrate the promising performance of pattern mining, with complete article
matching performing better than sentence matching. Retrieving articles based upon a pattern's presence
within a single sentence is too restrictive. In most cases, article-matching significantly improves Ariel's
recall performance while only slightly weakening precision.
Furthermore, the precision of 3 or more item patterns far surpasses that of word pairs and Boolean
searches. From a reader standpoint, the importance of precision outweighs that of recall. A reader would
much rather receive a small number of relevant stories rather than a larger number of stories in which
relevant articles need to be weeded out. Thus, while longer patterns have poorer recall numbers, the
precision results outweigh recall weaknesses.
It should also be noted that the procedure of using the 5 most frequent patterns is only used for this
preliminary test set. In normal Ariel processes, all existing patterns of 3-items or more will be used in the
retrieval process.
Thus, recall performance should increase with more rigorous evaluations.
More
extensive testing is described in Chapter 6, the evaluation section of this paper, along with a comparison of
pattern mining with common story retrieval techniques.
As a result of these tests, the Ariel system will retrieve articles based upon all patterns with 3items or more. An article is considered relevant if every item of the pattern is present within the article,
regardless of its sentence position.
3.5
Other Considered Techniques
While researching pattern generation, several different approaches were explored. The technique described
above involves generating frequent itemsets from an article chosen by the user. Additionally, this thesis
considers two other approaches, substituting association mining with sequential mining, and replacing the
specific article with an analysis of the entire news corpus.
31
3.5.1
Looking at Sequential Patterns
Generating sequential patterns is very similar to the association mining approach described above.
However, instead of using sets of words from each sentence (itemsets), lists of words are used
(itemlists)[8][45].
This means that ordering now becomes significant.
The advantage is that sequential
patterns reduce the chance of retrieval error. For example, consider one article containing the sentence
"The Camp David summit opens today", and another including the sentence "David went to camp this
summer." Since both sentences contain the itemset {Camp, David}, association mining techniques would
group these two articles as similar. However, sequential mining would identify two distinct patterns,
{Camp, David} and {David, Camp}, respectively, and the error would be avoided.
However, the disadvantage is that we lose the flexibility of association mining. Whereas the
sentences "Clinton and Arafat met" and "Arafat and Clinton met" would be considered matches by
association techniques, sequential methods identify two distinct patterns {Clinton, Arafat, met} and
{Arafat, Clinton, met}, and would therefore fail to recognize their similarity.
To test this sequential approach, sequential patterns were generated for the three stories described
above, using the same minimum support of 2, and the same testing environment described in Table 3.1.
The experiment is exactly identical to the article-level-matching association mining tests (Tables 3.2, 3.3,
and 3.4), using the same Milan corpus. However, instead of being unordered, the generated patterns are
sequential.
Table 3.5 illustrates the resulting precision/recall values in relation to article-level-matching
association mining results. In general, sequential patterns create moderately lower precision values and
significantly lower recall values. These results make sense since the restriction of ordered items allows
fewer story retrievals. In the case of Article 2, however, precision/recall values are identical. Since only
two patterns are generated using association mining and the same patterns were picked up by sequential
techniques, the results are the same.
Although sequential mining allows one to distinguish between the phrases "blind Venetian" and
"Venetian blind" for example, these results emphasize the need for flexibility in word usage. Thus, the
restrictions of sequential mining weaken story retrieval rather than aid it. Compared to association mining,
little advantage is seen in using sequential story patterns.
Article
Assoc.
Assoc.
Seq.
Seq.
Precision
Recall
Precision
Recall
1. Hubble space telescope
73.3%
86.8%
64.7%
28.9%
2. Camp David
86.0%
76.6%
86.0%
76.6%
3. Al Gore
75.8%
35.3%
75.0%
6.8%
Table 3.5: Precision/Recall values for articles retrieved using association patterns and sequential patterns.
32
3.5.2
General corpus trends
Another approach involves searching through the entire corpus for story patterns, instead of focusing on the
chosen article. The basic intuition is that retrieved patterns encapsulate common themes found within the
corpus. If several articles discuss the 2000 Olympic games, for example, these articles will likely share
common word choices, and thus share common patterns. If enough stories reflect a certain pattern, that
pattern may be viewed as a news topic descriptor. Then, when a user chooses one of these articles, the
pattern is recognized and all other stories containing this pattern will be retrieved.
In this approach, a large pattern is defined by the number of articles containing it, rather than
sentences. Patterns are generated using the association mining algorithm; however, the transaction record
is now a record of the entire news corpus, instead of a single news article.
Since an article is still divided
into its sentence components, patterns from each sentence may be joined together to form a set of patterns.
For example, the structure < {Olympics, Sydney} {athlete, compete} > corresponds to "Olympics" and
"Sydney" occurring in one sentence and "athlete" and "compete" occurring in another sentence, with both
sentences coming from the same article. This pattern is accepted only if a minimum support is satisfied.
While this technique appears promising, a brief examination of Table 3.6 illustrates that results are
poor. Instead of retrieving useful patterns that describe various topics within the corpus, only the most
common patterns are gathered. For example, the first pattern < {page, open, browser, window} > emerges
from the common phrase at the end of each CNN.com article "Pages will open in a new browser window".
Similarly, the pattern < {web, post, gmt} > appears in regard to phrases like "April 20, 2000 Web posted at
12:33PM EDT (1633 GMT)". These results indicate that discovering useful patterns among a large set of
articles is more difficult than it appears. This complexity is attributed to the various writing styles and
topic discussions throughout the corpus. By adhering to a single article, the writing style does not change
and the focus is distinct, allowing a more meaningful pattern generation.
Patterns generated from 1000 articles
< {page, open, browser, window} >
< {web, post, gmt} >
< {cnn} {search, new} >
< {people, business} {new, year} >
< {internet, headline}>
Table 3.6: An experiment in discovering general corpus trends using pattern mining techniques
33
3.5.3
Conclusions
While various approaches to pattern generation exist, association mining techniques appear to
provide the most informative patterns. The use of frequent itemsets allows for flexibility in writing styles,
while capturing the main ideas of the article. At the same time, by concentrating on a single article, the
system takes advantage of language redundancy to provide a clear description of the article's meaning.
34
Chapter 4
Existing Resources
This section describes the resources Ariel was built upon. Milan is the personalized news portal that
supplies Ariel with a news corpus and a user interface to work with. More specifically, Ariel is one of the
many services built on top of Milan. This chapter describes the underlying structure of the news database,
the data transport system, and the user interface. Furthermore, this section provides a glimpse of IBM's
Vinci project [2], a local area service-oriented architecture created for the development of distributed datacentric applications. Milan is one such application built using Vinci.
4.1
The Milan News System
Milan is an online newspaper service similar to MyYahoo.com and CNN.com. It obtains its articles from a
variety of sources including Factiva, CNN.com, and net news, and maintains a corpus of news articles that
is built on the IBM DB2 database [6].
Furthermore, Milan provides each user with a personalized
newspaper, which allows users to choose news topics, stock updates, weather reports, and other options to
display. Figure 4-1 displays a sample Milan page.
This news system, however, differs from other personalized papers in that it allows users to create
and modify their own news channels, instead of having to select topics from a pre-designed list of choices.
News channels are generally pre-defined categories of news from which a user may choose sections of their
personalized newspaper. For example, a user of MyYahoo may select the "Top World News Stories", "U.S.
Market News", "Health from Reuters", and "Entertainment from Reuters" news channels for their
newspaper. In the case of creating one's own channels, someone with family in Taiwan may be very
interested in reading all the articles regarding it and its disputes with China. Normally, that person would
be forced to choose the "Top World News Stories" news channel to add to their newspaper, since this is the
35
- -: A
ileanE
TMVuwSNePmW)ISG1
alN*ISEIr
J
F
*
/
dit ?,age Tab.
Generated Wdnesdqy, August 9, 200011:52:07 AM PDT
IBM in the News (more)
Dow Rises 1099 on Economic Report . Nw
Weather
wg
YORK Aug.8
B-lue-chip tocks rose today amid growing optimsm that
iderest rates will remain stable bt technology lescInedsclgtlly
onproftakngter Mondqys bigadvance
Linux
(more)
owg Holders Register VA LINUX SYSTEMS INC
Stock=S1TM
PPBNSOURCE:PORmi44 0 omER VALINUX
C USYAOOL: LNUX0 OFMI1R: REGJNTSOFTH
U..IVZRSJTTOFMIC
OTTL2: None OBROKER: DBALZBROWN
3SRgISTRgD: 40789 DATE R2ISTZRD:.
os/FFBN IPO Background: TNPC n Marketino sHAR
Pact With AOL- oPtNSouRC:SgCS4 a a zTwc oe/9 Chasing Hollywood "Pirates": Suits a Test for
INC QSYMBOL:XT JM 0 009 JNG COMMON QA)OUJT. Up to Digital Copyright, Free - Producedin limit rqanitiesand
a0 33AcTFILz#33 41412
LPPILING N0POST only vailableinnavyblue, the T-shirti5 acuwioulyold-world artifact
$4Qnllaon
GS
RF5CT'E: N USEL LINO HOLDE R N 0..
oswDow Jones 10AM Averages: DJlA 10889,72
Up 22 71- 0 30MDUS10889.72 UP 22.71 ORi21 % 0 20
TR4ISF 289750 UP 3.29 0R
O %0 653STOCKS3212.10
S
f
*
015 UTMLS35 34 UF0.30 OR
UPS270R0166 The Dow Jones.
1
%
W/7 T1Ar-1
e0
(morel
Microsoft In the News
wst9FFBN Tech Wrap: Software Stocks Push
Main Index Higher- aSS7ER. PDERALR7INGSBUNES
our technologr ndi1x
d of9
a
xrP%&
-7d
Daniel &einberg (DG) - It has been threeyears since *ple CIO
Steve Jobs explicitly menionedplansfor Java on the Macintoshaa
Macworld ixpo keynote addres. There has been plenty of
W1MM IeenLaee
UI?54
w~
NEW'S 0ST4BOL XPT 0 00 (hism market wrap was originally
published on Tuesdqy eveynng.) El 0 WASHINGTON (tFBN)-toc*k in
am
f.,
as/s Analysis: Java 2 for the Mac almost ready - by
back
W7;V
*
MiiiM
tofiguwe in a storm over theatureofthe Inteet. The problem is on the
Test for
DqitaLCopright Free. acadinaedqunlfleand
only availablein navy blue, the Tshirt s a curiouly old-world atfact
togure in a storm over theflure ofthe Inteet The problem is on the
osi Chasing Hollywood"Pirates" Suits a
A
ASPs
(more)
Portals (more)
asAnalysis: Java 2 for the Mac almost ready -
seiner (ID)-- I has been threeyeavs since Aple CEO
2ere Jobs explicitly mentionedplansfor Java on the Macintoshata
Macworld
06/7
ixpokeynote
address
There has benplenty of.
Top Tier SoftwareAPO -2: Sees Net Proceeds
Of $39.9 Million - oFNsoURCz: SEC S-i/Al 0 OiSUNR
TOPTIERSOFPWAR NC.
0SFMOL:XTZM 0 0OPIRNG:
Figure 4-1: Milan is a personalized information portal for the Vinci system.
closest channel that exists. However, with Milan, the user can create his own channel, perhaps using the
search "Taiwan + China" to meet his/her specific needs. Furthermore, Milan provides the ability for users
to communicate their interests to each other, by sharing channels with each other.
With the creator's
permission, other users may use or modify the Taiwan news channel and/or add it to their own newspapers.
Gruhl originally described this system in his Ph.D. thesis entitled "The Search for Meaning in
Large Text Databases"[ 19].
Milan emerged from the MyZWrap and Panorama systems, both designed to
simplify and partially automate the editorial decisions of a human news editor by gaining a more thorough
understanding of the news articles within the database. Milan differs from the two prototype systems by
offering a decentralized and more modular structure, a characteristic inherent to all Vinci-based systems.
Instead of requiring components to exchange information through a centralized database, Milan allows
services to communicate among one another directly.
Furthermore, Milan components can be built on
various platforms and deployed among any number of machines, creating a flexible and robust architecture.
36
<vinci:FRAME>
<OID>3325</>
<DATE>Novenber 12, 2000
<HEADLINE>The Making of a
Memor ial</>
<BODY>On Veterans Day, the
memorial to the 57,000
Americans missing or
killed in Vietnam...
</>
</>
Figure 4-2: A frame describing an article on the Vietnam Memorial, represented both graphically
and in XML format
4.1.1
Gathering and storing information
News is transferred to Milan through wire services, net news, and Internet news providers. Approximately
10,000 news articles are gathered every day and stored in the Milan news corpus.
Currently, the corpus
contains approximately 500,000 news stories, dating from August 1999 to the present.
Milan uses frames as its news storage medium, which provides the system with both speed and
flexibility [19].
A frame is a single-level depth XML document. For example, a frame describing a news
article might include a headline, body, date, and news source. Figure 4-2 illustrates a sample news article
stored in a frame. These frames are stored in a PISA database. Created for the Vinci system, PISA is a
relational database that utilizes a vertical format for storing objects while maintaining a horizontal view [6].
Once the news articles enter Milan, they are immediately processed into frames. Each article is
given a unique Object IDentification number (OID) and stored in a simple frame consisting of the OID and
the text of the article [23].
This OLD allows the frame to be referred to in a concise and unambiguous
manner. Since OID's are never recycled, they can be used as pointers to an article with the assurance that it
will never point to another article. However, since storing the article as a string of text and tagging it with
an OID is not very useful for understanding its content, Milan uses several software agents, or what Gruhl
calls experts, to extract more detailed information from these articles.
4.1.2
Making it useful - experts and the blackboard system
A single news article is of little value unless its meaning is understood.
While articles are generally
represented as a single block of text, it is very useful to understand what people, locations, and dates were
mentioned in the article. For this purpose, Milan uses various experts organized into a blackboard system
to extract useful information from each article [14]. Each expert is specialized in only one task. One might
find a people-spotter,
location-finder, or date-locator expert working on a particular news article. These
37
experts work together to create understanding about the article, and store each of their findings in the frame
itself.
In the case of Milan, thousands of blackboards exist, each board representing a single news article.
Hence, instead of several experts studying one blackboard, these experts can walk around a room full of
boards and make contributions to each board as they see fit. If an expert is working on an article, another
expert can just pass it by, and come back to it later, when it's free. Imagine that each expert writes on the
blackboard with his own piece of chalk.
If each expert has a different colored piece of chalk, the
contributions of the various experts are easily viewed by noting the different colors on the blackboard. As
a result, those agents that depend on the work of others can simply glance at a board to see if the
appropriate colors exist yet.
Each blackboard can be implemented as a frame. When an article arrives, it is processed into a
new frame, which is visited by various experts. Experts can be divided into two groups, structural experts
and information experts. Structural experts extract the headline, author, body, and the time an article was
posted. After these agents have visited, the second group of experts then performs tasks such as word
stemming, people spotting, and location finding for a more thorough understanding of the article.
Once the information has been extracted, various searching techniques are used to select stories
for the newspaper, choosing which stories to present and in how much detail. Milan explores the use of
Boolean searches, AltaVista-like searches, and relevance feedback to present the most satisfactory articles
for the reader's perusal.
However, these techniques lie outside the scope of this thesis, and will not be
discussed in detail. For more details, see Gruhl's Ph.D. thesis [19].
4.2
The Vinci System
In the larger scheme of things, Milan is only one application of several built on the Vinci project. As stated
before, Vinci is a local area service-oriented architecture that provides both standard and specialized
building blocks with which to develop data-centric applications such as Milan. As shown in Figure 4-3, the
Vinci project consists of several base services, including a database, web crawler, and data mining tools.
On top of this foundation, almost any application can be plugged in and registered as a Vinci service.
Services may communicate with one another via the xtalk
protocol.
Vinci requires components and services to communicate by exchanging encoded XML
documents, a protocol called xtalk
[2].
xtalk creates a decentralized environment in which services
can directly communicate with one another, as well as with users.
For instance, a system such as Ariel
may want to communicate with Milan to request a particular news article.
38
Value Added Services
(Query Mapping
CSearchj
Catalog
Management
User
Profiling
Personalization
News & Community
(Privacy)
Monitoring]
Data
Extraction
Personal
Databases
Base Services
Figure 4-3: A graphical representation of the Vinci system.
For this communication to take place, Ariel must send the following xt a 1 k document:
<vinci: FRAME>
<vinci:COMMAND>getFrame</>
<OID>3325</>
Milan evaluates this document and returns:
<vinci: FRAME>
<OID>3325</ >
<DATE>November 12, 2000</>
<HEADLINE>The Making of a Memorial</ >
<BODY>On Veterans Day, the memorial to
the 57,000 Americans missing or
killed in Vietnam is one of the
most visited monuments in Washington
</>
39
Ra-I
-o
Realtime
Remote
Querying
This frame-transfer protocol exists for all Vinci services, making communication between services both
flexible and efficient. In fact, this idea is very similar to systems making method calls over a network. As
long as the data transfer schema known for the desired service, communication between systems is very
simple.
4.3
Milan Services
Just as Milan is a service of Vinci, Milan maintains services of its own to help with its tasks. For example,
many readers like to see weather reports for their hometown. Instead of searching through the Internet for
up-to-date weather information, Milan simply calls the weather service for help, which in turn searches for
the appropriate information. Now, Milan simply has to establish a connection with the service and send a
frame such as:
<vinci :FRAME>
<vinci:COMMAND>getWeather</>
<ZIPCODE>02139</>
The weather service returns:
<vinci :FRAME>
<CITY>BOSTON</>
<TEMP>32 F</>
<PIC>
http://www.intellicast.com/images/icons.66
wtext.jpg
</>
This information then becomes represented as a channel on Milan that gives current weather
conditions for a given city. An illustration of this channel is provided in Figure 4-1, the sample Milan
newspaper.
Other services are involved in finding the word of the day, providing stock quotes, preparing
comic strips, allowing AltaVista-like searches, and maintaining a user database.
These services hold
similar conversations with Milan, so Milan can obtain the information it needs to maintain its users'
newspapers.
4.4
Where Ariel Fits
The Ariel system aids Milan users in finding related stories for any news story of interest.
Therefore, it
makes sense to depict Ariel as both an expert and a service in Milan. As an expert, Ariel can analyze a
news article and discover its similar stories before the article is actually requested.
Then, as a service,
Ariel can retrieve the appropriate articles for Milan when a user requests information, just as a weather
40
Comics
Ariel
!j
Pattern Generation
Stock Quotes
Article Retrieval
Similarity Ranking
Word of the Day
Figure 4-4: An illustration of the Milan system, which is built on top of the Vinci base services. Milan
relies on several other Vinci services, including Ariel and various news channels.
service would relay requested weather reports or a stock service would provide stock updates. Similarly,
Ariel can use any Milan and Vinci resources (e.g. via the frame-transfer system described above) to collect
the data it needs. Figure 4-4 provides an illustration of the Milan system with the addition of Ariel's
components.
Also, where the available Milan and Vinci services are insufficient for Ariel's purposes, new
services and experts can be built to aid in needed tasks. For example, the pattern generator can easily be
made into a Milan service. Feature spotters, on the other hand, may also be useful to incorporate among
Milan's news experts. Spotters that identify people, places, and times may provide Ariel with important
information about its news articles. These services, and many others, can be built and integrated into the
Milan system, to aid this thesis' study of mining patterns for text retrieval.
41
Chapter 5
Ariel: Design and Implementation
The ideas set out in the previous chapters have been explored in the development of Ariel, a news retrieval
system based upon user queries for related news articles. Ariel utilizes the Milan news corpus, and is
currently considering over 500,000 stories, with new articles added daily. The system is fully integrated
into Milan's newspaper, such that Milan readers can make requests to Ariel while browsing through
articles. The Ariel system consists of approximately 2,500 lines of code, not including the pattern mining
algorithm, and is written mainly in Java with some use of Perl and Javascript.
5.1
General System Flow
The Ariel system is composed of six distinct sections: preprocessing articles, indexing article words,
generating patterns, finding related stories, ranking stories by similarity, and providing a user interface. The
general system flow is depicted in Figure 5-1. News enters Milan through news streams, and is reformatted
into frames. Milan experts then examine and augment these frames and store them in a news corpus. Next,
the Ariel system parses the headline and body of the articles based upon stemmed words. Since each word
is mapped to a unique integer identifier, a binary format of the story is created. This binary data file is fed
into a Vinci service built for pattern generation, which produces patterns for each article (This service is
discussed in greater detail in Section 5.3).
Once Ariel identifies these patterns, related stories are
discovered and ranked for user retrieval, using indexes that have been developed. A user may request
related stories directly from the Milan newspaper, which contacts an Ariel JSP. This new web page
presents the user with a list of similar stories, as well as a method for browsing through them.
42
Frame
\ Database
Figure 5-1: Overall system diagram for the approach presented in this thesis
5.2
Story Analysis
Consider the following article, shortened for the purposes of this discussion, which has just entered the
Milan system from the CNN news stream:
How do designers define
luxury? 4
It can be a cashmere kimono. A bed made with embroidered sheets.
The detailing on a handmade pair of shoes. An unexpected mix of
beautiful colors. Designers have different concepts of luxury.
First, Milan reformats the article into a frame and structural experts extract the headline and body.
Next, the information experts extract stemmed words, people, locations, and other features from the text.
However, since Ariel only requires knowledge of the headline, body, and stemmed words, only these
features are discussed here.
In the Ariel system, each word is represented by its stemmed form, using the Porter stemming
algorithm [37].
In this way, the words "design", "designing", and
"designer", for example, appear
identical to the system. Without such stemming, each word would be viewed as unique, and no correlation
Modified version of "How do designers define luxury?" CNN.com December 29,2000
http://www.cnn.com/2000/STYLE/fashion/12/29/luxury/index.html
4
43
would be drawn between them. Furthermore, each stemmed word is identified by a unique number.
Whenever a new word is discovered, it is assigned an identifier and added to the dictionary.
Also, Ariel does not consider stop words. A list of these words is provided by Cornell University
[11]. By removing stop words, Ariel prevents the generation of many useless patterns that would otherwise
lead to greater article mismatches. For example, consider the sample article: "Clinton will move out of
office in the middle of January. Then, Bush will be in control of the White House." By including stop
words, such as will, of in, the, then, be, a minimum support of two sentences will allow pattern generation
to discover {will, of, in, the}, {will, in, the}, {will, of, the}, {will, of, in}, and {of, in, the} as valid
patterns. Clearly, these patterns will retrieve numerous articles that have little to do with the given topic.
Thus, a removal of stop words allows more effective patterns to be retrieved, while also speeding up pattern
generation processes.
Since stop words are not considered, fewer word combinations will exist for the
system to examine.
The dictionary is a simple mapping between each stemmed word and its unique identifier. Since
a string format requires 4 bytes per Unicode character, and an integer requires 4 bytes, this conversion from
word to number significantly decreases memory usage and thus speeds up system processes.
Ariel then translates each stemmed word in the headline and body into its numeric representation.
The headline may be viewed as a sentence in the article while the body is divided into its sentence
components. As a result, the fashion article shown above may be expressed as:
Headline: 1 2 3
Body:
Sentl:
Sent2:
Sent3:
Sent4:
Sent5:
4 5
6 7 8 9
10 11 12 13
14 15 16 17
1 18 19 3
Given the dictionary shown in Table 5.1. These results are not actual Ariel results, and are only provided
as a means for explanation.
Finally, the article's sets of word ID's, or sentences, are combined into a single string
representation to be used in pattern mining. The example above would be expressed as:
3 1 2 3 2 4
5 4 6 7 8 9 4 10 11 12 13 4 14 15 16 17 4 1 18 19 3
The underlined numbers indicate how many items exist in the following sentence.
Thus, if a
string begins with 3, we know that the first sentence in the article contains three items, which are listed
immediately afterwards. The next sentence contains 2 items, 4 and 5, the third sentence contains 4 items,
and so on. Using this format, we can express article components using a simple string representation.
44
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Design
Define
Luxury'
Cashmere
Kimono
Bed
Made
Embroider
Sheet
Detail
Handmade
Pair
Shoe
Unexpect
15.
16.
17.
18.
19.
Mix
Beautiful
Color
Different
Concept
Table 5.1: A sample dictionary mapping each word to its unique numeric identifier
5.3
Pattern Mining
Patterns are generated using the IBM DB2 Intelligent Miner for Data [24], built in 1996 in the context of
the Quest project at the IBM Almaden Research Center. Specifically, Ariel utilizes the "Associations &
Sequential Patterns algorithm" of the Intelligent Miner, which is written in C++ [7].
As described in Chapter 3, the Ariel system utilizes frequent itemset generation from the data
mining system. To incorporate these techniques into Ariel, pattern generation is implemented as a Vinci
service.
A conversation between Ariel and the pattern generation service begins with the following request:
<vinci: FRAME>
<vinci:COMMAND>disc</>
<DATAFILE> 3 1 2 3 2 4 5 4 6 7 8 9 4 10 11 12
13 4 14 15 16 17 1 18 19 3</>
<MINSUPCNT>2</>
After evaluating the provided data, the service responds.
<vinci: FRAME>
<PATTERN>
<SET> [1] [3] [18]</>
<NUMITEMS> 3</>
<SUPPORT> 28.3 </>
<PATTERN>
<SET> [16] [17] </>
<NUMITEMS>2</>
<SUPPORT> 15.4</>
</>
The service requires the string representation of the article, described above, along with a
minimum support count. The service considers all patterns that have appeared in at least two sentences of
45
the article, since Ariel uses a minimum support count of two.
After analyzing the data, patterns are
generating using the given support count and the resulting patterns are returned, along with their support
and size. Each pattern's support is actually its frequency of occurrence within the article.
While this
information is not used in the Ariel system, future systems may want to incorporate it in sorting or ranking
patterns.
5.4
Indexing for Searches
Performance is a key issue with any system that interacts with human users.
In searching for related
articles, an article that shares at least one pattern with the original article is considered a match.
Furthermore, only patterns of three items or more are considered. If no 3-item patterns exist then 2-item
patterns are used.
As a result, the Ariel system requires a method for determining whether an article
contains a particular pattern. However, since every article in the corpus needs to be examined in order to
collect a complete set of similar stories, an efficient search technique is required.
One cumbersome method would be to examine each article in the database for patterns, simply by
extracting the headline and body of the article at every query, and looking for the appropriate features
within them. Or, a faster method would be to store every word contained in the article in its numeric form
as a key, value pair, such as:
<vinci: FRAME>
<OID> 80764 </>
<HEADLINE> Hubble image sheds light on darkness within
galaxies </>
<BODY> The chance alignment of two spiral galaxies...</>
<WORD> 356 </>
<WORD> 1255 </>
<WORD> 1307 </>
<WORD> 1905 </>
In this way, the system could query each article for patterns by extracting all WORD features from
each article, and then comparing the individual items with the given pattern. For example, compared with
the pattern {356 1307 1905}, the above article would be considered a match.
However, the need to examine every article in the corpus still appears rather cumbersome.
A
better solution is to create a set of indices which map frame features to vectors of OID's where they appear.
For example, the system may record that <WORD> 356 </>, where the number 356 represents the word
image, appears in article frames 4467, 5580, 80764, 97840, and 112283.
Thus, instead of searching
through the entire news corpus, a search for articles containing pattern {356
1307 1905} simply requires
an index lookup of the 3 items, and then an examination of which articles all three items have in common.
These common articles are therefore considered matches to the original article.
46
While this type of index works well in Ariel, there are some drawbacks to the system. For
example, indices do not support position searching (i.e. Find Bill within two words of Clinton). One fix
might be to have multiple indices, one for most searches and one for positional searches.
5.5
A Ranking Schema
Retrieving all matching stories is useful in evaluation suites. However, in most cases, a user does not want
to sort through all of the matching articles, even if all were related to the main story in some way.
Occasionally, several hundred matches may be retrieved by the system.
An ideal system would only
retrieve the number of stories desired by the user. Thus, a ranking scheme is needed in order to rank stories
from most similar to least similar. In this way, the system has a foundation upon which to choose the best
stories, regardless of the number of results.
In this ranking process, we establish that that larger sized patterns produce better article matches
than smaller sized patterns. This statement is based on the assumption that rarer patterns, the longer ones,
provide more valuable information than their shorter, more common counterparts.
Salton's research
reaffirms this statement. In his research, term weighting is done by assigning a high degree of importance
to terms occurring in only a few documents in the collection. Thus, the rarer a term, the more likely it is an
important feature of the document [41]. Furthermore, this assumption is reinforced by the preliminary
precision/recall results discussed above. Thus, a story containing a 4-item pattern from the main story is
ranked higher than one containing a 3-item or 2-item pattern.
However, each group of patterns still retrieves a large number of articles. For example, the 4-item
patterns may retrieve 20 articles all together, while the 3-item patterns retrieve another 50 from the corpus.
Out of these groups, a new ranking scheme is needed to determine which articles are the better matches. A
variety of similarity comparisons exist. In this thesis, we choose the cosine similarity metric.
We use the following cosine ranking scheme:
1)
2)
3)
4)
Construct a dictionary of all terms in the corpus selection (i.e. the group of articles under
consideration)
Construct a normalized vector for the main article
Construct a normalized vector for each of the articles under consideration
Generate a similarity function between the main article and each test article given the following cosine
similarity metric [39]. The result is a measurement of similarity, from 1 to 0, as calculated by the
degree to which the articles DOCi and DOCj share the terms k.
=
Cosine Similarity (DOC, , DOC 1 ) =
k,=1
t
(TERM ik TERM j)
2
t
2 1 2
[Z (TERMik) .Z U(TERMIk ) ]
k =1
5)
Rank the articles based on highest to lowest similarity.
47
k =1
Once a set of similar articles has been retrieved, the above metric is used to sort the articles from
most to least similar.
In this way, an Ariel user may specify the number of articles to receive.
For
example, if a user requests 10 similar articles, the system will retrieve the top 10 most similar articles
found. Here, pattern length and cosine similarity are used as the ranking scheme. Pattern mining allows
Ariel to quickly isolate a set of similar articles before analyzing them more carefully with the cosine metric.
In addition, other more expensive methods could also be used in place of cosine. The modularity of Ariel
allows for any ranking scheme to be easily integrated with the rest of the system.
Another concern of article retrieval is deciding which stories to retrieve. One option is to gather
all articles pertaining to the given subject matter. Or, a second option is to collect only those articles
appearing at an earlier time than the given article.
Considering the former alternative, computation
becomes extremely time consuming. Every time a new article enters the corpus, Ariel must identify which
articles are similar to it, and recalculate ranking information for all articles in this category.
However, in
the legal profession, for example, this extra computation may be highly desirable. In patent searches, a
lawyer's goal is to find all inventions related to a given topic. Thus, a re-computation is necessary to make
sure all searches are up-to-date.
crucial.
However, in the realm of news, these extra computations are not as
Since news is generally presented in a historic context, retrieving only those articles that provide a
historic background for the original article is a more suitable alternative.
5.6
The User Interface
The Ariel interface may be described from two perspectives, as another Vinci service or as a reader of the
Milan newspaper.
Any Vinci service may request similar stories from the Ariel system via the xtalk
protocol. This communication is easy, as long as the client knows the appropriate schema for interacting
with Ariel.
A sample communication follows.
The Vinci service sends the request:
<vinci: FRAME>
<vinci:COMMAND>patternsearch</>
<OID>80764</>
<COUNT>8</>
This request simply asks Ariel to complete a pattern mining search for the article identified by the
OID 80764, in this case the Hubble space telescope article mentioned earlier. Once this search is
completed, the top 8 most similar articles should be transmitted from Ariel to the client.
48
In response, Ariel sends the frame:
<vinci: FRAME>
<OID>97840</>
<OID>170536</>
<OID>96061</>
<OID>80380</>
<OID>139884</>
<OID>112283</>
<OID>5580</>
<OID>4467</>
The OID's in the response frame identify the 8 articles most similar to the Hubble space story.
From a user perspective, however, this form of communication provides little information about
the requested topic. Therefore, for the average user, Ariel provides a web interface that seamlessly fits into
Milan's existing newspaper interface.
As a result, the user may peruse through related articles without
requiring any knowledge of Ariel's existence.
The following is a brief walkthrough of the system's interface:
Consider our Berkeley student Kellie again, a Milan user who is browsing idly through a
newspaper much like the example shown in Figure 4-1. In particular, Kellie is looking for stories about the
2000 presidential campaign for her upcoming school report.
Eventually, she comes across the article
"Reintroducing Mr. Gore" which appears to provide some interesting information for her report. Clicking
on the article link reveals Figure 5-2, a web page containing the full news story.
The main frame presents the article as it appears from the Factiva news wire.
The left-hand
column, on the other hand, lists various pieces of information, including a section on "Related Stories".
In
this section, Ariel displays the top three most similar articles in relation to the Gore article in the main
frame.
49
nt:
factYva.
Reintroducing Mr. Gore
LAST NIGHT Vice President Gore delivered what had be
more times than he probably cared to hear, the most im
of his political career. At long last the Democratic n,
longer in Bill Clinton' 's shadow, Mr. Gore sought to c'
voters that he has both the strength and the supplenes;
campaign season in which voters are said by pollsters
anything resembling "negative campaigning,"
the vice p:
to implicitly distinguish himself from Republican riva
by means of a heavy reliance on substance, seriousness
Mr. Gore''s unspoken message was that he has the exper
and general heft that the less-experienced Texas gover:
the vice president could not rely exclusively on such
because too many voters continue to see him as awkward
automaton. So Mr. Gore also sought to reintroduce hims
American people as a man comfortable in his own skin,
be comfortable with.
Figure 5-2: A Milan news article
If further information is requested, however, Ariel pops open a new window, as seen in Figure 53. In this window, the main frame contains the most similar story, in this case "For Bush, it's time to show
his mettle".
The left side displays a list of similar articles. Currently, the system retrieves 20 similar
articles from the corpus.
Kellie may browse through each of these articles at her own convenience, with
each article popping up into the main frame when selected.
Thus, without the need of search engines or
any amount of time, Kellie can collect several key resources for her report in just a few clicks, simply by
using the Milan online newspaper.
5.7 Conclusions
A few observations may be drawn from this brief tour of the Ariel system.
First of all, Ariel allows for the
construction of a useful Milan service that is seamlessly integrated with the Milan system. A Milan user
can utilize the features of Ariel without needing to realize its presence in the Milan newspaper.
50
N':
0
Gnerated Saturdqy, January 20,
The Viewi hrsW art"Ob
0 3S116
IA
Arid Stories
.
_
_
CadJM
t ss ThA
* 220342: CevbWd
-Outs
-
Snbwy; Ow:T
- 13899 kOM5L
ss
For Bush, It"s Time
To Show His Mettle
136899: CONVENTION
CHATTER: Cbinta. Democts
Tou The Ecamy
0 177294: TMe Debtes. Two
Cadids. Two Stles: Bush
* 140239: Ga Debuts asHis "Own
IWn" Vice PRs*Ist Clims
e 153056: hto Hcmestrec*A
Tidtweted Racc S'xz by Gihr
e 141729:RknekOo m the
Oftsi:Bush. 0mn
MeD O
CUMW=i ftategits
e
va.
ifact
13..1.:.Ral,,f.
OWL: As
*
P*A 2001 5:20:34 PMPST
136507: Clkltn Damtwnk*s Scee
in L.A.: S* he bis GOP
Dan Bakz
PHILADELPHIA, Aug. 2 -- When George W.
Thursday night, he has one overriding goe
people that he is presidential.
The opening nights of the Republican Ne
offered testimonials in his behalf, a sez
new and united party and tonight,
with vi
Richard B. Cheney''s speech, the first
What is left to Bush is still
a tall
cli
oz
whatever doubts the voters have about hi-
and to show them that he has an agenda t_
FF-F-
N
Figure 5-3: A Milan page displaying the top stories of the "Reintroducing Gore" article
Furthermore, the methods described in this chapter provide a fairly accurate retrieval system that
utilizes both pattern mining and cosine similarity to retrieve and rank stories.
While pattern mining
requires a larger amount of article preprocessing time than cosine mining, these calculations can be
preformed prior to any user queries to maintain an efficient retrieval system.
The Ariel approach also encourages incremental improvements of existing components, as long as
the network interface remains unchanged. Milan and Vinci components do not require any knowledge of
how Ariel accomplishes its task.
51
Chapter 6
An Evaluation of Ariel
The Ariel system is an information retrieval system designed to retrieve similar news stories.
While the
success of Ariel is ultimately based upon its usefulness to readers, an evaluation of usefulness is difficult to
obtain.
Therefore, this thesis will rely upon traditional IR precision and recall analyses to determine the
accuracy of the retrieval system.
Precision is defined here as the ratio of the number of similar stories
retrieved to the total number of stories retrieved. Recall is defined as the ratio of the number of similar
stories retrieved to the total number of similar stories in existence.
In online news, readers generally prefer to get fewer errors rather than more article matches.
Thus, this thesis considers precision of higher priority than recall.
This consideration differs from the field
of legal concerns, for example. Lawyers searching for prior art want the highest recall value possible, since
they want to find as many documents about their case as possible. Therefore, from a legal perspective,
recall would be of a higher priority than precision.
6.1
Experimental Protocol
The evaluation of Ariel includes a rigorous analysis of three stories chosen from the Milan data set and
three stories chosen from the Reuters-21578 collection [28]. See Table 6.1 for details on the six chosen
articles. Milan is a general news source, including articles from technology, politics, science, and art. The
data set contains various news articles dating from August 1999 to the present. Currently, Milan consists of
over 500,000 articles with approximately 10,000 articles added daily. The Reuters collection is also a news
data set, specifically compiled from the 1987 Reuters news wire. Containing 21,577 articles, Reuters is
significantly smaller than Milan. It is also a more specialized collection, with articles chosen from various
economic reports. These reports mainly include market analyses and trading speculations.
52
ID
1
Article Title
Corpus
Articles considered
Milan
2
Hubble Image Sheds Light on
Darkness
Next in the Middle East
3
Reintroducing Mr. Gore
Milan
8,500 articles
June 2000
18,000 articles
July to August 2000
19,000 articles
4
Bahia Cocoa Review
Reuters
5
If Dollar Follows Wall Street,
Japanese will Divest
Tower Report Diminishes Reagan's
Reuters
Milan
August to September 2000
21,577 articles
Entire corpus
6
Reuters
Hopes of Rebound
21,577 articles
Entire corpus
21,577 articles
Entire corpus
Table 6.1: The chosen articles for this testing environment.
All six stories undergo the same evaluation process. First, each article is manually matched with a
set of similar articles. For the Milan corpus, this master set is chosen from a specified section of the
corpus, since the corpus itself is too large for a manual search. In Reuters, however, the articles are chosen
from the entire corpus. Next, Ariel retrieves and ranks all the articles it believes to be similar, filtering out
those articles not included in the pre-selected group. This process is identical to that described in chapters 3
and 5.
In addition to the graphs, each article evaluation contains an excerpt from the story, the number of
patterns found, and a list of several generated patterns. These pieces of information provide the reader with
a better idea of the article topic, as well as more familiarity with the common patterns found in the story.
6.1.1
Graph Characteristics
Precision/Recall graphs are then obtained by the following method:
Let C be the master set of similar articles, chosen manually.
Let A = {a,, a 2 , ... , am} be Ariel's list of retrieved articles in ranked order, where a, is the most
similar article and am is the least similar.
Let S be a set of articles, where S c A.
Let P be the set of (Recall, Precision) points
1. S={}, an empty set
2.
3.
3.
4.
5.
6.
7.
8.
P = {},an empty set
for (i = 1; i:5 m; i++) do begin
addaitoS
Precisioni = number of elements s e S where s e C
number of elements in S
Recalli = number of elements s E S where s E C
number of elements in C
add (Recalli, Precisioni) to P
end
Result = P;
53
Each point in P is displayed on the precision/recall graphs, connected by lines for easier viewing.
As a comparison, these precision/recall graphs are then matched with graphs generated from a cosine
similarity metric. The cosine graphs are also obtained via the above method where A represents a list of
related articles retrieved by cosine similarity.
One characteristic of these graphs is that the numbers are almost never monotonically decreasing
(see Figures 6-1, 6-2, 6-3, 6-4, 6-5, and 6-6). In other words, as recall increases, the precision sometimes
increases and sometimes decreases. The reason is that each new precision/recall point is calculated by
adding a new document to the set, as shown in the above algorithm. Thus, with each new article, the recall
will never decrease, since the number of relevant articles will never decrease. However, the precision may
vary with each addition, since the new article may either be relevant or not. A relevant article will increase
both precision and recall, while a non-relevant article will decrease precision and keep recall stagnant. As a
result, a monotonically decreasing graph could only be obtained by maintaining a constant recall value of
100% throughout the graph.
One may also note from the graphs that the shapes of pattern and cosine results show striking
similarities.
As stated before, pattern mining uses a ranking scheme involving both pattern length
considerations and cosine similarity results. The cosine graph is, of course, created purely from cosine
results. Thus, these schemes create comparable graph shapes.
6.1.2
Performance Considerations
In these precision/recall evaluations, it is also important to note that pattern mining uses significantly less
CPU time than cosine similarity.
We will assume that both techniques require article preprocessing,
namely headline and body extraction, word stemming, and dictionary creation. In both cases, the article is
represented in numeric form, as a string of numbers in pattern mining and as a matrix in cosine similarity.
In pattern mining, story patterns are generated via the association mining scheme described in
Section 3.3, and used to isolate a set of relevant articles. Since patterns are simply sets of frequent words,
relevant articles are identified by the number of these frequent words they contain. Indices help speed up
this search process (see Section 5.4). In the case of cosine similarity, however, every word in every article
needs to be accounted for and compared. Thus, indices are of little use, since all articles in the corpus need
to be examined.
However, the use of cosine similarity inherently ranks the articles by relevancy without further
computation. Pattern mining, on the other hand, has isolated a set of similar articles, which are unranked.
Thus, pattern mining uses cosine similarity to rank a significantly smaller article set, which have already
been deemed relevant.
In terms of the Milan corpus, over 500,000 articles exist in the database. Cosine similarity
requires each of these 500,000 articles to be compared to the original article, word for word, before ranking
them. Pattern mining, on the contrary, analyzes the original article and discovers several story patterns.
Assuming that these patterns consist of 12 unique word items (the average number of words given 20
54
chosen articles), only those articles that contain any of these 12 words would ever be noticed by the Ariel
system. Furthermore, out of this set, perhaps 500 articles are found to contain the correct patterns. As a
result, instead of requiring a cosine similarity ranking on 500,000 articles, only 500 articles are required.
The Ariel system clearly requires less computational power than the cosine similarity metric, thus making
pattern mining evaluations run much more efficiently than cosine evaluations.
55
Milan Experiment Results
6.2
Article 1: Hubble Image Sheds Light on Darkness
The chance alignment of two spiral galaxies offers a rare
glimpse of elusive galactic material, according to Hubble
Space Telescope astronomers, who released a striking image
of the pair on Thursday...
41 patterns found.
{image shed light space}
{help dust scientist interstellar}
{image light space telescope}
{march april image heritage}
{interstellar dust bright stars}
{matter hold mass}
{dust galaxy silhouette}
{spiral dark space}
1
Pattern Mining +Cosine 8milarit?~ -
0.81-
0.61[
a
0
0.4 F
0.21-
OL
0
0.2
0.4
Recall
0.6
Figure 6-1 Precision/recall graph for Article
56
0.8
1
I
1
Article 2: Next in the Middle East
Prime Minister Ehud Barak and Palestinian leader Yasser
Arafat returned home from the Camp David summit to very
different welcomes. Mr. Arafat was received as a hero
who stood up to pressure and did not forsake Palestinian
claims over Jerusalem...
2 patterns found.
{prime, minister, opposition}
{prime minister, Barak}
1
Pattern Mining -e-Cosine Similarity -
0.8 r
C
0
0.6 r
.
0.4 r
0.2 r
I
0
0
0.2
0.6
0.4
Recall
Figure 6-2: Precision/Recall graph for Article 2
57
0.8
1
Article 3: Re-Introducing Mr. Gore
Last night Vice President Gore delivered what had been
called, more times than he probably cared to hear, the
most important speech of his political career. At long
last the Democratic nominee, no longer in Bill Clinton's
shadow, Mr. Gore sought to convince American voters that
he has the strength and the suppleness to lead...
2 patterns found.
{vice, president, voter}
{american, sought, gore}
Pattern Mining
Cosine Similarity
0.8
0
0.6
0.4
0.2
0
0
0.2
0.6
0.4
0.8
Recall
Figure 6-3: Precision/Recall graph for Article 3
58
*
*
6.3
Reuters Experiment Results
Article 4: Bahia Cocoa Review
Showers contined throughout the week in the Bahia cocoa
zone, alleviating the drought since early January and
improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria
Smith said in its weekly review...
40 patterns found.
{comissaria smith york time}
{bahia cocoa review}
{bahia total estimate}
{bag mln crop}
{sale dlr port}
{sept york time}
1
Pattern Mining
Cosine Similarity
0.8
C
0
0.6
CL
0.4
0.2 ~
0
0
0.2
0.6
0.4
Recall
Figure 6-4: Precision/Recall graph for Article 4
59
0.8
*
Article 5: If Dollar Follows Wall Street, Japanese will Divest
If the dollar goes the way of Wall Street, Japanese will
finally move out of dollar investments in a serious way,
The Japanese, the dominant
Japanese investment managers say.
foreign investors in U.S. Dollar securities, have already
sold U.S. Equities...
29 patterns found.
{manage market stock bond general}
{manage invest wall japanese}
{stock bond general}
{manage invest international}
{manage international department}
{market, stock, bond, general}
1
Pattem Mining
Cosine Similarty
-
-
0.8
0.6
C
0
CO
.5
0.4 t
0.2 f
I
0
0
0.2
0.4
0.6
Recall
Figure 6-5: Precision/Recall graph for Article 5
60
0.8
1
Article 6: Tower Report Diminishes Reagan's Hope of Rebound
The Tower Commission report, which says President
Reagan was ignorant about much of the Iran arms
of regaining
about
ends his prospects
just
Deal,
dominance in Washington, political analysts said...
political
23 patterns found.
{analyst public reagan back}
{public point person popular}
{report told tower}
{expect white house}
{institute reagan washington}
{point person popular}
{analyst tower politic}
1
Pattem Mining
Cosine Similarty
-
0.8
0.6
C
0
0.4 I
0.2 I
0
0
0.2
0.6
0.4
Recall
Figure 6-6: Precision/Recall graph for Article 6
61
0.8
1
6.4
Final Observations
The Milan data set performed extremely well. Precision and recall values for pattern mining dominated
those of the cosine similarity metric in virtually all graph areas. Furthermore, in both Articles
1 and 2,
pattern mining is able to maintain consistently high precision levels up until an 80% recall value. While
Article 3 also performs well, its results are not quite as high. Rather, Article 3 shows a steady drop in
precision, almost from the very beginning of the experiment.
This phenomenon can most likely be attributed to the large number of related stories to the Gore
topic. While the Hubble and Camp David stories included 34 and 54 relevant articles, respectively, the
Gore topic consisted of 117 relevant articles. Therefore, it makes sense that the recall numbers are low
during much of the experiment, pushing (recall, precision) points closer together. If we count the number
of documents added before precision falls below 80%, for example, we would find that Article
1 retrieves
38 stories, Article 2 retrieves 70 stories, and Article 3 retrieves 58 stories. Therefore, even though the
graphs may indicate a lower performance for Article 3, further analysis shows that the results are quite
similar.
In the Reuter's data set, the graphs are almost identical in all three cases. This result indicates that
the use of pattern mining does little to affect the results of the cosine similarity metric. However, this in
itself is an achievement. As stated in Section 6.1.2, a pure cosine metric requires a thorough analysis of
each article in the corpus, using the painstaking process of comparing every single term to that of the query
article. Pattern mining, on the other hand, extracts the unique patterns of the query article and uses only
those items present to isolate a set of similar articles. As a result, pattern mining performs significantly
faster than the cosine metric. Moreover, graph similarities indicate that pattern mining successfully chose a
set of articles similar to that chosen by cosine similarity.
Therefore, while producing similar
precision/recall values, the speed of the Ariel system makes it the better choice.
We can also view our results side by side, as shown in Table 6.2. Here, the precision values for
the Milan and Reuters data sets are averaged for conciseness.
For example, the numbers from the
Milan/Pattern column are the averaged results of the three Milan articles (Articles 1, 2, and 3) using pattern
mining techniques.
While evaluations on both data sets show promising results, the Milan corpus clearly performed
better than Reuters.
Pattern mining results on the Milan database show an average precision of 75.21%
while the same evaluations on Reuters produce an average 22.81% precision. This observation may be
attributed to differences in the content of the two databases. Milan contains a wide range of news topics,
including stories on politics, space, fashion, and world news. Reuters, on the other hand, contains a
specific set of topics involving various economic reports, mostly trading speculations and money market
concerns. Therefore, the patterns created for Articles 4, 5, and 6 were not as unique as they would have
been in the Milan database, causing Reuters patterns to identify many more non-relevant articles.
62
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Averages
Milan Precision
Pattern
Cosine
Reuters Precision
Pattern
Cosine
1.0000
0.9762
0.9479
0.9444
0.8997
0.8589
0.7963
0.7267
0.6404
0.4453
0.0369
1.0000
0.9583
0.9167
0.8817
0.8335
0.7418
0.6805
0.5052
0.2701
0.0960
0.0345
1.0000
0.3915
0.2856
0.2397
0.1710
0.1265
0.1056
0.0737
0.0541
0.0386
0.0226
1.0000
0.3850
0.3149
0.2160
0.1616
0.1202
0.0972
0.0714
0.0423
0.0280
0.0122
0.7521
0.6289
0.2281
0.2226
Table 6.2: A comparison of average precision values between Milan and Reuters pattern mining
and cosine similarity results
Comparing pattern mining and cosine similarity results, pattern mining produces better results in
both cases. In Milan, the average difference in precision is 12.32%. In Reuters, this difference is much
more subtle, 0.55%. Again, the variance in performance is most likely attributed to the differences in the
data sets. The patterns generated in the Milan corpus provided a clearer distinction between articles than
the Reuters patterns. However, since patterns produce similar or greater results than cosine similarity, this
suggests that pattern mining techniques are identifying sets of stories very similar in characteristics to those
stories collected by cosine similarity, in a manner of greater efficiency. Thus, pattern mining appears to be
the better choice, both for precision and efficiency.
63
Chapter 7
Conclusions and Future Work
This thesis was motivated by the observation that online news lacks an efficient and flexible way of
retrieving similar news stories. Readers would benefit from an automated news system that possesses both
the understanding of a human editor and the ability to provide this service for any given article in the news
Journalists and newspaper editors can also benefit from such a system, allowing research and
corpus.
editing to proceed with more effective tools than Boolean searches.
The Ariel system was developed to address the issue of news retrieval using data mining
technologies. The result is a novel approach to text retrieval using pattern mining.
7.1
Contributions
In creating Ariel, this thesis has developed an autonomous news retrieval system that works in conjunction
with the personalized news portal, Milan. The system has been specifically built for users of Milan's
online newspaper. When a reader desires more information on a specific article, he/she calls upon Ariel to
gather similar articles from the news corpus. In addition, Ariel could easily be applied to journalistic and
editorial situations too. Journalists very often require background information on the topic they are writing
about.
Using Ariel, these writers simply need to identify a sample article and similar articles will be
retrieved.
Newspaper editing is a slightly different problem.
Editors are generally presented with a
plethora of articles, from which they must choose a selection of articles to print. Since several articles are
often written on the same topic, Ariel can extract the similar articles from the news collection so that the
editor may pick and choose from them.
Furthermore, the Ariel architecture provides a modular, easily reusable structure that can be used
in many different experiments. While Ariel is currently integrated with the Milan news system, other text
databases may also be used. In the case of newspaper journalism and editing, Ariel could exist upon
64
various news archives and live news streams. Additionally, Ariel is not limited to news corpora. Any text
database, including patent, scientific journal, and medical research databases, may be integrated with the
Ariel system. Furthermore, various components of the Ariel system may be replaced or removed at any
time without the rest of the system's knowledge.
Since Ariel communications with the rest of the Vinci
system through a well-defined network interface, any changes may be made without notification.
Therefore, a new searching technique or ranking scheme may be incorporated with ease.
Moreover, the architecture allows for a computationally expensive investigation of the articles to
be performed ahead of time, and the results stored for future use. This technique allows for a more in-depth
examination of the articles at search time, without the need for the user to wait through these computations.
Ariel also demonstrates that patterns of words, traditionally called phrases, are still useful to IR
research despite many claims that their value has all but diminished. Using pattern mining, a novel text
retrieval method based upon statistical data mining, Ariel provides the mechanics for discovering patterns
within a given article and then comparing these patterns to the rest of the corpus. Created with association
mining technologies, story patterns may be viewed as unique signatures of a news topic or event.
While various document retrieval functions already exist, these are very time consuming
calculations. Functions such as the cosine similarity metric require a detailed analysis of each document
before isolating the similar ones.
Pattern mining, on the other hand, simply scans each article for the
appropriate features, much like Boolean search. Ariel's performance surpasses that of cosine metric, word
pair, and Boolean searches in precision and recall analyses.
Furthermore, pattern mining is significantly
more efficient than cosine similarity, while providing comparable or better retrieval results. Boolean and
word pair searches, on the other hand, are more efficient than pattern mining, although the quality of text
retrieval drops significantly. In addition, Ariel's precision/recall performance appears to be comparable to
the TREC-8 ad hoc retrieval results, despite differences in the testing environment and corpus used.
7.2
Future Work
On the surface, the Ariel system provides a simple, yet effective, text retrieval tool for the domain of online
news.
Underneath this application, however, is a brand new technique waiting to be explored. This thesis
presents a mere glimpse of the field of pattern mining. With further exploration, this technique could make
significant contributions to the IR field. Below, several ideas are outlined for future work on the Ariel
system.
7.2.1
Incorporating Concepts into Patterns
Currently, Ariel patterns are composed of stemmed words. In the future, more informative patterns could
be created, by using concepts in addition to words.
For example, one uses the words "angry" or "mad" or
"furious" with exactly the same meaning intended.
65
However, to Ariel, each one of these words is
completely different from the next. To resolve this problem, Ariel can easily be extended to recognize
synonyms, by incorporating a tool such as WordNet [48].
In this way, we can produce a concept to
describe anger, which recognizes all synonyms and associates them with the same meaning.
Similarly, people and locations could also be represented as concepts.
President Clinton, for
example, may appear in the news as Bill Clinton, Clinton, William Jefferson Clinton, and soon, formerPresident Clinton. To the current system, these phrases have no relation to one another, other than sharing
the word "Clinton".
However, in the same way an anger concept can be created, we can produce a Clinton
concept that identifies each of his different names as an alias for the same person.
7.2.2
Recognizing Topic Progression
By incorporating a more complex understanding of article content, as described above, Ariel can begin to
For example, consider an analysis of the article
notice topic differences within relevant articles.
"Thousands pause to remember when Hiroshima become hell on earth", a story about the 55th anniversary
of the atomic bombing of Hiroshima. Currently, Ariel would search for relevant articles and rank them by
similarity. However, Ariel can be built to recognize more specialized topics within the articles, including
World War II event, the Manhattan project, and the victims of the bombing. As a result, Ariel could
categorize the similar stories by both relevance and by topic. The ability to distinguish between different
article focuses and evaluate their relevance to the user's interest is one possible extension of the Ariel
project.
7.2.3
Story Summarization
Instead of providing a list of similar articles, as Ariel currently does, summarization techniques may be
used to condense information each relevant article into a single synopsis. In this way, the user could
browse through the highlights from each article without actually needing to scan through the entire
selection. A process known as semantic encoding [34], developed in the IBM Toyko Research Laboratory,
allows a computer system to automatically analyses text and shrinks it into a single description.
Perhaps
these same techniques could be applied to Ariel.
7.2.4
Story Merging
Similar to summarization, Ariel would also benefit from the ability to scan through the collection of
retrieved documents, and merge the entire collection into a single synopsis about the topic at hand.
One
approach would be to locate common themes within the group of articles and generate a synopsis using
these themes.
Themes could possibly be discovered using the approach described in Section 3.5.2,
searching for general corpus trends. While this approach was unsuccessful in searching through an entire
66
news corpus, perhaps more interesting results could be obtained when searching through a limited number
of stories regarding the same topic.
7.2.5
Expanding the Scope of Ariel
In addition, Ariel should be expanded to different types of users and text databases to further evaluate its
success. Since Ariel has already been applied to news, a natural progression would be to incorporate Ariel
into a news archive used by journalists and editors. While this application has been discussed throughout
this thesis, Ariel has never actually been implemented as such a tool. Another possibility is to apply Ariel's
pattern mining techniques to a patent database, such as the IBM's Intellectual Property Network [25], to
help IP lawyers conduct research for prior art. Or, Ariel could be incorporated into various types of
journals and placed in a library for students and researchers.
67
References
1. N. Abramson. Design, specification, and implementation of a movie server. Bachelor's Thesis.
Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1990
2.
R. Agrawal, R.J. Bayardo, D. Gruhl, S. Papadimitriou. Vinci: A Service-Oriented Architecture for
Rapid Development of Web Applications. In Proceedingsof the 10'h World Wide Web Conference.
Hong Kong, May 2001.
3.
R. Agrawal, R.J. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text
Databases. In Proceedingsof the 7 InternationalConference on Extending Database Technology
(EDBT). Konstanz, Germany. March 2000
4.
R. Agrawal, T. Imielinski, A. Swami. Database Mining: A Performance Perspective. In IEEE
Transactionson Knowledge and DataEngineering, Vol. 5, No. 6. December 1993
5.
R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules between Sets of Items in Large
Databases. In AMC Sigmod InternationalConference on Management of Data, Washington D.C., May
1993
6.
R. Agrawal, A. Somani, Y. Xu. Storage and Querying of E-Commerce Data. Submitted to ACM
SIGMOD InternationalConference on Management ofData. December 2000.
7.
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 2 0'h
InternationalConference on Very Large Databases.Santiago, Chile. September 1994.
8.
R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proceedingsof the 11h International
Conference on Data Engineering,Taipei, Taiwan, March 1995
9.
W. Bender and P. Chesnais. Network Plus. In SPIE ElectronicImaging Devices and Systems
Symposium, Vol. 900, pages 81-86. Los Angeles, California. January 1988
10. S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan. Using Taxonomy, Discrinimants, and Signatures
for Navigating in Text Databases. In Proceedings of the 2 3rd InternationalConference on Very Large
Data Bases (VLDB). Athens, Greece. August 1997
11. Cornell University, CS Department. Cornell list of stop words found at
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
in January 2001.
12. Jack Driscoll. Visiting Scholar to the MIT Media Laboratory. Former editor of The Boston Globe.
Interview. December 20, 2000.
13. Sara Elo. Plum: Contextualizing news for communities through augmentation. Masters Thesis for the
Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1995
68
14. R.S. Engelmore, A.J. Morgan, and H.P. Nii. Introduction. In BlackboardSystems. Addison-Wesley
Publishers Ltd., 1988
15. J.L. Fagan. The effectiveness of a non-syntactic approach to automatic phrase indexing for document
retrieval. Journalof the American Societyfor Information Science, 40(2):115-132, 1989
16. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery in Databases.
In Proceedingsof the American Associationfor Artificial Intelligence. Fall 1996
17. Simson Garfinkel and Kenneth Haase. Reference Points. Interview. January 31, 2001.
18. Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan Cigarran. Indexing with WordNet synsets can
improve text retrieval. In Proceedingsof the COLING/ACL Workshop on Usage of WordNet in
Natural Language ProcessingSystems. 1998, Montreal
19. Daniel Gruhl, The Search for Meaning in Large Text Databases. PhD Thesis for the Massachusetts
Institute of Technology, February 2000
20. Data Mining: Extending the Information Warehouse Framework. IBM Data Mining Whitepaper found
at http://www.almaden.ibm.com/cs/quest/papers/whitepaper.html in January 2001.
21. Kenneth B. Haase. Analogy in the Large. In Proceedings ofAMC SIGIR Conference. 1995
22. Kenneth B. Haase. Do Experts Need Understanding. In IEEE Expert, Spring 1997
23. Kenneth B. Haase, FramerD: Representing Knowledge in the Large. Technical Report, MIT Media
Lab, 1996
24. IBM DB2 Intelligent Miner for Data.
Found at http://www-4.ibm.com/software/data/iminer/fordata/index.html in January 2001
25. IBM Intellectual Property Network.
Found at http://www.almaden.ibm.com/cs/patent.html in January 2001.
26. Steve Jones and Mark S. Staveley. Phrasier: a system for interactive document retrieval using
keyphrases. In the
2 2 "d ProceedingsofACMSIGIR
Conference on Research and Development in
Information Retrieval. Pages 160-167. August 1999
27. B. Lent, R. Agrawal, and R. Srikant. Discovering Trends in Text Databases. In Proceedings of the
3 rd
InternationalConference on Knowledge Discovery in Databasesand DataMining, Newport Beach,
California, August 1997
28. David Lewis. The Reuters-21578 text categorization test collection. AT&T Labs. September 1997.
Found at http://www.research.att.com/lewis/reuters21578.html in January 2001.
29. D. Lewis and W. Croft. Term clustering of syntactic phrases. In the
1 3 'h InternationalA
CM SIGIR
Conference on Research and Development in Information Retrieval, pages 385-404. 1990.
30. H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal of Research and
Development, Vol. 2, No. 2, April 1958
31. M. Maybury, A. Merlino, J. Rayson. Segmentation, Content Extraction and Visualization of Broadcast
News Video using Multistream Analysis. In Proceedings of the American AssociationforArtificial
Intelligence. 1997.
69
32. R.B Miller. Response time in user-system conversational transactions. In Proceedingsof the AFIPS
FallJoint Computer Conference, pages 267-277. 1968
33. Mandar Mitra, Chris Buckley, Amit Singhal, Claire Cardie. An Analysis of Statistical and Syntactic
Phrases. In Proceedingsof RIA097, Computer-AssistedInformation Searching on the Internet,pages
200-214, Montreal, Canada, June 1997
34. K. Nagao, S. Hosoya, Y. Shirai, and K. Squire. Semantic Transcoding: Making the World Wide Web
More Understandable and Usable with External Annotations. IBM Research, Tokyo Research
Laboratory
35. News in the Future research consortium, MIT Media Laboratory. Found at
http://nif.media.mit.edu/nif.html in January 2001.
36. The Pew Research Center. Internet now sapping broadcast news audience. Found at
http://www.people-press.org/mediaOOrpt.htm in January 2001.
37. M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
Downloadable from the Porter Stemming Algorithm home page. http://open.muscat.com/stemming
38. A. Renouf Making sense of text: automated approaches to meaning extraction. In the 17'h
InternationalOnline Information Meeting Proceedings,page 77-86. 1993
39. Mark Rorvig. Images of similarity: A visual exploration of optimal similarity metrics and scaling
properties of TREC topic-document sets. In The Journalof the American Societyfor Information
Science. January 1998.
40. Gerald Salton, editor. The SMART RetrievalSystem - Experiments in Automatic Document Processing.
Prentice Hall Inc., Englewood Cliffs, NJ, 1971
41. Gerald Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill
Book Co., New York, 1983
42. Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In the
2 2 nd
Proceedingsof
ACM SIGIR Conference on Research and Development in Information Retrieval, page 206-231.
August 1999.
43. J. Shafer, R. Agrawal, M. Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining. In the
2 2 "nd
InternationalConference on Very Large Databases(VLDB-96), Bombay, India, September 1996
44. R. Srikant. Member of the Data Mining group at the IBM Almaden Reseach Center. Interview.
January 25, 2001
45. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. In Proceedingsof the 5 InternationalConference on Extending Database Technology
(EDBT). Avignon, France, March. 1996
46. Andrew Turpin, Alistair Moffat. Statistical Phrases for Vector-Space Information Retieval. In the 22"
ProceedingsofACMSIGIR Conference on Research and Development in Information Retrieval,
pages 309-3 10, August 1999
70
47. E.M. Voorhees and D. Harman. Overview of the Eighth Text Retreval Conference (TREC-8).
Including Appendix A, TREC-8 Results. Found at http://trec.nist.gov/pubs/trec8/t8 proceedings.html
in January 2001.
48. WordNet. A Lexical Database for English. Cognitive Science Library. Princeton University. Found at
http://www.cogsci.princeton.edu/-wn/ in January 2001.
49. G.K. Zipf. Human Behavior and the Principle of Least Effort, Addison Wesley Publishing, Reading,
Massachusetts, 1949.
71