Mining the Web to Create Specialized Glossaries

advertisement
www.computer.org/intelligent
Mining the Web to Create Specialized Glossaries
Paola Velardi, Roberto Navigli, and Pierluigi D’Amadio
Vol. 23, No. 5
September/October 2008
This material is presented to ensure timely dissemination of scholarly and technical
work. Copyright and all rights therein are retained by authors or by other copyright
holders. All persons copying this information are expected to adhere to the terms
and constraints invoked by each author's copyright. In most cases, these works
may not be reposted without the explicit permission of the copyright holder.
© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or
for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be
obtained from the IEEE.
For more information, please see www.ieee.org/web/publications/rights/index.html.
N L P
a n d
t h e
W e b
Mining the Web
to Create Specialized
Glossaries
Paola Velardi, Roberto Navigli, and Pierluigi D’Amadio,
University of Rome “La Sapienza”
TermExtractor
and GlossExtractor,
two Web-mining-based
applications, support
glossary building
by exploiting the
Web’s evolving nature
to allow continuous
updating of an
emerging community’s
vocabulary.
18 A
first step in establishing a Web community’s knowledge domain is to collect a glossary of domain-relevant terms that constitute the linguistic surface
manifestation of domain concepts. A glossary is a list of domain terms and their textual definitions. For instance, Yahoo’s financial glossary defines security as a “piece of
paper that proves ownership of stocks, bonds, and
other investments.” Glossaries are helpful for integrating information, reducing semantic heterogeneity, and facilitating communication between information systems. However, although established
domains largely have glossaries, new communities
typically have trouble providing them. Even when
they’re accessible, they often provide emerging
knowledge in an unsystematic, incomplete manner.
Manually constructing glossaries requires the
cooperative effort of a team of domain experts and
involves several steps, including identifying consistent domain terminology, producing textual definitions of terms, and harmonizing the results. This
procedure is time-consuming and costly,1 often requiring the support of a collaborative platform to
facilitate shared decisions among experts. It’s also
an incremental process that must be continuously
iterated as emerging terms come to use or as authoritative definitions for existing terms in the domain are produced. To help alleviate the issues of
manual construction, you can partly automate the
process. Glossary extraction systems should speed
up the three steps of manually constructing glossaries by
1541-1672/08/$25.00 © 2008 IEEE
Published by the IEEE Computer Society
• minimizing the time spent actually building the
domain terminology and glossary,
• ensuring continuous maintenance and updates,
and
• providing support for collaborative validation
and refinement of the glossary extraction results.
Unfortunately, existing glossary extraction systems don’t truly meet these objectives. In this article,
we present a comprehensive, fully engineered set of
natural language tools that assist an online community in the entire domain glossary extraction process. (For a discussion of other tools and methodologies, see the “Related Work in Glossary Extraction”
sidebar.) We based our methodology on machine
learning and Web-mining techniques and implemented it in two tools, called TermExtractor (http://
lcl.uniroma1.it/termextractor) and GlossExtractor
(http://lcl.uniroma1.it/glossextractor). The tools acquire a glossary’s two basic components: terms and
definitions. The former are harvested from domain
text corpora, and the latter are extracted from different types of Web pages (such as online glossaries
and Web documents).
The tools use text-processing techniques to
IEEE INTELLIGENT SYSTEMS
Related Work in Glossary Extraction
In the past few years, the Web has been a valuable resource
for extracting several kinds of information (such as terminologies, facts, and social-networking information) to aid textmining methods. Nonetheless, automatic-glossary-extraction
literature, especially from the Web, isn’t very rich. Atsushi
Fujii and Tetsuya Ishikawa presented a method that extracts
fragments of Web pages using patterns typically used to describe terms.1 They used an n-gram model to quantify how
well formed a given text fragment was. Although they used
a clustering approach to identify the most representative
definitions, they didn’t perform domain-based selection.
Youngja Park, Roy Byrd, and Branimir Boguraev’s method,
implemented in the GlossEx system, extracts glossary items
(that is, a terminology) on the basis of a pipeline architecture.2 This method performs domain-oriented filtering that
improves over the state of the art in terminology extraction; nonetheless, it provides no means for learning definitions for the extracted terms. Judith Klavans and Smaranda
Muresan’s text-mining method, the Definder system, extracts
definitions embedded in online texts.3 The system is based
on pattern matching at the lexical level and guided by cue
phrases such as “is called” and “is defined as.” However, as in
Fujii and Ishikawa’s method, the Definder system applies patterns in an unconstrained manner, which implies low precision when applied to the Web, rather than a domain corpus.
A closely related problem to glossary extraction is opendomain definitional question answering (QA), which focuses
on unconstrained “who is X?” and “what is X?” questions.
Many QA approaches—especially those evaluated in the
TREC (Text Retrieval Conference, http://trec.nist.gov) evaluation competitions—require the availability of training data,
such as large collections of sentences tagged as “definitions”
or “nondefinitions.” Applying supervised approaches 4,5 to
glossary extraction in emerging domains would be costly
and time-consuming. Ion Androutsopoulos and Dimitrios
Galanis propose a weakly supervised approach in which feature vectors associated with each candidate definition are
augmented with automatically learned patterns.6 Patterns
are n-grams that surround a term for which a definition is
sought. Horacio Saggion’s method extracts definitions from
a large corpus using lexical patterns coupled with a search
for cue terms that recur in dictionary and encyclopedic definitions of the target terms.7
A recent article proposes using probabilistic lexico-semantic patterns (called soft patterns) for definitional QA.8 These
patterns cope better with the diversity of definition sentences than predefined patterns (or hard patterns) can. The
authors evaluated the method on data sets made available
through TREC. However, we must outline some important
differences of glossary extraction with respect to the TREC
“what is” task.
First, in TREC, the target is to mediate at best between precision and recall; when the objective is an emerging domain,
recall is the most relevant performance figure. For certain
novel concepts, there might be very few definitions (or just
one). The target is to capture the majority of them because
rejecting a bad definition takes just seconds, whereas manually creating a definition takes several minutes. (We consulted
a professional lexicographer, Orin Hargraves, to estimate the
manual effort of glossary development.)
Second, TREC questions span heterogeneous domains—
that is, there’s no domain relation between definitions, in
contrast with glossaries.
September/October 2008
Third, the TREC evaluation data set doesn’t have true definitions but rather sentence “nuggets” that provide some
relevant fact about a target. The following example shows
a list of answers, classified as “vital” or just “okay,” for the
query, “What is Bollywood?” from TREC 2003.
Qid 2201: What is Bollywood?
2201 1 vital is Bombay-based film industry
2201 2 okay bollywood is indian version of Hollywood
2201 3 okay name bollywood derived from Bombay
2201 4 vital second largest movie industry outside of hollywood
2201 5 okay the bollywood movies in UK’s top ten this year
2201 6 okay empty theaters because films shown on cable
2201 7 okay Bollywood awards equivalent to Oscars
2201 8 vital 700 or more films produced by india with 200 or more from bollywood
2201 9 vital Bolywood—an industry reeling from its worst year ever
This example shows that manually validated answers (even
those classified as “vital”) aren’t always true definitions (for
example, answer 8) and aren’t professionally written (for
example, answer 9). Wikipedia gives a true definition of Bollywood: “Bollywood is the informal term popularly used for
the Mumbai-based Hindi-language film industry in India.”
Given the TREC task objectives, the use of soft matching
criteria is certainly crucial, but in glossary definitions, style
constraints should be followed closely to provide clarity and
knowledge sharing, as well as to facilitate validation and
agreement among specialists.
In summary, the state of the art on glossary and definition
extraction only partially satisfies the objectives of the glossary extraction life cycle. In the main article, we present our
glossary extraction system, which employs natural language
techniques to exploit the Web’s potential.
References
1. A. Fujii and T. Ishikawa, “Utilizing the World Wide Web as an
Encyclopedia: Extracting Term Descriptions from Semi-Structured Texts,” Proc. 38th Ann. Meeting Assoc. for Computational Linguistics, Morgan Kaufmann, 2000, pp. 488–495.
2. Y. Park, R. Byrd, and B. Boguraev, “Automatic Glossary Extraction: Beyond Terminology Identification,” Proc. 19th Int’l
Conf. Computational Linguistics, Howard Int’l House and Academica Sinica, 2002, pp. 1–7.
3. J. Klavans and S. Muresan, “Evaluation of the Definder System
for Fully Automatic Glossary Construction,” Proc. American
Medical Informatics Association Symp., 2001, pp. 324–328.
4. S. Miliaraki and I. Androutsopoulos, “Learning to Identify Single-Snippet Answers to Definition Questions,” Proc. 20th Int’l
Conf. Computational Linguistics, Morgan Kaufmann, 2004,
pp. 1360–1366.
5. H.T. Ng, J.L.P. Kwan, and Y. Xia, “Question Answering Using
a Large Text Database: A Machine Learning Approach,” Proc.
Conf. Empirical Methods in Natural Language Processing, Assoc. for Computational Linguistics, 2001, pp. 67–73.
6. I. Androutsopoulos and D. Galanis, “A Practically Unsupervised
Learning Method to Identify Single-Snippet Answers to Definition Questions on the Web,” Proc. Human Language Technology
Conf. and Conf. Empirical Methods in Natural Language Processing, Assoc. for Computational Linguistics, 2005, pp. 323–330.
7. H. Saggion, “Identifying Definitions in Text Collections for
Question Answering,” Proc. Language Resources and Evaluation Conf., European Language Resources Assoc., 2004.
8. H. Cui, M.K. Kan, and T.S. Chua, “Soft Pattern Matching Models for Definitional Question Answering,” ACM Trans. Information Systems, vol. 25, no. 2, 2007, pp. 1–30.
www.computer.org/intelligent
19
N L P
a n d
t h e
Domain documents
W e b
TermExtractor
GlossExtractor
Terminology
extraction/updating
and validation
Glossary
extraction/updating
and validation
Terminology
Glossary
Web
Figure 1. The glossary extraction process. The process comprises two phases—terminology extraction and glossary extraction.
analyze both the documents’ content and
structure. They exploit the Web’s highly
evolving nature to allow for continuous,
incremental updating of the domain lexicon and the corresponding definitions. On
one side, in emerging communities, terms
often gain and lose popularity in a short
time. On the other side, new definitions of
established terms might become available
over time.
System architecture
Figure 1 shows our glossary extraction process, which comprises two phases.
In terminology extraction, the user collects domain-relevant documents (such as a
scientific community’s publications, a commercial organization’s product descriptions,
technical manuals, or a company’s best
practices) and submits them to the TermExtractor system. This produces a domain terminology (that is, a flat list of terms) that
the user receives for single-user or group
validation. (We’ve described the terminology extraction phase in past work.2)
In glossary extraction, the user submits
terminology to the GlossExtractor system,
which explores the Web for one or more
definition sentences for each term. Finally,
an evaluation session occurs, and the validated glossary is ready for download.
(We’ve previously described a basic version
of GlossExtractor,3 but both the filtering
techniques and the experiments we present
here are novel.)
The user can repeat the extraction process many times, incrementally, to update
and maintain the glossary. The two Web
20 applications give the user several options to
tune the run for the community’s needs, but
describing these options in detail is outside
this article’s scope (although the interested
reader can access the Web applications and
follow the instructions).
TermExtractor
To be complete, and to provide a clear picture of the complete glossary-learning procedure, we summarize the system’s algorithms and features here.
First, the user sets a list of options (including minimum and maximum length of
terminological strings, named-entity detection, updating of an existing terminology or
creation of a new one, single-user or collaborative validation, and many others) or accepts the default settings.
Second, the user submits a set of documents in any common format (such as doc,
txt, pdf, or latex) either one by one, or as a
compressed archive.
Third, the terminology extraction process
begins by applying part-of-speech tagging
and chunking to the input documents (we
used TreeTagger; www.ims.uni-stuttgart.de/
projekte/corplex/TreeTagger). Consequently,
the tool selects nominal chunks (typically
including adjectives and nouns) as term candidates. Filtering of domain-pertinent terminological strings from extracted candidates
is based on the following measures:2
• domain relevance, a measure for iden-
tifying patterns that are frequent in the
domain corpus and infrequent in a predefined user-adjustable set of contrastive
www.computer.org/intelligent
domains;
• domain consensus, an entropy-based
measure that determines whether terms
are evenly distributed across the domain
documents;
• lexical cohesion, a measure that aims to
discriminate between the chance of cooccurrence and real semantic cohesion
by analyzing the component words of
candidate terms;
• structural relevance, a set of heuristics
that assign a higher weight to terms occurring in specific formatting structures,
such as titles, abstracts, or boldface; and
• heuristics, a set of additional heuristic
rules that prune out certain strings, such
as those including very common adjectives, numbers, or strings. Heuristic rules
also extract single-word terms, which are
often subject to noise.
Finally, when extraction and filtering are
complete, the tool automatically sends an
email to the user, directing him or her to the
validation page, which supports both singleuser and group validation.
As an example, Figure 2 shows the result
of terminology extraction when submitting
a draft version of this article. Terms can be
marked for rejection (as done in the figure
for the term “threshold embed”). When validation completes, the user can download
the result in one of several formats (such as
plain text, Excel, or XML).
TermExtractor has been online for nearly
two years and has been evaluated both in
the context of the Interop European project
(www.interop-vlab.eu) and by many users
IEEE INTELLIGENT SYSTEMS
across the world. We performed an evaluation on different domains involving 12 different research and industrial institutions,
which led to an average precision of 80 percent in terminology extraction.2
GlossExtractor
Here, we describe our glossary extraction
methodology and its implementation as a
Web application.
The gloss extraction algorithm
The algorithm takes as input a list T of
terms and outputs a (possibly empty) set of
definitions for each term t ∈ T. Possibly, T
is the result of the previous terminology extraction process.
The algorithm comprises three phases:
candidate extraction, stylistic-based filtering, and domain-based filtering.
Candidate extraction. First, for each term
t ∈ T, the tool searches for definition sentences in online glossaries and Web documents. To retrieve candidate definitions in
Web documents, it queries Google with the
following set of manually defined patterns:
“t is a,” “t is an,” “t are the,” “t defines,”
“t refers to,” “t concerns,” “t is the,” “t is
any”
These patterns, inspired by previous work
on lexico-syntactic patterns,4 are intentionally very simple to reduce search and retrieval time. The tool saves the first five
pages returned for each query (and converts them to plain text format). Pages are
retrieved on the basis of a single text snippet matching one of our patterns. However,
other potential definitions might be available in the document. Thus, from each page
we extract all the sentences (that is, our candidate definitions) that contain our target
term t. This strategy results in a considerable amount of noise in our corpus of candidate definitions. Noise typically is generated
by two factors:
• Nondefinitional sentences. Given their
generality, our simple patterns match definitional as well as nondefinitional sentences, such as, “Bollywood is a great
inspiration for young actors.” The “X is
a” pattern is a major source of noise, but
many definitional sentences follow this
style. In other words, this pattern’s precision is low, but its recall is high. To remove nondefinitional candidates matchSeptember/October 2008
Figure 2. Terminology extraction results from a draft version of this article.
ing our regular expressions, we resort to
a stylistic filter.
• Out-of-domain definitions. The sentence
could indeed be a definition, but one that
isn’t pertinent to our domain. For example, “A model is a person who poses for
a photographer, painter, or sculptor” is
pertinent to the fashion domain but not to
the software engineering domain. To discard out-of-domain definitions, we use a
domain filter.
The two subsequent steps of our glossary
extraction methodology, which we describe
next, will help eliminate these sources of
noise.
Stylistic filter. The stylistic filter aims to
select well-formed definitions—that is, definitions expressed in terms of genus (the
kind a concept belongs to) and differentia
(what specializes the concept with respect
to its kind). For instance, in the definition,
“Database integration is the process of integrating systems at the database level,”
process is the genus, and the rest of the sentence is the differentia. Not all definitions
are well formed in this sense (for example,
“component integration is obtained by composing the component’s refinement structures together”), and many sentences that
are ill-formed aren’t even definitions (for
example, “Component integration has been
recently proposed to provide a solution for
those issues”).
Stylistic filtering is a novel criterion with
respect to the related definition extraction
literature, and it has several advantages:
www.computer.org/intelligent
• to prefer definitions adhering to a uni-
form style, commonly adopted by professional lexicographers;
• to distinguish between definitions and
nondefinitions (especially when candidate definitions are extracted from free
texts, rather than glossaries); and
• to extract from definitions a kind-of information, used to arrange terms taxonomically (for details on automatic taxonomy construction, see our previous
work3,5).
To learn a stylistic filter for well-formed
definitions, we employed a decision tree
machine learning algorithm. We used
the J48 algorithm from the weka (www.
cs.waikato.ac.nz/ml/weka) suite of machine learning software, trained on a data
set of positive and negative examples from
four domains—arts, tourism, computer networks, and enterprise interoperability (our
training set, or TS). We acquired positive
examples from professional online sources,
whereas we obtained negative examples by
manually collecting nondefinitional sentences, including the domain terms from
the four domains, from Web search results
and available data of a previous experiment
conducted with the Interop project members. Table 1 illustrates the training-set
composition. To make it a suitable input to
the decision tree algorithm, we tagged and
chunked each sentence in the TS sentences
by parts of speech and transformed them
into a vector of five elements, one for each
of the first five part-of-speech/chunk tag
pairs. For instance, our earlier definition of
21
N L P
a n d
t h e
W e b
Table 1. Training set used to learn a stylistic filter for definition sentences.
Domains
Arts*
Tourism†
Computer networks ‡
Enterprise interoperability§
Positive
310 + 415
270 + 270
450
1,435
Negative
80
60
50
2,032
*AAT (www.getty.edu/research/conducting_research/vocabularies/aat) and WordNet (http://wordnet.princeton.com)
†WordNet (http://wordnet.princeton.com) and STB (http://app.stb.com.sg/asp/tou/tou08.asp)
‡Geek.com (www.geek.com/glossary)
§Interop glossary (http://interop-vlab.eu/backoffice/IEKR/iekr.2007-05-31.1222018345/?searchterm=glossary)
database integration produced the following output:
Database integration is the process of
integrating systems at the database level
[ NN NN ] VBZ [ DT NN ] IN [ VBG
NNS ] IN [ DT NN NN ],
where we enclose a chunk in square brackets (for example, the first chunk includes
the first two nouns). Next, we transform the
definition into a vector of the first five elements (we exclude the term being defined),
plus the instance classification (“yes” for
positive, “no” for negative):
((VBZ, -), (DT, chunk2), (NN, chunk2),
(IN, -), (VBG, chunk3), YES).
As a result of training, we learn a decision tree model that we can use to filter
future candidate definitions according to
their style. Although the recognition of stylistic patterns is triggered by a supervisedlearning algorithm, note that
• style is mostly a domain-independent cri-
terion, so the training phase doesn’t need
to be repeated; and
• we needed limited manual effort to create the training set.
Domain filter. The domain filter prunes
out candidate definitions that aren’t pertinent to the domain of interest. We obtain a
probabilistic domain model by analyzing
the domain terminology. As we explained
earlier, the input to the glossary extraction algorithm is a terminology T—that
is, a flat list of single-word and multiword
terms. Starting from the set of single words
constituting the terminology T, we learn a
probabilistic model of the domain, which
assigns a probability that each single word
occurs in the terminology T. Let T be the
set of domain terms, and let W be the set
of single words appearing in T. Given a
word w ∈ W, we estimate its probability
22 of occurring in our domain of interest as
freq ( w )
P (w) =
∑
freq ( w ′ )
Database integration is the process of
integrating systems at the database level.
,
w ′∈W
where freq(w) is the frequency of w in
terms of T. For example, if T = {distributed
system integration, database integration},
then W = {distributed, system, integration,
database} and P(integration) = 2/5 because
out of five single words in T, integration appears twice.
Now, given a term t, let Def(t) be the
set of candidate definitions for t, and let
d ∈ Def(t) be a candidate definition for t.
Finally, let Wd be the set of single words in
d that also occur in W (Wd ⊆ W)—that is,
Wd is the set of single words occurring both
in d and in our initial terminology T. We define d’s domain pertinence as a function of
the occurrence probabilities of the terminological words occurring in d and those composing term t. Formally, we define the formula as
DP ( d ) =
N
∑ P ( w ) log N twt + α ∑ P ( w )
w∈Wd
example of set W):
w∈t
where Nt is the number of definitions extracted for t (that is Nt = |Def(t)|), and Ntw is
the number of such definitions including the
word w. The log factor, called inverse document frequency in information retrieval literature, gives less weight to words that have
a very high probability of occurring in any
definition, regardless of the domain (for example, “system”). The additional sum in
this formula gives more weight to those sentences that include some of the words that
compose the term t to be defined ( is an
adjustable parameter).
For example, the following definition of
database integration includes two occurrences of the term’s words (marked in italics—we don’t count the target term), plus
the word system (with regard to the previous
www.computer.org/intelligent
The system applies the measure of domain pertinence we defined earlier to candidates that the decision tree classifies
as definitions. Finally, it applies a termdependent threshold  to each definition
to select only those definitions d for which
weight(d) ≥ . The system computes the
threshold as the difference between the
maximum and minimum values of domain
pertinence of the definitions in Def(t) averaged over the number of definitions.
Web application
We’ve implemented this methodology in the
GlossExtractor Web application (accessible
from http://lcl.uniroma1.it/glossextractor).
First, the user uploads a terminology T
(possibly the result of a TermExtractor
run) or runs a demo session, where he or
she specifies only one term. If users upload
a terminology, they can select a number of
options, including the sources from which
glossaries must be extracted (such as Web
glossaries and Google’s “define” feature
and Web pages, which GlossExtractor
searches). As with TermExtractor, the user
can select single-user or group validation.
Finally, the user assigns a name to the glossary and starts the extraction task. (Gloss
search over the Web is time consuming because often several thousand queries must
be issued. Furthermore, there’s a limit
to the daily number of queries; we thank
Google’s team for letting us extend this
limit. Depending on the terminology’s size
and the number of active users, the process
might not complete in one day, although
usually it’s a matter of dozens of minutes).
The glossary extraction process is asynchronous: the Web application ends and, as
soon as the process is terminated, the user
receives an email and is directed to a validation page. Figure 3 shows a screen dump
of a validation page for the term “BollyIEEE INTELLIGENT SYSTEMS
Table 2. The algorithm’s performance (%) when searching Web glossaries and Web documents.
Data set
Number of glosses
Accuracy
Precision
Recall
F1 measure
1,978
86.2
89.7
88.9
89.3
359
85.1
92.5
81.0
86.4
Web glossaries
Web documents
wood.” The validation page lists both the
definitions above the filtering threshold 
and those below. For each definition, the
system shows the computed weight and the
source type from which the definition has
been extracted (a Web glossary, Google’s
“define,” or a Web document). In our experiments, we noticed that the distribution
of definition sources varied significantly
depending on the domain’s novelty.
By clicking on the pencil icon to the left
of each definition, the user can modify a
definition’s text if the definition is good but
not fully satisfactory. The system tracks
manual changes.
In a group validation, the coordinator
has a different view of the validation page,
where he or she can inspect the global votes
each definition receives and make the final
decisions.
Evaluation
We used TermExtractor and GlossExtractor
in the Interop European project,3 and because they’re freely available, several users
around the world are using them in experiments in various domains. Here, we describe a set of experiments that have measured the glossary extraction methodology’s
performance. Few data are available on the
evaluation of glossary extraction systems.
Large-scale evaluations are available for
the TREC (Test Retrieval Conference) task,
but as we’ve explained, we have reasons for
not using these data sets.
We conducted the evaluations as such:
first, we performed a direct evaluation of
GlossExtractor to assess our stylistic and domain filters’ performance on a gold-standard
data set. Then we conducted a real-world
system evaluation, following the entire terminology and glossary extraction process
from the Web.
Direct evaluation
To perform a direct objective evaluation of
the core system’s methodology, we assembled a set of definitions from online glossaries and Web documents. To obtain examples of good definitions, we extracted
definitions from professional glossaries on
September/October 2008
Figure 3. Screen dump of a validation session. Highlighted definitions are those the
system proposes to reject.
the Web (partially from the same sources
as for the learning set in Table 1, but choosing different definitions). Let G1 be this initial set. Then, we used GlossExtractor to
search the Web for additional definitions
of the terms in G1. We manually compared
the set of extracted definitions, G 2, with G1
to reliably separate good and bad definitions. Let G 2g and G 2b be the good and bad
definitions, respectively. The resulting test
set is then TS = G1 ∪ G 2g ∪ G 2b.
Because good definitions (at least those
in G1) are professionally created, the test set
is reasonably close to a gold standard. Using
this test set, we ran several experiments to
evaluate the system’s ability to
• correctly classify good and bad defini-
tions, when definitions are extracted only
from glossaries;
• correctly classify good and bad definitions, when these are extracted only from
Web documents; and
• prune out definitions that are good but
not pertinent to the selected domain.
While the first two experiments evaluate the entire glossary extraction system (both style and domain filters are applied), the third experiment focuses on the
domain filter’s ability to discard out-ofwww.computer.org/intelligent
domain definitions. Table 2 shows the result of the first two experiments. We assessed the system’s accuracy, precision,
recall, and F1 measure (the latter three calculated on good definitions). Both experiments highlight encouraging results, even
in comparison with the few data available
in the literature.6
To perform the third experiment, we
created a test set of 1,000 economy definitions and 250 medicine definitions. We applied our domain filter and ordered the set
of definitions, based only on their value of
domain pertinence, computed on the economy terminology. This experiment was
sufficiently realistic because, most of the
time, technical terminology is meaningful
in a restricted set of domains. The system
generates an ordered list of definitions,
where the first nonpertinent definition (for
example, a medical definition) is found in
position 571, and only 72 economy definitions appear in positions 571 to 1,000. So,
regardless of the threshold value, the domain filter performs a good separation
between pertinent and nonpertinent definitions. Of course, with an appropriate selection of the threshold, it’s at best possible
to balance precision and recall; however,
in new domains, high recall is often preferable to high precision. Also, the domain
23
N L P
a n d
t h e
W e b
Table 3. The extraction algorithm’s performance on a live search with postevaluation.
Statistics
Interoperability
Medicine
Total number of submitted terms (T )
100
100
Total number of extracted sentences (from all sources) (E)
774
1,137
Sentences over the threshold (C)
517
948
Sentences over the threshold accepted by evaluators (A)
448
802
51
96
Terms with at least one positive definition (N)
Performance measures
Interoperability
Medicine
Precision on positive (A/C)
86.85
84.59
Coverage (N/T)
51.00
96.00
Table 4. Glossary extraction performance for the terms in Figure 2.
Positive
Negative
Total
True
14
106
120
False
5
25
30
Total
19
131
150
Precision on positive
73.68
(14/19)
Precision on negative
80.92
(106/131)
Accuracy
80.00
(120/150)
Coverage
64.70
(11/17)
filter’s ability to discard inappropriate definitions depends merely on the domain
model acquired exclusively on the basis of
the input terminology T. Thus, no knowledge of other domains is requested to perform this task.
Real-world evaluation
Table 2 provides a reasonably objective
evaluation of the system, given that “good”
examples come mostly from professional
glossaries or have been compared with professional definitions. However, we didn’t
perform the evaluation on a real task because we asked the system to analyze a predefined set of candidate definitions and to
classify them as good or bad using the style
and domain filters.
To obtain an evaluation more realistic
to the system’s intended use, we ran a realworld experiment in which the system actually searched the Web. In this case, we performed validation manually using a “three
judges with adjudication” approach (that is,
the adjudicator made the final choice in case
24 of disagreement between the judges). To exploit our experience, we selected an enterprise interoperability domain. We repeated
the experiment on a medical domain because we also had expertise in this domain.
In this experiment, we provided the system with only a set of terms, and all definitions were automatically extracted from
online glossaries or Web documents. Table
3 shows the results. Overall, the system performed similarly to the gold standard we
used earlier (see Table 2). However, in this
case we couldn’t measure the recall (which
would require annotating all candidate definitions available online).
Performance is similar across the two domains, but coverage (that is, the percentage
of terms for which at least one good definition was found) is considerably lower for interoperability terms, as expected. In new or
relatively recent domains, it’s more difficult
to find definitions, both in glossaries and in
Web documents.
Finally, to test the complete terminologyand glossary-learning procedure, we perwww.computer.org/intelligent
formed a small experiment that any interested reader might replicate. First, we submitted a draft (extended) version of this
article to TermExtractor. (When you submit a single document to TermExtractor in
demo mode, it is automatically partitioned
into sections [depending on the document
length] to simulate a corpus and compute
a consensus measure.) Figure 2 shows the
results: TermExtractor extracted 18 terms,
of which only one was rejected (probably
owing to a file conversion error). Also, a
few of the terms weren’t so relevant to this
article’s research domain (for example,
“professional lexicographer”). However,
TermExtractor best applies its domain consensus filter to more than one document at
a time.
After validation, TermExtractor submitted the terminology file of 17 terms to Gloss
Extractor. The system produced 150 definitions (65 from Web glossaries and 85 from
Web documents). Table 4 summarizes the
evaluation’s results. The system found definitions for all but four of the 17 terms: professional lexicographer, definition sentence,
style filter, and domain filter. However, owing to false positives, some additional terms
had poor definitions (although we found
good definitions among those rejected, which
we counted as false negatives).
The terms “style filter” and “domain filter” certainly are relevant to the domain,
but because they’re new terms that we defined, GlossExtractor couldn’t find a definition for them on the Web.
T
o the best of our knowledge, no other
system in the literature covers the
IEEE INTELLIGENT SYSTEMS
entire glossary extraction process (acquisition of a domain terminology, extraction of
definitions, evaluation, updating, and maintenance). Our methodology exploits the
evolving nature of the Web, which is seen
as a huge corpus of texts that is processed
to create and continuously update specialized glossaries. In fact, the procedure can
be repeated incrementally with the same
average effort-per-term, both to extend the
set of definitions for existing terms as well
as to extract definitions for new emerging terms. Therefore, the complete procedure is under the control of the interested
community, including the glossary maintenance and updating problem. According
to our experiments, especially those in the
context of the Interop European project,
the average speedup factor with respect to
manual glossary building varies from onefifth to one-half of the time employed by a
professional lexicographer.
Real-world evaluation of GlossExtractor
showed that coverage might vary significantly, depending on the domain’s novelty.
We’ve identified three strategies that could
increase coverage: including other formats,
such as non-HTML documents, in our Web
searches; letting users explicitly provide
URLs of existing online glossaries or document archives for the system to search;
and extending our style filter using a much
larger training set composed of Wikipedia
definitions. We will include these features
in a future release of GlossExtractor.
More important, we’re studying the development of novel, motivated algorithms that
we hope will further improve our style and
domain filters. A comparison of their performance with our current implementation
will eventually reveal whether textual definitions’ style might vary (slightly) depending on the domain at hand.
Acknowledgments
We thank Orin Hargraves for his very helpful
professional advice.
References
1. E.P. Bontas and M. Mochol, “Towards a
Cost Estimation Model for Ontologies,”
Proc. 3rd Berliner XML Tage, HumboldtUniversität zu Berlin and Freie Univ. Ber­­lin, 2005, pp. 153–160.
2. F. Sclano and P. Velardi, “TermExtractor:
September/October 2008
T h e
A u t h o r s
Paola Velardi is a full professor in the Department of Computer Science at the University of
Rome “La Sapienza.” Her research interests include natural language processing, machine learning, and the Semantic Web. Velardi received her Laurea degree in electrical engineering from
“La Sapienza.” Contact her at velardi@di.uniroma1.it.
Roberto Navigli is an assistant professor in the Department of Computer Science at the Uni-
versity of Rome “La Sapienza.” His research interests include natural language processing and
the Semantic Web and its applications. Navigli received his PhD in computer science from “La
Sapienza.” Contact him at navigli@di.uniroma1.it.
Pierluigi D’Amadio is an IT consultant in system architecture, software design, and telecom-
munication systems. His research interests are intelligent systems, natural language processing,
and information retrieval techniques. D’Amadio received his degree in computer science from
the University of Rome “La Sapienza.” Contact him at damadio@di.uniroma1.it.
A Web Application to Learn the Common Terminology of Interest Groups and
Research Communities,” Proc. 9th Conf.
Terminology and Artificial Intelligence,
2007.
3. P. Velardi, R. Navigli, and M. Pétit, “Se­
mantic Indexing of a Competence Map
to Support Scientific Collaboration in a
Research Community,” Proc. 20th Int’l
Joint Conf. Artificial Intelligence, AAAI
Press, 2007, pp. 2897–2902.
4. M.A. Hearst, “Automatic Acquisition of
Hyponyms from Large Text Corpora,”
Proc. 14th Int’l Conf. Computational
Linguistics, Assoc. for Computational
Linguistics, 1992, pp. 539–545.
5. R. Navigli and P. Velardi, “Learning
Domain Ontologies from Document
Warehouses and Dedicated Web Sites,”
Computational Linguistics, vol. 30, no. 2,
2004, pp. 151–179.
6. J. Klavans and S. Muresan, “Evaluation of
the Definder System for Fully Automatic
Glossary Construction,” Proc. Am. Medical Informatics Assoc. Symp., 2001, pp.
324–328.
For more information on this or any other computing topic, please visit our Digital Library at
www.computer.org/csdl.
Engineering and
Applying the Internet
IEEE Internet Computing reports emerging
tools, technologies, and applications
implemented through the Internet to support
a worldwide computing environment.
In upcoming issues, we’ll look at:
• Data Stream Management
• RFID Software and Systems
• Dependable Service-Oriented
Computing
• IPTV
• and more!
www.computer.org/internet/
www.computer.org/intelligent
25
Download