Indexing and Searching Timed Media: the Role of Mutual Information Models

advertisement
Indexing and Searching
Timed Media: the Role of
Mutual Information Models
Tony Davis (StreamSage/Comcast)
IPAM, UCLA
1 Oct. 2007
A bit about what we do…

StreamSage (now a part of Comcast) focuses on indexing and
retrieval of “timed” media (video and audio, aka “multimedia” or
“broadcast media”)

A variety of applications, but now centered on cable TV

This is joint work with many members of the research team:
Shachi Dave
Abby Elbow
David Houghton
I-Ju Lai
Hemali Majithia
Phil Rennert
Kevin Reschke
Robert Rubinoff
Pinaki Sinha
Goldee Udani
Overview

Theme: use of term association models to address challenges of
timed media

Problems addressed:
 Retrieving all and only the relevant portions of timed media for a
query
 Lexical semantics (WSD, term expansion, compositionality of
multi-word units)
 Ontology enhancement
 Topic detection, clustering, and similarity among documents
Automatic indexing and retrieval of
streaming media


Streaming media presents particular difficulties
 Its timed nature makes navigation cumbersome, so the system
must extract relevant intervals of documents rather than present
a list of documents to the user
 Speech recognition is error-prone, so the system must
compensate for noisy input
We use a mix of statistical and symbolic NLP
 Various modules factor into calculating relevant intervals for
each term:
1. Word sense disambiguation
2. Query expansion
3. Anaphor resolution
4. Name recognition
5. Topic segmentation and identification
Statistical and NLP
Foundations: the COW
and YAKS Models
Two types of models

COW (Co-Occurring Words)
 Based on proximity of terms, using a sliding window

YAKS (Yet Another Knowledge Source)
 Based on grammatical relationships between terms
The COW Model

Large mutual information model of word co-occurrence

MI(X,Y) = P(X,Y)/P(X)P(Y)

Thus, COW values greater than 1 indicate correlation (tendency
to co-occur); less than 1, anticorrelation

Values are adjusted for centrality (salience)

Two main COW models:
 New York Times, based on 325 million words (about 6 years)
of text
 Wikipedia, more recent, roughly the same amount of text
 also have specialized COW models (for medicine, business,
and others), as well as models for other languages
The COW (Co-Occurring Words) Model

The COW model is at the core of what we do

Relevance interval construction

Document segmentation and topic identification

Word sense disambiguation (large-scale and unsupervised,
based on clustering COWs of an ambiguous word)

Ontology construction
 Determining semantic relatedness of terms
 Determining specificity of a term
The COW (Co-Occurring Words) Model

An example: top 10 COWs of railroad//N
post//J
shipper//N
freight//N
locomotive//N
rail//N
railway//N
commuter//N
pickups//N
train//N
burlington//N
165.554
123.568
121.375
119.602
73.7922
64.6594
63.4978
48.3637
44.863
41.4952
Multi-Word Units (MWUs)

Why do we care about MWUs? Because they act like single words
in many cases, but also:

MWUs are often powerful disambiguators of words within them
(see, e.g., Yarowsky (1995), Pederson (2002) for wsd methods
that exploit this):
 ‘fuel tank’, ‘fish tank’, ‘tank tread’
 ‘indoor pool’, ‘labor pool’, ‘pool table’

Useful in query expansion
 ‘Dept. of Agriculture’  ‘Agriculture Dept.’
 ‘hookworm in dogs’  ‘canine hookworm’

Provide many terms that can be added to ontologies
 ‘commuter railroad’, ‘peanut butter’
Multi-Word Units (MWUs) in our models

MWUs in our system



We extract nominal MWUs, using a simple procedure based on
POS-tagging:
Example:
1. ({N,J}) ({N,J}) N
2. N Prep (‘the’) ({N,J}) N
 where Prep is ‘in’, ‘on’, ‘of’, ‘to’, ‘by’, ‘with’, ‘without’, ‘for’, or
‘against’
For the most common 100,000 or so MWUs in our corpus, we
calculate COW values, as we do for words
COWs of MWUs

An example: top ten COWs of ‘commuter railroad’
post//J
pickups//N
rail//N
sanitation//N
weekday//N
transit//N
commuter//N
subway//N
transportation//N
railway//N
1234.47
315.005
200.839
186.99
135.609
134.329
119.435
86.6837
86.487
86.2851
COWs of MWUs

Another example: top ten COWs of ‘underground railroad’
abolitionist//N
slave//N
gourd//N
runaway//J
douglass//N
slavery//N
harriet//N
quilt//N
quaker//N
historic//N
845.075
401.732
266.538
226.163
170.459
157.654
131.803
109.241
94.6592
86.0395
The YAKS model


Motivations
 COW values reflect simple co-occurrence or association, but no
particular relationship beyond that
 For some purposes, it’s useful to measure the association between
two terms in a particular syntactic relationship
Construction
 Parse a lot of text (the same 325 million words of New York Times
used to build our NYT COW model); however, long sentences (>25
words) were discarded, as parsing them was slow and error-prone
 The parser’s output provides information about grammatical
relations between words in a clause; to measure the association of
a verb (say ‘drink’) and a noun as its object (say ‘beer’), we
consider the set of all verb-object pairs, and calculate mutual
information over that set
 Calculate MI for broader semantic classes of terms, e.g.: food,
substance. Semantic classes were taken from Cambridge
International Dictionary of English (CIDE); there are about 2000 of
them, arranged in a shallow hierarchy
YAKS Examples

Some objects of ‘eat’
OBJ head=eat
arg1=hamburger
arg1=pretzel
arg1=:Food
arg1=:Substance
arg1=:Sound
arg1=:Place
139.249
90.359
18.156
7.89
0.324
0.448
Relevance Intervals
Relevance Intervals (RIs)

Each RI is a contiguous segment of audio/video deemed relevant to a term
1.
RIs are calculated for all content words (after lemmatization) and multi-word
expressions
RI basis: sentence containing the term
Each RI is expanded forward and backward to capture relevant material,
using the techniques including:
 Topic boundary detection by changes in COW values across sentences
 Topic boundary detection via discourse markers
 Synonym-based query expansion
 Anaphor resolution
Nearby RIs for the same term are merged
Each RI is assigned a magnitude, reflecting its likely importance to a user
searching on that term, based on the number of occurrences of the term in
the RI, and the COW values of other words in the RI with the term
2.
3.
4.
5.
Relevance Intervals: an Example

Index term: squatter

Among the sentences containing this term are these two, near each other:
Paul Bew is professor of Irish politics at Queens University in Belfast.
In South Africa the government is struggling to contain a growing demand
for land from its black citizens.
Authorities have vowed to crack down and arrest squatters illegally
occupying land near Johannesburg.
In a most serious incident today more than 10,000 black South Africans
have seized government and privately-owned property.
Hundreds were arrested earlier this week and the government hopes to
move the rest out in the next two days.
NPR’s Kenneth Walker has a report.
Thousands of squatters in a suburb outside Johannesburg cheer loudly as
their leaders deliver angry speeches against whites and landlessness in
South Africa.
“Must give us a place…”

We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example

Index term: squatter

Among the sentences containing this term are these two, near each other:
Paul Bew is professor of Irish politics at Queens University in Belfast.
In South Africa the government is struggling to contain a growing demand
for land from its black citizens. [cow-expand]
Authorities have vowed to crack down and arrest squatters illegally
occupying land near Johannesburg.
In a most serious incident today more than 10,000 black South Africans
have seized government and privately-owned property. [cow-expand]
Hundreds were arrested earlier this week and the government hopes to
move the rest out in the next two days.
NPR’s Kenneth Walker has a report.
Thousands of squatters in a suburb outside Johannesburg cheer loudly as
their leaders deliver angry speeches against whites and landlessness in
South Africa.
“Must give us a place…” [topic segment boundary]

We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example

Index term: squatter

Among the sentences containing this term are these two, near each other:
Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment
boundary]
In South Africa the government is struggling to contain a growing demand for land
from its black citizens. [cow-expand]
Authorities have vowed to crack down and arrest squatters illegally occupying land
near Johannesburg.
In a most serious incident today more than 10,000 black South Africans have seized
government and privately-owned property. [cow-expand]
Hundreds were arrested earlier this week and the government hopes to move the rest
out in the next two days. [merge nearby intervals]
NPR’s Kenneth Walker has a report. [merge nearby intervals]
Thousands of squatters in a suburb outside Johannesburg cheer loudly as their
leaders deliver angry speeches against whites and landlessness in South Africa.
“Must give us a place…” [topic segment boundary]

Two occurrence of squatter produce a complete merged interval.
Relevance Intervals and Virtual Documents

The set of all the RIs for a term in a document constitutes the virtual
document for that term

In effect, the VD for a term is intended to approximate a document
that would have been produced had the authors focused solely on
that term

A VD is assigned a magnitude equal to the highest magnitude of the
RIs in it, with a bonus if more than one RI has a similarly high
magnitude
Merging Ris for multiple terms
Occurrence of
Original Term
Russia
Iran
Note that this can only be done at query time,
So it needs to be fairly quick and simple.
Merging RIs
Occurrence of
Original Term
Russia
Iran
Activation
Spreading
Merging RIs
Occurrence of
Original Term
Russia
Iran
Russia and Iran
Evaluating RIs and VDs

Evaluation of retrieval effectiveness in timed media raises further
issues:

Building a gold-standard is painstaking, and potentially more subjective
 It’s necessary to measure how closely the system’s Ris match the gold
standard’s
 What’s a reasonable baseline?

We created a gold standard of about 2300 VDs with about 200
queries on about 50 documents (NPR, CNN, ABC, and business
webcasts), and rated each RI in a VD on a scale of 1 (highly
relevant) to 3 (marginally relevant).

Testing of the system was performed on speech recognizer output
Evaluating RIs and VDs

Measure amounts of extraneous and missed content
ideal RI
system RI
extraneous
missed
Evaluating RIs and VDs

Comparison of percentages of median extraneous and missed
content over all queries between system using COWs and system
using only sentences with query terms present
relevance
most relevant
relevant
most relevant
and relevant
marginally
relevant
system
extra
miss
extra
miss
extra
miss
extra
miss
With
COWs
9.3
12.7
39.2
27.5
29.9
21.6
61.7
15.0
Without
COWs
0.0
57.7
0.3
64.7
0.0
63.4
18.8
60.2
MWUs and
compositionality
MWUs, idioms, and compositionality

Several partially independent factors are in play here (Calzolari, et al. 2002):
1. reduced syntactic and semantic transparency;
2. reduced or lack of compositionality;
3. more or less frozen or fixed status;
4. possible violation of some otherwise general syntactic patterns
or rules;
5. a high degree of lexicalization (depending on pragmatic factors);
6. a high degree of conventionality.
MWUs, idioms, and compositionality

In addition, there are two kinds of “mixed” cases

Ambiguous MWUs, with one meaning compositional and the other not:
‘end of the tunnel’, ‘underground railroad’

“Normal” use of some component words, but not others:
‘flea market’ (a kind of market)
‘peanut butter’ (a spread made from peanuts)
Automatic detection of non-compositionality

Previous work

Lin (1999): “based on the hypothesis that when a phrase is noncompositional, its mutual information differs significantly from the mutual
informations [sic] of phrases obtained by substituting one of the words in the
phrase with a similar word.”

For instance, the distribution of ‘peanut’ and ‘butter’ should differ from
that of ‘peanut’ and ‘margarine’

Results are not very good yet, because semantically-related words often
have quite different distributions, and many compositional collocations
are “institutionalized”, so that substituting words within them will
change distributional statistics.
Automatic detection of non-compositionality

Previous work

Baldwin, et al. (2002): “use latent semantic analysis to determine the
similarity between a multiword expression and its constituent words”;
“higher similarities indicate greater decomposability”

“Our expectation is that for constituent word-MWE pairs with higher
LSA similarities, there is a greater likelihood of the MWE being a
hyponym of the constituent word.” (for head words of MWEs)

“correlate[s] moderately with WordNet-based hyponymy values.”
Automatic detection of non-compositionality


We use the COW model for a related approach to the problem

COWs (and COW values) of an MWU and its component words will be
more alike if the MWU is compositional.
We use a measure of occurrences of a component word near an MWU as
another criterion of compositionality

The more often words in the MWU appear near it, but not as a part of it, the
more likely it is that the MWU is compositional.
COW pair sum measure


Get the top n COWs of an MWU, and of one of its component words.
For each pair of COWs (one from each of these lists), find their COW value.
railroad
post//J
shipper//N
freight//N
commuter railroad
post//J
pickups//N
rail//N
COW pair sum measure



Get the top n COWs of an MWU, and of one of its component words.
For each pair of COWs (one from each of these lists), find their COW value.
Then sum up these values. This provides a measure of how similar the contexts
in which the MWU and its component word appear are.
Feature overlap measure



Get the top n COWs (and values) of an MWU, and of one of its component
words.
For each COW with a value greater than some threshold, treat that COW as a
feature of the term.
Then compute the overlap coefficient (Jaccard coefficient); for two sets of
features A and B:
|AB|
|AB|
Occurrence-based measure


For each occurrence of an MWU, determine if a given component word occurs
in a window around that occurrence, but not as part of that MWU.
Calculate the proportion of occurrences for which this the case, compared to all
occurrences of the MWU.
Testing the measures


We extracted all MWUs tagged as idiomatic in the Cambridge International
Dictionary of English (about 1000 expressions).
There are about 112 of these that conform to our MWU patterns and occur with
sufficient frequency in our corpus that we have calculated COWs for them.
fashion victim
flea market
flip side
Testing the measures


We then searched the 100,000 MWUs for which we have COW values,
choosing compositional MWUs containing the same terms.
In some cases, this is difficult or impossible, as no appropriate MWUs are
present. About 144 MWUs are on the compositional list.
fashion victim fashion designer
crime victim
flea market
[flea collar]
market share
flip side
[coin flip]
side of the building
Results: basic statistics

The idiomatic and compositional sets are quite different in aggregate, though
there is a large variance:
COW pair sum
Feature overlap
Occurrence
measure
Non-idiomatic
mean 575.478
s.d. 861.754
mean 0.297
s.d. 0.256
mean 37.877
s.d. 23.470
Idiomatic
mean -236.92
s.d. 502.436
mean 0.109
s.d. 0.180
mean 16.954
s.d. 16.637
Results: discriminating the two sets

How well does each measure discriminate between idioms and non-idioms?
COW pair sum
negative
positive
Non-idiomatic
75
213
Idiomatic
178
46
Results: discriminating the two sets

How well does each measure discriminate between idioms and non-idioms?
feature overlap
< 0.12
>= 0.12
Non-idiomatic
100
188
Idiomatic
175
49
Results: discriminating the two sets

How well does each measure discriminate between idioms and non-idioms?
occurrence-based measure
< 25%
>= 25%
Non-idiomatic
94
194
Idiomatic
174
50
Results: discriminating the two sets


Can we do better combining the measures?
We used the decision-tree software C5.0 to check
Rule: if COW pair sum <= -216.739 or
COW pair sum <= 199.215 and occ. measure < 27.74%
then idiomatic; otherwise non-idiomatic
yes
no
Non-idiomatic
50
238
Idiomatic
184
40
Results: discriminating the two sets


Some cases are “split”—classified as idiomatic with respect to one component
word but not the other word:
bear hug is idiomatic w.r.t. bear but not hug
flea market is idiomatic w.r.t. flea but not market
Other methods to improve performance on this task

MWUs often come in semantic “clusters”:
‘almond tree’, ‘peach tree’, ‘blackberry bush’, ‘pepper plant’, etc.

Corresponding components in these MWUs can be localized in a small area
of WordNet (Barrett, Davis, and Dorr (2001)) or UMLS (Rosario, Hearst,
and Fillmore (2002)).

“Outliers” that don’t fit the pattern are potentially idiomatic or noncompositional (’plane tree’ but ’rubber tree’, which is compositional).
Clustering and topic
detection
Clustering by similarities among segments

Content of a segment is represented by its topically salient terms.

The COW model is used to calculate a similarity measure for each
pair of segments.

Clustering on the resulting matrix of similarities (using the CLUTO
package) yields topically distinct clusters of results.
Clustering Results

Example: crane

10 segments form 2 well-defined clusters, one relating to birds, the
other to machinery (and cleanup of the WTC debris in particular)
Clustering Results

Example: crane
Cluster Labeling

Topic labels for clusters improve usability.

Candidate cluster labels can be obtained from:
 Topic terms of segments in a cluster
 Multi-word units containing the query term(s)
 Outside sources (taxonomies, Wikipedia, …)
Topics through latent Dirichlet allocation




LDA (Blei, Ng, and Jordan 2003) models documents as probability
distribution over a set of underlying topics, with each topic defined
as a probability distribution over terms.
We’ve generated topic models for the SR output of about 20,000
news programs (from 400 to1000 topics).
Nouns, verbs, and MWUs only; all other words discarded
Impressionistically, the “best” topics seem to be those in which a
third or more of the probability mass is in 10-50 terms.
Topics through latent Dirichlet allocation
Are there better ways to gauge topic coherence (and relatedness)?
 We’ve tried two
 COW sum measure (Wikipedia COW table on 20 most probably
words in a topic
 Average “distance” between CIDE semantic domain codes for
term (also top 20, using the similarity measure of Wu and Palmer
1994)
Excerpt of CIDE hierarchy:
43 Building and Civil Engineering
66 Buildings
68 Buildings: names and types of
754 Houses and homes
755 Public buildings
365 Rooms
194 Furniture and Fittings
834 Bathroom fixtures and fittings
811 Curtains and wallpaper
804 Tables
805 Chairs and seats

Topics through latent Dirichlet allocation

Results are quite different (these are with 1000 topics)
 COW sum measure
Rank in WikiCow= 1 Rank in cide= 232
return home war talk stress month post-traumatic rack deal think
stress$disorder night psychological face understand pt experience
combat series mental$health
Rank in WikiCow= 2 Rank in cide= 773
research cell embryo stem$cell disease destroy embryonic$stem
scientist parkinson researcher federal funding potential cure today
human$embryo medical human$life california create
Rank in WikiCow= 3 Rank in cide= 924
giuliani new$hampshire run talk gingrich february former$new
thompson social mormon massachusetts mayor iowa newt opening
york choice rudy$giuliani jim committee
Topics through latent Dirichlet allocation

Results are quite different (these are with 1000 topics)
 COW sum measure
Rank in WikiCow= 4 Rank in cide= 650
palestinian israeli prime$minister israelis arafat west$bank jerusalem peace
settlement gaza$strip government violence israeli$soldier hama leader
move force security peace$process downtown
Rank in WikiCow= 5 Rank in cide= 283
japan japanese tokyo connecticut lieberman lucy run lose joe$lieberman
support lemont disaster war twenty-first$century anti-war define
peterson$cbs senator$joe future figure
Rank in WikiCow= 6 Rank in cide= 710
california schwarzenegger union break office view arnold$schwarzenegger
tuesday politician budget rosie poll agenda o'donnell political battle
california$governor help rating maria
Topics through latent Dirichlet allocation

Results are quite different (these are with 1000 topics)
 CIDE distance measure
Rank in cide= 1 Rank in WikiCow= 278
play show broadway actor stage star performance theatre perform
audience act career tony sing character production young role
review performer
Rank in cide= 2 Rank in WikiCow= 476
war u.n. refugee united$nation international defend support kill u.s.
week flee allow chief board cost remain desperate innocent
humanitarian conference
Rank in cide= 3 Rank in WikiCow= 151
competition win team skate compete figure stand finish performance
gold watch champion hard-on score skater talk meter lose fight
gold$medal
Topics through latent Dirichlet allocation

Results are quite different (these are with 1000 topics)
 CIDE distance measure
Rank in cide= 4 Rank in WikiCow= 858
set move show pass finish mind send-up expect outcome address
defence person responsibility wish independent system woman
salute term pres
Rank in cide= 5 Rank in WikiCow= 478
character play think kind film actor mean mind scene act guy script
role good$morning real read nice wonderful stuff interesting
Rank in cide= 6 Rank in WikiCow= 892
warren purpose rick hunter week drive hundred influence allow poor
peace night leader train sign walk training goal team rally
YAKS and Ontology
Enhancement
YAKS (Yet another knowledge source)




What is YAKS:
Like the COW model: measures how much more (or less) frequently
words occur in certain environments than would be expected by
chance.
Unlike the COW model: does not measure co-occurrence within a
fixed window size. Measures how often words are in one of the
syntactic relations tracked by the model.
Example: Subject(eat,dog) would be an entry in the YAKS table
recording how many times the verb eat occurs in the corpus with
dog as its subject.
Syntax and Semantics: YAKS



YAKS infrastructure:
Corpus: New York Times corpus (six years), all sentences under 25
words (~60% of sentences, ~50% of words); about 160 million
words.
Need good parsing. We used the Xerox Linguistics Environment
(XLE) parser from PARC, based on lexical-functional grammar.
Syntax and Semantics: YAKS

Sample YAKS relations: (YAKS tracks 13 different relations.)
Relation
SUBJ(V, X)


Meaning
X is subject of verb V.
OBJ(V,

X is direct object of verb V.


“Dan believes that soup is good”:
COMP(believe, that)

“Eli sat on the ground”:
POBJ(on, the ground)
“Dan ate bread and soup”:
CONJ(and, bread, soup)

X)
COMP(V,
X)

Verb V and complementizer
X.
POBJ(P,
X)

Preposition P and object X.
CONJ(C,
X, Y)

Coordinating conjunction C,
and two conjuncts (either
nouns or verbs) X and Y.




Example
“Dan ate soup”: SUBJ(eat, Dan)
“Dan ate soup”: OBJ(eat, soup)
For ontology extension, we use only the OBJ relation.
YAKS in ontology enhancement
seed items
hat
sock
shirt
dress
Items from one neighborhood
in the ontology
YAKS in ontology enhancement
hat
sock
shirt
dress
wear
wash
take off
iron
Which verbs are most
associated with these nouns,
in a large corpus?
YAKS in ontology enhancement
hat
sock
shirt
dress
wear
wash
new candidates
for that area of
the ontology
take off
iron
sweater
clothes pants
jacket
blouse
pajamas
Whaich other nouns are
most associated as objects
of those verbs?
YAKS in ontology enhancement
YAKS
technique:
The noun-verb-noun cycle reflects the selectional restrictions of
verbs that are strongly associated with the seed nouns.
This lets us use the information in the statistical properties of the large
corpus to find nouns that are semantically close to your seeds in the
ontology.
This is why we use OBJ (the object relation) for this YAKS technique.
(SUBJ found the same information, but with slightly more noise.)
YAKS in ontology enhancement
YAKS
technique: finds new items for a neighborhood, but where exactly do
they go?
X belongs in here . . .
YAKS in ontology enhancement
?
X But where?
?
?
?
?
?
?
?
?
?
YAKS in ontology enhancement



Techniques to precisely locate a candidate term in the ontology:
Wikipedia check (already described)
(Does the Dog page contain “dog is a mammal”? Does the Mammal
page contain “mammal is a dog”?)
Yahoo pattern technique
YAKS in ontology enhancement

1.
2.
Yahoo check:
Use known pattern information to form a Yahoo search.
• If we suspect that a dog is-a mammal, search Yahoo for “dog is
a mammal” and “mammal is a dog”. Also search using other is-a
patterns (e.g., from Hearst 1994).
Use relative counts (absolute counts hard to interpret).
YAKS in ontology enhancement: evaluation

1.
2.
3.
Four levels of is-a: Beef, meat, food, substance
Seed pairs were beef-meat, meat-food, and food-substance.
YAKS induction and siblings added to produce candidate pool.
Candidate list was then pared using a version of the Wikipedia
check:
YAKS in ontology enhancement: evaluation


Beef, meat, food, substance — Wikipedia check:
Where might candidate term T fit, relative to seed term S?

Count mentions of T on the S page, MT, and mentions of S on the
T page, MS. Weight doubly mentions in the first paragraph.
Subtract: MT - MS .

If T is an S, then MT - MS should be >0.
YAKS in ontology enhancement: evaluation




Beef, meat, food, substance — Results:
After YAKS induction and expansion via siblings, had >800 candidate
terms to add to the ontology.
Wikipedia check pared this to 38 terms, each with its best is-a
relationship.
Of these, on hand inspection, all but two were very high quality.
YAKS in ontology enhancement: evaluation

1.

2.
YAKS testing: MILO transportation ontology
In each trial, seed terms were three items from near the bottom of
the ontology, linked by is-a. For instance, sedan, car, and vehicle.
Selecting terms from the bottom of the ontology decreases the
lexicalization problems inherent in MILO content.
Induced new candidate terms using the YAKS technique, adding
siblings as well.
YAKS in ontology enhancement: evaluation
3.
Candidate list was then pared using a Yahoo is-a pattern check:
A. For a candidate term T, compare it to each of the three seeds
– i.e., is it a sedan? Is it a car? Is it a vehicle?
B. Do this via a Yahoo search for Hearst is-a patterns involving T
and each of the seeds. This trial used the patterns Y such as X
and X and other Y.
– i.e., search for “. . . sedans such as T . . .”, “. . . cars such as
T . . .”, “. . . vehicles such as T . . .”, “. . . T and other sedans
...”, and so on.
YAKS in ontology enhancement: evaluation
3.
Yahoo pattern check, continued:
C.
Does the candidate have a significant number of Yahoo
hits for one of the seeds with one of the is-a patterns?
A.
Is there a significant difference between that highest
number of hits and the number of hits with some other
seed?
B.
That is, is there a clear indication that the candidate
is-a for one particular seed?
D.
If not, discard this candidate.
E.
If so, then this candidate most likely is-a member of the class
named by that seed.
YAKS in ontology enhancement: evaluation

Results — MILO transportation ontology:
candidate
Yahoo hits for is-a:
sedan car
vehicle
result
truck
0
532
152000
van
0
95
5590
automobile
0
63
14700
minivan
1
76
561
BMW
3
360
66
a BMW is a car
Jeep
0
3600
849
a Jeep is a car
Corolla
6
66
3
convertible
0
25
76
a truck is a vehicle
a van is a vehicle
an automobile is a vehicle
a minivan is a vehicle
a Corolla is a car
a convertible is a vehicle
Conclusions and future
work
Conclusions and future work



MI models can help compensate for the noisiness in timed media
search and retrieval.
MI models can help in building knowledge sources from large text
collections.
We’re still looking for better ways to combine handcrafted semantic
resources with statistical ones.
Download