ppt

advertisement
Concept Hierarchy
Induction
by Philipp Cimiano
Objective

Structure information into categories

Provide a level of generalization to define
relationships between data

Application: Backbone of any ontology
Overview
Different approaches of acquiring
conceptual hierarchies from text corpus.
 Various clustering techniques.
 Evaluation
 Related Work
 Conclusion

Machine Readable Dictionaries

Entries: ‘a tiger is a mammal’, or ‘mammals
such as tigers, lions or elephants’.

exploit the regularity of dictionary entries.

the head of the first NP - hypernym.
Example
Exception
Exception
is-a (corolla, part)………..is a NOT VALID
is-a (republican, member) ……….. is a NOT VALID
is-a (corolla, flower)………..is a NOT VALID
is-a (republican, political party)………..is a NOT VALID
Exception
Alshawis solution
Results using MRDs

Dolan et al. - 87% of the hypernym
relations extracted are correct

Calzolari cites a precision of > 90%

Alshawi - precision of 77%
Strengths And Weaknesses

Correct, explicit knowledge

Robust basis for ontology learning

Weakness- domain independent
Lexico-Syntactic patterns

Task: automatically learning hyponym
relations from the corpora.
'Such injuries as bruises, wounds and broken bones'
hyponym (bruise, injury)
hyponym (wound, injury)
hyponym (broken bone, injury)
Hearst patterns
'Such injuries as bruises, wounds and broken bones'
Requirements

Occur frequently in many text genres.

Accurately indicate the relation of interest.

Be recognizable with little or no preencoded knowledge
Strengths And Weaknesses

Identified easily and are accurate
Weakness:

patterns appear rarely
 is-a relation do not appear in Hearst style
pattern
Distribution Similarity

'you shall know a word by the company it
keeps’ [Firth, 1957].

semantic similarity of words – similarity of
the contexts.
Using distribution similarity
Strengths And Weaknesses

reasonable concept hierarchy.
Weakness:

Cluster tree lacks clear and formal interpretation
 Does not provide any intentional description of
concepts
 Similarities may be accidental (sparse data)
Formal Concept Analysis (FCA)
FCA output
Similarity measures
Smoothing
Evaluation

Semantic cotopy (SC).

Taxonomy overlap (TO)
Evaluation Measure
100% Precision Recall
Low Recall
Low Precision
Results
Results
Results
Results
Strengths And Weaknesses
FCA generates formal concepts
 Provides intentional description

Weakness:

Size of the lattice can get exponential in the size
 spurious clusters

Finding appropriate labels for the cluster
Problems with Unsupervised
Approaches to Clustering
Data sparseness leads to spurious
syntactic similarities
 Produced clusters can’t be appropriately
labeled

Guided Clustering

Hypernyms directly used to guide
clustering
 WordNet
 Hearst

Agglomerative clustering
Similarity Computation
Ten most similar terms of the tourism reference taxonomy
The Hypernym Oracle

Three sources
 WordNet
 Hearst
patterns matched in a corpus
 Hearst patterns matched in the World Wide
Web

Record hypernyms and amount of
evidence found in support of hypernyms.
WordNet
Collect hypernyms found in any
dominating synset containing term, t
 Include number of times the hypernym
appears in a dominating synset

Hearst Patterns (Corpus)

Record number of isa-relations found
between two terms
Hearst Patterns (WWW)

Download 100 Google abstracts for each
concept and clue:
Evidence
Total Evidence for Hypernyms:
•time: 4
•vacation: 2
•period: 2
Clustering Algorithm
1.
2.
3.
Input a list of terms
Calculate the similarity between each
pair of terms and sort from highest to
lowest
For each potential pair to be clustered
consult the oracle.
Consulting the Oracle case 1

If term 1 is a hypernym of term 2 or viceversa:
 Create
appropriate subconcept relationship.
Consulting the Oracle case 2
Find the common hypernym for both terms
with greatest evidence.
 If one term has already been classified:

t’ = h
h is a hypernym of t’
t’ is a hypernym of h
Consulting the Oracle case 3

Neither term has been classified:
 Each
term becomes a subconcept of the
common hypernym.
Consulting the Oracle case 4

The terms do not share a common
hypernym:
 Set
aside the terms for further processing.
r-matches

For all unprocessed terms, check for rmatches (i.e. ‘credit card’ matches
‘international credit card’)
Further Processing
If either term in a pair is already classified
as t’, the other term is classified under t’ as
well.
 Otherwise place both terms under the
hypernym of either term with the most
evidence.
 Any unclassified terms are added under
the root concept.

Evaluation

Taxonomic overlap (TO)
 ignore

leaf nodes
Sibling overlap (SO)
 measures
quality of clusters
Evaluation

Tourism domain:
 Lonely
Planet
 Mecklenburg

Finance domain:
 Reuters-21578
Tourism Results—TO
Finance Results—TO
Tourism Results—SO
Finance Results—SO
Human Evaluation
Future Work

Take word sense into consideration for the
WordNet source.
Summary

Hypernym guided agglomerative clustering
works pretty good.
 Better
than the “Golden Standard”
 Good human evaluation
Provides labels for clusters
 No spurious similarities
 Faster than agglomerative clustering

Learning from Heterogeneous
Sources of Evidence
Many ways to learn concept hierarchies
 Can we combine different paradigms?

 Any
manual attempt to combine strategies
would be ad hoc
 Use supervised learning to combine
techniques
Determining relationships with
machine learning

Example: Determine if a pair of words has
an “isa” relationship
Feature 1:
Matching patterns in a corpus
Given two terms t1 and t2 we record how
many times a Hearst-pattern indicating an
isa-relation between t1 and t2 is matched
in the corpus
 Normalize by maximum number of Hearst
patterns found for t1

Example

This provided the best F-measure with a
single-feature classifier
Feature 2:
Matching patterns on the web

Use the Google API to count the matches
of a certain expression on the Web
Feature 3:
Downloading webpages




Allows for matching expressions with a more
complex linguistic structure
Assign functions to each of the Hearst patterns
to be matched
Use these “clues” to decide what pages to
download
Download 100 abstracts matching the query
“such as conferences”
Example
Feature 4:
WordNet – All senses
Is there a hypernym relationship between
t1 and t2?
 Can be more than one path from the
synsets of t1 to the synsets of t2

Feature 5:
WordNet – First sense

Only consider the first sense of t1
Feature 6:
“Head”- heuristic
If t1 r-matches t2 we derive the relation
isa(t2,t1)
 e.g.

t1 = “conference”
t2 = “international conference”
isahead(“international conference”,”conference”)
Feature 7:
Corpus-based subsumption

t1 is a subclass of t2 if all the syntactic
contexts in which t1 appears are also
shared by t2
Feature 8:
Document-based subsumption

t1 is a subclass of term t2 if t2 appears in
all documents in which t1 appears
# of pages where t1 and t2 occur
# of pages where t1 occurs
Example
Naïve Threshold Classifier
Used as a baseline
 Classify an example as positive if the
value of a given feature is above some
threshold t
 For each feature, the threshold has been
varied from 0 to 1 in steps of 0.01

Baseline Measures
Evaluation

Classifiers
 Naïve
Bayes
 Decision Tree
 Perceptron
 Multi-layer perceptron
Evaluation Strategies

Undersampling


Oversampling


Try different threshold values other than 0.5
Introducing a cost matrix


Add additional examples to the minority class
Varying the classification threshold


Remove a number of majority class examples (non-isa
examples)
Different penalties for different types of misclassification
One Class SVMs

Only considers positive examples
Results
Results (cont.)
Discussion

The best results achieved with the one-class SVM
(F = 32.96%)
 More
than 10 points above the baseline classifier
average (F = 21.28%) and maximum (F = 21%)
strategies
 More than 14 points better than the best singlefeature classifier (F = 18.84%) using the isawww
feature

Second best results obtained with a Multilayer
Perceptron using oversampling or undersampling
Discussion


Gain insight from finding which features were
most used by classifiers
Used this information to modify features and
rerun experiments
Summary





Using different approaches is useful
Machine learning approaches outperform naïve
averaging
Unbalanced character of the dataset poses a
problem
SVMs (which are not affected by the imbalance)
produce the best results
This approach can show which features are the
most reliable as predictors
Related Work

Taxonomy Construction
 Lexico-syntactic
patterns
 Clustering
 Linguistic
approaches
Taxonomy Refinement
 Taxonomy Extension

Lexico-syntactic patterns







Hearst
Iswanska et al. – added extra patterns
Poesia et al. – anaphoric resolution
Ammad et al. – applying to specific domains
Etzioni et al. – patterns matched on the www
Cederburg and Widdows – precision improved with
Latent Semantic Analysis
Others working on learning patterns automatically
Clustering

Hindle



Pereira et al.



group nouns semantically
derive verb-subject and verb-object dependencies from a 6
million word sample of Associated Press news stories
top-down soft clustering algorithm with deterministic annealing
words can appear in different clusters (multiple meanings of
words)
Caraballo


bottom-up clustering approach to build a hierarchy of nouns
uses conjunctive and appositive constructions for nouns derived
from the Wall Street Journal Corpus
Clustering (cont.)











The ASIUM System
The Mo'K Workbench
Grefenstette
Gasperin et al.
Reinberger et al.
Lin et al.
CobWeb
Crouch et al.
Haav
Curran et al.
Terascale Knowledge Acquisition
Linguistic Approaches

Linguistic analysis exploited more directly rather than
just for feature extraction



OntoLT - use shallow parser to label parts of speech and
grammatical relations (e.g. HeadNounToClass-ModToSubClass,
which maps a common noun to a concept or class)
OntoLearn - analyze multi-word terms compositionally with
respect to an existing semantic resource (Word-Net)
Morin et al. - tackle the problem of projecting semantic relations
between single terms to multiple terms (e.g. project the isarelation between apple and fruit to an isa-relation between apple
juice and fruit juice)
Linguistic Approaches


Sanchez and Moreno – download first n hits for a
search word and process the neighborhood
linguistically to determine candidate modifiers for the
search term
Sabou - inducing concept hierarchies for the purpose
of modeling web services (applies methods not to full
text, but to Java-documentation of web services)
Taxonomy Refinement
Hearst and Schutze
 Widdows
 Madche, Pekar and Staab
 Alfonseca et aL

Taxonomy Extension
Agirre et al.
 Faatz and Steinmetz
 Turney

Conclusions

Compared different hierarchical clustering
approaches with respect to:
 effectiveness
 speed
 traceability

Set-theoretic approaches, as FCA, can
outperform similarity-based approaches.
Conclusions
Presented an algorithm for clustering
guided by a hypernym oracle.
 More efficient than agglomerative
clustering.

Conclusions
Used machine learning techniques to
effectively combine different approaches
for learning taxonomic relations from text.
 A learned model indeed outperforms all
single approaches.

Open Issues






Which similarity or weighting measure should be chosen
Which features should be considered to represent a
certain term
Can features be aggregated to represent a term at a
more abstract level
How should we model polysemy of terms
Can we automatically induce lexico-syntactic patterns
(unsupervised!)
What other approaches are there for combining different
paradigms; and how can we compare these
Questions
Download