Unsupervised Ontology Acquisition from plain texts:
The OntoGain method
Efthymios Drymonas
Kalliopi Zervanou
Euripides G.M. Petrakis
Intelligent Systems Laboratory http://www.intelligence.tuc.gr
Technical University of Crete (TUC), Chania, Greece
A platform for unsupervised ontology acquisition from text
Application independent
Ontology of multi-word term concepts
Adjusts existing methods for taxonomy & relation acquisition to handle multi-word concepts
Outputs ontology in OWL
Good results on Medical, Computer science corpora
2
Majority of terminological expressions
Convey classificatory information, expressed as modifiers
e.g. “ carotid artery disease ” denotes a type of “ artery disease ” which is a type of
“ disease ”
Leads to more expressive and compact ontology lexicon
3
Concept Extraction
C/NC-value
Taxonomy Induction
Clustering, Formal Concept Analysis
Non-taxonomic Relations
Association Rules, Probabilistic algorithm
4
[Frantzi et.al. , 2000]
Identifies multi-word term phrases denoting domain concepts
Noun phrases are extracted first
((adj | noun)+ | ((adj | noun)
*
(adj | noun)
*
) noun
(noun prep)?)
C-Value : Term validity criterion, relying on the hypothesis that multi-word terms tend to consist of other terms
NC-Value : Uses context information
(valid terms tend to appear in specific context and co-occur with other terms)
5
For candidate term a
f(a): Total frequency of occurrence
f(b): Frequency of a as part of longer terms
P(T a
): number of these longer terms
|a|: The length of the candidate string
C
value ( a )
log
2 log
2
| a |
| a | ( f ( a )
f ( a ), a : not nested
1
P ( T a
)
b
T a f ( b )) , otherwise
Concept Extraction
output term web page information retrieval search engine machine learning computer science experimental result text mining natural language processing world wide web large number artificial intelligence relevant document similarity measure information extraction knowledge discovery
582.83
557.33
530.67
515.73
468.22
464.64
443.29
435.79
c-nc value
1740.11
1274.14
1103.99
727.70
723.82
655.125
645.57
7
8
Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms
Two methods in OntoGain
Agglomerative clustering
Formal Concept Analysis (FCA)
Proceeds bottom-up: at each step, the most similar clusters are merged
Initially each term is considered a cluster
Similarity between all pairs of clusters is computed
The most similar clusters are merged as long as they share terms with common heads
Group average for clusters, Dice like formula for terms
10
[Ganter et al., 1999]
FCA relies on the idea that the objects
(terms) are associated with their attributes (verbs)
Finds common attributes (verbs) between objects and forms object clusters that share common attributes
Formal concepts are connected with the sub-concept relationship
( O
1
, A
1
)
( O
2
, A
2
)
O
1
O
2
( A
1
A
2
)
Takes as input a matrix showing associations between terms (concepts) and attributes (verbs) submit test describe print compute search
*
*
*
*
Html form
Hierarchical clustering
Text retrieval
Root node
Single cluster
Web page
*
* *
*
*
*
*
*
*
*
*
Formal concepts
({hierarchical clustering, root node, single cluster},
{compute, search})
({html form, web page}, {print, search})
Not all dependencies
c,v are interesting
P ( c | v )
f ( c , v )
t f ( v )
13
14
Concepts are also characterized by attributes and relations to other concepts in the hierarchy
Typically expressed by a verb relating pair of concepts
Two approaches
Associations rules
Probabilistic
Introduced to predict the purchase behavior of customers
Extract terms connected with some relation subject-verb-object
Enhance with general terms from the taxonomy
Eliminate redundant relations: predictive accuracy < t
Domain chiasmal syndrome medial collateral ligament blood transfusion lipid peroxidation prostate specific antigen chronic fatigue syndrome right ventricular infraction creatinine clearance cardioplegic solution bacterial translocation accurate diagnosis ultrasound examination total body oxygen consumption coronary arteriography
Range pituitary disproportion surgical treatment antibiotic prophylaxis cardiopulmonary bypass prostatectomy cardiac function radionuclide ventriculography arteriovenous hemofiltration superoxide dismutase antibiotic prophylaxis clinical suspicion clinical suspicion epidural analgesia physician
Label cause by need result lead to follow yield analyze by achieve give decrease depend give attenuate by perform by
17
Collect verbal relations from the corpus
Find the most general relation wrt verb using frequency of occurrence
Suffer_from(man, head_ache)
Suffer_from(woman, stomach_ache)
Suffer_from(patient,ache)
Select relationships satisfying a conditional probability measure
Associations > t become accepted
18
Relevance judgments are provided by humans
Precision - Recall
We examined the 200 top-ranked concepts and their respective relations in 500 lines
Results from OhsuMed & Computer
Science corpus
19
Processing
Layer
Method
Precision
–
OhsuMed
Recall
-
OhsuMed
Precision
–
Comp.
Science
Recall
–
Comp.
Science
Concept
Extraction
C/NC-Value 89.7% 91.4%
Taxonomic
Relations
Non-
Taxonomic
Relations
Formal
Concept
Analysis
Hierarchical
Clustering
Association
Rules
47.1%
71.2%
71.8%
41.6%
67.3%
67.7%
Probabilistic 62.7% 55.9%
86.7%
44.2%
71.3%
72.8%
61.6%
89.6%
48.6%
62.7%
61.7%
49.4%
20
[Cimiano & Volker, 2005]
Huge lists of plain single word terms, and relations lacking of semantic meaning
Text2Onto cannot work with big texts
Cannot export results in OWL
21
OntoGain
Multi-word term concepts
Exports ontology in OWL
Domain independent
Results
C/NC-Value yields good results
Clustering outperforms FCA
Association Rules perform better than
Verbal Expressions
22
Explore more methods / combinations
e.g., clustering, FCA
Hearst patterns for discovering additional relation types (Part-of)
Discover attributes and cardinality constraints
Incorporate term similarity information from WordNet, MeSH
Resolve term ambiguities
23
Questions ?
24
Tokenization, POS tagging, Shallow parsing (OpenNLP suite)
Lemmatization (WordNet Java Library
Apply to all steps of OntoGain
Shallow parsing is used in relations acquisition for the detection of verbal dependencies
Terms sharing a head tend to be similar
e.g
. hierarchical method and agglomerative method are both methods
Nested terms are related to each other
e.g. agglomerative clustering method and clustering method should be associated )
26