NLDB10-OntoGain - Intelligent Systems Laboratory

advertisement

Unsupervised Ontology Acquisition from plain texts:

The OntoGain method

Efthymios Drymonas

Kalliopi Zervanou

Euripides G.M. Petrakis

Intelligent Systems Laboratory http://www.intelligence.tuc.gr

Technical University of Crete (TUC), Chania, Greece

OntoGain

 A platform for unsupervised ontology acquisition from text

 Application independent

 Ontology of multi-word term concepts

 Adjusts existing methods for taxonomy & relation acquisition to handle multi-word concepts

 Outputs ontology in OWL

 Good results on Medical, Computer science corpora

2

Why multi-word term concepts?

 Majority of terminological expressions

 Convey classificatory information, expressed as modifiers

 e.g. “ carotid artery disease ” denotes a type of “ artery disease ” which is a type of

“ disease ”

 Leads to more expressive and compact ontology lexicon

3

Ontology Learning Steps

 Concept Extraction

 C/NC-value

 Taxonomy Induction

 Clustering, Formal Concept Analysis

 Non-taxonomic Relations

 Association Rules, Probabilistic algorithm

4

The C/NC-Value method

[Frantzi et.al. , 2000]

 Identifies multi-word term phrases denoting domain concepts

 Noun phrases are extracted first

((adj | noun)+ | ((adj | noun)

*

(adj | noun)

*

) noun

(noun prep)?)

 C-Value : Term validity criterion, relying on the hypothesis that multi-word terms tend to consist of other terms

 NC-Value : Uses context information

(valid terms tend to appear in specific context and co-occur with other terms)

5

C-Value: Statistical Part

 For candidate term a

 f(a): Total frequency of occurrence

 f(b): Frequency of a as part of longer terms

 P(T a

): number of these longer terms

 |a|: The length of the candidate string

C

 value ( a )

 log

2 log

2

| a |

| a | ( f ( a )

 f ( a ), a : not nested

1

P ( T a

)

 b

T a f ( b )) , otherwise

Concept Extraction

C/NC-Value sample results

output term web page information retrieval search engine machine learning computer science experimental result text mining natural language processing world wide web large number artificial intelligence relevant document similarity measure information extraction knowledge discovery

582.83

557.33

530.67

515.73

468.22

464.64

443.29

435.79

c-nc value

1740.11

1274.14

1103.99

727.70

723.82

655.125

645.57

7

Ontology Learning Steps

Preprocessing

Concept Extraction

Taxonomy Induction

Non-taxonomic Relations

8

Taxonomy Induction

 Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms

 Two methods in OntoGain

 Agglomerative clustering

 Formal Concept Analysis (FCA)

Agglomerative Clustering

 Proceeds bottom-up: at each step, the most similar clusters are merged

 Initially each term is considered a cluster

 Similarity between all pairs of clusters is computed

 The most similar clusters are merged as long as they share terms with common heads

 Group average for clusters, Dice like formula for terms

10

Formal Concept Analysis (FCA)

[Ganter et al., 1999]

 FCA relies on the idea that the objects

(terms) are associated with their attributes (verbs)

 Finds common attributes (verbs) between objects and forms object clusters that share common attributes

 Formal concepts are connected with the sub-concept relationship

( O

1

, A

1

)

( O

2

, A

2

)

O

1

O

2

( A

1

A

2

)

FCA Example

 Takes as input a matrix showing associations between terms (concepts) and attributes (verbs) submit test describe print compute search

*

*

*

*

Html form

Hierarchical clustering

Text retrieval

Root node

Single cluster

Web page

*

* *

*

*

*

*

*

*

*

*

FCA Taxonomy

 Formal concepts

 ({hierarchical clustering, root node, single cluster},

{compute, search})

 ({html form, web page}, {print, search})

 Not all dependencies

c,v are interesting

P ( c | v )

 f ( c , v )

 t f ( v )

13

Non-Taxonomic Relations extraction phase

Concept Extraction

Taxonomy Induction

Non-Taxonomic Relations

14

Non-Taxonomic Relations

 Concepts are also characterized by attributes and relations to other concepts in the hierarchy

 Typically expressed by a verb relating pair of concepts

 Two approaches

 Associations rules

 Probabilistic

Association Rules [Aggrawal et.al., 1993]

 Introduced to predict the purchase behavior of customers

 Extract terms connected with some relation subject-verb-object

 Enhance with general terms from the taxonomy

 Eliminate redundant relations: predictive accuracy < t

Association Rules: Example

Domain chiasmal syndrome medial collateral ligament blood transfusion lipid peroxidation prostate specific antigen chronic fatigue syndrome right ventricular infraction creatinine clearance cardioplegic solution bacterial translocation accurate diagnosis ultrasound examination total body oxygen consumption coronary arteriography

Range pituitary disproportion surgical treatment antibiotic prophylaxis cardiopulmonary bypass prostatectomy cardiac function radionuclide ventriculography arteriovenous hemofiltration superoxide dismutase antibiotic prophylaxis clinical suspicion clinical suspicion epidural analgesia physician

Label cause by need result lead to follow yield analyze by achieve give decrease depend give attenuate by perform by

17

Probabilistic approach [Cimiano et.al. 2006]

 Collect verbal relations from the corpus

 Find the most general relation wrt verb using frequency of occurrence

 Suffer_from(man, head_ache)

 Suffer_from(woman, stomach_ache)

 Suffer_from(patient,ache)

 Select relationships satisfying a conditional probability measure

 Associations > t become accepted

18

Evaluation

 Relevance judgments are provided by humans

 Precision - Recall

 We examined the 200 top-ranked concepts and their respective relations in 500 lines

 Results from OhsuMed & Computer

Science corpus

19

Results

Processing

Layer

Method

Precision

OhsuMed

Recall

-

OhsuMed

Precision

Comp.

Science

Recall

Comp.

Science

Concept

Extraction

C/NC-Value 89.7% 91.4%

Taxonomic

Relations

Non-

Taxonomic

Relations

Formal

Concept

Analysis

Hierarchical

Clustering

Association

Rules

47.1%

71.2%

71.8%

41.6%

67.3%

67.7%

Probabilistic 62.7% 55.9%

86.7%

44.2%

71.3%

72.8%

61.6%

89.6%

48.6%

62.7%

61.7%

49.4%

20

Comparison with Text2Onto

[Cimiano & Volker, 2005]

 Huge lists of plain single word terms, and relations lacking of semantic meaning

 Text2Onto cannot work with big texts

 Cannot export results in OWL

21

Conclusions

 OntoGain

 Multi-word term concepts

 Exports ontology in OWL

 Domain independent

 Results

 C/NC-Value yields good results

 Clustering outperforms FCA

 Association Rules perform better than

Verbal Expressions

22

Future Work

 Explore more methods / combinations

 e.g., clustering, FCA

 Hearst patterns for discovering additional relation types (Part-of)

 Discover attributes and cardinality constraints

 Incorporate term similarity information from WordNet, MeSH

 Resolve term ambiguities

23

Thank you!

Questions ?

24

Preprocessing

 Tokenization, POS tagging, Shallow parsing (OpenNLP suite)

 Lemmatization (WordNet Java Library

 Apply to all steps of OntoGain

 Shallow parsing is used in relations acquisition for the detection of verbal dependencies

 Terms sharing a head tend to be similar

 e.g

. hierarchical method and agglomerative method are both methods

 Nested terms are related to each other

 e.g. agglomerative clustering method and clustering method should be associated )

26

Download