Ontology and the Lexicon Graeme Hirst

advertisement
From Enumerative Structures in Texts
towards
Hierarchical Structures in Ontologies
Nathalie Aussenac-Gilles
Mouna KAMEL – Bernard ROTHENBURGER
IC3 team at IRIT, Toulouse
ILIKS 2011, Aix-en-Provence
Semantic relations
The remaining elite American companies are Allen Edmonds and Alden Shoe Company.
High-heeled footwear is footwear that raises the heels, …
Slingbacks are shoes which are secured by a strap behind the heel, …
Court shoes, known in the US as pumps, are typically high-heeled, …
Platform shoe: shoe with very thick soles and heels
24/05/2011
ILIKS 2011 - Hierachical structures
2
Semantic relations
Variants include kitten heels (typically 1½-2 inches high) and stilletto heels (with a very narrow heel post)
and wedge heels (with a wedge-shaped sole rather than a heel post).
Ballet flats, known in the UK as ballerinas, ballet pumps or skimmers, are shoes with a very low heel …
24/05/2011
ILIKS 2011 - Hierachical structures
3
Semantic relations
24/05/2011
ILIKS 2011 - Hierachical structures
4
Semantic relations
24/05/2011
ILIKS 2011 - Hierachical structures
5
Semantic relations
24/05/2011
ILIKS 2011 - Hierachical structures
6
What’s new ?

Going beyond the sentence


Using typo-dispositional clues


Logical structure of a text, Textual Architecture Model …
Typology of Enumerative Structures (ES)



Discourse analysis (Rhetorical Structure Theory, Segmented
Discourse Representation Theory,…)
Vertical and paradigmatic ES
Translation Enumerative Structure > Hierarchical structure
Evaluation
24/05/2011
ILIKS 2011 - Hierachical structures
7
Discourse Structure
The key U.S. and foreign annual interest rates below are a guide
to
general
levels
but
don't
always
represent
actual
transactions.
PRIME RATE: 10 1/2%. The base rate on corporate loans at
large U.S. money center commercial banks.
FEDERAL FUNDS:
8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16%
offered. Reserves traded among commercial banks for overnight
use in amounts of $1 million or more.
Source: Fulton Prebon (U.S.A.) Inc.
DISCOUNT RATE: 7%. The charge on loans to depository
institutions by the New York Federal Reserve Bank.
CALL MONEY: 9 3/4% to 10%. The charge on loans to
brokers on stock exchange collateral.
COMMERCIAL PAPER : placed directly by General Motors
Acceptance Corp.:
8.50% 30 to 44 days;
8.25% 45 to 65 days;
8.375% 66 to 89 days;
8% 90 to 119 days;
7.875% 120 to 149 days;
7.75% 150 to 179 days;
7.50% 180 to 270 days.
8
ILIKS 2011 - Hierachical structures
24/05/2011
Discourse Structure
The key U.S. and foreign annual interest rates
below are a guide to general levels but don't
always represent actual transactions.
PRIME RATE: 10 1/2%. The base rate on
corporate loans at large U.S. money center
commercial banks.
FEDERAL FUNDS:
8 3/4% high, 8 11/16% low, 8 5/8% near closing
bid, 8 11/16% offered. Reserves traded among
commercial banks for overnight use in amounts
of $1 million or more.
Source: Fulton Prebon (U.S.A.) Inc.
DISCOUNT RATE: 7%. The charge on
loans to depository institutions by the New
York Federal Reserve Bank.
CALL MONEY: 9 3/4% to 10%. The charge
on
loans
to
brokers
on
stock
exchange
collateral.
COMMERCIAL PAPER : placed directly by
General Motors Acceptance Corp.:
8.50% 30 to 44 days;
8.25% 45 to 65 days;
8.375% 66 to 89 days;
8% 90 to 119 days;
7.875% 120 to 149 days;
7.75% 150 to 179 days;
7.50% 180 to 270 days.
24/05/2011
(190) [The key U.S. and foreign annual
interest rates below are a guide to general
l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l
transactions.A] [PRIME RATE: 10 1/2%. The base
rate on corporate loans at large U.S. money
center commercial banks.B] [FEDERAL FUNDS: 8
3/4% high, 8 11/16% low, 8 5/8% near closing
bid, 8 11/16% offered. Reserves traded among
commercial banks for overnight use in amounts
of $1 million or more. Source: Fulton Prebon
(U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge
on loans to depository institutions by the New
York Federal Reserve Bank.D] [CALL MONEY: 9
3/4% to 10%. The charge on loans to brokers on
stock exchange collateral. E] [COMMERCIAL
PAPER placed directly by General Motors
Acceptance Corp.: 8.50% 30 to 44 days; 8.25%
45 to 65 days; 8.375% 66 to 89 days; 8% 90 to
119 days; 7.875% 120 to 149 days; 7.75% 150 to
179 days; 7.50% 180 to 270 days.F]wsj_0602
Carlson, L and Marcu D. (2001). Discourse Tagging Manual.
Unpublished manuscript,
http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf.
ILIKS 2011 - Hierachical structures
9
Discourse Structure
The key U.S. and foreign annual interest rates
below are a guide to general levels but don't
always represent actual transactions.
PRIME RATE: 10 1/2%. The base rate on
corporate loans at large U.S. money center
commercial banks.
FEDERAL FUNDS:
8 3/4% high, 8 11/16% low, 8 5/8% near closing
bid, 8 11/16% offered. Reserves traded among
commercial banks for overnight use in amounts
of $1 million or more.
Source: Fulton Prebon (U.S.A.) Inc.
DISCOUNT RATE: 7%. The charge on
loans to depository institutions by the New
York Federal Reserve Bank.
CALL MONEY: 9 3/4% to 10%. The charge
on
loans
to
brokers
on
stock
exchange
collateral.
COMMERCIAL PAPER : placed directly by
General Motors Acceptance Corp.:
8.50% 30 to 44 days;
8.25% 45 to 65 days;
8.375% 66 to 89 days;
8% 90 to 119 days;
7.875% 120 to 149 days;
7.75% 150 to 179 days;
7.50% 180 to 270 days.
24/05/2011
(190) [The key U.S. and foreign annual
interest rates below are a guide to general
l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l
transactions.A] [PRIME RATE: 10 1/2%. The base
rate on corporate loans at large U.S. money
center commercial banks.B] [FEDERAL FUNDS: 8
3/4% high, 8 11/16% low, 8 5/8% near closing
bid, 8 11/16% offered. Reserves traded among
commercial banks for overnight use in amounts
of $1 million or more. Source: Fulton Prebon
(U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge
on loans to depository institutions by the New
York Federal Reserve Bank.D] [CALL MONEY: 9
3/4% to 10%. The charge on loans to brokers on
stock exchange collateral. E ] [COMMERCIAL
PAPER placed directly by General Motors
Acceptance Corp.: 8.50% 30 to 44 days; 8.25%
45 to 65 days; 8.375% 66 to 89 days; 8% 90 to
119 days; 7.875% 120 to 149 days; 7.75% 150 to
179 days; 7.50% 180 to 270 days.F]wsj_0602
Carlson, L and Marcu D. (2001). Discourse Tagging Manual.
Unpublished manuscript,
http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf.
ILIKS 2011 - Hierachical structures
10
Discourse Structure
(190) [The key U.S. and foreign annual
interest rates below are a guide to general
l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l
transactions.A] [PRIME RATE: 10 1/2%. The base
rate on corporate loans at large U.S. money
center commercial banks.B] [FEDERAL FUNDS: 8
3/4% high, 8 11/16% low, 8 5/8% near closing
bid, 8 11/16% offered. Reserves traded among
commercial banks for overnight use in amounts
of $1 million or more. Source: Fulton Prebon
(U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge
on loans to depository institutions by the New
York Federal Reserve Bank.D] [CALL MONEY: 9
3/4% to 10%. The charge on loans to brokers on
stock exchange collateral. E ] [COMMERCIAL
PAPER placed directly by General Motors
Acceptance Corp.: 8.50% 30 to 44 days; 8.25%
45 to 65 days; 8.375% 66 to 89 days; 8% 90 to
119 days; 7.875% 120 to 149 days; 7.75% 150 to
179 days; 7.50% 180 to 270 days.F]wsj_0602
Elaboration-Set-Member
Carlson, L and Marcu D. (2001). Discourse Tagging Manual.
Unpublished manuscript,
http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf.
24/05/2011
ILIKS 2011 - Hierachical structures
List
B
C
D
E
F
11
Typo-dispositional markers
Typographical markers
Dispositional markers
24/05/2011
ILIKS 2011 - Hierachical structures
12
Enumerative Structure


The act of enumerating : stating the successive elements
of a same conceptual domain, these elements being
hierarchically directly or indirectly linked to a classifying
concept
Can take several forms (examples from the “shoe”
wikipedia page)
24/05/2011
ILIKS 2011 - Hierachical structures
13
Enumerative Structures (ES)
PRIMER {ITEM} CONCLUSION
 Horizontal vs. Vertical
 Syntagmatic vs. Paradigmatic

24/05/2011
ILIKS 2011 - Hierachical structures
14
Enumerative Structures (ES)
Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive
new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used
most of Mahillon's system, but replaced the term autophone with idiophone. The original
Hornbostel-Sachs system classified instruments into four main groups:
 Idiophones, which would be an instrument that you could hit, strike, shake or scrape –
such as the xylophone and rattle. They produce sound by vibrating themselves; they are
sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones.
 Membranophones, which would be an instrument that uses a stretched skin, or
membrane (key word being "stretched")such as drums or kazoos, produce sound by a
vibrating membrane; they are sorted into predrum membranophones, tubular drums,
friction idiophones, kettledrums, friction drums, and mirlitons.
 Chordophones, which would be an instrument that uses stretched string or cord – such
as the piano or cello, produce sound by vibrating strings; they are sorted into zithers,
keyboard chordophones, lyres, harps, lutes, and bowed chordophones.
 Aerophones, which would be an instrument that you produce a sound by blowing air into
– such as the pipe organ or oboe, produce sound by vibrating columns of air; they are
sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones.
Sachs later added a fifth category, electrophones, such as theremins, which produce sound by
electronic means.[107] Within each category are many subgroups. The system has been criticised
and revised over the years, but remains widely used by ethnomusicologists and organologists.
From the "Musical instrument" wikipedia page
24/05/2011
ILIKS 2011 - Hierachical structures
15
Enumerative Structures (ES)
Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive
new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used
most of Mahillon's system, but replaced the term autophone with idiophone. The original
Hornbostel-Sachs system classified instruments into four main groups:
Primer
 Idiophones, which would be an instrument that you could hit, strike, shake or scrape –
such as the xylophone and rattle. They produce sound by vibrating themselves; they are
sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones.
 Membranophones, which would be an instrument that uses a stretched skin, or
membrane (key word being "stretched")such as drums or kazoos, produce sound by a
vibrating membrane; they are sorted into predrum membranophones, tubular drums,
friction idiophones, kettledrums, friction drums, and mirlitons.
 Chordophones, which would be an instrument that uses stretched string or cord – such
as the piano or cello, produce sound by vibrating strings; they are sorted into zithers,
keyboard chordophones, lyres, harps, lutes, and bowed chordophones.
 Aerophones, which would be an instrument that you produce a sound by blowing air into
– such as the pipe organ or oboe, produce sound by vibrating columns of air; they are
sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones.
Sachs later added a fifth category, electrophones, such as theremins, which produce sound by
electronic means. Within each category are many subgroups. The system has been criticised and
revised over the years, but remains widely used by ethnomusicologists and organologists.
24/05/2011
16 page
ILIKS 2011 - Hierachical structures
From the "Musical instrument" wikipedia
Enumerative Structures (ES)
Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive
new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used
most of Mahillon's system, but replaced the term autophone with idiophone. The original
Hornbostel-Sachs system classified instruments into four main groups:
Primer
 Idiophones, which would be an instrument that you could hit, strike, shake or scrape –
such as the xylophone and rattle. They produce sound by vibrating themselves; they are
sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones.
 Membranophones, which would be an instrument that uses a stretched skin, or
membrane (key word being "stretched")such as drums or kazoos, produce sound by a
vibrating membrane; they are sorted into predrum membranophones, tubular drums,
Items
friction idiophones, kettledrums, friction drums, and mirlitons.
 Chordophones, which would be an instrument that uses stretched string or cord – such
as the piano or cello, produce sound by vibrating strings; they are sorted into zithers,
keyboard chordophones, lyres, harps, lutes, and bowed chordophones.
 Aerophones, which would be an instrument that you produce a sound by blowing air into
– such as the pipe organ or oboe, produce sound by vibrating columns of air; they are
sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones.
Sachs later added a fifth category, electrophones, such as theremins, which produce sound by
electronic means. Within each category are many subgroups. The system has been criticised and
revised over the years, but remains widely used by ethnomusicologists and organologists.
From the "Musical instrument" wikipedia page
24/05/2011
ILIKS 2011 - Hierachical structures
17
Enumerative Structures (ES)
Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive
new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used
most of Mahillon's system, but replaced the term autophone with idiophone. The original
Hornbostel-Sachs system classified instruments into four main groups:
 Idiophones, which would be an instrument that you could hit, strike, shake Primer
or scrape –
such as the xylophone and rattle. They produce sound by vibrating themselves; they are
sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones.
 Membranophones, which would be an instrument that uses a stretched skin, or
membrane (key word being "stretched")such as drums or kazoos, produce sound by a
vibrating membrane; they are sorted into predrum membranophones, tubular drums,
friction idiophones, kettledrums, friction drums, and mirlitons.
Items
 Chordophones, which would be an instrument that uses stretched string or cord – such
as the piano or cello, produce sound by vibrating strings; they are sorted into zithers,
keyboard chordophones, lyres, harps, lutes, and bowed chordophones.
 Aerophones, which would be an instrument that you produce a sound by blowing air into
– such as the pipe organ or oboe, produce sound by vibrating columns of air; they are
sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones.
Sachs later added a fifth category, electrophones, such as theremins, which produce sound by
electronic means. Within each category are many subgroups. The system has been criticised and
revised over the years, but remains widely used by ethnomusicologists and organologists.
Conclusion
From the "Musical instrument" wikipedia page
24/05/2011
ILIKS 2011 - Hierachical structures
18
Enumerative Structures (ES)
PRIMER {ITEM} CONCLUSION
 Horizontal vs. Vertical
 Syntagmatic vs. Paradigmatic

24/05/2011
ILIKS 2011 - Hierachical structures
19
Horizontal vs. Vertical ES
Under IAU definitions, there are eight planets in the Solar System. In order of
~~~~~~~~~~~~~
increasing distance from the Sun, there are the four terrestrial planets, Mercury,
~~~~~~~~~~~~~
Venus, Earth, and Mars, then the four gas-giant ones, Jupiter, Saturn, Uranus, and
~~~~
Neptune.
PRIMER
versus
ITEMS
24/05/2011
Under IAU definitions, in the Solar System and in order of
increasing distance from the Sun, there are eight planets:
• four terrestrial planets: ~~~~~~~~~~~~~
~~~~~~~~
- Mercury,
- Venus,
- Earth,
- Mars.
• four gas-giant planets:
~~~~
- Jupiter,
- Saturn,
- Uranus,
- Neptune.
ILIKS 2011 - Hierachical structures
20
Horizontal vs. Vertical ES
Under IAU definitions, there are eight planets in the Solar System. In order of
increasing distance from the Sun, there are four terrestrial planets, Mercury,
Venus, Earth, and Mars, then the four gas-giant ones, Jupiter, Saturn, Uranus, and
Neptune.
versus
24/05/2011
Under IAU definitions, in the Solar System and in order of
increasing distance from the Sun, there are eight planets:
• four terrestrial planets:
- Mercury,
- Venus,
- Earth,
- Mars.
 less ambiguous
• four gas-giant planets:
- Jupiter,
- Saturn,
- Uranus,
- Neptune.
ILIKS 2011 - Hierachical structures
21
False enumerative structures ?
This overconsumption is mainly responsible for the growing resistance of bacteria
to antibiotics :
• The more a country consumes antibiotics the more resistant the bacteria
become : in France, Staphylococcus aureus is resistant to méthicilline in 57%
justify(non-volitionnal-cause(A,B),
of the cases, though the
observed frequency is only 1% in Denmark and 9% in
sequence (motivation(contrast(D,E),C),
Germany.
non-volitionnal-cause(F,G))
• and to each noticeable
and lasting decrease of antibiotic consumption
corresponds a decrease of this resistance phenomenon.
SE(Primer([A,B]),enum(item([C,E,D]),item([F,G]))
[This overconsumption A][is mainly responsible for the growing resistance of bacteria
to antibiotics B]. [The more a country consumes antibiotics the more resistant the
bacteria become C]:[In France, Staphylococcus aureus is resistant to methicillin in
57% of the cases D], [though the observed frequency is only 1% in Denmark and 9%
in Germany E].[and to each noticeable and lasting decrease of antibiotic
consumption F][corresponds a decrease of this resistance phenomenon. G]
From "The Sunday Times" wikipedia page
24/05/2011
ILIKS 2011 - Hierachical structures
22
Enumerative Structures (ES)
PRIMER {ITEM} CONCLUSION
 Horizontal vs. Vertical
 Syntagmatic vs. Paradigmatic

24/05/2011
ILIKS 2011 - Hierachical structures
23
Syntagmatic vs. Paradigmatic ES
Character shoes have :
• a one to three inch heel
• which is usually made of leather
there are dependencies between items
versus
Men's shoes can also be decorated in various ways:
• Plain-toes: have a sleek appearance and no extra
decorations on the vamp.
• Cap-toes: has an extra layer of leather that "caps"
the toe. This is possibly the most popular
decoration.
heads of items are syntactically equivalent
24/05/2011
ILIKS 2011 - Hierachical structures
24
Paradigmatic ES > Hierarchical Structure
A shoe is mainly composed of :
~~~
• the sole, which protects the bottom of the feet, more or less raised on
~~~
the back of the heel and
• the vamp, upper part that wraps the foot
~~~~
Shoe
Part-of
Vamp
24/05/2011
Sole
ILIKS 2011 - Hierachical structures
25
Typology of primers
The primer is incomplete (proposition which is syntactically incomplete
and for which the missing components are given by the items)
Women's dress shoes :
 Pumps
 SlingBuncks
 Loafers
 Mules
 Ballet flats
 Sandals
The primer is a noun phrase :
root of the tree > noun phrase
relation > is-a
The sole of a sandal can be
made of :
• rubber,
• leather,
• wood,
• tatami, or
• rope
The primer is only composed of a noun
phrase and of a verb :
root of the tree > noun phrase
relation > meaning of the verb
24/05/2011
ILIKS 2011 - Hierachical structures
26
Typology of primers
The primer is complete (proposition which is syntactically complete)
Some shoes are exclusively worn by
women :
root of the tree > subject or object ???
 pumps
relation > is-a
Stiletto heels
 Ballet flats
Snowshoes today are divided into three types:
• aerobic/running (small and light; not intended for
backcountry use);The primer contains a numeral or a linguistic clue such
• recreational (a bit
meant for
use in gentle-to
aslarger;
"categories",
"types",
"following", etc.
moderate walks of 3–5
(4.8–8.0
and cooccurs with this marker
rootmiles
of the
tree > km));
term which
• mountaineering (the
largest,> meant
relation
is-a for serious hillclimbing, long-distance trips and off-trail use).
24/05/2011
ILIKS 2011 - Hierachical structures
27
Application
Enrichment of the OntoTopo ontology
(ANR-07-MDCO-005, http://www.lri.fr/geonto)
Domain : map data description – geography
Resources :
24/05/2011
ILIKS 2011 - Hierachical structures
28
Application
Source Ontology
WIKIPEDIA
Module for annotating
728 concepts
24/05/2011
402 pages
(183 Enumerative Structures)
383 Parallel ES
Type of primer
Rate
Incomplete
27%
Complete with marker
54%
Complete without marker
18%
ILIKS 2011 - Hierachical structures
29
Application
Source Ontologie
WIKIPEDIA
Module for annotating
728 concepts
402 pages (183 SE)
383 PES
Module for extracting
hierarchical structures
~400 new concepts
~300 new instances
24/05/2011
ILIKS 2011 - Hierachical structures
~300 Hierarchical Structures
30
Conclusion
Results:
PES are increasingly found in electronic documents
Additional tool based on the layout are possible
Translation process by successive annotations exists, this process
depends on the type of the primer
It can dramatically improve an ontology
Perspectives:
Combine structure to the texique and syntaxe for ontology learning
Tackle more complex grammatical constructions and spelling
variations
Improve ontology enrichment with hierarchical structures
24/05/2011
ILIKS 2011 - Hierachical structures
31
Download