From Enumerative Structures in Texts towards Hierarchical Structures in Ontologies Nathalie Aussenac-Gilles Mouna KAMEL – Bernard ROTHENBURGER IC3 team at IRIT, Toulouse ILIKS 2011, Aix-en-Provence Semantic relations The remaining elite American companies are Allen Edmonds and Alden Shoe Company. High-heeled footwear is footwear that raises the heels, … Slingbacks are shoes which are secured by a strap behind the heel, … Court shoes, known in the US as pumps, are typically high-heeled, … Platform shoe: shoe with very thick soles and heels 24/05/2011 ILIKS 2011 - Hierachical structures 2 Semantic relations Variants include kitten heels (typically 1½-2 inches high) and stilletto heels (with a very narrow heel post) and wedge heels (with a wedge-shaped sole rather than a heel post). Ballet flats, known in the UK as ballerinas, ballet pumps or skimmers, are shoes with a very low heel … 24/05/2011 ILIKS 2011 - Hierachical structures 3 Semantic relations 24/05/2011 ILIKS 2011 - Hierachical structures 4 Semantic relations 24/05/2011 ILIKS 2011 - Hierachical structures 5 Semantic relations 24/05/2011 ILIKS 2011 - Hierachical structures 6 What’s new ? Going beyond the sentence Using typo-dispositional clues Logical structure of a text, Textual Architecture Model … Typology of Enumerative Structures (ES) Discourse analysis (Rhetorical Structure Theory, Segmented Discourse Representation Theory,…) Vertical and paradigmatic ES Translation Enumerative Structure > Hierarchical structure Evaluation 24/05/2011 ILIKS 2011 - Hierachical structures 7 Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. 8 ILIKS 2011 - Hierachical structures 24/05/2011 Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. 24/05/2011 (190) [The key U.S. and foreign annual interest rates below are a guide to general l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l transactions.A] [PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. ILIKS 2011 - Hierachical structures 9 Discourse Structure The key U.S. and foreign annual interest rates below are a guide to general levels but don't always represent actual transactions. PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks. FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc. DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank. CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. COMMERCIAL PAPER : placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days. 24/05/2011 (190) [The key U.S. and foreign annual interest rates below are a guide to general l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l transactions.A] [PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E ] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. ILIKS 2011 - Hierachical structures 10 Discourse Structure (190) [The key U.S. and foreign annual interest rates below are a guide to general l e ve l s b u t d o n ' t a l w a y s re p r e se n t a ct u a l transactions.A] [PRIME RATE: 10 1/2%. The base rate on corporate loans at large U.S. money center commercial banks.B] [FEDERAL FUNDS: 8 3/4% high, 8 11/16% low, 8 5/8% near closing bid, 8 11/16% offered. Reserves traded among commercial banks for overnight use in amounts of $1 million or more. Source: Fulton Prebon (U.S.A.) Inc.C] [DISCOUNT RATE: 7%. The charge on loans to depository institutions by the New York Federal Reserve Bank.D] [CALL MONEY: 9 3/4% to 10%. The charge on loans to brokers on stock exchange collateral. E ] [COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.50% 30 to 44 days; 8.25% 45 to 65 days; 8.375% 66 to 89 days; 8% 90 to 119 days; 7.875% 120 to 149 days; 7.75% 150 to 179 days; 7.50% 180 to 270 days.F]wsj_0602 Elaboration-Set-Member Carlson, L and Marcu D. (2001). Discourse Tagging Manual. Unpublished manuscript, http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf. 24/05/2011 ILIKS 2011 - Hierachical structures List B C D E F 11 Typo-dispositional markers Typographical markers Dispositional markers 24/05/2011 ILIKS 2011 - Hierachical structures 12 Enumerative Structure The act of enumerating : stating the successive elements of a same conceptual domain, these elements being hierarchically directly or indirectly linked to a classifying concept Can take several forms (examples from the “shoe” wikipedia page) 24/05/2011 ILIKS 2011 - Hierachical structures 13 Enumerative Structures (ES) PRIMER {ITEM} CONCLUSION Horizontal vs. Vertical Syntagmatic vs. Paradigmatic 24/05/2011 ILIKS 2011 - Hierachical structures 14 Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone. The original Hornbostel-Sachs system classified instruments into four main groups: Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrum membranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means.[107] Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. From the "Musical instrument" wikipedia page 24/05/2011 ILIKS 2011 - Hierachical structures 15 Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone. The original Hornbostel-Sachs system classified instruments into four main groups: Primer Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrum membranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. 24/05/2011 16 page ILIKS 2011 - Hierachical structures From the "Musical instrument" wikipedia Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone. The original Hornbostel-Sachs system classified instruments into four main groups: Primer Idiophones, which would be an instrument that you could hit, strike, shake or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrum membranophones, tubular drums, Items friction idiophones, kettledrums, friction drums, and mirlitons. Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. From the "Musical instrument" wikipedia page 24/05/2011 ILIKS 2011 - Hierachical structures 17 Enumerative Structures (ES) Erich von Hornbostel and Curt Sachs adopted Mahillon's scheme and published an extensive new scheme for classification in Zeitschrift für Ethnologie in 1914. Hornbostel and Sachs used most of Mahillon's system, but replaced the term autophone with idiophone. The original Hornbostel-Sachs system classified instruments into four main groups: Idiophones, which would be an instrument that you could hit, strike, shake Primer or scrape – such as the xylophone and rattle. They produce sound by vibrating themselves; they are sorted into concussion, percussion, shaken, scraped, split, and plucked idiophones. Membranophones, which would be an instrument that uses a stretched skin, or membrane (key word being "stretched")such as drums or kazoos, produce sound by a vibrating membrane; they are sorted into predrum membranophones, tubular drums, friction idiophones, kettledrums, friction drums, and mirlitons. Items Chordophones, which would be an instrument that uses stretched string or cord – such as the piano or cello, produce sound by vibrating strings; they are sorted into zithers, keyboard chordophones, lyres, harps, lutes, and bowed chordophones. Aerophones, which would be an instrument that you produce a sound by blowing air into – such as the pipe organ or oboe, produce sound by vibrating columns of air; they are sorted into free aerophones, flutes, organs, reedpipes, and lip-vibrated aerophones. Sachs later added a fifth category, electrophones, such as theremins, which produce sound by electronic means. Within each category are many subgroups. The system has been criticised and revised over the years, but remains widely used by ethnomusicologists and organologists. Conclusion From the "Musical instrument" wikipedia page 24/05/2011 ILIKS 2011 - Hierachical structures 18 Enumerative Structures (ES) PRIMER {ITEM} CONCLUSION Horizontal vs. Vertical Syntagmatic vs. Paradigmatic 24/05/2011 ILIKS 2011 - Hierachical structures 19 Horizontal vs. Vertical ES Under IAU definitions, there are eight planets in the Solar System. In order of ~~~~~~~~~~~~~ increasing distance from the Sun, there are the four terrestrial planets, Mercury, ~~~~~~~~~~~~~ Venus, Earth, and Mars, then the four gas-giant ones, Jupiter, Saturn, Uranus, and ~~~~ Neptune. PRIMER versus ITEMS 24/05/2011 Under IAU definitions, in the Solar System and in order of increasing distance from the Sun, there are eight planets: • four terrestrial planets: ~~~~~~~~~~~~~ ~~~~~~~~ - Mercury, - Venus, - Earth, - Mars. • four gas-giant planets: ~~~~ - Jupiter, - Saturn, - Uranus, - Neptune. ILIKS 2011 - Hierachical structures 20 Horizontal vs. Vertical ES Under IAU definitions, there are eight planets in the Solar System. In order of increasing distance from the Sun, there are four terrestrial planets, Mercury, Venus, Earth, and Mars, then the four gas-giant ones, Jupiter, Saturn, Uranus, and Neptune. versus 24/05/2011 Under IAU definitions, in the Solar System and in order of increasing distance from the Sun, there are eight planets: • four terrestrial planets: - Mercury, - Venus, - Earth, - Mars. less ambiguous • four gas-giant planets: - Jupiter, - Saturn, - Uranus, - Neptune. ILIKS 2011 - Hierachical structures 21 False enumerative structures ? This overconsumption is mainly responsible for the growing resistance of bacteria to antibiotics : • The more a country consumes antibiotics the more resistant the bacteria become : in France, Staphylococcus aureus is resistant to méthicilline in 57% justify(non-volitionnal-cause(A,B), of the cases, though the observed frequency is only 1% in Denmark and 9% in sequence (motivation(contrast(D,E),C), Germany. non-volitionnal-cause(F,G)) • and to each noticeable and lasting decrease of antibiotic consumption corresponds a decrease of this resistance phenomenon. SE(Primer([A,B]),enum(item([C,E,D]),item([F,G])) [This overconsumption A][is mainly responsible for the growing resistance of bacteria to antibiotics B]. [The more a country consumes antibiotics the more resistant the bacteria become C]:[In France, Staphylococcus aureus is resistant to methicillin in 57% of the cases D], [though the observed frequency is only 1% in Denmark and 9% in Germany E].[and to each noticeable and lasting decrease of antibiotic consumption F][corresponds a decrease of this resistance phenomenon. G] From "The Sunday Times" wikipedia page 24/05/2011 ILIKS 2011 - Hierachical structures 22 Enumerative Structures (ES) PRIMER {ITEM} CONCLUSION Horizontal vs. Vertical Syntagmatic vs. Paradigmatic 24/05/2011 ILIKS 2011 - Hierachical structures 23 Syntagmatic vs. Paradigmatic ES Character shoes have : • a one to three inch heel • which is usually made of leather there are dependencies between items versus Men's shoes can also be decorated in various ways: • Plain-toes: have a sleek appearance and no extra decorations on the vamp. • Cap-toes: has an extra layer of leather that "caps" the toe. This is possibly the most popular decoration. heads of items are syntactically equivalent 24/05/2011 ILIKS 2011 - Hierachical structures 24 Paradigmatic ES > Hierarchical Structure A shoe is mainly composed of : ~~~ • the sole, which protects the bottom of the feet, more or less raised on ~~~ the back of the heel and • the vamp, upper part that wraps the foot ~~~~ Shoe Part-of Vamp 24/05/2011 Sole ILIKS 2011 - Hierachical structures 25 Typology of primers The primer is incomplete (proposition which is syntactically incomplete and for which the missing components are given by the items) Women's dress shoes : Pumps SlingBuncks Loafers Mules Ballet flats Sandals The primer is a noun phrase : root of the tree > noun phrase relation > is-a The sole of a sandal can be made of : • rubber, • leather, • wood, • tatami, or • rope The primer is only composed of a noun phrase and of a verb : root of the tree > noun phrase relation > meaning of the verb 24/05/2011 ILIKS 2011 - Hierachical structures 26 Typology of primers The primer is complete (proposition which is syntactically complete) Some shoes are exclusively worn by women : root of the tree > subject or object ??? pumps relation > is-a Stiletto heels Ballet flats Snowshoes today are divided into three types: • aerobic/running (small and light; not intended for backcountry use);The primer contains a numeral or a linguistic clue such • recreational (a bit meant for use in gentle-to aslarger; "categories", "types", "following", etc. moderate walks of 3–5 (4.8–8.0 and cooccurs with this marker rootmiles of the tree > km)); term which • mountaineering (the largest,> meant relation is-a for serious hillclimbing, long-distance trips and off-trail use). 24/05/2011 ILIKS 2011 - Hierachical structures 27 Application Enrichment of the OntoTopo ontology (ANR-07-MDCO-005, http://www.lri.fr/geonto) Domain : map data description – geography Resources : 24/05/2011 ILIKS 2011 - Hierachical structures 28 Application Source Ontology WIKIPEDIA Module for annotating 728 concepts 24/05/2011 402 pages (183 Enumerative Structures) 383 Parallel ES Type of primer Rate Incomplete 27% Complete with marker 54% Complete without marker 18% ILIKS 2011 - Hierachical structures 29 Application Source Ontologie WIKIPEDIA Module for annotating 728 concepts 402 pages (183 SE) 383 PES Module for extracting hierarchical structures ~400 new concepts ~300 new instances 24/05/2011 ILIKS 2011 - Hierachical structures ~300 Hierarchical Structures 30 Conclusion Results: PES are increasingly found in electronic documents Additional tool based on the layout are possible Translation process by successive annotations exists, this process depends on the type of the primer It can dramatically improve an ontology Perspectives: Combine structure to the texique and syntaxe for ontology learning Tackle more complex grammatical constructions and spelling variations Improve ontology enrichment with hierarchical structures 24/05/2011 ILIKS 2011 - Hierachical structures 31