Building a Terminological Database from Heterogeneous Definitional Sources Smaranda Muresan, Peter T. Davis Samuel D. Popper, Judith L. Klavans Columbia University May 21, 2003 Why Terminology is Important? Each agency and each department might have different ways to define the same concept Working with multiple databases requires understanding the data across multiple agencies and domains What’s an Employee? An appointed officer or employee of USDA including special Government employees (collaborators, consultants and panel members). The term excludes independent contractors. US Department of Agriculture A person who works for wages or salary in the service of an employer. Mine Safety and Health Administration An individual who is engaged or compensated by a railroad or by a contractor to a railroad, who is authorized by a railroad to use its wireless communications in connection with railroad operations. Federal Railroad Administration The term "employee" does not include a director, trustee, or officer. US SEC Desiderata for Terminological Resources Capture the ongoing evolution of language Provide consistency, ease of sharing and integration across agencies. Architecture Collection Heterogeneous Definitional Corpus Building the terminological DB Semantic Analysis Database Building ParseGloss Terminological Database Use GetGloss Definder dynamic sources relations among concepts and their attributes fast access, flexibility, sharing database query Collection Motivation Solution Acquisition of – Definitions are rich in terminological – GetGloss – identification Heterogeneous knowledge Definitional and extraction of glossaries Corpus GetGloss – On-line dictionaries are static and generally – Definder - extraction of incomplete Definder definitions from online free text – Need to capture the evolution of language Building the Terminological DB Term: Motor Gasoline (Finished) Source: (source (agency "EIA") Gasoline: Motor Gasoline (Finished). (resourceSee "Gasoline Glossary") (url …) Solution Motor Gasoline (Finished): A complex Motivation Paren-modifier: Finished mixture of relatively volatile hydrocarbons –without Transform definitional Full Definition: A complex mixture... – Need to identify among with or small quantities ofrelationships for use text in …. into conceptual Semantic Database additives, blended to form a fuel suitable concepts A engines. complexMotor mixture ... Analysis Building forCore use inDefinition: spark-ignition data for use in spark-ignition engines e.g. synonyms, hypernyms, cross-reference gasoline, as defined in ASTM Specification Genus Phrase: A complex mixture of –orParseGloss – partial D 4814 Federal Specification VV-Grelatively volatile hydrocarbons 1690C, is semantic characterized analysis as having a boiling of Head Genus Word: hydrocarbons range of 122 toto 158store degrees Fahrenheit at – Need this conceptual information definitions to identify Terminological theProperties: 10 percent recovery point to 365 to ParseGloss for easy and fast access and integrationDatabase UsedIn: relations between 374 degrees Fahrenheit at the 90 percent recovery spark-ignition point. "Motor Gasoline" enginesincludes concepts conventional gasoline; all types of Excludes-Includes: – Store data intogasohol; a gasoline oxygenated gasoline, including includes conventional and reformulated gasoline, but excludes relational database includes gasohol aviation gasoline. excludes aviation gasoline ... Definitional Corpus Database Use SQL query for inflammation Motivation Solution 1. Redness, swelling, heat and pain resulting from injury to tissue (parts – Query module for the of the body underneath the skin). relational database – Enable the user to access the richness Also known as swelling. (SQL) reaction of tissues 2. A terminological characteristic knowledge to disease or injury; it is marked by Terminological four signs: swelling, redness, Database to data – Assure easy andheat, fast access and pain. Enable data 3.– The reaction of tissuesharing to injury . and integration 4. A agencies response to irritation , infection , or injury , resulting in pain , redness , and swelling . – Enable dynamic update of data … of across Putting It All Together Collection Heterogeneous Definitional Corpus Building the terminological DB Semantic Analysis Database Building ParseGloss Terminological Database User GetGloss Definder dynamic sources relations among concepts and their attributes fast access, flexibility, sharing database query GetGloss – Automatic Glossary Extraction DGRC project Given a URL find the glossary file Challenges: – glossaries can constitute small parts of a web page, being embedded inside – there is no standard HTML tag formatting for marking <term,definition> pairs – a web page can contain <term,information> pairs, where information is not a definition. True Positive False Positive Algorithm Two step algorithm Identification Component – Find candidate glossaries – Keyword + Rule-based algorithm (6 rules) “glossary”, “dictionary” in HTML tags Terms in alphabetical order … Classification Component – Filter out false positives – Rule-based method (9 rules) e.g filter if term is a Named Entity (e.g California) – Statistical method using SVM Evaluating the Identification Component 10,000+ pages from 5 different sites – 1,000 page sample: no glossaries (n=13) 286,579 page sample from 268 domains – P=53% Estimating recall is hard Precision and Recall both very sensitive to perturbations (p=0 vs. p=53%) Klavans et. al (dg.o 2002) Evaluating the Classification Components GetGloss Categorizer assigns a score to each candidate based on a linear combination of weighted features Corpus: Test: 2400 glossary candidates 300 randomly chosen, manually categorized glossary candidates Classification Component Performance 1 0.8 Precision Recall F1 % 0.6 0.4 0.2 0 0 1 2 3 4 5 Score Range 0 if Score < -100 3 if 0 ≤ Score < 50 1 if -100 ≤ Score < -50 4 if 50 ≤ Score < 100 2 if -50 ≤ Score < 0 5 if 100 ≤ Score Definder- Automatic Extraction of Definitions from Text Definder- Automatic Extraction of Definitions from Text Definder Part of NSF funded digital library project – Medical domain – Extract definitions from consumer oriented medical text Corpus – Medical articles written by doctors for lay audience – Different genre (articles, manual chapters) Algorithm Shallow parsing – Simple definitions (e.g NP-NP pairs for synonyms) – Candidate complex definitions Full parsing (Charniak’00 parser) – appositions, relative clauses, complex definitional sentences Definition Patterns and Examples Simple definitions NP ( NP ) moving x-ray pictures ( angiograms ) hypertension ( high blood pressure ) NP -- NP -tachycardia – racing heartbeat -- …. NP of NP ( NP ) enlargement of the heart muscle ( hypertrophy ) Definition Patterns and Examples Complex definitions [S [CNP NP , [CNP CNP CNP] CNP] [VP … VP] S] Angina, the pressing chest pain most people associate with heart problems, …. [S … [CNP NP -- [CNP CNP CNP] CNP] S] … atherosclerosis – the progressive narrowing of the heart’s own arteries by cholesterol plaque buildups, which starves the heart itself for oxygen and nutrients. Evaluation Quantitative – 4 human subjects , 10 articles – 53 definitions – gold standard – DEFINDER – p=86.27%, r = 84.60% Qualitative – Usefulness and readability (non-specialists) – Completeness and accuracy (medical specialists) Klavans and Muresan(JCDL’01) Muresan and Klavans (LREC’02) Coverage of On-line Dictionaries 70 60 78.5 76 80 60 50 defined undefined absent 40 30 20 24 24 16 10 0 UMLS 21.5 0 0 OMD GPTMT Characteristics of Automatically Built Definitional Corpus Ozone: (O3)Adomain colorless gas with a pungent Environment Addison Disease odor, having the molecular form of O3 , found Column Ozone: ozone between a rare that results in two -layers of disease the atmosphere, thethe Earth's surface and outer space. Ozone levels can stratosphere 90% of the total from a(about deficiency in loading) andways. the troposphere beatmospheric described in several One of the adrenocortical hormones (about 10%). Ozone is a form of most common measures is howoxygen much foundisnaturally in the column stratosphere that ozone in a vertical of air. The provides - an a protective endocrine layer disorder shielding thatthe Earth dobson unit is a radiation's measure of column ozone. from ultraviolet harmful health affects about 1 in 100,000 Other measures include partial pressure, effects on humans and the environment. In people number density, and concentration of the troposphere, ozone is a chemical oxidant ozone, and can represent either column and major component of photochemical smog. can seriously affectat the ozone or-Ozone the amount of ozone a human a degenerative disease that respiratory system. See atmosphere, particular altitude. is characterized by low blood Large amount of data Heterogeneous – Structure – Language – Semantics Multiple definitions of ultraviolet radiation. the same term pressure Medical domainand dark brown <http://www.epa.gov/globalwarming/glossar y.html.xml> pigmentation of the skin Arrhythmia: A disturbance in the beating pattern of the heart . Ozone: A form of oxygen in which atoms combine in groups of three . <1006_Oxygen_therapy> Putting It All Together Collection Heterogeneous Definitional Corpus Building the terminological DB Semantic Analysis Database Building ParseGloss Terminological Database User GetGloss Definder dynamic sources relations among concepts and their attributes fast access, flexibility, sharing database query ParseGloss – Partial Semantic Analysis Challenges – heterogeneous collection of definitions Focus on identifying the main semantic relations among concepts Partial Semantic Analysis based on shallow parsing – Genus phrase and genus term – Synonyms, cross-reference – Other common relations between the defined term and the concepts inside the definition Example Term: Motor Gasoline (Finished) Source: (source (agency "EIA") (resource "Gasoline Glossary") (url …) Paren-modifier: Finished Full Definition: A complex mixture... for use in …. Core Definition: A complex mixture ... for use in spark-ignition engines Genus Phrase: A complex mixture of relatively volatile hydrocarbons Head Genus Word: hydrocarbons Properties: UsedIn: spark-ignition engines Excludes-Includes: includes conventional gasoline includes gasohol excludes aviation gasoline … Evaluation Task - User based evaluation to – build a gold standard for genus phrase – to ask people to identify the most important properties inside each definition Data – 100 term-definition pairs – 7 different glossaries – 26 subjects Results Complex evaluation Different notions of agreement and overlap: – Head only – 64% precision – Genus phrase – 59% precision Building the Database Relational database XML – data transfer Statistics – ~8000 terms – ~12,500 definitions – ~2000 different sources (web pages, articles, etc.) Distribution of terms with multiple definitions Putting It All Together Collection Heterogeneous Definitional Corpus Building the terminological DB Semantic Analysis Database Building ParseGloss Terminological Database User GetGloss Definder dynamic sources relations among concepts and their attributes fast access, flexibility, sharing database query Query the terminological Database - Term: Audit Definition: An examination of the financial statements, accounting records, and other supporting evidence of an institution… Source: www.fdic.gov - Term: Industrial radiography Definition: means an examination of the structure of materials… Source: www.nrc.gov - Term: Medical Surveillance Definition: is the systematic examination of medical monitoring data to determine… Source: www.osha.gov Conclusions Proposed a framework for solving the heterogeneous terminology problem – Automatically building a heterogeneous collection of definitions from dynamic sources – Partial semantic analysis of the definitions to identify main semantic relations between concepts – Building a database for fast, easy access, dynamic update of data, sharing across agencies Future Work Deep domain specific semantic analysis But, how to automatically classify the glossary entries and definitions? – Based on the classification of their source (sites, articles, etc) When and how to merge different definitions for the same term? Integrate this acquired terminological knowledge into the DGRC system