Using Natural Language Processing To Lead the User to Data Judith Klavans, Walter Bourne, Brian Whitman, Deniz Sarioz Columbia University Digital Government Research Center Columbia University University of Southern California 1 What’s the Problem? Data & metadata that can be hard to use • Proliferation of terms across many domains. • Different definitions across Agencies for similar concepts. • Lengthy, dense, and technical definitions • Buried information in notes, appendices, or crossreferences. • Metadata not linked to data or other metadata 2 A Solution: Automation, Standardization, Organization… Ontology Automation: Machine-read metadata terms— definitions, documentation, etc. Standardization: Merge terms into a standardized terminology of data terms, automatically. Organization: Organize terms into an “ontology” that links definitions into a conceptual view of a domain of knowledge, e.g., the available data. 3 What We Have To Work With • Natural Language Definitions – Lengthy, dense, and technical. – Complex. – But, lots of information. For example ... 4 “Natural language” definitions • Gasoline: See Motor Gasoline (Finished). • Motor Gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in spark-ignition engines. Motor gasoline, as defined in ASTM Specification D 4814 or Federal Specification VV-G-1690C, is characterized as having a boiling range of 122 to 158 degrees Fahrenheit at the 10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent recovery point. "Motor Gasoline" includes conventional gasoline; all types of oxygenated gasoline, including gasohol; and reformulated gasoline, but excludes aviation gasoline. Note: Volumetric data on blending components, such as oxygenates, are not counted in data on finished motor gasoline until the blending components are blended into the gasoline. • Regular Gasoline: Gasoline having an antiknock index, i.e., octane rating, greater than or equal to 85 and less than 88. Note: Octane requirements may vary by altitude. See Gasoline Grades. 5 A Machine-Analyzed Definition Term: Motor Gasoline (Finished) Source: (source (agency "EIA") (resource "Gasoline Glossary") (url …) Paren-modifier: Finished Full Definition: A complex mixture... for use in …. Note: Volumetric data on …. Core Definition: A complex mixture ... for use in spark-ignition engines Genus Phrase: A complex mixture of relatively volatile hydrocarbons Head Genus Word: mixture Properties: for use in spark-ignition engines Excludes-Includes: includes conventional gasoline includes gasohol excludes aviation gasoline Codes and Documents: Specification D 4814 Specification VV-G-1690 Note: Volumetric data on blending components .... 6 Heterogeneous Data & Meta-data Sources EPA Census Information Access User Interface System Architecture Metadata mediates Data Definitions (Ontology) Labor ??? EIA Data Integration 7 Using Organized Metadata • Read definitions into an organized network of conceptual relationships. • Use the structure to follow relationships. • Consider alternate meanings and topics. • Navigate to data of interest. A graphical representation ... 8 Gasoline (from EIAGG) See Motor Gasoline (Finished). Start from the defined term Gasoline (from EIAGG Examine its characteristics 9 Follow to a related term Regular Gasoline Gasoline having an antiknock index, …. Results of reading another definition Follow a link to gasoline as a product product.gasoline.regular (from SIMS) 10 If it is a measurement, it’s data! Regular Gasoline Data Product.gasoline.regular (from SIMS) Product-of: EIA-OGIRS-C120050061FM-MEASUREMENT (from SIMS) 11 It’s a SERIES From EIA For California It’s monthly It’s seasonally adjusted Unit: Thousands of gallons … 12 Regular Gasoline Gallons/Day A lot of things you want to know about this data. Data and Metadata, Together Query History Data List of definition components Data footnotes Look up in 13 ontology Sources Analyzing Other Domains • Biomedical terminology • Environmental terms • Computer technology 14 A Biomedical Definition Read Cultures and Stocks: Infectious agents and associated biologicals including: cultures from medical and pathological laboratories; cultures and stocks of infectious agents from research and industrial laboratories; waste from the production of biologicals; discarded live and attenuated vaccines; and culture dishes and devices used to transfer, inoculate, and mix cultures. (See 'regulated medical waste') Term: Cultures and Stocks Source: (source (agency “OTA") (resource "Mapping Our Genes ... OTABA-373") ) Cross-reference: 'regulated medical waste' Full Definition: Infectious agents .... (See 'regulated medical waste') Core Definition: Infectious agents ... and mix cultures. Genus Phrase: Infectious agents Head Genus Word: agents 15 An Environmental Definition Read baghouse: a dust-collection chamber containing numerous permeable fabric filters through which the exhaust gases pass. Finer particulates entrained in the exhaust gas stream are collected in the filters for subsequent treatment/disposal. Term: baghouse Source: (source (agency “EPA") (resource "Terms of Environment") ...) Full Definition: a dust-collection chamber ... treatment/disposal. Core Definition: a dust-collection chamber ... gases pass. Genus Phrase: a dust-collection chamber Head Genus Word: chamber Properties: Excludes-Includes: contains numerous permeable fabric filters 16 A Computer Definition Read input class: The class of filters that provide an interface for HID hardware, including USB and legacy devices, plus proprietary and other HID hardware, under the WDM HID architecture. Skips ‘class’ to find ‘filters’ Acronym Term: input class Source: (source (agency “MSC") (resource “Microsoft Glossary)") ...) Full Definition: The class of filters that provide an interface Genus Phrase: The class of filters Head Genus Word: filters Properties: Excludes-Includes: includes USB 17 Acronym Identification Acronym: HID Acrocat: Human Interface Device Human Interface Device HID Hitachi D9000 development Hue is defined Acronym: USB Acrocat: United States Sponsor USB Universal Serial USB Universal Serial Bus United States A 18 Futures • Refine parser to draw out more meaning. • Extend parser to differently structured metadata like data definitions, codebooks, .. • Expand output options to structured documentation, e.g., DDI • Support varying user modes/types: novice, student, expert, citizen, …. 19 The real question... What do you need? 20