PowerPoint original - Columbia University

advertisement
Using Natural Language Processing
To Lead the User to Data
Judith Klavans, Walter
Bourne, Brian Whitman, Deniz Sarioz
Columbia University
Digital Government Research Center
Columbia University
University of Southern California
1
What’s the Problem?
Data & metadata that can be hard to use
• Proliferation of terms across many domains.
• Different definitions across Agencies for similar
concepts.
• Lengthy, dense, and technical definitions
• Buried information in notes, appendices, or crossreferences.
• Metadata not linked to data or other metadata
2
A Solution:
Automation, Standardization, Organization…
Ontology
 Automation: Machine-read metadata terms—
definitions, documentation, etc.
 Standardization: Merge terms into a
standardized terminology of data terms,
automatically.
 Organization: Organize terms into an
“ontology” that links definitions into a
conceptual view of a domain of knowledge,
e.g., the available data.
3
What We Have To Work With
• Natural Language Definitions
– Lengthy, dense, and technical.
– Complex.
– But, lots of information.
For example ...
4
“Natural language” definitions
• Gasoline: See Motor Gasoline (Finished).
• Motor Gasoline (Finished): A complex mixture of relatively volatile
hydrocarbons with or without small quantities of additives, blended to
form a fuel suitable for use in spark-ignition engines. Motor gasoline,
as defined in ASTM Specification D 4814 or Federal Specification
VV-G-1690C, is characterized as having a boiling range of 122 to 158
degrees Fahrenheit at the 10 percent recovery point to 365 to 374
degrees Fahrenheit at the 90 percent recovery point. "Motor Gasoline"
includes conventional gasoline; all types of oxygenated gasoline,
including gasohol; and reformulated gasoline, but excludes aviation
gasoline. Note: Volumetric data on blending components, such as
oxygenates, are not counted in data on finished motor gasoline until
the blending components are blended into the gasoline.
• Regular Gasoline: Gasoline having an antiknock index, i.e., octane
rating, greater than or equal to 85 and less than 88. Note: Octane
requirements may vary by altitude. See Gasoline Grades.
5
A Machine-Analyzed Definition
Term: Motor Gasoline (Finished)
Source: (source (agency "EIA") (resource "Gasoline Glossary") (url …)
Paren-modifier: Finished
Full Definition: A complex mixture... for use in …. Note: Volumetric data
on ….
Core Definition: A complex mixture ... for use in spark-ignition engines
Genus Phrase: A complex mixture of relatively volatile hydrocarbons
Head Genus Word: mixture
Properties:
for use in spark-ignition engines
Excludes-Includes:
includes conventional gasoline
includes gasohol
excludes aviation gasoline
Codes and Documents:
Specification D 4814
Specification VV-G-1690
Note: Volumetric data on blending components ....
6
Heterogeneous Data
& Meta-data Sources
EPA
Census
Information Access
User Interface
System Architecture
Metadata
mediates
Data Definitions
(Ontology)
Labor
???
EIA
Data
Integration
7
Using Organized Metadata
• Read definitions into an organized network
of conceptual relationships.
• Use the structure to follow relationships.
• Consider alternate meanings and topics.
• Navigate to data of interest.
A graphical representation ...
8
Gasoline (from EIAGG)
See Motor Gasoline (Finished).
Start from the
defined term
Gasoline (from EIAGG
Examine its
characteristics
9
Follow to a
related term
Regular Gasoline
Gasoline having an
antiknock index, ….
Results
of
reading
another
definition
Follow a link to
gasoline as a
product
product.gasoline.regular (from SIMS)
10
If it is a
measurement,
it’s data!
Regular Gasoline
Data
Product.gasoline.regular (from SIMS)
Product-of: EIA-OGIRS-C120050061FM-MEASUREMENT (from SIMS)
11
It’s a SERIES
From
EIA
For
California
It’s
monthly
It’s seasonally
adjusted
Unit: Thousands of
gallons …
12
Regular
Gasoline
Gallons/Day
A lot of
things you
want to
know about
this data.
Data and Metadata, Together
Query
History
Data
List of
definition
components
Data
footnotes
Look up in
13
ontology
Sources
Analyzing Other Domains
• Biomedical terminology
• Environmental terms
• Computer technology
14
A Biomedical Definition Read
Cultures and Stocks: Infectious agents and associated biologicals
including: cultures from medical and pathological laboratories; cultures
and stocks of infectious agents from research and industrial laboratories;
waste from the production of biologicals; discarded live and attenuated
vaccines; and culture dishes and devices used to transfer, inoculate, and
mix cultures. (See 'regulated medical waste')
Term: Cultures and Stocks
Source: (source (agency “OTA") (resource "Mapping Our Genes ... OTABA-373") )
Cross-reference: 'regulated medical waste'
Full Definition: Infectious agents .... (See 'regulated medical waste')
Core Definition: Infectious agents ... and mix cultures.
Genus Phrase: Infectious agents
Head Genus Word: agents
15
An Environmental Definition Read
baghouse: a dust-collection chamber containing numerous
permeable fabric filters through which the exhaust gases pass.
Finer particulates entrained in the exhaust gas stream are collected
in the filters for subsequent treatment/disposal.
Term: baghouse
Source: (source (agency “EPA") (resource "Terms of
Environment") ...)
Full Definition: a dust-collection chamber ... treatment/disposal.
Core Definition: a dust-collection chamber ... gases pass.
Genus Phrase: a dust-collection chamber
Head Genus Word: chamber
Properties:
Excludes-Includes:
contains numerous permeable fabric filters
16
A Computer Definition Read
input class: The class of filters that provide an interface for
HID hardware, including USB and legacy devices, plus
proprietary and other HID hardware, under the WDM HID
architecture.
Skips ‘class’ to find ‘filters’
Acronym
Term: input class
Source: (source (agency “MSC") (resource “Microsoft
Glossary)") ...)
Full Definition: The class of filters that provide an interface
Genus Phrase: The class of filters
Head Genus Word: filters
Properties:
Excludes-Includes:
includes USB
17
Acronym Identification
Acronym: HID
Acrocat: Human Interface Device
Human Interface Device HID
Hitachi D9000 development
Hue is defined
Acronym: USB
Acrocat: United States Sponsor
USB Universal Serial
USB Universal Serial Bus
United States A
18
Futures
• Refine parser to draw out more meaning.
• Extend parser to differently structured
metadata like data definitions, codebooks, ..
• Expand output options to structured
documentation, e.g., DDI
• Support varying user modes/types: novice,
student, expert, citizen, ….
19
The real question...
What do you need?
20
Download