DG Project Overview - Columbia University

advertisement
The Energy Data
Collection Project
1
The Vision:
Ask the Government...
We’re thinking of moving to
Denver...What are the
schools like there?
How many people had
breast cancer in the area
over the past 30 years?
Is there an orchestra? An
art gallery? How far are
the nightclubs?
2
How have property values in
the area changed over the
past decade?
Census
Labor
Stats
The Vision:
Ask the Government...
We’re thinking of moving to
Cambridge…How much
does gas cost there?
Which state has the
highest oil production?
How long has the
nuclear plant been in
service?
3
Are alternative energy
sources any cheaper to
use?
Census
Labor
Stats
The problem and the solution
• Problem:FedStats has thousands of databases
in over seventy Government agencies:
– data is duplicated and near-duplicated,
– even Government officials and specialists cannot
find it
• Solution: Create a system to provide easy
standardized access:
– need multi-database access engine,
– need powerful user interface,
– need terminology standardization mechanism.
4
The purpose of DGRC
To Make Digital Government Happen
• Advance information systems research
• Bring the benefits of cutting edge IS research to
government systems
• Help educate government and the community
• Learn needs from government partners to drive next
stage system development
• Built pilot systems as part of new infrastructure
5
Research challenges
• Scale to incorporate many databases
… build data models automatically
• Process large and disparate data efficiently
… develop fast processing techniques
… create aggregation and substitution operators
• Integrate data models across sources and agencies
…take a large ontology and link the models into it automatically
• Incorporate additional information that is available from text
…use language processing tools to extract it
• Display complex information from distributed sources
…develop and evaluate new presentation techniques
6
Construction
phase:
• Deploy DBs
• Extend ontol.
System Architecture
Integrated ontology
- global terminology
- source descriptions
- integration axioms
Databases
- DB analysis
- text analysis
- query substitution
- rapid analysis tools
User Interface
- ontology browser
- query constructor
User phase:
Query processor
- reformulation
- cost optimization
Sources

R
S
• Compose query
• Present results
T
Access phase:
• Create DB query
• Retrieve data
7
Text
Tables
Data
Columbia’s Team Approach
• User Interface
– Year One: Hatzivassiloglou, Sandhaus
– Year Two: Feiner, Temiyabutr
• Database Aggregation
– Year One: Gravano, Singla
– Year Two: Ross, Zaman
• Automatic Inter-Agency Ontologies
– Years One and Two: Klavans, Whitman
8
System interface – Year One Progress
Components:
Vasileios Hatzivassiloglou
Jay Sandhaus
1. Query formation
2. Ontology/glossary browsing for concept navigation
3. Answer display, interaction history
GUI incorporates key technologies for facilitating user
access to diverse databases:
– Context-sensitive menu-based input mechanism
– Visualization and navigation of results and the ontology
– Lightweight client runs on multiple platforms without
downloads
– Java/Swing implementation allows client-side processing
9
Information Aggregation – Yr. 1 Progress
Problem: Data is not in exactly the form the user needs
(monthly, not annually; actual values, not averaged)
Solution: Attempt to provide unified view of data of
various granularities:
–
–
–
–
time period
geographical region
product
…
Luis Gravano
Anurag Singla
Example over BLS data:
– View: monthly data available for all geographical regions
– Query: monthly prices for LA in 1979
– Answer: yearly price for LA in 1979
10
Aggregation challenges
• Different coverage along these dimensions across
data sets
• Users see a simple, unified view of the data; if a
query cannot be answered, we answer the closest
query that we have data for
• Answers are always exact
• Key challenges:
– defining query proximity (default vs. user-specific)
– communicating ‘query relaxation’ to users
– defining and navigating the space of ‘answerable’ queries
efficiently
11
Extracting and Structuring Information
from Definitions – Yr 1
Problems:
–
–
–
–
12
Judith Klavans
Brian Whitman
Proliferation of terms in domain
Agencies define terms differently
Many refer to the same or related entity
Lengthy and dense term definitions often contain
important information which is buried
Glossary analysis framework
• Gather glossaries, thesauri, definitions from govt agencies
• Create framework into which text will be analyzed
• Extract ontological
information applying
language sensitive analysis
tools
• Structure and deliver to ISI
for access and display
• Based on past projects:
– analysis of definitions in
machine-readable dictionaries
• Original – domain specific
glossaries
13
DGRC-EDC Plans for Year Two
• User Interface
– Incorporate new presentation approaches
– Link ontology access mechanisms to query input
– Incorporate other DG research (Marchionini)
• Database
– Integrate existing aggregation prototype
– Main memory for fast performance
• Lexical Knowledge Bases
– Incorporate into SENSUS
– Add web crawler to extend coverage
– Develop mechanisms to merge definitions
14
End of Part I : DGRC – EDC
• Reviewed goals of DGRC Energy Data
Collection Project
• Showed first year progress
• Gave early second year results
• Presented Columbia’s team approach
• Set out future goals
But what is next?
15
Next Steps for DGRC Growth
• Ambitious two-pronged plan
Additional Funding
For
DGRC – TRADE
(NSF)
16
Independent
Foundation
Funding
(leverage NSF
Investment)
One Facet: From DGRC-EDC to
DGRC-TRADE
• Builds on past successes
• Brings in a new domain – trade data
• Adds three new enhancements
– User Needs and Evaluation
• Electronic Data Service at Columbia
• Users and Experts to test usefulness and usability
– Database – incorporate cross data set aggregation
– Ontology – add multilingual capability
17
Heterogeneous Data
Sources
EPA
Information Access
Census
Definition
Ontology
Labor
EIA
Data
18 Integration
User Interface
Heterogeneous Data
Sources
EPA
Labor
Information Access
User Interface
Multilingual
Access
Task-based
Evaluation
Census
Main
Memory
Query
Processing
Trade
Definition
Ontology
User Evaluation
EIA
Data
19 Integration
Columbia’s Electronic Data Service
• Established to serve social science
researchers
• Operational unit of the Libraries
• Excellent relationship with faculty, staff and
students
• Capable of supporting many levels of
development and testing
• Evaluation effort led by Walter Bourne
20
Partners – DGRC Trade
• Evaluation experts from the US and Canada
– Cognitive evaluation
– User needs evaluation
– User interface evaluation
• Social scientists
– ISERP and CIESEN at Columbia
– Public Health
– Policy research
21
Facet Two: Building the DGRC
• Seek substantial Foundation support
• Pursue a large vision
• Involvement of high level Columbia and ISI
administration
• Gather an advisory board to develop a
sustainable plan
22
What do we need from the NSF?
1. Information
– Ways to interact with portals
• E.g. firstgov.com
• Private companies delivering (free) government data
2. Contacts
– Leverage peer-review process of NSF to
establish key contacts
23
To Sum
• DGRC – Energy Data Collection (EDC)
– Progress from Year One
– Plans and early results from Year Two
• Larger Plans for Growing DGRC
– Trade Proposal – NSF
– Plans for other funding
24
Today’s Plan: Focus on DGRC-EDC
Major research challenges:
• Building and structuring the ontology
• Automated data aggregation
• Presentation of complex information
Major practical challenges:
• Getting more data into the system
• Understanding users’ needs
25
Thank you!
Any questions?
26
Information Integration:
Heterogeneity in Aggregation
Luis Gravano
Assistant Professor, Columbia U.
(joint work with Anurag Singla and Vasilis Vassalos)
27
Information Integration
Data Sets/Sources:
Tables with statistical data, potentially
produced by different organizations
Goal:
To Provide Single-Stop Access to Multiple
Distributed Autonomous Data Sets
28
My Research Background
• Databases
• Distributed search and retrieval over text
sources
29
Metasearchers: Single-Stop Access to
Heterogeneous Text Sources
Source 1
Query
User
Unified
Results
Meta
Searcher
Source 2
...
Source n
30
Main Metasearcher Tasks
• Selects good text sources for query
(source discovery)
• Evaluates query at these sources
(query translation)
• Combines query results from sources (result
merging)
31
Some of my Previous Work on Metasearchers
• GlOSS: a scalable source discovery system that
selects relevant text sources
• STARTS: a protocol that facilitates metasearching
(Participants included Infoseek, Microsoft, HewlettPackard, Fulcrum, Verity, and Netscape.)
32
Challenges for
Information Integration
•
•
•
•
•
33
“Semantic” Heterogeneity of Data Sets
“Syntactic” Heterogeneity of Data Sets
Varying Granularity of Data Sets
Varying Data Coverage
Number of Available Data Sets
Challenges for
Information Integration
•
•
•
•
•
34
“Semantic” Heterogeneity of Data Sets
“Syntactic” Heterogeneity of Data Sets
Varying Granularity of Data Sets
Varying Data Coverage
Number of Available Data Sets
ISI’s
SIMS
Future Work
Challenges for
Information Integration
•
•
•
•
•
35
“Semantic” Heterogeneity of Data Sets
“Syntactic” Heterogeneity of Data Sets
Varying Granularity of Data Sets
Varying Data Coverage
Number of Available Data Sets
Last Year
Focus
Mediators: Single-Stop Access to
Heterogeneous Statistical Sources
MainMemory
DBMS
Query
Mediator
User
...
Unified
Results
Traditional
DBMS
36
Varying Data Coverage and
Granularity
• Time period
• Geographical region
• Products
Average Price of Gasoline from BLS
37
Varying Data Coverage (I)
Region : US Average
– Product : Leaded Regular Gasoline
• Time Period: Oct 1973 to Mar 1991
• Source: BLS Series APU000074712
– Product: Leaded Premium Gasoline
• Time Period: Oct 1973 to Dec 1983
• Source: BLS Series APU000074713
38
Varying Data Coverage (II)
Product: Leaded Regular Gasoline
– Region: San Diego, CA
• Time Period : Jan 1978 to Dec 1986
• Source: BLS Series APUA42474712
– Region: Boston, Massachusetts
• Time Period : Jan 1978 to Jan 1989
• Source: BLS Series APUA10374712
39
Varying Data Coverage (III)
• Geographical coverage varies for different data
fields
(even for same gasoline type)
• Not all data fields available for all gasoline types
(e.g., Consumer Price Index available for
Unleaded Regular but not for Leaded Premium)
40
Varying Data Granularity
Granularity “hierarchies” for:
– Time period
– Geographical region
– Products
41
Granularity Hierarchy for Time Period
Year
Quarter
Week
Month
Day
Granularity Hierarchy for Geographical
Region
World
Country
Region
(Spanning cities
or states)
State
City
Granularity Hierarchy for Products
Gasoline
Leaded
Gasoline
Unleaded
Gasoline
Leaded Gasoline Leaded Gasoline
Leaded Gasoline
(Premium)
(Midgrade)
(Regular)
Some BLS Data Sets for our Demo
(Gasoline Unleaded Regular, Average Price)
• US; Monthly; 10/1973 to 3/1991
Source: APU000074712
• San Diego; Monthly; 1/1978 to 12/1986
Source: APUA42474712
• Los Angeles; Monthly; 1/1986 to 4/1991
Source: APUA42174712
• Los Angeles; Yearly; 1978 to 1985
Source: APUA42174712 (aggregated)
45
What Do We Show Users as Data Sets
Available for Querying?
46
What Do We Show Users as Data Sets
Available for Querying?
Possibility 1: All the details!
• US; Monthly; 10/1973 to 3/1991
• San Diego; Monthly; 1/1978 to 12/1986
• Los Angeles; Monthly; 1/1986 to 4/1991
• Los Angeles; Yearly; 1978 to 1985
47
What Do We Show Users as Data Sets
Available for Querying?
Possibility 1: All the details!
Advantages: Users can exploit all data sets
48
What Do We Show Users as Data Sets
Available for Querying?
Possibility 1: All the details!
Advantages: Users can exploit all data sets
Disadvantages: …if they don’t get overwhelmed first.
49
What Do We Show Users as Data Sets
Available for Querying?
Possibility 2: “Least common denominator” of data
sets
E.g., “only yearly data available”
50
What Do We Show Users as Data Sets
Available for Querying?
Possibility 2: “Least common denominator” of data
sets
Advantages: Users get a unified view of the data.
51
What Do We Show Users as Data Sets
Available for Querying?
Possibility 2: “Least common denominator” of data
sets
Advantages: Users get a unified view of the data.
Disadvantages: Almost nothing is left!
52
What Do We Show Users as Data Sets
Available for Querying?
Possibility 3 (our approach): Define a reasonably
expressive, unified view
53
Our Approach
• Users have a simple, unified view of the data.
• If a query cannot be answered, we answer the
closest query that we have data for.
• Answers are always exact.
54
Example over BLS Sources
• View: monthly data available for all
geographical regions
• Query: monthly prices for LA in 1979
• Answer: yearly price for LA in 1979
55
What Do We Show Users as Data Sets
Available for Querying?
Possibility 3 (our approach): Define a reasonably
expressive, unified view
Advantages: Users get a unified view of the data; most data sets exploited.
Disadvantages: Sometimes user queries cannot be answered.
56
Key Challenges
• Defining query proximity
(“default” vs. user-specific)
• Communicating “query relaxation” to users
• Defining and navigating the space of
“answerable queries” efficiently
57
Proof-of-Concept Demo
• Four BLS sources
• Simple integrated view
• Results for “closest” query when original
answer cannot be computed
http://db-pc01.cs.columbia.edu/digigov/Main.html
59
Some Open Issues
• Definition of “right view”
• Interaction with user interface
• Addition of aggregation into ISI’s SIMS
system
60
Aggregation in Main Memory
Kenneth A. Ross
Kazi A. Zaman
Columbia University, New York
61
Research Experience
• Complex query processing
• Data Warehousing
• Main memory databases
62
MainMemory
DBMS
Query
Mediator
User
...
Unified
Results
Traditional
DBMS
63
Outline
•
•
•
•
•
64
Introduction to Datacubes
Frameworks for querying cubes
The Main Memory based framework
Experimental Results
Conclusions and Plan
The CUBE BY Operator
State Year Grade Sales
State
Year
Grade
Sales
CA
NY
CA
1997 Regular 90
1997 Premium 70
1998 Premium 65
NY
1998 Premium 95
CUBE BY
(sum Sales)
Large increase in total Size,
especially with many dimensions
65
CA
CA
ALL
CA
ALL
ALL
ALL
CA
ALL
1997
1997
1997
ALL
1997
1997
ALL
ALL
ALL
Additional
records
…….
Regular 90
ALL
90
Regular 90
Regular 90
Regular 90
ALL
160
Regular 90
ALL
155
ALL
320
Lattice Representation
State
Year
Grade
State, Year
State, Grade
Year, Grade
State, Year, Grade
66
Modeling Queries
Slice Queries ask for a single aggregate record
SELECT
FROM
GROUP BY
HAVING
67
State, year, sum(sales)
BLS-12345
State, year
State = “NY” AND
year = “1998”
Existing Frameworks
Choose subset of cube to
materialize based on workload.
State
Year
Grade
Materialize on disk
State, Year
Appropriate record recovered or
computed for incoming slice query
State, Year, Grade
Drawbacks:
Ignores Clustering of Relation on disk.
Smallest unit of materialization is too big.
68
State,Grade Year,Grade
Our approach
The full cube is often larger
than available memory, but ...
State
The finest granularity
aggregate may fit.
Any record can be computed
without having to go to disk.
Year
State, Year
State,Grade Year,Grade
State, Year, Grade
How should the finest granularity be organized ?
69
Grade
Framework
Level-1 Store
Level-2 Store
Finest granularity cuboid
Query q
records in linked lists
Selected coarse records
in hash table
Slot directory
70
The Level-1 Store
Records are <Key,Value> pairs
stored in a hash table.
Records can contain ALL’s
Key
a1
b2
c2
…
Given query Q, form composite
key and check level-1 store (constant time).
If not found, use level-2 store
71
Value
55
34
12
...
The Level-2 Store
Level-2 Store
Slot directory is organized as
a multidimensional array:
level2[sz1][sz2][sz3][sz4]
Finest granularity cuboid
Each slot points to a linked
list of elements.
records in linked lists
Records placed according to
set of mapping functions H
Slot directory
72
Using the Level-2 store
Query Q without ALL’s
a3
Slot 4
b4
c2
d5
Slot 3 Slot 7 Slot1
Access list denoted by level2[4][3][7][1] ;
aggregate those matching (a3,b4,c2,d5).
73
Using the Level-2 store
Query Q with ALL’s
a3
Slot 4
ALL
c2
ALL
List of Slots Slot 7 List of Slots
Access lists matching level2[4][*][7][*] ;
aggregate those matching (a3,*,c2,*).
74
Experimental Results
Query Processing Time vs Additional Memory Used
(real dataset, 10^6 records, 8 dimensions)
Average time per
query in
milliseconds
15
Query Cost
10
5
0
0
20
40
60
Additional Memory Used in MB
Scanning all records takes 194 ms.
75
80
Importance of Work
•Aggregation is fundamental to analysis.
•Make analysis interactive.
•Make a variety of aggregate
granularities available, where possible.
76
Contributions and Plan
• A Main Memory based framework for
answering datacube queries efficiently.
• Query Performance in the 2-4 ms range
which is more efficient than going to disk.
• Goal: Integrate within Columbia/ISI system
to facilitate interactive analysis.
77
Experimental Results
Number of Tuples in
Level-1 store
Distribution of tuples in Level-1 store
1200000
1000000
800000
600000
400000
200000
0
4 ALLS
5 ALLS
6 ALLS
7 ALLS
8 ALLS
0
20
40
60
80
Additional Memory used in MB
78
Workload Based Distribution
Each possible query record is assigned a probability.
(Nonzero probability for some records not in cube.)
Uniform Cuboid:
Each cuboid has equal probability
Each record in cuboid has same probability
Count Based:
Each cuboid has equal probability
Probability proportional to count of record
79
Existing Cost Models
Linear Cost Model:
Cost proportional to number of records read.
If cuboid is not materialized, use smallest materialized
ancestor.
Drawbacks:
Ignores Clustering of Relation on disk
Smallest unit of materialization is a cuboid
80
Design Decisions
• Mapping functions H
• level2[sz1][sz2][sz3][sz4] : choice of sz
values.
• Size of level-1 store vs level-2
• Choice of level-1 records
81
Choice of Mapping Functions
Mapping functions aim for uniform
distribution.
We know the single attribute
distribution in advance.
Exact problem is intractable,
use heuristics.
82
Example
Attribute frequency
a1
a2
a3
5
10
5
If the range size is
2, a2 maps to one
slot, a1 and a3 the
other
Level-2 choice of Range Sizes
Level 2 slot array: level2[sz1][sz2][sz3][sz4]
Given slot array size T
T  sz1 * sz2 * sz3 * sz4
If all cuboids are equiprobable, pick uniform range sizes
1/ 4
sz1 = sz2 = sz3 = sz4  T
In general, nonlinear integer programming problem.
83
Optimizing Slot Array Size
If T is too big,
Too many empty slots are checked.
Less space available for level-1 store.
If T is too small,
Too many records examined for each query.
84
Cost of using Slot Array
T: Number of slots
n: Number of finest granularity records
s: Cost of slot access
l: Cost of list element access
d: Dimensionality of dataset
Number of slots accessed by a query with k ALL’s :
k/d
T (s+nl/T)
85
Average Cost of level-2 store
pi : probability of record i being queried
k i : number of ALL’s in record i
A=
Average cost of lookup for all non-materialized records
A=
Σ
k i /d
pi T (s+nl/T)
i: all records not
in level-1 store
86
Benefit of a level-1 record
p : probability of record i being queried
i
k : number of ALL’s in record i
i
Expected benefit of materializing record:
ki /d
B=piT (s+nl/T)
Exponential in k, linear in pi
87
The Tradeoff Equation
Given unit extra space do we increase the slot array size
or that of the level-1 store ?
Pick option which provides greater average reduction
in query time.
Level-1 Benefit : B
88
Level-2 Benefit: A’ (dA/dT)
The Tradeoff Equation
Level-1 Storage cost : q
Level-2 Slot Size: z
Benefit per unit space
Level-1 : B/q
Level-2 : -A’/z
Allocate memory to obtain larger benefit.
Repeat for next unit of memory (parameters have changed)
89
Experimental Setup
8 dimensional dataset of Cloud coverage data
64 bits for CUBE BY attributes
32 bits for aggregate
1015367 base records (12 Mbytes)
Size of total datacube = 102745662 records (1.2 Gbytes)
Algorithms implemented in C
300 MHz Sun Ultra-2 running Solaris 5.6
Results shown for count based distribution
90
Experimental Results
Size of Slot Array in
MB
Size of Slot Array
3
2.5
2
1.5
1
0.5
0
Slot Array Size
0
50
100
Additional Memory used in
MB
91
Experimental Results
Size of level-1 store in
MB
Size of Level-1 Store
120
100
80
60
40
20
0
Size of Store
0
50
100
Additional Memory used in
MB
92
Experimental Results
Percentages of
tuples per level in
level-1 store
Levelwise Breakup of tuples in level-1
store
100
80
60
40
20
0
4 ALLS
5 ALLS
6 ALLS
7 ALLS
8 ALLS
0
50
100
Additional Memory used in MB
93
Experimental Results
Average Update Cost in
milliseconds
Update Costs
3
2.5
2
1.5
1
0.5
0
Cuboid Info
Independent
0
50
100
Additional Memory used in
MB
94
See paper for…
•
•
•
•
95
More details on updates.
Hierarchies in attributes.
Range queries.
More experiments.
User Interfaces for
DGRC-EDC
Steven Feiner
Surabhan Temiyabutr
Department of Computer Science
Columbia University
New York, NY 10027
Supported by NSF Grant EIA-9876739
96
Approach
• Redesign current UI
– Heuristic analysis and informal experiments
• Formal experiments and feedback
97
Redesign: First Steps
98
Redesign: Next Steps
• Potential problem areas
– Query
• Alleviate “peep-hole” confusion of walking menu
– Results
• May interface with Marchionini et al. table browser
– Ontology
• Explore graph presentation strategies: layout, distortion
viewing (e.g., fisheye), hierarchy, filtering
99
Redesign: Next Steps
• Potential problem areas (cont.)
– History
• Support reuse and modification of previous queries
– Metainformation
• Determine utility and presentation approaches
– Integration
• Maintain consistency/linkage across displays
– Substitution
• Leverage ontology
100
Experiments
• Design/perform/analyze formal user
experiments at BLS et al.
• Feed back experimental results to UI design
101
Extracting Information from Domain
Specific Glossaries
Judith L. Klavans
Brian Whitman
102
Construction
phase:
• Deploy DBs
• Extend ontol.
System Architecture
Integrated ontology
- global terminology
- source descriptions
- integration axioms
Databases
- DB analysis
- text analysis
- query substitution
- rapid analysis tools
User Interface
- ontology browser
- query constructor
User phase:
Query processor
- reformulation
- cost optimization
Sources

R
S
• Compose query
• Present results
T
Access phase:
• Create DB query
• Retrieve data
103
Text
Tables
Data
Extracting and Structuring Metadata
from Text
Judith Klavans, Dir of CRIA, Columbia
Brian Whitman, GRA, Columbia
Problems:
–
–
–
–
104
Proliferation of terms in domain
Agencies define terms differently
Many refer to the same or related entity
Lengthy and dense term definitions often contain
important information which is buried
Gasoline: Sample Definitions
• Gasoline:
A volatile mixture of flammable liquid hydrocarbons
derived chiefly from crude petroleum and used principally
as a fuel for internal-combustion engines and as a solvent,
an illuminant, and a thinner.
(The American Heritage® Dictionary of the English Language, Third Edition)
• Gasoline:
See regular gasoline.
(Energy Information Administration, Gasoline Glossary 2000)
105
Regular Gasoline: Gasoline having an antiknock
index, i.e., octane rating, greater than or equal to
85 and less than 88. Note: Octane requirements
may vary by altitude. See Gasoline Grades.
Data
sources
106
EIA Edited Gasoline Glossary
Regular Gasoline: Gasoline having an antiknock
index, i.e., octane rating, greater than or equal to
85 and less than 88. Note: Octane requirements
may vary by altitude. See Gasoline Grades.
Motor Gasoline (Finished): A complex mixture
of relatively volatile hydrocarbons with or without
small quantities of additives, blended to form a
fuel suitable for use in spark-ignition engines.
Data
sources
EIA Online Energy Glossary
107
EIA Edited Gasoline Glossary
The Core
gasoline { petrol [N] , gas [N]
gasolene [N }
Regular Gasoline: Gasoline having an antiknock
index, i.e., octane rating, greater than or equal to
85 and less than 88. Note: Octane requirements
may vary by altitude. See Gasoline Grades.
Large ontology
(SENSUS)
Motor Gasoline (Finished): A complex mixture
of relatively volatile hydrocarbons with or without
small quantities of additives, blended to form a
fuel suitable for use in spark-ignition engines.
Data
sources
EIA Online Energy Glossary
108
EIA Edited Gasoline Glossary
The Core
Large ontology
(SENSUS)
Concepts from
glossaries (by
LKB)
gasoline { petrol [N] , gas [N]
gasolene [N }
(Regular Gasoline (source …)
(xref "Gasoline Grades") (full-def … }
(core-def … }
(genus-phrase “gasoline")
(head-word “gasoline")
(properties
(contains "an antiknock index")
) (quantifiers (less-than "88") )
(note … ) )
Data
sources
109
EIA Edited Gasoline Glossary
The Core
Linguistic
mapping
Logical
mapping
Large ontology
(SENSUS)
Concepts from
glossaries (by
LKB)
Domain-specific
ontologies
(SIMS models)
Data
sources
110
Glossary analysis framework
• Gather glossaries, thesauri, definitions from govt agencies
• Create framework into which text will be analyzed
• Extract ontological
information applying
language sensitive
analysis tools
• Structure and deliver to
ISI for access and display
• Based on past projects:
– analysis of definitions in
machine-readable
dictionaries
• Original – domain
specific glossaries
111
Columbia’s Lexical Knowledge
Base (LKB) Tool
Combines statistical and
linguistic methods:
• identifies topics with high
accuracy
• provides complete coverage
• useful for any subject area
• produced over 6,000 concepts
in current domain
112
Extraction of Information from
Domain Specific Glossaries
• Year One
– Built a definition analyzer using a combination of novel and
known techniques
– Analyzed and structured over 6000 entries from 4 sites
– Tested on medical glossaries also
• Year Two
–
–
–
–
–
Build a crawler to identify glossaries across sites
Analyze additional data
Evaluate nodes by social science experts
Link our output to ISI’s ontology
Develop first-stage merging representation structure
(SENSUS)
113
Thank you!
Any questions?
114
Lexical Knowledge Base Generation
from Glossaries
CARDGIS / Digital Government Project
Brian Whitman
Columbia University
115
The Problem
• Unstructured glossaries provide needed
information
• Create a common ontology across many
datasets
116
An Example from EIA
• Input
Motor Gasoline Blending Components: Naphthas (e.g., straight-run gasoline,
alkylate, reformate, benzene, toluene, xylene) used for blending or compounding into finished motor
gasoline. These components include reformulated gasoline blendstock for oxygenate blending
(RBOB) but exclude oxygenates (alcohols, ethers), butane, and pentanes plus. Note: Oxygenates are
reported as individual components and are included in the total for other hydrocarbons, hydrogens,
and oxygenates.
Output
(Motor Gasoline Blending Components
(isa "Naphthas")
(used-for "blending")
(!contains "oxygenates")
)
117
In the Way
•
•
•
•
118
Lack of standards
Complex input
Automatic extraction
Acronyms
The ALKB System
ALKB
SLKB
Acrocat
119
What’s a LKB?
• “Highly structured isomorph of a published
dictionary” - Klavans, Boguraev, Byrd
(1990)
• Definition  Structured tree
120
LKB Example
“gasoline”
(Sense 1…)
American Heritage
Senses
Sense 2
Pronunciation: gas’e-len’
Used For...
Cross-ref
fuel for internal
combustion engines
propellant
Columbia Encyclopedia
Senses
121
Step-by-step Process - Demo One
•
•
•
•
Parses
POS Tagging
NP Identification
Bigram frequency
Two attribute types
– predefined
– automatic
122
Demo One
123
Predefined Semantic Attributes
• Developed after an analysis of the source
material
• Examples:
– contains, includes, excludes
– less than, greater than, more than
– used for
124
Automatic Semantic Attributes
• Also uses the probability material to
determine additional attributes
Head Term:
Motor Gasoline (Finished)
Cross-reference
Genus Term:
A complex mixture of relatively volatile hydrocarbons
Head Genus Word:
mixture
Properties
for use in: spark-ignition engines
Excludes-Includes:
includes conventional gasoline
Acronym: ASTM [list]
Data on: all types, aviation gasoline, gasohol, gasoline
finished motor: gasoline, blending components
In data: aviation gasoline, gasohol
125
Acrocat - Acronym Cataloguer
• Glossaries full of acronyms
• Acrocat ‘dereferences’
• Guesses difficult acronyms
– RBOB = reformulated gasoline blendstock for
oxygenate blending
126
What Acrocat Does
• Salience measures:
–
–
–
–
127
Distance from named reference
Capital letter match
Length
Crawls all pages within a domain
Demo Two
128
Future Work
• ALKB:
– Integration of data into ISI ontology
– Research on the semantic attribute set
– Tests over non-glossary data
• Acrocat:
– Evaluation
– Building with known dictionaries
129
EDS
October 30, 2000
Walter Bourne
Assistant Director
Academic Information Systems (AcIS)
walter@columbia.edu
130
EDS History & Operation
• BASR(Bureau of Applied Social Research), ‘40’s
• DARTS (Data Archive) est’d 1970’s as part of the
Center for the Social Sciences.
• EDS replaced DARTS in 1992.
• EDS is a joint operation of the Libraries and AcIS
(Academic Information Systems).
• 4 full-time staff; 4 graduate assistants.
• 10 librarians in Social Science Division provide
extended subject expertise.
131
EDS Services
•
•
•
•
•
•
•
132
Data Library
Data Finding
Data Access
Data Consulting
Data Acquisition
Statistical Programming Assistance
Instructional Assistance
Who Uses EDS
Visits/Contacts at EDS
By Discipline
350
300
250
200
150
100
50
he
r
Ot
Ur
ba
n
rs
he
Te
ac
gy
iolo
lth
So
c
He
a
t'l
Wo
rk
ial
So
c
ern
a
Sc
i
In t
Ec
o
133
Po
li
mi
cs
0
no
• EDS serves a wide
variety of disciplines.
• 1,089 contacts were
recorded in past year.
• Economics and
Political Science the
most frequent users.
• Others: Undergrads,
Journalism, Statistics,
Data Library
• 1,285 studies are maintained online.
• 60 GB of data, including many gzipped files.
• A variety of sample and access programs for
SPSS and SAS.
• A library of codebooks and manuals.
• Extensive local how-to documentation,
http://www.columbia.edu/acis/eds
134
Data Access
The EDS DataGate
• Full-text search over abstracts and titles of
studies.
• Abstracts are from ICPSR (Inter-University
Consortium for Political and Social Research).
• System reflects current status of data by
nightly update from the files on disk.
• A combination of an SQL DB, indexer, and
CGI scripts.
135
The EDS DataGate
•
•
•
•
•
The principal finding tool.
Originally developed in 1994.
Uses OpenText’s PAT indexing engine.
Ingres is current DBMS.
cron scripts update DB and Web nightly.
• http://www.columbia.edu/acis/eds/dgate.html
136
EDS and DGRC
1. Evaluation
• EDS provides access to a pool of data users
(seekers?) of varying sophistication.
• EDS and Libraries staff have decades of
experience doing and supporting social
science research.
• A ideal combination for DGRC evaluation.
137
EDS and DGRC
2. Futures
• DGRC vision promises relief to EDS users
and staff from the tough job of finding,
preparing and analyzing data.
• Users will be able to concentrate on their
analysis.
• Staff will be able to work on improving
DGRC tools; incorporating their expertise.
138
The DGRC First-Year User Interface
Vasilis Hatzivassiloglou
Jay Sandhaus
139
Goals
• Provide a means for uniform access to
heterogeneous, distributed statistical
databases.
• Communicate in real time with the ontology
and information mediator (SIMS) over the
internet.
140
Tasks for First Year
• Select appropriate platform that combines
ease of access, interactivity, and capability
of advanced graphical displays.
• Develop communication APIs between the
ontology and the UI and between the
information manager and the UI.
• Design and implement prototype interfaces.
141
Interface platform
• A tradeoff between computational
capabilities and accessibility to the casual
user
• We experimented with several prototypes:
– Stand-alone application
– JAVA AWT applet
– JAVA Swing applet
142
Two layers
• Application layer
– Communicate with information manager using
HTTP
– Combine information from multiple relational
tables (joins, identification of common
information)
• Presentation layer
– What the user sees
143
User Interface Components
•
•
•
•
•
144
Query specification
Error handling
Ontology browsing
Information display
Representation of integrated information
User Interface Visual Structure
• Four main panels:
–
–
–
–
Query specification + error handling
Result display
Ontology navigation and display
Help and documentation
• User can switch between panels at any time.
145
Query Specification
• Three modes considered
– Restricted natural language input
– Direct SQL entry (for expert users)
– Guided selection of terms from contextsensitive menus (e.g., product, time period,
location)
• In all cases, SQL query is formulated and
sent to the information manager (SIMS)
146
Error Handling
• Syntactic checks (SQL and NL)
• Terms in the ontology (highlight unknown
terms, allow browsing of the ontology for
replacement)
• User can consult the ontology from any
panel with a simple right-click on any term
147
Ontology browsing
• Graphical display of the ontology as a
tree/directed graph
• Navigation capabilities for parent, children,
and other related nodes
• Display of information associated with each
note (e.g., definitions, source)
148
Result display
• Display of integrated result as a table
• Ability to refer back to source documents
• Display of extracted footnotes in separate
area
• Display of relevant ontology terms
149
Data Granularity
• An issue that cuts across the interface and
data integration
• Open questions:
– Do we impose a unified view?
– Do we allow the user to ever give a query with
no answer available?
150
Other interface components
• What to show from the system’s integration
operations?
• Possibilities:
– Data granularity level selected
– Sources relevant to the query
– Ontology terms used in answering the query
151
Current interface status
• Prototypes as stand-alone application and JAVA
AWT and SWING applets
• Input: Menu navigation, SQL query, simple oneterm query
• Output: Table returned over the internet by SIMS,
along with associated displays of footnotes,
sources, etc.
• Ontology browsing and display of properties
• SQL query construction and data exchanges with
SIMS shown in log window
152
Download