2 page description of DGRC

advertisement
The Energy Data Collection Project (EDC)
José Luis Ambite*, Yigal Arens*, Walter Bourne†, Steven Feiner†,
Eduard Hovy*, Judith Klavans†, Andrew Philpot*, Ken Ross†, Sal Stolfo†,
* Digital Government Research Center
Information Sciences Institute
University of Southern California
4676 Admiralty Way
Marina del Rey, CA 90292-6695
arens@isi.edu
† Digital Government Research Center
Center for Research on Information Access
Department of Computer Science
Columbia University
535 West 114th Street, MC 1103
New York, NY 10027
klavans@cs.columbia.edu
Goals
The massive amount of statistical and text data available from government agencies has created a set of daunting
challenges for both the research and analysis communities. These problems include heterogeneity, size,
distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions
to these key problems, focusing on (1) the use of an ontology for terminology standardization, (2) information
integration across databases and other data sources with high speed query processing, and (3) interfaces for query
input and presentation of results. This collaboration effort between researchers from Columbia University and the
Information Sciences Institute of the University of Southern California employs technology developed at both
locations, in particular the SENSUS ontology, the SIMS multi-database access planner, the LEXING automated
dictionary and terminology analysis system, and Datacube for very fast access to huge databases. The pilot EDC
application targets gasoline data.
Government Partners - Energy Information Administration of the Department of Energy, Bureau of Labor
Statistics, Census Bureau, the California Energy Commission and other government agencies.
.
Heterogeneous Data
Sources
EPA
Labor
Information Access
User Interface
Multilingual
Access
Task-based
Evaluation
Census
Main
Memory
Query
Definition
Ontology
Trade
User Evaluation
EIA
Data Integration
Research Demonstrations:
Dynamic Unified Access to Distributed Data
Information about gasoline prices and energy in general is available to the public on several government web sites.
However, no particular effort has been made to correlate them in ways that would make the data easy to integrate
and analyze further. For example, the EIA’s gasoline site (see http://www.eia.doe.gov) receives hundreds of
thousands of hits a month, but most of the information is available only as standard HTML pages or as prepared
PDF documents. The EDC project is working to make this and related gasoline data (distributed by BLS, Census,
and the California Energy Commission) accessible in a much more flexible way. This supports users in exploring
the many different variations of terminology and definition, facilitating query formation, and making visible the
http://www.dgrc.org
many footnotes that explain the complex nature of the data—whose varying definitions can make incomparable
figures appear comparable. EDC has demonstrated initial results with running prototype systems.
Information Integration. We have developed effective methods for identifying and describing the contents of
over 30,000 data series so that useful information can be accurately and efficiently located even when precise
answers are unavailable. We have performed research on computational properties of data aggregation, and
investigated the extraction of information from footnotes embedded in text.
Ontology Construction. We have extended USC/ISI’s 90,000-node terminology taxonomy SENSUS to
incorporate new energy-related domain models, and have developed automated concept-to-ontology alignment
algorithms. Columbia’s LEXING for term extraction from glossaries involves the automatic analysis of over 6000
terms used across agencies (EIA, Census SICS and NAICS codes, EPA) and the automatic handling of acronyms
towards the creation of a cross-agency ontology.
User Interface Development. We have designed and are implementing a powerful and intuitive user interface with
the capability of handling integrated querying and presentation of results.
Fast Access to Large Amounts of Pre-Indexed Data. We are developing the Datacube, technology that allows
the user to access and manipulate very large databases in milliseconds. With Datacube, one can explore the nature
of subsets of the data in real or near-real time by dynamically varying access parameters.
Data Format Conversion and Manipulation. We have employed our technology to convert data from one
format (HTML, PDF) into other desired formats, representing the results as web pages or in databases.
Biodiversity Polyclave. We have employed the ontology representation technology to classify and store a
collection of around 10,000 species of plants, helping biologists identify species using an interface that supports
induction from partial specification.
Publications
Technical papers and Reports
 Paper at dg.o 2001 National Conference on Digital Government, 2001
 Paper at AFCEA Database Conference, 2001 (to appear)
 Overview article in IEEE Computer, Feb 2001
 Technical article in Digital Government book, Kluwer, 2001 (in press)
 Paper at Joint Statistical Conference, Aug 2000
 EDC Project Annual Report to NSF, 2000
Conferences and Workshops
 dg.o 2001 Conference, May 2001: http://www.dgrc.org/dgo2001
 dg.o 2000 Workshop, May 2000: http://www.isi.edu/dgrc/dgo2000
DGRC
The Digital Government Research Center (DGRC; http://www.dgrc.org) was established in 1999 to help
Government make full use of cutting edge information technology. The DGRC consists of faculty, staff, and
students at the Information Science Institute (ISI) of the University of Southern California and Columbia
University’s Computer Science Department and its Center for Research on Information Access.
The mandate of DGRC is to conduct and support research in key areas of information systems, develop
standards/interfaces and infrastructure, build pilot systems, and collaborate closely with Government
service/information providers and users.
DGRC organizes the annual dg.o National Conference on Digital Government Research and publishes the quarterly
DG Online, the Magazine of Digital Government Research. See http://www.dgrc.org/conferences and
http://www.dgrc.org/dg-online.
http://www.dgrc.org
Download