The Energy Data Collection Project (EDC) José Luis Ambite*, Yigal Arens*, Walter Bourne†, Steven Feiner†, Eduard Hovy*, Judith Klavans†, Andrew Philpot*, Ken Ross†, Sal Stolfo†, * Digital Government Research Center Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292-6695 arens@isi.edu † Digital Government Research Center Center for Research on Information Access Department of Computer Science Columbia University 535 West 114th Street, MC 1103 New York, NY 10027 klavans@cs.columbia.edu Goals The massive amount of statistical and text data available from government agencies has created a set of daunting challenges for both the research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions to these key problems, focusing on (1) the use of an ontology for terminology standardization, (2) information integration across databases and other data sources with high speed query processing, and (3) interfaces for query input and presentation of results. This collaboration effort between researchers from Columbia University and the Information Sciences Institute of the University of Southern California employs technology developed at both locations, in particular the SENSUS ontology, the SIMS multi-database access planner, the LEXING automated dictionary and terminology analysis system, and Datacube for very fast access to huge databases. The pilot EDC application targets gasoline data. Government Partners - Energy Information Administration of the Department of Energy, Bureau of Labor Statistics, Census Bureau, the California Energy Commission and other government agencies. . Heterogeneous Data Sources EPA Labor Information Access User Interface Multilingual Access Task-based Evaluation Census Main Memory Query Definition Ontology Trade User Evaluation EIA Data Integration Research Demonstrations: Dynamic Unified Access to Distributed Data Information about gasoline prices and energy in general is available to the public on several government web sites. However, no particular effort has been made to correlate them in ways that would make the data easy to integrate and analyze further. For example, the EIA’s gasoline site (see http://www.eia.doe.gov) receives hundreds of thousands of hits a month, but most of the information is available only as standard HTML pages or as prepared PDF documents. The EDC project is working to make this and related gasoline data (distributed by BLS, Census, and the California Energy Commission) accessible in a much more flexible way. This supports users in exploring the many different variations of terminology and definition, facilitating query formation, and making visible the http://www.dgrc.org many footnotes that explain the complex nature of the data—whose varying definitions can make incomparable figures appear comparable. EDC has demonstrated initial results with running prototype systems. Information Integration. We have developed effective methods for identifying and describing the contents of over 30,000 data series so that useful information can be accurately and efficiently located even when precise answers are unavailable. We have performed research on computational properties of data aggregation, and investigated the extraction of information from footnotes embedded in text. Ontology Construction. We have extended USC/ISI’s 90,000-node terminology taxonomy SENSUS to incorporate new energy-related domain models, and have developed automated concept-to-ontology alignment algorithms. Columbia’s LEXING for term extraction from glossaries involves the automatic analysis of over 6000 terms used across agencies (EIA, Census SICS and NAICS codes, EPA) and the automatic handling of acronyms towards the creation of a cross-agency ontology. User Interface Development. We have designed and are implementing a powerful and intuitive user interface with the capability of handling integrated querying and presentation of results. Fast Access to Large Amounts of Pre-Indexed Data. We are developing the Datacube, technology that allows the user to access and manipulate very large databases in milliseconds. With Datacube, one can explore the nature of subsets of the data in real or near-real time by dynamically varying access parameters. Data Format Conversion and Manipulation. We have employed our technology to convert data from one format (HTML, PDF) into other desired formats, representing the results as web pages or in databases. Biodiversity Polyclave. We have employed the ontology representation technology to classify and store a collection of around 10,000 species of plants, helping biologists identify species using an interface that supports induction from partial specification. Publications Technical papers and Reports Paper at dg.o 2001 National Conference on Digital Government, 2001 Paper at AFCEA Database Conference, 2001 (to appear) Overview article in IEEE Computer, Feb 2001 Technical article in Digital Government book, Kluwer, 2001 (in press) Paper at Joint Statistical Conference, Aug 2000 EDC Project Annual Report to NSF, 2000 Conferences and Workshops dg.o 2001 Conference, May 2001: http://www.dgrc.org/dgo2001 dg.o 2000 Workshop, May 2000: http://www.isi.edu/dgrc/dgo2000 DGRC The Digital Government Research Center (DGRC; http://www.dgrc.org) was established in 1999 to help Government make full use of cutting edge information technology. The DGRC consists of faculty, staff, and students at the Information Science Institute (ISI) of the University of Southern California and Columbia University’s Computer Science Department and its Center for Research on Information Access. The mandate of DGRC is to conduct and support research in key areas of information systems, develop standards/interfaces and infrastructure, build pilot systems, and collaborate closely with Government service/information providers and users. DGRC organizes the annual dg.o National Conference on Digital Government Research and publishes the quarterly DG Online, the Magazine of Digital Government Research. See http://www.dgrc.org/conferences and http://www.dgrc.org/dg-online. http://www.dgrc.org