Federal Committee on Statistical Methodology Statistical Policy Seminar: Integrating Federal Statistical Information and Processes The Fifth in a Series of Seminars Hosted by COPAFS (Council of Professional Associations on Federal Statistics) November 8-9, 2000 Holiday Inn Bethesda, MD Session 12: Software Issues for Disseminating Statistical Information on CD-ROM Organizer: Charles Paulter, BOC and Cathryn Dippo, BLS To Make or to Buy: This is the Question Brand Niemann, U.S. EPA and the FedStats.Net Team* Abstract The concern over different ways to access and use Federal statistical data on CD-ROMs also extends to Federal Web sites. Indeed users would like Federal statistical data to be interoperable and integrated across CD-ROMs and Web sites. The goal of the Product Concepts Group of the FedStats Task Force is to provide interoperable and integrated statistical products on both the FedStats.gov and the new FedStats.Net collaboration site. The FedStats.gov content and several Census CD-ROMs have been re-purposed using XML and state-of-the art Web server softwares to build an integrated and functional network of content that includes delivery of data tables in XML and XLS formats, sortable tables, and query-able documents and databases. So the answer to the question “to make or to buy” is do both and even more. The paper includes a description of the methods, initial results of prototyping at the new FedStats.Net site, and some next steps. * See Acknowledgments 1. Introduction Users are faced with the difficulty of wanting to use several CD-ROMs each of which have a different software package associated with it. This is confusing and time consuming to users whose only goal is to efficiently find and use the information they need. Federal Agencies should be responsive to the questions: “Why do I have to learn all these different ways to access and use Federal data? Will the various agencies ever talk to each other and work out these differences? Surely there is a common way that meets users needs.” With the evolution of the Web to the use of the eXtensible Markup Language (XML), the “common way” now is to re-purpose both Web and CD-ROM content and to integrate and deliver it on both the Web and on CD/DVD, as well as in print and other means like wireless appliances that are becoming increasingly popular in our mobile society. In addition, usability studies and common sense dictate that statistical data tables can and should be delivered in a format that can be readily used for creating graphs and maps and even statistical analyses. Statistical data tables can now be delivered on the Web as real 1 data using XML and XLS (Excel for which there is a free viewer) which in turn can be input (uploaded) to statistical analysis softwares on the Web. The Products Concepts Group of the FedStats Task Force has developed a vision for Next Generation Products by Type and How Generated (see table below) (Capps, et al, 1999; DeBerry, et al, 2000; Wallace and Cortez, 2000; Wallace and Highsmith, 2000; and Wallace and Sperling, 2000). The third reference describes the new Census QuickFacts (http://quickfacts.census.gov/qfd/) and FedStats QuickStats (in review) applications which use a MYSQL database to store the state and county data elements selected from eight FedStats agencies along with their associated metadata and provide a front-end written in Perl/CGI that generates the tables dynamically via SQL queries. Product Type How Generated -Quick Facts (Stats) – tables of results -“Hot Reports” – dynamically created documents -Dynamic Mapping – documents, tables, graphs, and maps together (XSLT) -Manual cache – pull together manually -Auto Cache – pull together programmatically -Distributed – leave it where it is and create a distributed Content Network! It was determined that the NXT 3 software from NextPage provided the functionality needed to begin to realize this vision (http://www.nextpage.com/products/NXT3/econtent/econtent_tour.asp). It was also determined that the StatServer software from Mathsoft (http://www.splus.mathsoft.com/products/statserver/statserverintro.html) would provide the desired functionality to select pre-loaded databases and statistical analyses or to upload additional databases to produce statistical analyses of the data on the Web. Recent FedStats.gov usability testing of a redesigned site suggested the need for a more clearly defined conceptual framework for the site, which in turn suggested the need for a larger redesign effort. One suggestion was to reorganize the FedStats.gov site around topics, each of which would have its own agency list as well as a consistently defined topical index of key statistics from each agency, a list of international comparisons, and MapStats and QuickStats among other features. This would necessitate overhauling the many lists and links in the site. In addition, the major statistical compendia like the Statistical Abstract (http://landview.census.gov/statab/www/) and other agency reports normally delivered in PDF could and should be made more accessible (http://www.section508.gov/) and functional and the MapStats feature could be supplemented with other tools like StatServer and LandView: The Federal Geographic Data Viewer (http://landview.census.gov/) with its closely-coupled and Web-connected features (Niemann, 1999). The next sections include a description of the methods, initial results of prototyping at the new FedStats.Net site, and some next steps. 2 2. Methodology Some of the principal concepts and methods for building a content network using the NXT 3 software are listed below: •A site contains one or more content collections organized into hierarchies using folders. •A content collection is composed of multiple documents. •A document is a textual data stream that is either authored or constructed dynamically and corresponds to a single page in a Web browser. •A view is a subset of the hierarchy of a site. •A Web server may host multiple sites, each with multiple views (peer-to-peer). •The Content Network Manager (CNM) allows one to organize the site and the Manage Content (MC) allows one to dynamically update a site. •HTML templates with special replaceable comment tags are used to customize the display. •Searches can be made for the entire site, specific content collections, or sub-content components and highlights appear in virtually all documents. •Sub-content components (fields) are text strings assigned semantic meaning & indexed separately. •Content can include file systems, ODBC databases, and external Web sites. •Multiple CN servers are linked and integrated by the Content Network Adapter (CNA). •User preferences can be stored by associating an XML document with a specific user to carry information from one session to another that can include saved searches & individualized content. •Metadata support is stored as a document property that describes the document and its children and is used for searching and resource discovery using W3C RDF and Dublin Core standards. •A content service can be: a set of local files; an ODBC database; and an external Web site •NXT 3 content services do not store the content – it simply indexes a structure for the NXT 3 search engine and table of contents. •Content services are updated manually or periodically by a “build schedule” (see first screen capture example below). •SQL queries entered by the administrator generate a table of contents from the ODBC database and either generates XML documents from the field data or retrieves any discrete documents stored as “blobs” (see second screen capture example below). 3 4 The MathSoft StatServer allows the user to choose the data source, the table, and the columns of that table and the statistical analysis to be performed on the selected subset of data. 3. Results The FedStat.Net content network has been initially organized by a series of topic nodes as follows: •Introduction •Topics by Agency • Sources of Environmental Statistics •Statistical Abstracts –Integrated 31chapters and 1450 data tables for 1999 –Demo of PDF-to-HTML conversion •Databases –USA Counties 1998 –REIS 1969-1998 • Metadata •LandView •Statistical Reports (CEQ 1997) •Partners (HUD, BEA, SBA: on another FedStats.Net server) •International (CIA): on NextPage Server in Utah • EPA Integrated Registry and Repository: on another interagency server In addition, an environmental content network node described in a recent paper (Niemann, B. 2000) was also included as an example of what an individual agency or organization might do on its own as well. 5 Some screen captures from the prototypes to become available at the FedStats.Net site are included below. These include advanced features and functionality for tables recommended recently by others (Marchionini, G. and Hert, C., 2000; and Wallman, K.K., Zawitz, M.W., Blessing, C., and Treadwell, W., 1999). The first example is the CIA County Profiles (CIA, 2000) delivered from a query of an XML document as an XML table that can be sorted by column label links and linked by row names to individual country data. The second is the Statistical Abstract with all 31 chapters and nearly 1500 tables integrated in an XML database with properties sheets for and links to the nearly 1500 tables in both HTML (XML) and Excel (XLS) formats. The third is the LandView IV Federal Geographic Data Viewer (http://landview.census.gov/) on DVD view of Alabama state demographic statistics that can be queried, summarized, mapped, and linked to on the Internet. The Default (opening web page) contains the following principal links: 1. Welcome 2. Search for statistics by agency and topic 3. Search and use the Statistical Abstract chapters and data tables 4. Search and use the agencies major statistical reports and data tables 5. Do your own statistical calculations with StatServer 6 7 8 4. Some Next Steps Since usability testing suggested some of the methods and results reported in this paper, it is appropriate to submit these methods and results to additional usability testing. There is certainly more CD-ROM and Web database content that can and should be added as part of the ongoing effort to form more content partnerships to build out the FedStats.Net content network. Partners with major content holdings may wish to adopt the same NXT 3 content network technology so that their content can be integrated and searched more seamlessly with that of the FedStats.Net site. The FedStats.Net content network has the potential to change the paradigm for publishing and maintaining major statistical compendia and reports on the Web, CD-ROM, and in print that could be more effective and efficient. 5. Acknowledgments The FedStats.Net Team is led by Jack Marshall and includes Andy Reamer, Amar Talwar, Marianne Thrift, Richie Wang, and members of the FedStats New Product Concepts Group, chaired by Mark Wallace. The team was supported by a NextPage Technical Assistance Team consisting of Bill Donellan and Garth Despain, a Consulting Services Team led by Julie Wood with major technical contributions from Jan Nakashima and Jared Saxton, and NextPage Training by Reed Farnsworth. The team was also supported by the Interagency Land View IV Team, including major technical contributions from Peter Gatusso and Jerry McFaul. The team acknowledges the benefits of its participation in the FedStats Tables Group led by Gary Marchionini and Carol Hert. Lastly, the team acknowledges networking and related support from Forrest Houston of the University of Southern California’s Information Sciences Institute, and the temporary loan of some equipment from IBM at an early stage of prototyping. 6. References Capps, C., Green, A. and Wallace, M., 1999: The Vision of Integrated Access to Statistics: the Data Web, Of Significance, Volume 1, Number 2, 42-47, Association of Public Data Users. Central Intelligence Agency, 2000: A Brief History of Basic Intelligence and The World Factbook, http://www.odci.gov/cia/publications/factbook/docs/history.html DeBerry, M., Gregg, V., and Taylor, R. FedStats: Creating the U.S. National Statistical Information Infrastructure of the 21st Century, UN/ECE Seminar on Integrated Statistical Information Systems (ISIS), Riga, Latvia, 29-31 May 2000. Marchionini, G. and Hert, C., 2000: Interacting with Tabular Data Through the WWW, in proceeding of this conference (http://istweb.syr.edu/~tables/). 9 Niemann, B.: A Closely-coupled Web-connected DVDs, NIST DVD’99 Workshop, November 30-December 1, 1999, 19 pp. Niemann, B. 2000: An Integrated Repository and Registry for Environmental Information: An XML Portal for Public Access and Data Exchange, Discussion paper for INFOTERRA 2000-Global Conference on Access to Environmental Information, Dublin, Ireland, 11-15 September. http://www.unep.org/infoterra/infoterra2000/Brand-rev.doc Wallace, M. and Cortez, J.A., 2000: Developing and Promoting Next-Generation Inormation Products for WWW.CENSUS.GOV and WWW.FEDSTATS.GOV, Census Advisory Committee of Professional Associations, October 19-20, 12 pp. Wallace, M. and Highsmith, S. 1999: Use of Metadata for the Effective Integration of Data from Multiple Sources, ICES paper, 8pp. Wallace, M., and Sperling, J. User-Driven Integrated Statistical Solutions (Version 08/28/00), accepted for publication in URISA Journal, http://www.urisa.org/Journal/accepted/sperling/user_drivewn_integrated_statistical_solut ions.htm Wallman, K.K., Zawitz, M.W., Blessing, C., and Treadwell, W., 1999: Making Things Add Up For the End Uses: Issues in Statistical Literacy, Of Significance, Volume 1, Number 2, 14-16, Association of Public Data Users. 10