Federal Committee on Statistical Methodology

advertisement
Federal Committee on Statistical Methodology
Statistical Policy Seminar: Integrating Federal Statistical Information and Processes
The Fifth in a Series of Seminars Hosted by COPAFS (Council of Professional
Associations on Federal Statistics)
November 8-9, 2000
Holiday Inn Bethesda, MD
Session 12: Software Issues for Disseminating Statistical Information on CD-ROM
Organizer: Charles Paulter, BOC and Cathryn Dippo, BLS
To Make or to Buy: This is the Question
Brand Niemann, U.S. EPA and the FedStats.Net Team*
Abstract
The concern over different ways to access and use Federal statistical data on CD-ROMs
also extends to Federal Web sites. Indeed users would like Federal statistical data to be
interoperable and integrated across CD-ROMs and Web sites. The goal of the Product
Concepts Group of the FedStats Task Force is to provide interoperable and integrated
statistical products on both the FedStats.gov and the new FedStats.Net collaboration site.
The FedStats.gov content and several Census CD-ROMs have been re-purposed using
XML and state-of-the art Web server softwares to build an integrated and functional
network of content that includes delivery of data tables in XML and XLS formats,
sortable tables, and query-able documents and databases. So the answer to the question
“to make or to buy” is do both and even more. The paper includes a description of the
methods, initial results of prototyping at the new FedStats.Net site, and some next steps.
* See Acknowledgments
1. Introduction
Users are faced with the difficulty of wanting to use several CD-ROMs each of which
have a different software package associated with it. This is confusing and time
consuming to users whose only goal is to efficiently find and use the information they
need. Federal Agencies should be responsive to the questions: “Why do I have to learn all
these different ways to access and use Federal data? Will the various agencies ever talk to
each other and work out these differences? Surely there is a common way that meets
users needs.”
With the evolution of the Web to the use of the eXtensible Markup Language (XML), the
“common way” now is to re-purpose both Web and CD-ROM content and to integrate
and deliver it on both the Web and on CD/DVD, as well as in print and other means like
wireless appliances that are becoming increasingly popular in our mobile society. In
addition, usability studies and common sense dictate that statistical data tables can and
should be delivered in a format that can be readily used for creating graphs and maps and
even statistical analyses. Statistical data tables can now be delivered on the Web as real
1
data using XML and XLS (Excel for which there is a free viewer) which in turn can be
input (uploaded) to statistical analysis softwares on the Web.
The Products Concepts Group of the FedStats Task Force has developed a vision for Next
Generation Products by Type and How Generated (see table below) (Capps, et al, 1999;
DeBerry, et al, 2000; Wallace and Cortez, 2000; Wallace and Highsmith, 2000; and
Wallace and Sperling, 2000). The third reference describes the new Census QuickFacts
(http://quickfacts.census.gov/qfd/)
and FedStats QuickStats (in review) applications which use a MYSQL database to store
the state and county data elements selected from eight FedStats agencies along with their
associated metadata and provide a front-end written in Perl/CGI that generates the tables
dynamically via SQL queries.
Product Type
How Generated
-Quick Facts (Stats) – tables of results
-“Hot Reports” – dynamically created documents
-Dynamic Mapping – documents, tables, graphs,
and maps together (XSLT)
-Manual cache – pull together manually
-Auto Cache – pull together programmatically
-Distributed – leave it where it is and create a
distributed Content Network!
It was determined that the NXT 3 software from NextPage provided the functionality
needed to begin to realize this vision
(http://www.nextpage.com/products/NXT3/econtent/econtent_tour.asp). It was also
determined that the StatServer software from Mathsoft
(http://www.splus.mathsoft.com/products/statserver/statserverintro.html) would provide
the desired functionality to select pre-loaded databases and statistical analyses or to
upload additional databases to produce statistical analyses of the data on the Web.
Recent FedStats.gov usability testing of a redesigned site suggested the need for a more
clearly defined conceptual framework for the site, which in turn suggested the need for a
larger redesign effort. One suggestion was to reorganize the FedStats.gov site around
topics, each of which would have its own agency list as well as a consistently defined
topical index of key statistics from each agency, a list of international comparisons, and
MapStats and QuickStats among other features. This would necessitate overhauling the
many lists and links in the site. In addition, the major statistical compendia like the
Statistical Abstract (http://landview.census.gov/statab/www/) and other agency reports
normally delivered in PDF could and should be made more accessible
(http://www.section508.gov/) and functional and the MapStats feature could be
supplemented with other tools like StatServer and LandView: The Federal Geographic
Data Viewer (http://landview.census.gov/) with its closely-coupled and Web-connected
features (Niemann, 1999).
The next sections include a description of the methods, initial results of prototyping at the
new FedStats.Net site, and some next steps.
2
2. Methodology
Some of the principal concepts and methods for building a content network using the
NXT 3 software are listed below:
•A site contains one or more content collections organized into hierarchies using folders.
•A content collection is composed of multiple documents.
•A document is a textual data stream that is either authored or constructed dynamically
and corresponds to a single page in a Web browser.
•A view is a subset of the hierarchy of a site.
•A Web server may host multiple sites, each with multiple views (peer-to-peer).
•The Content Network Manager (CNM) allows one to organize the site and the Manage
Content (MC) allows one to dynamically update a site.
•HTML templates with special replaceable comment tags are used to customize the
display.
•Searches can be made for the entire site, specific content collections, or sub-content
components and highlights appear in virtually all documents.
•Sub-content components (fields) are text strings assigned semantic meaning & indexed
separately.
•Content can include file systems, ODBC databases, and external Web sites.
•Multiple CN servers are linked and integrated by the Content Network Adapter (CNA).
•User preferences can be stored by associating an XML document with a specific user to
carry information from one session to another that can include saved searches &
individualized content.
•Metadata support is stored as a document property that describes the document and its
children and is used for searching and resource discovery using W3C RDF and Dublin
Core standards.
•A content service can be: a set of local files; an ODBC database; and an external Web
site
•NXT 3 content services do not store the content – it simply indexes a structure for the
NXT 3 search engine and table of contents.
•Content services are updated manually or periodically by a “build schedule” (see first
screen capture example below).
•SQL queries entered by the administrator generate a table of contents from the ODBC
database and either generates XML documents from the field data or retrieves any
discrete documents stored as “blobs” (see second screen capture example below).
3
4
The MathSoft StatServer allows the user to choose the data source, the table, and the
columns of that table and the statistical analysis to be performed on the selected subset of
data.
3. Results
The FedStat.Net content network has been initially organized by a series of topic nodes
as follows:
•Introduction
•Topics by Agency
• Sources of Environmental Statistics
•Statistical Abstracts
–Integrated 31chapters and 1450 data tables for 1999
–Demo of PDF-to-HTML conversion
•Databases
–USA Counties 1998
–REIS 1969-1998
• Metadata
•LandView
•Statistical Reports (CEQ 1997)
•Partners (HUD, BEA, SBA: on another FedStats.Net server)
•International (CIA): on NextPage Server in Utah
• EPA Integrated Registry and Repository: on another interagency server
In addition, an environmental content network node described in a recent paper
(Niemann, B. 2000) was also included as an example of what an individual agency or
organization might do on its own as well.
5
Some screen captures from the prototypes to become available at the FedStats.Net site are
included below. These include advanced features and functionality for tables
recommended recently by others (Marchionini, G. and Hert, C., 2000; and Wallman,
K.K., Zawitz, M.W., Blessing, C., and Treadwell, W., 1999).
The first example is the CIA County Profiles (CIA, 2000) delivered from a query of an
XML document as an XML table that can be sorted by column label links and linked by
row names to individual country data. The second is the Statistical Abstract with all 31
chapters and nearly 1500 tables integrated in an XML database with properties sheets for
and links to the nearly 1500 tables in both HTML (XML) and Excel (XLS) formats. The
third is the LandView IV Federal Geographic Data Viewer (http://landview.census.gov/)
on DVD view of Alabama state demographic statistics that can be queried, summarized,
mapped, and linked to on the Internet.
The Default (opening web page) contains the following principal links:
1. Welcome
2. Search for statistics by agency and topic
3. Search and use the Statistical Abstract chapters and data tables
4. Search and use the agencies major statistical reports and data tables
5. Do your own statistical calculations with StatServer
6
7
8
4. Some Next Steps
Since usability testing suggested some of the methods and results reported in this paper, it
is appropriate to submit these methods and results to additional usability testing. There is
certainly more CD-ROM and Web database content that can and should be added as part
of the ongoing effort to form more content partnerships to build out the FedStats.Net
content network. Partners with major content holdings may wish to adopt the same NXT
3 content network technology so that their content can be integrated and searched more
seamlessly with that of the FedStats.Net site. The FedStats.Net content network has the
potential to change the paradigm for publishing and maintaining major statistical
compendia and reports on the Web, CD-ROM, and in print that could be more effective
and efficient.
5. Acknowledgments
The FedStats.Net Team is led by Jack Marshall and includes Andy Reamer, Amar
Talwar, Marianne Thrift, Richie Wang, and members of the FedStats New Product
Concepts Group, chaired by Mark Wallace. The team was supported by a NextPage
Technical Assistance Team consisting of Bill Donellan and Garth Despain, a Consulting
Services Team led by Julie Wood with major technical contributions from Jan Nakashima
and Jared Saxton, and NextPage Training by Reed Farnsworth. The team was also
supported by the Interagency Land View IV Team, including major technical
contributions from Peter Gatusso and Jerry McFaul. The team acknowledges the benefits
of its participation in the FedStats Tables Group led by Gary Marchionini and Carol Hert.
Lastly, the team acknowledges networking and related support from Forrest Houston of
the University of Southern California’s Information Sciences Institute, and the temporary
loan of some equipment from IBM at an early stage of prototyping.
6. References
Capps, C., Green, A. and Wallace, M., 1999: The Vision of Integrated Access to
Statistics: the Data Web, Of Significance, Volume 1, Number 2, 42-47, Association of
Public Data Users.
Central Intelligence Agency, 2000: A Brief History of Basic Intelligence and The World
Factbook, http://www.odci.gov/cia/publications/factbook/docs/history.html
DeBerry, M., Gregg, V., and Taylor, R. FedStats: Creating the U.S. National Statistical
Information Infrastructure of the 21st Century, UN/ECE Seminar on Integrated Statistical
Information Systems (ISIS), Riga, Latvia, 29-31 May 2000.
Marchionini, G. and Hert, C., 2000: Interacting with Tabular Data Through the WWW, in
proceeding of this conference (http://istweb.syr.edu/~tables/).
9
Niemann, B.: A Closely-coupled Web-connected DVDs, NIST DVD’99 Workshop,
November 30-December 1, 1999, 19 pp.
Niemann, B. 2000: An Integrated Repository and Registry for Environmental
Information: An XML Portal for Public Access and Data Exchange, Discussion paper for
INFOTERRA 2000-Global Conference on Access to Environmental Information, Dublin,
Ireland, 11-15 September. http://www.unep.org/infoterra/infoterra2000/Brand-rev.doc
Wallace, M. and Cortez, J.A., 2000: Developing and Promoting Next-Generation
Inormation Products for WWW.CENSUS.GOV and WWW.FEDSTATS.GOV, Census
Advisory Committee of Professional Associations, October 19-20, 12 pp.
Wallace, M. and Highsmith, S. 1999: Use of Metadata for the Effective Integration of
Data from Multiple Sources, ICES paper, 8pp.
Wallace, M., and Sperling, J. User-Driven Integrated Statistical Solutions (Version
08/28/00), accepted for publication in URISA Journal,
http://www.urisa.org/Journal/accepted/sperling/user_drivewn_integrated_statistical_solut
ions.htm
Wallman, K.K., Zawitz, M.W., Blessing, C., and Treadwell, W., 1999: Making Things
Add Up For the End Uses: Issues in Statistical Literacy, Of Significance, Volume 1,
Number 2, 14-16, Association of Public Data Users.
10
Download