The Role of Metadata in the Statistical Knowledge Network: An

Forthcoming, Spring 2004, Social Science Computing Reviews The Role of Metadata in the Statistical Knowledge Network: An Emerging Research Agenda Carol A. Hert School of Information Studies, Syracuse University, USA cahert@syr.edu Sheila Denn, Stephanie Haas School of Information and Library Science, University of North Carolina at Chapel Hill, USA {denns, haas}@ils.unc.edu keywords: statistical metadata, digital government, statistical knowledge network, user studies 1. Abstract Metadata (data about data) is integral to many processes in the Statistical Knowledge Network (SKN). Recently, efforts such as the Semantic Web (Berners-Lee et al., 2001), the joint work of ISI and Columbia University (Ambite et al., 2002), and ours (Marchionini, et al., 2003) express the importance of metadata in supporting integration of information across multiple sources. Critical challenges in building the SKN are identifying what metadata is needed and at what point in the cycle of production and use of statistical information it must be available, and establishing an architecture that supports metadata acquisition and use throughout the SKN. This paper provides a research agenda in those areas, building on our current work in helping users “find the statistical information they need and understand what they find.” 2. Envisioning the Statistical Knowledge Network Statistical information plays an important role in the lives of many people. Deciding whether to move for a new job, determining health risks for various diseases, or considering retirement options often involves using statistical data. This can be a challenge to people because such information is located in multiple places, due to the highly decentralized statistical information system in the United States, and exists in large volumes (for example, at the Federal level, over 70 agencies are charged with providing statistical information). This complexity results in the potential for statistics to be difficult to find, integrate, and understand. Efforts such as FedStats (www.fedstats.gov) and FirstGov (www.firstgov.gov) address this complexity by providing gateways to a variety of information. Our project, funded by the National Science Foundation’s Digital Government Initiative and sponsored by a number of Federal statistical agencies (project information at http://ils.unc.edu/govstat) takes these efforts further by proposing the Statistical Knowledge Network (SKN). We envision a seamless network, in which users have transparent access to statistical information from a variety of sources (Federal, state, and local agencies as well as citizen-produced data and commentary). End-users will have access to a variety of tools enabling them to explore and locate statistical information and then use it upon retrieval. These could include exploration tools such as the Relationship Browser (Marchionini & Brunk, 2003), the FedStats Browser (Kules & Shneiderman, 2003) and interpretation tools such as the Statistical Interactive Glossary (Haas, Pattuelli, & Brown, 2003) and Ymap (Kung, et al., 2003; Zhao et al., 2003). The SKN would enable people to find statistics without having to know particular sources, and provide the context necessary to enable people to interpret and use these statistics appropriately. For example, a user interested in determining whether to move to a new town could find information on cost of living and employment in that location from the Bureau of Labor Statistics, general demographic information from the Census Bureau, health information from the National Center for Health Statistics and so on, without having to know that these agencies exist or performing separate searches. The SKN would present these in an integrated fashion and provide explanatory information and manipulation tools (via interfaces) that would help the user to understand and use the information. Figure 1 shows the general architecture of the Statistical Knowledge Network (SKN). Agencies (at Federal, state, local, tribal, etc.) would be linked into the network and their publicly available data and metadata will be accessible to end-users (indicated by the shaded portions of the figure) throughout the system. A distributed, virtual public intermediary will provide the conduit by which agency data and metadata are found, extracted, pre-integrated as appropriate and transferred to specific user tools via a set of interfaces. These interfaces might be specific to a given user or group of users. The public intermediary might also include derived metadata, such as a set of rules concerning appropriate comparisons to be consulted when data with discrepancies or missing values are retrieved. Particular end-user tools would incorporate selected metadata and be able to present them to users. (Figure 1 goes here) Metadata is a critical component of the SKN, as any effort designed to integrate data for users from across different sources and domains necessarily depends upon the effective use of metadata to provide linkages between sources and to allow the user to orient herself within the data. By metadata, we mean information that provides the context necessary to enable understanding and use. For example, to understand what it means that the unemployment rate in the United States for July 2003 was 6.2% (from www.bls.gov), one might need to know that this is a seasonally adjusted number (and the meaning of seasonal adjustment), based on data collected by the Census Bureau for the Bureau of Labor Statistics in June 2003 via a household survey, how the BLS defines unemployment for this survey, and so on. This contextual information is metadata. Thus a major research thrust in our project is to identify the metadata necessary to support end-user functions and services in the SKN and to develop an architecture to support metadata acquisition, storage, and use. It is on the metadata component of our project that this paper focuses. Metadata for statistical information has been under consideration for upwards of twenty years. Statistical agencies have long been interested in streamlining their processes, and retention of metadata forms an important component of that goal. Efforts at the level of individual agencies (such as the metadata repository efforts at the Census Bureau (Highsmith, 2000)), the Federal government (www.FedStats.gov), and international organizations (see, for example, the work of MetaNet at www.epros.ed.ac.uk/metanet/) are being leveraged by our project. An international standard, ISO/IEC 11179 (http://metadata-stds.org/11179/index.html), exists to support the development of metadata registries. The data archives community has been interested in providing greater access to datasets. The Data Document Initiative (DDI) (http://www.icpsr.umich.edu/DDI/index.html) reflects their metadata efforts. The library and information science community has developed the Dublin Core (www.dublincore.org), a general metadata standard to support discovery and use of a wide variety of materials, including statistical information. 3. Which Metadata In our efforts toward building the SKN, we need to concentrate on those metadata elements that are most important to the end user in supporting the integration process. This suggests research that starts with user needs and problems. To this end, we conducted a series of studies in which users interact with statistical information and from those interactions, we identified user issues and challenges (see Hert & Hernandez, 2001; Denn, Haas, & Hert, 2003; Hert & Haas, 2003). We have associated metadata elements with these issues and challenges and begun to construct a metadata architecture that prominently features the elements identified. To illustrate these user issues, consider the scenario of a student who needs to find out how the economies of her county and state compare to that of the United States as a whole. Table 1 indicates some statistics that she would have found on March 3, 2003. The student needs to understand what statistics might help, as well as understand the statistics retrieved. She must begin to understand what makes the numbers different, whether she can compare 1999 with 2003 data, whether seasonally-adjusted numbers can be compared to non-adjusted, and so on. (Table 1 about here) Observing users engaged in scenarios such as the one above provides us with a rich picture of user behavior and associated challenges from which we can derive relevant metadata elements. Our analysis identified the following problems for which metadata might be useful:  Mapping of user terms to agency terms (for example, a mapping between a user’s term “inflation” to the term used by BLS, “cost of living”)  Definitions of statistical/survey terms (such as “seasonal adjustment”)  Comparability of statistics (when given for multiple geographic units, or similar statistics from different sources), for example whether statistics for Yakima County can be compared to or substituted for statistics for the Yakima Metropolitan Statistical Area (MSA)  Help with finding and interpreting statistics, such as the meaning of the values of the Consumer Price Index  Information about recency of statistics (what is the latest number available), update schedule (when will a new number be available), and when updates are available on website  For specific statistics, geographic levels at which that statistic available (for example, can I get the Consumer Price Index for an individual city?)  Navigation by means of something other than large lists of text links  Column headings that are always visible, rather than scrolling off the page The behaviors and problems provide a critical link to the statistical metadata necessary in the SKN. Problems with meaning of terms can be resolved with definitional metadata, geographic relationships need geographical coordinates but also definitional information (how does Yakima county overlap with the MSA), questions about methodology need technical documentation, and so forth. 4. Metadata Elements From the problems above, we have begun to derive a short list of metadata elements (to be expressed in XML) that can enable discovery and use of statistical information. These elements are grouped into several “components” to address particular aspects of statistical information: descriptive information about particular statistical products (e.g., a table with aggregate data, a microdata set, an analytical report with data embedded), groupings of these products (e.g., a statistical almanac), the survey/census efforts from which the products are derived and concept and variable descriptions. Detailed geographic- and time-related attributes are associated with elements throughout the various components. This initial set is drawn partially from a set developed by the Data Documentation Initiative (DDI, http://www.icpsr.umich.edu/DDI/index.html) as well as from an ontology element set, the DAML+ OIL (DARPA Agent Markup Language +Ontology Inference Layer, http://www.daml.org/2001/03/daml=oil-index.html), and a classification element set, the Neuchatel Terminology Model (2003). Several considerations are driving these development efforts. Our user studies show that time and geography are at the core of many user problems. For example, end-users will need to be able to find the dates of statistics, or the most current statistic, or the frequency with which a given statistic is updated. The elements (and their attributes) must be capable of representing complex relationships in these areas among statistical products, the surveys from which they are derived, the variables (which may have been modified throughout a survey’s history), etc. Expressing the relationships among the products themselves is also challenging. A table may stand alone, be part of a news release, present data from one survey, or incorporate data from several surveys. All these relationships must be expressable in order to provide accurate metadata for end-user discovery and use. We are addressing this by allowing any “entity” (the specific unit being described) to be “contained” in any other entity. A third challenge has been to structure the element set to facilitate its integration with key statistical metadata standards, the DDI and ISO/IEC 11179 (Metadata Registries, http://metadata-stds.org/11179/index.html) as well as with ontology and classification standards. Where possible, we are directly incorporating elements from DDI and other standards but developing new elements for some of the components. A particularly thorny area has been the representation of tables as data cubes. The DDI has a developing specification in this area and we are currently exploring it for incorporation into our model. 5. A Research Agenda Our work to date has addressed several areas of the large challenge of metadata in the SKN. We have a rich working knowledge of user behavior and a developing structure in XML to represent the metadata. To actualize the SKN, further efforts in several areas are anticipated: Identification of additional metadata: As we continue to elaborate on the SKN’s functionality, it is likely that we will identify additional end-user activities to support. As these are identified, we will define relevant metadata. Further user studies (including usability tests, ethnographic analyses, etc.) should be undertaken. An area of particular interest to our project is enabling users to strategize about how to find and use statistics and helping them make appropriate comparisons. Supporting these activities may require metadata not currently available in statistical agencies. Accessing, extracting, and managing metadata: An important aspect of building the SKN is working with statistical information suppliers to understand their processes and data management strategies. These represent both technical and social challenges. The SKN metadata infrastructure must have knowledge of internal metadata sources and structures, the ability to extract from them, and reformulate if necessary. Additional work on understanding metadata initiatives (for example, the Census Bureau’s metadata repository efforts) occurring within various datasupplying agencies is thus necessary. The problems of legacy data and metadata sources will also need to be addressed. From a social perspective, integration of sources and metadata will require negotiation among stakeholders. Rules about how to handle discrepancies in data and metadata, determinations about how to recognize acceptable data and metadata, and management of information in the public intermediary will all be needed. Developing these rules requires understanding corporate cultures and how agencies fit into their political worlds Haggerty, 2001). For example, specific metadata may be useful to have but not feasible to gather from an agency because it has a low priority in terms of the mission of the agency or what may be mandated by funding agencies. Additionally, some metadata will cross agency boundaries (such as that related to making comparisons among data from multiple sources) and determining how and who creates and maintains such metadata will need to be considered. An understanding of the social aspects of the system is essential to devising a technical system to access and retrieve metadata. Further specification of XML metadata structures: We have already identified the representation of time and spatial dimensions as problematic with existing statistical metadata schemes, and further efforts to understand the role these play in user activities and how to model them efficiently and effectively are necessary. New metadata (identified in user studies) may entail new XML structures. Tools such as ontologies and geographic information systems are not currently represented in our metadata element set (nor in many other statistical information metadata sets). Harmonization of metadata standards: Integration of information across agencies and even domains results in an increased need to harmonize across metadata standards or find ways to make them work together without actually merging them. How will we be able to maintain the specificity needed within a particular domain or a particular application and still enable communication? The Dublin Core is currently vying to be a lingua franca but the extent to which it can represent statistical metadata remains to be seen (Woodley, 2000). Developing software to manipulate and transfer metadata in the SKN: Structured metadata is helpful only if it can be extracted from sources and transferred to relevant tools (both within the network and for end-users). Software will need to be developed that can exploit the structured metadata (and vice versa, the structure needs to support identified functionality necessary in the SKN). 6. Conclusion Metadata forms the backbone of functions and services throughout the Statistical Knowledge Network. Supporting end-users will require getting metadata (and data) to them in the context of tools that support their situations and tasks. Our work is designed to provide insight into end-user activities and from that, derive an appropriate architecture for SKN metadata. As part of that process, we are developing and executing a research agenda in that area. Acknowledgement This material is based upon work supported by the National Science Foundation (NSF) under Grant EIA 0131824. Project information is available at http://ils.unc.edu/govstat. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. We thank Dan Gillman of BLS and Gary Marchionini, UNC-Chapel Hill for their comments on drafts of this paper. References Ambite, José Luis, Yigal Arens, Eduard H. Hovy, Andrew Philpot, Luis Gravano, Vasileios Hatzivassiloglou and Judith Klavans (2002). “Simplifying Data Access: The Energy Data Collection Project”. IEEE Computer 34(2): 47-54. Berners-Lee, T.; Hendler, J.; Lassila, O. (2001) The Semantic Web. Scientific American. May 2001. Available online at http://www.sciam.com/article.cfm?colID=1&articleID=00048144-10D2-1C7084A9809EC588EF21 (accessed 3-5-03). Denn, S., Haas, S.W., & Hert, C.A. (2003). Statistical metadata needs during integration tasks. 2003 Dublin Core Conference (to appear).Paper currently available from first author. Haas, S.W., Pattuelli, M.C. & Brown, R.T. (2003). Understanding statistical concepts and terms in context: The GovStat Ontology and Statistical Interactive Glossary. Proceedings of the American Society for Information Science and Technology Annual Meeting. (to appear). Haggerty, K.D. (2001) Negotiated measures: The Institutional micropolitics of official criminal justice statistics. Studies in History and Philosophy of Science 32(4): 705-722. Hert, C.A. and Haas, S.W. (2003). Supporting end-users of statistical information: the Role of statistical metadata in the statistical knowledge network. DGO2003: National Conference on Digital Government Research (May 18-21, Boston, MA). Pp. 127-130. Hert, C.A. and Hernandez, N. (2001). User Uncertainties With Tabular Statistical Data: Identification And Resolution: Available at http://ils.unc.edu/govstat/fedstats/uncertaintiespaper.htm Highsmith, S. (2000). Building a Corporate Metadata Repository at the U.S. Bureau of the Census. (Seminar on Integrated Statistical Information Systems and Related Matters, ISIS 2000, Riga, Latvia, 29-31 May 2000) Available at http://www.unece.org/stats/documents/ces/sem.43/10.s.e.pdf (accessed 8/26/2003) Kang, H; Plaisant, C.; & Shneiderman, B. (2003). New approaches to help users get started with visual interfaces: multi-layered interfaces and integrated initial guidance. DGO2003: National Conference on Digital Government Research (May 18-21, Boston, MA). Pp. 141-145 Kules, W. and Shneiderman, B. (2003). Designing a Metadata -Driven Visual Information Browser for Federal Statistics. DGO2003: National Conference on Digital Government Research (May 18-21, Boston, MA). Pp. 117-122. Marchionini, G., and Brunk, B. (2003). Toward a General Relation Browser: A GUI for Information Architects. Journal of Digital Information 4(1) available online http://jodi.ecs.soton.ac.uk/Articles/v04/i01/Marchionini/ (accessed 8/20/2003) Marchionini, G.; Haas, S.W.; Plaisant, C.; Shneiderman, B.; Hert, Carol A. (2003) Towards a Statistical Knowledge Network. DGO2003: National Conference on Digital Government Research (May 18-21, Boston, MA). Pp. 27-34. Neuchâtel Terminology: Classification database object types and their attributes, Version 2.0 (2003). Currently not publicly available but available from Carol A. Hert. Woodley, M. (2000). Crosswalks: the path to universal access. In: Introduction to metadata: pathways to digital information. (Los Angeles, Calif.: Getty Information Institute). Available from: <URL:http://www.getty.edu/research/institute/standards/intrometadata/2_articles/ woodley/> accessed 8/20/2003 Zhao, H.; Shneiderman, B.; & Plaisant, C. (2003). Improving accessibility and usability of geo-referenced statistical data. DGO2003: National Conference on Digital Government Research (May 18-21, Boston, MA). Pp. 147-150. Author Biographies Carol A. Hert is a Research Professor at Syracuse University’s School of Information Studies. Her work focuses on statistical information seeking behavior and the design of metadata systems to support that behavior. Stephanie W. Haas is an associate professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill. Her research interests include language usage, terminology and controlled vocabularies, and concept modeling. Sheila Denn is a doctoral student at the School of Information and Library Science at the University of North Carolina – Chapel Hill. Her research interests are information seeking behavior, information integration and metadata. end user Agency data with integrated metadata Domain Ontologies end user end user Distributed public intermediary: variable/concept level, XML-based incorporating ISO11179 and DDI, providing java-based statistical literacy tools to user interface agency backend data and metadata User Agency with mutliple metadata respositories Interfac es Statistical Ontology end user end user end users: interact with data from information/concept perspective, not just agency perspective end user agency backend data and metadata Domain Experts End User Communities firewall Figure 1: The Statistical Knowledge Network end user Economic Indicator Unemployment rate Yakima County Yakima MSA 11.1% (2000 Census) 12.1% Dec. 2002 not seasonally adjusted (BLS) Wash. State 6.2% (2000 Census) 6.6% Jan. 2003 12.7% (Jan. 2003 prelim. seasonally prelim, not adjusted (Wash. seasonally Website) adjusted) 7.4% Jan 2003 not (Washington State adjusted (Wash. website) website) 7.0% Dec. 2002 seasonally adjusted (BLS) Poverty Rate 21.9% (1999, for 7.3% (1999, for families, from families, from 2000 Census data) 2000 Census data) Table 1: Example Economic Indicators for Yakima, Washington State, and the U.S. United States 5.8% (2000 Census) 5.7% Jan 2003, seasonally adjusted (BLS) 9.2% (1999, for families, from 2000 Census data)

The Role of Metadata in the Statistical Knowledge Network: An

Related documents

Products

Support

The Role of Metadata in the Statistical Knowledge Network: An

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib