DINO Discussion Paper: Common Metadata Standards and A Centralized Web-based Data Extraction and Analysis System for Ontario Prepared by DINO – Data In Ontario Sub-committee of OCUL April, 2006 Introduction One of the main discussion items at a recent OCUL Data In Ontario (DINO) meeting 1 was to look at where the needs of the community were in terms of providing easy access to our collections of electronic data. The discussion primarily focused on the part of the collection which includes data obtained through the DLI2 program. This program is a strong and wellestablished partnership between Canadian post-secondary institutions and Statistics Canada. In addition to this collection there has been growth in the availability of other data resources that have been matched by increasing expectations from an ever more sophisticated community of users. The discussion focused on what various institutions were currently doing and where they thought they might be in the future. Currently there is significant variation in access to data resources across institutions and certain institutions have raised questions about the sustainability of their local solutions. Considering the success of Scholar’s Portal and the new potential this has created for collaboration, an idea was developed to provide a centralized, web-based data extraction/analysis system in Ontario. This topic generated a wide-ranging discussion, with many ideas some of which will need further development. One area that resulted in agreement was to begin to seed this idea with OCUL Directors. To ‘jumpstart’ this process, the group agreed that a discussion paper should be written to review the current situation and outline the benefits of OCUL moving in this direction. Three Major Hurdles: Access, Use, and Standards Access Over the years, researchers faced two major hurdles when it came to data. The first was the prohibitive cost of obtaining access to electronic data. Although there is still a lot of work to be done, many of these financial constraints have been removed. Consortial programs like DLI and ICPSR have essentially decreased the marginal costs of research data to zero. In addition there is hope that national programs, such as CRKN, will open up more opportunities. Use The second hurdle deals with how researchers are able to actually use the data to generate empirical results. In the past there were significant up front costs to learning how to read and manipulate complex microdata files. It takes time to develop the necessary skills to 1 7-8 December, 2005 at Ryerson University 2 Data Liberation Initiative 106757814 understand data structures and develop efficient and effective programs to achieve research goals. In the last decade or so, a number of easy-to-use, homegrown, web-based data extraction/analysis systems were developed. A viable commercial solution was slow to appear, so much of this development took place in academic institutions, in Canada and around the world. These systems went a long way toward eliminating the steep learning curve historically associated with using large microdata files. These systems have been extremely effective and have served the community well. Recently, however, a number of commercial extraction systems have reached a point that they can easily meet the needs of a broad range of clients. Standards A key element in the development of these commercial systems has been the development and promotion of metadata3 standards, as defined by such initiatives as DDI 4. Using these standards, our data collections become portable and can be plugged into and moved between DDI-compliant analysis/extraction systems. The potential of this, in terms of data sharing and migration to evolving systems, should not be underestimated. The traditional library world traversed this bridge years ago with MARC and the myriad of Library Management Systems that support this standard. The data world is facing a similar decision now. Do we continue to maintain and develop homegrown solutions, or do we move toward common metadata standards and compliant data extraction systems? An ad hoc group called CANDDI (Canadian Data Documentation Initiative) was established in 2004 to look at adopting metadata standards in the Canadian context. While structured informally, CANDDI has three clear goals: (i) to coordinate efforts to create DDI ‘marked up’ metadata for Canadian Studies, (ii) to act as a clearinghouse for the products of these efforts and the discussions surrounding them, and (iii) to create a made-in-Canada version of the DDI (i.e. a subset of DDI well-suited for Canadian studies along with some standards and best practices for marking up data). If successful, CANDDI can lay the groundwork for any OCUL initiative by contributing to a rich collection of metadata that is easily shareable amongst our community. More information can be found at: http://blogs.uoguelph.ca/canddi/blog/. Current Situation Recognized Value Many OCUL institutions represented at the DINO meetings expressed their strong support for the web-based services they are currently using. For smaller institutions in particular, the ability to rely on one of the homegrown systems has virtually eliminated the need for these institutions to download and process data files for researchers. Rather, Data Librarians at these institutions have been able to focus on the outreach and teaching part of promoting ‘numeracy’ at their institutions. It is important to note here that shared web-based data systems are in no way a substitute for local data expertise. As expressed in Libraries and E-Learning – Final Report of the CARL E-Learning Working Group: “CARL libraries played a pivotal role in negotiating free access to census data and online maps for Canadian researchers and students. Such databases are key to research success. ’The Fulbright Study showed a strong association between high levels of local data support and good performance. Nearly all researchers [supported the] view that local data services are a key Metadata = data about data files – analogous to cataloguing records for books Data Documentation Initiative – An international initiative to establish metadata standards for describing data files. 3 4 106757814 2 factor in the production of highly cited research publications... There are also strong a priori grounds for associating data support with good research performance. The evidence for growth in quantity and quality of empirical research is very strong.’ ” [emphasis added] Three Systems in Current Use Web-based data extraction and analysis services are currently provided in three ways at OCUL institutions. Twelve institutions subscribe to one of two homegrown systems, IDLS 5 and QWIFS6, developed by UWO and Queen’s, respectively. For the most part, subscribers to the UWO and Queen’s systems are newer providers of data services. For them, IDLS and QWIFS provide an efficient means of improving data services for their users. Eight institutions use commercial software (Nesstar or SDA). Five institutions each run the full Nesstar system (Publisher, Server, and Webview, described more fully below). U of T and Ryerson use SDA; U of T also uses Nesstar Publisher. OCUL Institutions by Type of Data Extraction/Analysis System IDLS (UWO) Brock Lakehead Laurentian Nipissing Trent UOIT Western Windsor QWIFS (Queen’s) RMC McMaster Ryerson Queen’s Nesstar or SDA Guelph Waterloo Wilfrid Laurier U of T Carleton Ottawa Windsor Ryerson No System York IDLS and QWIFS have been in service for the past ten years or so. Both systems continue to serve researchers well, but inherent with any homegrown, institution-specific, system are concerns about sustainability. In addition there is a desire to remain at the cutting edge of functionality and a need to follow emerging standards. All of these things consume resources. The University of Guelph, along with Waterloo and WLU, has already taken the step of eliminating the risks and costs of supporting a homegrown solution. Instead, the TriUniversity group has focused on developing standards-based metadata that can be shared among institutions, and delivering this information to students and researchers using a commercial solution (Nesstar). The University of Ottawa and Carleton University have also adopted the Nesstar System. Windsor has purchased Nesstar and intends to use it to load locally-produced files. Looking Forward Recognizing Need While generally pleased with the functionality of their current homegrown solutions, the broader DINO group recognized (i) the need for more and ongoing development as metadata standards emerge and (ii) the reality that we have reached a point where homegrown systems provide fewer features and are increasingly more difficult to sustain relative to commercial solutions. The group felt primarily that emerging metadata standards are driving the need for a level of software development difficult to sustain at an individual institution as opportunities arise and the added value of commercial software development surpasses the value of local customization. 5 6 Internet Data Library System – developed and maintained at UWO Queen’s Web Interface For SPSS – developed and maintained at Queen’s University 106757814 3 Building on the Scholars Portal Model The group also recognized the success of the OCUL “Scholars Portal” which has changed the Library landscape in terms of inter-university cooperation. Central provision of citation, abstracting, and full-text databases, via the CSA Illumina interface, has altered the way we look at the purchase, provision, and preservation of these essential resources. Those of us on the ‘front lines’ of Library service know first-hand the benefits of the Scholars Portal to our users, in the short and long-term. There is no reason why taking a similar approach with numeric7 data files would not produce similar results. The Case for Adopting Metadata Standards Ultimately, the decision to adopt a centralized data system is secondary to the decision to adopt metadata standards. Data extraction/analysis systems may come and go (much as Library Management Systems do), but the underlying ‘data about the data’ stays the same and provides the underpinnings of a sustainable, accessible, and sharable data collection. The emergence of DDI as the de facto metadata standard for data files sets the stage for OCUL institutions to move to a new level of cooperation and collaboration. Being able to share the work of creating metadata is an important, practical outcome of this approach. The Case for a Centralized Data Extraction/Analysis System Once a common metadata standard has been adopted, the case for a shared data extraction and analysis system becomes compelling. From an administrative perspective, a commercial solution will provide a more sophisticated, sustainable solution that will adapt to meet changing user needs (or it will be replaced). From a research and teaching perspective, OCUL institutions will benefit from a common data interface, shared metadata, a broader base of expertise, and a richer body of teaching materials. Choosing a DDI-compliant Data Extraction and Analysis System There are a number of DDI-compliant extraction and analysis systems on the market. One example of such a product is Nesstar, which is already in use by the Tri-University group, Carleton University, and the University of Ottawa. Nesstar is a well-developed system consisting of three main components: - Nesstar Publisher – streamlines loading of files from existing sources (SPSS, SAS) and adding additional DDI-compliant metadata. - Nesstar Server – provides a robust, scalable platform for the ‘publishing’ of survey and aggregate data on the web. - Nesstar Webview – offers an intuitive, web-based interface to users for the manipulation and visualization of data resources. In addition to the features of each of these components (see URL: http://www.nesstar.com/ for more details), the following items should also be considered when looking at a product, such as Nesstar: - Statistics Canada is using Nesstar to mark-up and deliver data through the DLI (Data Liberation Initiative); see URL: http://www.nesstar.com/news/press3.shtml . - In addition, international data archives (ICPSR, the UK Data Archive, and most recently, the Zentralarchiv in Cologne) are providing shared access to data and metadata using the Nesstar system. - Nesstar accepts and generates DDI-compliant metadata, which is a big plus in terms of improved access, but also ensures that migration of data to future systems will be straightforward. A Canadian subset of the DDI tag set is being developed by the CANDDI group. 7 Geospatial data files were also discussed, and need to be dealt with, but are beyond the scope of this discussion paper. 106757814 4 - Nesstar’s DDI compliance will facilitate the cooperative ‘cataloguing’ of data files and subsequent sharing of this metadata. Institutions will be able to contribute to the cataloguing of major files, as needed, but be in a position to concentrate on unique files held by their institution; this will help minimize duplication of effort. - Nesstar is an ‘off-the-shelf’ product supported independently of any given institution. - Having Nesstar as a common interface at all OCUL institutions will promote the development of transferable research skills and teaching resources. - Nesstar has well-developed access control and security features – important in a shared environment where not all institutions will be licensed to access all data files. There are many data files beyond Statistics Canada/DLI that could be loaded into Nesstar (e.g. ICPSR, Polling data, data from researchers). Other Issues Potential cost savings As is the case with other ‘consortial’ purchases, it is hoped that the negotiated purchase price and annual maintenance fee for an OCUL licence would be lower than it would be for individual institutions. In reality, academic pricing for NESSTAR will not be the binding constraint. In addition there is the potential to save the cost of local hardware/installation/maintenance. Local needs will still prevail The adoption of a centralized server model will not prevent individual institutions from tailoring their services to meet local needs. Local access to IDLS or QWIFS need not be discontinued under this model. We suspect that several institutions may choose to install Nesstar (Publisher, Server, WebView) or SDA locally to better meet their users’ needs. This type of distributed model will also help facilitate the sharing of the many local data collections owned by various institutions. Archiving Locally-produced Data For many institutions the ability to start archiving locally produced data sets was seen as important. There is growing recognition on campuses of the amount of research data that is disappearing as people retire or move elsewhere; in other cases data is destroyed because there is no place to store it permanently. Metadata standards and a tool like Nesstar will facilitate local archiving, and feed into a province-wide (and possibly national) repository of research data. New members of DLI Colleges are starting to join DLI. While not part of the OCUL mandate, there is perhaps a role to be played here in promoting the ‘data culture’ throughout post-secondary education in Ontario. From what we have seen this model should scale fairly easily. DINO Discussion Paper Subcommittee: Jeff Moon Bo Wandschneider Wendy Watkins 106757814 5