DINO Discussion Paper - SPOT-DOCS

advertisement
DINO Discussion Paper:
Common Metadata Standards and
A Centralized Web-based Data Extraction and
Analysis System for Ontario
Prepared by
DINO – Data In Ontario
Sub-committee of OCUL
April, 2006
Introduction
One of the main discussion items at a recent OCUL Data In Ontario (DINO) meeting 1 was to
look at where the needs of the community were in terms of providing easy access to our
collections of electronic data. The discussion primarily focused on the part of the collection
which includes data obtained through the DLI2 program. This program is a strong and wellestablished partnership between Canadian post-secondary institutions and Statistics Canada.
In addition to this collection there has been growth in the availability of other data resources
that have been matched by increasing expectations from an ever more sophisticated
community of users.
The discussion focused on what various institutions were currently doing and where they
thought they might be in the future. Currently there is significant variation in access to data
resources across institutions and certain institutions have raised questions about the
sustainability of their local solutions.
Considering the success of Scholar’s Portal and the new potential this has created for
collaboration, an idea was developed to provide a centralized, web-based data
extraction/analysis system in Ontario. This topic generated a wide-ranging discussion, with
many ideas some of which will need further development. One area that resulted in
agreement was to begin to seed this idea with OCUL Directors. To ‘jumpstart’ this process,
the group agreed that a discussion paper should be written to review the current situation and
outline the benefits of OCUL moving in this direction.
Three Major Hurdles: Access, Use, and Standards
Access
Over the years, researchers faced two major hurdles when it came to data. The first was the
prohibitive cost of obtaining access to electronic data. Although there is still a lot of work to
be done, many of these financial constraints have been removed. Consortial programs like
DLI and ICPSR have essentially decreased the marginal costs of research data to zero. In
addition there is hope that national programs, such as CRKN, will open up more
opportunities.
Use
The second hurdle deals with how researchers are able to actually use the data to generate
empirical results. In the past there were significant up front costs to learning how to read and
manipulate complex microdata files. It takes time to develop the necessary skills to
1
7-8 December, 2005 at Ryerson University
2
Data Liberation Initiative
106757814
understand data structures and develop efficient and effective programs to achieve research
goals.
In the last decade or so, a number of easy-to-use, homegrown, web-based data
extraction/analysis systems were developed. A viable commercial solution was slow to
appear, so much of this development took place in academic institutions, in Canada and
around the world. These systems went a long way toward eliminating the steep learning
curve historically associated with using large microdata files. These systems have been
extremely effective and have served the community well. Recently, however, a number of
commercial extraction systems have reached a point that they can easily meet the needs of a
broad range of clients.
Standards
A key element in the development of these commercial systems has been the development
and promotion of metadata3 standards, as defined by such initiatives as DDI 4. Using these
standards, our data collections become portable and can be plugged into and moved
between DDI-compliant analysis/extraction systems. The potential of this, in terms of data
sharing and migration to evolving systems, should not be underestimated.
The traditional library world traversed this bridge years ago with MARC and the myriad of
Library Management Systems that support this standard. The data world is facing a similar
decision now. Do we continue to maintain and develop homegrown solutions, or do we move
toward common metadata standards and compliant data extraction systems?
An ad hoc group called CANDDI (Canadian Data Documentation Initiative) was established in
2004 to look at adopting metadata standards in the Canadian context. While structured
informally, CANDDI has three clear goals: (i) to coordinate efforts to create DDI ‘marked up’
metadata for Canadian Studies, (ii) to act as a clearinghouse for the products of these efforts
and the discussions surrounding them, and (iii) to create a made-in-Canada version of the
DDI (i.e. a subset of DDI well-suited for Canadian studies along with some standards and
best practices for marking up data). If successful, CANDDI can lay the groundwork for any
OCUL initiative by contributing to a rich collection of metadata that is easily shareable
amongst
our
community.
More
information
can
be
found
at:
http://blogs.uoguelph.ca/canddi/blog/.
Current Situation
Recognized Value
Many OCUL institutions represented at the DINO meetings expressed their strong support for
the web-based services they are currently using. For smaller institutions in particular, the
ability to rely on one of the homegrown systems has virtually eliminated the need for these
institutions to download and process data files for researchers. Rather, Data Librarians at
these institutions have been able to focus on the outreach and teaching part of promoting
‘numeracy’ at their institutions.
It is important to note here that shared web-based data systems are in no way a substitute for
local data expertise. As expressed in Libraries and E-Learning – Final Report of the
CARL E-Learning Working Group:
“CARL libraries played a pivotal role in negotiating free access to census data and online maps for
Canadian researchers and students. Such databases are key to research success.
’The Fulbright Study showed a strong association between high levels of local data support and
good performance. Nearly all researchers [supported the] view that local data services are a key
Metadata = data about data files – analogous to cataloguing records for books
Data Documentation Initiative – An international initiative to establish metadata standards for
describing data files.
3
4
106757814
2
factor in the production of highly cited research publications... There are also strong a priori
grounds for associating data support with good research performance. The evidence for growth
in quantity and quality of empirical research is very strong.’ ”
[emphasis added]
Three Systems in Current Use
Web-based data extraction and analysis services are currently provided in three ways at
OCUL institutions. Twelve institutions subscribe to one of two homegrown systems, IDLS 5
and QWIFS6, developed by UWO and Queen’s, respectively. For the most part, subscribers
to the UWO and Queen’s systems are newer providers of data services. For them, IDLS and
QWIFS provide an efficient means of improving data services for their users. Eight institutions
use commercial software (Nesstar or SDA). Five institutions each run the full Nesstar system
(Publisher, Server, and Webview, described more fully below). U of T and Ryerson use SDA;
U of T also uses Nesstar Publisher.
OCUL Institutions by Type of Data Extraction/Analysis System
IDLS (UWO)
Brock
Lakehead
Laurentian
Nipissing
Trent
UOIT
Western
Windsor
QWIFS (Queen’s)
RMC
McMaster
Ryerson
Queen’s
Nesstar or SDA
Guelph
Waterloo
Wilfrid Laurier
U of T
Carleton
Ottawa
Windsor
Ryerson
No System
York
IDLS and QWIFS have been in service for the past ten years or so. Both systems continue
to serve researchers well, but inherent with any homegrown, institution-specific, system are
concerns about sustainability. In addition there is a desire to remain at the cutting edge of
functionality and a need to follow emerging standards. All of these things consume
resources. The University of Guelph, along with Waterloo and WLU, has already taken the
step of eliminating the risks and costs of supporting a homegrown solution. Instead, the TriUniversity group has focused on developing standards-based metadata that can be shared
among institutions, and delivering this information to students and researchers using a
commercial solution (Nesstar). The University of Ottawa and Carleton University have also
adopted the Nesstar System. Windsor has purchased Nesstar and intends to use it to load
locally-produced files.
Looking Forward
Recognizing Need
While generally pleased with the functionality of their current homegrown solutions, the
broader DINO group recognized (i) the need for more and ongoing development as metadata
standards emerge and (ii) the reality that we have reached a point where homegrown
systems provide fewer features and are increasingly more difficult to sustain relative to
commercial solutions. The group felt primarily that emerging metadata standards are driving
the need for a level of software development difficult to sustain at an individual institution as
opportunities arise and the added value of commercial software development surpasses the
value of local customization.
5
6
Internet Data Library System – developed and maintained at UWO
Queen’s Web Interface For SPSS – developed and maintained at Queen’s University
106757814
3
Building on the Scholars Portal Model
The group also recognized the success of the OCUL “Scholars Portal” which has changed
the Library landscape in terms of inter-university cooperation. Central provision of citation,
abstracting, and full-text databases, via the CSA Illumina interface, has altered the way we
look at the purchase, provision, and preservation of these essential resources. Those of us
on the ‘front lines’ of Library service know first-hand the benefits of the Scholars Portal to our
users, in the short and long-term. There is no reason why taking a similar approach with
numeric7 data files would not produce similar results.
The Case for Adopting Metadata Standards
Ultimately, the decision to adopt a centralized data system is secondary to the decision to
adopt metadata standards. Data extraction/analysis systems may come and go (much as
Library Management Systems do), but the underlying ‘data about the data’ stays the same
and provides the underpinnings of a sustainable, accessible, and sharable data collection.
The emergence of DDI as the de facto metadata standard for data files sets the stage for
OCUL institutions to move to a new level of cooperation and collaboration. Being able to
share the work of creating metadata is an important, practical outcome of this approach.
The Case for a Centralized Data Extraction/Analysis System
Once a common metadata standard has been adopted, the case for a shared data extraction
and analysis system becomes compelling. From an administrative perspective, a commercial
solution will provide a more sophisticated, sustainable solution that will adapt to meet
changing user needs (or it will be replaced). From a research and teaching perspective,
OCUL institutions will benefit from a common data interface, shared metadata, a broader
base of expertise, and a richer body of teaching materials.
Choosing a DDI-compliant Data Extraction and Analysis System
There are a number of DDI-compliant extraction and analysis systems on the market. One
example of such a product is Nesstar, which is already in use by the Tri-University group,
Carleton University, and the University of Ottawa. Nesstar is a well-developed system
consisting of three main components:
- Nesstar Publisher – streamlines loading of files from existing sources (SPSS,
SAS) and adding additional DDI-compliant metadata.
- Nesstar Server – provides a robust, scalable platform for the ‘publishing’ of survey
and aggregate data on the web.
- Nesstar Webview – offers an intuitive, web-based interface to users for the
manipulation and visualization of data resources.
In addition to the features of each of these components (see URL: http://www.nesstar.com/
for more details), the following items should also be considered when looking at a product,
such as Nesstar:
-
Statistics Canada is using Nesstar to mark-up and deliver data through the DLI (Data
Liberation Initiative); see URL: http://www.nesstar.com/news/press3.shtml .
-
In addition, international data archives (ICPSR, the UK Data Archive, and most recently,
the Zentralarchiv in Cologne) are providing shared access to data and metadata using
the Nesstar system.
-
Nesstar accepts and generates DDI-compliant metadata, which is a big plus in terms of
improved access, but also ensures that migration of data to future systems will be
straightforward. A Canadian subset of the DDI tag set is being developed by the
CANDDI group.
7
Geospatial data files were also discussed, and need to be dealt with, but are beyond the scope
of this discussion paper.
106757814
4
-
Nesstar’s DDI compliance will facilitate the cooperative ‘cataloguing’ of data files and
subsequent sharing of this metadata. Institutions will be able to contribute to the
cataloguing of major files, as needed, but be in a position to concentrate on unique files
held by their institution; this will help minimize duplication of effort.
-
Nesstar is an ‘off-the-shelf’ product supported independently of any given institution.
-
Having Nesstar as a common interface at all OCUL institutions will promote the
development of transferable research skills and teaching resources.
-
Nesstar has well-developed access control and security features – important in a shared
environment where not all institutions will be licensed to access all data files. There are
many data files beyond Statistics Canada/DLI that could be loaded into Nesstar (e.g.
ICPSR, Polling data, data from researchers).
Other Issues
Potential cost savings
As is the case with other ‘consortial’ purchases, it is hoped that the negotiated purchase price
and annual maintenance fee for an OCUL licence would be lower than it would be for
individual institutions. In reality, academic pricing for NESSTAR will not be the binding
constraint. In addition there is the potential to save the cost of local
hardware/installation/maintenance.
Local needs will still prevail
The adoption of a centralized server model will not prevent individual institutions from
tailoring their services to meet local needs. Local access to IDLS or QWIFS need not be
discontinued under this model. We suspect that several institutions may choose to install
Nesstar (Publisher, Server, WebView) or SDA locally to better meet their users’ needs. This
type of distributed model will also help facilitate the sharing of the many local data collections
owned by various institutions.
Archiving Locally-produced Data
For many institutions the ability to start archiving locally produced data sets was seen as
important. There is growing recognition on campuses of the amount of research data that is
disappearing as people retire or move elsewhere; in other cases data is destroyed because
there is no place to store it permanently. Metadata standards and a tool like Nesstar will
facilitate local archiving, and feed into a province-wide (and possibly national) repository of
research data.
New members of DLI
Colleges are starting to join DLI. While not part of the OCUL mandate, there is perhaps a
role to be played here in promoting the ‘data culture’ throughout post-secondary education in
Ontario. From what we have seen this model should scale fairly easily.
DINO Discussion Paper Subcommittee:
Jeff Moon
Bo Wandschneider
Wendy Watkins
106757814
5
Download