Uploaded by lynn.woolfrey

2019-oecd-case-study-sa-df-v2

advertisement
OECD Project on Enhanced Access to Public Data for Science,
Technology and Innovation: South African Case Study: The DataFirst
African Research Data Repository
Lynn Woolfrey 1
1. Overview of the Initiative
Name of initiative
Objective
Type (strategy, policy, bill of law)
Responsible policy making bodies
Responsible implementing bodies
DataFirst Open African Research Data Repository
(DataFirst), University of Cape Town
Most Open African Data repositories do not provide (ii)
data at a granular level or (ii) data that is quality
assured, or (ii) data that is well-described. Thus, data
from these sites is only marginally useful for sound
policy research. Our repository ensures discovery of
and open access to well-documented African
government microdata and microdata from African
research institutions to support data-intensive African
policy research.
University of Cape town public good initiative funded
by the Mellon Foundation. Setting up of the data
repository was originally to provide data for African
researchers being trained in quantitative skills as a
component of the Mellon funded project. DataFirst now
has a wider role, to provide high-quality primary
African social science data to policy researchers
worldwide
DataFirst’s Advisory Board, representing the SA
Department of Science and Technology, Statistics SA,
academics from leading SA universities and African
think-tanks, and the Directors of internationallycertified data repositories in the UK and US.
DataFirst, University of Cape Town
Open definition 2.1(2005)
International reference framework
OECD principles and guidelines for access to research
(if relevant)
data from public funding (2007)
1
Manager, DataFirst, University of Cape Town
Target audience
Total duration of initiative (years)
Total budget of initiative (in
national currency)
Sectoral focus (if relevant)
Type of data concerned (data from
research, public sector
information, private sector
information)
Target audience (scientific
community, business, civil society,
general public)
Expected results
Name of initiative
Sebastopol principles for open government data (2007)
Sunlight Foundation open government data principles
(2010)
Open archival information system ISO reference model
(ISO 14721, 2012)
FAIR data principles (2014)
Principles for the data revolution for sustainable
development (UN, 2014)
ICSU-World Data System data sharing principles
(2015)
The scientific community
Ongoing, initiated in 2001
2000 000 per annum
Social science and health data
Primary public-sector data (from government
administrative databases and censuses and surveys),
primary data from large-scale research projects of
African universities and research institutes.
Scientific community
Optimal use of publicly funded research data, better
quantitative African policy research, and, ultimately,
better African data because of exposure and data quality
feedback from academia
DataFirst Open African Research Data Repository
2. Rationale, motives and key drivers
2.1 Policy context prior to the initiative
The South African Apartheid state regularly collected demographic and income data, but usually only
on its “white” population. even when all citizens were enumerated, for example during the
quinquennial census, this was done with different questionnaires for each population group. It was
therefore difficult for policy researchers to compare households to obtain an accurate picture of South
Africa’s social and economic situation. After the installation of the new democratic government in
1994, quantitative social research in the country burgeoned, partly because of demand from the new
government for empirical data on all communities in South Africa for policy formulation (Seekings,
2008, pp. 1-3). These new quantitative datasets were seldom re-used for further research because in
the early 1990s South Africa lacked the policy and technological infrastructure to enable formal
research data sharing. For example, in 1993 the University of Cape Town undertook a comprehensive
World Bank sponsored survey of all South African households. The data informed the new
government’s reconstruction and development plans (Parliament of South Africa, 1994) but the
dataset eventually had to be archived and shared by the World Bank which had the necessary research
data sharing infrastructure.
2.2 Objectives and expected results
South Africa’s 1999 Statistics Act mandated the state to collect data on its citizens, and to improve the
quality of national data, but did not make provision for the sharing of public sector information
(Government of South Africa, 1999). South Africa needed a research data repository that would be
more than an archive, but rather a full data service to actively promote the long-term preservation,
sharing and analysis of data from publicly-funded research. The South African Data Archive (SADA)
was established in the 1990s for this purpose and was modelled on the archives of the Consortium of
European Social Science Data Archives (CESSDA). This archive is now housed at South Africa’s
National Research Foundation and provides free access to a limited amount of data from governmentsponsored research projects. However, it struggles to draw the funding and skills needed to curate
research data according to international standards.
Thus, the team at the Southern Africa Labour and Development Research Unit at the University of
Cape Town, who were responsible for the seminal 1993 survey, set up the DataFirst data repository in
2001. The Mellon Foundation provided seed funding in a programme to build quantitative skills
among African researchers. The initiative has three strategic components to advance data-intensive
South African research: The repository, which enables access to quality-assured primary data for
research; a training unit which builds quantitative skills of African researchers, and a research unit
which undertakes research on data quality issues in South Africa. The motivations for a universitybased repository are, firstly, that proximity to researchers keeps the repository responsive to research
data needs. Secondly, such a service is seen as independent and largely free from political influence
and not susceptible to funding changes as political regimes change. Thirdly, repositories can draw on
the skills base at universities, such as a constant supply of student interns to work on data cleaning.
This is an important requirement in the skills-constrained environments of LMICs. Thus, a university
base can ensure that data repositories can maintain a high level of service in the long term.
2.3 International standards and best practice as drivers
From the outset DataFirst needed to build a reputation as a trusted digital repository to change
mindsets in South Africa around sharing public sector data. This was done firstly, through adherence
to open data principles, secondly, though adoption of international standards for repository processes,
including ethics standards, and thirdly, through ongoing consultation with stakeholders.
2.3.1
Adoption of Open Data Principles
Table 1 lists commonalities among established data sharing principles. DataFirst subscribes to these
principles, including those concerning government data, in the belief that an open data approach is the
best way to ensure that quality data is freely available for South African policy research.
DATE
NAME
DATA
PRINCIPLE
Interoperable
Accessibility
Discoverable
Nondiscriminatory
Complete
Bermuda Principles Genomics data
1997
Data quality
principles
2007
2007
OECD Principles
and Guidelines for
Access to Research
Data from Public
Funding
Sebastopol
Principles - 8
fundamental
principles for open
government data
o
Government
data
Timely
Secure
Interpretable
Permanent
Primary
Data should
Data should be
Access should be on be made
Primary, not Standardised
online and easy to
equal terms
available in aggregate data open formats
find
their entirety
1996
Machine
readable
Structured to
Shared in a Protection of
Preserved and
allow
Good data
timely
privacy of data
shared in the long
automated
documentation
manner
subjects
term
processing
o
o
o
o
o
o
o
o
o
o
o
o
Data from
publicly
funded
research
o
o
o
o
o
Government
data
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
2010
10 Open
Government Data
principles
Government
data
o
o
2014
FAIR Data
Principles
(Findable,
Accessible,
Interoperable, Reusable)
Research data
o
o
2015
ICSU-WDS Data
Research data
Sharing Principles
o
Table 1. Commonalities among data sharing principles (Woolfrey, 2017)
o
o
Table 2 shows some examples of how the repository functions are aligned to open government data
principles (Sebastopol principles of open government data , 2007 ) (10 principles for opening up
government information , 2010 ).
Principle
Definition
Repository compliance measure(s)
Accessibility
Data must be easily
discoverable and
downloadable
Online metadata supports discovery and data can be
downloaded
No usage costs
Data must be free
Nondiscrimination
Anyone should be able to
access the data for any
purpose
Completeness
Primacy
Clear licensing
Use of Open
Standards
Machinereadability
Timeliness
Permanence
Datasets must be released in
their entirety and not as data
sub-sets
Data must be released at a
primary, unit-record level
Data must be clearly labelled
as being in the public domain
Data must be in Open
formats, i.e. not dependent
on proprietary software for
their analysis
Data must be stored in
widely-used file formats that
can be computer-processed
Data collected by the
government must be release
as soon after collection as
possible
Data must be available in the
long-term, online and
versioned
Depositors are charged for data preparation and data
hosting (on a sliding scale) but charges are not
passed on to the data users who are DataFirst’s
clients
Public access data is available to anyone for any
purpose. Researchers are required to register once on
our data site and to tell us how they intend to use the
data, but this is for record-keeping purposes and
does not affect their access rights
Entire datasets or dataset panels are released
Data is shared as microdata, not as aggregated data
Usage licenses must be completed with data requests
Open source-software-based dissemination software
and free, xml-compliant metadata tools are used.
Data is shared in proprietary formats, but also in
software-agnostic formats. Free courses are offered
in open source data analysis software
Use of international standards for machine-readable
and comparable metadata, and the use of standard
file formats and syntax to allow data linkages
Government agencies are encouraged to deposit data
in a timely manner
Ongoing data preservation, migration and
dissemination is supported by DataFirst's
infrastructure and processes. Dataset DOIs support
long-term identification and access
Table 2. Examples of DataFirst’s compliance with Open Government Data principles
Figure 1. DataFirst’s repository data life-cycle model (Woolfrey, 2018)
2.3.2 Adherence to International Standards
DataFirst’s repository processes are modelled on the ISO standard for Open Archival Information
Systems (International Organization for Standardization, 2018). Figure 1 shows this life-cycle model.
Additions to the ISO standard are the inclusion of data rescue processes, in which repository staff
source and digitise “at-risk” South African survey data. “At risk” data is data that is in danger of being
lost or unusable because it is in non-digital formats, has insufficient metadata, or is in repositories
threatened with closure. that is at risk of being lost. DataFirst has undertaken several data rescue
projects to save such threatened dataset and make them available for new research. The model reflects
the virtuous cycle of re-use whereby public-sector data is improved through feedback from clients
who can easily access and analyse this data. Figure 2 indicates standards that are considered for each
stage of the data life-cycle at the repository.
Figure 2. Standards for data life-cycle stages (Woolfrey, unpublished)
DataFirst complies with international metadata standards to create clear descriptions of datasets to
help researchers to use the data in an informed manner. This metadata includes information on the
quality and usability of the data. Metadata is created with Nesstar Publisher, which is free,
community-developed data mark-up software for the creation of XML-compliant metadata following
the Data Documentation Initiative (DDI) metadata standard. Compliance with this standard supports
machine readability and data comparability. Metadata published by the repository includes citation
information, and DataFirst encourages data citation by those using the data to ensure research
reproducibility. The repository uses the DataCite citation standard and provides linked data citations
to assist researchers to know what has been published based on the data. All datasets have permanent
identifiers in the form of Digital Object Identifiers (DOIs) as DataFirst is a member of the DataCite
consortium and mints DOIs using the DataCite Fabrica tool.
Ethics concerns must be covered in repository processes. DataFirst complies with data sharing ethics
requirements by, firstly, undertaking disclosure control on deposited data using a clearly defined
process, to prevent information being published that could lead to identification of individuals.
Secondly, data depositors must provide proof of ethics clearance for them to collect the data. Ethics
clearance information is included in the metadata provided for each dataset.
2.4 Building data science skills in African institutions
In line with the unit’s original mandate, DataFirst continues to build data science skills at African
institutions. Data scientists have been described as those who are involved in “the four A’s of data”:
data architecture, data acquisition, data analysis and data archiving (Stanton, 2013, p. 4). DataFirst
gives training in all four dimensions of data science, to alleviate the shortage of such skills in the
region. Since 2008 a DataFirst-World Bank collaboration has installed data curation software and
trained data managers in African National Statistics Agencies as an initiative of the PARIS21
consortium. DataFirst also runs regular data curation workshops for librarians and data stewards of
large data projects to help them manage their data throughout the entire data life-cycle. Data analysis
training is offered through regular workshops aimed at postgraduates and other researchers.
3. Governance
DataFirst’s repository is located within the University of Cape Town’s governance structure. The unit
is based in the Faculty of Commerce, as an independent unit, funded in part by the university and in
part by grants. DataFirst’s Director, who is also a full-time professor in the School of Economics has
overall responsibility for the unit and DataFirst’s research initiatives. DataFirst’s Manager is
responsible for operations of the repository and other data services and supervision of data service
staff. Figure 3 shows where the repository responsibilities are in an organogram of DataFirst’s staff
structure.
Figure 3. DataFirst’s organogram
A Governing Board provides oversight of the work of DataFirst. The Board meets annually to review
DataFirst’s annual report and provide input to operations and to discuss relevant scientific
developments. The Board includes representatives from South Africa’s Department of Science and
Technology, SA’s National Statistics Agency and other government departments, research-intensive
universities, and the African Economic Research Consortium (a policy think-tank). International
Board members include the Directors of two well-established data archives, the UK Data Archive at
the University of Essex and the ICPSR at the University of Michigan. DataFirst’s strategic goal is to
promote the efficient and equitable sharing of South Africa’s public-sector data for research. The
governance model adopted allows feedback from clients in government and academia on adherence to
this goal. DataFirst’s Board also advises on data science and policy developments.
4. Process and Timeline
The process of establishing the repository involved consultation with key stakeholders such as data
experts in other countries, the local quantitative research community, and data stewards in public
sector institutions. The Mellon Foundation was approached for funding to train South African
researchers in quantitative analysis and to establish the repository as a source of accessible
government data for this analysis. Negotiations were also held with the University of Cape Town’s
administration to fund an Operations Manager for the repository. Deans of Faculties at the University
of Cape Town were also consulted on existing research data needs within their faculties. Research
grouping across the campus were also approached for input to the initiative. Consultation with senior
academics proved useful and much of their advice was incorporated into the final repository setup.
Stakeholder consultation is ongoing and includes annual meetings with DataFirst’s Board and
biannual meetings of an Advisory Committee which represents senior quantitative researchers from
each of UCT’s faculties and the faculty heads of research. Finally, DataFirst has set up an online
support site for researchers to give feedback on the repository services. The opportunity to liaise
between producers and users of public sector information and to work closely with both stakeholder
groups enables repository staff to know how to advocate for data sharing and to identify researchers’
evolving data needs.
5. Adoption and implementation
5.1 The Repository
Data held in DataFirst’s repository originated from Statistics SA and other agencies within the
national statistical system, and projects of SA universities. In 2003 there were 114 datasets in the
repository. Dataset were initially shared within a secure space in the University of Cape Town’s
School of Economics which was open to academics and postgraduates at SA universities, and the
centre had 190 clients by the end of 2003. DataFirst’s physical location in the Economics Department
was reflected in its client-base, with economists and economics postgraduates making up about 40%
of registered clients. This also reflected the compulsory quantitative instruction given to economics
students at the university. However, sociologists and political scientists were also using the data
resources. Figure 2 shows the client base of the Centre in 2003.
Figure 4. Registered clients by discipline 2003
DataFirst’s strategic aim was to give wider access to the data through an online platform. In 2003
DataFirst’s Director and Manager undertook a training visit to the University of Michigan’s archive at
the Inter-University Consortium for Political and Social Research (ICPSR) to see how this could be
achieved. In 2004 DataFirst’s repository began using the proprietary, community-developed Nesstar
Server software as a dissemination platform. Initially access was for researchers at South African
universities, but 2006 saw the next phase of DataFirst’s development, with the extension of access to
academics across Africa, and encouraging deposits of data from other African countries. Full
realisation of the value of the repository depended on African researchers having access to training in
quantitative analysis. DataFirst therefore partnered with SALDRU to run training programmes in
quantitative analysis of survey data to ensure that there was a core cadre of data-savvy academics to
analyse the data. The repository also trained up student interns in data curation techniques.
A key performance indicator for DataFirst was the repository’s progress in widening access to African
data for research. However, resource constraints meant that the use of proprietary data dissemination
software was not feasible in the long term. In 2008 DataFirst therefore teamed up with the World
Bank’s International Household Survey Data Network (IHSN) to disseminate all data in the repository
using free, IHSN-developed data-dissemination software. Hardware for this initiative was funded by
UCT’s Central Equipment Facility. The new data site went live in 2009. The platform facilitates the
publishing of DDI-compliant metadata and dissemination of data files and documents. The reporting
function allows for the collection of statistics at dataset and client level.
Widening access to data holdings highlighted DataFirst’s role in promoting research data, and the
repository began to receive deposits from across Anglophone Africa. All datasets were uploaded with
keyword-searchable metadata, study designs and questionnaires. The online site also played a
discovery role by listing metadata for African datasets held online elsewhere.
5.2 The Secure Research Data Centre
Demand for data from the repository continued to grow, as research funders and public-sector
institutions increasingly saw the value in sharing sponsored data. However, policy researchers needed
data at the level of local municipality or even at the level of households to obtain an accurate picture
of their economies. In South Africa this data has not been available to researchers. DataFirst
responded to these demands by creating a “safe room” at the university where disaggregated public
data can be analysed by vetted researchers for cutting-edge research. The service has been set up to
prevent the risks of respondents’ identities being revealed and published. This is done through
ensuring a safe space, accrediting “safe” researchers and only allowing “safe” (non-disclosive)
information to leave the Centre. Policies and procedures for the service were informed by
consultations with experts in the field from the UK.
6. Monitoring and Evaluation
Impact evaluation of the repository is multi-dimensional and includes usage metrics such as client
registrations and data downloads, as well as citation counts of publications based on data obtained
from the repository. Access statistics show increasing use of the repository: In 2012 registered clients
totalled 2229 and this figure now stands at 10183 (2018). Table 4 indicates Google Analytics usage
statistics for DataFirst’s repository from 2012-2016 which show a healthy growth in DataFirst’s
online presence. The 2014 statistics are misleading because site activity for two months is missing
because of a change in our URL.
Table 4. Repository usage statistics 2012-2016
A similar pattern emerged for clients who downloaded data. Figure 5 shows the distribution of
countries from which people downloaded datasets in the period from 2010-2016.
Figure 5. Distribution of countries for data downloads 2010-2016
DataFirst was also given membership of international expert data agencies, which indicates the
repository’s growing reputation as a trustworthy data source. DataFirst is a member of the World Data
System (WDS), a body established by the International Council for Science's General Assembly in
2008 to support the preservation of high-quality scientific data, and to promote data standards and
open data practices. The repository’s services are also measured against best practice at other social
science research data repositories.
Community-developed trusted digital repository standards can be used to benchmark services.
DataFirst has held one such certification, the Data Seal of Approval (DSA) since 2014. In 2018 the
DSA and WDS instituted a new repository certification, the CoreTrustSeal. The CoreTrustSeal is
awarded to repositories that have a clear data preservation and dissemination mission, overt licensing
requirements, show ethics compliance, are sustainable in terms of funding and skills, and which draw
on guidance from the expert community. The certification also requires repositories to support data
quality and have clearly defined workflows for each stage of the data life-cycle. Certified repositories
must also demonstrate that they have the technical infrastructure and security arrangements in place to
curate data optimally. DataFirst’s repository received this certification in 2018 and is the only African
data repository awarded this certification to date. Figure 6 shows repositories which hold the WDS,
DSA and CoreTrustSeal certifications in 2018.
Figure 6. WDS, DSA and CoreTrustSeal certified repositories 2018
7. Lessons Learned and Challenges Ahead
The objectives of establishing DataFirst and the repository was to advance quantitative skills of
African researchers and provide open access to South African public-sector data for the data-intensive
policy research these researchers were expected to undertake. The repository was also tasked with
helping Statistics South Africa and other producers of official data to improve the quality of their data
products. DataFirst has achieved these goals, through adherence to international standards and by
building relationships between data producer and user communities in South Africa. DataFirst has
also supported open data practices further afield, in other African countries, through data advocacy
work. This work has been grounded in the example of the repository as an open data success story in a
relatively resource-constrained environment.
One lessons learned was the importance of documenting policies, processes and service
developments. Applying for certifications showed gaps in repository documentation, particularly
regarding records of solutions for challenges that arose with new data formats or data demands.
Marketing and promotion has been minimal because of the repository’s small staff complement. This
means potential depositors and data users may not be aware of the services on offer. A future
challenge is how to promote the repository given our limited resources. This of course is a two-edged
sword because DataFirst is not government funded and UCT only part-funds the repository and
therefore must therefore attract grants, and this requires the repository to publish its successful
outcomes.
Lessons learned from this case study on DataFirst’s repository may be useful for repositories in resourceconstrained OECD partner countries. Firstly, the repository’s successes can show the possibilities of building
functional research data infrastructure with limited means. Secondly, such successes indicate the value of
situating data services close to data users in academia, rather than within government agencies. Thirdly, local
success stories can encourage projects that incorporate South-South cooperation in building data services where
none exist. Finally, DataFirst’s work highlights the importance of a multi-dimensional approach to improving
the reuse of public sector data in LMICs and LICs, which should include making data accessible, supporting
data quality, and building skills for data usage.
References
10 principles for opening up government information . (2010 ). Retrieved from
https://sunlightfoundation.com/policy/documents/ten-open-data-principles/
Government of South Africa. (1999). Statistics Act. Retrieved 12 11, 2018, from
https://sagc.org.za/pdf/legislation/Statistics%20Act%206%20of%201999.pdf
International Organization for Standardization. (2018). ISO 14721:2012. Retrieved 12 05, 2018, from
https://www.iso.org/standard/57284.html
Parliament of South Africa. (1994). White paper on reconstruction and development. Parliament of
SA. Cape Town: Government printer. Retrieved 12 05, 2018, from
https://www.gov.za/sites/default/files/governmentgazetteid16085.pdf
Sebastopol principles of open government data . (2007 ). Retrieved 12 06, 2018, from
https://opengovdata.org/
Seekings, J. (2008). The uneven development of quantitative social science in South Africa. Social
Dynamics, 27(1), 1-36. doi:DOI: 10.1080/02533950108458702
Stanton, J. (2013). An introduction to data science (Version 3 ed.). Syracuse University. Retrieved 07
31, 2018, from
https://ia802501.us.archive.org/4/items/DataScienceBookV3/DataScienceBookV3.pdf
Woolfrey, L. (2017). Data Management Principles and Trusted Data Repositories. CODATA-ASSAF
Workshop on Open Data for Sustainable Development Goals in Developing Countries.
Antananarivo: African Open Science Platform. Retrieved 02 28, 2019, from
http://africanopenscience.org.za/?p=485
Download