OECD Project on Enhanced Access to Public Data for Science, Technology and Innovation: South African Case Study: The DataFirst African Research Data Repository Lynn Woolfrey 1 1. Overview of the Initiative Name of initiative Objective Type (strategy, policy, bill of law) Responsible policy making bodies Responsible implementing bodies DataFirst Open African Research Data Repository (DataFirst), University of Cape Town Most Open African Data repositories do not provide (ii) data at a granular level or (ii) data that is quality assured, or (ii) data that is well-described. Thus, data from these sites is only marginally useful for sound policy research. Our repository ensures discovery of and open access to well-documented African government microdata and microdata from African research institutions to support data-intensive African policy research. University of Cape town public good initiative funded by the Mellon Foundation. Setting up of the data repository was originally to provide data for African researchers being trained in quantitative skills as a component of the Mellon funded project. DataFirst now has a wider role, to provide high-quality primary African social science data to policy researchers worldwide DataFirst’s Advisory Board, representing the SA Department of Science and Technology, Statistics SA, academics from leading SA universities and African think-tanks, and the Directors of internationallycertified data repositories in the UK and US. DataFirst, University of Cape Town Open definition 2.1(2005) International reference framework OECD principles and guidelines for access to research (if relevant) data from public funding (2007) 1 Manager, DataFirst, University of Cape Town Target audience Total duration of initiative (years) Total budget of initiative (in national currency) Sectoral focus (if relevant) Type of data concerned (data from research, public sector information, private sector information) Target audience (scientific community, business, civil society, general public) Expected results Name of initiative Sebastopol principles for open government data (2007) Sunlight Foundation open government data principles (2010) Open archival information system ISO reference model (ISO 14721, 2012) FAIR data principles (2014) Principles for the data revolution for sustainable development (UN, 2014) ICSU-World Data System data sharing principles (2015) The scientific community Ongoing, initiated in 2001 2000 000 per annum Social science and health data Primary public-sector data (from government administrative databases and censuses and surveys), primary data from large-scale research projects of African universities and research institutes. Scientific community Optimal use of publicly funded research data, better quantitative African policy research, and, ultimately, better African data because of exposure and data quality feedback from academia DataFirst Open African Research Data Repository 2. Rationale, motives and key drivers 2.1 Policy context prior to the initiative The South African Apartheid state regularly collected demographic and income data, but usually only on its “white” population. even when all citizens were enumerated, for example during the quinquennial census, this was done with different questionnaires for each population group. It was therefore difficult for policy researchers to compare households to obtain an accurate picture of South Africa’s social and economic situation. After the installation of the new democratic government in 1994, quantitative social research in the country burgeoned, partly because of demand from the new government for empirical data on all communities in South Africa for policy formulation (Seekings, 2008, pp. 1-3). These new quantitative datasets were seldom re-used for further research because in the early 1990s South Africa lacked the policy and technological infrastructure to enable formal research data sharing. For example, in 1993 the University of Cape Town undertook a comprehensive World Bank sponsored survey of all South African households. The data informed the new government’s reconstruction and development plans (Parliament of South Africa, 1994) but the dataset eventually had to be archived and shared by the World Bank which had the necessary research data sharing infrastructure. 2.2 Objectives and expected results South Africa’s 1999 Statistics Act mandated the state to collect data on its citizens, and to improve the quality of national data, but did not make provision for the sharing of public sector information (Government of South Africa, 1999). South Africa needed a research data repository that would be more than an archive, but rather a full data service to actively promote the long-term preservation, sharing and analysis of data from publicly-funded research. The South African Data Archive (SADA) was established in the 1990s for this purpose and was modelled on the archives of the Consortium of European Social Science Data Archives (CESSDA). This archive is now housed at South Africa’s National Research Foundation and provides free access to a limited amount of data from governmentsponsored research projects. However, it struggles to draw the funding and skills needed to curate research data according to international standards. Thus, the team at the Southern Africa Labour and Development Research Unit at the University of Cape Town, who were responsible for the seminal 1993 survey, set up the DataFirst data repository in 2001. The Mellon Foundation provided seed funding in a programme to build quantitative skills among African researchers. The initiative has three strategic components to advance data-intensive South African research: The repository, which enables access to quality-assured primary data for research; a training unit which builds quantitative skills of African researchers, and a research unit which undertakes research on data quality issues in South Africa. The motivations for a universitybased repository are, firstly, that proximity to researchers keeps the repository responsive to research data needs. Secondly, such a service is seen as independent and largely free from political influence and not susceptible to funding changes as political regimes change. Thirdly, repositories can draw on the skills base at universities, such as a constant supply of student interns to work on data cleaning. This is an important requirement in the skills-constrained environments of LMICs. Thus, a university base can ensure that data repositories can maintain a high level of service in the long term. 2.3 International standards and best practice as drivers From the outset DataFirst needed to build a reputation as a trusted digital repository to change mindsets in South Africa around sharing public sector data. This was done firstly, through adherence to open data principles, secondly, though adoption of international standards for repository processes, including ethics standards, and thirdly, through ongoing consultation with stakeholders. 2.3.1 Adoption of Open Data Principles Table 1 lists commonalities among established data sharing principles. DataFirst subscribes to these principles, including those concerning government data, in the belief that an open data approach is the best way to ensure that quality data is freely available for South African policy research. DATE NAME DATA PRINCIPLE Interoperable Accessibility Discoverable Nondiscriminatory Complete Bermuda Principles Genomics data 1997 Data quality principles 2007 2007 OECD Principles and Guidelines for Access to Research Data from Public Funding Sebastopol Principles - 8 fundamental principles for open government data o Government data Timely Secure Interpretable Permanent Primary Data should Data should be Access should be on be made Primary, not Standardised online and easy to equal terms available in aggregate data open formats find their entirety 1996 Machine readable Structured to Shared in a Protection of Preserved and allow Good data timely privacy of data shared in the long automated documentation manner subjects term processing o o o o o o o o o o o o Data from publicly funded research o o o o o Government data o o o o o o o o o o o o o o o o o o o 2010 10 Open Government Data principles Government data o o 2014 FAIR Data Principles (Findable, Accessible, Interoperable, Reusable) Research data o o 2015 ICSU-WDS Data Research data Sharing Principles o Table 1. Commonalities among data sharing principles (Woolfrey, 2017) o o Table 2 shows some examples of how the repository functions are aligned to open government data principles (Sebastopol principles of open government data , 2007 ) (10 principles for opening up government information , 2010 ). Principle Definition Repository compliance measure(s) Accessibility Data must be easily discoverable and downloadable Online metadata supports discovery and data can be downloaded No usage costs Data must be free Nondiscrimination Anyone should be able to access the data for any purpose Completeness Primacy Clear licensing Use of Open Standards Machinereadability Timeliness Permanence Datasets must be released in their entirety and not as data sub-sets Data must be released at a primary, unit-record level Data must be clearly labelled as being in the public domain Data must be in Open formats, i.e. not dependent on proprietary software for their analysis Data must be stored in widely-used file formats that can be computer-processed Data collected by the government must be release as soon after collection as possible Data must be available in the long-term, online and versioned Depositors are charged for data preparation and data hosting (on a sliding scale) but charges are not passed on to the data users who are DataFirst’s clients Public access data is available to anyone for any purpose. Researchers are required to register once on our data site and to tell us how they intend to use the data, but this is for record-keeping purposes and does not affect their access rights Entire datasets or dataset panels are released Data is shared as microdata, not as aggregated data Usage licenses must be completed with data requests Open source-software-based dissemination software and free, xml-compliant metadata tools are used. Data is shared in proprietary formats, but also in software-agnostic formats. Free courses are offered in open source data analysis software Use of international standards for machine-readable and comparable metadata, and the use of standard file formats and syntax to allow data linkages Government agencies are encouraged to deposit data in a timely manner Ongoing data preservation, migration and dissemination is supported by DataFirst's infrastructure and processes. Dataset DOIs support long-term identification and access Table 2. Examples of DataFirst’s compliance with Open Government Data principles Figure 1. DataFirst’s repository data life-cycle model (Woolfrey, 2018) 2.3.2 Adherence to International Standards DataFirst’s repository processes are modelled on the ISO standard for Open Archival Information Systems (International Organization for Standardization, 2018). Figure 1 shows this life-cycle model. Additions to the ISO standard are the inclusion of data rescue processes, in which repository staff source and digitise “at-risk” South African survey data. “At risk” data is data that is in danger of being lost or unusable because it is in non-digital formats, has insufficient metadata, or is in repositories threatened with closure. that is at risk of being lost. DataFirst has undertaken several data rescue projects to save such threatened dataset and make them available for new research. The model reflects the virtuous cycle of re-use whereby public-sector data is improved through feedback from clients who can easily access and analyse this data. Figure 2 indicates standards that are considered for each stage of the data life-cycle at the repository. Figure 2. Standards for data life-cycle stages (Woolfrey, unpublished) DataFirst complies with international metadata standards to create clear descriptions of datasets to help researchers to use the data in an informed manner. This metadata includes information on the quality and usability of the data. Metadata is created with Nesstar Publisher, which is free, community-developed data mark-up software for the creation of XML-compliant metadata following the Data Documentation Initiative (DDI) metadata standard. Compliance with this standard supports machine readability and data comparability. Metadata published by the repository includes citation information, and DataFirst encourages data citation by those using the data to ensure research reproducibility. The repository uses the DataCite citation standard and provides linked data citations to assist researchers to know what has been published based on the data. All datasets have permanent identifiers in the form of Digital Object Identifiers (DOIs) as DataFirst is a member of the DataCite consortium and mints DOIs using the DataCite Fabrica tool. Ethics concerns must be covered in repository processes. DataFirst complies with data sharing ethics requirements by, firstly, undertaking disclosure control on deposited data using a clearly defined process, to prevent information being published that could lead to identification of individuals. Secondly, data depositors must provide proof of ethics clearance for them to collect the data. Ethics clearance information is included in the metadata provided for each dataset. 2.4 Building data science skills in African institutions In line with the unit’s original mandate, DataFirst continues to build data science skills at African institutions. Data scientists have been described as those who are involved in “the four A’s of data”: data architecture, data acquisition, data analysis and data archiving (Stanton, 2013, p. 4). DataFirst gives training in all four dimensions of data science, to alleviate the shortage of such skills in the region. Since 2008 a DataFirst-World Bank collaboration has installed data curation software and trained data managers in African National Statistics Agencies as an initiative of the PARIS21 consortium. DataFirst also runs regular data curation workshops for librarians and data stewards of large data projects to help them manage their data throughout the entire data life-cycle. Data analysis training is offered through regular workshops aimed at postgraduates and other researchers. 3. Governance DataFirst’s repository is located within the University of Cape Town’s governance structure. The unit is based in the Faculty of Commerce, as an independent unit, funded in part by the university and in part by grants. DataFirst’s Director, who is also a full-time professor in the School of Economics has overall responsibility for the unit and DataFirst’s research initiatives. DataFirst’s Manager is responsible for operations of the repository and other data services and supervision of data service staff. Figure 3 shows where the repository responsibilities are in an organogram of DataFirst’s staff structure. Figure 3. DataFirst’s organogram A Governing Board provides oversight of the work of DataFirst. The Board meets annually to review DataFirst’s annual report and provide input to operations and to discuss relevant scientific developments. The Board includes representatives from South Africa’s Department of Science and Technology, SA’s National Statistics Agency and other government departments, research-intensive universities, and the African Economic Research Consortium (a policy think-tank). International Board members include the Directors of two well-established data archives, the UK Data Archive at the University of Essex and the ICPSR at the University of Michigan. DataFirst’s strategic goal is to promote the efficient and equitable sharing of South Africa’s public-sector data for research. The governance model adopted allows feedback from clients in government and academia on adherence to this goal. DataFirst’s Board also advises on data science and policy developments. 4. Process and Timeline The process of establishing the repository involved consultation with key stakeholders such as data experts in other countries, the local quantitative research community, and data stewards in public sector institutions. The Mellon Foundation was approached for funding to train South African researchers in quantitative analysis and to establish the repository as a source of accessible government data for this analysis. Negotiations were also held with the University of Cape Town’s administration to fund an Operations Manager for the repository. Deans of Faculties at the University of Cape Town were also consulted on existing research data needs within their faculties. Research grouping across the campus were also approached for input to the initiative. Consultation with senior academics proved useful and much of their advice was incorporated into the final repository setup. Stakeholder consultation is ongoing and includes annual meetings with DataFirst’s Board and biannual meetings of an Advisory Committee which represents senior quantitative researchers from each of UCT’s faculties and the faculty heads of research. Finally, DataFirst has set up an online support site for researchers to give feedback on the repository services. The opportunity to liaise between producers and users of public sector information and to work closely with both stakeholder groups enables repository staff to know how to advocate for data sharing and to identify researchers’ evolving data needs. 5. Adoption and implementation 5.1 The Repository Data held in DataFirst’s repository originated from Statistics SA and other agencies within the national statistical system, and projects of SA universities. In 2003 there were 114 datasets in the repository. Dataset were initially shared within a secure space in the University of Cape Town’s School of Economics which was open to academics and postgraduates at SA universities, and the centre had 190 clients by the end of 2003. DataFirst’s physical location in the Economics Department was reflected in its client-base, with economists and economics postgraduates making up about 40% of registered clients. This also reflected the compulsory quantitative instruction given to economics students at the university. However, sociologists and political scientists were also using the data resources. Figure 2 shows the client base of the Centre in 2003. Figure 4. Registered clients by discipline 2003 DataFirst’s strategic aim was to give wider access to the data through an online platform. In 2003 DataFirst’s Director and Manager undertook a training visit to the University of Michigan’s archive at the Inter-University Consortium for Political and Social Research (ICPSR) to see how this could be achieved. In 2004 DataFirst’s repository began using the proprietary, community-developed Nesstar Server software as a dissemination platform. Initially access was for researchers at South African universities, but 2006 saw the next phase of DataFirst’s development, with the extension of access to academics across Africa, and encouraging deposits of data from other African countries. Full realisation of the value of the repository depended on African researchers having access to training in quantitative analysis. DataFirst therefore partnered with SALDRU to run training programmes in quantitative analysis of survey data to ensure that there was a core cadre of data-savvy academics to analyse the data. The repository also trained up student interns in data curation techniques. A key performance indicator for DataFirst was the repository’s progress in widening access to African data for research. However, resource constraints meant that the use of proprietary data dissemination software was not feasible in the long term. In 2008 DataFirst therefore teamed up with the World Bank’s International Household Survey Data Network (IHSN) to disseminate all data in the repository using free, IHSN-developed data-dissemination software. Hardware for this initiative was funded by UCT’s Central Equipment Facility. The new data site went live in 2009. The platform facilitates the publishing of DDI-compliant metadata and dissemination of data files and documents. The reporting function allows for the collection of statistics at dataset and client level. Widening access to data holdings highlighted DataFirst’s role in promoting research data, and the repository began to receive deposits from across Anglophone Africa. All datasets were uploaded with keyword-searchable metadata, study designs and questionnaires. The online site also played a discovery role by listing metadata for African datasets held online elsewhere. 5.2 The Secure Research Data Centre Demand for data from the repository continued to grow, as research funders and public-sector institutions increasingly saw the value in sharing sponsored data. However, policy researchers needed data at the level of local municipality or even at the level of households to obtain an accurate picture of their economies. In South Africa this data has not been available to researchers. DataFirst responded to these demands by creating a “safe room” at the university where disaggregated public data can be analysed by vetted researchers for cutting-edge research. The service has been set up to prevent the risks of respondents’ identities being revealed and published. This is done through ensuring a safe space, accrediting “safe” researchers and only allowing “safe” (non-disclosive) information to leave the Centre. Policies and procedures for the service were informed by consultations with experts in the field from the UK. 6. Monitoring and Evaluation Impact evaluation of the repository is multi-dimensional and includes usage metrics such as client registrations and data downloads, as well as citation counts of publications based on data obtained from the repository. Access statistics show increasing use of the repository: In 2012 registered clients totalled 2229 and this figure now stands at 10183 (2018). Table 4 indicates Google Analytics usage statistics for DataFirst’s repository from 2012-2016 which show a healthy growth in DataFirst’s online presence. The 2014 statistics are misleading because site activity for two months is missing because of a change in our URL. Table 4. Repository usage statistics 2012-2016 A similar pattern emerged for clients who downloaded data. Figure 5 shows the distribution of countries from which people downloaded datasets in the period from 2010-2016. Figure 5. Distribution of countries for data downloads 2010-2016 DataFirst was also given membership of international expert data agencies, which indicates the repository’s growing reputation as a trustworthy data source. DataFirst is a member of the World Data System (WDS), a body established by the International Council for Science's General Assembly in 2008 to support the preservation of high-quality scientific data, and to promote data standards and open data practices. The repository’s services are also measured against best practice at other social science research data repositories. Community-developed trusted digital repository standards can be used to benchmark services. DataFirst has held one such certification, the Data Seal of Approval (DSA) since 2014. In 2018 the DSA and WDS instituted a new repository certification, the CoreTrustSeal. The CoreTrustSeal is awarded to repositories that have a clear data preservation and dissemination mission, overt licensing requirements, show ethics compliance, are sustainable in terms of funding and skills, and which draw on guidance from the expert community. The certification also requires repositories to support data quality and have clearly defined workflows for each stage of the data life-cycle. Certified repositories must also demonstrate that they have the technical infrastructure and security arrangements in place to curate data optimally. DataFirst’s repository received this certification in 2018 and is the only African data repository awarded this certification to date. Figure 6 shows repositories which hold the WDS, DSA and CoreTrustSeal certifications in 2018. Figure 6. WDS, DSA and CoreTrustSeal certified repositories 2018 7. Lessons Learned and Challenges Ahead The objectives of establishing DataFirst and the repository was to advance quantitative skills of African researchers and provide open access to South African public-sector data for the data-intensive policy research these researchers were expected to undertake. The repository was also tasked with helping Statistics South Africa and other producers of official data to improve the quality of their data products. DataFirst has achieved these goals, through adherence to international standards and by building relationships between data producer and user communities in South Africa. DataFirst has also supported open data practices further afield, in other African countries, through data advocacy work. This work has been grounded in the example of the repository as an open data success story in a relatively resource-constrained environment. One lessons learned was the importance of documenting policies, processes and service developments. Applying for certifications showed gaps in repository documentation, particularly regarding records of solutions for challenges that arose with new data formats or data demands. Marketing and promotion has been minimal because of the repository’s small staff complement. This means potential depositors and data users may not be aware of the services on offer. A future challenge is how to promote the repository given our limited resources. This of course is a two-edged sword because DataFirst is not government funded and UCT only part-funds the repository and therefore must therefore attract grants, and this requires the repository to publish its successful outcomes. Lessons learned from this case study on DataFirst’s repository may be useful for repositories in resourceconstrained OECD partner countries. Firstly, the repository’s successes can show the possibilities of building functional research data infrastructure with limited means. Secondly, such successes indicate the value of situating data services close to data users in academia, rather than within government agencies. Thirdly, local success stories can encourage projects that incorporate South-South cooperation in building data services where none exist. Finally, DataFirst’s work highlights the importance of a multi-dimensional approach to improving the reuse of public sector data in LMICs and LICs, which should include making data accessible, supporting data quality, and building skills for data usage. References 10 principles for opening up government information . (2010 ). Retrieved from https://sunlightfoundation.com/policy/documents/ten-open-data-principles/ Government of South Africa. (1999). Statistics Act. Retrieved 12 11, 2018, from https://sagc.org.za/pdf/legislation/Statistics%20Act%206%20of%201999.pdf International Organization for Standardization. (2018). ISO 14721:2012. Retrieved 12 05, 2018, from https://www.iso.org/standard/57284.html Parliament of South Africa. (1994). White paper on reconstruction and development. Parliament of SA. Cape Town: Government printer. Retrieved 12 05, 2018, from https://www.gov.za/sites/default/files/governmentgazetteid16085.pdf Sebastopol principles of open government data . (2007 ). Retrieved 12 06, 2018, from https://opengovdata.org/ Seekings, J. (2008). The uneven development of quantitative social science in South Africa. Social Dynamics, 27(1), 1-36. doi:DOI: 10.1080/02533950108458702 Stanton, J. (2013). An introduction to data science (Version 3 ed.). Syracuse University. Retrieved 07 31, 2018, from https://ia802501.us.archive.org/4/items/DataScienceBookV3/DataScienceBookV3.pdf Woolfrey, L. (2017). Data Management Principles and Trusted Data Repositories. CODATA-ASSAF Workshop on Open Data for Sustainable Development Goals in Developing Countries. Antananarivo: African Open Science Platform. Retrieved 02 28, 2019, from http://africanopenscience.org.za/?p=485