IPUMS & AICMD Add Value to African Census Microdata Robert McCaa and Patricia Kelly-Hall ASSD VII, January, 2012 Cape Town, South Africa *** ipums.org/international ecastats.uneca.org/aicmd rmccaa@umn.edu for additional details, please see: www.hist.umn.edu/~rmccaa/ipums-africa 1 “Dissemination [means] opening up the valueWorldwide inherent in our data” IPUMS-International: Free, --Walter Radermacher (President, Eurostat) Access Now for Censuses of 62 and Microdata Pieter Everaers (Director, Eurostat) IPUMS+AICMD open up the value inherent in microdata for censuses throughout * * *Africa. ipums.org/international ecastats.uneca.org/aicmd rmccaa@umn.edu for additional details, please see: www.hist.umn.edu/~rmccaa/ipums-africa 2 The purpose of this talk: “value added by 3rd parties” 1. Encourage National Statistical Offices to entrust census microdata samples to the IPUMS-International project 2. Describe some of the value that IPUMS-International adds to integrated microdata and metadata. 3. 1960-2000 samples 2010 round samples Free access to the microdata for bona fide researchers Extensive analysis of data quality before the samples are released Integrated metadata (compare questions in 1, 2, … many censuses) Integrated, pooled microdata (multiple censuses, countries) Encourage usage of integrated samples by African researchers Usage is relatively low, but increasing quickly as more samples become available Central Statistics Office-Ireland Deirdre Cullen, Senior Statistician, testimonial (not in the paper): Advantages of IPUMS for Ireland • Bonus for CSO: as a result of this project, our historic data sets are now in a much more usable format • IPUMS allows – mix of Census years available in 1 file • Comparability with other countries • Ease of access for users • Positive publicity for Census in Ireland Outline Introduction IPUMS+AICMD adds value to population microdata: 1. 2. 3. 4. When NSOs disseminate microdata, the task is costly, risky and often unsatisfactory IPUMS+AICMD partnership offers solution for African countries Invitation to participate, entrust microdata for 2010 and earlier censuses without undue delay Statistical confidentiality and security – disclosure controls, restricted access Integration – census microdata and metadata Dissemination – custom tailored extracts: country(ies), census(es), populations, variables, sample density, metadata Ethics - statistical transparency, academic freedom, responsible use, sharing of results. Reflections 6 Why Statistical Offices entrust Responsibility of Disseminating Census Microdata to IPUMS-International » NSO Dissemination is costly, risky and often unsatisfactory » Costly: scarce human resources to prepare sample, assure statistical confidentiality, and manage access for relatively few users (however important they may be!) » Risky: little experience in anonymizing and managing access to microdata, yet great responsibility » US Census Bureau anonymization protocol egregiously corrupted ages for elderly in ACS microdata—took 5 years to discover the error! » Unsatisfactory: excessive anonymization, slow to provide access. Troublesome for NSO statisticians who do not wish to risk their job to some academic. Most deny access to all but the most persistent, influential would-be users. Complaints (of a large European NSO): » “I haven't used the [microdata]; the bureaucracy was just too slow to get much use out of it.” » “[Access] is unbelievably bureaucratic and difficult – this discourages people from using it. It took me 6 months to get the data.” IPUMS-International assumes responsibilities and risks for integrating & disseminating microdata and metadata » Uniform Memorandum of Understanding with each NSO: » » Founding partners (2001): Kenya, South Africa, Ghana, Egypt, France, Spain, China, Vietnam, Kenya, Colombia, Mexico, USA … now almost 100 countries Specific conditions of access: ownership of data (NSO), use, access, restrictions, confidentiality, security, publication, violations, sharing, jurisdiction, and precedence. » Almost 100 countries entrust census microdata to IPUMS-I. » 6 most populous countries NOT entrusting census microdata to IPUMS: India, *Nigeria, Russian Federation, Japan, Algeria, *Korea (RO—may join at the UNSC in New York) » » * = negotiating No data: Congo (DR), Myanmar, Afghanistan, Uzbekistan, Somalia 8 90+ National Statistics Offices have endorsed the IPUMSInternational Memorandum of Understanding 9 IPUMS-International results posted at http://bibliography.ipums.org IPUMS Milestones » » » » » 1995: IPUMS-USA first release of integrated microdata 1999: IPUMS-International funded by NSF & NIH 2002: 1st International launch: 7 countries, 25 samples. 2007 launch (56th ISI): 32 89 2009 launch (57th ISI): 44 130 » ~279 million person records » ~3,000 registered users » 2011 launch (58th ISI): 62 185 » 2013 (ISI Hong Kong!): ~70 ~225 » 397 million person records » 5,000 registered users » ~500 million person records » ~7,000 registered users Cartogram of IPUMS+AICMD partners weighted by population dark green = integrated and disseminating 2002-2011 Open Invitation to Cooperate , Entrust and Access Microdata Microdata Disseminating None inventoried Integrating None entrusted Disseminating 12 None inv The IPUMS-International team (includes National Science Foundation Board) Steven Ruggles, inventor of IPUMS, Professor of History, and Director of the Minnesota Population Center (Not present: some computer gurus, researchers, research assistants, civil service employees, and others who were not at the NSF Board meeting) I. Statistical Confidentiality and Security 1. Statistical Confidentiality and Microdata Security 2. Statistical disclosure control protections 3. Restricted access See, pp. 3-5: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January 2012.. . . 14 NSI 1 …. NSI entrusts census metadata and anonymized microdata to MPC NSI …62+ MPC MPC integrates metadata and confidentializes microdata samples IPUMSInternational IPUMS-International manages access and entrusts researchers with customtailored <ddi> , SAS, STATA, and SPSS metadata and microdata extracts for any combination of countries, censuses, sub-populations, and variables …. Trusted researcher Trusted researcher 1. Statistical Confidentiality and security. Trusted researcher receives customized extracts 15 Dennis Trewin on-site evaluation. former: Australian Statistician, chair: Conference of European Statisticians Task Force on Microdata and Confidentiality » “...the best practice for an international repository of microdata” » “The security of IPUMS is first class…the standard of the best national statistical offices” » “...a valuable and trustworthy microdata service. It meets the fundamental principles of good practice with respect to confidentiality and microdata.” » “in full compliance with the principles and recommendations of the CES [Conference of European Statisticians]” 2. Statistical Disclosure controls 1. 2. 3. 4. • • • Microdata are anonymized by suppressing any names, addresses, or precise geographic identifiers. Sample is drawn so that researchers have access to only a minor fraction of the complete dataset. Disclosure protections are imposed on the sample, variableby-variable and code-by-code. A small fraction of households is swapped across geographic boundaries. See case of Switzerland with 5% household samples for four censuses. Suppression thresholds are set by each NSO. Great satisfaction from NSOs and researchers 3. Restricted access: Thwarting intruders by legal and administrative procedures » Usage is restricted to bona-fide researchers who agree to stringent conditions of use to protect statistical confidentiality » » » » » » 1,100 word application form; <5,300 word Facebook policy Agree to 8 specific conditions of use Supply extensive personal and institution details Identify your employer’s Office for Protection of Human Subject, IRB, etc. Describe research detailing need for access Rogue intruders face legal and institutional sanctions » University attorney’s office is obligated to initiate sanctions against both individual and the institution —similar to NIH probationary status Despite the “P” (Public) in IPUMS, access to the microdata is restricted. Restricted Access: User Registration and Login Links to Partner Statistical Agency Websites 19 Thwarting intruders by legal and administrative procedures » Usage is restricted to bona-fide researchers who agree to stringent conditions of use to protect statistical confidentiality » » » » » » 1,100 word application form; <5,300 word Facebook policy Agree to 8 specific conditions of use Supply extensive personal and institution details Identify your employer’s Office for Protection of Human Subject, Application form for IPUMS-I IRB, etc. requesting information on institutional affiliation Describe research detailing need for access Rogue intruders face legal and institutional sanctions » University attorney’s office is obligated to initiate sanctions against both individual and the institution —similar to NIH probationary status Conditions of use: must agree to each one--no exceptions √ Data must not be redistributed without authorization. √ The microdata are intended only for scholarly research and educational purposes. √ √ √ √ √ √ All data extracted from the IPUMS-International database are intended solely for the use of the licensee. Under IPUMS-International agreements with collaborating agencies, redistribution of the data to third parties is prohibited. Each member of a research team using the data must apply for access and be licensed individually. These microdata are provided for the exclusive purposes of teaching and scholarly research, and may not be used for any other purposes without explicit written approval from the relevant official statistical authority. Commercial use and redistribution of the microdata is strictly prohibited. Users are prohibited from using microdata acquired from the Integrated Public Use Microdata Series International or other authorized distributors in the pursuit of any commercial or income-generating venture either privately, or otherwise. Use of the microdata must follow strict rules of confidentiality. Users will maintain the confidentiality of persons and households. Any attempt to ascertain the identity of persons or households from the microdata is prohibited. Alleging that a person or household has been identified in these data is also prohibited. Statistical results that might reveal the identity of persons or entities may not be reported or published in any form. The microdata must always be safely secured. Users will implement security measures to prevent unauthorized access to microdata acquired from Integrated Public Use Microdata Series International, its partners or authorized distributors. Upon the completion of this research, data may be retained only if they can be safely secured. If security cannot be guaranteed, the microdata must be destroyed. Scholarly publications are permitted, and must be cited appropriately. The publishing of research results based on IPUMS-International microdata is permitted in communications such as scholarly papers, journals and the like. The authors of these communications are required to cite Integrated Public Use Microdata Series-International and the relevant official statistical authority as the source of the microdata, and to indicate that the results and views expressed are those of the author. Users are requested to provide the IPUMS-International staff with a full citation for any publications resulting from their work with these data. Any violation of this license agreement will result in disciplinary action, including possible loss of employment. Violation of this agreement will lead to revocation of this license, recall of all microdata acquired, a motion of censure to the relevant professional organization(s) and civil prosecution under national or international statutes, at the discretion of the Regents of the University of Minnesota and the official statistical agencies. Sanctions likewise may be taken against the institution with which the violator is affiliated. User agrees to notify ipums@pop.umn.edu regarding errors in the data. II. Integration 4. 5. 6. 7. 8. Comprehensive Source Metadata Integrated, DDI Compatible Metadata Integrated Microdata IPUMS-I Value-Added Variables Integrated Boundary Files See, pp. 6-8: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January 2012. 22 4. Comprehensive Source Documents (forms, instruction manuals) --for Linksintegrated to Officialcensuses Statistical Agency Partners Bibliography: view cites, link to publications 23 24 5. DDI Compatible Metadata (we share!) http://microdata.worldbank.org: Mapped in DDI; compatible with IHSN Microdata toolkit 25 copies entered into the NADA catalog and archive User Registration, conditions of use license 6. Integrated Metadata (Browse and Select Data Download Data Extract (and <ddi> codebook) Source documents (forms, instruction manuals) Link to Official Statistical Agency home pages Bibliography: view cites, link to publications 26 Integrated metadata: open access, dynamically constructed. Example: Marital Status Page is constructed dynamically 27 Integrated IPUMS-I Metadata: Codes and Frequencies Detailed, Case-Count View 2 rules: 1. Retain details 2. Harmonize everything Page is constructed dynamically Displays currently selected samples 28 Integrated IPUMS-I Metadata: Enumeration text View text in English for any combination of countries and censuses. 2 documents: First the form Page is constructed dynamically Displays currently selected samples 29 Integrated IPUMS-I Metadata: Enumeration text View text in English for any combination of countries and censuses. 2 documents: First, the form; then, the enumeration instructions scroll down for more Page is constructed dynamically Displays currently selected samples 30 7. Integrated Microdata (Table 2) 32 most popular integrated variables in IPUMS-International (85,505 Sample Extracts) Rank Label 1 Educational attainment 2 Age (single years to 85+) 3 Employment status 4 Marital status 5 Person weight 6 Relationship to head 7 Sex 8 Class of work 9 Ownership of dwelling 10 Occupation ISCO recode 11 School attendance 12 Years of schooling 13 Literate 14 Urban/rural 15 Industry-general code 16 Household weight Extracts Mnemonic Comment 19,307 EDATTAN 19,009 AGE Grouped age n=3,838 18,490 EMPSTAT 18,214 MARST 17,511 WTPER Technical variable 15,783 RELATE 14,595 SEX 12,583 CLASSWK 8,050 OWNRSHP 8,004 OCCISCO 7,919 SCHOOL 7,576 YRSCHL 7,290 LIT 7,098 URBAN 31 7,044 INDGEN 6,656 WTHH Technical variable Table 2. 32 most popular integrated variables in IPUMS-International (85,505 Sample Extracts) Rank Label 17 Children ever born 18 Nativity (native/foreign born) 19 Occupation 20 Country of birth 21 Religion 22 Industry 23 Location of spouse in household 24 Rule for locating spouse 25 Location of mother in hh 26 Number of children surviving 27 Place of residence 5 years ago 28 Location of father in household 29 Total household income 30 Earned income 31 Number of rooms 32 Consensual union Extracts Mnemonic 6,363 CHBORN 6,332 NATIVTY 6,246 OCC 6,153 BPLCTRY 6,075 RELIG 5,670 IND 5,007 SPLOC 4,171 SPRULE 4,153 MOMLOC 4,074 CHSURV 4,064 MGRATE5 3,983 POPLOC 3,965 INCTOT 3,655 INCEARN 3,465 ROOMS 3,443 CONSENS Comment IPUMS unique IPUMS unique IPUMS unique IPUMS unique Household variable IPUMS unique32 Appendix D. 42 (of 60) Integrated Household Variables: Availability for 13 African Countries (25 Censuses) 33 Appendix E. 88 (of 108) Integrated Person Variables: Availability for 13 African Countries (25 Censuses) 34 8. GIS Boundary files (and other Data Files Source documents (forms, instruction manuals) Link to Official Statistical Agency home pages Bibliography: view cites, link to publications 35 III. Dissemination 9. Trans-border Access 10. Custom-Tailored Extracts 11. Usage 12. 2010 Round Census Microdata See, pp. 9-10: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January 2012. . 36 9. Transborder access. IPUMS-I Extracts by researcher’s place of identity Place of Identity United States France Spain United Kingdom Canada Colombia Brazil Mexico Singapore Germany Austria Italy Chile Argentina Switzerland Belgium Australia Netherlands China Japan Extracts (N) 14,669 973 972 961 671 627 598 507 494 420 403 377 318 310 283 250 229 192 184 170 Samples Extracted (mean) 3.43 2.95 8.34 2.74 2.35 2.04 2.60 3.33 1.49 3.83 4.77 3.03 6.33 3.79 3.92 2.85 2.17 7.58 2.32 1.68 Institutions (N) 295 39 23 41 35 16 22 28 4 31 8 27 6 18 10 3 12 8 25 37 19 Top 20 institutions using IPUMS-I (Appendix 4) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 University of Michigan Columbia University Universitat de Barcelona, Spain Harvard University Inter - American Development Bank Arizona State University National University of Singapore, Singapore World Bank University of California - Berkeley Universidade Federal de Minas Gerais, Brazil University of Chicago Universidad del Valle, Colombia Institute for Health Metrics & Evaluation Princeton University University of Wisconsin - Madison Brown University University of Vienna, Austria University of Pittsburgh University of Delaware El Colegio de México, México 742 701 615 589 499 495 467 408 362 314 285 270 260 237 234 229 229 227 213 214 38 Dissemination of microdata and metadata extracts » The massive scale of IPUMS requires users to be selective: » » » » » » Once an extract request is submitted, the IPUMS extract engine: » » » » » Select country (or countries) Select samples (census years) Select variables (e.g., age, sex, educational attainment, etc.) Select sub-populations (e.g., nurses) Select sample density Constructs the microdata extract Constructs the metadata Emails the researcher to retrieve the extract password protected, transmission is encrypted 128 bit SSL The researcher downloads the extract, un-zips and analyzes Extract system validated as usage has soared 10. Custom tailored extracts. www.ipums.org/international: a. Login with password d-1. Download extract (SSL encrypted) b-1. Study documentation b-2. Create extract c. Receive email; logon with p/word d-2. UnZip data e. Analyze using own software Use the extract system to “Select Cases”. Example: Disability Second: Click the box to include the variable Third: Click “select cases” box Fourth: Scroll down, select “disabled”, then “Continue to next step” Click here, to select every person in households containing an individual with employment disability 2010 round censuses. Minimum Standards for Samples Entrusted to IPUMS for dissemination 1. 2. 3. Household samples High precision: 5% minimum, 10% preferred Broad set of variables—omit only those required for statistical confidentiality (low-level geography, low frequency attributes) Detailed codes 4. » » » » Age: single year to 85 Occupation, industry: 3 digit ISCO, ISIC Country of birth: detail individual countries consistent with statistical confidentiality Thanks to INSEE France for sample of recensement renovee, 2004-2008: 20 million person records launched in IPUMS-I IV. Ethics 13. Statistical Transparency 14. Academic Freedom 15. Reduce Research Fraud and Exaggeration of Results 16. Share Research Results See, pp. 11: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January 2012. 45 “IPUMS-I is an excellent resource for teaching…” -- Dr. David Lam, president Population Association of America 1. Free, easy access to data for many countries and censuses 2. Large sample sizes: • Make it possible to include many different variables in a regression… multi-level model… • Produce separate estimates for population sub-groups • Easy to extract samples with a target sample size (e.g., 50mb) • Easy to revise an extract for a larger size or to include more countries, censuses, variables or sub-populations 2. Students show a great deal of creativity in using IPUMS-I 3. Skills acquired have an immediate pay-off when applying for jobs (e.g., World Bank), graduate school, etc. Africa Mirror Site: http://ecastats.uneca.org/aicmd/ 47 IPUMS-International: Free, Worldwide “Dissemination [means] opening up the value inherent in our data” Microdata Access Now for Censuses of 62 --Walter Radermacher (President, Eurostat) Countries--80 by 2015 and Pieter Everaers (Director, Eurostat) Robert McCaa, Steven Ruggles, Matt Sobek and Wendy L. Thomas IPUMS opens up the value inherent in census microdata. for the 2010 round Session STS065 The Future of Microdata Access for the582000, 1990 and earlier rounds (where microdata exist) th International Statistical Institute, Dublin, Ireland, 26 August, 2011 And for many countries *** ipums.org/international ecastats.uneca.org/aicmd rmccaa@umn.edu for additional details, please see: www.hist.umn.edu/~rmccaa/ipums-africa 48 Thank you To discuss cooperation, please discuss with Dr. Patricia Kelly-Hall or email: rmccaa@umn.edu To use integrated census microdata, See: ipums.org/international or ecastats.uneca.org/aicmd