A Crash Course in Secondary Data Sources for Berkeley Researchers Jon Stiles D-Lab Lesson 1 & 2: Plan Ahead! Take a little time to check out the landscape and see what you might want to look for. Lesson 1 & 2: Look where you’re going! …and Don’t go so fast that you lose control of what you’re doing! Road Map for Today Secondary Data Resources Secondary data: what is it and where does it come from? Why and how would you want to use it? Secondary data: where can you find it? Sites (archives, research organizations, government agencies) Strategies (keyword, literature, snowball) Tools to help you extract and use secondary data Local resources to help you Secondary data: what is it and where does it come from? Secondary data: what is it? Data: plural of "datum" ….. from the Latin "something given." Plural : Right on! Something Given: Not so much…. Secondary data: what is it? Primary data “New” data Typically collected to answer specific questions or serve specific needs Known universe/sample, intentional design Tailored data items Secondary data “Recycled” data Collected by others and re-used Often (but not always) collected for a different use Value reliant on meta-data (information about the data) Secondary data: basic characteristics Secondary data tend to emerge from three kinds of collection processes: Survey data: collection for research purposes, coherent research design, well-defined sampling process, intent to generalize Administrative data: collection for program administration or routine record-keeping Digital exhaust: an electronic byproduct or residue of activities Secondary data may be available either as: Microdata: individual level records for a unit of analysis Aggregate data: summary counts or statistics across multiple units Secondary data may be available either as: Cross-sectional: data collected at a single point in time Longitudinal data: data collected for the same unit of observation at multiple points in time Data Characteristics Survey Data Characteristics Well defined sampling process Usually fewer observations American community survey (~200K/mon) GSS (~1500-6000) – Public Opinion (~1200) Individual opinions and characteristics often gathered Administrative data characteristics Restricted universe, but can have large amounts of data (millions of observations) Data collected only for program administration Other data spotty, even if described in program Often linkable to other data Rarely includes participant opinion “Data Exhaust” Characteristics Often very large Skewed populations – unclear sampling frame Uncertain but developing capacity to link Secondary data: origins Secondary data emerge from several kinds of collection processes: Survey data: collection for research purposes, coherent research design, well-defined sampling process, intent to generalize Examples: Administrative data: collection for program administration or routine record-keeping Examples: General Social Survey (GSS) National Health Interview Surveys (NHIS) Current Population Survey (CPS) Marriage Records Property Sales Hospital Discharge Records Court Records Data exhaust: byproduct or residue of activities Examples: Twitter collections Cell phone location data Newspaper articles Advantages of Secondary Data Cost: original data collector bear burden Comparability: results may be contrasted with others using same/similar sources Chronology: research process can be shortened dramatically Coverage: data may address points in time or geographies not directly available to researcher Credibility: data collection may use specially trained/knowledgeable staff Disadvantages/ Concerns about Secondary Data Sample design may be unknown/ undocumented Quality of data elements may vary dramatically Data collection challenges may be difficult to ascertain Data may be gathered for different purposes/ coded in inappropriate ways Data may be outdated Cost/ Availability: proprietary or confidential data Break & Introduction Next we are going to talk about places which serve as repositories for data, and how to locate data…. But before we do that, let’s take a break and talk about your interests and needs. Secondary data: where can you find it? Archives: Academic ICPSR (Inter-University Consortium for Political and Social Research) is a membership-based organization which collects data from individual researchers, polling agencies, and governmental and international agencies. Data set cover areas such as political attitudes and behavior patterns, crime and criminal justice, state and national voting records, election studies, census enumerations, economic behavior, family studies, and social atttitudes. Holdings at ICPSR are available to UCB subject to IP verification. (www.icpsr.umich.edu) Archives: Polling Data Roper Center: The Roper Center archives data from thousands of surveys with national adult, state, foreign, and special subpopulation samples conducted by Gallup, NORC, CBS, ABC, Harris, the LA Times, the NY Times, and many other polling organizations. Polls are available from as far back as the mid1930’s. Holdings at the Roper Center are, effective as of this month, also available via IP screening. (www.ropercenter.uconn.edu ) Government: NCES http://nces.ed.gov/ NCES: Data Access http://nces.ed.gov/edat/ Government: NCHS http://www.cdc.gov/nchs/surveys.htm Government: NSF - College, Doctoral, Post-Doctoral http://www.nsf.gov/statistics/data-tools.cfm#micro-data Government: BEA Government: BLS http://www.bls.gov/data/ Government: USDA UKDA: General Purpose Archive http://discover.ukdataservice.ac.uk/ IEA: TIMSS OECD: PISA http://www.oecd.org/pisa/ http://www.asdfree.com/2013/12/analyze-program-for-international.html Other Archives/Data Resources on the net Office of Population Research at Princeton http://opr.princeton.edu/archive/ This archive focuses on data of interest to demographers: data about fertility, mortality, and migration. The Mexican Migration Project (MMP), an ongoing multidisciplinary study of migration from Mexico to the United States, has released data for 93 communities in 17 States in Mexico. The Latin American Migration Project (LAMP), which extends the MMP design to a study of migration flows originating in other Latin American countries, has now released data for Dominican Republic, Nicaragua, Costa Rica, Haiti, Peru and Paraguay. Demographic and Health Surveys http://www.measuredhs.com/ Surveys from Central and South American, Africa, and Asia dealing with health, family planning, education, and household characteristics. Free, but registration required. Archives: Distributed http://thedata.harvard.edu/dvn/ http://thedata.harvard.edu/dvn/faces/site/BrowseDataversesPage.xhtml?initialSort=Released Add Health http://www.cpc.unc.edu/projects/addhealth/data Other Archives/Data Resources on the net Integrated Public Use Microdata Series (IPUMS) http://www.ipums.umn.edu/ This is THE starting place if you have any interest in using microdata from the decennial censuses of the US. The documentation provides wording/context, extractions are straightforward, multiple statistical packages are supported. General Social Survey http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10 The GSS (General Social Survey) is an almost annual "omnibus," personal interview survey of U.S. households conducted by the National Opinion Research Center (NORC) since 1972. It covers a broad range of topics, with a strong core of replicated items each year, and modules which are concurrently fielded in many other countries since the mid1980’s. Other Archives/Data Resources on the net Panel Study of Income Dynamics (PSID) http://psidonline.isr.umich.edu/ The PSID is a longitudinal survey of a representative sample of US individuals and their families, ongoing since 1968. The data were collected each year through 1997, and every other year starting in 1999. Topics include income and wealth, expenses, education, and health care. The National Survey of Families and Households (NSFH) http://www.ssc.wisc.edu/nsfh/home.htm The NSFH has fielded three waves of interviews between 1987 and 2002 which cover family structure, household division of labor, employment, cohabitation, parenting, health and well-being, etc.. Other Archives/Data Resources on the net National Historic Geographic Information System http://www.nhgis.org/ Provides, free of charge, aggregate census data and GIS compatible boundary files for the United States between 1790 and 2000. International Social Science Programme (ISSP) http://www.gesis.org/en/data_service/issp/data/list_quest_pdf.htm The ISSP topical modules have focused on the Role of Government (1985, 1990, 1996), Family (1988, 1994, 2002), Social Inequality (1987, 1992, 1999), Social Networks (1986, 2001), Religion (1991, 1998), National Identity (1995, 2003), the Environment (1993, 2000) and Work Orientations (1989, 1997). The most recent modules are fielded in almost 40 countries. Other Archives/Data Resources on the net National Bureau of Economic Research http://nber.org/data/ Downloadable macro and microdata. Includes Consumer Expenditures data, Survey of Program Dynamics (SPD), Survey of Income and Program Participation (SIPP), natality and mortality files from NCHS, a long time series on segregation, and more. http://nber.org/data/cps.html A very nice description and organization of the topical supplements to the CPS, as well as the data, documentation, and (in many cases) SAS, SPSS, and stata syntax to read in the data. Other Archives/Data Resources on the net American Religion Data Archive http://www.arda.tm/ Consortium for Earth Science Information Network (CIESIN) http://sedac.ciesin.org/data.html 1980/1990/2000 Census summary files in easily usable format boundary files in popular GIS formats University of Wisconsin-Madison Center for Demography and Ecology ftp site http://www.ssc.wisc.edu/cde/library/cdeftp.htm University of Virginia Library http://fisher.lib.virginia.edu/ Other Data Resources/tools on the net The Dataferrett http://dataferrett.census.gov/TheDataWeb/index.html A collaboration between the CDC and Census Bureau which allows you to extract and download data from: American Community Survey (ACS) American Housing Survey (AHS) Behavioral Risk Factor Surveillance System (BRFSS) Consumer Expenditure Survey (CES) Current Population Survey (CPS) Decennial Census of Population and Housing (Census2000) National Ambulatory Medical Care Survey (NAMCS) National Center for Health Statistics Mortality-Underlying Cause-of-Death (MORT) National Health and Nutrition Examination Survey (HANES) National Health Interview Survey (NHIS)* National Hospital Ambulatory Medical Care Survey (NHAMCS) National Survey of Fishing, Hunting, and Wildlife-Assocated Recreation (FHWAR) Survey of Income and Program Participation (SIPP) Survey of Program Dynamics (SPD) Tools to help you extract and use secondary data www.socialexplorer.com Local resources to help you Selected Data Resources at Berkeley D-Lab http://dlab.berkeley.edu/ UC DATA http://ucdata.berkeley.edu/ California Census Research Data Center http://www.census.gov/ces/ Library Data Lab http://www.lib.berkeley.edu/wikis/datalab/ SDA (Survey Documentation & Analysis) http://sda.berkeley.edu/ Geospatial Innovation Facility http://gif.berkeley.edu/ Thank you. (Slides will be posted.) Road Map (I) Research Design & Implementation Data Collection Data Entry Primary or Secondary – or both? (& Documentation) (& Documentation) (& Documentation) Road Map (II) Data Cleaning Reading Data In Labelling Edit Checks (More Cleaning) Weighting (& Documentation) (& Documentation) (& Documentation) (& Documentation) (& Documentation) Road Map (III) Descriptive Statistics Data Transformation Record Matching Aggregation/Collapsing (& Documentation) (& Documentation) (& Documentation) (& Documentation) First Stops Data Cleaning Skip Patterns Missing Data Range Checks Reading Data In Fixed format / Delimited / Hierarchical Variable Typing (String/ Numeric) Labelling Variables & Values Edit Checks (More Cleaning) Consistency / Imputation Weighting Sampling Probability Non-Response Population Along the way Descriptive Statistics Min/Mean/Ptiles/Max/Valid N Data Transformation Recoding Complex Scales & Indices Record Matching Linking (1-1) / (1-Many) Aggregation/Collapsing Summary Statistics Planning Your Trip and Getting on the Road Research Design & Implementation What do you want to be able to say at the end? Who/what are your units of analysis? What is the universe of the units you want to talk about? How are the units you observe selected from the universe? What is/are the instruments used to collect data? Data Collection How was the sampling strategy implemented? Non-response – unit-level, item-level – and followup Data Entry Coding, Collapsing, Open-ended Validation