Power Point Slides Secondary Data Resources - D-Lab

advertisement
A Crash Course
in Secondary Data Sources
for Berkeley Researchers
Jon Stiles
D-Lab
Lesson 1 & 2:
Plan Ahead!
Take a little
time to check
out the
landscape
and see what
you might
want to look
for.
Lesson 1 & 2:
Look where you’re going!
…and
Don’t go
so fast that
you lose
control of
what
you’re
doing!
Road Map for Today
Secondary Data Resources
Secondary data: what is it and where does it come from?
Why and how would you want to use it?
Secondary data: where can you find it?


Sites (archives, research organizations, government agencies)
Strategies (keyword, literature, snowball)
Tools to help you extract and use secondary data
Local resources to help you
Secondary data: what is it and where does it
come from?
Secondary data: what is it?
Data: plural of "datum" …..
from the Latin "something given."
Plural : Right on!
Something Given: Not so much….
Secondary data: what is it?
Primary data




“New” data
Typically collected to answer specific questions or serve
specific needs
Known universe/sample, intentional design
Tailored data items
Secondary data




“Recycled” data
Collected by others and re-used
Often (but not always) collected for a different use
Value reliant on meta-data (information about the data)
Secondary data: basic
characteristics

Secondary data tend to emerge from three kinds of collection processes:
 Survey data: collection for research purposes, coherent research design,
well-defined sampling process, intent to generalize
 Administrative data: collection for program administration or routine
record-keeping

Digital exhaust: an electronic byproduct or residue of activities
 Secondary data may be available either as:


Microdata: individual level records for a unit of analysis
Aggregate data: summary counts or statistics across multiple units
 Secondary data may be available either as:


Cross-sectional: data collected at a single point in time
Longitudinal data: data collected for the same unit of observation at
multiple points in time
Data Characteristics
Survey Data Characteristics
Well defined sampling process
Usually fewer observations
American community survey (~200K/mon)
GSS (~1500-6000) –
Public Opinion (~1200)
Individual opinions and characteristics often gathered
Administrative data characteristics
Restricted universe, but can have large amounts of data (millions of observations)
Data collected only for program administration
Other data spotty, even if described in program
Often linkable to other data
Rarely includes participant opinion
“Data Exhaust” Characteristics
Often very large
Skewed populations – unclear sampling frame
Uncertain but developing capacity to link
Secondary data: origins

Secondary data emerge from several kinds of collection processes:

Survey data: collection for research purposes, coherent research design,
well-defined sampling process, intent to generalize
Examples:

Administrative data: collection for program administration or routine
record-keeping
Examples:

General Social Survey (GSS)
National Health Interview Surveys (NHIS)
Current Population Survey (CPS)
Marriage Records
Property Sales
Hospital Discharge Records
Court Records
Data exhaust: byproduct or residue of activities
Examples:
Twitter collections
Cell phone location data
Newspaper articles
Advantages of Secondary Data
Cost: original data collector bear burden
Comparability: results may be contrasted with
others using same/similar sources
Chronology: research process can be shortened
dramatically
Coverage: data may address points in time or
geographies not directly available to researcher
Credibility: data collection may use specially
trained/knowledgeable staff
Disadvantages/ Concerns about
Secondary Data
Sample design may be unknown/ undocumented
Quality of data elements may vary dramatically
Data collection challenges may be difficult to
ascertain
Data may be gathered for different purposes/ coded
in inappropriate ways
Data may be outdated
Cost/ Availability: proprietary or confidential data
Break & Introduction

Next we are going to talk about places
which serve as repositories for data, and
how to locate data….
But before we do that, let’s take a break and
talk about your interests and needs.
Secondary data: where can you find it?
Archives: Academic
ICPSR (Inter-University Consortium for Political and Social Research)
is a membership-based organization which collects data from
individual researchers, polling agencies, and governmental and
international agencies. Data set cover areas such as political
attitudes and behavior patterns, crime and criminal justice, state
and national voting records, election studies, census
enumerations, economic behavior, family studies, and social
atttitudes. Holdings at ICPSR are available to UCB subject to
IP verification. (www.icpsr.umich.edu)
Archives: Polling Data
Roper Center:
The Roper Center archives data from thousands
of surveys with national adult, state, foreign, and special
subpopulation samples conducted by Gallup, NORC, CBS, ABC,
Harris, the LA Times, the NY Times, and many other polling
organizations. Polls are available from as far back as the mid1930’s. Holdings at the Roper Center are, effective as of this
month, also available via IP screening.
(www.ropercenter.uconn.edu )
Government: NCES
http://nces.ed.gov/
NCES: Data Access
http://nces.ed.gov/edat/
Government: NCHS
http://www.cdc.gov/nchs/surveys.htm
Government: NSF
- College, Doctoral, Post-Doctoral
http://www.nsf.gov/statistics/data-tools.cfm#micro-data
Government: BEA
Government: BLS
http://www.bls.gov/data/
Government: USDA
UKDA: General Purpose Archive
http://discover.ukdataservice.ac.uk/
IEA: TIMSS
OECD: PISA
http://www.oecd.org/pisa/
http://www.asdfree.com/2013/12/analyze-program-for-international.html
Other Archives/Data Resources on the net
Office of Population Research at Princeton
http://opr.princeton.edu/archive/
This archive focuses on data of interest to demographers:
data about fertility, mortality, and migration.
The Mexican Migration Project (MMP), an ongoing multidisciplinary study
of migration from Mexico to the United States, has released data for 93
communities in 17 States in Mexico.
The Latin American Migration Project (LAMP), which extends the MMP
design to a study of migration flows originating in other Latin American
countries, has now released data for Dominican Republic, Nicaragua, Costa
Rica, Haiti, Peru and Paraguay.
Demographic and Health Surveys
http://www.measuredhs.com/
Surveys from Central and South American, Africa, and
Asia dealing with health, family planning, education, and
household characteristics. Free, but registration required.
Archives: Distributed
http://thedata.harvard.edu/dvn/
http://thedata.harvard.edu/dvn/faces/site/BrowseDataversesPage.xhtml?initialSort=Released
Add Health
http://www.cpc.unc.edu/projects/addhealth/data
Other Archives/Data Resources on the net
Integrated Public Use Microdata Series (IPUMS)
http://www.ipums.umn.edu/
This is THE starting place if you have any interest in using
microdata from the decennial censuses of the US. The
documentation provides wording/context, extractions are
straightforward, multiple statistical packages are supported.
General Social Survey
http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10
The GSS (General Social Survey) is an almost annual
"omnibus," personal interview survey of U.S. households
conducted by the National Opinion Research Center (NORC)
since 1972. It covers a broad range of topics, with a strong
core of replicated items each year, and modules which are
concurrently fielded in many other countries since the mid1980’s.
Other Archives/Data Resources on the net
Panel Study of Income Dynamics (PSID)
http://psidonline.isr.umich.edu/
The PSID is a longitudinal survey of a representative sample
of US individuals and their families, ongoing since 1968.
The data were collected each year through 1997, and every other
year starting in 1999. Topics include income and wealth, expenses,
education, and health care.
The National Survey of Families and Households (NSFH)
http://www.ssc.wisc.edu/nsfh/home.htm
The NSFH has fielded three waves of interviews between 1987
and 2002 which cover family structure, household division of labor,
employment, cohabitation, parenting, health and well-being, etc..
Other Archives/Data Resources on the net
National Historic Geographic Information System
http://www.nhgis.org/
Provides, free of charge, aggregate census data and GIS compatible
boundary files for the United States between 1790 and 2000.
International Social Science Programme (ISSP)
http://www.gesis.org/en/data_service/issp/data/list_quest_pdf.htm
The ISSP topical modules have focused on the Role of Government
(1985, 1990, 1996), Family (1988, 1994, 2002), Social Inequality (1987,
1992, 1999), Social Networks (1986, 2001), Religion (1991, 1998),
National Identity (1995, 2003), the Environment (1993, 2000) and Work
Orientations (1989, 1997). The most recent modules are fielded in
almost 40 countries.
Other Archives/Data Resources on the net
National Bureau of Economic Research
http://nber.org/data/
Downloadable macro and microdata. Includes Consumer
Expenditures data, Survey of Program Dynamics (SPD),
Survey of Income and Program Participation (SIPP), natality
and mortality files from NCHS, a long time series on
segregation, and more.
http://nber.org/data/cps.html
A very nice description and organization of the topical
supplements to the CPS, as well as the data, documentation,
and (in many cases) SAS, SPSS, and stata syntax to read in
the data.
Other Archives/Data Resources on the net
American Religion Data Archive
http://www.arda.tm/
Consortium for Earth Science Information Network (CIESIN)
http://sedac.ciesin.org/data.html
1980/1990/2000 Census summary files in easily usable format
boundary files in popular GIS formats
University of Wisconsin-Madison
Center for Demography and Ecology ftp site
http://www.ssc.wisc.edu/cde/library/cdeftp.htm
University of Virginia Library
http://fisher.lib.virginia.edu/
Other Data Resources/tools on the net
The Dataferrett
http://dataferrett.census.gov/TheDataWeb/index.html
A collaboration between the CDC and Census Bureau which allows you
to extract and download data from:
American Community Survey (ACS)
American Housing Survey (AHS)
Behavioral Risk Factor Surveillance System (BRFSS)
Consumer Expenditure Survey (CES)
Current Population Survey (CPS)
Decennial Census of Population and Housing (Census2000)
National Ambulatory Medical Care Survey (NAMCS)
National Center for Health Statistics Mortality-Underlying Cause-of-Death
(MORT)
National Health and Nutrition Examination Survey (HANES)
National Health Interview Survey (NHIS)*
National Hospital Ambulatory Medical Care Survey (NHAMCS)
National Survey of Fishing, Hunting, and Wildlife-Assocated Recreation
(FHWAR)
Survey of Income and Program Participation (SIPP)
Survey of Program Dynamics (SPD)
Tools to help you extract and use secondary
data
www.socialexplorer.com
Local resources to help you
Selected Data Resources at Berkeley
D-Lab
http://dlab.berkeley.edu/
UC DATA
http://ucdata.berkeley.edu/
California Census Research Data Center
http://www.census.gov/ces/
Library Data Lab
http://www.lib.berkeley.edu/wikis/datalab/
SDA (Survey Documentation & Analysis)
http://sda.berkeley.edu/
Geospatial Innovation Facility
http://gif.berkeley.edu/
Thank you. (Slides will be posted.)
Road Map
(I)
Research Design & Implementation
Data Collection
Data Entry
Primary or Secondary – or
both?
(& Documentation)
(& Documentation)
(& Documentation)
Road Map
(II)
Data Cleaning
Reading Data In
Labelling
Edit Checks (More Cleaning)
Weighting
(& Documentation)
(& Documentation)
(& Documentation)
(& Documentation)
(& Documentation)
Road Map
(III)
Descriptive Statistics
Data Transformation
Record Matching
Aggregation/Collapsing
(& Documentation)
(& Documentation)
(& Documentation)
(& Documentation)
First Stops
Data Cleaning
Skip Patterns
Missing Data
Range Checks
Reading Data In
Fixed format / Delimited / Hierarchical
Variable Typing (String/ Numeric)
Labelling
Variables & Values
Edit Checks (More Cleaning)
Consistency / Imputation
Weighting
Sampling Probability
Non-Response
Population
Along the way
Descriptive Statistics
Min/Mean/Ptiles/Max/Valid N
Data Transformation
Recoding
Complex
Scales & Indices
Record Matching
Linking (1-1) / (1-Many)
Aggregation/Collapsing
Summary Statistics
Planning Your Trip and
Getting on the Road
Research Design & Implementation
What do you want to be able to say at the end?
Who/what are your units of analysis?
What is the universe of the units you want to talk about?
How are the units you observe selected from the universe?
What is/are the instruments used to collect data?
Data Collection
How was the sampling strategy implemented?
Non-response – unit-level, item-level – and followup
Data Entry
Coding, Collapsing, Open-ended
Validation
Download