SS4_10.1

advertisement
Getting to know your variables
Jane E. Miller, PhD
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Overview
•
•
•
•
Why “get to know” your variables?
Unit of analysis
Restrictions on your analytic sample
Information about each variable
– Level of measurement
– Missing values
– Valid range of values
– Substantive interpretation of values
– Distribution of observed values in your data set
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Why is it important
to get to know your variables?
• Each variable measures
– A specific concept
• Numeric values have particular meanings that differ
depending on the nature of that concept
– In a particular context
• When, where, to whom do those numbers pertain?
– Collected with a specific study design
• Need to understand why some values are missing
– By design
– Due to non-response
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Example of failing
to get to know variables
• In a nationally representative survey sample from a
developing country circa 2002, birth weight in grams
observed range up to 9999.
– Data set downloaded from a research data web site; not
cleaned or evaluated before use.
– Mean birth weight over 8000 in the sample
• First red flag: Implausible as an actual birth weight,
given its meaning and units. 9,999 grams ~= 22 lbs.
– 9999 was a code for missing value
• Lesson learned: Must become familiar with what a
particular value means for that concept and context.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Second red flag
• 2/3 of sample had a birth weight value of 9999
– Very high value for a substantial share of the sample,
unlikely to be explained solely by
• outliers
• data entry errors
• Lesson learned: Look at study documentation and
questionnaire to find out why this distribution was
observed.
– Occurred due to a skip pattern designed to minimize recall
bias in birth weight reporting.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Resources needed for this exercise
• Documentation on the data source
– Description of study design
– Questionnaire
– Codebook for electronic data file
•
•
•
•
Electronic file of database
Statistical software
Your research question
Articles, books, web sites etc. on your topic
– Dependent and key independent variables
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Time needed for this exercise
• This is not an assignment that can be done overnight
• Multi-step process involving multiple resources
– About data source
– About topic
• Results from early steps will inform later steps in the
exercise
• May also need feedback from
– Mentors
– Colleagues
– Persons involved in original data collection
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Getting to know your variables
is project-specific
• Several issues that inform this assignment are
specific to research question and data set.
– Unit of analysis
– Restrictions on your analytic sample
– Roles of different variables in your analysis
• Dependent, independent, control, filter
• Even experienced researchers should
complete this assignment when undertaking a
project with a new topic or data set.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Valuable information for all parts
of a well-written research paper
• Reading the literature on your topic will provide
information needed for
– Introduction
– Literature review
– Discussion
• Detailed knowledge of study design and variables
from documentation, questionnaire and codebook
will provide information for
– A comprehensive data section
– Appropriate model specification
– Interpretation of statistical results
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Unit of analysis
• Do data pertain to
– Individual person?
– Family?
– Census tract?
– Institution?
• Knowing unit of analysis helps ascertain
plausible range of values, e.g.,
– mean number of family members will be much lower
than population of a census tract or a school
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Analytic sample
• Before you acquaint yourself with the range of
values for each variable in your analysis,
impose any limits related to your research
question.
• Exclude cases
– to whom the topic does not pertain.
– that are part of a group with too few cases
– for whom a key variable was not collected
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Restrictions on your analytic sample
• Limit your analytic sample to cases that meet certain
criteria, related to your research question. e.g.,
– particular demographic traits
– minimum test scores
– a specific disease
• Exclude subgroups that don’t meet minimum sample
size if
– there aren’t enough cases in one or more subgroups of a
key variable to provide sufficient statistical power
– it would not be theoretically sensible to combine them
with other subgroups used in your analysis
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Attributes of each variable to familiarize
yourself with prior to analysis
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Labeling, coding, and missing value
information for your variables
• To help you create a comprehensive record of
information on each of the variables in your analysis,
fill out a grid like this one
– An electronic version can be found online in the “getting to
know your variables” assignment.
Variable name
Saw doctor last
year
Birth weight
Variable label
(e.g. acronym
on your data
set)
DOCLY
BWGRMS
Type of
variable
(nominal,
ordinal,
interval, or
ratio)
Nominal
Ratio
Coding (for
categorical
variables)
OR Units (for
continuous
variables)
1 = yes
2 = no
Grams
Plausible
range of
values
(excluding
missing
values)
1, 2
0–6000
Skip pattern?*
(e.g., conditions
under which
Missing value
variable not
codes (if any)
collected)
7 = refused
None for this
8 = don’t
variable
know
9 = missing
9999 =
Asked only about
missing
children under
age 5 years at
time of survey.
Original or
created
variable?†
Original
Original
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
“Old” and “new” variables
• Familiarize yourself with all variables to be used in
your analysis
– Variables analyzed in the same form in which they
appeared in the original data set
– Variables you created from those variables, e.g.,
• Dummy variables created from categorical variables
• Categorical versions of continuous variables
• Aggregated variables, e.g.,
– income calculated from several sources
– scales that combine responses to multiple items
• Calculated variables
– E.g., body mass index calculated from weight and height
• Transformed variables (logged, standardized)
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Organizing variables within the grid
• Using major row headings, label sections for
each of the following, based on their role in
your analysis
– Dependent variable(s)
– Key independent variables
– Control variables
– Sampling weights
– Filter questions (e.g., used to restrict sample)
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Variable names and labels
• For each variable, fill in
– Variable name: a short (up to ~8 character)
acronym used to identify the variable in the
software program you are using
– Variable label: a descriptive phrase of up to 40
characters that helps convey the meaning of the
variable
• If you rename an item with a more informative variable
name (e.g., “gender” instead of Q117), include the
original question name in the variable label
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Level of measurement
• Categorical variables are those that are
classified into ranges or categories.
• Continuous variables
– Measured in numeric units, but not grouped.
– Two types of continuous variables:
• Interval
– Zero is not lowest possible value
– e.g., temperature °Fahrenheit
• Ratio
– Zero is lowest possible value
– e.g., temperature °Kelvin, height, weight
Helps to
anticipate
limits on range
of values
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Categorical variables
Nominal variables
Ordinal variables
• No inherent order to the
categories
• Numeric value labels have
NO mathematical
interpretation
• Examples:
• Categories have an inherent
numeric order
• Examples:
– Gender
– Race
– Geographic region
– Letter grades
– Age group
– Likert scale items
• E.g., from strongly disagree
to strongly agree
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Units of measurement
• System of measurement
– E.g., Metric or British or other?
• income in dollars or euros or pesos?
• Level of aggregation
– E.g., income per hour or per week or per year?
• Scale
– E.g., income in dollars or thousands of dollars or
millions of dollars?
• See also podcast on “reporting one number”
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Missing values
• Missing values on a variable can occur
because they are
– Not applicable
– Missing by design
– Non-response
• See chapters 4 and 13 of The Chicago Guide to
Writing about Multivariate Analysis, 2nd
Edition for more on missing values and
missing by design.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Not applicable
• Some questions are not asked of specific
respondents because they don’t pertain.
– E.g., if someone reports that they are unemployed, they
wouldn’t be asked about their current job type or earnings
• Look
– At the questionnaire or form used to collect the data for
• a filter question
• a skip pattern
– At the codebook for one or more missing value codes for
non-response
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Missing by design
• Some topics are not asked of specific subgroups due
to concern about the accuracy of their responses.
– E.g., to minimize recall bias, mothers asked birth weight
only for children under age 10 years.
• Surveys sometimes administer specialized topic
modules only to a randomly selected subsample of
respondents.
– Used to obtain a smaller representative sample that meets
statistical power requirements while reducing study costs.
• Read the study design documentation to find out
whether these pertain to the question you are using.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Item non-response
• Another reason for missing values is when a
respondent does not answer a question that was
asked of them.
• Item non-response is particularly common for
– stigmatized topics
– questions that require complex or detailed answers
– unclear instructions about number of allowed responses
• Examples: Respondent was asked
– to report income, but didn’t know it
– immigration status, but had concerns about deportation
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Types of non-response
•
•
•
•
Don’t know
Refused to answer the question
Didn’t answer the question (unspecified reason)
Other
– marked too many answers to a single response question
– wrote an illegible answer
• Look up the pertinent missing value codes for each
of your variables in the codebook for your data set.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Valid range of values
•
•
•
•
•
Definitional limits
Conceptually plausible range
Context of measurement
Observed range
Watch for numeric values for missing values
– Label them in your electronic database, so they
are treated correctly during analysis.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Definitional limits on values
• Some variables by definition have limits on the range
of values they can assume:
– A percentage share of a whole must fall between 0 and 100
• Likewise, a proportion must fall between 0 and 1
– However, a percentage change can be
• Negative (<0)
• Greater than 100
– Variables at the ratio level of measurement cannot take on
negative values
– Other topic- or field-specific variables also have such
restrictions
• E.g., a Gini coefficient must fall between 0 and 1
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Plausible range of values
for the concept being measured
A value of 10,000
• Makes sense in at least some contexts for
– Annual family income in dollars
– Population of a census tract
– An annual death rate per 100,000 persons
• Does NOT make sense for
–
–
–
–
–
–
Hourly income in dollars
Birth weight in grams
Number of persons in a family
A Likert scale item
A proportion
An annual death rate per 1,000 persons
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Another example of
plausible range of values
A value of –1
• Makes sense in at least some contexts for
–
–
–
–
Temperature in degrees Fahrenheit or Celsius
Change in rating on a 5 point scale
Change in death rate
Percentage change in income
• Does NOT make sense for
–
–
–
–
–
Temperature in degrees Kelvin
Number of persons in a family
Death rate
A Likert scale item
A proportion
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Context of measurement
• When, where, who, e.g., family income will be
– Higher now than it was 200 years ago in a given place and
group
– Higher in a currently developed than developing country
– Higher in a sample of all households than in a sample of
low-income households
• Remember to take into account restrictions you
imposed on the analytic sample to fit your research
question and data.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Descriptive statistics
on your variables
• After you have
– Imposed restrictions on your analytic sample
– Filled in missing value codes for each variable
• Complete a grid like the one below, with descriptive
statistics on each of the variables in your analysis.
• An electronic version of this grid can be found online in the
“getting to know your variables” assignment.
Observed values from data set
Number of valid
Reference
Values &
Variable label
cases for that
Mean (for
value from
range
Variable (e.g. acronym on variable (excl.
continuous Modal external consistent w/
name
your data set)
missing values) Minimum Maximum variables) value
source
codebook?
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Familiarizing yourself with the
concepts under study
• To identify plausible ranges of values for each of your
variables, read the literature on your dependent and
key independent variables.
• Read for
– how each concept is operationalized in the data set
– standards, cutoffs, or transformations commonly used for
that variable in your field
– range of values observed
• but pay attention to differences in who, when, where studied
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Check each distribution against the
codebook for the original source
• Codebooks for some data sets provide information on
– frequency distribution of categorical variables
– range and/or mean values for continuous variables
– number of cases with missing values, by reason for missing
value (not applicable, refused, etc.)
• Check the distribution of values observed in your
analytic sample for each variable against the
codebook for your data set.
• If any distributions are inconsistent, do NOT analyze
the data until you have resolved the discrepancies!
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Identify reasons for inconsistencies
• Review your answers to the previous steps in this
exercise to identify possible reasons for discrepancies
between your statistics and the codebook, such as:
– Units of analysis
• e.g., family instead of individual
– Restrictions on your analytic sample
• e.g., excluding a subgroup that is included in the statistics shown
in the overall codebook
– Scale
• e.g., grams instead of kilograms
– Transformations you have made to the variables, e.g.,
• logged values
• multiples of standard deviations rather than original units
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Check each distribution against the
literature on similar variables
• Track down information in the published literature on
each of your main variables for a similar population.
• Check the distribution of values for each variable in
your data set against the values from the external
source of information about that variable.
• Again, if the values of your data are substantially
different from those used in other studies of the
same concepts, do NOT analyze the data until you
have resolved the discrepancies!
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Identify reasons for inconsistencies
• Review your answers to the previous steps in this
assignment to explain possible reasons for
discrepancies between your data and other similar
data sets, such as:
– Population studied, e.g., substantially different time, place,
and/or subgroup
– Units of analysis, e.g., family instead of individual
– Units of measurement, e.g., metric instead of British units
– Scale, e.g., grams instead of kilograms
– Transformations of the variables, e.g., percentiles instead
of original value
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Summary
• Before you conduct your analyses, it is critical that
you and other members of your research team
become familiar with the following for each variable
–
–
–
–
Levels of measurement
Units and categories
Plausible and observed ranges of values
Missing values and their reasons
• Compare observed values against
– Documentation for the data set you use in your analysis
– The published literature on your topic
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Summary, continued
• These attributes are essential information for
– Data preparation
• Inclusion criteria for your analytic sample
• Creation of new variables
– Choice of pertinent descriptive and multivariate statistics
– Design of correct charts and tables
– Writing correct prose descriptions for the data and
methods and results sections of your paper.
• Even experienced researchers should complete this
assignment when they undertake a project with a
new topic or data set.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Reasons for getting to know your
variables, redux
• Exercises in this podcast are time-consuming but
very valuable for generating in-depth knowledge
needed for your paper
• Reading the literature on your topic will yield
information needed for the introduction, literature
review and discussion sections.
• Detailed knowledge of study design and variables
from documentation, questionnaire and codebook
will yield information for the data and methods and
results sections.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Resources on your topic
• Articles, books, reports, or web sites related to the
main independent and dependent variables in your
data
– Definitions of concepts under study
– Operationalization (how those concepts are actually
measured in a particular data set)
– Observed distributions in populations similar to those from
which your sample is drawn
– Commonly used transformations of those variables prior to
analysis
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Resources on your data set
• Documentation on study design
– Context (who, when, where)
– Sampling
– Unit of analysis
• Questionnaire or other data collection instrument
– Modules
– Wording of questions
– Skip patterns
• Codebook
– Levels of measurement, categories, units
– Missing value codes
– Distribution of observed values for the study sample
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Suggested readings
• Miller, J. E. 2013. The Chicago Guide to Writing about
Multivariate Analysis, 2nd Edition.
– chapter 4 on levels of measurement, units, standards and
cutoffs
– chapters 7 and 10 on choice of contrasts to suit the variable
– chapter 13 on data and methods
– chapters 4 and 13 on missing values and missing by design
• Chambliss, Daniel F., and Russell K. Schutt. 2012. Making
Sense of the Social World: Methods of Investigation, 4th
Edition. Thousand Oaks, CA: Sage Publications, or other
research methods book for information on
– study design, conceptualization, and measurement
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Suggested online resources
• Podcasts on
– Reporting one number (re: units)
– Comparing two numbers or series of numbers (re:
levels of measurement)
– Defining the Goldilocks problem
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Suggested practice exercises
• Study guide to The Chicago Guide to Writing about
Multivariate Analysis, 2nd Edition.
– Problem sets for
• chapter 4, questions # 6 and 13
• chapter 10, question #1 and 3
– Suggested course extensions for
• chapter 4
– “Reviewing” exercise #1
– “Estimating statistics and writing” exercises #1 and 2
• chapter 10
– “Reviewing” exercise #1
– “Estimating statistics and writing” exercises #1 and 2
– “Revising” exercises #1, 2 and 3
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Contact information
Jane E. Miller, PhD
jmiller@ifh.rutgers.edu
Online materials available at
http://press.uchicago.edu/books/miller/multivariate/index.html
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Download