Uploaded by 444

Social-Data-Analysis-1690866269

advertisement
Social Data Analysis
Social Data Analysis
Qualitative and Quantitative Approaches
MIKAILA MARIEL LEMONIK ARTHUR AND ROGER
CLARK
KALEIGH POIRIER
PROVIDENCE, RI
Social Data Analysis by Mikaila Mariel Lemonik Arthur and Roger Clark is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.
Contents
Acknowledgements
How to Use This Book
xi
xiii
Section I. Introducing Social Data Analysis
1. Introducing Social Data Analysis
3
An Overview
Roger Clark
Quantitative or Qualitative Data Analysis?
3
Section II. Quantitative Data Analysis
2. Preparing Quantitative Data and Data Management
9
Mikaila Mariel Lemonik Arthur
Data Cleaning & Working With Data
3. Univariate Analysis
11
13
Roger Clark
Univariate Analyses in Context
13
A Word About Univariate Inferential Statistics
29
4. Bivariate Analyses: Crosstabulation
33
Crosstabulation
Roger Clark
5. Hypothesis Testing in Quantitative Research
55
Mikaila Mariel Lemonik Arthur
A Brief Review of Probability
56
Null Hypothesis Significance Testing
58
What Does Significance Testing Tell Us?
62
6. An In-Depth Look At Measures of Association
65
Mikaila Mariel Lemonik Arthur
General Interpretation of Measures of Association
67
Details on Measures of Association
68
7. Multivariate Analysis
75
Roger Clark
A Word About Causation
77
What Happens When You Control for a Variable and What Does it Mean?
78
A Quick Word About Significance Levels
88
8. Correlation and Regression
93
Roger Clark
Dummy Variables
93
Correlation Analysis
96
Regression Analysis
103
Multiple Regression
106
9. Presenting the Results of Quantitative Analysis
111
Mikaila Mariel Lemonik Arthur
Writing the Quantitative Paper
111
Creating Effective Tables
112
Section III. Qualitative Data Analysis
10. The Qualitative Approach
127
The Qualitative Approach
Mikaila Mariel Lemonik Arthur
Types of Qualitative Data
128
Paradigms of Research
129
Inductive and Deductive Approaches
131
Research Standards
132
The Process of Qualitative Research
137
11. Preparing and Managing Qualitative Data
139
Mikaila Mariel Lemonik Arthur
Data Management
139
Preparing Data
141
Data Reduction
146
Qualitative Data Analysis Software
148
12. Qualitative Coding
153
Mikaila Mariel Lemonik Arthur
Developing a Coding System
154
The Process of Coding
159
Coding and What Comes After
161
Becoming a Coder
163
13. From Qualitative Data to Findings
167
Mikaila Mariel Lemonik Arthur
Theoretical Memos
168
Data Displays
170
Narrative Approaches
179
Making Conclusions
181
Testing Findings
183
Thinking Like a Researcher
187
14. Presenting the Results of Qualitative Analysis
191
Mikaila Mariel Lemonik Arthur
Audience and Voice
193
Making Data Come Alive
195
The Genre of Research Writing
197
Concluding Your Work
200
Section IV. Quantitative Data Analysis With SPSS
15. Quantitative Analysis with SPSS: Getting Started
205
Mikaila Mariel Lemonik Arthur
Importing Data Into SPSS
205
Using SPSS
212
Getting More Out of SPSS
215
16. Quantitative Analysis with SPSS: Univariate Analysis
217
Mikaila Mariel Lemonik Arthur
Producing Descriptive Statistics
217
Graphs
227
17. Quantitative Analysis with SPSS: Data Management
233
Mikaila Mariel Lemonik Arthur
Working With Datasets
233
Working With Variables
235
18. Quantitative Analysis with SPSS: Bivariate Crosstabs
251
Mikaila Mariel Lemonik Arthur
19. Quantitative Analysis with SPSS: Multivariate Crosstabs
257
Mikaila Mariel Lemonik Arthur
20. Quantitative Analysis with SPSS: Comparing Means
263
Mikaila Mariel Lemonik Arthur
Comparing Means
263
T-Tests For Statistical Significance
265
ANOVA
268
21. Quantitative Analysis with SPSS: Correlation
271
Mikaila Mariel Lemonik Arthur
Scatterplots
271
Correlation
275
Partial Correlation
277
22. Quantitative Analysis with SPSS: Bivariate Regression
Mikaila Mariel Lemonik Arthur
283
23. Quantitative Analysis with SPSS: Multivariate Regression
289
Mikaila Mariel Lemonik Arthur
Dummy Variables
294
Regression Modeling
299
Notes on Advanced Regression
303
Section V. Qualitative and Mixed Methods Data Analysis with
Dedoose
24. Qualitative Data Analysis with Dedoose: Data Management
309
Mikaila Mariel Lemonik Arthur
Getting Started With a New Project
309
Working With Data
312
Backing Up Your Data & Managing Dedoose
316
25. Qualitative Data Analysis with Dedoose: Coding
319
Mikaila Mariel Lemonik Arthur
The Code Tree
319
Coding in Dedoose
322
Working with Codes
323
26. Qualitative Data Analysis with Dedoose: Developing Findings
327
Mikaila Mariel Lemonik Arthur
Using the Analysis Tools
327
Conducting and Concluding Analysis
343
Glossary
345
Modified GSS Codebook for the Data Used in this Text
369
The General Social Survey
369
The Codebook Guide to GSS Variables
381
Works Cited
523
About the Authors
527
Acknowledgements
Dragan Gill worked tirelessly to gain access to the platform that made this text possible and
provided technical and other support throughout the process of creating the text.
Kaleigh Poirier made an invaluable contribution to this book by translating many of the
formulas it contains into LaTeX code.
This text was made possible, in part, by a Rhode Island College Committee for Faculty
Scholarship Major Grant Award. Funding from Rhode Island College, the Rhode Island College Foundation, and the Rhode Island College Alumni Affairs Office provided essential editorial support.
Finally, thanks to the many students who have enrolled in Sociology 404 with Drs. Arthur
and Clark over the years. Their questions, struggles, and successes have shaped the material presented in this text in vital ways.
Acknowledgements | xi
xii | Acknowledgements
How to Use This Book
This book is divided into four parts:
1. A conceptual section on conducting quantitative data analysis
2. A conceptual section on conducting qualitative data analysis
3. A practical section on conducting quantitative data analysis using SPSS
4. A practical section on conducting qualitative data analysis using Dedoose
Each part can be used separately by those interested in developing the relevant skills.
Each chapter includes suggested exercises at the end of the chapter for those seeking
practice with the ideas, concepts, and skills introduced in the chapter. There is also a hyperlinked glossary of terms. Bibliographic information is available in a separate bibliography
rather than in each individual chapter.
For users who prefer to download or print the text, go to the text homepage and click
“Download This Book.” It is available in PDF and other formats for use in ereaders; those
who want to print can bring the download to a local print or office store. The text is
designed to be compatible with screen readers; where applicable, image descriptions for
software screenshots provide instructions for how to use keycodes to access key functions.
Should users discover any screenreader compatibility problems, they are welcome to email
Mikaila Mariel Lemonik Arthur to get them corrected. Note that while SPSS is basically
screenreader compatible (plugins may be required depending on a user’s specific system
configuration), Dedoose is not.
In addition, as an Open Educational Resources text, the authors encourage others to
develop equivalent practical sections using other software packages, like Atlas.ti, Nvivo,
Stata, SAS, R, and Excel. This project can be forked to add such sections, or those interested
in collaborating to incorporate new sections into this base text are welcome to reach out to
Mikaila Mariel Lemonik Arthur to discuss.
How to Use This Book | xiii
xiv | How to Use This Book
SECTION I
INTRODUCING SOCIAL DATA
ANALYSIS
Introducing Social Data Analysis | 1
2 | Introducing Social Data Analysis
1. Introducing Social Data
Analysis
An Overview
ROGER CLARK
Social data analysis enables you, as a researcher, to organize the facts you collect during
your research. Your data may have come from a questionnaire survey, a set of interviews, or
observations. They may be data that have been made available to you from some organization, national or international agency or other researchers. Whatever their source, social
data can be daunting to put together in a way that makes sense to you and others.
This book is meant to help you in your initial attempts to analyze data. In doing so it will
introduce you to ways that others have found useful in their attempts to organize data. You
might think of it as like a recipe book, a resource that you can refer to as you prepare data
for your own consumption and that of others. And, like a recipe book that teaches you to
prepare simple dishes, you may find this one pretty exciting. Analyzing data in a revealing
way is at least as rewarding, we’ve found, as it is to cook up a yummy cashew carrot paté or
a steaming corn chowder. We’d like to share our pleasure with you.
Quantitative or Qualitative Data Analysis?
Our book is divided into two parts. One part focuses on what researchers call quantitative
data analysis; the other, on qualitative data analysis. These two types of analysis are often
complementary: the same project can employ both of them. But for now we’d like to look
at the main distinction between the two. In general, quantitative data analysis focuses
on variables and/or the relationships among variables. This analysis involves the statistical
summary of such variables and those relationships. Roger recently completed a study, with
two students, of the relationship between Americans’ gender and their party affiliation
(Petit, Mellor and Clark, 2020). We were interested in the relationship between two variables: gender and party affiliation. Women are more likely to identify as Democrats in the
United States than men. We found that we could largely explain the emergence and maintenance of this relationship since the 1970s in terms of three other variables: the increased
participation in paid labor by women, the decreasing likelihood that both men and women
are married, and the declining participation of Americans in labor unions. To examine the
Introduction | 3
relationship among these five variables (gender, political affiliation, labor force participation, marital status and attachment to labor unions) we relied almost exclusively on quantitative, or statistical, analysis.
Qualitative data analysis, on the other hand, focuses on the interpretation of action
or the representation of meaning. Roger did another study with a student in which we
watched YouTube recordings of Trump and Clinton rallies during the 2016 presidential
campaign (Fernandez and Clark, 2019). A careful examination of these rallies led us to the
conclusion that Trump rallies typically looked more like quasi-religious events—with participants displaying quasi-sacred objects (like Make America Great Again caps), participating in quasi-religious rituals (like shouting rhythmically and in unison, “Lock Her Up”), and
cheering quasi-religious beliefs (like how valuable it would be to slow immigration)—than
Clinton rallies did. In this study, we were focused on both interpreting the actions of rally
participants and on trying to represent what they meant by those actions.
In practice, researchers often employ both quantitative and qualitative data analyses in
1
the same study. For instance, Roger recently did another project with students (Gauthier
et al. 2020), one of whose goals was to discern the decades (one variable) since the 1930s
in which children’s picture books were most likely to depict characters in gender stereotyped ways (another variable). Put this way, the study looks like one that required quantitative data analysis, because it was examining the relationship between two variables—the
time in which books were created and the degree to which they depicted characters in
gender stereotyped ways. But to discern whether an individual book used gender stereotypes, we had to interpret the actions and thoughts of individual characters in terms of a
number of characteristics we viewed as gender stereotyped. For instance, we had to decide
whether a character was nurturing (a stereotypically feminine characteristic) and whether
they seemed competitive (a stereotypically masculine characteristic). Such decisions are
essentially qualitative in their nature.
Consequently, the distinction we’ve used to organize this text—quantitative vs. qualitative
data analysis—is a little misleading. Researchers often employ both kinds of research in the
same project. Still, it is conventional for teachers to teach quantitative and qualitative analyses as if they were distinct and who are we to defy convention? Thus, this text includes both
chapters about quantitative data analyses and those about qualitative data analyses.
Exercises
1. Roger is actually familiar with research that others have done. He’s getting on in years, though, and now most
trusts himself not to misrepresent his own work.
4 | Introduction
1.
For this exercise we’d like you to use data from the General Social Survey (GSS), a survey which
has been executed about every other year (sometimes more frequently) since 1972. The GSS is a
nationally representative survey of American adults. What we’d like you to do is to use the data it
produces, made available for us to work with by the University of California, Berkeley, to check
whether men or women have been more likely to participate in the GSS over the years. We’ll be
using this source a lot in this book, so getting a feel for its use is worthwhile here.
The data are available at https://sda.berkeley.edu/ (you may have to copy and paste this address
to request the website). What we’d first like you to do is connect to this link, then go down to the
second full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll down to the
section labeled “General Social Surveys” and click on the first link there: General Social Survey
(GSS) Cumulative Datafile 1972-2018 –release.
Now type “sex” into the “row” box and hit “run the table.” What percentage of GSS respondents
have been female? What kind of analysis—quantitative or qualitative—have you done? What
makes you say so?
2.
Watch the first commercial (about the Toyota Highlander) in this YouTube recording of the 10
Best Super Bowl Commercials of 2020:
One or more interactive elements has been excluded from this version of the text. You
can view them online here: https://pressbooks.ric.edu/socialdataanalysis/?p=76#oem­
bed-1
Which character in this commercial, would you say, is the main one? What one word, would you
say, sums up the personality of this character best? What kind of analysis—quantitative or qualitative—have you done? What makes you say so?
Introduction | 5
6 | Introduction
SECTION II
QUANTITATIVE DATA ANALYSIS
Quantitative Data Analysis | 7
8 | Quantitative Data Analysis
2. Preparing Quantitative Data
and Data Management
MIKAILA MARIEL LEMONIK ARTHUR
The process of research design and data collection is beyond the scope of this book, but it
is worth spending some time on the steps required to get quantitative data ready for data
analysis. Social science researchers who are working with quantitative data may have collected that data themselves, or they may have obtained that data from another researcher
or from a data repository such as the General Social Survey, a national census bureau
or other government data source (e.g. the U.S. Census Bureau), or the Institute for Social
Research at the University of Michigan. Preparing data for analysis requires different steps
depending on the initial source and format of the data.
When a researcher has collected their own data, they need to enter that data into a
computer file in a machine-readable format. Some online survey software systems permit
survey data to be downloaded in an appropriate format, but not all do—and if data was
collected on paper or face-to-face, it needs additional processing. Typically, research teams
enter data into a spreadsheet program like Microsoft Excel or Google Sheets. But doing so
requires the creation of a codebook, or a document in which numerical codes are assigned
to all answer choices or data entry elements.
Figure 1 provides an example of what a
codebook for survey data entry might look
like, drawing on a survey a group of students created and administered as part of a
research methods course. Each question is
assigned a column, and each answer
choice is assigned a numerical code, with a
special code for missing or unusable data
(often 9, 99, 999, or -1). Note that in circumstances where a survey question asked
respondents to “check all that apply,” each
answer choice must be converted into a
separate question, with selected and not
selected as the coded answer choices. This
is one reason why downloaded survey data
Figure 1. An Example of a Codebook
must often still be prepared for use, as survey software like Google Forms may not reliably process “check all that apply” questions or
Quantitative Data Management | 9
automatically convert multiple-choice questions to the type of numeric answers statistical
software requires.
Figure 2 shows what completed data entry might look like; it is taken from the same
survey and shows the data after student survey-takers entered it into Excel. Each survey
response, coded text, or other unit of analysis in the quantitative project has its data
entered on a particular row. Note that without the codebook, it is not possible to understand the data displayed on the screen. When researchers perform data analysis directly
in spreadsheet software, they may need to rely on the codebook to convert data back and
forth from machine-readable (numerical) codes to human-language response categories.
However, when data is imported into statistical analysis software, codebook information
can be entered directly into the software, as will be discussed in the chapter Quantitative
Analysis with SPSS: Data Management.
Figure 2. Survey Data After Entry Into Excel
10 | Quantitative Data Management
When obtaining data from elsewhere, many sites will provide the option of downloading
data in a variety of file formats. In that case, researchers should choose—if possible—the
appropriate file format for the software they are using, and should also download any codebook, readme, or help files that will explain the data and coding. Sometimes data is not
available in a given file format and will need to be converted or imported, which will be discussed in the chapter Quantitative Analysis with SPSS: Data Management.
Note that most statistical analysis software is not cloud-resident, so it is important that
researchers save their datasets after creating, importing, or modifying them; keep good
backups; and keep records of all tests and procedures run, modifications made, etc. during
the data analysis process.
Data Cleaning & Working With Data
Aside from preparing data for analysis, the other crucial step researchers need to take prior
to beginning their analysis is data cleaning. Data cleaning is the process of examining data
to find any errors, mistakes, duplications, corruptions, omissions, or other issues. Where
possible, researchers can correct these issues; in other cases, certain data may need to be
omitted from analysis.
Researchers may also need to modify variables or datasets in various ways. For example,
many studies involve the creation of an index variable, or a composite measure created by
combining information from multiple variables. For example, a study might involve administering a self-esteem inventory consisting of a number of different multiple-choice questions getting at various elements of self-esteem. Then, researchers combine the answers
to all of these questions using a scoring system to create one variable representing the
score on the self-esteem index. In other cases, researchers need to reduce the number of
response categories a variable has or convert a continuous variable into an ordinal variable.
Or a researcher might be working with a dataset that includes respondents of all ages, but
for a study only interested in 18-29 year olds, and thus may need to filter the dataset. As one
final example, researchers may have data from the same study stored in multiple spreadsheets and may need to combine or merge that data.
These are only a few examples of the tasks researchers face. The practical how-to of carrying out these tasks will be discussed in the chapter Quantitative Analysis with SPSS: Data
Management — but before trying to carry them out, researchers need to take the time to
think through their projects, determine which steps are necessary, and plan carefully.
Quantitative Data Management | 11
Exercises
1.
Write five basic multiple-choice survey questions (they do not have to be anything fancy–consider asking questions like age and favorite color). Create a codebook for your survey. Then, ask ten
people you know to answer the questions, without using survey software. Finally, enter the data
into Excel or another spreadsheet program of your choice, following your codebook.
2.
Choose one of the data sources noted at the top of this chapter. Visit the website for the data
source and learn as much as you can about it, then write a paragraph summarizing how the data
is collected and what the data focuses on.
Media Attributions
• codebook-example-1 © Mikaila Mariel Lemonik Arthur, in conjunction with Fall 2015
students in Sociology 302 at Rhode Island College.
• survey data entry © Mikaila Mariel Lemonik Arthur, in conjunction with Fall 2015 students in Sociology 302 at Rhode Island College.
12 | Quantitative Data Management
3. Univariate Analysis
ROGER CLARK
Univariate Analyses in Context
This chapter will introduce you to some of the ways researchers use statistics to organize
their presentation of individual variables. In Exercise 1 of Introducing Social Data Analysis,
you looked at one variable from the General Social Survey (GSS), “sex” or gender, and found
that about 54 percent of respondents over the years have been female while about 46 percent have been male. You in fact did an analysis of one variable, sex or gender, and hence
did an elementary univariate analysis.
Before we go further into your introduction to univariate analyses, we’d like to provide a
somewhat larger context for it. In doing so, we begin with a number of distinctions. One
distinction has to do with the number of variables that are involved in an individual analysis. In this book you’ll be exposed to three kinds of analysis: univariate, bivariate and mul­
tivariate analyses. Univariate analyses are ones that tell us something about one variable.
You did one of these when you discovered that there have been more female than male
respondents to the GSS over the years. Bivariate analyses, on the other hand, are analyses
that focus on the relationship between two variables. We have just used the GSS source
we guided you to (Thomas 2020/2021) to discover that over the years men have been much
more likely to work full time than women—roughly 63 percent of male respondents have
done so since 1972, while only about 40 percent of female respondents have. This finding results from a bivariate analysis of two variables: gender and work status. Multivariate
analyses, then, are ones that permit the examination of the relationship between two variables while investigating the role of other variables as well. Thus, for instance, when we look
at the relationship between gender and work status for White Americans and Black Americans separately, we are involving a third variable: race. For White Americans, the GSS tells
us, about 63 percent of males have held full time jobs over time, while only about 39 percent of females have done so. For Black Americans, the difference is smaller: 56 percent of
males have worked full time, while 44 percent of females have done so. We thus did a multivariate analysis, in which we examined the relationship between gender and work status,
while also examining the effect of race on that relationship.
Another important distinction is between descriptive and inferential statistics. This distinction calls into play another: that between samples and populations. Many times
researchers will use data that have been collected from a sample of subjects from a larger
Univariate Analysis | 13
population. A population is a group of cases about which researchers want to learn something. These cases don’t have to be people; they could be organizations, localities, fire or
police departments, or countries. But in case of the GSS, the population of interest is in
fact people: all adults in the United States. Very often, it is impractical or undesirable for
researchers to gather information about every subject in the population. You can imagine
how much time and money it would cost for those who run the GSS, for instance, to contact every adult in the country. So what researchers settle for is information from samples
of the larger population. A sample is a number of cases drawn from a larger population.
In 2018, for instance, the organization that runs the GSS collected information on just over
2300 adult Americans.
Now we can address the distinction between descriptive and inferential statistics.
Descriptive statistics are statistics used to describe a sample. When we learned, for
instance, that the GSS reveals that about 63 percent of male respondents worked full time,
while about 40 percent of female respondents worked full time, we were getting a description of the sample of adult Americans who had ever participated in the GSS. (And you’d
be right if you added that this is a case of bivariate descriptive statistics, since the percentages describe the relationship between two variables in the sample—gender and work
status. You’re so smart!) Inferential statistics, on the other hand, are statistics that permit
researchers to make inferences about the larger populations from which the sample was
drawn. Without going into too much detail here about the requirements for using inferen1
tial statistics or how they are calculated, we can tell you that our analysis generated statistics that suggested we’d be on solid ground if we inferred from our sample data that a
relationship between gender and work status not only exists in the sample, but also in the
larger population of American adults from which the sample was drawn.
In this chapter we will learn something about both univariate descriptive statistics (statistics that describe single variables in a sample) and univariate inferential statistics (statistics
that permit inferences about those variables in the larger population from which the sample was drawn).
Levels of Measurement of Variables
Now we can get down to basics. We’ve been throwing around the term variable as if it were
second nature to you. (If it is, that’s great. If not, here we go.) A variable is a characteristic
that can vary from one subject or case to another or for one case over time. In the case of
1. Note that you can use many statistical methods to analyze data about populations, there are some differences
in how they are employed, as will be discussed later in this chapter.
14 | Univariate Analysis
the GSS data we’ve presented so far, one variable characteristic has been gender or sex. A
human adult responding to the GSS may indicate that they are male or female. (They could
also identify with other genders, of course, but the GSS hasn’t permitted this so far.) Gender is a variable because it is a characteristic that can vary from one human to another.
If we were studying countries, one variable characteristic that might be of interest is the
size of the population. Variables, we said, can also vary from one subject over time. Thus,
for instance, your age is in one category today, but will be in another next year and in yet
another in two years.
The nature of the kinds of categories is crucial to the understanding of the kinds of statistical analysis that can be applied to them. Statisticians refer to these “kinds” of categories
as levels of measurement. There are four such levels or kinds of variables: nominal level
variables, ordinal level variables, interval level variables, and ratio level variables. And, as
you’ll see, the term “level” of measurement makes sense because each level requires that
an additional criterion is met for distinguishing it from the previous “level.” The most basic
level of measurement is that of the nominal level variable, or a variable whose categories
have names. (The word “nominal” has the Latin root nomen, or name.) We say the nominal level is the most basic because every variable is at least a nominal variable. The variable
“gender,” when it has the two categories, male and female, has categories that have names
and is therefore nominal. So is “religion,” when it has categories like Protestant, Catholic,
Jew, Muslim, and other. But so does the variable “age,” when it has categories from 1 and 2
to, potentially, infinity. Each one of categories (1,2,3, etc.) has a name, even though the name
is a number. In other words, again, every variable is a nominal level variable. There are some
nominal level variables that have the special property of only consisting of two categories,
like yes and no or true and false. These variables are called binary variables (also known as
dichotomous variables).
To be an ordinal level variable, a variable must have categories can be ordered in some
sensible way. (The word “ordinal” has the Latin root ordinalis, or order.) Said another way, an
ordinal level variable is a variable whose categories have names and whose categories can
be ordered in some sensible way. An example would be the variable “height,” when the categories are “tall,” “medium,” and “short.” Clearly these categories have names (tall, medium
and short), but they also can be ordered: tall implies more height than medium, which, in
turn, implies more height than short. The variable “gender,” would not qualify as an ordinal
level variable, unless one were an inveterate sexist, thinking that one gender is somehow
a superior category to the others. Both nominal and ordinal level variables can be called
discrete variables, which means they are variables measured using categories rather than
numbers.
To be an interval level variable, a variable must be made up of adjacent categories that
are a standard distance from one another, typically as measured numerically. Fahrenheit
temperatures constitute an interval level variable because the difference between 78 and
Univariate Analysis | 15
79 degrees (1 degree) is seen as the same as the difference between 45 and 46 degrees.
But because all those categories (78 degrees, etc.) are named and can be ordered sensibly, it’s pretty easy to see that all interval level variables could be measured at the ordinal
level—even while not all nominal and ordinal level variables could be measured at the interval level.
Finally, we come to ratio level variables. Ratio variables are like interval level variables, but
with the addition of an absolute zero, a category that indicates the absence of the phenomenon in question. And while some interval level variables cannot be multiplied and divided,
ratio level variables can be. Age is an example of a ratio variable because the category, zero,
indicates a person or thing has no age at all (while, in contrast, “year of birth” in the calendar system used in the United States does not have an absolute zero, because the year
zero is not the absence of any years). But, while interval and ratio variables can be distinguished from each other, we are going to assert that, for the purposes of this book, they are
so similar that the distinction isn’t worth insisting upon. As a result, for practical purposes,
we could be calling all interval and ratio variables, interval-ratio variables, or simply interval
variables. Both ratio and interval level variables can also be referred to as scale or continu­
ous variables, as their (numerical) categories can be placed on a continuous scale.
But what are those practical purposes for which we need to know a variable’s level of
measurement? Let’s just see . . .
Measures of Central Tendency
Roger likes to say, “All statistics are designed with particular levels of measurement in
2
mind.” What’s this mean? Perhaps the easiest way to illustrate is to refer to what statisticians call “measures of central tendency” or what we laypersons call “averages.” You may
have already learned about three of these averages before: the mean, the median, and the
mode. But have you asked yourself why we need three measures of central tendency or
average?
The answer lies in the level of measure required by each kind of average. The mean
(which is what people most typically refer to when they use the term “average”), you may
recall, is the sum of all the categories (or values) in your sample divided by the number of
such categories (or values). Now, stop and think: what level of measurement (nominal, ordinal or interval) is required for you to calculate a mean?
3
If your answer was “interval,” you should give yourself a pat on the back. You need a vari-
2. Besides the fact that he’s getting increasingly senile?
3. Something that’s increasingly difficult for Roger to do as he gets up in years.
16 | Univariate Analysis
able whose categories may legitimately be added to one another in order to calculate a
mean. You could do this with the variable “age,” whose categories were 0, 1, 2, 3, etc. But you
couldn’t, say, with “height,” if the only categories available to you were tall, medium, and
short (if you had actual height in inches or centimeters, of course, that would be a different
story).
But if your variable of interest were like that height variable—i.e., an ordinal level variable,
statisticians have cooked up another “average” or measure of central tendency just for you:
the median. The median is the middle category (or value) when all categories (or values) in
the sample are arranged in order. Let’s say your five subjects had heights that were classified as tall, short, tall, medium and tall. If you wanted to calculate the median, you’d first
arrange these in order as, for instance, short, medium, tall, tall and tall. You’d then pick the
one in the middle—i.e., tall—and that would be your median. Now, stop and think: could you
calculate the median of an interval level variable, like the age variable we just talked about?
4
If your answer was “yes,” you should give yourself a hardy slap on the knee. The median
can be used to analyze an interval level variable, as well as ordinal level variables, because
all interval level variables are also ordinal. Right?
OK, you say, the mean has been designed to summarize interval level variables and the
median has been fashioned to handle ordinal level variables. “I’ll bet,” you say, “the mode
is for analyzing nominal level variables.” And you’re right! The mode is the category of a
variable in a sample that occurs most frequently. This can be calculated for nominal level
variables because nominal level variables, whatever else they have, have categories (with
names). Let’s say the four cars you were studying had the colors of blue, red, green and blue.
The mode would be blue, because it’s the category of colors that occurs most frequently.
Before you take these averages out for a spin, we’d like you to try another question. Can a
mode be calculated on an ordinal or an interval level variable?
If you answer “yes,” you should be very proud. Because you’ve probably seen that ordinal
and interval variables could also be treated like nominal level variables and therefore can
have modes. (That is, categories that occur most frequently). Note, though, that the mode
is unlikely to be a helpful measure in instances where continuous variables have many
possible numerical values, like annual income in dollars, because in these cases the mode
might just be some dollar amount made by three people in a sample where everyone else’s
income is unique.
Your Test Drive
4. Unless you’ve got arthritis there like you know who.
Univariate Analysis | 17
Examine the following sample data for five students (A through E). Calculate as many of the measures of central tendency (or average) as you can for each of the three variables: religion, height and
5
age. (See this footnote for the correct answer once you’re done.)
Student
A
B
C
D
E
Religion
Catholic
Protestant
Jewish
Catholic
Catholic
Height
Tall
Short
Medium
Short
Short
Age
19
20
19
21
19
How do you know which measure of central tendency or average (mode, median or mean)
to use to describe a given variable in a report? The first rule is a negative: do NOT report a
measure that is not suitable for your variable’s level of measurement. Thus, you shouldn’t
report a mean for the religion or height variables in the “test drive” above, because neither
of them is an interval level variable.
You might well ask, “How could I possibly report a mean religion, given the data above?”
This is a good question and leads us to mention, in passing, that when researchers set
up computer files to help them analyze data, they will almost always code variable categories using numbers so that the computer can recognize them more easily. Coding is the
process of assigning observations to categories—and, for computer usage, this often means
changing the names of variables categories to numbers. Perhaps you recall doing Exercise
1 at the end of Introducing Social Data Analysis—the one that asked you to determine the
percentage of respondents who were female over the years (about 54 percent). Well, to set
up the computer to do this analysis, the folks who created the file (and who supplied us
with the data) coded males as 1 and females as 2. So the computer was faced with over
34,000 1s and 2s rather than with over 34,000 “males” and “females.” Computers like this
kind of help. But computers, while very good at computing, are often a little stupid when
6
it comes to interpreting their computations. So when I went in and asked the computer
to add just a few more statistics, including the mean, median and mode, about the sex or
gender of GSS respondents, it produced this table. (Don’t worry, I’ll show you how to produce a table like this in Exercise 3 of this chapter.)
5. The mode of religion is Catholic. No other average is applicable. The median of height is short, and so is the
mode. The mean of height can’t be calculated. The mean height is 19.6. Its median is 19, as is its mode.
6. No offense to you, my faithful laptop, without which I couldn’t bring you, my readers, this cautionary tale.
18 | Univariate Analysis
Table 1: Univariate Statistics Associated with “Sex” in the GSS
Summary Statistics
Mean =
1.54
Std Dev =
.50
Coef var
=
.32
Median
=
2.00
Variance =
.25
Min =
1.00
Mode =
2.00
Skewness
=
-.17
Max =
2.00
Sum =
99,993.48
Kurtosis =
-1.97
Range = 1.00
What this table effectively and quickly, tells us is that the mode of “sex” (really gender) is
2, meaning “female.” Part of your job as a social data analyst is to translate codes like this
back into English—and report that the mode, here, is “female,” not “2”. But another important part, and something the computer also cannot do, is recognizing the level of measure
of the variable concerned—in this case, nominal—and realize which of the reported statistics is relevant given that level. And in terms of “sex,” as reported in Table 1, only you can
know how silly it would be to report that the mean “sex” is 1.54 (notice the computer can’t
7
see that silliness) or that its median is 2.00. When Roger was little, Smoky the Bear used
to tell kids “Only YOU can prevent forest fires.” But Roger is here to tell you, “Only YOU can
prevent statistical reporting travesties.” So, again, you do not want to report statistics that
aren’t designed for the level of measure of your variables.
In general, though, when you ARE dealing with an interval variable, like age in years, you
really have three choices about which to report: the mean, the median and the mode. For
the moment, we’re going to recommend that, in such case, you might consider that the
reading public is likely to be most familiar with the mean and, for that reason, you might
report the mean. (We’ll get to qualifications of that recommendation a little later.)
Variation
Measures of central tendency are often useful for summarizing variables, but they can
8
sometimes be misleading. Roger just Googled the average life expectancy for men in the
United States and discovered it was about 76.5 years. (Pretty clearly a mean, not a mode
7. Many years ago.
8. In 2020.
Univariate Analysis | 19
or median, right?) At this sitting, he is about 71.5 years old. Does this mean he has exactly
5 years left of life to live? Well, probably not. Given his health, educational level, etc., he’s
likely to live considerably longer…unless COVID-19 gets him tomorrow. The point is that
for life expectancy, as for other variables, there’s variation around the average. And sometimes knowing something about that variation is at least as important as the average
itself—sometimes more important.
We can learn a lot about a variable, for instance, simply by showing how its cases are distributed over its categories in a sample. Exercise 1 at the end of Introducing Social Data
Analysis actually told you the modal gender of respondents to the GSS survey. (“Modal” is
the adjectival form of mode.) Do you recall what that was? It was “female,” right? What this
tells you is that the “average” respondent over the years has been a female. But the mode,
being what it is, doesn’t tell you whether 100 percent of respondents were female or 50.1
percent were female. And that’s an important difference.
One of the most commonly used ways of showing variation is what’s called a frequency
distribution. A frequency distribution shows the number of times cases fall into each category in a sample. I’ve just called up the table you looked at in Exercise 1 of Introducing
Social Data Analysis and plunked it down here as Table 2. What this table shows is that
while about 35,179 females had participated in the GSS since 1972, 29,635 males had done so
as well. The table further tells us that while about 54 percent of the sample is female, about
46 percent has been male. The distribution has been much closer to 50-50 than 100-0. And
this extra information about the variable is a significant addition to the fact that modal “sex”
was female.
20 | Univariate Analysis
SEX
Cells contain:
–Column percent
-Weighted N
54.3
35,179.1
100.0
64,814.4
COL TOTAL
45.7
29,635.4
Distribution
2: FEMALE
1: MALE
Frequency Distribution
Table 2. The Frequency Distribution Associated with “Sex” in the
GSS as of 2018
“Sex” is a nominal level variable, and frequency distributions have been designed for displaying the variation of nominal level variables. But, of course, because ordinal and interval
variables are also nominal level variables, frequency distributions can be used to describe
their variation as well. And this often makes sense with ordinal level variables. Thus, for
instance, we used a frequency distribution of respondents’ confidence in the military (“conarm”) to show that there was relatively little variation in Americans’ confidence in that institution in 2018 (Table 3, below). Almost 61 percent of respondents said they had a “great deal
of confidence” in the military that year, while only about 39 percent said they had “only
some” or “hardly any” confidence. In other words, at least in comparison with the variation
in “sex,” variation in confidence in the military, which, after all, has three categories, seems
limited. In other words, this kind of confidence seems more concentrated in one category
(“great deal of confidence”) than you might expect.
Quiz at the End of the Paragraph: Can you see what the median and the mode of confidence in
the military was?
Bonus Trick Question: What was its mean?
22 | Univariate Analysis
CONARMY
(Confidence in
the U.S. Military)
Cells contain:
–Column percent
-Weighted N
60.6
940.7
32.5
504.3
7.0
108.0
100.0
1,553.0
2: ONLY SOME
3: HARDLY ANY
COL TOTAL
Distribution
1: A GREAT DEAL
Frequency Distribution
Table 3. The Frequency Distribution and Other Statistics Related to
Americans’ Confidence in the Military, 2018 General Social Survey
Data
Summary Statistics
Mean =
1.46
Std Dev =
.62
Coef var
=
.43
Median =
1.00
Variance =
.39
Min =
1.00
Mode =
1.00
Skewness =
1.00
Max =
3.00
Sum =
2,273.32 Kurtosis =
-.06
Range =
2.00
Measures of Variation for Interval Level Variables
Looking at frequency distributions is a pretty good way of getting a sense of the variation in
nominal and ordinal variables. But it would be a fairly awkward way of doing so for interval
variables, many of which, if you think about it, would have many categories. (Can you imagine a frequency distribution for the variable “age” of respondents in the GSS?) Statisticians
have actually given us some pretty elegant ways of dealing with the description of variation
in interval variables and we’d now like to illustrate them with simple examples.
Roger’s daughter, Wendy, was a day care provider for several years and could report that
variation in the ages of preschool children made a tremendous difference in the kinds of
things you can do with them. Imagine, if you will, that you had two groups of four preschool
children, one of which had four 3-year-olds in it and one of which had two 5-year-olds and
two 1-year-olds. Can you calculate the mean age of each group?
If you found that the mean age of both groups was 3 years old, you did a fine job. Now,
if you were inclined to think that any two groups with the same mean age were likely to
be similar, think of these two from a day care provider’s point of view. Figuring out what to
do for a day with two 1-year-olds and two 5-year-olds would be a much more daunting task
than planning for four 3-year-olds. Wouldn’t it?
Statisticians have given us one particularly simple measure of spread or variation for
interval level variables: the range. The range is simply the highest category in your sample
minus the lowest category. For the group with four 3-year-olds, the range would be (3-3=)
zero years. There is no variation in age for this group. For the group with two 1-year-olds and
two 5-year-olds, the range would be (5-1=) four years. A substantial, and important difference, again especially if you, like my daughter, were a day care provider. Means don’t always
tell the whole story, do they?
Perhaps the more commonly used statistic for describing the variation or spread of an
interval level variable, however, is the standard deviation. The range only gives you a sense
of how spread out the extreme values or categories are in your sample. The standard devi­
24 | Univariate Analysis
ation is a measure of variation that takes into account every value’s distance from the
sample mean. The usefulness of such a measure can be illustrated with another simple
example. Imagine, for instance, that your two groups of preschool children had the following ages: 1, 1, 5, 5, on the one hand, and 1, 3, 3, and 5, on the other.
The mean of these two groups is 3 years and the range is 4 years. But are they identical?
No. You may notice that each of the individual ages in the first group is a “distance” of 2
away from the mean of 3. (The two 1s are each 2 away from 3 and the two 5s are also 2 away
from 3.) So the average “distance” of each age from the mean is 2 for group 1. But that’s not
true for the second group. The 1 and the 5 are both 2 away from the mean of 3, but the two
3s are both no distance away. So the average distance of ages from the mean in this group
is something less than 2. Hence, the average distance of ages from the mean in the first
group is larger than the average distance in the second group. The standard deviation is a
way of capturing a difference like this—one that is not captured by the range.
It does this by using a formula that essentially adds the individual “distances” of categories or values from the mean and then divides that number by the categories. We think
of it as being very similar to the computation of the mean itself: a sum divided by the number of cases involved. The computational formula is:
where
stands for the standard deviation
stands for the square root of the entire expression that follows
means to add up the sequence of numbers produced by the expression that
follows
stands for each value of category in the sample
stands for the sample mean
stands for the number of sample cases
The formula may look daunting, but it’s not very difficult to compute with just a few
cases—and we’ll never ask you to use anything other than a computer to compute the standard deviation with more cases. Note that to calculate the standard deviation for an entire
population, rather than a sample, we use N rather than N-1 in the denominator. And also
note that the numerator—
—is referred to as the variance.
Notice first that the formula asks you to compute the sample mean. For the second sample of ages above—the one with ages 1, 3, 3, 5—the mean is 3. It then asks you to take the
difference between each category in the sample and the mean and square the differences.
Univariate Analysis | 25
1-3, for instance, is -2 and its square is 4. 3-3 is 0 and its square is 0. And 5-3 is 2 and its square
is 4. The formula then asks you to add these squared values up: 4 0 0 4=8. Then it says to
divide by the number of cases, minus 1: 3. 8/3=2.67. It then asks you to take the square root
of 2, or about 1.6. So the standard deviation of this sample is about 1.6 years.
Can you calculate the standard deviation for the second sample of ages above: 1, 1, 5, 5?
9
Did you get 2.3? If so, give yourself another pat on the back.
Measures of Deviation from the Normal Distribution
We’ve suggested that, other things being equal, the mean is a good way of describing
the central tendency or average of an interval level variable. But other things aren’t always
equal. The mean is an excellent measure of central tendency, for instance, when the interval
level variable conforms to what is called a normal distribution. A normal distribution of a
variable is one that is symmetrical and bell-shaped (otherwise called a bell curve), like the
one in Figure 2.1. This image suggests what is true when the distribution of a variable is normally distributed: that 68 percent of cases fall within one standard deviation on either side
of the mean; that 95 percent of the cases fall within two standard deviations on either side;
and that 99.7 percent of the cases fall within three standard deviations on either side. Note
that the symbol
is used to indicate standard deviation in many statistical contexts.
Figure 2.1 The
Normal Curve
9. You already know Roger can’t do this for himself.
26 | Univariate Analysis
One example that is frequently cited as a normally distributed variable is height. For American men, the average height in 2020 is about 69 inches,
10
where “average” here refers to
the mean, the median and the mode, if, in fact, height is normally distributed. The peak of
the curve (can you see it in your mind?) would be at 69 inches, which would be the most
frequently occurring category, the one in the middle of the distribution of categories and
the arithmetic mean.
But what happens when a variable is not normally distributed? We asked the Social Data
Archive to use GSS data from 2010 to tell us what distribution of the number of children
respondents had looked like, and we got these results (see Table 4):
Table 4. Number of Children Reported by General Social Survey
Respondents (2010)
Summary Statistics
Mean =
1.91
Std Dev =
1.73
Coef var =
.91
Median
2.00
=
Variance =
2.99
Min =
.00
Mode = .00
Skewness =
1.05
Max =
8.00
Sum =
Kurtosis =
1.39
Range =
8.00
3,894.30
As you might have expected, the greatest number of respondents said they had zero, one
or two children. But, then the number of children tails off pretty quickly as you get into categories that represent respondents with 3 or more children. This variable, then, is not normally distributed. Most of the cases are concentrated in the lowest categories. When an
interval level variable looks that this, it is said to have right, or positive skewness, and this is
reflected in the report that “number of children” has a skewness of positive 1.05. Skewness
refers to an asymmetry in a distribution in which a curve is distorted either to the left or the
right. The skewness statistic can take on values from negative infinity to positive infinity,
with positive values indicating right skewness (with “tails” to the right) and negative values
indicating left skewness (when “tails” are to the left). A skewness statistic of zero would indicate that a variable is perfectly symmetrical.
Our rule of thumb is that when the skewness statistic gets near to 1 or near -1, the variable
has more than enough skewness (either to the right or to the left) to be disqualified as
a normally distributed variable. And in such cases, it’s probably useful to report both the
mean and the median as measures of central tendency, since the relationship of the two
10. 5 feet, 9 inches
Univariate Analysis | 27
will give some idea to readers of the nature of the variable’s skewness. If the median is
greater than the mean (as it is in the case of “number of children”), it’s a sign that the author
means to convey that the variable is right skewed. If it’s less than the mean, the implication
is that it’s left skewed.
Figure 2.2 Negative Skew
Kurtosis refers to how sharp the peak of a frequency distribution is. If the peak is too
pointed to be a normal curve, it is said to have positive kurtosis (or “leptokurtosis”). The kurtosis statistic of “number of children” is 1.39, indicating that the variable’s distribution has
positive kurtosis (or leptokurtosis). If the peak of a distribution is too flat to be normally distributed, it is said to have negative kurtosis (or platykurtosis), as seen in Figure 2.3.
Figure 2.3. Kurtosis
A rule of thumb for the kurtosis statistic: if it gets near to 1 or near -1, the variable has more
28 | Univariate Analysis
than enough kurtosis (either positive or negative) to be disqualified as a normally distributed variable.
For a fascinating, personal lecture about the importance of being wary about reports
using only measures of central tendency or average (e.g., means and medians), however,
we encourage you to listen to the following talk by Stephen Jay Gould:
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://pressbooks.ric.edu/socialdataanalysis/?p=99#oembed-1
A Word About Univariate Inferential Statistics
Up to this point, we’ve only talked about univariate descriptive statistics, or statistics that
describe one variable in a sample. When we learned that 54 percent of GSS respondents
over the years have been women, we were simply learning about the (large) sample of
people who have responded to the GSS over the years. And when we learned that the
mean number of children that respondents had in 2010 was about 1.9 and the median was
2.0, those too were descriptions of the sample that year. One of the purposes of sampling,
though, is that it can provide us some insight into the population from which the sample
was drawn. In order to make inferences about such populations from sample data we
need to use inferential statistics. Inferential statistics, as we said before, are statistics that
permit researchers to make inferences about the larger population from which a sample is
drawn.
We’ll be spending more time on inferential statistics in other chapters, but now we’d like
to introduce you a statistical concept that frequently comes up in relation to political polls:
margin of error. To appreciate the concept of the margin of error, we need to understand
the difference between two important concepts: statistics and parameters. A statistic is
a description of a variable (or the relationship between variables) in a sample. The mean,
median, mode, range, standard deviation and skewness are all types of statistics. A parame­
ter, on the other hand, is a description of a variable (or the relationship between variables)
in a population; many (but not all) of the same tools used as statistics when analyzing data
from samples can be used as parameters when analyzing data on populations. A margin of
error, then, is a suggestion of how far away from the actual population parameter a statistic
is likely to be. Thus political polling can tell you precisely what percentage of the sample say
they are going to vote for a candidate, but it can’t tell you precisely what percentage would
say the same thing in the larger population from which the sample was drawn.
Univariate Analysis | 29
BUT, when a sample is a probability sample of the larger population, we can estimate
how close the population percentage is likely to be to the sample percentage. A full discussion of the different kinds of samples is beyond the scope of this book, but let’s just say that
a probability sample is one that has been drawn to give every member of the population
a known (non-zero) chance of inclusion. Inferential statistics of all kinds assume that one is
dealing with a probability sample of the larger population to which one would like to generalize (though, sometimes, inferential statistics are calculated even when this fundamental
11
assumption of inferential statistics has not been met).
Most frequently, a margin of error is a statement of the range around the sample percentage in which there is a 95 percent chance that the population percentage will fall. The
pre-election polls before the 2016 election are frequently criticized for how badly they got it
wrong when they predicted Hillary Clinton would get a higher percentage of the vote than
Donald Trump—and win the election. But in fact most of the national polls came remarkably close to predicting the election outcome perfectly. Thus, for instance, an ABC News/
Washington Post poll, collected between November 3rd and November 6th (two days before
the election), and involving a sample of 2,220, predicted that Clinton would get 49 percent
of the vote, plus or minus 2.5 percentage points (meaning that she’d likely get somewhere
between 46.5 percent and 51.5 percent of the vote), and that Trump would get 46 percent,
plus or minus 2.5 percentage points (meaning that he’d likely get somewhere between 43.5
percent and 48.5 percent of the vote). The margin of error in this poll, then, was plus or
minus 2.5 percentage points. And, in fact, Clinton won 48.5 percent of the actual vote (well
within the margin of error) and Trump won 46.4 percent (again, well within the margin
of error) (CNN Politics, 2020). This is just one poll that got the election precisely right with
respect to the total vote (if not the crucial electoral vote) count in advance of the election.
We haven’t shown you how to calculate a margin of error here but, as you’ll see in Exercise
4 at the end of the chapter, they are not hard to get a computer to spit out. One thing to
keep in mind is that the size of a margin of error is a function of the size of the sample: the
larger the sample, the smaller the margin of error. In fact all inferences using inferential statistics become more accurate as the sample size increases.
So, welcome to the world of univariate statistics! Now let’s try some exercises to see how
they work.
Exercises
11. And we hope you’ll always say “naughty, naughty,” when you know this has been done.
30 | Univariate Analysis
1.
Which of the measures of central tendency has been designed for nominal level variables?
For ordinal level variables? For interval level variables? Why can all three measures be
applied to interval level variables?
2.
Which way of showing the variation of nominal and ordinal level variables have we examined in this chapter? What measures of variation for interval level variables have we encountered?
3.
Return to the Social Data Archive we explored in Exercise 1 of Introducing Social Data
Analysis. The data, again, are available at https://sda.berkeley.edu/ . Again, go down to the
second full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll down
to the section labeled “General Social Surveys” and click on the first link there: General Social
Survey (GSS) Cumulative Datafile 1972-2018 release. Now type “religion” in the row box, hit
“output options,” click on “summary statistics,” then click on “run the table.” See if you can
answer these questions:
▪
What level of measurement best characterizes “religion”? What is this variable measuring?
▪
What’s the only measure of central tendency you can report for “religion”? Report
this measure, in English, not as a number.
▪
What’s a good way you can describe ”religion”’s variation? Describe its variation.
Now type “happy” in the row box, hit “output options,” click on “summary statistics,”
then click on “run the table.” See if you can answer these questions:
▪
What level of measurement best characterizes “happy”? What is this variable measuring?
▪
What are the only measures of central tendency you can report for “happy”? Report
these measures, in English, not as a number.
▪
What’s a good way you can use to describe “happy”’s variation? Describe its variation.
Now type “age” in the row box, hit “output options,” click on “summary statistics,” then
click on “run the table.” See if you can answer these questions:
▪
▪
What level of measure best describes “age”? What is this variable measuring?
What are all the measures of central tendency you could report for “age”? Report
these measures, in English, not simply as numbers.
▪
What are two good statistics for describing “age”’s variation? Describe its variation.
▪
Is it your sense that “age” is essentially normally distributed? Why or why not? (What
statistics did you check for this?)
4.
Return to the Social Data Archive. The data, again, are available at https://sda.berkeley.edu/ (You
may have to copy and paste this address to request the website.) Again, go down to the second
full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll down to the section
labeled “American National Election Studies (ANES)” and hit on the first link there: American
National Election Study (ANES) 2016. These data come from a survey done after the 2016 election.
Type “Trumpvote” in the row, hit “output options,” and hit “confidence intervals,” then hit “run
table.” What percentage of respondents, after the election, said they had voted for Trump? What
was the “95 percent” confidence interval for this percentage? Check the end of this chapter for
Univariate Analysis | 31
the actual percentage of the vote that Trump got. Does it fall within this interval?
Media Attributions
• Standard_deviation_diagram.svg © M. W. Toews is licensed under a CC BY (Attribution)
license
• Negative Skew © Diva Dugar adapted by Roger Clark is licensed under a CC BY-SA
(Attribution ShareAlike) license
• Kurtosis © Mikaila Mariel Lemonik Arthur
32 | Univariate Analysis
4. Bivariate Analyses:
Crosstabulation
Crosstabulation
ROGER CLARK
In most research projects involving variables, researchers do indeed investigate the central
tendency and variation of important variables, and such investigations can be very revealing. But the typical researcher, using quantitative data analysis, is interested in testing
hypotheses or answering research questions that involve at least two variables. A relation­
ship is said to exist between two variables when certain categories of one variable are associated, or go together, with, certain categories of the other variable. Thus, for example, one
might expect that in any given sample of men and women (assume, for the purposes of
this discussion, that the sample leaves out nonbinary folks), men would tend to be taller
than women. If this turned out to be true, one would have shown that there is a relationship
between gender and height.
But before we go further, we need to make a couple of distinctions. One crucial distinction is that between an independent variable and a dependent variable. An indepen­
dent variable is a variable a researcher suspects may affect or influence another variable.
A dependent variable, on the other hand, is a variable that a researcher suspects may be
affected or influenced by (or dependent upon) another variable. In the example of the previous paragraph, gender is the variable that is expected to affect or influence height and
is therefore the independent variable. Height is the variable that is expected to be affected
or influenced by gender and is therefore the dependent variable. Any time one states an
expected relationship between two (or more) variables, one is stating a hypothesis. The
hypothesis stated in the second-to-last sentence of the previous paragraph is that men will
tend to be taller than women. We can map two-variable hypotheses in the following way
(Figure 3.1):
Bivariate Analyses | 33
Figure 1. A
Mapping of the
Hypothesis That
Men Will Tend To
Be Taller Than
Women
When mapping a hypothesis, we normally put the variable we think to be affecting the
other variable on the left and the variable we expect to be affected on the right and then
draw arrows between the categories of the first variable and the categories of the second
that we expect to be connected.
Quiz at the End of The Paragraph
Read the following report by Annie Lowrey about a study done by two researchers, Kearney and
Levine. What is the main hypothesis, or at least the main finding, of Kearney and Levine’s study on the
effects of Watching 16 and Pregnant on adolescent women? How might you map this hypothesis (or
finding)?
https://www.nytimes.com/2014/01/13/business/media/mtvs-16-and-pregnant-derided-by-some-mayresonate-as-a-cautionary-tale.html
We’d like to say a couple of things about what we think Kearney and Levine’s major hypothesis was and then introduce you to a way you might analyze data collected to test the
hypothesis. Kearney and Levine’s basic hypothesis is that adolescent women who watched
16 and Pregnant were less likely to become pregnant than women who did not watch it.
They find some evidence not only to support this basic hypothesis but also to support the
idea that the ones who watched the show were less likely to get pregnant because they
were more likely to seek information about contraception (and presumably to use it) than
others. Your map of the basic hypothesis, at least as it applied to individual adolescent
women, might look like this:
34 | Bivariate Analyses
Figure 3.2: A
Mapping of
Kearney and
Levine’s
Hypothesis
Let’s look at a way of showing a relationship between two nominal level variables: crosstab­
ulation. Crosstabulation is process of making a bivariate table for nominal level variables to
show their relationship. But how does crosstabulation work?
Suppose you collected data from 8 adolescent women and the data looked like this:
Table 1: Data from Hypothetical Sample A
Bivariate Analyses | 35
yes
yes
yes
yes
no
no
no
no
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Watched 16 and Pregnant
no
yes
yes
yes
yes
no
no
no
Got Pregnant
Quick Check: What percentage of those who have watched 16 and Pregnant in the sample have
become pregnant? What percentage of those who have NOT watched 16 and Pregnant have
become pregnant?
If you found that 25 percent of those who had watched the show became pregnant, while
75 percent of those who had not watched it did so, you have essentially done a crosstabulation in your head. But here’s how you can do it more formally and more generally.
First you need to take note of the number of categories in your independent variable (for
“Watched 16 and Pregnant” it was 2: Yes and No). Then note the number of categories in
your dependent variable (for “Got Pregnant” it was also 2: again, Yes and No). Now you pre1
pare a “2 by 2” table like the one in Table 3.2, labeling the columns with the categories
of the independent variables and the rows with the categories of the dependent variable.
Then decide where the first case should be put, as we’ve done, by determining which cell
is where its appropriate row and column “cross.” We’ve “crosstabulated” Person 1’s data by
putting a mark in the box where the “Yes” for “watched” and the “No” for “got pregnant”
cross.
Table 2. Crosstabulating Person 1’s Data from Table 3.1 Above
Watched 16 and Pregnant
Yes
Got Pregnant
No
Yes
No
I
We’ve “crosstabulated” the first case for you. Can you crosstabulate the other seven cases?
We’re going to call the cell in the upper left corner of the table cell “A,” the one in the upper
right, cell “B,” the one in the lower left, cell “C,” and the one in the lower right, cell “D.” If
you’ve finished your crosstabulation and had one case in cell A, 3 in cell B, 3 in cell C, and 1
in cell D, you’ve done great!
In order to interpret and understand the meaning of your crosstabulation, you need to
take one more step, and that is converting those tally marks to percentages. To do this, you
add up all the tally marks in each column, and then you determine what percentage of the
1. If one of your variables had three categories, it might be a “2 by 3” table. If both variables had 3 categories,
you’d want a 3 by 3 table, etc.
Bivariate Analyses | 37
column total is found in each cell in that column. You’ll see what that looks like in Table 3
below.
Direction of the Relationship
Now, there are three characteristics of a crosstabulated relationship that researchers are
often interested in: its direction, its strength, and its generalizability. We’ll define each of
these in turn, and as we come to it. The direction of a relationship refers to how categories
of the independent variable are related to categories of the dependent variable. There are
two steps involved in working out the direction of a crosstabulated relationship… and these
are almost indecipherable until you’ve seen it done:
1. Percentage in the direction of the independent variable.
2. Compare percentages in one category of the dependent variable.
The first step actually involves three substeps. First you change the tally marks to numbers. Thus, in the example above, cell A would get a 1, B, a 3, C, a 3, and D, a 1. Second, you’d
add up all the numbers in each category of the independent variable and put the total on
the side of the table at the end of that column. Third, you would calculate the percentage
of that total that falls into each cell along that column (as noted above). Once you’d done all
that with the data we gave you above, you should get a table that looks like this (Table 3.3):
Table 3 Crosstabulation of Our Imaginary Data from a 16 and Pregnant Study
Watched 16 and Pregnant
Got Pregnant
Total
Yes
No
Yes
1 (25%)
3 (75%)
No
3 (75%)
1 (25%)
4 (100%)
4 (100%)
Step 2 in determining the direction of a crosstabulated relationship involves comparing
percentages in one category of the dependent variable. When we look at the “yes” category, we find that 25% of those who watched the show got pregnant, while 75% of those
who did NOT watch the show got pregnant. Turning this from a percentage comparison
to plain English, this crosstabulation would have shown us that those who did watch the
show were less likely to get pregnant than those who did not. And that is the direction of
the relationship.
Note: because we are designing our crosstabulations to have the independent variable in
the columns, one of the simplest ways to look at the direction or nature of the relationship is
to compare the percentages across the rows. Whenever you look at a crosstabulation, start
38 | Bivariate Analyses
by making sure you know which is the independent and which is the dependent variable
and comparing the percentages accordingly.
Strength of the Relationship
When we deal with the strength of a relationship, we’re dealing with the question of how
reliably we can predict a sample member’s value or category of the dependent variable
based on knowledge of that member’s value or category on the independent variables,
just knowing the direction of the relationship. Thus, for the table above, it’s clear that if you
knew that a person had watched 16 and Pregnant and you guessed she’d not gotten pregnant, you’d have a 75% (3 out of 4) chance of being correct; if you knew she hadn’t watched,
and you guessed she had gotten pregnant, you’d have a 75% (3 out of 4) chance of being
correct. Knowing the direction of this relationship would greatly improve your chances of
making good guesses…but they wouldn’t necessarily be perfect all the time.
There are several measures of the strength of association and, if they’ve been designed
for nominal level variables, they all vary between 0 and 1. When one of the measures is
0.00, it indicates that knowing a value of the independent variable won’t help you at all
in guessing what a value of the dependent variable will be. When one of these measures
is 1.00, it indicates that knowing a value of the independent variable and the direction of
the relationship, you could make perfect guesses all the time. One of the simplest of these
measures of strength, which can only be used when you have 2 categories in both the independent and dependent variables, is the absolute value of Yule’s Q. Because the “absolute
value of Yule’s Q” is so relatively easy to compute, we will be using it a lot from now on, and
it is the one formula in this book we would like you to learn by heart. We will be referring to
it simply as |Yule’s Q|—note that the “|” symbols on both sides of the ‘Yule’s Q’ are asking us
to take whatever Yule’s Q computes to be and turn it into a positive number (its absolute
value). So here’s the formula for Yule’s Q:
For the crosstabulation of Table 3,
In other words, the Yule’s Q is .80, much close to the upper limit of Yule’s Q (1.00) than it
is to its lower limit (0.00). So the relationship is very strong, indicating, as we already knew,
Bivariate Analyses | 39
that, given knowledge of the direction of the relationship, we could make a pretty good
guess about what value on the dependent variable a case would have if we knew what
value on the independent variable it had.
Practice Exercise
Suppose you took three samples of four adolescent women apiece and obtained the following data
on the 16 and Pregnant topic:
Sample 1
Sample 2
Sample 3
Watched
Pregnant
Watched
Pregnant
Watched
Pregnant
Yes
No
Yes
No
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
Yes
No
Yes
No
No
No
Yes
No
No
No
No
See if you can determine both the direction and strength of the relationship between having
watched “16 and Pregnant” in each of these imaginary samples. In what ways does each sample, other
2
than sample size, differ from the Sample A above? Answers to be found in the footnote.
Roger now wants to share with you a discovery he made after analyzing some data that two
now post-graduate students of his, Angela Leonardo and Alyssa Pollard, have made using
crosstabulation. At the time of this writing, they had just coded their first night of TV commercials, looking for the gender of the authoritative “voice-over”—the disembodied voice
that tells viewers key stuff about the product. It’s been generally found in gender studies
that these voice-overs are overwhelmingly male (e.g., O’Donnell and O’Donnell 1978; Lovdal 1989; Bartsch et al. 2000), even though the percentage of such voice-overs that were
male had dropped from just over 90 percent in the 1970s and 1980s to just over 70 percent in 1998. We will be looking at considerably more data, but so far things are so interesting that Roger wants to share them with you…and you’re now sophisticated enough about
2. Answers: In Sample 1, the direction of the relationship is the same as it was in Sample A (those who watched
the show were less likely than those who didn’t), but its strength is greater (Yule’s Q= 1.00, rather than 0.80). In
Sample 2, there is no direction of the relationship (those who watched the show were just as likely to get preg­
nant as those who didn’t) and its strength is as weak as it could be (Yule’s = 0.00). In Sample 3, the direction of
the relationship is the opposite of what it was in Sample A. In this case, those who watched the show were
more likely to get pregnant than those who didn’t. And the strength of the relationship was as strong as it
could be (Yule’s Q= 1.00).
40 | Bivariate Analyses
crosstabs (shorthand for crosstabulations) to appreciate them. Thus, Table 3.4 suggests that
things have changed a great deal. In fact the direction of the relationship between the time
period of the commercials and the gender of the voice-over is clearly that more recent commercials are much more likely to have a female voice-over than older ones. While only 29
percent of commercials in 1998 had a female voice-over, 71 percent in 2020 did so. And a
Yule’s Q of .72 indicates that the relationship is very strong.
Table 3.4 Crosstabulation of Year of Commercial and Gender of the Voice-Over
Year of Commercial
Gender of
Voice-Over
1998
2020
Male
432 (71%)
14 (29%)
Female
177 (29%)
35 (71%)
Notes: |Yule’s Q| = 0.72; 1998 data from Bartsch et al., 2001.
Yule’s Q, while relatively easy to calculate, has a couple of notable limitations. One is that if
one of the four cells in a 2 x 2 table (a table based on an independent variable with 2 categories and a dependent variable with 2 categories) has no cases, the calculated Yule’s Q will
be 1.00, even if the relationship isn’t anywhere near that strong. (Why don’t you try it with a
sample that has 5 cases on cell A, 5 in cell B, 5 in cell C, and 0 in cell D?)
Another problem with Yule’s Qis that it can only be used to describe 2 x 2 tables. But not
all variables have just 2 categories. As a consequence, there are several other measures of
strength of association for nominal level variables that can handle bigger tables. (One that
we recommend for sheep farmers is lambda. Bahhh!) But, we most typically use one called
Cramer’s V, which shares with Yule’s Q (and lambda) the property of varying between 0 and
1. Roger normally advises students that values of Cramer’s V between 0.00 and 0.10 suggests that the relationship is weak; between 0.11 and 0.30, that the relationship is moderately strong; between 0.31 and and 0.59, that the relationship is strong; and between 0.60
and 1.00, that the relationship is very strong. Associations (a fancy word for the strength of
the relationship) above 0.59 are not so common in social science research.
An example of the use of Cramer’s V? Roger used statistical software called the Statistical Package for the Social Sciences (SPSS) to analyze the data Angela, Alyssa and he collected about commercials (on one night) to see whether men or women, both or neither,
were more likely to appear as the main characters in commercials focused on domestic
goods (goods used inside the home) and non-domestic goods (goods used outside the
home). Who (men or women or both) would you expect to be such (main) characters in
commercials involving domestic products? Non-domestic products? If you guessed that
females might be the major characters in commercials for domestic products (e.g., food,
laundry detergent, and home remedies) and males might be major characters in com-
Bivariate Analyses | 41
mercials for non-domestic products (e.g., cars, trucks, cameras), your guesses would be
consistent with findings of previous researchers (e.g., O’Donnell and O’Donnell, 1978; Lovdal, 1989; Bartsch et al., 2001). The data we collected on our first night of data collection
suggest some support for these findings (and your expectations), but also some support
for another viewpoint. Table 3.5, for instance, shows that women were, in fact, the main
characters in about 48 percent of commercials for domestic products, while they were the
main characters in only about 13 percent of commercials for non-domestic products. So
far, so good. But males, too, were more likely to be main characters in commercials for
domestic products (they were these characters about 24 percent of the time) than they
were in commercials for non-domestic products (for which they were the main character
only about 4 percent of the time). So who were the main product “representatives” for
non-domestic commercials? We found that in these commercials at least one man and
one woman were together the main characters about 50 percent of the time, while men
and women together were the main characters in only about 18 percent of the time in
commercials for domestic products.
But the analysis involving gender of main character and whether products were domestic or non-domestic involved more than a 2 x 2 table. In fact, it involved a 2 x 4 table
because our dependent variable, gender of main character, had four categories: female,
male, both, and neither. Consequently, we couldn’t use Yule’s Q as a measure of strength
of association. But we could ask, and did ask (using SPSS), for Cramer’s V, which turned
out to be about 0.53, suggesting (if you re-examine Roger’s advice above) that the relationship is a strong one.
Table 3.5 Crosstabulation of Type of Commercial and Gender of Main Character
Type of Commercial
Gender of Main
Character
For Domestic
Product
For Non-Domestic
Product
Female
18 (47.4%)
3 (12.5%)
Male
9 (23.7%)
1 (4.2%)
Both
7 (18.4%)
12 (50%)
Neither
4 (10.4%)
8 (33.3%)
Notes: Cramer’s V = 0.53
Generalizability of the Relationship
When we speak of the generalizability of a relationship, we’re dealing with the question
of whether something like the relationship (in direction, if not strength) that is found in
42 | Bivariate Analyses
the sample can be safely generalized to the larger population from which the sample was
drawn. If, for instance, we drew a probability sample of eight adolescent women like the
ones we pretended to draw in the first example above, we’d know we have a sample in
which a strong relationship existed between watching “16 and Pregnant” and not becoming pregnant. But how could one tell that this sample relationship was likely to be representative of the true relationship in the larger population?
If you recall the distinction we drew between descriptive and inferential statistics in the
Chapter on Univariate Analysis, you won’t be surprised to learn that we are now entering
the realm of inferential statistics for bivariate relationships. When we use percentage comparisons within one category of the dependent variable to determine the direction of a
relationship and measures like Yule’s Q and Cramer’s V to get at its strength, we’re using
descriptive statistics—ones that describe the relationship in the sample. But when we talk
about Pearson’s chi-square (or Χ²), we’re referring to an inferential statistic—one that can
help us determine whether we can generalize that something like the relationship in the
sample exists in the larger population from which the sample was drawn.
But, before we learn how to calculate and interpret Pearson’s chi-square, let’s get a feel
for the logic of this inferential statistic first. Scientists generally, and social scientists in particular, are very nervous about inferring that a relationship exists in the larger population
when it really doesn’t exist there. This kind of error—the one you’d make if you inferred that
a relationship existed in the larger population when it didn’t really exist there—has a special name: a Type I error. Social scientists are so anxious about making Type 1 errors that
they want to keep the chances of making them very low, but not impossibly low. If they
made them impossibly low, then they’d risk making the opposite of a Type 1 error: a Type
2 error—the kind of error you’d make when you failed to infer that a relationship existed
in the larger population when it really did exist there. The chances, or probability, of something happening can vary from 0.00 (when there’s no chance at all of it happening) to 1.00,
when there’s a perfect chance that it will happen. In general, social scientists aim to keep
the chances of making a Type 1 error below .05, or below a 1 in 20 chance. They thus aim for
a very small, but not impossibly small, chance of making the inference that a relationship
exists in the larger population when it doesn’t really exist there.
Bivariate Analyses | 43
Karl Pearson, the statistician whose name is associated with
Pearson’s chi-square, studied the statistic’s property in about 1900.
He found, among other things, that crosstabulations of different
sizes (i.e., different numbers of cells) required a different chi-square
to be associated with a .05 chance, or probability (p), of making a
Type 1 error or less. As the number of cells increase, the required
chi-square increases as well. For a 2 x 2 table, the critical chi-square
is 3.84 (that is, the computed chi-square value should be 3.84 or
more for you to infer that a relationship exists in the larger populaFigure 3.3: Karl Pearson
in 1910
tion with only a .05 chance, or less, of being wrong); for a 2 x 3 table,
the critical chi-square is 5.99, and so on. Before we were able to use
statistical processing software like SPSS, statistical researchers
relied on tables that outlined the critical values of chi-quare for different size tables
(degrees of freedom, to be discussed below) and different probabilities of making a Type 1
error. A truncated (shortened) version of such a table can be seen in Table 6.
Table 6: Table of Critical Values of the Chi-Square Distribution
Probability less than the critical value
Degrees of
Freedom
0.90
0.95
0.99
0.999
1
2.706
3.841
5.024
10.828
2
4.605
5.991
7.378
13.816
3
6.251
7.815
9.384
16.266
4
7.779
9.488
11.143
13.277
5
9.236
11.070
12.833
20.515
6
10.645
12.592
14.449
22.458
7
12.017
14.067
17.013
24.458
And so on…
Now you’re ready to see how to calculate chi-square. The formula for chi-square (Χ²) is:
Let’s see how this would work with the example of the imaginary data in Table 3.3. This
table, if you recall, looked (mostly) like this:
44 | Bivariate Analyses
Table 7 (Slightly Revised) Crosstabulation of Our Imaginary Data from a “16 and Preg­
nant” Study
Watched 16 and Pregnant
Got Pregnant
Yes
No
Row
Marginals
Yes
1
3
4
No
3
1
4
4
4
N=8
Column Marginals
How do you figure out what the expected number of cases would be in each cell? You use
the following formula:
A row marginal is the total number of cases in a given row of a table. A column marginal
is the total number of cases in a given column of a table. For this table, the N is 8, the total
number of cases involved in the crosstabulation. For cell A, the row marginal is 4 and the
column marginal is 4, which means its expected number of cases would be 4 x 4 = 16/8 =
2. In this particular table, all the cells would have had an expected frequency (or number
of cases) of 2. So now all we have to do to compute χ2 is to make a series of calculation
columns:
Cell
Observed
Number of
Cases in Cell
Expected
Number of
Cases in Cell
(O-E)
(O-E)2
(O-E)2/E
A
1
2
-1
1
½
B
3
2
1
1
½
C
1
2
-1
1
½
D
3
2
1
1
½
And the sum of all the numbers in the (0-E)²/E column is 2.00. This is less than the 3.84 that
χ² needs to be for us to conclude that the chances of making a Type 1 error are less than .05
(see Table 3.6), so we cannot safely generalize that something like the relationship in this
small sample exists in the larger population. Aren’t you glad that these days programs like
SPSS can do these calculations for us? Even though they can, it’s important to go through
the process a few times on your own so that you understand what it is that the computer is
doing.
Bivariate Analyses | 45
Chi-square varies based on three characteristics of the sample relationship. The first of
these is the number of cells. Higher chi-squares are more easily achieved in tables with
more cells; hence the 3.84 standard for 2 x 2 tables and the 5.99 standard for 2 x 3 tables.
You’ll recall from Table 3.6 that we used the term degrees of freedom to refer to the calculation of table size. To figure out the degrees of freedom for a crosstabulation, you simply count the number of columns in the table (only the columns with data in them, not
columns with category names) and subtract one. Then you count the number of rows in
the table, again only those with data in them, and subtract one. Finally, you multiply the
two numbers you have computed. Therefore, the degrees of freedom for a 2×2 table will be
1 [(2-1)*(2-1)], while the degrees of freedom for a 4×6 table will be 15 [(4-1)*(6-1)].
Higher chi-squares will also be achieved when the relationship is stronger. If, instead of
the 1, 3, 3, 1 pattern in the four cells above (a relationship that yields a Yule’s Q of 0.80, one
had a 0, 4, 4, 0 pattern (a relationship that yields a Yule’s Q of 1.00), the chi-square would
3
be 8.00, considerably greater than the 3.84 standard, and one could then generalize that
something like the relationship in the sample also existed in the larger population.
But chi-square also varies with the size of the sample. Thus, if instead of the 1, 3, 3, 1 pattern
above, one had a 10, 30, 30, 10 pattern—both of which would yield a Yule’s Q of 0.80 and are
therefore of the same strength, and both of which have the same number of cells (4)—the
chi-square would compute to be 20, instead of 2, and give pretty clear guidance to infer
that a relationship exists in the larger population. The message of this last co-variant of chisquare—that it grows as the sample grows—implies that researchers who want to find generalizable results do well to increase sample size. A sample that tells us that the relationship
under investigation is generalizable is said to be significant—sometimes a desirable and
4
often an interesting thing. Incidentally, SPSS computed the chi-square for the crosstabulation in Table 3.5, the one that showed the relationship between type of product advertised
(domestic or non-domestic) and the gender of the product representative, to be 17.5. Even
for a 2 x 4 table like that one, this is high enough to infer that a relationship exists in the
larger population, with less than a .05 chance of being wrong. In fact, SPSS went even further, telling us that the chances of making a Type 1 error were less than .001. (Aren’t computers great?)
3. Can you double-check Roger’s calculation of chi-square for this arrangement to make sure he’s right? He’d
appreciate the help.
4. Of course, with very large samples, like the entire General Social Survey (GSS) since it was begun, it is some­
times possible to uncover significant relationships—i.e., ones that almost surely exist in the larger popula­
tion—that aren’t all that strong. Does that make sense?
46 | Bivariate Analyses
Crosstabulation with Two Ordinal Level Variables
We’ve introduced crosstabulation as a technique designed for the analysis of the relationship between two nominal level variables. But because all variables are at least nominal
level, one could theoretically use crosstabulation to analyze the relation between variables
5
of any scale. In the case of two interval level variables, however, there are much more elegant techniques for doing so and we’ll be looking at those in the chapter on correlation and
regression. If one were looking into the relationship between a nominal level variable (say,
6
gender, with the categories male and female) and an ordinal level variable (say, happiness
with marriage with the three categories: very happy, happy, not so happy), one could simply use all the same techniques for determining the direction, strength, and generalizability
we’ve discussed above.
If we chose to analyze the relationship between two ordinal level variables, however, we
could still use crosstabulation, but we might want to use a more elegant way of determining direction and strength of relationship than by comparing percentages and seeing
what Cramer’s V tells us. One very cool statistic used for determining the direction and
strength of a relationship between two ordinal level variables is gamma. Unlike Cramer’s V
and Yule’s Q, whose values only vary between 0.00 and 1.00, and therefore can only speak to
the strength of a relationship, gamma’s possible values are between -1.00 and 1.00. This one
statistic can tell us about both the direction and the strength of the relationship. Thus, a
gamma of zero still means there is no relationship between the two variables. But a gamma
with a positive sign not only reveals strength (a gamma of 0.30 indicates a stronger relationship than one of 0.10), but it also says that as values of the independent variable increase,
so do values of the dependent variable. And a gamma with a negative sign not only reveals
strength (a gamma of -0.30 indicates a stronger relationship than one of -0.10), but also
says that as values of the independent variable increase, values of the dependent variable
decrease. But what exactly do we mean by “values,” here?
Let’s explore a couple of examples from the GSS (via the Social Data Archive, or SDA).
Table 8 shows the relationship between the happiness of GSS respondents’ marriages
(HAPMAR) and their general happiness (HAPPY) over the years. Using our earlier way of
determining direction, we can see that 90 percent of those that are “very happy” generally
are also happy in their marriages, while only 19.5 percent of those who are “not too happy”
5. You would generate some pretty gnarly tables that would be very hard to interpret, though.
6. While there are clearly more than two genders, we are at the mercy of the way the General Social Survey asked
its questions in any given year, and thus for the examples presented in this text only data for males and
females is available. While this is unfortunate, it's also an important lesson about the limits of existing survey
data and the importance of ensuring proper survey question design.
Bivariate Analyses | 47
generally are pretty happy in their marriages. Pretty clear that marital happiness and general happiness are related, right?
Table 8. Crosstabulation of Marital Happiness and General Happiness, GSS data from
SDA
Frequency Distribution
HAPPY
Cells contain:
–Column percent
-N of cases
3
1
2
not
ROW
very
pretty
too
TOTAL
happy happy
happy
1: very happy
90.0
11,666
46.5
7,938
35.0
894
63.0
20,498
2: pretty happy
9.6
1,237
51.0
8,617
45.5
1,120
34.0
10,974
3: not too
happy
.4
51
2.4
433
19.5
503
2.9
987
COL TOTAL
100.0
12,954
100.0
16,988
100.0
2,517
100.0
32,459
Means
1.10
1.56
1.84
1.40
Std Devs
.32
.54
.72
.55
Unweighted N
12,954
16,988
2,517
32,459
HAPMAR
Color coding:
N in each cell:
<-2.0
<-1.0
<0.0
Smaller than expected
>0.0
>1.0
>2.0
Larger than expected
Summary Statistics
Eta* =
.46
Gamma =
.75
Rao-Scott-P: F(4,2360) =
1,807.32
(p= 0.00)
R=
.46
Tau-b =
.45
Rao-Scott-LR: F(4,2360) =
1,709.73
(p= 0.00)
Somers’ d* =
.42
Tau-c =
.35
Chisq-P(4) =
8,994.28
Chisq-LR(4) =
8,508.63
*Row variable treated as the dependent variable.
The more elegant way is to look at the statistics at the bottom of the table. Most of these
statistics aren’t helpful to us now. But one, gamma, certainly is. You’ll note that gamma is
0.75. There are two important attributes of this statistic: its sign (positive) and its magnitude (0.75). The former tells you that as coded values of marital happiness—1=very happy; 2
happy; 3=not so happy—go up, values of general happiness—1=very happy; 2=happy; 3=not
so happy—tend to go up as well. We can interpret this by saying that respondents who are
48 | Bivariate Analyses
less happy with their marriages are likely to be less happy generally than others. (Notice
that this also means that people who are happy in their marriages are also likely to be more
generally happy than others.) But the 0.75, independent of the sign, means that this relationship is very strong. By the way, you might also notice that there is a little parenthetical
expression at the end of the row gamma is on in the statistics box—(p=0.00). The “p” stands
for the chances (probability) of making a Type 1 error, and is sometimes called the “p value”
or the significance level. The fact that the “p value” here is 0.00 does NOT mean that there is
zero chance of making an error if you infer that there is a relationship between marital happiness and general happiness in the larger population. There will always be such a chance.
But the SDA printouts of such values give up after two digits to the right of the decimal
point. All one can really say is that the chances of making a Type 1 error, then, are less than
0.01 (which itself is less than 0.05)—and so researchers would conclude that they could reasonably generalize.
To emphasize the importance of the sign of gamma (+ or -), let’s have a look at Table 9,
which displays the relationship between job satisfaction, whose coded values are 1=very dissatisfied; 2=a little dissatisfied; 3= moderately satisfied; 4=very satisfied, and general happiness, whose codes are the same as they were in Table 3.7. You can probably tell from looking
at the internal percentages of the table that as job satisfaction increases so does general
happiness—as one might expect. But sign of the gamma of -0.43 might at first persuade
you that there is a negative association between job satisfaction and happiness, until you
remember that what it’s really telling you is that when the coded values of job satisfaction
go up, from 1 (very dissatisfied) to 4 (very satisfied), the coded values of happiness go down,
from 3 (not so happy) to 1 (very happy). Which really means that as job satisfaction goes up,
happiness goes up as well, right? Note, however, that if we reversed the coding for the job
satisfaction variable, so that 1 represented being very satisfied with your job while 4 represented being very dissatisfied, the direction of gamma would reverse. Thus, it is essential
that data analysts do not stop by looking at whether gamma is positive or negative, but
rather also ensure they understand the way the variable is coded (its attributes).
Also note here that the 0.43 portion of the gamma tells you how strong this relationship
is—it’s strong, but not as strong as the relationship between marital happiness and general
happiness (which had a gamma of 0.75). The “p value” here again is .00, which means that
it’s less than .01, which of course is less than .05, and we can infer that there’s very probably a relationship between job satisfaction and general happiness in the larger population
from which this sample was drawn.
Table 9. Crosstabulation of Job Satisfaction and General Happiness, GSS data from SDA
Bivariate Analyses | 49
Frequency Distribution
satjob2
Cells contain:
–Column percent
-Weighted N
1
2
3
4
ROW
Very
A Little
Moderately Very
TOTAL
Dissatisfied Dissatisfied Satisfied
Satisfied
1: very happy
15.1
283.0
15.6
722.3
23.9
4,317.4
44.9
10,134.3
32.8
15,457.0
2: pretty happy
51.1
955.7
62.1
2,877.8
64.9
11,716.0
48.7
10,982.8
56.3
26,532.3
3: not too happy
33.8
631.3
22.3
1,034.4
11.3
2,032.6
6.4
1,448.8
10.9
5,147.1
COL TOTAL
100.0
1,870.0
100.0
4,634.5
100.0
18,066.0
100.0
22,566.0
100.0
47,136.4
Means
2.19
2.07
1.87
1.62
1.78
Std Devs
.67
.61
.58
.60
.62
Unweighted N
1,907
4,539
17,514
22,091
46,051
happy
Color coding:
<-2.0
N in each cell:
<-1.0
<0.0
Smaller than expected
>0.0
>1.0
>2.0
Larger than expected
Summary Statistics
Eta* =
Gamma = -.43
Rao-Scott-P: F(6,3396) =
584.48 (p= 0.00)
-.28
Tau-b =
-.26
Rao-Scott-LR: F(6,3396) =
545.83 (p= 0.00)
Somers’ d* = -.25
Tau-c =
-.23
Chisq-P(6) =
4,310.95
Chisq-LR(6) =
4,025.87
R=
.28
*Row variable treated as the dependent variable.
We haven’t shown you the formula for gamma, but it’s not that difficult to compute. In fact,
when you have a 2 x 2 table gamma is the same as Yule’s Q, except that it can take on both
positive and negative values. Obviously, Yule’s Q could do that as well, if it weren’t for the
absolute value symbols surrounding it. As a consequence, you can use gamma as a substitute for Yule’s Q for 2 x 2 tables when using the SDA interface to access GSS data—as long
as you remember to take the absolute value of gamma that is calculated for you. Thus, in
Table 10, showing the relationship between gender and whether or not a respondent was
married, the absolute value of the reported gamma—that is, |-0.11|=0.11—is the Yule’s Q for
the relationship. And it is clearly weak. By the way, the p value here, 0.07, indicates that we
cannot safely infer that a similar relationship existed in the larger population in 2010.
Table 10. Crosstabulation of Gender and Marital Status in 2010, GSS data from SDA
50 | Bivariate Analyses
Frequency Distribution
sex
Cells contain:
–Column percent
-Weighted N
1
male
2
ROW
female TOTAL
45.4
420.9
50.7
565.9
48.3
986.8
54.6
506.1
49.3
549.5
51.7
1,055.6
100.0
927.0
100.0
1,115.4
100.0
2,042.4
Means
.55
.49
.52
Std Devs
.50
.50
.50
Unweighted N
891
1,152
2,043
0: not married
married 1: married
COL TOTAL
Color coding:
<-2.0
N in each cell:
<-1.0
<0.0
>0.0
Smaller than expected
>1.0
>2.0
Z
Larger than expected
Summary Statistics
Eta* =
.05
R=
Gamma =
-.11
Rao-Scott-P: F(1,78) =
3.29 (p= 0.07)
-.05
Tau-b =
-.05
Rao-Scott-LR: F(1,78) = 3.29 (p= 0.07)
Somers’ d* = -.05
Tau-c =
-.05
Chisq-P(1) =
5.75
Chisq-LR(1) =
5.76
*Row variable treated as the dependent variable.
One problem with an SDA output is that none of the statistics reported (not the Eta, the
R, the Tau-b, etc.) are actually designed to measure the strength of relationship between
two purely nominal level variables—Cramer’s V and Yule’s Q, for instance, are not provided
in the output. All of the measures that are provided, however, do have important uses. To
learn more about these and other measures of association and the circumstances in which
they should be used, see the chapter focusing on measures of association.
Exercises
1.
Write definitions, in your own words, for each of the following key concepts from this chapter:
◦
independent variable
◦
dependent variable
Bivariate Analyses | 51
2.
◦
crosstabulation
◦
direction of a relationship
◦
strength of a relationship
◦
generalizability of relationship
◦
Yule’s Q
◦
Cramer’s V
◦
Type 1 error
◦
Type 2 error
◦
Pearson’s chi-square
◦
gamma
◦
hypothesis
◦
null hypothesis
Use the following (hypothetical) data, meant to test the hypothesis (with a hypothetically random sample) that adults tend to be taller than children. Create a crosstabulation of the data that
enables you to determine the direction, strength and generalizability of the relationship, as well as
what determinations you can make in relation to the null and research hypotheses. Present the
statistics that permit you to describe these characteristics:
3.
Case
Gender
Height
Person 1
Child
Short
Person 2
Adult
Tall
Person 3
Child
Short
Person 4
Adult
Tall
Person 5
Child
Short
Person 6
Adult
Tall
Person 7
Child
Short
Person 8
Adult
Tall
Return to the Social Data Archive we’ve explored before. The data, again, are available at
https://sda.berkeley.edu/. Go down to the second full paragraph and click on the “SDA Archive” link
you’ll find there. Then scroll down to the section labeled “General Social Surveys” and click on the
first link there: General Social Survey (GSS) Cumulative Datafile 1972-2021 release.
◦
Now type “hapmar” in the row box and “satjob” in the column box. Hit “output options”
and find the “percentaging” options and make sure “column” is clicked. (Satjob will be our
independent variable here, so we want column percentages.) Now click on “summary statistics,” under “other options.” Hit on “run the table,” examine the resulting printout and write a
short paragraph in which you use gamma and the p-value to evaluate the hypothesis that
people who are more satisfied with their jobs are more likely to be happily married than
those who are less satisfied with their jobs. Your paragraph should mention the direction,
strength and generalizability of the relationship as well as what determinations you can
make in terms of the null and research hypotheses.
52 | Bivariate Analyses
Media Attributions
• A Mapping of the Hypothesis that Men Will Tend to be Taller than Women © Mikaila
Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial
ShareAlike) license
• A Mapping of Kearney and Levine’s Hypothesis © Mikaila Mariel Lemonik Arthur is
licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
Bivariate Analyses | 53
54 | Bivariate Analyses
5. Hypothesis Testing in
Quantitative Research
MIKAILA MARIEL LEMONIK ARTHUR
Statistical reasoning is built on the assumption that data are normally distributed, meaning that they will be distributed in the shape of a bell curve as discussed in the chapter
on Univariate Analysis. While real life often—perhaps even usually—does not resemble a
bell curve, basic statistical analysis assumes that if all possible random samples from a
population were drawn and the mean taken from each sample, the distribution of sample
means, when plotted on a graph, would be normally distributed (this assumption is called
the Central Limit Theorem). Given this assumption, we can use the mathematical techniques developed for the study of probability to determine the likelihood that the relationships or patterns we observe in our data occurred due to random chance rather than due
some actual real-world connection, which we call statistical significance.
Statistical significance is not the same as practical significance. The fact that we have
determined that a given result is unlikely to have occurred due to random chance does not
mean that this given result is important, that it matters, or that it is useful. Similarly, we
might observe a relationship or result that is very important in practical terms, but that we
cannot claim is statistically significant—perhaps because our sample size is too small, for
instance. Such a result might have occurred by chance, but ignoring it might still be a mistake. Let’s consider some examples to make this a bit clearer. Assume we were interested
in the impacts of diet on health outcomes and found the statistically significant result that
people who eat a lot of citrus fruit end up having pinky fingernails that are, on average, 1.5
millimeters longer than those who tend not to eat any citrus fruit. Should anyone change
their diet due to this finding? Probably not, even those it is statistically significant. On the
other hand, if we found that the people who ate the diets highest in processed sugar died
on average five years sooner than those who ate the least processed sugar, even in the
absence of a statistically significant result we might want to advise that people consider
limiting sugar in their diet. This latter result has more practical significance (lifespan matters more than the length of your pinky fingernail) as well as a larger effect size or association (5 years of life as opposed to 1.5 millimeters of length), a factor that will be discussed in
the chapter on association.
While people generally use the shorthand of “the likelihood that the results occurred by
chance” when talking about statistical significance, it is actually a bit more complicated
than that. What statistical significance is really telling us is the likelihood (or probability)
Hypothesis Testing | 55
1
that a result equal to or more “extreme ” is true in the real world, rather than our results having occurred due to random chance or sampling error. Testing for statistical significance,
then, requires us to understand something about probability.
A Brief Review of Probability
You might remember having studied probability in a math class, with questions about
coin flips or drawing marbles out of a jar. Such exercises can make probability seem very
abstract. But in reality, computations of probability are deeply important for a wide variety
of activities, ranging from gambling and stock trading to weather forecasts and, yes, statistical significance.
Probability is represented as a proportion (or decimal number) somewhere between 0
and 1. At 0, there is absolutely no likelihood that the event or pattern of interest would occur;
at 1, it is absolutely certain that the event or pattern of interest will occur. We indicate that
we are talking about probability by using the symbol
50% chance of occurring, we would write
hood of something not occurring, we can write
or
. For example, if something has a
. If we want to represent the likeli-
.
Check your thinking: Assume you were flipping coins, and you called heads. The probability of
getting heads on a coin flip using a fair coin (in other words, a normal coin that has not been
weighted to bias the result) is 0.5. Thus, in 50% of coin flips you should get heads. Consider the following probability questions and write down your answers so you can check them against the discussion below.
•
Imagine you have flipped the coin 29 times and you have gotten heads each time. What is
the probability you will get heads on flip 30?
•
What is the probability that you will get heads on all of the first five coin flips?
•
What is the probability that you will get heads on at least one of the first five coin flips?
1. One way to think about this is to imagine that your result has been plotted on a bell curve. Statistical signifi­
cance tells us the probability that the "real" result—the thing that is true in the real world and not due to ran­
dom chance—is at the same point as or further along the skinny tails of the bell curve than the result we have
plotted.
56 | Hypothesis Testing
There are a few basic concepts from the mathematical study of probability that are
important for beginner data analysts to know, and we will review them here.
Probability over Repeated Trials: The probability of the outcome of interest is the same in
each trial or test, regardless of the results of the prior test. So, if we flip a coin 29 times and
get heads each time, what happens when we flip it the 29th time? The probability of heads
is still 0.5! The belief that “this time it must be tails because it has been heads so many
times” or “this coin just wants to come up heads” is simply superstition, and—assuming a
fair coin—the results of prior trials do not influence the results of this one.
Probability of Multiple Events: The probability that the outcome of interest will occur
2
repeatedly across multiple trials is the product of the probability of the outcome on each
individual trial. This is called the multiplication theorem. Thinking about the multiplication
theorem requires that we keep in mind the fact that when we multiply decimal numbers
together, those numbers get smaller—thus, the probability that a series of outcomes will
occur is smaller than the probability of any one of those outcomes occurring on its own. So,
what is the probability that we will get heads on all five of our coin flips? Well, to figure that
out, we need to multiply the probability of getting heads on each of our coin flips together.
The math looks like this (and produces a very small probability indeed):
Probability of One of Many Events: Determining the probability that the outcome of interest will occur on at least one out of a series of events or repeated trials is a little bit more
complicated. Mathematicians use the addition theorem to refer to this, because the basic
way to calculate it is to calculate the probability of each sequence of events (say, headsheads-heads, heads-heads-tails, heads-tails-heads, and so on) and add them together. But
the greater the number of repeated trials, the more complicated that gets, so there is a simpler way to do it. Consider that the probability of getting no heads is the same as the probability of getting all tails (which would be the same as the probability of getting all heads
that we calculated above). And the only circumstance in which we would not have at least
one flip resulting in heads would be a circumstance in which all flips had resulted in tails.
Therefore, what we need to do in order to calculate the probability that we get at least one
heads is to subtract the probability that we get no heads from 1—and as you can imagine,
this procedure shows us that the probability of the outcome of interest occurring at least
once over repeated trials is higher than the probability of the occurrence on any given trial.
The math would look like this:
So why is this digression into the math of probability important? Well, when we test for
2. In other words, what you get when you multiply.
Hypothesis Testing | 57
statistical significance, what we are really doing is determining the probability that the outcome we observed—or one that is more extreme than that which we observed—occurred
by chance. We perform this analysis via a procedure called Null Hypothesis Significance
Testing.
Null Hypothesis Significance Testing
Null hypothesis significance testing, or NHST, is a method of testing for statistical signif­
icance by comparing observed data to the data we would expect to see if there were no
relationship between the variables or phenomena in question. NHST can take a little while
to wrap one’s head around, especially because it relies on a logic of double negatives: first,
we state a hypothesis we believe not to be true (there is no relationship between the variables in question) and then, we look for evidence that disconfirms this hypothesis. In other
words, we are assuming that there is no relationship between the variables—even though
our research hypothesis states that we think there is a relationship—and then looking to
see if there is any evidence to suggest there is not no relationship. Confusing, right?
So why do we use the null hypothesis significance testing approach?
• The null hypothesis—that there is no relationship between the variables we are exploring—would be what we would generally accept as true in the absence of other information,
• It means we are assuming that differences or patterns occur due to chance unless
there is strong evidence to suggest otherwise,
• It provides a benchmark for comparing observed outcomes, and
• It means we are searching for evidence that disconforms our hypothesis, making it
less likely that we will accept a conclusion that turns out to be untrue.
Thus, NHST helps us avoid making errors in our interpretation of the result. In particular, it
helps us avoid Type 2 error, as discussed in the chapter on Bivariate Analyses. As a reminder,
Type 2 error is error where you accept a hypothesis as true when in fact it was false (while
Type 1 error is error where you reject the hypothesis when in fact it was true). For example,
you are making a Type 1 error if you decide not to study for a test because you assume you
are so bad at the subject that studying simply cannot help you, when in fact we know from
research that studying does lead to higher grades. And you are making a Type 2 error if your
boss tells you that she is going to promote you if you do enough overtime and you then
work lots of overtime in response, when actually your boss is just trying to make you work
more hours and already had someone else in mind to promote.
58 | Hypothesis Testing
We can never remove all sources of error from our analyses, though larger sample sizes
help reduce error. Looking at the formula for computing standard error, we can see that
the standard error (
) would get smaller as the sample size (
) gets larger. Note: σ is the
symbol we use to represent standard deviation.
Besides making our samples larger, another thing that we can do is that we can choose
whether we are more willing to accept Type 1 error or Type 2 error and adjust our strategies
accordingly. In most research, we would prefer to accept more Type 1 error, because we are
more willing to miss out on a finding than we are to make a finding that turns out later to
be inaccurate (though, of course, lots of research does eventually turn out to be inaccurate).
Performing NHST
Performing NHST requires that our data meet several assumptions:
1. Our sample must be a random sample—statistical significance testing and other
inferential and explanatory statistical methods are generally not appropriate for non3
random samples —as well as representative and of a sufficient size (see the Central
Limit Theorem above).
2. Observations must be independent of other observations, or else additional statistical
manipulation must be performed. For instance, a dataset of data about siblings would
need to be handled differently due to the fact that siblings affect one another, so data
on each person in the dataset is not truly independent.
3. You must determine the rules for your significance test, including the level of uncertainty you are willing to accept (significance level) and whether or not you are interested in the direction of the result (one-tailed versus two-tailed tests, to be discussed
below), in advance of performing any analysis.
4. The number of significance tests you run should be limited, because the more tests
you run, the greater the likelihood that one of your tests will result in an error. To make
this more clear, if you are willing to accept a 5% probability that you will make the error
of accepting a hypothesis as true when it is really false, and you run 20 tests, one of
those tests (5% of them!) is pretty likely to have produced an incorrect result.
3. They also are not appropriate for censuses—but you do not need inferential statistics in a census because you
are looking at the entire population rather than a sample, so you can simply describe the relationships that do
exist.
Hypothesis Testing | 59
If our data has met these assumptions, we can move forward with the process of conducting an NHST. This requires us to make three decisions: determining our null hypothesis,
our confidence level (or acceptable significance level), and whether we will conduct a onetailed or a two-tailed test. In keeping with Assumption 3 above, we must make these decisions before performing our analysis. The null hypothesis is the hypothesis that there is no
relationship between the variables in question. So, for example, if our research hypothesis
was that people who spend more time with their friends are happier, our null hypothesis
would be that there is no relationship between how much time people spend with their
friends and their happiness.
Our confidence level is the level of risk we are willing to accept that our results could have
occurred by chance. Typically, in social science research, researchers use p<0.05 (we are willing to accept up to a 5% risk that our results occurred by chance), p<0.01 (we are willing to
accept up to a 1% risk that our results occurred by chance), and/or p<0.001 (we are willing
to accept up to a 0.1% risk that our results occurred by chance). P, as was noted above, is
the mathematical notation for probability, and that’s why we use a p-value to indicate the
probability that our results may have occurred by chance. A higher p-value increases the
likelihood that we will accept as accurate a result that really occurred by chance; a lower pvalue increases the likelihood that we will assume a result occurred by chance when actually it was real. Remember, what the p-value tells us is not the probability that our own
research hypothesis is true, but rather this: assuming that the null hypothesis is correct,
what is the probability that the data we observed—or data more extreme than the data we
observed—would have occurred by chance.
Whether we choose a one-tailed or a two-tailed test tells us what we mean when we say
“data more extreme than.” Remember that normal curve? A two-tailed test is agnostic as to
the direction of our results—and many of the most common tests for statistical significance
that we perform, like the Chi square, are two-tailed by default. However, if you are only interested in a result that occurs in a particular direction, you might choose a one-tailed test. For
instance, if you were testing a new blood pressure medication, you might only care if the
blood pressure of those taking the medication is significantly lower than those not taking
the medication—having blood pressure significantly higher would not be a good or helpful
result, so you might not want to test for that.
Having determined the parameters for our analysis, we then compute our test of statistical significance. There are different tests of statistical significance for different variables (for
example, the Chi square discussed in the chapter on bivariate analyses), as you will see in
other chapters of this text, but all of them produce results in a similar format. We then compare this result to the p value we already selected. If the p value produced by our analysis is
lower than the confidence level we selected, we can reject the null hypothesis, as the probability that our result occurred by chance is very low. If, on the other hand, the p value produced by our analysis is higher than the confidence level we selected, we fail to reject the
60 | Hypothesis Testing
null hypothesis, as the probability that our result occurred by chance is too high to accept.
Keep in mind this is what we do even when the p value produced by our analysis is quite
close to the threshold we have selected. So, for instance, if we have selected the confidence
level of p<0.05 and the p value produced by our analysis is p=0.0501, we still fail to reject the
null hypothesis and proceed as if there is not any support for our research hypothesis.
I actually like to think of the null hypothesis as ‘innocent until proven guilty’: the null hypothesis
(innocence) is assumed to be true as long as there isn’t enough evidence to reject it. –Patrick Alt­
meyer @paltmey via twitter, 09/13/2022, 3:55 pm.
Thus, the process of null hypothesis significance testing proceeds according to the following steps:
1. Determine the null hypothesis
2. Set the confidence level and whether this will be a one-tailed or two-tailed test
3. Compute the test value for the appropriate significance test
4. Compare the test value to the critical value of that test statistic for the confidence level
you selected
5. Determine whether or not to reject the null hypothesis
Your statistical analysis software will perform steps 3 and 4 for you (before there was computer software to do this, researchers had to do the calculations by hand and compare their
results to figures on published tables of critical values). But you as the researcher must perform steps 1, 2, and 5 yourself.
Confidence Intervals & Margins of Error
When talking about statistical significance, some researchers also use the terms confi­
dence intervals and margins of error. Confidence intervals are ranges of probabilities
within which we can assume the true population parameter lies. Most typically, analysts
aim for 95% confidence intervals, meaning that in 95 out of 100 cases, the population parameter will lie within the upper and lower levels specified by your confidence interval. These
are calculated by your statistics software as well. The margin of error, then, is the range
of values within the confidence interval. So, for instance, a 2021 survey of Americans conducted by the Robert Wood Johnson Foundation and the Harvard T.H. Chan School of Public Health found that 71% of respondents favor substantially increasing federal spending
on public health programs. This poll had a 95% confidence interval with a +/- 3.6 margin
Hypothesis Testing | 61
of error. What this tells us is that there is a 95% probability (19 in 20) that between 67.4%
(71-3.6) and 74.6% (71+3.6) of Americans favored increasing federal public health spending
at the time the poll was conducted. When a figure reflects an overwhelming majority, such
as this one, the margin of error may seem of little relevance. But consider a similar poll with
the same margin of error that sought to predict support for a political candidate and found
that 51.5% of people said they would vote for that candidate. In that case, we would have
found that there was a 95% probability that between 47.9% and 55.1% of people intended to
vote for the candidate—which means the race is total tossup and we really would have no
idea what to expect. For some people, thinking in terms of confidence intervals and margins of error is easier to understand than thinking in terms of p values; confidence intervals
and margins of error are more frequently used in analyses of polls while p values are found
more often in academic research. But basically, both approaches are doing the same fundamental analysis—they are determining the likelihood that the results we observed or a
similarly-meaningful result would have occurred by chance.
What Does Significance Testing Tell Us?
One of the most important things to remember about significance testing is that, while
the word “significance” is used in ordinary speech to mean importance, significance testing
does not tell us whether our results are important—or even whether they are interesting.
A full understanding of the relationship between a given set of variables requires looking
at statistical significance as well as association and the theoretical importance of the findings. Table 1 provides a perspective on using the combination of significance and association to determine how important the results of statistical analysis are—but even using
Table 1 as a guide, evaluating findings based on theoretical importance remains key. So:
make sure that when you are conducting analyses, you avoid being misled into assuming
that significant results are sufficient for making broad claims about the importance and
meaning of results. And remember as well that significance only tells us the likelihood that
the pattern of relationships we observe occurred by chance—not whether that pattern is
causal. For, after all, quantitative research can never eliminate all plausible alternative explanations for the phenomenon in question (one of the three elements of causation, along
with association and temporal order).
62 | Hypothesis Testing
Table 1. Significance and Association
Significance
Significant
Strength of
Association
Strong Something’s happening here!
Weak
Probably did not occur by
chance, but not interesting
Not Significant
Could be interesting, but
might have occurred by
chance
Nothing’s happening here
Exercises
1.
Using the approach described in this chapter, calculate the probability of the following coin flip
scenarios:
◦
Getting 7 heads on 7 coin flips
◦
Getting 5 heads on 7 coin flips
◦
Getting 1 head on 10 coin flips
Then check your work using the Coin Flip Probability Calculator.
2.
Write the null hypothesis for each of the following research hypotheses:
◦
As the advertised hourly pay for a job goes up, the number of job applicants increases.
◦
Teenagers who watch more hours of makeup tutorial videos on TikTok have, on average,
lower self-esteem.
◦
3.
Couples who share hobbies in common are less likely to get divorced.
Assume a research conducted a study that found that people wearing green socks type on average one word per minute faster than people who are not wearing green socks, and that this study
found a p value of p<0.01. Is this result statistically significant? Is this result practically significant?
Explain your answers.
4.
If we conduct a political poll and have a 95% confidence interval and a margin of error of +/2.3%, what can we conclude about support for Candidate X if 49.3% of respondents tell us they will
vote for Candidate X? If 24.7% do? If 52.1% do? If 83.7% do?
Hypothesis Testing | 63
64 | Hypothesis Testing
6. An In-Depth Look At Measures
of Association
MIKAILA MARIEL LEMONIK ARTHUR
Measures of association are statistics that tell analysts the strength of the relationship
between two (or more) variables, as well as in some cases the direction of that relationship.
There are a variety of measures of association; choosing the correct one for any given analysis requires understanding the nature of the variables being used for that analysis. This
chapter will detail a number of measures of association that are used by quantitative analysts, though there are others that will not be covered here. While the chapter will not provide full instructions for calculating most measures of association, it aims to give those who
are new to quantitative analysis a general understanding of how calculations of measures
of association work, how to interpret and understand the results, and how to choose the
correct measure of association for a given analysis.
To start, then, what do measures of association tell us? Remember that they do not tell
us whether a result is statistically significant, as discussed in the chapter on statistical
significance. Instead, they are designed to tell us about the nature and strength of the
observed relationship between the variables, whether or not that relationship is likely to
have occurred by chance. There are different ways of thinking about what association
means: for instance, two variables that are strongly associated are those in which the values
of one variable tend to co-occur with the values of the other variable. Or we might say that
strongly associated variables are those in which variation in one variable can explain much
of the variation in another variable. In addition, for analyses using only ordinal and/or continuous variables, some measures of association can tell us about the direction of the relationship—are we observing a direct (positive) relationship, where as the value of x goes up
the value of y also goes up, or are we observing an inverse (indirect or negative) relationship, where as the value of x goes up the value of y goes down?
Keep in mind that it is possible for a relationship to appear to have a moderate or even
strong association but for that association to not be meaningful in explaining the world.
This can occur for a variety of reasons—the relationship may not be significant, and thus
the likelihood that the observed pattern occurred by chance could be high. Note that even
a p<0.001 there is a one in one-thousand likelihood that the result occurred by chance! Or
the relationship may be spurious, and thus while it appears that the two variables are associated, this apparent association is only a reflection of the fact that each variable is separately associated with some other variable. Or the strong association may be due to the fact
that both variables are basically measuring the same underlying phenomena, rather than
Measures of Association | 65
measuring separate but related phenomena (for instance, one would observe a very strong
relationship between year of birth and age).
There is one other important difference between statistical significance and measures of
association: while the computation of statistical significance assumes that data has been
collected using a random sample, measures of association do not necessarily require that
the data be from a random sample. Thus, for instance, measures of association can be computed for data from a census.
Preparing to Choose a Measure of Association
When choosing a measure of association, analysts must begin by ensuring that they understand how their variable is measured as well as the nature of the question they are asking
about their data so that they can choose the measure of association that is best suited to
these variables and this question. There are a number of relevant factors to consider.
First, the levels of measurement of the variables that are being used: different measures
of association are appropriate for variables of different levels of measurement.
Second, whether information about the direction of the relationship is important to the
research question. Some measures of association provide direction and others do not.
Third, whether a symmetric or an asymmetric measure is required. Symmetric measures
consider the impact of each variable upon the other, while asymmetric measures are used
in circumstances where the analyst wants to use an independent variable to explain or predict variation in a dependent variable. Note that when producing asymmetric measures of
association in statistical software, the software will typically produce multiple versions, and
the analyst must ensure that they use the one for the correct independent/dependent variable.
Fourth, the number of attributes of each variable (for non-continuous variables). Some
measures of association are only appropriate for variables with few attributes—or for
crosstabulations in which the resulting tables are relatively small—while others are appropriate for greater numbers of attributes and larger tables.
There are also specific circumstances that are especially suited to particular measures of
association based on the nature of the research question or characteristics of the variables
being used. And, as will be discussed below, it is essential to understand the way attributes are coded. It is especially important in the case of ordinal and continuous variables to
understand whether increasing numerical values of the variable represent an increase or a
decrease in the underlying concept being measured. Finally, there are a variety of factors
other than the actual relationship between the variables that can impact the strength of
association, including the sample size, unreliable measurements, the presence of outliers,
66 | Measures of Association
1
and data that are restricted in range. Analysts should explore their data using descriptive
statistics to see if any of these issues might impact the analysis.
Keep in mind that while it is sometimes appropriate to produce more than one measure
of association as part of an analysis, it is not appropriate to simply run all of them and select
the one that provides the most desirable result. Instead, the analyst should carefully consider the variables, their question, and the options and choose the one or two most appropriate to the situation to produce and interpret.
General Interpretation of Measures of Association
When interpreting measures of association, there are two piece of information to look for:
(1) strength and (2) direction.
Table 1. Strength of Association
Strength
None
Value
0
The strength of nearly all measures of association ranges from 0 to 1. Zero means
there is no observed relationship at all
between the two (or more) variables in
Weak/Uninteresting
±0.01-0.09
Moderate
±0.10-0.29
distributed
Strong
±0.30-0.59
respect to each other. One would repre-
Very Strong
±0.60-0.99
Perfect Identity
±1
question—in other words, their values are
completely
randomly
with
sent what we call a complete identity—in
other words, the two variables are measuring the exact same thing and all values line
up perfectly. This would be the situation,
for instance, if we looked at the association between height in inches and height in centimeters, which are after all just two different ways of measuring the same value. While different researchers do use different scales for assessing the strength of association, Table 1
provides one approach for doing so. Note that very strong values are quite rare in social science, as most social phenomena are too complex for the types of simple explanations
where one variable explains most of the variation in another.
The direction of association, where applicable, is determined by whether the measure of
association is a positive or negative number–whether the number is positive or negative
does not tell us anything about strength (in other words, +0.5 is not bigger than -0.5—they
are the same strength but a different direction). Positive numbers mean a direct associ-
1. For instance, a study looking at the relationship between age and health that only included people between
the ages of 23 and 27 would be restricted in range in terms of age.
Measures of Association | 67
ation, while negative numbers mean an inverse relationship. Direction cannot be determined when examining relationships involving nominal variables, since nominal variables
themselves do not have direction. Keep in mind that it is essential to understand how a
variable is coded in order to interpret the direction. For example, imagine we have a variable measuring self-perceived health status. That variable could be coded as 1:poor, 2:fair,
3:good, 4:excellent. Or it could be coded as 1:excellent, 2:good, 3:fair, 4:poor. If we looked at
the relationship between the first version of our health variable and age, we might expect
that it would be negative, as the numerical value of the health variable would decline as
age increased. And if we looked a the relationship between the second version of our health
variable and age, we might expect that it would be positive, as the numerical value of the
health variable would increase as age increased. The actual health data could be exactly
the same in both cases—but if we change the direction of how our variable is coded, this
changes the direction of the relationship as well.
Details on Measures of Association
In this section, we will review a variety of measures of association. For each one, we will provide information about the circumstances in which it is most appropriately used and other
information necessary to properly interpret it.
Phi
Phi is a measure of association that is used when examining the relationship between two
binary variables. Cramer’s V and Pearson’s r, discussed below, will return values identical to
Phi when computed for two binary variables, but it is still more appropriate to use Phi. It
is a symmetric measure, meaning it treats the two variables identically rather than assuming one variable is the independent variable and the other is the dependent variable. It
can indicate direction, but given that binary variables are often assigned numerical codes
somewhat at random (should yes be 0 and no 1, or should no be 0 and yes 1?), interpretation of the direction may not be of much use. The computation of Phi is the square root
of the Chi square value divided by the sample size. While Phi is the most commonly used
measure of association for relationships between two binary variables in social science data,
there are other measures used in other fields (for instance, risk ratios in epidemiology) that
are asymmetric. Yule’s Q, discussed in several other chapters, is another example. These
will not be discussed here.
68 | Measures of Association
Cramer’s V
If there is any “default” measure of association, it is probably Cramer’s V. Cramer’s V is used
in situations involving pairs of nominal, ordinal, or binary variables, though not in situations
with two binary variables (then Phi is used) and it is less common in situations where both
variables are ordinal. It is symmetric and non-directional. The size of the table/number of
attributes of each variable does not matter. However, if there is a large difference between
the number of columns and the number of rows, Cramer’s V may overestimate the association between the variables. It is calculated by dividing the Chi square by the sample size
multiplied by whichever is smaller, the number of rows in the table minus one or the number of columns in the table minus one, and then taking the square root of the resulting
number.
Contingency Coefficient
The Contingency Coefficient is used for relationships in which at least one of the variables
is nominal. It is symmetric and non-directional, and is especially appropriate for large tables
(those 5×5 or larger—in other words, circumstances in which both variables have more than
five attributes). This is because, for smaller tables, the Contingency Coefficient is not mathematically able to get close to one. It is computed by dividing the Chi square by the number
of cases plus the Chi square, and then taking the square root of the result.
Lambda and Goodman & Kruskall’s Tau
Lambda is a measure of association used when at least one variable is nominal. It is asymmetric and nondirectional. Some statisticians believe that Lambda is not appropriate for
circumstances in which the dependent variable’s distribution is skewed. Unlike measures
based on the Chi square, Lambda is based on calculating what is called “the proportional
reduction in error” (PRE) when one uses the values of the independent variable to predict the values of the dependent variable. The formula for doing this is quite complex, and
involves the number of columns and rows in the table, the number of observations in a
given row and column, the number of observations in the cell where that row and column
intersect, and the total number of observations.
Goodman & Kruskall’s Tau works according to similar principles as Lambda, but without
consideration of the number of columns and rows. Thus, it is generally advised to use it only
for fairly small tables. Like Lambda, it is asymmetric and non-directional. In some statistical
Measures of Association | 69
software packages (including SPSS), Goodman & Kruskall’s Tau is produced when Lambda
is produced rather than it being possible to select it separately.
Uncertainty Coefficient
The Uncertainty Coefficient is also used when at least one variable is nominal. It is asymmetric and directional. Conceptually, it measures the reduction in prediction error (or
uncertainty) that occurs when one variable is used to predict the other. Some analysts
prefer it to Lambda because it better accounts for the entire distribution of the variable,
though others find it harder to interpret. As you can imagine, this makes the formula even
more complicated than the formula for Lambda; it relies on information about the total
number of observations in each row, each column, and each cell.
Spearman
Spearman is used when both variables are ordinal. It is symmetric and directional and can
be used for large tables. In SPSS, it can be found under “correlations.” Computing Spearman requires converting values into ranks and using the difference in ranks and the sample
size in the formula. Note that if there are tied values or if the data is truncated or reduced in
range Spearman may not be appropriate.
Gamma and Kendall’s Tau (b and c)
The two Kendall’s Tau measures are both symmetric and directional and are used for relationships involving two ordinal variables. However, Kendall’s Tau b is used when tables are
square, meaning that they have the same number of rows and columns, while Kendall’s
Tau c is used when tables are not square. Like Spearman, Kendall’s Tau is based on looking
at the relationship between ranks. After converting values to ranks, one counts pairs of values that are in agreement below a given rank (concordant pairs) and how many are not in
agreement (discordant pairs). The formula, then, involves subtracting the number of discordant pairs from the number of concordant pairs, then dividing this number by the number
of discordant pairs plus the number of concordant pairs.
Gamma is similar–also symmetric and directional and used for relationships involving
two ordinal variables, and with a similar method of calculation, except using same-order
and different-order (ranking high or low on both variables versus ranking high on one and
70 | Measures of Association
low on the other) instead of concordant and discordant pairs. Gamma is preferred when
many of the observations in an analysis are tied, as ties are discounted in the computation
of Kendall’s tau and thus Kendall’s tau will produce a more conservative (in other words,
lower) value in such cases. However, Gamma may overestimate association for larger tables.
Kappa
Kappa is a measure of association that is especially likely to be used for testing interrater
reliability, as it is designed for use when both variables are ordinal with the same categories.
It measures agreement between the two variables and is symmetric. Kappa is calculated
by subtracting the degree of agreement between the variables that would be expected by
chance from the degree of agreement that is observed; subtracting the degree of agreement that would be expected by chance from one, and dividing the former by the latter.
Somers’ D
Somers’ D is designed for use in examining relationship involving two ordinal variables
and is directional, but unlike the other ordinal x ordinal measures of association discussed
above, Somers’ D is asymmetric. As such, it measures the extent to which our ability to predict values of the dependent variable is improved by knowing the value of the independent
variable. It is a conservative measure, underestimating the actual extent to which two variables are associated, though this underestimation declines as table size increases.
Eta
Eta is a measure of association that is used when the independent variable is discrete and
the dependent variable is continuous. It is asymmetric and non-directional, and is primarily used as part of a statistical test called ANOVA, which is beyond the scope of this text.
In circumstances where independent variables are discrete but not binary, many analysts
choose to recode those variables to create multiple dummy variables, as will be discussed
in the chapter on multivariate regression, and then use Pearson’s R as discussed below.
Measures of Association | 71
Pearson’s r
Pearson’s r is used when examining relationships between two (or more) continuous variables and can also be used in circumstances where an independent variable is binary and
a dependent variable is continuous. It is symmetric and directional. The calculation of Pearson’s r is quite complex, but conceptually, what this calculation involves is plotting the data
on a graph and then finding the line through the graph that best fits this data, a topic that
will be further explored in the chapter on Correlation and Regression.
Other Situations
Attentive readers will have noticed that not all possible variable combinations have been
addressed above. In particular, circumstances in which the independent variable is continuous and the dependent variable is not continuous have not been addressed. For beginning analysts, the most straightforward approach to measuring the association in such
relationships is to recode the continuous variable to create an ordinal variable and then proceed with crosstabulation. However, there are a variety of more advanced forms of regression that are beyond the scope of this book, such as logistic regression, that can also handle
relationships between these sorts of variables, and there are various pseudo-R measures of
association that can be used in such analyses.
Exercises
1.
2.
Determine the strength and direction for each of the following measure of association values:
◦
-0.06
◦
0.54
◦
0.13
◦
-0.27
Select the most appropriate measure of association for each of the following relationships, and explain
why it is the most appropriate:
◦
Age, measured in years, and weight, measured in pounds
◦
Opinion about the local police on a 5-point agree/disagree scale and highest educational degree
earned
◦
Whether or not respondents have health insurance (yes/no) and whether or not they have been
to a doctor in the past 12 months (yes/no)
◦
3.
Letter grade on Paper 1 and letter grade on Paper 2 in a first-year composition class
Explain, in your own words, the difference between association and significance.
72 | Measures of Association
Measures of Association | 73
74 | Measures of Association
7. Multivariate Analysis
ROGER CLARK
We saw, in our discussion of bivariate analysis, how crosstabulation can be used to examine
bivariate relationships, like the one Kearney and Levine discovered between watching 16
and Pregnant and becoming pregnant for teenaged women. In this chapter, we’ll be investigating how researchers gain greater understanding of bivariate relationships by controlling for other variables. In other words, we’ll begin our exploration of multivariate analyses,
or analyses that enable researchers to investigate the relationship between two variables
while examining the role of other variables.
You may recall that Kearney and Levine claim to have investigated the relationship
between watching 16 and Pregnant and becoming pregnant, and thought it might have
been at least partly due to the fact that those who watched were more likely to seek out
information about (and perhaps use) contraception. Researchers call a variable that they
think might affect, or be implicated in, a bivariate relationship a control variable. In the case
of Kearney and Levine’s study, the control variable they thought might be implicated in the
relationship between watching 16 and Pregnant and becoming pregnant was seeking out
information about (or using) contraception.
Before we go further we’d like to introduce you to three kinds of control variables: inter­
vening, antecedent, and extraneous control variables. An intervening control variable is a
variable a researcher believes is affected by an independent variable and in turn affects
a dependent variable. The Latin root of “intervene” is intervener, meaning “to come
between”—and that’s what intervening variables do. They come between, at least in the
researcher’s mind, the independent and dependent variables.
For Kearney and Levine, seeking information about contraception was an intervening
variable: it’s a variable they thought was affected by watching 16 and Pregnant (their
independent variable) and in turn affected the likelihood that a young woman would be
become pregnant (their dependent variable). More precisely, their three-variable hypothesis goes something like this: a young woman who watched 16 and Pregnant was more
likely to seek information than a woman who did not watch it, and a woman who sought
information about contraception was less likely to get pregnant than a woman who did not
seek information about contraception. One quick way to map such a hypothesis is the following:
Multivariate Analysis | 75
Figure 1. A Depiction of Kearney & Levine’s Hypothesis
Importantly, researchers who believe they’ve found an intervening variable linking an independent variable and a dependent variable don’t believe they are challenging the possibility that the independent variable may be a cause of variation in the dependent variable.
(More about “cause” in a second.) They are simply pointing to a possible way, or mechanism
through which, the independent variable may cause or affect variation in the dependent
variable.
A second kind of control variable is an antecedent variable. An antecedent variable is a
variable that a researcher believes affects both the independent variable and the dependent variable. Antecedent has a Latin root that translates into “something that came
before.” And that’s what researchers who think they’ve found an antecedent variable
believe: that they’ve found a variable that not only comes before and affects both the independent variable and the dependent variable, but also, in some real sense, causes them to
go together, or to be related.
Example
For an example of a researcher/theorist who thinks he may have found an antecedent variable that explains
a relationship, think about what Robert Sternberg is saying about the correlation between the attractiveness
of children and the care their parents give them in this article on the research by W. Andrew Harrell.
Quiz at the end of the article: What two variables constituted the independent and dependent variables of
the finding announced by researchers at the University of Alberta? How did they show these two variables
were related? What variable did Robert Sternberg suspect might have been an antecedent variable for the
independent and dependent variables found to be related by the U. of Alberta researchers? How did he think
this variable might explain the relationship?
If you said that the basic relationship discovered by the University of Alberta researchers was that ugly children get poorer care from their parents than pretty children, you were right on the money. (It’s back-patting
time!) Here the proposed independent variable was the attractiveness of children and the dependent variable
was the parental care they received.
If you said that the socioeconomic status or wealth of the parents was what Sternberg thought might be an
76 | Multivariate Analysis
antecedent variable for these two variables (attractiveness and care), then you should glow with pride. Sternberg suggested that wealthier parents can both make their children look more attractive than poorer parents
can and give their children better care than poorer parents can. One quick way to map such a hypothesis is
like this:
Figure 2. Sternberg’s Hypothesis
A Word About Causation
Importantly, a researcher who thinks they have found an antecedent variable for a relationship implies that that have found a reason why the original relationship might be
non-causal. Spurious is a word researchers use to describe non-causal relationships.
Philosophers of science have told us that in order for a relationship between an independent variable and a dependent variable to be causal, three conditions must obtain:
1. The independent and dependent variables must be related. We demonstrated ways,
using crosstabulation, that such relationships can be established with data. The Alberta
researchers did show that the attractiveness of children was associated with how well they
were treated (cared for or protected) in supermarkets. This condition is sometimes called
association.
2. Instances of the independent variable occurring must come before, or at least not after,
instances of the dependent variables. The attractiveness of the children in the Alberta study
almost certainly preceded their treatment by their parents during the shopping expeditions observed by the researchers. This factor is often called temporal order.
3. There can be NO antecedent variable that creates the relationship between the independent variable and the dependent variable. This is the really tough condition for
researchers to demonstrate, because, in principle, there could be an infinite numbers of
antecedent variables that create such a relationship. This factor is often called elimination
of alternatives. There is one research method—the controlled laboratory experiment—that
theoretically eliminates this difficulty, but it is beyond the scope of this book to show you
how. Yet it is not beyond our scope to show you how an antecedent variable might be
Multivariate Analysis | 77
shown, with data, to throw real doubt on the notion that an independent variable causes a
dependent variable. And we’ll be doing that shortly.
Back to Our Main Story
A third kind of control variable is an extraneous variable. An extraneous variable is a variable
that has an effect on the dependent variable that is separate from the effect of the independent variable. One can easily imagine variables that would affect the chances of an adolescent woman’s getting pregnant (the dependent variable for Kearney and Levine) that
have nothing to do with her having watched, or not watched, the TV show 16 and Preg­
nant. Whether or not friends are sexually active, and whether or not she defines herself as a
lesbian, are two such variables. Sexual experience of her friendship group and sexual orientation, then, might be considered extraneous variables when considering the relationship
between watching 16 and Pregnant and becoming pregnant. One might map the relationship among these four variables in the following way:
What Happens When You
Control for a Variable and
What Does it Mean?
Figure 3. A Hypothetical Extraneous Variable
Relationship
You may be wondering how one could confirm
any three-variable hypothesis with data. Let’s
look at an example using data from eight imag-
inary adolescent women, whether they watched 16 and Pregnant, got pregnant, and
sought information about contraception:
78 | Multivariate Analysis
Case
Watched Show
Got Pregnant
Sought Contraception
Information
1
Yes
No
Yes
2
Yes
No
Yes
3
Yes
No
Yes
4
Yes
Yes
Yes
5
No
Yes
No
6
No
Yes
No
7
No
Yes
No
8
No
No
No
Checking out a three-variable hypothesis requires, first, that you determine the relationship
between the independent and dependent variables: in this case, between having watched
16 and Pregnant and pregnancy status. Do you recall how to do that? In any case, we’ve
done it in Table 1.
Table 1. Crosstabulation of Watching 16 and Pregnant and Pregnancy Status
Watched 16 and Pregnant
Got Pregnant
Yes
No
Yes
1 (25%)
3 (75%)
No
3 (75%)
1 (25%)
|Yule’s Q|=0.80
You’ll note that the direction of this relationship, as expected by Kearny and Levine, is that
women who had watched the show were less likely to get pregnant than those who had
not. And a Yule’s Q of 0.80 suggests the relationship is strong.
What controlling a relationship for another variable means is that one looks at the original relationship (in this case between watching the show and becoming pregnant) after
eliminating variation in the control variable. We eliminate such variation by separating out
the cases that fall into each category of the control variable, and examining the relationship
between the independent and dependent variables in each category. In this case, what this
means is that we first look at the relationship between watching and getting pregnant for
those who have sought contraceptive information and then look at it for those who have
not sought such information. To do this we create two more tables that have the same form
as Table 1, one into which only those who fell into the “yes” category of having sought contraceptive information are put, the other into which only those cases that fell into the “no”
category of having sought contraceptive information are put. Doing this, we’ve created two
Multivariate Analysis | 79
more tables, Tables 2 and 3. Table 2 looks at the relationship between having watched the
show and having become pregnant only for the four cases that sought contraceptive information; Table 3 does this only for the four cases that didn’t seek contraceptive information.
Table 2 Crosstabulation of Watching 16 and Pregnant and Pregnancy Status For
Those Who Sought Contraceptive Information
Watched 16 and Pregnant
Got Pregnant
Yes
No
Yes
1
0
No
3
0
|Yule’s Q|=0.00
Table 3 Crosstabulation of Watching 16 and Pregnant and Pregnancy Status For Those
Who Did Not Seek Contraceptive Information
Watched 16 and Pregnant
Yes
Got Pregnant
No
Yes 0
3
No
1
0
|Yule’s Q|=0.00
We call relationship between an independent and dependent variable for the part of a sample that falls into one category of a control variable a partial relationship or simply a partial.
What is notable about the partial relationships in both Table 4.2 and 4.3 is that they are as
weak as they could possibly be (both Yule’s Qs are equal to 0.00); both are much weaker
than the original relationship between watching the show and becoming pregnant. In fact,
in the context of controlling a relationship between two variables for a third, the relationship between the independent variable and the dependent variable, before the control, is
often called an original relationship.
It may not surprise you to learn that controlling a relationship for a third variable does
not always yield partials that are all weaker than the original. In fact, a famous methodologist, Paul Lazarsfeld (see Rosenberg, 1968), identified four distinct possibilities and others
have called the resulting typology the elaboration model. Elaboration, in fact, is the term
used by researchers for the process of controlling a relationship for a third variable. Table
4 outlines the basic characteristics of Lazarsfeld’s four types of elaboration, with one more
thrown in because, as we’ll show, this fifth one is not only a logical, but also a practical, possibility.
Table 4. The Elaboration Model: Five Kinds of Elaboration
80 | Multivariate Analysis
Type of Elaboration
Kind of Control Variable
Relationship of Partials to the Original
Interpretation
Intervening
All partials weaker than the original
Replication
Doesn’t Matter
All partials about the same as the original
Explanation
Antecedent
All partials weaker than the original
Specification
Doesn’t Matter
Some, but not all, partials different (stronger
and/or weaker) than the original
Revelation
Doesn’t Matter
All partials stronger than the original
Quiz at the end of the table: What kind of elaboration is demonstrated, in your view, in Tables 4.1
to 4.3?
You may recall that Kearney and Levine saw seeking contraceptive information as an intervening variable between the watching of 16 and Pregnant and pregnancy status. Moreover,
the partial relationships (shown in Tables 2 and 3) are both weaker than the original (shown
in Table 1), so the elaboration shown in Tables 1 through 3 is an interpretation. If this kind
of elaboration occurred in the real world, one could be pretty sure that seeking contraceptive information was indeed a mechanism through which watching the show affected a
teenage woman’s pregnancy status. Note: while the quantitative result in cases of interpretation and explanation are the same, the explanations for the processes at work are different, and this means the researcher must rely on their own knowledge of the variables at
hand to determine which is at work. In cases of interpretation, an intervening variable is at
work, and thus the relationship between the independent and dependent variables is a real
relationship—it’s just that the intervening variable is the mechanism through which this
relationship occurs. In contrast, for cases of explanation, an antecedent variable is responsible for the apparent relationship between the independent and dependent variables, and
thus this apparent relationship does not really exist. Rather, it is spurious.
In the real world, things don’t usually work out quite as neatly as they did in this example,
where an original relationship completely “disappears” in the partials. If one finds evidence
of an interpretation, it’s likely to be more subdued. Tables 5 and 6 demonstrate this point.
Here, the researcher’s (Roger’s) hypothesis had been that people who are more satisfied
with their finances are generally happier than people who are less satisfied. Table 5 uses
General Social Survey (GSS) data to provide support for this hypothesis. Comparing percentages, about 47.5 percent of people who are satisfied with their finances claimed to be
very happy, while only 17.3 percent who have claimed to be not at all satisfied with their
finances said they were very happy. Moreover, a gamma of 0.42 suggests this relationship
is in the hypothesized direction and that it is strong.
Multivariate Analysis | 81
Table 5 Crosstabulation of Satisfaction with Finances and General Happiness, GSS
Data from SDA
Frequency Distribution
satfin
Cells contain:
–Column percent
-Weighted N
1
pretty
well
satisfied
2
more or
less
satisfied
3
not
satisfied
at all
ROW
TOTAL
1: very happy
47.5
8,964.6
30.4
8,731.4
17.3
2,831.3
32.1
20,527.3
2: pretty happy
47.0
8,874.7
60.1
17,228.1
57.2
9,345.6
55.5
35,448.4
3: not too happy
5.4
1,023.9
9.5
2,721.7
25.4
4,148.7
12.4
7,894.3
COL TOTAL
100.0
18,863.2
100.0
28,681.2
100.0
16,325.6
100.0
63,870.0
Means
1.58
1.79
2.08
1.80
Std Devs
.59
.60
.65
.64
Unweighted N
18,761
28,254
16,824
63,839
happy
Color coding:
N in each cell:
<-2.0
<-1.0
<0.0
Smaller than expected
>0.0
>1.0
>2.0
Z
Larger than expected
Summary Statistics
Eta* =
.29
Gamma
.42
=
Rao-Scott-P:
F(4,2420) =
1,287.60
(p=
0.00)
R=
.29
Tau-b =
.26
Rao-Scott-LR:
F(4,2420) =
1,232.07
(p=
0.00)
Somers’
d* =
.25
Tau-c =
.24
Chisq-P(4) =
6,058.38
Chisq-LR(4) =
5,797.08
*Row variable treated as the dependent variable.
82 | Multivariate Analysis
Roger also introduced a control variable, the happiness of respondents’ marriages, believing that this variable might be an intervening variable for the relationship between financial satisfaction and general happiness. In fact, he hypothesized that people who are more
satisfied with their finances would be happier in their marriages than people who were
not satisfied with their finances, and that happily married people would be more generally
happy than people who are not happy in their marriages. In terms of the elaboration model,
he was expecting that the relationships between financial satisfaction and general happiness for each part of the sample defined by a level of marital happiness (i.e., the partials)
would be weaker than the original relationship between financial satisfaction and general happiness. And (hallelujah!) he was right. Table 6 shows that the relationship between
financial satisfaction and general happiness for those with very happy marriages yielded
a gamma of 0.35; for those with pretty happy marriages, 0.33; and for those with not too
happy marriages, 0.31. All three of the partial relationships were weaker than the original,
which we showed in Table 5 had a gamma of 0.43.
Table 6. Gammas for the Relationship Between Satisfaction with Finances and Gen­
eral Happiness for People with Different Degrees of Marital Happiness
Very Happy
Pretty Happy
Not Too Happy
0.34
0.33
0.31
Because the partials for each level of marital happiness are only somewhat weaker than
the original relationship between financial satisfaction and general happiness, they don’t
suggest that marital satisfaction is the only reason for the relationship between financial
satisfaction and general happiness, but they do suggest it is probably part of the reason.
A curious researcher might look for others. But you get the idea: data can be used to shed
light on the meaning of basic, two-variable relationships.
Perhaps more interesting still is that data can be used to resolve disputes about basic
relationships. To illustrate, let’s return to the “Ugly Children” study and Alberta, discussed
in the chapter on bivariate analysis. One of the Alberta researchers, a Dr. Harrell, essentially
said the fact that prettier children got better care than uglier children was causal: parents
with prettier children are propelled by evolutionary forces, in his view, to protect their children (and, one assumes, parents of uglier children are not). A Dr. Sternberg, however, didn’t
see this relationship as causal. Instead, he saw it as the spurious result of wealth: wealthier
parents can feed and clothe their kids better than others and are more likely to be caught
up on supermarket etiquette associated with child care than others. Who’s right?
One way one could check this out is by collecting and analyzing data. Suppose, for
instance, that a researcher replicated the Alberta study (following parent/child dyads
around supermarkets to determine the attractiveness of the children and how well they
were cared for), but added observations about the cars the parent/child couples came in.
Multivariate Analysis | 83
Late-model cars might be used as an indicator of relative wealth; beat-up dentmobiles (like
Roger’s), of relative poverty. Then one could see how much of the relationship between
attractiveness and care “disappeared” in the parts of the sample that were defined by
wealth and poverty. Suppose, in fact, the data so collected looked like this:
Table 7. Hypothetical Data to Test Dr. Sternberg’s Hypothesis
84 | Multivariate Analysis
Attractiveness
Pretty
Pretty
Pretty
Pretty
Ugly
Ugly
Ugly
Ugly
Case
1
2
3
4
5
6
7
8
Good
Bad
Bad
Bad
Bad
Good
Good
Good
Care
Poor
Poor
Poor
Poor
Rich
Rich
Rich
Rich
Wealth
Quiz about these data: Can you figure out the direction and strength of the relationship between
the attractiveness of children and their care in this sample? What is the strength of this relationship
within each category (rich and poor) of the control variable? What kind of elaboration did you
uncover?
If you found that the original relationship was that pretty children got better care than ugly
children (75% of the former did so, while only 25% of the latter did), you should be glowing
with pride. If you found that the strength of the relationship (Yule’s Q=0.80) was strong, your
brilliance is even more evident. And if you found that this strength “disappeared” (Yule’s Qs
= 0.00) within each category of the wealth, you’re a borderline genius. If you decided that
the elaboration is an “explanation,” because the partials are both weaker than the original
and you’ve got an antecedent variable (at least according to Sternberg), you’ve crossed the
border into genius.
Now referring back to the criteria for demonstrating causation (above), you’ll note that
the third criterion was that there must not be any antecedent variable that creates the relationship between the independent and dependent variables. What this means in terms of
data analysis is that there can’t be any antecedent variables whose control makes the relationship “disappear” within each of the parts of the sample defined by its categories. But
that’s exactly what has happened above. In other words, one can show that a relationship
is non-causal (or spurious) by showing, through data, that there is an antecedent variable
whose control, as in the example we’ve just been working with, makes the relationship “disappear.” Pretty cool, huh?
On the other hand, while it’s impossible to use data to show that a relationship is causal,
1
it is possible to show that any single third variable that others hypothesize is creating the
relationship between the relevant independent and dependent variables isn’t really creating that relationship. Thus, for example, Harrell and his Alberta team might have heard
Sternberg’s claim that the wealth of “families” is the real reason why the attractiveness and
care of children are related. And if they’d collected data like the following, they could have
shown this claim was false. Can you use the data to do so? See if you can analyze the data
and figure out what kind of elaboration Harrell et al. would have discovered.
Table 8. Hypothetical Data to Test Dr. Harrell
1. The reason you can never show, through data analysis, that a two-variable relationship is causal is that for every
two-variable relationship there are an infinite number of possible antecedent variables, and we just don’t live
long enough to test all the possibilities.
86 | Multivariate Analysis
Attractiveness
Pretty
Pretty
Ugly
Ugly
Pretty
Pretty
Ugly
Ugly
Case
1
2
3
4
5
6
7
8
Bad
Good
Good
Good
Bad
Good
Good
Good
Care
Poor
Poor
Poor
Poor
Rich
Rich
Rich
Rich
Wealth
If you found that these data yielded a “replication,” you’re clearly on the brink of mastering
the elaboration model. The original relationship between attractiveness and care was that
pretty kids got better care than ugly kids (100% of pretty kids got it, compared to 50% of ugly
kids who did) and this relationship was strong (Yule’s Q= 1.00). But each of the partials was
just as strong (Yule’s Qs = 1.00), and had the same direction, as the original. What a replication shows is that the variable that was conceived of as an antecedent variable (wealth)
does not “explain” the original relationship at all. The relationship is just as strong in each
part of the sample defined by categories of the antecedent variable as it was before variation in this variable was controlled.
A Quick Word About Significance Levels
This chapter’s focus on the elaboration model and controlling relationships has been all
about making comparisons: primarily about comparing the strength of partial relationships
to the strength of original relationships (but sometimes, as you’ll soon see, comparing the
strength of partials to one another). We haven’t said a thing about comparing inferential
statistics and the resulting information about whether one dare generalize from a sample
to the larger population from which the sample has been drawn. This has been intentional.
You may recall (from the chapter on bivariate analyses) that the magnitude of chi-square
is directly related to the size of the sample: the larger the sample, given the same relationship, the greater the chi-square. When one controls a relationship between an independent
and dependent variable, however, one is dividing the sample into at least two parts, and,
depending on the number of categories of the control variable, potentially more. So comparing the chi-squares, and therefore the significance levels, of partials to that of an original is hardly a fair fight. The originals will always involve more cases than the partials. So
we usually limit our comparisons to those of strength (and sometimes direction), though if
a relationship loses its statistical significance when examining the partials this does mean
that the relationship cannot necessarily be generalized in its partial form.
Having made this important point, however, we’ll let you loose on the two quizzes that
will end this chapter, each of which will introduce you to a new kind of elaboration.
Quiz #1 at the End of the Chapter
Show that the (hypothetical) sample data, below, conceivably collected to test Kearney and
Levine’s three-variable hypothesis (that adolescents who watched the show were more
likely to seek contraceptive information than others, and that those who sought informa88 | Multivariate Analysis
tion were less likely to get pregnant than others) are illustrative of a “specification.” For
which category of the control variable (sought contraceptive information) is the relationship between having watched 16 and Pregnant and having gotten pregnant stronger? For
2
which is it weaker? Why would such data NOT support Kearney and Levine’s hypothesis?
Case
Watched Show
Got Pregnant
Sought Contraceptive Information
1
Yes
No
Yes
2
Yes
No
Yes
3
No
Yes
Yes
4
No
Yes
Yes
5
Yes
Yes
No
6
Yes
No
No
7
No
Yes
No
8
No
No
No
Quiz #2 at the End of the Chapter
Suppose the data you collected to test Sternberg’s
hypothesis (that the relationship between the
Case Attractiveness
Care
Wealth
attractiveness of children and their care is a result of
1
Pretty
Good
Rich
family wealth or social class) really looked like this ⇒
2
Pretty
Good
Rich
3
Ugly
Bad
Rich
4
Ugly
Bad
Rich
nal relationships? But they sure can. That what
5
Pretty
Bad
Poor
Roger calls a “revelation.”)
6
Pretty
Bad
Poor
7
Ugly
Good
Poor
8
Ugly
Good
Poor
What kind of elaboration would you have uncovered? What makes you say so? (Doesn’t it seem odd
that partial relationships can be stronger than origi-
2. The original relationship, in this case, would be strong |Yule’s Q|= 0.80. The partial relationship for those who
had sought contraception, however, would be stronger (|Yule’s Q|= 1.00, while that for those who had not
sought contraception would be very weak (|Yule’s Q|= 0.00). You can specify, therefore, that the original rela­
tionship is particularly strong for those who’d sought contraception and particularly weak for those who had
not. Kearney and Levine’s hypothesis had anticipated an “interpretation,” but this data yield a specification. So
the data would prove their hypothesis wrong.
Multivariate Analysis | 89
Exercises
1.
2.
Write definitions, in your own words, for each of the following key concepts from this chapter:
◦
multivariate analysis
◦
antecedent variable
◦
intervening variable
◦
control variable
◦
extraneous variable
◦
spurious
◦
original relationship
◦
partial relationship
◦
elaboration
◦
interpretation
◦
replication
◦
explanation
◦
specification
◦
revelation
Below are real data from the GSS. See what you can make of them.
Who do you think would be more fearful of walking in their neighborhoods at night: males or
females? Recalling that gamma= Yule’s Q for 2 x 2 tables, what does the following table, and its
accompanying statistics, tell you about the actual direction and strength of the relationship? Support your answer with details from the table.
Table 9. Crosstabulation of Gender (Sex) with Whether Respondent Reports Being Fearful of
Walking in the Neighborhood at Night (Fear), GSS Data from SDA
90 | Multivariate Analysis
Frequency Distribution
sex
Cells contain:
–Column percent
-Weighted N
fear
1
male
2
female
ROW
TOTAL
1: yes
22.2
4,468.4
51.6
12,027.9
38.0
16,496.2
2: no
77.8
15,620.2
48.4
11,294.1
62.0
26,914.3
COL TOTAL
100.0
100.0
100.0
20,088.6 23,322.0 43,410.6
Means
1.78
1.48
1.62
Std Devs
.42
.50
.49
Unweighted N
19,213
<-1.0
8
24,15
Color coding:
<-2.0
<0.0
N in each cell:
Smaller than expected
43,371
>0.0
>1.0
>2.0
Z
Larger than expected
Summary Statistics
Eta* =
Gamma =
-.58
Rao-Scott-P: F(1,590) =
1,776.84
(p= 0.00)
R=
-.30
.30
Tau-b =
-.30
Rao-Scott-LR: F(1,590) =
1,828.11
(p= 0.00)
Somers’ d* =
-.29
Tau-c =
-.29
Chisq-P(1) =
3,936.97
Chisq-LR(1) =
4,050.56
*Row variable treated as the dependent variable.
We controlled this relationship for “race,” a variable that had three categories: Whites, Blacks,
and others. Suppose you learned that the gamma for this relationship among Whites was -0.61,
among Blacks was -0.53 and among those identifying as members of other racial groups was
-0.44. What kind of elaboration, in your view, would you have uncovered? Justify your answer.
3.
Please read the article by Robert Bartsch et al., entitled “Gender Representation in Television
3
Commercials: Updating an Update” (Sex Roles, Vol. 43, Nos. 9/10, 2000: 735-743). What is the main
point of the article, in your view? What is the significance, according to Bartsch et al., of the gender of the voice-over in a commercial? Please examine Table 1 on page 739. Describe the overall
gender breakdown of the voice-overs in 1998. Which gender was more represented in the voiceovers? Now look at the gender breakdown for voice-overs for domestic products and nondomestic products separately. Which of these is the stronger relationship: the one for domestic or the
one for nondomestic products? What kind of elaboration would you say Bartsch et al. uncovered
when they controlled the gender of voice-over for type of product (domestic or nondomestic)?
How might you account for this finding?
3. If the link below doesn’t work, perhaps you can hunt down an electronic copy of the article through your col­
lege’s library service.
Multivariate Analysis | 91
Media Attributions
• Figure 4.1
• Figure 4.2
• Diagramming an Extraneous Variable Relationship © Mikaila Mariel Lemonik Arthur
92 | Multivariate Analysis
8. Correlation and Regression
ROGER CLARK
The chapter on bivariate analyses focused on ways to use data to demonstrate relationships
between nominal and ordinal variables and the chapter on multivariate analysis on controling these relationships for other variables. This chapter will introduce you to the ways
scholars show relationships between interval variables and control those relationships with
other interval variables.
It turns out that the techniques presented in this chapter are by far the most likely ones
you’ll see used in research articles in the social sciences. There are a couple of reasons for
this popularity. One is that the techniques we’ll show you here are much less clumsy than
the ones we showed you in prior chapters. The other is that, despite what we led you to
believe in the chapter on univariate analysis (in our discussion of levels of measurement),
all variables, whatever their level of measurement, can, via an ingenious method, be converted into interval-level variables. This method may strike you at first as having a very modest name for an ingenious method: dummy variable creation. Until you realize that dummy
does not always refer to a dumb person—a dated and offensive expression in any case.
Sometimes dummy refers to a “substitute for,” as it does in this case.
Dummy Variables
In fact, a dummy variable is a two-category variable that is used as an ordinal or interval
level variable. To understand how any variable, even a nominal-level variable can be treated
as an ordinal or interval level variable, let’s recall the definitions of ordinal and interval level
variables.
An ordinal level variable is a variable whose categories can be ordered in some sensible
way. The General Social Survey (GSS) measure of “general happiness” has three categories:
very happy, happy, and not too happy. It’s easy to see how these three categories can
be ordered sensibly: “very happy” suggests more happiness than “happy,” which in turn
implies more happiness than “not too happy.” But we’d normally say that the variable “gender,” when limited to just two categories (female and male), is merely nominal. Neither category seems to have more of something than the other.
Not until you do a little conceptual blockbusting and think of the variable gender as a
measure of either how much “maleness” or “femaleness” a person has. If we coded, as the
GSS does, males as 1 and females as 2 we could say that a person’s “gender,” really “femaleCorrelation and Regression | 93
ness,” is greater any time a respondent gets coded 2 (or female) than when s/he gets coded
1
1 (or male). Then one could, as we’ve done in Table 1, ask for a crosstabulation of sex (really
“gender”) and happy (really level of unhappiness) and see that females, generally, were a
little happier than males in the U.S. in 2010, either by looking at the percentages or the
gamma—a measure of relationship generally reserved for two ordinal level variables. The
gamma for this relationship is -0.08, indicating that, in 2010, as femaleness went up, unhappiness went down. Pretty cool, huh?
Table 1. Crosstabulation of Gender (Sex) and Happiness (Happy), GSS data from SDA,
2010
Frequency Distribution
sex
Cells contain:
–Column percent
-Weighted N
1
male
2
ROW
female TOTAL
1: very happy
26.7
247.0
29.8
330.8
28.4
577.8
2: pretty happy
57.5
532.3
57.5
639.3
57.5
1,171.5
3: not too happy
15.9
146.9
12.7
141.0
14.1
287.8
COL TOTAL
100.0
926.1
100.0
1,111.0
100.0
2,037.1
Means
1.89
1.83
1.86
Std Devs
.64
.63
.64
Unweighted N
890
1,149
2,039
happy
Color coding:
N in each cell:
<-2.0
<-1.0
<0.0
Smaller than expected
>0.0
>1.0
>2.0
Z
Larger than expected
1. Professional statistical analysts usually use 0 and 1 rather than 1 and 2 when making dummy variables. This is
due to the fact that the numbers used can impact the interpretation of the regression constant, which is not
something beginning quantitative analysts need to worry about. Therefore, in this text, both approaches are
used interchangeably.
94 | Correlation and Regression
Summary Statistics
Eta* =
Gamma = -.09
Rao-Scott-P: F(2,156) =
1.95 (p= 0.15)
-.05
Tau-b =
-.05
Rao-Scott-LR: F(2,156) =
1.94 (p= 0.15)
Somers’ d* = -.05
Tau-c =
-.05
Chisq-P(2) =
5.31
Chisq-LR(2) =
5.30
R=
.05
*Row variable treated as the dependent variable.
We hope it’s now clear why and how a two-category (dummy) variable can be used as an
ordinal variable. But why and how can it be used as an interval variable? The answer to
this question also lies in a definition: this time, of an interval level variable. An interval level
variable, you may recall, is one whose adjacent categories are a standard or fixed distance
from each other. For example, on the Fahrenheit temperature scale, we think of 32 degrees
being the same distance from 33 degrees as 82 degrees is from 83 degrees. Returning to
what we previously might have said was only a nominal-level variable, gender (using here
just two categories: female and male), statisticians now ask us to ask: what is the distance
between categories here. They answer: who really cares? Whatever it is, it’s a standard distance because there’s only one length of it to be covered. Every male, once coded, say, as a
“1,” is as far from the female category, once coded as a “2,” as every other male. And every
two-category (dummy) variable similarly consists of categories that are a “fixed” distance
from each other. We hope this kind of conceptual blockbusting isn’t as disorienting for you
as it was for us when we first had to wrap our heads around it.
But this leaves the question of how “every” nominal-level variable can become an ordinal
or interval level variable. After all, some nominal level variables have more than two categories. The GSS variable “labor force status” (wrkstat) has eight usable categories: working
full time, working part time, temporarily not working, unemployed, retired, school, keeping
house, and other.
But even this variable can become a two-category variable through recoding. Roger, for
instance, was interested is seeing whether people who work fulltime were happier than
other people, so he recoded so that there were only two categories: working full time and
not working full time (wrkstat1). Then, using the Social Data Archive facility, he asked for the
following crosstab (Table 2):
Table 2. Crossbulation of Whether a Respondent Works Fulltime (Wrkstat1) and Hap­
piness (Happy), GSS data vis SDA
Correlation and Regression | 95
Frequency Distribution
wrkstat1
2
1
Not
Working
ROW
Working
Full
TOTAL
Full
Time
Time
Cells contain:
–Column percent
-Weighted N
1: very happy
33.2
9,963.5
32.8
9,860.2
33.0
19,823.8
2: pretty happy
57.7
17,300.1
53.2
15,986.4
55.4
33,286.5
3: not too happy
9.1
2,738.6
14.0
4,223.8
11.6
6,962.4
COL TOTAL
100.0
30,002.2
100.0
30,070.5
100.0
60,072.7
Means
1.76
1.81
1.79
Std Devs
.60
.66
.63
Unweighted N
29,435
30,604
60,039
happy
Color coding:
<-2.0
N in each cell:
<-1.0
<0.0
Smaller than expected
>0.0
>1.0
>2.0
Z
Larger than expected
Summary Statistics
Eta* =
.04
Gamma = .06
Rao-Scott-P: F(2,1132) = 120.24 (p= 0.00)
R=
.04
Tau-b =
.03
Rao-Scott-LR: F(2,1132) =
121.04 (p= 0.00)
Somers’ d* =
.04
Tau-c =
.04
Chisq-P(2) =
368.93
Chisq-LR(2) =
371.39
*Row variable treated as the dependent variable.
The gamma here (0.06) indicates that those working full time do tend to be happier than
others, but that the relationship is a weak one.
We’ve suggested that dummy variables, because they are interval-level variables, can
be used in analyses designed for interval-level variables. But we haven’t yet said anything
about analyses aimed at looking at the relationship between interval level variables. Now
we will.
Correlation Analysis
The examination of relationships between interval-level variables is almost necessarily dif96 | Correlation and Regression
ferent from that of nominal or ordinal level variables. Doing crosstabulations of many interval level variables, for one thing, would involve very large numbers of cells, since almost
every case would have its own distinct category on the independent and on the depen2
dent variables. Roger hypothesizes, for instance, that as the percentage of residents who
own guns in a state rises, the number of gun deaths in a year per 100,000 residents would
also increase. He just went to a couple of websites and downloaded information about both
variables for all 50 states in 2017. Here’s how the data look for the first six cases:
3
Table 3. Gun Ownership and Shooting Deaths by State, 2016
State
Percent of Residents Who
Own Guns
Gun Shooting Deaths Per
100,000 Population
Alabama
52.8
21.5
Alaska
57.2
23.3
Arizona
36
15.2
Arkansas
51.8
17.8
California
16.3
7.9
Colorado
37.9
14.3
…
…
…
One could of course recode all this information, so that each variable was reduced to two
or three categories. For example, we could say any state whose number of gun shooting
per 100,000 was less than 13 fell into a “low gun shooting category,” and any state whose
number was 13 or more fell into a “high gun shooting category.” And do something like this
for the percentage of residents who own guns as well. Then you could do a crosstabulation.
But think of all the information that’s lost in the process. Statisticians were dissatisfied with
this solution and early on noticed that there was a better way than crosstabulation to depict
the relationship between interval level variables.
They discovered a better way was to use what is called a scatterplot. A scatterplot is a
visual depiction of the relationship between two interval level variables, the relationship
between which is represented as points on a graph with an x-axis and a y-axis. Thus, Figure
1 shows the “scatter” of the states when plotted with the percent of residents who are gun
owners along the x-axis and the gun shooting death per 100,000 along the y-axis.
2. Note that this would impact statistical significance, too, since there would be many cells in the table but few
cases in each cell.
3. Gun ownership data from Schell et al. 2020; death data from National Center for Health Statistics 2022.
Correlation and Regression | 97
Figure 1
Scatterplot of
Gun Ownership
Rates and Per
Capita Gun
Deaths by State
Each state is a distinct point in this plot. We’ve pointed in the figure to Massachusetts, the
state with the lowest value on each of the variables (9% percent of residents own guns and
there are 3.4 gun deaths per 100,000 population), but each of the 50 states is represented
by a point or dot on the graph. You’ll note that, in general, as the gun ownership rate rises,
the gun death rate does as well.
Karl Pearson (inventor of chi-square, you may recall) created a statistic, Pearson’s r, which
measures both the strength and direction of a relationship between two interval level variables, like the ones depicted in Figure 1. Like gamma, Pearson’s r can vary between 1 and -1.
The farther away Pearson’s r is from zero, or the closer it is to 1 or -1, the stronger the relationship. And, like gamma, a positive Pearson’s r indicates that as one variable increases,
the other tends to as well. And this is the kind of relationship depicted in Figure 1: as gun
ownership rises, the gun death rates tend to rise as well.
When the sign of Pearson’s r (or simply “r”) is negative, however, this means that as one
variable rises in values, the other tends to fall in values. Such a relationship is depicted in
Figure 2. Roger had expected that drug overdose death rates in states (measured as the
number of deaths due to drug overdoses per 100,000 people) would be negatively associated with the percentage of states’ residents reporting a positive sense of overall well being
in 2016. Figure 2 provides visual support for this hypothesis. Note that while in Figure 1 the
plot of points tends to move from bottom left to upper right on the graph (typical of positive relationships), the plot in Figure 2 tends to move from top left to bottom right (typical
of negative relationships).
98 | Correlation and Regression
Figure 2.
Scatterplot of the
Relationship
Between Drug
Overdose Death
Rates and
Population
Wellbeing, By
State
A Pearson’s r of 1.00 would not only mean that the relationship was as strong as it could
be, that as one variable goes up, the other goes up, but also that all points fall on a line
from bottom left to top right. A Pearson’s r of -1.00 would mean that the relationship was as
strong as it can be, that as one variable goes up, the other goes down, and that all points fall
on a line from top left to bottom right. Figure 3 illustrates what various graphs producing
various Pearson’s r values would look like.
Figure 3.
Examples of
Scatterplots with
Different Values
of Pearson
Correlation
coefficient (r)
The formula for calculating Pearson’s r is not much fun to use, but all contemporary comCorrelation and Regression | 99
puters have no trouble adapting to systems that calculate it rapidly. The computer calculated the Pearson’s r for the relationship between gun ownership and gun death rates for
50 states and it is indeed positive. Pearson’s r is equal to 0.76. So it’s a strong positive relationship. In contrast, the r for the relationship between the overall wellbeing of state residents and their overdose deaths rates is negative. The r turns out to be -0.50. So it’s a strong
negative relationship…though not quite as strong as the one for gun ownership and gun
shootings. 0.76 is farther from zero than -0.50.
Most statistics packages will quickly calculate the several correlations very quickly as
well. Roger asked SPSS to calculate the correlations among three variable characteristics of
states: drug overdose death rates, percent of residents saying they have high levels of overall well being, and whether a state is in the southeast or southwest of the country. (Roger
thought states in the southeast and southwest—the South, for short—might have higher
rates of drug overdose deaths than other states.) The results of this request are shown in
Table 3. This table shows what is called a correlation matrix and it’s worth a moment of your
time.
One reads the correlation between two variables by finding the intersection of the column headed by one of the variables and seeing where it crosses the row headed by the
other variable. The top number in the resulting box is the Pearson correlation for the two
variables. Thus, if one goes down the column in Table 3 headed by “Drug Overdose Death
Rate” and sees where it crosses the row headed by “Percent Reporting High Overall Well
Being” one see that their correlation is “-0.495,” which rounds to -0.50. (Research reports
always round correlation coefficients to two digits after the decimal point.) This is what we
reported above.
Table 3. Correlations Among Drug Overdose Death Rates, Levels of Overall Well Being
and Whether a State is in the American Southeast or Southwest
100 | Correlation and Regression
Correlations
Drug Overdose Death
Rate
Pearson
Correlation
Drug Overdose Death
Rate
Pearson
Correlation
Percent Reporting High
Sig.
Overall Well Being
(2-tailed)
N
Pearson
Correlation
Sig.
(2-tailed)
N
South
or
Other
-.495**
.094
.000
.517
50
50
50
-.495**
1
-.351*
1
Sig.
(2-tailed)
N
South or Other
Percent Reporting High
Overall Well Being
.000
.012
50
50
50
.094
-.351*
1
.517
.012
50
50
50
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Quiz at the end of the table: What is the correlation between the drug overdose death rate and
whether or not a state is in the South of the United States? And what does it mean?
If you answered, “I’m not sure,” you’re right! Whether a state is the South is a dummy variable: the state can be either in the South, on the one hand, or in the rest of the country, on
the other. But since we haven’t told you how this variable is coded, you couldn’t possibly
know what the Pearson’s r of 0.09 means. But once we tell you that Southern states were
coded 1 and all others were coded 0, you should be able to see that Southern states tended
to have higher drug overdose rates than others, but that the relationship isn’t very strong.
Then you’ll also realize that the Pearson’s r relating region and overall well being (-0.35) suggests that overall well being tends to be lower in Southern states than in others.
One other thing is worth mentioning about a correlation matrix yielded by SPSS…and
about Pearson’s r’s. If you look at the box (there are actually two such boxes; can you find
Correlation and Regression | 101
the other?) telling you correlation between overdose rates and overall well being, you’ll see
two other numbers in it. The bottom number (50) is of course the number of cases in the
sample (there are, after all, 50 states). But the one in the middle (0.000) gives you some idea
of the generalizability of the relationship (if there were, in fact, more states). A significance
level or p-value of 0.000 does NOT mean there is no chance of making a Type 1 error (i.e.,
the error we make when we infer that a relationship exists in the larger population from
which a sample is drawn when it does not), just that it’s lower than can be shown in an SPSS
printout. It does mean it is lower than 0.001 and therefore than 0.05, so inferring that such
a relationship would exist in a larger population is reasonably safe. Karl Pearson was, after
all, the inventor of chi-square and was always looking for inferential statistics. He found one
in Pearson’s r itself (imagine his surprise!) and figured out a way to use it to calculate the
probability of making a Type 1 error (or p value) for values of r with various sample sizes. We
don’t need to show you how this is done but we do want you to marvel at this: Pearson’s r
is a measure of direction, strength, and generalizability of the relationship all wrapped into
one.
There are several assumptions one makes when doing a correlation analysis of two variables. One, of course, is that both variables are interval-level. Another is that both are
normally distributed. One can, with most statistical packages, do a quick check of the skew­
ness of both variables. If the skewness of one or both is greater than 1.00 or less than -1.00,
it is advisable to make a correction. Such corrections are pretty easy, but showing you how
to do them is beyond our scope here. Roger did check on the variables in Table 5.3, found
that the drug overdose rate was slightly skewed, corrected for the skewness, and found the
correlations among the variables was very little changed.
A third assumption of correlation is that the relationship between the two variables is lin­
ear. A linear relationship is one in which a good description of its scatterplot is that it tends
to conform to a straight line, rather than some other figure, like a U-shape or an upsidedown U-shape. This seems to be true of the relationships shown in Figures 1 and 2. One can
almost imagine that a line from bottom left to top right, for instance, is a pretty good way of
describing the relationship in Figure 1, and we’ve done so in Figure 4. It’s certainly not easy
to see that any curved line would “fit” the points in that figure any better than a straight
line does.
102 | Correlation and Regression
Figure 4.
Scatterplot of
Gun Ownership
Rates and Per
Capita Gun
Deaths by State,
with Trendline
Regression Analysis
But the assumption of a linear relationship raises the question of “Which line best describes
the relationship?” It may not surprise you to learn that statisticians have a way of figuring
out what that line is. It’s called regression. Regression is a technique that is used to see how
an interval-level dependent variable is affected by one or more interval-level independent
variables. For the moment, we’re going to leave aside the very tantalizing “or more” part of
that definition and focus on the how regression analysis can provide even more insight into
the relationship between two variables than correlation analysis does.
We call regression simple linear regression when we’re simply examining the relationship between two variables. It’s called multiple regression or multivariate regression when
we’re looking at the relationship between a dependent variable and more than one independent variable. Correlation, as we’ve said, can tell us about the strength, direction, and
generalizability of the relationship between two interval level variables. Simple linear
regression can tell us the same things, while adding information that can help us use an
independent variable to predict values of a dependent variable. It does this by telling us the
formula for the line of best fit for the points in a scatterplot. The line of best fit is a line that
minimizes the distance between itself and all of the points in a scatterplot.
To get the flavor of this extra benefit of regression, we need to recall the formula for a line:
Correlation and Regression | 103
What simple linear regression does, in the first instance, is find the line that comes closest
to all of the points in the scatterplot. Roger, for instance, used SPSS to do a regression of the
gun shooting death rate by state on the percentage of residents who own guns (this is the
vernacular used by statisticians: they regress the dependent variable on the independent
variable[s]). Part of the resulting printout is shown in Table 4.
Table 4. Partial Printout from Request for Regression of Gun Shooting Death Rate Per
100,000 on Percentage of Residents Owning Guns
Coefficientsa
Unstandardized Coefficients
B
Model
(Constant)
1
Gun Ownership
Standardized Coefficients
Std. Error
2.855
1.338
.257
.032
t
Beta
Sig.
.760
2.134
.038
8.111
<.001
a. Dependent Variable: Gun Shooting Death Rate
Note that under the column labeled “B” under “Unstandardized Coefficients,” one gets two
numbers: 2.855 in the row labeled (Constant) and 0.257 in the row labeled “Gun Ownership.”
4
The (Constant) 2.855, rounded to 2.86, is the y-intercept (the “a” in the equation above) for
the line of best fit. The 0.257, rounded to 0.26, is the slope for that line. So what this regression tells us is that the line of best fit for the relationship between gun shooting deaths and
gun ownership is:
Gun shooting death rate = 2.86 + 0.26 * (% of residents owning guns)
Correlation, we’ve noted, provides information about the strength, direction and generalizability of a relationship. By generating equations like this, regression gives you all those
things (as you’ll see in a minute), but also a way of predicting what (as-yet-incompletelyknown) subjects will score on the dependent variable when one has knowledge of the their
4. This constant is the number that is impacted by whether we choose to code our dummy variable as 0 and 1 or
as 1 and 2. As you can see, this choice impacts the equation of the line, but otherwise does not impact our
interpretation of these results.
104 | Correlation and Regression
values on the independent variable. It permitted us, for instance, to draw the line of best fit
into Figure 4, one that has a y-intercept of about 2.86 and a slope of about 0.26. And suppose you knew that about 50 percent of a state’s residents owned guns. You could predict
the gun death rate of the state by substituting “50” for the “% of residents owning guns”
and get a prediction that:
Gun shooting death rate = 2.86 + 0.26 (50)= 2.86 + 13 = 15.86
Or that 15.86 per 100,000 residents will have experienced gun shooting deaths. Another
output of regression is something called R squared. R squared x 100 (because we are converting from a decimal to a percentage) tells you the approximate percentage of variation
in the dependent variables that is “explained” by the independent variable. “Explained” here
is a slightly fuzzy term that can be thought of as referring to how closely points on a scatterplot comes to a line of best fit. In the case of the gun shooting death rate and gun ownership, the R squared is 0.578, meaning that about 58 percent of the variation in the gun
shooting death rate can be explained by gun ownership rate. This is actually a fairly high
percentage of variance explained, by sociology and justice studies standards, but would
mean one’s predictions using the regression formula are likely to be off by a bit, sometimes
quite a bit.
The prediction bonus of regression is very profitable in some disciplines, like finance and
investing. And predictions can even get pretty good in the social sciences if more variables are brought into play. We’ll show you how this gets done in a moment, but first a
word about how regression, like correlation, also provides information about the direction,
strength and generalizability of a two-variable relationship. If you return to Table 5.4, you’ll
find in a column labeled “standardized coefficient” or beta (sometimes represented as β),
the number 0.760. You may recall that the Pearson’s r of the relationship between the gun
shooting death rate and the percentage of residents who own guns was also 0.76, and
that’s no coincidence. The beta in simple regression is always the same as the Pearson’s r
for the bivariate relationship. Moreover, you’ll find at the end of the row headed by “Gun
Ownership” a significance level (<0.001)—which was exactly the same as the one for the
original Pearson’s r. In other words, through beta and the significance level associated with
an independent variable we can, just as we could with Pearson’s r, ascertain the direction,
strength and generalizability of a relationship.
But beta’s meaning is just a little different from Pearson’s r’s. Beta actually tells you the
correlation between the relevant independent variable and the dependent variable when
all other independent variables in the equation or model are controlled. That’s a mouthful,
we know, but it’s a magical mouthful, as we’re about to show you. In fact, the reason that
the beta in the regression above is the same as the relevant Pearson’s r is that there are no
other independent variables involved. But let’s now see what happens when there are….
Correlation and Regression | 105
Multiple Regression
Multiple regression (also called multivariate regression), as we’ve said before, is a technique
that permits the examination of the relationship between a dependent variable and several
independent variables. But to put it this way is somehow to diminish its magic. This magic
is one reason that, of all the quantitative data analytic technique we’ve talked about in this
book, multiple regression is probably the most popular among social researchers. Let’s see
why with a simple example.
Roger’s taught classes in the sociology of gender and has long been interested in the
question of why women are better represented in some countries’ governments than in
others. For example, why are women better represented in the national legislatures in
many Scandinavian countries than they are, say, in the United States? In 2020, the United
States achieved what was then a new high in female representation in the House of Representatives—23.4 percent of the House’s seats were held by women after the election of
the year before—while in Sweden 47 percent of the seats (almost twice as many) were held
5
by women (Inter-Parliamentary Union 2022) . He also knew that women were better represented in the legislatures of certain countries where some kind of quota for women’s representation in politics had been established. Thus, in Rwanda, where a bitter civil war tore the
country apart in 1994, a new leader, Paul Kagame, felt it wise to bring women into government and established a law that women should constitute at least 30 percent of all government decision-making bodies. In 2019, 61 percent of the members of Rwanda’s lower house
were women—by far the greatest percentage in the world and more than two and a half
times as many as in the United States. Most Scandinavian countries also have quotas for
women’s representation.
In any case, Roger and three students (Rebecca Teczar, Katherine Rocha, and Joseph
Palazzo), being good social scientists, wondered whether the effect of quotas might be at
least partly attributable to cultural beliefs—say, to beliefs that men are better suited for politics than women. And, lo and behold, they found an international survey that measured
such an attitude in more than 50 countries: the 2014 World Values Survey. They (Teczar et al.
found that for those countries, the correlation between the presence of some kind of quota
and the percentage of women in the national legislature was pretty strong (r =0.31), but that
the correlation between the percentage of the population that thought men were better
suited for politics and the presence of women in the legislature was even stronger (r= -0.46).
Still, they couldn’t be sure that the correlation of one of these independent variables with
5. When the U.S. Congress goes into session in 2023, the House of Representatives will be 28.5% women (Center
for American Women in Politics 2022). Sweden has stayed about the same, while in Rwanda, women now
make up 80% of members of the lower house (Inter-Parliamentary Union 2022).
106 | Correlation and Regression
the dependent variable wasn’t at least partly due to the effects of the other variable. (Look
out: we’re about to use the language of the elaboration model outlined in the chapter on
multivariate analysis.)
One possibility, for instance, was that the relationship between the presence of quotas
and women’s participation in legislatures was the spurious result of attitudes about
women’s (or men’s) suitability for office on both the creation of quotas promoting their
access to them and on the access itself. If this position had been proven correct, they would
have discovered that there was an “explanation” for the relationship between quotas and
women’s representation. But they would have had to see the correlation between the presence of quotas and women’s participation in legislatures drop considerably when attitudes
were controlled for this position to be borne out.
On the other hand, it might have been that attitudes about women’s suitability made
it more (or less) likely that countries would adopt quotas, which in turn made it more
likely that women would be elected to parliaments. Had the data supported this view, if,
that is, the controlled association between attitudes and women’s presence in parliaments
dropped when the presence of quotas was controlled, we would have discovered an “interpretation” and might have interpreted the presence of quotas as the main way in which
positive attitudes towards women in politics affected their presence in parliaments.
As it turns out, there was support, though nowhere near complete support, for both positions. Thus, the beta for the attitudinal question (-0.41) is slightly weaker than the original
correlation (-0.46), suggesting that some of effect of cultural attitudes on women’s parliamentary participation may be accounted for by their effects on the presence of quotas and
the quotas’ effects on participation. But the beta for the presence of quotas (0.23) is also
weaker than its original correlation with women in parliaments (0.31), suggesting that some
of its association with women in parliament may be due to the direct effects of attitudes
on both the presence of quotas and on women in parliament. The R squared for this model
(0.26) involving the two independent variables is considerably greater than it was for models involving each independent variable alone (0.20 for the attitudinal variable; 0.09 for the
quota variable), so the two together explain more of the variance in women’s presence in
parliaments than either does alone.
But an R squared of 0.26 suggests that even if we used the formula that multiple regression gives us for predicting women’s percentage of a national legislature from knowledge
of whether a country had quotas for women and the percentage agreeing that men are
better at politics, our prediction might not be all that good. That formula, though, is again
provided by the numbers in the column headed by “B” under “Unstandardized Coefficients.” That column yields the formula for a line in three-dimensional space, if you can
imagine:
Fraction of Legislature that is Female = 0.286 + 0.067 (Presence of Quota) – 0.002 (Percentage Thinking Men More Suitable)
Correlation and Regression | 107
If a country had a quota and 10% of the population thought men were better suited for
politics, we would predict that the fraction of the legislature that was female would be
0.286 + 0.067 (1) – 0.002 (10) = 0.333 or that 33.3 percent of the legislature would be female.
Because such a prediction would be so imperfect, though, social scientists usually wouldn’t
make too much of it. It’s frequently the case that sociologists and students of justice are
more interested in multiple regression for its theory testing, rather than its predictive function.
Table 5. Regression of Women in the Legislature by Country on the Presence of a
Quota for Women in Politics and The Percent of the Population Agreeing that Men Are
More Suitable for Politics than Women
Coefficientsa
Unstandardized
Coefficients
B
Model
1
Std. Error
Standardized
Coefficients
Beta
t
Sig.
(Constant)
.286
.049
5.828 .000
Presence of a quota for women in politics
.067
.037
.228
Percent agreeing that men are more suitable for politics than women
-.002
.001
-.412 -3.289 .002
1.819 .075
a. Dependent Variable: women in legislature 2017
These are the kinds of lessons one can learn from multiple regression. Two things here
are worthy of note. First, the variable “presence of a quota for women in parliament” is a
dummy variable treated in this analysis just as seriously as one would any other interval
level variable. Second, we could have added any number of other independent variables
into the model, as you’ll see when you read the article referred to in Exercise 4 below. And
any of them could have been a dummy variable. (We might, for instance, have included
a dummy variable for whether a country was Scandinavian or not.) Multiple regression, in
short, is a truly powerful, almost magical technique.
Exercises
1.
Write definitions, in your own words, for each of the following key concepts from this chapter:
◦
dummy variable
108 | Correlation and Regression
2.
◦
scatterplot
◦
Pearson’s r
◦
linear relationship
◦
regression
◦
line of best fit
◦
simple linear regression
◦
multiple regression
◦
R squared
◦
beta
Return to the Social Data Archive we’ve explored before. The data, again, are available at
https://sda.berkeley.edu/ . (You may have to copy this address and paste it to request the website.) Again,
go down to the second full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll
down to the section labeled “General Social Surveys” and click on the first link there: General Social Survey (GSS) Cumulative Datafile 1972-2021 release.
For this exercise, you’ll need to come up with three hypotheses:
1) Who do you think will have more offspring: older or younger adults?
2) People with more education or less?
3) Protestants or other Americans?
Now you need to test these hypotheses from the GSS, using correlation analysis. To do this, you’ll first
need to make a dummy variable of religion. First, put “relig” in the “Variable Selection” box on the left and
hit “View.” How many categories does “relig” have? This is how to reduce those categories to just two.
First, hit the “create variables” button at the upper left. Then, on the right, name the new variable something like “Protestant.” (Some other student may have done this first. If so, you may want to use “their”
variable.) The label for the new variable could be something like “Protestant or other.” Then put “relig” in
the “Name(s) of existing variables” box and click on the red lettering below. There should be a bunch of
boxes down below. Put a “1” in the first box on the left, give the category a name like “Protestant,” and
put “1” for the Protestant category of “relig” on the right. Then go down one row and put “0” in the first
box on the left in the row, label the category “other,” and put “2-13” in the right-hand box of the row. This
will put all other religions listed in “relig” in the “other” category of “Protestant.” Then go to the bottom
and hit “Start recoding.” If no one else has done this yet, you should see a frequency distribution for your
new variable. If someone else has done it, you may use their variable for the rest of this exercise.
Now hit the “analysis” button at the upper left. Choose “Correl. Matrix” (for “correlation matrix”) for the
kind of analysis. Now put the four variables of interest for this exercise (“childs,” “age,” “educ,” and “Protestant”) in the first four “Variables to Correlate” boxes. Now go to the bottom and hit “Run correlations.”
Report the correlations between the three independent variables (age, educ and Protestant) and your
dependent variable (childs). Do the correlations support your hypotheses? Which hypothesis receives the
strongest support? Which the weakest? Were any of your hypotheses completely falsified by the analysis?
3.
Now let’s use the same data that we used in Exercise 1 to do a multiple regression analysis. You’ll first
need to leave the Social Data Archive and get back in again, returning to the GSS link. This time, instead
of hitting “Correl. Matrix,” hit “Regression.” Then put “Childs” in the “Dependent” variable box and “Age,”
“Educ,” and “Protestant” in three of the “Independent variables” boxes. Hit “Run regression.” Which of the
Correlation and Regression | 109
independent variables retains the strongest association with the number of children a respondent has
when all other variables in the model are controlled? What is that association? Which has the weakest
when other variables are controlled?
4.
Please read the following article:
Teczar, Rebecca, Katherine Rocha, Joseph Palazzo, and Roger Clark. 2018. “Cultural Attitudes towards
Women in Politics and Women’s Political Representation in Legislatures and Cabinet Ministries.” Sociol­
ogy Between the Gaps: Forgotten and Neglected Topics 4(1):1-7.
In the article, Teczar et al. use a multiple regression technique, called stepwise regression, which in this
case only permits those variables that have a statistically significant (at the 0.05 level) controlled association into the model.
a. What variables do Teczar et al. find have the most significant controlled associations with women in
national parliaments? Where do you find the relevant statistics in the article?
b. What variables do Teczar et al. find have the most significant controlled association with women in
ministries? Where do you find the relevant statistics in the article?
c. Which model—the one for parliaments or the one for ministries (or cabinets)—presented in the article has the greater explanatory power? (i.e., which one explains more of the variation in the dependent
variable?) How can you tell?
d. Do you agree with the authors’ point (at the end) that political attitudes, while tough to change, are
not unchangeable? Can you think of any contemporary examples not mentioned in the conclusion that
might support this point?
Media Attributions
• Scatterplot of Gun Ownership Rates and Per Capita Gun Deaths by State © Mikaila
Mariel Lemonik Arthur
• Scatterplot of the Relationship Between Drug Overdoes Death Rates and Population
Wellbeing, By State © Roger Clark
• correlation-coefficients-1 © Kiatdd adapted by Mikaila Mariel Lemonik Arthur is
licensed under a CC BY-SA (Attribution ShareAlike) license
• Scatterplot of Gun Ownership Rates and Per Capita Gun Deaths by State, with Trendline © Mikaila Mariel Lemonik Arthur
110 | Correlation and Regression
9. Presenting the Results of
Quantitative Analysis
MIKAILA MARIEL LEMONIK ARTHUR
This chapter provides an overview of how to present the results of quantitative analysis, in
particular how to create effective tables for displaying quantitative results and how to write
quantitative research papers that effectively communicate the methods used and findings
of quantitative analysis.
Writing the Quantitative Paper
Standard quantitative social science papers follow a specific format. They begin with a title
page that includes a descriptive title, the author(s)’ name(s), and a 100 to 200 word abstract
that summarizes the paper. Next is an introduction that makes clear the paper’s research
question, details why this question is important, and previews what the paper will do. After
that comes a literature review, which ends with a summary of the research question(s) and/
or hypotheses. A methods section, which explains the source of data, sample, and variables
and quantitative techniques used, follows. Many analysts will include a short discussion of
their descriptive statistics in the methods section. A findings section details the findings
of the analysis, supported by a variety of tables, and in some cases graphs, all of which
are explained in the text. Some quantitative papers, especially those using more complex
techniques, will include equations. Many papers follow the findings section with a discussion section, which provides an interpretation of the results in light of both the prior literature and theory presented in the literature review and the research questions/hypotheses.
A conclusion ends the body of the paper. This conclusion should summarize the findings,
answering the research questions and stating whether any hypotheses were supported,
partially supported, or not supported. Limitations of the research are detailed. Papers typically include suggestions for future research, and where relevant, some papers include policy implications. After the body of the paper comes the works cited; some papers also have
an Appendix that includes additional tables and figures that did not fit into the body of
the paper or additional methodological details. While this basic format is similar for papers
regardless of the type of data they utilize, there are specific concerns relating to quantitative research in terms of the methods and findings that will be discussed here.
Presenting the Results of Quantitative Analysis | 111
Methods
In the methods section, researchers clearly describe the methods they used to obtain and
analyze the data for their research. When relying on data collected specifically for a given
paper, researchers will need to discuss the sample and data collection; in most cases,
though, quantitative research relies on pre-existing datasets. In these cases, researchers
need to provide information about the dataset, including the source of the data, the time
it was collected, the population, and the sample size. Regardless of the source of the data,
researchers need to be clear about which variables they are using in their research and
any transformations or manipulations of those variables. They also need to explain the specific quantitative techniques that they are using in their analysis; if different techniques
are used to test different hypotheses, this should be made clear. In some cases, publications will require that papers be submitted along with any code that was used to produce
the analysis (in SPSS terms, the syntax files), which more advanced researchers will usually
have on hand. In many cases, basic descriptive statistics are presented in tabular form and
explained within the methods section.
Findings
The findings sections of quantitative papers are organized around explaining the results as
shown in tables and figures. Not all results are depicted in tables and figures—some minor
or null findings will simply be referenced—but tables and figures should be produced for
all findings to be discussed at any length. If there are too many tables and figures, some
can be moved to an appendix after the body of the text and referred to in the text (e.g. “See
Table 12 in Appendix A”).
Discussions of the findings should not simply restate the contents of the table. Rather,
they should explain and interpret it for readers, and they should do so in light of the hypothesis or hypotheses that are being tested. Conclusions—discussions of whether the hypothesis or hypotheses are supported or not supported—should wait for the conclusion of the
paper.
Creating Effective Tables
When creating tables to display the results of quantitative analysis, the most important
goals are to create tables that are clear and concise but that also meet standard conven-
112 | Presenting the Results of Quantitative Analysis
tions in the field. This means, first of all, paring down the volume of information produced
in the statistical output to just include the information most necessary for interpreting the
results, but doing so in keeping with standard table conventions. It also means making
tables that are well-formatted and designed, so that readers can understand what the
tables are saying without struggling to find information. For example, tables (as well as figures such as graphs) need clear captions; they are typically numbered and referred to by
number in the text. Columns and rows should have clear headings. Depending on the content of the table, formatting tools may need to be used to set off header rows/columns and/
or total rows/columns; cell-merging tools may be necessary; and shading may be important in tables with many rows or columns.
Here, you will find some instructions for creating tables of results from descriptive,
crosstabulation, correlation, and regression analysis that are clear, concise, and meet normal standards for data display in social science. In addition, after the instructions for creating tables, you will find an example of how a paper incorporating each table might describe
that table in the text.
Descriptive Statistics
When presenting the results of descriptive statistics, we create one table with columns for
each type of descriptive statistic and rows for each variable. Note, of course, that depending on level of measurement only certain descriptive statistics are appropriate for a given
variable, so there may be many cells in the table marked with an — to show that this statistic is not calculated for this variable. So, consider the set of descriptive statistics below, for
occupational prestige, age, highest degree earned, and whether the respondent was born
in this country.
Presenting the Results of Quantitative Analysis | 113
Table 1. SPSS Ouput: Selected Descriptive Statistics
Statistics
R’s occupational prestige score (2010) Age of respondent
Valid
3873
3699
Missing
159
333
Mean
46.54
52.16
Median
47.00
53.00
Std. Deviation
13.811
17.233
Variance
190.745
296.988
Skewness
.141
.018
N
Std. Error of Skewness .039
.040
Kurtosis
-.809
-1.018
Std. Error of Kurtosis
.079
.080
Range
64
71
Minimum
16
18
Maximum
80
89
25
35.00
37.00
50
47.00
53.00
75
59.00
66.00
Percentiles
R’s highest degree
Frequency Percent
Valid
Percent
Cumulative
Percent
less than high school
246
6.1
6.1
6.1
high school
1597
39.6
39.8
46.0
associate/junior
college
370
9.2
9.2
55.2
Statistics
R’s highest degree
N
Valid
4009
Missing 23
Median
2.00
Mode
1
bachelor’s
1036
25.7
25.8
81.0
Range
4
graduate
760
18.8
19.0
100.0
Minimum
0
Total
4009
99.4
100.0
Maximum
4
Missing System
23
.6
Total
4032
100.0
Valid
114 | Presenting the Results of Quantitative Analysis
Was r born in this country
Statistics
Frequency Percent Valid Percent Cumulative Percent
Was r born in this country
Valid
N
3960
Missing 72
Mean
1.11
Mode
1
Valid
yes
3516
87.2
88.8
88.8
no
444
11.0
11.2
100.0
Total
3960
98.2
100.0
Missing System 72
1.8
Total
100.0
4032
To display these descriptive statistics in a paper, one might create a table like Table 2. Note
that for discrete variables, we use the value label in the table, not the value.
Table 2. Descriptive Statistics
Occupational
Prestige Score
Age
Highest Degree
Earned
Born in This
Country?
Mean
46.54
52.16
—
1.11
Median
47
53
1: Associates
(9.2%)
1: Yes (88.8%)
Mode
—
—
2: High School
(39.8%)
—
Standard
Deviation
13.811
17.233
—
—
Variance
190.745
296.988
—
—
Skewness
0.141
0.018
—
—
Kurtosis
-0.809
-1.018
—
—
Range
64 (16-80)
71 (18-89)
Less than High
School (0) –
Graduate (4)
—
Interquartile
Range
35-59
37-66
—
—
N
3873
3699
4009
3960
If we were then to discuss our descriptive statistics in a quantitative paper, we might write
something like this (note that we do not need to repeat every single detail from the table,
as readers can peruse the table themselves):
This analysis relies on four variables from the 2021 General Social Survey: occupational prestige score, age, highest degree earned, and whether the respondent was
born in the United States. Descriptive statistics for all four variables are shown in
Table 2. The median occupational prestige score is 47, with a range from 16 to 80.
Presenting the Results of Quantitative Analysis | 115
50% of respondents had occupational prestige scores scores between 35 and 59. The
median age of respondents is 53, with a range from 18 to 89. 50% of respondents
are between ages 37 and 66. Both variables have little skew. Highest degree earned
ranges from less than high school to a graduate degree; the median respondent
has earned an associate’s degree, while the modal response (given by 39.8% of the
respondents) is a high school degree. 88.8% of respondents were born in the United
States.
Crosstabulation
When presenting the results of a crosstabulation, we simplify the table so that it highlights
the most important information—the column percentages—and include the significance
and association below the table. Consider the SPSS output below.
116 | Presenting the Results of Quantitative Analysis
Table 3. R’s highest degree * R’s subjective class identification Crosstabulation
R’s subjective class identification
less than high
school
high school
R’s highest
degree
associate/junior
college
bachelor’s
graduate
Total
lower
class
working
class
middle
class
upper
class
Count
65
106
68
7
% within R’s
subjective class
identification
18.8%
7.1%
3.4%
4.2%
Count
217
800
551
23
% within R’s
subjective class
identification
62.9%
53.7%
27.6%
13.9%
Count
30
191
144
3
% within R’s
subjective class
identification
8.7%
12.8%
7.2%
1.8%
Count
27
269
686
49
% within R’s
subjective class
identification
7.8%
18.1%
34.4%
29.5%
Count
6
123
546
84
% within R’s
subjective class
identification
1.7%
8.3%
27.4%
50.6%
Count
345
1489
1995
166
% within R’s
subjective class
identification
100.0%
100.0%
100.0%
100.0%
Chi-Square Tests
Value
df
Asymptotic Significance (2-sided)
Pearson Chi-Square
819.579a
12
<.001
Likelihood Ratio
839.200
12
<.001
Linear-by-Linear Association
700.351
1
<.001
N of Valid Cases
3995
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.22.
Presenting the Results of Quantitative Analysis | 117
Symmetric Measures
Value
Asymptotic
Standard Errora
Approximate Approximate
Significance
Tb
Interval by
Interval
Pearson’s R
.419
.013
29.139
<.001c
Ordinal by
Ordinal
Spearman
Correlation
.419
.013
29.158
<.001c
N of Valid Cases
3995
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Based on normal approximation.
Table 4 shows how a table suitable for include in a paper might look if created from the
SPSS output in Table 3. Note that we use asterisks to indicate the significance level of the
results: * means p < 0.05; ** means p < 0.01; *** means p < 0.001; and no stars mean p > 0.05
(and thus that the result is not significant). Also note than N is the abbreviation for the number of respondents.
Respondent’s Subjective Class Identification
Highest
Degree
Earned
Lower Class
Working
Class
Middle
Class
Upper
Class
Total
Less than
High School
18.8%
7.1%
3.4%
4.2%
6.2%
High School
62.9%
53.7%
27.6%
13.9%
39.8%
Associate’s /
Junior
College
8.7%
12.8%
7.2%
1.8%
9.2%
Bachelor’s
7.8%
18.1%
34.4%
29.5%
25.8%
Graduate
1.7%
8.3%
27.4%
50.6%
19.0%
N: 3995 Spearman Correlation 0.419***
If we were going to discuss the results of this crosstabulation in a quantitative research
paper, the discussion might look like this:
A crosstabulation of respondent’s class identification and their highest degree
earned, with class identification as the independent variable, is significant, with a
Spearman correlation of 0.419, as shown in Table 4. Among lower class and working
class respondents, more than 50% had earned a high school degree. Less than 20%
of poor respondents and less than 40% of working-class respondents had earned
118 | Presenting the Results of Quantitative Analysis
more than a high school degree. In contrast, the majority of middle class and upper
class respondents had earned at least a bachelor’s degree. In fact, 50% of upper class
respondents had earned a graduate degree.
Correlation
When presenting a correlating matrix, one of the most important things to note is that we
only present half the table so as not to include duplicated results. Think of the line through
the table where empty cells exist to represent the correlation between a variable and itself,
and include only the triangle of data either above or below that line of cells. Consider the
output in Table 5.
Table 5. SPSS Output: Correlations
Pearson
Correlation
Age of respondent
R’s occupational
prestige score (2010)
R’s
Highest year R’s family
occupational
of school R
income in
prestige
completed
1986 dollars
score (2010)
1
.087**
.014
.017
<.001
.391
.314
Sig. (2-tailed)
N
3699
3571
3683
3336
Pearson
Correlation
.087**
1
.504**
.316**
Sig. (2-tailed)
<.001
<.001
<.001
N
3571
3873
3817
3399
Pearson
Correlation
.014
.504**
1
.360**
.391
<.001
N
3683
3817
3966
3497
Pearson
Correlation
.017
.316**
.360**
1
Sig. (2-tailed)
.314
<.001
<.001
N
3336
3399
3497
Highest year of school
R completed
Sig. (2-tailed)
R’s family income in
1986 dollars
Age of
respondent
<.001
3509
**. Correlation is significant at the 0.01 level (2-tailed).
Table 6 shows what the contents of Table 5 might look like when a table is constructed in a
fashion suitable for publication.
Presenting the Results of Quantitative Analysis | 119
Table 6. Correlation Matrix
Occupational
Prestige Score
Age
Age
Highest Year of
School
Completed
Family Income in
1986 Dollars
1
Occupational
Prestige Score
0.087***
1
Highest Year of
School
Completed
0.014
0.504***
1
Family Income in
1986 Dollars
0.017
0.316***
0.360***
1
If we were to discuss the results of this bivariate correlation analysis in a quantitative paper,
the discussion might look like this:
Bivariate correlations were run among variables measuring age, occupational prestige, the highest year of school respondents completed, and family income in constant 1986 dollars, as shown in Table 6. Correlations between age and highest year of
school completed and between age and family income are not significant. All other
correlations are positive and significant at the p<0.001 level. The correlation between
age and occupational prestige is weak; the correlations between income and occupational prestige and between income and educational attainment are moderate,
and the correlation between education and occupational prestige is strong.
Regression
To present the results of a regression, we create one table that includes all of the key information from the multiple tables of SPSS output. This includes the R2 and significance of
the regression, either the B or the beta values (different analysts have different preferences
here) for each variable, and the standard error and significance of each variable. Consider
the SPSS output in Table 7.
Table 7. SPSS Output: Regression
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
1
.395a
.156
.155
36729.04841
a. Predictors: (Constant), Highest year of school R completed, Age of respondent, R’s occupational
prestige score (2010)
120 | Presenting the Results of Quantitative Analysis
ANOVAa
Model
1
Sum of Squares
df
Mean Square
F
Sig.
Regression
805156927306.583
3
268385642435.528
198.948
<.001b
Residual
4351948187487.015
3226
1349022996.741
Total
5157105114793.598
3229
a. Dependent Variable: R’s family income in 1986 dollars
b. Predictors: (Constant), Highest year of school R completed, Age of respondent, R’s occupational
prestige score (2010)
Coefficientsa
Model
1
Unstandardized
Coefficients
Standardized
Coefficients
t
B
Beta
Std. Error
(Constant)
-44403.902 4166.576
Age of respondent
9.547
38.733
R’s occupational
prestige score (2010)
522.887
Highest year of school
3988.545
R completed
Sig.
Collinearity
Statistics
Tolerance
VIF
-10.657
<.001
.004
.246
.805
.993
1.00
54.327
.181
9.625
<.001
.744
1.34
274.039
.272
14.555
<.001
.747
1.339
a. Dependent Variable: R’s family income in 1986 dollars
The regression output in shown in Table 7 contains a lot of information. We do not include
all of this information when making tables suitable for publication. As can be seen in Table
8, we include the Beta (or the B), the standard error, and the significance asterisk for each
variable; the R2 and significance for the overall regression; the degrees of freedom (which
tells readers the sample size or N); and the constant; along with the key to p/significance
values.
Presenting the Results of Quantitative Analysis | 121
Table 8. Regression Results for Dependent Variable
Family Income in 1986 Dollars
Beta & SE
Age
0.004
(38.733)
Occupational Prestige
Score
0.181***
(54.327)
Highest Year of School
Completed
0.272***
(274.039)
R2
0.156***
Degrees of Freedom
3229
Constant
-44,403.902
* p<0.05 **p<0.01 ***p<0.001
If we were to discuss the results of this regression in a quantitative paper, the results might
look like this:
Table 8 shows the results of a regression in which age, occupational prestige, and
highest year of school completed are the independent variables and family income
is the dependent variable. The regression results are significant, and all of the independent variables taken together explain 15.6% of the variance in family income. Age
is not a significant predictor of income, while occupational prestige and educational
attainment are. Educational attainment has a larger effect on family income than
does occupational prestige. For every year of additional education attained, family
income goes up on average by $3,988.545; for every one-unit increase in occupa1
tional prestige score, family income goes up on average by $522.887.
Exercises
1.
Choose two discrete variables and three continuous variables from a dataset of your choice. Produce appropriate descriptive statistics on all five of the variables and create a table of the results
suitable for inclusion in a paper.
2.
Using the two discrete variables you have chosen, produce an appropriate crosstabulation, with
significance and measure of association. Create a table of the results suitable for inclusion in a
1. Note that the actual numberical increase comes from the B values, which are shown in the SPSS output in
Table 7 but not in the reformatted Table 8.
122 | Presenting the Results of Quantitative Analysis
paper.
3.
Using the three continuous variables you have chosen, produce a correlation matrix. Create a
table of the results suitable for inclusion in a paper.
4.
Using the three continuous variables you have chosen, produce a multivariate linear regression.
Create a table of the results suitable for inclusion in a paper.
5.
Write a methods section describing the dataset, analytical methods, and variables you utilized
in questions 1, 2, 3, and 4 and explaining the results of your descriptive analysis.
6.
Write a findings section explaining the results of the analyses you performed in questions 2, 3,
and 4.
Presenting the Results of Quantitative Analysis | 123
124 | Presenting the Results of Quantitative Analysis
SECTION III
QUALITATIVE DATA ANALYSIS
Qualitative Data Analysis | 125
126 | Qualitative Data Analysis
10. The Qualitative Approach
The Qualitative Approach
MIKAILA MARIEL LEMONIK ARTHUR
At the most basic level, qualitative research is research that emphasizes data that is not
numerical in nature, data like words, pictures, and ideas. In contrast, quantitative data
emphasizes numbers, or at least variables that can relatively easily be translated into
numerical terms. In other words, quantitative data is about quantities, while qualitative
data is about qualities. Beyond this basic distinction, qualitative research can look very similar to quantitative research or it can take a very different approach. Later in this chapter, you
will learn more about different ways of thinking about data and how they might apply to
qualitative data analysis. When people talk about qualitative approaches to research, however, they are often focused on those approaches that are distinct from what quantitative
researchers do.
So what are some of the unique features of qualitative research? First of all, qualitative
research tends to rely on the use of rich, thick description. In other words, qualitative
research does not just provide summaries of data and findings, it really takes the reader
or consumer of research there and lets them explore the situation and make conclusions
for themselves by drawing on extended descriptions and excerpts from the data. Qualitative research also leaves room for focus on feelings and emotions, elements of the social
world that can be harder to get at with quantitative data. For the qualitative researcher,
data should not just depict specific actions or occurrences, but rather the contexts and
backgrounds that lead up to what happened. More broadly, qualitative research tends to
focus on a deep understanding of a specific place or organization or of a particular issue,
rather than providing a wider but shallower understanding of an area of study.
Among the strengths of qualitative research are that it provides for the development of
new theories and the exploration of issues that people do not know much about. It is very
high in validity since it is so connected to real life. And it permits the collection and analysis of more detailed, contextual, and complex kinds of data and information. Of course,
with strengths come limitations. The higher validity of qualitative data is matched with
lower reliability due to the unique circumstances of data collection and the impact of interviewer effect. The greater ability to develop new theories is matched with a greater difficulty testing existing theories, especially causal ones, given the impossibility of eliminating
alternative explanations for the phenomena under investigation. The ability to collect more
detailed and complex information comes in large part due to the focus on a much smaller
number of participants or cases, which in turn limits generalizability and in some cases can
The Qualitative Approach | 127
limit representativeness. And while there is no reason to conclude that any of these factors make qualitative research more prone to bias than quantitative research, which after
all can be profoundly impacted by slight variations in survey question wording or sample
design, those who are not well informed about research methodology may discount the
strengths of qualitative research by suggesting that the lack of numbers or the close interaction between participants and researchers bias the results.
In their classic text on qualitative data analysis, Miles and Huberman (1994) present the
following as among the key elements of qualitative data analysis:
• It involves more prolonged contact with more ordinary aspects of human life;
• It has a holistic rather than a particularistic focus, aiming to keep data and findings in
context;
• Multiple interpretations and understandings of data are possible, and researchers
should preserve respondents’ own understandings of their worlds and lives;
• There is a lack of standardization and measurement, with the researcher themselves
becoming the primary measurement instrument; and
• Analysis is done primarily with words.
For the purposes of this text on data analysis, which focuses on what we do after we collect
data rather than on how we go about obtaining the data in the first place, the last of these
elements is most important. However, other scholars would argue that qualitative analysis
is not limited to words—it may also involve visuals ways of engaging with and presenting
data.
Types of Qualitative Data
The data that we analyze in qualitative research consist primarily of words and images
drawn from observation, interaction, interviewing, or existing documents. In particular, the
types of data collection that tend to result in qualitative data include interviews and focus
groups, ethnography and participant observation, and the analysis of existing documents.
These different data collection strategies imply a variety of analytical strategies as well, and
indeed qualitative data analysis relies on a breadth of techniques. Thus, part of the process
of formulating and selecting qualitative data is selecting the right kinds of strategies to
apply to the particular data being utilized.
One of the most common ways in which qualitative data is collected is through talking
to people. We often refer to these people as respondents or participants. Sometimes, they
may be called subjects, though many qualitative researchers find that term to be inappro-
128 | The Qualitative Approach
priate. In contrast to respondents or participants, subjects implies a more passive kind of
relationship to the research process, a relationship in which research is done to a person
rather than one in which a person is a party to the research process.
Research involving talking to people usually involves interviews of various kinds, whether
they be in-person or via video chat, short and structured or long oral histories, of an individual or of a larger focus group. The data collected from interviews may include interview
notes and audio or video recordings. Alternatively, researchers may conduct observational
research or participant-observation (often called ethnography). In this method,
researchers observe real social life in all its detail, either with or without participating in it.
Typically, the data collected from observation and ethnography entails detailed fieldnotes
recording what has been encountered in the setting.
It is beyond the scope of this text to discuss the process of data collection. However, the
next chapter will detail some of the strategies that researchers may want to consider in
designing their studies and collecting their data in order to ensure that data is obtained in
a form that is useful for analysis.
There are other kinds of qualitative data that do not involve talking to people. These
include trace analysis, or observing the traces of life that people have left behind (this is
what archeologists do), as well as the use of existing documents or images as a data source.
For example, researchers might collect social media posts, photographs of social events,
newspaper articles, or archival materials like letters, journals, and meeting minutes.
Paradigms of Research
Researchers approach their research from different perspectives or paradigms. A paradigm
is a set of assumptions, values, and practices that shapes the way that people see, understand, and engage with the world, and thus the particular paradigm that a researcher
inhabits shapes the fashion in which they carry out their research. Philosophers use the
term epistemology to refer to the study of the nature of knowledge, and thus we can take
an epistemological perspective to understanding how paradigms of research might vary.
Two paradigms that commentators often juxtapose are the positivist and interpretivist
approaches. Positivism assumes that there is a real, verifiable reality and that the purpose
of research is to come as close to it as possible. Thus, a positivist would argue that we can
understand the world, subject it to prediction and control, and—through the processes of
research and data analysis—empirically verify our claims. Positivist research projects can
utilize a variety of methods, but experimental and quantitative survey data are especially
likely. Among qualitative approaches, positivism is often associated with the type of observational study once common in anthropology, which aimed at uncovering the “real” social
The Qualitative Approach | 129
practices of a group. These methods tend to involve keeping some degree of distance
between the researcher and the participants and positioning the researcher as the expert
on both research methods and the participants’ own lives. From a positivist perspective,
standards of rigor like reliability, validity, and generalizability are important and attainable
markers of good research as they contribute to the likelihood that the research arrives at
the right answer. As this suggests, objectivity is an essential goal of positive research. Good
research, to a positivist, is that which is valid, reliable, generalizable, and has strong, significant results.
In contrast, interpretivism suggests that our knowledge of the world is created by our
own individual experiences and interactions, and thus that reality cannot be understood as
existing on its own in a form separate from our distinct existences. Thus, an interpretivist
would argue that understandings are always based in a particular time and on a particular
interpreter and are always open to reinterpretation. Interpretivist research projects utilize
naturalistic research methods that are rooted in real social contexts, especially in-depth
interviewing and participant-observation. These methods tend to involve a closer and more
reciprocal relationship between the researcher and the participants, with a greater concern
for ethical treatment and in some cases an emphasis on possibilities for social change.
Interpretivist researchers also value participants’ expertise and their understandings of
their own lives rather than assuming the researcher’s perspective is necessarily more accurate. From an interpretivist perspective, validity may not be attainable due to the fact that
truth is not certain, and in any case standards of rigor are far less important than considerations like ethics, morality, the degree to which biases are made clear, and what the world
can learn from the research. As this might suggest, interpretivists would tend to believe
that objectivity is probably not attainable, and that even it is, the pursuit of it may not be
worthwhile. To an interpretivist, good research is that which is done in a careful, respectful
manner, contributes to knowledge, is reflective, and takes appropriate political and ethical
considerations into account.
Lisa Pearce (2012) has outlined a paradigm she calls pragmatist. This approach is sometimes understood as a kind of middle position between positivist and interpretivist ways
of thinking. Thus, its proponents neither believe that strict objectivity is possible nor abandon efforts to seek objectivity at all, instead engaging in reflexivity as they consider how
researchers influence both research participants and research findings. While pragmatist
approaches can be used with various methods of data collection, they tend to be employed
by those using mixed-methods approaches, especially those combining quantitative and
quantitative strategies.
Another paradigm of research is feminist in nature. While there are of course many
ways to do research from a feminist perspective, one of the most important elements of
feminist epistemology is the idea that everyone comes to research—whether they are a
researcher, a research participant, or a consumer of research—from their own standpoint.
130 | The Qualitative Approach
In other words, each person’s individual life experiences and social positions shape their
point of view on the world, and this point of view will in turn impact how the individual
understands and interprets phenomena they encounter, including those that are part of
research. Dorothy Smith (1987), one of the figures associated with feminist standpoint
approaches, notes that this approach to methods requires that we be able to describe the
social “in ways that can be checked back to how it actually is” (1987:122). Such approaches
are powerful not only for understanding the experiences of women, but also for understanding the experiences of other minoritized, marginalized, and/or oppressed groups,
including people who are Black, Indigenous, or of color, and those living with disabilities.
Feminist research has much in common with the broader paradigm of interpretivist
research, but it pays greater attention to the importance of standpoints and of inequality
and oppression in shaping the dynamics of research.
While the discussion of paradigms here is not exhaustive—there are many other
approaches to research, many other epistemologies—it does provide an overview of some
of the possible ways to think about research and data analysis. One important thing to
remember is that while there are criteria for good research, criteria that will be further
outlined in subsequent chapters of this text, there are no objective or empirical standards
for which paradigm is “correct.” In other words, individual researchers or research teams
approach their research from the perspective or philosophy that makes sense to them, and
while others may have reasons for disapproving, they cannot say that such a choice is right
or wrong. Researchers must make these sorts of decisions for themselves.
Inductive and Deductive Approaches
Another question we might ask about the epistemology of research processes is whether
our data emerges from our analysis or whether our data generates our analysis. If you
argue that data emerges from analysis, you are suggesting that you begin the research
process with a theory and then look to the data you have collected to see whether or
not you can find support for your theory. This approach enables the testing of theories.
It is typically understood as a deductive approach to research. In deductive approaches,
researchers develop a theory, collect data, analyze the data, and use their analysis to test
their theory. Positivist research is often deductive in its approach.
Instead, if you argue that data generates analysis, you are suggesting that you begin the
research process by collecting data and you then look to see what you can find within it.
This approach enables the building of theories. It is typically understood as an inductive
approach. In inductive approaches, researchers begin by collecting data. Then they analyze
The Qualitative Approach | 131
that data and use that analysis to build new understandings. Interpretivist and feminist
research are often inductive in their approach.
While qualitative research can be conducted using both deductive and inductive
approaches, it is a bit more common for qualitative researchers to use inductive
approaches. Such approaches are far less possible in quantitative analysis because of the
need for more precisely-designed data collection techniques. Thus, one advantage of qualitative research is that it permits for an inductive approach and is thus especially useful
in contexts in which very little is already know or where new explanations need to be
uncovered. It is also possible to conduct research using what some call abduction, or an
interplay between deductive and inductive approaches (Pearce 2012). Such an approach
may also be found in mixed-methods research. This text will focus primarily on inductive
approaches to qualitative data analysis, given that they are far more common. But deductive approaches do exist. For example, consider a researcher who is interested in what sorts
of circumstances give rise to nonprofit organization boards deciding to replace the organization’s director. More typically, a qualitative researcher with this question would interview a wide variety of non-profit board members and, based on the responses, would build
a theory—an inductive approach. In contrast, the researcher could choose to conduct her
study deductively. Then, she would read the prior literature on management and organizational decision-making and develop one or more hypotheses about the circumstances that
give rise to leadership changes. She would then interview board members looking specifically for the constellation of circumstances she hypothesized to test whether these circumstances were associated with the decision to replace the director.
Research Standards
As researchers design and carry out their data collection and data analysis strategies, there
are a variety of issues they must consider in terms of ensuring their research meets appropriate disciplinary and professional standards for quality and rigor. These include considerations of generalizability or representativeness, reliability and validity, and ethics and
social responsibility. It is also important to note that researchers must be attentive to
ensuring that they are not overstating the degree to which their research can demonstrate
evidence of causation. It is only possible for research to demonstrate causation if it meets
three essential criteria:
• Association, which means that there must be clear empirical evidence of a relationship between the factor understood as the cause and the factor understood as the
effect,
132 | The Qualitative Approach
• Temporal order, which means that it must be known that the causal factor happened
earlier in time than the effect, and
• Elimination of alternatives, which means that the research must have eliminated all
possible alternative explanations for the effect.
It is not generally possible to eliminate all possible alternative explanations—even if a
research project is able to eliminate all the ones the researcher thought of, there are still
other possibilities. Thus, research can only make true causal claims if its finding come from
a properly-controlled laboratory experiment in which the only element that could possibly have change the outcome was the one under examination. If research does not involve
a properly-controlled laboratory experiment, researchers must be cautious about the way
they describe their findings. They cannot say that their study has proven anything or that it
shows that A causes B. Instead, they can say something like “these findings are consistent
with the hypothesis that A causes B.” Qualitative research cannot conclusively show causal
relationships, even though it can be suggestive of them.
Generalizability refers to whether the research findings from a particular study can be
assumed to hold true for the larger population. Research can only be generalized if it is the
result of a properly-conducted random sample, also called a probability sample, and then
only to the population that was sampled from. In other words, if I conduct a random sample of students at a particular college, I can only assume my findings will hold true for students at that college—I cannot assume they accurately reflect dynamics at other colleges
or among people who are not college students. Furthermore, because probability sampling
can involve what is called sampling error, even it can not guarantee generalizability to the
population from which the sample has been drawn. It simply optimizes the chance that
such generalizability exists.
And if my sample was not random, I cannot assume that my findings reflect the broader
dynamics of that college. This is because the randomization that is part of developing a random sample is designed to eliminate the potential for sample bias that might shape the
results. For example, if I conduct a non-random sample of college students by posting an
ad on a social media site, then my participants will only be those who saw or heard about
the ad, and they may be different in some way from students who did not see or hear about
the ad.
While it is possible to conduct qualitative research using a random sample, a considerable portion of qualitative research projects do not use random sampling. This is because
it is only possible to develop a random sample if you have a list of all possible people in the
population (or can use sampling methods like cluster sampling that allow you to randomize without such a list). Clearly, if I want to study students at a particular college, I can get a
list of all possible students at that college. But what if I wanted to study people who play the
video game Fortnite? Or individuals who enjoy using contouring makeup? Or parents who
The Qualitative Approach | 133
have a child with autism as well as a child who is neurotypical? There are no lists of people
in these categories, and thus a random sample is not possible. In addition, it can be hard
to use random sampling for studies in which the researcher will ask participants for more
lengthy time commitments, such as in-depth interviewing and ethnographic observation.
Where generalizability is not possible, researchers can instead strive for representative­
ness. Having a representative sample means having a sample that includes a sufficient
number of people from various subgroups within the population such that the research
can understand whether the dynamics it uncovers are applicable broadly across groups
or whether they only apply to specific subgroups. Which characteristics must be reflected
to ensure representativeness will vary depending on the study in question. A study of students’ participation in extracurricular activities probably should consider both residential
students and those who commute to campus. A study of retail employees might need to
include both full-time and part-time workers as well as those who do and do not hold managerial positions. Race, gender, and class, as well as other axes of inequality, are very common subgroups used to ensure representativeness. Note that it is entirely ok to exclude
various subgroups from a study, as long as the study makes clear who and what it is study­
ing. In other words, it would be reasonable to conduct a study of mothers of children with
autism. It would not be acceptable to conduct a study of parents with autism but only
include mothers in the sample.
Reliability refers to the extent to which repeated measures produce consistent results.
Usually, discussions of reliability refer to the consistency of specific measures. For instance,
if I ask you what you ate for breakfast on Wednesday in a conversation on Wednesday
evening and then on Friday morning, will you give me the same answer? Or if I administer
two different self-esteem scales, do you come out with similar results? Changes in the way
questions are asked, the context in which they are asked, or who is doing the asking can
have remarkable impacts on the responses, and these impacts mean reliability is reduced.
Some concerns about reliability, such as that illustrated with the self-esteem scale, refer to
consistency between different approaches for measuring the same underlying idea. Others have to do with repeatability, replicability, or reproducibility (Plesser 2017). An example
of the issue of repeatability is the question about what you ate for breakfast—if the same
researcher repeats the same measurement, do they get the same results? Replicability
refers to situations in which a different researcher uses the same measurement approaches
on the same type of population, though the research may take place in a different location.
While a researcher can never ensure that their research will be replicable, researchers who
strive to ensure replicability do endeavor to make their research process as clear as possible in any publications so that others will be able to take the same exact steps in trying to
replicate it. However, this can be difficult in qualitative studies as the impact the researcher
has on the context through phenomena such as interviewer effect may mean that a different researcher or research team cannot exactly replicate the original conditions of data
134 | The Qualitative Approach
collection. Finally, reproducibility refers to whether a different research team can develop
its own methodological approach to answering the research question but still find results
consistent with those in the original study. It is always possible that a study fails to reproduce not because the findings are inherently irreproducible but rather because some variation in the population or setting is responsible for the different results. Another element
of reliability is inter-rater reliability. To understand inter-rater reliability, consider a study in
which a researcher is trying to determine whether the degree of sexism displayed in advertisements differs depending on the type of product being advertised. In order to collect
this data, a team of research assistant has to examine each advertisement and rate, on a
scale of 1 to 5, how sexist the advertisement is. It’s not surprising that different research
assistants might judge the same advertisement differently–and this can impact the results
of the study. Measuring inter-rater reliability helps determine how different these multiple
raters’ ratings are from one another, and if the differences are large, the researcher can go
back and retrain the research assistants so they can more consistently apply the intended
rating scale.
Validity refers to the extent to which research measurements accurately reflect the
underlying reality. Well-designed qualitative approaches, especially in-depth interviewing
and participant-observation, tend to be high in validity. This is because such methods come
the closest of all social science methods to reflecting real life in all of its complexity. Validity
can be increased by careful attention to research design, the use of method triangulation
(multiple research methods or approaches), and deep reflection on process and findings.
While a full treatment of research ethics is beyond the scope of this book, it is essential to
remember that good research always attends to the highest ethical standards. People who
talk about research ethics often focus their primary attention to the treatment of human
subjects, or the people who actually participate in the research project. Ethical treatment
of participants includes ensuring that any risks they face are limited, that they have given
fully-informed consent to their participation in research, that their identity will be pro1
tected, and that they do not experience coercion to participate. An interesting example of
the kinds of issues that a commitment to research ethics raises has to do with the legal risks
inherent in research. Shamus Khan, a researcher studying sexual assault, has written about
an instance in which he became embroiled in the court process after his research materials
were subpoenaed in a lawsuit. The subpoena would have entitled the litigants to materials that would have disclosed confidential personal information, information research participants were assured would remain confidential. Khan details the lengths that he had
to go to in order to protect participants’ information and the complex ethical questions
1. However, there are research participants who wish to disclose their real identity, and some qualitative
researchers argue that truly ethical research gives participants the option to make informed decisions about
such disclosure.
The Qualitative Approach | 135
his case raises, ultimately concluding that a real commitment to research ethics requires
some changes in how the institutions that sponsor research think about and manage their
responsibilities (Khan 2019).
Many commentators who discuss research ethics suggest that researchers’ ethical
responsibility goes much further. For example, feminist researchers often suggest that
research participants be given the opportunity to review interview transcripts for errors,
omissions, or statements they would have preferred not to make and issue corrections,
even if their words and experiences will be used anonymously. Attention should also be
paid to ensuring that people and communities who participate in research are able to share
in the benefits of that research. For example, if a program is developed through research
on a particular community of homeless people, those people should be among the first to
be able to access the new program. If researchers profit financially from the research they
have done, they might consider sharing the profits with those they have studied.
While traditional treatments of research ethics consider only the researcher’s responsibility to research participants, a broader treatment of ethics—in keeping with interpretivist
and feminist paradigms—would also include social responsibility as an ethical touchstone.
Researchers concerned with social responsibility might consider whether their approach to
publication or the content of their publications might have harmful impacts on the populations they have studied, stigmatizing them or exposing them to disadvantageous policy consequences. For example, Robert Putnam, a political scientist, conducted a study
that examined the impact of neighborhood diversity on social cohesion and trust. When
he found that diversity can reduce trust, he worried that his findings would be used as a
political weapon by those opposed to diversity, racial equity, and immigration. Thus, while
he made some data available to other researchers, he withheld publication for several
years while he developed policy proposals designed to mitigate the potential harm of his
findings. Some commentators felt that withholding publication was itself unethical, while
Putnam felt that publishing without due consideration of the impact of his findings was
the unethical thing. A commitment to social responsibility might also include attention to
ensuring equity in citation practices, an issue that has been brought to the fore by the social
media campaign #CiteBlackWomen, which urges scholars and teachers to ensure that
they read publications by Black women, acknowledge Black women’s scholarship through
citation as well as inclusion in course reading lists, and ensure that Black women are represented as speakers at conferences, among other things (Cite Black Women Collective n.d.).
As noted above, research paradigms influence the particular qualities that researchers
value in their research. In addition, it is not always realistic or even possible to maximize all
of these qualities in a given project. Thus, most research, including most excellent research,
will emphasize some of these standards and not others. This does not mean the research
is lacking in rigor. Good research, however, is always explicit about its own limitations. Thus,
researchers should indicate whether or not their results can be generalized, and if so, to
136 | The Qualitative Approach
whom. They should be clear on which subgroups they included in their efforts to ensure
representativeness.
The Process of Qualitative Research
So, how does one go about conducting an inductive qualitative research project? Well,
there are a series of steps researchers follow. However, it is important to note that qualitative research and data analysis involve a high degree of fluidity and are typically iterative,
meaning that they involve repeatedly returning to prior steps in the process.
First, researchers design their data collection process, which includes developing any
data collection instruments such as interview guides and locating participants. Then, they
collect their data. To collect data researchers might conduct interviews, observations, or
ethnography, or they might locate documents or other sources of textual or visual data.
While deductive quantitative approaches require researchers collect all their data and only
then analyze it, inductive qualitative approaches provide the opportunity for more of a cyclical process in which researchers collect data, begin to analyze it, and then use what they
have found so far to reshape their further data collection.
Once data is collected, researchers need to ensure that their data is usable. This may
require the transcription of audio or video recordings, the scanning or photocopying of documents, typing up handwritten fieldnotes, or other processes designed to move raw data
into a more manipulable form.
Next, researchers engage in data reduction. Research projects typically entail the collection of really large quantities of data, more data that can possibly be managed or utilized in
the context of one paper. This is especially likely in the case of qualitative research because
of the richness and complexity of the data that is collected. Therefore, once data collection
is completed, researchers use strategies and techniques to reduce the hundreds or thousands of pages of fieldnotes or interview transcripts or documents into a manageable form.
Activities involved in data reduction, which will be taken up in a later chapter, include coding, summarization, the development of data displays, and categorization.
Once data reduction has made data more usable, researchers can develop conclusions
based on their data. Remember, however, that this process is iterative, which means that it
is a continuing cycle. So, when researchers make conclusions, they also go back to earlier
stages to refine their approaches. In addition, the process of developing conclusions also
requires careful consideration to limitations of the data and analytical approaches, such as
those discussed earlier in this chapter.
Finally, researchers present their findings. During each project, researchers must determine how best to disseminate results. Factors influencing this determination include the
The Qualitative Approach | 137
research topic, the audience, and the intended use of the results—for instance, are these
the results of basic research, designed to increase knowledge about the phenomena under
study, or are they the results of applied research, conducted for a specific audience to
inform the administration of a policy or program? Findings might be disseminated in a
graphical form like an infographic or a series of charts, a visual form like a video or animation, an oral form like a lecture, or a written form like a scholarly article or a report. Of course,
many projects incorporate multiple forms of dissemination.
While this chapter is titled “The Qualitative Approach,” it is actually inaccurate to suggest
that there is just one overall approach to qualitative research. As this chapter has shown,
there are some core characteristics that qualitative approaches to research have in common, such as data that relies on words or images rather than numbers and a richer, more
contextual understanding of the phenomena under study. But there are also many ways in
which qualitative approaches to research vary. They use different methods of data collection. They take place within different paradigms and epistemologies. They focus their attention on emphasizing different standards for research quality. And, as the following chapters
will show, they utilize different methods for preparing and managing data, analyzing that
data, and disseminating their findings.
Exercises
1.
Find a few grocery store circulars from your area. The ones that get delivered with your
mail are fine, or you can locate them online on the website of your local grocery stores.
Spend some time examining the circulars. Look at the words and images, the types of items
represented, the fonts and layouts, anything that catches your eye, and then answer two
questions: first, what do the circulars tell you about the lives of people today who live in your
area, and second, what did you do, cognitively, to figure that out?
2.
Locate a recent scholarly journal article in your field of study and read it. Do you think this
article used a more positivist or more interpretivist paradigm of knowledge? Explain how
you know, drawing on the key elements of these paradigms.
3.
What do you think it means to do good research? Which of the various standards for good
research do you think are most important to the topics or issues you are interested in? And
what are some of the strategies you might employ to be sure your research lives up to these
standards?
138 | The Qualitative Approach
11. Preparing and Managing
Qualitative Data
MIKAILA MARIEL LEMONIK ARTHUR
When you have completed data collection for a qualitative research project, you will likely
have voluminous quantities of data—thousands of pages of fieldnotes, hundreds of hours
of interview recordings, many gigabytes of images or documents—and these quantities of
data can seem overwhelming at first. Therefore, preparing and managing your data is an
essential part of the qualitative research process. Researchers must find ways to organize
the voluminous quantities of data into a form that is useful and workable. This chapter will
explore data management and data preparation as steps in the research process, steps that
help facilitate data analysis. It will also review methods for data reduction, a step designed
to help researchers get a handle on the volumes of data they have collected and coalesce
the data into a more manageable form. Finally, it will discuss the use of computer software
in qualitative data analysis.
Data Management
Even before the first piece of data is collected, a data management system is a necessity
for researchers. Data management helps to ensure that data remain safe, organized, and
accessible throughout the research process and that data will be ready for analysis when
that part of the project begins. Miles and Huberman (1994) outline a series of processes and
procedures that are important parts of data management.
First, researchers must attend to the formatting and layout of their data. Developing
a consistent template for storing fieldnotes, interview transcripts, documents, and other
materials, and including consistent metadata (data about your data) such as time, date,
pseudonym of interviewee, source of document, person who interacted with the data, and
other details will be of much use later in the research process.
Similarly, it is essential to keep detailed records of the research process and all research
decisions that are made. Storing these inside one’s head is insufficient. Researchers should
keep a digital file or a paper notebook in which all details and decisions are recorded. For
instance, how was the sample conducted? Which potential respondents never ended up
going through with the interview? What software decisions were made? When did the dig-
Preparing Qualitative Data | 139
ital voice recorder fail, and for how long? What day did the researcher miss going into the
field because they were ill? And, going forward, what decisions were made about each step
in the analytical process?
As data begin to be collected, it is necessary to have appropriate, well-developed physical
and/or digital filing systems to ensure that data are safely stored, well-organized, and easy
to retrieve when needed. For paper storage, it is typical to use a set of file folders organized chronologically, by respondent, or by some other meaningful system. For digital storage, researchers might use a similar set of folders or might keep all data in a single folder
but use careful file naming conventions (e.g. RespondentPseudonym_Date_Transcript) to
make it easy to find each piece of data. Some researchers will keep duplicate copies of all
data and use these copies to begin to sort, mark, and organize data in ways that enable
the presence of relationships and themes to emerge. For instance, researchers might sort
interview transcripts by the way respondents answered a particular key question. Or they
might sort fieldnotes by the central activities that took place in the field that day. Activities such as these can be facilitated by the use of index cards, color-coding systems, sticky
notes, marginal annotations, or even just piles. Cross-referencing systems may be useful to
ensure that thematic files can be connected to respondent-based files or to other relevant
thematic files. Finally, it is essential that researchers develop a system of backups to ensure
that data is not lost in the event of a catastrophic hard drive failure, a house fire, lack of
access to the office for an extended period, or some other type of disaster.
One more issue to attend to in data management is research ethics. It is essential to
ensure that confidential data is protected from disclosure; that identifying information
(including signed consent forms) are not kept with or linkable to data; and that all
researchers, analysts, interns, and administrative personnel involved in a study sign statements of confidentiality to ensure they understand the importance of nondisclosure (Berg
2009). Note that such documents will not protect researchers and research personnel from
subpoena by the courts—if research documents will contain information that could expose
participants to criminal or legal liability, there are additional concerns to consider and
researchers should do due diligence to protect themselves and their respondents (see, e.g.,
Khan 2019), though the methods and mechanisms for doing so are beyond the scope of this
text. Researchers must attend to data security protocols, many of which were likely agreed
to in the IRB submission process. For example, paper research records should be locked
securely where they cannot be seen by visitors or by personnel or accessed by accident.
Digital records should be securely stored in password protected files that meet current
standards for strong passwords. Cloud storage or backups should have similar protections,
and researchers should carefully review the terms of service to ensure that they continue to
own their data and that the data are protected from disclosure.
140 | Preparing Qualitative Data
Preparing Data
In most cases, data are not entirely ready for analysis at the moment at which they are
collected. Additional steps must be taken to prepare data for analysis, and these steps are
somewhat different depending on the form in which the data exists and the approach to
data collection that was used: fieldnotes from observation or ethnography, interviews and
other recorded data, or documentary data like texts and images.
Fieldnotes
When researchers conduct ethnographic or observational research, they typically do not
have the ability to maintain verbatim recordings. Instead, they maintain fieldnotes. Maintaining fieldnotes is a tricky and time-consuming process! In most instances, researchers
cannot take notes—at least not too many—while present in the research site without making themselves conspicuous. Therefore, they need to limit themselves to a couple of jotted words or sentences to help jog their memories later on, though the quantity of notes
that can be taken in the field is higher these days because of the possibility of taking notes
via smartphone, a notetaking process largely indistinguishable from the socially-ubiquitous practices of text messaging and social media posts. Immediately after leaving the
site, researchers use the skeleton of notes they have taken to write up full notes recording
everything that happened. And later, within a day or so, many researchers go back over
the fieldnotes to edit and refine the fieldnotes into a useful document for later analysis. As
this process suggests, analysis is already beginning even while the research is ongoing, as
researchers make notes and annotations about theoretical ideas, connections to explore,
potential answers to their research questions, and other things in the process of refining
their fieldnotes.
When fleshing out fieldnotes, researchers should be attentive to the distinctions between
recollections they believe are accurate, interpretations and reflections they have made,
and analytical thoughts that develop later through the process of refining the fieldnotes.
It is surprisingly easy for a slight mistake in recording, say, which people did what, or in
what sequence a series of events occurred, to entirely change the interpretation of circumstances observed in the field. To demonstrate how such issues can arise, consider the following two hypothetical fieldnote excerpts:
Preparing Qualitative Data | 141
Excerpt A
Excerpt B
Sarah walked into the living room and before
she knew what happened, she found Marisol on
the floor in tears, surrounded by broken bits of
glass. “What did you do?” Sarah said, her voice
thick with emotion. Marisol covered her face and
cried louder.
Her voice thick with emotion, Sarah said, “What
did you do?” Before she knew what happened,
she found Marisol on the floor in tears, surrounded by bits of broken glass. Sarah walked
into the living room. Marisol covered her face
and cried louder.
In Excerpt A, the most reasonable interpretation of events is probably that Sarah walked
into the room and found Marisol, the victim of an accident, and was concerned about her.
In Excerpt B, in contrast, Sarah probably caused the accident herself. Yet the words are
exactly the same in both excerpts—they have just been slightly rearranged. This example
highlights how important careful attention to detail is in recording, refining, and analyzing
fieldnotes (and other forms of qualitative data, for that matter).
Fieldnotes contain within them a vast array of different types of data: records of verbal
interactions between people, observations about social practices and interactions,
researchers’ inferences and interpretations of social meanings and understandings, and
other thoughts (Berg 2009). Therefore, as researchers work to prepare their fieldnotes for
analysis, they may need to work through them again to organize and categorize different
types of notes for different uses during analysis. The data collected from ethnographic or
observational research can also include documents, maps, images, and recordings, which
then need to be prepared and managed alongside the fieldnotes.
Interviews & Other Recordings
First of all, interview researchers need to think carefully about the form in which they will
obtain their data. While most researchers audio- or video-record their interviews, it is useful to keep additional information alongside the recordings. Typically, this might include a
form for keeping track of themes and data from each interview, including details of the
context in which the interview took place, such as the location and who was present; biographical information about the participant; notes about theoretical ideas, questions, or
themes that occur to the researcher during the interview; and reminders of particularly
notable or valuable points during the interview. These information sheets should also contain the same pseudonym or respondent number that is used during the interview recording, and thus can be helpful in matching biographical details to participant quotes at the
time of ultimate writeup. Interviewers may also want to consider taking notes throughout
the interview, as notes can highlight elements of body language, facial expression, or more
subtle comments that might not be picked up on audio recordings. While video recordings
142 | Preparing Qualitative Data
can pick up such details, they tend to make participants more self-conscious than do audio
recordings.
Once the interview has concluded, recordings need to be transcribed. While automated
transcription has improved in recent years, it still falls far short of what is needed to make
an accurate transcript. Transcription quality is typically assessed using a metric called the
Word Error Rate—basically, dividing the number of incorrect words by the number of words
that should appear in the passage—there are other, more complex assessment metrics
that take into consideration individual words’ importance to meaning. As of 2020, automated transcription services still tended to have Word Error Rates of over 10%, which may
be sufficient for general understanding (such as in the case of apps that convert voicemails to text) but which is definitely too high of an error rate for use in data analysis. And
error rates increase when audio recordings contain background noise, accented speech, or
the use of dialects other than Standard American English (SAE). There can also be ethical
concerns about data privacy when automated services are used (Khamsi 2019). However,
automated services can be cost-effective, with a typical cost of about 25 cents per minute
of audio (Brewster 2020). For a typical study involving 40 interviews averaging 90 minutes
each, this would come to a total cost of about $900, far less than the cost of human transcription, which averages about $1 per minute these days. Human transcription is far more
accurate, with extremely low Word Error Rates, especially for words essential to meaning.
But human transcribers also suffer from increased error when transcribing audio with noisy
backgrounds, where multiple speakers may be interrupting one another (for instance in
recordings of focus groups), or in cases where speakers have stronger accents or speak
in dialects other than Standard American English. For example, a study examining court
reporters—professional transcribers with special experience and training at transcribing
speech in legal contexts—working in Philadelphia who were assigned to transcribe African
American English had average Word Error Rates of above 15%, and these errors were significant enough to fundamentally alter meaning in over 30% of the speech segments they
transcribed (Jones et al. 2019).
Researchers can, of course, transcribe their recordings themselves, an option that vastly
reduces cost but adds an enormous amount of time to the data preparation process. The
use of specialized software or devices like foot-pedal controlled playback can facilitate the
ease of transcription, but it can easily take up to four hours to complete the transcription
of one hour of recordings. This is because people speak far faster than they type—a typical person speaks at a rate of about 150 words per minute and types at a rate more like
30-60 words per minute. Another possibility is to use a kind of hybrid approach in which
the researcher uses automated transcription or voice recognition to get a basic—if errorladen—transcript and then corrects it by hand. Given the time that will be invested in correcting the transcript by listening to the recording while reviewing the transcript, even
lower-quality transcription services may be acceptable, such as the automated captioning
Preparing Qualitative Data | 143
video services like YouTube offer, though of course these services also present data privacy
concerns. Alternatively, researchers might use voice-recognition software. The accuracy of
such software can typically be improved by training it on the user’s voice. This approach can
be especially helpful when interview respondents speak with accents, as the researcher can
re-record the interview in their own voice and feed it into software that is already trained to
understand the researcher’s voice.
Table 1 below compares different approaches to transcription in terms of financial cost,
time, error rate, and ethical concerns. Costs for transcription by the researcher and hybrid
approaches are typically limited to the acquisition of software and hardware to aid the transcription process. For a new researcher, this might entail several hundred dollars of cost
for a foot pedal, a good headset with microphone, and software, though these costs are
often one-time costs not repeated with each project. In contrast, even automated transcription can cost nearly a thousand dollars per project, with costs far higher for the hired
human transcriptionsts who have much better accuracy. In terms of time, though, automated and hired services require far less of the researchers’ time. Hired services will require
some time for turnaround, more if the volume of data is high, but the researcher can work
on other things during that time. For self and hybrid transcription approaches, researchers
can expect to put in much more time on transcription than they did conducting interviews.
For a typical project involving 40 interviews averaging 90 minutes each, the time required
to conduct the interviews and transcribe them—not including time spent preparing for
interviews, recruiting participants, traveling, analyzing data, or any other task—can easily
exceed 300 hours. If you assume a researcher has 10 hours per week to devote to their project, that would mean it would take over 30 weeks just to collect and transcribe the data
before analysis could begin. And after transcription is complete, most researchers find it
useful to listen to the recordings again, transcript in hand, to correct any lingering errors
and make notes about avenues for exploration during data analysis.
Table 1. Comparing Transcription Approaches for a Typical Interview-Based Research Project
144 | Preparing Qualitative Data
Cost
Time
Error Rate
Ethical Concerns
Automated
$900
A few hours turnaround
High
High
Hired
$3,600
At least several days turnaround
Low for SAE
Probably Low
Self
Minimal
About 240 hours active
Varies
Low
Hybrid
Minimal
Varies, likely at least 120 hours active
Low for SAE
Varies
Note: this table assumes a project involving 40 interviews, all conducted by the main
researcher, averaging 90 minutes in length. Time costs do not include interviewing itself,
which would add an additional 60 hours to the time required to complete the project.
Documents and Images
Data preparation is far different when data consists of documents and images, as these
already exist in textual form. Here, concerns are more likely to revolve around storage, filing,
and organization, which will be discussed later in this chapter. However, it can be important to conduct a preliminary review of the data to better understand what is there. And for
visual data, it may be especially useful to take notes on the content in and the researcher’s
impressions of each visual as a starting point to thinking about how to further work with
the materials (Saldaña 2016).
There are special concerns about research involving documents and images that are
worth noting here. First all, it is important to remember the importance of sampling
issues in relation to the use of documents. Sampling is not always a concern—for instance,
research involving newspaper articles may involve a well-conducted random sample, or
photographs may have been taken by the researcher themselves according to a clear purposive sampling process—but many projects involving textual data have used sampling
procedures where it remains unclear how representative the sample is of the universe of
data. Researchers must keep careful notes on where the documents and images included
in their data came from and what sorts of limitations may exist in the data and include a
discussion of these issues in any reporting on their research.
When writing about interview data, it is typical to include excerpts from the interview
transcripts. Similarly, when using documents or visual materials, it is preferable to include
some of the original data. However, this can be more complex due to copyright concerns.
When using published works, there are real legal limits on the quantity of text that you
can include without getting permission from the copyright owner, who may make you pay
Preparing Qualitative Data | 145
for the privilege. This is not an issue for works that were created or published more than
95 years ago, as their copyrights have expired. For works more recent than that, the use of
more than a small portion of the work typically violates copyright, and the use of an image
is almost never permitted unless it has been specifically released from copyright (or created
by the researcher themselves). Archival data may be subject to specific usage restrictions
imposed by the archive or donor. Copyright can make the goal of providing the data in a
form useful to the reader very difficult, so you might need to get the copyright clearance or
find other creative ways of providing the data.
Data Reduction
In qualitative data analysis, data collection and data analysis are often not two distinct
research phases. Rather, as researchers collect data, they begin to develop themes, ask analytical questions, write theoretical memos, and otherwise begin the work of analysis. And
when researchers are analyzing data, they may find they need to go back and collect more
to flesh out certain areas that need further elaboration (Taylor, Bogdan, and DeVault 2016).
But as researchers move further towards analysis, one of the first steps is reading through
all of the data they have collected. Many qualitative researchers recommend taking notes
on the data and/or annotating it with simple notations like circles or highlighting to focus
your attention on those passages that seem especially fruitful for later focus (Saldaña 2016).
This is often called “pre-coding.” Other approaches to pre-coding include noting hypotheses about what might emerge elsewhere in the data, summarizing the main ideas of each
piece of data and annotating it with details about the respondent or circumstances of its
creation, and taking preliminary notes about concepts or ideas that emerge.
This sort of work is often called “preliminary analysis,” as it enables researchers to start
making connections and working with themes and theoretical ideas, but before you get
to the point of making actual conclusions. It is also a form of data reduction. In qualitative
analysis, the volume of data collected in any given research project is often enormous, far
more than can be productively dealt with in any particular project or publication. Thus, data
reduction refers to the process of reducing large volumes of data such that the more meaningful or important parts are accessible. As sociologist Kristen Luker points out in her text
Salsa Dancing into the Social Sciences (2008), what we are really trying to do is recognize
patterns, and data reduction is a process of sifting through, digesting, and thinking about
our data until we can see the patterns we might not have seen before. Luker argues that
one important way to help ourselves see patterns is to talk about our data with others—lots
of others, and not just other social scientists—until what we are explaining starts to make
sense.
146 | Preparing Qualitative Data
There are a variety of approaches to data reduction. Which of these are useful for a particular project depends on the type and form of data, the priorities of the researcher, and
the goals of the research project, and so each researcher must decide for themselves how
to proceed. One approach is summarization. Here, researchers write short summaries of
the data—summaries of individual interview transcripts, of particular days or weeks of fieldnotes, or of documents. Then, these summaries can be used for preliminary analysis rather
than requiring full engagement with the larger body of data. Another approach involves
writing memos about the data in which connections, patterns, or theoretical ideas can be
laid out with reference to particular segments of the data. A third approach is annotation,
in which marginal notes are used to highlight or draw attention to particularly important or
noteworthy segments of the data. And Luker’s suggestion of conversations about our data
with others can be understood as a form of data reduction, especially if we record notes
about our conversations.
One of the approaches to data reduction which many analysts find most useful is the
creation of typologies, or systems by which objects, events, people, or ideas can be classified into categories. In constructing typologies, researchers develop a set of mutually-exclusive categories—no one can be placed into more than one category of the typology (Berg
2009)—that are, ideally, also exhaustive, so that no one is left out of the set of categories
(an “other” category can always be used for those hard to classify). They then go through
all their pieces of data or data elements, be they interview participants, events recorded
in fieldnotes, photographs, tweets, or something else, and place each one into a category.
Then, they examine the contents of each category to see what common elements and analytical ideas emerge and write notes about these elements and ideas.
One approach to data reduction which qualitative researchers often fall back on but
which they should be extremely careful with is quantification. Quantification involves the
transformation of non-numerical data into numerical data. For example, if a researcher
counts the number of interview respondents who talk about a particular issue, that is
a form of quantification. Some limited quantification is common in qualitative analysis,
though its use should be particularly rare in ethnographic research given the fact that
ethnographic research typically relies on one or a very small number of cases. However, the
use of quantification should be constrained to those circumstances where it provides particularly useful or illuminating descriptive information about the data, and not as a core
analytical tool. In addition, given that it is exceptionally uncommon for qualitative research
projects to produce generalizable findings, any discussion of quantified data should focus
on numbers rather than percents. Numbers are descriptive—“35 out of 40 interview respondents said they had argued with housemates over chores in the past week”—while percents suggest broader and more generalizable claims (“87.5% of respondents said they had
argued with housemates over chores in the past week”).
Preparing Qualitative Data | 147
Qualitative Data Analysis Software
As part of the process of preparing data for analysis and planning an analysis strategy,
many—though not all—qualitative researchers today use software applications to facilitate
their work. The use of such technologies has had a profound impact on the way research
is carried out, as have many technological changes over history. Take a much older example: the development of technology permitting for the audio recording of interviews. This
technology made it possible to develop verbatim transcripts, whereas prior interview-based
research had to rely on handwritten notes conveying the interview content—or, if the interviewer had significant financial resources, perhaps a stenographer. Recordings and verbatim transcripts also made it possible for researchers to minutely analyze speech patterns,
specific word choices, tones of voice, and other elements that would not previously have
been able to be preserved.
Today’s technologies make it easier to store and retrieve data, make it faster to process
and analyze data, and provide access to new analytical possibilities. On a basic level, software can allow for more sophisticated possibilities for linking data to memos and other
documents. And there are a variety of other benefits (Adler and Clark 2008) to the use of
software-aided analysis (often referred to as CAQDAS, or computer-aided qualitative data
analysis software). It can allow for more attention to detail, more systematic analysis, and
the use of more cases, especially when dealing with large data sets or in circumstances
where some quantification is desirable. The use of CAQDAS can enhance the perception of
rigor, which can be useful when bringing qualitative data to bear in settings where those
using data are more used to quantitative analysis. When coding (to be discussed further in
the chapter on qualitative coding), software enhances flexibility and complexity, and may
enliven the coding process. And software can provide complex relational analysis tools that
go well beyond what would be possible by hand.
However, there are limitations to the use of CAQDAS as well (Adler and Clark 2008). Software can promote ways of thinking about data that are disconnected from qualitative
ideals, whether through reductions in the connection between data and context or the
increased pressure to quantify. Each individual software application creates a specific
model of the architecture of data and knowledge, and analysis may become shaped or constrained by this architecture. Coding schemes, taxonomies, and strategies may reflect the
capacities available in and the structures prioritized by the software rather than reflecting
what is actually happening in the data itself, and this can further homogenize research,
as researchers draw from a few common software applications rather than from a wide
variety of personal approaches to analysis. Software can also increase the psychic distance
between the researcher or analyst and their data and reduce the likelihood of researchers
understanding the limitations of their data. The tools available in CAQDAS applications
148 | Preparing Qualitative Data
tend to emphasize typical data rather than unusual data, and so outliers or negative cases
may be missed. Finally, CAQDAS does not always reduce the amount of time that a research
project takes, especially for newer users and in cases with smaller sets of data. This is
because there can be very steep learning curves and prolonged set-up procedures.
The fact that this list of limitations is somewhat longer than the list of positives should not
be understood as suggesting that researchers avoid CAQDAS-based approaches. Software
truly does make forms of research possible that would not have been without it, speeds
data processing tasks, and makes a variety of analytical tasks much easier to do, especially
when they require attention to detail. And digital technologies, including both software
applications and hardware devices, facilitate so much about how qualitative researchers
work today. There are a wide variety of types of technological aids to the qualitative research
purpose, each with different functions.
First of all, digital technologies can be used for capturing qualitative data. This may seem
obvious, but as the example of audio recording above suggests, the development of technologies like audio and film recording, especially via cellphone or other small personal
devices, led to profound changes in the way qualitative research is carried out as well as
an expansion in the types of research that are possible. Other technologies that have had
similar impacts include the photocopier and scanner, and more recently the possibility to
use a cell phone to capture photographs of documents in archives (without the flash on
to avoid damaging delicate items). Finally, videoconferencing software makes it possible to
interview people who are halfway around the world, and most videoconferencing platforms
have a built-in option to save a video record of the conversation, and potentially autocaption it. It’s also worth noting that digital technologies provide access to sources of data that
simply did not exist in the past, whether interviewing via videoconferencing, content analysis of social media, or ethnography of massively-multiplayer online games or worlds.
Software applications are very useful for data management tasks. The ability to store,
file, and search electronic documents makes the management of huge quantities of data
much more feasible. Storing metadata with files can help enormously with the management of visual data and other files. Word processing programs are also relevant here. They
help us produce and revise text and reports, compile and edit our fieldnotes and transcriptions, write memos, make tables, count words, and search for and count specific words and
phrases. Graphics programs can also facilitate the creation of graphs, charts, infographics,
and other data displays. Finally, speech recognition programs aid our transcription process
and, for some of us, our writing process.
Coding programs fall somewhere between data reduction and data analysis in their functions. Such software applications typically provide researchers with the ability to apply one
or more codes to specific segments of text, search for and retrieve all segments that have
had particular codes applied to them, and look at relationships between different codes.
Some also provide data management features, allowing researchers to store memos, doc-
Preparing Qualitative Data | 149
uments, and other materials alongside the coded text, and allow for interrater reliability
testing (to be discussed in another chapter). Finally, there are a variety of data analysis tools.
These tools allow researchers to carry out functions like organizing coded data into maps or
diagrams, testing hypotheses, merging work carried out by different researchers, building
theory, utilizing formal comparative methods, creating diagrams of networks, and others.
Many of these features will be discussed in subsequent chapters.
Choosing the Right Software
There are so many programs out there that carry out each of the functions discussed
above, with new ones appearing constantly. Because the state of technology changes all
the time, it is outside the scope of this chapter to detail specific options for software applications, though online resources can be helpful in this regard (see, e.g., University of Surrey
n.d.). But researchers still need to make decisions about which software to use. So, how do
researchers choose the right qualitative software application or applications for their projects? There are four primary sets of questions researchers should ask themselves to help
with this decision.
First, what functions does this research need and does this project require? As discussed
above, programs have very different functions. In many cases, researchers may need to
combine multiple programs to get access to all the functions they need. In other cases,
researchers may need only a simple software application already available on their computers.
Second, researchers should consider how they use technology. There are a variety of
questions that are relevant here. For example, what kind of device will be used, a desktop
computer, laptop, tablet, or phone? What operating system, Windows, Mac/iOS, Chrome,
or Android? How much experience and skill do researchers have with computers—do they
need software applications that are very easy to use, or can they handle command-line
interfaces that require some programming skills? Do they prefer software that is installed
on their devices or a cloud-based approach? And will the researcher be working alone or
as part of a team where multiple people need to contribute and share access to the same
materials?
What type of data will be used? Will it be textual, visual, audio, or video? Will data come
from multiple sources and styles or will it all be consistent? Is the data organized or freeform? What is the magnitude of the data that will be analyzed?
Finally, what resources does the researcher already have available? What software can
they access, whether already available on their personal computing devices or via licenses
provided by their employer or college/university? What degree of technical support can
they access, and are technical support personnel familiar with CAQDAS? And how much
150 | Preparing Qualitative Data
money do they have available to pay for software on a one-time or ongoing basis? Note
that some software can be purchased, while other software is provided as a service with
a monthly subscription fee. And even when software is purchased, licenses may only provide access for a limited time period such as a year. Thus, both short-term and long-term
financial costs and resource availability should be assessed prior to committing to a software package.
Exercises
1.
Transcribe about 10 minutes of an audio interview—one good source might be your local NPR
station’s website. Be sure that your transcription is an exact record of what was said, including any
pauses, laughter, vulgarities, or other kinds of things you might not typically write in an academic
context, and that you transcribe both questions and responses. What was it like to complete this
exercise?
2.
Use the course listings at your college or university as a set of data. Develop a typology of different types of courses—not based on the department or school offering them or the course number
alone—and classify courses within this typology. What does this exercise tell you about the curriculum at your college or university?
3.
Review the notes, documents, and other materials you have already collected from this course
and develop a new system of file management for them, with digital or physical folders, subfolders, and labels or file names that make items easy to locate.
Preparing Qualitative Data | 151
152 | Preparing Qualitative Data
12. Qualitative Coding
MIKAILA MARIEL LEMONIK ARTHUR
Codes are words or phrases that capture a central or notable attribute of a particular segment of text or visual data (Saldaña 2016). Coding, then, is the process of applying codes to
texts or visuals. It is one of the most common strategies for data reduction and analysis of
qualitative data, though many qualitative projects do not require or use coding. This chapter will provide an overview of approaches based in coding, including how to develop codes
and how to go through the coding process.
In order to understand coding, it is essential to think about what it means for something
to be a code. To analogize to social media, codes might function a bit like tags or hashtags.
They are words or phrases that convey content, ideas, perspectives, or other key elements of
segments of text. Codes are not the same as themes. Themes are broader than codes—they
are concepts or topics around which a discussion, analysis, or text focuses. Themes are more
general and more explanatory—often, once we code, we find themes emerge as ideas to
explore in our further analysis (Saldaña 2016). Codes are also different from descriptors.
Descriptors are words or phrases that describe characteristics of the entire text and/or the
person who created it. For example, if we note the profession of an interview respondent,
whether an article is news or opinion, or the type of camera used to take a photograph,
those would be descriptors. Saldaña (2016) instead calls these attributes. The term attributes more typically refers to the possible answer choices or options for a variable, so it is
possible to think about descriptors as variables (or perhaps their attributes) as well.
Figure 1. Codes vs.
Themes v.s.
Descriptors
Qualitative Coding | 153
Let’s consider an example. Imagine that you were conducting an interview-based study
looking at minor-league athletes’ workplace experiences and later-life career plans. In this
study, themes might be broad ideas like “aspirations” or “work experiences.” There would
be a vast array of codes, but they might include things like “short-term goals,” “educational
plans,” “pay,” “team bonding,” “travel,” “treatment by managers,” “family demands,” and
many more. Descriptors might include the athlete’s gender and what sport they play.
Developing a Coding System
While all approaches to coding have in common the idea that codes are applied to segments of text or visuals, there are many different ways to go about coding. These
approaches differ in terms of when they occur during the research process and how codes
are developed. First of all, there is a distinction between first- and second-cycle coding
approaches (Saldaña 2016). First-cycle coding happens early in the research process and
is really a bridge from data reduction to data analysis, while second-cycle coding occurs
later in the research process and is more analytical in nature. Another version of this distinction is the comparison between rough, analytic, and focused coding. Rough coding is
really part of the process of data reduction. It often involves little more than putting a few
words near each segment of text to make clear what is important in that segment, with the
approach being further refined as coding continues. In contrast, analytic coding involves
more detailed techniques designed to move towards the development of themes and findings. Finally, focused coding involves selecting ideas of interest and going back and re-coding your texts to orient your approach more specifically around these ideas (Bergin 2018).
A second set of distinctions concerns whether the data drives the development of codes
or whether codes are instead developed in advance. If codes are determined in advance,
or predetermined, researchers develop a set of codes based on their theory, hypothesis, or
research question. This sort of coding is typically called deductive coding or closed cod­
ing. In contrast, open coding or inductive coding refers to a process in which researchers
develop codes based on what they observe in their data, grounding their codes in the texts.
This second approach is more common, though by no means universal, in qualitative data
analysis. In both types of coding, however, researcher may rely upon ideas generated by
writing theoretical memos as they work through the connections between concepts, theory, and data (Saldaña 2016).
Finally, a third set of distinctions focuses on what is coded. Manifest coding refers to
the coding of surface-level and easily observable elements of texts (Berg 2009). In contrast,
latent coding is a more interpretive approach based on looking deeply into texts for the
meanings that are encoded within or symbolized by them (Berg 2009). For example, con-
154 | Qualitative Coding
sider a research project focused on gender in car advertisements. A manifest approach
might count the number of men versus women who appear in the ads. A latent approach
would instead focus on the use of gendered language and the extent to which men and
women are depicted in gender-stereotyped ways.
Researchers need to answer two more questions as they develop their coding systems.
First, what to code, and second, how many codes. When thinking about what to code,
researchers can look at the level of individual words, characters or actors in the text, paragraphs, entire textual items (like complete books or articles), or really any unit of text (Berg
2009), but the most useful procedure is to look for chunks of words that together express a
thought or idea, here referred to as “segments of text” or “textual segments,” and then code
to represent the ideas, concepts, emotions, or other relevant thoughts expressed in those
chunks.
How many codes should a particular coding system have? There is no simple answer to
this question. Some researchers develop complex coding systems with many codes and
may have over a hundred different codes. Others may use no more than 25, perhaps fewer,
even for the same size project (Saldaña 2016). Some researchers nest codes into code trees,
with several related “child” codes (or subcodes) under a single “parent” code. For example, a
code “negative emotions” could be the parent code for a series of codes like “anger,” “frustration,” “sadness,” and “fear.” This approach enables researcher to use a smaller or larger
number of codes in their analysis as seems fit after coding is complete. While there is no
formula for determining the right number of codes for a particular project, researchers
should be attentive to overgrowth in the number of codes. Codes have limited analytical
value if they are used only once or twice—if a coding system includes many codes that are
applied only a small number of times, consider whether there are larger categories of codes
that might be more useful. Occasionally, there are codes worth keeping but applying rarely,
for example when there is a rare but important phenomenon that arises in the data. But for
the most part, codes should be used with some degree of frequency in order for them to
be useful for uncovering themes and patterns.
Types of Codes
A wide variety of different types of codes can be used in coding systems. The discussion
below, which draws heavily on the work of Saldaña (2016), details a variety of different
approaches to coding and code development. Researchers do not need to choose just one
of these approaches—most researchers combine multiple coding approaches to create an
overall system that is right for the texts they are coding and the project they are conducting. The approaches detailed here are presented roughly in order of the degree of complexity they represent.
Qualitative Coding | 155
At the most basic level is descriptive coding. Descriptive codes are nouns or phrases
describing the content covered in a segment of text or the topic the segment of text
focuses on. All studies can use descriptive coding, but it often is less productive of rich data
for analysis than other approaches might be. Descriptive coding is often used as part of
rough coding and data reduction to prepare for later iterations of coding that delve more
deeply into the texts. So, for instance, that study of sexism in advertisements might involve
some rough coding in which the researcher notes what type of product or service is being
advertised in each advertisement.
Structural coding, in contrast, attends more closely to the research question rather than
to the ideas in the text. In structural coding, codes indicate which specific research question, part of a research question, or hypothesis is being addressed by a particular segment
of text. This may be most useful as part of rough coding to help researchers ensure that
their data addresses the questions and foci central to their project.
In vivo coding captures short phrases derived from participants’ own language, typically
action-oriented. This is particularly important when researchers are studying subcultural
groups that use language in different ways than researchers are accustomed to and where
this language is important for subsequent analysis (Manning 2017). In this approach,
researchers choose actual portions of respondents’ words and use those as codes. In vivo
coding can be used as part of both rough and analytical coding processes.
A related approach is process coding, which involves “the use of gerunds to label actual
or conceptual actions relayed by participants” (Saldaña 2016:77). (Gerunds are verb forms
that end in -ing and can function grammatically as if they are nouns when used in sentences). Process coding draws researchers’ attention to actions, but in contrast to in vivo
coding it uses the researcher’s vocabulary to build the coding system. So, for instance, in
the study of minor league athletes discussed earlier in the chapter, process codes might
include “traveling,” “planning,” “exercising,” “competing,” and “socializing.”
Concept coding involves codes consisting of words or short phrases that represent
broader concepts or ideas rather than tangible objects or actions. Sticking with the minor
league athletes example, concept codes might include “for the love of the game,” “youth,”
and “exploitation.” A combination of concept, process, and descriptive coding may be useful if researchers want their coding system to result in an inventory of the ideas, objects, and
actions discussed in the texts.
156 | Qualitative Coding
Emotion codes are codes indicating the emotions participants discuss in or that are evoked
by a segment of text. A more contemporary version of emotion codes relies on “emoticodes” or
the emoji that express specific kinds of emotions, as shown in Figure 2.
Values coding involves the use of codes
designed to represent the “perspectives or
worldview” of a respondent by conveying participants’ “values, attitudes, and beliefs” (Saldaña
2016:131). For example, a project on elementary
school teachers’ workplace satisfaction might
include values codes like “equity,” “learning,”
Figure 2. A Selection of “Emoticodes”
“commitment,” and “the pursuit of excellence.”
Do note that choices made in values coding are,
even more so than in other forms of coding, likely to reflect the values and worldviews of
the coder. Thus, it can be essential to use a team of multiple coders with different backgrounds and perspectives in order to ensure a values coding approach that reflects the
contents of the texts rather than the ideas of the coders.
Versus coding requires the construction of a series of binary oppositions and then the
application of one or the other of the items in the binary as a code to each relevant segment
of text. This may be a particularly useful approach for deductive coding, as the researcher
can set out a series of hypothesized binaries to use as the basis for coding. For example,
the project on elementary school teachers’ workplace satisfaction might use binaries like
feeling supported vs. feeling unsupported, energized vs. tired, unfulfilled needs vs. fulfilled
needs, kids ready to learn vs. kids needing services, academic vs non-academic concerns,
and so on.
Evaluation coding is used to signify what is and is not working in the policy, program, or
endeavor that respondents are discussing or that the research focuses on. This approach is
obviously especially useful in evaluation research designed to assess the merit or functioning of particular policies or programs. For example, if the project about elementary school
teachers was part of a mentoring program designed to keep new teachers in the education profession, codes might include “future orientation” to flag portions of the text in which
teachers discuss their longer-term plans and “mentor/mentee match” to flag portions in
which they explore how they feel about their mentors, both key elements of the program
and its goals.
There are a variety of other approaches more common outside of sociology, such as dra­
maturgical coding, which is a coding approach that treats interview transcripts or fieldnotes as if they are scripts for a play, coding such things as actors, attitudes, conflicts, and
Qualitative Coding | 157
subtexts; coding approaches relying on terms and ideas from literary analysis; and those
drawn from communications studies, which focus on facets of verbal exchange. Finally,
some researchers have outlined very specific coding strategies and procedures such that
someone else could pick up their methods and apply them exactly. This sort of approach
is typically deductive, as it requires the advance specification of the decisions that will be
made about coding.
Some coding strategies incorporate measures of weight or intensity, and this can be
combined with many of the approaches detailed above. For example, consider a project
collecting narratives of people’s experiences with losing their jobs. Respondents might
include a variety of emotional content in their narratives, whether sadness, fear, stress, relief,
or something else. But the emotions they discuss will vary not only in type, they will also
vary in extent. A worker who is fired from a job they liked well enough but who knows they
will be able to find another job soon may express sadness while a worker whose company
closed after she worked there for 20 years and who has few other equivalent employment
opportunities in the region may express devastation. Code weights help account for these
differences.
A final question researchers must consider is whether they will apply only one code per
segment of text or will permit overlapping codes. Overlapping codes make data analysis
more complex but can facilitate the process of looking for relationships between different
concepts or ideas in the data.
Codebooks
As a coding system is developed and certainly upon its completion, researchers create documents known as codebooks. As is the case with survey research, codebooks lay out the
details of how the measurement instrument works to capture data and measure it. For surveys, a codebook tells researchers how to transform the multiple-choice and short-answer
responses to survey questions into the numerical data used for quantitative analysis. For
qualitative coding, codebooks instead explain when and how to use each of the codes
included in the project. Codebooks are an important part of the coding process because
they remind the researcher, and any other coders working on the project, what each code
means, what types of data it is meant to apply to, and when it should and should not be
used (Luker 2008). Even if a researcher is coding without others, it is easy to lose sight of
what you were thinking when you initially developed your coding system, and so the codebook serves as an important reminder.
For each code, the codebook should state the name of the code, include a couple of sentences describing the code and what it should be used for, any information about when the
code should not be used, examples of both typical and atypical conditions under which the
158 | Qualitative Coding
code would be used, and a discussion of the role the code plays in analysis (Saldaña 2016).
Codebooks thus serve as instruction manuals for when and how to apply codes. They can
also help researchers think about taxonomies of codes as they organize the code book, with
higher-level ideas serving as categories for groups of child, or more precise, codes.
The Process of Coding
So, what does the process of coding look like? While qualitative research can and does
involve deductive approaches, the process that will be detailed here is an inductive
approach, as this is more common in qualitative research. This discussion will lay out a
series of steps in the coding process as well as some additional questions researchers and
analysts must consider as they develop and carry out their coding.
The first step in inductive coding is to completely and thoroughly read through the data
several times while taking detailed notes. To Saldaña (2016), the most important question to
ask during this initial read is what is especially interesting or surprising or otherwise stands
out. In addition, researchers might contemplate the actions people take, how people go
about accomplishing things, how people use language or understand the world, and what
people seem to be thinking. The notes should include anything and everything—objects,
people, emotions, actions, theoretical ideas, questions—really anything, whether it comes
up again and again in the data or only once, though it is useful to flag or highlight those
concepts that seem to recur frequently in the data.
Next, researchers need to organize these notes into a coding system. This involves deciding which coding approach(es) to incorporate, whether or not to use parent and child
codes, and what sort of vocabulary to use for codes. Remember that readers will not see
the coding system except insofar as the researcher chooses to convey it, so vocabulary and
terms should be chosen based on the extent to which they make sense to the research
team. Once a coding system has been developed, the researcher must create a codebook. If
paper coding will be used, a paper codebook should be created. If researchers will be using
CAQDAS, or computer-aided qualitative data analysis software, to do their coding, it is often
the case that the codebook can be built into the software itself.
Next, the researcher or research team should rough code, applying codes to the text
while taking notes to reflect upon missing pieces in the coding system, ways to reorganize
the codes or combine them to enhance meaning, and relevant theoretical ideas and
insights. Upon completing the rough coding process, researchers should revise the coding
system and codebook to fully reflect the data and the project’s needs.
At this point, researchers are ready to engage in coding using the revised codebook.
They should always have someone else code a portion of the texts—usually a minimum of
Qualitative Coding | 159
10%—for interrater reliability checks, and if a larger research team is used, 10% of the texts
should be coded in common by all coders who are part of the research team. Even in cases
where researchers are working alone, it truly strengthens data analysis to be able to check
for interrater reliability, so most analysts suggest having a portion of the data coded by
another coder, using the codebook. If at all possible, additional coding staff should not be
told what the hypothesis or research question is, as one of the strengths of this approach
is that additional coding staff will be less likely to be influenced by preexisting ideas about
what the data should show (Luker 2008). There are various quantitative measures, such
as Chronbach’s alpha and Kappa, that researchers use to calculate interrater reliability, the
measure of how closely the ratings of multiple coders correspond. All coders should keep
detailed notes about their coding process and any obstacles or difficulties they encounter.
How do researchers know they are done coding? Not just because they have gone
through each text once or twice! Researchers may need to continue repeating this process
of revision and re-coding until additional coding does not reveal anything more. This repetition is an essential part of coding, as coding always requires refinement and rethinking (Saldaña 2016). In Berg’s (2009:354-55) words, it is essential to “code minutely,” beginning with
a rough view of the entire text and then refining as you go until you are examining each
detail of a text. Then, researchers think about why and how they developed their codes and
what jumps out at them as important from the research as they delve into findings, making sure that nothing has been left out of the coding process before they move towards
data analysis.
One interesting question is whether the identities and standpoints (as discussed in
the chapter “The Qualitative Approach”) of coders matter to the coding process. Eduardo
Bonila-Silva (Zuberi and Bonilla-Silva 2008:17) has described how, after a presentation discussing his research on racism, a colleague asked whether the coders were White or
Black—and he responded by asking the colleague “if he asked such questions across the
board or only to researchers saying race matters.” As Bonilla-Silva’s question suggests, race
(like other aspects of identity and experience, such as gender, immigration status, disability
status, age, and social class, just to name a few) very well might shape the way coders see
and understand data, functioning as part of a particular coding filter (Saldaña 2016). But
that shaping extends broadly across all issues, not just those we might assume are particularly salient in relationship to identities. Thus, it is best for research teams to be diverse so as
to ensure that a variety of perspectives are brought to bear on the data and that the findings reflect more than just a narrow set of ideas about how the world works.
160 | Qualitative Coding
Coding and What Comes After
If researchers will code by hand, they will need multiple copies of their data, one for reference and one for writing on (Luker 2008). On the copy that will be written on, researchers
use a note-taking system that makes sense to them—whether different-colored markers,
Roman numerals in the margins, a complex series of sticky notes, or whatever—to mark
the application of various codes to sections of your data. You can see an example of what
hand coding might look like in Figure 3 below, which is taken from a study of the comments faculty members make on student writing. Segments of text are highlighted in different colors, with codes noted in the margins next to the text. You can see how codes
are repeated but in different combinations. Once the initial coding process is complete,
researchers often cut apart the pieces of paper to make chunks of text with individual
codes and sort the pieces of paper by code (if multiple codes appear in individual chunks
of text, additional copies might be needed). Then, each pile is organized and used as the
basis for writing theoretical memos. Another option for coding by hand is to use an index
sheet (Berg 2009). This approach entails developing a set of codes and categories, arranging them on paper, and entering transcript, page, and paragraph information to identify
where relevant quotes can be found.
For more complex analytical processes, researchers will likely want to use software,
though there are limitations to software. Luker (2008), for instance, argues that when coding manually, she tends to start with big themes and only breaks them into their constituent parts later, while coding using software leads her to start with the smallest possible
codes. (One solution to this, offered by some software packages, is upcoding, where a
so-called “parent” code is simultaneously applied to all of the “child” codes under it. For
instance, you might have a parent code of “activism” and then child codes that you apply
to different kinds of activism, whether protest, legislative advocacy, community organizing,
or whatever.)
Qualitative Coding | 161
Figure 3. An Example of Hand Coding
Coding does not stand on its own, and thus simply completing the coding process does
not move a research project from data to analysis. While the analysis process will be discussed in more detail in a subsequent chapter, there are several steps researchers take
alongside coding or immediately after completing coding that facilitate analysis and are
thus useful to discuss in the context of coding. Many of these are best understood as part
of the process of data reduction. One of the most important of these is categorizing codes
into larger groupings, a step that helps to enable the development of themes. These larger
162 | Qualitative Coding
groupings, sometimes called “parent” codes, can collapse related but not identical ideas.
This is always useful, but it is especially useful in cases where researchers have used a large
number of codes and each one is applied only a few times. Once parent codes have been
created, researchers then go back and ensure that the appropriate parent code is assigned
to all segments of text that were initially coded with the relevant “child” codes (a step that
can be automated in CAQDAS). If appropriate, researchers may repeat this process to see if
parent codes can be further grouped. An alternative approach to this grouping process is to
wait until coding is complete, and then create more analytical categories that make sense
as thematic groupings for the codes that have been utilized in the project so far (Saldaña
2016).
There are a variety of other approaches researchers may take as part of data reduction
or preliminary analysis after completing coding. They may outline the codes that have
occurred most frequently for specific participants or texts, or for the entire body of data,
or the codes that are most likely to co-occur in the same segment of text or in the same
document. They may print out or photocopy documents or segments of text and rearrange
them on a surface until the arrangement is analytically meaningful. They may develop diagrams or models of the relationships between codes. In doing this, it is especially helpful
to focus on the use of verbs or other action words to specify the nature of these relationships—not just stating that relationships exist, but exploring what the relationships do and
how they work.
In inductive coding especially, it is often useful to write theoretical and analytical memos
while coding occurs, and after coding is completed it is a good time to go back and review
and refine these memos. Here, researchers both clearly articulate to themselves how the
coding process occurred and what methodological choices they made as well as what preliminary ideas they have about analysis and potential findings. It can be very useful to summarize one’s thinking and any patterns that might have been observed so far as a step in
moving towards analysis. However, it is extremely important to remember the data and not
just the codes. Qualitative researchers always go back to the actual text and not just the
summaries or categories. So a final step in the process of moving toward analysis might
be to flag quotes or data excerpts that seem particularly noteworthy, meaningful, or analytically useful, as researchers need these examples to make their data come alive during
analysis and when they ultimately present their results.
Becoming a Coder
This chapter has provided an overview of how to develop a coding system and apply that
system to the task of conducting qualitative coding as part of a research project. Many new
Qualitative Coding | 163
researchers find it easy—if sometimes time-consuming and not always fascinating—to get
engaged with the coding process. But what does it take to become an effective coder?
Saldaña (2016) emphasizes personality attributes and skills that can help. Some of these
are attributes and skills that are important for anyone who is involved in any aspect of
research and data analysis: organization, to keep track of data, ideas, and procedures; perseverance, to ensure that one keeps going even when the going is tough, as is often the
case in research; and ethics, to ensure proper treatment of research participants, appropriate data security behaviors, and integrity in the use of sources. In most aspects of data
analysis, creativity is also important, though there are some roles in quantitative data analysis that require more in the way of technical skills and ability to follow directions. In qualitative data analysis, creativity remains important because of the need to think deeply and
differently about the data as analysis continues. Flexibility and the ability to deal with ambiguity are much more important in qualitative research, as the data itself is more variable
and less concrete; quantitative research tends to place more emphasis on rules and procedures. A final strength that is particularly important for those working in qualitative coding
is having a strong vocabulary, as vocabulary both helps researchers understand the data
and enhances their ability to create effective and useful coding systems. The best way to
develop a stronger vocabulary is to read more, especially within your discipline or field but
broadly as well, so researchers should be sure to stay engaged with reading, learning, and
growing.
Reading, learning, and growing, along with a lot of practice, is of course how researchers
enhance their data collection, coding, and data analysis skills, so keep working at it. Qualitative research can indeed be easy to get started with, but it takes time to become an expert.
Put in the time, and you, too, can become a skilled qualitative data analyst.
Exercises
1.
For each of the following words or phrases, consider whether it is most likely to represent a
code, a theme, or a descriptor. Explain your response.
◦
Female respondent
◦
Energized
◦
The relationship between poverty and social control
◦
Creative
◦
A teacher
◦
The process of divorce
◦
Social hierarchies
◦
Grief
164 | Qualitative Coding
2.
Pick a research topic you find interesting and determine which of the approaches to coding
detailed in this chapter might be most appropriate for your topic, then write a paragraph about
why this approach is the best.
3.
Sticking with the same topic you used to respond to Exercise 2, brainstorm some codes that
might be useful for coding texts related to this topic. Then, write appropriate text for a codebook
for each of those codes.
4.
Select a hashtag of interest on a particular social media site and randomly sample every other
post using that hashtag until you have selected 15 tweets. Then inductively code those posts and
engage in summarization or classification to determine what the most important themes they
express might be.
5.
Create a codebook based on what you did in Exercise 4. Exchange codebooks and tweets with a
classmate and code each other’s tweets according to the instructions in the codebook. Compare
your results—how often did your coding decisions agree and how often did they disagree? What
does this tell you about interrater reliability, codebook construction, and coder training?
Media Attributions
• codes themes descriptors © Mikaila Mariel Lemonik Arthur is licensed under a CC BYNC (Attribution NonCommercial) license
• Emoticodes © AnnaliseArt is licensed under a CC BY (Attribution) license
• Hand Coding Example © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NCND (Attribution NonCommercial NoDerivatives) license
Qualitative Coding | 165
166 | Qualitative Coding
13. From Qualitative Data to
Findings
MIKAILA MARIEL LEMONIK ARTHUR
So far in this text, you have learned about various approaches to managing, preparing,
reducing, and otherwise interacting with qualitative data. Because of the iterative and cyclical nature of the qualitative research process, it is not accurate to say that these steps come
before analysis. Rather, they are an integral part of analysis. Yet there are procedures and
methods for moving from the data to findings that are essential to completing a qualitative data analysis project. This chapter will outline three basic strategies for analysis of qualitative data: theoretical memos, data displays, and narratives; discuss how to move towards
conclusions; and suggest approaches for testing these conclusions to ensure that they hold
up to scrutiny.
But before the final stages of analysis occur, researchers do need to take a step back and
ensure that data collection really is finished—or at least finished enough for the particular
phase of analysis and publication the researcher is working on, in the case of a very longterm project. How do researchers know their data collection has run its course? Well, in
some cases they know because they have exhausted their sample. If a project was designed
to include interviews of forty respondents or the collection of 500 social media posts, then it
is complete when those interviews have been conducted or those social media posts have
been saved. In other cases, researchers know that data collection is complete when they
reach saturation, or the point in the research process where continuing to engage in data
collection no longer yields any new insights. This way of concluding data collection is more
common in ethnographic work or work with archival documents.
In addition, since qualitative research often results in a truly enormous amount of data,
one of the key tasks of analysis is finding ways to select the most central or important ideas
for a given project. Keep in mind that doing so does not mean dismissing other ideas as
unimportant. Rather, these other ideas may become the basis for another analysis drawing
on the same data in the future. But one project, even an entire book, often cannot contend
with the full body of data that a researcher or research team has collected. That is why it is
important to engage in data reduction before or alongside the analysis process.
As researchers move from data towards findings, it is essential that they remember that,
unlike much quantitative research, most qualitative research draws on small or otherwise
unrepresentative samples, the findings also cannot be generalized. Thus, while the findings of qualitative research may be suggestive of general patterns, they must be regarded
as just that: only suggestive.
From Data to Findings | 167
Similarly, qualitative research cannot demonstrate causation. Demonstrating causation
requires three elements:
• Association, or the ability to show a clear relationship between the phenomena, concepts, or processes in question,
• Temporal order, or the ability to show that the supposed cause came earlier in time
than the supposed effect, and
• Elimination of alternatives, or the ability to show that there is no possible alternative
explanation that could account for the phenomena in question.
While qualitative research can demonstrate association and temporal order, it cannot
eliminate all alternative explanations—only well-designed and properly-controlled laboratory experiments can do so. Therefore, qualitative researchers (along with any quantitative
researchers who are not relying on data from controlled laboratory experiments) need to
take care to stay away from arguments suggesting that their analysis has proven anything
or shown a causal relationship. However, qualitative researchers can locate evidence that
supports the argument that a relationship is causal, leading to sentences like “This provides
evidence suggestive of a causal relationship between X and Y.”
Theoretical Memos
Memo-writing has been discussed in prior chapters as a strategy for data reduction, but
memos (or theoretical notes) can be a key element in the process of moving from data
to findings. Memos and theoretical notes are texts the researcher writes for themselves in
which they work through ideas from the data, connect examples and excerpts to themes
and theories, pose new questions, and illuminate potential findings. Memos can serve as a
way to move from the rich, contextual detail of qualitative data—detail that can sometimes
be overwhelming—towards the broader issues and questions that motivate a study. Initial
memos are often drafted while data collection is still going on. For instance, a researcher
might write reflective memos that integrate preliminary thoughts and ideas about data,
help clarify concepts central to the research project, or pull together disparate chunks of
data. But additional, and more specifically analytical, memos come later in the process.
These memos may focus on a variety of topics and ideas, including reflecting on the
researcher’s role and thought processes; contemplating the research question, potential
answers, and shifts in the research focus; noting choices about coding strategies, the coding process, and choices made during coding; clarifying ethical or methodological issues
that have arisen in the course of the research; considering what further research may need
168 | From Data to Findings
to be done in the future; and trying out ideas that may become part of the final analysis
or write-up (Saldaña 2016). What is integral to the memo-writing approach is the recognition that writing is a key part of thinking (Roberts 1993). Thus, it is in drafting memos that
researchers can come to better understand the ideas shaping their study and what their
data is saying in response to their research question.
Saldaña (2016:274-76) suggests a variety of techniques for analytical processes that use
theoretical memos as thinking tools:
• Select the ten most interesting passages—a paragraph to half a page in length—from
your data. Arrange, rearrange, and comment on these passages.
• Choose the three “codes, categories, themes, and/or concepts” (Saldaña 2016:275) that
most stand out to you. Write about how they relate to one another.
• Write a narrative in as few sentences as possible that incorporates all of the most
important key words, codes, or themes from your study.
• Make sure your analysis is based in concepts instead of nouns. Saldaña advises the use
of what he calls the “touch test”—if you can touch it, it is a noun rather than a concept.
For instance, you can touch a formerly incarcerated person, but you cannot touch their
reentry strategy for life after prison. Thus, if the data still seems to be focused on or
organized around nouns, write about the concepts that these nouns might exemplify
or illuminate.
Alternatively, you can deconstruct your research question into its central parts and write
memos laying out what the data seems to be telling you about the answer(s) to each one
of these parts. Then, you can go back to the data to see if your impressions are supported or
if you are missing other key elements. But note that theoretical memos cannot be simply
lists of anecdotes or ways to confirm pre-existing ideas. They must be explicitly and thoroughly grounded in the data.
Another option, related to the crafting of theoretical memos but distinct in its approach,
is to create what Berg (2009) calls “short-answer sheets.” These function as a kind of
open-ended survey, except instead of administering the survey directly to a respondent,
researchers use an interview transcript or a text as a data source to complete the survey
themselves. The short-answer sheet might summarize respondents’ answers to key portions of the research question or detail how texts address particular issues of concern to
the project, for instance. Such an approach can help categorize different themes or types
of responses by making it easier to see commonalities and patterns.
From Data to Findings | 169
Data Displays
Another tool for moving from data to findings is called a data display. Data displays are
diagrams, tables, and other items that enable researchers to visualize and organize data
so that it is possible to clearly see the patterns, comparisons, processes, and themes that
emerge. These patterns, comparisons, processes, and themes then enable the researcher to
articulate conclusions and findings. Data displays can be used as analytical tools, as part of
the presentation process (to be discussed elsewhere), or to serve both purposes simultaneously. While the discussion of data displays in this chapter will of necessity be introductory,
researchers interested in learning more about the development and use of data displays
in qualitative research can consult Miles and Huberman’s (1994) thorough and comprehensive sourcebook.
Researchers who use data displays need to remember that even as the displays enable
the drawing of conclusions through looking for patterns and themes, this is not sufficient
to support analysis and writeup. The display is simply a tool for analysis and understanding
and cannot fully encapsulate the findings or the richness of the data. Thus, the analytical
process always needs to return to the data itself, and whatever researchers write up or present needs to include stories, quotes, or excerpts of data to bring the concepts and ideas
that are part of the study alive. The display is just that and does not itself contain or encapsulate the conclusions/analysis.
There are a variety of types of data displays, and this chapter will only present two common types: first, process and network diagrams, and second, matrixes, including tables,
chronologies or timelines, and related formal methods. This summary should provide
researchers with a sense of the possibilities. Of course, researchers can craft new
approaches that make sense to them and for their specific projects—they should not feel
bound to using only those types of data displays they have previously encountered. It is also
worth noting here than many CAQDAS software packages integrate their own proprietary
approaches to data display, and these can be very helpful for researchers using such software.
Process or Network Diagrams
Drawing various kinds of diagrams can help make sense of the ideas and connections
in the data (Taylor, Bogdan, and DeVault 2016). Process and network diagrams are both
key ways of visualizing relationships, thoughthe type of relationships they visualize differ.
Process diagrams visualize relationships between steps in a process or procedure, while
network diagrams visualize relationships between people or organizations. There is also
170 | From Data to Findings
a specialized kind of network diagram called a cognitive map that shows how ideas are
related to one another (Miles and Huberman 1994). Process diagrams are particularly useful
for policy-related applied research, as they can help researchers understand where an intervention would be helpful or where a current policy or program is falling short of its goals, as
well as whether the number or complexity of steps in a process may be getting in the way
of optimal outcomes.
Figure 1. A Decision Tree or Process Diagram of Options for After High School
The category of visualizations that we call process diagrams also includes decision trees
and flow charts. Decision trees are diagrams that specifically lay out the steps that people
or organizations take to make decisions. For example, consider Figure 1, which is a hypothetical decision tree that could have emerged from a study of high school students considering their post-high school plans. Such a diagram can allow researchers to uncover
unexpected sticking points or organize an analytical narrative around stages in the deci-
From Data to Findings | 171
sion-making process. For example, maybe students are not aware of the option to participate in paid service programs or consider a gap year, options depicted in Figure 1. And, as
the question mark suggests, those entering into criminal activity, such as drug sales, may
see what they do as a kind of under-the-table employment—or as a rejection of employment. Similarly, flow charts can be used to diagram stages in a process and can depict
complex, multidimensional relationships between these stages (flow charts can also be
used to diagram personal relationships—one common use of flow charts is to diagram
organizational relationships within corporations or other organizations, but more on this
later).
To develop any type of process diagram, researchers should review their interview transcripts, fieldnotes, documents, and other data for instances were decisions are made (or
avoided), events occurred that could have had a different outcome, or steps are taken or not
taken that lead or could have led in a particular direction. All of these occurrences should
then be listed and categorized, and researchers should work out which decisions or steps
seem to depend on earlier decisions or steps, to be associated with other co-occurring decisions or steps, or to shape later decisions or steps. It is essential to ensure that no decisions
or steps are missed when the diagram is created. For example, if one were to create a diagram of the process of making a peanut butter and jelly sandwich, it would not be sufficient
to say “put peanut butter and jelly on bread.” Rather, the diagram would have to account
for choosing and obtaining the peanut butter, jelly, and bread; the utensils needed spread
the peanut butter and jelly; how to go about spreading it; and the fact that the sandwich
needs to be closed up after the spreading is complete.
Individual
Network diagrams, as noted above, visualize the connections between people, organizations, or ideas. There is an entire subfield of
sociology concerned with network analysis,
and it has developed a specialized language
for talking about these diagrams. In this specialized language, the individual people,
organizations, or ideas included in the diagram are called nodes (represented by the
Figure 2. A Hypothetical Network Diagram
blue circles in Figure 2), while the lines connecting them are called edges (represented
by the black lines in Figure 2). Analysts can choose to use double-headed arrows, which
indicate reciprocal relationships, or single-headed arrows, which indicate one-way relationships—or lines without arrows, if the direction of relationships is unclear, as in Figure 2.
While the field of network analysis typically relies on quantitative and computational methods to draw conclusions from these diagrams, which can be large and complex enough
that generating them requires specialized software, network diagrams can also be useful
172 | From Data to Findings
tools for smaller-scale qualitative research because of the way they enable visualization of
complicated webs of relationships.
To develop a network diagram, researchers must begin by combing through their data
to find all of the individual nodes (people, organizations, ideas, or other entities) that will
be included in the diagram. Then, they must find the relationships between the nodes,
which can require a lot of time spent reviewing codes and categories produced as part of
data reduction as well as many memoing tasks designed to help clarify these relationships
and understand where each node fits in the broader scheme of things. Analysts can then
draw diagrams by hand, by using computer graphics software, or by using specialized network analysis software. Alternatively, researchers may wish to create a flow chart. As noted
above, flow charts can be extremely useful for diagramming relationships—the difference
is that in a flow chart the relationships tend to be more formalized, like supervisory relationships in a corporation or volunteer organization (this is often referred to as an organiza­
tional chart) or kinship ties in a family tree. Once researchers create an initial draft of their
network diagram, they need to look back at the evidence and data they have collected,
especially searching for disconfirming bits of data that might suggest alternative network
arrangements, in order to be sure they are on the right track.
Once network diagrams are created, they can be used in a variety of ways to help
researchers build towards conclusions and findings. As noted above, they can be especially
useful for understanding processes or policies, as they enable analysts and viewers to see
whether something is working, where the social breakdowns might be, or where a procedure might benefit from changes. Of course, network diagrams also help make relationships and patterns of interaction clear, and by following the path laid out in the diagrams,
it is possible to uncover insight into what the extended network surrounding an individual
might look like—a network the individual themselves may not even be fully conscious of.
Examining the shape and structure of the network can also answer a variety of questions.
For example: are lots of entities clustered closely together in one area? Are there parts of the
diagram where it seems like there are very few connections, or where it takes a lot of steps
to get from one person to another? Are there cliques, or portions of the network where
everyone is close with everyone else in their subgroup, or are ties spread more evenly? Are
there ties that seem like they should exist but do not? And how does the diagram map on
to social processes in the real world, such as the spread of ideas, behaviors, or disease? The
possibilities go on. The point, though, is that these approaches allow researchers to see the
way relationships interact and intersect as part of building their analysis.
Matrices and Tables
Another type of data display arranges data in a grid. We call these displays matrices (the
From Data to Findings | 173
plural of matrix) or tables. Such displays let researchers more clearly see patterns of similarities and differences and observe comparisons across cases or types of cases. In developing a matrix-based data display, researchers have a variety of choices to make. The most
important concern is that the matrix be usable for analysis and display purposes. Thus,
researchers should strive to create matrixes that they can see all at once—those fitting on
a single sheet of paper or a single computer screen. As a simple heuristic or rule designed
to help guide matrix development, try to ensure that a matrix has no more than 10-12 rows
or columns, and a number closer to six may be preferable. If a matrix cannot be adjusted
to usably fit into one view, researchers should consider whether it is possible to break the
matrix into component parts, each of which fits onto a screen. For example, a study of
racism in the residential real estate industry might, rather than including one overall matrix,
include separate matrixes detailing the experiences of real estate agents, home buyers, and
home sellers, or alternatively separate matrixes for rental markets and sales markets. While
many researchers draw matrixes by hand, word processing programs have excellent table
construction features that enable the development of clear, organized, and well-formatted
matrixes.
In order to populate the matrix, it is necessary for researchers to already have completed
some level of data reduction and preliminary analysis, such as coding and/or the development of typologies or categories. These steps will then form the basis of matrix development. However, there are many different ways to organize the data in a matrix, and thus
there are a variety of questions researchers must think through as they develop this type of
data display.
First, will the matrix be used to explore one case at a time, with the grid being used to illuminate elements of the case and with a new copy of the matrix being completed for each
case covered in the data? Or alternatively, will it be used to make cross-case comparisons
or look for patterns?
Second, will this matrix be descriptive or explanatory? A descriptive matrix is especially
helpful in circumstances where researchers are trying to categorize or classify data or
develop a typology with key characteristics for each type or category. In contrast, a matrix
designed to facilitate explanation needs to go well beyond the detailed characteristics of
each case to instead think about comparisons, processes, potentially-causal mechanisms,
and other dynamics, enabling researchers to see relationships among the data.
Third, which axes of variation—variables, if you will—will be incorporated into the table?
Remember, a typical matrix has two dimensions, the columns and the rows, though modern computer applications can create matrixes that are multidimensional and thus include
one or more additional axes. These axes might include individual respondents, types or
categories of people, settings or contexts, actions people take, events that have occurred,
distinct perspectives, organizations, processes and procedures, or portions of the research
question, just to name a few.
174 | From Data to Findings
Fourth, does the order of the data in the table show anything important about the data?
In many cases it does not, and the order of table columns or rows is, for all intents and purposes, random. But this need not be the case—order can be a valuable part of both the
development of the data display and the analytical power it presents. Order can be used to
convey information about participants’ social roles, timing and process, or the magnitude
or strength of ideas and emotions, just to give a few examples.
Fifth, what information will be placed within each cell of the matrix or table? Will the
table cells include summaries of data, events, or stories? Vignettes developed from multiple
accounts? Cross-references to specific interview transcripts or documents? Key quotes?
Explanations of what the data shows about the intersection of the ideas noted in the
columns and rows? Images? Or perhaps there is another idea that is a better fit for a particular research project.
Once the design and structure of the matrix have been developed, researchers then
return to their data to complete the matrix, filling in headers for the rows and columns
and then entering information of the specified kind into each table cell. To do so properly,
researchers need to be sure they are returning to the raw data frequently to check that
details are correct and that nothing has been inappropriately excluded.
After the matrix has been fully completed, it can be used to facilitate analysis. The design
of a matrix makes it particularly useful for observing patterns or making comparisons, but
matrixes can also be helpful in making themes or categories clear or looking for clusters
of related ideas or cases. For example, consider the matrix shown in Table 1. Note that this
matrix was constructed based on an already-published article (Edin 2000), but it reflects
the kind of approach a researcher might take as part of the creation of data displays
for analysis. The study on which Table 1 was based was an inductive in-depth interview
study of low-income single mothers who identified as Black or White in three United State
cities, looking at how low-income women think about marriage. While the author, Kathryn
Edin, does not detail her analysis strategy in the article, she likely used a coding-based
approach to uncover five key factors shaping women’s thoughts about marriage: affordability, respectability, control, trust, and domestic violence. By selecting a series of quotes
that illustrate each theme and categorizing them by the race of the respondent, it is possible to see whether there are important racial differences in any of the themes. In fact—as
may be clear from Table 1 itself—there are not major racial differences in the way Black and
White respondents discussed each theme, except that White respondents were more likely
to think “marrying up” was a realistic possibility; there were, however, differences in the fre­
quency with which respondents discussed each theme, which Edin discusses in the text of
the article. As this might suggest, some researchers, including Edin, use counting as part
of the matrix process, whether to count rows and column with specific characteristics or to
insert numbers into table cells to quantify the frequency of particular types of responses,
From Data to Findings | 175
but care should be taken with numbers to avoid risks related to over-quantification (as discussed elsewhere in this text).
Table 1. Themes and Quotes from Edin 2000
176 | From Data to Findings
Black
White
“Men simply don’t earn enough to support
a family. This leads to couples breaking up.”
Affordability
“You can’t get married and go on living
with your mother. That’s just like playing
house.”
“We were [thinking about marriage] for a
while, but he was real irresponsible. I didn’t
want to be mean or anything, [but when
he didn’t work] I didn’t let him eat my
food.”
“I am not going to get married and pay
rent to someone else. When we save up
Respectability enough money to [buy] an acre of land and
[can finance] a trailer, then we’ll marry.”
“I couldn’t get him to stay
working….It’s hard to love somebody
if you lose respect….”
I plan to “marry out of poverty” and
become a housewife.
I want to marry “up or not at all.”
Control
“[I won’t marry because] the men take over
the money. I’m too afraid to lose control of
my money again.
“If we were to marry, I don’t think it
would be so ideal. [Husbands] want
to be in charge, and I can’t deal with
that.”
“I’m the head of the household right now,
and I make the [financial] decisions. I [don’t
want to give that up].”
“[Marriage isn’t an option] right now.
I don’t want any man thinking that
he has any claim on my kids or on
how I raise them.”
“One thing my mom did teach me is that
you must work some and bring some
money into the household so you can have
a say in what happens. If you completely
live off a man, you are help-less. That is why
I don’t want to get married until I get my
own [career] and get off of welfare.”
Trust
“I don’t want to depend on nobody.
It’s too scary.”
“You know, I feel better [being alone]
because I am the provider, I’m
getting the things that I want and
I’m getting them for myself, little by
little.”
“All those reliable guys, they are gone, they
are gone. They’re either thinking about one
of three things: another woman, another
man, or dope….”
“I was married for three years before
I threw him out after discovering
that he had another woman. I loved
my husband, but I don’t [want
another one]. This is a wicked world
we are living in.”
“I would like to find a nice man to marry,
but I know that men cannot be trusted.”
“I want to meet a man who will love
me and my son and want us to grow
together. I just don’t know if he
exists.”
“I’m frustrated with men, period. They
bring drugs and guns into the house, you
take care of their kids, feed them, and then
they steal your rent money out of your
purse.”
“Love is blind. You fall in love with the
wrong one sometimes. It’s easy to
do.”
From Data to Findings | 177
Domestic
Violence
“My daughter’s father, we used to fight. I
got to where nobody be punching on me
because love is not that serious. And I
figure somebody is beating on you and the
only thing they love is watching you go the
emergency room. That’s what they love.”
“…after being abused, physically
abused, by him the whole time we
were married, I was ready to [kill
him]. He put me in the hospital three
times. I was carrying our child four
and a half months, he beat me and I
miscarried.
“So [we got in the car] and we started
arguing about why he had to hang around
people like that [who do] drugs and all that
sort of stuff. One thing led to another and
he kind of tossed me right out of the car.”
“I was terrified to leave because I
knew it would mean going on
welfare…. But that is okay. I can
handle that. The thing I couldn’t deal
with is being beat up.”
A specialized type of matrix that works somewhat differently is the timeline or chronology.
In this type of matrix, one axis is time (whether times of day, days of the week, dates, years,
ages, or some other time scale), and the researcher fills in details about events, interactions,
or other phenomena that occurred or that could be expected to occur at each given time.
Researchers can create separate timelines for individual respondents or organizations, or
can use common timelines to group and summarize data.
Matrixes also form the basis of a set of more complex analytical methods, including
truth tables and related formal methods, such as Charles Ragin’s Qualitative-Comparative
Analysis (QCA). While a full exploration of such approaches is beyond the scope of this
text, new data analysts should be aware of the existence of these techniques. They are
designed to formalize qualitative analysis (Luker 2008) using a rigorous procedure focusing
on the existence of necessary and sufficient factors. Ragin (2000; 2008) uses Boolean algebra and the logic of sets to determine which factors or groups of factors are necessary
and which factors or groups of factors are sufficient for each important outcome. Boolean
algebra is the mathematical expression of logical relationships based in the principles of
syllogism—deductive reasoning based on a series of premises (Boole 1848). Sets simply
represent groups of similar objects that are classified together. Thus, in QCA and other
techniques based on this approach, factors and combinations of factors are evaluated to
see which are necessary—without them, the outcome cannot happen—and which are sufficient—with them, the outcome must always happen.
There are a variety of other types of formal methods as well. For example, in case study
research, researchers might use quasi-experimental approaches to compare cases across
key axes of variation (Gerring 2007). Another approach is called process tracing, in which
data, “with the aid of a visual diagram or formal model” is used to “verify each stage of this
model” (Gerring 2007:184), and there are many others not discussed here. Researchers who
seek formal approaches for developing within-case or cross-case comparisons can turn to
the robust literature on case study research as they enhance their methodological toolkits.
178 | From Data to Findings
Narrative Approaches
In many cases, researchers rely on the crafting of narratives as a key part of the analytical
process. Such narratives may be used only for analysis or they may be integrated into
the ultimate write-up of the project. There are a variety of different narrative approaches
(Grbich 2007). One approach is the case study, in which the researcher tells the story of a
specific individual or organization as part of an extended narrative discussion or a short
summary. A given project may include multiple case studies. Case studies can be used
to holistically highlight the dynamics and contexts under exploration in a project, seeking
to describe and/or explain the dynamics of the case or cases under examination. Another
approach is to use vignettes, or “small illustrative stor[ies]” (Grbich 2007:214) created by
summarizing data or consolidating data from different sources. Vignettes can be used to
bring attention to particular themes, patterns, or types of experiences. Similarly, anecdotes
are short tales that are used to illustrate particularly central ideas within the discussion.
Narratives can be descriptive—simply using the data to tell a story—or theoretical and analytical, where data is used to illustrate concepts and ideas (Taylor, Bogdan, and DeVault
2016).
Narrative approaches are used most frequently for ethnographic and archival data. Here,
the common strategy is to craft what is called a thick description, a detailed account that
is rich with details about the observations, actors, and contexts, with analytical points highlighted in the course of the narrative (Berg 2009). Researchers using such a strategy often
also employ metaphor to help them make sense of the ideas that they are working with.
Berg (2009:236), for example, uses “revolving-door justice,” a phrase used to refer to a situation in which people rapidly move into and out of the criminal justice system, as an example
of a metaphor which is analytically useful.
Grounded theory is a particular approach to both data collection and analysis in which
researchers collect data, identify themes in the data, review and compare the data to see
if it fits the themes and theoretical concepts, collect additional data to continue to refine
the analytical ideas, and ultimately build theory that reflects what the data has to say about
the world (Taylor, Bogdan, and DeVault 2016). When utilizing a grounded theory approach,
researchers must take care to ensure that theories they build are in fact grounded in data,
rather than in prior knowledge or scholarship (Bergin 2018). Thus, grounded theory is a
highly inductive approach to research and data analysis. A grounded theory approach to
data analysis involves open coding, the writing of theoretical memos, further selective coding to validate categories and ideas, and integration of the empirical data with the theoretical memos (Grbich 2007).
Narrative approaches can also center the use of quotes from participants. Grbich (2007)
discusses several ways to do this:
From Data to Findings | 179
• Layering, or the interweaving of different perspectives to show how views and experiences may diverge;
• Pastiche, in which many voices are brought together to create an overall picture of
experiences; and
• Juxtaposition, in which opposing views are contrasted to one another.
Most common is an approach in which a series of quotes exemplifying a particular theme
or idea are presented together. Each quote must be analytically explained in the body of
the text—they cannot simply stand on their own. For example, Figure 3 below presents
an excerpt from the section of Kathryn Edin’s paper on motherhood and marriage among
poor women that focuses on control. In it, you can see how a theme is discussed and a
series of quotes showcasing respondents’ words relating to this theme are presented, with
the meaning and relevance of each quote explained before or after the quote.
Whether to choose memo-writing, data display, or narrative approaches to analysis, or
some combination of two or more of these approaches, is determined both by researchers’
personal styles of research and analysis and by the nature, type, and complexity of the
data. For instance, a multifaceted case study of an organization with multiple departments
may have a story which is too complex for a narrative approach and thus requires the
researcher to find other ways of simplifying and organizing the data. An ethnography of
a girls’ soccer team over the course of a single season, though, might lend itself well to a
narrative approach utilizing thick description. And interviews with people recovering from
surgery about their experiences might best be captured through a narrative approach
focusing on quotes from participants.
180 | From Data to Findings
Figure 3. An Example of the Use of Quotes in a Narrative Approach (Edin 2000:121)
Making Conclusions
Theoretical memos, data displays, and narratives are not themselves conclusions or findings. Rather, they are tools and strategies that help researchers move from data towards
findings. So how do researchers use these tools and strategies in service of their ultimate
goal of making conclusions? Most centrally, this occurs by looking for patterns, comparisons, associations, or categories. And patterns are probably the most common and useful
of these. There are a variety of types of patterns that researchers might encounter or look
for. These include finding patterns of similarities, patterns of predictable differences, patterns of sequence or order, patterns of relationship or association, and patterns that appear
to be causal (Saldaña 2016).
Making comparisons across cases is one way to look for patterns, and doing so also
enhances researchers’ ability to make claims about the representativeness of their data.
From Data to Findings | 181
There are a variety of ways to make comparisons. Researchers can make predictions about
what would happen in different sets of circumstances represented in the data and then
examine cases to see whether they do or do not it with these predictions. They can look
at individual variables or codes across cases to see how cases are similar or different. Or
they can focus on categorizing whole cases into typologies. Categorization is an especially important part of research that has used coding. Depending on the approach taken,
researchers tend to collapse groups of codes into categories either during the development
of the coding strategy, as part of the coding process, or once it concludes, and it is these
broader categories that then provide the core of the analysis (Saldaña 2016). To take some
examples of approaches to analysis that might involve comparison, a researcher conducting ethnographic research in two different prisoner-reentry programs with different success rates might compare a series of variables across the two programs to see where the
key differences arise. Or a project involving interviews of marketing professionals might categorize them according to their approach to the job to see which approaches have most
helped people move forward in their careers.
Another strategy for developing conclusions, one that must be more tightly integrated
into the data collection process, involves making predictions and then testing whether they
come true by checking back with the site or the participant and seeing what happens
later (Miles and Huberman 1994). For instance, consider an interview-based study looking
at adult women returning to college and asking how they are adjusting. The researcher
might learn about different strategies student respondents use and then might develop a
hypothesis or prediction about which of these strategies will be most likely to lead to students staying enrolled in college past their first year. Then, the researcher can follow up
with participants a year later to see if the strategies worked as predicted.
A final approach to developing conclusions involves the use of negative or deviant case
methodologies. Deviant case methodologies are usually positioned as a sampling strategy
in which researchers sample cases that seem especially likely to present problems for existing theory, often selecting on the dependent variable. However, deviant case methodologies can also be used long after sampling is completed. To do this, researchers sift through
their data, looking for cases that do not conform to their theory, that do not fit the broader
patterns they have observed from the body of their data, or that are as different as possible
from other cases, and then they seek to understand what has shaped these differences. For
instance, a project looking at why countries experience revolutions might collect data on
a variety of countries to see if common theories hold up, zoning in on those cases that do
not seem to it the theories. Or a project involving life-history interviews with men in prison
for murder might devote special attention to those men whose stories seem least like the
others.
While qualitative research does not have access to heuristics as simple as the null hypothesis significance testing approach used by quantitative researchers, qualitative researchers
182 | From Data to Findings
can still benefit from the use of an approach informed by the idea of the null hypothesis
(Berg 2009). In simple terms, the null hypothesis is the hypothesis that there is no relationship or association between the variables or concepts under study. Thus, qualitative
researchers can approach their data with the assumption of a null hypothesis in mind,
rather than starting from an assumption that whatever they are hoping to find will be displayed in their data. Such an approach reduces the likelihood that data becomes a self-fulfilling prophecy. Deviant case methodology—looking for examples within the data that do
not fit the explanations researchers have developed and considering how the analysis can
account for these aberrations (Warren and Karner 2015)—has a particular strength in this
regard.
Testing Findings
Settling in on a set of findings does not mean a research project has been completed.
Rather, researchers need to go through a process of testing, cross-checking, and verifying
their conclusions to be sure they stand up to scrutiny. Researchers use a variety of
approaches to accomplish this task, usually in combination.
One of the most important is consulting with others. Researchers discuss their findings
with other researchers and other professional colleagues as well as with participants or
other people similar to the participants (Warren and Karner 2015). These conversations give
researchers the opportunity to test their logic, learn about questions others may have in
regards to the research, refine their explanations, and be sure they have not missed obvious limitations or errors in their analysis. Presenting preliminary versions of the project to
classmates, at conferences or workshops, or to colleagues can be particularly helpful, as can
sharing preliminary drafts of the research write-up. Talking to participants or people like the
participants can be especially important. While it is always possible that a research project
will develop findings that are valid but that do not square with the lived experiences of participants, researchers should take care in such circumstances to responsibly address objections in ways that uphold the validity of both the research and the participants’ experiences
and to listen carefully to criticisms to be sure all potential errors or omissions in the analysis have been addressed. Feminist, critical social science, and action research perspectives
especially value participants’ expertise and encourage researchers to give participants final
control over how they are portrayed in publication and presentation. For instance, feminist
researchers often seek to ensure that relationships between researchers and participants
are non-exploitative and empowering for participants (Grbich 2007), which may require
that researchers do not position themselves as experts about participants’ lives but rather
provide participants with the opportunity to have their own lived experience shine through.
From Data to Findings | 183
However, it is essential to be attentive to how participants’ feedback is used. Such feedback can be extremely important to a project, but it can also misdirect analysis and inclusions in problematic ways. Participants vary greatly in their degree of comprehension of
social scientific methods and language. This means researchers must strive to present their
results to participants in ways that make sense to them. Participants’ critiques of methods
and language—while potentially illuminating—also could, if incorporated into the project,
weaken its scientific strengths. In addition, there can be disagreements between different participants or groups of participants, as well as between participants and researchers,
about explanations for the phenomena under consideration in the study.
While many qualitative researchers emphasize the important of participants’ own knowledge about their social worlds, sometimes individuals are not the best at analyzing and
understanding their own social circumstances. For example, consider someone you know
whose romantic relationship is ending and ask them what happened. There is a good
chance that the explanations offered by each partner are different, and maybe even that
neither explanation matches what you as an outside observer see. Similarly, participants’
narratives and explanations are a vital part of conclusion-building in qualitative research,
but they are not the only possible conclusions a project can draw. In addition, attention
to participants’ views can sometimes lead researchers to self-censor, especially when
researchers have ongoing relationships with participants or contexts and when participants’ priorities and understandings are opposed to researchers’. Similarly, participants
may use various discursive strategies—or ways of using language intended to have particular effects—that researchers may wish to critically interrogate. For example, what
researchers call “race talk,” or the discourse strategies people use to talk about and around
race (Twine and Warren 2000; Van Den Berg, Wetherell, and Houtkoop-Steenstra 2003), can
shed light on patterns of thought that participants may not be willing to openly admit.
Returning to participants to discuss findings and conclusions can also lead to new ethical, privacy, and even legal concerns as participants may be exposed to information gathered from others. While an interview-based study using appropriate care for confidentiality
and in which participants do not know one another is not likely to raise these concerns, as
long as data has been handled appropriately, in the case of ethnographic research or interviewing in which participants are acquainted, even careful attention to confidentiality may
still leave the possibility that participants recognize one another in the narrative and analysis. Thus, it may be necessary to share only those sections of the data and analysis talking
about a participant with that participant.
But researchers need not rely only on others to help them assess their work. There are
a variety of steps researchers can take as part of the research process to test and evaluate
their findings. For instance, researchers can critically re-examine their data, being sure their
findings are firmly based on stronger data: data that was collected later in the research
process after early weaknesses in collection methods were corrected, first-hand observa-
184 | From Data to Findings
tions rather than occurrences the researcher only heard about later, and data what was collected in conditions with higher trust. They should also attend to the issue of face validity,
the type of validity concerned with whether the measures used in a study are a good fit
for the concepts. Sometimes, in the course of analysis, face validity falls away as researchers
focus on exciting and new ideas, so returning to the core concepts of a study and ensuring
the ultimate conclusions are based on measures that fit those concepts can help ensure
solid conclusions.
Even if a study does not specifically use a deviant case methodology (as discussed above),
researchers can take care to look for evidence that is surprising or that does not fit the
model, theory, or predictions. If no such evidence appears—if it seems like all of the data
conforms to the same general perspective—remember that the absence of such evidence
does not mean the researcher’s assumptions are correct. For example, imagine a research
project designed to study factors that help students learn more math in introductory math
classes. The researcher might interview students about their classes and find that the students who report that there were more visual aids used in their class all say that they
learned a lot, while the students who report that their classes were conducted without
visual aids say they did not learn so much. In this analysis, there may not have been any
responses that did not fit this overall pattern. Clearly, then, something is going on within
this population of students. But it is not necessarily the case that the use of visual aids
impacts learning. Rather, the difference could be due to some other factor. Students might
have inaccurate perception of how much they have learned and the use or lack of use of
visual aids could impact these perceptions. Or visual aids might have a spurious relationship with students’ perceptions of learning given some other variable, like the helpfulness
of the instructor, that correlates with both perception and use of visual aids. Remember
that a spurious relationship is a relationship in which two phenomena seem to vary in association with one another, but the observed association is not due to any causal connection
between the two phenomena. Rather, the association is due to some other factor that is
related to both phenomena but that has not been included in the analysis.
Careful attention to logic can also help researchers avoid making conclusions that turn
out to be spurious. If an association is observed, then the researcher should consider
whether that association is plausible or whether there might be alternative explanations
that make more sense, turning back to the data as necessary. Indeed, researchers should
always consider the possibility of alternative explanations, and it can be very helpful to ask
others to suggest alternative explanations that have not yet been considered in the analysis. Not only does doing this increase the odds that a project’s conclusions will be reliable
and valid, it also staves off potential criticism from others who may otherwise remain convinced that their explanations are more correct.
Researchers should always make clear to others how they carried out their research. Providing sufficient detail about the research design and analytical strategy makes it possible
From Data to Findings | 185
for other researchers to replicate the study, or carry out a repeat of the research designed
to be as similar as possible to the initial project. Complete, accurate replications are possible
for some qualitative projects, such as an analysis of historical newspaper articles or of children’s videos, and thus providing the level of detail and clarity necessary for replication is
a strength for such projects. It is far less possible for in-depth interviewing or ethnography
to be replicated given the importance of specific contextual factors as well as the impact
of interviewer effect. However, providing as much detail about methodological choices and
strategies as possible, along with why these choices and strategies were the right ones for a
given project, keeps the researcher and the project more honest and makes the approach
more clear, as the goals of good research should include transparency.
Additionally, research projects involving multiple coders should have already undergone
inter-rater reliability checks including at least 10% of the texts or visuals to be coded, and,
if possible, even projects with only one coder should have conducted some inter-rater reliability testing. A discussion of the results of inter-rater reliability testing should be included
in any publication or presentation drawing on the analysis, and if inter-rater reliability was
not conducted for some reason this should be explicitly discussed as a limitation of the project. There are other types of limitations researchers must also clearly acknowledge, such
as a lack of representativeness among respondents, small sample size, any issues that
might suggest stronger-than-usual interviewer or Hawthorne effects, and other issues
that might shape the reliability and validity of the findings.
There are a variety of other cautions and concerns that researchers should keep in mind
as they build and evaluate their conclusions and findings. The term anecdotalism refers
to the practice of treating anecdotes, or individual stories or events, as if they themselves
are sufficient data upon which to base conclusions. In other words, researchers who are
engaging in anecdotalism present snippets of data to illustrate or demonstrate a phenomenon without any evidence that these particular snippets are representative. While it is natural for researchers to include their favorite anecdotes in the presentation of their results,
this needs to be done with attention to whether the anecdote illustrates a broader theme
expressed throughout the data or whether it is an outlier. Without this attention, the use
of anecdotes can quickly mislead researchers to misplace their focus and build unsupported conclusions. One of the most problematic aspects of anecdotalism is that it can
enable researchers to focus on particular data because it supports the research hypothesis, is aligned with researchers’ political ideals, or is exotic and attention-getting, rather
than focusing on data that is representative of the results. The practice of anecdotalism is
at the foundation of some people’s perceptions that qualitative research is not rigorous or
methodologically sound. In reality, bad or sloppy qualitative research, including that suffering from anecdotalism, is not rigorous, just as bad quantitative research is also not rigorous.
Qualitative researchers must take care to present excerpts and examples from their
data. When researchers do not do this and instead focus their write-up on summaries (or
186 | From Data to Findings
even numbers), readers are not able to draw their own conclusions about whether the
data supports the findings. Respondents’ actual words, words or images from documents,
and ethnographers’ first-hand observations are the true strength of qualitative research
and thus it is essential that these things come through in the final presentation. Plus, if
researchers focus on summaries or numbers, they may miss important nuances in their
data that could more accurately shape the findings. On the other hand, researchers also
must take care to avoid making overconclusions, or conclusions that go beyond what the
data can support. Researchers risk making overconclusions when they assume data are
representative of a broader or more diverse population than that which was included in the
study, when they assume a pattern or phenomenon they have observed occurs in other
types of contexts, and in similar circumstances when limited data cannot necessarily be
extended to apply to events or experiences beyond the parameters of the study.
Another risk in qualitative research is that researchers might underemphasize theory. The
role of theory marks one of the biggest differences between social science research and
journalism. By connecting data to theory, social scientists have the ability to make broader
arguments about social process, mechanisms, and structures, rather than to simply tell stories. Remember that one common goal in social science research is to focus on ordinary
and everyday life and people, showing how—for instance—social inequality and social organizations structure people’s lives, while journalism seeks stories that will draw attention.
Thinking Like a Researcher
This chapter has highlighted a variety of strategies for moving from data to conclusions.
In the quantitative research process, moving from data to conclusions really is the analysis
stage of research. But in qualitative research, especially inductive qualitative research, the
process is more iterative, and researchers move back and forth between data collection,
data management, data reduction, and analysis. It is also important to note that the strategies and tools outlined here are only a small sampling of the possible analytical techniques qualitative researchers use—but they provide a solid introduction to the qualitative
research process. As you practice qualitative research and develop your expertise, you will
continue to find new approaches that better fit your data and your research style.
So what ties all of these approaches to qualitative data analysis together? Among the
most important characteristics is that the data needs to speak for itself. Yes, qualitative
researchers may engage in data reduction due to the volume and complexity of the data
they have collected, but they need to stay close enough to the data that it continues to
shape the analysis and come alive in the write up.
Another very important element of qualitative research is reflexivity. Reflexivity, in the
From Data to Findings | 187
context of social science research, refers to the process of reflecting on one’s own perspective and positionality and how this perspective and positionality shape “research design,
data collection, analysis, and knowledge production” (Hsiung 2008:212). The practice of
reflexivity is one of the essential habits of mind for qualitative researchers. While
researchers should engage in reflexivity throughout the research process, it is important to
engage in a specifically reflexive thought process as the research moves towards conclusions. Here, researchers consider what they were thinking about their project, methodology, theoretical approach, topic, question, and participants when they began the research
process, how these thoughts and ideas have or have not shifted, and how these thoughts
and ideas—along with shifts in them—might have impacted the findings (Taylor, Bogdan,
and DeVault 2016). They do this by “turn[ing] the investigative lens away from others and
toward themselves” (Hsiung 2008:213), taking care to remember that the data they have
collected and the data reduction and analysis strategies they have pursued result in records
of interpretations, not clear, objective facts. Some feminist reflexive approaches involve talking through this set of concepts with participants; reflexive research may also involve having additional researchers serve as a kind of check on the research processes to ensure they
comport with researchers’ goals and ethical priorities (Grbich 2007).
Adjusting to the qualitative way of thinking can be challenging. New researchers who
are accustomed to being capable students are generally used to being good at what they
do—getting the right answers on tests, finding it easy to write papers that fulfill the professor’s requirements, and picking up material from lectures or reading without much difficulty. Thus, they may end up thinking that if something is hard, they are probably falling
short. And those who are accustomed to thinking o themselves as not such good students
are often used to finding many of these academic activities hard and assuming that the
fault lies within themselves. But one of the most important lessons we can learn from doing
social science research is that neither of these sets of assumptions is accurate. In fact, doing
research is hard by definition, and when it is done right, researchers are inevitably going
to hit many obstacles and will frequently feel like they do not know what they are doing.
This is not going to be because there is something wrong with the researcher! Rather, this
is because that is how research works, because research involves trying to answer a question no one has answered before by collecting and analyzing data in a way no one has tried
before. In Martin Schwartz’s (2008:1771) words, faculty like those of us writing this book and
teaching your class have been doing students a disservice by not making students “understand how hard it is to do research” and teaching them how we go about “confronting our
absolute stupidity,” the existential ignorance we all have when trying to understand the
unknown. Schwartz says this kind of ignorance, which we choose to engage in when we
pursue research, is highly productive, because it drives us to learn more. And that, after all,
is the real point of research.
188 | From Data to Findings
Exercises
1.
Ask three friends or acquaintances to tell you what steps they took to get their most recent job.
Create a process diagram of the job-searching process.
2.
Create a network diagram of your class, with nodes representing each student and edges
reflecting whether students knew one another before the semester began (or perhaps whether
they are taking multiple courses together this semester).
3.
Using a textual or video interview with a celebrity as your data (be sure it is an interview and not
an article summarizing an interview), write a narrative case study of that celebrity’s life, being sure
to reference sociological concepts where appropriate.
4.
Locate a work of long-form journalism about a topic of social science interest. Good publications
to explore for this purpose include The Atlantic, The New Yorker, The New York Times Magazine,
Vanity Fair, Slate, and longreads.com, among others. Summarize how the article might be different if it were an example of social science rather than journalism—what theory or theories might
it draw on? What types of scholarly sources might it cite? How might its data collection have been
different? How might data analysis have been conducted? What social science conclusions might
it have reached?
5.
Drawing on the “Conceptual Baggage Inventory Chart,” (Hsiung 2008:219), identify your own
research interests and goals; biographical characteristics; beliefs, values, and ideologies; and position in structures of stratification (including, but not limited to, race, gender, class, sexuality, age,
and disability). Then consider how each of these might serve as potential advantages and as
potential disadvantages in carrying out research design, data collection, and data analysis.
Media Attributions
• Process Diagram of Post-High-School Pathways © Mikaila Mariel Lemonik Arthur is
licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
• Social-network © By Wykis - Own work is licensed under a Public Domain license
• Edin 2000 Excerpt © Kathryn Edin is licensed under a All Rights Reserved license
From Data to Findings | 189
190 | From Data to Findings
14. Presenting the Results of
Qualitative Analysis
MIKAILA MARIEL LEMONIK ARTHUR
Qualitative research is not finished just because you have determined the main findings
or conclusions of your study. Indeed, disseminating the results is an essential part of the
research process. By sharing your results with others, whether in written form as scholarly
paper or an applied report or in some alternative format like an oral presentation, an infographic, or a video, you ensure that your findings become part of the ongoing conversation of scholarship in your field, forming part of the foundation for future researchers. This
chapter provides an introduction to writing about qualitative research findings. It will outline how writing continues to contribute to the analysis process, what concerns researchers
should keep in mind as they draft their presentations of findings, and how best to organize
qualitative research writing
As you move through the research process, it is essential to keep yourself organized.
Organizing your data, memos, and notes aids both the analytical and the writing processes.
Whether you use electronic or physical, real-world filing and organizational systems, these
systems help make sense of the mountains of data you have and assure you focus your
attention on the themes and ideas you have determined are important (Warren and Karner
2015). Be sure that you have kept detailed notes on all of the decisions you have made and
procedures you have followed in carrying out research design, data collection, and analysis,
as these will guide your ultimate write-up.
First and foremost, researchers should keep in mind that writing is in fact a form of thinking. Writing is an excellent way to discover ideas and arguments and to further develop an
analysis. As you write, more ideas will occur to you, things that were previously confusing
will start to make sense, and arguments will take a clear shape rather than being amorphous and poorly-organized. However, writing-as-thinking cannot be the final version that
you share with others. Good-quality writing does not display the workings of your thought
process. It is reorganized and revised (more on that later) to present the data and arguments important in a particular piece. And revision is totally normal! No one expects the
first draft of a piece of writing to be ready for prime time. So write rough drafts and memos
and notes to yourself and use them to think, and then revise them until the piece is the way
you want it to be for sharing.
Bergin (2018) lays out a set of key concerns for appropriate writing about research. First,
present your results accurately, without exaggerating or misrepresenting. It is very easy
to overstate your findings by accident if you are enthusiastic about what you have found,
Communicating Qualitative Data | 191
so it is important to take care and use appropriate cautions about the limitations of the
research. You also need to work to ensure that you communicate your findings in a way
people can understand, using clear and appropriate language that is adjusted to the level
of those you are communicating with. And you must be clear and transparent about the
methodological strategies employed in the research. Remember, the goal is, as much as
possible, to describe your research in a way that would permit others to replicate the study.
There are a variety of other concerns and decision points that qualitative researchers must
keep in mind, including the extent to which to include quantification in their presentation
of results, ethics, considerations of audience and voice, and how to bring the richness of
qualitative data to life.
Quantification, as you have learned, refers to the process of turning data into numbers. It
can indeed be very useful to count and tabulate quantitative data drawn from qualitative
research. For instance, if you were doing a study of dual-earner households and wanted
to know how many had an equal division of household labor and how many did not, you
might want to count those numbers up and include them as part of the final write-up.
However, researchers need to take care when they are writing about quantified qualitative
data. Qualitative data is not as generalizable as quantitative data, so quantification can be
very misleading. Thus, qualitative researchers should strive to use raw numbers instead of
the percentages that are more appropriate for quantitative research. Writing, for instance,
“15 of the 20 people I interviewed prefer pancakes to waffles” is a simple description of the
data; writing “75% of people prefer pancakes” suggests a generalizable claim that is not
likely supported by the data. Note that mixing numbers with qualitative data is really a type
of mixed-methods approach. Mixed-methods approaches are good, but sometimes they
seduce researchers into focusing on the persuasive power of numbers and tables rather
than capitalizing on the inherent richness of their qualitative data.
A variety of issues of scholarly ethics and research integrity are raised by the writing
process. Some of these are unique to qualitative research, while others are more universal
concerns for all academic and professional writing. For example, it is essential to avoid plagiarism and misuse of sources. All quotations that appear in a text must be properly cited,
whether with in-text and bibliographic citations to the source or with an attribution to the
research participant (or the participant’s pseudonym or description in order to protect confidentiality) who said those words. Where writers will paraphrase a text or a participant’s
words, they need to make sure that the paraphrase they develop accurately reflects the
meaning of the original words. Thus, some scholars suggest that participants should have
the opportunity to read (or to have read to them, if they cannot read the text themselves) all
sections of the text in which they, their words, or their ideas are presented to ensure accuracy and enable participants to maintain control over their lives.
192 | Communicating Qualitative Data
Audience and Voice
When writing, researchers must consider their audience(s) and the effects they want their
writing to have on these audiences. The designated audience will dictate the voice used in
the writing, or the individual style and personality of a piece of text. Keep in mind that the
potential audience for qualitative research is often much more diverse than that for quantitative research because of the accessibility of the data and the extent to which the writing
can be accessible and interesting. Yet individual pieces of writing are typically pitched to a
more specific subset of the audience.
Let us consider one potential research study, an ethnography involving participantobservation of the same children both when they are at daycare facility and when they
are at home with their families to try to understand how daycare might impact behavior
and social development. The findings of this study might be of interest to a wide variety
of potential audiences: academic peers, whether at your own academic institution, in your
broader discipline, or multidisciplinary; people responsible for creating laws and policies;
practitioners who run or teach at day care centers; and the general public, including both
people who are interested in child development more generally and those who are themselves parents making decisions about child care for their own children. And the way you
write for each of these audiences will be somewhat different. Take a moment and think
through what some of these differences might look like.
If you are writing to academic audiences, using specialized academic language and
working within the typical constraints of scholarly genres, as will be discussed below, can
be an important part of convincing others that your work is legitimate and should be taken
seriously. Your writing will be formal. Even if you are writing for students and faculty you
already know—your classmates, for instance—you are often asked to imitate the style of
academic writing that is used in publications, as this is part of learning to become part of
the scholarly conversation. When speaking to academic audiences outside your discipline,
you may need to be more careful about jargon and specialized language, as disciplines do
not always share the same key terms. For instance, in sociology, scholars use the term diffusion to refer to the way new ideas or practices spread from organization to organization.
In the field of international relations, scholars often used the term cascade to refer to the
way ideas or practices spread from nation to nation. These terms are describing what is fundamentally the same concept, but they are different terms—and a scholar from one field
might have no idea what a scholar from a different field is talking about! Therefore, while
the formality and academic structure of the text would stay the same, a writer with a multidisciplinary audience might need to pay more attention to defining their terms in the body
of the text.
It is not only other academic scholars who expect to see formal writing. Policymakers
Communicating Qualitative Data | 193
tend to expect formality when ideas are presented to them, as well. However, the content
and style of the writing will be different. Much less academic jargon should be used, and the
most important findings and policy implications should be emphasized right from the start
rather than initially focusing on prior literature and theoretical models as you might for an
academic audience. Long discussions of research methods should also be minimized. Similarly, when you write for practitioners, the findings and implications for practice should be
highlighted. The reading level of the text will vary depending on the typical background of
the practitioners to whom you are writing—you can make very different assumptions about
the general knowledge and reading abilities of a group of hospital medical directors with
MDs than you can about a group of case workers who have a post-high-school certificate.
Consider the primary language of your audience as well. The fact that someone can get
by in spoken English does not mean they have the vocabulary or English reading skills to
digest a complex report. But the fact that someone’s vocabulary is limited says little about
their intellectual abilities, so try your best to convey the important complexity of the ideas
and findings from your research without dumbing them down—even if you must limit your
vocabulary usage.
When writing for the general public, you will want to move even further towards emphasizing key findings and policy implications, but you also want to draw on the most interesting aspects of your data. General readers will read sociological texts that are rich with
ethnographic or other kinds of detail—it is almost like reality television on a page! And this
is a contrast to busy policymakers and practitioners, who probably want to learn the main
findings as quickly as possible so they can go about their busy lives. But also keep in mind
that there is a wide variation in reading levels. Journalists at publications pegged to the
general public are often advised to write at about a tenth-grade reading level, which would
leave most of the specialized terminology we develop in our research fields out of reach. If
you want to be accessible to even more people, your vocabulary must be even more limited. The excellent exercise of trying to write using the 1,000 most common English words,
available at the Up-Goer Five website (https://www.splasho.com/upgoer5/) does a good job
of illustrating this challenge (Sanderson n.d.).
Another element of voice is whether to write in the first person. While many students
are instructed to avoid the use of the first person in academic writing, this advice needs to
be taken with a grain of salt. There are indeed many contexts in which the first person is
best avoided, at least as long as writers can find ways to build strong, comprehensible sentences without its use, including most quantitative research writing. However, if the alternative to using the first person is crafting a sentence like “it is proposed that the researcher
will conduct interviews,” it is preferable to write “I propose to conduct interviews.” In qualitative research, in fact, the use of the first person is far more common. This is because
the researcher is central to the research project. Qualitative researchers can themselves be
understood as research instruments, and thus eliminating the use of the first person in
194 | Communicating Qualitative Data
writing is in a sense eliminating information about the conduct of the researchers themselves.
But the question really extends beyond the issue of first-person or third-person. Qualitative researchers have choices about how and whether to foreground themselves in their
writing, not just in terms of using the first person, but also in terms of whether to emphasize their own subjectivity and reflexivity, their impressions and ideas, and their role in the
setting. In contrast, conventional quantitative research in the positivist tradition really tries
to eliminate the author from the study—which indeed is exactly why typical quantitative
research avoids the use of the first person. Keep in mind that emphasizing researchers’
roles and reflexivity and using the first person does not mean crafting articles that provide
overwhelming detail about the author’s thoughts and practices. Readers do not need to
hear, and should not be told, which database you used to search for journal articles, how
many hours you spent transcribing, or whether the research process was stressful—save
these things for the memos you write to yourself. Rather, readers need to hear how you
interacted with research participants, how your standpoint may have shaped the findings,
and what analytical procedures you carried out.
Making Data Come Alive
One of the most important parts of writing about qualitative research is presenting the
data in a way that makes its richness and value accessible to readers. As the discussion of
analysis in the prior chapter suggests, there are a variety of ways to do this. Researchers may
select key quotes or images to illustrate points, write up specific case studies that exemplify their argument, or develop vignettes (little stories) that illustrate ideas and themes, all
drawing directly on the research data. Researchers can also write more lengthy summaries,
narratives, and thick descriptions.
Nearly all qualitative work includes quotes from research participants or documents to
some extent, though ethnographic work may focus more on thick description than on
relaying participants’ own words. When quotes are presented, they must be explained and
interpreted—they cannot stand on their own. This is one of the ways in which qualitative
research can be distinguished from journalism. Journalism presents what happened, but
social science needs to present the “why,” and the why is best explained by the researcher.
So how do authors go about integrating quotes into their written work? Julie Posselt
(2017), a sociologist who studies graduate education, provides a set of instructions. First of
all, authors need to remain focused on the core questions of their research, and avoid getting distracted by quotes that are interesting or attention-grabbing but not so relevant to
the research question. Selecting the right quotes, those that illustrate the ideas and argu-
Communicating Qualitative Data | 195
ments of the paper, is an important part of the writing process. Second, not all quotes
should be the same length (just like not all sentences or paragraphs in a paper should be
the same length). Include some quotes that are just phrases, others that are a sentence
or so, and others that are longer. We call longer quotes, generally those more than about
three lines long, block quotes, and they are typically indented on both sides to set them off
from the surrounding text. For all quotes, be sure to summarize what the quote should be
telling or showing the reader, connect this quote to other quotes that are similar or different, and provide transitions in the discussion to move from quote to quote and from topic
to topic. Especially for longer quotes, it is helpful to do some of this writing before the quote
to preview what is coming and other writing after the quote to make clear what readers
should have come to understand. Remember, it is always the author’s job to interpret the
data. Presenting excerpts of the data, like quotes, in a form the reader can access does not
minimize the importance of this job. Be sure that you are explaining the meaning of the
data you present.
A few more notes about writing with quotes: avoid patchwriting, whether in your literature review or the section of your paper in which quotes from respondents are presented.
Patchwriting is a writing practice wherein the author lightly paraphrases original texts but
stays so close to those texts that there is little the author has added. Sometimes, this even
takes the form of presenting a series of quotes, properly documented, with nothing much
in the way of text generated by the author. A patchwriting approach does not build the
scholarly conversation forward, as it does not represent any kind of new contribution on
the part of the author. It is of course fine to paraphrase quotes, as long as the meaning is
not changed. But if you use direct quotes, do not edit the text of the quotes unless how
you edit them does not change the meaning and you have made clear through the use of
ellipses (…) and brackets ([])what kinds of edits have been made. For example, consider this
exchange from Matthew Desmond’s (2012:1317) research on evictions:
The thing was, I wasn’t never gonna let Crystal come and stay with me
from the get go. I just told her that to throw her off. And she wasn’t fittin’ to
come stay with me with no money…No. Nope. You might as well stay in that
shelter.
A paraphrase of this exchange might read “She said that she was going to let Crystal stay
with her if Crystal did not have any money.” Paraphrases like that are fine. What is not fine
is rewording the statement but treating it like a quote, for instance writing:
The thing was, I was not going to let Crystal come and stay with me from
beginning. I just told her that to throw her off. And it was not proper for her
to come stay with me without any money…No. Nope. You might as well stay
in that shelter.
But as you can see, the change in language and style removes some of the distinct meaning of the original quote. Instead, writers should leave as much of the original language as
196 | Communicating Qualitative Data
possible. If some text in the middle of the quote needs to be removed, as in this example,
ellipses are used to show that this has occurred. And if a word needs to be added to clarify,
it is placed in square brackets to show that it was not part of the original quote.
Data can also be presented through the use of data displays like tables, charts, graphs,
diagrams, and infographics created for publication or presentation, as well as through the
use of visual material collected during the research process. Note that if visuals are used,
the author must have the legal right to use them. Photographs or diagrams created by the
author themselves—or by research participants who have signed consent forms for their
work to be used, are fine. But photographs, and sometimes even excerpts from archival
documents, may be owned by others from whom researchers must get permission in order
to use them.
A large percentage of qualitative research does not include any data displays or visualizations. Therefore, researchers should carefully consider whether the use of data displays will
help the reader understand the data. One of the most common types of data displays used
by qualitative researchers are simple tables. These might include tables summarizing key
data about cases included in the study; tables laying out the characteristics of different taxonomic elements or types developed as part of the analysis; tables counting the incidence
of various elements; and 2×2 tables (two columns and two rows) illuminating a theory. Basic
network or process diagrams are also commonly included. If data displays are used, it is
essential that researchers include context and analysis alongside data displays rather than
letting them stand by themselves, and it is preferable to continue to present excerpts and
examples from the data rather than just relying on summaries in the tables.
If you will be using graphs, infographics, or other data visualizations, it is important that
you attend to making them useful and accurate (Bergin 2018). Think about the viewer or
user as your audience and ensure the data visualizations will be comprehensible. You may
need to include more detail or labels than you might think. Ensure that data visualizations
are laid out and labeled clearly and that you make visual choices that enhance viewers’
ability to understand the points you intend to communicate using the visual in question.
Finally, given the ease with which it is possible to design visuals that are deceptive or misleading, it is essential to make ethical and responsible choices in the construction of visualization so that viewers will interpret them in accurate ways.
The Genre of Research Writing
As discussed above, the style and format in which results are presented depends on the
audience they are intended for. These differences in styles and format are part of the genre
of writing. Genre is a term referring to the rules of a specific form of creative or productive
Communicating Qualitative Data | 197
work. Thus, the academic journal article—and student papers based on this form—is one
genre. A report or policy paper is another. The discussion below will focus on the academic
journal article, but note that reports and policy papers follow somewhat different formats.
They might begin with an executive summary of one or a few pages, include minimal background, focus on key findings, and conclude with policy implications, shifting methods and
details about the data to an appendix. But both academic journal articles and policy papers
share some things in common, for instance the necessity for clear writing, a well-organized
structure, and the use of headings.
So what factors make up the genre of the academic journal article in sociology? While
there is some flexibility, particularly for ethnographic work, academic journal articles tend
to follow a fairly standard format. They begin with a “title page” that includes the article
title (often witty and involving scholarly inside jokes, but more importantly clearly describing the content of the article); the authors’ names and institutional affiliations, an abstract,
and sometimes keywords designed to help others find the article in databases. An abstract
is a short summary of the article that appears both at the very beginning of the article and
in search databases. Abstracts are designed to aid readers by giving them the opportunity
to learn enough about an article that they can determine whether it is worth their time
to read the complete text. They are written about the article, and thus not in the first person, and clearly summarize the research question, methodological approach, main findings, and often the implications of the research.
After the abstract comes an “introduction” of a page or two that details the research
question, why it matters, and what approach the paper will take. This is followed by a literature review of about a quarter to a third the length of the entire paper. The literature
review is often divided, with headings, into topical subsections, and is designed to provide a
clear, thorough overview of the prior research literature on which a paper has built—including prior literature the new paper contradicts. At the end of the literature review it should
be made clear what researchers know about the research topic and question, what they do
not know, and what this new paper aims to do to address what is not known.
The next major section of the paper is the section that describes research design, data
collection, and data analysis, often referred to as “research methods” or “methodology.” This
section is an essential part of any written or oral presentation of your research. Here, you tell
your readers or listeners “how you collected and interpreted your data” (Taylor, Bogdan, and
DeVault 2016:215). Taylor, Bogdan, and DeVault suggest that the discussion of your research
methods include the following:
• The particular approach to data collection used in the study;
• Any theoretical perspective(s) that shaped your data collection and analytical
approach;
• When the study occurred, over how long, and where (concealing identifiable details as
198 | Communicating Qualitative Data
needed);
• A description of the setting and participants, including sampling and selection criteria
(if an interview-based study, the number of participants should be clearly stated);
• The researcher’s perspective in carrying out the study, including relevant elements of
their identity and standpoint, as well as their role (if any) in research settings; and
• The approach to analyzing the data.
After the methods section comes a section, variously titled but often called “data,” that
takes readers through the analysis. This section is where the thick description narrative; the
quotes, broken up by theme or topic, with their interpretation; the discussions of case studies; most data displays (other than perhaps those outlining a theoretical model or summarizing descriptive data about cases); and other similar material appears. The idea of the data
section is to give readers the ability to see the data for themselves and to understand how
this data supports the ultimate conclusions. Note that all tables and figures included in formal publications should be titled and numbered.
At the end of the paper come one or two summary sections, often called “discussion”
and/or “conclusion.” If there is a separate discussion section, it will focus on exploring the
overall themes and findings of the paper. The conclusion clearly and succinctly summarizes
the findings and conclusions of the paper, the limitations of the research and analysis, any
suggestions for future research building on the paper or addressing these limitations, and
implications, be they for scholarship and theory or policy and practice.
After the end of the textual material in the paper comes the bibliography, typically called
“works cited” or “references.” The references should appear in a consistent citation style—in
sociology, we often use the American Sociological Association format (American Sociological Association 2019), but other formats may be used depending on where the piece will
eventually be published. Care should be taken to ensure that in-text citations also reflect
the chosen citation style. In some papers, there may be an appendix containing supplemental information such as a list of interview questions or an additional data visualization.
Note that when researchers give presentations to scholarly audiences, the presentations
typically follow a format similar to that of scholarly papers, though given time limitations
they are compressed. Abstracts and works cited are often not part of the presentation,
though in-text citations are still used. The literature review presented will be shortened to
only focus on the most important aspects of the prior literature, and only key examples
from the discussion of data will be included. For long or complex papers, sometimes only
one of several findings is the focus of the presentation. Of course, presentations for other
audiences may be constructed differently, with greater attention to interesting elements of
the data and findings as well as implications and less to the literature review and methods.
Communicating Qualitative Data | 199
Concluding Your Work
After you have written a complete draft of the paper, be sure you take the time to revise and
edit your work. There are several important strategies for revision. First, put your work away
for a little while. Even waiting a day to revise is better than nothing, but it is best, if possible,
to take much more time away from the text. This helps you forget what your writing looks
like and makes it easier to find errors, mistakes, and omissions. Second, show your work to
others. Ask them to read your work and critique it, pointing out places where the argument
is weak, where you may have overlooked alternative explanations, where the writing could
be improved, and what else you need to work on. Finally, read your work out loud to yourself (or, if you really need an audience, try reading to some stuffed animals). Reading out
loud helps you catch wrong words, tricky sentences, and many other issues. But as important as revision is, try to avoid perfectionism in writing (Warren and Karner 2015). Writing
can always be improved, no matter how much time you spend on it. Those improvements,
however, have diminishing returns, and at some point the writing process needs to conclude so the writing can be shared with the world.
Of course, the main goal of writing up the results of a research project is to share with others. Thus, researchers should be considering how they intend to disseminate their results.
What conferences might be appropriate? Where can the paper be submitted? Note that if
you are an undergraduate student, there are a wide variety of journals that accept and publish research conducted by undergraduates. Some publish across disciplines, while others
are specific to disciplines. Other work, such as reports, may be best disseminated by publication online on relevant organizational websites.
After a project is completed, be sure to take some time to organize your research materials and archive them for longer-term storage. Some Institutional Review Board (IRB) protocols require that original data, such as interview recordings, transcripts, and field notes,
be preserved for a specific number of years in a protected (locked for paper or passwordprotected for digital) form and then destroyed, so be sure that your plans adhere to the
IRB requirements. Be sure you keep any materials that might be relevant for future related
research or for answering questions people may ask later about your project.
And then what? Well, then it is time to move on to your next research project. Research is
a long-term endeavor, not a one-time-only activity. We build our skills and our expertise as
we continue to pursue research. So keep at it.
Exercises
200 | Communicating Qualitative Data
1.
Find a short article that uses qualitative methods. The sociological magazine Contexts is a good
place to find such pieces. Write an abstract of the article.
2.
Choose a sociological journal article on a topic you are interested in that uses some form of
qualitative methods and is at least 20 pages long. Rewrite the article as a five-page research summary accessible to non-scholarly audiences.
3.
Choose a concept or idea you have learned in this course and write an explanation of it using
the Up-Goer Five Text Editor (https://www.splasho.com/upgoer5/), a website that restricts your
writing to the 1,000 most common English words. What was this experience like? What did it
teach you about communicating with people who have a more limited English-language vocabulary—and what did it teach you about the utility of having access to complex academic language?
4.
Select five or more sociological journal articles that all use the same basic type of qualitative
methods (interviewing, ethnography, documents, or visual sociology). Using what you have
learned about coding, code the methods sections of each article, and use your coding to figure
out what is common in how such articles discuss their research design, data collection, and analysis methods.
5.
Return to an exercise you completed earlier in this course and revise your work. What did you
change? How did revising impact the final product?
6.
Find a quote from the transcript of an interview, a social media post, or elsewhere that has not
yet been interpreted or explained. Write a paragraph that includes the quote along with an explanation of its sociological meaning or significance.
Communicating Qualitative Data | 201
202 | Communicating Qualitative Data
SECTION IV
QUANTITATIVE DATA ANALYSIS
WITH SPSS
This portion of this text provides details on how to perform basic quantitative analysis with
SPSS, a statistical software package produced by IBM. Students and faculty can access discounted versions of the software via a variety of educational resellers, with lower-cost limited-term licenses for students that last six months to cover the time spent in a course (see
IBM’s list of resellers here). For most users, the GradPack Standard is the right option. Many
colleges and universities also make SPSS available in campus computer labs or via a virtual
lab environment, so check with your campus before assuming you need to pay for access.
SPSS for non-students can be very expensive, though a 30-day free trial is available from
IBM and should provide sufficient time to learn basic functions. Note that SPSS does offer
screenreader capabilities, but users may need to install an additional plugin and may wish
to seek technical support in advance for accomplishing this. Those looking for free, opensource statistical analysis software may want to consider R instead, though it does have
a steeper learning curve. Hopefully, R supplements to this book will be available at some
point in the future.
The examples and screenshots provided throughout this section of the book utilize data
from the 2021 General Social Survey. The standard 2021 GSS file has been imported into
SPSS and modified and simplified to produce an SPSS file that is available for download so
users of this book can follow along with the examples. The number of variables has been
reduced to 407, with most duplicated and survey-experiment variables removed as well as
those that are difficult to use or that were responded to by only a very small number of people. Variable information has been adjusted and variables have been reordered to further
1
simplify use. Finally, survey weights have been removed from this dataset, as the proper
use of survey weights is beyond the scope of this text. The dataset is thus designed only
for learning purposes. Researchers who want to conduct actual analyses will need to download the original 2021 GSS file, import it into SPSS, and apply the survey weights. To learn
more about survey weighting in the GSS, read this FAQ, and for instructions about applying
survey weights in SPSS, see this handy guide from Kent State.
1. Survey weights are adjustments made to survey data to correct for the fact that certain populations have been
under- or oversampled. For instance, because in the GSS only one person per household is sampled, individu­
als living in larger households have a lower chance of being selected into the survey. Thus, the survey weights
adjust for household size in the calculation of results.
Quantitative Data Analysis With SPSS | 203
A simplified codebook is also available as part of this book (see Modified GSS Codebook
for the Data Used in this Text). The codebook is an edited version of the 2021 GSS Codebook,
with some technical detail removed and the variable list edited and simplified to match the
dataset. Users of this book should take some time to familiarize themselves with the codebook before beginning to work with the data.
204 | Quantitative Data Analysis With SPSS
15. Quantitative Analysis with
SPSS: Getting Started
MIKAILA MARIEL LEMONIK ARTHUR
This chapter focuses on getting started with SPSS. Note that before you can start to work
with SPSS, you need to get your data into an appropriate format, as discussed in the chapter on Preparing Quantitative Data and Data Management. It is possible to enter data
directly into SPSS, but the interface is not conducive to data entry and so researchers are
better off entering their data using a spreadsheet program and then importing it.
Importing Data Into SPSS
In some cases, existing data will be able to be downloaded in SPSS format (*.sav is the file
extension for an SPSS datafile), in which case it can be opened in SPSS by going to File →
Open → Data and then locating the location of the file. However, in most cases, researchers
will need to import data stored in another file format into SPSS. To import data, go to the
file menu, then select import data. Next, choose the type of data you wish to import from
the menu that appears. In most cases, researchers will be importing Excel or CSV data
(when they have entered it themselves or are downloading it from a general-purpose site
like the Census Bureau) or SAS or Stata data (when they are downloading it from a site that
makes prepared statistical data files available).
SPSS: Getting Started | 205
Figure 1. The Import Data Menu in SPSS
Once you click on a data type, a window will pop up for you to select the file you wish to
import. Be sure it is of the file type you have chosen. If you import a file in a format that is
already designed to work with statistical software, such as Stata, the importation process
will be as seamless as opening a file. Researchers should be sure that immediately after
importing, they save their file (File → Save As) so that it is stored in SPSS format and can be
opened in SPSS, rather than imported, in the future. It is essential to remember that SPSS
is not cloud-resident software and does not have an autosave function, so any time a file is
changed, it must be manually saved.
206 | SPSS: Getting Started
If you import a file in Excel, CSV (commaseparated values) or text format, SPSS will
open an import wizard with a number of
steps. The steps vary slightly depending on
which file type you are importing. For
instance, to import an Excel file, as shown
in Figure 2, you first need to specify the
worksheet (if the file has multiple worksheets—SPSS can only import one worksheet at a time). You can choose to specify
a limited range of cells. Checking the
checkbox next to “Read variable names
from first row of data” will replace the V1,
V2, V3, and so on column headers with
whatever appears in the top row of data in
the Excel file. You can also choose to
change the percentage of values that are
used to determine data type, remove leading and trailing spaces from string values,
and—if your Excel file has hidden rows or Figure 2. The Import Data Window for an Excel
columns—you can choose to ignore them.
File
Below the options, a preview of your Excel file will be shown; you can scroll through the
preview to see that data is being displayed correctly. Clicking OK will finalize the import.
A different set of options appears when
you import a CSV file, as shown in Figure 3.
The top of the popup window shows a preview of the data in CSV format. While toggles related to whether the first line
contains variable names, removing leading
and trailing spaces, and indicating the percentage of values that determine the data
type are the same as for importing data
from Excel, there are additional options
that are important for the proper importing
of CSV data. First of all, the user must specify whether values are delimited by a
Figure 3. Window for Importing CSV Files
comma, a semicolon, or a tab. While commas are the most common delimiters in
SPSS: Getting Started | 207
CSV files, the other delimiters are possible, and looking at the preview should make clear
which of the delimiters is being used in a given file, as shown in the example below.
Comma-delimited:
1,2312,"Yes",984
Semicolon-delimited: 1;2312;"Yes";984
Tab-delimited:
1
2312
"Yes"
984
Second, the user must specify whether the period or the comma is the decimal symbol.
Data produced in the United States typically uses the period (as in 1238.67), as does data
produced in many other English-speaking countries, while most of Europe and Latin America use the comma. Third, the user must specify the text qualifier (single quotes, double
quotes, or none). This is the character used to note that the contents of a particular entry
in the CSV file are textual (string variables) in nature, not numerical. If your data includes
text, it should be clear from the preview which qualifier is being used. Users can also toggle
whether data is cached locally or not; caching locally speeds the importation process.
Finally, there is a button for Advanced Options (Text Wizard). The text wizard offers the
same window and options that users see if they are importing a text file directly, and this
wizard offers more direct control over the importation process over a series of six steps. First,
users can specify a predefined format if they have a *.tpf file on their computers (this is rare)
and see a preview of what the data in the file looks like. In step two, they can indicate if the
file is delimited (as above) or fixed-width (where values are stored in columns of constant
size specified within the file); which—if any—row contains the variable names; and the decimal symbol. Note that some forms of fixed-width files may not be supported. Third, they
indicate which line of the file contains the first line of data, whether each line represents
a case or a specific given number of variables represents a case, and how many cases to
import. This last choice includes the option to import a random sample of cases. Fourth,
users specify the delimiter and the text qualifier and determine how to handle leading and
trailing spaces in string values. Fifth, users can double-check variable names and formats.
Finally, before clicking the “Finish” button, users can choose to save their selections as a *.tpf
file to be reused or to paste the syntax (to be discussed later in this chapter).
In all cases, once the importation options have been selected and OK or Finish has been
clicked, the data is imported. An output window (see Figure 4) may open with various warnings and details about the importation process, and the Data View window (see Figure 5)
will show the data, with variable names at the top of each column. At this point, be sure to
save the dataset in a location and with a name you will be able to locate later.
208 | SPSS: Getting Started
Figure 4. The SPSS Output Window
Figure 5. SPSS Data View
Before users are done setting up their dataset, they must be sure that appropriate variable
information is included. When datasets are imported from other statistical programs, they
will typically come with variable information. But when they are imported from Excel or
CSV files, the variable information must be manually entered, typically from a codebook
or related document. Variable information is entered using Variable View. Users can switch
between Data View and Variable View by clicking the tabs at the bottom of the screen or
using the Ctrl+T key combination. As you can see in Figure 6, a screenshot of a completed
dataset, Variable View shows each variable in a row, with a variety of information about
that variable. When a dataset is imported, each of these pieces of information need to be
entered by hand for each variable. To move between columns by key commands, use the
tab key; to open variable information that requires a menu for entry, click the space bar
twice.
SPSS: Getting Started | 209
Figure 6. SPSS Variable View
• Name requires that each variable be given a short name, without any spaces. There
are additional rules about names, but in short, names should be primarily alphanumeric in nature and cannot be words or use symbols that have meaning for the
underlying computer processing. Names can be entered directly.
• Type specifies the variable type. To open up the menu allowing the selection of variable types, click on the cell, then click on the three dots [.…] that appear on the right
side of the cell. Users can then choose from among numeric, dollar, date, numeric
with leading zeros, string, and other variable types.
• Width specifies the number of characters of width for the variable itself in data storage, while decimals specifies how many decimal places the variable will have. These
can both be entered or edited directly or in the dialog box for Type.
• Label provides space for a longer variable name that spells out
210 | SPSS: Getting Started
more completely what the variable is
measuring. It can be entered directly.
• Values is where the attributes or value
labels for a variable are specified. Clicking the three dots [.…]—remember, they
are not visible until you click in a
cell—opens a dialog box in which values
and their labels can be entered, as
shown in Figure 7. To enter a value and
its label, click on the green plus sign.
Then enter the numerical value under
the “Value” column and the value label
Figure 7. Value Labels Popup Window
under the “Label” column, and continue
doing this until all values are labled. Labels can be long, but the beginning portions
should be easily distinguishable so analysts can work with them even when the entire
label is not displayed. There is a “Spelling…” button for spell-checking your work. Use
the red X to delete a value and its label.
• Missing provides for the indication that
particular values—like “refused to
answer”—should be treated by the SPSS
software as missing data rather than as
analytically useful categories. Clicking
the three dots [.…] opens a dialog box for
specifying missing values. When there
are no missing values, “no missing values” should be selected. Otherwise, users
can select “discrete missing values” and
then enter three specific missing values—the numerical values, not the value
labels—or they can elect “range plus one
Figure 8. Missing Values Popup
optional discrete missing value” to specific a range from low to high of missing values,
optionally adding an additional single discrete value.
• Columns specifies the width of the display column for the variable. It can be entered
directly.
• Align specifies whether the variable data will be aligned right, center, or left. Users can
click in the cell to make a menu appear or can press spacebar twice and then use
arrows to select the desired alignment.
• Measure permits the indication of level of measurement from among nominal, ordinal, and scale variables. Users can click in the cell to make a menu appear or can press
SPSS: Getting Started | 211
spacebar twice and then use arrows to select the desired level of measurement. Note
that measure is often wrong in datasets and analysts should not rely on it in determining the level of measurement for selection of statistical tests; SPSS does not use this
characteristic when running tests.
• Some datasets will have additional criteria. For example, the dataset shown in Figure 6
has a column called origsort which displays the original sort order of the dataset, so
that if an analyst sorts the variables they can be returned to their original order.
When entering variable information, it is especially important to include Name, Label, and
Values and be sure Type is correct and any Missing values are specified. Other variable information is less crucial, though clearly it is better to fully specify all variable information. Once
all variable information is entered and double-checked and the dataset has been saved, it
is ready for use.
Using SPSS
When a user first opens SPSS, they are greeted with the “Welcome Dialog” (see figure 9).
This dialog provides tips, links to help resources, and options for creating a new file (by
selecting “new dataset”) or opening recently used files. There is a checkbox for turning off
the Welcome Dialog so that it will not be shown in the future.
212 | SPSS: Getting Started
Figure 9. SPSS Welcome Dialog
When the Welcome Dialog is turned off, SPSS opens with a blank file. Going to File → Open
→ Data (Alt+F, O, D) brings up the dialog for opening a data file; the Open menu also provides for opening other types of files, which will be discussed below. Earlier in this chapter, the differences between Data View and Variable view were discussed; when you open a
data file, be sure to observe which view you are using.
SPSS: Getting Started | 213
It can be useful to be able to search for a
variable or case in the datafile. There are
two main ways to do this, both under the
1
Edit menu (Alt+E). The Edit menu offers
Find and Go To. Find, which can also be
accessed by pressing Ctrl+F, allows users to
search for all or part of a variable name. Figure 10 displays the Search dialog, with
options shown after clicking on the “show
options” button. (Users can also use the
Replace function, but this carries the risk of
writing over data and so should be avoided
in almost all cases.) Be sure to select the
column you wish to search—the Find func-
Figure 10. Find and Replace Dialog in SPSS
tion can only examine one column in Variable View at a time. Most typically, users will want to search variable names or labels. The
checkbox for Match Case toggles whether or not case (in other words, capitalization) matters to the search. Expanding the options permits users to specify how much and which
part of a cell must be matched as well as search order.
Users can also navigate to specific variables by using the Edit → Go to Case (to navigate
to a specific case—or row in data view) and Edit → Go to Variable (to navigate to a specific
variable—a row in variable view or a column in data view). Users can also access detailed
variable information via the tool Utilities → Variables.
Another useful feature is the ability to sort variables and cases. Both types of sorting can
be found in the data menu. Variables can be sorted by any of the characteristics in variable
view; when sorting, the original sort order can be saved as a new characteristic. Cases can
be sorted on any variable.
SPSS Options
The Options dialog can be reached by going to Edit → Options (or Alt+E, Alt+N). There are
a wide variety of options available to help users customize their SPSS experience, a few of
which are particularly important. First of all, using various dialogs and menus in the program is much easier if the options Variable List—Display Names (Alt+N) and Alphabeti-
1. Note that "Search," another option under the Edit menu, does not search variables or cases but instead
launches a search of SPSS web resources and help files.
214 | SPSS: Getting Started
cal (Alt+H) are selected under General. You can also change the display language for both
the user interface and for output under Language, change fonts and colors for output
under Viewer, set number options under Data; change currency options under Currency;
set default output for graphs and charts under Charts; and set default file locations for saving files under File locations. While most of these options can be left on their default settings, it is really important for most users to set variables to display names and alphabetical
before use. Options will be preserved if you use the same computer and user account, but
if you are working on a public computer you should get in the habit of checking every time
you start the program.
Getting More Out of SPSS
So far, we have been working only with Data View and Variable View in the main dataset
window. But when researchers produce the results of an analysis, these results appear in
a new window called Output—IBM SPSS Statistics Viewer. New Output windows can be
opened from the File menu by going to Open → Output or from the Window menu by
selecting “Go to Designated Viewer Window” (the later command also brings the output
window to the foreground if one is already open). Output will be discussed in more detail
when the results of different tests are discussed. For now, note that output can be saved in
*.spv format, but this format can only be viewed in SPSS. To save output in a format viewable in other applications, go to File → Export, where you can choose a file location and a file
format (like Word, PowerPoint, HTML, or PDF). Individual output items can also be copied
and pasted.
SPSS also offers a Syntax viewer and editor, which can also be accessed from both the File
and Window menus. While syntax is beyond the scope of this text, it provides the option for
writing code (kind of like a computer program) to control SPSS rather than using menus
and buttons in a graphical user interface. Experienced users, or those doing many similar
repetitive tasks, often find working via syntax to be faster and more efficient, but the learning curve is quite steep. If you are interested in learning more about how to write syntax
in SPSS, Help → Command Syntax Reference brings up a very long document detailing the
commands available.
Finally, the Help menu in SPSS offers a variety of options for getting help in using the program, including links to web resource guides, PDF documentation, and help forums. These
tools can also be reached directly via the SPSS website. In addition, many dialog boxes contain a “Help” button that takes users to webpages with more detail on the tool in question.
SPSS: Getting Started | 215
Exercises
Go to https://www.baseball-reference.com/ and select 10 baseball players of your choice. In an Excel or other
spreadsheet, enter the name, position, batting arm, throwing arm, weight in pounds, and height in inches, as
well as, from the Summary: Career section, HR (home runs) and WAR (wins above replacement). Each player
should get one row of the Excel spreadsheet. Once you have entered the data, import it into SPSS. Then use
Variable View to enter the relevant information about each variable—including value labels for position, batting arm, and throwing arm. Sort your cases by home runs. Finally, save your file.
Media Attributions
• import menu
• import excel © IBM SPSS is licensed under a All Rights Reserved license
• import csv © IBM SPSS is licensed under a All Rights Reserved license
• output window © IBM SPSS is licensed under a All Rights Reserved license
• spss data view © IBM SPSS is licensed under a All Rights Reserved license
• variable-view © IBM SPSS is licensed under a All Rights Reserved license
• value labels © IBM SPSS is licensed under a All Rights Reserved license
• missing values © IBM SPSS is licensed under a All Rights Reserved license
• welcome dialog © IBSM SPSS is licensed under a All Rights Reserved license
• find and replace © IBM SPSS is licensed under a All Rights Reserved license
216 | SPSS: Getting Started
16. Quantitative Analysis with
SPSS: Univariate Analysis
MIKAILA MARIEL LEMONIK ARTHUR
The first step in any quantitative analysis project is univariate analysis, also known as
descriptive statistics. Producing these measures is an important part of understanding the
data as well as important for preparing for subsequent bivariate and multivariate analysis. This chapter will detail how to produce frequency distributions (also called frequency
tables), measures of central tendency, measures of dispersion, and graphs in SPSS. The
chapter on Univariate Analysis provides details on understanding and interpreting these
measures. To select the correct measures for your variables, first determine the level of
measurement of each variable for which you want to produce appropriate descriptive statistics. The distinction between binary and other nominal variables is important here, so
you need to determine whether each variable is binary, nominal, ordinal, or continuous.
Then, use Table 1 to determine which descriptive statistics you should produce.
Table 1. Selecting the Right Univariate/Descriptive Statistics
Binary
Nominal
Ordinal
Continuous
Measures of Central
Tendency
Measures of
Dispersion
Graphs
Mean; Mode
Frequency distribution
Pie Chart; Bar graph
Mode
Frequency distribution
Pie Chart; Bar Graph
Median; Mode
Range (min/max);
Frequency distribution;
occasionally
Percentiles
Bar Graph
Mean; Median
Standard deviation;
Variance; Range (min/
max); Skewness;
Kurtosis; Percentiles
Histogram
Producing Descriptive Statistics
Other than graphs, all of the univariate analyses discussed in this chapter are produced by
going to Analyze → Descriptive Statistics → Frequencies, as shown in Figure 1. Note that SPSS
also offers a tool called Descriptives; avoid this unless you are specifically seeking to pro-
Quantitative Analysis with SPSS: Univariate Analysis | 217
duce Z scores, a topic beyond the scope of this text, as the Descriptives tool provides far
fewer options than the Frequencies tool.
Figure 1. Running Descriptive Statistics in SPSS
Selecting this tool brings up a window
called “Frequencies” from which the various descriptive statistics can be selected, as
shown in Figure 2. In this window, users
select which variables to perform univariate
analysis upon. Note that while univariate
analyses can be performed upon multiple
variables as a group, those variables need to
all have the same level of measurement as
Figure 2. The Frequencies Window
only one set of options can be selected at a
time.
To use the Frequencies tool, scroll through the list of variables on the left side of the
screen, or click in the list and begin typing the variable name if you remember it and the
list will jump to it. Use the blue arrow to move the variable into the Variables box or grab
and drag it over. If you are performing analysis on a binary, nominal, or ordinal variable,
be sure the checkbox next to “Display frequency tables” is checked; if you are performing analysis on a continuous variable, leave that box unchecked. The checkbox for “Create
218 | Quantitative Analysis with SPSS: Univariate Analysis
APA style tables” slightly alters the format and display of tables. If you are working in the
field of psychology specifically, you should select this checkbox, otherwise it is not needed.
The options under “Format” specify elements about the display of the tables; in most cases
those should be left as the default. The options under “Style” and “Bootstrap” are beyond
the scope of this text.
It is under “Statistics” that the specific
descriptive statistics to be produced are
selected, as shown in Figure 3. First, users
can select several different options for producing percentiles, which are usually produced only for continuous variables but
occasionally are used for ordinal variables.
Quartiles produces the 25th, 50th (median),
and 75th percentile in the data. Cut points
allows the user to select a specified number of equal groups and see at which values
the groups break. Percentiles allows the
user to specify specific percentiles to produce—for instance, a user might want to
specify 33 and 66 to see where the upper, Figure 3. The Dialog Box for Selecting Descriptive
middle, and lower third of data fall.
Statistics
Second, users can select measures of central tendency, specifically the mean (used for
binary and continuous variables), the median (used for ordinal and continuous variables),
and the mode (used for binary, nominal, and ordinal variables). Sum adds up all the values
of the variable, and is not typically used. There is also an option to select if values are group
midpoints, which is beyond the scope of this text.
Next, users can select measures of dispersion and distribution, including the standard
deviation (abbreviated here Std. deviation, and used for continuous variables), the variance
(used for continuous variables), the range (used for ordinal and continuous variables), the
minimum value (used for ordinal and continuous variables), the maximum value (used for
ordinal and continuous variables), and the standard error of the mean (abbreviated here as
S.E. mean, this is a measure of sampling error and beyond the scope of this text), as well as
skewness and kurtosis (used for continuous variables).
Quantitative Analysis with SPSS: Univariate Analysis | 219
Once all desired tests are selected, click
“Continue” to go back to the main frequencies dialog. There, you can also select the
Chart button to produce graphs (as shown
in Figure 4), though only one graph can be
produced at a time (other options for producing graphs will be discussed later in this
chapter). Bar charts are appropriate for
binary, nominal, and ordinal variables. Pie
charts are typically used only for binary
variables and nominal variables with just a
few categories, though they may at times
make sense for ordinal variables with just a
few categories. Histograms are used for
continuous variables; there is an option to
show the normal curve on the histogram,
which can help users visualize the distribuFigure 4. Making Graphs from the Frequencies
Dialog
tion more clearly. Users can also choose
whether their graphs will be displayed in
terms of frequencies (the raw count of val-
ues) or percentages.
Examples at Each Level of Measurement
Here, we will produce appropriate descriptive statistics for one variable from the 2021 GSS
file at each level of measurement, showing what it looks like to produce them, what the
resulting output looks like, and how to interpret that output.
A Binary Variable
To produce descriptive statistics for a binary variable, be sure to leave Display frequency
tables checked. Under statistics, select Mean and Mode and then click continue, and under
graphs select your choice of bar graph or pie chart and then click continue. Using the variable GUNLAW, then, the selected option would look as shown in Figure 5. Then click OK,
and the results will appear in the Output window.
220 | Quantitative Analysis with SPSS: Univariate Analysis
Figure 5. SPSS Dialogs Set Up for Descriptive Statistics for the Binary Variable GUNLAW
The output for GUNLAW will look approximately like what is shown in Figure 6. GUNLAW
is a variable measuring whether the respondent favors or opposes requiring individuals to
obtain police permits before buying a gun.
Quantitative Analysis with SPSS: Univariate Analysis | 221
The output shows that 3,992 people gave
a valid answer to this question, while
responses for 40 people are missing. Of
those who provided answers, the mode, or
most frequent response, is 1. If we look at
the value labels, we will find that 1 here
means “favor;” in other words, the largest
number of respondents favors requiring
permits for gun owners. The mean is 1.33. In
the case of a binary variable, what the
mean tells us is the approximate proportion
of people who have provided the highernumbered value label—so in this case,
about ⅓ of respondents said they are
opposed to requiring permits.
The frequency table, then, shows the
number and proportion of people who provided each answer. The most important
column to pay attention to is Valid Percent.
This column tells us what percentage of the
people who answered the question gave
each answer. So, in this case, we would say
Figure 6. SPSS Output for Descriptive Statistics on
GUNLAW
that 67.3% of respondents favor requiring
permits for gun ownership, while 32.7% are
opposed—and 1% are missing.
Finally, we have produced a pie chart, which provides the same information in a visual
format. Users who like playing with their graphs can double-click on the graph and then
right-click or cmd/ctrl click to change options such as displaying value labels or amounts or
changing the color of the graph.
A Nominal Variable
To produce descriptive statistics for a nominal variable, be sure to leave Display frequency
tables checked. Under statistics, select Mode and then click continue, and under graphs
select your choice of bar graph or pie chart (avoid pie chart if your variable has many categories) and then click continue. Using the variable MOBILE16, then, the selected option
would look as shown in Figure 7. Then click OK, and the results will appear in the Output
window.
222 | Quantitative Analysis with SPSS: Univariate Analysis
Figure 7. SPSS Dialogs Set Up for Descriptive Statistics for the Nominal Variable MOBILE16
The output will then look approximately like the output shown in Figure 8. MOBILE16 is a
variable measuring respondents’ degree of geographical mobility since age 16, asking them
if they live in the same city they lived in at age 16; stayed in the same state they lived in at
age 16 but now live in a different city; or live in a different state than they lived in at age 16.
The output shows that 3608 respondents
answered this survey question, while 424
did not. The mode is 2; looking at the value
labels, we conclude that 2 refers to “same
state, different city,” or in other words that
the largest group of respondents lives in
the same state they lived in at age 16 but
not in the same city they lived in at age 16.
The frequency table shows us the percentage breakdown of respondents into the
three categories. Valid percent is most useful here, as it tells us the percentage of
respondents in each category after those
who have not responded to the question
are removed. In this case, 35.9% of people
live in the same state but a different city,
Figure 8. SPSS Output for Descriptive Statistics on
MOBILE16
the largest category of respondents. Thirtyfour percent live in a different state, while
30.1% live in the same city in which they
lived at age 16. Below the frequency table is a bar graph which provides a visual for the
Quantitative Analysis with SPSS: Univariate Analysis | 223
information in the frequency table. As noted above, users can change options such as displaying value labels or amounts or changing the color of the graph.
An Ordinal Variable
To produce descriptive statistics for an ordinal variable, be sure to leave Display frequency
tables checked. Under statistics, select Median, Mode, Range, Minimum, and Maximum,
and then click continue, and under graphs select your choice of bar graph and then click
continue. Then click OK, and the results will appear in the Output window. Using the variable CARSGEN, then, the selected option would look as shown in Figure 7.
Figure 9. SPSS Dialogs Set Up for Descriptive Statistics for the Ordinal Variable CARSGEN
The output will then look approximately like the output shown in Figure 10. CARSGEN is an
ordinal variable measuring the degree to which respondents agree or disagree that car pollution is a danger to the environment.
224 | Quantitative Analysis with SPSS: Univariate Analysis
First, we see that 1778 respondents
answered this question, while 2254 did not
(remember that the GSS has a lot of questions; some are asked of all respondents
while others are only asked of a subset, so
the fact that a lot of people did not answer
may indicate that many were not asked
rather than that there is a high degree of
nonresponse). The median and mode are
both 3. Looking at the value labels tells us
that 3 represents “somewhat dangerous.”
The range is 4, representing the maximum
(5) minus the minimum (1)—in other words,
there are five ordinal categories.
Looking at the valid percents, we can see
that 13% of respondents consider car pollution extremely dangerous, 31.4% very dangerous, and 45.8%—the biggest category
(and both the mode and median)—somewhat dangerous. In contrast only 8.5% think
Figure 10. SPSS Output for Descriptive Statistics on
CARSGEN
car pollution is not very dangerous and 1.2%
think it is not dangerous at all. Thus, it is
reasonable to conclude that the vast major-
ity—over 90%—of respondents think that car pollution presents at least some degree of
danger. The bar graph at the bottom of the output represents this information visually.
A Continuous Variable
To produce descriptive statistics for a continuous variable, be sure to uncheck Display
frequency tables. Under statistics, go to percentile values and select Quartiles (or other
percentile options appropriate to your project). Then select Mean, Median, Std. deviation,
Variance, Range, Minimum, Maximum, Skewness, and Kurtosis and then click continue, and
under graphs select Histograms and turn on Show normal curve on histogram and then
click continue. Using the variable EATMEAT, then, the selected option would look as shown
in Figure 11. Then click OK, and the results will appear in the Output window.
Quantitative Analysis with SPSS: Univariate Analysis | 225
Figure 11. SPSS Dialogs Set Up for Descriptive Statistics for the Nominal Variable EATMEAT
The output will then look approximately like the output shown in Figure 12. EATMEAT is
a continuous variable measuring the number of days per week that the respondent eats
beef, lamb, or products containing beef or lamb.
226 | Quantitative Analysis with SPSS: Univariate Analysis
Because this variable is continuous, we
have not produced frequency tables, and
therefore we jump right into the statistics.
1795 respondents answered this question.
On average, they eat beef or lamb 2.77 days
per week (that is what the mean tells us).
The median respondent eats beef or lamb
three days per week. The standard deviation of 1.959 tells us that about 68% of
respondents will be found within ±1.959 of
the mean of 2.77, or between 0.811 days and
4.729 days. The skewness of 0.541 tells us
that the data is mildly skewed to the right,
with a longer tail at the higher end of the
distribution. The kurtosis of -0.462 tells us
that the data is mildly platykurtic, or has little data in the outlying tails. (Note that we
have ignored several statistics in the table,
which are used to compute or further interpret the figures we are discussing and
which are otherwise beyond the scope of
Figure 12. SPSS Output for Descriptive Statistics on
EATMEAT
this text). The range is 7, with a minimum of
0 and a maximum of 7—sensible, given
that this variable is measuring the number
of days of the week that something happens. The 25th percentile is at 1, the 50th at 3 (this
is the same as the median) and the 75th at 4. This tells us that one quarter of respondents
eat beef or lamb one day a week or fewer; a quarter eat it between one and three days a
week; a quarter eat it between three and four days a week; and a quarter eat it more than
four days per week. The histogram shows the shape of the distribution; note that while the
distribution is otherwise fairly normally distributed, more respondents eat beef or lamb
seven days a week than eat it six days a week.
Graphs
There are several other ways to produce graphs in SPSS. The simplest is to go to Graphs
→ Legacy Dialogs, where a variety of specific graph types can be selected and produced,
including both univariate and bivariate charts. The Legacy Dialogs menu, as shown in Fig-
Quantitative Analysis with SPSS: Univariate Analysis | 227
ure 13, permits users to choose bar graphs, 3-D bar graphs, line graphs, area charts, pie
charts, high-low plots, boxplots, error bars, population pyramids, scatterplots/dot graphs,
and histograms. Users are then presented with a series of options for what data to include
in their chart and how to format the chart.
Figure 13. The Legacy Dialogs/Graphs Menu in SPSS
Here, we will review how to produce univariate bar graphs, pie charts, and histograms
using the legacy dialogs. Other graphs important to the topics discussed in this text will be
reviewed in other chapters.
Bar Graphs
To produce a bar graph, go to Graphs → Legacy Dialogs → Bar. For a univariate graph, then
select Simple, and click Define. Then, select the relevant binary, nominal, or ordinal variable
and use the blue arrow (or drag and drop it) to place it in the “Category Axis” box. You can
change the options under “Bars represent” to be the number of cases, the percent of cases,
or other statistics, if you choose. Once you have set up your graph, click OK, and the graph
will appear in the Output Viewer window. Figure 14 shows the dialog boxes for creating a
228 | Quantitative Analysis with SPSS: Univariate Analysis
bar graph, with the appropriate options selected, as well as a graph of the variable NEWS,
which measures how often the respondent reads a newspaper.
Figure 14. Bar Graph Dialog and Resulting Bar Graph for NEWS
Pie Charts
To produce a pie chart, go to Graphs → Legacy Dialogs → Pie. In most cases, users will want
to select the default option, “Summaries for groups of cases,” and click define. Then, select
the relevant binary, nominal, or ordinal variable (remember not to use pie charts for variables with too many categories) and use the blue arrow (or drag and drop it) to place it in
the “Define Slices By” box. You can change the options under “Slices represent” to be the
number of cases or the percent of cases. Once you have set up your graph, click OK, and
the graph will appear in the Output Viewer window. Figure 15 shows the dialog boxes for
creating a pie chart, with the appropriate options selected, as well as a graph of the variable
BORN, which measures whether or not the respondent was born in the United States.
Quantitative Analysis with SPSS: Univariate Analysis | 229
Figure 15. Pie Chart Dialog and Resulting Pie Chart for BORN
Histograms
To produce a histogram, go to Graphs → Legacy Dialogs → Histogram. Then, select the relevant continuous variable and use the blue arrow (or drag and drop it) to place it in the “Variable” box. Most users will want to check the “Display normal curve” box. Once you have set
up your graph, click OK, and the graph will appear in the Output Viewer window. Figure 16
shows the dialog boxes for creating a histogram, with the appropriate options selected, as
well as a graph of the variable AGE, which measures the respondent’s age at the time of the
survey. Note that when histograms are produced, SPSS also provides the mean, standard
deviation, and total number of cases along with the graph.
230 | Quantitative Analysis with SPSS: Univariate Analysis
Figure 16. Histogram Dialog and Resulting Histogram for AGE
Other Ways of Producing Graphs
Other options include the Chart Builder and the Graphboard Template Chooser. In the
Graphboard Template Chooser, users select one or more variables and SPSS indicates a
selection of graphs that may be suitable for that combination of variables (note that SPSS
simply provides options, it cannot determine if those options would in fact be appropriate
for the analysis in question, so analysts must take care to evaluate the options and choose
which one(s) are actually useful for a given analysis). Then, users are able to select from
among a set of detailed options and provide titles for their graph. In chart builder, users
first select from among a multitude of univariate and bivariate graph formats and drag
and drop variables into the graph, then setting options and properties and changing colors
as desired. While both of these tools provide more flexibility than the graphs accessed via
Legacy Dialogs, advanced users designing visuals often move outside of the SPSS ecosystem and create graphs in software more directly suited to this purpose, such as Excel or
Tableau.
Exercises
To complete these exercises, load the 2021 GSS data prepared for this text into SPSS. For each of the
following variables, answer the questions below.
Quantitative Analysis with SPSS: Univariate Analysis | 231
•
ZODIAC
•
COMPUSE
•
SATJOB
•
NUMROOMS
•
Any other variable of your choice
1.
What is the variable measuring? Use the GSS codebook to be sure you understand.
2.
At what level of measurement is the variable?
3.
What measures of central tendency, measures of dispersion, and graphs can you produce for
this variable, given its level of measurement?
4.
Produce each of the measures and graphs you have listed and copy and paste the output into a
document.
5.
Write a paragraph explaining the results of the descriptive statistics you’ve obtained. The goal is
to put into words what you now know about the variable—interpreting what each statistic means,
not just restating the statistic.
Media Attributions
• descriptives frequencies © IBM SPSS is licensed under a All Rights Reserved license
• frequencies window © IBM SPSS is licensed under a All Rights Reserved license
• frequencies-statistics © IBM SPSS is licensed under a All Rights Reserved license
• frequencies charts © IBM SPSS is licensed under a All Rights Reserved license
• binary descriptives © IBM SPSS is licensed under a All Rights Reserved license
• gunlaws output © IBM SPSS is licensed under a All Rights Reserved license
• nominal descriptives © IBM SPSS is licensed under a All Rights Reserved license
• mobile16 output © IBM SPSS is licensed under a All Rights Reserved license
• ordinal descriptives © IBM SPSS is licensed under a All Rights Reserved license
• carsgen output © IBM SPSS is licensed under a All Rights Reserved license
• continuous descriptives © IBM SPSS is licensed under a All Rights Reserved license
• eatmeat output © IBM SPSS is licensed under a All Rights Reserved license
• graphs legacy dialogs © IBM SPSS is licensed under a All Rights Reserved license
• bar graphs © IBM SPSS is licensed under a All Rights Reserved license
• pie charts © IBM SPSS is licensed under a All Rights Reserved license
• histogram © IBM SPSS is licensed under a All Rights Reserved license
232 | Quantitative Analysis with SPSS: Univariate Analysis
17. Quantitative Analysis with
SPSS: Data Management
MIKAILA MARIEL LEMONIK ARTHUR
This chapter is designed to introduce a variety of ways to work with datasets and variables
that facilitate analysis. None of the approaches in this chapter themselves produce results,
but rather are designed to enable analysis that might not be possible if datasets are used
in their default form. First, it will show how to perform analysis on more limited subsets of data. Then, it will show how to transform variables to change their level of measurement, reduce attributes, create index variables, and otherwise combine variables. One
quick note about a topic that is not covered in this text: the application of survey weights.
More advanced quantitative analysts will want to learn to properly weight their data before
performing analysis.
Working With Datasets
In some cases, analysts may wish to use a smaller subset of their dataset or to analyze different groups within the dataset separately. This section of the chapter will review select
cases and split file, approaches for doing just this.
SPSS Data Management | 233
Select Cases
The Select Cases tool permits analysts to
choose a subset of cases upon which to perform analysis. It can be found at the bottom of
the Data menu (Alt+D, Alt+S); the Select Cases
dialog is shown in Figure 1. Select cases offers
the option of selecting cases based on satisfying a certain condition (e.g. cases with a specific value for a specific variable), selecting a
random sample of a percentage or number of
cases, selecting a specific range of cases, or
using a filter variable to select only those
cases with a value other than 0 or missing on
that variable.
When using the “If condition is satisfied”
option, click the “If…” button and use variable
Figure 1. The Select Cases Dialog
names and logical or mathematical operators
to write an expression (either by clicking or just typing the expression). For instance, one
might select only those who have a bachelor’s degree or higher by writing
1
, as shown in Figure 2.
Once an option has been selected, analysts then need to determine what should
happen to the selected cases. They can
choose to filter out the unselected cases or
to copy the selected cases into a new file
with a given filename. SPSS also permits
the option of deleting unselected cases,
but since this permanently alters the original dataset, it is not recommended. If “Filter
Out Unselected Cases” is chosen, it is
Figure 2. The “Select Cases If” Dialog
important to remember to return to the
Select Cases dialog when the portion of the
project relying on the selected subset of cases is completed. When returning to Select
Cases, “All Cases” should be selected in order to revert to the original dataset with all cases
available for analysis.
1. Note: the symbol | means or in mathematical notation.
234 | SPSS Data Management
Split File
The split file tool allows analysts to produce output that is separated according to
the attributes of a variable. For instance, analysts could perform descriptive statistics or
crosstabulations and generate separate output for different race, sex, educational, or
other categories. Split file can be accessed
via the Data menu (Alt+D, Alt+F); the dialog
is shown in Figure 3. Analysts can choose to
analyze all cases (not splitting the file) or to
split the file and either compare groups or
organize output by groups. Using each of Figure 3. The Split File Dialog
these options, all analyses performed appear
in the output in multiple copies, one for each of the attributes of the selected variable. So,
for instance, if the file were split by SEX (in the 2021 GSS, SEX only has the attributes of male
and female), separate descriptive statistics, crosstabs, graphs, or whatever other output is
desired will be produced. The difference between “Compare groups” and “Organize output
by groups” is that “Compare groups” produces a stack of output—say, the frequencies
tables for male and female—right on top of each other, while “Organize output by groups”
produces all output requested in a single procedure separated for each attribute.
Once the analyst has selected one of these options, they select the variable and use the
blue arrow to put it in the Groups Based on box; in most cases, the option for Sort the file
by grouping variables should be selected as well. Then click on and proceed to perform the
desired analysis. Once the analysis is completed, return to the Compare Groups dialog and
select “Analyze all cases, do not compare groups” and click OK so that the split file is turned
off.
Working With Variables
Analysts may wish to use variables differently than the way they were originally collected.
This section of the chapter will address three approaches for transforming variables: first,
recoding variables, which permits transforming a continuous variable into an ordinal one
or reducing the number of attributes in a variable to fewer, larger categories; second, creating indexes by combining variables using the count function; and third, using compute to
manipulate or combine variables in other ways, such as creating averages.
SPSS Data Management | 235
Recoding
Recoding is a procedure that permits analysts to change the way in which the attributes of
a variable are set up. It can be used for a variety of purposes, among them:
• Converting a continuous variable into an ordinal one by grouping numerical values
into categories,
• Simplifying a nominal or ordinal variable with many categories by collapsing those
categories into a smaller number of categories,
• Changing the direction of a variable, for instance taking an ordinal variable with 5 categories ranging from 1: strongly disagree to 5: strongly agree and turning it into an
ordinal variable with 5 categories ranging from 1: strongly agree to 5: strongly disagree,
and
• Creating dummy variables, as will be discussed in the chapter on multivariate regression.
This section of the chapter will provide examples of how to conduct the first two types of
recoding. Note that before proceeding to recode any variable, it is essential to first produce
complete descriptive statistics for the variable in question and study them carefully. If you
are recoding a continuous variable, you may wish to use the “cut points” option under Analyze → Descriptive Statistics → Frequencies → Statistics, being sure to specify the number of
equal groups you are considering creating. If you are recoding a discrete variable it is also
essential to understand what the attributes (value labels) are for that variable and how they
are coded (values). Take good notes on both the descriptive statistics and the attributes so
that you have the information available to help you decide how to set up your recode.
The recode dialog is found under Transform (Alt+T). Note that there are several different
recoding options. You should never use Recode into Same Variables (Alt+S), as this writes
over your original data. The Automatic Recode (Alt+A) option is most useful when data is in
non-numeric form and needs to be converted to numeric form. Most frequently, as a quantitative analyst, you will use Recode into Different Variables (Alt+R).
236 | SPSS Data Management
Recoding a Continuous Variable into an Ordinal Variable
Let’s say we would like to recode AGE,
changing it from a continous variable to
a discrete variable. We might use our own
understanding of ages to come up with
categories, like 18-25, 26-35, 36-45, 46-55,
56-65, 66-75, and 75 and older. But these
Table 1. Descriptive Statistics for AGE, 2021 GSS
Age of respondent
N
Valid
3699
Missing
333
Mean
52.16
Median
53.00
Std. Deviation
17.233
ends of the distribution—something we
Variance
296.988
find out if we look at the descriptive sta-
Skewness
.018
tistics. In fact, if we produce descriptive
Std. Error of Skewness
.040
statistics using cut points for five equal
Kurtosis
-1.018
Std. Error of Kurtosis
.080
Range
71
sample fall into one age category. We
Minimum
18
might not want to just use the cut points
Maximum
89
categories, it turns out, might not be so
useful, as not very many people in our
dataset are at the youngest or oldest
groups—as shown in Table 1, we would
find out that we have to get all the way to
age 35 to have 20% of the people in our
our descriptive statistics found, though,
as they do not necessarily make sense as
theoretical groupings. Perhaps instead
we would choose 18-35, 36-45, 46-59,
60-69, and 70 and older. These grouping
Percentiles
20
35.00
40
46.00
60
59.00
80
69.00
would be approximately equal in size, but
make more sense numerically. Once we determine our groups, we also need to decide
which numerical value we will assign each group—perhaps 1:18-35, 2:36-45; 3:46-59, 4:60-69;
5:70+. Now that we have decided how we will recode our variable, we are ready to proceed
with actually recoding the variable.
SPSS Data Management | 237
To begin the process of recoding, go to
Transform → Recode Into Different. Select
the variable you wish to recode and move it
into the box using the blue arrow. Then,
give the variable a new name and label.
Many analysts use the convention of
adding an R to the original variable name,
thus here we are giving our variable the
new name RAGE and the label “Age
Figure 4. Recode Into Different Dialog Box Set Up
to Recode Age
Recoded Into Categories.” Click the Change
button. There is an If… option for more complicated recoding procedures, but in most
cases all that needs to be done now is clicking Old and New Values to put in the old and
new values we already decided upon.
In the Old and New Values dialog, there
are a variety of ways to indicate the original
value (old) and the new value. We always
begin by selecting old value: System or
user-missing and new value: System-missing, to ensure that missing values remain
missing. We then put in the rest of our categories using the Range ____ through _____
option, except for the final category, where
we use Range, value through highest to
ensure we don’t accidentally leave out
Figure 5. Old and New Values Dialog for Recoding
Age
those 89-year-olds. Once an old value and
its respective new value have been entered, we click the Add button so that they appear in
the Old → New box. In other cases, analysts might change individual values, use Range, lowest through value, or combine all other values. If it is necessary to edit or delete something
that has already been added, use the Change (to edit) or Remove (to delete) buttons. When
all of the old and new values have been added, the Old and New Values dialog should look
as it does in Figure 5, with the following test in the Old → New box:
MISSING-->SYSMIS
18 thru 35-->1
36 thru 45-->2
46 thru 59-->3
60 thru 69-->4
70 thru highest-->5
238 | SPSS Data Management
When everything is set up, click Continue and then OK. To see your new variable, scroll to
the bottom of the screen in Variable View.
There is one more step to recoding, and that is to add the value labels. To do this, go to
Variable View; you will most likely find your new variable at the very bottom of the list of
variables. If you click in the Values box for the row with your new variable in it, as shown in
Figure 6, you will see a box with … in it. Click the … and the value labels dialog will come up.
Figure 6. Preparing to Enter Value Labels
To enter the value labels, click on the green plus sign, and then enter a numerical value and
its associated value label. Click the plus sign again to enter the next value and value label,
and so on until all have been entered. If you need to remove one that has been entered
incorrectly, use the red X. There is a spellchecker if useful. When you are done, the Value
Labels dialog should look as it does in Figure 7. Then click OK, and remember to save your
dataset so that you don’t lose your new variable.
Finally, it is time to check that the recoding is proceeding correctly before using the
new variable in any analysis. To check the
variable, produce a frequency distribution.
Assess the frequency distribution for evidence of any errors, such as:
• Old values that did not get caught in the
recoding process,
• Categories that are missing from the new
values,
Figure 7. Value Labels for Recoded Age Variable
• More missing values than would be
expected, and
• Unexpected discrepancies between the
descriptive statistics from the original variable and those produced now.
SPSS Data Management | 239
In the case of
Table 2. Frequency Distribution for Age Recoded Into Categories
our RAGE variable,
we
observe in Table
2 that we did a
relatively
good
job of keeping
the
proportion
Valid
of respondents
in each of our
categories
pretty even. In
the absence of
theoretical
Frequency Percent
Valid
Percent
Cumulative Percent
18-35
795
19.7
21.5
21.5
36-45
643
15.9
17.4
38.9
46-59
855
21.2
23.1
62.0
60-69
728
18.1
19.7
81.7
70+
678
16.8
18.3
100.0
Total
3699
91.7
100.0
can
Missing System 333
8.3
Total
100.0
4032
or
conceptual reasons for choosing a particular recoding strategy, making categories of relatively consistent size can be a good way to proceed.
Reducing the Attributes of a Discrete Variable
As noted above, recoding can also be used to condense categories in the case of an ordinal
or nominal variable with many categories. The example here uses the variable POLVIEWS,
an ordinal variable measuring respondents’ political views on a seven-point scale from
extremely liberal to extremely conservative. In recoding this variable, we might want to
reduce the seven points to three: liberal, moderate, and conservative. But which values
belong in which categories? We might say that extremely liberal and liberal make up the
liberal category; extremely conservative and conservative make up the conservative category; and slightly liberal, slightly conservative, and moderate/middle of the road make up
the moderate category. So we produce a frequency table to see what our data looks like.
This frequency table is shown in Table 3.
240 | SPSS Data Management
Table 3: Think of Self as Liberal or conservative
Frequency Percent
Valid
Percent
Cumulative
Percent
extremely liberal
207
5.1
5.2
5.2
liberal
623
15.5
15.7
20.9
slightly liberal
490
12.2
12.4
33.3
moderate, middle of the
road
1377
34.2
34.7
68.0
slightly conservative
476
11.8
12.0
80.0
conservative
617
15.3
15.6
95.6
extremely conservative
174
4.3
4.4
100.0
Total
3964
98.3
100.0
Missing System
68
1.7
Total
4032
100.0
Valid
Table 3 might cause us to think that our original idea about how to combine these categories is not the best, given how few people would end up in the liberal and conservative
categories and how many in the moderate category. Instead, we might conclude that it
would make more sense to group extremely liberal, liberal, and slightly liberal together;
keep moderate on its own; and then group extremely conservative, conservative, and
slightly conservative together. And we might decide that 1 will be liberal, 2 will be moderate,
and 3 will be conservative. We also need to write down the value labels from our existing
variable so that we can use them in the recoding.
Once we’ve made these decisions, it’s time to proceed with the recoding, which we do
much the same way as we did for the recoding of the continuous variable above, by going
to Transform → Recode Into Different. Give the variable a new name, RPOLVIEWS, and a new
label, Is R Liberal, Moderate, or Conservative? Click Change, and then click Old and New Values. Note that if you are recoding right after recoding a prior variable, you will need to use
the Remove button to remove the old and new values from the prior recoding process. The
old and new values we will be entering are shown below; Figure 8 shows what the recode
dialogs should look like.
MISSING-->SYSMIS
1 thru 3-->1
4-->2
5 thru 7-->3
Once the old and new values are entered, click Continue and then OK. Then, scroll to the
SPSS Data Management | 241
new variable in Variable View and add the value labels: 1 Liberal, 2 Moderate, 3 Conservative,
as shown in Figure 8.
Figure 8. Recoding POLVIEWS
Finally, produce a frequency table to check for errors before using the new RPOLVIEWS
variable in an analysis. As Table 4 shows, the recoding strategy we chose happened to
distribute respondents quite evenly, though of course it is based on conceptual concerns
rather than simply the distribution of respondents.
Table 4. Is R Liberal, Moderate, or Conservative?
Frequency Percent Valid Percent Cumulative Percent
Liberal
1320
32.7
33.3
33.3
Moderate
1377
34.2
34.7
68.0
Conservative 1267
31.4
32.0
100.0
Total
3964
98.3
100.0
Missing System
68
1.7
Total
4032
100.0
Valid
242 | SPSS Data Management
Creating an Index
In the course of many research and data analysis projects, researchers may seek to create
index variables by combining responses on multiple related variables. While it may seem
that simply adding the values together might work, one of the main reasons simply adding
does not work is that it cannot distinguish between circumstances where a respondent
did not answer one or more questions and circumstances in which respondents gave
answers with lower numerical values. Therefore, the process to be detailed here requires
two steps: first, collecting missing responses, and then, creating an index while excluding
those respondents who did not answer all questions included in the index.
The example index detailed here is an index of the seven variables in the 2021 GSS that ask
respondents their views on whether abortion should be legal in a variety of circumstances:
in the case of fetal defect (ABDEFECT), risks to health (ABHLTH), rape (ABRAPE), if the pregnant person is single (ABSINGLE), if the pregnant person is poor (ABPOOR), if the pregnant
person already has children and does not want any more (ABNOMORE), and for any reason
(ABANY). Note that in the 2021 GSS there are a separate set of variables asking about abortion that end in G. These variables reflect differences in wording as part of a survey experiment; one could use them instead of the non-G abortion opinion variables, but the two sets
should not be combined as they were asked of different people.
Our first task is to look at the value labels for our variables. Each of these abortion variables is coded with 1:Yes and 2:No; we need to determine whether we wish to make an
index of yeses or nos, and in this case we will use the Yeses. The second task is to collect
the missing responses so that we can exclude them. Both this part of the process and the
ultimate task of creating the index utilize Transform → Count Values Within Cases… (Alt+T,
Alt+O). Once the Count dialog is open, we need to give our new variable a name (in the
Target Variable box) and Label (in the Target Label box). We will call this variable ABMISSING with the label Count of Missing Variables for Abortion Variables, given that’s what we
are counting at the moment. We then move all of the variables (seven, in this case) we are
including in our index into the Variables box using the blue arrow. Next, click on the Define
Values button. Select the radio button next to System- or user-missing and click Add, then
click Continue, then click OK. Figure 9 shows how this Count procedure should be set up.
SPSS Data Management | 243
Figure 9. Setting Up the Count for Missing Values
Table 5. Count of Missing Values for Abortion Variables
Valid
Cumulative
Percent
It is a good practice
to produce a fre-
Frequency
Percent
Valid
Percent
.00
1284
31.8
31.8
31.8
1.00
118
2.9
2.9
34.8
ensure that a rea-
2.00
26
.6
.6
35.4
sonable number of
3.00
8
.2
.2
35.6
cases
4.00
9
.2
.2
35.8
use in producing
5.00
1
.0
.0
35.9
6.00
11
.3
.3
36.1
7.00
2575
63.9
63.9
100.0
Total
4032
100.0
100.0
quency
table
at
this stage to check
for
errors
and
remain
for
the desired index
variable.
In
this
example, the frequency
table
for
the missing values
count
should
appear as in Table 5. This shows that 31.8%, or 1284 cases, answered all seven abortion questions and thus will be able to be included in our index variable. 63.9% did not answer any of
the seven questions (presumably because they were not asked them), while a far smaller
percent answered only some of the questions—and thus also will be excluded from our
index.
The next step is to create the index variable while excluding those who have missing values. To do this, we again go to Transform → Count Values Within Cases…. This time, we will
244 | SPSS Data Management
call our new variable ABINDEX and give it the label Index Variable for Abortion Questions.
Under Define Values, we remove MISSING and, in its place, add 1 to the Values to Count
box, and then click Continue. Next, we click If… and select the radio button next to Include
if case satisfies condition. In the box, we say ABINDEX = 0 (so that cases in which any of the
included variables have missing values are excluded) and click Continue. Figure 10 below
shows what all of the dialogs should look like when everything is set up to produce the
index. Once everything is ready, click OK.
Figure 10. Creating an Index
Finally, produce a frequency distribution for the new index variable. (Note: some analysts
treat this type of index variable as ordinal, while others argue that because it is a count of
numbers it might better be understood as continuous. Either approach is acceptable for
producing descriptive statistics.) Table 6 shows the results, which many people might find
surprising: 50.4% of respondents—just a tad more than half—agree that abortion should
be legal in all of the cases about which they were asked, while only 7.2% believe abortion
should be legal in none of them.
SPSS Data Management | 245
Table 6. Frequencies for Index Variable for Abortion Questions
Frequency Percent Valid Percent Cumulative Percent
.00
93
2.3
7.2
7.2
1.00
82
2.0
6.4
13.6
2.00
104
2.6
8.1
21.7
3.00
172
4.3
13.4
35.1
4.00
69
1.7
5.4
40.5
5.00
43
1.1
3.3
43.8
6.00
74
1.8
5.8
49.6
7.00
647
16.0
50.4
100.0
Total
1284
31.8
100.0
Missing System 2748
68.2
Total
100.0
Valid
4032
Computing Variables
Sometimes, analysts want to combine variables in ways other than by making an index
(for instance, adding two continuous variables together or taking their average) or otherwise wish to perform mathematical functions on them. These types of operations can
be conducted by going to Transform → Compute Variable (Alt+T, Alt+C). Here, we will try
two examples, one in which we take the average of the two variables measuring parental
occupational prestige (MAPRES10 and PAPRES10) to determine respondents’ parents’ average occupational prestige, and one in which we take a continuous variable collected on a
weekly basis (WWWHR, which measures how many hours respondents spend on the Internet per day) and divide it by seven, the number of days in a week, to produce a variable on
a daily basis (the average number of hours respondents spend on the internet per day).
To create the computed variable for average parental occupational prestige, we go to
Transform → Compute Variable. We then indicate the name for our new variable in the Target Variable box; we will call it PARPRES10 (for parental prestige). Once we enter this, we
can click on the Type & Label box to provide our variable with a label and be sure it is classified as a numeric variable. Next, under the Numeric Expression box, we set up the formula
that will produce the outcome we want. Here, we are averaging two variables, so we want to
add them together (remember the parentheses, for order of operations) and then divide by
two, like this:
. The If… (optional case selection) can be used
to only include or to make sure to exclude certain kinds of cases; we do not need to use it
246 | SPSS Data Management
to exclude missing values, as the Compute function already excludes them from computation.
Figure 11 shows what the Compute Variable dialog should look like to produce the
desired variable, the average of mother’s
and father’s occupational prestige. When it
is set up, click OK. The resulting variable is
continuous, so the last step is to produce
descriptive statistics for the continuous
variable. Here, we will produce descriptive
statistics for all three variables–the two
original occupational prestige scores and
our new average. Table 7 shows the results.
Figure 11. Computing an Average Variable
Table 7. Descriptive Statistics on Occupational Prestige
Average of Mother’s and
Father’s Occupational
Prestige
Mothers
occupational
prestige score (2010)
Father’s
occupational
prestige score (2010)
Valid
2232
2767
3349
Missing
1800
1265
683
Mean
44.1633
42.66
45.16
Median
42.5000
42.00
44.00
Std. Deviation
10.58686
13.168
13.148
Variance
112.082
173.387
172.869
Skewness
.493
.389
.622
Std. Error of
Skewness
.052
.047
.042
Kurtosis
-.317
-.672
-.305
Std. Error of Kurtosis
.104
.093
.085
Range
60.00
64
64
Minimum
20.00
16
16
Maximum
80.00
80
80
25
36.0000
32.00
35.00
Percentiles 50
42.5000
42.00
44.00
51.5000
50.00
52.00
N
75
SPSS Data Management | 247
Let’s take one last example: starting with
a variable that measures the number of
hours respondents spend on the Internet
per week and adjusting it so it measures
the number of hours they spend per day.
We will call the new variable WWWHRDAY,
and the expression to produce it is simply
, as shown in Figure 13. Then
click OK.
Again, the final step is to compute
descriptive statistics for the new variable, as
shown in Table 8. These descriptive statistics show that the average person spends
nearly 15 hours a week online, or just over 2
Figure 13. Another Example of Computing a
Variable
hours per day—but that this is pretty skewed by folks who spend quite a bit of time online,
as the median time online is just 9 hours a week or somewhat over 1 hour per day. The range
shows that there are people in the dataset who spend no time online at all, and others who
claim to spend every hour of the day online (how is this possible? Don’t they sleep?). Looking at the percentiles, we can observe that a quarter of our respondents spend less than
half an hour a day online, while another quarter claim to spend more than 2.8 hours per day
online.
248 | SPSS Data Management
Table 8. Descriptive Statistics For Time Online
Hours per
week r
spends on
the Internet
Average
Number of
Hours
Online Per
Day
Valid
2466
2466
Missing
1566
1566
Mean
14.80
2.1146
Median
9.00
1.2857
Std. Deviation
17.392
2.48459
Variance
302.487
6.173
Skewness
2.575
2.575
Std. Error of
Skewness
.049
.049
Kurtosis
10.257
10.257
Std. Error of Kurtosis
.099
.099
Range
168
24.00
Minimum
0
.00
Maximum
168
24.00
25
3.00
.4286
Percentiles 50
9.00
1.2857
20.00
2.8571
N
75
Exercises
1.
Select cases to select only those who identify as poor (from the variable CLASS). Produce a histogram
of working hours (from the variable HRS1). Then select all cases and produce the same histogram. Compare your results.
2.
Split file by RACE or SEX. Choose any variable of interest and perform appropriate descriptive statistics.
Write a paragraph explaining the differences you observe between the racial or sex categories in your
analysis.
3.
Recode HRS1, creating no more than 5 categories. Be sure you can explain why your categories are
grouped the way they are, and look at the descriptive statistics as you determine them. Then produce
descriptive statistics after recoding and summarize your results.
4.
Recode ENPRBUS, creating no more than 4 categories. Be sure you can explain why your categories
are grouped the way they are, and look at the descriptive statistics as you determine them. Then produce
descriptive statistics after recoding and summarize your results.
5.
Create an index of the six variables asking about whether people with various views should be allowed
SPSS Data Management | 249
to speak in your community (SPKATH, SPKRAC, SPKCOM, SPKMIL, SPKHOMO, SPKMSLM), being sure to
create the missing value index first. Produce appropriate descriptive statistics and summarize your
results.
6.
Create a new variable for the average number of hours per day the respondent spends in a car or other
vehicle by using the Compute function to divide CARHR (the number of hours in a vehicle per week) by 7.
Produce descriptive statistics for the original CARHR variable and your new computed variable.
Media Attributions
• select cases © IBM SPSS is licensed under a All Rights Reserved license
• select cases if © IBM SPSS is licensed under a All Rights Reserved license
• split file © IBM SPSS is licensed under a All Rights Reserved license
• recode age © IBM SPSS is licensed under a All Rights Reserved license
• recode age values © IBM SPSS is licensed under a All Rights Reserved license
• to enter value labels © IBM SPSS is licensed under a All Rights Reserved license
• age value labels © IBM SPSS is licensed under a All Rights Reserved license
• recoding polviews © IBM SPSS is licensed under a All Rights Reserved license
• count of missing values © IBM SPSS is licensed under a All Rights Reserved license
• creating an index © IBM SPSS is licensed under a All Rights Reserved license
• compute average © IBM SPSS is licensed under a All Rights Reserved license
• compute math © IBM SPSS is licensed under a All Rights Reserved license
250 | SPSS Data Management
18. Quantitative Analysis with
SPSS: Bivariate Crosstabs
MIKAILA MARIEL LEMONIK ARTHUR
This chapter will focus on how to produce and interpret bivariate crosstabulations in SPSS.
To access the crosstabs dialog, go to Analyze → Descriptive Statistics → Crosstabs (Alt+A,
Alt+E, Alt+C). Once the Crosstabs dialog opens, the independent variable should be placed
in the Columns box and the dependent variable in the Rows box. There is a checkbox that
will make a clustered bar chart appear, as well as one that will suppress the tables so that
only the bar chart appears (typically one would not want to suppress the tables; whether or
not to produce the clustered bar chart is a matter of personal preference). In the analysis
shown in Figure 1, SEX is the independent variable (here permitting only male and female
as answers) and DISRSPCT is the dependent variable (how often the respondent feels they
are treated with “less courtesy or respect” than other people are).
Figure 1. Setting Up a Bivariate Crosstab
SPSS Bivariate Analysis I | 251
Next, click the Cells button (Alt+E) and
select Column under Percentages. It is
important to be sure to select column percentages when the independent variable is
in the columns; this is necessary for proper
interpretation of the table. Observed will
already be checked under Counts; if it is
not, you may want to select it as well. There
are a variety of other options in this dialog,
but they are beyond the scope of this chapter. To see what the Cell Display dialog
should look like before proceeding, take a
look at Figure 2. Once the appropriate
options are selected from Cell Display, click
Figure 2. Cell Display Dialog for Crosstabs
Continue. If one is interested only in producing the crosstabulation table and/or
clustered bar charts, OK can be pressed after returning to the main crosstabs dialog.
However, in most cases analysts will want to obtain statistical significance and association measures as well. These can be located by clicking the Statistics button. Chi-square
should be checked in order to produce statistical significance; analysts should then select
the appropriate measure of association for their analysis from among those displayed in
the box. In the absence of information suggesting a different measure of association, Phi
and Cramer’s V is a reasonable default, with Phi being used for 2×2 tables and Cramer’s V
for larger tables, though this may not be appropriate for Ordinal x Ordinal tables. For more
information on selecting an appropriate measure of association, see the chapter on measures of association. The default options are shown in Figure 3.
252 | SPSS Bivariate Analysis I
Some measures of association that SPSS can
compute are not listed in the dialog but instead
are produced by selecting a different option:
Goodman and Kruskal tau can be found under
Lambda, while both Pearson’s r and Spearman
Correlation are found under correlations.
Note that not all of the statistics SPSS can produce are frequently used by beginning social science data analysts, and thus some are not
addressed in the chapter on measures of association. And remember to select only one or two
appropriate options—it is never the right answer to
produce all, or many, of the statistics that are available, especially not if the analyst is simply search-
Figure 3. Statistics Dialog for Crosstabs
ing for the strongest possible association.
Once the appropriate statistics are selected, click continue to go back to the main
Crosstabs dialogue, and then OK to proceed with producing the results (which will then
appear in the output).
The output for this analysis is shown in Figure 4. Below Figure 4, the text will review how
to interpret this output.
SPSS Bivariate Analysis I | 253
254 | SPSS Bivariate Analysis I
Figure 4. SPSS Output for a Crosstabulation of SEX and DISRSPCT
The first table shown is the Case Processing Summary, which simply shows the proportion
of valid cases included in the analysis versus those missing from the analysis (which are
those where there is no response to at least one of the variables).
The second table is the main crosstabulation table. To read this table, compare the percentages across the rows. So, for instance, we can see that very similar proportions of males
and females feel disrespected almost every day or a few times a month, though females
are somewhat more likely to feel disrespected at least once a week. Females are also more
likely to feel disrespected a few times a year, while males are more likely to feel disrespected
less than once a year or never. These conclusions are made simply by comparing percentage across the rows and noting which are bigger and which are smaller. Ignore the count
(the raw number) as this is heavily impacted by the total number of people in each category
of the independent variable and thus is not analytically useful. For example, in this analysis, there are 1434 women and 1154 men (see the total row in the crosstabulation table) and
thus there are more women than men in every category of the dependent variable—even
those where men are more likely to have selected that answer choice than women! Thus, it
is necessary to focus on the percentages, not the raw numbers.
The third table presents the results of the Chi-square significance test. A variety of figures
are provided in this table, including the value and degrees of freedom used to compute the
Chi square. However, in most cases you need only pay attention to the figure under Asymptotic Significance (2-Sided). You will note there are several rows in that column, all of which
provide the same figure. It will almost always be the case that the same figure appears in
each row under the significance column; if it does not, attend to the first significance figure. In this case, the significance figure presented is 0.006, well under both the p<0.05 and
p<0.01 confidence levels though above the p<0.001 level.
The fourth table presents the measures of association, in this case Phi and Cramer’s V.
Sometimes as in this case, these figures are the same, while in other cases they are different. If they are different, be sure you know which one you should be looking at given the
level of measurement of your variables. Here, they are 0.079, which the strength chart in the
measures of association chapter would tell us means there is a weak association.
Finally, at the bottom, is a clustered bar chart. Bivariate bar graphs can also be produced
using the Graphs menu. Under Graphs → Legacy Dialogs → Bar Charts, both clustered and
stacked bar charts are available. Those interested in displaying their data in bivariate graphs
may wish to play around with the different options to see which presents the data in the
most useful form.
SPSS Bivariate Analysis I | 255
Exercises
Select two variables of interest. Answer the following questions:
•
Which is the independent variable and which is the dependent variable?
•
What is the research hypothesis for this analysis?
•
What is the null hypothesis for this analysis?
•
What confidence level (p value) have you chosen?
•
Which measure of association is most appropriate for this relationship?
Next, use SPSS to produce a crosstabulation according to the instructions in this chapter. Interpret
the crosstabulation, being sure to answer the following questions:
•
Is the relationship statistically significant?
•
Can the null hypothesis be rejected?
•
How strong is the association between the two variables?
•
Looking at that pattern of percentages across the rows, what can you determine about the
nature of the relationship between the two variables?
•
Is there support for your research hypothesis?
Repeat this exercise for two additional pairs of variables, choosing new variables each time.
Media Attributions
• crosstabs dialog bivariate © IBM SPSS is licensed under a All Rights Reserved license
• cell display crosstabs © IBM SPSS is licensed under a All Rights Reserved license
• statistics dialog crosstabs © IBM SPSS is licensed under a All Rights Reserved license
• crosstabs output © IBM SPSS is licensed under a All Rights Reserved license
256 | SPSS Bivariate Analysis I
19. Quantitative Analysis with
SPSS: Multivariate Crosstabs
MIKAILA MARIEL LEMONIK ARTHUR
Producing a multivariate crosstabulation is exactly the same as producing a bivariate
crosstabulation, except that an additional variable is added. Note that, due to the limitations of the crosstabulation approach, you are not actually looking at the relationships
between all three variables simultaneously (and this approach is limited to three variables).
Rather, you are looking at how controlling for a third variable—your “Layer” or control variable—changes the relationship between the independent and dependent variable in your
analysis. What SPSS produces, then, is basically a stack of crosstabulation tables with your
independent and dependent variables, one for each category of your control variable, along
with statistical significance and association values for each category of your control variable. This chapter will review how to produce and interpret a multivariate crosstabulation. It
uses variables with fairly few categories for ease of interpretation. Do note that when using
variables with many categories, results can become quite complex and lengthy, and due to
small numbers of cases left in each cell of the very lengthy tables, statistical significance is
likely to be reduced. Thus, analysts should take care to consider whether the relationship(s)
they are interested in are suitable for this type of analysis, and may want to consider recoding variables (see the chapter on data management) with many categories into somewhat
fewer categories to facilitate analysis.
To produce a multivariate crosstabulation, follow the same steps as you
Multivariate Crosstabulation in SPSS | 257
would follow to produce a bivariate
crosstabulation—put the independent variable in the columns box, the dependent
variable in the rows box, select column percentages under cells, and select chi square
and an appropriate measure of association
under statistics. Note that the measure of
association you choose should be the same
one that you would choose for a bivariate
analysis with the same independent and
dependent variables, as the third variable is
a control variable and does not alter the criteria upon which the decision about measures of association is made. The one thing
you need to add in order to produce a mul-
Figure 1. Crosstabs Dialog for an Analysis for SEX
as Independent Variable, HAPMAR as Dependent
Variable, and DIVORCE as Control Variable
tivariate crosstabulation is that you add
your third variable, the control variable, to the Layer box in the crosstabs dialog. Figure 1
shows what this would look like for a crosstabulation with the independent variable SEX,
the dependent variable HAPMAR, and the control variable DIVORCE. In other words, this
analysis is exploring whether being male or female influences respondents’ feelings of happiness in their marriages, controlling for whether or not they have ever been divorced.
Below are the tables SPSS produces for this analysis. After the tables, the text will continue,
with an explanation of how one would go about interpreting these results.
258 | Multivariate Crosstabulation in SPSS
Happiness of R’s marriage * Respondent’s sex * Ever been divorced or separated Crosstabulation
Respondent’s sex
Ever been divorced or separated
female
136
133
269
57.1%
52.8%
54.9%
Count
96
107
203
% within
Respondent’s sex
40.3%
42.5%
41.4%
Count
6
12
18
% within
Respondent’s sex
2.5%
4.8%
3.7%
Count
238
252
490
% within
Respondent’s sex
100.0%
100.0%
100.0%
Count
467
439
906
65.6%
59.8%
62.7%
Count
224
260
484
% within
Respondent’s sex
31.5%
35.4%
33.5%
Count
21
35
56
% within
Respondent’s sex
2.9%
4.8%
3.9%
Count
712
734
1446
% within
Respondent’s sex
100.0%
100.0%
100.0%
Count
603
572
1175
63.5%
58.0%
60.7%
Count
320
367
687
% within
Respondent’s sex
33.7%
37.2%
35.5%
Count
27
47
74
% within
Respondent’s sex
2.8%
4.8%
3.8%
Count
950
986
1936
% within
Respondent’s sex
100.0%
100.0%
100.0%
Count
very happy % within
Respondent’s sex
Happiness of R’s
marriage
pretty
happy
yes
not too
happy
Total
very happy % within
Respondent’s sex
Happiness of R’s
marriage
pretty
happy
no
not too
happy
Total
very happy % within
Respondent’s sex
Happiness of R’s
marriage
pretty
happy
Total
not too
happy
Total
Total
male
Multivariate Crosstabulation in SPSS | 259
Chi-Square Tests
Ever been divorced or separated
Value
df Asymptotic Significance (2-sided)
Pearson Chi-Square
2.231b
2
.328
Likelihood Ratio
2.269
2
.322
Linear-by-Linear Association
1.649
1
.199
N of Valid Cases
490
Pearson Chi-Square
6.710c
2
.035
Likelihood Ratio
6.748
2
.034
Linear-by-Linear Association
6.524
1
.011
N of Valid Cases
1446
Pearson Chi-Square
8.772a
2
.012
Likelihood Ratio
8.840
2
.012
Linear-by-Linear Association
8.200
1
.004
N of Valid Cases
1936
yes
no
Total
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 36.31.
b. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 8.74.
c. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 27.57.
Symmetric Measures
Ever been divorced or separated
Value
Approximate Significance
.067
.328
Nominal by Nominal Cramer’s
V
.067
.328
N of Valid Cases
490
Phi
yes
Phi
no
.068
.035
Nominal by Nominal Cramer’s
V
.068
.035
N of Valid Cases
1446
Phi
Total
.067
.012
Nominal by Nominal Cramer’s
V
.067
.012
N of Valid Cases
1936
First, consider the crosstabulation table. As you can see, this table really consists of three
tables stacked on top of each other. Each of these three tables considers the relationship
between sex and the happiness of the respondent’s marriage, but there is one table for
260 | Multivariate Crosstabulation in SPSS
those who have ever been divorced, one table for those who have never been divorced, and
one table for everyone. Comparing the percentages across the rows, we can make the following observations:
• Among those who have ever been divorced, males are slightly more likely to be very
happy in their marriage, while females are somewhat more likely to be not too happy.
• Among those who have not ever been divorced, males are more likely to be very
happy in their marriage, while females are more likely to be pretty happy and are
somewhat more likely to be not too happy.
• Among the entire sample, males are more likely to be very happy in their marriages,
while females are more likely to be pretty happy and somewhat more likely to be not
too happy.
• Overall, then, the results suggest men are happier in their marriages than women.
Next, we turn to statistical significance. At the p<0.05 level, we can observe that this analysis
produces significant results for those who have never been divorced and for the entire
sample, but not for those who have been divorced. Turning to the association, we find a
weak association—the figures for those who have been divorced, those who have not been
divorced, and the entire population are quite similar.
Thus, we can conclude that women who have never been divorced are, on average, less
happy in their marriages than men who have never been divorced, but that among those
who have been divorced, the relationship between sex and marital happiness is not statistically significant.
Exercises
Select three variables of interest. Answer the following questions:
•
Which is the independent variable, which is the dependent variable, and which is the control
variable?
•
What is the research hypothesis for this analysis? What do you predict will be the relationship
between the independent variable and the dependent variable, and how will the control variable
impact this relationship?
•
What is the null hypothesis for this analysis?
•
What confidence level (p value) have you chosen?
•
Which measure of association is most appropriate for this relationship?
Next, use SPSS to produce a multivariate crosstabulation according to the instructions in this chapter.
Interpret the crosstabulation. First, answer the following questions for each of the stacked crosstabula-
Multivariate Crosstabulation in SPSS | 261
tions of your independent and dependent variable (one for each category of the control variable, plus
one for everyone):
•
Is the relationship between the independent and dependent variables statistically significant?
•
Can the null hypothesis be rejected?
•
How strong is the association between the two variables?
•
Looking at that pattern of percentages across the rows, what can you determine about the
nature of the relationship between the two variables?
Then, compare your results across the different categories of the control variable.
•
What does this tell you about how the control variable impacts the relationship between the
independent and dependent variables?
•
Is there support for your research hypothesis?
Media Attributions
• crosstabs dialog multivariate © IBM SPSS is licensed under a All Rights Reserved
license
262 | Multivariate Crosstabulation in SPSS
20. Quantitative Analysis with
SPSS: Comparing Means
MIKAILA MARIEL LEMONIK ARTHUR
In prior chapters, we have discussed how to perform analysis using only discrete variables.
In this chapter, we will begin to explore techniques for analyzing relationships between
discrete independent variables and continuous dependent variables. In particular, these
techniques enable us to compare groups. For example, imagine a college course with 400
people enrolled in it. The professor gives the first exam, and wants to know if majors scored
better than non-majors, or if first-year students scored worse than sophomores. The techniques discussed in this chapter permit those sorts of comparisons to be made. First, the
chapter will detail descriptive approaches to these comparisons—approaches that let us
observe the differences between groups as they appear in our data without performing
statistical significance testing. Second, the chapter will explore statistical significance testing for these types of comparisons. Note here that the Split File technique, discussed in the
chapter on Data Management, also provides a way to compare groups.
Comparing Means
The most basic way to look at differences
between groups is by using the Compare
Means command, found by going to Analyze → Compare Means → Means (Alt+A,
Alt+M, Alt+M). Put the independent (discrete) variable in the Layer 1 of 1 box and the
dependent (continuous) variable in the
Dependent List box. Note that while you
can use as many independent and dependent variables as you would like in one
Compare
Means
command,
Compare Figure 1. The Compare Means Dialog in SPSS
Means does not permit for multivariate
analysis, so including more variables will just mean more paired analyses (one independent
and one dependent variable at a time) will be produced. Under Options, you can select
Quantitative Analysis with SPSS: Comparing Means | 263
additional statistics to produce; the default is the mean, standard deviation, and number of
cases, but other descriptive and explanatory statistics are also available. The options under
Style and Bootstrap are beyond the scope of this text. One the Compare Means test is set
up, click ok.
The results that appear in the Output window are quite simple: just a table listing the
statistics of the dependent variable that were selected (or, if no changes were made, the
default statistics as discussed above) for each category or attribute of the independent variable. In this case, we looked at the independent variable SEX and the dependent variable
AGEKDBRN to see if there is a difference between the age at which men’s and women’s
first child was born.
Table 1: Comparing Mean Age At Birth of First Child By Sex
Table 1, the result of this
analysis, shows that the
R’s age when their 1st child was born
Respondent’s sex
Mean
N
Std. Deviation
male
27.27
1146
6.108
female
24.24
1601
5.917
Total
25.51
2747
6.179
average male respondent
had their first child at age
27.27, while the average
female
respondent
had
her first child at the age of
24.24, or about a three year
difference. The higher standard deviation for men tells us that there is more variation in
when men have their first child than there is among women.
The Compare Means command can be used with discrete independent variables with
any number of categories, though keep in mind that if the number of respondents in each
group becomes too small, means may reflect random variation rather than real differences.
264 | Quantitative Analysis with SPSS: Comparing Means
Boxplots
Boxplots provide a way to look at this
same type of relationship visually. Like
Compare Means, boxplots can be used with
discrete independent variables with any
number of categories, though the graphs
will likely become illegible when the independent variable has more than 10 or so
categories. To produce a boxplot, go to
Graphs → Legacy Dialogs → Boxplot (Alt+G,
Alt+L, Alt+X). Click Define. Then put the discrete independent variable in the Category
Axis box and the continuous dependent
variable in the Variable box. Under Options
it is possible to include missing values to
see if those respondents differ from those
Figure 2. The Boxplot Dialog
who did respond, but this option is not usually selected. Other options in the Boxplot dialog
generally increase the complexity of the graph in ways that may make it harder to use, so
just click OK once your variables are set up.
Figure 3 displays the boxplot that is produced. It shows that the median (the thick black
line), the 25th percentile (the bottom of the blue box), the 75th percentile (the top of the
blue box), the low end extreme (the ⊥at the bottom of the distribution) and the high end
extreme before outliers (the T at the top of the distribution) are all higher for men than
women, while the most extreme outlier (the *) is pretty similar for both. Outliers are labeled
with their case numbers so they can be located within the dataset. As you can see, the boxplot provides a way to describe the differences between groups visually.
T-Tests For Statistical
Significance
But what if we want to know if these differences are statistically significant? That is
where T-tests come in. Like the Chi square
test, the T test is designed to determine statistical significance, but here, what the test
Figure 3. A Boxplot of Sex and Age at the Birth of
First Child
Quantitative Analysis with SPSS: Comparing Means | 265
is examining is whether there is a statistically significant difference between the means of
two groups. It can only be used to compare two groups, not more than two. There are multiple types of T tests; we will begin here with the independent-samples T test, which is used
to compare the means of two groups of different people.
The computation behind the T test involves the standard deviation for each category, the
number of observations (or respondents) in each category, and taking the mean value for
each category and computing the difference between the means (the mean difference).
Like in the case of the Chi square, this produces a calculated T value and degrees of freedom that are then compared to a table of critical values to produce a statistical significance
value. While SPSS will display many of the figures computed as part of this process, it produces the significance value itself so there is no need to do any part of the computation by
hand.
To produce an independent samples T test, go to go to Analyze → Compare Means → Independent-Samples T Test (Alt+A, Alt+M, Alt+T). Put the continuous dependent variable in
the Test Variable(s) box. Note that you can use multiple continuous dependent variables
at once, but you will only be looking at differences in each one, one at a time, not at the
relationships between them. Then, put the discrete independent variable in the Grouping
Variable box. Click the Define Groups button, and specify the numerical values of the two
1
groups you wish to compare —keep in mind that any one T test can only compare two values, not more, so if you have a discrete variable with more than two categories, you will
2
need to perform multiple T tests or choose another method of analysis. In most cases,
other options should be left as they are. For our analysis looking at differences in the age
a respondent’s first child was born in terms of whether the respondent is male or female,
the Independent-Samples T Test dialogs would look as shown in Figure 4. AGEKDBRN is
the test variable and SEX is the grouping variable, and under Define Groups, the values of 1
and 2 (the two values of the SEX variable) are entered. If we were using a variable with more
than two groups, we would need to select the two groups we were interested in comparing
and input the numerical values for just those two groups.
1. Yes, you will need to check variable view for this first, before proceeding to produce your T-test.
2. It is also possible to use a continuous variable for this type of analysis, with the "Cut Point" automatically divid­
ing people into two categories.
266 | Quantitative Analysis with SPSS: Comparing Means
After clicking OK to run the test, the
results are produced in the output window.
While multiple tables are produced, the
ones most important to the analysis are
called Group Statistics and Independent
Samples Test, and for our analysis of sex
and age at the birth of the first child, they
are reproduced below as Table 2 and Table
Figure 4. The Independent-Samples T Test Dialogs
3, respectively.
Table 2. Group Statistics, Age at Birth of First Child Grouped by Sex
R’s age when their 1st child
was born
Respondent’s sex
N
Mean
Std. Deviation
Std. Erro
male
1146
27.27
6.108
.180
female
1601
24.24
5.917
.148
Table 2 provides the number of respondents in each group (male and female), the mean
age at the birth of the first child, the standard deviation, and the standard error of the
mean, statistics much like those we produced in the Compare Means analysis above.
Table 3. Independent Samples Test Results, Sex by Age at Birth of First Child
Levene’s
Test for
Equality
of
Variances
t-test for Equality of Means
Significance
F
Sig. t
df
95%
Confidence
Interval of
Mean
Std. Error the
Difference Difference Difference
One-Sided Two-Sided
p
p
R’s
age
when
their
1st
child
was
born
Equal
variances 1.409 .235 13.026 2745
assumed
Equal
variances
not
assumed
Lower
<.001
<.001
3.023
.232
2.568
12.957 2418.993 <.001
<.001
3.023
.233
2.565
Table 3 shows the results of the T test, including the T test result, degrees of freedom, and
confidence intervals. There are two rows, one for when equal variances are assumed and
one for when equal variances are not assumed. If the significance under “Sig.” is below 0.05,
that means we should assume the variances are not equal and proceed with our analysis
Quantitative Analysis with SPSS: Comparing Means | 267
using the bottom row. If the significance under “Sig.” is 0.05 or above, we should treat the
variances as equal and proceed using the top row. Thus, looking further at the top row, we
can see the mean difference of 3.023 (which recalls the mean difference from our compare
means analysis above) and the significance. Separate significance values are produced for
one-sided and two-sided tests, though these are often similar. One-sided tests only look for
change in one direction (increase or decrease), while two-sided tests look for any change
or difference. Here, we can see both significance values are less than 0.001, so we can conclude that the observed mean difference of 3.023 does represent a statistically significant
difference in the age at which men and women have their first child.
There are a number of other types of T tests. For example, the Paired Samples T Test is
used when comparing two means from the same group—such as if we wanted to compare
the average score on test one to the average score on test two given to the same students.
There is also a One-Sample T Test, which permits analysts to compare observed data from
a sample to a hypothesized value. For instance, a researcher might record the speed of drivers on a given road and compare that speed to the posted speed limit to see if drivers are
going statistically significantly faster. To produce a Paired Samples T Test, go to Analyze →
Compare Means → Paired-Samples T Test (Alt+A, Alt+M, Alt+P); the procedure and interpretation is beyond the scope of this text.
To produce a One-Sample T Test, go to Analyze → Compare Means → One-Sample T
Test (Alt+A, Alt+M, Alt+S). Put the continuous variable of interest in the Test Variables box
(Alt+T) and the comparison value in the Test Value box (Alt+V). In most cases, other options
should be left as is. The results will show the sample mean, mean difference (the difference
between the sample mean and the test value), and confidence intervals for the difference,
as well as statistical significance tests both one-sided and two-sided. If the results are significant, that tells us that there is high likelihood that our sample differs from our predicted
value; if the results are not significant, that tells us that any difference between our sample
and our predicted value is likely to have occurred by chance, usually because the difference
is quite small.
ANOVA
While a detailed discussion of ANOVA—Analysis of Variance—is beyond the scope of this
text, it is another type of test that examines relationships between discrete independent
variables and continuous dependent variables. Used more often in psychology than in
sociology, ANOVA relies on the statistical F test rather than the T test discussed above. It
enables analysts to use more than two categories of an independent variable—and to look
at multiple independent variables together (including by using interaction effects to look at
268 | Quantitative Analysis with SPSS: Comparing Means
how two different independent variables together might impact the dependent variable).
It also, as its name implies, includes an analysis of differences in variance between groups
rather than only comparing means. To produce a simple ANOVA, with just one independent variable in SPSS, go to Analyze → Compare Means → One Way ANOVA (Alt+A, Alt+M,
Alt+O); the independent variable in ANOVA is called the “Factor.” You can also use the Compare Means dialog discussed above to produce ANOVA statistics and eta as a measure of
association by selecting the checkbox under Options. For more on ANOVA, consult a more
advanced methods text or one in the field of psychology or behavioral science.
Exercises
1.
Select a discrete independent variable of interest and a continuous dependent variable of interest. Run
appropriate descriptive statistics for them both and summarize what you have found.
2.
3.
Run Compare Means and a Boxplot for your pair of variables and summarize what you have found.
Select two categories of the independent variable that you wish to compare and determine what the
numerical codes for those categories are.
4.
Run an independent-samples T test comparing those two categories and summarize what you have
found. Be sure to discuss both statistical significance and the mean difference.
Media Attributions
• compare means © IBM SPSS is licensed under a All Rights Reserved license
• boxplot dialog © IBM SPSS is licensed under a All Rights Reserved license
• boxplot © IBM SPSS is licensed under a All Rights Reserved license
• independent samples t test dialog © IBM SPSS is licensed under a All Rights Reserved
license
Quantitative Analysis with SPSS: Comparing Means | 269
270 | Quantitative Analysis with SPSS: Comparing Means
21. Quantitative Analysis with
SPSS: Correlation
MIKAILA MARIEL LEMONIK ARTHUR
So far in this text, we have only looked at relationships involving at least one discrete variable. But what if we want to explore relationships between two continuous variables? Cor1
relation is a tool that lets us do just that. The way correlation works is detailed in the chapter
on Correlation and Regression; this chapter, then, will focus on how to produce scatterplots
(the graphical representations of the data upon which correlation procedures are based);
bivariate correlations and correlation matrices (which can look at many variables, but only
two at a time); and partial correlations (which enable the analyst to examine a bivariate correlation while controlling for a third variable).
Scatterplots
To produce a scatterplot, go to Graphs →
Legacy Dialogs → Scatter/Dot (Alt+G, Alt+L,
Alt+S), as shown in Figure 13 in the chapter
on Quantitative Analysis with SPSS: Univariate Analysis. Choose “Simple Scatter” for a Figure 1. Scatter/Dot Graph Selection Dialog
scatterplot with two variables, as shown in
Figure 1.
1. Note that the bivariate correlation procedures discussed in this chapter can also be used with ordinal variables
when appropriate options are selected, as will be detailed below.
Quantitative Analysis with SPSS: Correlation | 271
This brings up the dialog for creating a
scatterplot, as shown in Figure 2. The independent variable is placed in the X Axis box,
as it is a graphing convention to always put
the independent variable on the X axis (you
can remember this because X comes
before Y, therefore X is the independent
variable and Y is the dependent variable,
and X goes on the X axis while Y goes on
the Y axis). Then the dependent variable is
placed in the Y Axis box.
There are a variety of other options in the
simple scatter dialog, but most are rarely
used. In a small dataset, Label Cases by
allows you to specify a variable that will be
used to label the dots in the scatterplot (for
instance, in a database of states you could
Figure 2. Simpler Scatter Dialog
label the dots with the 2-letter state code).
Once the scatterplot is set up with the
independent and dependent variables, click OK to continue. The scatterplot will then
appear in the output. In this case, we have used the independent variable AGE and the
dependent variable CARHR to look at whether there is a relationship between the respondent’s age and how many hours they spend in a car per week. The resulting scatterplot is
shown in Figure 3.
In some scatterplots, it is easy to observe
the relationship between the variables. In
others, like the one in Figure 3, the pattern
of dots is too complex to make it possible to
really see the relationship. A tool to help
analysts visualize the relationship is the line
of best fit, as discussed in the chapter on
Correlation and Regression. This line is the
line mathematically calculated to be the
closest possible to the greatest number of
dots. To add the line of best fit, sometimes
Figure 3. A Scatterplot of Age and Hours Spent in
a Car Per Week
called the regression line or the fit line, to
your scatterplot, go to the scatterplot in the output window and double-click on it. This will
open up the Chart Editor window. Then go to Elements → Fit Line at Total, as shown in Fig-
272 | Quantitative Analysis with SPSS: Correlation
ure 4. This will bring up the Properties window. Under the Fit Line tab, be sure the Linear
button is selected; click apply if needed and close out.
Doing so will add a line with an equation
to the scatterplot, as shown in Figure 5.
2
From looking at the line, we can see that
age age goes up, time spent in the car per
week goes down, but only slightly. The
equation confirms this. As shown in the
graph,
the
equation
for
this
line
is
. This equation tells us
that the line crosses the y axis at 9.04 and
that the line goes down 0.05 hours per
week in the car for every one year that age
goes up (that’s about 3 minutes).
Figure 4. Adding a Fit Line to a Scatterplot
Figure 5. Scatterplot of Age and Hours Spent in
the Car Per Week with Fit Line
2. It will also add the R2; see the chapter on Correlation and Regression for more on how to interpret this.
Quantitative Analysis with SPSS: Correlation | 273
What if we are interested in a whole
bunch of different variables? It would take
a while to produce scatterplots for each pair
of variables. But there is an option for producing them all at once, if smaller and a bit
harder to read. This is a scatterplot matrix.
To produce a scatterplot matrix, go to
Graphs → Legacy Dialogs → Scatter/Dot
(Alt+G, Alt+L, Alt+S), as in Figure 1. But this
time, choose Matrix from the dialog that
appears.
In the Scatterplot Matrix dialog, select all
of the variables you are interested in and
put them in the Matrix Variables box, and
then click OK. The many other options here,
as in the case of the simple scatterplot, are
rarely used.
The scatterplot matrix will then be produced. As you can see in Figure 7, the scat-
Figure 6. The Scatterplot Matrix Dialog
terplot matrix involves a series of smaller scatterplots, one for each pair of variables
specified. Here we specified CARHR and AGE, the two variables we were already using, and
added REALINC, the respondent’s family’s income in real (inflation-adjusted) dollars. It is
possible, using the same instructions detailed above, to add lines of best fit to the little scatterplots in the scatterplot matrix. Note that each little scatterplot appears twice, once with
the variable on the x-axis and once with the variable on the y-axis. You only need to pay
attention to one version of each pair of scatterplots.
Keep in mind that while you can include
discrete variables in a scatterplot, the
resulting scatterplot will be very hard to
read as most of the dots will just be stacked
on top of each other. See Figure 8 for an
example of a scatterplot matrix that uses
some binary and ordinal variables so you
are aware of what to expect in such circumstances. Here, we are looking at the relationships between pairs of the three
Figure 8. A Scatterplot Matrix
variables real family income, whether the
respondent works for themselves or someone else, and how they would rate their family
income from the time that they were 16 in comparison to that of others. As you can see,
274 | Quantitative Analysis with SPSS: Correlation
including discrete variables in a scatterplot produces a series of stripes which are not very
useful for analytical purposes.
Correlation
Scatterplots can help us visualize the relationships between our variables. But they
cannot tell us whether the patterns we
observe are statistically significant—or how
strong the relationships are. For this, we
turn to correlation, as discussed in the
Figure 7. A Scatterplot Matrix Including an Ordinal
and a Binary Variable
chapter on Correlation and Regression. Correlations are bivariate in nature—in other words, each correlation looks at the relationship
between two variables. However, like in the case of the scatterplot matrix discussed above,
we can produce a correlation matrix with results for a series of pairs of variables all shown
in one table.
To produce a correlation matrix, go to
Analyze → Correlate → Bivariate (Alt+A, Alt+C,
Alt+B). Put all of the variables of interest in
the Variables box. Be sure Flag significant
correlations is checked and select your correlation coefficient. Note that the dialog
provides the option of three different correlation coefficients, Pearson, Kendall’s tau-b,
and Spearman. The first, Pearson, is used
when looking at the relationship between
two continuous variables; the other two are Figure 8. Bivariate Correlation Dialog
used when looking at the relationship
3
between two ordinal variables. In most cases, you will want the two-tailed test of significance. Under options, you can request that means and standard deviations are also produced. When your correlation is set up, as shown in Figure 8, click OK to produce it. The
results will be as shown in Table 1 (the order of variables in the table is determined by the
order in which they were entered into the bivariate correlation dialog).
3. A detailed explanation of each of these measures of association is found in the chapter An In-Depth Look At
Measures of Association.
Quantitative Analysis with SPSS: Correlation | 275
Table 1. Bivariate Correlation Matrix
R’s family income in
1986 dollars
Age of respondent
How many hours in a
typical week does r
spend in a car or other
motor vehicle, not
counting public transit
Pearson
Correlation
R’s family
income in
1986 dollars
Age of
respondent
How many hours in a
typical week does r
spend in a car or other
motor vehicle, not
counting public transit
1
.017
-.062*
.314
.013
Sig. (2-tailed)
N
3509
3336
1613
Pearson
Correlation
.017
1
-.100**
Sig. (2-tailed)
.314
N
3336
3699
1710
Pearson
Correlation
-.062*
-.100**
1
Sig. (2-tailed)
.013
<.001
N
1613
1710
<.001
1800
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
As in the scatterplot matrix above, each correlation appears twice, so you only need to
look at half of the table—above or below the diagonal. Note that in the diagonal, you are
seeing the correlation of each variable with itself, so a perfect 1 for complete agreement
and the number of cases with valid responses on that variable. For each pair of variables,
the correlation matrix includes the N, or number of respondents included in the analysis;
the Sig. (2-tailed), or the p value of the correlation; and the Pearson Correlation, which is
the measure of association in this analysis. It is starred to further indicate the significance
level. The direction, indicated by a + or – sign, tells us whether the relationship is direct or
inverse. Therefore, for each pair of variables, you can determine the significance, strength,
and direction of the relationship. Taking the results in Table 1 one variable pair at a time, we
can thus conclude that:
• The relationship between age and family income is not significant. (We could say
there is a weak positive association, but since this association is not significant, we
often do not comment on it.)
• The relationship between time spent in a car per week and family income is significant at the p<0.05 level. It is a weak negative relationship—in other words, as family
income goes up, time spent in a car each week goes down, but only a little bit.
• The relationship between time spent in a car per week and age is significant at the
276 | Quantitative Analysis with SPSS: Correlation
p<0.001 level. It is a moderate negative relationship—in other words, as age goes up,
time spent in a car each week goes down.
Partial Correlation
Partial correlation analysis is an analytical procedure designed to allow you to examine
the association between two continuous variables while controlling for a third variable.
Remember that when we control for a variable, what we are doing is holding that variable
constant so we can see what the relationship between our independent and dependent
variables would look like without the influence of the third variable on that relationship.
Once you’ve developed a hypothesis about the relationship between the independent,
dependent, and control or intervening variable and run appropriate descriptive statistics,
the first step in partial correlation analysis is to run a regular bivariate correlation with all of
your variables, as shown above, and interpret your results.
After running and interpreted the results
of your bivariate correlation matrix, the next
step is to produce the partial correlation by
going to Analyze → Correlate → Partial (Alt+A,
Alt+C, Alt+R). Place the independent and
dependent variables in the Variables box,
and the control variable in the Controlling
for box, as shown in Figure 9. Note that the
partial correlation assumes continuous
variables and will only produce the Pearson
correlation. The resulting partial correlation
Table 2 will look much like the original
bivariate correlation, but will show that the
Figure 9. The Partial Correlation Dialog
third variable has been controlled for, as shown in Table 2. To interpret the results of the
partial correlation, begin by looking at the significance and association displayed and interpret them as usual.
Quantitative Analysis with SPSS: Correlation | 277
Table 2. Partial Correlation
Age of
respondent
How many
hours in a
typical week
does r spend
in a car or
other motor
vehicle, not
counting
public transit
Correlation
1.000
-.106
Significance
(2-tailed)
.
<.001
df
0
1547
Correlation
-.106
1.000
Significance
(2-tailed)
<.001
.
df
1547
0
Control Variables
Age of respondent
R’s family income in
1986 dollars
How many hours in a
typical week does r
spend in a car or
other motor vehicle,
not counting public
transit
To interpret the results, we again look at significance, strength, and direction. Here, we find
that the relationship is significant at the p<0.001 level and it is a weak negative relationship.
As age goes up, time spent in a car each week goes down.
After interpreting the results of the bivariate correlation, compare the value of the measure of association in the correlation to that in the partial correlation to see how they differ.
Keep in mind that we ignore the + or – sign when we do this, just considering the actual
number (the absolute value). In this case, then, we would be comparing 0.100 from the
bivariate correlation to 0.106 from the partial correlation. The number in the partial correlation is just a little bit higher. So what does this mean?
Interpreting Partial Correlation Coefficients
To determine how to interpret the results of your partial correlation, figure out which of the
following criteria applies:
• If the correlation between x and y is smaller in the bivariate correlation than in the partial correlation: the third variable is a suppressor variable. This means that when we
don’t control for the third variable, the relationship between x and y seems smaller
than it really is. So, for example, if I give you an exam with a very strict time limit to see
if how much time you spend in class predicts your exam score, the exam time limit
might suppress the relationship between class time and exam scores. In other words,
278 | Quantitative Analysis with SPSS: Correlation
if we control for the time limit on the exam, your time in class might better predict
your exam score.
• If the correlation between x and y is bigger in the bivariate correlation than in the partial correlation, this means that the third variable is a mediating variable. This is
another way of saying that it is an intervening variable—in other words, the relationship between x and y seems larger than it really is because some other variable z intervenes in the relationship between x and y to change the nature of that relationship. So,
for example, if we are interested in the relationship between how tall you are and how
good you are at basketball, we might find a strong relationship. However, if we added
the additional variable of how many hours a week you practice shooting hoops, we
might find the relationship between height and basketball skill is much diminished.
• It is additionally possible for the direction of the relationship to change. So, for example, we might find that there is a direct relationship between miles run and marathon
performance, but if we add frequency of injuries, then running more miles might
reduce your marathon performance.
• If the value of Pearson’s r is the same or very similar in the bivariate and partial correlations, the third variable has little or no effect. In other words, the relationship
between x and y is basically the same regardless of whether we consider the influence
of the third variable, and thus we can conclude that the third variable does not really
matter much and the relationship of interest remains the one between our independent and dependent variables.
Finally, remember that significance still matters! If neither the bivariate correlation nor the
partial correlation is significant, we cannot reject our null hypothesis and thus we cannot
conclude that there is anything happening amongst our variables. If both the bivariate correlation and the partial correlation are significant, we can reject the null hypothesis and
proceed according to the instructions for interpretation as discussed above. If the original
bivariate correlation was not significant but the partial correlation was significant, we can­
not reject the null hypothesis in regards to the relationship between our independent and
dependent variables alone. However, we can reject the null hypothesis that there is no relationship between the variables as long as we are controlling for the third variable! If the
original bivariate correlation was significant but the partial correlation was not significant,
we can reject the null hypothesis in regards to the relationship between our independent
and dependent variables, but we cannot reject the null hypothesis when considering the
role of our third variable. While we can’t be sure what is going on in such a circumstance,
the analyst should conduct more analysis to try to see what the relationship between the
control variable and the other variables of interest might be.
So, what about our example above? Well, the number in our partial correlation was
higher, even if just a little bit, than the number in our bivariate correlation. This means that
Quantitative Analysis with SPSS: Correlation | 279
family income is a suppressor variable. In other words, when we do not control for family
income, the relationship between age and time spent in the car seems smaller than it really
is. But here is where we find the limits of what the computer can do to help us with our
analysis—the computer cannot explain why controlling for income makes the relationship
between age and time spent in the car larger. We have to figure that out ourselves. What
do you think is going on here?
Exercises
1.
Choose two continuous variables of interest. Produce a scatterplot with regression line and
describe what you see.
2.
Choose three continuous variables of interest. Produce a scatterplot matrix for the three variables and describe what you see.
3.
Using the same three continuous variables, produce a bivariate correlation matrix. Interpret
your results, paying attention to statistical significance, direction, and strength.
4.
Choose one of your three variables to use as a control variable. Write a hypothesis about how
controlling for this variable will impact the relationship between the other two variables.
5.
Produce a partial correlation. Interpret your results, paying attention to statistical significance,
direction, and strength.
6.
Compare the results of your partial correlation to the results from the correlation of those same
two variables in Question 3 (when the other variable is not controlled for). How have the results
changed? What does that tell you about the impact of the control variable?
Media Attributions
• scatter dot dialog © IBM SPSS is licensed under a All Rights Reserved license
• simple scatter dialog © IBM SPSS is licensed under a All Rights Reserved license
• scatter of carhrs and age © Mikaila Mariel Lemonik Arthur is licensed under a CC BYNC-ND (Attribution NonCommercial NoDerivatives) license
• scatter fit line © IBM SPSS is licensed under a All Rights Reserved license
• scatter with line © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-ND
(Attribution NonCommercial NoDerivatives) license
• scatterplot matrix dialog © IBM SPSS is licensed under a All Rights Reserved license
• matrix scatter © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-ND (Attribution NonCommercial NoDerivatives) license
280 | Quantitative Analysis with SPSS: Correlation
• scatter binary ordinal © Mikaila Mariel Lemonik Arthur is licensed under a All Rights
Reserved license
• bivariate correlation dialog © IBM SPSS is licensed under a All Rights Reserved license
• partial correlation dialog © IBM SPSS is licensed under a All Rights Reserved license
Quantitative Analysis with SPSS: Correlation | 281
282 | Quantitative Analysis with SPSS: Correlation
22. Quantitative Analysis with
SPSS: Bivariate Regression
MIKAILA MARIEL LEMONIK ARTHUR
This chapter will detail how to conduct basic bivariate linear regression analysis using one
continuous independent variable and one continuous dependent variable. The concepts
and mathematics underpinning regression are discussed more fully in the chapter on Correlation and Regression. Some more advanced regression techniques will be discussed in
the chapter on Multivariate Regression.
Before beginning a regression analysis, analysts should first run appropriate descriptive
statistics. In addition, they should create a scatterplot with regression line, as described
in the chapter on Quantitative Analysis with SPSS: Correlation & descriptive statistics. One
important reason why is that linear regression has as a basic assumption the idea that data
are arranged in a linear—or line-like—shape. When relationships are weak, it will not be possible to see just by glancing at the scatterplot whether it is linear or not, or if there is no
relationship at all.
However, there are cases where it is quite
obvious that there *is* a relationship, but that
this relationship is not line-like in shape. For
example, if the scatterplot shows a clear
curve, as in Figure 1, one that could not be
approximated by a line, then the relationship
is not sufficiently linear to be detected by a
1
linear regression. Thus, any results you obtain
from linear regression analysis would considerably underestimate the strength of such a
relationship and would not be able to discern
its nature. Therefore, looking at the scatterplot before running a regression allows the
analyst to determine if the particular relation-
Figure 1. A Curvilinear Scatterplot
ship of interest can appropriately be tested
with a linear regression.
1. There are other regression techniques that are appropriate for such relationships, but they are beyond the
scope of this text.
Quantitative Analysis with SPSS: Bivariate Regression | 283
Assuming that the relationship of interest is appropriate for linear regression, the regres2
sion can be produced by going to Analyze → Regression → Linear (Alt+A, Alt+R, Alt+L). The
dependent variable is placed in the Dependent box; the independent in the “Block 1 of 1”
box. Under Statistics, be sure both Estimates and Model fit are checked. Here, we are using
the independent variable AGE and the dependent variable CARHR. Once the regression is
set up, click OK to run it.
The results will appear in the output window. There will be four tables: Variables
Entered/Removed;
Model
Summary;
ANOVA; and Coefficients. The first of these
simply documents the variables you have
3
used. The other three contain important
elements of the analysis. Results are shown
in Tables 1, 2, and 3.
Figure 2. The Linear Regression Dialog
Table 1. Model Summary
Model R
1
R Square Adjusted R Square Std. Error of the Estimate
.100a .010
.009
8.656
a. Predictors: (Constant), Age of respondent
Table 2. ANOVAa
Model
1
Sum of Squares
df
Mean Square
F
Sig.
Regression
1290.505
1
1290.505
17.225
<.001b
Residual
127966.152
1708
74.922
Total
129256.657
1709
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor
vehicle, not counting public transit
b. Predictors: (Constant), Age of respondent
2. You will notice that there are many, many options and tools within the Linear Regression dialog; some of these
will be discussed in the chapter on Multivariate Regression, while others are beyond the scope of this text.
3. The Variables Entered/Removed table is important to those running a series of multivariate models while
adding or removing individual variables, but is not useful when only one model is run at a time.
284 | Quantitative Analysis with SPSS: Bivariate Regression
Table 3. Coefficientsa
Unstandardized
Coefficients
Standardized
Coefficients
B
Std. Error
Beta
(Constant)
9.037
.680
Age of
respondent
-.051
.012
Model
1
-.100
t
Sig.
13.297
<.001
-4.150
<.001
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor
vehicle, not counting public transit
When interpreting the results of a bivariate linear regression, we need to answer the following questions:
• What is the significance of the regression?
• What is the strength of the observed relationship?
• What is the direction of the observed relationship?
• What is the actual numerical relationship?
Each of these questions is answered by numbers found in different places within the set of
tables we have produced.
First, let’s look at the significance. The significance is found in two places among our
results, under “Sig.” in the ANOVA table (here, table 2) and under “Sig.” in the Coefficients
table (here, table 3). In the Coefficients table, look at the significance number in the row
with the independent variable; you will also see a significance number for the constant,
which will be discussed later. In a bivariate regression, these two significance numbers are
the same (this is not true for multivariate regressions). So, in these results, the significance
is p<0.001, which means we can conclude that the results are significant.
Next, we look at the strength. Again, we can look in two places for the strength, under R
in the Model Summary table (here, table 1) and under Beta in the Coefficients table. Beta
refers to the Greek letter , and beta and
are used interchangeably when referring to the
standardized coefficient. R, here, refers to Pearson’s r, and both it and Beta are interpreted
the same way as measures of association usually are. While the R and the Beta will be the
same in a bivariate regression, the sign (whether the number is positive or negative) may
not be; again, in multivariate regressions, the numbers will not be the same. This is because
Beta is used to look at the strength of the relationship each individual independent variable
has with the dependent variable. Here, the R/Beta is 0.100, so the relationship is moderate
in strength.
The direction of the relationship is determined by whether the Beta is positive or negative. Here, it is negative, so that means it is an inverse relationship. In other words, as age
goes up, time spent in cars each week goes down. And the B value, found in the CoeffiQuantitative Analysis with SPSS: Bivariate Regression | 285
cients table, tells us by how much it goes down. Here we see that for every one year of
additional age, time spent in cars goes down by about 0.051 hours (a little more than three
minutes).
One final thing to look at is the R squared (R2) in the Model Summary table. The R2 tells
us how much of the variance in our dependent variable is explained by our independent
variable. Here, then, age explains 1% (0.010 converted to a percent by multiplying it times
100) of the variance in time spent in a car each week. That might not seem like very much,
and it is not very much. But considering all the things that matter to how much time you
spend in a car each week, it is clear that age is contributing somehow.
The numbers in the Coefficients table also allow us to construct the regression equation
(the equation for the line of best fit). The number under B for the constant row is the y
intercept (in other words, if X were 0, what would Y be?), and the number under B for the
variable is the slope of the line. We apply asterisks to indicate significance, giving us the following equation:
. Note that whether or not the constant/intercept
is statistically significant is just telling us whether the constant/intercept is statistically significantly different from zero, which is not actually very interesting, and thus most analysts
do not pay much attention to the significance of the constant/intercept.
So, in summary, our results tell us that age has a significant, moderate, inverse relationship with time spent in a car each week; that age explains 1% of the variance in time spent
in the car each week, and that for every one year of additional age, just over 3 more minutes
per week are spent in the car.
Exercises
1.
Choose two continuous variables of interest. Write a hypothesis about the relationship between the
variables.
2.
Create a scatterplot for these two variables with regression line (line of best fit). Explain what the scatterplot shows.
3.
Run a bivariate regression for these two variables. Interpret the results, being sure to discuss significance, strength, direction, and the actual magnitude of the effect.
4.
Create the regression equation for your regression results.
286 | Quantitative Analysis with SPSS: Bivariate Regression
Media Attributions
• curvilinear © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
• linear regression dialog © IBM SPSS is licensed under a All Rights Reserved license
Quantitative Analysis with SPSS: Bivariate Regression | 287
288 | Quantitative Analysis with SPSS: Bivariate Regression
23. Quantitative Analysis with
SPSS: Multivariate Regression
MIKAILA MARIEL LEMONIK ARTHUR
In the chapter on Bivariate Regression, we explored how to produce a regression with one
independent variable and one dependent variable, both of which are continuous. In this
chapter, we will expand our understanding of regression. The regressions we produce here
will still be linear regressions with one continuous dependent variable, but now we will
be able to include more than one independent variable. In addition, we will learn how to
include discrete independent variables in our analysis.
In fact, producing and interpreting multivariate linear regressions is not very different
from producing and interpreting bivariate linear regressions. The main differences are:
1. We add one or more additional variables to the Block 1 of 1 box (where the independent variables go) when setting up the regression analysis,
2. We check off one additional option under Statistics when setting up the regression
analysis, Collinearity diagnostics, which will be explained below,
3. We interpret the strength and significance of the entire regression and then look at
the strength, significance, and direction of each included independent variable one at
a time, so there are more things to interpret, and
4. We can add or remove variables and compare the R2 to see how those changes
impacted the overall predictive power of the regression.
Each of these differences between bivariate and multivariate regression will be discussed
below, beginning with the issue of collinearity and the tools used to diagnose it.
Collinearity
Collinearity refers to the situation in which two independent variables in a regression analysis are closely correlated with one another (when more than two independent variables are
closely correlated, we call it multicollinearity). This is a problem because when the correlation between independent variables is high, the impact of each individual variable on the
dependent variable can no longer be separately calculated. Collinearity can occur in a variety of circumstances: when two variables are measuring the same thing but using differ-
Quantitative Analysis with SPSS: Multivariate Regression | 289
ent scales; when they are measuring the same concept but doing so slightly differently; or
when one of the variables has a very strong effect on the other.
Let’s consider examples of each of these circumstances in turn. If a researcher included
both year of birth and age, or weight in pounds and weight in kilograms, both of the variables in each pair are measuring the exact same thing. Only the scales are different. If a
researcher included both hourly pay and weekly pay, or the length of commute in both
distance and time, the correlation would not be quite as close. A person might get paid
$10 an hour but work a hundred hours per week, or get paid $100 an hour but work ten
hours per week, and thus still have the same weekly pay. Someone might walk two miles to
work and spend the same time commuting as someone else driving 35 miles on the highway. But overall, the relationships between hourly pay and weekly pay and the relationship
between commute distance and commute time are likely to be quite strong. Finally, consider a researcher who includes variables measuring the grade students earned on Exam 1
and their total grade in a course with three exams, or one who includes variables measuring families’ spending on housing each month and their overall spending each month. In
these cases, the variables are not measuring the same underlying phenomena, but the first
variable likely has a strong effect on the second variable, resulting in a strong correlation.
In many cases, the potential for collinearity will be obvious when considering the variables included in the analysis, as in the examples above. But it is not always obvious.
Therefore, researchers need to test for collinearity when performing multivariate regressions. There are several ways to do this. First of all, before beginning to run a regression,
researchers can check for collinearity by running a correlation matrix and a scatterplot
matrix to look at the correlations between each pair of variables. The instructions for these
techniques can be found in the chapter on Quantitative Analysis with SPSS: Correlation. A
general rule of thumb is that if a Pearson correlation is above 0.8, this suggests a likely problem with collinearity, though some suggest scrutinizing those pairs of variables with a correlation above 0.7.
290 | Quantitative Analysis with SPSS: Multivariate Regression
In addition, when running the regression,
researchers can check off the option for
Collinearity diagnostics (Alt+l) under the
statistics dialog (Alt+S), as shown in Figure
1. The resulting regression’s Coefficients
table will include two additional pieces of
information, the VIF and the Tolerance, as
well as an additional table called Collinearity diagnostics. The VIF, or Variance Inflation
Factor, calculates the degree of collinearity
present. Values of around or close to one
suggest no collinearity; values around four
or five suggest that a deeper look at the
variables is needed, and values at ten or
above definitely suggest collinearity great
enough to be problematic for the regres-
Figure 1. Using Collinearity Diagnostics in
Regression
sion analysis. The Tolerance measure calculates the extent to which other independent variables can predict the values of the variable
under consideration; for tolerance, the smaller the number, the more likely that collinearity
is a problem. Typically, researchers performing relatively straightforward regressions such
as those detailed in this chapter do not need to rely on the Collinearity diagnostics table, as
they will be able to determine which variables may be correlated with one another by simply considering the variables and looking at the Tolerance and VIF statistics.
Quantitative Analysis with SPSS: Multivariate Regression | 291
Producing & Interpreting Multivariate Linear Regressions
Producing multivariate linear regressions
in SPSS works just the same as producing
bivariate linear regressions, except that we
add one or more additional variables to the
Block 1 of 1 box and check off the Collinearity diagnostics, as shown in Figure 2. Let’s
continue our analysis of the variable
CARHR, adding the independent variable
REALINC (inflation-adjusted family income)
to the independent variable AGE. Figure 2
shows how the linear regression dialog
would look when set up to run this regression, with CARHR in the Dependent box
and AGE and REALINC in the Indepen-
Figure 2. The Linear Regression Dialog Set Up
With CARHR as Dependent and AGE and
REALRINC as Independent
dent(s) box under Block 1 of 1. Be sure that
Estimates, Model fit, and Collinearity diag-
nostics are checked off, as shown in Figure 1. Then click OK to run the regression.
Tables 1, 2, and 3 below show the results (excluding those parts of the output unnecessary
for interpretation.
Table 1. Model Summary
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
1
.124a
.015
.014
8.619
a. Predictors: (Constant), R’s family income in 1986 dollars, Age of respondent
Table 2. ANOVAa
Model
1
Sum of Squares
df
Mean Square
F
Sig.
Regression
1798.581
2
899.290
12.106
<.001b
Residual
114913.923
1547
74.282
Total
116712.504
1549
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor
vehicle, not counting public transit
b. Predictors: (Constant), R’s family income in 1986 dollars, Age of respondent
292 | Quantitative Analysis with SPSS: Multivariate Regression
Table 3. Coefficientsa
Unstandardized
Coefficients
Standardized
Coefficients
t
B
Std. Error
Beta
(Constant)
9.808
.752
Age of respondent
-.055
.013
R’s family income in
1986 dollars
-1.356E-5
.000
Model
1
Sig.
Collinearity
Statistics
Tolerance
VIF
13.049
<.001
-.106
-4.212
<.001
1.000
1.00
-.064
-2.538
.011
1.000
1.00
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor vehicle, not coun
public transit
So, how do we interpret the results of our multivariate linear regression? First, look at the
Collinearity Statistics in the Coefficients table (here, Table 3). As noted above, to know we
are not facing a situation involving collinearity, we are looking for a VIF that’s lower than
5 and a Tolerance that is close to 1. Both of these conditions are met here, so collinearity
is unlikely to be a problem. If it were, we would want to figure out which variables were
overly correlated and remove at least one of them. Next, we look at the overall significance
of the regression in the ANOVA table (here, Table 3). The significance shown is <0.001, so the
regression is significant. If it were not, we would stop there.
To find out the overall strength of the regression, we look at the R in the Model Summary
(here, Table 1). It is 0.124, which means it is moderately strong. The R2 is 0.015, which—converting the decimal into a percentage by multiplying it by 100—tells us that the two independent variables combined explain 1.5% of the variance in the dependent variable, how
much time the respondent spends in the car. And here’s something fancy you can do with
that R2: compare it to the R2 for our prior analysis in the chapter on Bivariate Regression,
which had just the one independent variable of AGE. That R2 was 0.10, so though our new
regression still explains very little of the variance in hours spent in the car, adding income
does enable us to explain a bit more of the variance. Note that you can only compare R2 values among a series of models with the same dependent variable. If you change dependent
variables, you can no longer make that comparison.
Now, let’s turn back to the Coefficients table. When we interpreted the bivariate regression results, we saw that the significance and Beta values in this table were the same as
the significance value in the ANOVA table and the R values, respectively. In the multivariate
regression, this is no longer true—because now we have multiple independent variables,
each with their own significance and Beta values. These results allow us to look at each
independent variable, while holding constant (controlling for) the effects of the other inde­
pendent variable(s). Age, here, is significant at the p<0.001 level, and its Beta value is -0.106,
showing a moderate negative association. Family income is significant at the p<0.05 level,
and its Beta value is -0.064, showing a weak negative association. We can compare the Beta
Quantitative Analysis with SPSS: Multivariate Regression | 293
values to determine that age has a larger effect (0.106 is a bigger number than 0.064; we
ignore sign when comparing strength) than does income.
Next, we look at the B values to see the actual numerical effect of each variable. For
every year of additional age, respondents spend on average 0.055 fewer hours in the car, or
about 1.65 minutes less. And for every dollar of additional family income, respondents spend
-1.356E-5 fewer hours in the car. But wait, what does -1.356E-5 mean? It’s a way of writing
numbers that have a lot of decimal places so that they take up less space. Written the long
way, this number is -0.00001356—so what the E-5 is telling us is to move the decimal point
five spaces over. That’s a pretty tiny number, but that’s because an increase of $1 in your
annual family income really doesn’t have much impact on, well, really anything. If instead
we considered the impact of an increase of $10,000 in your annual family income, we would
multiply our B value by $10,000, getting -0.1356. In other words, an increase of $10,000 in
annual family income (in constant 1986 dollars) is associated with an average decrease of
0.1356 hours in the car, or a little more than 8 minutes.
Our final step is to create the regression equation. We do this the same way we did for
bivariate regression, only this time, there is more than one x, so we have to indicate the coefficient and significance of each one separately. Some people do this by numbering their xs
with subscript numerals (e.g. x1, x2, and so on), while others just use the short variable name.
We will do the later here. Taking the numbers from the B column, our regression equation
is
.
Phew, that was a lot to go through! But it told us a lot about what is going on with our
dependent variable, CARHR. That’s the power of regression: it tells us not just about the
strength, significance, and direction of the relationship between a given pair of variables,
but also about the way adding or removing additional variables changes things as well as
about the actual impact each independent variable has on the dependent variable.
Dummy Variables
So far, we have reviewed a number of the advantages of regression analysis, including the
ability to look at the significance, strength, and direction of the relationships between a
series of independent variables and a dependent variable; examining the effect of each
independent variable while controlling for the others; and seeing the actual numerical
effect of each independent variable. Another advantage is that it is possible to include independent variables that are discrete in our analysis. However, they can only be included in a
very specific way: if we transform them into a special kind of variable called a dummy vari­
able in which a single value of interest is coded as 1 and all other values are coded as 0. It is
294 | Quantitative Analysis with SPSS: Multivariate Regression
even possible to create multiple dummy variables for different categories of the same discreet variable, so long as you have an excluded category or set of categories that are sizable.
It is important to leave a sizeable group of respondents or datapoints in the excluded category because of collinearity.
Consider, for instance, the variable WRKSLF, which asks if respondents are self-employed
or work for someone else. This is a binary variable, with only two answer choices. We could
make a dummy variable for self-employment, with being self-employed coded as 1 and
everything else (which, here, is just working for someone else) as 0. Or we could make a
dummy variable for working for someone else, with working for someone else coded as 1
and everything else as 0. But we cannot include both variables in our analysis because they
are, fundamentally, measuring the same thing.
Figuring out how many dummy variables to make and which ones they should be can be
difficult. The first question is theoretical: what are you actually interested in? Only include
categories you think would be meaningfully related to the outcome (dependent variable)
you are considering. Second, look at the descriptive statistics for your variable to be sure
you have an excluded category or categories. If all of the categories of the variable are sufficiently large, it may be enough to exclude one category. However, if a category represents
very few data points—say, just 5 or 10 percent of respondents—it may not be big enough
to avoid collinearity. Therefore, some analysts suggest using one of the largest categories,
assuming this makes sense theoretically, as the excluded category.
Let’s consider a few examples:
Quantitative Analysis with SPSS: Multivariate Regression | 295
Table 4. Examples of Dummy Variables
GSS
Variable
Answer Choices &
Frequencies
RACE
White: 78.2%
Black: 11.6%
Other: 10.2%
Suggested Dummy Variable(s)
Option 1. 2 variables: Back 1, all others 0 & Other 1, all others
0
Option 2. Nonwhite 1, all others 0
Option 3. White 1, all others 0
DEGREE
CHILDS
Less than high school:
6.1%
High school: 38.8%
Associate/junior college:
9.2%
Bachelor’s: 25.7%
Graduate: 18.8%
0: 29.2%
1: 16.2%
2: 28.9%
3: 14.5%
4: 7%
5: 2%
6: 1.3%
7: 0.4%
8 or more: 0.5%
CLASS
Lower class: 8.7%
Working class: 27.4%
Middle class: 49.8%
Upper class: 4.2%
SEX
Male: 44.1%
Female: 55.9%
Option 1. Bachelor’s or higher 1; all others 0
Option 2. High school or higher 1; all others 0
Option 3. 4 variables: Less than high school 1, all others 0;
Associate/junior college 1, all others 0; Bachelor’s 1, all
others 0; Graduate 1, all others 0
Option 4. Use EDUC instead, as it is continuous
Option 1. 0 children 1, all others 0
Option 2. 2 variables: 0 children 1, all others 0; 1 child 1, all
others 0
Option 3. 3 variables: 0 children 1, all others 0; 1 child 1, all
others 0; 2 children 1, all others 0
Option 4. Ignore the fact that this variable is not truly
continuous and treat is as continuous anyway
The best option is to create thee variables: Lower class 1, all
others 0; Working class 1, all others 0; Upper class 1, all
others 0 (however, you could instead include Working
class and have a variable for Middle class if that made
more sense theoretically)
Option 1. Male 1, all others 0
Option 2: Female 1, all others 0
So, how do we go about making our dummy variable or variables? We use the Recode
technique, as illustrated in the chapter on Quantitative Analysis with SPSS: Data Management. Just remember to Recode into different and to make as many dummy variables as
needed: maybe one, maybe more. Here, we will make one for SEX. Because we are continuing our analysis of CARHRS, let’s assume we hypothesize that, on average, women spend
more time in the car than men because women are more likely to be responsible for driving children to school and activities. On the basis of this hypothesis, we would treat female
as the included category (coded 1) and male as the excluded category (coded 0) since what
we are interested in is the effect of being female.
As a reminder, to recode, we first make sure we know the value labels for our existing
original variable, which we can find out by checking Values in Variable View. Here, male is 1
and female is 2. Then we go to Transform → Recode into Different Variables (Alt+T, Alt+R). We
add the original variable to the box, and then give our new variable a name, here generally
296 | Quantitative Analysis with SPSS: Multivariate Regression
something like the name of the category we are interested in (here, Female) and descriptive label, and click Change. Next, we click “Old and New Values.” We set system or user
missing as system missing, our category of interest as 1, and everything else as 0. We click
continue, then go to the bottom of variable view and edit our value labels to reflect our
new categories. Finally, we run a frequency table of our new variable to be sure everything
worked right. Figure 3 shows all of the steps described here.
Dummy Variables in
Regression Analysis
After creating the dummy variable, we
are ready to include our dummy variable in
a regression. We set up the regression just
the same way as we did above, except that Figure 3. The Process of Recoding Sex to Create
we add FEMALE to the independent vari- the Dummy Variable Female
ables REALINC and AGE (the dependent
variable will stay CARHR). Be sure to check
Collinearity diagnostics under Statistics.
Figure 4 shows how the linear regression
dialog should look with this regression set
up. Once the regression is set up, click ok to
run it.
Now, let’s consider the output, again
focusing only on those portions of the output necessary to our interpretation, as
shown in Tables 5, 6, and 7.
Figure 4. The Multivariate Linear Regression
Window with our Dummy Variable Added
Table 5. Model Summary
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
1
.143a
.020
.018
8.602
a. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars
Quantitative Analysis with SPSS: Multivariate Regression | 297
Table 6. ANOVAa
Model
1
Sum of Squares
df
Mean Square
F
Sig.
Regression
2372.292
3
790.764
10.686
<.001b
Residual
114259.126
1544
74.002
Total
116631.418
1547
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor
vehicle, not counting public transit
b. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars
Table 7. Coefficientsa
Unstandardized
Coefficients
Standardized
Coefficients
t
B
Std. Error
Beta
(Constant)
10.601
.801
Age of respondent
-.056
.013
R’s family income in
1986 dollars
-1.548E-5
Dummy Variable for
Being Female
-1.196
Model
1
Sig.
Collinearity
Statistics
Tolerance
VIF
13.236
<.001
-.108
-4.301
<.001
1.000
1.00
.000
-.073
-2.882
.004
.986
1.014
.442
-.069
-2.706
.007
.986
1.015
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor vehicle, not coun
public transit
First, we look at our collinearity diagnostics in the Coefficients table (here, Table 7). We can
see that all three of our variables have both VIF and Tolerance close to 1 (see above for a
more detailed explanation of how to interpret these statistics), so it is unlikely that there is
a collinearity problem.
Second, we look at the significance for the overall regression in the ANOVA table (here,
Table 6). We find the significance is <0.001, so our regression is significant and we can continue our analysis.
Third, we look at the Model Fit table (here, table 5). We see that the R is 0.143, so the
regression’s strength is moderate, and the R2 is 0.02, meaning that all of our variables
together explain 2% of the variance (0.02 * 100 converts the decimal to a percent) in our
dependent variable. We can compare this 2% R2 to the 1.5% R2 we obtained from the earlier regression without Female and determine that adding the dummy variable for being
female helped our regression explain a little bit more of the variance in time respondents
spend in the car.
Fourth, we look at the significance and Beta values in the Coefficients table. First, we find
that Age is significant at the p<0.001 level and that it has a moderate negative relationship with time spent in the car. Second, we find that income is significant at the p<0.01
298 | Quantitative Analysis with SPSS: Multivariate Regression
level and has a weak negative relationship with time spent in the car. Finally, we find that
being female is significant at the p<0.01 level and has a weak negative relationship with
time spent in the car. But wait, what does this mean? Well, female here is coded as 1 and
male as 0. So what this means is that when you move from 0 to 1—in other words from
male to female—the time spent in the car goes down (but weakly). This is the opposite of
what we hypothesized! Of the three variables, age has the strongest effect (the largest Beta
value).
Next, we look at the B values to see what the actual numerical effect is. For every one
additional year of age, time spent in the car goes down by 0.056 hours (3.36 minutes)
a week. For every one additional dollar of income, time spent in the car goes down by
-1.548E-5 hours per week; translated (as we did above), this means that for every $10,000
additional dollars of income, time spent in the car goes down by 0.15 hours (about 9 minutes) per week. And women, it seems, spend on average 1.196 hours (about one hour and
twelve minutes) fewer per week in the car than do men.
Finally, we produce our regression equation. Taking the numbers from the B column,
our
regression
equation
is
.
Regression Modeling
There is one more thing you should know about basic multivariate linear regression. Many
analysts who perform this type of technique systematically add or remove variables or
groups of variables in a series of regression models (SPSS calls them “Blocks”) to look at
how they influence the overall regression. This is basically the same as what we have done
above by adding a variable and comparing the R2 (the difference between the two R2 values is called the R2 change). However, SPSS provides a tool for running multiple blocks at
once and looking at the results. When looking at the Linear regression dialog, you may have
noticed that it says “Block 1 of 1” just above the box where the independent variables go.
Well, if you click “next” (Alt+N), you will be moved to a blank box called “Block 2 of 2”. You
can then add additional independent variables here as an additional block.
Just below the Block box is a tool called “Method” (Alt+M). While a description of the
options here is beyond the scope of this text, this tool provides different ways for variables in
each block to be entered or removed from the regression to develop the regression model
that is most optimal for predicting the dependent variable, retaining only those variables
that truly add to the predictive power of the ultimate regression equation. Here, we will
stick with the “Enter” Method, which does not draw on this type of modeling but instead
Quantitative Analysis with SPSS: Multivariate Regression | 299
simply allows us to compare two (or more) regressions upon adding an additional block (or
blocks) of variables.
So, to illustrate this approach to regression analysis, we will retain the same set of variables for Block 1 that we used above: age, income, and the dummy variable for being
female. And then we will add a Block 2 with EDUC (the highest year of schooling com1
pleted) and PRESTIG10 (the respondent’s occupational prestige score) . Remember to be
sure to check the collinearity diagnostics box under statistics. Figure 5 shows how the
regression dialog should be set up to run this analysis.
The output for this type of analysis (relevant sections of the output appear as
Tables 8, 9, and 10) does look more complex
at first, as each table now has two tables
stacked on top of one another. Note the
output will first, before the relevant tables,
include “Variables Entered/Removed” table
that
simply
lists
which
variables
Figure 5. Setting Up a Linear Regression With
Blocks
are
included in each block. This is more important for the more complex methods other than
Enter in which SPSS calculates the final model; here, we already know which variables we
have included in each block.
Table 7. Model Summary
Model
R
R Square
Adjusted R Square
Std. Error of the Estimate
1
.155a
.024
.022
8.201
2
.200b
.040
.037
8.139
a. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars
b. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars, R’s occupational prestige score (2010), Highest year of school R completed
1. Occupational prestige is a score assigned to each occupation. The score has been determined by administer­
ing a prior survey in which respondents were asked to rank the prestige of various occupations; these rankings
were consolidated into scores. Census occupational codes were used to assign scores of related occupations to
those that had not been asked about in the original survey.
300 | Quantitative Analysis with SPSS: Multivariate Regression
Table 8. ANOVAa
Model
1
2
Sum of Squares
df
Mean Square
F
Sig.
Regression
2486.883
3
828.961
12.326
<.001b
Residual
101353.178
1507
67.255
Total
103840.061
1510
Regression
4144.185
5
828.837
12.512
<.001c
Residual
99695.876
1505
66.243
Total
103840.061
1510
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor
vehicle, not counting public transit
b. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars
c. Predictors: (Constant), Dummy Variable for Being Female, Age of respondent, R’s family income
in 1986 dollars, R’s occupational prestige score (2010), Highest year of school R completed
Table 9. Coefficientsa
Unstandardized
Coefficients
Standardized
Coefficients
t
B
Std. Error
Beta
(Constant)
10.642
.783
Age of respondent
-.055
.013
R’s family income in
1986 dollars
-1.559E-5
Dummy Variable for
Being Female
Model
1
2
Sig.
Collinearity
Statistics
Tolerance
VIF
13.586
<.001
-.111
-4.344
<.001
1.000
1.00
.000
-.077
-3.003
.003
.986
1.014
-1.473
.426
-.089
-3.456
<.001
.986
1.015
(Constant)
16.240
1.411
11.514
<.001
Age of respondent
-.053
.013
-.107
-4.216
<.001
.995
1.00
R’s family income in
1986 dollars
-4.286E-6
.000
-.021
-.762
.446
.827
1.209
Dummy Variable for
Being Female
-1.576
.424
-.095
-3.722
<.001
.983
1.017
Highest year of school
-.279
R completed
.092
-.092
-3.032
.002
.691
1.44
R’s occupational
prestige score (2010)
.018
-.067
-2.225
.026
.712
1.40
-.040
a. Dependent Variable: How many hours in a typical week does r spend in a car or other motor vehicle, not coun
public transit
You will notice, upon inspecting the results, that what appears under Model 1 (the rows with
the 1 at the left-hand side) is the same as what appeared in our earlier regression in this
Quantitative Analysis with SPSS: Multivariate Regression | 301
chapter, the one where we added the dummy variable for being female. That is because, in
fact, Model 1 is the same regression as that prior regression. Therefore, here we only need
to interpret Model 2 and compare it to Model 1; if we had not previously run the regression
that is shown in Model 1, we would also need to interpret the regression in Model 1, not just
the regression in Model 2. But since we do not need to do that here, let’s jump right in to
interpreting Model 2.
We begin with collinearity diagnostics in the Coefficients table (here, Table 9). We can see
that the Tolerance and VIF have moved further away from 1 than in our prior regressions.
However, the VIF is still well below 2 for all variables, while the Tolerance remains above
0.5. Inspecting the variables, we can assume the change in Tolerance and VIF may be due
to the fact that education and occupational prestige are strongly correlated. And in fact,
if we run a bivariate correlation of these two variables, we do find that the Pearson’s R is
0.504—indeed a strong correlation! But not quite so strong as to suggest that they are too
highly correlated for regression analysis.
Thus, we can move on to the ANOVA table (here, Table 8). The ANOVA table shows that
the regression is significant at the p<0.001 level. So we can move on to the Model Summary
table (here, Table 7). This table shows that the R is 0.200, still a moderate correlation, but a
stronger one than before. And indeed, the R2 is 0.040, telling us that all of our independent
variables together explain about 4% of the variance in hours spent in the car per week. If
we compare this R2 to the one for Model 1, we can see that, while the R2 remains relatively
small, the predictive power has definitely increased with the addition of educational attainment and occupational prestige to our analysis.
Next, we turn our attention back to the Coefficients table to determine the strength and
significance of each of our five variables. Income is no longer significant now that education and occupational prestige have been included in our analysis, suggesting that income
in the prior regressions was really acting as a kind of proxy for education and/or occupational prestige (it is correlated with both, though not as strongly as they are correlated with
one another). The other variables are all significant, age and being female at the p<0.001
level; education at the p<0.01 level; and occupational prestige is significant at the p<0.05
level. Age of respondent has a moderate negative (inverse) effect. Being female has a weak
negative association, as do education and occupational prestige. In this analysis, age has
the strongest effect, though the Betas for all the significant variables are pretty close in size
to one another.
The B column provides the actual numerical effect of each independent variable, as
well as the numbers for our regression equation. For every one year of additional age,
time spent in the car each week goes down by about 3.2 minutes. Since income is not
significant, we might want to ignore it; in any case, the effect is quite tiny, with even
a $10,000 increase in income being associated with only a 2.4 minute decrease in time
spent in the car. Being female is associated with a decrease of, on average, just over an
302 | Quantitative Analysis with SPSS: Multivariate Regression
hour and a half (94.56 minutes). A one year increase in educational attainment is associated with a decrease of just under 17 minutes a week in the car, while a one-point
2
increase in occupational prestige score is associated with a decline of 24 minutes spent
in
the
car
per
week.
Our
regression
equation
is
.
So, what have we learned from our regression analysis in this chapter? Adding more variables can result in a regression that better explains or predicts our dependent variable. And
controlling for an additional independent variable can sometimes make an independent
variable that looked like it had a relationship with our dependent variable become insignificant. Finally, remember that regression results are generalized average predictions, not
some kind of universal truth. Our results suggest that folks who want to spend less time in
the car might benefit from being older, being female, getting more education, and working in a high-prestige occupation. However, there are plenty of older females with graduate degrees working in high-prestige jobs who spend lots of time in the car–and there are
plenty of young men with little education who hold low-prestige jobs and spend no time in
the car at all.
Notes on Advanced Regression
Multivariate linear regression with dummy variables is the most advanced form of quantitative analysis covered in this text. However, there are a vast array of more advanced regression techniques for data analysts to use. All of these techniques are similar in some ways. All
involve an overall significance, an overall strength using Pearson’s r or a pseudo-R or R analog which is interpreted in somewhat similar ways, and a regression equation made up of
various coefficients (standardized and unstandardized) that can be interpreted as to their
significance, strength, and direction. However, they differ as to their details. While exploring
all of those details is beyond the scope of this book, a brief introduction to logistic regres­
sion will help illuminate some of these details in at least one type of more advanced regression.
2. In the 2021 General Social Survey dataset, occupational prestige score ranges from 16 to 80 with a median of
47.
Quantitative Analysis with SPSS: Multivariate Regression | 303
Logistic regression is a technique used
when dependent variables are binary.
Instead of estimating a best-fit line, it estimates a best-fit logistic curve, an example
of which is shown in Figure 6. This curve is
showing the odds that an outcome will be
one versus the other of the two binary
attributes of the variable in question. Thus,
the coefficients that the regression analysis
produces are themselves odds, which can
be a bit trickier to interpret. Because of the
Figure 6. A Plot of a Logistic Function
different math for a logistic rather than a
linear equation, logistic regression uses pseudo-R measures rather than Pearson’s r. But
logistic regression can tell us, just like linear regression can, about the significance,
strength, and direction of the relationships we are interested in. And it lets us do this for
binary dependent variables.
Besides using different regression models, more advanced regression can also include
interaction terms. Interaction terms are variables constructed by combining the effects of
two (or more) variables so as to make it possible to see the combined effect of these variables together rather than looking at their effects one by one. For example, imagine you
were doing an analysis of compensation paid to Hollywood stars and were interested in factors like age, gender, and number of prior star billings. Each of these variables undoubtedly
has an impact on compensation. But many media commentators suggest that the effect
of age is different for men than for women, with starring roles for women concentrated
among the younger set. Thus, an interaction term that combined the effects of gender and
age would make it more possible to uncover this type of situation.
There are many excellent texts, online resources, and courses on advanced regression. If
you are thinking about continuing your education as a data analyst or pursuing a career in
which data analysis skills are valuable, learning more about the various regression analysis
techniques out there is a good way to start. But even if you do not learn more, the skills
you have already developed will permit you to produce basic analyses–as well as to understand the more complex analyses presented in the research and professional literature in
your academic field and your profession. For even more complex regressions still rely on
the basic building blocks of significance, direction, and strength/effect size.
Exercises
1.
Choose three continuous variables. Produce a scatterplot matrix and describe what you see. Are there
304 | Quantitative Analysis with SPSS: Multivariate Regression
any reasons to suspect that your variable might not be appropriate for linear regression analysis? Are any
of them overly correlated with one another?
2.
Produce a multivariate linear regression using two of your continuous variables as independent variables and one as a dependent variable. Be sure to produce collinearity diagnostics. Answer the following
questions:
◦
Are there any collinearity problems with your regression? How do you know?
◦
What is the significance of the entire regression?
◦
What is the strength of the entire regression?
◦
How much of the variance in your dependent variable is explained by the two independent variables combined?
For each independent variable:
◦
▪
What is the significance of that variable’s relationship with the dependent variable?
▪
What is the strength of that variable’s relationship with the dependent variable?
▪
What is the direction of that variable’s relationship with the dependent variable?
▪
What is the actual numerical effect that an increase of one in that variable would have on
the dependent variable?
Which independent variable has the strongest relationship with the dependent variable?
◦
3.
Produce the regression equation for the regression you ran in response to Question 2.
4.
Choose a discrete variable of interest that may be related to the same dependent variable you used for
Question 2. Create one or more dummy variables from this variable (if it has only two categories, you can
create only on dummy variable; if it has more than two categories, you may be able to create more than
one dummy variable, but be sure you have left out at least one largeish category which will be the
excluded category with no corresponding dummy variable). Using the Recode into Different function,
create your dummy variable or variables. Run descriptive statistics on your new dummy variable or variables and explain what they show.
5.
Run a regression with the two continuous variables from Question 2, the two dummy variables from
Question 4, and one additional dummy or continuous variable as your independent variables and the
same dependent variable as in Question 2.
6.
Be sure to produce collinearity diagnostics. Answer the following questions:
◦
Are there any collinearity problems with your regression? How do you know?
◦
What is the significance of the entire regression?
◦
What is the strength of the entire regression?
◦
How much of the variance in your dependent variable is explained by the two independent variables combined?
◦
3
For each independent variable :
▪
What is the significance of that variable’s relationship with the dependent variable?
▪
What is the strength of that variable’s relationship with the dependent variable?
▪
What is the direction of that variable’s relationship with the dependent variable?
▪
What is the actual numerical effect that an increase of one in that variable would have on
the dependent variable?
3. Be sure to pay attention to the difference between dummy variables and continuous variables in interpreting
your results.
Quantitative Analysis with SPSS: Multivariate Regression | 305
◦
Which independent variable has the strongest relationship with the dependent variable?
7.
Produce the regression equation for the regression that you ran in response to Question 6.
8.
Compare the R2 for the regression you ran in response to Question 2 and the regression you ran in
response to Question 6. Which one explains more of the variance in your dependent variable? How much
more? Is the difference large enough to conclude that adding more additional variables helped explain
more?
Media Attributions
• collinearity diagnostics menu © IBM SPSS is licensed under a All Rights Reserved
license
• multivariate reg 1 © IBM SPSS is licensed under a All Rights Reserved license
• recode sex dummy © IBM SPSS is licensed under a All Rights Reserved license
• multivariate reg 2 © IBM SPSS is licensed under a All Rights Reserved license
• multivariate reg 3 © IBM SPSS is licensed under a All Rights Reserved license
• mplwp_logistic function © Geek3 is licensed under a CC BY (Attribution) license
306 | Quantitative Analysis with SPSS: Multivariate Regression
SECTION V
QUALITATIVE AND MIXED
METHODS DATA ANALYSIS WITH
DEDOOSE
While researchers can and do analyze qualitative data by hand or with the use of basic
computer software like word processing programs and spreadsheets, most qualitative projects involving a moderate to large volume of data today do rely on qualitative data analysis software packages. There are a variety of such packages, all with different strengths
and limitations, and those who intend to perform qualitative data analysis regularly as
part of their research or professional responsibilities should explore the options to find
out which program is the best fit for their research style and priorities. This text features
Dedoose. To get started with Dedoose, visit https://dedoose.com/ — the website has helpful
guides, an explanation of pricing, and other resources. Users can sign up for an account
at https://dedoose.com/signup (there are instructions about a student discount there as
well) and can download the software at https://www.dedoose.com/resources/articledetail/
dedoose-desktop-app for Windows, Mac, Chromebook, or Linux. Note there is both a regular installation and a “portable” option for Windows users who do not have administrative
privileges. Unfortunately, Dedoose is not screenreader compliant at the time of this writing
and has some limitations in terms of screen zoom applications.
The chapters here on how to use Dedoose include screenshots from a dataset created by
students in the Sociology 404 class in Spring 2020, just before the onslaught of COVID. Students were asked to write a paragraph about their first day on campus here at Rhode Island
College and answer a few questions about their graduation year, living arrangements, gender, major, and what they had been doing before coming to our campus. Students then
did the work, collectively, of developing a code tree and coding the excerpts. Care has been
taken to keep all of their participation, both their responses to the writing prompt and
their work on the project, confidential, but let me take this moment to express my deepest
appreciation for their excellent work and their contributions, which also helped inspire me
to write this book.
Qualitative and Mixed Methods Data Analysis with Dedoose | 307
308 | Qualitative and Mixed Methods Data Analysis with Dedoose
24. Qualitative Data Analysis with
Dedoose: Data Management
MIKAILA MARIEL LEMONIK ARTHUR
While researchers generally refer to the software they use to facilitate qualitative research
as qualitative data analysis software, such software programs also play a very important role
in data management and data reduction. Indeed, without employing the data management capabilities of qualitative data analysis software, the software itself is unlikely to be
functional. Thus, before we start analyzing our data using tools like Dedoose, we need to
feed the data we have collected into the tool of our choice and take various steps to set it
up to be usable. This chapter will provide an overview of how to get started with a project in
Dedoose and add data to the project, as well as how to view and manipulate the data. This
chapter assumes you have already created a user account with Dedoose and downloaded
and installed the software, and that you can successfully log into your account in the program. Note that if you run into technical difficulties while using Dedoose, their support can
be reached at support@dedoose.com.
Getting Started With a New Project
The first step in getting started with Dedoose is to either create a new project or be added
to an existing one. This chapter will assume you are starting with a new project; if you are
working with an existing project, someone who has administrative privileges on the existing project will need to add you to it. “Projects,” in Dedoose’s terminology, are workspaces
that store complete collections of data from a particular research study.
To create a new project, first click on “projects” on the menu bar in the top right corner of
the screen, as shown in Figure 1.
Dedoose Data Management | 309
This will bring up a window that lists all of
the projects your account has access to. If
you are a new Dedoose user, you will likely
have far fewer projects in your account
than the examples here will show; as you
develop additional projects, they will be
added to your list. Next, click on the “Create
Project” button at the bottom of the screen.
Figure 1. Finding the Projects Button on the
Dedoose Home Screen
Figure 2. Creating a New Project in Dedoose
Figure 3. Popup Window for Creating a New
Project
The final step in creating a project is to set up that project, using the “create project”
popup window shown in Figure 3, by giving it a title (this should be short but descriptive,
so you can remember which project is which) and a brief description of the project. If you
wish, you can assign the project an encrypted password, but note that this makes it impossible for Dedoose to help you recover the project. Once you have set up the project, click
submit.
Once you have created your project, you may need to load it by first selecting it from the
list of projects in the project window, then clicking the “load” button next to the project
name, as shown in Figure 4. You will see that the information you provided in setting up
your project now appears in the list of projects. The project screen also provides buttons for
deleting, renaming, and copying projects.
310 | Dedoose Data Management
If you would like to add
another user to your project, like your professor, a
classmate, or a research Figure 5.
collaborator,
click
on
Security Icon
“security” in the top menu. Of course, if you
are planning on working alone, you do not
need to add anyone. The security center,
Figure 4. Loading a Project
shown in Figure 6, is not always the most
intuitive feature of Dedoose, but luckily you
do not need to use it too often.
The first step in the Security Center is to
use the Add Group button to add a new
group, or the Edit Group button to edit an
existing group. A “group,” in this context,
refers to a set of users who have similar
security privileges—for instance, you might
Figure 6. Dedoose Security Center
have a project director with full privileges to
do anything with the project, two lead
researchers who can use most tools in the
project but cannot add other users, five research assistants who can code and use analysis
tools but cannot delete anything, and an intern who can view the project but not edit it.
Each of these types of people—project director, lead researcher, research assistant, and
intern—would be a group, and the security center enables you to set specific security privileges for them.
When you click on Add Group or Edit
Group, a popup window opens, as shown in
Figure 7, with a list of the various types of
security privileges that are available. These
vary in terms of the extent to which users in
those groups can use various features,
ranging from “Full Access” for our project
manager to “Guest Access” for our intern. It
may take a little while to explore the
options and select the right one for your
project. When you have done so, select that
Figure 7. Security Privileges
option and click the green submit button.
To add users to the user group you have
now defined, use the “add user” button (next to the “add group” button highlighted in Fig-
Dedoose Data Management | 311
ure 6. First, you will need to indicate which of the security groups you have defined this user
should be added to. The system will then ask you for the user’s email address, and depending on whether the user is already an active Dedoose user or not will ask you whether you
want to invite them to Dedoose, add them to your account, or just add them to the project.
Once you have finished working with the
security center—or, if this is an individual
project and thus you did not need to use
the security center—your project should be
ready for you to start working.
Working With Data
Figure 8. Adding a User
The first step in the qualitative data analysis
project, as discussed in the chapter on
Preparing & Managing Qualitative Data, is to prepare your data for analysis. In Dedoose, a
key part of this process is adding your data to the application. To do this, click on the plus
sign in a circle at the top of the “Media” box on the home screen, as shown in Figure 9.
This will open a pop-up window with a
variety of options for importing media, as
shown in Figure 10, depending on how your
data is stored. On the most basic level, you
can create a blank document and type or
copy-paste content into Dedoose, but that
becomes cumbersome with more than a
small volume of data. If you have stored
Figure 9. Opening the Dialogue for Adding Media
fieldnotes in document (*.doc, *.docx, *.rtf,
*.txt, *.htm, or *.html) or in PDF format, you
can choose the “Import Text” or “Import PDFs” options, respectively. These options will allow
you to select one or more files and have them all imported at once. Using this option is
especially handy when you have a separate, single file for each interview, respondent, or
case, as Dedoose will then store each file as a separate instance of data in ways that facilitate analysis. It is also possible to import image files (*.jpg, *.png, *.bmp, or *.gif), which is a
helpful option for visual sociology. Dedoose can handle audio and video files but at additional cost, and discussing these features is beyond the scope of this text. Finally, you can
import a spreadsheet. However, this method of importation requires careful spreadsheet
construction and formatting. The Dedoose User Guide (scroll down to “Survey Importer”)
provides the necessary details about how to format your file.
312 | Dedoose Data Management
Once your data is imported,
you can view and manipulate it
using the Media tools by clicking on the Media icon at the top
of the Dedoose screen. The
Figure 11.
Media
Icon
media icon takes users to a window that displays a list of all documents,
graphics, or other texts that are part of a
project, with their title, the user that
imported
Figure 10. Methods of Importing Data into
Dedoose
them,
the
date
they
were
imported, their length, and several other
features that will be discussed later on in
this text (including descriptors, memos,
and excerpts), as shown in Figure 12.
If you click on any media item, Dedoose
opens that particular media item, as shown
in Figure 13.
Figure 12. Dedoose Media Tool
By clicking on the gearshift highlighted
in Figure 13, you can edit the title or description of a media item. And by clicking on the
lock icon highlighted in Figure 13, you can
edit the text in the media item itself. Other
features of the media window will be discussed later, after those tools are introduced.
Figure 13. Media Editing Window
There are two additional
important
features
that
are part of setting up your data in Dedoose. The first is called “Descriptors.”
Descriptors are category labels that apply to an entire text or piece of
media or to its author or creator—they are not codes, which are applied to
Figure 14.
Descriptors
Icon
specific segments of text. For instance, if we were keeping track of the gender, age, or race of the author of a narrative or the participant in an interview, that would be
Dedoose Data Management | 313
a descriptor. Similarly, if our study involved collecting advertisements, the magazine, website, or television program on which the advertisement appeared might be a descriptor.
Descriptors are created and edited in Dedoose by using the Descriptors tool in the toolbar
at the top of the program.
To create descriptors, load the descriptors
tool using the icon shown in Figure 14, and
then click on the plus sign in the circle
highlighted in Figure 15 next to the number
1, in the Descriptor Sets section of the
screen. Give your descriptor set a name (the
name does not matter, it is just part of the
data storage system) and click submit.
Figure 15. The Descriptors Window
Then, click on the plus sign in the circle
highlighted in Figure 15 next to the number
2. This will bring up a window that permits you to develop a set of descriptors.
For each descriptor you wish to add, as shown
in Figure 16, you can provide a name (something
like “gender” or “age” or “magazine appeared in”)
and a longer description explaining the descriptor so you can remember what you did. Then you
can choose a field type from four options: an
option list, which is basically the same as a multiple-choice question; a number, which permits the
free entry of any numerical value; a date; or a text
field, in which any short text can be entered.
There is an option for dynamic fields, which are
those where data might change over time, as in a
longitudinal study, but we will leave those aside
for the purposes of this discussion. If you select
option list for field type, you can then use “Add
Options” under Field Options to add each of your
multiple-choice options to the list. When you are
done, click submit. For example, Figure 16 shows
an option list descriptor called “Housing,” used in
a study of student life, in which respondents indicated whether they lived with their family, lived in
Figure 16. Adding or Editing Descriptor
Fields
on-campus housing, or lived off-campus but not
with their family. You can use the X icon to delete an option from the list, the icon that looks
like a piece of paper to edit the option, and the ∧ and ∨ icons to reorder your options.
314 | Dedoose Data Management
The next step in using descriptors is to link your descriptors to your media. There are several ways to do this. You can go to the Media window and click the blue box under the
Descriptors heading, or you can use the tiny (and hard to see) descriptor icon on an individual media item. Both options are shown in Figure 17.
Clicking on either option will bring up a
popup window called Descriptor Links.
Then click “Create and Link Descriptor,” as
shown in Figure 18.
Figure 17. Adding Descriptors to Media
Figure 18. Linking Descriptors to Media
This will bring up a pop-up window in which
you can select from drop-down menus (for
option lists) or enter (for other types of descriptor fields) as shown in Figure 19 to apply
descriptors to the selected media item. You
may have far more descriptors than are shown
here; however many there are, select the
appropriate options for each one, and then
click submit.
You will need to do this individually for each
media item in your project. Once you have
done this, if you return to the Media tool, you
Figure 19. Editing Descriptors for a Particular
Media Item
will be able to preview all of your media items
and their linked descriptors. In addition, if you
return to the Descriptors tool, you will see all of
your descriptors, but without the associated media.
The final tool you should know about as you get started with Dedoose is the memo tool.
To create a memo, first open a specific media item, then click on the memo icon, as shown
in Figure 20.
Dedoose Data Management | 315
This will open a pop-up window, as
shown in Figure 21, in which you can create
a memo. You should enter a title, and then
Figure 20. Opening the Memo Tool
you can type or copy-paste your text where
the screen says “What are you thinking?”
Using the memo groups box at the top, you
can also create a memo group or add a
memo to an existing group, if you have
many memos and want to classify or categorize them. Once you begin typing, a
“Save” button will appear at the bottom of
Figure 21. Adding a Memo
the memo screen—be sure to save when
you are done.
Once you have created
one or more memos, you can use the memos icon, as shown in Figure
22, to load a window that shows all of your memos and allows you to
work with them.
Figure 22. Memo
Icon
Backing Up Your Data & Managing Dedoose
As Dedoose uses cloud-based storage to keep your data, you do not need to save—all data
is automatically saved. However, you may wish to download your data, either to back it up,
to keep a local copy, or to import it into a different software package in the future. You can
use the Export button on the Home screen, as shown in Figure 23, to export all or a portion
of a project.
Figure 23. The Export Button
316 | Dedoose Data Management
The export button brings up a pop-up
window, as shown in Figure 24, with a variety of options. You can export just the
codes, descriptors, media information with
linked descriptors, or excerpts to a spreadsheet by selecting the option of your
choice. Excerpts, as well as memos, can be
exported to a document. Keep in mind that
none of these options involves exporting
the full volume of original media—Dedoose
strongly encourages you to keep your original media, in the form it was prior to
uploading to Dedoose, intact and backed
up outside of Dedoose. If you have the
media plus these exports, you can load data
into other programs or applications. There
is also an option to export the entire project, but this type of file is hard to use out-
Figure 24. Export Options Window
side of Dedoose itself, so it best serves as a
backup of your work.
If Dedoose gets a little slow or nonresponsive, you may wish to log out of the application,
1
close it, and then reopen and log back in. There is also a refresh icon (⯑) at the top of the
screen that can be helpful if you are working on a project with another researcher and want
to be sure their changes have loaded into your view.
Exercises
1.
Create a project in Dedoose. If you are doing this work as part of a class, give your instructor
access to the project.
2.
Download five oral histories from the COVID-19 Archive (you can do this at
https://covid-19archive.org/s/oralhistory/item). Import them into your project.
3.
Create a descriptor set with at least three descriptors you find relevant to the oral histories you
selected. Link the descriptor set to your oral history media, being sure to correctly select any
options.
4.
Read one of the oral histories you downloaded. Write a memo of at least 250 words summariz-
1. This icon might not display correctly in some versions of this text. Click here to see what it looks like.
Dedoose Data Management | 317
ing the most important insights in that oral history, and add your memo to that media item in
Dedoose.
Media Attributions
• Dedoose Home Screen with Projects Button © Dedoose is licensed under a All Rights
Reserved license
• Creating a New Project © Dedoose is licensed under a All Rights Reserved license
• Creating a Project Popup Window © Dedoose is licensed under a All Rights Reserved
license
• Loading a Project © Dedoose is licensed under a All Rights Reserved license
• The Security Icon © Dedoose is licensed under a All Rights Reserved license
• The Security Center © Dedoose is licensed under a All Rights Reserved license
• Security Privileges Popup © Dedoose is licensed under a All Rights Reserved license
• Adding User to Project © Dedoose is licensed under a All Rights Reserved license
• Adding Media © Dedoose is licensed under a All Rights Reserved license
• Importing Data Popup Window © Dedoose is licensed under a All Rights Reserved
license
• The Media Icon © Dedoose is licensed under a All Rights Reserved license
• The Dedoose Media Tool © Dedoose is licensed under a All Rights Reserved license
• Editing Media © Dedoose is licensed under a All Rights Reserved license
• The Descriptors Icon © Dedoose is licensed under a All Rights Reserved license
• The Descriptors Window © Dedoose is licensed under a All Rights Reserved license
• Editing Descriptor Fields © Dedoose is licensed under a All Rights Reserved license
• Adding Descriptors to Media © Dedoose is licensed under a All Rights Reserved
license
• Descriptor Links © Dedoose is licensed under a All Rights Reserved license
• Editing Descriptors © Dedoose
• Opening the Memo Tool © Dedoose is licensed under a All Rights Reserved license
• Adding a Memo © Dedoose is licensed under a All Rights Reserved license
• The Memo Icon © Dedoose is licensed under a All Rights Reserved license
• The Export Button © Dedoose is licensed under a All Rights Reserved license
• Export Options Window © Dedoose is licensed under a All Rights Reserved license
318 | Dedoose Data Management
25. Qualitative Data Analysis with
Dedoose: Coding
MIKAILA MARIEL LEMONIK ARTHUR
The process of coding in Dedoose begins with the creation of a code tree. Once the code
tree is created, then codes can be applied to segments of text or other media. This chapter
will step users through the process of creating a code tree and applying codes, as well as
other issues in working with codes and coding in Dedoose.
The Code Tree
In order to begin coding, analysts must first develop a code tree, as discussed in the chapter
on Qualitative Coding. Once the code tree has been developed, it can be added to Dedoose
so that it is ready to use. Codes can be added using the Codes section of the Dedoose home
page or by selecting “Codes” from the menu at the top of the screen, and then clicking the
⊕ (plus sign in a circle) symbol.
Clicking the ⊕ icon brings up a pop-up
Figure 1. The Code Screen in Dedoose
window into which users can enter a code.
The window asks for a title, which is the
term used to identify that code in the list of codes. This title should be clear and brief. For
example, rather than writing “An instance of the use of stereotyping by the respondent” you
would write “stereotyping.”
Qualitative Coding with Dedoose | 319
The description box provides space to
enter a longer explanation of the code and
the circumstances in which it should be
used, which is especially useful when coding with multiple coders or over a longer
period of time so everyone can keep track
of the meanings behind codes. Users can
also change the color of the code (note that
using this feature also requires changing a
setting, which will be discussed below).
Finally, if code weighting is part of the project, users can enable it and specify mini-
Figure 2. New Code Window
mum, maximum, and default weights.
Weighting is used when analysts wish to not only code the text, but also indicate the
degree to which a code applies. For example, if a study involved codes representing a variety of emotions—happy, sad, angry, excited, satisfied, etc.—weighting could be used to distinguish pleased from ecstatic and gloomy from devastated by applying a 1 to the more
mild emotion, a 2 for a moderate emotion, and a 3 for a more intense emotion. While the
rest of this text will proceed without involving code weighting, you may wish to explore its
use in projects you are completing.
In many cases, analysts develop code trees that have multiple levels of nested codes. For
instance, in the example above of emotion codes, the code tree might look something like
this:
Emotions
Happy
Pleased
Joyous
Ecstatic
Sad
Gloomy
Dejected
Devastated
320 | Qualitative Coding with Dedoose
To set the code tree up in this way,
Dedoose uses the language of “parent
codes” (those at the top or first level of the
tree, like emotions in this example) and
“child codes” (those at lower levels of the
Figure 3. Adding a Child Code
tree). First, analysts need to enter a parent
code into the coding system, as shown
above. Then, they can use the ⊕ symbol next to that code to add a child code, using just
the same dialog box shown in Figure 2 above. Figure 3 also shows an icon of a gear shift,
next to the ⊕ symbol—this gear shift can be used to re-open the dialog box for editing the
code.
Once codes are added, the code tree will
look something like the one shown in Figure 4. The little triangles can be used to
open and close parent codes, making child
codes visible or hiding them. Note that
there can be multiple levels of codes, so an
analyst could add additional child codes
under, say, anxiety or confusion.
The magnifying glass at the top of the
codes window can be used to search all of
the codes used in a project, which can be
helpful when the code tree gets very
lengthy. There are also two other icons at
the top of the code tree, one that looks like
little slider bars and one that looks like an
exclamation point in a circle. The sliders
allow the analyst to set options for how
coding will proceed.
• Automatic upcoding: When automatic
upcoding is turned on, any time that a
child code is used while coding, the
parent code will also be applied to the
same segment of text.
Figure 4. A Complete Code Tree in Dedoose
• Sort alphabetically: Just as it sounds,
this option reorders codes in alphabetical order, which can make it easier to find them
in a lengthy code tree.
• Code counts: The code counts option displays the number of times each code has
Qualitative Coding with Dedoose | 321
been used in a project right in the code tree next to the code itself. There are two ways
to implement code counts. In the first, “explicit code count,” just instances in which
the code itself has been used are counted, while in the second, “child sum code
count,” the sum of all uses of all child codes is displayed next to each parent code.
• Color scheme: Changing the color scheme to “custom” allows for the use of colors designated in the process of adding codes to be used in the display.
The exclamation point icon provides a number of useful tools:
• Collapse/Expand: This tool is the equivalent of going through and clicking all of the little black triangles one at a time—when clicked, it toggles the code tree between having all parent codes closed, such that child codes are hidden, and having all parent
codes open, such that all child codes are visible.
• Retroactive upcode: This tool is used when, having not turned on “automatic upcoding” (as discussed above) at the beginning of a coding process, the analyst decides
later that they would like the parent code applied to all instances where the child code
is used.
• Reorder codes: This tool allows the analyst to reorganize the code tree into a different
order.
• Import codes: This tool permits for the importation of codes from a Microsoft Excel or
comma-separated file.
• Export codes: This tool permits the analyst to export codes to Microsoft Excel or
Microsoft Word.
Once all settings and options have been set to the analysts’ preference and the code tree
has been added, it is time to start coding. Note that it is possible to change settings and add
codes during the coding process. However, it is very important that, if a new code is added
during the coding process, the analyst goes back to all texts that have already been coded
and re-codes them. Otherwise, that new code will be used for only part of the dataset,
which will introduce errors into the data analysis process.
Coding in Dedoose
In order to apply codes to texts, the first step is to create an excerpt. To create an excerpt,
load a media item, highlight a segment of text to which one or more codes should be
applied, and click the quotation mark in the corner of the document screen, as shown in
322 | Qualitative Coding with Dedoose
Figure 5. If you have made a mistake in your selection, you can click the X next to the quotation mark in the “Selection Info” box to delete it.
Once you have created an excerpt, you
can then apply codes by either dragging
each individual code from the “codes” box
to the “selection info” box or by doubleclicking on the code in the “codes” box. If
you want to remove a particular code you
have added to an excerpt, just click the X
next to that code in the “selection info” box.
Figure 5. Creating an Excerpt
When you are done applying codes to a
given excerpt, click the X next to “selection
info” to exit the editing mode and move on to create your next code. If you want to re-open
a particular excerpt, you can click on the black bracket next to the excerpt, and this will permit you to add additional codes or delete the excerpt. When you are done with a given text,
you can use the < and > icons at the bottom of the screen to move on to the next text.
Figure 6 provides an example of what it might look like after a complete (short) text is
coded. You can see how each excerpt appears highlighted in color, with a black bracket in
the margin. One excerpt is currently selected, and the “selection info” box shows the codes
that the coder applied to that excerpt. Do note the typical length of the excerpts—when
selecting an excerpt, analysts should strive to select a unit of text that represents a complete idea or utterance, whether that complete idea/utterance is just a few words or
whether it is a paragraph or more in length.
Working with Codes
While the process of coding primarily
involves moving through each text, creating excerpts, and applying relevant codes
(and sometimes repeating this process as
the code tree evolves or additional texts are
Figure 6. A Coded Text
added to the dataset), Dedoose does offer some tools that can further enhance the coding
process.
Qualitative Coding with Dedoose | 323
In the chapter on Data Management with
Dedoose, you learned about linking memos to
texts. But what if the memo you write is less conFigure 7. Code Tools
nected to a given text and instead is generated by
observations made while applying a particular
code? In that case, you may wish to link a memo
to a code. The icon that looks a bit like a scroll of paper next to the code, as shown in Figure
7—the one after the quotation mark icon—allows analysts to link memos to individual
codes.
The quotation mark icon brings up a window, as shown in Figure 8, that includes all of
the excerpts to which a given code has been applied. The “view text excerpts in full” button shows the complete text of each excerpt and all codes that have been applied to it. You
can also export all of the excerpts. If you double-click on a specific excerpt, you can copy
and paste the text of that excerpt, which is useful when you need to include a quote in the
paper or talk you are preparing. After double-clicking, there is also a “view in context” button, which loads the text in question in the background such that after you close the various pop-up windows you will be able to view it.
Similar information is available by clicking on the “Excerpts” tool at the top of the
screen, as shown in Figure 9. The excerpts Figure 8. Excerpt Viewer Tool
tool brings up a list of all the excerpts in a
given project and shows, for each one, which text it is part of, when it was created, who created it, how long it is, how many codes were applied to it, which codes, and—if applicable—any memos or descriptors. The list of excerpts can be sorted or filtered by any of these
columns, and double-clicking on any row will bring up the specific excerpt in a pop-up window.
324 | Qualitative Coding with Dedoose
A final tool worth noting for those who
are coding in teams is the Training Center.
While it is beyond the scope of this text to
detail the workings of the Training Center,
it is designed to help coding teams
enhance their interrater reliability. In short,
team leads can select a variety of excerpts
Figure 9. The Excerpts Window in Dedoose
and codes as part of a coding test that all
members of a coding team then take. After
coders complete the test, they and their team leads can see an overall interrater reliability
score comparing their coding to the codes applied by the initial coder to those excerpts
selected for the test, and they can also delve deeper by looking at agreement and disagreement rates and Kappa scores for each individual code. Reports can be excerpted, and team
members can also view specific code applications to see how their coding compares to the
initial coding seeding the test.
A final note: the analysis tools in Dedoose do rely on completed coding, so finish coding
all your texts before delving into analysis.
Exercises
1.
Return to the five oral history transcripts you selected in the exercises for the chapter Qualitative Data Analysis with Dedoose: Data Management. Read through the transcripts and develop a
code tree including at least five parent codes and additional child codes as relevant. Enter the
code tree into Dedoose.
2.
3.
Choose one of your transcripts and code that transcript completely.
Write a memo focusing on what the transcript you coded tells you in relation to one or two of
the codes you selected.
Media Attributions
• code screen © Dedoose is licensed under a All Rights Reserved license
• new code window © Dedoose is licensed under a All Rights Reserved license
• add child code © Dedoose is licensed under a All Rights Reserved license
Qualitative Coding with Dedoose | 325
• code tree © Dedoose adapted by Mikaila Mariel Lemonik Arthur is licensed under a All
Rights Reserved license
• create excerpt © Mikaila Mariel Lemonik Arthur is licensed under a All Rights Reserved
license
• example coding © Mikaila Mariel Lemonik Arthur is licensed under a All Rights
Reserved license
• code tools © Dedoose is licensed under a All Rights Reserved license
• view excerpts © Dedoose is licensed under a All Rights Reserved license
• excerpts window © Dedoose is licensed under a All Rights Reserved license
326 | Qualitative Coding with Dedoose
26. Qualitative Data Analysis with
Dedoose: Developing Findings
MIKAILA MARIEL LEMONIK ARTHUR
When working with very small datasets, analysts may find it sufficient to use Dedoose to
code texts and then use the coding tools to just explore each code. But Dedoose offers a
variety of tools for digging more deeply into the data, exploring relationships, and moving
towards findings. In this chapter, we will review the each of the analysis tools, showing what
each tool can provide and how to work with the analysis tools in developing findings.
Using the Analysis Tools
To access the analysis tools in Dedoose, click on the “Analyze” button in the toolbar at the
top of the screen, as shown in Figure 1. This will bring up the Analyze window. In this window, the sidebar contains a list of all of the analysis tools that Dedoose provides, categorized
by type. The main window displays the results after a particular analysis tool is selected.
Figure 1. The Analyze Window in Dedoose
In the top corner of the screen, you will observe a variety of icons for interacting with the
results of the selected analysis tool. Note that not all of these icons are available for all
tools—for instance, the icons that let users switch between bar graph and pie chart views
are only available for tools where the results are displayed in bar graphs and pie charts. Next
to the pie chart you will see a tool with an up arrow; this tool is used to export results. Most
Dedoose: Qualitative Findings | 327
results are exported in Microsoft Excel format (which can be easily viewed in Google Sheets
and other spreadsheet programs and includes both data and visuals), though in one or two
cases a PDF file is produced. Sometimes there are options users can select when determining how to export their results. After that, the icon with four arrows is used to enter fullscreen view, and the icon with the question mark is used to access very brief information
about the currently-selected tool. Many tools will also provide specific options for formatting or data processing, and these will be explained along with the explanation of each tool.
Before getting into the specific tools, it is also important to note that tools provide direct
access to relevant excerpts. While tools are loaded, you can typically put your cursor over
any data element to bring up a popup with more detailed information. Then, if you click
on the data element, you will bring up a window that provides all excerpts that meet the
given criteria specified by the data element you have clicked on. For instance, if an analyst
clicked on one of the bars shown in the bar graph in Figure 1 (which represent texts), they
would be taken to a window showing all of the excerpts in the text they clicked on. Clicking
on an excerpt then brings up more detail about that specific excerpt in a new window, and
the text of that excerpt can then be selected, copied, and pasted into a working document
when quotes are desired.
The Dedoose Analysis Toolkit
Below, each of the analysis tools in Dedoose will be explored. There are a few more
advanced tools that will only be touched upon briefly. Note that many tools can be found
under multiple tool categories in the sidebar, but provide the same information regardless
of which category the tool has been selected under.
Exploring Media and Users
Dedoose offers a few tools that are rather limited in terms of analytical power but that do
offer some useful ways to explore the texts that are part of a project and to track the work of
multiple coders on the project. The first two tools discussed in this section are for exploring
texts, while the final three are for looking at the work of coders.
Excerpt Count x Media (found under Media Charts, Excerpt Charts, and Quantitative
Charts) provides an overview of how many excerpts have been created in each text. This is
the tool shown in Figure 1 above. Clicking on the bar representing any given text provides a
window with all of the excerpts from that text. The dropdown menu in the corner provides
the option of changing the sort order.
328 | Dedoose: Qualitative Findings
Code Count x Media (found under Media Charts, Code Charts, and Quantitative Charts)
shows how many codes were applied to each text, displaying a bar graph (which can be
changed to a pie chart) as shown in Figure 2. This bar graph is often quite similar to the one
found under Excerpt Count x Media, except the numbers are typically higher as more than
one code is typically applied per excerpt. Similarly, clicking on a bar brings up all relevant
excerpts, and the dropdown menu in the corner provides the option of changing the sort
order.
Figure 2. Code Count x Media Tool
User Excerpts (found under Excerpt Charts, User Charts, and Quantitative Charts) and User
Code Application (found under Code Charts, User Charts, and Quantitative Charts) provide bar graphs similar to those discussed above, except instead of displaying the number of excerpts and codes by text, they display the number of excerpts and codes by user.
These tools, then, can be a useful way of tracking engagement with a project when multiple coders are working together. User Media (found under Media Charts, User Charts, and
Quantitative Charts) provides a bar graph showing how many media were uploaded by
each user. Except in large projects with many researchers, this tool is less likely to be useful.
Dedoose: Qualitative Findings | 329
Code Tools
The tools that are most important for the standard forms of qualitative data analysis discussed in this text are among the code tools. These are a set of tools that allow analysts to
explore what they can learn from the way they have coded within their project. The most
basic of these is Code Presence (found under Media Charts, Code Charts, and Qualitative
Charts), which simply displays whether or not a particular code has been applied at least
once to a given text. As shown in Figure 3, this tool provides a grid or matrix in which the
entire code tree developed in the project is arrayed across the top, while the document
titles are listed down the side. When a code appears in a particular document or text, a
red square with the numeral 1 marks the intersection; when the code does not appear, the
intersecting cell is empty. If you click on the red square, a window will pop up with all of the
excerpts from that text to which that code was applied.
Figure 3. Code Presence Tool
Code Application (found under Media Charts, Code Charts, and Qualitative Charts) is a
somewhat more useful way to view the same data. In this tool, instead of just displaying the
presence or absence of each code in each text, the number of times each code was applied
to each text is displayed, as shown in Figure 4. Color coding helps users quickly spot the
most frequently used codes, which are displayed in orange and red (while rarely used codes
330 | Dedoose: Qualitative Findings
are displayed in blue) with the number of times they were applied indicated. As in the case
of other tools, clicking on a table cell brings up all applicable excerpts.
Figure 4. Code Application Tool
Code Co-Occurrence (found under Code Charts and Qualitative Charts) is arguably the
most useful of the code tools. Rather than simply documenting the presence or extent of
given codes in particular texts, the Code Co-Occurrence tool lets analysts explore the relationships between codes. It flags and tallies excerpts in which the same codes appear. For
example, in investigating the sample data displayed in Figure 5, we can observe that “emotions” tends to co-occur with relationships and classes, and anxiety is a particular emotion frequently occurring in relation to discussions of classes. The same color scheme as
noted above in relation to the Code Application tool helps viewers see, at a glance, which
codes co-occur most frequently, and clicking on table cells brings up relevant excerpts. The
checkbox in the corner toggles whether overlapping excerpts—or multiple excerpts that
have some sections of text in common—are included. As is the case with other tools, the
resulting chart can be exported to a spreadsheet format, though excerpts are not included
in the export.
Dedoose: Qualitative Findings | 331
Figure 5. The Code Co-Occurrence Tool
Less useful, but perhaps more fun, are two tools that produce code clouds. Code clouds
are a type of word clouds used specifically to display the frequency at which a given code
has been applied within the body of text in a project. In code clouds, codes that have been
applied more frequently are displayed in larger and bolder text than codes that are used
less frequently. These tools can be used to get an at-a-glance sense of which codes are
most predominant in the text, and can also be useful in creating visuals to use in presentations when discussing coding. The Packed Code Cloud (found under Code Charts and
Qualitative Charts), as shown in Figure 4, provides a static visual which can be modified in
a variety of ways. The “Sub-code Count” checkbox toggles whether or not child codes are
included in the counts driving the sizes of the codes. Under the “Colors” drop-down menu,
analysts can choose from the default color scheme shown in Figure 4 or color schemes that
are more blue, red/yellow/orange, or pastel in color. The “Layout” drop-down menu offers a
fast scheme or one that places each code into the visual one at a time. The “Direction” dropdown menu offers options for how horizontal or vertical the display is, as well as an option
called “Wiggly” for a more dynamic diagonal display of terms. Finally, the “Redraw” button refreshes the display once options have been changed; even if options aren’t changed,
slightly different presentations will occur each time Redraw is hit, so that the analyst can
chose the one that is most visually appealing. Hovering the mouse over a code provides the
332 | Dedoose: Qualitative Findings
number of times that code was applied in the project, and clicking on it brings up a window
with all instances of the code. The export tool creates a PDF of the code cloud, which is necessary if you need a print-quality version of the resulting image, though for including in a
presentation or other digital document many users might prefer to take a screenshot. Note
that the Packed Code Cloud is most useful as a visual representation of coding data—to
actually carry out data analysis, other tools will be more helpful.
Figure 6. Packed Code Cloud Tool
Similarly, the 3D Code Cloud (found under Code Charts and Qualitative Charts) presents
word cloud data, but in a simulated three-dimensional format, as shown in the (silent) video
clip below. A checkbox toggles the inclusion or exclusion of sub-codes, while sliders on the
side of the window allow users to adjust the zoom and the minimum frequency of code
applications for inclusion. Note that the 3D Code Cloud tool does not provide an export
option, so users will need to have a screen recorder in order to use these visualizations elsewhere. Just as in the Packed Code Cloud tool, users can click on individual codes to bring
up matching excerpts. However, as not all codes can be clearly seen at once, the 3D Code
1
Cloud is even less useful as an analytical tool.
1. For screenreader users: there is a screencapture video of this tool below; it is silent. Tab to move through tog­
gle buttons and options; the screenreader is unable to read the moving graphic shown.
Dedoose: Qualitative Findings | 333
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://pressbooks.ric.edu/socialdataanalysis/?p=62#video-62-1
Tools for both Descriptors and Codes
An additional set of tools is designed to help analysts investigate the relationships between
codes and descriptors. These tools would permit, for example, an investigation of gender or
racial differences in how respondents discuss a particular topic, or an analysis of whether
different age groups use different emotion words when talking about their pets. All of the
tools provide similar basic information, but display it differently or permit deeper dives.
Codes x Descriptor (found under Descriptors Charts, Code Charts, and Mixed Methods
Charts, as well as under the Codes tab in Dedoose) creates bar graphs that show how often
a code is applied to texts with a given descriptor, as shown in Figure 7. All codes in the code
tree are shown in the visualization, and a box at the top of the window with two drop-down
menus lets the analyst select the descriptor they wish to investigate (and the descriptor set
in which that descriptor appears, if more than one descriptor set is in use in a given project).
Other options, shown in the grey bar at the top of the window, allow for configuration of
the results:
• Hit/Miss: when selected, the display will show the number of texts with a given
descriptor to which a particular code is applied. When unselected, the display will
show the total number of times a particular code is applied to texts with a given
descriptor. This option cannot be selected at the same time as the Normalize option.
• Sub-code Count: as in other tools, this toggles on or off the inclusion of child codes
when parent codes are presented.
• Normalize: this option applies a mathematical calculation to the figures presented in
the tool to adjust them in light of the overall number of texts with a given descriptor.
For instance, if a dataset had 23 nurses and 5 doctors, it might not be reasonable to
just examine how many times nurses versus doctors discussed status at work—there
are so many more nurses that their figures would just seem inflated. Normalizing
helps correct for this.
• %: toggles between displaying data as counts (raw numbers) and percentages.
As in other tools, clicking a bar in one of the graphs brings up a window with relevant
excerpts.
334 | Dedoose: Qualitative Findings
Figure 7. The Code x Descriptor Tool
Descriptor x Code (found under Descriptors Charts, Code Charts, and Mixed Methods
Charts) is, in a way, the inverse of Code x Descriptor. Here, all descriptors employed in the
project are displayed in the window, and a drop-down menu permits the analyst to select
which code they wish to investigate. The same options for adjusting the output are provided. In the example displayed in Figure 8, for instance, we can see that only those texts
produced by individuals who were recently in high school are coded as involving high
school classmates, though not too many texts discuss high school classmates at all.
Dedoose: Qualitative Findings | 335
Figure 8. The Descriptor x Code Tool
The next two tools provide an overview of code applications by descriptor. Both have the
same options and the same basic format, but they provide slightly different data. The
Descriptor x Code Case Count Table (found under Descriptors Charts, Code Charts, and
Mixed Methods Charts) shows the number of excerpts across all texts with a given descriptor to which a particular code has been applied. In contrast, the Descriptor x Code Count
Table (found under Descriptors Charts, Code Charts, and Mixed Methods Charts) shows the
number of texts with a given descriptor to which a particular code has been applied. Thus,
it is unsurprising that the numbers are generally higher in the former than in the latter.
Which is more appropriate to use depends on the research question and goals of a given
project. Figure 9 shows the Descriptor x Code Count Table; the Descriptor x Code Case
Count Table would look similar but with higher numbers in many cells.
336 | Dedoose: Qualitative Findings
Descriptor x Descriptor x Code (found
under Descriptors Charts, Code Charts, and
Mixed Methods Charts) provides a way to
look at the relationship between two
descriptors and a code. Analysts choose
two descriptors, and the tool produces a set
of bar graphs, one graph for each category
of the first descriptor with one bar in each
Figure 9. The Descriptor x Code Count Table
graph for each category of the second
descriptor. Then analysts choose a code,
and the applications of that code deter-
mine the length of each bar. For instance, in Figure 10, we can see the relationship between
gender, living arrangements, and the application of the anxiety code. The results show that
anxiety is discussed most frequently among male respondents who live on campus, but
among female respondents who live off campus but not with family. Note that switching
which descriptor is in Field 1 and which is in Field 2 does change the results, especially when
the normalize option is switched on as this makes the data quite susceptible to alteration
based on small numbers of texts with a given descriptor. Thus, it is essential that analysts
think carefully about their research question and how to set up any Descriptor x Descriptor
X Code analysis to address that question. This tool can be used with code weights in projects that have applied them. Options for the inclusion of child codes, for normalizing figures, and for toggling between percentages and raw numbers are also available.
Code Frequency Descriptor Bubble Plot
(found under Descriptors Charts, Code
Charts, and Mixed Methods Charts) creates
a visual display incorporating data from
three codes and one descriptor. The freFigure 10. The Descriptor x Descriptor x Code Tool
quency of applications of one code is displayed on the X axis, of a second code on
the Y axis, and of a third code in the size of the bubble. Note that rearranging which code is
in the X axis, Y axis, or size drop-down box will alter the display, so analysts should think
carefully about how they wish to set up their display. Then, the descriptor selected from the
Field drop-down box creates different bubbles, with a color key, for each category of the
selected descriptor. For example, in Figure 11, we can see a plot looking at anxiety, being
lost, and talking about classes, with the descriptor of where students were prior to their first
day on my campus: high school, the local community college, or another college. The
results show that anxiety is most prevalent in the discussions of students starting directly
from high school, who are also more likely to talk about classes. In contrast, students coming from the local community college wrote little about anxiety, but a lot about getting lost.
Dedoose: Qualitative Findings | 337
As in other tools, including sub-code counts and normalizing can be toggled on and off;
clicking a bubble will bring up a list of included excerpts, and results can be exported, in
this case as both spreadsheet and PDF files.
Figure 11. The Code Frequency Descriptor Bubble Plot
Descriptor Fields by Codes Grid Chart (found under Descriptors Charts, Code Charts, and
Mixed Methods Charts) is a particularly flexible tool that lets analysts look at various combinations of descriptors and codes. First, analysts select as many descriptor fields as they
wish to include from the Field drop-down menu, clicking “Add Field” after each one to add
it to the list of Descriptor Fields in the top corner. Then, they select the checkboxes next
to the codes they wish to include from the Codes list. Counts or weights can be displayed
in projects that use weights. These selections generate a grid chart that shows all possible
combinations of the selected descriptor categories and the number of code applications of
each selected code in texts with those combinations of descriptors. For instance, in Figure
12, we can see that female students coming directly from high school made more statements that were coded with Anxiety than did other groups.
338 | Dedoose: Qualitative Findings
Figure 11. Descriptor Fields x Codes Grid Chart
Descriptor Tools
The descriptor tools are tools designed to provide summary data, including descriptive statistics and some basic explanatory statistics, using descriptors as variables. Unlike other
tools in Dedoose, the descriptor tools produce primarily quantitative data. As such, they are
generally most useful either for presenting basic descriptive data about participants in a
study or when the study is designed as a mixed-methods study. However, in most cases, if
more than the most basic descriptive data is desired, it would make more sense to export
the relevant descriptor data from Dedoose (which can be done under the Descriptors tab)
and load it into appropriate statistical analysis software.
Descriptor Ratios Multi Chart (found under Descriptors Charts and Quantitative Charts)
provides a choice of pie or bar graphs that display the number of texts associated with each
category for each descriptor field. This tool, as shown in Figure 12, is a good way to quickly
familiarize oneself with the distribution of descriptor data in a project.
Dedoose: Qualitative Findings | 339
Figure 12. Descriptor Ratios Multi Chart
Descriptor Ratios Grid Chart (found under Descriptors Charts and Quantitative Charts)
provides a way to view crosstabulations of descriptor field categories. Analysts choose one
descriptor from the Target Field drop-down menu and that descriptor is used as the independent variable in a stack of crosstabulations, with the other descriptors as the dependent
variables. A toggle is available to turn on or off the inclusion of descriptors that have not
been linked to a text. For example, Figure 13 shows students’ status prior to coming to our
campus as the target field; crosstabulations with gender and how far away students lived
from campus prior to becoming students as the other included variables. More information on how to interpret crosstabulations can be found in the chapter on Bivariate Analyses:
Crosstabulation, but in short, researchers compare the percentages across the rows. Doing
so for the data displayed in Figure 13 shows that there is no notable gender difference in
students’ educational status before coming to our campus, but that there is a difference in
terms of how far away their homes are from campus—students who came to our campus
from the local community college are more likely to live within 20 miles of campus.
340 | Dedoose: Qualitative Findings
Figure 13. Descriptor Ratios Grid Chart
Descriptor Field x Descriptor Field (found under Descriptors Charts and Quantitative
Charts) provides a different way to examine crosstabulation data. To use this tool,
researchers select two descriptors, one for the outer field and one for the inner field. A separate bar graph is produced for each category of the inner field, with bars for each category
of the outer field. Options toggle whether the count is of descriptors or of excerpts, whether
only linked descriptors are included, and whether categories with no data are included. At
the bottom of the screen, the critical value of the Chi square and the degrees of freedom
(df) are presented; clicking on the question mark in a circle at the bottom of the screen
brings up a webpage that provides the table of critical values of Chi square, as discussed
in the chapter on Bivariate Analyses: Crosstabulation. For example, in Figure 14, an analysis of the relationship between housing one’s first year on campus and how far away one
lived prior to enrolling is presented. The data shows that the vast majority of students lived
within 20 miles of our campus prior to enrolling, and that those students were more likely
to continue to live with family, while students from further away were more likely to live in
on-campus housing. However, the Chi square calculation is such that this observed relationship is not statistically significant. To determine this, note the Chi square and df values
and click on the question mark in a circle at the bottom of the screen, and follow the directions to use the table that is presented.
Dedoose: Qualitative Findings | 341
Figure 14. The Descriptor Field x Descriptor Field Tool
Because this text recommends that statistical analysis be performed using statistical analysis software, discussion of the remaining more-quantitative tools will be limited–especially
as these tools rely heavily on the inclusion of continuous, numerical variables among the
descriptors, which is generally less common in qualitative and perhaps even mixed-methods projects. The Descriptor Number Distribution Plot (found under Descriptors Charts
and Quantitative Charts) provides a way to obtain basic descriptive data for large descriptors in larger datasets. However, its use is not possible with smaller datasets such as are
commonly used for qualitative analysis. The Descriptor Field T-Test (found under Descriptors Charts, Code Charts, Quantitative Charts, and Mixed Methods Charts) produces an
independent-samples T-test. To use this tool, you can select under the drop-down for “primary field” any descriptor with categories, and this will be the independent variable. The
dependent variable must be continuous rather than discrete. The tool then produces a
plot, mean and median differences, and the T-value and degrees of freedom to look up
in the chart that loads when the “Critical Values Table” button is pressed. Similarly, the
Descriptor ANOVA (found under Descriptors Charts, Code Charts, Quantitative Charts, and
Mixed Methods Charts) produces an ANOVA output for a discreet independent variable and
a continuous dependent variable. The Descriptor Field Correlation (found under Descriptors Charts and Quantitative Charts) produces a kind of scatterplot with descriptors that are
continuous variables in nature presented in both the X-axis and the Y-axis. As in the other
342 | Dedoose: Qualitative Findings
tool, the correlation and degrees of freedom (here called DoF) are provided, along with a
button to bring up the relevant critical value table.
Code Weight Tools
Code weighting is generally beyond the scope of this text, but it is worth briefly noting that
there are four analysis tools designed specifically for projects that use code weight. In addition, some of the tools discussed above have options that allow the incorporation of data
on code weights into the analysis process, as you have seen.
Code Weight Statistics (found under Code Charts and Qualitative Charts) displays the
minimum, maximum, mean, median, and count of code weights applied for each code,
while Code Weight Distribution Plot (found under Code Charts and Quantitative Charts)
provides more ways to examine descriptive statistics about code weights for a particular
code.
Code Weight Descriptor Bubble Plot (found under Descriptors Charts, Code Charts, and
Mixed Methods Charts) lets users create four-dimensional visualizations of the relationship between three different codes with code weights and one descriptor. Codes can be
assigned to the X axis, Y axis, and bubble size, while the categories of the descriptor become
different bubbles on the graph. Code Weight Frequency x Field (found under Code Charts
and Mixed Methods Charts) permits the analyst to see how the weight of codes varies
across different categories of a descriptor.
Conducting and Concluding Analysis
While Dedoose provides powerful analytical tools, it is important to remember that it
remains up to the analyst to make the right choices and use the tools appropriately in service of the project. Analysts need to make sure to choose the tools that fit with the research
question and approach, not just those that are appealing or easy to use. For instance, many
students of Dedoose are drawn to the code cloud tools because they are attractive and
simple—but they provide far less analytical power than does the code co-occurence tool or
even the code application tool.
In addition, in qualitative research, it is not sufficient to simply report what a tool tells you.
And it is very unlikely that many of the graphs, tables, and other visuals Dedoose produces
will find their way into publications or presentations. Instead, the tools should be used as a
guide to determine what findings are worth investigating further or focusing on. Analysts
will then need to return to the texts to choose appropriate quotes for illustrating these findDedoose: Qualitative Findings | 343
ings and making sure the findings are sensible in the context of the data. Dedoose provides
tools to help researchers make sense of their data, but it does not itself provide answers.
Exercises
1.
Return to the Dedoose project involving oral history transcripts. Click though all of the analysis tools.
Select two tools you find useful (not just easy or pretty) and write a paragraph summarizing the findings
from each tool.
2.
Use the tools you have selected to locate and copy two quotes that illustrate each of the findings you
have summarized.
Media Attributions
• analyze tools © Dedoose is licensed under a All Rights Reserved license
• codecountmedia © Dedoose is licensed under a All Rights Reserved license
• code presence © Dedoose is licensed under a All Rights Reserved license
• code application © Dedoose adapted by Mikaila Mariel Lemonik Arthur is licensed
under a All Rights Reserved license
• code coocurrance © Dedoose adapted by Mikaila Mariel Lemonik Arthur is licensed
under a All Rights Reserved license
• packed code cloud © Dedoose adapted by Mikaila Mariel Lemonik Arthur is licensed
under a All Rights Reserved license
• code x descriptor © Dedoose is licensed under a All Rights Reserved license
• descriptor x code © Dedoose is licensed under a All Rights Reserved license
• descriptor x code count table © Dedoose is licensed under a All Rights Reserved
license
• descriptor x descriptor x code © Dedoose adapted by Mikaila Mariel Lemonik Arthur is
licensed under a All Rights Reserved license
• code descriptor bubble © Dedoose adapted by Mikaila Mariel Lemonik Arthur is
licensed under a All Rights Reserved license
• descriptor codes grid © Dedoose adapted by Mikaila Mariel Lemonik Arthur is licensed
under a All Rights Reserved license
• descriptor ratios multi © Dedoose is licensed under a All Rights Reserved license
• descriptor ratios grid © Dedoose is licensed under a All Rights Reserved license
• descriptor field x descriptor field © Dedoose
344 | Dedoose: Qualitative Findings
Glossary
abduction
An approach to research that combines both inductive and deductive elements.
abstract
A short summary of a text written from the perspective of a reader rather than from
the perspective of an author.
addition theorem
The theorem addressing the determination of the probability of a given outcome
occurring at least once across a series of trials; it is determined by adding the probability of each possible series of outcomes together.
analytic coding
Coding designed to move analysis towards the development of themes and findings.
anecdotalism
When researchers choose particular stories or incidents specifically to illustrate a point
rather than because they are representative of the data in general.
ANOVA
A statistical test designed to measure differences between groups.
antecedent variable
A variable that is hypothesized to affect both the independent variable and the dependent variable.
applied research
Research designed to address a specific problem.
archive
A repository of documents, especially those of historical interest.
Glossary | 345
association
The situation in which variables are able to be shown to be related to one another.
attributes
The possible levels or response choices of a given variable.
bar chart
Also called bar graphs, these graphs display data using bars of varying heights.
basic research
Research designed to increase knowledge, regardless of whether that knowledge may
have any practical application.
bell curve
A graph showing a normal distribution—one that is symmetrical with a rounded top
that then falls away towards the extremes in the shape of a bell
beta
The standardized regression coefficient. In a bivariate regression, the same as Pearson's
r; in a multivariate regression, the correlation between the given independent variable
and the dependent variable when all other variables included in the regression are controlled for.
binary
Consisting of only two options. Also known as dichotomous.
bivariate analyses
Quantitative analyses that tell us about the relationship between two variables.
block quote
A quotation, usually one of some length, which is set off from the main text by being
indented on both sides rather than being placed in quotation marks.
346 | Glossary
CAQDAS
An acronym for "computer-aided qualitative data analysis software," or software that
helps to facilitate qualitative data analysis.
causation
A relationship between two phenomena where one phenomenon influences, produces, or alters another phenomenon.
central limit theorem
The theorem that states that if you take a series of sufficiently large random samples
from the population (replacing people back into the population so they can be reselected each time you draw a new sample), the distribution of the sample means will be
approximately normally distributed.
chronology
A list or diagram of events in order of their occurrence in time.
cliques
An exclusive circle of people or organizations in which all members of the circle have
connections to all other members of the circle.
closed coding
Coding in which the researcher developed a coding system in advance based on their
theory, hypothesis, or research question.
code tree
A hierarchically-organized coding system.
code weights
Elements of a coding strategy that help identify the intensity or degree of presence of
a code in a text.
codebooks
Documents that lay out the details of measurement. Codebooks may be used in surveys to indicate the way survey questions and responses are entered into data analysis
Glossary | 347
software. Codebooks may be used in coding to lay out details about how and when to
use each code that has been developed.
codes
Words or phrases that capture a central or notable attribute of a particular segment of
textual or visual data.
coding
The process of assigning observations to categories.
coding (in quantitative methods)
Assigning numerical variables to replace the names of variable categories.
cognitive map
Visualizations of the relationships between ideas.
collinearity
The condition where two independent variables used in the same analysis are strongly
correlated with one another.
column marginal
The total number of cases in a given column of a table.
concept coding
Coding using words or phrases that represent concepts or ideas.
confidence interval
A range of estimates into which it is highly probable that an unknown population parameter falls.
confidence level
The probability that the sample statistics we observe holds true for the larger population.
348 | Glossary
continuous variable
A variable measured using numbers, not categories, including both interval and ratio
variables. Also called a scale variable.
control variable
A variable that is neither the independent variable nor the dependent variable in a relationship, but which may impact that relationship.
controlling a relationship
Examining a relationship between two variables while eliminating the effect of variation in an additional variable, the control variable.
crosstabulation
An analytical method in which a bivariate table is created using discrete variables to
show their relationship.
data cleaning
The process of examining data to find any errors, mistakes, duplications, corruptions,
omissions, or other issues, and then correcting or removing data as is appropriate.
data display
Tables, diagrams, figures, and related items that enable researchers to visualize and
organize data in ways that permit the perception of patterns, comparisons, processes,
or themes.
data management
The process of organizing, preserving, and storing data so that it can be used effectively.
data reduction
The process of reducing the volume of data to make it more usable while maintaining
the integrity of the data.
decision tree
A diagram that lays out the steps taken to reach decisions.
Glossary | 349
deductive
An approach to research in which researchers begin with a theory, then collect data
and use that data to test their theory.
deductive coding
Coding in which the researcher developed a coding system in advance based on their
theory, hypothesis, or research question.
degrees of freedom
The number of cells in a table that can vary if we know something about the row and
column totals of that table, calculated according to the formula (# of columns-1)*(# of
rows-1).
denominator
The expression below the line in a fraction; the entity used to divide another entity in a
formula.
dependent variable
A variable that is affected or influenced by (or depends on) another variable; the effect
in a causal relationship.
descriptive coding
Coding that relies on nouns or phrases describing the content or topic of a segment of
text.
descriptive statistics
Statistics used to describe a sample.
descriptor
A category in an information storage system; more specifically in Dedoose, a characteristic of an author or entire text. Also, the word used to indicate that category or characteristic.
deviant case
A case that appears to be an exception to commonly-understood patterns or explanations.
350 | Glossary
dichotomous
Consisting of only two options. Also known as binary.
direction
How categories of an independent variable are related to categories of a dependent
variable.
discrete variable
A variable measured using categories rather than numbers, including binary/dichotomous, nominal, and ordinal variables.
dramaturgical coding
Coding that treats texts as if they are scripts for a play.
dummy variable
A two-category (binary/dichotomous) variable that can be used in regression or correlation, typically with the values 0 and 1.
edge
The line connecting nodes in a network diagram; such lines represent real-world relationships or linkages.
elaboration
A term used to refer to the process of controlling for a variable.
elimination of alternatives
In relation to causation, the requirement that for a causal relationship to exist, all possible explanations other than the hypothesized independent variable have been eliminated as the cause of the dependent variable.
emotion codes
Codes indicating emotions discussed by or present in the text, sometimes indicated by
the use of emoji/emoticons.
Glossary | 351
empirical
That which could hypothetically be shown to be true or false; statements about reality
rather than opinion.
epistemology
The philosophical study of the nature of knowledge.
ethics (in research)
Standards for the appropriate conduct of research that seek to ensure researchers treat
human participants in research appropriately and do not harm them and that scientific
misconduct is avoided.
ethnography
A research method in which the researcher is a participant in a social setting while
simultaneously observing and collecting data on that setting and the people within it.
evaluation coding
A coding system used to indicate what is or is not working in a program or policy.
exhaustive
The property of a variable which has a category for everyone.
extraneous variable
A variable that impacts the dependent variable but is not related to the independent
variable.
face validity
The extent to which measures appear to measure that which they were intended to
measure.
feminism
A perspective rooted in the idea that explorations and understandings of gendered
power relations should be at the root of inquiry and action.
352 | Glossary
fieldnotes
Qualitative notes recorded by researchers in relation to their observation and/or participation of participants, social circumstances, events, etc. in which they document
occurrences, interactions, and other details they have observed in their observational
or ethnographic research.
first-cycle coding
Coding that occurs early in the research process as part of a bridge from data reduction
to data analysis.
flow chart
A diagram of a sequence of operations or relationships.
focus group
A research method in which multiple participants interact with each other while being
interviewed.
focused coding
Selective coding designed to orient an analytical approach around certain ideas.
frequency distribution
An analysis that shows the number of cases that fall into each category of a variable.
gamma
A measure of the direction and strength of a crosstabulated relationship between two
ordinal-level variables.
General Social Survey
A nationally-representative survey on social issues and opinions which has been carried
out roughly every other year since 1972. Also known as the GSS.
generalizability
The degree to which a finding based on data from a sample can be assumed to be true
for the larger population from which the population was drawn.
Glossary | 353
genre
A classification of written or artistic work based on form, content, and style.
gerunds
Verb forms that end in -ing and function grammatically in sentences as if they are
nouns.
grounded theory
An inductive approach to data collection and data analysis in which researchers strive
to generate a conception of how participants understand their own lives and circumstances.
Hawthorne effect
When research participants modify their behavior, actions, or responses due to their
awareness that they are being observed.
histogram
A graph that looks like a bar chart but with no spaces between the bars, it is designed to
display the distribution of continuous data by creating rectangles to represent equallysized groups of values.
hypothesis
A statement of the expected or predicted relationship between two or more variables.
In vivo coding
Coding that relies on research participants' own language.
independent variable
A variable that may affect or influence another variable; the cause in a causal relationship.
index variable
A composite variable created by combining information from multiple variables.
354 | Glossary
inductive
A research approach in which researchers begin by collecting data and then use this
data to build theory.
inductive coding
Coding in which the researcher develops codes based on what they observe in the data
they have collected.
inferential statistics
Statistics that permit researchers to make inferences (or reasoned conclusions) about
the larger populations from which a sample has been drawn.
inter-rater reliability
The extent to which multiple raters or coders assign the same or a similar score, code,
or rating to a given text, item, or circumstance.
interaction term
A variable constructed by multiplying the values of other variables together so as to
make it possible to look at their combined impact.
interpretivism
A philosophy of research that assumes all knowledge is constructed and understood by
human beings through their own individual and cultural perspectives.
interval variable
A variable with adjacent, ordered categories that are a standard distance from one
another, typically as measured numerically.
intervening variable
A variable hypothesized to intervene in the relationship between an independent and a
dependent variable; in other words, a variable that is affected by the independent variable and in turn affects the dependent variable.
interview
A research method in which a researcher asks a participant open-ended questions.
Glossary | 355
iterative
A process in which steps are repeated.
Kappa
A measure of association especially likely to be used for testing interrater reliability.
kurtosis
How sharp the peak of a frequency distribution is. If the peak is too pointed to be a
normal curve, it is said to have positive kurtosis (or “leptokurtosis”). If the peak of a distribution is too flat to be normally distributed, it is said to have negative kurtosis (or
platykurtosis).
latent coding
Interpretive coding that focuses on meanings within texts.
leptokurtosis
The characteristic of a distribution that is too pointed to be a normal curve, indicated
by a positive kurtosis statistic.
levels of measurement
Classification of variables in terms of the precision or sensitivity in how they are
recorded.
line of best fit
The line that best minimizes the distance between itself and all of the points in a scatterplot.
linear relationship
A relationship in which a scatterplot will produce a reasonable approximation of a
straight line (rather than something like a U or some other shape).
logistic regression
A type of regression analysis that uses the logistic function to predict the odds of a particular value of a binary dependent variable.
356 | Glossary
manifest coding
Coding of surface-level and/or easily observable elements of texts.
margin of error
A suggestion of how far away from the actual population parameter a sample statistic
is likely to be.
matrices
Tables with rows and columns that are used to summarize and analyze or compare
data.
mean
The sum of all the values in a list divided by the number of such values.
measures of central tendency
A measure of the value most representative of an entire distribution of data.
measures of dispersion
Statistical tests that show the degree to which data is scattered or spread.
median
The middle value when all values in a list are arranged in order.
metadata
Data about other data.
mode
The category in a list that occurs most frequently.
multiple regression
Regression analysis looking at the relationship between a dependent variable and
more than one independent variable.
Glossary | 357
multiplication theorem
The theorem in probability about the likelihood of a given outcome occurring repeatedly over multiple trials; this is determined by multiplying the probabilities together.
multivariate analyses
Quantitative analyses that explores relationships involving more than two variables or
examines the impact of other variables on a relationship between two variables.
multivariate regression
Regression analysis looking at the relationship between a dependent variable and
more than one independent variable.
mutually exclusive
The characteristic of a variable in which no one can fit into more than one category,
such as age categories 5-10 and 11-15 (rather than 5-10 and 10-15, as this would mean tenyear-olds fit into two categories).
network diagram
A visualization of the relationships between people, organizations, or other entities.
NHST
Null hypothesis significance testing.
nodes
Points in a network diagram that represents an individual person, organization, idea, or
other entity of the type the diagram is designed to show connections between.
nominal variable
A variable whose categories have names that do not imply any order.
normal distribution
A distribution of values that is symmetrical and bell-shaped.
null hypothesis
The hypothesis that there is no relationship between the variables in question.
358 | Glossary
null hypothesis significance testing
A method of testing for statistical significance in which an observed relationship, pattern, or figure is tested against a hypothesis that there is no relationship or pattern
among the variables being tested
objectivity
The ability to evaluate something without individual perspectives, values, or biases
impacting the evaluation.
observational research
A research method in which the researcher observes the actions, interactions, and
behaviors of people.
open coding
Coding in which the researcher develops codes based on what they observe in the data
they have collected.
ordinal variable
A variable with categories that can be ordered in a sensible way.
organizational chart
A diagram, usually a flow chart, that documents the hierarchy and reporting relationships within an organization.
original relationship
The relationship between an independent variable and a dependent variable before
controlling for an additional variable.
p value
The measure of statistical significance typically used in quantitative analysis. The lower
the p value, the more likely you are to reject the null hypothesis.
paradigm
A set of assumptions, values, and practices that shapes the way that people see, understand, and engage with the world.
Glossary | 359
partial
Shorter term for a partial relationship.
partial relationship
A relationship between an independent and a dependent variable for only the portion
of a sample that falls into a given category of a control variable.
participant-observation
A research method in which the researcher observes social interaction while themselves participating in the social setting.
participants
People who participate in a research project or from or about whom data is collected.
Pearson’s chi-square
A measure of statistical significance used in crosstabulation to determine the generalizability of results.
Pearson’s r
A measure of association that calculates the strength and direction of association
between two continuous (interval and/or ratio) level variables.
Pie charts
Circular graphs that show the proportion of the total that is in each category in the
shape of a slice of pie.
platykurtosis
The characteristic of a distribution that is too flat to be a normal curve, indicated by a
negative kurtosis statistic.
population
A group of cases about which researchers want to learn something; generally, members of a population share common characteristics that are relevant to the research,
such as living in a certain area, sharing a certain demographic characteristic, or having
had a common experience.
360 | Glossary
population parameter
A quantitative measure of data from a population.
positionality
An individual's social, cultural, and political location in relation to the research they are
doing.
positivism
A view of the world in which knowledge can be obtained through logic and empirical
observation and the world can be subjected to prediction and control.
pragmatism
A philosophy that suggests that researchers can adapt elements of both objectivist and
interpretivist philosophies.
probability
How likely something is to happen; also, a branch of mathematics concerned with
investigating the likelihood of occurrences.
probability sample
A sample that has been drawn to give every member of the population a known (nonzero) chance of inclusion.
process coding
Coding in which gerunds are applied to actions that are described in segments of text.
process diagrams
Visualizations that display the relationships between steps in a process or procedure.
qualitative data analysis
Data analysis in which the data is not primarily numeric, for instance based on words or
images.
quantification
The transformation of non-numerical data into numerical data.
Glossary | 361
quantitative data analysis
Data analysis in which the data is numerical.
R squared
The square of the regression coefficient, which tells analysts how much of the variation
in the dependent variable has been explained by the independent variable(s) in the
regression.
R2 change
The change in the percent of the variance of the dependent variable that is explained
by all of the independent variables together when comparing two different regression
models
random sample
A sample in which all members of the population have an equal probability of being
selected.
range
The highest category in a list minus the lowest category.
ratio level variable
A numerical variable with an absolute zero which can also be multiplied and divided.
reflexivity
A continual reflection on the research process and the researcher's role within that
process designed to ensure that researchers are aware of any thought processes that
may impact their work.
regression
A statistical technique used to explore how one variable is affected by one or more
other variables.
regression line
The line that is the best fit for a series of data, typically as displayed in a scatterplot.
362 | Glossary
relationship (between variables)
When certain categories of one variable are associated, or go together, with certain categories of the other variable(s).
reliability
The extent to which multiple or repeated measurements of something produce the
same results.
repeatability
The extent to which a researcher can repeat a measurement and get the same result.
replicability
The extent to which a research study can be entirely redone and yet produce the same
overall findings.
replicate
Repeating a research study with different participants.
representativeness
The degree to which the characteristics of a sample resemble those of the larger population.
reproducibility
The extent to which a new study designed to test the same hypothesis or answer the
same question ends up with the same findings as the original study.
respondents
People who participate in a research project or from or about whom data is collected.
rough coding
Coding for data reduction or as part of an initial pass through the data.
row marginal
The total number of cases in a given row of a table.
Glossary | 363
sample
A subset of cases drawn or selected from a larger population.
sample statistics
Quantitative measures of data from a sample.
sampling error
Measurement error created due to the fact that even properly-constructed random
samples are do not have precisely the same characteristics as the larger population
from which they were drawn.
saturation
The point in the research process where continuing to engage in data collection no
longer yields any new insights. Can also be used to refer to the same point in the literature review process.
scale variable
A variable measured using numbers, not categories, including both interval and ratio
variables. Also called a continuous variable.
scatterplot
A visual depiction of the relationship between two interval level variables, the relationship between which is represented as points on a graph with an x-axis and a y-axis.
second-cycle coding
Analytical coding that occurs later in the data analysis process.
significance (statistical)
A statistical measure that suggests that sample results can be generalized to the larger
population, based on a low probability of having made a Type 1 error.
simple linear regression
A regression analysis looking at a linear relationship between one independent and one
dependent variable.
364 | Glossary
skewness
An asymmetry in a distribution in which a curve is distorted either to the left or the
right, with positive values indicating right skewness and negative values indicating left
skewness.
social data analysis
The analysis of empirical data in the social sciences.
social responsibility (in research)
The extent to which research is conducted with integrity, is trustworthy, is relevant, and
meets the needs of communities.
spurious
The term used to refer to relationship where variables seem to vary in relation to one
another, but where in fact no causal relationship exists.
standard deviation
A measure of variation that takes into account every value’s distance from the sample
mean.
standard error
A measure of accuracy of sample statistics computed using the standard deviation of
the sampling distribution.
standpoint
The particular social position in which a person exists and in which their understandings of the world are rooted.
strength (of relationship)
A measure of how well we can predict the value or category of the dependent variable
for any given unit in our sample based on knowing the value or category of the independent variable(s).
string
A data type that represents non-numerical data; string values can include any
sequence of letters, numbers, and spaces.
Glossary | 365
structural coding
Coding that indicates which research question or hypothesis is being addressed by a
given segment of text.
subjects
People who participate in a research project or from or about whom data is collected.
summarization
The process of creating abridged or shortened versions of content or texts that still keep
intact the main points and ideas they contain.
table
A display that uses rows and columns to show information.
temporal order
The order of events in time; in relation to causation, the fact that independent variables
must occur prior to dependent variables.
the elaboration model
A typology developed by Paul Lazarsfeld for the possible analytical outcomes of controlling for a variable.
themes
Concepts, topics, or ideas around which a discussion, analysis, or text focuses.
thick description
A detailed narrative account of social action that incorporates rich details about context
and meaning such that readers are able to understand the analytical meaning of the
description.
timeline
A diagram that lays out events in order of when they occurred in time.
trace analysis
Research that uses the traces of life people have left behind as data, as in archeology.
366 | Glossary
triangulation
The use of multiple methods, sites, populations, or researchers in a project, especially to
validate findings.
type 1 error
The error made if one infers that a relationship exists in a larger population when it does
not really exist; in other words, a false positive error.
type 2 error
The error you make when you do not infer a relationship exists in the larger population
when it actually does exist; in other words, a false negative conclusion.
typologies
Classification systems.
univariate
Using one variable.
univariate analyses
Quantitative analyses that tell us about one variable, like the mean, median, or mode.
validity
The degree to which research measurements accurately reflect the real phenomena
they are intended to measure.
values coding
Coding that relies on codes indicating the perspective, worldview, values, attitudes,
and/or beliefs of research participants.
variable
A characteristic that can vary from one subject or case to another or for one case over
time within a particular research study.
Glossary | 367
variance
A basic statistical measure of dispersion, the calculation of which is necessary for computing the standard deviation.
versus coding
Coding that relies on a series of binary oppositions, one of which must be applied to
each segment of text.
voice
The style or personality of a piece of writing, including such elements as tone, word
choice, syntax, and rhythm.
word cloud
Visual display of words in which the size and boldness of each word indicates the frequency with which it appears in a body of text.
Yule’s Q
A measure of the strength of association use with binary variables
Z score
A way of standardizing data based on how many standard deviations away each value
is from the mean.
368 | Glossary
Modified GSS Codebook for the
Data Used in this Text
The General Social Survey
2021 GSS (Cross-section study) Documentation and Public
Use File Codebook (Release 2)
Edited and Modified for use with this text
Citation of This Document
In publications, please acknowledge the original source. The citation for this Public Use File
is:
Davern, Michael; Bautista, Rene; Freese, Jeremy; Morgan, Stephen L.; and Tom W. Smith.
General Social Survey 2021 Cross-section. [Machine-readable data file]. Principal Investigator, Michael Davern; Co-Principal Investigators, Rene Bautista, Jeremy Freese, Stephen L.
Morgan, and Tom W. Smith. NORC ed. Chicago, 2021. 1 datafile (68,846 cases) and 1 codebook (506 pages).
Copyright 2021-2022 NORC
Permission is hereby granted, free of charge, to any person obtaining a copy of this codebook, portions thereof, and the associated documentation (the “codebook”), to use the
codebook, including, without limitation, the rights to use, copy, modify, merge, publish, and
distribute copies of the codebook, and to permit persons to whom the codebook is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or
portions of the codebook. Any distribution of the codebook must be free of charge to the
recipient, except for charges to recover duplicating costs.
Modified GSS Codebook for the Data Used in this Text | 369
The codebook is provided “as is,” without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and
noninfringement. In no event shall the authors or copyright holders be liable for any claim,
damages or other liability, whether in an action of contract, tort or otherwise, arising from,
out of, or in connection with the codebook or the use or other dealings in the codebook.
Please contact the GSS team at gss@norc.org with any questions or requests.
Notes on the modified version:
A modified version of the 2021 GSS was created for use with the Open Educational
Resources text Social Data Analysis. This version is simplified for use by undergraduate
students learning SPSS software; as such, survey weights and certain variables have been
removed from the dataset and it has been converted to SPSS format. This codebook
includes only important information about the 2021 GSS and information about those variables included in the modified dataset. Other information has been removed to shorted
and simply the codebook.
–Mikaila Mariel Lemonik Arthur
2021 GENERAL SOCIAL SURVEY CROSS-SECTION CODEBOOK,
RELEASE 1
(Codebook for the Machine-Readable Data File 2021 General Social Survey Cross-section)
370 | Modified GSS Codebook for the Data Used in this Text
Principal Investigators
Michael Davern
Co-Principal Investigator René Bautista
Co-Principal Investigator Jeremy Freese
Co-Principal Investigator Stephen L. Morgan
Co-Principal Investigator Tom W. Smith
Senior Advisor
Colm O’Muircheartaigh
Research Associates
Jaesok Son
Benjamin Schapiro
Jodie Smylie
Beth Fisher
Katherine Burda
Ned English
Steven Pedlow
Amy Idhe
María Sánchez
Eyob Moges
Hans Erickson
Abigail Norling Ruggles
Rachel Sparkman
Produced by NORC at the University of Chicago
This project was supported by the National Science Foundation
INTRODUCTION
Introduction to the General Social Survey (GSS)
The General Social Survey (sometimes, General Social Surveys) is a series of nationally representative cross-sectional interviews in the United States that have occurred since 1972.
The GSS collects data on contemporary American society to monitor and explain trends in
opinions, attitudes, and behaviors. The GSS has adapted questions from earlier surveys,
thereby allowing researchers to conduct comparisons for up to 80 years. Originally proposed and developed by James A. Davis, the GSS has been administered by NORC at the
University of Chicago (NORC) and funded by the National Science Foundation (NSF) since
its inception. Currently, the GSS is designed by a set of Primary Investigators (PIs), with
input from the GSS Board, comprised of notable researchers within the scientific community. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and
violence, intergroup tolerance, morality, national spending priorities, psychological well-
Modified GSS Codebook for the Data Used in this Text | 371
being, social mobility, and stress and traumatic events. Altogether, the GSS is the single
best source for sociological and attitudinal trend data covering the United States. It allows
researchers to examine the structure and functioning of society in general, as well as the
role played by relevant subgroups and to compare the United States to other nations. The
GSS aims to make high-quality data easily accessible to scholars, students, policymakers,
and others, with minimal cost and waiting.
The GSS has been tracking trends in public opinion since 1972. Throughout, the GSS has
taken great care to keep the survey methodology as comparable over time as possible,
which includes everything from keeping the same sampling approach to not changing
question wording. This is done to minimize potential changes due to changes in methodology and support the study of trends in public opinion in the United States over time.
1
However, due to the global COVID19 pandemic, the 2021 GSS Cross-section implemented
significant methodological adaptions for the safety of respondents and interviewers. Since
its inception, the GSS has traditionally used in-person data collection as its primary mode
of data collection. However, the 2021 GSS Cross-section used an address-based sampling
with push to web and a web self-administered questionnaire. This new survey design and
methodology bring numerous changes, which are discussed in this codebook.
The GSS comprises a core set of items (the Replicating Core) that are repeated every
round, as well as topical modules, which may or may not be repeated. The GSS is currently
composed of three separate ballots (A, B, and C), as well as two separate forms (X and Y),
which allow for up to six different paths through the interview itself (in addition to paths
determined by respondent answers, such as questions about spouses or partners, or questions on employment). Not every question in the Replicating Core is asked of every respondent; most only appear on two of the three ballots. However, every item in the Replicating
Core overlaps on at least one ballot with every other item in the Replicating Core, ensuring
that researchers can estimate inter-item correlations. Forms are used for experiments
such as wording variations within questions, ensuring that half of the respondents on each
ballot see the experimental or control conditions of each relevant variable. Within the GSS,
these form experiments are usually assigned mnemonics that end in -Y.
Topical modules are typically assigned to either two full ballots (e.g., A and B) or one full ballot and one half-ballot (e.g., A and BX), covering two-thirds or half of sample respondents,
respectively. However, some topical modules are included on all ballots. Modules are usually
assigned to specific ballots based on one of two conditions: overlap with other key questions (either ensuring that respondents to specific items also receive specific modules or
that respondents to specific items do not receive specific modules), or time constraints. The
GSS tries to balance the length of all six paths to be approximately equal. Topical modules
1. See Study Overview for an explanation for why this is the 2021 GSS, rather than the 2020 GSS.
372 | Modified GSS Codebook for the Data Used in this Text
may be administered via interviewer in any mode or completed by self-administered questionnaire, depending on the sensitivity of the items included.
Topical modules come from several different sources. While the GSS broadly is funded
by NSF, individual modules may be sponsored by other government agencies, universities,
research institutes, or individuals. The GSS typically includes modules every round that are
related to the International Social Survey Programme (ISSP), a consortium of national-level
studies like the GSS (for more information, see Introduction to the International Social Survey Programme, below). Finally, modules may be solicited by the GSS Scientific Advisory
Board or the Principal Investigators and can be included based on scientific merits and
available time in the interview. The number of GSS modules varies by year.
Additionally, the GSS has implemented experimental designs over time or through collaborations (for instance, supporting other studies such as the National Organization Studies, National Congregations Study, National Voluntary Associations Study, and the
2016-2020 General Social Survey-American National Election Studies (GSS- ANES) Panel),
which have led to several ancillary datasets.
Introduction to the International Social Survey Programme
(ISSP)
The ISSP is a consortium of nationally representative research studies, like the GSS, who
have all agreed to ask questions on the same topics on an annual basis. It emerged out
of bilateral collaboration with NORC and the German organization Zentrum für Umfragen,
Methoden, und Analysen (ZUMA; now part of GESIS-Leibniz Institute of the Social Sciences).
Starting in 1982, each organization devoted a small segment of their national surveys, ALLBUS and GSS, to a common set of questions. The ISSP was formally established in 1984 by
Australia, Germany, Great Britain, and the United States, and it now has 42 member countries across five continents and collects data in 70 countries. NORC represents the United
States as a country in the ISSP.
ISSP modules have several defining criteria:
• Modules are developed in English and translated to every administered language.
• Modules must contain approximately 60 questions and can be supplemented by
optional items, as well as around 40 items on background and demographic characteristics.
• Modules must contain the same items across all languages and studies, asked in the
same order, with minor variations to account for mode differences or regional differences.
2
• Each topical module must be administered within a relatively narrow time frame, usuModified GSS Codebook for the Data Used in this Text | 373
ally about 18 months from the start of the relevant year.
ISSP Modules are currently replicated every 10 years, allowing for topics to be studied as
both a multinational snapshot at a single point in time as well as a slowly evolving repeated
cross-section of national opinions. While not every topic is repeated, the longest-running
module is up to its fifth replication. ISSP rules require that when a topic is repeated, at least
two-thirds of the questions must be repeated from a previous round, while up to one- third
can be new questions.
Study Overview
Due to the onset of the COVID-19 pandemic in the United States in the early months of
2020, GSS staff redesigned the GSS in several ways to ensure the safety and well-being of all
the people who participate in and administer the study.
In 2020, we conducted the GSS as two studies: 1) a panel reinterview of past respondents
from the 2016 and 2018 cross-sectional GSS studies (referred to as the 2016-2020 GSS
Panel), and 2) an independent fresh cross-sectional address-based sampling push-to-web
study (referred to in this document as the 2021 GSS Cross-section but also known as the
2020 cross-sectional survey in previous documents). This codebook provides details of the
second study—namely, the 2021 GSS Cross-section, where newly selected respondents
answered a GSS questionnaire from December 2020 to May 2021. We refer to the second
study as the 2021 GSS Cross-section because the majority of the data was collected in 2021.
Documentation for the first study (the 2016-2020 GSS Panel) is provided separately. During
the spring and summer of 2020, GSS staff redesigned both the panel and the cross-section
to be administered primarily via web self-administered questionnaire, instead of face-toface interviews, with phone interviews as a secondary mode.
Each of these major changes had several ramifications for sampling, fielding, questionnaire design, data cleaning, response rates, and weights.
Cross-section Overview
This codebook focuses on the 2021 GSS Cross-section survey. While this iteration of the
Cross-section has meaningful changes from previous editions, there is much that remains
consistent. Just as was done since 2004, the GSS Cross-section survey administers a full-
2. For example, asking about Congress or Parliament, or asking about the European Union or NAFTA.
374 | Modified GSS Codebook for the Data Used in this Text
probability sample approach with samples created from an adapted form of the United
States Postal Service (USPS) metropolitan statistical area (MSA)/county frame area. More on
the GSS conventional design can be found on pages 3177–3178 of the legacy cumulative
codebook, available at the GSS website.
The GSS has been conducted since 1972 and currently functions as a social indicators program, which highly values historical trends and continuity. To that end, the GSS’s replicating core contains items that have been asked since its inception. In some cases, these items
were asked on even older surveys, allowing for continuous measurement of concepts since
the 1940s.
Modified GSS Codebook for the Data Used in this Text | 375
2021 GSS Cross-section
Adults 18 or older in the United States who live in noninstitutional housing at the time of interviewing
Mailing materials that show a web link to invite people to participate on the web (i.e., pushing
respondent to a web survey first). Phone option was also provided.
Web (supplemented by phone)
Both non-contingent pre-paid incentive and contingent post- paid incentive
4,032 completes from 27,591 lines of sample
17.4%
December 1, 2020, to May 3, 2021
Mail push to web as primary mode, supplemented with phone
Last birthday method
English and Spanish
Paradata were recorded but are not available in release 1.
KEY ASPECTS
Sample
Invitation
Survey mode
Incentive
Final sample size
Response Rate (AAPOR RR3)
Fielding period
Administration
Respondent selection within
household
Language
Paradata derived from
instrument
Table 1. Key Aspects of the 2021
GSS Cross-section
Occasionally, the cumulative datafile is updated with newly cleaned or derived variables,
and error corrections. Please see Release Notes for changes since the initial release of the
1972-2021 Cumulative file.
NOTE ON MEASUREMENT AND INTERPRETATION
The GSS has been tracking trends in public opinion since 1972. Over this time, the GSS has
taken great care to keep the survey methodology as comparable over time as possible,
which includes everything from keeping the same sampling approach to not changing
question wording. This is done to minimize potential changes due to methodology and
support the study of changes in public opinion in the United States. However, due to the
global COVID-19 pandemic, the 2021 GSS8 needed to implement significant methodological adaptations for the safety of respondents and interviewers. Since its inception, the
GSS has traditionally used in-person data collection as its primary mode of data collection. In response to the COVID-19 pandemic, the GSS altered its methodology for the first
time to be primarily an address-based sampling push-to-web methodology in 2021. As a
result, when examining changes in trends over time, we caution GSS data users to carefully examine how a change they are observing in a trend may have been impacted by the
methodological differences employed in 2021. The 2021 GSS Cross-section codebook provides documentation of the methodological changes and adaptations that were necessary
to transition the GSS survey from in-person to a web survey.
Total Survey Error Perspective for GSS Trend Estimates in 2021
The GSS was collected in 2021 to provide vital opinion data to the research community at
a critical time in U.S. history. While the data will contribute to our understanding of society,
any changes in public opinion seen in the 2021 GSS data could be due to either changes
in actual opinion and/or changes the GSS made in the methodology to adapt to COVID-19.
When evaluating the GSS for trend changes over time, we caution our users to carefully
consider changes in the GSS methodology from a total survey error perspective. Total survey error is a way of comprehending the impact on estimates due to measurement, nonresponse, coverage, and sampling error. Below, we provide a high-level summary of the
components of the total survey error in the 2021 GSS Cross-section, but we invite the user
to carefully review the details in the present document.
Measurement Error: Changes in how survey questions were administered can impact the
answers. The GSS has traditionally been conducted in person and administered by an inter-
Modified GSS Codebook for the Data Used in this Text | 377
viewer. Due to the COVID-19 pandemic, the GSS needed to administer survey questions primarily over the web without any interviewer assistance. Some cases were collected over the
phone as well. To adapt to the primary mode of administration (web), some changes were
needed in the measures. For example, “Don’t Know” response categories were included for
factual questions, but they were not displayed for opinion questions (only for factual questions). In the past, Don’t Knows could be recorded by interviewers regardless of whether
they were factual or opinion questions.
Non-Response Error: Historically, the GSS has achieved high response rates well above 50
percent, mostly because in-person surveys can attain higher response rates. The 2021 GSS
was conducted using a mail invite to push the respondent to the web. The 2021 GSS Crosssection response rate is 17 percent (which is still high for web surveys). Differential participation rates across selected GSS participants in 2020 relative to previous years could also
contribute to a change in estimates. To help control for this concern, the GSS implemented
an adjustment to known population totals for the 2021 Cross-section round based on a raking approach (i.e., post-stratification weighting) to ensure the weighted totals in the 2021
GSS Cross-section sample are as close as possible to the control totals from the U.S. Census
Bureau estimates by education, sex, age, region of the country, race, and ethnicity.
Coverage Error: In 2021, the GSS had to change how it rostered and ultimately estimated
adults residing in a household. Typically, household rosters have been used in the in-person
methodology to randomly select one adult in the household to complete the survey. That
is, in previous years the GSS has instructed interviewers to complete an initial household
enumeration form to collect some basic data on everyone residing in a household and then
randomly chose one adult with whom to complete the main interview. In the 2021 GSS
such a household enumeration was not possible up front, and the GSS instructed selected
households to identify the adult with the most recent birthday and to report the number of
adults living in the household. It is possible that respondent selection may have happened
at the household level or missed some household residents (for instance, people abroad,
adult children living at home, etc.).
Sampling Error: The 2021 GSS Cross-section relies on scientific sampling to allow for the
calculation of sampling error (i.e., margin of error). As in any scientific sample, any one trend
from year to year can be impacted by the fact we only observed a sample and not the entire
population. The dataset contains survey design variables (clusters, strata, and weights) to
account for the complex survey sample design. And it is possible that our trend estimates
are off by a sampling or “margin of error” between any given set of years. It is essential,
therefore, to control for sampling error by conducting tests of significance for trend differences estimates.
We recommend our users include the one of the following statements when reporting
on the GSS 2021 Cross-section data:
Total Survey Error Summary Perspective for the 2021 GSS Cross-section: Changes in opin-
378 | Modified GSS Codebook for the Data Used in this Text
ions, attitudes, and behaviors observed in 2021 relative to historical trends may be due
to actual change in concept over time and/or may have resulted from methodological
changes made to the survey methodology during the COVID-19 global pandemic.
Suggested Statement to Include in Articles and Reports That Use GSS Data
To safeguard the health of staff and respondents during the COVID-19 pandemic, the
2021 GSS data collection used a mail-to-web methodology instead of its traditional in-person interviews. Research and interpretation done using the data should take extra care to
ensure the analysis reflects actual changes in public opinion and is not unduly influenced
by the change in data collection methods. For more information on the 2021 GSS methodology and its implications, please visit https://gss.norc.org/Get-The-Data
Screenshots of Changes
The 2021 GSS Cross-section comprised primarily a self-administered web questionnaire. The
following typical item from this questionnaire displays both the typical layout of a selfadministered web questionnaire item and the change to Don’t Know response options discussed in Don’t Know and No Answer Responses.
Figure 1: Visual Display of a Survey Question in the 2021 GSS Cross-section
Modified GSS Codebook for the Data Used in this Text | 379
Figure 1: Visual Display of a Survey Question in the 2021 GSS Cross-section
Appendix A: 2021 GSS Cross-section Outcomes
Table A1: 2021 GSS Cross-section Response Rate Calculation
Status
Frequency
Complete
4,032
Eligible cases (partial or NIR)
5,589
Non-interviewed respondent – Eligibility not known 19,670
Total
27,591
Completes were defined as those cases that completed the full interview or met the data
threshold predefined by the research team to be included with the Completes category.
Partial completes were cases that started the GSS interview but did not meet the threshold to be included with Completes (i.e., completed two-thirds of the core section). Noninterviewed respondent (NIR) cases were split into two categories: eligibility known and
eligibility not known. “Eligibility known” cases were those in which the sampled address
was confirmed as being an occupied residence. “Eligibility not known” cases were those in
which the address was not confirmed as being an occupied residence. Finally, out-of-scope
cases were those in which the sampled address was identified as vacant, a business, or not
an address. Out-of-scope cases were also those in which nobody in the sampled household
spoke English or Spanish, or the selected respondent was ill or incapacitated and could not
complete the interview. A total of 3,561 (88 percent) of completes were completed by web
SAQ, with the remainder (471) completed by phone (12 percent).
APPENDIX B: Other GSS Documentation
GSS Codebook
This is the legacy codebook that combines all of the notes, appendices, and frequency
tables of the GSS since its inception in 1972 up through 2018. This is useful to see how
trends have changed over the decades, find question-level wording, and find when questions were added or discontinued.
GSS Data Explorer
A web-based interactive tool to access and analyze GSS data, question wording, and ballot timing. The Data Explorer is helpful to visualize individual variables’ trend lines or extract
380 | Modified GSS Codebook for the Data Used in this Text
data into various software packages. The Data Explorer tool is going through renovations
and it is expected to be upgraded in December of 2021.
The Codebook Guide to GSS Variables
1) ID
Respondent's ID number
RANGE: 1 to 4471
N Mean Std. Deviation
Total 4032 2221.813 1289.407
2) WRKSTAT
Last week were you working full time, part time, going to school, keeping
house, or what?
RANGE: 1 to 8
N Mean Std. Deviation
Total 4024 3.111 2.254
1) Working full time 1776 44.1
2) Working part time 365 9.1
3) With a job, but not at work because of temporary illness, vacation,
strike 104 2.6
4) Unemployed, laid off, looking for work 265 6.6
5) Retired 993 24.7
6) In school 88 2.2
7) Keeping house 315 7.8
8) Other 118 2.9
Missing 8
3) HRS1
How many hours did you work last week, at all jobs?
Responses greater than 89 were coded 89.
RANGE: 0 to 89
N Mean Std. Deviation
Total 2116 39.98 13.199
Missing 1916
Modified GSS Codebook for the Data Used in this Text | 381
4) WRKSLF
(Are/were) you self employed or (do/did) you work for someone else?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3925 1.89 0.313
1) Self-employed 432 11.0
2) Someone else 3493 89.0
Missing 107
5) PRESTG10
Prestige of respondent's occupation
RANGE: 16 to 80
N Mean Std. Deviation
Total 3873 46.544 13.811
6) MARITAL
Are you currently - married, widowed, divorced, separated, or have you
never been married?
RANGE: 1 to 5
N Mean Std. Deviation
Total 4023 2.438 1.655
1) Married 1999 49.7
2) Widowed 301 7.5
3) Divorced 655 16.3
4) Separated 96 2.4
5) Never married 972 24.2
Missing 9
7) DIVORCE
IF CURRENTLY MARRIED OR WIDOWED: Have you ever been divorced or legally
separated?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2300 1.737 0.44
1) Yes 605 26.3
2) No 1695 73.7
382 | Modified GSS Codebook for the Data Used in this Text
Missing 1732
8) WIDOWED
IF CURRENTLY MARRIED, SEPARATED, OR DIVORCED: Have you ever been widowed?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2752 1.983 0.131
1) Yes 48 1.7
2) No 2704 98.3
Missing 1280
9) PAWRKSLF
Was your [father/stepfather/male relative you were living with when you
were 16] an employee, self-employed without employees, or self-employed
with employees?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3349 1.758 0.428
1) Self-employed 810 24.2
2) Someone else 2539 75.8
Missing 683
10) PAPRES10
Prestige of respondent's father's occupation
RANGE: 16 to 80
N Mean Std. Deviation
Total 3349 45.157 13.148
11) MAWRKSLF
At this job, was [mother/stepmother/female relative you were living with
when you were 16] an employee, self-employed without employees, or selfemployed with employees?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2797 1.897 0.304
1) Self-employed 287 10.3
2) Someone else 2510 89.7
Modified GSS Codebook for the Data Used in this Text | 383
Missing 1235
12) MAPRES10
Prestige of respondent's mother's occupation
RANGE: 16 to 80
N Mean Std. Deviation
Total 2767 42.664 13.168
13) SIBS
How many brothers and sisters did you have? Please count those born alive,
but no longer living, as well as those alive now. Also include stepbroth­
ers and stepsisters, and children adopted by your parents.
RANGE: 0 to 35
N Mean Std. Deviation
Total 3968 3.13 2.646
Missing 64
14) CHILDS
How many children have you ever had? Please count all that were born alive
at any time (including any you had from a previous marriage).
RANGE: 0 to 8
N Mean Std. Deviation
Total 3983 1.7 1.526
0) None 1163 29.2
1) One 646 16.2
2) Two 1152 28.9
3) Three 578 14.5
4) Four 277 7.0
5) Five 79 2.0
6) Six 52 1.3
7) Seven 17 0.4
8) Eight or more 19 0.5
Missing 49
15) AGE
Respondent's age
RANGE: 18 to 89
384 | Modified GSS Codebook for the Data Used in this Text
N Mean Std. Deviation
Total 3699 52.165 17.233
Missing 333
16) AGEKDBRN
How old were you when your first child was born?
RANGE: 9 to 57
N Mean Std. Deviation
Total 2803 25.47 6.192
17) EDUC
Respondent's education
RANGE: 0 to 20
N Mean Std. Deviation
Total 3966 14.769 2.8
Missing 66
18) PAEDUC
What is the highest grade in elementary school or high school that your
father finished and got credit for?
RANGE: 0 to 20
N Mean Std. Deviation
Total 3090 12.546 3.809
Missing 942
19) MAEDUC
What is the highest grade in elementary school or high school that your
mother finished and got credit for?
RANGE: 0 to 20
N Mean Std. Deviation
Total 3613 12.504 3.294
Missing 419
20) DEGREE
Respondent's degree
RANGE: 0 to 4
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 385
Total 4009 2.116 1.283
0) Less than high school 246 6.1
1) High school 1597 39.8
2) Associate/junior degree 370 9.2
3) Bachelor's 1036 25.8
4) Graduate 760 19.0
Missing 23
21) SEX
Respondent's sex
RANGE: 1 to 2
N Mean Std. Deviation
Total 3940 1.559 0.497
1) Male 1736 44.1
2) Female 2204 55.9
Missing 92
22) RACE
What race do you consider yourself?
RANGE: 1 to 3
N Mean Std. Deviation
Total 3978 1.32 0.649
1) White 3110 78.2
2) Black 4 63 11.6
3) Other 405 10.2
Missing 54
23) RES16
Which of these categories comes closest to the type of place you were liv­
ing in when you were 16 years old?
RANGE: 1 to 6
N Mean Std. Deviation
Total 4029 3.824 1.5
1) In open country but not on a farm 406 10.1
2) Farm 204 5.1
3) In a small city or town (under 50,000) 1208 30.0
4) In a medium-size city (50,000-250,000) 777 19.3
386 | Modified GSS Codebook for the Data Used in this Text
5) In a suburb near a large city 742 18.4
6) In a large city (over 250,000) 692 17.2
Missing 3
24) REG16
In what state or foreign country were you living when you were 16 years
old?
RANGE: 1 to 9
N Mean Std. Deviation
Total 4018 5.128 2.621
1) New England 176 4.4
2) Middle Atlantic 543 13.5
3) East North Central 812 20.2
4) West North Central 335 8.3
5) South Atlantic 536 13.3
6) East South Atlantic 235 5.8
7) West South Central 356 8.9
8) Mountain 242 6.0
9) Pacific 783 19.5
Missing 14
25) MOBILE16
IF STATE NAMED IS SAME STATE R. LIVES IN NOW, ASK MOBILE16: When you were
16 years old, were you living in this same (city/town/county)?
RANGE: 1 to 3
N Mean Std. Deviation
Total 3608 2.039 0.8
1) Same state, same city 1087 30.1
2) Same state, different city 1294 35.9
3) Different state 1227 34.0
Missing 424
26) MAWRKGRW
Did your mother ever work for pay for as long as a year, while you were
growing up?
RANGE: 1 to 2
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 387
Total 3770 1.242 0.429
1) Yes 2856 75.8
2) No 914 24.2
Missing 262
27) INCOM16
Thinking about the time when you were 16 years old, compared with American
families in general then, would you say your family income was: far below
average, below average, average, above average, or far above average?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3826 2.739 0.952
1) Far below average 421 11.0
2) Below average 1013 26.5
3) Average 1625 42.5
4) Above average 679 17.7
5) Far above average 88 2.3
Missing 206
28) BORN
Were you born in this country?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3960 1.112 0.316
1) Yes 3516 88.8
2) No 444 11.2
Missing 72
29) GRANBORN
(Were all of your four grandparents born in this country?...) IF NO: how
many were born outside the United States?
RANGE: 0 to 4
N Mean Std. Deviation
Total 3633 0.979 1.52
0) None 2379 65.5
1) One 220 6.1
2) Two 350 9.6
388 | Modified GSS Codebook for the Data Used in this Text
3) Three 99 2.7
4) Four 585 16.1
Missing 399
30) MABORN
Was(your mother/ your stepmother/ the female relative you were living with
at 16) born in this country?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3939 1.159 0.366
1) Yes 3312 84.1
2) No 627 15.9
Missing 93
31) PABORN
Was(your father/your stepfather/the male relative you were living with at
16) born in this country?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3918 1.166 0.372
1) Yes 3269 83.4
2) No 649 16.6
Missing 114
32) SEXBIRTH1
Was your sex recorded as male or female at birth?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3928 1.56 0.497
1) Male 1730 44.0
2) Female 2198 56.0
Missing 104
33) REGION
Region of interview
RANGE: 1 to 9
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 389
Total 4032 5.192 2.454
1) New England 203 5.0
2) Middle Atlantic 414 10.3
3) East North Central 676 16.8
4) West North Central 314 7.8
5) South Atlantic 800 19.8
6) East South Atlantic 270 6.7
7) West South Central 426 10.6
8) Mountain 345 8.6
9) Pacific 584 14.5
34) PARTYID
Generally speaking, do you usually think of yourself as a Republican, Demo­
crat, Independent, or what?
RANGE: 0 to 7
N Mean Std. Deviation
Total 4000 2.776 2.135
0) Strong Democrat 822 20.6
1) Not very strong Democrat 541 13.5
2) Independent, close to Democrat 471 11.8
3) Independent (neither, no response) 817 20.4
4) Independent, close to Republican 327 8.2
5) Not very strong Republican 384 9.6
6) Strong Republican 524 13.1
7) Other party 114 2.9
Missing 32
35) VOTE16
In 2016, you remember that Hillary Clinton ran for president on the Democ­
ratic ticket against Donald Trump for the Republicans. Do you remember for
sure whether or not you voted in that election?
RANGE: 1 to 3
N Mean Std. Deviation
Total 3703 1.28 0.567
1) Voted 2886 77.9
2) Did not vote 596 16.1
3) Ineligible 221 6.0
390 | Modified GSS Codebook for the Data Used in this Text
Missing 329
36) PRES16
Did you vote for Hillary Clinton or Donald Trump?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2764 1.551 0.69
1) Clinton 1509 54.6
2) Trump 1037 37.5
3) Other candidate (please specify) 169 6.1
4) Didn't vote for President 49 1.8
Missing 1268
37) IF16WHO
Who would you have voted for, for president, if you had voted?
RANGE: 1 to 3
N Mean Std. Deviation
Total 761 1.859 0.809
1) Clinton 310 40.7
2) Trump 248 32.6
3) Other 203 26.7
Missing 3271
38) POLVIEWS
We hear a lot of talk these days about liberals and conservatives. I'm
going to show you a seven-point scale on which the political views that
people might hold are arranged from extremely liberal - point one - to
extremely conservative - point seven. Where would you place yourself on
this scale?
RANGE: 1 to 7
N Mean Std. Deviation
Total 3964 3.968 1.536
1) Extremely liberal 207 5.2
2) Liberal 623 15.7
3) Slightly liberal 490 12.4
4) Moderate, middle of the road 1377 34.7
5) Slightly conservative 476 12.0
Modified GSS Codebook for the Data Used in this Text | 391
6) Conservative 617 15.6
7) Extremely conservative 174 4.4
Missing 68
39) NATSPAC
(Let's begin with some things people think about today. We are faced with
many problems in this country, none of which can be solved easily or inex­
pensively. I'm going to name some of these problems, and for each one I'd
like you to tell me whether you think we're spending too much money on it,
too little money, or about the right amount.) Space exploration program
RANGE: 1 to 3
N Mean Std. Deviation
Total 1969 1.975 0.685
1) Too little 488 24.8
2) About right 1043 53.0
3) Too much 438 22.2
Missing 2063
40) NATENVIR
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Improv­
ing and protecting the environment
RANGE: 1 to 3
N Mean Std. Deviation
Total 1979 1.401 0.658
1) Too little 1376 69.5
2) About right 413 20.9
3) Too much 190 9.6
Missing 2053
41) NATHEAL
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Improv­
392 | Modified GSS Codebook for the Data Used in this Text
ing and protecting the nation's health
RANGE: 1 to 3
N Mean Std. Deviation
Total 1966 1.418 0.644
1) Too little 1314 66.8
2) About right 483 24.6
3) Too much 169 8.6
Missing 2066
42) NATCITY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Solv­
ing the problems of big cities
RANGE: 1 to 3
N Mean Std. Deviation
Total 1949 1.661 0.763
1) Too little 1009 51.8
2) About right 591 30.3
3) Too much 349 17.9
Missing 2083
43) NATCRIME
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Halt­
ing the rising crime rate
RANGE: 1 to 3
N Mean Std. Deviation
Total 1969 1.449 0.644
1) Too little 1249 63.4
2) About right 556 28.2
3) Too much 164 8.3
Missing 2063
44) NATDRUG
Modified GSS Codebook for the Data Used in this Text | 393
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Deal­
ing with drug addiction
RANGE: 1 to 3
N Mean Std. Deviation
Total 1964 1.466 0.647
1) Too little 1216 61.9
2) About right 581 29.6
3) Too much 167 8.5
Missing 2068
45) NATEDUC
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Improv­
ing the nation's education system
RANGE: 1 to 3
N Mean Std. Deviation
Total 1976 1.349 0.618
1) Too little 1439 72.8
2) About right 384 19.4
3) Too much 153 7.7
Missing 2056
46) NATRACE
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Improv­
ing the condition of Blacks
RANGE: 1 to 3
N Mean Std. Deviation
Total 1959 1.654 0.759
1) Too little 1020 52.1
394 | Modified GSS Codebook for the Data Used in this Text
2) About right 596 30.4
3) Too much 343 17.5
Missing 2073
47) NATARMS
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) The
military, armaments, and defense
RANGE: 1 to 3
N Mean Std. Deviation
Total 1974 2.068 0.764
1) Too little 513 26.0
2) About right 814 41.2
3) Too much 647 32.8
Missing 2058
48) NATAID
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) For­
eign aid
RANGE: 1 to 3
N Mean Std. Deviation
Total 1969 2.472 0.655
1) Too little 177 9.0
2) About right 686 34.8
3) Too much 1106 56.2
Missing 2063
49) NATFARE
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
Modified GSS Codebook for the Data Used in this Text | 395
too much money on it, too little money, or about the right amount.) Welfare
RANGE: 1 to 3
N Mean Std. Deviation
Total 1970 2.023 0.797
1) Too little 604 30.7
2) About right 717 36.4
3) Too much 649 32.9
Missing 2062
50) NATROAD
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) High­
ways and bridges
RANGE: 1 to 3
N Mean Std. Deviation
Total 4004 1.485 0.599
1) Too little 2281 57.0
2) About right 1504 37.6
3) Too much 219 5.5
Missing 28
51) NATSOC
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Social
Security
RANGE: 1 to 3
N Mean Std. Deviation
Total 4002 1.478 0.59
1) Too little 2286 57.1
2) About right 1518 37.9
3) Too much 198 4.9 Missing 30
52) NATMASS
396 | Modified GSS Codebook for the Data Used in this Text
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Mass
transportation
RANGE: 1 to 3
N Mean Std. Deviation
Total 3996 1.675 0.62
1) Too little 1628 40.7
2) About right 2039 51.0
3) Too much 329 8.2 Missing 36
53) NATPARK
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Parks
and recreation
RANGE: 1 to 3
N Mean Std. Deviation
Total 4002 1.684 0.545
1) Too little 1428 35.7
2) About right 2412 60.3
3) Too much 162 4.0 Missing 30
54) NATCHLD
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Assis­
tance for childcare
RANGE: 1 to 3
N Mean Std. Deviation
Total 3984 1.518 0.641
1) Too little 2244 56.3
2) About right 1418 35.6
3) Too much 322 8.1
Missing 48
Modified GSS Codebook for the Data Used in this Text | 397
55) NATSCI
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Sup­
porting scientific research
RANGE: 1 to 3
N Mean Std. Deviation
Total 3996 1.62 0.622
1) Too little 1820 45.5
2) About right 1873 46.9
3) Too much 303 7.6
Missing 36
56) NATENRGY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Devel­
oping alternative energy sources
RANGE: 1 to 3
N Mean Std. Deviation
Total 4008 1.54 0.678
1) Too little 2265 56.5
2) About right 1321 33.0
3) Too much 422 10.5
Missing 24
57) NATSPACY
(Let's begin with some things people think about today. We are faced with
many problems in this country, none of which can be solved easily or inex­
pensively. I'm going to name some of these problems, and for each one I'd
like you to tell me whether you think we're spending too much money on it,
too little money, or about the right amount.) Space exploration
RANGE: 1 to 3
N Mean Std. Deviation
398 | Modified GSS Codebook for the Data Used in this Text
Total 2030 2.018 0.687
1) Too little 461 22.7
2) About right 1071 52.8
3) Too much 498 24.5
Missing 2002
58) NATENVIY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Improv­
ing and protecting the environment
RANGE: 1 to 3
N Mean Std. Deviation
Total 2040 1.399 0.648
1) Too little 1410 69.1
2) About right 447 21.9
3) Too much 183 9.0
Missing 1992
59) NATHEALY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Health
RANGE: 1 to 3
N Mean Std. Deviation
Total 2025 1.47 0.711
1) Too little 1333 65.8
2) About right 432 21.3
3) Too much 260 12.8
Missing 2007
60) NATCITYY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
Modified GSS Codebook for the Data Used in this Text | 399
too much money on it, too little money, or about the right amount.) Assis­
tance to big cities
RANGE: 1 to 3
N Mean Std. Deviation
Total 2014 2.058 0.76
1) Too little 526 26.1
2) About right 846 42.0
3) Too much 642 31.9
Missing 2018
61) NATCRIMY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Law
enforcement
RANGE: 1 to 3
N Mean Std. Deviation
Total 2037 1.806 0.778
1) Too little 853 41.9
2) About right 727 35.7
3) Too much 457 22.4
Missing 1995
62) NATDRUGY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Drug
rehabilitation
RANGE: 1 to 3
N Mean Std. Deviation
Total 2023 1.495 0.671
1) Too little 1225 60.6
2) About right 595 29.4
3) Too much 203 10.0
Missing 2009
400 | Modified GSS Codebook for the Data Used in this Text
63) NATEDUCY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Educa­
tion
RANGE: 1 to 3
N Mean Std. Deviation
Total 2031 1.321 0.607
1) Too little 1532 75.4
2) About right 346 17.0
3) Too much 153 7.5
Missing 2001
64) NATRACEY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Assis­
tance to Blacks
RANGE: 1 to 3
N Mean Std. Deviation
Total 2006 1.728 0.762
1) Too little 930 46.4
2) About right 692 34.5
3) Too much 384 19.1
Missing 2026
65) NATARMSY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.)
National defense
RANGE: 1 to 3
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 401
Total 2033 2.046 0.74
1) Too little 512 25.2
2) About right 915 45.0
3) Too much 606 29.8
Missing 1999
66) NATAIDY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Assis­
tance to other countries
RANGE: 1 to 3
N Mean Std. Deviation
Total 2031 2.501 0.67
1) Too little 202 9.9
2) About right 609 30.0
3) Too much 1220 60.1 Missing 2001
67) NATFAREY
(We are faced with many problems in this country, none of which can be
solved easily or inexpensively. I'm going to name some of these problems,
and for each one I'd like you to tell me whether you think we're spending
too much money on it, too little money, or about the right amount.) Assis­
tance to the poor
RANGE: 1 to 3
N Mean Std. Deviation
Total 2038 1.406 0.648
1) Too little 1392 68.3
2) About right 464 22.8
3) Too much 182 8.9
Missing 1994
68) EQWLTH
Some people think that the government in Washington ought to reduce the
income differences between the rich and the poor, perhaps by raising the
taxes of wealthy families or by giving income assistance to the poor. Oth­
402 | Modified GSS Codebook for the Data Used in this Text
ers think that the government should not concern itself with reducing this
income difference between the rich and the poor. Here is a card with a
scale from one to seven. Think of a score of one as meaning that the gov­
ernment ought to reduce the income differences between rich and poor, and
a score of seven meaning that the government should not concern itself
with reducing income differences. What score between one and seven comes
closest to the way you feel?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2661 3.385 2.2
1) The government should reduce income differences 882 33.1
2) 242 9.1
3) 325 12.2
4) 389 14.6
5) 249 9.4
6) 153 5.7
7) The government should not concern itself with reducing income differ­
ences 421 15.8
Missing 1371
69) TAX
Do you consider the amount of federal income tax which you have to pay as
too high, about right, or too low?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2634 1.467 0.552
1) Too high 1478 56.1
2) About right 1082 41.1
3) Too much 74 2.8
Missing 1398
70) SPKATH
There are always some people whose ideas are considered bad or dangerous
by other people. For instance, somebody who is against all churches and
religion... If such a person wanted to make a speech in your (city/town/
community) against churches and religion, should he be allowed to speak,
or not?
RANGE: 1 to 2
Modified GSS Codebook for the Data Used in this Text | 403
N Mean Std. Deviation
Total 1312 1.186 0.389
1) Yes, allowed to speak 1068 81.4
2) Not allowed 244 18.6
Missing 2720
71) COLATH
(There are always some people whose ideas are considered bad or dangerous
by other people. For instance, somebody who is against all churches and
religion...) Should such a person be allowed to teach in a college or uni­
versity, or not?
RANGE: 4 to 5
N Mean Std. Deviation
Total 2647 4.304 0.46
4) Yes, allowed to teach 1842 69.6
5) Not allowed 805 30.4
Missing 1385
72) LIBATH
(There are always some people whose ideas are considered bad or dangerous
by other people. For instance, somebody who is against all churches and
religion...) If some people in your community suggested that a book he
wrote against churches and religion should be taken out of your public
library, would you favor removing this book, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1310 1.844 0.363
1) Remove 204 15.6
2) Not remove 1106 84.4
Missing 2722
73) SPKRAC
Or consider a person who believes that Blacks are genetically inferior. If
such a person wanted to make a speech in your community claiming that
Blacks are inferior, should he be allowed to speak, or not?
RANGE: 1 to 2
N Mean Std. Deviation
404 | Modified GSS Codebook for the Data Used in this Text
Total 1310 1.512 0.5
1) Yes, allowed to speak 639 48.8
2) Not allowed 671 51.2
Missing 2722
74) COLRAC
(Or consider a person who believes that Blacks are genetically infe­
rior...) Should such a person be allowed to teach in a college or univer­
sity, or not?
RANGE: 4 to 5
N Mean Std. Deviation
Total 2633 4.672 0.47
4) Yes, allowed to teach 864 32.8
5) Not allowed 1769 67.2
Missing 1399
75) LIBRAC
(Or consider a person who believes that Blacks are genetically infe­
rior...) If some people in your community suggested that a book he wrote
which said Blacks are inferior should be taken out of your public library,
would you favor removing this book, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1301 1.613 0.487
1) Remove 503 38.7
2) Not remove 798 61.3
Missing 2731
76) SPKCOM
Now, we would like to ask you some questions about a man who admits he is
a Communist. Suppose this admitted Communist wanted to make a speech in
your community. Should he be allowed to speak, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1305 1.266 0.442
1) Yes, allowed to speak 958 73.4
2) Not allowed 347 26.6
Modified GSS Codebook for the Data Used in this Text | 405
Missing 2727
77) COLCOM
(Now, we would like to ask you some questions about a man who admits he is
a Communist...) Suppose he is teaching in a college. Should he be fired,
or not?
RANGE: 4 to 5
N Mean Std. Deviation
Total 1293 4.704 0.457
4) Yes, fired 383 29.6
5) Not fired 910 70.4
Missing 2739
78) LIBCOM
(Now, we would like to ask you some questions about a man who admits he is
a Communist...) Suppose he wrote a book which is in your public library.
Somebody in your community suggests that the book should be removed from
the library. Would you favor removing it, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1303 1.787 0.41
1) Remove 278 21.3
2) Not remove 1025 78.7
Missing 2729
79) SPKMIL
Consider a person who advocates doing away with elections and letting the
military run the country. If such a person wanted to make a speech in your
community, should he be allowed to speak, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1301 1.381 0.486
1) Yes, allowed to speak 805 61.9
2) Not allowed 496 38.1
Missing 2731
80) COLMIL
406 | Modified GSS Codebook for the Data Used in this Text
(Consider a person who advocates doing away with elections and letting the
military run the country...) Should such a person be allowed to teach in a
college or university, or not?
RANGE: 4 to 5
N Mean Std. Deviation
Total 2637 4.529 0.499
4) Yes, allowed to teach 1243 47.1
5) Not allowed 1394 52.9
Missing 1395
81) LIBMIL
Suppose he wrote a book advocating doing away with elections and letting
the military run the country. Somebody in your community suggests that the
book be removed from the public library. Would you favor removing it, or
not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1302 1.699 0.459
1) Remove 392 30.1
2) Not remove 910 69.9
Missing 2730
82) SPKHOMO
And what about a man who admits that he is homosexual... Suppose this
admitted homosexual wanted to make a speech in your community. Should he
be allowed to speak, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1304 1.083 0.276
1) Yes, allowed to speak 1196 91.7
2) Not allowed 108 8.3
Missing 2728
83) COLHOMO
Should such a person be allowed to teach in a college or university, or
not?
RANGE: 4 to 5
Modified GSS Codebook for the Data Used in this Text | 407
N Mean Std. Deviation
Total 2656 4.069 0.253
4) Yes, allowed to teach 2473 93.1
5) Not allowed 183 6.9
Missing 1376
84) LIBHOMO
(And what about a man who admits that he is homosexual...) If some people
in your community suggested that a book he wrote in favor of homosexuality
should be taken out of your public library, would you favor removing this
book, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1308 1.867 0.34
1) Remove 174 13.3
2) Not Remove 1134 86.7
Missing 2724
85) SPKMSLM
Now consider a Muslim clergyman who preaches hatred of the United States.
If such a person wanted to make a speech in your community preaching
hatred of the United States, should he be allowed to speak, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1307 1.519 0.5
1) Yes, allowed 629 48.1
2) Not allowed 678 51.9 Missing 2725
86) COLMSLM
Should such a person be allowed to teach in a college or university, or
not?
RANGE: 4 to 5
N Mean Std. Deviation
Total 2646 4.67 0.47
4) Yes, allowed to teach 874 33.0
5) Not allowed 1772 67.0
Missing 1386
408 | Modified GSS Codebook for the Data Used in this Text
87) LIBMSLM
(Now consider a Muslim clergyman who preaches hatred of the United
States...) If some people in your community suggested that a book he wrote
which preaches hatred of the United States should be taken out of your pub­
lic library, would you favor removing this book, or not?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1304 1.564 0.496
1) Remove 569 43.6
2) Not remove 735 56.4
Missing 2728
88) CAPPUN
Do you favor or oppose the death penalty for persons convicted of murder?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3957 1.438 0.496
1) Favor 2222 56.2
2) Oppose 1735 43.8 Missing 75
89) GUNLAW
Would you favor or oppose a law which would require a person to obtain a
police permit before he or she could buy a gun?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3992 1.327 0.469
1) Favor 2686 67.3
2) Oppose 1306 32.7
Missing 40
90) RACLIVE
Are there any ('Whites' for Black respondents, 'Blacks' for non-Black
respondents) living in this neighborhood now?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3587 1.176 0.381
1) Yes 2955 82.4
Modified GSS Codebook for the Data Used in this Text | 409
2) No 632 17.6
Missing 445
91) RELIG
What is your religious preference? Is it Protestant, Catholic, Jewish,
some other religion, or no religion?
RANGE: 1 to 13
N Mean Std. Deviation
Total 3951 2.774 2.358
1) Protestant 1590 40.2
2) Catholic 824 20.9
3) Jewish 75 1.9
4) None 1121 28.4
5) Other 55 1.4
6) Buddhism 47 1.2
7) Hinduism 30 0.8
8) Other Eastern religions 2 0.1
9) Muslim/Islam 25 0.6
10) Orthodox Christian 37 0.9
11) Christian 124 3.1
12) Native American 3 0.1
13) Inter/nondenominational 18 0.5
Missing 81
92) FUND
Fundamentalism/liberalism of respondent's religion
RANGE: 1 to 3
N Mean Std. Deviation
Total 3742 2.255 0.719
1) Fundamentalist 612 16.4
2) Moderate 1565 41.8
3) Liberal 1565 41.8
Missing 290
93) ATTEND
How often do you attend religious services? (USE CATEGORIES AS PROBES, IF
NECESSARY.)
410 | Modified GSS Codebook for the Data Used in this Text
RANGE: 0 to 8
N Mean Std. Deviation
Total 3962 2.853 2.756
0) Never 1178 29.7
1) Less than once a year 565 14.3
2) About once or twice a year 453 11.4
3) Several times a year 403 10.2
4) About once a month 122 3.1
5) Two to three times a month 200 5.0
6) Nearly every week 331 8.4
7) Every week 532 13.4
8) Several times a week 178 4.5
Missing 70
94) PRAY
About how often do you pray? (USE CATEGORIES AS PROBES.)
RANGE: 1 to 6
N Mean Std. Deviation
Total 3955 3.255 1.968
1) Several times a day 1139 28.8
2) Once a day 656 16.6
3) Several times a week 542 13.7
4) Once a week 175 4.4
5) Less than once a week 560 14.2
6) Never 883 22.3
Missing 77
95) SPREL
What is your (SPOUSE'S) religious preference? Is it Protestant, Catholic,
Jewish, some other religion, or no religion?
RANGE: 1 to 7
N Mean Std. Deviation
Total 1639 2.422 1.451
1) Protestant 590 36.0
2) Catholic 465 28.4
3) Jewish 27 1.6
4) None 486 29.7
Modified GSS Codebook for the Data Used in this Text | 411
5) Other 26 1.6
6) Buddhism 20 1.2
7) Hinduism 25 1.5
Missing 2393
96) AFFRMACT
Some people say that because of past discrimination, Blacks should be
given preference in hiring and promotion. Others say that such preference
in hiring and promotion of Blacks is wrong because it discriminates
against Whites. What about your opinion? Are you for or against preferen­
tial hiring and promotion of Blacks? IF FAVORS: Do you favor preference in
hiring and promotion strongly or not strongly? IF OPPOSES: Do you oppose
preference in hiring and promotion strongly or not strongly?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2628 3.063 1.041
1) Strongly favors 349 13.3
2) Not strongly favors 299 11.4
3) Not strongly opposes 818 31.1
4) Strongly opposes 1162 44.2
Missing 1404
97) WRKWAYUP
Do you agree strongly, agree somewhat, neither agree nor disagree, dis­
agree somewhat, or disagree strongly with the following statement (HAND
CARD TO RESPONDENT) Irish, Italians, Jewish and many other minorities over­
came prejudice and worked their way up. Blacks should do the same without
special favors.
RANGE: 1 to 5
N Mean Std. Deviation
Total 2688 2.878 1.418
1) Agree strongly 609 22.7
2) Agree somewhat 536 19.9
3) Neither agree nor disagree 640 23.8
4) Disagree somewhat 380 14.1
5) Disagree strongly 523 19.5
Missing 1344
412 | Modified GSS Codebook for the Data Used in this Text
98) HAPPY
Taken all together, how would you say things are these days - would you
say that you are very happy, pretty happy, or not too happy?
RANGE: 1 to 3
N Mean Std. Deviation
Total 4014 2.035 0.651
1) Very happy 783 19.5
2) Pretty happy 2308 57.5
3) Not too happy 923 23.0
Missing 18
99) HAPMAR
(IF CURRENTLY MARRIED, ASK HAPMAR) Taking things all together, how would
you describe your marriage? Would you say that your marriage is very
happy, pretty happy, or not too happy?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1986 1.427 0.565
1) Very happy 1211 61.0
2) Pretty happy 701 35.3
3) Not too happy 74 3.7
Missing 2046
100) HEALTH
Would you say your own health, in general, is excellent, good, fair, or
poor?
RANGE: 1 to 4
N Mean Std. Deviation
Total 4023 2.06 0.739
1) Excellent 835 20.8
2) Good 2264 56.3
3) Fair 773 19.2
4) Poor 151 3.8
Missing 9
101) LIFE
Modified GSS Codebook for the Data Used in this Text | 413
In general, do you find life exciting, pretty routine, or dull?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2669 1.691 0.562
1) Exciting 962 36.0
2) Routine 1571 58.9
3) Dull 136 5.1
Missing 1363
102) CONFINAN
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Banks and financial institutions
RANGE: 1 to 3
N Mean Std. Deviation
Total 2660 2.039 0.633
1) A great deal 482 18.1
2) Only some 1592 59.8
3) Hardly any 586 22.0
Missing 1372
103) CONBUS
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Major companies
RANGE: 1 to 3
N Mean Std. Deviation
Total 2656 2.048 0.62
1) A great deal 450 16.9
2) Only some 1628 61.3
3) Hardly any 578 21.8
Missing 1376
104) CONCLERG
(I am going to name some institutions in this country. As far as the peo­
414 | Modified GSS Codebook for the Data Used in this Text
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Organized religion
RANGE: 1 to 3
N Mean Std. Deviation
Total 2650 2.183 0.669
1) A great deal 394 14.9
2) Only some 1377 52.0
3) Hardly any 879 33.2
Missing 1382
105) CONEDUC
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Education
RANGE: 1 to 3
N Mean Std. Deviation
Total 2658 2.052 0.619
1) A great deal 443 16.7
2) Only some 1633 61.4
3) Hardly any 582 21.9
Missing 1374
106) CONFED
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Executive branch of the federal government
RANGE: 1 to 3
N Mean Std. Deviation
Total 2658 2.318 0.686
1) A great deal 336 12.6
2) Only some 1140 42.9
3) Hardly any 1182 44.5
Missing 1374
Modified GSS Codebook for the Data Used in this Text | 415
107) CONLABOR
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Organized labor
RANGE: 1 to 3
N Mean Std. Deviation
Total 2648 2.151 0.594
1) A great deal 297 11.2
2) Only some 1655 62.5
3) Hardly any 696 26.3
Missing 1384
108) CONPRESS
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Press
RANGE: 1 to 3
N Mean Std. Deviation
Total 2654 2.358 0.679
1) A great deal 306 11.5
2) Only some 1093 41.2
3) Hardly any 1255 47.3
Missing 1378
109) CONMEDIC
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Medicine
RANGE: 1 to 3
N Mean Std. Deviation
Total 2662 1.69 0.631
1) A great deal 1070 40.2
2) Only some 1346 50.6
3) Hardly any 246 9.2
416 | Modified GSS Codebook for the Data Used in this Text
Missing 1370
110) CONTV
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) TV
RANGE: 1 to 3
N Mean Std. Deviation
Total 2660 2.341 0.617
1) A great deal 207 7.8
2) Only some 1340 50.4
3) Hardly any 1113 41.8
Missing 1372
111) CONJUDGE
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) U.S. Supreme Court
RANGE: 1 to 3
N Mean Std. Deviation
Total 2662 1.943 0.676
1) A great deal 689 25.9
2) Only some 1437 54.0
3) Hardly any 536 20.1
Missing 1370
112) CONSCI
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Scientific community
RANGE: 1 to 3
N Mean Std. Deviation
Total 2654 1.563 0.616
1) A great deal 1337 50.4
Modified GSS Codebook for the Data Used in this Text | 417
2) Only some 1141 43.0
3) Hardly any 176 6.6
Missing 1378
113) CONLEGIS
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Congress
RANGE: 1 to 3
N Mean Std. Deviation
Total 2661 2.485 0.597
1) A great deal 141 5.3
2) Only some 1089 40.9
3) Hardly any 1431 53.8
Missing 1371
114) CONARMY
(I am going to name some institutions in this country. As far as the peo­
ple running this institution are concerned, would you say you have a great
deal of confidence, only some confidence, or hardly any confidence at all
in them?) Military
RANGE: 1 to 3
N Mean Std. Deviation
Total 2656 1.624 0.653
1) A great deal 1254 47.2
2) Only some 1147 43.2
3) Hardly any 255 9.6
Missing 1376
115) OBEY
(If you had to choose, which thing on this list would you pick as the most
important for a child to learn to prepare him or her for life? Which comes
next in importance? Which comes third? Which comes fourth?) To obey
RANGE: 1 to 5
N Mean Std. Deviation
Total 2573 3.875 1.027
418 | Modified GSS Codebook for the Data Used in this Text
1) First 127 4.9
2) Second 160 6.2
3) Third 294 11.4
4) Fourth 1318 51.2
5) Fifth 674 26.2
Missing 1459
116) POPULAR
(If you had to choose, which thing on this list would you pick as the most
important for a child to learn to prepare him or her for life? Which comes
next in importance? Which comes third? Which comes fourth?) To be wellliked or popular
RANGE: 1 to 5
N Mean Std. Deviation
Total 2573 4.649 0.64
1) First 21 0.8
2) Second 20 0.8
3) Third 48 1.9
4) Fourth 663 25.8
5) Fifth 1821 70.8
Missing 1459
117) THNKSELF
(If you had to choose, which thing on this list would you pick as the most
important for a child to learn to prepare him or her for life? Which comes
next in importance? Which comes third? Which comes fourth?) To think for
himself or herself
RANGE: 1 to 5
N Mean Std. Deviation
Total 2573 1.93 1.111
1) First 1268 49.3
2) Second 581 22.6
3) Third 411 16.0
4) Fourth 262 10.2
5) Fifth 51 2.0
Missing 1459
Modified GSS Codebook for the Data Used in this Text | 419
118) WORKHARD
(If you had to choose, which thing on this list would you pick as the most
important for a child to learn to prepare him or her for life? Which comes
next in importance? Which comes third? Which comes fourth?) To work hard
RANGE: 1 to 5
N Mean Std. Deviation
Total 2573 2.201 0.865
1) First 624 24.3
2) Second 934 36.3
3) Third 894 34.7
4) Fourth 116 4.5
5) Fifth 5 0.2
Missing 1459
119) HELPOTH
(If you had to choose, which thing on this list would you pick as the most
important for a child to learn to prepare him or her for life? Which comes
next in importance? Which comes third? Which comes fourth?) To help others
when they need help
RANGE: 1 to 5
N Mean Std. Deviation
Total 2573 2.345 0.926
1) First 533 20.7
2) Second 878 34.1
3) Third 926 36.0
4) Fourth 214 8.3
5) Fifth 22 0.9
Missing 1459
120) SOCREL
(Would you use this card and tell me which answer comes closest to how
often you do the following things...) Spend a social evening with rela­
tives?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2705 3.912 1.615
1) Almost daily 191 7.1
420 | Modified GSS Codebook for the Data Used in this Text
2) Once or twice a week 438 16.2
3) Several times a month 478 17.7
4) About once a month 481 17.8
5) Several times a year 694 25.7
6) About once a year 275 10.2
7) Never 148 5.5
Missing 1327
121) SOCOMMUN
(Would you use this card and tell me which answer comes closest to how
often you do the following things...) Spend a social evening with someone
who lives in your neighborhood?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2705 5.277 1.78
1) Almost daily 39 1.4
2) Once or twice a week 235 8.7
3) Several times a month 272 10.1
4) About once a month 322 11.9
5) Several times a year 438 16.2
6) About once a year 323 11.9
7) Never 1076 39.8
Missing 1327
122) SOCFREND
(Would you use this card and tell me which answer comes closest to how
often you do the following things...) Spend a social evening with friends
who live outside the neighborhood?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2701 4.473 1.508
1) Almost daily 34 1.3
2) Once or twice a week 253 9.4
3) Several times a month 467 17.3
4) About once a month 554 20.5
5) Several times a year 777 28.8
6) About once a year 273 10.1
Modified GSS Codebook for the Data Used in this Text | 421
7) Never 343 12.7
Missing 1331
123) SOCBAR
(Would you use this card and tell me which answer comes closest to how
often you do the following things...) Go to a bar or tavern?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2702 5.688 1.465
1) Almost daily 5 0.2
2) Once or twice a week 91 3.4
3) Several times a month 190 7.0
4) About once a month 271 10.0
5) Several times a year 513 19.0
6) About once a year 461 17.1
7) Never 1171 43.3
Missing 1330
124) JOBLOSE
Thinking about the next 12 months, how likely do you think it is that you
will lose your job or be laid off--very likely, fairly likely, not too
likely, or not at all likely?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1495 3.395 0.706
1) Very likely 34 2.3
2) Fairly likely 92 6.2
3) Not too likely 619 41.4
4) Not likely 750 50.2
Missing 2537
125) JOBFIND
About how easy would it be for you to find a job with another employer
with approximately the same income and fringe benefits you now have? Would
you say very easy, somewhat easy, or not easy at all?
RANGE: 1 to 3
N Mean Std. Deviation
422 | Modified GSS Codebook for the Data Used in this Text
Total 1489 2.269 0.703
1) Very easy 221 14.8
2) Somewhat easy 647 43.5
3) Not easy 621 41.7
Missing 2543
126) SATJOB
On the whole, how satisfied are you with the work you do—would you say you
are very satisfied, moderately satisfied, a little dissatisfied, or very
dissatisfied?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2734 1.755 0.809
1) Very satisfied 1198 43.8
2) Moderately satisfied 1119 40.9
3) A little dissatisfied 305 11.2
4) Very dissatisfied 112 4.1
Missing 1298
127) RICHWORK
(IF CURRENTLY WORKING OR TEMPORARILY NOT AT WORK, ASK RICHWORK.) If you
were to get enough money to live as comfortably as you would like for the
rest of your life, would you continue to work or would you stop working?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1624 1.382 0.486
1) Continue to work 1004 61.8
2) Stop working 620 38.2
Missing 2408
128) CLASS
If you were asked to use one of four names for your social class, which
would you say you belong in: the lower class, the working class, the mid­
dle class, or the upper class?
RANGE: 1 to 4
N Mean Std. Deviation
Total 4018 2.495 0.713
Modified GSS Codebook for the Data Used in this Text | 423
1) Lower class 349 8.7
2) Working class 1501 37.4
3) Middle class 1999 49.8
4) Upper class 169 4.2
Missing 14
129) RANK
(In our society there are groups which tend to be toward the top and those
that are toward the bottom. Here we have a scale that runs from top to bot­
tom...) Where would you put yourself on this scale?
RANGE: 1 to 10
N Mean Std. Deviation
Total 1966 4.58 1.718
1) Top 76 3.9
2) 89 4.5
3) 361 18.4
4) 365 18.6
5) 685 34.8
6) 142 7.2
7) 134 6.8
8) 64 3.3
9) 20 1.0
10) Bottom 30 1.5
Missing 2066
130) SATFIN
We are interested in how people are getting along financially these days.
So far as you and your family are concerned, would you say that you are
pretty well satisfied with your present financial situation, more or less
satisfied, or not satisfied at all?
RANGE: 1 to 3
N Mean Std. Deviation
Total 4016 1.927 0.739
1) Pretty well satisfied 1254 31.2
2) More or less satisfied 1800 44.8
3) Not satisfied at all 962 24.0
Missing 16
424 | Modified GSS Codebook for the Data Used in this Text
131) FINALTER
During the last few years, has your financial situation been getting bet­
ter, worse, or has it stayed the same?
RANGE: 1 to 3
N Mean Std. Deviation
Total 4021 1.991 0.894
1) Getting better 1623 40.4
2) Getting worse 811 20.2
3) Stayed the same 1587 39.5
Missing 11
132) FINRELA
Compared with American families in general, would you say your family
income is far below average, below average, average, above average, or far
above average? (PROBE: Just your best guess.)
RANGE: 1 to 5
N Mean Std. Deviation
Total 4020 2.938 0.962
1) Far below average 287 7.1
2) Below average 978 24.3
3) Average 1603 39.9
4) Above average 1000 24.9
5) Far above average 152 3.8
Missing 12
133) WKSUB
We have some more questions about [your/your spouse's] job. At work, [do
you/does your spouse] have a supervisor to whom [you/he or she] are
directly responsible?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2500 1.169 0.375
1) Yes 2077 83.1
2) No 423 16.9
Missing 1532
Modified GSS Codebook for the Data Used in this Text | 425
134) WKSUBS
IF YES: Does that supervisor have a supervisor to whom he or she is
directly responsible?
RANGE: 3 to 4
N Mean Std. Deviation
Total 2049 3.113 0.317
3) Yes 1817 88.7
4) No 232 11.3
Missing 1983
135) WKSUP
At work, [do you/does your spouse] supervise anyone who is directly respon­
sible to [you/your spouse]?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2548 1.669 0.471
1) Yes 844 33.1
2) No 1704 66.9 Missing 1484
136) UNION
Do you (or your spouse) belong to a labor union? (Who?)
RANGE: 1 to 4
N Mean Std. Deviation
Total 2652 3.679 0.874
1) Yes, respondent belongs 212 8.0
2) Yes, spouse belongs 88 3.3
3) Yes, both belong 39 1.5
4) No, neither belong 2313 87.2
Missing 1380
137) UNION1
Do you (or your spouse or partner) belong to a labor union? (Who?)
RANGE: 1 to 4
N Mean Std. Deviation
Total 2652 3.678 0.867
1) Yes, respondent belongs 202 7.6
2) Yes, spouse or partner belongs 100 3.8
426 | Modified GSS Codebook for the Data Used in this Text
3) Yes, both belong 49 1.8
4) No, neither belong 2301 86.8
Missing 1380
138) PARSOL
Compared to your parents when they were the age you are now, do you think
your own standard of living now is much better, somewhat better, about the
same, somewhat worse, or much worse than theirs was?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2653 2.404 1.114
1) Much better 656 24.7
2) Somewhat better 835 31.5
3) About the same 703 26.5
4) Somewhat worse 353 13.3
5) Much worse 106 4.0
Missing 1379
139) ABDEFECT
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If there is a strong chance
of serious defect in the baby?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1435 1.215 0.411
1) Yes 1126 78.5
2) No 309 21.5
Missing 2597
140) ABNOMORE
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If she is married and does
not want any more children?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1424 1.409 0.492
1) Yes 841 59.1
Modified GSS Codebook for the Data Used in this Text | 427
2) No 583 40.9
Missing 2608
141) ABHLTH
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the woman's own health is
seriously endangered by the pregnancy?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1431 1.099 0.298
1) Yes 1290 90.1
2) No 141 9.9
Missing 2601
142) ABPOOR
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the family has a very low
income and cannot afford any more children?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1425 1.421 0.494
1) Yes 825 57.9
2) No 600 42.1
Missing 2607
143) ABRAPE
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If she becomes pregnant as a
result of rape?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1430 1.153 0.36
1) Yes 1211 84.7
2) No 219 15.3 Missing 2602
144) ABSINGLE
(Please tell me whether or not you think it should be possible for a preg­
428 | Modified GSS Codebook for the Data Used in this Text
nant woman to obtain a legal abortion if...) If she is not married and
does not want to marry the man?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1425 1.429 0.495
1) Yes 813 57.1
2) No 612 42.9
Missing 2607
145) ABANY
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the woman wants it for any
reason?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1328 1.436 0.496
1) Yes 749 56.4
2) No 579 43.6
Missing 2704
146) ABDEFECTG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If there is a strong chance
of serious defect in the baby?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1176 1.207 0.406
1) Yes 932 79.3
2) No 244 20.7
Missing 2856
147) ABNOMOREG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If she is married and does
not want any more children?
RANGE: 1 to 2
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 429
Total 1165 1.467 0.499
1) Yes 621 53.3
2) No 544 46.7
Missing 2867
148) ABHLTHG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the woman's own health is
seriously endangered by the pregnancy?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1188 1.106 0.308
1) Yes 1062 89.4
2) No 126 10.6
Missing 2844
149) ABPOORG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the family has a very low
income and cannot afford any more children?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1160 1.437 0.496
1) Yes 653 56.3
2) No 507 43.7
Missing 2872
150) ABRAPEG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If she becomes pregnant as a
result of rape?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1179 1.173 0.378
1) Yes 975 82.7
2) No 204 17.3
Missing 2853
430 | Modified GSS Codebook for the Data Used in this Text
151) ABSINGLEG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If she is not married and
does not want to marry the man?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1158 1.461 0.499
1) Yes 624 53.9
2) No 534 46.1 Missing 2874
152) ABANYG
(Please tell me whether or not you think it should be possible for a preg­
nant woman to obtain a legal abortion if...) If the woman wants it for any
reason?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1300 1.401 0.49
1) Yes 779 59.9
2) No 521 40.1
Missing 2732
153) PROCHOIC
(We hear a lot of talk these days about abortion. Please indicate to what
extent you agree or disagree with each of the following statements.) I con­
sider myself pro-choice.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3551 2.448 1.306
1) Strongly agree 1053 29.7
2) Agree 1004 28.3
3) Neither agree nor disagree 729 20.5
4) Disagree 379 10.7
5) Strongly disagree 386 10.9
Missing 481
154) PROLIFE
(We hear a lot of talk these days about abortion. Please indicate to what
Modified GSS Codebook for the Data Used in this Text | 431
extent you agree or disagree with each of the following statements.) I con­
sider myself pro-life.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3537 2.888 1.329
1) Strongly agree 660 18.7
2) Agree 791 22.4
3) Neither agree nor disagree 941 26.6
4) Disagree 574 16.2
5) Strongly disagree 571 16.1
Missing 495
155) CHLDIDEL
What do you think is the ideal number of children for a family to have?
RANGE: 0 to 8
N Mean Std. Deviation
Total 2693 4.13 2.674
0) 49 1.8
1) 49 1.8
2) 1076 40.0
3) 501 18.6
4) 160 5.9
5) 22 0.8
6) 9 0.3
7) Seven or more 3 0.1
8) As many as you want 824 30.6
Missing 1339
156) PILLOK
Do you strongly agree, agree, disagree, or strongly disagree that methods
of birth control should be available to teenagers between the ages of 14
and 16 if their parents do not approve?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2691 2.099 1.028
1) Strongly agree 947 35.2
2) Agree 885 32.9
432 | Modified GSS Codebook for the Data Used in this Text
3) Disagree 505 18.8
4) Strongly disagree 354 13.2
Missing 1341
157) SEXEDUC
Would you be for or against sex education in the public schools?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2688 1.089 0.284
1) Favor 2450 91.1
2) Oppose 238 8.9
Missing 1344
158) PREMARSX
There's been a lot of discussion about the way morals and attitudes about
sex are changing in this country. If a man and a woman have sexual rela­
tions before marriage, do you think it is always wrong, almost always
wrong, wrong only sometimes, or not wrong at all?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2680 3.306 1.098
1) Always wrong 390 14.6
2) Almost always wrong 162 6.0
3) Wrong only sometimes 367 13.7
4) Not wrong at all 1761 65.7
Missing 1352
159) TEENSEX
What if they are in their early teens, say 14 to 16 years old? In that
case, do you think sex relations before marriage are always wrong, almost
always wrong, wrong only sometimes, or not wrong at all?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2679 2.048 1.12
1) Always wrong 1196 44.6
2) Almost always wrong 581 21.7
3) Wrong only sometimes 479 17.9
Modified GSS Codebook for the Data Used in this Text | 433
4) Not wrong at all 423 15.8
Missing 1353
160) XMARSEX
What is your opinion about a married person having sexual relations with
someone other than the marriage partner—is it always wrong, almost always
wrong, wrong only sometimes, or not wrong at all?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2650 1.553 0.831
1) Always wrong 1667 62.9
2) Almost always wrong 606 22.9
3) Wrong only sometimes 272 10.3
4) Not wrong at all 105 4.0
Missing 1382
161) HOMOSEX
What about sexual relations between two adults of the same sex—do you
think it is always wrong, almost always wrong, wrong only sometimes, or
not wrong at all?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2611 3.047 1.313
1) Always wrong 693 26.5
2) Almost always wrong 112 4.3
3) Wrong only sometimes 185 7.1
4) Not wrong at all 1621 62.1
Missing 1421
162) PORNLAW
Which of these statements comes closest to your feelings about pornography
laws?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2652 1.793 0.518
1) There should be laws against the distribution of pornography whatever
the age 686 25.9
434 | Modified GSS Codebook for the Data Used in this Text
2) There should be laws against the distribution of pornography to persons
under 18 1828 68.9
3) There should be no laws forbidding the distribution of pornography 138
5.2
Missing 1380
163) XMOVIE
Have you seen an X-rated movie in the last year?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2656 1.696 0.46
1) Yes 808 30.4
2) No 1848 69.6
Missing 1376
164) SPANKING
Do you strongly agree, agree, disagree, or strongly disagree that it is
sometimes necessary to discipline a child with a good, hard spanking?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2684 2.49 0.952
1) Strongly agree 424 15.8
2) Agree 977 36.4
3) Disagree 827 30.8
4) Strongly disagree 456 17.0
Missing 1348
165) LETDIE1
When a person has a disease that cannot be cured, do you think doctors
should be allowed by law to end the patient's life by some painless means
if the patient and his family request it?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1316 1.29 0.454
1) Yes 934 71.0
2) No 382 29.0
Missing 2716
Modified GSS Codebook for the Data Used in this Text | 435
166) SUICIDE1
(Do you think a person has the right to end his or her own life if this
person...) Has an incurable disease?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1212 1.268 0.443
1) Yes 887 73.2
2) No 325 26.8
Missing 2820
167) SUICIDE2
(Do you think a person has the right to end his or her own life if this
person...) Has gone bankrupt?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1298 1.881 0.324
1) Yes 155 11.9
2) No 1143 88.1
Missing 2734
168) SUICIDE3
(Do you think a person has the right to end his or her own life if this
person...) Has dishonored his or her own family?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1317 1.89 0.313
1) Yes 145 11.0
2) No 1172 89.0
Missing 2715
169) SUICIDE4
(Do you think a person has the right to end his or her own life if this
person...) Is tired or living and ready to die?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1231 1.795 0.404
436 | Modified GSS Codebook for the Data Used in this Text
1) Yes 252 20.5
2) No 979 79.5
Missing 2801
170) POLHITOK
Are there any situations you can imagine in which you would approve of a
policeman striking an adult male citizen?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1300 1.315 0.465
1) Yes 890 68.5
2) No 410 31.5
Missing 2732
171) POLABUSE
(Would you approve of a policeman striking a citizen who...) Had said vul­
gar and obscene things to the policeman?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1304 1.923 0.266
1) Yes 100 7.7
2) No 1204 92.3
Missing 2728
172) POLMURDR
(Would you approve of a policeman striking a citizen who...) Was being
questioned as a suspect in a murder case?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2634 1.787 0.409
1) Yes 560 21.3
2) No 2074 78.7
Missing 1398
173) POLESCAP
(Would you approve of a policeman striking a citizen who...) Was attempt­
ing to escape from custody?
Modified GSS Codebook for the Data Used in this Text | 437
RANGE: 1 to 2
N Mean Std. Deviation
Total 2639 1.402 0.49
1) Yes 1579 59.8
2) No 1060 40.2
Missing 1393
174) POLATTAK
(Would you approve of a policeman striking a citizen who...) Was attacking
the policeman with his fists?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1303 1.214 0.41
1) Yes 1024 78.6
2) No 279 21.4
Missing 2729
175) FEAR
Is there any area right around here—that is, within a mile—where you would
be afraid to walk alone at night?
RANGE: 1 to 2
N Mean Std. Deviation
Total 4022 1.65 0.477
1) Yes 1409 35.0
2) No 2613 65.0
Missing 10
176) OWNGUN
Do you happen to have in your home (IF HOUSE: or garage) any guns or
revolvers?
RANGE: 1 to 3
N Mean Std. Deviation
Total 3922 1.65 0.482
1) Yes 1383 35.3
2) No 2529 64.5
3) Refused 10 0.3
Missing 110
438 | Modified GSS Codebook for the Data Used in this Text
177) HUNT1
Do you (or does your [husband/wife/partner]) go hunting?
RANGE: 1 to 4
N Mean Std. Deviation
Total 4025 3.658 0.887
1) Yes, respondent does 325 8.1
2) Yes, spouse or partner does 155 3.9
3) Yes, both do 93 2.3
4) No, neither respondent nor spouse or partner does 3452 85.8
Missing 7
178) NEWS
How often do you read the newspaper—every day, a few times a week, once a
week, less than once a week, or never?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2696 3.3 1.628
1) Every day 642 23.8
2) A few times a week 361 13.4
3) Once a week 240 8.9
4) Less than once a week 452 16.8
5) Never 1001 37.1
Missing 1336
179) TVHOURS
On the average day, about how many hours do you personally watch televi­
sion?
RANGE: 0 to 24
N Mean Std. Deviation
Total 2683 3.458 3.109
Missing 1349
180) FECHLD
Please read the following statements and indicate whether you strongly
agree, agree, disagree, or strongly disagree with each statement. For exam­
ple, here is the statement: A working mother can establish just as warm
and secure a relationship with her children as a mother who does not work.
Modified GSS Codebook for the Data Used in this Text | 439
RANGE: 1 to 4
N Mean Std. Deviation
Total 2714 1.817 0.793
1) Strongly agree 1062 39.1
2) Agree 1173 43.2
3) Disagree 394 14.5
4) Strongly disagree 85 3.1
Missing 1318
181) FEPRESCH
A preschool child is likely to suffer if his or her mother works.
RANGE: 1 to 4
N Mean Std. Deviation
Total 2707 2.97 0.766
1) Strongly agree 105 3.9
2) Agree 520 19.2
3) Disagree 1433 52.9
4) Strongly disagree 649 24.0
Missing 1325
182) FEFAM
It is much better for everyone involved if the man is the achiever outside
the home and the woman takes care of the home and family.
RANGE: 1 to 4
N Mean Std. Deviation
Total 2708 3.132 0.841
1) Strongly agree 131 4.8
2) Agree 410 15.1
3) Disagree 1137 42.0
4) Strongly disagree 1030 38.0
Missing 1324
183) RACDIF1
On the average (Negroes/Blacks/African-Americans) have worse jobs, income,
and housing than white people. Do you think these differences are...
Mainly due to discrimination?
RANGE: 1 to 2
440 | Modified GSS Codebook for the Data Used in this Text
N Mean Std. Deviation
Total 2681 1.432 0.495
1) Yes 1524 56.8
2) No 1157 43.2
Missing 1351
184) RACDIF2
(On the average (Negroes/Blacks/African-Americans) have worse jobs,
income, and housing than white people. Do you think these differences
are...) Because most (Negroes/Blacks/African-Americans) have less in-born
ability to learn?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2689 1.948 0.223
1) Yes 141 5.2
2) No 2548 94.8
Missing 1343
185) RACDIF3
(On the average (Negroes/Blacks/African-Americans) have worse jobs,
income, and housing than white people. Do you think these differences
are...) Because most (Negroes/Blacks/African-Americans) don't have the
chance for education that it takes to rise out of poverty?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2673 1.435 0.496
1) Yes 1511 56.5
2) No 1162 43.5
Missing 1359
186) RACDIF4
(On the average (Negroes/Blacks/African-Americans) have worse jobs,
income, and housing than white people. Do you think these differences
are...) Because most (Negroes/Blacks/African-Americans) just don't have
the motivation or willpower to pull themselves up out of poverty?
RANGE: 1 to 2
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 441
Total 2666 1.735 0.441
1) Yes 706 26.5
2) No 1960 73.5
Missing 1366
187) HELPPOOR
Next, here are issues that some people tell us are important. Some people
think that the government in Washington should do everything possible to
improve the standard of living of all poor Americans, they are at Point
One on the scale below. Other people think it is not the government's
responsibility, and that each person should take care of himself, they are
at Point Five. Where would you place yourself on this scale, or haven't
you made up your mind on this?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2633 2.682 1.254
1) Government should help 675 25.6
2) 335 12.7
3) Agree with both 1035 39.3
4) 327 12.4
5) People should help themselves 261 9.9
Missing 1399
188) HELPNOT
Some people think that the government in Washington is trying to do too
many things that should be left to individuals and private businesses. Oth­
ers disagree and think that the government should do even more to solve
our country's problems. Still others have opinions somewhere in between.
Where would you place yourself on this scale, or haven't you made up your
mind on this?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2609 2.859 1.293
1) Government should do more 539 20.7
2) 386 14.8
3) Agree with both 977 37.4
4) 319 12.2
442 | Modified GSS Codebook for the Data Used in this Text
5) Government does too much 388 14.9
Missing 1423
189) HELPSICK
In general, some people think that it is the responsibility of the govern­
ment in Washington to see to it that people have help in paying for doc­
tors and hospital bills. Others think that these matters are not the
responsibility of the federal government and that people should take care
of these things themselves. Where would you place yourself on this scale,
or haven't you made up your mind on this?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2627 2.376 1.267
1) Government should help 917 34.9
2) 471 17.9
3) Agree with both 793 30.2
4) 227 8.6
5) People should care for themselves 219 8.3
Missing 1405
190) HELPBLK
Some people think that Blacks have been discriminated against for so long
that the government has a special obligation to help improve their living
standards. Others believe that the government should not be giving special
treatment to Blacks. Where would you place yourself on this scale, or
haven't you made up your mind on this?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2602 3.027 1.457
1) Government should help 560 21.5
2) 407 15.6
3) Agree with both 679 26.1
4) 315 12.1
5) No special treatment 641 24.6
Missing 1430
191) GOD
Please look at this card and tell me which statement comes closest to
Modified GSS Codebook for the Data Used in this Text | 443
expressing what you believe about God.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2615 4.657 1.682
1) Don't believe 173 6.6
2) Don't know, no way to find out 237 9.1
3) Higher power 341 13.0
4) Believe sometimes 124 4.7
5) Believe with doubts 427 16.3
6) No doubts 1313 50.2
Missing 1417
192) REBORN
Would you say you have been born again or have had a born again experi­
ence—that is, a turning point in your life when you committed yourself to
Christ?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2600 1.661 0.474
1) Yes 882 33.9
2) No 1718 66.1
Missing 1432
193) SAVESOUL
Have you ever tried to encourage someone to believe in Jesus Christ or to
accept Jesus Christ as his or her savior?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3949 1.63 0.483
1) Yes 1461 37.0
2) No 2488 63.0
Missing 83
194) LIVEBLKS
Now, I'm going to ask you about different types of contact with various
groups of people. In each situation would you please tell me whether you
would be very much in favor of it happening, somewhat in favor, neither in
444 | Modified GSS Codebook for the Data Used in this Text
favor nor opposed to it happening, somewhat opposed or very much opposed
to it happening? Living in a neighborhood where half of your neighbors
were Black?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2695 2.722 0.943
1) Strongly favor 454 16.8
2) Favor 234 8.7
3) Neither favor nor oppose 1694 62.9
4) Oppose 233 8.6
5) Strongly oppose 80 3.0
Missing 1337
195) MARBLK
How about having a close relative marry a Black person? Would you be very
in favor of it happening, somewhat in favor, neither in favor nor opposed
to it happening, somewhat opposed, or very opposed to it happening?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2698 2.624 0.944
1) Strongly favor 546 20.2
2) Favor 196 7.3
3) Neither favor nor oppose 1747 64.8
4) Oppose 144 5.3
5) Strongly oppose 65 2.4
Missing 1334
196) MARASIAN
How about having a close relative marry an Asian American person? Would
you be very in favor of it happening, somewhat in favor, neither in favor
nor opposed to it happening, somewhat opposed, or very opposed to it hap­
pening?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2689 2.588 0.88
1) Strongly favor 513 19.1
2) Favor 242 9.0
Modified GSS Codebook for the Data Used in this Text | 445
3) Neither favor nor oppose 1816 67.5
4) Oppose 76 2.8
5) Strongly oppose 42 1.6
Missing 1343
197) MARHISP
How about having a close relative marry a Hispanic person? Would you be
very in favor of it happening, somewhat in favor, neither in favor nor
opposed to it happening, somewhat opposed, or very opposed to it happening?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2690 2.58 0.882
1) Strongly favor 517 19.2
2) Favor 257 9.6
3) Neither favor nor oppose 1794 66.7
4) Oppose 82 3.0
5) Strongly oppose 40 1.5
Missing 1342
198) MARWHT
What about having a close relative marry a White person? Would you be very
in favor of it happening, somewhat in favor, neither in favor nor opposed
to it happening, somewhat opposed, or very opposed to it happening?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2694 2.461 0.922
1) Strongly favor 666 24.7
2) Favor 235 8.7
3) Neither favor nor oppose 1709 63.4
4) Oppose 54 2.0
5) Strongly oppose 30 1.1
Missing 1338
199) RACWORK
IF EMPLOYED: Are the people who work where you work all White, mostly
White, about half and half, mostly Black, or all Black?
RANGE: 1 to 5
446 | Modified GSS Codebook for the Data Used in this Text
N Mean Std. Deviation
Total 1371 2.253 0.722
1) All White 163 11.9
2) Mostly White 757 55.2
3) About half and half 397 29.0
4) Mostly Black 49 3.6
5) All Black 5 0.4
Missing 2661
200) DISCAFF
What do you think the chances are these days that a white person won't get
a job or promotion while an equally or less qualified Black person gets
one instead? Is this very likely, somewhat likely, or not very likely to
happen these days?
RANGE: 1 to 3
N Mean Std. Deviation
Total 3948 2.367 0.688
1) Very likely 476 12.1
2) Somewhat likely 1549 39.2
3) Not very likely 1923 48.7
Missing 84
201) FEJOBAFF
Some people say that because of past discrimination, women should be given
preference in hiring and promotion. Others say that such preference in hir­
ing and promotion of women is wrong because it discriminates against men.
What about your opinion—are you for or against preferential hiring and pro­
motion of women? IF FOR: Do you favor preference in hiring and promotion
strongly or not strongly? IF AGAINST: Do you oppose the preference in hir­
ing and promotion strongly or not strongly?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1323 2.796 1.13
1) Strongly favor 257 19.4
2) Not strongly favor 235 17.8
3) Not strongly oppose 352 26.6
4) Strongly oppose 479 36.2
Modified GSS Codebook for the Data Used in this Text | 447
Missing 2709
202) DISCAFFM
What do you think the chances are these days that a man won't get a job or
promotion while an equally or less qualified woman gets one instead. Is
this very likely, somewhat likely, somewhat unlikely, or very unlikely
these days?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1336 2.719 0.898
1) Very likely 119 8.9
2) Somewhat likely 422 31.6
3) Not very likely 511 38.2
4) Very unlikely 284 21.3
Missing 2696
203) FEHIRE
Now I'm going to read several statements. As I read each one, please tell
me whether you strongly agree, agree, neither agree nor disagree, dis­
agree, or strongly disagree. Because of past discrimination, employers
should make special efforts to hire and promote qualified women.
RANGE: 1 to 5
N Mean Std. Deviation
Total 2708 2.449 1.134
1) Strongly agree 622 23.0
2) Agree 898 33.2
3) Neither agree nor disagree 673 24.9
4) Disagree 381 14.1
5) Strongly disagree 134 4.9
Missing 1324
204) RELPERSN
To what extent do you consider yourself a religious person? Are you...
RANGE: 1 to 4
N Mean Std. Deviation
Total 3954 2.72 1.038
1) Very religious 518 13.1
448 | Modified GSS Codebook for the Data Used in this Text
2) Moderately religious 1284 32.5
3) Slightly religious 939 23.7
4) Not religious at all 1213 30.7
Missing 78
205) SPRTPRSN
To what extent do you consider yourself a spiritual person? Are you...
RANGE: 1 to 4
N Mean Std. Deviation
Total 3934 2.319 1.007
1) Very spiritual 961 24.4
2) Moderately spiritual 1362 34.6
3) Slightly spiritual 1006 25.6
4) Not spiritual at all 605 15.4
Missing 98
206) OTHLANG
Can you speak a language other than English [Spanish]?
RANGE: 1 to 2
N Mean Std. Deviation
Total 3937 1.709 0.454
1) Yes 1144 29.1
2) No 2793 70.9
Missing 95
207) OTHLANG1
What other language(s) do you speak? First response. (2016)
RANGE: 1 to 176
N Mean Std. Deviation
Total 1064 11.837 22.84
Missing 2968
See Value Labels in SPSS for the 176 possible answer choices. Most common
answers:
2) Spanish 46.6
1) English 9.8
4) French 8.9
12) German 6.8
Modified GSS Codebook for the Data Used in this Text | 449
8) Chinese 3.0
10) Italian 2.3
32) Japanese 1.9
19) Korean 1.8
6) Russian 1.5
33) Portuguese 1.5
46) Sign language 1.3
22) Arabic 1.1
208) SPKLANG
How well do you speak that language? [IF SPEAKS 2 OR MORE, ASK ONLY OF THE
MOST FLUENT LANGUAGE] READ CATEGORIES
RANGE: 1 to 4
N Mean Std. Deviation
Total 1066 1.957 0.931
1) Very well 441 41.4
2) Well 275 25.8
3) Not well 305 28.6
4) Poorly/hardly at all 45 4.2
Missing 2966
209) COMPUSE
Do you personally ever use a computer at home, at work, or at some other
location?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2651 1.092 0.29
1) Yes 2406 90.8
2) No 245 9.2
Missing 1381
210) EMAILHR
About how many minutes or hours per week do you spend sending and answer­
ing electronic mail or email?
RANGE: 0 to 124
N Mean Std. Deviation
Total 2322 6.918 10.038
450 | Modified GSS Codebook for the Data Used in this Text
211) WWWHR
Not counting email, about how many minutes or hours per week do you use
the web? (Include time you spend visiting regular websites and time spent
using interactive internet services like social media, streaming services,
chat rooms, online conferencing services, discussion boards or forums, and
the like.)
RANGE: 0 to 168
N Mean Std. Deviation
Total 2466 14.803 17.392
212) DISRSPCT
(In your day-to-day life how often have any of the following things hap­
pened to you?) You are treated with less courtesy or respect than other
people.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2601 4.178 1.409
1) Almost every day 136 5.2
2) At least once a week 231 8.9
3) A few times a month 327 12.6
4) A few times a year 801 30.8
5) Less than once a year 552 21.2
6) Never 554 21.3
Missing 1431
213) POORSERV
(In your day-to-day life how often have any of the following things hap­
pened to you?) You receive poorer service than other people at restaurants
or stores.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2592 4.831 1.116
1) Almost every day 24 0.9
2) At least once a week 82 3.2
3) A few times a month 155 6.0
4) A few times a year 672 25.9
Modified GSS Codebook for the Data Used in this Text | 451
5) Less than once a year 774 29.9
6) Never 885 34.1
Missing 1440
214) NOTSMART
(In your day-to-day life how often have any of the following things hap­
pened to you?) People act as if they think you are not smart.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2602 4.611 1.35
1) Almost every day 101 3.9
2) At least once a week 129 5.0
3) A few times a month 202 7.8
4) A few times a year 684 26.3
5) Less than once a year 619 23.8
6) Never 867 33.3
Missing 1430
215) AFRAIDOF
(In your day-to-day life how often have any of the following things hap­
pened to you?) People act as if they are afraid of you.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2604 5.267 1.14
1) Almost every day 39 1.5
2) At least once a week 56 2.2
3) A few times a month 120 4.6
4) A few times a year 352 13.5
5) Less than once a year 426 16.4
6) Never 1611 61.9
Missing 1428
216) THREATEN
(In your day-to-day life how often have any of the following things hap­
pened to you?) You are threatened or harassed.
RANGE: 1 to 6
N Mean Std. Deviation
Total 2604 5.267 0.997
452 | Modified GSS Codebook for the Data Used in this Text
1) Almost every day 20 0.8
2) At least once a week 44 1.7
3) A few times a month 84 3.2
4) A few times a year 322 12.4
5) Less than once a year 738 28.3
6) Never 1396 53.6
Missing 1428
217) QUALLIFE
In general, would you say your quality of life is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 3632 2.483 0.915
1) Excellent 477 13.1
2) Very good 1449 39.9
3) Good 1243 34.2
4) Fair 399 11.0
5) Poor 64 1.8
Missing 400
218) HLTHPHYS
In general, how would you rate your physical health?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3630 2.674 1.038
1) Excellent 526 14.5
2) Very good 1012 27.9
3) Good 1361 37.5
4) Fair 582 16.0
5) Poor 149 4.1
Missing 402
219) HLTHMNTL
In general, how would you rate your mental health, including your mood and
your ability to think?
RANGE: 1 to 5
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 453
Total 3637 2.519 1.016
1) Excellent 579 15.9
2) Very good 1313 36.1
3) Good 1147 31.5
4) Fair 476 13.1
5) Poor 122 3.4
Missing 395
220) SATSOC
In general how would you rate your satisfaction with your social activi­
ties and relationships?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3622 2.782 1.034
1) Excellent 362 10.0
2) Very good 1125 31.1
3) Good 1283 35.4
4) Fair 645 17.8
5) Poor 207 5.7
Missing 410
221) ACTSSOC
In general, please rate how well you carry out your usual social activi­
ties and roles. (This includes activities at home, at work and in your com­
munity, and responsibilities as a parent, child, spouse, employee, friend,
etc.)
RANGE: 1 to 5
N Mean Std. Deviation
Total 3624 2.507 0.912
1) Excellent 431 11.9
2) Very good 1457 40.2
3) Good 1290 35.6
4) Fair 358 9.9
5) Poor 88 2.4
Missing 408
222) PHYSACTS
454 | Modified GSS Codebook for the Data Used in this Text
To what extent are you able to carry out your everyday physical activities
such as walking, climbing stairs, carrying groceries, or moving a chair?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3623 1.689 0.995
1) Completely 2190 60.4
2) Mostly 667 18.4
3) Moderately 507 14.0
4) A little 219 6.0
5) Not at all 40 1.1
Missing 409
223) EMOPROBS
In the past seven days, how often have you been bothered by emotional prob­
lems such as feeling anxious, depressed or irritable?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3614 2.491 1.092
1) Never 765 21.2
2) Rarely 1092 30.2
3) Sometimes 1126 31.2
4) Often 478 13.2
5) Always 153 4.2
Missing 418
224) FATIGUE
In the past seven days, how would you rate your fatigue on average?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3615 2.314 0.916
1) None 684 18.9
2) Mild 1491 41.2
3) Moderate 1124 31.1
4) Severe 253 7.0
5) Very severe 63 1.7
Missing 417
Modified GSS Codebook for the Data Used in this Text | 455
225) WKRSLFFAM
(Do/did) you work in your own family business or farm?
RANGE: 1 to 2
N Mean Std. Deviation
Total 429 1.55 0.498
0) Inapplicable
1) Yes 193 45.0
2) No 236 55.0
Missing 3603
226) NEXTGEN
I'm going to read to you some statements like those you might find in a
newspaper or magazine article. For each statement, please tell me if you
strongly agree, agree, disagree or strongly disagree. Because of science
and technology, there will be more opportunities for the next generation.
RANGE: 1 to 4
N Mean Std. Deviation
Total 1859 1.712 0.668
1) Strongly agree 734 39.5
2) Agree 951 51.2
3) Disagree 149 8.0
4) Strongly disagree 25 1.3
Missing 2173
227) TOOFAST
I'm going to read to you some statements like those you might find in a
newspaper or magazine article. For each statement, please tell me if you
strongly agree, agree, disagree or strongly disagree. Science makes our
way of life change too fast.
RANGE: 1 to 4
N Mean Std. Deviation
Total 1862 2.605 0.798
1) Strongly agree 163 8.8
2) Agree 617 33.1
3) Disagree 875 47.0
4) Strongly disagree 207 11.1
Missing 2170
456 | Modified GSS Codebook for the Data Used in this Text
228) ADVFRONT
I'm going to read to you some statements like those you might find in a
newspaper or magazine article. For each statement, please tell me if you
strongly agree, agree, disagree or strongly disagree. Even if it brings no
immediate benefits, scientific research that advances the frontiers of
knowledge is necessary and should be supported by the federal government.
RANGE: 1 to 4
N Mean Std. Deviation
Total 1855 1.822 0.704
1) Strongly agree 615 33.2
2) Agree 994 53.6
3) Disagree 207 11.2
4) Strongly disagree 39 2.1 Missing 2177
229) SCIBNFTS
Now for another type of question. People have frequently noted that scien­
tific research has produced benefits and harmful results. Would you say
that, on balance, the benefits of scientific research have outweighed the
harmful results, or have the harmful results of scientific research been
greater than its benefits?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1838 1.447 0.561
1) Benefits greater 1078 58.7
2) About equal (if volunteered) 698 38.0
3) Harmful results greater 62 3.4
Missing 2194
230) VIRUSES
Now, I would like to ask you a few short questions like those you might
see on a television game show. For each statement that I read, please tell
me if it is true or false. If you don't know or aren't sure, just tell me
so, and we will skip to the next question. Remember true, false or don't
know. Antibiotics kill viruses as well as bacteria.
RANGE: 1 to 2
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 457
Total 1834 1.683 0.466
1) True 582 31.7
2) False 1252 68.3
Missing 2198
231) INTEDUC
Are you very interested, moderately interested or not at all interested in
local school issues?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1878 1.929 0.707
1) Very interested 541 28.8
2) Moderately interested 930 49.5
3) Not at all interested 407 21.7
Missing 2154
232) INTSCI
Are you very interested, moderately interested or not at all interested in
issues about new scientific discoveries?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1877 1.638 0.642
1) Very interested 849 45.2
2) Moderately interested 858 45.7
3) Not at all interested 170 9.1
Missing 2155
233) INTECON
Are you very interested, moderately interested or not at all interested in
economic issues and business conditions?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1867 1.653 0.634
1) Very interested 812 43.5
2) Moderately interested 891 47.7
3) Not at all interested 164 8.8
Missing 2165
458 | Modified GSS Codebook for the Data Used in this Text
234) INTTECH
Are you very interested, moderately interested or not at all interested in
issues about the use of new inventions and technologies?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1874 1.677 0.639
1) Very interested 782 41.7
2) Moderately interested 915 48.8
3) Not at all interested 177 9.4
Missing 2158
235) POSSLQ
Which of these statements applies to you?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1970 2.324 1.385
1) I am married and living in the same household as my husband or wife 945
48.0
2) I am living as married and my partner and I together live in the same
household 169 8.6
3) I have a husband or wife or steady partner, but we don't live in the
same household 129 6.5
4) I don't have a steady partner 727 36.9
Missing 2062
236) POSSLQY
Which of these statements applies to you?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2040 2.26 1.354
1) I am married and living in the same household as my husband or wife 990
48.5
2) I am living as married and my partner and I together live in the same
household 213 10.4
3) I have a husband or wife or steady partner, but we don't live in the
same household 153 7.5
Modified GSS Codebook for the Data Used in this Text | 459
4) I don't have a steady partner 684 33.5
Missing 1992
237) MARCOHAB
Marriage and cohabitation status
RANGE: 1 to 3
N Mean Std. Deviation
Total 4031 1.913 0.947
1) Married 1997 49.5
2) Not married, cohabitating partner 386 9.6
3) Not married, no cohabitating partner 1648 40.9
Missing 1
238) ENDSMEET
Thinking of your household's total income, including all the sources of
income of all the members who contribute to it, how difficult or easy is
it currently for your household to make ends meet?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1763 3.264 1.147
1) Very difficult 129 7.3
2) Fairly difficult 321 18.2
3) Neither easy nor difficult 548 31.1
4) Fairly easy 485 27.5
5) Very easy 280 15.9
Missing 2269
239) OPWLTH
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is coming from a wealthy family?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1815 3.149 1.091
1) Essential 118 6.5
2) Very important 407 22.4
3) Fairly important 580 32.0
4) Not very important 506 27.9
460 | Modified GSS Codebook for the Data Used in this Text
5) Not important at all 204 11.2
Missing 2217
240) OPPARED
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is having well-educated parents?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1845 2.685 0.937
1) Essential 159 8.6
2) Very important 642 34.8
3) Fairly important 738 40.0
4) Not very important 233 12.6
5) Not important at all 73 4.0
Missing 2187
241) OPEDUC
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is having a good education yourself?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1853 1.883 0.781
1) Essential 623 33.6
2) Very important 875 47.2
3) Fairly important 313 16.9
4) Not very important 32 1.7
5) Not important at all 10 0.5 Missing 2179
242) OPHRDWRK
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is hard work?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1857 1.666 0.718
1) Essential 856 46.1
2) Very important 797 42.9
3) Fairly important 176 9.5
4) Not very important 24 1.3
Modified GSS Codebook for the Data Used in this Text | 461
5) Not important at all 4 0.2
Missing 2175
243) OPKNOW
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is knowing the right people?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1845 2.543 0.913
1) Essential 250 13.6
2) Very important 603 32.7
3) Fairly important 763 41.4
4) Not very important 199 10.8
5) Not important at all 30 1.6
Missing 2187
244) OPCLOUT
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is having political connections?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1793 3.524 1.021
1) Essential 76 4.2
2) Very important 203 11.3
3) Fairly important 499 27.8
4) Not very important 735 41.0
5) Not important at all 280 15.6
Missing 2239
245) OPRACE
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is a person's race?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1800 3.908 1.128
1) Essential 44 2.4
2) Very important 184 10.2
462 | Modified GSS Codebook for the Data Used in this Text
3) Fairly important 420 23.3
4) Not very important 398 22.1
5) Not important at all 754 41.9
Missing 2232
246) OPRELIG
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is a person's religion?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1806 4.185 0.996
1) Essential 42 2.3
2) Very important 97 5.4
3) Fairly important 217 12.0
4) Not very important 579 32.1
5) Not important at all 871 48.2
Missing 2226
247) OPSEX
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is being born a man or woman?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1793 3.83 1.173
1) Essential 69 3.8
2) Very important 198 11.0
3) Fairly important 403 22.5
4) Not very important 422 23.5
5) Not important at all 701 39.1
Missing 2239
248) OPBRIBES
(For each of these, please tell me how important you think it is for get­
ting ahead in life...) How important is giving bribes?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1718 4.588 0.82
Modified GSS Codebook for the Data Used in this Text | 463
1) Essential 20 1.2
2) Very important 51 3.0
3) Fairly important 96 5.6
4) Not very important 282 16.4
5) Not important at all 1269 73.9
Missing 2314
249) GOODLIFE
The way things are in America, people like me and my family have a good
chance of improving our standard of living—do you agree or disagree?
RANGE: 1 to 5
N Mean Std. Deviation
Total 2664 2.741 1.073
1) Strongly agree 290 10.9
2) Agree 944 35.4
3) Neither agree nor disagree 750 28.2
4) Disagree 526 19.7
5) Strongly disagree 154 5.8
Missing 1368
250) PAYDOC
About how much do you think a doctor in general practice earns?
RANGE: 0 to 999996
N Mean Std. Deviation
Total 1828 228682.544 188107.26
251) PAYCLERK
How much do you think a salesclerk earns?
RANGE: 0 to 999996
N Mean Std. Deviation
Total 1823 38771.885 76630.982
252) PAYEXEC
How much do you think a chairman of a large national corporation earns?
RANGE: 0 to 999996
N Mean Std. Deviation
Total 1818 673620.721 371536.782
464 | Modified GSS Codebook for the Data Used in this Text
253) PAYUNSKL
How much do you think an unskilled worker in a factory earns?
RANGE: 0 to 999996
N Mean Std. Deviation
Total 1824 36354.363 66898.405
254) PAYCABNT
How much do you think a cabinet minister in the federal government earns?
RANGE: 0 to 999996
N Mean Std. Deviation
Total 1812 257485.826 232526.024
255) INCGAP
To what extent do you agree or disagree with the following statements? Dif­
ferences in income in America are too large.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1822 1.96 1.071
1) Strongly agree 793 43.5
2) Agree 545 29.9
3) Neither agree nor disagree 301 16.5
4) Disagree 130 7.1
5) Strongly disagree 53 2.9
Missing 2210
256) GOVEQINC
It is the responsibility of the government to reduce the differences in
income between people with high incomes and those with low incomes.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1806 3.122 1.257
1) Strongly agree 190 10.5
2) Agree 434 24.0
3) Neither agree nor disagree 471 26.1
4) Disagree 387 21.4
5) Strongly disagree 324 17.9
Modified GSS Codebook for the Data Used in this Text | 465
Missing 2226
257) GOVUNEMP
The government should provide a decent standard of living for the unem­
ployed.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1784 2.786 1.213
1) Strongly agree 288 16.1
2) Agree 508 28.5
3) Neither agree nor disagree 455 25.5
4) Disagree 363 20.3
5) Strongly disagree 170 9.5
Missing 2248
258) TAXRICH
Generally, how would you describe taxes in America today for those with
high incomes? Taxes are...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1704 3.601 1.168
1) Much too high 98 5.8
2) Too high 240 14.1
3) About right 329 19.3
4) Too low 614 36.0
5) Much too low 423 24.8
Missing 2328
259) TAXSHARE
Do you think people with high incomes should pay a larger share of their
income in taxes than those with low incomes, the same share, or a smaller
share?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1748 2.011 0.822
1) Much larger share 512 29.3
2) Large 760 43.5
466 | Modified GSS Codebook for the Data Used in this Text
3) The same share 432 24.7
4) Smaller 32 1.8
5) Much smaller share 12 0.7
Missing 2284
260) CONWLTH
(In all countries, there are differences or even conflicts between differ­
ent social groups. In your opinion, in America how much conflict is there
between...) Poor people and rich people?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1675 2.216 0.726
1) Very strong conflicts 255 15.2
2) Strong conflicts 849 50.7
3) Not very strong conflicts 526 31.4
4) There are no conflicts 45 2.7
Missing 2357
261) CONCLASS
(In all countries, there are differences or even conflicts between differ­
ent social groups. In your opinion, in America how much conflict is there
between...) The working class and the middle class?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1701 2.768 0.668
1) Very strong conflicts 69 4.1
2) Strong conflicts 416 24.5
3) Not very strong conflicts 1057 62.1
4) There are no conflicts 159 9.3
Missing 2331
262) CONUNION
(In all countries, there are differences or even conflicts between differ­
ent social groups. In your opinion, in America how much conflict is there
between...) Management and workers?
RANGE: 1 to 4
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 467
Total 1687 2.434 0.683
1) Very strong conflicts 140 8.3
2) Strong conflicts 721 42.7
3) Not very strong conflicts 780 46.2
4) There are no conflicts 46 2.7
Missing 2345
263) CONAGE
(In all countries, there are differences or even conflicts between differ­
ent social groups. In your opinion, in America how much conflict is there
between...) Young people and older people?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1716 2.594 0.733
1) Very strong conflicts 138 8.0
2) Strong conflicts 537 31.3
3) Not very strong conflicts 925 53.9
4) There are no conflicts 116 6.8
Missing 2316
264) CONIMM
(In all countries, there are differences or even conflicts between differ­
ent social groups. In your opinion, in America how much conflict is there
between...) People born in America and people from other countries who
have come to live in America?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1700 2.236 0.721
1) Very strong conflicts 244 14.4
2) Strong conflicts 854 50.2
3) Not very strong conflicts 558 32.8
4) There are no conflicts 44 2.6
Missing 2332
265) LDCGAP
(Turning to international differences, to what extent do you agree or dis­
agree with the following statements?) Present economic differences between
rich and poor countries are too large.
468 | Modified GSS Codebook for the Data Used in this Text
RANGE: 1 to 5
N Mean Std. Deviation
Total 1700 2.311 0.953
1) Strongly agree 365 21.5
2) Agree 619 36.4
3) Neither agree nor disagree 587 34.5
4) Disagree 81 4.8
5) Strongly disagree 48 2.8
Missing 2332
266) LDCTAX
(Turning to international differences, to what extent do you agree or dis­
agree with the following statements?) People in wealthy countries should
make an additional tax contribution to help people in poor countries.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1722 3.177 1.162
1) Strongly agree 150 8.7
2) Agree 335 19.5
3) Neither agree nor disagree 548 31.8
4) Disagree 438 25.4
5) Strongly disagree 251 14.6
Missing 2310
267) RICHHLTH
Is it just or unjust—right or wrong—that people with higher incomes can
buy better health care than people with lower incomes?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1771 3.655 1.231
1) Very just, definitely right 121 6.8
2) Somewhat just, right 191 10.8
3) Neither just nor unjust, mixed feelings 452 25.5
4) Somewhat unjust, wrong 421 23.8
5) Very unjust, definitely wrong 586 33.1 Missing 2261
268) RICHEDUC
Modified GSS Codebook for the Data Used in this Text | 469
Is it just or unjust—right or wrong—that people with higher incomes can
buy better education for their children than people with lower incomes?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1769 3.617 1.254
1) Very just, definitely right 119 6.7
2) Somewhat just, right 241 13.6
3) Neither just nor unjust, mixed feelings 422 23.9
4) Somewhat unjust, wrong 404 22.8
5) Very unjust, definitely wrong 583 33.0
Missing 2263
269) PAYRESP
(In deciding how much people ought to earn, how important should each of
these things be, in your opinion...) How much responsibility goes with the
job—how important do you think that ought to be in deciding pay?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1767 1.989 0.733
1) Essential 436 24.7
2) Very important 950 53.8
3) Fairly important 357 20.2
4) Not very important 12 0.7
5) Not important at all 12 0.7 Missing 2265
270) PAYEDTRN
(In deciding how much people ought to earn, how important should each of
these things be, in your opinion...) The number of years spent in educa­
tion and training?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1771 2.395 0.855
1) Essential 242 13.7
2) Very important 759 42.9
3) Fairly important 622 35.1
4) Not very important 125 7.1
5) Not important at all 23 1.3
470 | Modified GSS Codebook for the Data Used in this Text
Missing 2261
271) PAYCHILD
(In deciding how much people ought to earn, how important should each of
these things be, in your opinion...) Whether the person has children to
support?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1729 3.577 1.161
1) Essential 91 5.3
2) Very important 242 14.0
3) Fairly important 413 23.9
4) Not very important 544 31.5
5) Not important at all 439 25.4
Missing 2303
272) PAYDOWEL
(In deciding how much people ought to earn, how important should each of
these things be, in your opinion...) How well he or she does the job?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1760 1.701 0.726
1) Essential 763 43.4
2) Very important 796 45.2
3) Fairly important 171 9.7
4) Not very important 24 1.4
5) Not important at all 6 0.3
Missing 2272
273) MARHOMO
(Do you agree or disagree?) Homosexual couples should have the right to
marry one another.
RANGE: 1 to 5
N Mean Std. Deviation
Total 2658 2.149 1.371
1) Strongly agree 1265 47.6
2) Agree 500 18.8
Modified GSS Codebook for the Data Used in this Text | 471
3) Neither agree nor disagree 419 15.8
4) Disagree 179 6.7
5) Strongly disagree 295 11.1
Missing 1374
274) MEOVRWRK
Family life often suffers because men concentrate too much on their work.
RANGE: 1 to 5
N Mean Std. Deviation
Total 2714 2.806 0.985
1) Strongly agree 202 7.4
2) Agree 886 32.6
3) Neither agree nor disagree 1001 36.9
4) Disagree 487 17.9
5) Strongly disagree 138 5.1
Missing 1318
275) RELACTIV
How often do you take part in the activities and organizations of a church
or place of worship other than attending services?
RANGE: 1 to 10
N Mean Std. Deviation
Total 3949 2.865 2.278
1) Never 1650 41.8
2) Less than once a year 614 15.5
3) About once or twice a year 488 12.4
4) Several times a year 449 11.4
5) About once a month 150 3.8
6) Two to three times a month 159 4.0
7) Nearly every week 137 3.5
8) Every week 241 6.1
9) Several times a week 26 0.7
10) Once a day 35 0.9
Missing 83
276) CANTRUST
Generally speaking, would you say that people can be trusted or that you
472 | Modified GSS Codebook for the Data Used in this Text
can't be too careful in dealing with people?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1815 2.724 0.867
1) People can almost always be trusted 71 3.9
2) People can usually be trusted 714 39.3
3) You usually can't be too careful in dealing with people 750 41.3
4) You almost always can't be too careful in dealing with people 205 11.3
Missing 2286
277) RELIGINF
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) The U.S. would be a better country if religion had
less influence.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3559 3.095 1.25
1) Strongly agree 514 14.4
2) Agree 565 15.9
3) Neither agree nor disagree 1069 30.0
4) Disagree 890 25.0
5) Strongly disagree 521 14.6
Missing 473
278) PRIVENT
(How much do you agree or disagree with each of these statements) Private
enterprise is the best way to solve America's economic problems.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1754 2.845 1.065
1) Strongly agree 225 12.8
2) Agree 366 20.9
3) Neither agree nor disagree 730 41.6
4) Disagree 322 18.4
5) Strongly disagree 111 6.3
Missing 2278
Modified GSS Codebook for the Data Used in this Text | 473
279) POSTMAT1
Looking at the list on the hand card, please tell me the one thing you
think should be America's highest priority, the most important thing it
should do.
RANGE: 1 to 4
N Mean Std. Deviation
Total 1610 2.196 1.103
1) Maintain order in the nation 509 31.6
2) Give people more say in government decisions 618 38.4
3) Fight rising prices 141 8.8
4) Protect freedom of speech 342 21.2
Missing 2422
280) SCIGRN
(How much do you agree or disagree with each of these statements)? Modern
science will solve our environmental problems with little change to our
way of life.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1772 3.53 0.978
1) Strongly agree 35 2.0
2) Agree 241 13.6
3) Neither agree nor disagree 525 29.6
4) Disagree 691 39.0
5) Strongly disagree 280 15.8
Missing 2260
281) GRNECON
(How much do you agree or disagree with each of these statements)? We
worry too much about the future of the environment and not enough about
prices and jobs today.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1795 3.403 1.162
1) Strongly agree 114 6.4
2) Agree 310 17.3
3) Neither agree nor disagree 453 25.2
474 | Modified GSS Codebook for the Data Used in this Text
4) Disagree 574 32.0
5) Strongly disagree 344 19.2
Missing 2237
282) HARMSGRN
(How much do you agree or disagree with each of these statements?) Almost
everything we do in modern life harms the environment.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1798 2.804 0.993
1) Strongly agree 135 7.5
2) Agree 639 35.5
3) Neither agree nor disagree 520 28.9
4) Disagree 451 25.1
5) Strongly disagree 53 2.9
Missing 2234
283) GRNPROG
(How much do you agree or disagree with each of these statements)? People
worry too much about human progress harming the environment.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1772 3.42 1.071
1) Strongly agree 72 4.1
2) Agree 312 17.6
3) Neither agree nor disagree 459 25.9
4) Disagree 658 37.1
5) Strongly disagree 271 15.3
Missing 2260
284) GRWTHELP
(And please tell me for each of these statements, how much you agree or
disagree with it.) Economic growth always harms the environment.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1754 2.712 0.955
1) Strongly agree 143 8.2
2) Agree 637 36.3
Modified GSS Codebook for the Data Used in this Text | 475
3) Neither agree nor disagree 617 35.2
4) Disagree 297 16.9
5) Strongly disagree 60 3.4
Missing 2278
285) GRWTHARM
(And please tell me for each of these statements, how much you agree or
disagree with it.) In order to protect the environment America needs eco­
nomic growth.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1771 3.501 0.835
1) Strongly agree 30 1.7
2) Agree 160 9.0
3) Neither agree nor disagree 619 35.0
4) Disagree 817 46.1
5) Strongly disagree 145 8.2
Missing 2261
286) GRNPRICE
How willing would you be to pay much higher prices in order to protect the
environment?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1778 2.918 1.142
1) Very willing 135 7.6
2) Fairly willing 625 35.2
3) Neither willing nor unwilling 476 26.8
4) Fairly unwilling 334 18.8
5) Very unwilling 208 11.7
Missing 2254
287) GRNTAXES
And how willing would you be to pay much higher taxes in order to protect
the environment?
RANGE: 1 to 5
N Mean Std. Deviation
476 | Modified GSS Codebook for the Data Used in this Text
Total 1775 3.208 1.25
1) Very willing 129 7.3
2) Fairly willing 479 27.0
3) Neither willing nor unwilling 433 24.4
4) Fairly unwilling 361 20.3
5) Very unwilling 373 21.0
Missing 2257
288) GRNSOL
And how willing would you be to accept cuts in your standard of living in
order to protect the environment?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1778 3.15 1.196
1) Very willing 111 6.2
2) Fairly willing 505 28.4
3) Neither willing nor unwilling 487 27.4
4) Fairly unwilling 356 20.0
5) Very unwilling 319 17.9
Missing 2254
289) TOODIFME
How much do you agree or disagree with each of these statements? It is
just too difficult for someone like me to do much about the environment.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1783 3.392 0.999
1) Strongly agree 62 3.5
2) Agree 305 17.1
3) Neither agree nor disagree 474 26.6
4) Disagree 756 42.4
5) Strongly disagree 186 10.4
Missing 2249
290) IHLPGRN
(How much do you agree or disagree with each of these statements?) I do
what is right for the environment, even when it costs more money or takes
Modified GSS Codebook for the Data Used in this Text | 477
more time.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1781 2.495 0.813
1) Strongly agree 123 6.9
2) Agree 868 48.7
3) Neither agree nor disagree 602 33.8
4) Disagree 161 9.0
5) Strongly disagree 27 1.5
Missing 2251
291) CARSGEN
In general, do you think that air pollution caused by cars is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1778 2.534 0.869
1) Extremely dangerous for the environment 232 13.0
2) Very dangerous 559 31.4
3) Somewhat dangerous 814 45.8
4) Not very dangerous 151 8.5
5) Not dangerous at all for the environment 22 1.2
Missing 2254
292) RECYCLE
How often do you make a special effort to sort glass or cans or plastic or
newspapers and so on for recycling?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1700 1.769 1.065
1) Always 1006 59.2
2) Often 278 16.4
3) Sometimes 218 12.8
4) Never 198 11.6
Missing 2332
293) GRNGROUP
Are you a member of any group whose main aim is to preserve or protect the
478 | Modified GSS Codebook for the Data Used in this Text
environment?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1820 1.898 0.303
1) Yes 186 10.2
2) No 1634 89.8
Missing 2212
294) GRNSIGN
In the last five years, have you signed a petition about an environmental
issue?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1802 1.752 0.432
1) Yes, I have 447 24.8
2) No, I have not 1355 75.2
Missing 2230
295) GRNMONEY
In the last five years, have you given money to an environmental group?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1814 1.764 0.425
1) Yes, I have 428 23.6
2) No, I have not 1386 76.4
Missing 2218
296) GRNDEMO
In the last five years, have you taken part in a protest or demonstration
about an environmental issue?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1817 1.951 0.216
1) Yes, I have 89 4.9
2) No, I have not 1728 95.1
Missing 2215
Modified GSS Codebook for the Data Used in this Text | 479
297) IMPGRN
(How much do you agree or disagree with each of these statements?) There
are more important things to do in life than protect the environment.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1788 3.45 1.004
1) Strongly agree 54 3.0
2) Agree 262 14.7
3) Neither agree nor disagree 553 30.9
4) Disagree 664 37.1
5) Strongly disagree 255 14.3
Missing 2244
298) OTHSSAME
(How much do you agree or disagree with each of these statements?) There
is no point in doing what I can for the environment unless others do the
same.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1791 3.539 1.028
1) Strongly agree 67 3.7
2) Agree 257 14.3
3) Neither agree nor disagree 376 21.0
4) Disagree 826 46.1
5) Strongly disagree 265 14.8
Missing 2241
299) GRNEXAGG
(How much do you agree or disagree with each of these statements?) Many of
the claims about environmental threats are exaggerated.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1777 3.547 1.187
1) Strongly agree 106 6.0
2) Agree 282 15.9
3) Neither agree nor disagree 354 19.9
4) Disagree 604 34.0
480 | Modified GSS Codebook for the Data Used in this Text
5) Strongly disagree 431 24.3
Missing 2255
300) TOPPROB1
Which of these issues is the most important for America today?
RANGE: 1 to 9
N Mean Std. Deviation
Total 1690 3.723 2.546
1) Health care 551 32.6
2) Education 216 12.8
3) Crime 53 3.1
4) The environment 224 13.3
5) Immigration 68 4.0
6) The economy 363 21.5
7) Terrorism 50 3.0
8) Poverty 99 5.9
9) None of these 66 3.9
Missing 2342
301) GRNCON
Generally speaking, how concerned are you about environmental issues?
Please tell me what you think, where one means you are not at all con­
cerned and five means you are very concerned.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1823 3.889 1.156
1) Not at all concerned 81 4.4
2) 142 7.8
3) 419 23.0
4) 438 24.0
5) Very concerned 743 40.8
Missing 2209
302) ENPRBUS
Here is a list of some different environmental problems. Which problem, if
any, do you think is the most important for America as a whole?
RANGE: 1 to 10
Modified GSS Codebook for the Data Used in this Text | 481
N Mean Std. Deviation
Total 1642 5.988 2.612
1) Air pollution 141 8.6
2) Chemicals and pesticides 140 8.5
3) Water shortage 76 4.6
4) Water pollution 131 8.0
5) Nuclear waste 40 2.4
6) Domestic waste disposal 87 5.3
7) Climate change 642 39.1
8) Genetically modified foods 115 7.0
9) Using up out natural resources 177 10.8
10) None of these 93 5.7
Missing 2390
303) GRNEFFME
(How much do you agree or disagree with each of these statements?) Environ­
mental problems have a direct effect on my everyday life.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1768 2.819 0.997
1) Strongly agree 132 7.5
2) Agree 607 34.3
3) Neither agree nor disagree 542 30.7
4) Disagree 423 23.9
5) Strongly disagree 64 3.6 Missing 2264
304) TEMPGEN1
In general, do you think that a rise in the world's temperature caused by
climate change is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1734 2.251 1.072
1) Extremely dangerous for the environment 528 30.4
2) Very dangerous 501 28.9
3) Somewhat dangerous 493 28.4
4) Not very dangerous 166 9.6
5) Not dangerous at all for the environment 46 2.7
482 | Modified GSS Codebook for the Data Used in this Text
Missing 2298
305) BUSGRN
Which of these approaches do you think would be the best way of getting
business and industry in America to protect the environment?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1643 1.826 0.745
1) Heavy fines for businesses that damage the environment 623 37.9
2) Use the tax system to reward businesses that protect the environment
683 41.6
3) More information and education for businesses about the advantages of
protecting the environment 337 20.5
Missing 2389
306) PEOPGRN
Which of these approaches do you think would be the best way of getting
people and their families in America to protect the environment?
RANGE: 1 to 3
N Mean Std. Deviation
Total 1689 2.3 0.693
1) Heavy fines for people that damage the environment 228 13.5
2) Use the tax system to reward people that protect the environment 727
43.0
3) More information and education for people about the advantages of pro­
tecting the environment 734 43.5
Missing 2343
307) NOBUYGRN
And how often do you avoid buying certain products for environmental rea­
sons?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1821 2.761 0.873
1) Always 146 8.0
2) Often 525 28.8
3) Sometimes 769 42.2
Modified GSS Codebook for the Data Used in this Text | 483
4) Never 381 20.9
Missing 2211
308) IMPORTS
America should limit the import of foreign products in order to protect
its national economy.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1800 2.683 0.99
1) Strongly agree 195 10.8
2) Agree 617 34.3
3) Neither agree nor disagree 607 33.7
4) Disagree 326 18.1
5) Strongly disagree 55 3.1
Missing 2232
309) POWRORGS
International organizations are taking away too much power from the Ameri­
can government.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1746 2.99 1.082
1) Strongly agree 174 10.0
2) Agree 377 21.6
3) Neither agree nor disagree 618 35.4
4) Disagree 446 25.5
5) Strongly disagree 131 7.5
Missing 2286
310) LETIN1A
Do you think the number of immigrants to America nowadays should be
RANGE: 1 to 5
N Mean Std. Deviation
Total 2670 3.121 1.133
1) Increased a lot 227 8.5
2) Increased a little 476 17.8
3) Remain the same as it is 1128 42.2
484 | Modified GSS Codebook for the Data Used in this Text
4) Reduced a little 426 16.0
5) Reduced a lot 413 15.5
Missing 1362
311) PARTNERS
How many sex partners have you had in the last 12 months?
RANGE: 0 to 9
N Mean Std. Deviation
Total 2313 0.966 1.038
0) No partners 599 25.9
1) One partner 1504 65.0
2) Two partners 83 3.6
3) Three partners 44 1.9
4) Four partners 27 1.2
5) Five to 10 partners 40 1.7
6) 11-20 partners 4 0.2
7) 21-100 partners 2 0.1
8) More than 100 partners 4 0.2
9) One or more partners (unspecified) 6 0.3
Missing 1719
312) MATESEX
Was one of the partners your husband or wife or regular sexual partner?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1697 1.077 0.266
1) Yes 1567 92.3
2) No 130 7.7
Missing 2335
313) SEXSEX
Have your sex partners in the last 12 months been...
RANGE: 1 to 3
N Mean Std. Deviation
Total 1689 1.944 0.994
1) Exclusively male 884 52.3
2) Both male and female 16 0.9
3) Exclusively female 789 46.7
Modified GSS Codebook for the Data Used in this Text | 485
Missing 2343
314) SEXFREQ
About how often did you have sex during the last 12 months?
RANGE: 0 to 6
N Mean Std. Deviation
Total 2157 2.254 1.914
0) Not at all 633 29.3
1) Once or twice 251 11.6
2) Once a month 283 13.1
3) Two or three times a month 351 16.3
4) About once a week 287 13.3
5) Two or three times a week 268 12.4
6) More than three times a week 84 3.9
Missing 1875
315) NUMWOMEN
Now thinking about the time since your 18th birthday (including the past
12 months), how many female partners have you had sex with?
RANGE: 0 to 996
N Mean Std. Deviation
Total 2207 14.745 92.634
Missing 1825
316) NUMMEN
Now thinking about the time since your 18th birthday (including the past
12 months), how many male partners have you had sex with?
RANGE: 0 to 997
N Mean Std. Deviation
Total 2190 13.307 90.957
Missing 1842
317) PARTNRS5
Now thinking about the past five years - the time since February/March
2015, and including the past 12 months, how many sex partners have you had
in that five-year period?
RANGE: 0 to 9
486 | Modified GSS Codebook for the Data Used in this Text
N Mean Std. Deviation
Total 2314 1.723 2.025
0) No partners 336 14.5
1) One partner 1416 61.2
2) Two partners 151 6.5
3) Three partners 83 3.6
4) Four partners 64 2.8
5) Five to 10 partners 109 4.7
6) 11-20 partners 40 1.7
7) 21-100 partners 25 1.1
8) More than 100 partners 7 0.3
9) One or more partners (unspecified) 83 3.6
Missing 1718
318) SEXSEX5
(Now thinking about the past five years—the time since February/March
2015, and including the past 12 months), have your sex partners in the
last five years been...
RANGE: 1 to 3
N Mean Std. Deviation
Total 1925 1.925 0.983
1) Exclusively male 1007 52.3
2) Both male and female 55 2.9
3) Exclusively female 863 44.8
Missing 2107
319) EVPAIDSX
Thinking about the time since your 18th birthday, have you ever had sex
with a person you paid or who paid you for sex?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2275 1.937 0.243
1) Yes 143 6.3
2) No 2132 93.7
Missing 1757
320) EVSTRAY
Modified GSS Codebook for the Data Used in this Text | 487
Have you ever had sex with someone other than your husband or wife while
you were married?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2677 2.252 0.641
1) Yes 297 11.1
2) No 1408 52.6
3) Never married 972 36.3
Missing 1355
321) CONDOM
The last time you had sex, was a condom used? By 'sex' we mean vaginal,
oral, or anal sex.
RANGE: 1 to 2
N Mean Std. Deviation
Total 2157 1.835 0.371
1) Used last time 356 16.5
2) Not used 1801 83.5
Missing 1875
322) RELATSEX
The last time you had sex, was it with someone you were in an ongoing rela­
tionship with, or was it with someone else? Remember that by 'sex' we mean
only vaginal, oral, or anal sex.
RANGE: 1 to 2
N Mean Std. Deviation
Total 2186 1.088 0.284
1) Yes, the last time I had sex, it was with someone I was in an ongoing
relationship with 1993 91.2
2) No, the last time I had sex, it was not with someone I was in an ongo­
ing relationship with 193 8.8
Missing 1846
323) EVIDU
Have you ever, even once, taken any drugs by injection with a needle (like
heroin, cocaine, amphetamines, or steroids)? DO NOT include anything you
took under a doctor's orders.
488 | Modified GSS Codebook for the Data Used in this Text
RANGE: 1 to 2
N Mean Std. Deviation
Total 2259 1.973 0.162
1) Yes 61 2.7
2) No 2198 97.3
Missing 1773
324) EVCRACK
Have you ever, even once, used 'crack' cocaine in chunk or rock form?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2277 1.952 0.214
1) Yes 110 4.8
2) No 2167 95.2
Missing 1755
325) HIVTEST
Have you ever been tested for HIV? Do not count tests you may have had as
part of a blood donation. Include oral test (where they take a swab from
your mouth).
RANGE: 1 to 2
N Mean Std. Deviation
Total 2216 1.667 0.471
1) Yes 737 33.3
2) No 1479 66.7
Missing 1816
326) SEXORNT
Which of the following best describes you?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2260 2.89 0.406
1) Gay, lesbian, or homosexual 76 3.4
2) Bisexual 96 4.2
3) Heterosexual or straight 2088 92.4
Missing 1772
327) REALINC
Modified GSS Codebook for the Data Used in this Text | 489
Family income in 1972-2006 surveys in constant dollars (base=1986)
RANGE: 218 to 144835.4286
N Mean Std. Deviation
Total 3509 40053.127 40147.485
Missing 523
328) CONINC
Inflation-adjusted family income
RANGE: 336 to 168736.29696
N Mean Std. Deviation
Total 3509 55955.939 47370.022
Missing 523
329) COHORT
Birth cohort of respondent
RANGE: 1932 to 9999
N Mean Std. Deviation
Total 4032 2632.041 2210.722
330) VETYEARS
Have you ever been on active duty for military training or service for two
consecutive months or more? IF YES: What was your total time on active
duty?
RANGE: 0 to 3
N Mean Std. Deviation
Total 3934 0.24 0.714
0) No active duty 3491 88.7
1) Yes, less than two years 87 2.2
2) Yes, two to four years 212 5.4
3) Yes, more than four years 144 3.7
Missing 98
331) DWELOWN16
When you were 16 years old, did your family own your own home, pay rent,
or something else?
RANGE: 1 to 3
N Mean Std. Deviation
490 | Modified GSS Codebook for the Data Used in this Text
Total 2636 1.252 0.463
1) Owned or was buying 2007 76.1
2) Paid rent 595 22.6
3) Other 34 1.3
Missing 1396
332) DWELOWN
(Do you/Does your family) own your (home/apartment), pay rent, or what?
RANGE: 1 to 3
N Mean Std. Deviation
Total 2645 1.324 0.494
1) Own or is buying 1821 68.8
2) Pays rent 791 29.9
3) Other 33 1.2
Missing 1387
333) SEI10
Respondent's socioeconomic index (2010)
RANGE: 10.6 to 93.7
N Mean Std. Deviation
Total 3873 52.409 23.173
334) HISPANIC
Are you Spanish, Hispanic, or Latino/Latina? IF YES: Which group are you
from?
RANGE: 1 to 50
N Mean Std. Deviation
Total 3998 1.873 4.628
1) Not Hispanic 3544 88.6
2) Mexican, Mexican America, Chicano/a 245 6.1
3) Puerto Rican 51 1.3
4) Cuban 21 0.5
5) Salvadorian 9 0.2
6) Guatemalan 2 0.1
7) Panamanian 4 0.1
8) Nicaraguan 2 0.1
9) Costa Rican 4 0.1
Modified GSS Codebook for the Data Used in this Text | 491
10) Central American 5 0.1
11) Honduran 3 0.1
15) Dominican 17 0.4
20) Peruvian 4 0.1
21) Ecuadorian 8 0.2
22) Colombian 10 0.3
23) Venezuelan 5 0.1
24) Argentinian 1 0.0
25) Chilean 1 0.0
30) Spanish 46 1.2
35) Filipino/a 1 0.0
41) South American 3 0.1
46) Latino/a 2 0.1
47) Hispanic 7 0.2
50) Other, not specified 3 0.1
Missing 34
335) RACECEN1
(What is your race? Indicate one or more races that you consider yourself
to be.) First mention
RANGE: 1 to 15
N Mean Std. Deviation
Total 21 3.524 3.816
1) White 3110 77.1% 78.2%
2) Black or African American 2 463 11.5% 11.6%
3) American Indian or Alaska Native 44 1.1% 1.1%
4) Asian Indian 45 1.1% 1.1%
5) Chinese 39 1.0% 1.0%
6) Filipino 15 0.4% 0.4%
7) Japanese 13 0.3% 0.3%
8) Korean 23 0.6% 0.6%
9) Vietnamese 5 0.1% 0.1%
10) Other Asian 20 0.5% 0.5%
11) Native Hawaiian 2 0.0% 0.1%
12) Guamanian or Chamorro 2 0.0% 0.1%
13) Samoan 2 0.0% 0.1%
14) Other Pacific Islander 3 0.1% 0.1%
492 | Modified GSS Codebook for the Data Used in this Text
15) Some other race 56 1.4% 1.4%
16) Hispanic 136 3.4% 3.4%
Missing 54
336) ZODIAC
Astrological sign of respondent
RANGE: 1 to 12
N Mean Std. Deviation
Total 3676 6.816 3.337
1) Aries 237 6.4
2) Taurus 281 7.6
3) Gemini 253 6.9
4) Cancer 270 7.3
5) Leo 305 8.3
6) Virgo 327 8.9
7) Libra 319 8.7
8) Scorpio 346 9.4
9) Sagittarius 339 9.2
10) Capricorn 390 10.6
11) Aquarius 330 9.0
12) Pisces 279 7.6
Missing 356
337) WRKGOVT1
(Are/Were) you employed by the government? (Please consider federal,
state, or local government.)
RANGE: 1 to 2
N Mean Std. Deviation
Total 3916 1.778 0.416
1) Yes 869 22.2
2) No 3047 77.8
Missing 116
338) WRKGOVT2
(Are/Were) you employed by a private employer (including non-profit organi­
zations)?
RANGE: 1 to 2
Modified GSS Codebook for the Data Used in this Text | 493
N Mean Std. Deviation
Total 3919 1.408 0.491
1) Yes 2322 59.2
2) No 1597 40.8
Missing 113
339) IMMLIMIT
America should limit immigration in order to protect our national way of
life.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1813 3.069 1.2
1) Strongly agree 194 10.7
2) Agree 433 23.9
3) Neither agree nor disagree 471 26.0
4) Disagree 484 26.7
5) Strongly disagree 231 12.7
Missing 2219
340) TRRESRCH
(On a scale of 0 to 10, how much do you personally trust each of the fol­
lowing institutions? 0 means you do not trust an institution at all, and
10 means you trust it completely.) University research centers
RANGE: 0 to 10
N Mean Std. Deviation
Total 1788 6.247 2.383
0) No trust 58 3.2
1) 22 1.2
2) 64 3.6
3) 83 4.6
4) 93 5.2
5) 389 21.8
6) 151 8.4
7) 280 15.7
8) 343 19.2
9) 207 11.6
10) Complete trust 98 5.5
494 | Modified GSS Codebook for the Data Used in this Text
Missing 2244
341) TRMEDIA
(On a scale of 0 to 10, how much do you personally trust each of the fol­
lowing institutions? 0 means you do not trust an institution at all, and
10 means you trust it completely.) The news media
RANGE: 0 to 10
N Mean Std. Deviation
Total 1820 3.61 2.732
0) No trust 372 20.4
1) 149 8.2
2) 199 10.9
3) 179 9.8
4) 150 8.2
5) 291 16.0
6) 153 8.4
7) 172 9.5
8) 105 5.8
9) 31 1.7
10) Complete trust 19 1.0
Missing 2212
342) TRBUSIND
(On a scale of 0 to 10, how much do you personally trust each of the fol­
lowing institutions? 0 means you do not trust an institution at all, and
10 means you trust it completely.) Business and industry
RANGE: 0 to 10
N Mean Std. Deviation
Total 1805 4.672 1.975
0) No trust 67 3.7
1) 45 2.5
2) 134 7.4
3) 218 12.1
4) 249 13.8
5) 571 31.6
6) 210 11.6
7) 193 10.7
Modified GSS Codebook for the Data Used in this Text | 495
8) 80 4.4
9) 16 0.9
10) Complete trust 22 1.2
Missing 2227
343) TRLEGIS
(On a scale of 0 to 10, how much do you personally trust each of the fol­
lowing institutions? 0 means you do not trust an institution at all, and
10 means you trust it completely.) The U.S. Congress
RANGE: 0 to 10
N Mean Std. Deviation
Total 1818 3.236 2.418
0) No trust 339 18.6
1) 195 10.7
2 197 10.8
3) 255 14.0
4) 222 12.2
5) 332 18.3
6) 108 5.9
7) 88 4.8
8) 46 2.5
9) 11 0.6
10) Complete trust 25 1.4
Missing 2214
344) CLMTCAUS
There has been a lot of discussion about the world's climate and the idea
it has been changing in recent decades. Which of the following statements
comes closest to your opinion?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1770 3.308 0.777
1) The world's climate has not been changing 35 2.0
2) The world's climate has been changing mostly due to natural processes
240 13.6
3) The world's climate has been changing about equally due to natural
processes and human activity 639 36.1
496 | Modified GSS Codebook for the Data Used in this Text
4) The world's climate has been changing mostly due to human activity 856
48.4 Missing 2262
345) CLMTWRLD
On a scale from 0 to 10, how bad or good do you think the impacts of cli­
mate change will be for the world as a whole? 0 means extremely bad, 10
means extremely good.
RANGE: 0 to 10
N Mean Std. Deviation
Total 1737 2.876 2.359
0) Extremely bad 428 24.6
1) 137 7.9
2) 243 14.0
3) 239 13.8
4) 184 10.6
5) 346 19.9
6) 56 3.2
7) 44 2.5
8) 20 1.2
9) 15 0.9
10) Extremely good 25 1.4
Missing 2295
346) CLMTUSA
On a scale from 0 to 10, how bad or good do you think the impacts of cli­
mate change will be for America? 0 means extremely bad, 10 means extremely
good.
RANGE: 0 to 10
N Mean Std. Deviation
Total 1722 3.029 2.327
0) Extremely bad 372 21.6
1) 124 7.2
2) 242 14.1
3) 263 15.3
4) 198 11.5
5) 349 20.3
6) 60 3.5
7) 52 3.0
Modified GSS Codebook for the Data Used in this Text | 497
8) 19 1.1
9) 24 1.4
10) Extremely good 19 1.1
Missing 2310
347) NATURDEV
How willing would you be to accept a reduction in the size of America's
protected nature areas, in order to open them up for economic development?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1771 4.036 1.051
1) Very willing 28 1.6
2) Fairly willing 159 9.0
3) Neither willing nor unwilling 303 17.1
4) Fairly unwilling 513 29.0
5) Very unwilling 768 43.4
Missing 2261
348) INDUSGEN1
In general, do you think that air pollution caused by industry is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1779 2.142 0.863
1) Extremely dangerous for the environment 461 25.9
2) Very dangerous 688 38.7
3) Somewhat dangerous 557 31.3
4) Not very dangerous 63 3.5
5) Not dangerous at all for the environment 10 0.6
Missing 2253
349) CHEMGEN1
And do you think that pesticides and chemicals used in farming are...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1773 2.27 0.906
1) Extremely dangerous for the environment 419 23.6
2) Very dangerous 578 32.6
498 | Modified GSS Codebook for the Data Used in this Text
3) Somewhat dangerous 667 37.6
4) Not very dangerous 97 5.5
5) Not dangerous at all for the environment 12 0.7
Missing 2259
350) WATERGEN1
And do you think that pollution of America's rivers, lakes and streams
is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1780 1.982 0.849
1) Extremely dangerous for the environment 591 33.2
2) Very dangerous 691 38.8
3) Somewhat dangerous 442 24.8
4) Not very dangerous 51 2.9
5) Not dangerous at all for the environment 5 0.3
Missing 2252
351) GENEGEN1
And do you think that modifying the genes of certain crops is...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1655 2.857 1.05
1) Extremely dangerous for the environment 198 12.0
2) Very dangerous 367 22.2
3) Somewhat dangerous 649 39.2
4) Not very dangerous 355 21.5
5) Not dangerous at all for the environment 86 5.2
Missing 2377
352) NUKEGEN1
And do you think that nuclear power stations are...
RANGE: 1 to 5
N Mean Std. Deviation
Total 1743 2.714 1.114
1) Extremely dangerous for the environment 309 17.7
2) Very dangerous 386 22.1
Modified GSS Codebook for the Data Used in this Text | 499
3) Somewhat dangerous 631 36.2
4) Not very dangerous 328 18.8
5) Not dangerous at all for the environment 89 5.1
Missing 2289
353) ENJOYNAT
How much, if at all, do you enjoy being outside in nature?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1792 4.07 0.876
1) Not at all 11 0.6
2) To a small extent 61 3.4
3) To some extent 380 21.2
4) To a great extent 679 37.9
5) To a very great extent 661 36.9
Missing 2240
354) ACTIVNAT
In the last 12 months how often, if at all, have you engaged in any
leisure activities outside in nature, such as hiking, bird watching, swim­
ming, skiing, other outdoor activities, or just relaxing?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1774 2.628 1.137
1) Daily 328 18.5
2) Several times a week 530 29.9
3) Several times a month 475 26.8
4) Several times a year 356 20.1
5) Never 85 4.8
Missing 2258
355) PLANETRP
In the last 12 months, how many trips did you make by plane? Count outward
and return journeys, including transfers, as one trip.
RANGE: 0 to 200
N Mean Std. Deviation
Total 1806 1.118 7.11
500 | Modified GSS Codebook for the Data Used in this Text
Missing 2226
356) CARHR
In a typical week, about how many hours do you spend in a car or another
motor vehicle, including motorcycles, trucks, and vans, but not counting
public transport? Do not include shared rides in buses, collective taxis,
or carpooling services.
RANGE: 0 to 90
N Mean Std. Deviation
Total 1800 6.313 8.605
Missing 2232
357) EATMEAT
In a typical week, on how many days do you eat beef, lamb, or products
that contain them?
RANGE: 0 to 7
N Mean Std. Deviation
Total 1795 2.77 1.959
Missing 2237
358) NUMROOMS
How many rooms are there in your home (apartment or house)? Do not count
any separate kitchens, bathrooms, garages, balconies, hallways or cup­
boards.
RANGE: 0 to 20
N Mean Std. Deviation
Total 1812 5.343 2.795
Missing 2220
359) AIRPOLLU
(Thinking about your neighborhood, to what extent, if at all, was it
affected by the following things over the last 12 months?) Air pollution
RANGE: 1 to 5
N Mean Std. Deviation
Total 1717 2.015 1.034
1) Not at all 661 38.5
2) To a small extent 563 32.8
Modified GSS Codebook for the Data Used in this Text | 501
3) To some extent 348 20.3
4) To a great extent 96 5.6
5) To a very great extent 49 2.9
Missing 2315
360) WTRPOLLU
(Thinking about your neighborhood, to what extent, if at all, was it
affected by the following things over the last 12 months?) Water pollution
RANGE: 1 to 5
N Mean Std. Deviation
Total 1699 1.809 0.982
1) Not at all 854 50.3
2) To a small extent 449 26.4
3) To some extent 285 16.8
4) To a great extent 88 5.2
5) To a very great extent 23 1.4
Missing 2333
361) EXWEATHR
(Thinking about your neighborhood, to what extent, if at all, was it
affected by the following things over the last 12 months?) Extreme weather
events (such as severe storms, droughts, floods, heat waves, cold snaps,
etc.)
RANGE: 1 to 5
N Mean Std. Deviation
Total 1756 2.574 1.079
1) Not at all 301 17.1
2) To a small extent 554 31.5
3) To some extent 591 33.7
4) To a great extent 212 12.1
5) To a very great extent 98 5.6
Missing 2276
362) MKT1
It is the responsibility of private companies to reduce the differences in
pay between their employees with high pay and those with low pay.
RANGE: 1 to 5
502 | Modified GSS Codebook for the Data Used in this Text
N Mean Std. Deviation
Total 1790 2.402 1.086
1) Strongly agree 384 21.5
2) Agree 680 38.0
3) Neither agree nor disagree 425 23.7
4) Disagree 224 12.5
5) Strongly disagree 77 4.3
Missing 2242
363) RESPINEQ
From the following list, who do you think should have the greatest respon­
sibility for reducing differences in income between people with high
incomes and people with low incomes?
RANGE: 1 to 6
N Mean Std. Deviation
Total 1535 2.542 1.701
1) Private companies 515 33.6
2) Government 539 35.1
3) Trade unions 87 5.7
4) High-income individuals themselves 97 6.3
5) Low-income individuals themselves 122 7.9
6) Income differences do not need to be reduced 175 11.4
Missing 2497
364) GOVINEQ1
To what extent do you agree or disagree with the following statement: Most
politicians in America do not care about reducing the differences in
income between people with high incomes and people with low incomes.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1749 2.011 0.942
1) Strongly agree 593 33.9
2) Agree 692 39.6
3) Neither agree nor disagree 338 19.3
4) Disagree 103 5.9
5) Strongly disagree 23 1.3
Missing 2283
Modified GSS Codebook for the Data Used in this Text | 503
365) INEQMAD
Some people feel angry about differences in wealth between the rich and
the poor, while others do not. How do you feel when you think about differ­
ences in wealth between the rich and the poor in America? Please place
yourself on a scale of 0 to 10, where 0 means not angry at all and 10
means extremely angry.
RANGE: 0 to 10
N Mean Std. Deviation
Total 1778 4.764 2.986
0) Not angry at all 266 15.0
1) 76 4.3
2) 109 6.1
3) 124 7.0
4) 125 7.0
5) 356 20.0
6) 168 9.4
7) 229 12.9
8) 143 8.0
9) 50 2.8
10) Extremely angry 132 7.4
Missing 2254
366) MIGRPOOR
(Turning to international differences, to what extent do you agree or dis­
agree with the following statements?) People from poor countries should be
allowed to work in wealthy countries.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1742 2.376 0.944
1) Strongly agree 280 16.1
2) Agree 770 44.2
3) Neither agree nor disagree 498 28.6
4) Disagree 145 8.3
5) Strongly disagree 49 2.8
Missing 2290
504 | Modified GSS Codebook for the Data Used in this Text
367) CONTPOOR
How often do you have any contact with people who are a lot poorer than
you when you are out and about? This might be in the street, on public
transport, in shops, in your neighborhood, or at your workplace.
RANGE: 1 to 7
N Mean Std. Deviation
Total 1704 4.623 1.87
1) Never 80 4.7
2) Less than once a month 240 14.1
3) Once a month 151 8.9
4) Several times a month 378 22.2
5) Once a week 128 7.5
6) Several times a week 377 22.1
7) Every day 350 20.5
Missing 2328
368) CONTRICH
How often do you have any contact with people who are a lot richer than
you when you are out and about? This might be in the street, on public
transport, in shops, in your neighborhood, or at your workplace.
RANGE: 1 to 7
N Mean Std. Deviation
Total 1706 3.978 1.901
1) Never 158 9.3
2) Less than once a month 349 20.5
3) Once a month 188 11.0
4) Several times a month 399 23.4
5) Once a week 110 6.4
6) Several times a week 294 17.2
7) Every day 208 12.2
Missing 2326
369) CLASS1
Most people see themselves as belonging to a particular class. Please tell
me which social class you would say you belong to?
RANGE: 1 to 6
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 505
Total 1806 3.295 1.165
1) Lower class 95 5.3
2) Working class 461 25.5
3) Lower middle class 336 18.6
4) Middle class 666 36.9
5) Upper middle class 226 12.5
6) Upper class 22 1.2
Missing 2226
370) RANK1
In our society there are groups which tend to be toward the top and groups
which tend to be toward the bottom. On the handcard is a scale that runs
from top to bottom. Where would you put yourself now on this scale?
RANGE: 1 to 10
N Mean Std. Deviation
Total 1812 5.047 1.684
1) Top 34 1.9
2) 44 2.4
3) 228 12.6
4) 267 14.7
5) 730 40.3
6) 197 10.9
7) 165 9.1
8) 78 4.3
9) 29 1.6
10) Bottom 40 2.2
Missing 2220
371) RANK16
(In our society there are groups which tend to be toward the top and
groups which tend to be toward the bottom. On the handcard is a scale that
runs from top to bottom.) If you think about the family that you grew up
in, where did they fit in then?
RANGE: 1 to 10
N Mean Std. Deviation
Total 1808 5.482 1.866
1) Top 34 1.9
506 | Modified GSS Codebook for the Data Used in this Text
2) 44 2.4
3) 160 8.8
4) 226 12.5
5) 599 33.1
6) 243 13.4
7) 237 13.1
8) 146 8.1
9) 64 3.5
10) Bottom 55 3.0
Missing 2224
372) RANK10FUT
(In our society there are groups which tend to be toward the top and
groups which tend to be toward the bottom. On the handcard is a scale that
runs from top to bottom.) And thinking ahead 10 years from now, where do
you think you will be on a scale of 1 to 10, where 1 is the top and 10 the
bottom?
RANGE: 1 to 10
N Mean Std. Deviation
Total 1793 4.732 1.917
1) Top 77 4.3
2) 106 5.9
3) 282 15.7
4) 317 17.7
5) 554 30.9
6) 170 9.5
7) 129 7.2
8) 82 4.6
9) 27 1.5
10) Bottom 49 2.7
Missing 2239
373) FAIRDIST
How fair or unfair do you think the income distribution is in America?
RANGE: 1 to 4
N Mean Std. Deviation
Total 1652 2.887 0.797
Modified GSS Codebook for the Data Used in this Text | 507
1) Very fair 74 4.5
2) Fair 406 24.6
3) Unfair 805 48.7
4) Very unfair 367 22.2
Missing 2380
374) ENDSME12
And during the next 12 months, how difficult or easy do you think it will
be for your household to make ends meet?
RANGE: 1 to 5
N Mean Std. Deviation
Total 1752 3.244 1.162
1) Very difficult 132 7.5
2) Fairly difficult 349 19.9
3) Neither easy nor difficult 507 28.9
4) Fairly easy 488 27.9
5) Very easy 276 15.8
Missing 2280
375) SKIPMEAL
How often do you or other members of your household skip a meal because
there is not enough money for food?
RANGE: 1 to 7
N Mean Std. Deviation
Total 1767 1.421 1.119
1) Never 1471 83.2
2) Less than once a month 107 6.1
3) Once a month 46 2.6
4) Several times a month 82 4.6
5) Once a week 18 1.0
6) Several times a week 31 1.8
7) Every day 12 0.7
Missing 2265
376) RATEPAIN1
On a scale from 0 to 10, with 0 meaning no pain and 10 being the worst
imaginable pain, how would you rate your pain on average?
508 | Modified GSS Codebook for the Data Used in this Text
RANGE: 0 to 10
N Mean Std. Deviation
Total 3613 2.7 2.481
0) No pain 828 22.9
1) 620 17.2
2) 600 16.6
3) 430 11.9
4) 258 7.1
5) 302 8.4
6) 211 5.8
7) 173 4.8
8) 114 3.2
9) 56 1.5
10) The worst imaginable pain 21 0.6
Missing 419
377) RELIGIMP
How important is religion in your life—very important, somewhat important,
not too important, or not at all important?
RANGE: 1 to 4
N Mean Std. Deviation
Total 3609 2.369 1.163
1) Very important 1124 31.1
2) Somewhat important 923 25.6
3) Not too important 668 18.5
4) Not at all important 894 24.8
Missing 423
378) RELIDIMP
When thinking about religion, how important is being (a Christian/a
Catholic/a Jew/a Buddhist/a Hindu/a Muslim/an atheist/an agnostic/someone
who does not identify with a religion/ a member of your religion) to you?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3511 2.797 1.416
1) Extremely important 855 24.4
2) Very important 776 22.1
Modified GSS Codebook for the Data Used in this Text | 509
3) Somewhat important 711 20.3
4) Not too important 563 16.0
5) Not at all important 606 17.3
Missing 521
379) RELIDESC
How well does the term (Christian/Catholic/Jew/Buddhist/Hindu/Muslim/athe­
ist/agnostic/non-religious/N/A) describe you?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3450 2.497 1.086
1) Extremely well 689 20.0
2) Very well 1085 31.4
3) Somewhat well 1130 32.8
4) Not very well 363 10.5
5) Not well at all 183 5.3
Missing 582
380) RELIDWE
When talking about (Christians/Catholics/Jews/Buddhists/Hindus/Muslims/
atheists/agnostics/people who do not identify with a religion/your reli­
gion), how often do you say 'we' instead of 'they'?
RANGE: 1 to 5
N Mean Std. Deviation
Total 3432 2.859 1.352
1) Never 785 22.9
2) Rarely 586 17.1
3) Some of the time 857 25.0
4) Most of the time 735 21.4
5) All of the time 469 13.7 Missing 600
381) RELIDINS
If someone criticized (Christians/Catholics/Jews/Buddhists/Hindus/Muslims/
Atheists/Agnostics/people who do not identify with a religion/your reli­
gion), to what extent would it feel like a personal insult?
RANGE: 1 to 4
N Mean Std. Deviation
Total 3496 2.682 1.024
510 | Modified GSS Codebook for the Data Used in this Text
1) A great deal 483 13.8
2) Somewhat 1115 31.9
3) Very little 927 26.5
4) Not at all 971 27.8
Missing 536
382) SPRTCONNCT
(Some people say they have experiences of being personally moved, touched,
or inspired, while others say they do not have these experiences at all.
How often, if at all, do you experience each of the following?) Felt par­
ticularly connected to the world around you.
RANGE: 1 to 7
N Mean Std. Deviation
Total 3523 3.811 1.859
1) At least once a day 324 9.2
2) Almost every day 828 23.5
3) Once or twice a week 514 14.6
4) Once or twice a month 503 14.3
5) A few times per year 654 18.6
6) Once a year or less 279 7.9
7) Never 421 12.0
Missing 509
383) SPRTLRGR
(Some people say they have experiences of being personally moved, touched,
or inspired, while others say they do not have these experiences at all.
How often, if at all, do you experience each of the following?) Felt like
you were part of something much larger than yourself.
RANGE: 1 to 7
N Mean Std. Deviation
Total 3544 4.095 1.951
1) At least once a day 334 9.4
2) Almost every day 693 19.6
3) Once or twice a week 442 12.5
4) Once or twice a month 421 11.9
5) A few times per year 693 19.6
6) Once a year or less 409 11.5
Modified GSS Codebook for the Data Used in this Text | 511
7) Never 552 15.6
Missing 488
384) SPRTPURP
(Some people say they have experiences of being personally moved, touched,
or inspired, while others say they do not have these experiences at all.
How often, if at all, do you experience each of the following?) Felt a
sense of a larger meaning or purpose in life.
RANGE: 1 to 7
N Mean Std. Deviation
Total 3537 3.996 1.949
1) At least once a day 371 10.5
2) Almost every day 708 20.0
3) Once or twice a week 460 13.0
4) Once or twice a month 428 12.1
5) A few times per year 672 19.0
6) Once a year or less 390 11.0
7) Never 508 14.4
Missing 495
385) MDITATE1
How often do you meditate?
RANGE: 1 to 7
N Mean Std. Deviation
Total 3574 4.803 2.185
1) At least once a day 336 9.4
2) Almost every day 402 11.2
3) Once or twice a week 461 12.9
4) Once or twice a month 343 9.6
5) A few times per year 374 10.5
6) Once a year or less 204 5.7
7) Never 1454 40.7
Missing 458
386) GRTWRKS
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) The great works of philosophy and science are the best
512 | Modified GSS Codebook for the Data Used in this Text
source of truth, wisdom, and ethics.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3547 2.796 0.942
1) Strongly agree 297 8.4
2) Agree 959 27.0
3) Neither agree nor disagree 1624 45.8
4) Disagree 506 14.3
5) Strongly disagree 161 4.5
Missing 485
387) FREEMIND
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) To understand the world, we must free our minds from
old traditions and beliefs.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3556 2.893 1.05
1) Strongly agree 345 9.7
2) Agree 902 25.4
3) Neither agree nor disagree 1334 37.5
4) Disagree 740 20.8
5) Strongly disagree 235 6.6
Missing 476
388) DECEVIDC
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) When I make important decisions in my life, I rely
mostly on reason and evidence.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3564 2.078 0.847
1) Strongly agree 833 23.4
2) Agree 1888 53.0
3) Neither agree nor disagree 626 17.6
4) Disagree 167 4.7
5) Strongly disagree 50 1.4 Missing 468
Modified GSS Codebook for the Data Used in this Text | 513
389) ADVFMSCI
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) All of the greatest advances for humanity have come
from science and technology.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3558 2.734 0.997
1) Strongly agree 366 10.3
2) Agree 1135 31.9
3) Neither agree nor disagree 1271 35.7
4) Disagree 652 18.3
5) Strongly disagree 134 3.8
Missing 474
390) GODUSA
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) The success of the United States is part of God's plan.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3553 3.285 1.264
1) Strongly agree 326 9.2
2) Agree 607 17.1
3) Neither agree nor disagree 1238 34.8
4) Disagree 493 13.9
5) Strongly disagree 889 25.0
Missing 479
391) GOVCHRST
(Please indicate to what extent you agree or disagree with each of the fol­
lowing statements.) The federal government should advocate Christian val­
ues.
RANGE: 1 to 5
N Mean Std. Deviation
Total 3559 3.367 1.246
1) Strongly agree 287 8.1
2) Agree 598 16.8
514 | Modified GSS Codebook for the Data Used in this Text
3) Neither agree nor disagree 1087 30.5
4) Disagree 696 19.6
5) Strongly disagree 891 25.0
Missing 473
392) TRDUNIO1
To what extent do you agree or disagree with the following statements?
Workers need strong trade unions to protect their interests.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1697 2.666 1.109
1) Strongly agree 246 14.5
2) Agree 555 32.7
3) Neither agree nor disagree 542 31.9
4) Disagree 227 13.4
5) Strongly disagree 127 7.5
Missing 2335
393) BOARDREP
(To what extent do you agree or disagree with the following statements?)
Workers should be represented on the board of directors at major companies.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1701 2.098 0.888
1) Strongly agree 427 25.1
2) Agree 820 48.2
3) Neither agree nor disagree 342 20.1
4) Disagree 84 4.9
5) Strongly disagree 28 1.6
Missing 2331
394) UPWAGES
(To what extent do you agree or disagree with the following statements?)
The government should ensure that the wages of low-paying jobs increase as
the economy grows.
RANGE: 1 to 5
N Mean Std. Deviation
Modified GSS Codebook for the Data Used in this Text | 515
Total 1714 2.127 1.075
1) Strongly agree 537 31.3
2) Agree 708 41.3
3) Neither agree nor disagree 254 14.8
4) Disagree 144 8.4
5) Strongly disagree 71 4.1
Missing 2318
395) LIMITPAY
(To what extent do you agree or disagree with the following statements?)
The government should take steps to limit the pay of executives at major
companies.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1697 2.864 1.215
1) Strongly agree 259 15.3
2) Agree 432 25.5
3) Neither agree nor disagree 459 27.0
4) Disagree 374 22.0
5) Strongly disagree 173 10.2
Missing 2335
396) PRFTIMPV
(To what extent do you agree or disagree with the following statements?)
Allowing business to make good profits is the best way to improve every­
one's standard of living.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1690 2.649 1.007
1) Strongly agree 184 10.9
2) Agree 627 37.1
3) Neither agree nor disagree 555 32.8
4) Disagree 246 14.6
5) Strongly disagree 78 4.6
Missing 2342
397) GOVFNANC
(To what extent do you agree or disagree with the following statements?)
516 | Modified GSS Codebook for the Data Used in this Text
The government should finance projects to create new jobs, even if it
might require a tax increase to pay for it.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1698 2.756 1.057
1) Strongly agree 180 10.6
2) Agree 566 33.3
3) Neither agree nor disagree 540 31.8
4) Disagree 312 18.4
5) Strongly disagree 100 5.9
Missing 2334
398) DCLINDUS
(To what extent do you agree or disagree with the following statements?)
The government should support declining industries to protect jobs, even
if it might require a tax increase to pay for it.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1677 3.291 0.991
1) Strongly agree 54 3.2
2) Agree 311 18.5
3) Neither agree nor disagree 582 34.7
4) Disagree 553 33.0
5) Strongly disagree 177 10.6
Missing 2355
399) GOVFNAID
(To what extent do you agree or disagree with the following statements?)
The government should give financial assistance to college students from
low-income families, even if it might require a tax increase to pay for it.
RANGE: 1 to 5
N Mean Std. Deviation
Total 1713 2.521 1.084
1) Strongly agree 281 16.4
2) Agree 681 39.8
3) Neither agree nor disagree 419 24.5
4) Disagree 242 14.1
Modified GSS Codebook for the Data Used in this Text | 517
5) Strongly disagree 90 5.3
Missing 2319
400) HIVAFRAID
Please indicate how much you agree or disagree with each of the following
comments. I would be afraid to be around a person with HIV because I would
be worried I could get infected.
RANGE: 1 to 4
N Mean Std. Deviation
Total 2174 1.685 0.933
1) Strongly disagree 1249 57.5
2) Disagree 508 23.4
3) Agree 270 12.4
4) Strongly agree 147 6.8
Missing 1858
401) HIVIMMRL
(Please indicate how much you agree or disagree with each of the following
comments.) People who have HIV have participated in immoral activities.
RANGE: 1 to 4
N Mean Std. Deviation
Total 2032 1.653 0.875
1) Strongly disagree 1163 57.2
2) Disagree 500 24.6
3) Agree 280 13.8
4) Strongly agree 89 4.4
Missing 2000
402) HIVDSCRM
(Please indicate how much you agree or disagree with each of the following
comments.) There is a lot of discrimination against people with HIV in
this country today.
RANGE: 1 to 4
N Mean Std. Deviation
Total 1976 2.781 0.848
1) Strongly disagree 161 8.1
2) Disagree 490 24.8
518 | Modified GSS Codebook for the Data Used in this Text
3) Agree 945 47.8
4) Strongly agree 380 19.2
Missing 2056
403) STRVBIAS
Which do you think should be the bigger priority for the U.S. criminal jus­
tice system today?
RANGE: 1 to 2
N Mean Std. Deviation
Total 1149 1.665 0.472
1) Strengthening law and order through more police and greater enforcement
of the laws 385 33.5
2) Reducing bias against minorities in the criminal justice system by
reforming court and police practices 764 66.5
Missing 2883
404) RACESURV17
How much, if at all, do you think the legacy of slavery affects the posi­
tion of Black people in American society today?
RANGE: 1 to 4
N Mean Std. Deviation
Total 2340 2.168 1.064
1) A great deal 802 34.3
2) A fair amount 704 30.1
3) Not much 474 20.3
4) Not at all 360 15.4
Missing 1692
405) DEFUND
Do you favor or oppose reducing funding for police departments, and moving
those funds to mental health, housing, and other social services?
RANGE: 1 to 2
N Mean Std. Deviation
Total 2344 1.594 0.491
1) Favor 952 40.6
2) Oppose 1392 59.4
Missing 1688
Modified GSS Codebook for the Data Used in this Text | 519
406) POLTRTBLK
The following questions are about police and law enforcement. In general,
do the police (treat Whites better than Blacks, treat them both the same,
or treat Blacks better than Whites/treat Blacks better than Whites, treat
them both the same, or treat Whites better than Blacks)?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2328 2.382 1.344
1) Treat Whites much better than Blacks 914 39.3
2) Treat Whites moderately better than Blacks 452 19.4
3) Treat Whites a little better than Blacks 167 7.2
4) Treat both the same 765 32.9
5) Treat Blacks a little better than Whites 19 0.8
6) Treat Blacks moderately better than Whites 5 0.2
7) Treat Blacks much better than Whites 6 0.3
Missing 1704
407) POLTRTHSP
In general, do the police (treat Whites better than Latinos, treat them
both the same, or treat Latinos better than Whites/treat Latinos better
than Whites, treat them both the same, or treat Whites better than Lati­
nos)?
RANGE: 1 to 7
N Mean Std. Deviation
Total 2315 2.513 1.318
1) Treat Whites much better than Latinos 761 32.9
2) Treat Whites moderately better than Latinos 509 22.0
3) Treat Whites a little better than Latinos 174 7.5
4) Treat both the same 856 37.0
5) Treat Latinos a little better than Whites 3 0.1
6) Treat Latinos moderately better than Whites 6 0.3
7) Treat Latinos much better than Whites 6 0.3
Missing 1717
520 | Modified GSS Codebook for the Data Used in this Text
Media Attributions
• gss screenshot from codebook © NORC is licensed under a All Rights Reserved license
Modified GSS Codebook for the Data Used in this Text | 521
522 | Modified GSS Codebook for the Data Used in this Text
Works Cited
Adler, Emily Stier, and Roger Clark. 2008. How It’s Done: An Invitation to Social Research,
3rd. Belmont, CA: Wadsworth.
American Sociological Association. 2019. American Sociological Association Style Guide, 6th
Edition. Washington, D.C.: American Sociological Association.
Bartsch, Robert A., Teresa Burnett, Tommye R. Diller, and Elizabeth Rankin-Williams. 2000.
“Gender Representation in Television Commercials: Updating an Update.” Sex Roles 43(9):
735-43.
Berg, Bruce L. 2009. Qualitative Research Methods for the Social Sciences, 7th. Boston:
Allyn and Bacon.
Bergin, Tiffany. 2018. An Introduction to Data Analysis: Quantitative, Qualitative, and Mixed
Methods. London: Sage Publications.
Boole, George. 1848. “The Calculus of Logic.” Cambridge and Dublin Mathematical Journal
III:183-98.
Brewster, Signe. 2020. “The Best Transcription Services.” Wirecutter. Website accessed 07/
28/2020 (https://www.nytimes.com/wirecutter/reviews/best-transcription-services/).
Center for American Women and Politics. 2022. “New Records for Women in the U.S. Congress and House.” Rutgers Eagleton Institute of Politics. Website accessed 12/12/2022
(https://cawp.rutgers.edu/news-media/press-releases/new-records-women-us-congressand-house).
Cite Black Women Collective. n.d. “Cite Black Women.” Website accessed 06/24/2020,
(https://www.citeblackwomencollective.org/).
Davern, Michael; Bautista, Rene; Freese, Jeremy; Morgan, Stephen L.; and Tom W. Smith.
General Social Survey 2021 Cross-section. [Machine-readable data file]. Principal Investigator, Michael Davern; Co-Principal Investigators, Rene Bautista, Jeremy Freese, Stephen
L. Morgan, and Tom W. Smith. NORC ed. Chicago, 2021. 1 datafile (68,846 cases) and 1
codebook (506 pages).
Desmond, Matthew. 2012. “Disposable Ties and the Urban Poor.” American Journal of Soci­
ology 117(5):1295-1335.
Edin, Kathryn. 2000. “What Do Low-Income Single Mothers Say about Marriage?” Social
Problems 47(1):112-33.
Fernandez, Mary Ellen. 2019. “Trump and Clinton Rallies: Are Political Campaigns Quasi-Religious in Nature?” Sociology Between the Gaps: Forgotten and Neglected Topics 4(1):1-10.
Gauthier, Jessica, Madeline MacKay, Madison Mellor, and Roger Clark. 2020. “Ebbs and
Flows in the Feminist Presentation of Female Characters Among Caldecott Award-Win-
Works Cited | 523
ning Picture Books for Children.” Sociology Between the Gaps: Forgotten and Neglected
Topics 5(1): 1-20.
Gerring, John. 2007. Case Study Research: Principles and Practices. Cambridge, UK: Cambridge University Press.
Grbich, Carol. 2007. Qualitative Data Analysis: An Introduction. London: Sage.
Hsiung, Ping-Chun. 2008. “Teaching Reflexivity in Qualitative Interviewing.” Teaching Soci­
ology 36(July):211-26.
Inter-Parliamentary Union. 2022. “Monthly Ranking of Women in National Parliaments.” IPU
Parline. Website accessed 12/12/20222 (https://data.ipu.org/women-ranking).
Jones, Taylor, Jessica Rose Kalbfeld, Ryan Hancock, and Robin Clark. 2019. “Testifying While
Black: An Experimental Study of Court Reporter Accuracy in Transcription of African
American English.” Language 95(2):e216-52.
Kearney, Melissa S. and Phillip B. Levine. 2015. “Media Influences on Social Outcomes: The
Impact of MTV’s 16 and Pregnant on Teen Childbearing.” 105(12):3597-3632.
Khamsi, Roxanne. 2019. “Say What? A Non-Scientific Comparison of Automated Transcription Services.” The Open Notebook. Website accessed 07/28/2020 (https://www.theopennotebook.com/2019/12/17/
say-what-a-non-scientific-comparison-of-automated-transcription-services/).
Khan, Shamus. 2019. “The Subpoena of Ethnographic Data.” Sociological Forum
34(1):253-63.
Lovdal, Lynn T. 1989. “Sex Role Messages in Television Commercials: An Update.” Sex Roles
21(11):715-24.
Luker, Kristin. 2008. Salsa Dancing into the Social Sciences: Research in an Age of Info-glut.
Cambridge, MA: Harvard University Press.
Manning, Jimmie. 2017. “In Vivo Coding.” In The International Encyclopedia of Communica­
tion Research Methods, edited by Jörg Matthes, Christine S. Davis, and Robert F. Potter.
New York: Wiley-Blackwell.
Miles, Matthew B., and A. Michael Huberman. 1994. Qualitative Data Analysis: An Expanded
Sourcebook, 2nd Edition. Thousand Oaks, CA: Sage.
National Center for Health Statistics. 2022. “Firearm Mortality by State.” Centers for Disease
Control and Prevention. Website accessed 09/19/2022 (https://www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm).
O’Donnell, William J., and Karen J. O’Donnell. 1978. “Update: Sex‐role Messages in TV Commercials.” Journal of Communication 28(1): 156-58.
Pearce, Lisa D. 2012. “Mixed Methods Inquiry in Sociology.” American Behavioral Scientist
56(6):829-48.
Plesser, Hans E. 2017. “Reproducibility vs. Replicability: A Brief History of a Confused Terminology.”
Frontiers
in
Neuroinformatics
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5778115/).
524 | Works Cited
11(76),
accessed
online
Posselt, Julie. 2017. “4 tips for using quotes in #qualitative findings, after a month of much
reviewing -journal articles, conference proposals, & student writing.” Twitter thread
accessed
08/13/2020
(https://web.archive.org/web/20221213180729/https://twitter.com/
JuliePosselt/status/883050014024450049).
Petit, Lindsay, Madison Mellor, and Roger Clark. “The Gender Gap in Political Affiliation:
Understanding Why It Emerged and Maintained Itself Over Time.” Sociology Between the
Gaps: Forgotten and Neglected Topics 5(1):1-13.
Ragin, Charles C. 2000. Fuzzy-Set Social Science. Chicago: University of Chicago Press.
Ragin, Charles C. 2008. Redesigning Social Inquiry: Fuzzy Sets and Beyond. Chicago: University of Chicago Press.
Roberts, Keith A. 1993. “Toward a Sociology of Writing.” Teaching Sociology 21(4):317-324.
Rosenberg, Morris. 1968. The Logic of Survey Analysis. New York: Basic Books.
Saldaña, Johnny. 2016. The Coding Manual for Qualitative Researchers. London: Sage.
Sanderson, Theo. n.d. “The Up-Goer Five Text Editor.” Website accessed 08/13/2020
(https://www.splasho.com/upgoer5/).
Schell, Terry L. et al. 2020. “State-Level Estimates of Household Firearm Ownership.” RAND
Corporation report TL-354-LJAF. Website accessed 09/19/2022 (https://www.rand.org/
pubs/tools/TL354.html).
Schwartz, Martin A. 2008. “The Importance of Stupidity in Scientific Research.” Journal of
Cell Science 121(11):1771.
Smith, Dorothy E. 1987. The Everyday World as Problematic: A Feminist Sociology. Boston,
MA: Northeastern University Press.
Taylor, Steven J., Robert Bogdan, and Marjorie L. DeVault. 2016. Introduction to Qualitative
Research Methods: A Guidebook and Resource, Fourth edition. Hoboken, NJ: John Wiley
& Sons.
Teczar, Rebecca, Katherine Rocha, Joseph Palazzo, and Roger Clark. 2018. “Cultural Attitudes towards Women in Politics and Women’s Political Representation in Legislatures
and Cabinet Ministries.” Sociology Between the Gaps: Forgotten and Neglected Topics
4(1):1-7.
Thomas, Charlie. 2020/2021. “SDA: Survey Documentation and Analysis Archive.” Social Data
Archive. Website accessed December 12, 2022 (https://sda.berkeley.edu/archive.htm).
Twine, France Winddance, and Jonathan W. Warren. 2000. Racing Research, Researching
Race: Methodological Dilemmas in Critical Race Studies. New York: New York University
Press.
University of Surrey. n.d. “Computer Assisted Qualitative Data Analysis (CAQDAS) Networking Project.” Website accessed 7/30/2020 (https://www.surrey.ac.uk/computer-assistedqualitative-data-analysis).
Van Den Berg, Harry, Margaret Wetherell, and Hanneke Houtkoop-Steenstra. 2003. Analyz­
Works Cited | 525
ing Race Talk: Multidisciplinary Approaches on the Research Interview. Cambridge, UK:
Cambridge University Press.
Warren, Carol A. B., and Tracy Xavia Karner. 2015. Discovering Qualitative Methods: Ethnog­
raphy, Interviews, Documents, and Images. Oxford: Oxford University Press.
Zuberi, Tukufu, and Eduardo Bonilla-Silva. 2008. White Logic, White Methods: Racism and
Methodology. Lanham, MD: Rowman & Littlefield.
526 | Works Cited
About the Authors
Mikaila Mariel Lemonik Arthur is Professor of Sociology at Rhode Island College, where she
has taught a wide variety of courses including Social Research Methods, Social Data Analysis, Senior Seminar in Sociology, Professional Writing for Justice Services, Comparative Law
and Justice, Law and Society, Comparative Perspectives on Higher Education, and Race
and Justice. She has written a number of books and articles, including both those with a
pedagogical focus (including Law and Justice Around the World, published by the University of California Press) and those focusing on her scholarly expertise in higher education (including Student Activism and Curricular Change in Higher Education, published
by Routledge). She has expertise and experience in academic program review, translating
research findings for policymakers, and disability accessibility in higher education, and has
served as a department chair and as Vice President of the RIC/AFT, her faculty union. Outside of work, she enjoys reading speculative fiction, eating delicious vegan food, visiting the
ocean, and spending time with amazing humans.
Roger Clark is Professor Emeritus of Sociology at Rhode Island College, where he continues to teach courses in Social Research Methods and Social Data Analysis and to coauthor
empirical research articles with undergraduate students. He has coauthored two textbooks,
An Invitation to Social Research (with Emily Stier Adler) and Gender Inequality in Our
Changing World: A Comparative Approach (with Lori Kenschaft and Desirée Ciambrone).
He has been ranked by the USTA in its New England 60- and 65-and-older divisions, shot
four holes in one on genuine golf courses, and run multiple half and full marathons. Like the
Energizer Bunny, he keeps on going and going, but, given his age, leaves it to your imagination where.
About the Authors | 527
Download