DATA ANALYSIS GUIDE-SPSS - Sociology Data Set Server

advertisement
DATA ANALYSIS GUIDE-SPSS
When conducting any statistical analysis, you need to get familiar with your data and
perform an examination of it in order to lessen the odds of having biased results that can
make all of your hard work essentially meaningless or substantially weak.
Getting to Know SPSS
When you run a procedure in SPSS, such as frequencies, you need to select the variables
in the dialog box. On the left side of the dialog box, you see the list of variables in your
data file. You click a variable name to select it, and then click the right-arrow button to
move the variable into the Variable(s) list.
TIP 1
You can use change the appearance of the variables so that they appear as variable names
rather than variable labels [see above], which is the default option. You can also make
the variables appear alphabetical. I recommend switching to the variable names option
and having them listed alphabetically so that you can more easily find the variables of
interest to you. By selecting this method, you can type the first letter of the name of the
variable that you want in the variable display section of the dialog box and SPSS will
jump to the first variable that starts with that letter, and every subsequent variable that
starts with that letter as well.
 Directions: Pull down the Edit Tab, Select Options, Select the general tab,
and under Variable lists select display names and alphabetical
TIP 2 [What is this variable?]
In case you forget the label that you gave to variables when you go to the dialog box,
such as the frequency dialog box above, highlight the variable that you are interested in
and click the right-mouse button. This will provide a pop-up window that offers the
Variable Information section. In other words, if you were presented with the frequency
table above, you might highlight “size of the company [size], click on the right-mouse
button and select variable information. This action provides you with the name of the
variable, its label, measurement setting [e.g. ordinal], and value labels for the variable
[i.e. categories]
TIP 3 [What is this statistic?]
If you are unsure of what a particular statistic is used for, then highlight the particular
item, right-click on the selected statistics [e.g. mean] and you will receive a brief
description of what the statistics available in the dialog box provide. If the variable is
one that seems useful to you, then you select it by placing a check in the available box
next to the statistic. Then click ok and you will return to the main frequencies box.
TIP 4 [What is in this output?]
To obtain help on the output screen [i.e. the spss viewer], you need to double click a pivot
table in order to activate it so that you can make modifications. When activated, it will
appear to have “railroad track” lines surrounding it. See Table a below.
Favor or Oppose Death Penalty for Murder
Valid
Missing
Total
Favor
Oppos e
Total
DK
NA
Total
Frequency
1074
314
1388
106
6
112
1500
Percent
71.6
20.9
92.5
7.1
.4
7.5
100.0
Valid Percent
77.4
22.6
100.0
Cumulative
Percent
77.4
100.0
Once you have activated the pivot table, then you should right-click a row or
column header for a pop up menu, such as the column labeled valid percent.
Choose what’s this? This will bring up a pop-up window explaining what the
particular column or row is addressing. If you forget to activate the pivot table and
simply right-click on a column or row, you will get the following message: [Displays
output. Click once to select an object (for example, so that you can copy it to the
clipboard). Double-click to activate an object for editing. If the object is a pivot table,
you can obtain detailed help on items within the table by right-clicking on row and
column labels after the table is activated.]
If you want more than a pop-up delivers, choose results coach from the list instead
of what’s this? Essentially, this will take you through a subsection of the SPSS
tutorial.
Dressing Up Your Output
Changing Text
Activate the pivot table in the SPSS viewer [see Output screen] and then doubleclick on the text that you wish to change. Enter the new text and then follow the
same procedure as needed. If you wish to get rid of the title, then you can select the
title and then hit the delete button on your keyboard and you will obtain a table like
the one below.
Changing the Number of Decimal Places
Select the cell entries with too many [or too few] decimal places. From the format
menu, choose Cell properties, select the number of decimal points that you want and
click OK.
Showing/Hiding Cells
Activate the table and then select the row or column you wish by using Ctrl-Altclick in the column heading or row label. From the view label, choose hide. To
resurrect the table at a later time, activate the table and then from the view menu,
choose show all. If you’re sure that you never want to resurrect the information,
then you can simply delete them and they will be permanently removed. See Table a
above, which shows a table with unhidden columns. See Table b for an example of a
table with hidden valid percent column.
Table b
Favor or Oppose Death Penalty for Murder
Valid
Missing
Total
Favor
Oppos e
Total
DK
NA
Total
Frequency
1074
314
1388
106
6
112
1500
Percent
71.6
20.9
92.5
7.1
.4
7.5
100.0
Cumulative
Percent
77.4
100.0
Rearranging the rows, columns, and layers
Activate the pivot table, and from the pivot menu, choose pivoting trays. A
schematic representation of a pivot table appears with 3 areas [trays] labeled layer, row,
and column. Colored icons in these trays represent the contents of the table, one for each
variable and one for statistics. Place your mouse pointer over one of them to see what
it represents and if you wish to change the structure of the table, then you can drag
an icon and the table will rearrange itself. See Table b above for a pre-modification
version of the table and Table c below for a post-modification version.
Table c = post-modification version of Table b
Fa vor or Oppose Dea th Pena lty for Murder
Valid
Favor
Oppos e
Total
Missing
DK
NA
Total
Total
Frequency
Percent
Cumulative
Percent
Frequency
Percent
Cumulative
Percent
Frequency
Percent
Frequency
Percent
Frequency
Percent
Frequency
Percent
Frequency
Percent
1074
71.6
77.4
314
20.9
100.0
1388
92.5
106
7.1
6
.4
112
7.5
1500
100.0
Editing Your Charts
Double click on the viewer option to open it in a new chart editor window. Note: To
access some chart editing capabilities, such as identifying points on a scatterplot or
changing the width of bars in a histogram, you must click an element of the chart to select
it. For example, you must click any point in a scatterplot or any bar in a histogram
or bar chart. You can change labels [double-click any text and substitute your
own], create reference lines, and change colors, line types and sizes. When you close
the Chart editor window, the original chart in the viewer updates show any changes
that you made.
Using Syntax
You should ALWAYS use syntax when running statistical analyses. There are 2 ways
that you can do this. You can select the paste tab when you run a statistical analysis
using one of the dialog boxes, such as that for frequencies. However, when you use
the paste function, you have to remember to go to the newly created syntax window
or one that you created in a previous session and highlight the commands if you
wish the analysis to actually run.
The other method is to open a new syntax file so that you can type in any
commentary and syntax or copy and paste from an already existing syntax file. The
syntax below is what you would receive if you did a paste command in SPSS after using
a dialog box, such as that for frequencies and you would also receive this command if
you did a copy and paste of prior commands in an already existing syntax or a newly
created one.
FREQUENCIES
VARIABLES=cappun
/PIECHART PERCENT
/ORDER= ANALYSIS .
Whichever method you use to create syntax, you MUST always type in commentary that
explains what the command does. This ensures that you have a way of checking back to
see the methodology that you used and the steps that were taken when you conducted
your analysis. This is useful in case something goes wrong and you need to make
corrections and just to provide you and others with a guide for how the analyses occurred
in case replications need to be done. Commentary should be written in the following way
when dealing with commands:
*frequencies of attitudes toward capital punishment and gun laws.*
Notice that there are asterisk at either end and that a period (.) is just before the closing
asterisk. This tells the computer that this is not command text, so that while the computer
may highlight it during a run of the analysis, it will not view it as command text. If you
were going to combine the commentary and the command syntax in a syntax file, it
would appear as you see it below.
*frequencies of attitudes toward capital punishment and gun laws.*
FREQUENCIES
VARIABLES=cappun
/PIECHART PERCENT
/ORDER= ANALYSIS .
In addition, you MUST keep a Log of the analyses that you run, which will appear in the
output [SPSS viewer] file. To do this, you need to go to Edit, then options, and select the
viewer tab. Under that tab, be sure that initial output state has “log” listed in the pulldown tab and that display commands in the log is checked. This ensures that the
information that the program enters the text of any analysis that you do right before it
displays the results of the analysis, which is another way to let yourself and others know
what type of analysis you did and to evaluate whether it is the appropriate analysis and
whether it has been done properly in that case. See the information just below this text
for a sample.
FREQUENCIES
VARIABLES=cappun
/ORDER= ANALYSIS .
Frequencies
Statistics
Favor or Oppos e Death Penalty for Murder
N
Valid
1388
Missing
112
Favor or Oppose Death Penalty for Murder
Valid
Missing
Total
Favor
Oppos e
Total
DK
NA
Total
Frequency
1074
314
1388
106
6
112
1500
Percent
71.6
20.9
92.5
7.1
.4
7.5
100.0
Cumulative
Percent
77.4
100.0
Introducing Data
Typing your own data
If your data aren’t already in a computer-readable SPSS format, you can enter the
information directly into the SPSS Data Editor. From the menus, choose file, then new,
then data, which opens the data editor in data view. If you type a number into the
first cell, SPSS will label that column with the variable name VAR00001. To create
your own variable names, click the variable view tab.
Assigning Variable names and properties
In the name column, enter a unique name for each variable in the order in which
you want to enter the variables. The name must start with a letter, but the remaining
part of the variable can be letters or digits. A name can’t end with a period, contain
blanks or special characters, or be longer than 64 characters.
Assigning Descriptive Labels
 Variable Labels: Assign descriptive text to a variable by clicking the cell and
then entering the label. For instance for the variable “cappun” the label says
“favor or oppose death penalty for murder.”
 Value Labels: To label individual values, click the button in the Value column.
This opens its dialog box. For cappun, the label is coded 1 = favor, 2 = oppose.
The sequence of operations is to: enter the value, enter its label, click add, and
repeat this process for each value.
o Note: Labels for individual values are useful only for variables with a
limited number of categories whose codes aren’t self-explanatory. You
don’t want to attach value labels to individual ages; however, you should
label the missing value codes for all variables if you use more than one
code.
Assigning Missing Values
To indicate which codes were used for each variable when information is not available,
click in the missing column, and assign missing values. Cases with these codes will be
treated differently during statistical analysis. If you don’t assign codes for missing
values, even nonsensical values are accepted. A value of -1 for age would be considered
a real age. The missing-value codes that you assign to a variable are called usermissing values. System-missing values are assigned by SPSS to any blank numeric cell
in the Data Editor or to any calculated value that is not defined. A system-missing value
is indicated with a period (.).
 Note: You can’t assign missing values to a string variable that is more than 8
characters in width. For string variables, uppercase and lowercase letters are
treated as distinct characters. This means that if you use the code NA (not
available) as a missing value code, entries coded as na will not be treated as
missing. Also, if a string variable is 3 characters wide and the missing value code
is only 2 characters wide, the placement of the two characters in the field of 3
affects what’s considered missing. Blanks at the end of the field (trailing blanks)
are ignored in missing-value specifications.
 Warning: DON’T use a blank space as a missing value. Use a specific number
or character to signify that I looked for this value and I don’t know what it is.
DON’T use missing-value codes that are between the smallest and largest valid
values, even if these particular codes don’t occur in the data.
Assigning Levels of Measurement
Click in a cell in the Measure column to assign a level of measurement to each variable.
You have 3 choices: nominal, ordinal, and scale.


Warning 1: If you don’t specify the scale, SPSS attempts to divine it based on
characteristics of the data, but its judgment in this matter is fallible. For example,
string variables are always designated as nominal. In some procedures, SPSS
uses different icons for the 3 types of variables. The scale on which a variable is
measured doesn’t necessarily dictate the appropriate statistical analysis for a
variable. For example, an ID number assigned to subjects in an experiment is
usually classified as a nominal variable. If the numbers are assigned sequentially,
however, they can be plotted on a scale to see if subject responses change with
time. Vellemena and Wilkinson (1993) discuss the problems associated with
stereotyping variables.
Warning 2: Although SPSS assigns a level of measurement to each variable, this
information is seldom used to guide you. SPSS will let you calculate means for
nominal variables as long as they have numeric values. Certain statistical
procedures don’t allow string variables in particular fields in the dialog boxes.
For example, you can’t calculate the mean of a string variable.
Saving the Data File
You MUST always save your data periodically so that you don’t have to start from
scratch if anything goes wrong. You can also include text information in an SPSS data
file by choosing utilities and data file comments, which will appear in the syntax
screen. Anyone using the file can read the text associated with it. You can also elect to
have the comments displayed in the output. This is similar to what you would do with
your own inclusion of comments alerting what steps you are taking in your data analysis.
I recommend the other way because you will already be in the syntax rather than having
to switch back and forth, but this is a possible option.
* Data File Comments.
PRESERVE.
SET PRINT OFF.
ADD DOCUMENT
'test of data file comments'.
RESTORE.
Selecting Cases for Analyses
If you wish to perform analyses on a subset of your cases, this command is
invaluable. For instance, consider that you want to examine gender differences in
support for or opposition to capital punishment. Choose select cases from the data
menu and all analyses will be restricted to the cases that meet the criteria you
specified. After choosing select cases, choose select if condition is satisfied and also
click on the “if” tab. This will take you to a dialog box that allows you to complete
the command syntax necessary to carry out the procedure. Considering the
example, that I gave you, Males are coded 1 and Females are coded 2. I am
interested in calculating results separately for both groups. Therefore, I click on sex
under the variable list and use the arrow to put it in the box allocated for formulas.
Once this variable has been transferred, I click on the = sign on the calculator
provided and then on a “1” so that I inform the computer that I am only interested
in selecting cases for males. Then I hit continue and go back to the original select
cases dialog box where I can choose “unselected cases are filtered” or “unselected
cases are deleted. If you wish to keep both males and females in the dataset, but you
want to conduct separate analyses for each group, you want to choose “filtered.” If
you wish to get rid of those cases that don’t meet the criterion, i.e. you want to delete
the females from the data set permanently, you want to choose “deleted.” If you
look at the Data Editor when Select Cases is in effect, you’ll see lines through the
cases that did not meet the selection criteria [only for filtering of cases]. They won’t
be included in any statistical analysis or graphical procedures.
Repeating the Analysis for Different Groups of Cases
If you want to perform the same analysis for several groups of cases, choose Split File
from the Data menu. A separate analysis is done for each combination of values of the
variables specified in the Split File Dialog box.
SORT CASES BY sex .
SPLIT FILE
LAYERED BY sex .
Frequencies
Statistics
Favor or Oppos e Death Penalty for Murder
Male
N
Valid
Missing
Female
N
Valid
Missing
607
34
781
78
Fa vor or Oppose Dea th P ena lty for Murder
Respondent's S ex
Male
Valid
Female
Missing
Total
Valid
Missing
Total
Favor
Oppos e
Total
DK
Favor
Oppos e
Total
DK
NA
Total
Frequency
502
105
607
34
641
572
209
781
72
6
78
859
Percent
78.3
16.4
94.7
5.3
100.0
66.6
24.3
90.9
8.4
.7
9.1
100.0
Valid P erc ent
82.7
17.3
100.0
73.2
26.8
100.0
Cumulative
Percent
82.7
100.0
73.2
100.0
You can also select how you want the output displayed—all output for each subgroup
together or the same output for each subgroup together.
SORT CASES BY sex .
SPLIT FILE
SEPARATE BY sex .
Frequencies
Respondent's Sex = Male
Statisticsa
Favor or Oppose Death Penalty for Murder
N
Valid
607
Mis sing
34
a. Respondent's Sex = Male
Favor or Oppose Death Penalty for Murdera
Valid
Missing
Total
Favor
Oppos e
Total
DK
Frequency
502
105
607
34
641
Percent
78.3
16.4
94.7
5.3
100.0
Valid Percent
82.7
17.3
100.0
Cumulative
Percent
82.7
100.0
a. Respondent's Sex = Male
Respondent's Sex = Female
Statisticsa
Favor or Oppose Death Penalty for Murder
N
Valid
781
Mis sing
78
a. Respondent's Sex = Female
a
Fa vor or Oppose Dea th Pena lty for Murder
Valid
Missing
Total
Favor
Oppos e
Total
DK
NA
Total
Frequency
572
209
781
72
6
78
859
a. Respondent's Sex = Female
Percent
66.6
24.3
90.9
8.4
.7
9.1
100.0
Valid Perc ent
73.2
26.8
100.0
Cumulative
Percent
73.2
100.0
Preparing Your Data
Checking Variable Definitions
Using the Utilities Menu
Choose utilities and then variables to get data-definition information for each variable
in your data file. Make sure that all of your missing-value codes are correctly identified.
TIP 5
If you click, Go To, you find yourself in the column of the Data Editor for the
selected variable if the data editor is in data view. To edit the variable information
from the data editor in data view, double-click the variable name at the top of that
column. This takes you to the variable view for that variable.
TIP 6
To get a listing of the information for all of the variables without having to select the
variables individually, choose File, then display data file information, then working
file. This lists variable information for the whole data file. The disadvantage is that
you can’t quickly go back to the data editor to fix mistakes. An advantage is that
codes that are defined as missing are identified, so it’s easier to check the labels.
Checking Your Case Count
Eliminating Duplicate Cases
If you have entered your own data, it is possible that you will enter the same case twice
or even more. To oust any duplicates, choose data, then identify duplicate cases. If
you entered a supposedly unique ID variable for each case, move the name of that
ID variable into the Define Matching Cases By list. If it takes more than one
variable to guarantee uniqueness (for example, college and student ID), move all of
these variables into the list. When you click OK, SPSS checks the file for cases that
have duplicate values of the ID variables.
TIP 7
DON’T automatically discard cases with the same ID number unless all of the other
values also match. It’s possible that the problem is merely that a wrong ID number was
entered.
Adding Missing Cases
Run any procedure and look at the count of the total cases processed. That’s always the
first piece of output. Table d shows the summary from the Crosstabs procedure for sex
by cappun.
Table d
Crosstabs
Ca se Processi ng Sum ma ry
N
Respondent's Sex *
Favor or Oppos e Death
Penalt y for Murder
Valid
Percent
1388
92.5%
Cases
Missing
N
Percent
112
7.5%
N
Total
Percent
1500
100.0%
You see that the data file has 1500 cases, but only 1388 have valid (nonmissing) values
for the sex and cappun variables. If the count isn’t what you think it should be and if you
assigned sequential numbers to cases, you can look for missing ID numbers.
Checking Your Case Count
Warning: Data checking is not an excuse to get rid of data values that you don’t like.
You are looking for values that are obviously in error and need to be corrected or
replaced with missing values. This is not the time to deal with unusual but correct data
points. You’ll deal with those during the actual data analysis phase.
Making Frequency Tables
Use the frequency procedure to count the number of times each value of a variable occurs
in your data. For example, how many people in the gss survey support capital
punishment? You can also graph this information using pie charts, bar charts, or
histograms.
TIP 8
You can acquire the information that you need for the descriptive statistics [e.g. mean,
minimum, maximum] through the frequency dialog box by selecting the statistics tab and
checking on those statistics of interest to you.
To obtain your frequencies and descriptives, follow the instructions below.
 Go to Analyze, scroll down to descriptive statistics and select frequencies.
Click on the variables of interests and move them into the box for analysis
using the arrow shown. To obtain the descriptives, check on the statistics tab
in the frequency dialog box and select the mean, minimum, maximum, and
standard deviation boxes. Then click on charts and decide whether you wish
to run a pie chart, a bar chart, or a histogram chart w/ a normal curve. You
can only run one type of graph/chart at a time.
When conducting frequency analyses, you want to consider the following questions when
reviewing the results presented in your output.
 Are the codes that you used for missing values labeled as missing values in
the frequency table? If the codes are not labeled, go back to the data editor and
specify them as missing values.
 Do the value labels correctly match the codes? For example, if you see that
50% of your customers are very dissatisfied with your product, make sure that
you haven’t made a mistake in assigning the labels.
 Are all of the values in the table possible? For example, if you asked the
number of times a person has been married and you see values of -2, you know
that’s an error. Go back to the source and see if you can figure out what the
correct values are. If you can’t, replace them with codes for missing values.
 Are there values that are possible, but highly unlikely? For example, if you see
a subject who claims to own 11 toasters, you want to check whether the value is
correct. If the value is incorrect, you’ll have to take that into account when
analyzing the data.
 Are there unexpectedly large or small counts for any of the values? If you’re
studying the relationship of highest educational degree to subscription to Web
services offered by your company and you see that no one in your sample has a
college degree, suspect problems.
TIP 9
To search for a particular data value for a variable, go to data view, highlight the
column of the variable that you are interested in, choose edit, then find, and then
type in the value that you are interested in finding.


Looking At the Distribution of Values
For a scale variable with too many values for a frequency table [e.g. income in
dollars], you need different tools for checking the data values because counting
how often different values occur isn’t useful anymore.


Are the smallest and largest values sensible?
You don’t want to look solely at the single largest and single smallest values;
instead, you want to look at a certain % or # of cases with the largest and smallest
values. There are several ways to do this. The simplest, but most limited way, is
to choose:
o Analyze, then Descriptive Statistics, then either descriptives or
explore. Click statistics in the explore dialog box and select outliers in
the explore statistics dialog box. You will receive a list of cases with the
5 smallest and the 5 largest values for a particular variable. Values that
are defined as missing aren’t included, so if you see missing values in the
list, there’s something wrong. Check the other values if they appear to be
unusual. [see Table e]
Table e [using the Explore command]
Ex trem e Values
Hours Per Day
W atching TV
Highes t
Lowes t
1
2
3
4
5
1
2
3
4
5
Case Number
1402
466
300
1360
115
1500
1400
1373
1372
1356
Value
24
22
20
20
16 a
0
0
0
0
0b
a. Only a partial list of cases with the value 16 are shown in
the table of upper extremes .
b. Only a partial list of cases with the value 0 are shown in
the table of lower extremes .
You can see that one of the respondents claims to watch television 24 hours a day. You
know that’s not correct. It’s possible that he or she understood the question to mean how
many hours is the TV set on. When analyzing the TV variable, you’ll have to decide
what to do with people who have reported impossible values. In Table e, you see that
there are only 4 cases with values of 16 hours or greater and then there is a gap until 12
hours. You might want to set values greater than 12 hours to 12 hours when analyzing
the data. This is similar to what many people do when dealing with a variable for “age.”


Is there anything strange about the distribution of values?
The next task is to examine the distribution of the values using histograms or
stem-and-leaf plots. Make a stem-and-leaf plot [for small data sets] or a
histogram of the data using either Graphs/Histogram or the Explore Plots dialog
box. You want to look for unusual patterns in your data. For example, look at the
histogram of ages in Table f. Ask yourself where all of the 30-year-olds have
gone? Why are there no people above the age of 90? Were there really no people
younger than 18 in the survey?



Looking At the Distribution of Values
Are there logical impossibilities?
For example, if you have a data file of hospital admissions, you can make a
frequency table to count the reason for admission and the number of male and
female admissions. Looking at these tables, you may not notice anything strange.
However, if you look at these 2 variables together in a Crosstabs table, you may
uncover unusual events. For instance, you may find males giving birth to babies,
and women undergoing prostate surgery.

Sometimes, pairs of variables have values that must be ordered in a particular
way. For example, if you ask a woman her current age, her age at first marriage,
and the duration of her first marriage, you know that the current age must be
greater than or equal to the age at first marriage. You also know that the age at
first marriage plus the duration of first marriage cannot exceed the current age.
Start by looking at the simplest relationship: Is the age at first marriage less than
the current age? You can plot the two variables on a scatterplot and look for cases
that have unacceptable values. You know that all of the points must fall on or
above the identity line.
TIP 10
For large data files, the drawback to this approach is that it’s tedious and prone to error.
A better way is to create a new variable that is the difference between the current age and
the age at first marriage. Then use data, select cases to select cases with negative
values and analyze, then reports, then case summaries to list the pertinent
information. Once you’ve remedied the age problem, you can create a new variable
that is the sum of the age at first marriage and the duration of first marriage. You
can then find the difference between this sum and the current age. Reset the select
cases criteria and use case summaries to list cases with offending values.


Is there consistency?
For a survey, you often have questions that are conditional. For example, first
you ask Do you have a car? and then, if the answer is Yes, you ask insightful
questions about the car. You can make Crosstabs tables of the responses to the
main question with those to the subquestions. You have to decide how to deal
with these inconsistencies: do you impute answers to the main question, or do you
discard answers to subquestions? It’s your call.


Is there agreement?
This refers to whether you have pairs of variables that convey similar information
in different ways. For example, you may have recorded both years of education
and highest degree earned. Or, you may have created a new variable that groups
age into 2 categories, such as less than 25, 25 to 50, and older than 50. Compare
the values of the 2 variables using crosstabs. The table may be large, but it’s easy
to check the correspondence between the 2 variables. You can also identify
problems by plotting the values of the 2 variables.


Are there unusual combinations of values?
Identify any outliers so that you can make sure the values of these variables are
correct and make any necessary adjustments. What counts as an outlier depends
on the variables that are being considered together.
TIP 11
You can identify points in a scatterplot by specifying a variable in the Label Cases By
text box in the Scatterplot dialog box. Double-click the plot to activate it in the Chart
Editor. From the Elements menu, choose Data Label Mode or click on the Data
Label Mode icon on the toolbar. This changes your cursor to a black box. Click the
cursor over the point that you want identified by the value of the labeling variable.
To go to that case in the Data Editor, right click on the point, and then left click.
Make sure that the Data Editor is in Data View. To turn Data Label Mode off, click
on the Data label Mode icon on the toolbar.
Transforming Your Data
Before you transform your data, be sure that you know that value of the variables that
you are interested in so that you know how to code the information. See earlier
instructions about how to use the utilities menu to obtain information on the variables
either individually or for the entire working data file.
Computing a New Variable
If you want to perform the same calculation for all of the cases in your data file, the
transformation is called unconditional. If you want to perform different computations
based on the values of 1 or more variables, the transformation is conditional. For
example, if you compute an index differently for men and women, the transformation is
conditional. Both types of transformations can be performed in the Compute Variable
dialog box.
One Size Fits All: Unconditional Transformation
Choose compute from the transform menu to open the compute variable dialog box.
At the top left, assign a new name to the variable that you will be computing. To do
so, click in the target variable box and type in the desired name. You must follow the
same rules for assigning variable names as you did when naming variables in the Data
Editor. Also, don’t forget to enter the information in the type and label tab in the dialog
box.
Warning: You MUST use a new variable name rather than one already in use. If you
reuse the same name and make a mistake specifying the transformation, you’ll replace the
values of the original variable with values that you don’t want. If you don’t catch the
mistake right away, and you save the data file, the original values of the variable are lost.
SPSS will ask you for permission to proceed if you try to use an existing variable name.
To specify the formula for the calculations that you want to perform, either type directly
in the Numeric Expression text box or use the calculator pad. Each time you want to
refer to an existing variable, click it in the variable list and then click the arrow button.
The variable name will appear in the formula at the blinking insertion point. Once you
click ok, the variable is added to your data file as the last variable. However, remember
that you want to click the paste button and then run the syntax command from the syntax
window so that you know what commands you specified. You also want to remember to
use commentary information above the pasted syntax in order to tell yourself and the
reviewer, in this case me, what you did to conduct your analysis.
TIP 12
Right-click your mouse on any button (except the #s) on the calculator pad or any
function for an explanation of what it means.
Using a Built-in Function
The function groups are located in the dialog box and can be used to perform your
calculations, if necessary. There are 7 main groups of functions: arithmetic, statistical,
string, data and time, distribution, random-variable, and missing-values. If you wish to
use it, click it when the blinking insertion point is placed where you want to insert
the function into your formula, and then click the up arrow button. The function
will appear in your formula, but it will have question marks for the arguments. The
arguments of a function are the numbers or strings that it operates on. In the expression
SQRT(25), 25 is the sole argument of this function. Enter a value for the argument, or
double-click a variable to move it into the argument list. If there are more questionmark arguments, select them in turn and enter a value, move a variable, or
somehow supply whatever suits the needs of the function.
TIP 12
For detailed information about any function and its arguments, from the Help menu,
choose Topics, click the index tab, and type the word functions. You can then select the
type of function that you want.
If and Then: Conditional Transformation
If you want to use different formulas, depending on the values of one or more existing
variables, you have to enter the formula and then click the button labeled if at the bottom
of the compute variable dialog box. This will take you a secondary compute data dialog
box in which you choose, “include if cases satisfies condition.” To make your conditional
equation. For example, if you wish to compute a new variable, you would specify how
the new target variable is coded in reference to the “if, then expression.”
Changing the Coding Scheme
Recode into Same Variables
If you wish to change the coding of a variable but not create a totally different variable,
you would select transform, recode, into same variables, and click on the variable or
variables of interest and move them into the variable box by clicking the arrow.
Depending on how you wish to recode the values within a variable, you could select old
and new values and on the left side of the dialog box, choose the numbers that you
wish to change and on the right side of the dialog box, choose what you want them
to become and click add. When done, select continue, to go back to the previous
dialog box and paste command syntax so that you can run it. Again don’t forget to
type in a commentary of what the command is doing. In other cases, you might
choose the “IF” tab to compute the conditions under which a recode will take place.
TIP 13
If you wish to recode a group of variables using the same coding scheme, such as recode
a 2 into a 1 for a set of variables even if the numbers stand for different value labels, you
can enter several variables into the dialog box at once.
Recode into Different Variables
If you want to recode an existing variable into a new one in which every original value
has to be transformed into a value of the new variable. Click transform, recode, into
different variables and you will get a dialog box. In this dialog box, select the name
of the variable that will be recoded. Then in the output variable name test box,
enter a name for the new variable. Click the change button and the new name
appears after the arrow in the central list. Once this is done, click “old and new
values” and enter the recode criteria that will comprise the command syntax. SPSS
carries out the recode specifications in the order they are listed in the old to new list.
TIP 14
Always specify all of the values even if you’re leaving them unchanged. Select all other
values and then copy cold values. Remember to click the add button after entering
each specification to move it into the old to new list; otherwise, it is ignored.
Checking the Recode
The easiest method is to make a crosstabs table of the original variable with the new
variable containing recoded values.
Warning: After you’ve created a new variable with recode, go to the variable view in the
Data Editor and set the missing values for each newly created variable.
Describing Your Data
Examining Tables and Chart Counts
Frequency Tables
Rap Music
Valid
Missing
Total
Like Very Much
Like It
Mixed Feelings
Dislike It
Dislike Very Much
Total
DK Much About It
NA
Total
Frequency
41
145
266
401
578
1431
58
11
69
1500
Percent
2.7
9.7
17.7
26.7
38.5
95.4
3.9
.7
4.6
100.0
Valid Percent
2.9
10.1
18.6
28.0
40.4
100.0
Cumulative
Percent
2.9
13.0
31.6
59.6
100.0
Imagine that you were interested in analyzing respondents views regarding rap music.
You would run a frequency table like the one above to find a count of the level of like or
dislike of rap music reported by respondents. Each row of the table corresponds to one of
the recorded answers. Be sure to make sure that the counts presented appear to be
correct, including those for the missing data listing.
The 3rd-5th columns contain percentages. The 3rd column labeled simply percent is the %
of all cases in the data file with that value. 9% of respondents reported that they like rap
music. However, the 4th column, labeled valid percent indicates that 10% of respondents
like rap music. Why the difference? The 4th column bases the % only on people who
actually respondent to the question.
Warning: A large difference between the % and valid % columns can signal big
problems for your study. If the missing values result from people not being asked the
question because that’s the design of the study, you don’t have to worry. If people
weren’t asked because the interviewer decided not to ask them or if they refused to
answer, that’s a different matter.
The 5th column, labeled cumulative percent is the sum of the valid % for that row and all
of the rows before it. It’s useful only if the variable is measured at least on an ordinal
scale. For example, the cumulative % for “like” tells you that 13% of respondents either
reported that they like rap music or that they like it very much. The valid data value that
occurs most frequently is called the mode. For these data, “dislike very much” is the
modal category since 578 of the respondents reported that they disliked rap music very
much. The mode is not a particularly good summary measure, and if you report it, you
should always indicate the percentage of cases with that value. For variables measured
on a nominal scale, the mode is the only summary statistic that makes sense, but that isn’t
the case for this variable because there is a natural order to the responses [i.e. ordinal
variable].
Frequency Tables as Charts
You can display the numbers in a frequency table in a pie chart or a bar chart, although
prominent statisticians advise that one should “never use a pie chart.”
Rap Music
Like Very Much
Like It
Mixed Feelings
Dislike It
2.87%
Dislike Very Much
10.13%
40.39%
18.59%
28.02%
__
Warning: If you create a pie chart by choosing Descriptive Statistics, then frequencies, a
slice for missing values is always included. Use graph, then select pie if you don’t want
to include a slice for missing values. This was the way that I obtained the pie chart
above.
50.0%
Percent
40.0%
30.0%
20.0%
10.0%
0.0%
Like Very Much
Like It
Mixed Feelings
Dislike It
Dislike Very Much
Rap Music
Examining Tables and Chart Counts
Now you know how people as a group feel about rap music, but what about more
nuanced information about the kinds of people who hold these views. Are they male?
College Educated? Racial and Ethnic Minorities? To find out this information, you need
to look at attitudes regarding rap music in conjunction with other variables. A
crosstabualtion involving a 2-way table of counts, for attitudes toward rap music and
gender. Gender is the row variable since it defines the rows of the table, and attitudes
toward rap music is the column variable since it defines the columns. Each of the unique
combinations of the values of the 2 variables defines a cell of the table. The numbers in
the total row and column are called marginals because they are in the margins of the
table. They are frequency tables for the individual variables.
TIP 15
DON’T be alarmed if the marginals in the crosstabulation aren’t identical to the
frequency tables for the individual variables. Only cases with valid values for both
variables are in the crosstabulation, so if you have cases with missing values for one
variable but not the other, they will be excluded from the crosstabulation. Respondents
who tell you their gender but not their attitudes about rap music are included in the
frequency table for gender but not in the crosstabulation of the 2 variables.
The table below shows a crosstabulation that contains information solely on the number
of cases that meet both criteria, but not a % distribution.
Respondent's Sex * Rap Music Crosstabulation
Count
Respondent's
Sex
Total
Male
Female
Like Very
Much
17
24
41
Like It
62
83
145
Rap Music
Mixed
Feelings
97
169
266
Dislike It
181
220
401
Dislike
Very Much
258
320
578
Total
615
816
1431
Percentages
The above information, i.e. the counts in the cell are the basic elements of the table, but
they are usually not the best choice for reporting findings because they cannot be easily
compared if there are different totals in the rows and columns of the table. For example,
if you know that 17 Males and 24 Females like rap music very much, you can conclude
little about the relationship between the 2 variables unless you also know the total of men
and women in the sample.
For a crosstabulation, you can compute 3 different percentages:
 Row %: the cell count divided by the number of cases in the row times 100
 Column %: the cell count divided by the number of cases in the column times
100
 Total %: the cell count divided by the total number of cases in the table times
100
The 3 % convey different information, so be sure to choose the correct one for your
problem. If one of the 2 variables in your table can be considered an independent
variable and the other a dependent variable, make sure the % sum up to 100 for each
category of the independent variable.
Respondent's Sex * Rap Music Crosstabulation
Respondent's
Sex
Male
Female
Total
Count
% within
Respondent's Sex
% within Rap Music
% of Total
Count
% within
Respondent's Sex
% within Rap Music
% of Total
Count
% within
Respondent's Sex
% within Rap Music
% of Total
Like It
62
Rap Music
Mixed
Feelings
97
Dislike It
181
Dislike
Very Much
258
Total
615
2.8%
10.1%
15.8%
29.4%
42.0%
100.0%
41.5%
1.2%
24
42.8%
4.3%
83
36.5%
6.8%
169
45.1%
12.6%
220
44.6%
18.0%
320
43.0%
43.0%
816
2.9%
10.2%
20.7%
27.0%
39.2%
100.0%
58.5%
1.7%
41
57.2%
5.8%
145
63.5%
11.8%
266
54.9%
15.4%
401
55.4%
22.4%
578
57.0%
57.0%
1431
2.9%
10.1%
18.6%
28.0%
40.4%
100.0%
100.0%
2.9%
100.0%
10.1%
100.0%
18.6%
100.0%
28.0%
100.0%
40.4%
100.0%
100.0%
Like Very
Much
17
Since gender would fall under the realm of an independent variable, you want to calculate
the row % because they will tell you what % of women and men fall into each of the
attitudinal categories. This % isn’t affected by unequal numbers of males and females in
your sample. From the row % displayed above, you find that 2.8% of males like rap
music very much as do 2.9% of females. So with regard to strong positive feelings about
rap music, you note that there are no visible differences. Note: No statistical differences
are examined yet. From the column% displayed above, you find that among those who
like rap music very much, 41.5% are men and 58.5% are female. This does not tell you
that females are significantly more likely to report liking rap music very much than
males. Instead, it tells you that of the people who like rap music very much, women tend
to hold a stronger view than men. Note: The column % depend on the number of men
and women in the sample as well as how they feel about rap music. If men and women
have identical attitudes but there are twice as many men in the survey than women, the
column % for men will be twice as large as the column % for women. You can’t draw
any conclusions based on only the column %.
TIP 16
If you use row %, compare the % within a column. If you use column %, compare the %
within a row.
Multiway Tables of Counts as Charts
You can plot the % in the table above by using a clustered bar chart like the one below.
For each attitudinal category regarding rap music, there are separate bars for men and
women since gender is the cluster variable. The values plotted are the % of all men and
the % of all women who gave each response. You can easily that females are equally
likely to like rap music very much as much as males. Although the same information is
in the crosstabulation, it is easier to see in the bar chart.
Respondent's Sex
50.0%
Male
Female
Percent
40.0%
30.0%
41.95%
39.22%
20.0%
29.43%
26.96%
20.71%
10.0%
15.77%
10.08%10.17%
2.76% 2.94%
0.0%
Like Very
Much
Like It
Mixed
Feelings
Dislike It
Dislike Very
Much
Rap Music
TIP 17
Always select % in the clustered bar chart dialog boxes; otherwise, you’ll have a difficult
time making comparisons within a cluster, since the height of the bars will depend on the
number of cases in each subgroup. For example, you won’t be able to tell if the bar for
men who always read newspapers is higher because men are more likely to read a
newspaper daily or because there are more men in the sample.
Control Variables
You can examine the relationship between gender and attitudes toward rap music
separately for each category of another variable, such as education [i.e., the control
variable]. See the crosstabulation model below to show you how the information would
look when entered into the crosstabulation dialog box.
Re sponde nt's Sex * Rap Music * RS Highest Degree Crossta bulation
RS Highest Degree
Less than HS
Respondent's
Sex
Male
Female
Total
High school
Respondent's
Sex
Male
Female
Total
Junior college
Respondent's
Sex
Male
Female
Total
Bachelor
Respondent's
Sex
Male
Female
Total
Graduate
Respondent's
Sex
Male
Female
Total
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Count
% within
Respondent's Sex
Like Very
Much
5
Rap Music
Mixed
Like It
Feelings
11
14
Dislike
Dislike It Very Much
30
55
Total
115
4.3%
9.6%
12.2%
26.1%
47.8%
100.0%
10
18
19
35
59
141
7.1%
12.8%
13.5%
24.8%
41.8%
100.0%
15
29
33
65
114
256
5.9%
11.3%
12.9%
25.4%
44.5%
100.0%
9
36
50
87
110
292
3.1%
12.3%
17.1%
29.8%
37.7%
100.0%
11
45
95
134
175
460
2.4%
9.8%
20.7%
29.1%
38.0%
100.0%
20
81
145
221
285
752
2.7%
10.8%
19.3%
29.4%
37.9%
100.0%
1
4
4
13
14
36
2.8%
11.1%
11.1%
36.1%
38.9%
100.0%
1
3
13
15
18
50
2.0%
6.0%
26.0%
30.0%
36.0%
100.0%
2
7
17
28
32
86
2.3%
8.1%
19.8%
32.6%
37.2%
100.0%
2
8
22
32
41
105
1.9%
7.6%
21.0%
30.5%
39.0%
100.0%
2
11
30
27
52
122
1.6%
9.0%
24.6%
22.1%
42.6%
100.0%
4
19
52
59
93
227
1.8%
8.4%
22.9%
26.0%
41.0%
100.0%
3
7
19
38
67
4.5%
10.4%
28.4%
56.7%
100.0%
5
12
9
16
42
11.9%
28.6%
21.4%
38.1%
100.0%
8
19
28
54
109
7.3%
17.4%
25.7%
49.5%
100.0%
You see that the largest difference in strong dislike of rap music between men and
women occurs among those with a graduate degree. 56.7% of males strongly dislike rap
compared to 38.1% of females. The % are almost equal for those with a high school
education.
As the number of variables in a crosstabulation increases, it becomes unwieldy to plot all
of the categories of a variable. Instead you can restrict your attention to a particular
responses.
T-Tests
When using these statistical tests, you are testing the null hypothesis that 2 population
means are equal. The alternative hypothesis is that they are not equal. There are 3
different ways to go about this, depending on how the data were obtained.
Deciding Which T-test to Use
Neither the one-sample t test nor the paired samples t test requires any assumption about
the population variances, but the 2-sample t test does.
TIP 18
When reporting the results of a t test, make sure to include the actual means, differences,
and standard errors. Don’t give just a t value and the observed significance level.
One-sample T test
If you have a single sample of data and want to know whether it might be from a
population with a known mean, you have what’s termed a one-sample design, which can
be analyzed with a one-sample t test.





Examples
You want to know whether CEOs have the same average score on a personality
inventory as the population on which it was normed. You administer the test to a
random sample of CEOs. The population value is assumed to be known in
advance. You don’t estimate it from your data.
You’re suspicious of the claim that the normal body temperature is 98.6 degrees.
You want to test the null hypothesis that the average body temperature for human
adults is the long assumed value of 98.6, against the alternative hypothesis that it
is not. The value 98,6 isn’t estimated from the data; it is a known constant. You
take a single random sample of 1,000 adult men and women and obtain their
temperatures.
You think that 40 hours no longer defines the traditional work week. You want to
test the null hypothesis that the average work week is 40 hours, against the
alternative that it isn’t. You ask a random sample of 500 full-time employees
how many hours they worked last week.
You want to know whether the average IQ score for children diagnosed with
schizophrenia differs from 100, the average for the population of all children.
You administer an IQ test to a random sample of 700 schizophrenic children.
Your null hypothesis is that the population value for the average IQ score for
schizophrenic children is 100, and the alternative hypothesis is that it isn’t.



Data Arrangement
For the one-sample t test, you have one variable that contains the values for each
case. For example:
A manufacturer of high-performance automobiles produces disc brakes that must
measure 322 millimeters in diameter. Quality control randomly draws 16 discs
made by each of eight production machines and measures their diameters. This
example uses the file brakes.sav . Use One Sample T Test to determine whether or
not the mean diameters of the brakes in each sample significantly differ from 322
millimeters. A nominal variable, Machine Number, identifies the production
machine used to make the disc brake. Because the data from each machine must
be tested as a separate sample, the file must first be split into groups by Machine
Number.
Select compare groups in the split file dialog box. Select machine number from the
variable listing and move it into the box for “groups based on.” Select the “compare
groups circle” and since the file isn’t already sorted, be sure that you have selected, “sort
the file by grouping variables.”
Next select one-sample T test from the analyze tab.

Select analyze, then compare means, and then one-sample T test.
Select the test variable, i.e. disc brake diameter (mm), type 322 as the test variables, and
click options.
In the options dialog box for the one-sample T test, type 90 in the confidence interval %,
then be sure that you have missing values coded as “exclude cases analysis by analysis,”
then click continue, then click paste so that the syntax is entered in the syntax viewer, and
then select ok.
 Note: A 95% confidence interval is generally used, but the examples below
reflect a 90% confidence interval.
The Descriptives table displays the sample size, mean, standard deviation, and standard
error for each of the eight samples. The sample means disperse around the 322mm
standard by what appears to be a small amount of variation.
The test statistic table shows the results of the one-sample T test.
The t column displays the observed t statistic for each sample, calculated as the ratio of the
mean difference divided by the standard error of the sample mean.
The df column displays degrees of freedom. In this case, this equals the number of cases in
each group minus 1.
The column labeled Sig. (2-tailed) displays a probability from the t distribution with 15
degrees of freedom. The value listed is the probability of obtaining an absolute value
greater than or equal to the observed t statistic, if the difference between the sample mean
and the test value is purely random.
The Mean Difference is obtained by subtracting the test value (322 in this example) from
each sample mean.
The 90% Confidence Interval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in 90% of all possible random samples of 16
disc brakes produced by this machine.
Since their confidence intervals lie entirely above 0.0, you can safely say that machines 2,
5 and 7 are producing discs that are significantly wider than 322mm on the average.
Similarly, because its confidence interval lies entirely below 0.0, machine 4 is producing
discs that are not wide enough.
The one-sample t test can be used whenever sample means must be compared to a known
test value. As with all t tests, the one-sample t test assumes that the data be reasonably
normally distributed, especially with respect to skewness. Extreme or outlying values
should be carefully checked; boxplots are very handy for this.
Paired-Samples T test
You use a paired-samples (also known as the matched cases) T test if you want to test
whether 2 population means are equal, and you have 2 measurements from pairs of
people or objects that are similar in some important way. For example, you’ve observed
the same person before and after treatment or you have personally measures for each
CEO and their non-CEO sibling. Each “case” in this data file represents a pair of
observations.








Examples
You are interested in determining whether self-reported weights and actual
weights differ. You ask a random sample of 200 people how much they weigh
and then you weigh them on a scale. You want to compare the means of the 2
related sets of weights.
You want to test the null hypothesis that husbands and wives have the same
average years of education. You take a random sample of married couples and
compare their average years of education.
You want to compare 2 methods for teaching reading. You take a random sample
of 50 pairs of twins and assign each member of a pair to one of the 2 methods.
You compare average reading scores after completion of the program.
Data Arrangement
In a paired-samples design, both members of a pair must be on the same data
record. Different variable names are used to distinguish the 2 members of a pair.
For example:
A physician is evaluating a new diet for her patients with a family history of heart
disease. To test the effectiveness of this diet, 16 patients are placed on the diet for
6 months. Their weights and triglyceride levels are measured before and after the
study, and the physician wants to know if either set of measurements has changed.
This example uses the file dietstudy.sav . Use Paired-Samples T Test to determine
whether there is a statistically significant difference between the pre- and postdiet weights and triglyceride levels of these patients.
o Select Analyze, then compare means, then paired-samples T test
Select Triglyceride and Final Triglyceride as the first set of paired variables.

Select Weight and final weight as the second pair and click ok.
The Descriptives table displays the mean, sample size, standard deviation, and standard
error for both groups. The information is disseminated in pairs such that pair 1 should
come first and pair 2 should come second in the table.
Across all 16 subjects, triglyceride levels dropped between 14 and 15 points on average
after 6 months of the new diet.
The subjects clearly lost weight over the course of the study; on average, about 8 pounds.
The standard deviations for pre- and post-diet measurements reveal that subjects were
more variable with respect to weight than to triglyceride levels.
At -0.286, the correlation between the baseline and six-month triglyceride levels is not
statistically significant. Levels were lower overall, but the change was inconsistent across
subjects. Several lowered their levels, but several others either did not change or increased
their levels.
On the other hand, the Pearson correlation between the baseline and six-month weight
measurements is 0.996, almost a perfect correlation. Unlike the triglyceride levels, all
subjects lost weight and did so quite consistently.
The Mean column in the paired-samples t test table displays the average difference
between triglyceride and weight measurements before the diet and six months into the diet.
The Std. Deviation column displays the standard deviation of the average difference score.
The Std. Error Mean column provides an index of the variability one can expect in
repeated random samples of 16 patients similar to the ones in this study.
The 95% Confidence Interval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in 95% of all possible random samples of 16
patients similar to the ones participating in this study.
The t statistic is obtained by dividing the mean difference by its standard error.
The Sig. (2-tailed) column displays the probability of obtaining a t statistic whose absolute
value is equal to or greater than the obtained t statistic.
Since the significance value for change in weight is less than 0.05, you can conclude that
the average loss of 8.06 pounds per patient is not due to chance variation, and can be
attributed to the diet.
However, the significance value greater than 0.10 for change in triglyceride level shows
the diet did not significantly reduce their triglyceride levels.

Warning: When you click the first variable of a pair, it doesn’t move to the list
box; instead, it moves to the lower left box labeled Current Selections. Only
when you click a second variable and move it into Current Selections can you
move the pair into the Paired Variable list.
Two-Independent-Samples T test
If you have 2 independent groups of subjects, such as CEOs and non-CEOs, men and
women, or people who received a treatment and people who didn’t, and you want to test
whether they come from populations with the same mean for the variable of interest, you
have a 2-independent samples design. In an independent-samples design, there is no
relationship between people or objects in the 2 groups. The T test you use is called an
independent-samples T test.






Examples
You want to test the null hypothesis that, in the U.S. population, the average hours
spent watching TV per day is the same for males and females.
You want to compare 2 teaching methods. One group of students is taught by one
method, while the other group is taught by the other method. At the end of the
course, you want to test the null hypothesis that the population values for the
average scores are equal.
You want to test the null hypothesis that people who report their incomes in a
survey have the same average years of education as people who refuse.
Data Arrangement
If you have 2 independent groups of subjects, e.g., boys and girls, and want to
compare their scores, your data file must contain two variables for each child: one
that identifies whether a case is a boy or a girl, and one with the score. The same
variable name is used for the scores for all cases. To run the 2 independent
samples T test, you have to tell SPSS which variable defines the groups. That’s
the variable Gender, which is moved into the Grouping Variable box. Notice the
2 question marks after a variable name. They will disappear after you use the
Define Groups dialog box to tell SPSS which values of the variable should be
used to form the 2 groups.
TIP 18
Right-click the variable name in the Grouping Variable box and select variable
information from the pop-up menu. Now you can check the codes and value labels that
you’ve defined for that variable.
Warning: In the define groups dialog box, you must enter the actual values that you
entered into the data editor, not the value labels. If you used the codes of 1 for male and
2 for female and assigned them value labels of m and f, then you enter the values 1 and 2,
not the labels m and f, into the define groups dialog box.
An analyst at a department store wants to evaluate a recent credit card promotion. To this
end, 500 cardholders were randomly selected. Half received an ad promoting a reduced
interest rate on purchases made over the next three months, and half received a standard
seasonal ad.

Select Analyze, then compare means, then independent samples T test
Select money spent during the promotional period as the test variable. Select type of mail
insert received as the grouping variable. Then click define groups.
Type 0 as the group 1 variable and 1 as the group 2 variable under define groups. For the
default, the program should have “use specified values” selected. Then click continue and
ok.
The Descriptives table displays the sample size, mean, standard deviation, and standard
error for both groups. On average, customers who received the interest-rate promotion
charged about $70 more than the comparison group, and they vary a little more around
their average.
The procedure produces two tests of the difference between the two groups. One test
assumes that the variances of the two groups are equal. The Levene statistic tests this
assumption.
In this example, the significance value of the statistic is 0.276. Because this value is
greater than 0.10, you can assume that the groups have equal variances and ignore the
second test. Using the pivoting trays, you can change the default layout of the table so
that only the "equal variances" test is displayed.
Activate the pivot table. Then under pivot, select pivoting trays.
Drag assumptions from the row to the layer and close the pivoting trays window.
With the test table pivoted so that assumptions are in the layer, the Equal variances
assumed panel is displayed.
The df column displays degrees of freedom. For the independent samples t test, this equals
the total number of cases in both samples minus 2.
The column labeled Sig. (2-tailed) displays a probability from the t distribution with 498
degrees of freedom. The value listed is the probability of obtaining an absolute value
greater than or equal to the observed t statistic, if the difference between the sample means
is purely random.
The Mean Difference is obtained by subtracting the sample mean for group 2 (the New
Promotion group) from the sample mean for group 1.
The 95% Confidence Interval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in 95% of all possible random samples of 500
cardholders.
Since the significance value of the test is less than 0.05, you can safely conclude that the
average of 71.11 dollars more spent by cardholders receiving the reduced interest rate is
not due to chance alone. The store will now consider extending the offer to all credit
customers.
Churn propensity scores are applied to accounts at a cellular phone company. Ranging
from 0 to 100, an account scoring 50 or above may be looking to change providers. A
manager with 50 customers above the threshold randomly samples 200 below it, wanting
to compare them on average minutes used per month.



Select analyze, then compare means, then independent samples T test
Select average monthly minutes as the test variable and propensity to leave
as the group variable. Then select define groups.
Select cut point and type 50 as the cut point value. Then click continue and
ok.
The Descriptives table shows that customers with propensity scores of 50 or more are
using their cell phones about 78 minutes more per month on the average than customers
with scores below 50.
The significance value of the Levene statistic is greater than 0.10, so you can assume that
the groups have equal variances and ignore the second test. Using the pivoting trays,
change the default layout of the table so that only the "equal variances" test is displayed.
Play around with the pivot tray link if you wish.
The t statistic provides strong evidence of a difference in monthly minutes between
accounts more and less likely to change cellular providers.
Analyzing Truancy Data: The Example
To perform this analysis in order to test your skills using a T test, please see the spss file
on the course blackboard page.
One-Sample T test
Consider whether the observed truancy rate before intervention [the % of school days
missed because of truancy] differs from an assumed nationwide truancy rate of 8%. You
have one sample of data [students enrolled in the TRP program-truancy reduction
program] and you want to compare the results to a fixed, specified in-advance population
value.
The null hypothesis is that the sample comes from a population with an average truancy
rate of 8%. [Another way of stating the null hypothesis is that the difference in the
population means between your population and the nation as a whole is 0.] The
alternative hypothesis is that you sample doesn’t come from a population with a truancy
rate of 8%.
To obtain the table below, you would do one of the following: Go to Analyze, choose
desciptive statistics, then descriptives, select the variable to be examined, in this case
prepct, then go to options in the descriptives dialog box and select, mean, minimu,
maximum, and standard deviation, then select continue and okay. You can also
choose frequencies under the descriptive statistics link, select the variable to be
examined, go to statistics and pick the same statistics as above, select continue, and
then okay.
Descriptive Statistics
N
prepct Percent truant
days pre intervention
Valid N (listwise)
299
Minimum
Maximum
.00
72.08
Mean
14.2038
Std. Deviation
13.07160
299
From the table above, you see that, for the 299 students in this sample, the average
truancy rate is 14.2%. You know that even if the sample is selected from a population in
which the true rate is 8%, you don’t expect your sample to have an observed rate of
exactly 8%. Samples from the population vary. What you want to determine is whether
it’s plausible for a sample of 299 students to have an observed truancy rate of 14.2% if
the population value is 8%.
TIP 19
Before you embark on actually computing a one-sample T test, make certain checks.
Look at the histogram of the truancy rates to make sure that all of the values make sense.
Are there percentages smaller than 0 or greater than 100? Are there values that are really
far from the rest? If so, make sure they’re not the result of errors. If you have a small
number of cases, outliers can have a large effect on the mean and the standard deviation.
Checking the Assumptions
To use the one-sample T test, you have to make certain assumptions about the data:
 The observations must be independent of each other. In this data file, students
came from 17 schools, so its possible that students in the same school may be
more similar than students in different schools. If that’s the case, the estimated
significance level may be smaller than it should be, since you don’t have as much
information as the sample size indicates. [If you have 10 students from 10
different schools, that’s more information than having 10 students from the same
school because it’s plausible that students in the same school are more similar
than students from different schools.] Independence is one of the most important
assumptions that you have to make when analyzing data.
 In the population, the distribution of the variable must be normal, or the sample
size must be large enough so that it doesn’t matter. The assumption of normally
distributed data is required for many statistical tests. The importance of the
assumption differs, depending on the statistical test. In the case of a one-sample T
test, the following guidelines are suggested: If the number of cases is < 15, the
data should be approximately normally distributed; if the number of cases is
between 15 and 40, the data should not have outliers or be very skewed; for
samples of 40 or more, even markedly skewed distributions are acceptable.
Because you have close to 300 observations, there’s little need to worry about the
assumption of normality.
TIP 20
If you have reason to believe that the assumptions required for the T test are violated in
an important way, you can analyze the data using a nonparametric tests.
Testing the Hypothesis
Compute the difference between the observed sample mean and the hypothesized
population value. [14.2%-8% = 6.2%]
Compute the standard error of the difference. This is a measure of how much you expect
sample means, based on the same number of cases from the same population, to vary.
The hypothetical population value is a constant and doesn’t contribute to the variability
of the differences, so the standard error of the difference is just the standard error of the
mean. Based on the standard deviation in the table above, the standard error equals:
 SE = std. deviation/SQRT of the sample size = 13.07/SQRT of 299 = .756 [Note:
You should be able to obtain this value using the frequencies command and
selecting standard error mean under statistics. This is a way for you to double
check if you are unsure of your calculations. See the table below.
Statistics
prepct Percent truant days pre intervention
N
Valid
299
Missing
0
Mean
14.2038
St d. Error of Mean
.75595
St d. Deviat ion
13.07160
You can calculate the t statistic by hand if you divide the observed difference by the
standard error of the difference.
 T = Observed Mean [prepct]-Predicted Mean/Std. Error of the mean
= 14.204-8/0.756 = 8.21
You can also conduct a one-sample T test using SPSS by going to analyze, compare
means, one-sample T test, selecting the relevant variable, [i.e. prepct] and entering it
into the test variable box and entering the number 8 in the test value box at the
bottom of the dialog box and running the analysis. You will get the following output
as shown below.
T-TEST
/TESTVAL = 8
/MISSING = ANALYSIS
/VARIABLES = prepct
/CRITERIA = CI(.95) .
T-Test
One-Sample Statistics
N
prepct Percent truant
days pre intervention
Mean
299
14.2038
Std. Deviation
Std. Error
Mean
13.07160
.75595
One-Sample Test
Test Value = 8
t
prepct Percent truant
days pre intervention
8.207
df
298
Sig. (2-tailed)
Mean
Difference
.000
6.20378
95% Confidence
Interval of the
Difference
Lower
Upper
4.7161
7.6915
Use the T distribution to determine if the observed t statistic is unlikely if the null
hypothesis is true. To calculate the observed significance level for a T statistic, you have
to take into account both how large the actual T value is and how many degrees of
freedom it has. For a one-sample T test, the degress of freedom [dof] is one fewer than
the number of cases. From the table above, you see that the observed significance level is
< .0001. Your observed results are very unlikely if the true rate is 8%, so you reject the
null hypothesis. Your sample probably comes from a population with a mean larger than
8%.
TIP 21
To obtain observed significance levels for an alternative hypothesis that specifies
direction, often known as a one-sided or one-tailed test, divide the observed two-tailed
significance level by two. Be very cautious about using one-sided tests.
Examining the Confidence Interval
If you look at the 95% Confidence Interval for the population difference, you see that it
ranges from 4.7% to 7.7%. You don’t know whether the true population difference is in
this particular interval, but you know that 95% of the time, 95% confidence intervals
include the true population values. Note that the value of 0 is not included in the
confidence interval. If your observed significance level had been larger than 0.05, 0
would have been included in the 95% confidence interval.
TIP 22
There is a close relationship between hypothesis testing and confidence intervals. You
can reject the null hypothesis that you sample comes from a population with any value
outside of the 95% confidence interval. The observed significance level for the
hypothesis test will be less than 0.05.
Paired-Samples T test
You’ve seen that your students have a higher truancy rate than the country as a whole.
Now the question is whether there is a statistically significant difference in the truancy
rates before and after the truancy reduction programs. For each student, you have 2
values for unexcused absences. One is for the year before the student enrolled in the
program; the other is for the year in which the student was enrolled in the program. Since
there are two measurements for each subject, a before and an after, you want to use a
paired-samples T test to test the null hypothesis that averages before and after rates are
equal in the population.
TIP 23
The reason for doing a paired-samples design is to make the 2 groups as comparable as
possible on characteristics other than the one being studied. By studying the same
students before and after intervention, you control for differences in gender,
socioeconomic status, family supervision, and so on. Unless you have pairs of
observations that are quite similar to each other, pairing has little effect and may, in fact,
hurt your chances of rejecting the null hypothesis when it is false.
Before running the paired-samples T test procedure, look at the histogram of the
differences shown. You should see that the shape of the distribution is symmetrical [i.e.
not too far from normal]. Many of the cases cluster around 0, indicating that the
difference in the before and after scores is small for these students.
Checking the Assumptions
The same assumptions about the distributions of the data are required for this test as those
in the one-sample T test. The observations should be independent; if the sample size is
small, the distribution of differences should be approximately normal. Note that the
assumptions are about the differences, not the original observations. That’s because a
paired-samples T test is nothing more than a one-sample T test on the differences. If you
calculate the differences between the pre- and post-values and use the one-sample T test
with a population value of 0, you’ll get exactly the same statistic as using the pairedsamples T test.
Testing the Hypothesis
From the table below, you see that the average truancy rate before intervention is 14.2%
and the average truancy rate after intervention is 11.4%. That’s a difference about 2.8%.
To get the table below, you should go to descriptives and select the prepct and postpct
variables and enter them into the variable list, be sure that the right statistics are
checked off [e.g. standard deviation], and then hit okay.
Paired Samples Statistics
Mean
Pair
1
postpct Percent truant
days post intervention
prepct Percent truant
days pre intervention
N
Std. Deviation
Std. Error
Mean
11.4378
299
11.18297
.64673
14.2038
299
13.07160
.75595
To see how often you would expect to see a difference of at least 2.8% when the null
hypothesis of no difference is true, look at the paired-samples T test table below.
To obtain the table below, do the following: go to analyze, then select compare means,
then select paired-samples T test and choose the 2 variables of interest of the pair to
be selected, i.e., prepct and postpct, then select Ok.
Pa ired Sa mpl es Test
Paired Differenc es
Mean
Pair
1
postpc t Percent truant
days post intervention
- prepc t Percent truant
days pre intervention
-2. 76602
St d. Deviat ion
St d. Error
Mean
12.69355
.73409
95% Confidenc e
Int erval of t he
Difference
Lower
Upper
-4. 21067
-1. 32137
t
-3. 768
df
Sig. (2-tailed)
298
.000
The T statistic, 3.8, is computed by dividing the average difference [2.77%] by the
standard error of the mean difference [0.73]. The degrees of freedom is the number of
pairs minus one. The observed significance level is < .001, so you can reject the null
hypothesis that the pre-intervention and post-intervention truancy rates are equal in the
population. Intervention appears to have reduced the truancy rate.
Warning: The conclusions you can draw about the effectiveness of truancy reduction
programs from a study like this are limited. Even if you restrict your conclusions to the
schools from which these children are a sample, there are many problems. Since you are
looking at differences in truancy rates between adjacent years, you aren’t controlling for
possible increases or decreases in truancy that occur as children grow older. For
example, if truancy increases with age, the effect of the truancy reduction program may
be larger than it appears. There is also potential bias in the determination of what is
considered an “excused” absence.
The 95% confidence interval for the population change is from 1.3% to 4.2%. It appears
that if the program has an effect, it is not a very large one. One average, assuming a 180day school year, students in the truancy reduction program attended school five more
days after the program than before. The 95% confidence interval for the number of days
“saved” is from 2.3 days to 7.6 days.
A paired-samples design is effective only if you have pairs of similar cases. If your
pairing does not result in a positive correlation coefficient between the 2 measurements
of close to 0.5, you may lose power [your computer stays on, but your ability to reject the
null hypothesis when it is false fizzles] by analyzing the data as a paired-samples design.
From the correlation coefficient table covering the correlation coefficient between the
pre- and post-intervention rates is close to 0.5, so pairing was probably effective. See
below.
Pa ired Sa mpl es Corre lati ons
N
Pair
1
postpc t Percent truant
days post intervention
& prepct Percent t ruant
days pre intervention
Correlation
299
Sig.
.461
.000
Warning: Although well-intentioned, paired designs often run into trouble. If you give a
subject the same test before and after an intervention, the practice effect, instead of the
intervention, may be responsible for any observed change. You must also make sure that
there is no carryover effect; that is, the effect of one intervention must be completely
gone before you impose another.
Two-Independent Samples T test
You’ve seen that intervention seems to have had a small, although statistically significant
effect. One of the questions that remains is whether the effect is similar for boys and
girls prior to intervention? Is the average truancy rate the same for boys and girls after
intervention? Is the change in truancy rates before and after intervention the same for
boys and girls?
Group Sta tisti cs
gender Gender
prepct Percent truant f Female
days pre intervention
m Male
postpc t Percent truant f Female
days post intervent ion m Male
diffpct Pre - Post
f Female
m Male
152
147
152
Mean
13.0998
15.3453
11.5130
St d. Deviat ion
12.25336
13.81620
11.43948
St d. Error
Mean
.99388
1.13954
.92786
147
11.3599
10.94995
.90314
152
147
1.5866
3.9850
11.72183
13.55834
.95077
1.11827
N
The table above shows summary statistics for the 2 groups for all 3 variables. Boys had
somewhat larger average truancy scores prior to intervention than did girls. The average
scores after intervention were similar for the 2 groups. The difference between the
average pre- and post-intervention is larger for boys. You must determine whether these
observed differences are large enough for you to conclude that, in the population, boys
and girls differ in average truancy rates. You can use the 2 independent-samples T test to
test all 3 hypotheses.
Checking the Assumptions
You must assume that all observations are independent. If the sample sizes in the groups
are small, the data must come from populations that have normal distributions. If the
sum of the sample sizes in the 2 groups is greater than 40, you don’t have to worry about
the assumption of normality. The 2-independent-samples T test also requires
assumptions about the variances in the 2 groups. If the 2 samples come from populations
with the same variance, you should use the “pooled” or equal-variance T test. If the
variances are markedly different, you should use the separate-variance T test. Both of
these are shown below.
Independent Sam ple s Te st
Levene's Test for
Equality of Varianc es
F
prepct Percent truant
days pre intervention
Equal variances
as sumed
Equal variances
not as sumed
postpc t P ercent truant Equal variances
days post intervent ion as sumed
Equal variances
not as sumed
diffpct Pre - Post
Equal variances
as sumed
Equal variances
not as sumed
5.248
.122
1.679
Sig.
.023
.727
.196
t-t est for E quality of Means
t
df
Sig. (2-tailed)
Mean
Difference
St d. E rror
Difference
95% Confidenc e
Int erval of t he
Difference
Lower
Upper
-1. 488
297
.138
-2. 24550
1.50904
-5. 21527
.72426
-1. 485
290.226
.139
-2. 24550
1.51207
-5. 22151
.73051
.118
297
.906
.15309
1.29578
-2. 39698
2.70317
.118
296.969
.906
.15309
1.29483
-2. 39511
2.70130
-1. 638
297
.102
-2. 39839
1.46426
-5. 28003
.48326
-1. 634
287.906
.103
-2. 39839
1.46782
-5. 28740
.49063
You can test the null hypothesis that the population variances in the 2 groups are equal
using the Levene test, shown above. If the observed significance level is small [in the
column labeled sig. under Levene’s Test], you reject the null hypothesis that the
population variances are equal. For this example, you can reject the null hypothesis that
the per-intervention truancy variances are equal in the 2 groups. For the other 2
variables, you can’t reject the null hypothesis that the variances are equal.
Testing the Hypothesis
In the 2-independent-samples T test, the T statistic is computed the same as for the other
2 tests. It is the ratio of the difference between the 2 sample means divided by the
standard error of the difference. The standard error of the difference is computed
differently, depending on whether the 2 variances are assumed to be equal or not. That’s
why you see 2 sets of T values in the table above. In this example, the 2 T values and
confidence intervals based on them are very similar. That will always be the case when
the sample size in the 2 groups is almost the same.
The degrees of freedom for the t statistic also depends on whether you assume that the 2
variances are equal. If the variances are assumed to be equal, the degrees of freedom is 2
fewer than the sum of the number of cases in the 2 groups. If you don’t assume that the
variances are equal, the degrees of freedom is calculated from the actual variances and
the sample sizes in the groups. The result is usually not an integer.
From the column labeled Sig. [2-tailed], you can’t reject any of the 3 hypotheses of
interest. The observed results are not incompatible with the null hypothesis that boys and
girls are equally truant before and after the program and that intervention affects
confidence intervals.
Warning: When you compare 2 independent groups, one of which has a factor of interest
and the other that doesn’t, you must be very careful about drawing conclusions. For
example, if you compare people enrolled in a weight-loss program to people who aren’t,
you cannot attribute observed differences to the program unless the people have been
randomly assigned to two programs.
T-Tests
Crosstabulations
You classify cases based on values for 2 or more categorical variables [e.g. type of health
insurance coverage and satisfaction with health care.] Each combination of values is
called a cell. To test whether the two variables that make up the rows and columns are
independent, you calculate how many cases you expect in each cell if the variables are
independent, and compare these expected values to those actually observed using the chisquare statistic. If your observed results are unlikely if the null hypothesis of
independence is true, you reject the null hypothesis. You can measure how strongly the
row and column variables are related by computing measures of association. There are
many different measures, and they define association in different ways. In selecting a
measure of association, you should consider the scale on which the variables are
measured, the type of association you want to detect, and the ease of interpretation of the
measure. You can study the relationship between a dichotomous [2-category] risk factor
and a dichotomous outcome [e.g. family history of a disease and development of the
disease], controlling for other variables [e.g. gender] by computing special measures
based on the odds.
Chi-Square Test: Are Two Variables Independent?
If you think that 2 variables are related, the null hypothesis that you want to test is that
they are not related. Another way of stating the null hypothesis is that the 2 variables are
independent. Independence has a very precise meaning in this situation. It means that
the probability that a case falls into a particular cell of a table is the product of the
probability that a case falls into that row and the probability that a case falls into that
column.
Warning: The word independent as used here has nothing to do with dependent and
independent variables. It refers to the absence of a relationship between 2 variables.
As an example of testing whether 2 variables are independent, look at the table below, a
crosstabulation of highest educational attainment [degree] and perception of life’s
excitement[life] based on the gssdata posted on blackboard. From the row %, you see
that the % of people who find life exciting is not exactly the same in the 5 degree groups,
although it is fairly similar for the 1st 2 degree groups. Slightly less than half of those
with less than a high school education or with a high school education find life exciting.
However, you see that there is substantial differences between those with some exposure
to college and those with a post-graduate degree. For those respondents, almost 2/3 find
that life is exciting.
degree Highest degree * life Is life exciting, routine or dull? Crosstabulation
degree
Highes t
degree
0 Lt high s chool
1 High school
2 Junior college
3 Bachelor
4 Graduate
Total
Count
Expected Count
% within degree
Highes t degree
Count
Expected Count
% within degree
Highes t degree
Count
Expected Count
% within degree
Highes t degree
Count
Expected Count
% within degree
Highes t degree
Count
Expected Count
% within degree
Highes t degree
Count
Expected Count
% within degree
Highes t degree
life Is life exciting, routine or dull?
1 Exciting 2 Routine
3 Dull
59
67
10
70.8
60.2
5.0
Total
136
136.0
43.4%
49.3%
7.4%
100.0%
218
243.7
232
207.1
18
17.2
468
468.0
46.6%
49.6%
3.8%
100.0%
41
34.4
23
29.2
2
2.4
66
66.0
62.1%
34.8%
3.0%
100.0%
94
74.4
46
63.3
3
5.3
143
143.0
65.7%
32.2%
2.1%
100.0%
55
43.7
29
37.2
0
3.1
84
84.0
65.5%
34.5%
.0%
100.0%
467
467.0
397
397.0
33
33.0
897
897.0
52.1%
44.3%
3.7%
100.0%
Warning: The chi-square test requires that all observations be independent. This means
that each case can appear in only one cell of the table. For example, if you apply 2
different treatments to the same patients and classify them both times as improved or not
improved, you can’t analyze the data with the chi-square test of independence.
Computing Expected Values
You use the chi-square test to determine if your observed results are unlikely if the 2
variables are independent in the population. 2 variables are independent if knowing the
value of one variable tells you nothing about the value of the other variable. The level of
education one attains and one’s perception of life are independent if the probability of
any level of educational attainment/perception of life combination is the product of the
probability of that level of educational attainment times the probability of that perception
of life. For example, under the independence assumption, the probability of being a
college graduate and finding life exciting is:
P = Probability(bachelor degree) x Probability(life exciting)
P = 143/897 x 467/897 = .083
If the null hypothesis is true, you expect to find in your table 74 excited people with
bachelor’s degrees. You see this expected value in the row labeled Expected Count in the
table above
The chi-square test is based on comparing these 2 counts: the observed number of cases
in a cell and the expected number of cases in a cell if the 2 variables are independent.
The Pearson chi-square statistic is:
X2 = ∑ (observed-expected) 2/expected
TIP 24
By examining the differences between observed and expected values in the cells [the
residuals], you can see where the independence model falls. You can examine actual
residuals and residuals standardized by estimates of their variability to help you pinpoint
departures from independence by requesting them in the Cells dialog box of the
Analyze/Descriptive Statistics/Crosstabs procedure.
Determining the Observed Significance Level
From the calculated chi-square value, you can estimate how often in a sample you would
expect to see a chi-square value at least as large as the one you observed if the
independence hypothesis is true in the population. If the observed significance level is
small, enough you reject the null hypothesis that the 2 variables are independent. The
value of chi-square depends on the number of rows and columns in the table. The
degrees of freedom for the chi-square statistic is calculated by finding the product of one
fewer than the number of rows and one fewer than the number of columns. [the degrees
of freedom is the number of cells in a table that can be arbitrarily filled when the row and
column totals are fixed.] In this example, the degrees of freedom is 6.
From the table below, you see that the observed significance level for the Pearson chisquare is 0.000, so you can reject the null hypothesis that level of educational attainment
and perception of life are independent.
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
34.750 a
37.030
29.373
8
8
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
897
a. 2 c ells (13.3%) have ex pec ted c ount les s than 5. The
minimum expected count is 2.43.
Warning: A conservative rule for use of the chi-square test requires that the expected
values in each cell be greater than 1 and that most cells have expected values greater than
5. After SPSS displays the pivot table with the statistics, it displays the number of celss
with expected values less than 5 and the minimum expected count. If more than 20% of
your cells have expected values less than 5, you should combine categories, if that makes
sense for your table, so that most expected values are greater than 5.
Examining Additional Statistics
SPSS displays several statistics in addition to the Pearson chi-square when you ask for a
chi-square test as shown above.
 The likelihood-ratio-chi-square has a different mathematical basis than the
Pearson chi-square, but for large sample sizes, it is close in value to the Pearson
chi-square. It is seldom that these 2 statistics will lead you to different
conclusions.
 The linear-by-linear association statistic is also known as the Mantel-Haenszel
chi-square. It is based on the Pearson correlation coefficient. It tests whether
there is a linear association between the 2 variables. You SHOULD NOT use
this statistic for nominal variables. For ordinal variables, the test is more likely
to detect a linear association between the variables than is the Pearson-chi-square
test; it is more powerful.
 A continuity-corrected-chi-square [not shown here] is shown for tables with 2
rows and 2 columns. Some statisticians claim that this leads to a better estimate
of the observed significance level, but the claim is disputed.
 Fisher’s exact test [not shown here] is calculated if any expected value in a 2 by 2
table is < 5. You get exact probabilities of obtaining the observed table or one
more extreme if the 2 variables are independent and the marginals are fixed. That
is, the number of cases in the rows and columns of the table are determined in
advance by the researcher.
Warning: The Mantel-Haenszel test is calculated using the actual values of the row and
column variables, so if you coded 3 unevenly spaced dosages of a drug as 1, 2, and 3,
those values are used for the computations.
Are Proportions Equal?
A special case of the chi-square test for independence is the test that several proportions
are equal. For example, you want to test whether the % of people who report themselves
to be very happy has changed during the time that the GSS has been conducted. The
figure below is a crosstabulation of the % of people who say were very happy for each of
the decades. This uses the aggregatedgss.sav file. Almost 35% of the people questioned
in the 1970s claimed that they were very happy, compared to 31% in this millennium.
happy GENERAL HAPPINESS * decade decade of survey Crosstabulation
happy GENERAL
HAPPINESS
1 VERY HAPPY
2 PRETTY HAPPY
Total
Count
Expected Count
% within decade
decade of s urvey
Count
Expected Count
% within decade
decade of s urvey
Count
Expected Count
% within decade
decade of s urvey
1 1972-1979
3637
3403.4
decade decade of s urvey
2 1980-1989 3 1990-1999
4475
4053
4516.7
4211.5
4 2000-2002
1296
1329.4
Total
13461
13461.0
34.3%
31.8%
30.9%
31.3%
32.1%
6977
7210.6
9611
9569.3
9081
8922.5
2850
2816.6
28519
28519.0
65.7%
68.2%
69.1%
68.7%
67.9%
10614
10614.0
14086
14086.0
13134
13134.0
4146
4146.0
41980
41980.0
100.0%
100.0%
100.0%
100.0%
100.0%
Calculating the Chi-Square Statistic
If the null hypothesis is true, you expect 32.1% of people to be very happy in each
decade, the overall very happy rate. You calculate the expected number in each decade
by multiplying the total number of people questioned in each decade by 32.1%. The
expected number of not very happy people is 67.9% multiplied by the number of people
in each decade. These values are shown in the table above. The chi-square statistic is
calculated in the usual fashion.
From the table below, you see that the observed significance level for the chi-square
statistic is < .001, leading you to reject the null hypothesis that in each decade people are
equally likely to describe themselves as very happy. Notice that the difference between
years isn’t very large; the largest % is 34.3% for the 1970s, while the smallest is 30.9%
for the 1990s. the sample sizes in each group are very large, so even small differences
are statistically significant, although they may have little practical implication.
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
34.180 a
33.974
25.746
3
3
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
41980
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 1329.43.
Introducing a Control Variable
To see whether both men and women experienced changes in happiness during this time
period, you can compute the chi-square statistic separately for men and for women, as
shown below:

Go to Analyze, then Descriptive Statistics, then Crosstabs, then put the
variable happy in the row box and decade in the column box, then the
variable sex into layer 1 of 1, then select under the cells tab in the crosstabs
dialog box, the boxes marked observed and expected counts and column %,
then select ok and go back and select the statistics box in order to order a chisquare test.
Chi-Square Tests
sex RESPONDENTS
SEX
1 Male
2 Female
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
As sociation
N of Valid Cases
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
As sociation
N of Valid Cases
3
3
As ymp. Sig.
(2-sided)
.298
.300
1
.343
18442
42.987 b
42.712
3
3
.000
.000
35.904
1
.000
Value
3.677a
3.668
.901
df
23538
a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is
586.01.
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is
742.96.
You see that for men, you can’t reject the null hypothesis that happiness has not changed
with time. You can reject the null hypothesis for women. From the line plot in the graph
below, you see that in the sample, happiness decreases with time for women, but not for
men. You can also graph the information. See the graph below, but also note how to
obtain the graph.

Go to the graphs menu, choose line, then select the multiple icon and
summaries for groups of cases, and then click define. Next move decade inot
the category axis box and sex into the define lines by box in the dialog box
that appears. Select other statistic, then move happy into the variable list,
and then click change statistic. In the statistic subdialog box, select % inside
and type 1 into both the low and high text boxes. Click continue, and then
click OK.
RESPONDENTS SEX
36
Male
Female
%in(1,1) GENERAL HAPPINESS
35
34
33
32
31
30
1972-1979
1980-1989
1990-1999
2000-2002
decade of survey
Cases weighted by number of cases
Measuring Change: McNemar Test
The chi-square test can also be used to test hypotheses about change when the same
people or objects are observed at two different times. For example, the table below is a
crosstabulation of whether a person voted in 1996 and whether he or she voted in 2000.
[See gssdata.sav file]
vote00 DI D R VOTE IN 2000 ELECTION * vote 96 DID R VOTE IN 1996 ELECTI ON
Crosstabulation
Count
vot e96 DID R VOTE IN
1996 ELECTION
2 DID
1 VOTED NOT VOTE
vot e00 DID R VOTE 1 VOTED
1539
151
IN 2000 ELECTION 2 DID NOT VOTE
187
502
Total
1726
653
Total
1690
689
2379
An interesting question is whether people were more likely to vote in one of the years
than the other. The cases on the diagonal of the table don’t provide any information
because they behaved similarly in both elections. You have to look at the off-diagonal
cells, which correspond to people who voted in one election but not the other. If the null
hypothesis that likelihood of voting did not change is true, a case should be equally likely
tofallinto either of the 2 off-diagonal cells. The binomial distribution is used to calculate
the exact probability of observing a split between the 2 off-diagonal cells at least as
unequal as the one observed, if cases in the population are equally likely to fall into either
off-diagonal cell. This test is called the McNemar test.
Chi-Square Tests
Value
McNemar Test
N of Valid Cases
Exact Sig.
(2-sided)
.057a
2379
a. Binomial distribution us ed.
McNemar’s test can be calculated for a square table of any size to test whether the upper
half and the lower half of a square table are symmetric. This test is labeled in the table
above. For tables with more than 2 rows and columns, it is labeled the McNemar-Bowker
test. From the figure below, you see that you can’t reject the null hypothesis that people
who voted in only one of the 2 elections were equally likely to vote in another.
Warning: Since the same person is asked whether he or she voted in 1996 and whether
he or she voted in 2000, you can’t make a table in which the rows are years and the
columns are whether he or she voted. Each case would appear twice in such a table.
How Strongly are 2 Variables Related?
If you reject the null hypothesis that 2 variables are independent, you may want to
describe the nature and strength of the relationship between the 2 variables. There are
many statistical indexes that you can use to quantify the strength of the relationship
between 2 variables in a cross-classification. No single measure adequately summarizes
all possible types of association. Measures vary in the way they define perfect and
intermediate association and in the way they are interpreted. Some measures are used
only when the categories of the variables can be ordered from lowest to highest on some
scale.
Warning: Don’t compute a large number of measures and then report the most
impressive as if it were the only one examined.
You can test the null hypothesis that a particular measure of association is 0 based on an
approximate T statistic shown in the output. If the observed significance level is small
enough, you reject the null hypothesis that the measure is 0.
TIP 25
Measures of association should be calculated with as detailed data as possible. Don’t
combine categories with small numbers of cases, as was suggested above for the chisquare test of independence.
FINAL NOTE: IF YOU WISH TO DO MEASURES OF ASSOCIATION TO
DETERMINE HOW STRONGLY 2 VARIABLES ARE RELATED, THEN PLEASE
SEE ME AND ASK FOR ASSISTANCE IF YOU HAVE ANY DIFFICULTIES ON
YOUR OWN.
Download