Introduction to SPSS
Data types and SPSS
data entry and analysis
1
In this session








What does SPSS look like?
Types of data (revision)
Data Entry in SPSS
Simple charts in SPSS
Summary statistics
Contingency tables and crosstabulations
Scatterplots and correlations
Tests of differences of means
2
SPSS/PASW
3
Aspects of SPSS


Menus - Analyse and Charts esp.
Spreadsheet view of data



Rows are cases (people, respondents etc.)
Columns are Variables
Variable view of data

Shows detail of each variable type
4
Questionnaire Data Coding
5
In SPSS



We change ticks etc. on a questionnaire into
numbers
One number for each variable for each case
How we do this depends on the type of
variable/data
6
Types of data





Nominal
Ranked
Scales/measures
Mixed types
Text answers (open ended questions)
7
Nominal (categorical)



order is arbitrary
e.g. sex, country of birth, personality type, yes or no.
Use numeric in SPSS and give value labels.
(e.g. 1=Female, 2=Male, 99=Missing)
(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other,
99=Missing)
8
Ranks or Ordinal



in order, 1st, 2nd, 3rd etc.
e.g. status, social class
Use numeric in SPSS with value labels


E.g. 1=Working class, 2=Middle class, 3=Upper
class
E.g. Class of degree, 1=First, 2=Upper second,
3=Lower second, 4=Third, 5=Ordinary,
99=Missing
9
Measures, scales
Interval - equal units
1.

e.g. IQ
Ratio - equal units, zero on scale
2.


e.g. height, income, family size, age
Makes sense to say one value is twice another
Use numeric (or comma, dot or scientific) in
SPSS



E.g. family size, 1, 2, 3, 4 etc.
E.g. income per year, 25000, 14500, 18650 etc.
10
Mixed type


Categorised data
Actually ranked, but used to identify
categories or groups



e.g. age groups
= ratio data put into groups
Use numeric in SPSS and use value
labels.

E.g. Age group, 1=‘Under 18’, 2=‘18-24’, 3=‘2534’, 4=‘35-44’, 5=‘45-54’, 6=‘55 or greater’
11
Text answers




E.g. answers to open-ended questions
Either enter text as given (Use String in SPSS)
Or
Code or classify answers into one of a small number
types. (Use numeric/nominal in SPSS)
12
Data Entry in SPSS

Video by Andy Field
13
Frequency counts


Used with categorical and ranked variables
e.g. gender of students taking Health and
Illness option
Sex of student
Cumulative
Frequency
Valid
Female
Percent
Valid Percent
Percent
25
73.5
73.5
73.5
Male
9
26.5
26.5
100.0
Total
34
100.0
100.0
14
e.g. Number of GCSEs passed by students taking
Health and Illness option
Number of GCSEs
Cumulative
Frequency
Valid
Percent
Valid Percent
Percent
0
1
2.9
2.9
2.9
1
1
2.9
2.9
5.9
2
4
11.8
11.8
17.6
3
6
17.6
17.6
35.3
4
4
11.8
11.8
47.1
5
2
5.9
5.9
52.9
6
6
17.6
17.6
70.6
7
3
8.8
8.8
79.4
8
2
5.9
5.9
85.3
9
3
8.8
8.8
94.1
13
1
2.9
2.9
97.1
14
1
2.9
2.9
100.0
34
100.0
100.0
Total
15
Central Tendency

Mean



Mode



= average value
sum of all the values divided by the number of values
= the most frequent value in a distribution
(N.B. it is possible to have 2 or more modes, e.g. bimodal
distribution)
Median



= the half-way value, or the value that divides the ordered
distribution in the middle
The middle score when scores are ordered
N.B. need to put values into order first
16
Dispersion and variability

Quartiles






The three values that split the sorted data into
four equal parts.
Second Quartile = median.
Lower quartile = median of lower half of the data
Upper quartile = median of upper half of the data
Need to order the individuals first
One quarter of the individuals are in each interquartile range
17
Used on Box Plot
Age of Health and Illness students
Statist ics
Age
N
Valid
34
Missing
Upper quartile
0
Mean
24.03
Median
21.00
Median
Lower quartile
18
Variance

Average deviation from the mean, squared
Score
Mean
Deviation
Squared
Deviation
1
2.6
-1.6
2.56
2
2.6
-0.6
0.36
3
2.6
0.4
0.16
3
2.6
0.4
0.16
4
2.6
1.4
1.96
Total
5.20

5.20 is the Sum of Squares
This depends on number of individuals so we divide by n (5)

Gives 1.04 which is the variance

19
Standard Deviation

The variance has one problem: it is
measured in units squared.

This isn’t a very meaningful metric so we take
the square root value.

This is the Standard Deviation
20
Using SPSS



‘Analyse>Descriptive>Explore’ menu.
Gives mean, median, SD, variance, min,
max, range, skew and kurtosis.
Can also produce stem and leaf, and
histogram.
21
Charts in SPSS





Use ‘Chart Builder’ from ‘Graph’ menu or the
Legacy menu
And/or double click chart to edit it.
E.g. double click to edit bars (e.g. to change
from colour to fill pattern).
Do this in SPSS first before cut and paste to
Word
Label the chart (in SPSS or in Word)
22
Stem and leaf plots


e.g. age of students taking Health and Illness
option
good at showing



distribution of data
outliers
range
23
Stem and leaf plots e.g.
Age Stem-and-Leaf Plot
Frequency
Stem &
6.00
1
17.00
2
5.00
2
3.00
3
1.00
3
2.00 Extremes
Stem width:
Each leaf:
.
.
.
.
.
Leaf
999999
00000000001111134
55678
123
5
(>=36)
10
1 case(s)
24
Box Plot
Statist ics
Age
N
Valid
34
Missing
0
Mean
24.03
Median
21.00
25
Box Plot
Fill colour
changed.
N.B. numbers refer
to case numbers.
26
Histograms and bar charts

Length/height of bar indicates frequency
27
Histogram
Fill pattern suitable
for black and white
printing
28
Changing the bin size
Bin size made
smaller to show
more bars
29
Pie chart

angle of segment indicates proportion of the
whole
30
Pie Chart
Shadow and one
slice moved out for
emphasis
Analysing relationships

Contingency tables or crosstabulations

Compares nominal/categorical variables





But can include ordinal variables
N.B. table contains counts (= frequency data)
One variable on horizontal axis
One variable on vertical axis
Row and column total counts known as marginals
Example

In the Health and
Illness class, are
women more
likely to be under
21 than men?
Crosstabulations

e.g.

Use column and row percentages to look for
relationships
SPSS output
Chi-square
²
Cross tabulations and Chi-square are tests that
can be used to look for a relationship between
two variables:
 When the variables are categorical so the
data are nominal (or frequency).
 For example, if we wanted to look at the
relationship between gender and age.
 There are several different types of Chi-square
(²), we will be using the 2 x 2 Chi-square
2x2 Chi-square results in
SPSS
Another example

The Bank employees data
Bank Employees
Chi-Square tests
Chi-Square analysis on SPSS


http://www.youtube.com/watch?v=Ahs8jS5m
JKk 4m15s
http://www.youtube.com/watch?v=IRCzOD27
NQU


From 6m:30s to 9m:50s
http://www.youtube.com/watch?v=532QXt1P
MQ&feature=plcp&context=C3ba91a4UDOEgs
ToPDskJ-ABupdp-Yfvuf4j4fJGzV 12m30s
Low values in cells



Get SPSS to output expected values
Look where these are <5
Consider recoding to combine cols or rows
Tabulating questionnaire
responses

Categorical survey data often “collapsed” for purposes of data
analysis
Original category
Frequency
Collapsed category
Frequency
White British
284
White
304
White Irish
7
Other White
13
Indian
40
South Asian
105
Pakistani
32
Bangladeshi
33
Chinese
16
Chinese
16
Black British
30
Black
44
Afro-Caribbean
12
African
2
An analysis on a sample of 2 (e.g. Black African) would not have been very meaningful!
Recoding variables

http://www.youtube.com/watch?v=uzQ_522F
2SM&feature=related


Ignore t-test for now
6m11s
http://www.youtube.com/watch?v=FUoYZ_f6
Lxc

Uses old version of SPSS, no submenu now. 6m
Scatterplots and correlations

Looks for association between variables, e.g.





Population size and GDP
crime and unemployment rates
height and weight
Both variables must be rank, interval or
ratio (scale or ordinal in SPSS).
Thus cannot use variables like, gender,
ethnicity, town of birth, occupation.
44
Scatterplots

e.g. age (in years) versus Number of GCSEs
45
Interpretation



As Y increases
X increases
Called
correlation
Regression line
model in red
46
Correlation measures
association not causation



Height correlates with weight



The older the child the better s/he is at reading
The less your income the greater the risk of
schizophrenia
But weight does not cause height
Height is one of the causes of weight (also body
shape, diet, fitness level etc.)
Numbers of ice creams sold is correlated with
the rate of drowning


Ice creams do not cause drowning (nor vice versa)
Third variable involved – people swim more and buy
more ice creams when it’s warm
47
Scatterplot in SPSS




Use Graph menu
http://www.youtube.com/watch?v=74BjgPQvI
Eg 8m34s
http://www.youtube.com/watch?v=blfflA34pQ&feature=related 4m04s
http://www.youtube.com/watch?v=UVylQoG4
hZM 1m50s, ignore polynomial regression
48
Modifying the Scatterplot


http://www.youtube.com/watch?v=803YCYA2
AoQ&feature=related 4m04s
http://www.youtube.com/watch?v=vPzvuMuV
Xk8&feature=related 3m40s
49
If mixed data sets




Change point icon and/or colour to see
different subsets.
Overall data may have no relationship but
subsets might.
E.g. show male and female respondents.
Use Chart builder
50
Correlation


Correlation coefficient = measure of strength
of relationship, e.g. Pearson’s r
varies from 0 to 1 with a plus or minus sign
Correlations
Number of
GCSEs
Number of GCSEs
Pearson Correlation
Age
1
Sig. (2-tailed)
N
Age
Pearson Correlation
Sig. (2-tailed)
N
*. Correlation is significant at the 0.05 level (2-tailed).
-.415
*
.015
34
34
*
1
-.415
.015
34
34
51
Positive correlation

as x increases, y increases
r = 0.7
52
Negative correlation

as x increases, y decreases
r = -0.7
53
Strong correlation (i.e. close to 1)
r = 0.9
54
Weak correlation (i.e. close to 0)
r = 0.2
55
Interpretation cont.
 r2


is a measure of degree of variation in
one variable accounted for by variation
in the other.
E.g. If r=0.7 then r2=.49 i.e. just under half
the variation is accounted for (rest
accounted for by other factors).
If r=0.3 then r2=0.09 so 91% of the
variation is explained by other things.
56
Significance of r




SPSS reports if r is significant at α=0.05
N.B. this is dependent on sample size to a
large extent.
Other things being equal, larger samples
more likely to be significant.
Usually, size of r is more important than
its significance
57
Pearson’s r in SPSS

http://www.youtube.com/watch?v=loFLqZmvf
zU 6m57s
58
Parametric and non-parametric




Some statistics rely on the variables being
investigated following a normal distribution. –
Called Parametric statistics
Others can be used if variables are not
distributed normally – called Non-parametric
statistics.
Pearson’s r is a parametric statistic
Kendal’s tau and Spearman’s rho (rank
correlation) are non-parametric.
59
Assessing normality

Produce histogram and normal plot
60
Use statistical test

SPSS provides two formal tests for normality
: Kolmogorov-Smirnov (K-S) and ShapiroWilks (S-W)




But, there is debate about KS
Extremely sensitive to departure from normality
May erroneously imply parametric test not
suitable – especially in small sample
So, always use a histogram as well.
61
Often can use parametric tests



Parametric tests (e.g. Pearson’s r) are robust
to departures from normality
Small, non-normal samples OK
But use non-parametric if


Data are skewed (questionnaire data often is)
Data are bimodal
62
Spearmans’s rho


http://www.youtube.com/watch?v=r_WQe2cISU From 4.14 to 4.56
http://www.youtube.com/watch?v=POkFi5vKv
I8&feature=fvwrel 6m16s
63
So far…

Looked at relationships between nominal
variables


Looked at relationships between scale
variables



Gender vs age group
Height vs. Weight
Now combine the two
Groups vs a scale variable

E.g. Gender vs income
64
Reminder – IV vs DV






IV = independent variable
What makes a difference, causes effects, is responsible
for differences.
DV = dependent variable
What is affected by things, what is changed by the IV.
Gender vs income. Gender = IV, income = DV
So we investigate the effect of gender on income
65
Example 1
Age group vs. no. of GCSEs


Using the Health and Illness class data
Age group defines 2 groups






Under 21
21 and over
Just two groups
Can use independent samples t-test
Independent because the two groups consist
of different people.
t-test compares the means of the 2 groups.
66
Difference of means

Do under 21s have more or fewer GCSEs
than 21 and overs?
Group Statisti c s
Age group
Number of GCSEs

N
Mean
Std. Deviation
Std. Error Mean
Under 21
16
6.44
3.140
.785
21 and over
18
4.28
2.906
.685
Means are different (6.44 & 4.28) but is that
significant?
67
No significant difference therefore
assume equal variances
Independent Samples T e s t
Levene's Test for Equality of
Variances
t-test for Equality of Means
95% Confidence Interval of the
F
Number of GCSEs
Sig.
Equal variances assumed
.164
Equal variances not
t
.689
df
Sig. (2-tailed)
Mean
Std. Error
Difference
Difference
Difference
Lower
Upper
2.082
32
.045
2.160
1.037
.047
4.272
2.073
30.789
.047
2.160
1.042
.034
4.285
Independent Samples T e s t
assumed
s Test for Equality of
Variances
t-test for Equality of Mean
Mean
Sig.
.164
t
.689
df
Sig. (2-tailed)
2.082
32
.045
2.073
30.789
.047
Std.
Means are
Difference
Diffe
statistically
2.160
significantly
different
2.160 68
Parametric vs non-parametric




Just as in the case of correlations, there are
both kinds of tests.
Need to check if DV is normally distributed.
Do this visually
Also use statistical tests
69
Tests for normality






Kolmogorov-Smirnov and Shapiro-Wilk
If n>50 use KS
If n≤50 use SW
Null hypothesis is ‘data are normally distributed’.
So if p<0.05 then data are significantly different
from a normal distribution – use nonparametric tests
If p≥0.05 then no significant difference – use
parametric tests
70
Checking normality



Produce histogram of DV
Tick box to undertake statistical test
Interpret results.
71
t-test




Identify your two groups.
Determine what values in the data indicate
those two groups (e.g. 1=female, 2=male)
Select Analyze:Compare Means:Independent
samples t-test
http://www.youtube.com/watch?v=_KHI3ScO
8sc 9m40s
72
Mann-Whitney U test


Use this when comparing two groups and the
DV is not normally distributed
http://www.youtube.com/watch?v=7iTvv3m9d
_g 3m45s
73
Comparing 3 or more groups
ANOVA = Analysis of Variance
 Analyze: Compare Means: One-way ANOVA
 http://www.youtube.com/watch?v=wFq1b3QjI
1U 4m04s
Useful to get table of means (descriptives) and
means plots from ANOVA options.

74
ANOVA Means and F value
75
ANOVA Means Plot
76