General Introduction to SPSS

advertisement
Introduction to SPSS
for GHA Staff
Prof Gwilym Pryce: g@gpryce.com
Tutors: George Vlachos, Christian Holz
Lab notes based on material by John Malcolm
Plan:
• A. Data Types
– 1. Variables
– 2. Constants
• B. Introduction to SPSS
– 1. SPSS Menu Bar
– 2. File Types
• C. Tabulating Data
– 1. Categorical variables
– 2. Continuous variables
• D. Graphing Data
– 1. Categorical variables
– 2. Continuous variables
A. Data Types
• 1. Variables
• 2. Constants
1. What is a variable?
– A measurement or quantity that can take on
more than one value:
•
•
•
•
•
E.g. size of planet:
E.g. weight:
E.g. gender:
E.g. fear of crime:
E.g. income:
varies from planet to planet
varies from person to person
varies from person to person
varies from person to person
varies from HH to HH
– I.e. values vary across ‘individuals’ = the
objects described by our data
• Individuals = basic units of a data set whom
we observe or experiment on in a controlled
way
– not necessary persons
• (could be schools, organisations, countries,
groups, policies, or objects such as cars or
safety pins)
• Variables = information that can vary across
the individuals we observe
– e.g. age, height, gender, income, exam scores,
whether signed Nuclear Test Ban Treaty
Variable Type, for Coding Purposes:
Variable View of the Data
Variable Type for Coding
Purposes:
• Available data types in SPSS are as
follows:
–
–
–
–
–
–
Numeric – the default for new variables
Comma
Dot
Scientific Notation
Date
String
• Numeric
– A variable whose values are numbers. Values are
displayed in standard numeric format.
– The Data Editor accepts numeric values in
standard format or in scientific notation.
• Comma
– A numeric variable whose values are displayed
with commas delimiting every three places, and
with the period as a decimal delimiter.
– The Data Editor accepts numeric values for
comma variables with or without commas, or in
scientific notation.
• Values cannot contain commas to the right of the decimal
indicator.
• Dot
– A numeric variable whose values are displayed
with periods delimiting every three places and with
the comma as a decimal delimiter.
– The Data Editor accepts numeric values for dot
variables with or without periods, or in scientific
notation.
• Values cannot contain periods to the right of the decimal
indicator.
• Scientific notation
– A numeric variable whose values are displayed
with an imbedded E and a signed power-of-ten
exponent.
– The Data Editor accepts numeric values for such
variables with or without an exponent.
• The exponent can be preceded either by E or D with an
optional sign, or by the sign alone--for example, 123,
1.23E2, 1.23D2, 1.23E+2, and even 1.23+2.
• Date
– A numeric variable whose values are displayed in
one of several calendar-date or clock-time
formats. Select a format from the list. You can
enter dates with slashes, hyphens, periods,
commas, or blank spaces as delimiters. The
century range for two-digit year values is
determined by your Options settings (from the Edit
menu, choose Options and click the Data tab).
• Custom currency
– A numeric variable whose values are displayed in
one of the custom currency formats that you have
defined in the Currency tab of the Options dialog
box. Defined custom currency characters cannot
be used in data entry but are displayed in the Data
Editor.
• String
– Values of a string variable are not numeric
and therefore are not used in calculations.
– They can contain any characters up to the
defined length.
– Uppercase and lowercase letters are
considered distinct.
– Also known as an alphanumeric variable.
Conceptual Approach to
Variable Type:
• Numeric = values are numbers that can be
used in calculations.
• String = Values are not numeric, and hence
not used in calculations.
– But can often be coded: I.e. transformed into a
numerical variable:
• e.g.
If (LA = ‘Aberdeen’) X = 1.
If (LA = ‘East Renfrewshire’) X = 2.
etc.
Continuous vs Categorical
• Continuous (or Scale or quantitative Variables) = data
values are numeric values on an interval or ratio
scale
– (e.g., age, income). Scale variables must be numeric.
– E.g. dimmer switch: brightness of light can be measured
along a continuum from dark to full brightness
• Categorical Variables = variables that have values
which fall into two or more discrete categories
– E.g. conventional light switch: either total darkness or full
brightness, on or off.
– Male or female, employment category, country of origin
Two types of Categorical
variables: Ordinal & Nominal
• Ordinal variables = Data values represent
categories with some intrinsic order
– (e.g., low, medium, high; strongly agree, agree,
disagree, strongly disagree).
– Ordinal variables can be either string
(alphanumeric) or numeric values that represent
distinct categories (e.g., 1=low, 2=medium,
3=high).
Ordinal variables:
• Values fall within discrete but ordered
categories
– I.e. the sequence of categories has meaning
• e.g. education categories:
–
–
–
–
–
–
1 = primary
2 = secondary
3 = college
4 = university undergraduate
5 = university postgraduate masters
6 = university postgraduate phd
• e.g. 1= Very poor, 2= poor, 3=good, 4=very good
Nominal variables
• Nominal Variables = Data values represent
categories with no intrinsic order
– sequence of categories is arbitary -ordering has no meaning in and of itself:
• e.g. country of origin: Wales, Scotland,
Germany…
• e.g. make of car: Ford, Vauxhall
• e.g. job category
• e.g. company division
– Nominal variables can be either string
(alphanumeric) or numeric values that represent
distinct categories (e.g., 1=Male, 2=Female).
2. What is a constant?
– A measurement or quantity that has only one value
for all the objects described in our data
– Also called a ‘scalar’ or ‘intercept’ or ‘parameter’
•
•
•
•
E.g. speed of light in a vacuum:
constant for all light transmissions
E.g. ratio of diameter to circumf.:
constant for all circles
E.g. ave. increase in life expectancy: constant at 1 year pa since 1900
E.g. Price elasticity of housing supply: assumed constant for a particular market
• Often it is a constant that want to estimate:
– we employ statistical techniques to estimate
‘parameters’ or ‘constants’ that summarise or
link variables.
• e.g. mean = ‘typical’ value of a variable = measure of
central tendency
• e.g. standard deviation = measure of the variability of
a variable = measure of spread
• e.g. correlation coefficient = measures the correlation
between two variables
• e.g. slope coefficients = how much y increases when
x increases
Plan:
• A. Data Types
– 1. Variables
– 2. Constants
• B. Introduction to SPSS
– 1. SPSS Menu Bar
– 2. File Types
• C. Tabulating Data
– 1. Categorical variables
– 2. Continuous variables
• D. Graphing Data
– 1. Categorical variables
– 2. Continuous variables
B. Introduction to SPSS
1. SPSS Menu Bar
• When you first open SPSS, you will
usually be presented with a blank Data
View window
– The Data View lists variables as columns
and observations (also called “cases” or
“individuals”) as rows
• Data View without and with data looks like
this…
Data View of Home Sales data:
• Variable View looks like this…
Variable View of Home Sales data:
SPSS Menu Bar
B.2. File Types & SPSS Structure
• If you try opening a new file (File,
New), you will see that you are
presented with five choices of file type.
• These choices reflect the basic
structure of SPSS:
– Data
– Syntax
• Steep learning curve, but
essential for larger projects
– Backup
– Record/checking
– Re-use
– Output
• Graphs, tables, commands,
error messages
SPSS Scripting Facility
• The scripting facility allows you to automate
tasks, including:
–
–
–
–
Automatically customize output in the Viewer.
Open and save data files.
Display and manipulate dialog boxes.
Run data transformations and statistical
procedures using command syntax.
– Export charts as graphic files in a number of
formats.
Plan:
• A. Data Types
– 1. Variables
– 2. Constants
• B. Introduction to SPSS
– 1. SPSS Menu Bar
– 2. File Types
• C. Tabulating Data
– 1. Categorical variables
– 2. Continuous variables
• D. Graphing Data
– 1. Categorical variables
– 2. Continuous variables
C. Tabulating Data
• 1. Categorical Data: Frequency Tables
– E.g. Neighbourhood type (House Sales data)
• Analyse, Descriptive Statistics, Frequencies
Neighborhood
Valid
A
B
C
D
E
F
G
Total
Frequency
42
319
258
467
500
372
482
2440
Percent
1.7
13.1
10.6
19.1
20.5
15.2
19.8
100.0
Valid Percent
1.7
13.1
10.6
19.1
20.5
15.2
19.8
100.0
Cumulative
Percent
1.7
14.8
25.4
44.5
65.0
80.2
100.0
• Categorical Data: Crosstabs (2-Way Tables)
– E.g. Does Ethnic Minority Status affect job type?
(Emplment data)
• Analyse, Descriptive Statistics, Crosstabs
MinorityClassification * Employment Category Crosstabulation
Minority Classification
No
Yes
Total
Count
% within Employment
Category
Count
% within Employment
Category
Count
% within Employment
Category
Employment Category
Clerical
Custodial
Manager
276
14
80
Total
370
76.0%
51.9%
95.2%
78.1%
87
13
4
104
24.0%
48.1%
4.8%
21.9%
363
27
84
474
100.0%
100.0%
100.0%
100.0%
2. Scale Data
• Scale or quantitative data: usually a
measurement of size or quantity
– not meaningful to report % or count
• Not unless you break the variale into categories
(& then it becomes categorical data!)
• e.g. income bands = “grouped data”
• Tables of raw data not much use unless
only a few values...
How tabulate 129,000
observations?
Borrower
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CM SML 1988
Total Income
Borrower
.
21
.
22
.
23
.
24
.
25
.
26
.
27
.
28
.
29
.
30
.
31
.
32
.
33
.
34
18720
35
16000
36
16455
37
.
38
7020
39
4576
40
CM SML 1988
Total Income
Borrower
10800
41
.
42
19072
43
.
44
.
45
.
46
.
47
.
48
.
49
.
50
.
51
.
52
.
53
.
54
.
55
.
56
.
57
11500
58
2912
59
11745
60
CM SML 1988
Total Income
Borrower
.
61
7216
62
.
63
12000
64
9758
65
6084
66
.
67
.
68
.
69
9345
70
9810
71
14406
72
9190
73
.
74
.
75
.
76
.
77
.
78
.
79
.
80
CM SML 1988
Total Income
.
.
.
.
.
.
.
.
18336
15096
.
12597
9700
.
.
.
.
5295
4539
.
Tables of Summary Statistics
for Continuous Data:
• Descriptives Function in SPSS:
– E.g. House Sales data
• On SPSS Menu Bar select:
– Analyze, Descriptive Statistics, Descriptives
Descriptive Statistics
N
Current Salary
Valid N (listwise)
474
474
Minimum
$15,750
Maximum
$135,000
Mean
$34419.6
Std. Deviation
$17,075.661
• Explore Function in SPSS:
– On SPSS Menu Bar select:
• Analyze, Descriptive Statistics, Explore
Descriptives
Current Salary
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
$34419.6
$32878.4
Std. Error
$784.311
$35960.7
$32455.2
$28875.0
3E+008
$17075.7
$15,750
$135,000
$119,250
$13,163
2.125
5.378
.112
.224
Plan:
• A. Data Types
– 1. Variables
– 2. Constants
• B. Introduction to SPSS
– 1. SPSS Menu Bar
– 2. File Types
• C. Tabulating Data
– 1. Categorical variables
– 2. Continuous variables
• D. Graphing Data
– 1. Categorical variables
– 2. Continuous variables
D. Graphs of Variables:
1. Graphs of Categorical Data
• Pie Charts
– If all the categories sum to a meaningful
total, then you can use a pie chart
– Pie charts emphasise the differences in
proportions between categories
– OK for a single snapshot, but not very good
for showing trends
• would need to have a separate pie chart for
each year
•On SPSS Menu Bar select:
•Graphs, Pie, Summaries for Groups of
Cases
• Bar Charts
– can show either % or count
– not very good for showing trends in more
than one category
Income Support claimants with housing costs
by statistical group in May 1999
100
90
80
70
000's
60
50
40
30
20
10
0
Aged 60 or over
Lone Parents
Disabled
Category of Claimant
DSS Quarterly Statistical Enquiry
Other
Income Support claimants with housing costs
by statistical group in May 1999
100
90
80
70
000's
60
50
40
30
20
10
0
Aged 60 or over
Lone Parents
Disabled
Category of Claimant
DSS Quarterly Statistical Enquiry
Other
Income Support claimants with housing costs by statistical
group: May 1993 to May 1999
140
120
100
Other 000s
Disabled 000s
Lone Parents 000s
Aged 60 or over 000s
80
000's
60
40
20
0
1993
1994
1995
Year
1996
1997
1998
1999
Beware of scaling...
Income Support claimants with housing costs by statistical
group: May 1993 to May 1999
120
110
100
Lone Parents 000s
000's 90
80
70
60
1993
1994
1995
Year
1996
1997
1998
1999
Income Support claimants with housing costs by statistical
group: May 1993 to May 1999
200
180
160
140
000's
120
100
80
60
40
20
0
1993
1994
1995
1996
Year
Lone Parents 000s
1997
1998
1999
D. Graphs of Variables:
2. Graphs of Continuous Data
• What are we interested in when describing
data?
• E.g. income:
–
–
–
–
Is income evenly spread?
Or are most people rich?
Or are most people poor?
Or are most reasonably well off?
• This are all questions about the variable’s
Distribution
– We can represent the whole data set with one
picture...
3000.0 - 4500.0
6000.0
- 7500.0
0.0- 10500.0
1500.0
9000.0
12000.0
- 13500.0
15000.0 - 16500.0
18000.0 - 19500.0
21000.0 - 22500.0
24000.0 - 25500.0
27000.0 - 28500.0
30000.0 - 31500.0
33000.0 - 34500.0
36000.0 - 37500.0
39000.0 - 40500.0
42000.0 - 43500.0
45000.0 - 46500.0
48000.0 - 49500.0
51000.0 - 52500.0
54000.0 - 55500.0
57000.0 - 58500.0
•On SPSS Menu Bar select:
•Graphs, Histogram, and select variable
12000
10000
8000
6000
4000
2000
Std. Dev = 12830.02
Mean = 17993.3
0
N = 125541.00
TOTAL INCOME OF BORROWER(S)
-.25 - .25
.75 - 1.25
1.75 - 2.25
2.75 - 3.25
3.75 - 4.25
4.75 - 5.25
5.75 - 6.25
6.75 - 7.25
7.75 - 8.25
8.75 - 9.25
9.75 - 10.25
10.75 - 11.25
11.75 - 12.25
12.75 - 13.25
13.75 - 14.25
14.75 - 15.25
15.75 - 16.25
16.75 - 17.25
Frequency
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
60000
50000
40000
30000
20000
Std. Dev = .25
10000
Mean = .80
0
N = 74736.00
LTV
0.00 - .05
.05 - .10
.10 - .15
.15 - .20
.20 - .25
.25 - .30
.30 - .35
.35 - .40
.40 - .45
.45 - .50
.50 - .55
.55 - .60
.60 - .65
.65 - .70
.70 - .75
.75 - .80
.80 - .85
.85 - .90
.95
.90- 1.00
- .95
1.00 - 1.05
1.05 - 1.10
1.10 - 1.15
1.15 - 1.20
1.20 - 1.25
1.25 - 1.30
1.30 - 1.35
1.35 - 1.40
1.40 - 1.45
1.45 - 1.50
Frequency
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
30000
20000
10000
Std. Dev = .22
Mean = .80
0
N = 74552.00
LTV
.05 - .10
.10 - .15
.15 - .20
.20 - .25
.25 - .30
.30 - .35
.35 - .40
.40 - .45
.45 - .50
.50 - .55
.55 - .60
.60 - .65
.65 - .70
.70 - .75
.75 - .80
.80 - .85
.85 - .90
.90- 1.00
- .95
.95
0.00 - .05
Frequency
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
30000
20000
10000
Std. Dev = .22
Mean = .78
0
N = 70545.00
LTV
LTV Frequency Distribution
All HHs in Low Price Areas
(1995-1998 CML SML Data)
70000
60000
50000
30000
20000
Std. Dev = .22
10000
Mean = .80
N = 74552.00
LTV
1.00 - 1.50
.50 - 1.00
0
0.00 - .50
Frequency
40000
Summary
• A. Data Types
– 1. Variables
– 2. Constants
• B. Introduction to SPSS
– 1. SPSS Menu Bar
– 2. File Types
• C. Tabulating Data
– 1. Categorical variables
– 2. Continuous variables
• D. Graphing Data
– 1. Categorical variables
– 2. Continuous variables
Download