Theory of Regression - Jeremy Miles's Page

advertisement
Theory of Regression
1
The Course
• 16 (or so) lessons
– Some flexibility
• Depends how we feel
• What we get through
2
Part I: Theory of Regression
1. Models in statistics
2. Models with more than one parameter:
regression
3. Why regression?
4. Samples to populations
5. Introducing multiple regression
6. More on multiple regression
3
Part 2: Application of regression
7.
8.
9.
10.
11.
12.
Categorical predictor variables
Assumptions in regression analysis
Issues in regression analysis
Non-linear regression
Moderators (interactions) in regression
Mediation and path analysis
Part 3: Advanced Types of Regression
13.
14.
15.
16.
Logistic Regression
Poisson Regression
Introducing SEM
Introducing longitudinal multilevel models
4
House Rules
• Jeremy must remember
– Not to talk too fast
• If you don’t understand
– Ask
– Any time
• If you think I’m wrong
– Ask. (I’m not always right)
5
Learning New Techniques
• Best kind of data to learn a new technique
– Data that you know well, and understand
• Your own data
– In computer labs (esp later on)
– Use your own data if you like
• My data
– I’ll provide you with
– Simple examples, small sample sizes
• Conceptually simple (even silly)
6
Computer Programs
• SPSS
– Mostly
• Excel
– For calculations
•
•
•
•
•
GPower
Stata (if you like)
R (because it’s flexible and free)
Mplus (SEM, ML?)
AMOS (if you like)
7
8
9
Lesson 1: Models in statistics
Models, parsimony, error, mean,
OLS estimators
10
What is a Model?
11
What is a model?
• Representation
– Of reality
– Not reality
• Model aeroplane represents a real
aeroplane
– If model aeroplane = real aeroplane, it
isn’t a model
12
• Statistics is about modelling
– Representing and simplifying
• Sifting
– What is important from what is not
important
• Parsimony
– In statistical models we seek parsimony
– Parsimony  simplicity
13
Parsimony in Science
• A model should be:
– 1: able to explain a lot
– 2: use as few concepts as possible
• More it explains
– The more you get
• Fewer concepts
– The lower the price
• Is it worth paying a higher price for a better
model?
14
A Simple Model
• Height of five individuals
– 1.40m
– 1.55m
– 1.80m
– 1.62m
– 1.63m
• These are our DATA
15
A Little Notation
Y
Yi
The (vector of) data that we are
modelling
The ith observation in our
data.
Y  4,5,6,7,8
Y2  5
16
Greek letters represent the true
value in the population.

0
j

(Beta) Parameters in our model
(population value)
The value of the first parameter of our
model in the population.
The value of the jth parameter of our
model, in the population.
(Epsilon) The error in the population
model.
17
Normal letters represent the values in our
sample. These are sample statistics, which are
used to estimate population parameters.
b
e
Y
A parameters in our model (sample
statistics)
The error in our sample.
The data in our sample which we are
trying to model.
18
Symbols on top change the meaning.
Y
The data in our sample which we are
trying to model (repeated).
ˆ
Yi
The estimated value of Y, for the ith
case.
Y
The mean of Y.
19
ˆ
So b1  1
I will use b1 (because it is easier to type) 
20
• Not always that simple
– some texts and computer programs use
b = the parameter estimate (as we have
used)
 (beta) = the standardised parameter
estimate
SPSS does this.
21
A capital letter is the set (vector) of
parameters/statistics
B
Set of all parameters (b0, b1, b2, b3 … bp)
Rules are not used very consistently (even by
me).
Don’t assume you know what someone means,
without checking.
22
• We want a model
– To represent those data
• Model 1:
– 1.40m, 1.55m, 1.80m, 1.62m, 1.63m
– Not a model
• A copy
– VERY unparsimonious
• Data: 5 statistics
• Model: 5 statistics
– No improvement
23
• Model 2:
– The mean (arithmetic mean)
– A one parameter model
n
ˆ
Yi  b0  Y 
 Yi
i 1
n
24
• Which, because we are lazy, can be
written as
Y
Y 
n
25
The Mean as a Model
26
The (Arithmetic) Mean
• We all know the mean
– The ‘average’
– Learned about it at school
– Forget (didn’t know) about how clever the mean is
• The mean is:
– An Ordinary Least Squares (OLS) estimator
– Best Linear Unbiased Estimator (BLUE)
27
Mean as OLS Estimator
• Going back a step or two
• MODEL was a representation of DATA
– We said we want a model that explains a lot
– How much does a model explain?
DATA = MODEL + ERROR
ERROR = DATA - MODEL
– We want a model with as little ERROR as possible
28
• What is error?
Data (Y)
Model (b0)
mean
Error (e)
1.40
-0.20
1.55
-0.05
1.80
1.60
0.20
1.62
0.02
1.63
0.03
29
• How can we calculate the ‘amount’ of
error?
• Sum of errors
ERROR  ei
 (Yi  Yˆ)
 (Yi  b0 )
  0.20   0.05  0.20  0.02  0.03
0
30
– 0 implies no ERROR
• Not the case
– Knowledge about ERROR is useful
• As we shall see later
31
• Sum of absolute errors
– Ignore signs
ERROR   ei
  Yi  Yˆ
  Yi  b0
 0.20  0.05  0.20  0.02  0.03
 0.50
32
• Are small and large errors equivalent?
– One error of 4
– Four errors of 1
– The same?
– What happens with different data?
• Y = (2, 2, 5)
– b0 = 2
– Not very representative
• Y = (2, 2, 4, 4)
– b0 = any value from 2 - 4
– Indeterminate
• There are an infinite number of solutions which would satisfy
our criteria for minimum error
33
• Sum of squared errors (SSE)
2
i
ERROR  e
2
ˆ
 (Yi  Y )
 (Yi  b0 )
2
  0.20   0.05  0.20  0.02  0.03
2
2
2
2
2
 0.08
34
• Determinate
– Always gives one answer
• If we minimise SSE
– Get the mean
• Shown in graph
– SSE plotted against b0
– Min value of SSE occurs when
– b0 = mean
35
2
1.8
1.6
1.4
SSE
1.2
1
0.8
0.6
0.4
0.2
0
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
b0
36
The Mean as an OLS Estimate
37
Mean as OLS Estimate
• The mean is an Ordinary Least Squares
(OLS) estimate
– As are lots of other things
• This is exciting because
– OLS estimators are BLUE
– Best Linear Unbiased Estimators
– Proven with Gauss-Markov Theorem
• Which we won’t worry about
38
BLUE Estimators
• Best
– Minimum variance (of all possible unbiased
estimators
– Narrower distribution than other estimators
• e.g. median, mode
• Linear
– Linear predictions
Y Y
– For the mean
– Linear (straight, flat) line
39
• Unbiased
– Centred around true (population) values
– Expected value = population value
– Minimum is biased.
• Minimum in samples > minimum in population
• Estimators
– Errrmm… they are estimators
• Also consistent
– Sample approaches infinity, get closer to
population values
– Variance shrinks
40
SSE and the Standard
Deviation
• Tying up a loose end
2
ˆ
SSE  (Yi  Y )
s
2
ˆ
(Yi  Y )
n

2
ˆ
(Yi  Y )
n 1
41
• SSE closely related to SD
• Sample standard deviation – s
– Biased estimator of population SD
• Population standard deviation - 
– Need to know the mean to calculate SD
• Reduces N by 1
• Hence divide by N-1, not N
– Like losing one df
42
Proof
• That the mean minimises SSE
– Not that difficult
– As statistical proofs go
• Available in
– Maxwell and Delaney – Designing
experiments and analysing data
– Judd and McClelland – Data Analysis (out
of print?)
43
What’s a df?
• The number of parameters free to vary
– When one is fixed
• Term comes from engineering
– Movement available to structures
44
0 df
No variation
available
1 df
Fix 1 corner, the
shape is fixed
45
Back to the Data
• Mean has 5 (N) df
– 1st moment
•  has N –1 df
– Mean has been fixed
– 2nd moment
– Can think of as amount cases vary away
from the mean
46
While we are at it …
• Skewness has N – 2 df
– 3rd moment
• Kurtosis has N – 3 df
– 4rd moment
– Amount cases vary from 
47
Parsimony and df
• Number of df remaining
– Measure of parsimony
• Model which contained all the data
– Has 0 df
– Not a parsimonious model
• Normal distribution
– Can be described in terms of mean and 
• 2 parameters
– (z with 0 parameters)
48
Summary of Lesson 1
• Statistics is about modelling DATA
– Models have parameters
– Fewer parameters, more parsimony, better
• Models need to minimise ERROR
– Best model, least ERROR
– Depends on how we define ERROR
– If we define error as sum of squared deviations
from predicted value
– Mean is best MODEL
49
50
51
Lesson 2: Models with one
more parameter - regression
52
In Lesson 1 we said …
• Use a model to predict and describe
data
– Mean is a simple, one parameter model
53
More Models
Slopes and Intercepts
54
More Models
• The mean is OK
– As far as it goes
– It just doesn’t go very far
– Very simple prediction, uses very little
information
• We often have more information than
that
– We want to use more information than that
55
House Prices
• In the UK, two of the largest lenders
(Halifax and Nationwide) compile house
price indices
– Predict the price of a house
– Examine effect of different circumstances
• Look at change in prices
– Guides legislation
• E.g. interest rates, town planning
56
Predicting House Prices
Beds
£ (000s)
1
2
1
3
5
5
2
5
4
1
77
74
88
62
90
136
35
134
138
55
57
One Parameter Model
• The mean
Y  88.9
Yˆ  b0  Y
SSE  11806.9
“How much is that house worth?”
“£88,900”
Use 1 df to say that
58
Adding More Parameters
• We have more information than this
– We might as well use it
– Add a linear function of number of
bedrooms (x1)
Yˆ  b0  b1 x1
59
Alternative Expression
• Estimate of Y (expected value of Y)
Yˆ  b0  b1 x1
• Value of Y
Yi  b0  b1 xi1  ei
60
Estimating the Model
• We can estimate this model in four different,
equivalent ways
– Provides more than one way of thinking about it
1.
2.
3.
4.
Estimating the slope which minimises SSE
Examining the proportional reduction in SSE
Calculating the covariance
Looking at the efficiency of the predictions
61
Estimate the Slope to Minimise
SSE
62
Estimate the Slope
• Stage 1
– Draw a scatterplot
– x-axis at mean
• Not at zero
• Mark errors on it
– Called ‘residuals’
– Sum and square these to find SSE
63
160
140
120
100
80
1.5
2
2.5
3
3.5
4
4.5
5
5.5
60
40
20
0
64
160
140
120
100
80
1.5
2
2.5
3
3.5
4
4.5
5
5.5
60
40
20
0
65
• Add another slope to the chart
– Redraw residuals
– Recalculate SSE
– Move the line around to find slope which
minimises SSE
• Find the slope
66
• First attempt:
67
• Any straight line can be defined with
two parameters
– The location (height) of the slope
• b0
– Sometimes called a
– The gradient of the slope
• b1
68
• Gradient
b1 units
1 unit
69
• Height
b0 units
70
• Height
• If we fix slope to zero
– Height becomes mean
– Hence mean is b0
• Height is defined as the point that the
slope hits the y-axis
– The constant
– The y-intercept
71
• Why the constant?
– b0x0
– Where x0 is 1.00 for
every case
• i.e. x0 is constant
• Implicit in SPSS
– Some packages force
you to make it
explicit
– (Later on we’ll need
to make it explicit)
beds (x1) x0
1
1
2
1
1
1
3
1
5
1
5
1
2
1
5
1
4
1
1
1
£ (000s)
77
74
88
62
90
136
35
134
138
55
72
• Why the intercept?
– Where the regression line intercepts the yaxis
– Sometimes called y-intercept
73
Finding the Slope
• How do we find the values of b0 and b1?
– Start with we jiggle the values, to find the
best estimates which minimise SSE
– Iterative approach
• Computer intensive – used to matter, doesn’t
really any more
• (With fast computers and sensible search
algorithms – more on that later)
74
• Start with
– b0=88.9 (mean)
– b1=10 (nice round number)
• SSE = 14948 – worse than it was
– b0=86.9,
– b0=66.9,
– b0=56.9,
– b0=46.9,
– b0=51.9,
– b0=51.9,
– b0=46.9,
– ……..
b1=10,
b1=10,
b1=10,
b1=10,
b1=10,
b1=12,
b1=14,
SSE=13828
SSE=7029
SSE=6628
SSE=8228
SSE=7178
SSE=6179
SSE=5957
75
• Quite a long time later
– b0 = 46.000372
– b1 = 14.79182
– SSE = 5921
• Gives the position of the
– Regression line (or)
– Line of best fit
• Better than guessing
• Not necessarily the only method
– But it is OLS, so it is the best (it is BLUE)
76
160
140
120
Price
100
80
60
Actual Price
Predicted Price
40
20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Number of Bedrooms
77
• We now know
– A house with no bedrooms is worth 
£46,000 (??!)
– Adding a bedroom adds  £15,000
• Told us two things
– Don’t extrapolate to meaningless values of
x-axis
– Constant is not necessarily useful
• It is necessary to estimate the equation
78
Standardised Regression Line
• One big but:
– Scale dependent
• Values change
– £ to €, inflation
• Scales change
– £, £000, £00?
• Need to deal with this
79
• Don’t express in ‘raw’ units
– Express in SD units
– x1=1.72
– y=36.21
• b1 = 14.79
• We increase x1 by 1, and Ŷ increases by
14.79
14.79  (14.79 / 36.21) SDs  0.408SDs
80
• Similarly, 1 unit of x1 = 1/1.72 SDs
– Increase x1 by 1 SD
– Ŷ increases by 14.79  (1.72/1) = 8.60
• Put them both together
b1   x1
y
81
14.79 1.72
 0.706
36.21
• The standardised regression line
– Change (in SDs) in Ŷ associated with a
change of 1 SD in x1
• A different route to the same answer
– Standardise both variables (divide by SD)
– Find line of best fit
82
• The standardised regression line has a
special name
The Correlation Coefficient
(r)
(r stands for ‘regression’, but more on that
later)
• Correlation coefficient is a standardised
regression slope
– Relative change, in terms of SDs
83
Proportional Reduction in
Error
84
Proportional Reduction in Error
• We might be interested in the level of
improvement of the model
– How much less error (as proportion) do we
have
– Proportional Reduction in Error (PRE)
• Mean only
– Error(model 0) = 11806
• Mean + slope
– Error(model 1) = 5921
85
ERROR(0)  ERROR(1)
PRE 
ERROR(0)
ERROR(1)
PRE  1 
ERROR(0)
5921
PRE  1 
11806
PRE  0.4984
86
• But we squared all the errors in the first
place
– So we could take the square root
– (It’s a shoddy excuse, but it makes the
point)
0.4984  0.706
• This is the correlation coefficient
• Correlation coefficient is the square root
of the proportion of variance explained
87
Standardised Covariance
88
Standardised Covariance
• We are still iterating
– Need a ‘closed-form’
– Equation to solve to get the parameter
estimates
• Answer is a standardised covariance
– A variable has variance
– Amount of ‘differentness’
• We have used SSE so far
89
• SSE varies with N
– Higher N, higher SSE
• Divide by N
– Gives SSE per person
– (Actually N – 1, we have lost a df to the
mean)
• The variance
• Same as SD2
– We thought of SSE as a scattergram
• Y plotted against X
– (repeated image follows)
90
160
140
120
100
80
1.5
2
2.5
3
3.5
4
4.5
5
5.5
60
40
20
0
91
• Or we could plot Y against Y
– Axes meet at the mean (88.9)
– Draw a square for each point
– Calculate an area for each square
– Sum the areas
• Sum of areas
– SSE
• Sum of areas divided by N
– Variance
92
Plot of Y against Y
180
160
140
120
100
0
20
40
60
80
80
100
120
140
160
180
60
40
20
0
93
Draw Squares
180
Area =
40.1 x 40.1
= 1608.1
160
138 – 88.9
= 40.1
140
138 – 88.9
= 40.1
120
100
0
20
35 – 88.9
= -53.9
40
60
80
80
100
120
140
160
180
60
40
35 – 88.9
= -53.9
20
Area =
-53.9 x -53.9
= 2905.21
0
94
• What if we do the same procedure
– Instead of Y against Y
– Y against X
•
•
•
•
Draw rectangles (not squares)
Sum the area
Divide by N - 1
This gives us the variance of x with y
– The Covariance
– Shortened to Cov(x, y)
95
96
Area
= (-33.9) x (-2)
= 67.8
55 – 88.9
= -33.9
4-3=1
138-88.9
= 49.1
1 - 3 = -2
Area =
49.1 x 1
= 49.1
97
• More formally (and easily)
• We can state what we are doing as an
equation
– Where Cov(x, y) is the covariance
( x  x )( y  y )
Cov( x, y ) 
N 1
• Cov(x,y)=44.2
• What do points in different sectors do
to the covariance?
98
• Problem with the covariance
– Tells us about two things
– The variance of X and Y
– The covariance
• Need to standardise it
– Like the slope
• Two ways to standardise the covariance
– Standardise the variables first
• Subtract from mean and divide by SD
– Standardise the covariance afterwards
99
• First approach
– Much more computationally expensive
• Too much like hard work to do by hand
– Need to standardise every value
• Second approach
– Much easier
– Standardise the final value only
• Need the combined variance
– Multiply two variances
– Find square root (were multiplied in first
place)
100
• Standardised covariance

Cov( x , y )
Var( x )  Var( y )

44.2
2.9  1311
 0.706
101
• The correlation coefficient
– A standardised covariance is a correlation
coefficient
r
Covariance
 variance  variance 
102
• Expanded …
r
 ( x  x )( y  y ) 


N 1


2
2
 ( x  x ) ( y  y ) 



N 1 
 N 1
103
• This means …
– We now have a closed form equation to
calculate the correlation
– Which is the standardised slope
– Which we can use to calculate the
unstandardised slope
104
We know that:
r
b1   x1
y
We know that:
b1 
r  y
x
1
105
b1 
r  y
x
1
0.706  36.21
b1 
1.72
b1  14.79
• So value of b1 is the same as the iterative
approach
106
• The intercept
– Just while we are at it
• The variables are centred at zero
– We subtracted the mean from both
variables
– Intercept is zero, because the axes cross at
the mean
107
• Add mean of y to the constant
– Adjusts for centring y
• Subtract mean of x
– But not the whole mean of x
– Need to correct it for the slope
c  y  b1 x1
c  88.9  14.8  3
c  46.00
• Naturally, the same
108
Accuracy of Prediction
109
One More (Last One)
• We have one more way to calculate the
correlation
– Looking at the accuracy of the prediction
• Use the parameters
– b0 and b1
– To calculate a predicted value for each case
110
Beds
1
2
1
3
5
5
2
5
4
1
Actual Predicted
Price
Price
77
60.80
74
75.59
88
60.80
62
90.38
90
119.96
136
119.96
35
75.59
134
119.96
138
105.17
55
60.80
• Plot actual price
against
predicted price
– From the model
111
140
Predicted Value
120
100
80
60
40
20
20
40
60
80
100
Actual Value
120
140
160
112
• r = 0.706
– The correlation
• Seems a futile thing to do
– And at this stage, it is
– But later on, we will see why
113
Some More Formulae
• For hand calculation
r
xy
x 2 y 2
• Point biserial

M
r
y1
 M y 0  PQ
sd y
114
• Phi (f)
– Used for 2 dichotomous variables
Vote P
Vote Q
Homeowner
A: 19
B: 54
Not homeowner
C: 60
D:53
BC  AD
r
( A  B)(C  D)( A  C )( B  D)
115
• Problem with the phi correlation
– Unless Px= Py (or Px = 1 – Py)
• Maximum (absolute) value is < 1.00
• Tetrachoric can be used
• Rank (Spearman) correlation
– Used where data are ranked
6d
r
2
n(n  1)
2
116
Summary
• Mean is an OLS estimate
– OLS estimates are BLUE
• Regression line
– Best prediction of DV from IV
– OLS estimate (like mean)
• Standardised regression line
– A correlation
117
• Four ways to think about a correlation
– 1.
– 2.
– 3.
– 4.
Standardised regression line
Proportional Reduction in Error (PRE)
Standardised covariance
Accuracy of prediction
118
119
120
Lesson 3: Why Regression?
A little aside, where we look at
why regression has such a curious
name.
121
Regression
The or an act of regression; reversion;
return towards the mean; return to an
earlier stage of development, as in an
adult’s or an adolescent’s behaving like
a child
(From Latin gradi, to go)
• So why name a statistical technique
which is about prediction and
explanation?
122
• Francis Galton
– Charles Darwin’s cousin
– Studying heritability
• Tall fathers have shorter sons
• Short fathers have taller sons
– ‘Filial regression toward mediocrity’
– Regression to the mean
123
• Galton thought this was biological fact
– Evolutionary basis?
• Then did the analysis backward
– Tall sons have shorter fathers
– Short sons have taller fathers
• Regression to the mean
– Not biological fact, statistical artefact
124
Other Examples
• Secrist (1933): The Triumph of Mediocrity in
Business
• Second albums often tend to not be as good
as first
• Sequel to a film is not as good as the first
one
• ‘Curse of Athletics Weekly’
• Parents think that punishing bad behaviour
works, but rewarding good behaviour doesn’t
125
Pair Link Diagram
• An alternative to a scatterplot
x
y
126
r=1.00
x
x
x
x
x
x
x
127
r=0.00
x
x
x
x
x
128
From Regression to
Correlation
• Where do we predict an individual’s
score on y will be, based on their score
on x?
– Depends on the correlation
• r = 1.00 – we know exactly where they
will be
• r = 0.00 – we have no idea
• r = 0.50 – we have some idea
129
r=1.00
Starts here
Will end up
here
x
y
130
r=0.00
Starts here
Could end
anywhere here
x
y
131
r=0.50
Probably
end
somewhere
here
Starts here
x
y
132
Galton Squeeze Diagram
• Don’t show individuals
– Show groups of individuals, from the same
(or similar) starting point
– Shows regression to the mean
133
r=0.00
Ends here
Group starts
here
x
Group starts
here
y
134
r=0.50
x
y
135
r=1.00
x
y
136
1 unit
r units
x
y
• Correlation is amount of regression that
doesn’t occur
137
• No regression
• r=1.00
x
y
138
• Some
regression
• r=0.50
x
y
139
r=0.00
• Lots
(maximum)
regression
• r=0.00
x
y
140
Formula
zˆ y  rxy z x
141
Conclusion
• Regression towards mean is statistical necessity
regression = perfection – correlation
• Very non-intuitive
• Interest in regression and correlation
– From examining the extent of regression towards
mean
– By Pearson – worked with Galton
– Stuck with curious name
• See also Paper B3
142
143
144
Lesson 4: Samples to
Populations – Standard Errors
and Statistical Significance
145
The Problem
• In Social Sciences
– We investigate samples
• Theoretically
– Randomly taken from a specified
population
– Every member has an equal chance of
being sampled
– Sampling one member does not alter the
chances of sampling another
• Not the case in (say) physics, biology,
etc.
146
Population
• But it’s the population that we are
interested in
– Not the sample
– Population statistic represented with Greek
letter
– Hat means ‘estimate’
ˆ
b
x 
ˆx
147
• Sample statistics (e.g. mean) estimate
population parameters
• Want to know
– Likely size of the parameter
– If it is > 0
148
Sampling Distribution
• We need to know the sampling
distribution of a parameter estimate
– How much does it vary from sample to
sample
• If we make some assumptions
– We can know the sampling distribution of
many statistics
– Start with the mean
149
Sampling Distribution of the
Mean
• Given
– Normal distribution
– Random sample
– Continuous data
• Mean has a known sampling distribution
– Repeatedly sampling will give a known
distribution of means
– Centred around the true (population) mean
()
150
Analysis Example: Memory
• Difference in memory for different
words
– 10 participants given a list of 30 words to
learn, and then tested
– Two types of word
• Abstract: e.g. love, justice
• Concrete: e.g. carrot, table
151
Concrete Abstract
12
4
11
7
4
6
9
12
8
6
12
10
9
8
8
5
12
10
8
4
Diff (x)
8
4
-2
-3
2
2
1
3
2
4
x  2.1
 x  3.11
N  10
152
Confidence Intervals
• This means
– If we know the mean in our sample
– We can estimate where the mean in the
population () is likely to be
• Using
– The standard error (se) of the mean
– Represents the standard deviation of the
sampling distribution of the mean
153
1 SD contains
68%
Almost 2 SDs
contain 95%
154
• We know the sampling distribution of
the mean
– t distributed
– Normal with large N (>30)
• Know the range within means from
other samples will fall
– Therefore the likely range of 
x
se( x ) 
n
155
• Two implications of equation
– Increasing N decreases SE
• But only a bit
– Decreasing SD decreases SE
• Calculate Confidence Intervals
– From standard errors
• 95% is a standard level of CI
– 95% of samples the true mean will lie within
the 95% CIs
– In large samples: 95% CI = 1.96  SE
– In smaller samples: depends on t
distribution (df=N-1=9)
156
x  2.1,
 x  3.11,
N  10
 x 3.11
se( x ) 

 0.98
n
10
157
95% CI  2.26  0.98  2.22
x  CI    x  CI
-0.12    4.32
158
What is a CI?
• (For 95% CI):
• 95% chance that the true (population)
value lies within the confidence
interval?
• 95% of samples, true mean will land
within the confidence interval?
159
Significance Test
• Probability that  is a certain value
– Almost always 0
• Doesn’t have to be though
• We want to test the hypothesis that the
difference is equal to 0
– i.e. find the probability of this difference
occurring in our sample IF =0
– (Not the same as the probability that =0)
160
• Calculate SE, and then t
– t has a known sampling distribution
– Can test probability that a certain value is
included
x
t
se(x )
2.1
t
 2.14
0.98
p  0.061
161
Other Parameter Estimates
• Same approach
– Prediction, slope, intercept, predicted
values
– At this point, prediction and slope are the
same
• Won’t be later on
• We will look at one predictor only
– More complicated with > 1
162
Testing the Degree of
Prediction
• Prediction is correlation of Y with Ŷ
– The correlation – when we have one IV
• Use F, rather than t
• Started with SSE for the mean only
– This is SStotal
– Divide this into SSresidual
– SSregression
• SStot = SSreg + SSres
163
F
SSreg df1
SS res df 2
df1  k
df 2  N  k  1,
164
• Back to the house prices
– Original SSE (SStotal) = 11806
– SSresidual = 5921
• What is left after our model
– SSregression = 11806 – 5921 = 5885
• What our model explains
• Slope = 14.79
• Intercept = 46.0
• r = 0.706
165
F
SSreg df1
SS res df 2
5885 1
F
 7.95
5921 (10  1  1)
df1  k  1
df 2  N  k  1  8
166
• F = 7.95, df = 1, 8, p = 0.02
– Can reject H0
• H0: Prediction is not better than chance
– A significant effect
167
Statistical Significance:
What does a p-value (really)
mean?
168
A Quiz
• Six questions, each true or false
• Write down your answers (if you like)
• An experiment has been done. Carried out
perfectly. All assumptions perfectly satisfied.
Absolutely no problems.
• P = 0.01
– Which of the following can we say?
169
1. You have absolutely disproved the null
hypothesis (that is, there is no
difference between the population
means).
170
2. You have found the probability of the
null hypothesis being true.
171
3. You have absolutely proved your
experimental hypothesis (that there is
a difference between the population
means).
172
4. You can deduce the probability of the
experimental hypothesis being true.
173
5. You know, if you decide to reject the
null hypothesis, the probability that
you are making the wrong decision.
174
6. You have a reliable experimental
finding in the sense that if,
hypothetically, the experiment were
repeated a great number of times, you
would obtain a significant result on
99% of occasions.
175
OK, What is a p-value
• Cohen (1994)
“[a p-value] does not tell us what we
want to know, and we so much want to
know what we want to know that, out
of desperation, we nevertheless believe
it does” (p 997).
176
OK, What is a p-value
• Sorry, didn’t answer the question
• It’s The probability of obtaining a result
as or more extreme than the result we
have in the study, given that the null
hypothesis is true
• Not probability the null hypothesis is
true
177
A Bit of Notation
• Not because we like notation
– But we have to say a lot less
•
•
•
•
Probability – P
Null hypothesis is true – H
Result (data) – D
Given - |
178
What’s a P Value
• P(D|H)
– Probability of the data occurring if the null
hypothesis is true
• Not
• P(H|D)
– Probability that the null hypothesis is true,
given that we have the data = p(H)
• P(H|D) ≠ P(D|H)
179
• What is probability you are prime minister
– Given that you are british
– P(M|B)
– Very low
• What is probability you are British
– Given you are prime minister
– P(B|M)
– Very high
• P(M|B) ≠ P(B|M)
180
• There’s been a murder
– Someone bumped off a statto for talking too
much
• The police have DNA
• The police have your DNA
– They match(!)
• DNA matches 1 in 1,000,000 people
• What’s the probability you didn’t do the
murder, given the DNA match (H|D)
181
• Police say:
– P(D|H) = 1/1,000,000
• Luckily, you have Jeremy on your defence
team
• We say:
– P(D|H) ≠ P(H|D)
• Probability that someone matches the
DNA, who didn’t do the murder
– Incredibly high
182
Back to the Questions
• Haller and Kraus (2002)
– Asked those questions of groups in
Germany
– Psychology Students
– Psychology lecturers and professors (who
didn’t teach stats)
– Psychology lecturers and professors (who
did teach stats)
183
1. You have absolutely disproved the null
hypothesis (that is, there is no difference
between the population means).
•
True
•
•
•
•
•
34% of students
15% of professors/lecturers,
10% of professors/lecturers teaching statistics
False
We have found evidence against the null
hypothesis
184
2. You have found the probability of the
null hypothesis being true.
– 32% of students
– 26% of professors/lecturers
– 17% of professors/lecturers teaching
statistics
•
•
False
We don’t know
185
3. You have absolutely proved your
experimental hypothesis (that there is a
difference between the population means).
–
–
–
•
20% of students
13% of professors/lecturers
10% of professors/lecturers teaching statistics
False
186
4. You can deduce the probability of the
experimental hypothesis being true.
– 59% of students
– 33% of professors/lecturers
– 33% of professors/lecturers teaching
statistics
•
False
187
5. You know, if you decide to reject the null
hypothesis, the probability that you are
making the wrong decision.
•
•
•
•
•
68% of students
67% of professors/lecturers
73% of professors professors/lecturers
teaching statistics
False
Can be worked out
– P(replication)
188
6. You have a reliable experimental finding
in the sense that if, hypothetically, the
experiment were repeated a great
number of times, you would obtain a
significant result on 99% of occasions.
– 41% of students
– 49% of professors/lecturers
– 37% of professors professors/lecturers
teaching statistics
•
•
False
Another tricky one
– It can be worked out
189
One Last Quiz
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.05
• You replicate the study exactly
– What is probability you find p < 0.05?
190
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.01
• You replicate the study exactly
– What is probability you find p < 0.05?
191
• Significance testing creates boundaries
and gaps where none exist.
• Significance testing means that we find
it hard to build upon knowledge
– we don’t get an accumulation of
knowledge
192
• Yates (1951)
"the emphasis given to formal tests of significance
... has resulted in ... an undue concentration of
effort by mathematical statisticians on
investigations of tests of significance applicable
to problems which are of little or no practical
importance ... and ... it has caused scientific
research workers to pay undue attention to the
results of the tests of significance ... and too
little to the estimates of the magnitude of the
effects they are investigating
193
Testing the Slope
• Same idea as with the mean
– Estimate 95% CI of slope
– Estimate significance of difference from a
value (usually 0)
• Need to know the sd of the slope
– Similar to SD of the mean
194
s y. x 
s y. x 
s y. x
2
ˆ
(Y  Y )
N  k 1
SSres
N  k 1
5921

 27.2
8
195
• Similar to equation for SD of mean
• Then we need standard error
- Similar (ish)
• When we have standard error
– Can go on to 95% CI
– Significance of difference
196
se(by. x ) 
s y.x
( x  x )
2
27.2
se(by. x ) 
 5.24
26.9
197
• Confidence Limits
• 95% CI
– t dist with N - k - 1 df is 2.31
– CI = 5.24  2.31 = 12.06
• 95% confidence limits
14.8  12.1    14.8  12.1
2.7    26.9
198
• Significance of difference from zero
– i.e. probability of getting result if =0
• Not probability that  = 0
b
14.7
t

 2.81
se(b)
5.2
df  N  k  1  8
p  0.02
• This probability is (of course) the same
as the value for the prediction
199
Testing the Standardised
Slope (Correlation)
• Correlation is bounded between –1 and +1
– Does not have symmetrical distribution, except
around 0
• Need to transform it
– Fisher z’ transformation – approximately
normal
z  0.5[ln( 1  r )  ln( 1  r )]
1
SE z 
n3
200
z  0.5[ln( 1  0.706)  ln( 1  0.706)]
z  0.879
1
1
SEz 

 0.38
n3
10  3
• 95% CIs
– 0.879 – 1.96 * 0.38 = 0.13
– 0.879 + 1.96 * 0.38 = 1.62
201
• Transform back to correlation
e 1
r  2y
e 1
2y
• 95% CIs = 0.13 to 0.92
• Very wide
– Small sample size
– Maybe that’s why CIs are not reported?
202
Using Excel
• Functions in excel
– Fisher() – to carry out Fisher
transformation
– Fisherinv() – to transform back to
correlation
203
The Others
• Same ideas for calculation of CIs and
SEs for
– Predicted score
– Gives expected range of values given X
• Same for intercept
– But we have probably had enough
204
Lesson 5: Introducing Multiple
Regression
205
Residuals
• We said
Y = b0 + b1x1
• We could have said
Yi = b0 + b1xi1 + ei
• We ignored the i on the Y
• And we ignored the ei
– It’s called error, after all
• But it isn’t just error
– Trying to tell us something
206
What Error Tells Us
• Error tells us that a case has a different
score for Y than we predict
– There is something about that case
• Called the residual
– What is left over, after the model
• Contains information
– Something is making the residual  0
– But what?
207
160
140
swimming pool
120
Price
100
80
Unpleasant
neighbours
60
Actual Price
Predicted Price
40
20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Number of Bedrooms
208
• The residual (+ the mean) is the value
of Y
If all cases were equal on X
• It is the value of Y, controlling for X
• Other words:
– Holding constant
– Partialling
– Residualising
– Conditioned on
209
Beds £ (000s)Pred
1
2
1
3
5
5
2
5
4
1
77
74
88
62
90
136
35
134
138
55
61
76
61
90
120
120
76
120
105
61
Adj. Value
Res
105
-16
90
2
62
-27
117
28
119
30
73
-16
129
41
75
-14
56
-33
95
6
210
• Sometimes adjustment is enough on its own
– Measure performance against criteria
• Teenage pregnancy rate
– Measure pregnancy and abortion rate in areas
– Control for socio-economic deprivation, and
anything else important
– See which areas have lower teenage pregnancy
and abortion rate, given same level of deprivation
• Value added education tables
– Measure school performance
– Control for initial intake
211
Control?
• In experimental research
– Use experimental control
– e.g. same conditions, materials, time of
day, accurate measures, random
assignment to conditions
• In non-experimental research
– Can’t use experimental control
– Use statistical control instead
212
Analysis of Residuals
• What predicts differences in crime rate
– After controlling for socio-economic
deprivation
– Number of police?
– Crime prevention schemes?
– Rural/Urban proportions?
– Something else
• This is what regression is about
213
• Exam performance
– Consider number of books a student read
(books)
– Number of lectures (max 20) a student
attended (attend)
• Books and attend as IV, grade as DV
214
Book s
Attend
0
1
0
2
4
4
1
4
3
0
9
15
10
16
10
20
11
20
15
15
Grade
45
57
45
51
65
88
44
87
89
59
First 10 cases
215
• Use books as IV
– R=0.492, F=12.1, df=1, 28, p=0.001
– b0=52.1, b1=5.7
– (Intercept makes sense)
• Use attend as IV
– R=0.482, F=11.5, df=1, 38, p=0.002
– b0=37.0, b1=1.9
– (Intercept makes less sense)
216
100
90
80
70
Grade (100)
60
50
40
30
-1
0
1
2
3
4
5
Books
217
100
90
80
70
60
Grade
50
40
30
5
7
9
11
13
15
17
19
21
Attend
218
Problem
• Use R2 to give proportion of shared
variance
– Books = 24%
– Attend = 23%
• So we have explained 24% + 23% =
47% of the variance
– NO!!!!!
219
• Look at the correlation matrix
BOOKS
1
ATTEND
0.44
1
GRADE
0.49
0.48
1
BOOKS
ATTEND
GRADE
• Correlation of books and attend is
(unsurprisingly) not zero
– Some of the variance that books shares
with grade, is also shared by attend
220
• I have access to 2 cars
• My wife has access to 2 cars
– We have access to four cars?
– No. We need to know how many of my 2
cars are the same cars as her 2 cars
• Similarly with regression
– But we can do this with the residuals
– Residuals are what is left after (say) books
– See of residual variance is explained by
attend
– Can use this new residual variance to
calculate SSres, SStotal and SSreg
221
• Well. Almost.
– This would give us correct values for SS
– Would not be correct for slopes, etc
• Assumes that the variables have a
causal priority
– Why should attend have to take what is
left from books?
– Why should books have to take what is left
by attend?
• Use OLS again
222
• Simultaneously estimate 2 parameters
– b1 and b2
– Y = b0 + b1x1 + b2x2
– x1 and x2 are IVs
• Not trying to fit a line any more
– Trying to fit a plane
• Can solve iteratively
– Closed form equations better
– But they are unwieldy
223
3D scatterplot
(2points only)
y
x2
x1
224
b2
y
b1
b0
x2
x1
225
(Really) Ridiculous Equations
2

 y  y x1  x1 x2  x2     y  y x2  x2 x1  x1 x2  x2 
b1 
2
2
2









 x1  x1  x2  x2   x1  x1 x2  x2
2

 y  y x2  x2 x1  x1     y  y x1  x1 x2  x2 x1  x1 
b2 
2
2
2









 x2  x2  x1  x1   x2  x2 x1  x1
b0  y  b1 x1  b2 x2
226
• The good news
– There is an easier way
• The bad news
– It involves matrix algebra
• The good news
– We don’t really need to know how to do it
• The bad news
– We need to know it exists
227
A Quick Guide to Matrix
Algebra
(I will never make you do it again)
228
Very Quick Guide to Matrix
Algebra
• Why?
– Matrices make life much easier in
multivariate statistics
– Some things simply cannot be done
without them
– Some things are much easier with them
• If you can manipulate matrices
– you can specify calculations v. easily
– e.g. AA’ = sum of squares of a column
• Doesn’t matter how long the column
229
• A scalar is a number
A scalar: 4
• A vector is a row or column of numbers
A row vector:
A column vector:
2
4 8 7
5
 
11 
230
• A vector is described as rows x columns
2
4 8 7
– Is a 1  4 vector
5
 
11 
– Is a 2  1 vector
– A number (scalar) is a 1  1 vector
231
• A matrix is a rectangle, described as
rows x columns
2 6 5 7 8


4 5 7 5 3
1 5 2 7 8


• Is a 3 x 5 matrix
• Matrices are referred to with bold capitals
- A is a matrix
232
• Correlation matrices and covariance
matrices are special
– They are square and symmetrical
– Correlation matrix of books, attend and
grade
 1.00 0.44 0.49 


 0.44 1.00 0.48 
 0.49 0.48 1.00 


233
• Another special matrix is the identity
matrix I
– A square matrix, with 1 in the diagonal and
0 in the off-diagonal
1

0
I
0

0
0 0 0

1 0 0
0 1 0

0 0 1
– Note that this is a correlation matrix, with
correlations all = 0
234
Matrix Operations
• Transposition
– A matrix is transposed by putting it on its
side
A  7 5 6 
– Transpose of A is A’
7
 
A'   5 
6
 
235
• Matrix multiplication
– A matrix can be multiplied by a scalar, a
vector or a matrix
– Not commutative
– AB  BA
– To multiply AB
• Number of rows in A must equal number of
columns in B
236
• Matrix by vector
a
b

c

d
e
f
g   j   aj  dk  gl 





h    k    bj  ek  hl 





i   l   cj  fk  il 
 2 3 5  2   33 
 2 3 5  2   4  9  20   43 
 7 7 1111 1313
 331499 33  52    90 
 17 19 23
  4 141
 



   
 

 17 19 23   4 
 34  57  92 
 183 
237
• Matrix by matrix
a b  e

  
c d g
f   ae  cf af  bh 
  

h   ce  dg cf  dh 
 2 3   2 3   4  12 6  15 

  
  

 5 7   4 5  10  28 15  35 
 16 21

 
 38 50 
238
• Multiplying by the identity matrix
– Has no effect
– Like multiplying by 1
AI  A
2 3 1 0 2 3

  
  

5 7   0 1  5 7 
239
• The inverse of J is: 1/J
• J x 1/J = 1
• Same with matrices
– Matrices have an inverse
– Inverse of A is A-1
– AA-1=I
• Inverting matrices is dull
– We will do it once
– But first, we must calculate the
determinant
240
• The determinant of A is |A|
• Determinants are important in statistics
– (more so than the other matrix algebra)
• We will do a 2x2
– Much more difficult for larger matrices
241
a b

A  
c d
A  ad  cb
 1.0 0.3 
A  

 0.3 1.0 
A  1  1  0.3  0.3
A  0.91
242
• Determinants are important because
– Needs to be above zero for regression to
work
– Zero or negative determinant of a
correlation/covariance matrix means
something wrong with the data
• Linear redundancy
• Described as:
– Not positive definite
– Singular (if determinant is zero)
• In different error messages
243
• Next, the adjoint
a b 
A  

c d 
 d  b
adj A  

 c a 
•Now
1
A 
adj A
A
1
244
• Find A-1
 1.0 0.3 
A  

 0.3 1.0 
A  0.91
A
A
1
1
 1.0  0.3 
1

 

0.91   0.3 1.0 
 1.10  0.33 
 

  0.33 1.10 
245
Matrix Algebra with
Correlation Matrices
246
Determinants
• Determinant of a correlation matrix
– The volume of ‘space’ taken up by the
(hyper) sphere that contains all of the
points
 1.0 0.0 
A

 0.0 1.0 
A  1.0
247
X
X
X
X
X
 1.0 0.0 
A

 0.0 1.0 
A  1.0
248
X
X
X
 1.0 1.0 
A

 1.0 1.0 
A  0.0
249
Negative Determinant
• Points take up less than no space
– Correlation matrix cannot exist
– Non-positive definite matrix
250
Sometimes Obvious
1.0 1.2 

A  
1
.
2
1
.
0


A  0.44
251
Sometimes Obvious (If You
Think)
0.9 0.9 
 1


A   0.9
1
0.9 
 0.9 0.9

1 

A  2.88
252
Sometimes No Idea
 1.00 0.76 0.40 


A   0.76
1
0.30 
 0.40 0.30

1


A  0.01
 1.00 0.75 0.40 


A   0.75
1
0.30 
 0.40 0.30

1


A  0.0075
253
Multiple R for Each Variable
• Diagonal of inverse of correlation matrix
– Used to calculate multiple R
– Call elements aij
Ri .123...k
1
 1
aii
254
Regression Weights
• Where i is DV
• j is IV
bi . j 
aij
aij
255
Back to the Good News
• We can calculate the standardised
parameters as
B=Rxx-1 x Rxy
• Where
– B is the vector of regression weights
– Rxx-1 is the inverse of the correlation matrix
of the independent (x) variables
– Rxy is the vector of correlations of the
correlations of the x and y variables
– Now do exercise 3.2
256
One More Thing
• The whole regression equation can be
described with matrices
– very simply
Y  XB  E
257
• Where
– Y = vector of DV
– X = matrix of IVs
– B = vector of coefficients
• Go all the way back to our example
258
1
1

1

1
1

1
1

1
1

1
0
1
0
2
4
4
1
4
3
0
9
 e1   45 
 e   57 
5 
 2  
 e3   45 
10 
   

e4   51 
16 

 b0 
10     e5   65 
  b1       
20     e6   88 
b2 

 e7   44 

11
   

20 
 e8   87 
 e   89 
15 
 9   

15 
 e10   59 
259
1

1
1

1
1

1

1
1

1
1

0
1
0
2
4
4
1
4
3
0
The constant – literally a
constant. Could be any
 e1   45 
   but
number,
 it is most
 e2  to
 57make

convenient
it 1. Used
 e   45 
3
to ‘capture’
  the
 intercept.
9

5
10 

16 
 e4   51 
 b0     

e5
10  
65
 b1       
20    e6   88 
 b2   e   
11 
 7   44 
 e8   87 
20 
   

15 
 e9   89 
 e   59 
15 
 10   
260
1

1
1

1
1

1

1
1

1
1

0
1
0
2
4
4
1
4
3
0
9
 e1   45 
   

5
 e2   57 
 e   45 
10 
 3  

e4   51
16  The matrix
ofvalues for
 b0     

and attend)
e5
10  IVs (books
65
 b1       
20    e6   88 
 b2   e   
11 
 7   44 
 e8   87 
20 
   

15 
 e9   89 
 e   59 
15 
 10   
261
 e1   45 
1 0 9 
   


 e2   57 
1 1 5 
 e   45 
1 0 10 
 3  


 e4   51 
1 2 16 
1 4 10  b0   e   65 
The parameter

 b1    5    
4 20    e6   88 
1 are
estimates. We
b2     



trying to find the
1 1best
11 
 e7   44 
values of these.
 e8   87 
1 4 20 
   


 e9   89 
1 3 15 
1 0 15 
 e   59 


 10   
262
Error. We are trying to
0 9
1 this
minimise


1
1

1
1

1

1
1

1
1

1
0
2
4
4
1
4
3
0
 e1   45 
   
5
 e2   57 
 e   45 
10 
 3  

16 
 e4   51 
 b0     

e5
10  
65
 b1       
20    e6   88 
 b2   e   
11 
 7   44 
 e8   87 
20 
   

15 
 e9   89 
 e   59 
15 
 10   
263
1 0

1 1
1 0

1 2
1 4

1 4

1 1
1 4

1 3
1DV
The
 0
9
 e1   45 
   

5
 e2   57 
 e   45 
10 
 3  

16 
 e4   51 
 b0     

e5
10  
65
 b1       
20    e6   88 
 b2   e   
11 
 7   44 
 e8   87 
20 
   

15 
 e9   89 

 e   59 
- 15
grade

 10   
264
• Y=BX+E
• Simple way of representing as many IVs as
you like
Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e
 x01

 x02
x11
x12
x21
x22
x31
x32
x41
x42
 b0 
 
 b1 
x51  b2   e1 
    
x52  b3   e2 
b 
 4
b 
 5
265
 x01 x11 x21 x31 x41

 x02 x12 x22 x32 x42
 b0 
 
 b1 
x51  b2   e1 
    
x52  b3   e2 
b 
 4
b 
 5
 b0 x0  b1 x1  ...bk xk  e
266
Generalises to Multivariate
Case
• Y=BX+E
• Y, B and E
– Matrices, not vectors
• Goes beyond this course
– (Do Jacques Tacq’s course for more)
– (Or read his book)
267
268
269
270
Lesson 6: More on Multiple
Regression
271
Parameter Estimates
• Parameter estimates (b1, b2 … bk) were
standardised
– Because we analysed a correlation matrix
• Represent the correlation of each IV
with the DV
– When all other IVs are held constant
272
• Can also be unstandardised
• Unstandardised represent the unit
change in the DV associated with a 1
unit change in the IV
– When all the other variables are held
constant
• Parameters have standard errors
associated with them
– As with one IV
– Hence t-test, and associated probability
can be calculated
• Trickier than with one IV
273
Standard Error of Regression
Coefficient
• Standardised is easier
1 R
1
SEi 
2
n  k 1 1  R i
2
Y
– R2i is the value of R2 when all other predictors are
used as predictors of that variable
• Note that if R2i = 0, the equation is the same as for
previous
274
Multiple R
• The degree of prediction
– R (or Multiple R)
– No longer equal to b
• R2 Might be equal to the sum of squares
of B
– Only if all x’s are uncorrelated
275
In Terms of Variance
• Can also think of this in terms of
variance explained.
– Each IV explains some variance in the DV
– The IVs share some of their variance
• Can’t share the same variance twice
276
Variance in Y
accounted for by x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance in Y
accounted for by x2
rx2y2 = 0.36
277
• In this model
– R2 = ryx12 + ryx22
– R2 = 0.36 + 0.36 = 0.72
– R = 0.72 = 0.85
• But
– If x1 and x2 are correlated
– No longer the case
278
Variance in Y
accounted for by x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance shared
between x1 and x2
(not equal to rx1x2)
Variance in Y
accounted for by x2
rx2y2 = 0.36
279
• So
– We can no longer sum the r2
– Need to sum them, and subtract the
shared variance – i.e. the correlation
• But
– It’s not the correlation between them
– It’s the correlation between them as a
proportion of the variance of Y
• Two different ways
280
• Based on estimates
2
R  b1ryx1  b2ryx2
• If rx1x2 = 0
– rxy = bx1
– Equivalent to ryx12 + ryx22
281
• Based on correlations
2
R 
2
yx1
r
2
yx2
r
 2ryx1 ryx2 rx1 x2
2
x1 x2
1r
• rx1x2 = 0
– Equivalent to ryx12 + ryx22
282
• Can also be calculated using methods
we have seen
– Based on PRE
– Based on correlation with prediction
• Same procedure with >2 IVs
283
Adjusted R2
• R2 is an overestimate of population
value of R2
– Any x will not correlate 0 with Y
– Any variation away from 0 increases R
– Variation from 0 more pronounced with
lower N
• Need to correct R2
– Adjusted R2
284
• Calculation of Adj. R2
N 1
Adj. R  1  (1  R )
N  k 1
2
2
• 1 – R2
– Proportion of unexplained variance
– We multiple this by an adjustment
• More variables – greater adjustment
• More people – less adjustment
285
Shrunken R2
• Some authors treat shrunken and
adjusted R2 as the same thing
– Others don’t
286
N 1
N  k 1
N  20, k  3
20  1
19

 1.1875
20  3  1 16
N  10, k  8
N  10, k  3
10  1
9
 9
10  8  1 1
10  1
9
  1.5
10  3  1 6
287
Extra Bits
• Some stranger things that can
happen
– Counter-intuitive
288
Suppressor variables
• Can be hard to understand
– Very counter-intuitive
• Definition
– An independent variable which increases
the size of the parameters associated with
other independent variables above the size
of their correlations
289
• An example (based on Horst, 1941)
– Success of trainee pilots
– Mechanical ability (x1), verbal ability (x2),
success (y)
• Correlation matrix
Mech
Mech
Verb
Success
1
0.5
0.3
Verb
0.5
1
0
Success
0.3
0
1
290
– Mechanical ability correlates 0.3 with
success
– Verbal ability correlates 0.0 with success
– What will the parameter estimates be?
– (Don’t look ahead until you have had a
guess)
291
• Mechanical ability
– b = 0.4
– Larger than r!
• Verbal ability
– b = -0.2
– Smaller than r!!
• So what is happening?
– You need verbal ability to do the test
– Not related to mechanical ability
• Measure of mechanical ability is contaminated
by verbal ability
292
• High mech, low verbal
– High mech
• This is positive
– Low verbal
• Negative, because we are talking about
standardised scores
• Your mech is really high – you did well on the
mechanical test, without being good at the
words
• High mech, high verbal
– Well, you had a head start on mech,
because of verbal, and need to be brought
down a bit
293
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
294
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =0.26
b2 = -0.06
295
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 =
b2 =
296
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 = 0.53
b2 = -0.47
297
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
298
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 = 0.53
b2 = 0.47
299
• Suppression happens when two opposing
forces are happening together
– And have opposite effects
• Don’t throw away your IVs,
– Just because they are uncorrelated with the DV
• Be careful in interpretation of regression
estimates
– Really need the correlations too, to interpret what
is going on
– Cannot compare between studies with different
IVs
300
Standardised Estimates > 1
• Correlations are bounded
-1.00 ≤ r ≤ +1.00
– We think of standardised regression
estimates as being similarly bounded
• But they are not
– Can go >1.00, <-1.00
– R cannot, because that is a proportion of
variance
301
• Three measures of ability
– Mechanical ability, verbal ability 1, verbal
ability 2
– Score on science exam
Mech
Mech
Verbal1
Verbal2
Scores
1
0.1
0.1
0.6
Verbal1
0.1
1
0.9
0.6
Verbal2
0.1
0.9
1
0.3
Scores
0.6
0.6
0.3
1
–Before reading on, what are the parameter
estimates?
302
Mech
Verbal1
Verbal2
0.56
1.71
-1.29
• Mechanical
– About where we expect
• Verbal 1
– Very high
• Verbal 2
– Very low
303
• What is going on
– It’s a suppressor again
– An independent variable which increases
the size of the parameters associated with
other independent variables above the size
of their correlations
• Verbal 1 and verbal 2 are correlated so
highly
– They need to cancel each other out
304
Variable Selection
• What are the appropriate independent
variables to use in a model?
– Depends what you are trying to do
• Multiple regression has two separate
uses
– Prediction
– Explanation
305
• Prediction
– What will happen in
the future?
– Emphasis on
practical application
– Variables selected
(more) empirically
– Value free
• Explanation
– Why did something
happen?
– Emphasis on
understanding
phenomena
– Variables selected
theoretically
– Not value free
306
• Visiting the doctor
– Precedes suicide attempts
– Predicts suicide
• Does not explain suicide
• More on causality later on …
• Which are appropriate variables
– To collect data on?
– To include in analysis?
– Decision needs to be based on theoretical knowledge
of the behaviour of those variables
– Statistical analysis of those variables (later)
• Unless you didn’t collect the data
– Common sense (not a useful thing to say)
307
Variable Entry Techniques
• Entry-wise
– All variables entered simultaneously
• Hierarchical
– Variables entered in a predetermined order
• Stepwise
– Variables entered according to change in
R2
– Actually a family of techniques
308
• Entrywise
– All variables entered simultaneously
– All treated equally
• Hierarchical
– Entered in a theoretically determined order
– Change in R2 is assessed, and tested for
significance
– e.g. sex and age
• Should not be treated equally with other
variables
• Sex and age MUST be first
– Confused with hierarchical linear modelling
309
• Stepwise
– Variables entered empirically
– Variable which increases R2 the most goes
first
• Then the next …
– Variables which have no effect can be
removed from the equation
• Example
– IVs: Sex, age, extroversion,
– DV: Car – how long someone spends
looking after their car
310
• Correlation Matrix
SEX
SEX
AGE
EXTRO
CAR
AGE
1.00
-0.05
0.40
0.66
-0.05
1.00
0.40
0.23
EXTRO CAR
0.40
0.66
0.40
0.23
1.00
0.67
0.67
1.00
311
• Entrywise analysis
– r2 = 0.64
SEX
AGE
EXTRO
b
0.49
0.08
0.44
p
<0.01
0.46
<0.01
312
• Stepwise Analysis
– Data determines the order
– Model 1: Extroversion, R2 = 0.450
– Model 2: Extroversion + Sex, R2 = 0.633
EXTRO
SEX
b
0.48
0.47
p
<0.01
<0.01
313
• Hierarchical analysis
– Theory determines the order
– Model 1: Sex + Age, R2 = 0.510
– Model 2: S, A + E, R2 = 0.638
– Change in R2 = 0.128, p = 0.001
2
SEX
AGE
EXTRO
0.49
0.08
0.44
<0.01
0.46
<0.01
314
• Which is the best model?
– Entrywise – OK
– Stepwise – excluded age
• Did have a (small) effect
– Hierarchical
• The change in R2 gives the best estimate of the
importance of extroversion
• Other problems with stepwise
– F and df are wrong (cheats with df)
– Unstable results
• Small changes (sampling variance) – large
differences in models
315
– Uses a lot of paper
– Don’t use a stepwise procedure to pack
your suitcase
316
Is Stepwise Always Evil?
• Yes
• All right, no
• Research goal is predictive (technological)
– Not explanatory (scientific)
– What happens, not why
• N is large
– 40 people per predictor, Cohen, Cohen, Aiken,
West (2003)
• Cross validation takes place
317
A quick note on R2
R2 is sometimes regarded as the ‘fit’ of a
regression model
– Bad idea
• If good fit is required – maximise R2
– Leads to entering variables which do not
make theoretical sense
318
Critique of Multiple Regression
• Goertzel (2002)
– “Myths of murder and multiple regression”
– Skeptical Inquirer (Paper B1)
• Econometrics and regression are ‘junk
science’
– Multiple regression models (in US)
– Used to guide social policy
319
More Guns, Less Crime
– (controlling for other factors)
• Lott and Mustard: A 1% increase in gun
ownership
– 3.3% decrease in murder rates
• But:
– More guns in rural Southern US
– More crime in urban North (crack cocaine
epidemic at time of data)
320
Executions Cut Crime
• No difference between crimes in states
in US with or without death penalty
• Ehrlich (1975) controlled all variables
that effect crime rates
– Death penalty had effect in reducing crime
rate
• No statistical way to decide who’s right
321
Legalised Abortion
• Donohue and Levitt (1999)
– Legalised abortion in 1970’s cut crime in 1990’s
• Lott and Whitley (2001)
– “Legalising abortion decreased murder rates by …
0.5 to 7 per cent.”
• It’s impossible to model these data
– Controlling for other historical events
– Crack cocaine (again)
322
Another Critique
• Berk (2003)
– Regression analysis: a constructive critique (Sage)
• Three cheers for regression
– As a descriptive technique
• Two cheers for regression
– As an inferential technique
• One cheer for regression
– As a causal analysis
323
Is Regression Useless?
• Do regression carefully
– Don’t go beyond data which you have a
strong theoretical understanding of
• Validate models
– Where possible, validate predictive power
of models in other areas, times, groups
• Particularly important with stepwise
324
Lesson 7: Categorical
Independent Variables
325
Introduction
326
Introduction
• So far, just looked at continuous
independent variables
• Also possible to use categorical
(nominal, qualitative) independent
variables
– e.g. Sex; Job; Religion; Region; Type (of
anything)
• Usually analysed with t-test/ANOVA
327
Historical Note
• But these (t-test/ANOVA) are special
cases of regression analysis
– Aspects of General Linear Models (GLMs)
• So why treat them differently?
– Fisher’s fault
– Computers’ fault
• Regression, as we have seen, is
computationally difficult
– Matrix inversion and multiplication
– Unfeasible, without a computer
328
• In the special cases where:
• You have one categorical IV
• Your IVs are uncorrelated
– It is much easier to do it by partitioning of
sums of squares
• These cases
– Very rare in ‘applied’ research
– Very common in ‘experimental’ research
• Fisher worked at Rothamsted agricultural
research station
• Never have problems manipulating wheat, pigs,
cabbages, etc
329
• In psychology
– Led to a split between ‘experimental’
psychologists and ‘correlational’
psychologists
– Experimental psychologists (until recently)
would not think in terms of continuous
variables
• Still (too) common to dichotomise a
variable
– Too difficult to analyse it properly
– Equivalent to discarding 1/3 of your data
330
The Approach
331
The Approach
• Recode the nominal variable
– Into one, or more, variables to represent that
variable
• Names are slightly confusing
– Some texts talk of ‘dummy coding’ to refer to all of
these techniques
– Some (most) refer to ‘dummy coding’ to refer to
one of them
– Most have more than one name
332
• If a variable has g possible categories it
is represented by g-1 variables
• Simplest case:
– Smokes: Yes or No
– Variable 1 represents ‘Yes’
– Variable 2 is redundant
• If it isn’t yes, it’s no
333
The Techniques
334
• We will examine two coding schemes
– Dummy coding
• For two groups
• For >2 groups
– Effect coding
• For >2 groups
• Look at analysis of change
– Equivalent to ANCOVA
– Pretest-posttest designs
335
Dummy Coding – 2 Groups
• Also called simple coding by SPSS
• A categorical variable with two groups
• One group chosen as a reference group
– The other group is represented in a variable
• e.g. 2 groups: Experimental (Group 1) and
Control (Group 0)
– Control is the reference group
– Dummy variable represents experimental group
• Call this variable ‘group1’
336
• For variable ‘group1’
– 1 = ‘Yes’, 2=‘No’
Original
Category
Exp
Con
New
Variable
1
0
337
• Some data
• Group is x, score is y
Control
Group
Experiment 1
Experiment 2
Experiment 3
Experimental
Group
10
10
10
20
10
30
338
• Control Group = 0
– Intercept = Score on Y when x = 0
– Intercept = mean of control group
• Experimental Group = 1
– b = change in Y when x increases 1 unit
– b = difference between experimental
group and control group
339
35
30
Gradient of slope
25
represents
difference
between means
20
15
10
5
0
Control Group
Experiment 1
Experimental Group
Experiment 2
Experiment 3
340
Dummy Coding – 3+ Groups
• With three groups the approach is the
similar
• g = 3, therefore g-1 = 2 variables
needed
• 3 Groups
– Control
– Experimental Group 1
– Experimental Group 2
341
Original
Category
Con
Gp1
Gp2
Gp1
Gp2
0
1
0
0
0
1
• Recoded into two variables
– Note – do not need a 3rd variable
• If we are not in group 1 or group 2 MUST be in
control group
• 3rd variable would add no information
• (What would happen to determinant?)
342
• F and associated p
– Tests H0 that
g1  g2  g3
• b1 and b2 and associated p-values
– Test difference between each experimental
group and the control group
• To test difference between experimental
groups
– Need to rerun analysis
343
• One more complication
– Have now run multiple comparisons
– Increases a – i.e. probability of type I error
• Need to correct for this
– Bonferroni correction
– Multiply given p-values by two/three
(depending how many comparisons were
made)
344
Effect Coding
• Usually used for 3+ groups
• Compares each group (except the reference
group) to the mean of all groups
– Dummy coding compares each group to the
reference group.
• Example with 5 groups
– 1 group selected as reference group
• Group 5
345
• Each group (except reference) has a
variable
– 1 if the individual is in that group
– 0 if not
– -1 if in reference group
group
1
2
3
4
5
group_1 group_2 group_3 group_4
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
-1
-1
-1
-1
346
Examples
• Dummy coding and Effect Coding
• Group 1 chosen as reference group
each time
• Data
Group
Mean
SD
1
52.40
4.60
2
56.30
5.70
3
60.10
5.00
Total
56.27
5.88
347
• Dummy
Group
dummy2
dummy3
1
2
3
0
1
0
0
0
1
Group
Effect2
effect3
1
2
3
-1
1
0
-1
0
1
• Effect
348
Dummy
R=0.543, F=5.7, df=2,
27, p=0.009
b0 = 52.4,
b1 = 3.9, p=0.100
b2 = 7.7, p=0.002
Effect
R=0.543, F=5.7, df=2,
27, p=0.009
b0 = 56.27,
b1 = 0.03, p=0.980
b2 = 3.8, p=0.007
b0  g1
b0  G
b1  g2  g1
b1  g2  G
b2  g3  g1
b2  g3  G
349
In SPSS
• SPSS provides two equivalent procedures for
regression
– Regression (which we have been using)
– GLM (which we haven’t)
• GLM will:
– Automatically code categorical variables
– Automatically calculate interaction terms
• GLM won’t:
– Give standardised effects
– Give hierarchical R2 p-values
– Allow you to not understand
350
ANCOVA and Regression
351
• Test
– (Which is a trick; but it’s designed to make
you think about it)
• Use employee data.sav
– Compare the pay rise (difference between
salbegin and salary)
– For ethnic minority and non-minority staff
• What do you find?
352
ANCOVA and Regression
• Dummy coding approach has one special use
– In ANCOVA, for the analysis of change
• Pre-test post-test experimental design
– Control group and (one or more) experimental
groups
– Tempting to use difference score + t-test / mixed
design ANOVA
– Inappropriate
353
• Salivary cortisol levels
– Used as a measure of stress
– Not absolute level, but change in level over
day may be interesting
• Test at: 9.00am, 9.00pm
• Two groups
– High stress group (cancer biopsy)
• Group 1
– Low stress group (no biopsy)
• Group 0
354
High Stress
Low Stress
AM
20.1
22.3
PM
6.8
11.8
Diff
13.3
10.5
• Correlation of AM and PM = 0.493
(p=0.008)
• Has there been a significant difference
in the rate of change of salivary
cortisol?
– 3 different approaches
355
• Approach 1 – find the differences, do a
t-test
– t = 1.31, df=26, p=0.203
• Approach 2 – mixed ANOVA, look for
interaction effect
– F = 1.71, df = 1, 26, p = 0.203
– F = t2
• Approach 3 – regression (ANCOVA)
based approach
356
– IVs: AM and group
– DV: PM
– b1 (group) = 3.59, standardised b1=0.432,
p = 0.01
• Why is the regression approach better?
– The other two approaches took the
difference
– Assumes that r = 1.00
– Any difference from r = 1.00 and you add
error variance
• Subtracting error is the same as adding error
357
• Using regression
– Ensures that all the variance that is
subtracted is true
– Reduces the error variance
• Two effects
– Adjusts the means
• Compensates for differences between groups
– Removes error variance
358
In SPSS
• SPSS automates all of this
– But you have to understand it, to know
what it is doing
• Use Analyse, GLM, Univariate ANOVA
359
Outcome here
Categorical
predictors here
Continuous
predictors here
Click options
360
Select parameter
estimaters
361
More on Change
• If difference score is correlated with
either pre-test or post-test
– Subtraction fails to remove the difference
between the scores
– If two scores are uncorrelated
• Difference will be correlated with both
• Failure to control
– Equal SDs, r = 0
• Correlation of change and pre-score =0.707
362
Even More on Change
• A topic of surprising complexity
– What I said about difference scores isn’t
always true
• Lord’s paradox – it depends on the precise
question you want to answer
– Collins and Horn (1993). Best methods for
the analysis of change
– Collins and Sayer (2001). New methods for
the analysis of change.
363
Lesson 8: Assumptions in
Regression Analysis
364
The Assumptions
1. The distribution of residuals is normal (at
each value of the dependent variable).
2. The variance of the residuals for every set
of values for the independent variable is
equal.
• violation is called heteroscedasticity.
3. The error term is additive
•
no interactions.
4. At every value of the dependent variable
the expected (mean) value of the residuals
is zero
•
No non-linear relationships
365
5. The expected correlation between residuals,
for any two cases, is 0.
•
The independence assumption (lack of
autocorrelation)
6. All independent variables are uncorrelated
with the error term.
7. No independent variables are a perfect
linear function of other independent
variables (no perfect multicollinearity)
8. The mean of the error term is zero.
366
What are we going to do …
• Deal with some of these assumptions in
some detail
• Deal with others in passing only
– look at them again later on
367
Assumption 1: The
Distribution of Residuals is
Normal at Every Value of the
Dependent Variable
368
Look at Normal Distributions
• A normal distribution
– symmetrical, bell-shaped (so they say)
369
What can go wrong?
• Skew
– non-symmetricality
– one tail longer than the other
• Kurtosis
– too flat or too peaked
– kurtosed
• Outliers
– Individual cases which are far from the distribution
370
Effects on the Mean
• Skew
– biases the mean, in direction of skew
• Kurtosis
– mean not biased
– standard deviation is
– and hence standard errors, and
significance tests
371
Examining Univariate
Distributions
•
•
•
•
Histograms
Boxplots
P-P plots
Calculation based methods
372
Histograms
30
A and B
30
20
20
10
10
0
0
373
• C and D
40
14
12
30
10
8
20
6
4
10
2
0
0
374
•E&F
20
10
0
375
Histograms can be tricky ….
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
7
7
6
6
6
5
5
6
5
4
3
2
1
5
4
4
4
3
3
2
2
1
1
0
0
3
2
1
0
376
Boxplots
377
P-P Plots
•A&B
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
378
•C&D
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
379
•E&F
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
380
Calculation Based
• Skew and Kurtosis statistics
• Outlier detection statistics
381
Skew and Kurtosis Statistics
• Normal distribution
– skew = 0
– kurtosis = 0
• Two methods for calculation
– Fisher’s and Pearson’s
– Very similar answers
• Associated standard error
– can be used for significance of departure from
normality
– not actually very useful
• Never normal above N = 400
382
Skewness SE Skew Kurtosis SE Kurt
A
B
C
D
E
F
-0.12
0.271
0.454
0.117
2.106
0.171
0.172
0.172
0.172
0.172
0.172
0.172
-0.084
0.265
1.885
-1.081
5.75
-0.21
0.342
0.342
0.342
0.342
0.342
0.342
383
Outlier Detection
• Calculate distance from mean
– z-score (number of standard deviations)
– deleted z-score
• that case biased the mean, so remove it
– Look up expected distance from mean
• 1% 3+ SDs
• Calculate influence
– how much effect did that case have on the mean?
384
Non-Normality in Regression
385
Effects on OLS Estimates
• The mean is an OLS estimate
• The regression line is an OLS estimate
• Lack of normality
– biases the position of the regression slope
– makes the standard errors wrong
• probability values attached to statistical
significance wrong
386
Checks on Normality
• Check residuals are normally distributed
– SPSS will draw histogram and p-p plot of
residuals
• Use regression diagnostics
– Lots of them
– Most aren’t very interesting
387
Regression Diagnostics
• Residuals
– standardised, unstandardised, studentised,
deleted, studentised-deleted
– look for cases > |3| (?)
• Influence statistics
– Look for the effect a case has
– If we remove that case, do we get a different
answer?
– DFBeta, Standardised DFBeta
• changes in b
388
– DfFit, Standardised DfFit
• change in predicted value
– Covariance ratio
• Ratio of the determinants of the covariance
matrices, with and without the case
• Distances
– measures of ‘distance’ from the centroid
– some include IV, some don’t
389
More on Residuals
• Residuals are trickier than you might
have imagined
• Raw residuals
– OK
• Standardised residuals
– Residuals divided by SD
se 
e
n  k 1
2
390
Leverage
• But
– That SD is wrong
– Variance of the residuals is not equal
• Those further from the centroid on the
predictors have higher variance
• Need a measure of this
• Distance from the centroid is leverage,
or h (or sometimes hii)
• One predictor
– Easy
391

xi  x 
1
hi  
2
n ( x  x )
2
• Minimum hi is 1/n, the maximum is 1
• Except
– SPSS uses standardised leverage - h*
• It doesn’t tell you this, it just uses it
392
1
hi  hi 
n
2

xi  x 
*
hi 
2
( x  x )
*
• Minimum 0, maximum (N-1/N)
393
• Multiple predictors
– Calculate the hat matrix (H)
– Leverage values are the diagonals of this
matrix
1
H  X(X' X) X'
– Where X is the augmented matrix of
predictors (i.e. matrix that includes the
constant)
– Hence leverage hii – element ii of H
394
• Example of calculation of hat matrix
1




 1 15   1 15   1 15 
 1 15   0.318 0.273


 
 


 

 1 20   1 20   1 20 
 1 20   0.273 0.236

H






 ... ...  

... ...   ... ...   ... ... 


 
 


 

0.318 
 1 65   1 65   1 65 
 1 65  


395
Standardised / Studentised
• Now we can calculate the standardised
residuals
– SPSS calls them studentised residuals
– Also called internally studentised residuals
ei
ei 
se 1  hi
396
Deleted Studentised Residuals
• Studentised residuals do not have a
known distribution
– Cannot use them for inference
• Deleted studentised residuals
– Externally studentised residuals
– Jackknifed residuals
• Distributed as t
• With df = N – k – 1
397
Testing Significance
• We can calculate the probability of a
residual
– Is it sampled from the same population
• BUT
– Massive type I error rate
– Bonferroni correct it
• Multiply p value by N
398
Bivariate Normality
• We didn’t just say “residuals normally
distributed”
• We said “at every value of the
dependent variables”
• Two variables can be normally
distributed – univariate,
– but not bivariate
399
• Couple’s IQs
– male and female
FEMALE
MALE
8
6
5
6
4
4
3
2
Frequency
2
0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
130.0
1
0
140.0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
130.0
140.0
–Seem reasonably normal
400
• But wait!!
160
140
120
100
80
MALE
60
40
40
60
80
100
120
140
160
FEMALE
401
• When we look at bivariate normality
– not normal – there is an outlier
• So plot X against Y
• OK for bivariate
– but – may be a multivariate outlier
– Need to draw graph in 3+ dimensions
– can’t draw a graph in 3 dimensions
• But we can look at the residuals instead
…
402
• IQ histogram of residuals
12
10
8
6
4
2
0
403
Multivariate Outliers …
• Will be explored later in the exercises
• So we move on …
404
What to do about NonNormality
• Skew and Kurtosis
– Skew – much easier to deal with
– Kurtosis – less serious anyway
• Transform data
– removes skew
– positive skew – log transform
– negative skew - square
405
Transformation
• May need to transform IV and/or DV
– More often DV
• time, income, symptoms (e.g. depression) all positively
skewed
– can cause non-linear effects (more later) if only
one is transformed
– alters interpretation of unstandardised parameter
– May alter meaning of variable
– May add / remove non-linear and moderator
effects
406
• Change measures
– increase sensitivity at ranges
• avoiding floor and ceiling effects
• Outliers
– Can be tricky
– Why did the outlier occur?
• Error? Delete them.
• Weird person? Probably delete them
• Normal person? Tricky.
407
– You are trying to model a process
• is the data point ‘outside’ the process
• e.g. lottery winners, when looking at salary
• yawn, when looking at reaction time
– Which is better?
• A good model, which explains 99% of your
data?
• A poor model, which explains all of it
• Pedhazur and Schmelkin (1991)
– analyse the data twice
408
• We will spend much less time on the
other 6 assumptions
• Can do exercise 8.1.
409
Assumption 2: The variance of
the residuals for every set of
values for the independent
variable is equal.
410
Heteroscedasticity
• This assumption is a about
heteroscedasticity of the residuals
– Hetero=different
– Scedastic = scattered
• We don’t want heteroscedasticity
– we want our data to be homoscedastic
• Draw a scatterplot to investigate
411
160
140
120
100
80
MALE
60
40
40
60
FEMALE
80
100
120
140
160
412
• Only works with one IV
– need every combination of IVs
• Easy to get – use predicted values
– use residuals there
• Plot predicted values against residuals
– or
– or
– or
– or
standardised residuals
deleted residuals
standardised deleted residuals
studentised residuals
• A bit like turning the scatterplot on its
side
413
Good – no heteroscedasticity
Predicted Value
414
Bad – heteroscedasticity
Predicted Value
415
Testing Heteroscedasticity
•
White’s test
–
–
1.
2.
3.
4.
Not automatic in SPSS (is in SAS)
Luckily, not hard to do
Do regression, save residuals.
Square residuals
Square IVs
Calculate interactions of IVs
– e.g. x1•x2, x1•x3, x2 • x3
416
5. Run regression using
– squared residuals as DV
– IVs, squared IVs, and interactions as IVs
6. Test statistic = N x R2
– Distributed as c2
– Df = k (for second regression)
•
Use education and salbegin to predict
salary (employee data.sav)
–
R2 = 0.113, N=474, c2 = 53.5, df=5, p <
0.0001
417
Plot of Pred and Res
8
6
4
2
0
-2
-4
-2
0
2
4
6
8
Regression Standardized Predicted Value
418
Magnitude of
Heteroscedasticity
• Chop data into “slices”
– 5 slices, based on X (or predicted score)
• Done in SPSS
– Calculate variance of each slice
– Check ratio of smallest to largest
– Less than 10:1
• OK
419
The Visual Bander
• New in SPSS 12
420
1
• Variances
of the 5 groups
.219
2
.336
3
.757
4
.751
5
3.119
• We have a problem
– 3 / 0.2 ~= 15
421
Dealing with
Heteroscedasticity
•
Use Huber-White estimates
– Very easy in Stata
– Fiddly in SPSS – bit of a hack
•
Use Complex samples
1. Create a new variable where all cases are
equal to 1, call it const
2. Use Complex Samples, Prepare for
Analysis
3. Create a plan file
422
4.
5.
6.
7.
Sample weight is const
Finish
Use Complex Samples, GLM
Use plan file created, and set up
model as in GLM
(More on complex samples later)
In Stata, do regression as normal, and
click “robust”.
423
Heteroscedasticity –
Implications and Meanings
Implications
• What happens as a result of
heteroscedasticity?
– Parameter estimates are correct
• not biased
– Standard errors (hence p-values) are
incorrect
424
However …
• If there is no skew in predicted scores
– P-values a tiny bit wrong
• If skewed,
– P-values very wrong
• Can do exercise
425
Meaning
• What is heteroscedasticity trying to tell
us?
– Our model is wrong – it is misspecified
– Something important is happening that we
have not accounted for
• e.g. amount of money given to charity
(given)
– depends on:
• earnings
• degree of importance person assigns to the
charity (import)
426
• Do the regression analysis
– R2 = 0.60, F=31.4, df=2, 37, p < 0.001
• seems quite good
– b0 = 0.24, p=0.97
– b1 = 0.71, p < 0.001
– b2 = 0.23, p = 0.031
• White’s test
– c2 = 18.6, df=5, p=0.002
• The plot of predicted values against
residuals …
427
• Plot shows heteroscedastic relationship
428
• Which means …
– the effects of the variables are not additive
– If you think that what a charity does is
important
• you might give more money
• how much more depends on how much money
you have
429
70
60
50
40
30
GIVEN
Earnings
20
High
10
Low
4
6
8
10
12
14
16
IMPORT
430
• One more thing about
heteroscedasticity
– it is the equivalent of homogeneity of
variance in ANOVA/t-tests
431
Assumption 3: The Error Term
is Additive
432
Additivity
• What heteroscedasticity shows you
– effects of variables need to be additive
• Heteroscedasticity doesn’t always show it to
you
– can test for it, but hard work
– (same as homogeneity of covariance assumption
in ANCOVA)
• Have to know it from your theory
• A specification error
433
Additivity and Theory
• Two IVs
– Alcohol has sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– Some painkillers have sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– A bit of alcohol and a bit of painkiller
doesn’t make you very tired
– Effects multiply together, don’t add
together
434
• If you don’t test for it
– It’s very hard to know that it will happen
• So many possible non-additive effects
– Cannot test for all of them
– Can test for obvious
• In medicine
– Choose to test for salient non-additive
effects
– e.g. sex, race
435
Assumption 4: At every value of
the dependent variable the
expected (mean) value of the
residuals is zero
436
Linearity
• Relationships between variables should be
linear
– best represented by a straight line
• Not a very common problem in social
sciences
– except economics
– measures are not sufficiently accurate to make a
difference
• R2 too low
• unlike, say, physics
437
Fuel
• Relationship between speed of travel
and fuel used
Speed
438
• R2 = 0.938
– looks pretty good
– know speed, make a good prediction of
fuel
• BUT
– look at the chart
– if we know speed we can make a perfect
prediction of fuel used
– R2 should be 1.00
439
Detecting Non-Linearity
• Residual plot
– just like heteroscedasticity
• Using this example
– very, very obvious
– usually pretty obvious
440
Residual plot
441
Linearity: A Case of Additivity
• Linearity = additivity along the range of the
IV
• Jeremy rides his bicycle harder
– Increase in speed depends on current speed
– Not additive, multiplicative
– MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in
multiple regression. Psychological Bulletin.
442
Assumption 5: The expected
correlation between residuals, for
any two cases, is 0.
The independence assumption (lack of
autocorrelation)
443
Independence Assumption
• Also: lack of autocorrelation
• Tricky one
– often ignored
– exists for almost all tests
• All cases should be independent of one
another
– knowing the value of one case should not tell you
anything about the value of other cases
444
How is it Detected?
• Can be difficult
– need some clever statistics (multilevel
models)
• Better off avoiding situations where it
arises
• Residual Plots
• Durbin-Watson Test
445
Residual Plots
• Were data collected in time order?
– If so plot ID number against the residuals
– Look for any pattern
• Test for linear relationship
• Non-linear relationship
• Heteroscedasticity
446
2
Residual
1
0
-1
-2
0
10
20
30
40
Participant Number
447
How does it arise?
Two main ways
• time-series analyses
– When cases are time periods
• weather on Tuesday and weather on Wednesday
correlated
• inflation 1972, inflation 1973 are correlated
• clusters of cases
– patients treated by three doctors
– children from different classes
– people assessed in groups
448
Why does it matter?
• Standard errors can be wrong
– therefore significance tests can be wrong
• Parameter estimates can be wrong
– really, really wrong
– from positive to negative
• An example
– students do an exam (on statistics)
– choose one of three questions
• IV: time
• DV: grade
449
•Result, with line of best fit
90
80
70
60
50
40
Grade
30
20
10
10
Time
20
30
40
50
60
70
450
• Result shows that
– people who spent longer in the exam,
achieve better grades
• BUT …
– we haven’t considered which question
people answered
– we might have violated the independence
assumption
• DV will be autocorrelated
• Look again
– with questions marked
451
• Now somewhat different
90
80
70
60
50
40
Question
30
Grade
3
20
2
10
10
1
20
30
40
50
60
70
Time
452
• Now, people that spent longer got lower
grades
– questions differed in difficulty
– do a hard one, get better grade
– if you can do it, you can do it quickly
• Very difficult to analyse well
– need multilevel models
453
Durbin Watson Test
• Not well implemented in SPSS
• Depends on the order of the data
– Reorder the data, get a different result
• Doesn’t give statistical significance of
the test
454
Assumption 6: All independent
variables are uncorrelated
with the error term.
455
Uncorrelated with the Error
Term
• A curious assumption
– by definition, the residuals are uncorrelated
with the independent variables (try it and
see, if you like)
• It is about the DV
– must have no effect (when the IVs have
been removed)
– on the DV
456
• Problem in economics
– Demand increases supply
– Supply increases wages
– Higher wages increase demand
• OLS estimates will be (badly) biased in
this case
– need a different estimation procedure
– two-stage least squares
• simultaneous equation modelling
457
Assumption 7: No independent
variables are a perfect linear
function of other independent
variables
no perfect multicollinearity
458
No Perfect Multicollinearity
• IVs must not be linear functions of one
another
– matrix of correlations of IVs is not positive definite
– cannot be inverted
– analysis cannot proceed
• Have seen this with
– age, age start, time working
– also occurs with subscale and total
459
• Large amounts of collinearity
– a problem (as we shall see) sometimes
– not an assumption
460
Assumption 8: The mean of the
error term is zero.
You will like this one.
461
Mean of the Error Term = 0
• Mean of the residuals = 0
• That is what the constant is for
– if the mean of the error term deviates from
zero, the constant soaks it up
Y   0  1 x1  
Y  (  0  3)  1 x1  (  3)
- note, Greek letters because we are
talking about population values
462
• Can do regression without the constant
– Usually a bad idea
– E.g R2 = 0.995, p < 0.001
• Looks good
463
13
12
y
11
10
9
8
7
6
7
8
9
10
11
12
13
x1
464
465
Lesson 9: Issues in
Regression Analysis
Things that alter the
interpretation of the regression
equation
466
The Four Issues
•
•
•
•
Causality
Sample sizes
Collinearity
Measurement error
467
Causality
468
What is a Cause?
• Debate about definition of cause
– some statistics (and philosophy) books try
to avoid it completely
– We are not going into depth
• just going to show why it is hard
• Two dimensions of cause
– Ultimate versus proximal cause
– Determinate versus probabilistic
469
Proximal versus Ultimate
• Why am I here?
– I walked here because
– This is the location of the class because
– Eric Tanenbaum asked me because
– (I don’t know)
– because I was in my office when he rang
because
– I am a lecturer at York because
– I saw an advert in the paper because
470
– I exist because
– My parents met because
– My father had a job …
• Proximal cause
– the direct and immediate cause of
something
• Ultimate cause
– the thing that started the process off
– I fell off my bicycle because of the bump
– I fell off because I was going too fast
471
Determinate versus Probabilistic
Cause
• Why did I fall off my bicycle?
– I was going too fast
– But every time I ride too fast, I don’t fall
off
– Probabilistic cause
• Why did my tyre go flat?
– A nail was stuck in my tyre
– Every time a nail sticks in my tyre, the tyre
goes flat
– Deterministic cause
472
• Can get into trouble by mixing them
together
– Eating deep fried Mars Bars and doing no
exercise are causes of heart disease
– “My Grandad ate three deep fried Mars
Bars every day, and the most exercise he
ever got was when he walked to the shop
next door to buy one”
– (Deliberately?) confusing deterministic and
probabilistic causes
473
Criteria for Causation
• Association
• Direction of Influence
• Isolation
474
Association
• Correlation does not mean causation
– we all know
• But
– Causation does mean correlation
• Need to show that two things are related
– may be correlation
– my be regression when controlling for third (or
more) factor
475
• Relationship between price and sales
– suppliers may be cunning
– when people want it more
• stick the price up
Price
Price
Demand
Sales
1
0.6
0
Demand
0.6
1
0.6
Sales
0
0.6
1
– So – no relationship between price
and sales
476
– Until (or course) we control for demand
– b1 (Price) = -0.56
– b2 (Demand) = 0.94
• But which variables do we enter?
477
Direction of Influence
• Relationship between A and B
– three possible processes
A
B
A
B
B causes A
A
B
C causes A & B
C
A causes B
478
• How do we establish the direction of
influence?
– Longitudinally?
Barometer
Drops
Storm
– Now if we could just get that barometer
needle to stay where it is …
• Where the role of theory comes in
(more on this later)
479
Isolation
• Isolate the dependent variable from all
other influences
– as experimenters try to do
• Cannot do this
– can statistically isolate the effect
– using multiple regression
480
Role of Theory
• Strong theory is crucial to making
causal statements
• Fisher said: to make causal statements
“make your theories elaborate.”
– don’t rely purely on statistical analysis
• Need strong theory to guide analyses
– what critics of non-experimental research
don’t understand
481
• S.J. Gould – a critic
– says correlate price of petrol and his age,
for the last 10 years
– find a correlation
– Ha! (He says) that doesn’t mean there is a
causal link
– Of course not! (We say).
• No social scientist would do that analysis
without first thinking (very hard) about the
possible causal relations between the variables
of interest
• Would control for time, prices, etc …
482
• Atkinson, et al. (1996)
– relationship between college grades and
number of hours worked
– negative correlation
– Need to control for other variables – ability,
intelligence
• Gould says “Most correlations are noncausal” (1982, p243)
– Of course!!!!
483
I drink a lot of
beer
16 causal
relations
120 non-causal
correlations
laugh
toilet
jokes (about statistics)
vomit
karaoke
curtains closed
sleeping
headache
equations (beermat)
thirsty
fried breakfast
no beer
curry
chips
falling over
lose keys
484
• Abelson (1995) elaborates on this
– ‘method of signatures’
• A collection of correlations relating to
the process
– the ‘signature’ of the process
• e.g. tobacco smoking and lung cancer
– can we account for all of these findings
with any other theory?
485
1.
2.
3.
4.
5.
6.
7.
8.
The longer a person has smoked cigarettes, the
greater the risk of cancer.
The more cigarettes a person smokes over a given
time period, the greater the risk of cancer.
People who stop smoking have lower cancer rates
than do those who keep smoking.
Smoker’s cancers tend to occur in the lungs, and be of
a particular type.
Smokers have elevated rates of other diseases.
People who smoke cigars or pipes, and do not usually
inhale, have abnormally high rates of lip cancer.
Smokers of filter-tipped cigarettes have lower cancer
rates than other cigarette smokers.
Non-smokers who live with smokers have elevated
cancer rates.
(Abelson, 1995: 183-184)
486
– In addition, should be no anomalous
correlations
• If smokers had more fallen arches than nonsmokers, not consistent with theory
• Failure to use theory to select
appropriate variables
– specification error
– e.g. in previous example
– Predict wealth from price and sales
• increase price, price increases
• Increase sales, price increases
487
• Sometimes these are indicators of the
process
– e.g. barometer – stopping the needle won’t
help
– e.g. inflation? Indicator or cause?
488
No Causation without
Experimentation
• Blatantly untrue
– I don’t doubt that the sun shining makes
us warm
• Why the aversion?
– Pearl (2000) says problem is no
mathematical operator
– No one realised that you needed one
– Until you build a robot
489
AI and Causality
• A robot needs to make judgements
about causality
• Needs to have a mathematical
representation of causality
– Suddenly, a problem!
– Doesn’t exist
• Most operators are non-directional
• Causality is directional
490
Sample Sizes
“How many subjects does it take
to run a regression analysis?”
491
Introduction
• Social scientists don’t worry enough about
the sample size required
– “Why didn’t you get a significant result?”
– “I didn’t have a large enough sample”
• Not a common answer
• More recently awareness of sample size is
increasing
– use too few – no point doing the research
– use too many – waste their time
492
• Research funding bodies
• Ethical review panels
– both become more interested in sample
size calculations
• We will look at two approaches
– Rules of thumb (quite quickly)
– Power Analysis (more slowly)
493
Rules of Thumb
• Lots of simple rules of thumb exist
– 10 cases per IV
– >100 cases
– Green (1991) more sophisticated
• To test significance of R2 – N = 50 + 8k
• To test sig of slopes, N = 104 + k
• Rules of thumb don’t take into account
all the information that we have
– Power analysis does
494
Power Analysis
Introducing Power Analysis
• Hypothesis test
– tells us the probability of a result of that
magnitude occurring, if the null hypothesis
is correct (i.e. there is no effect in the
population)
• Doesn’t tell us
– the probability of that result, if the null
hypothesis is false
495
• According to Cohen (1982) all null
hypotheses are false
– everything that might have an effect, does
have an effect
• it is just that the effect is often very tiny
496
Type I Errors
• Type I error is false rejection of H0
• Probability of making a type I error
– a – the significance value cut-off
• usually 0.05 (by convention)
• Always this value
• Not affected by
– sample size
– type of test
497
Type II errors
• Type II error is false acceptance of the
null hypothesis
– Much, much trickier
• We think we have some idea
– we almost certainly don’t
• Example
– I do an experiment (random sampling, all
assumptions perfectly satisfied)
– I find p = 0.05
498
– You repeat the experiment exactly
• different random sample from same population
– What is probability you will find p < 0.05?
– ………………
– Another experiment, I find p = 0.01
– Probability you find p < 0.05?
– ………………
• Very hard to work out
– not intuitive
– need to understand non-central sampling
distributions (more in a minute)
499
• Probability of type II error = beta ()
– same as population regression parameter
(to be confusing)
• Power = 1 – Beta
– Probability of getting a significant result
500
State of the World
Research
Findings
H0 True
(no effect to
be found)
H0 false
(effect to be
found)
H0 true (we find
no effect – p >
0.05)

Type II error
p=
power = 1 - 
H0 false (we find
an effect – p <
0.05)
Type I error
p=a

501
• Four parameters in power analysis
– a – prob. of Type I error
–  – prob. of Type II error (power = 1 – )
– Effect size – size of effect in population
–N
• Know any three, can calculate the
fourth
– Look at them one at a time
502
•
a Probability of Type I error
– Usually set to 0.05
– Somewhat arbitrary
• sometimes adjusted because of circumstances
– rarely because of power analysis
– May want to adjust it, based on power
analysis
503
•  – Probability of type II error
– Power (probability of finding a result)
=1–
– Standard is 80%
• Some argue for 90%
– Implication that Type I error is 4 times
more serious than type II error
• adjust ratio with compromise power analysis
504
•
Effect size in the population
– Most problematic to determine
– Three ways
1. What effect size would be useful to find?
•
R2 = 0.01 - no use (probably)
2. Base it on previous research
– what have other people found?
3. Use Cohen’s conventions
– small R2 = 0.02
– medium R2 = 0.13
– large R2 = 0.26
505
– Effect size usually measured as f2
– For R2
2
R
f 
2
1 R
2
506
– For (standardised) slopes
2
sri
f 
2
1 R
2
– Where sr2 is the contribution to the
variance accounted for by the variable of
interest
– i.e. sr2 = R2 (with variable) – R2 (without)
• change in R2 in hierarchical regression
507
• N – the sample size
– usually use other three parameters to
determine this
– sometimes adjust other parameters (a)
based on this
– e.g. You can have 50 participants. No
more.
508
Doing power analysis
• With power analysis program
– SamplePower, GPower, Nquery
• With SPSS MANOVA
– using non-central distribution functions
– Uses MANOVA syntax
• Relies on the fact you can do anything with
MANOVA
• Paper B4
509
Underpowered Studies
• Research in the social sciences is often
underpowered
– Why?
– See Paper B11 – “the persistence of
underpowered studies”
510
Extra Reading
• Power traditionally focuses on p values
– What about CIs?
– Paper B8 – “Obtaining regression
coefficients that are accurate, not simply
significant”
511
Collinearity
512
Collinearity as Issue and
Assumption
• Collinearity (multicollinearity)
– the extent to which the independent
variables are (multiply) correlated
• If R2 for any IV, using other IVs = 1.00
– perfect collinearity
– variable is linear sum of other variables
– regression will not proceed
– (SPSS will arbitrarily throw out a variable)
513
• R2 < 1.00, but high
– other problems may arise
• Four things to look at in collinearity
– meaning
– implications
– detection
– actions
514
Meaning of Collinearity
• Literally ‘co-linearity’
– lying along the same line
• Perfect collinearity
– when some IVs predict another
– Total = S1 + S2 + S3 + S4
– S1 = Total – (S2 + S3 + S4)
– rare
515
• Less than perfect
– when some IVs are close to predicting
– correlations between IVs are high (usually,
but not always)
516
Implications
• Effects the stability of the parameter
estimates
– and so the standard errors of the
parameter estimates
– and so the significance
• Because
– shared variance, which the regression
procedure doesn’t know where to put
517
• Red cars have more accidents than
other coloured cars
– because of the effect of being in a red car?
– because of the kind of person that drives a
red car?
• we don’t know
– No way to distinguish between these three:
Accidents = 1 x colour + 0 x person
Accidents = 0 x colour + 1 x person
Accidents = 0.5 x colour + 0.5 x person
518
• Sex differences
– due to genetics?
– due to upbringing?
– (almost) perfect collinearity
• statistically impossible to tell
519
• When collinearity is less than perfect
– increases variability of estimates between
samples
– estimates are unstable
– reflected in the variances, and hence
standard errors
520
Detecting Collinearity
• Look at the parameter estimates
– large standardised parameter estimates
(>0.3?), which are not significant
• be suspicious
• Run a series of regressions
– each IV as DV
– all other IVs as IVs
• for each IV
521
• Sounds like hard work?
– SPSS does it for us!
• Ask for collinearity diagnostics
– Tolerance – calculated for every IV
Tolerance  1-R
2
– Variance Inflation Factor
• sq. root of amount s.e. has been increased
1
VIF 
Tolerance
522
Actions
What you can do about collinearity
“no quick fix” (Fox, 1991)
1. Get new data
•
•
•
avoids the problem
address the question in a different way
e.g. find people who have been raised as
the ‘wrong’ gender
•
•
exist, but rare
Not a very useful suggestion
523
2. Collect more data
•
•
•
not different data, more data
collinearity increases standard error (se)
se decreases as N increases
•
get a bigger N
3. Remove / Combine variables
•
•
•
If an IV correlates highly with other IVs
Not telling us much new
If you have two (or more) IVs which are
very similar
•
e.g. 2 measures of depression, socioeconomic status, achievement, etc
524
•
•
sum them, average them, remove one
Many measures
•
use principal components analysis to reduce
them
3. Use stepwise regression (or some
flavour of)
•
•
See previous comments
Can be useful in theoretical vacuum
4. Ridge regression
•
•
not very useful
behaves weirdly
525
Measurement Error
526
What is Measurement Error
• In social science, it is unlikely that we
measure any variable perfectly
– measurement error represents this
imperfection
• We assume that we have a true score
– T
• A measure of that score
–x
527
x T e
• just like a regression equation
– standardise the parameters
– T is the reliability
• the amount of variance in x which comes from T
• but, like a regression equation
– assume that e is random and has mean of zero
– more on that later
528
Simple Effects of
Measurement Error
• Lowers the measured correlation
– between two variables
• Real correlation
– true scores (x* and y*)
• Measured correlation
– measured scores (x and y)
529
True correlation
of x and y
rx*y*
x*
e
y*
Reliability of x
rxx
Reliability of y
ryy
x
y
Measured
correlation of x and y
rxy
e
530
• Attenuation of correlation
rxy  rx * y *  rxx ryy
• Attenuation corrected correlation
rx * y * 
rxy
rxx ryy
531
• Example
rxx  0.7
ryy  0.8
rxy  0.3
rx* y* 
rx* y*
rxy
rxx ryy
0.3

 0.40
0.7  0.8
532
Complex Effects of
Measurement Error
• Really horribly complex
• Measurement error reduces correlations
– reduces estimate of 
– reducing one estimate
• increases others
– because of effects of control
– combined with effects of suppressor
variables
– exercise to examine this
533
Dealing with Measurement
Error
• Attenuation correction
– very dangerous
– not recommended
• Avoid in the first place
– use reliable measures
– don’t discard information
• don’t categorise
• Age: 10-20, 21-30, 31-40 …
534
Complications
• Assume measurement error is
– additive
– linear
• Additive
– e.g. weight – people may under-report / overreport at the extremes
• Linear
– particularly the case when using proxy variables
535
• e.g. proxy measures
– Want to know effort on childcare, count
number of children
• 1st child is more effort than last
– Want to know financial status, count
income
• 1st £10 much greater effect on financial status
than the 1000th.
536
Lesson 10: Non-Linear
Analysis in Regression
537
Introduction
• Non-linear effect occurs
– when the effect of one independent
variable
– is not consistent across the range of the IV
• Assumption is violated
– expected value of residuals = 0
– no longer the case
538
Some Examples
539
Skill
A Learning Curve
Experience
540
Performance
Yerkes-Dodson Law of Arousal
Arousal
541
Suicidal
Enthusiastic
Enthusiasm Levels over a
Lesson on Regression
0
Time
3.5
542
• Learning
– line changed direction once
• Yerkes-Dodson
– line changed direction once
• Enthusiasm
– line changed direction twice
543
Everything is Non-Linear
• Every relationship we look at is nonlinear, for two reasons
– Exam results cannot keep increasing with
reading more books
• Linear in the range we examine
– For small departures from linearity
• Cannot detect the difference
• Non-parsimonious solution
544
Non-Linear Transformations
545
Bending the Line
• Non-linear regression is hard
– We cheat, and linearise the data
• Do linear regression
Transformations
• We need to transform the data
– rather than estimating a curved line
• which would be very difficult
• may not work with OLS
– we can take a straight line, and bend it
– or take a curved line, and straighten it
• back to linear (OLS) regression
546
• We still do linear regression
– Linear in the parameters
– Y = b1x + b2x2 + …
• Can do non-linear regression
– Non-linear in the parameters
– Y = b1x + b2x2 + …
• Much trickier
– Statistical theory either breaks down OR
becomes harder
547
• Linear transformations
– multiply by a constant
– add a constant
– change the slope and the intercept
548
y=2x
y
y=x + 3
y=x
x
549
• Linear transformations are no use
– alter the slope and intercept
– don’t alter the standardised parameter
estimate
• Non-linear transformation
– will bend the slope
– quadratic transformation
y = x2
– one change of direction
550
– Cubic transformation
y = x2 + x3
– two changes of direction
551
Quadratic Transformation
y=0 + 0.1x + 1x2
552
Square Root Transformation
y=20 + -3x + 5x
553
Cubic Transformation
y = 3 - 4x + 2x2 - 0.2x3
6
5
4
3
2
1
0
0
1
2
3
4
5
6
554
Logarithmic Transformation
y = 1 + 0.1x + 10log(x)
555
Inverse Transformation
y = 20 -10x + 8(1/x)
556
• To estimate a non-linear regression
– we don’t actually estimate anything nonlinear
– we transform the x-variable to a non-linear
version
– can estimate that straight line
– represents the curve
– we don’t bend the line, we stretch the
space around the line, and make it flat
557
Detecting Non-linearity
558
Draw a Scatterplot
• Draw a scatterplot of y plotted against x
– see if it looks a bit non-linear
– e.g. Anscombe’s data
– e.g. Education and beginning salary
• from bank data
• drawn in SPSS
• with line of best fit
559
• Anscombe (1973)
– constructed a set of datasets
– show the importance of graphs in
regression/correlation
• For each dataset
N
Mean of x
Mean of y
Equation of regression line
sum of squares (X - mean)
correlation coefficient
R2
11
9
7.5
y = 3 + 0.5x
110
0.82
0.67
560
561
562
563
564
A Real Example
• Starting salary and years of education
– From employee data.sav
565
Expected value
of error
(residual) is > 0
Educational Level (years)
Expected value
of error
(residual) is < 0
566
Use Residual Plot
• Scatterplot is only good for one variable
– use the residual plot (that we used for
heteroscedasticity)
• Good for many variables
567
• We want
– points to lie in a nice straight sausage
568
• We don’t want
– a nasty bent sausage
569
• Educational level and starting salary
10
8
6
4
2
0
-2
-2
-1
0
1
2
3
570
Carrying Out Non-Linear
Regression
571
Linear Transformation
• Linear transformation doesn’t change
– interpretation of slope
– standardised slope
– se, t, or p of slope
– R2
• Can change
– effect of a transformation
572
• Actually more complex
– with some transformations can add a
constant with no effect (e.g. quadratic)
• With others does have an effect
– inverse, log
• Sometimes it is necessary to add a
constant
– negative numbers have no square root
– 0 has no log
573
Education and Salary
Linear Regression
• Saw previously that the assumption of
expected errors = 0 was violated
• Anyway …
– R2 = 0.401, F=315, df = 1, 472, p < 0.001
– salbegin = -6290 + 1727  educ
– Standardised
• b1 (educ) = 0.633
– Both parameters make sense
574
Non-linear Effect
• Compute new variable
– quadratic
– educ2 = educ2
• Add this variable to the equation
– R2 = 0.585, p < 0.001
– salbegin = 46263 + -6542  educ + 310  educ2
• slightly curious
– Standardised
• b1 (educ) = -2.4
• b2 (educ2) = 3.1
– What is going on?
575
• Collinearity
– is what is going on
– Correlation of educ and educ2
• r = 0.990
– Regression equation becomes difficult
(impossible?) to interpret
• Need hierarchical regression
– what is the change in R2
– is that change significant?
– R2 (change) = 0.184, p < 0.001
576
Cubic Effect
• While we are at it, let’s look at the cubic
effect
– R2 (change) = 0.004, p = 0.045
– 19138 + 103  e + -206  e2 + 12  e3
– Standardised:
b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71
577
Fourth Power
• Keep going while we are ahead
– won’t run
• ???
• Collinearity is the culprit
– Tolerance (educ4) = 0.000005
– VIF = 215555
• Matrix of correlations of IVs is not
positive definite
– cannot be inverted
578
Interpretation
• Tricky, given that parameter estimates
are a bit nonsensical
• Two methods
• 1: Use R2 change
– Save predicted values
• or calculate predicted values to plot line of best
fit
– Save them from equation
– Plot against IV
579
50000
40000
30000
20000
Cubic
10000
Quadratic
0
Linear
8
10
12
14
Education (Years)
16
18
20
22
580
• Differentiate with respect to e
• We said:
s = 19138 + 103  e + -206  e2 + 12  e3
– but first we will simplify it to quadratic
s = 46263 + -6542  e + 310  e2
• dy/dx = -6542 + 310 x 2 x e
581
Education Slope
9
-962
10
-342
11
278
12
898
13
1518
14
2138
15
2758
16
3378
17
3998
18
4618
19
5238
20
5858
1 year of education
at the higher end of
the scale, better than
1 year at the lower
end of the scale.
MBA versus GCSE
582
• Differentiate Cubic
19138 + 103  e + -206  e2 + 12  e3
dy/dx = 103 – 206  2  e + 12  3  e2
• Can calculate slopes for quadratic and
cubic at different values
583
Education Slope (Quad) Slope (Cub)
9
-962
-689
10
-342
-417
11
278
-73
12
898
343
13
1518
831
14
2138
1391
15
2758
2023
16
3378
2727
17
3998
3503
18
4618
4351
19
5238
5271
20
5858
6263
584
A Quick Note on
Differentiation
• For y = xp
– dx/dy = pxp-1
• For equations such as
y =b1x + b2xP
dy/dx = b1 + b2pxp-1
• y = 3x + 4x2
– dy/dx = 3 + 4 • 2x
585
• y = b1x + b2x2 + b3x3
– dy/dx = b1 + b2 • 2x + b3 • 3 • x2
• y = 4x + 5x2 + 6x3
• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2
• Many functions are simple to
differentiate
– Not all though
586
Automatic Differentiation
• If you
– Don’t know how to differentiate
– Can’t be bothered to look up the function
• Can use automatic differentiation
software
– e.g. GRAD (freeware)
587
588
Lesson 11: Logistic Regression
Dichotomous/Nominal Dependent
Variables
589
Introduction
• Often in social sciences, we have a
dichotomous/nominal DV
– we will look at dichotomous first, then a quick look
at multinomial
• Dichotomous DV
• e.g.
–
–
–
–
guilty/not guilty
pass/fail
won/lost
Alive/dead (used in medicine)
590
Why Won’t OLS Do?
591
Example: Passing a Test
• Test for bus drivers
– pass/fail
– we might be interested in degrees of pass fail
• a company which trains them will not
• fail means ‘pay for them to take it again’
• Develop a selection procedure
– Two predictor variables
– Score – Score on an aptitude test
– Exp – Relevant prior experience (months)
592
• 1st ten cases
Score
5
1
1
4
1
1
4
1
3
4
Exp
6
15
12
6
15
6
16
10
12
26
Pass
0
0
0
0
1
0
1
1
0
1
593
• DV
– pass (1 = Yes, 0 = No)
• Just consider score first
– Carry out regression
– Score as IV, Pass as DV
– R2 = 0.097, F = 4.1, df = 1, 48, p = 0.028.
– b0 = 0.190
– b1 = 0.110, p=0.028
• Seems OK
594
• Or does it? …
• 1st Problem – pp plot of residuals
1.00
.75
.50
.25
0.00
0.00
.25
Observed Cum Prob
.50
.75
1.00
595
• 2nd problem - residual plot
596
• Problems 1 and 2
– strange distributions of residuals
– parameter estimates may be wrong
– standard errors will certainly be wrong
597
• 3rd problem – interpretation
– I score 2 on aptitude.
– Pass = 0.190 + 0.110  2 = 0.41
– I score 8 on the test
– Pass = 0.190 + 0.110  8 = 1.07
• Seems OK, but
– What does it mean?
– Cannot score 0.41 or 1.07
• can only score 0 or 1
• Cannot be interpreted
– need a different approach
598
A Different Approach
Logistic Regression
599
Logit Transformation
• In lesson 10, transformed IVs
– now transform the DV
• Need a transformation which gives us
– graduated scores (between 0 and 1)
– No upper limit
• we can’t predict someone will pass twice
– No lower limit
• you can’t do worse than fail
600
Step 1: Convert to Probability
• First, stop talking about values
– talk about probability
– for each value of score, calculate
probability of pass
• Solves the problem of graduated scales
601
probability of
failure given a
score of 1 is 0.7
Score 1 2 3 4 5
N
7 5 6 4 2
Fail
P
0.7 0.5 0.6 0.4 0.2
N
3 5 4 6 8
Pass
P
0.3 0.5 0.4 0.6 0.8
probability of
passing given a
score of 5 is 0.8
602
This is better
• Now a score of 0.41 has a meaning
– a 0.41 probability of pass
• But a score of 1.07 has no meaning
– cannot have a probability > 1 (or < 0)
– Need another transformation
603
Step 2: Convert to Odds-Ratio
Need to remove upper limit
• Convert to odds
• Odds, as used by betting shops
– 5:1, 1:2
• Slightly different from odds in speech
– a 1 in 2 chance
– odds are 1:1 (evens)
– 50%
604
• Odds ratio = (number of times it
happened) / (number of times it didn’t
happen)
p(event)
p(event )
odds ratio 

p(not event ) 1  p(event )
605
• 0.8 = 0.8/0.2 = 4
– equivalent to 4:1 (odds on)
– 4 times out of five
• 0.2 = 0.2/0.8 = 0.25
– equivalent to 1:4 (4:1 against)
– 1 time out of five
606
• Now we have solved the upper bound
problem
– we can interpret 1.07, 2.07, 1000000.07
• But we still have the zero problem
– we cannot interpret predicted scores less
than zero
607
Step 3: The Log
• Log10 of a number(x)
log( x )
10
x
• log(10) = 1
• log(100) = 2
• log(1000) = 3
608
• log(1) = 0
• log(0.1) = -1
• log(0.00001) = -5
609
Natural Logs and e
• Don’t use log10
– Use loge
• Natural log, ln
• Has some desirable properties, that log10
doesn’t
–
–
–
–
For us
If y = ln(x) + c
dy/dx = 1/x
Not true for any other logarithm
610
• Be careful – calculators and stats
packages are not consistent when they
use log
– Sometimes log10, sometimes loge
– Can prove embarrassing (a friend told me)
611
Take the natural log of the odds ratio
• Goes from -  +
– can interpret any predicted value
612
Putting them all together
• Logit transformation
– log-odds ratio
– not bounded at zero or one
613
Score 1
Fail
Pass
N
P
N
P
Odds (Fail)
log(odds)fail
2
3
4
7
5
6
4
0.7 0.5 0.6 0.4
3
5
4
6
0.3 0.5 0.4 0.6
5
2
0.2
8
0.8
2.33 1.00 1.50 0.67 0.25
0.85 0.00 0.41 -0.41 -1.39
614
probability
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Probability gets closer
to zero, but never
reaches it as logit
goes down.
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Logit
615
3.5
• Hooray! Problem solved, lesson over
– errrmmm… almost
• Because we are now using log-odds
ratio, we can’t use OLS
– we need a new technique, called Maximum
Likelihood (ML) to estimate the parameters
616
Parameter Estimation using
ML
ML tries to find estimates of model
parameters that are most likely to give
rise to the pattern of observations in
the sample data
• All gets a bit complicated
– OLS is a special case of ML
– the mean is an ML estimator
617
• Don’t have closed form equations
– must be solved iteratively
– estimates parameters that are most likely
to give rise to the patterns observed in the
data
– by maximising the likelihood function (LF)
• We aren’t going to worry about this
– except to note that sometimes, the
estimates do not converge
• ML cannot find a solution
618
Interpreting Output
Using SPSS
• Overall fit for:
– step (only used for stepwise)
– block (for hierarchical)
– model (always)
– in our model, all are the same
– c2=4.9, df=1, p=0.025
• F test
619
Om nibus Tests of Model Coe fficients
Chi-square
St ep 1
df
Sig.
St ep
4.990
1
.025
Block
4.990
1
.025
Model
4.990
1
.025
620
• Model summary
– -2LL (=c2/N)
– Cox & Snell R2
– Nagelkerke R2
– Different versions of R2
• No real R2 in logistic regression
• should be considered ‘pseudo R2’
621
Model Sum ma ry
St ep
1
-2 Log
lik elihood
Cox & Snell
R Square
64.245
.095
Nagelk erke
R Square
.127
622
• Classification Table
– predictions of model
– based on cut-off of 0.5 (by default)
– predicted values x actual values
623
Cl assi fication Tablea
Predic ted
PASS
Observed
St ep 1
PASS
0
Percentage
Correc t
1
0
18
8
69.2
1
12
12
50.0
Overall Percent age
60.0
a. The cut value is .500
624
Model parameters
•B
– Change in the logged odds associated with
a change of 1 unit in IV
– just like OLS regression
– difficult to interpret
• SE (B)
– Standard error
– Multiply by 1.96 to get 95% CIs
625
Va riables in the Equa tion
B
Staep
1
S. E.
W ald
SCORE
-.467
.219
4.566
Constant
1.314
.714
3.390
a. Variable(s) ent ered on step 1: SCORE.
Variables in the Equation
95.0% C.I.for EXP(B)
Sig.
Step
a
1
Exp(B)
s core
.386
1.263
Constant
.199
.323
Lower
.744
Upper
2.143
a. Variable(s ) entered on s tep 1: score.
626
• Constant
– i.e. score = 0
– B = 1.314
– Exp(B) = eB = e1.314 = 3.720
– OR = 3.720, p = 1 – (1 / (OR + 1))
= 1 – (1 / (3.720 + 1))
– p = 0.788
627
• Score 1
– Constant b = 1.314
– Score B = -0.467
– Exp(1.314 – 0.467) = Exp(0.847)
= 2.332
– OR = 2.332
– p = 1 – (1 / (2.332 + 1))
= 0.699
628
Standard Errors and CIs
• SPSS gives
– B, SE B, exp(B) by default
– Can work out 95% CI from standard error
– B ± 1.96 x SE(B)
– Or ask for it in options
• Symmetrical in B
– Non-symmetrical (sometimes very) in
exp(B)
629
Va riables in the Equa tion
95.0% C.I. for
EXP(B)
B
S. E.
Ex p(B)
SCORE
-.467
.219
.627
Constan
t
1.314
.714
3.720
Lower
.408
Upper
.962
a. Variable(s) entered on s tep 1: SCORE.
630
• The odds of passing the test are
multiplied by 0.63 (CIs = 0.408, 0.962p
p = 0.033), for every additional point
on the aptitude test.
631
More on Standard Errors
• In OLS regression
– If a variable is added in a hierarchical fashion
– The p-value associated with the change in R2 is
the same as the p-value of the variable
– Not the case in logistic regression
• In our data 0.025 and 0.033
• Wald standard errors
– Make p-value in estimates is wrong – too high
– (CIs still correct)
632
• Two estimates use slightly different
information
– P-value says “what if no effect”
– CI says “what if this effect”
• Variance depends on the hypothesised ratio of the
number of people in the two groups
• Can calculate likelihood ratio based pvalues
– If you can be bothered
– Some packages provide them automatically
633
Probit Regression
• Very similar to logistic
– much more complex initial transformation
(to normal distribution)
– Very similar results to logistic (multiplied by
1.7)
• In SPSS:
– A bit weird
• Probit regression available through menus
634
– But requires data structured differently
• However
– Ordinal logistic regression is equivalent to
binary logistic
• If outcome is binary
– SPSS gives option of probit
635
Results
Estimate
SE
P
Logistic
(binary)
Score
0.288
0.301
0.339
Exp
0.147
0.073
0.043
Logistic
(ordinal)
Score
0.288
0.301
0.339
Exp
0.147
0.073
0.043
Logistic
(probit)
Score
0.191
0.178
0.282
Exp
0.090
0.042
0.033
636
Differentiating Between Probit
and Logistic
• Depends on shape of the error term
– Normal or logistic
– Graphs are very similar to each other
• Could distinguish quality of fit
– Given enormous sample size
• Logistic = probit x 1.7
– Actually 1.6998
• Probit advantage
– Understand the distribution
• Logistic advantage
– Much simpler to get back to the probability
637
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
1
-1
-1.2
-1.4
-1.6
-1.8
-2
-2.2
-2.4
-2.6
-2.8
-3
1.2
Normal (Probit)
Logistic
0.8
0.6
0.4
0.2
0
638
Infinite Parameters
• Non-convergence can happen because
of infinite parameters
– Insoluble model
• Three kinds:
• Complete separation
– The groups are completely distinct
• Pass group all score more than 10
• Fail group all score less than 10
639
• Quasi-complete separation
– Separation with some overlap
• Pass group all score 10 or more
• Fail group all score 10 or less
• Both cases:
– No convergence
• Close to this
– Curious estimates
– Curious standard errors
640
• Categorical Predictors
– Can cause separation
– Esp. if correlated
• Need people in every cell
Male
White
Non-White
Female
White
Non-White
Below
Poverty
Line
Above
Poverty
Line
641
Logistic Regression and
Diagnosis
• Logistic regression can be used for diagnostic
tests
– For every score
• Calculate probability that result is positive
• Calculate proportion of people with that score (or lower)
who have a positive result
• Calculate c statistic
– Measure of discriminative power
– %age of all possible cases, where the model gives
a higher probability to a correct case than to an
incorrect case
642
– Perfect c-statistic = 1.0
– Random c-statistic = 0.5
• SPSS doesn’t do it automatically
– But easy to do
• Save probabilities
– Use Graphs, ROC Curve
– Test variable: predicted probability
– State variable: outcome
643
Sensitivity and Specificity
• Sensitivity:
– Probability of saying someone has a
positive result –
• If they do: p(pos)|pos
• Specificity
– Probability of saying someone has a
negative result
• If they do: p(neg)|neg
644
Calculating Sens and Spec
• For each value
– Calculate
• proportion of minority earning less – p(m)
• proportion of non-minority earning less – p(w)
– Sensitivity (value)
• P(m)
645
Salary
P(minority)
10
20
30
40
50
60
70
80
90
.39
.31
.23
.17
.12
.09
.06
.04
.03
646
Using Bank Data
• Predict minority group, using salary
(000s)
– Logit(minority) = -0.044 + salary x –0.039
• Find actual proportions
647
ROC Curve
1.0
Sensitivity
0.8
0.6
0.4
Area under curve
is c-statistic
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
Diagonal segments are produced by ties.
648
More Advanced Techniques
• Multinomial Logistic Regression more
than two categories in DV
– same procedure
– one category chosen as reference group
• odds of being in category other than reference
• Polytomous Logit Universal Models
(PLUM)
– Ordinal multinomial logistic regression
– For ordinal outcome variables
649
Final Thoughts
• Logistic Regression can be extended
– dummy variables
– non-linear effects
– interactions (even though we don’t cover
them until the next lesson)
• Same issues as OLS
– collinearity
– outliers
650
651
652
Lesson 12: Mediation and Path
Analysis
653
Introduction
• Moderator
– Level of one variable influences effect of another
variable
• Mediator
– One variable influences another via a third variable
• All relationships are really mediated
– are we interested in the mediators?
– can we make the process more explicit
654
• In examples with bank
education
beginning
salary
• Why?
– What is the process?
– Are we making assumptions about the
process?
– Should we test those assumptions?
655
job skills
expectations
beginning
salary
education
negotiating
skills
kudos
for bank
656
Direct and Indirect Influences
X may affect Y in two ways
• Directly – X has a direct (causal)
influence on Y
– (or maybe mediated by other variables)
• Indirectly – X affects Y via a mediating
variable - M
657
• e.g. how does going to the pub effect
comprehension on a Summer school
course
– on, say, regression
not reading
books on
regression
Having fun
in pub in
evening
less
knowledge
Anything
here?
658
not reading
books on
regression
Having fun
in pub in
evening
less
knowledge
fatigue
Still
needed?
659
• Mediators needed
– to cope with more sophisticated theory in
social sciences
– make explicit assumptions made about
processes
– examine direct and indirect influences
660
Detecting Mediation
661
4 Steps
From Baron and Kenny (1986)
• To establish that the effect of X on Y is
mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling for X
4. If effect of X controlling for M is zero, M
is complete mediator of the relationship
•
(3 and 4 in same analysis)
662
Example: Book habits
Enjoy Books

Buy books

Read Books
663
Three Variables
• Enjoy
– How much an individual enjoys books
• Buy
– How many books an individual buys (in a
year)
• Read
– How many books an individual reads (in a
year)
664
ENJOY
BUY
READ
ENJOY BUY
READ
1.00
0.64
0.73
0.64
1.00
0.75
0.73
0.75
1.00
665
• The Theory
enjoy
buy
read
666
• Step 1
1. Show that X (enjoy) predicts Y (read)
– b1 = 0.487, p < 0.001
– standardised b1 = 0.732
– OK
667
2. Show that X (enjoy) predicts M (buy)
– b1 = 0.974, p < 0.001
– standardised b1 = 0.643
– OK
668
3. Show that M (buy) predicts Y (read),
controlling for X (enjoy)
– b1 = 0.469, p < 0.001
– standardised b1 = 0.206
– OK
669
4. If effect of X controlling for M is zero,
M is complete mediator of the
relationship
– (Same as analysis for step 3.)
– b2 = 0.287, p = 0.001
– standardised b2 = 0.431
– Hmmmm…
•
Significant, therefore not a complete mediator
670
0.287
(step 4)
enjoy
read
buy
0.974
(from step 2)
0.206
(from step 3)
671
The Mediation Coefficient
• Amount of mediation =
Step 1 – Step 4
=0.487 – 0.287
= 0.200
• OR
Step 2 x Step 3
=0.974 x 0.206
= 0.200
672
SE of Mediator
enjoy
buy
a
(from step 2)
read
b
(from step 2)
• sa = se(a)
• sb = se(b)
673
• Sobel test
– Standard error of mediation coefficient can
be calculated
se  b s + a s - s s
2 2
a
a = 0.974
sa = 0.189
2 2
b
2 2
a b
b = 0.206
sb = 0.054
674
• Indirect effect = 0.200
– se = 0.056
– t =3.52, p = 0.001
• Online Sobel test:
http://www.unc.edu/~preacher/sobel/
sobel.htm
– (Won’t be there for long; probably will be
somewhere else)
675
A Note on Power
• Recently
– Move in methodological literature away from this
conventional approach
– Problems of power:
– Several tests, all of which must be significant
• Type I error rate = 0.05 * 0.05 = 0.0025
• Must affect power
– Bootstrapping suggested as alternative
• See Paper B7, A4, B9
• B21 for SPSS syntax
676
677
678
Lesson 13: Moderators in
Regression
“different slopes for different
folks”
679
Introduction
• Moderator relationships have many
different names
– interactions (from ANOVA)
– multiplicative
– non-linear (just confusing)
– non-additive
• All talking about the same thing
680
A moderated relationship occurs
• when the effect of one variable depends
upon the level of another variable
681
• Hang on …
– That seems very like a nonlinear relationship
– Moderator
• Effect of one variable depends on level of another
– Non-linear
• Effect of one variable depends on level of itself
• Where there is collinearity
– Can be hard to distinguish between them
– Paper in handbook (B5)
– Should (usually) compare effect sizes
682
• e.g. How much it hurts when I drop a
computer on my foot depends on
– x1: how much alcohol I have drunk
– x2: how high the computer was dropped
from
– but if x1 is high enough
– x2 will have no effect
683
• e.g. Likelihood of injury in a car
accident
– depends on
– x1: speed of car
– x2: if I was wearing a seatbelt
– but if x1 is low enough
– x2 will have no effect
684
30
25
Injury
20
15
10
5
0
5
15
25
35
45
Speed (mph)
Seatbelt
No Seatbelt
685
• e.g. number of words (from a list) I can
remember
– depends on
– x1: type of words (abstract, e.g. ‘justice’, or
concrete, e.g. ‘carrot’)
– x2: Method of testing (recognition – i.e.
multiple choice, or free recall)
– but if using recognition
– x1: will not make a difference
686
• We looked at three kinds of moderator
• alcohol x height = pain
– continuous x continuous
• speed x seatbelt = injury
– continuous x categorical
• word type x test type
– categorical x categorical
• We will look at them in reverse order
687
How do we know to look for
moderators?
Theoretical rationale
• Often the most powerful
• Many theories predict additive/linear
effects
– Fewer predict moderator effects
Presence of heteroscedasticity
• Clue there may be a moderated
relationship missing
688
Two Categorical Predictors
689
• 2 IVs
Data
– word type (concrete [1], abstract [2])
– test method (recog [1], recall [2])
• 20 Participants in one of four groups
–
–
–
–
1,
1,
2,
2,
1
2
1
2
• 5 per group
• lesson12.1.sav
690
Recog
Recall
Total
Concrete Abstract Total
Mean
15.40
15.20
15.30
SD
2.19
2.59
2.26
Mean
15.60
6.60
11.10
Std. Deviation 1.67
7.44
6.95
Mean
15.50
10.90
13.20
Std. Deviation 1.84
6.94
5.47
691
• Graph of means
18
16
14
12
10
WORDS
8
1.00
6
1.00
2.00
2.00
TEST
692
ANOVA Results
• Standard way to analyse these data
would be to use ANOVA
– Words: F=6.1, df=1, 16, p=0.025
– Test: F=5.1, df=1, 16, p=0.039
– Words x Test: F=5.6, df=1, 16, p=0.031
693
Procedure for Testing
1: Convert to effect coding
• can use dummy coding, collinearity is
less of an issue
• doesn’t make any difference to
substantive interpretation
2: Calculate interaction term
• In ANOVA interaction is automatic
• In regression we create an interaction
variable
694
• Interaction term (wxt)
– multiply effect coded variables together
word
-1
1
-1
1
test
-1
-1
1
1
wxt
1
-1
-1
1
695
3: Carry out regression
• Hierarchical
– linear effects first
– interaction effect in next block
696
b0=13.2
b1 (words) = -2.3, p=0.025
b2 (test) = -2.1, p=0.039
b3 (words x test) = -2.2, p=0.031
Might need to use change in R2 to test
sig of interaction, because of collinearity
What do these mean?
• b0 (intercept) = predicted value of Y
(score) when all X = 0
•
•
•
•
•
– i.e. the central point
697
• b0 = 13.2
– grand mean
• b1 = -2.3
– distance from grand to mean for two word
types
– 13.2 – (-2.3) = 15.5
– 13.2 + (-2.3) = 10.9
Recog
Recall
Total
Concrete Abstract Total
15.40
15.20
15.30
15.60
6.60
11.10
15.50
10.90
13.20
698
• b2 = -2.1
– distance from grand mean to recog and
recall means
• b3 = -2.2
– to understand b3 we need to look at
predictions from the equation without this
term
Score = 13.2 + (-2.3)  w + (-2.1)  t
699
Score = 13.2 + (-2.3)  w + (-2.1)  t
• So for each group we can calculate an
expected value
700
b1 = -2.3, b2 = -2.1
W
T
Word
Test
Expected Value
C
Cog
-1
-1
13.2 + (-2.3) x (-1) + (-2.1) x -1
C
Call
-1
1
13.2 + (-2.3) x (-1) + (-2.1) x 1
A
Cog
1
-1
13.2 + (-2.3) x 1 + (-2.1) x (-1)
A
Call
1
1
13.2 + (-2.3) x 1 + (-2.1) x 1
701
W
C
C
A
A
T
Word Test Exp
Actual Value
Call
-1 -1
17.6
15.4
Cog
-1
1
13.4
15.6
Call
1 -1
13.0
15.2
Cog
1
1
8.8
11.0
• The exciting part comes when we look
at the differences between the actual
value and the value in the 2 IV model
702
• Each difference = 2.2 (or –2.2)
• The value of b3 was –2.2
– the interaction term is the correction
required to the slope when the second IV
is included
703
• Examine the slope for word type
18
16
14
12
10
8
6
4
Gradient =
(11.1 - 15.3) / 2 = 2.1
2
0
Recog (-1)
Recall (1)
Test Type
704
• Add the slopes for two test groups
18
16
14
12
10
8
Both word
groups (-2.1)
6
4
2
0
Recog (-1)
Abstract
(6.6 - 15.2 )/2
= -4.3
Concrete
(15.6-15.4 )/2
= 0.1
Recall (1)
Test Type
705
b associated with interaction
• the change in slope, away from the
average, associated with a 1 unit
change in the moderating variable
OR
• Half the difference in the slopes
706
• Another way to look at it
Y = 13.2 + -2.3w + -2.1t + -2.2wt
• Examine concrete words group (w = -1)
– substitute values into the equation
Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t
Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
Y(concrete) = 15.5 + 0.1t
• The effect of changing test type for concrete
words (the slope, which is half the actual
difference)
707
Why go to all that effort? Why not do
ANOVA in the first place?
1. That is what ANOVA actually does
•
•
•
if it can handle an unbalanced design (i.e.
different numbers of people in each
group)
Helps to understand what can be done
with ANOVA
SPSS uses regression to do ANOVA
2. Helps to clarify more complex cases
•
as we shall see
708
Categorical x Continuous
709
Note on Dichotomisation
• Very common to see people dichotomise
a variable
– Makes the analysis easier
– Very bad idea
• Paper B6
710
Data
A chain of 60 supermarkets
• examining the relationship between
profitability, shop size, and local
competition
• 2 IVs
– shop size
– comp (local competition, 0=no, 1=yes)
• DV
– profit
711
• Data, ‘lesson 12.2.sav’
Shopsize
4
10
7
10
10
29
12
6
14
62
Comp
1
1
0
0
1
1
0
1
0
0
Profit
23
25
19
9
18
33
17
20
21
8
712
1st Analysis
Two IVs
• R2=0.367, df=2, 57, p < 0.001
• Unstandardised estimates
– b1 (shopsize) = 0.083 (p=0.001)
– b2 (comp) = 5.883 (p<0.001)
• Standardised estimates
– b1 (shopsize) = 0.356
– b2 (comp) = 0.448
713
• Suspicions
– Presence of competition is likely to have an
effect
– Residual plot shows a little
heteroscedasticity
3
2
1
0
-1
-2
-3
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
714
Procedure for Testing
• Very similar to last time
– convert ‘comp’ to effect coding
– -1 = No competition
– 1 = competition
– Compute interaction term
• comp (effect coded) x size
– Hierarchical regression
715
Result
• Unstandardised estimates
– b1 (shopsize) = 0.071 (p=0.006)
– b2 (comp) = -1.67 (p = 0.506)
– b3 (sxc) = -0.050 (p=0.050)
• Standardised estimates
– b1 (shopsize) = 0.306
– b2 (comp) = -0.127
– b3 (sxc) = -0.389
716
• comp now non-significant
– shows importance of hierarchical
– it obviously is important
717
Interpretation
• Draw graph with lines of best fit
– drawn automatically by SPSS
• Interpret equation by substitution of
values
– evaluate effects of
• size
• competition
718
40
30
20
10
Profit
Competition
No competition
0
All Shops
0
20
40
60
80
100
Shopsize
719
• Effects of size
– in presence and absence of competition
– (can ignore the constant)
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition present (x2 = 1)
Y=x10.071 + 1(-1.67) + x11 (-0.050)
Y=x10.071 + -1.67 + x1(-0.050)
Y=x1 0.021
+ (–1.67)
720
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (-0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(-1.67)
Y= x1 0.121 (+ 1.67)
721
Two Continuous Variables
722
Data
• Bank Employees
– only using clerical staff
– 363 cases
– predicting starting salary
– previous experience
– age
– age x experience
723
• Correlation matrix
– only one significant
LOGSB AGESTARTPREVEXP
LOGSB
1.00
-0.09
0.08
AGESTART
-0.09
1.00
0.77
PREVEXP
0.08
0.77
1.00
724
Initial Estimates (no moderator)
• (standardised)
– R2 = 0.061, p<0.001
– Age at start = -0.37, p<0.001
– Previous experience = 0.36, p<0.001
• Suppressing each other
– Age and experience compensate for one
another
– Older, with no experience, bad
– Younger, with experience, good
725
The Procedure
• Very similar to previous
– create multiplicative interaction term
– BUT
• Need to eliminate effects of means
– cause massive collinearity
• and SDs
– cause one variable to dominate the
interaction term
• By standardising
726
• To standardise x,
– subtract mean, and divide by SD
– re-expresses x in terms of distance from
the mean, in SDs
– ie z-scores
• Hint: automatic in SPSS in Descriptives
• Create interaction term of age and exp
– axe = z(age)  z(exp)
727
• Hierarchical regression
– two linear effects first
– moderator effect in second
– hint: it is often easier to interpret if
standardised versions of all variables are
used
728
• Change in R2
– 0.085, p<0.001
• Estimates (standardised)
– b1 (exp) = 0.104
– b2 (agestart) = -0.54
– b3 (age x exp) = -0.54
729
Interpretation 1: Pick-a-Point
• Graph is tricky
– can’t have two continuous variables
– Choose specific points (pick-a-point)
• Graph the line of best fit of one variable at
others
– Two ways to pick a point
• 1: Choose high (z = +1), medium (z = 0) and
low (z = -1)
• Choose ‘sensible’ values – age 20, 50, 80?
730
• We know:
– Y = e  0.10 + a  -0.54 + a  e  -0.54
– Where a = agestart, and e = experience
• We can rewrite this as:
– Y = (e  0.10) + (a  -0.54) + (a  e  -0.54)
– Take a out of the brackets
– Y = (e  0.10) + (-0.54 + e  -0.54)a
• Bracketed terms are simple intercept and simple
slope
– 0= (e  0.10)
– 1= (-0.54 + e  -0.54)a
– Y = 0 + 1a
731
• Pick any value of e, and we know the slope
for a
– Standardised, so it’s easy
• e = -1
– 0= (-1  0.10) = -0.10
– 1= (-0.54 + -1  -0.54)a = -0.0a
• e=0
– 0= (0  0.10) = 0
– 1= (-0.54+ 0  -0.54)a = -0.54a
• e=1
– 0= (1  0.10) = 0.10
– 1= (-0.54 + 1  -0.54)a = -1.08a
732
Graph the Three Lines
1.5
1
e = -1
e=0
e=1
Log(salary)
0.5
0
-0.5
-1
-1.5
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
Age
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
733
Interpretation 2: P-Values and CIs
• Second way
– Newer, rarely done
• Calculate CIs of the slope
– At any point
• Calculate p-value
– At any point
• Give ranges of significance
734
What do you need?
• The variance and covariance of the
estimates
– SPSS doesn’t provide estimates for
intercept
– Need to do it manually
• In options, exclude intercept
– Create intercept – c = 1
– Use it in the regression
735
• Enter information into web page:
– www.unc.edu/~preacher/interact/a
cov.htm
– (Again, may not be around for long)
• Get results
• Calculations in Bauer and Curran (in
press: Multivariate Behavioral Research)
– Paper B13
736
4.1
4.2
Y
4.3
4.4
4.5
MLR 2-Way Interaction Plot
4.0
CVz1(1)
CVz1(2)
CVz1(3)
-1.0
-0.5
0.0
X
0.5
1.0
737
Areas of Significance
0.0
-0.2
-0.4
-0.6
Simple Slope
0.2
0.4
Confidence Bands
-4
-2
0
2
4
Experience
738
• 2 complications
– 1: Constant differed
– 2: DV was logged, hence non-linear
• effect of 1 unit depends on where the unit is
– Can use SPSS to do graphs showing lines
of best fit for different groups
– See paper A2
739
Finally …
740
Unlimited Moderators
• Moderator effects are not limited to
– 2 variables
– linear effects
741
Three Interacting Variables
• Age, Sex, Exp
• Block 1
– Age, Sex, Exp
• Block 2
– Age x Sex, Age x Exp, Sex x Exp
• Block 3
– Age x Sex x Exp
742
• Results
– All two way interactions significant
– Three way not significant
– Effect of Age depends on sex
– Effect of experience depends on sex
– Size of the age x experience interaction
does not depend on sex (phew!)
743
Moderated Non-Linear
Relationships
• Enter non-linear effect
• Enter non-linear effect x moderator
– if significant indicates degree of nonlinearity differs by moderator
744
745
Modelling Counts: Poisson
Regression
Lesson 14
746
Counts and the Poisson
Distribution
• Von Bortkiewicz
(1898)
– Numbers of Prussian
soldiers kicked to
death by horses
120
100
80
60
0
1
2
3
4
5
109
65
22
3
1
0
40
20
0
0
1
2
3
4
5
747
• The data fitted a Poisson probability distribution
– When counts of events occur, poisson distribution is
common
– E.g. papers published by researchers, police arrests,
number of murders, ship accidents
• Common approach
– Log transform and treat as normal
• Problems
– Censored at 0
– Integers only allowed
– Heteroscedasticity
748
The Poisson Distribution
0.7
0.6
Probability
0.5
0.5
1
4
8
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
Count
10
11
12
13
14
15
16
17
749
exp(   ) 
p ( y | x) 
y!
y
750
exp(  ) 
p ( y | x) 
y!
y
• Where:
– y is the count
–  is the mean of the poisson distribution
• In a poisson distribution
– The mean = the variance (hence
heteroscedasticity issue))
–   2
751
Poisson Regression in SPSS
• Not directly available
– SPSS can be tweaked to do it in three ways:
– General loglinear model (genlog)
– Non-linear regression (CNLR)
• Bootstrapped p-values only
– Both are quite tricky
• SPSS 15,
752
Example Using Genlog
– 100 surfboards, 50
red, 50 blue
• Weight cases by
bites
• Analyse, Loglinear,
General
– Colour is factor
25
20
Blue
Red
15
Frequency
• Number of shark
bites on different
colour surfboards
10
5
0
0
1
2
Number of bites
3
4
753
Results
Correspondence Between Parameters and
Terms of the Design
Parameter
Aliased Term
1
Constant
2
[COLOUR = 1]
3 x [COLOUR = 2]
Note: 'x' indicates an aliased (or a
redundant) parameter. These parameters
are set to zero.
754
Asymptotic
Param
Est.
1
2
3
4.1190
-.5495
.0000
SE
.1275
.2108
.
Z-value
95% CI
Lower
Upper
32.30
-2.61
.
3.87
-.96
.
4.37
-.14
.
• Note: Intercept
(param 1) is curious
• Param 2 is the
difference in the
means
755
SPSS: Continuous Predictors
• Bleedin’ nightmare
• http://www.spss.com/tech/answer/detai
ls.cfm?tech_tan_id=100006204
756
Poisson Regression in Stata
• SPSS will save a Stata file
• Open it in Stata
• Statistics, Count outcomes, Poisson
regression
757
Poisson Regression in R
• R is a freeware program
– Similar to SPlus
– www.r-project.org
• Steep learning curve to start with
• Much nicer to do Poisson (and other) regression
analysis
http://www.stat.lsa.umich.edu/~faraway/book
/
http://www.jeremymiles.co.uk/regressionbook
/extras/appendix2/R/
758
• Commands in R
• Stage 1: enter data
– colour <- c(1, 0, 1, 0, 1, 0 … 1)
– bites <- c(3, 1, 0, 0, … )
• Run analysis
– p1 <- glm(bites ~ colour, family
= poisson)
• Get results
– summary.glm(p1)
759
R Results
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567
0.1686 -2.115 0.03441 *
colour
0.5555
0.2116
2.625 0.00866 **
• Results for colour
– Same as SPSS
– For intercept different (weird SPSS)
760
Predicted Values
• Need to get exponential of parameter
estimates
– Like logistic regression
• Exp(0.555) = 1.74
– You are likely to be bitten by a shark 1.74
times more often with a red surfboard
761
Checking Assumptions
• Was it really poisson distributed?
– For Poisson,   2
• As mean increases, variance should also
increase
– Residuals should be random
• Overdispersion is common problem
• Too many zeroes
• For blue:   2 = exp(-0.3567) = 1.42
• For red:   2 = exp(-0.3567 + 0.555)
= 2.48
762
exp(  ) 
p ( y | x) 
y!
y
• Strictly:
exp( ˆ )ˆ
p( yi | xi ) 
y!
y
763
Compare Predicted with Actual
Distributions
Red
Blue
0.7
0.4
0.6
0.35
0.3
Probability
Expected
Actual
0.4
0.3
0.25
Probability
0.5
0.2
0.15
0.2
Expected
Actual
0.1
0.1
0.05
0
0
0
1
2
Frequency
3
4
1
2
3
4
Frequency
764
Overdispersion
• Problem in poisson regression
– Too many zeroes
• Causes
– c2 inflation
– Standard error deflation
• Hence p-values too low
– Higher type I error rate
• Solution
– Negative binomial regression
765
Using R
• R can read an SPSS file
– But you have to ask it nicely
• Click Packages menu, Load package,
choose “Foreign”
• Click File, Change Dir
– Change to the folder that contains your
data
766
More on R
• R uses objects
– To place something into an object use <– X <- Y
• Puts Y into X
• Function is read.spss()
– Mydata <- read.spss(“spssfilename.sav”)
• Variables are then referred to as
Mydata$VAR1
– Note 1: R is case sensitive
– Note 2: SPSS variable name in capitals
767
GLM in R
• Command
– glm(outcome ~ pred1 + pred2 + … +
predk [,family = familyname])
– If no familyname, default is OLS
• Use binomial for logistic, poisson for poisson
• Output is a GLM object
– You need to give this a name
– my1stglm <- glm(outcome ~ pred1 +
pred2 + … + predk [,family =
familyname])
768
• Then need to explore the result
– summary(my1stglm)
• To explore what it means
– Need to plot regressions
• Easiest is to use Excel
769
770
Introducing Structural
Equation Modelling
Lesson 15
771
Introduction
• Related to regression analysis
– All (OLS) regression can be considered as a
special case of SEM
• Power comes from adding restrictions to
the model
• SEM is a system of equations
– Estimate those equations
772
Regression as SEM
• Grades example
– Grade = constant + books + attend +
error
• Looks like a regression equation
– Also
– Books correlated with attend
– Explicit modelling of error
773
Path Diagram
• System of equations are usefully
represented in a path diagram
x
Measured variable
e
unmeasured variable
regression
correlation
774
Path Diagram for Regression
Must usually
explicitly
model error
error
Books
Grade
Attend
Must explicitly
model correlation
775
Results
• Unstandardised
2.00
1.00
e
BOOKS
4.04
2.65
13.52
GRADE
17.84
1.28
ATTEND
776
Standardised
e
BOOKS
.35
.44
.82
GRADE
.33
ATTEND
777
Table
GRADE
GRADE
GRADE
GRADE
<-- BOOKS
<-- ATTEND
<-- e
Estimate
4.04
1.28
13.52
37.38
S.E.
1.71
0.57
1.53
7.54
C.R.
2.36
2.25
8.83
4.96
P
St. Est.
0.02
0.35
0.03
0.33
0.00
0.82
0.00
Coe ffi cie ntsa
Unstandardized
Coefficients
M odel
1
B
St andardized
Coefficients
St d. Error
Beta
Sig.
37.38
7.74
BOOKS
4.04
1.75
.35
.03
ATTEND
1.28
.59
.33
.04
(Const ant)
a. Dep endent Variable: GRADE
.00
778
So What Was the Point?
• Regression is a special case
• Lots of other cases
• Power of SEM
– Power to add restrictions to the model
• Restrict parameters
– To zero
– To the value of other parameters
– To 1
779
Restrictions
• Questions
– Is a parameter really necessary?
– Are a set of parameters necessary?
– Are parameters equal
• Each restriction adds 1 df
– Test of model with c2
780
The c2 Test
• Can the model proposed have
generated the data?
– Test of significance of difference of model
and data
– Statistically significant result
• Bad
– Theoretically driven
• Start with model
• Don’t start with data
781
Regression Again
0, 1
BOOKS
e
GRADE
ATTEND
• Both estimates restricted to zero
782
• Two restrictions
– 2 df for c2 test
– c2 = 15.9, p = 0.0003
• This test is (asymptotically) equivalent
to the F test in regression
– We still haven’t got any further
783
Multivariate Regression
y1
x1
y2
x2
y3
784
Test of all x’s on all y’s
(6 restrictions = 6 df)
y1
x1
y2
x2
y3
785
Test of all x1 on all y’s
(3 restrictions)
y1
x1
y2
x2
y3
786
Test of all x1 on all y1
(3 restrictions)
y1
x1
y2
x2
y3
787
Test of all 3 partial correlations between
y’s, controlling for x’s
(3 restrictions)
y1
x1
y2
x2
y3
788
Path Analysis and SEM
• More complex
models – can add
more restrictions
ENJOY
1
– E.g. mediator
model
BUY
e_buy
• 1 restriction
– No path from
enjoy -> read
1
READ
e_read
789
Result
• c2 = 10.9, 1 df, p = 0.001
• Not a complete mediator
– Additional path is required
790
Multiple Groups
• Same model
– Different people
• Equality constraints between groups
– Means, correlations, variances, regression
estimates
– E.g. males and females
791
Multiple Groups Example
• Age
• Severity of psoriasis
– SEVE – in emotional areas
• Hands, face, forearm
– SEVNONE – in non-emotional areas
– Anxiety
– Depression
792
Correlationsa
AGE
AGE
.017
.035
.
.004
.009
.859
.717
110
110
110
110
110
-.270
1
.665
.045
.075
.004
.
.000
.639
.436
110
110
110
110
110
-.248
.665
1
.109
.096
.009
.000
.
.255
.316
110
110
110
110
110
Pearson Correlat ion
.017
.045
.109
1
.782
Sig. (2-tailed)
.859
.639
.255
.
.000
110
110
110
110
110
Pearson Correlat ion
.035
.075
.096
.782
1
Sig. (2-tailed)
.717
.436
.316
.000
.
110
110
110
110
110
Pearson Correlat ion
N
Pearson Correlat ion
Sig. (2-tailed)
N
N
GHQ_D
GHQ_D
-.248
Sig. (2-tailed)
GHQ_A
GHQ_A
-.270
N
SEVNONE
SEVNONE
1
Pearson Correlat ion
Sig. (2-tailed)
SEVE
SEVE
N
a. SEX = f
793
Correlationsa
AGE
AGE
Pearson Correlat ion
Sig. (2-tailed)
N
SEVE
Pearson Correlat ion
Sig. (2-tailed)
N
SEVNONE
Pearson Correlat ion
Sig. (2-tailed)
N
GHQ_A
Pearson Correlat ion
Sig. (2-tailed)
N
GHQ_D
Pearson Correlat ion
Sig. (2-tailed)
N
SEVE
SEVNONE
GHQ_A
GHQ_D
1
-.243
-.116
-.195
-.190
.
.031
.310
.085
.094
79
79
79
79
79
-.243
1
.671
.456
.453
.031
.
.000
.000
.000
79
79
79
79
79
-.116
.671
1
.210
.232
.310
.000
.
.063
.040
79
79
79
79
79
-.195
.456
.210
1
.800
.085
.000
.063
.
.000
79
79
79
79
79
-.190
.453
.232
.800
1
.094
.000
.040
.000
.
79
79
79
79
79
a. SEX = m
794
Model
AGE
SEVE
SEVNONE
1
1
e_s
e_sn
Dep
Anx
1
1
E_d
e_a
795
Females
AGE
-.27
-.25
SEVE
.96
SEVNONE
.07
.04
e_s
.97
e_sn
.03
.09 -.04
.15
.64
Dep
Anx
.99
.99
E_d
e_a
.78
796
AGE
Males
-.24
-.12
SEVE
.97
SEVNONE
-.08
-.08
e_s
.99
e_sn
.52
-.12 .55
-.17
.67
Dep
Anx
.88
.88
E_d
e_a
.74
797
Constraint
• sevnone -> dep
– Constrained to be equal for males and
females
• 1 restriction, 1 df
– c2 = 1.3 – not significant
• 4 restrictions
– 2 severity -> anx & dep
798
• 4 restrictions, 4 df
– c2 = 1.3, p = 0.014
• Parameters are not equal
799
Missing Data: The big advantage
• SEM programs tend to deal with missing
data
– Multiple imputation
– Full Information (Direct) Maximum
Likelihood
• Asymptotically equivalent
• Data can be MAR, not just MCAR
800
Power: A Smaller Advantage
• Power for regression gets tricky with
large models
• With SEM power is (relatively) easy
– It’s all based on chi-square
– Paper B14
801
Lesson 16: Dealing with clustered
data & longitudinal models
802
The Independence
Assumption
• In Lesson 8 we talked about independence
– The residual of any one case should not tell you
about the residual of any other case
• Particularly problematic when:
– Data are clustered on the predictor variable
• E.g. predictor is household size, cases are members of
family
• E.g. Predictor is doctor training, outcome is patients of
doctor
– Data are longitudinal
• Have people measured over time
– It’s the same person!
803
Clusters of Cases
• Problem with cluster (group)
randomised studies
– Or group effects
• Use Huber-White sandwich estimator
– Tell it about the groups
– Correction is made
– Use complex samples in SPSS
804
Complex Samples
• As with Huber-White for heteroscedasticity
– Add a variable that tells it about the clusters
– Put it into clusters
• Run GLM
– As before
• Warning:
– Need about 20 clusters for solutions to be stable
805
Example
• People randomised by week to one of two
forms of triage
– Compare the total cost of treating each
• Ignore clustering
– Difference is £2.40 per person, with 95%
confidence intervals £0.58 to £4.22, p =0.010
• Include clustering
– Difference is still £2.40, with 95% CIs £5.65 to £0.85, and p = 0.141.
• Ignoring clustering led to type I error
806
Longitudinal Research
• For comparing
repeated measures
– Clusters are people
– Can model the
repeated measures
over time
ID
V1
V2
V3
V4
1
2
3
4
7
2
3
6
8
4
3
2
5
7
5
• Data are usually
short and fat
807
Converting Data
• Change data to tall
and thin
• Use Data,
Restructure in
SPSS
• Clusters are ID
ID
V
X
1
1
2
1
2
3
1
3
4
1
4
7
2
1
3
2
2
6
2
3
8
2
4
4
3
1
2
3
2
5
3
3
7
3
4
5
808
(Simple) Example
• Use employee data.sav
– Compare beginning salary and salary
– Would normally use paired samples t-test
• Difference = $17,403, 95% CIs
$16,427.407, $18,379.555
809
Restructure the Data
• Do it again
– With data tall and thin
• Complex GLM with
Time as factor
– ID as cluster
• Difference = $17,430,
95% CIs = 16427.407,
18739.555
ID
Time
Cash
1
1
$18,750
1
2
$21,450
2
1
$12,000
2
2
$21,900
3
1
$13,200
3
2
$45,000
810
Interesting …
• That wasn’t very interesting
– What is more interesting is when we have
multiple measurements of the same people
• Can plot and assess trajectories over
time
811
Single Person Trajectory
+
+
+
+
+
+
Time
812
Multiple Trajectories: What’s the
Mean and SD?
Time
813
Complex Trajectories
• An event occurs
– Can have two effects:
– A jump in the value
– A change in the slope
• Event doesn’t have to happen at the
same time for each person
– Doesn’t have to happen at all
814
Slope 1
Jump
Slope 2
Event Occurs
815
Parameterising
Time
1
2
3
4
5
6
7
8
9
Event
0
0
0
0
0
1
1
1
1
Time2
0
0
0
0
0
0
1
2
3
Outcome
12
13
14
15
16
10
9
8
7
816
Draw the Line
What are the parameter estimates?
817
Main Effects and Interactions
• Main effects
– Intercept differences
• Moderator effects
– Slope differences
818
Multilevel Models
• Fixed versus random effects
– Fixed effects are fixed across individuals
(or clusters)
– Random effects have variance
• Levels
– Level 1 – individual measurement
occasions
– Level 2 – higher order clusters
819
More on Levels
• NHS direct study
– Level 1 units: …………….
– Level 2 units: ……………
• Widowhood food study
– Level 1 units ……………
– Level 2 units ……………
820
More Flexibility
• Three levels:
– Level 1: measurements
– Level 2: people
– Level 3: schools
821
More Effects
• Variances and covariances of effects
• Level 1 and level 2 residuals
– Makes R2 difficult to talk about
• Outcome variable
– Yij
• The score of the ith person in the jth group
822
Y
2.3
3.2
4.5
4.8
7.2
3.1
1.6
i
1
2
3
1
2
3
4
j
1
1
1
2
2
2
2
823
Notation
• Notation gets a bit horrid
– Varies a lot between books and programs
• We used to have b0 and b1
– If fixed, that’s fine
– If random, each person has their own
intercept and slope
824
Standard Errors
• Intercept has standard errors
• Slopes have standard errors
• Random effects have variances
– Those variances have standard errors
• Is there statistically significant variation
between higher level units (people)?
• OR
• Is everyone the same?
825
Programs
• Since version 12
– Can do this in SPSS
– Can’t do anything really clever
• Menus
– Completely unusable
– Have to use syntax
826
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
(id)
covtype(un)
• /print = solution.
827
SPSS Syntax
• MIXED
• relfd with time
Outcome
Continuous
predictor
828
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
Must specify effect as
fixed first
829
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
Intercept and
(id)
covtype(un)
time are random
Specify random
effects
SPSS assumes that your
level 2 units are subjects,
and needs to know the id
variable
830
SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
(id) covtype(un)
Covariance matrix of random
effects is unstructured.
(Alternative is id – identity or vc
– variance components).
831
SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
(id) covtype(un)
• /print = solution.
Print the answer
832
The Output
• Information criteria
– We’ll come back
Information Criteriaa
-2 Res tricted Log
Likelihood
64899.758
Akaike's Information
64907.758
Criterion (AIC)
Hurvich and Ts ai's
Criterion (AICC)
64907.763
Bozdogan's Criterion
64940.134
(CAIC)
Schwarz's Bayes ian
Criterion (BIC)
64936.134
The information criteria are dis played in s maller-is -better forms .
a. Dependent Variable: relfd.
833
Fixed Effects
• Not useful here, useful for interactions
Type III Tests of Fixed Effectsa
Numerator df
Denominator
df
Intercept
1
741
3251.877
.000
time
1
741.000
2.550
.111
Source
F
Sig.
a. Dependent Variable: relfd.
834
Estimates of Fixed Effects
• Interpreted as regression equation
Estimates of Fixed Effectsa
95% Confidence
Interval
Parameter
Intercept
time
Estimate
21.90
Std.
Error
21.90
df
.38
t
57.025
-.06
-.06
.04
-1.597
Sig.
.000
Lower
Bound
21.15
Upper
Bound
22.66
.111
-.14
.01
a. Dependent Variable: relfd.
835
Covariance Parameters
Estimates of Covariance Parametersa
Parameter
Estimate
Res idual
64.11577 1.0526353
Intercept +
time [subject
= id]
Std. Error
UN (1,1)
85.16791 5.7003732
UN (2,1)
-4.53179
.5067146
UN (2,2)
.7678319
.0636116
a. Dependent Variable: relfd.
836
Change Covtype to VC
• We know that this is wrong
– The covariance of the effects was statistically
significant
– Can also see if it was wrong by comparing
information criteria
• We have removed a parameter from the
model
– Model is worse
– Model is more parsimonious
• Is it much worse, given the increase in parsimony?
837
UN Model
Information Criteriaa
-2 Res tricted Log
Likelihood
64899.758
VC Model
Information Criteriaa
-2 Res tricted Log
Likelihood
65041.891
Akaike's Information
64907.758
Criterion (AIC)
Akaike's Information
65047.891
Criterion (AIC)
Hurvich and Ts ai's
Criterion (AICC)
Hurvich and Ts ai's
Criterion (AICC)
64907.763
65047.894
Bozdogan's Criterion
64940.134
(CAIC)
Bozdogan's Criterion
65072.173
(CAIC)
Schwarz's Bayes ian
64936.134
Criterion (BIC)
Schwarz's Bayes ian
65069.173
Criterion (BIC)
The information
criteria
are dis played in s maller-is
The information criteria are dis played in s maller-is
-better forms
.
a. Dependent Variable: relfd.
a. Dependent Variable: relfd.
Lower is better.
838
Adding Bits
• So far, all a bit dull
• We want some more predictors, to make it
more exciting
– E.g. female
– Add:
Relfd with time female
/fixed = time sex time * sex
• What does the interaction term represent?
839
Extending Models
• Models can be extended
– Any kind of regression can be used
• Logistic, multinomial, Poisson, etc
– More levels
• Children within classes within schools
• Measures within people within classes within prisons
– Multiple membership / cross classified models
• Children within households and classes, but households
not nested within class
• Need a different program
– E.g. MlwiN
840
MlwiN Example (very quickly)
841
Books
Singer, JD and Willett, JB (2003). Applied
Longitudinal Data Analysis: Modeling Change
and Event Occurrence. Oxford, Oxford
University Press.
Examples at:
http://www.ats.ucla.edu/stat/SPSS/ex
amples/alda/default.htm
842
The End
843
Download