Theory of Regression - Jeremy Miles`s Page

Applying Regression
1
The Course
• 14 (or so) lessons
– Some flexibility
• Depends how we feel
• What we get through
2
Part I: Theory of Regression
1. Models in statistics
2. Models with more than one parameter:
regression
3. Samples to populations
4. Introducing multiple regression
5. More on multiple regression
3
Part 2: Application of regression
6.
7.
8.
9.
10.
11.
12.
Categorical predictor variables
Assumptions in regression analysis
Issues in regression analysis
Non-linear regression
Categorical and count variables
Moderators (interactions) in regression
Mediation and path analysis
Part 3:Taking Regression Further
(Kind of brief)
13. Introducing longitudinal multilevel models
4
Bonuses
Bonus lesson1: Why is it called
regression?
Bonus lesson 2: Other types of
regression.
5
House Rules
• Jeremy must remember
– Not to talk too fast
• If you don’t understand
– Ask
– Any time
• If you think I’m wrong
– Ask. (I’m not always right)
6
The Assistants
• Carla Xena - cxenag@essex.ac.uk
• Eugenia Suarez
Moran esuare@essex.ac.uk
• Arian Daneshmandadanes@essex.ac.uk
Learning New Techniques
• Best kind of data to learn a new technique
– Data that you know well, and understand
• Your own data
– In computer labs (esp later on)
– Use your own data if you like
• My data
– I’ll provide you with
– Simple examples, small sample sizes
• Conceptually simple (even silly)
8
Computer Programs
• Stata
– Mostly
• I’ll explain SPSS options
• You’ll like Stata more
• Excel
– For calculations
– Semi-optional
• GPower
9
Lesson 1: Models in statistics
Models, parsimony, error, mean,
OLS estimators
10
What is a Model?
11
What is a model?
• Representation
– Of reality
– Not reality
• Model aeroplane represents a real
aeroplane
– If model aeroplane = real aeroplane, it
isn’t a model
12
• Statistics is about modelling
– Representing and simplifying
• Sifting
– What is important from what is not
important
• Parsimony
– In statistical models we seek parsimony
– Parsimony  simplicity
13
Parsimony in Science
• A model should be:
– 1: able to explain a lot
– 2: use as few concepts as possible
• More it explains
– The more you get
• Fewer concepts
– The lower the price
• Is it worth paying a higher price for a better
model?
14
The Mean as a Model
15
The (Arithmetic) Mean
• We all know the mean
– The ‘average’
– Learned about it at school
– Forget (didn’t know) about how clever the mean is
• The mean is:
– An Ordinary Least Squares (OLS) estimator
– Best Linear Unbiased Estimator (BLUE)
16
Mean as OLS Estimator
• Going back a step or two
• MODEL was a representation of DATA
– We said we want a model that explains a lot
– How much does a model explain?
DATA = MODEL + ERROR
ERROR = DATA - MODEL
– We want a model with as little ERROR as possible
17
• What is error?
Data (Y)
Model (b0)
mean
Error (e)
1.40
-0.20
1.55
-0.05
1.80
1.60
0.20
1.62
0.02
1.63
0.03
18
• How can we calculate the ‘amount’ of
error?
• Sum of errors?
• Sum of absolute errors?
ERROR  ei
 (Yi  Yˆ)
 (Yi  b0 )
  0.20    0.05   0.20  0.02  0.03
0
19
• Are small and large errors equivalent?
– One error of 4
– Four errors of 1
– The same?
– What happens with different data?
• Y = (2, 2, 5)
– b0 = 2
– Not very representative
• Y = (2, 2, 4, 4)
– b0 = any value from 2 - 4
– Indeterminate
• There are an infinite number of solutions which would satisfy
our criteria for minimum error
20
• Sum of squared errors (SSE)
2
i
ERROR  e
2
ˆ
 (Yi  Y )
 (Yi  b0 )
2
  0.20    0.05   0.202  0.022  0.032
2
2
 0.08
21
• Determinate
– Always gives one answer
• If we minimise SSE
– Get the mean
• Shown in graph
– SSE plotted against b0
– Min value of SSE occurs when
– b0 = mean
22
2
1.8
1.6
1.4
SSE
1.2
1
0.8
0.6
0.4
0.2
0
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
b0
23
The Mean as an OLS Estimate
24
Mean as OLS Estimate
• The mean is an Ordinary Least Squares
(OLS) estimate
– As are lots of other things
• This is exciting because
– OLS estimators are BLUE
– Best Linear Unbiased Estimators
– Proven with Gauss-Markov Theorem
• Which we won’t worry about
25
BLUE Estimators
• Best
– Minimum variance (of all possible unbiased
estimators)
– Narrower distribution than other estimators
• e.g. median, mode
Y Y
26
SSE and the Standard
Deviation
• Tying up a loose end
2
ˆ
SSE  (Yi  Y )
s
2
ˆ
(Yi  Y )
n

2
ˆ
(Yi  Y )
n 1
27
• SSE closely related to SD
• Sample standard deviation – s
– Biased estimator of population SD
• Population standard deviation - 
– Need to know the mean to calculate SD
• Reduces N by 1
• Hence divide by N-1, not N
– Like losing one df
28
Proof
• That the mean minimises SSE
– Not that difficult
– As statistical proofs go
• Available in
– Maxwell and Delaney – Designing
experiments and analysing data
– Judd and McClelland – Data Analysis: a
model comparison approach
• (out of print?)
29
What’s a df?
• The number of parameters free to vary
– When one is fixed
• Term comes from engineering
– Movement available to structures
30
Back to the Data
• Mean has 5 (N) df
– 1st moment
•  has N –1 df
– Mean has been fixed
– 2nd moment
– Can think of it as amount of cases vary
away from the mean
31
While we are at it …
• Skewness has N – 2 df
– 3rd moment
• Kurtosis has N – 3 df
– 4rd moment
– Amount cases vary from 
32
Parsimony and df
• Number of df remaining
– Measure of parsimony
• Model which contained all the data
– Has 0 df
– Not a parsimonious model
• Normal distribution
– Can be described in terms of mean and 
• 2 parameters
– (z with 0 parameters)
33
Summary of Lesson 1
• Statistics is about modelling DATA
– Models have parameters
– Fewer parameters, more parsimony, better
• Models need to minimise ERROR
– Best model, least ERROR
– Depends on how we define ERROR
– If we define error as sum of squared deviations
from predicted value
– Mean is best MODEL
34
Lesson 1a
• A really brief introduction to Stata
35
Command
review
Commands
Output
Variable list
Commands
36
Stata Commands
• Can use menus
– But commands are easy
• All have similar format:
• command variables , options
• Stata is case sensitive
– BEDS, beds, Beds
• Stata lets you shorten
– summarize sqft
– su sq
37
More Stata Commands
• Open exercise 1.4.dta
– Run
• summarize sqm
• table beds
• mean price
• histogram price
– Or
• su be
• tab be
• mean pr
• hist pr
38
Lesson 2: Models with one
more parameter - regression
39
In Lesson 1 we said …
• Use a model to predict and describe
data
– Mean is a simple, one parameter model
40
More Models
Slopes and Intercepts
41
More Models
• The mean is OK
– As far as it goes
– It just doesn’t go very far
– Very simple prediction, uses very little
information
• We often have more information than
that
– We want to use more information than that
42
House Prices
• Look at house prices in one area of Los
Angeles
• Predictors of house prices
• Using:
– Sale price, size, number of bedrooms, size
of lot, year built …
43
House Prices
address
listprice
beds
baths
sqft
3628 OLYMPIAD Dr
649500
4
3
2575
3673 OLYMPIAD Dr
450000
2
3
1910
3838 CHANSON Dr
489900
3
2
2856
3838 West 58TH Pl
330000
4
2
1651
3919 West 58TH Pl
349000
3
2
1466
3954 FAIRWAY Blvd
514900
3
2.25
2018
4044 OLYMPIAD Dr
649000
4
2.5
3019
4336 DON LUIS Dr
474000
2
2.5
2188
4421 West 59TH St
460000
3
2
1519
4518 WHELAN Pl
388000
2
1.5
1403
4670 West 63RD St
259500
3
2
1491
5000 ANGELES
VISTA Blvd
678800
5
4
3808
46
One Parameter Model
• The mean
Y  415.69
Yˆ  b0  Y
SSE  341683
“How much is that house worth?”
$415,689
Use 1 df to say that
47
Adding More Parameters
• We have more information than this
– We might as well use it
– Add a linear function of number of size
(square feet) (x1)
Yˆ  b0  b1 x1
48
Alternative Expression
• Estimate of Y (expected value of Y)
Yˆ  b0  b1 x1
• Value of Y
Yi  b0  b1 xi1  ei
49
Estimating the Model
• We can estimate this model in four different,
equivalent ways
– Provides more than one way of thinking about it
1.
2.
3.
4.
Estimating the slope which minimises SSE
Examining the proportional reduction in SSE
Calculating the covariance
Looking at the efficiency of the predictions
50
Estimate the Slope to Minimise
SSE
51
Estimate the Slope
• Stage 1
– Draw a scatterplot
– x-axis at mean
• Not at zero
• Mark errors on it
– Called ‘residuals’
– Sum and square these to find SSE
52
700
500
600
160
140
120
300
400
100
80
1.5
2
2.5
3
3.5
4
4.5
5
5.5
60
200
40
20
1000
0
2000
3000
4000
SQFT
LAST SALE PRICE
Fitted values
53
160
140
120
100
80
60
40
20
0
1.5
2
2.5
3
3.5
4
4.5
5
5.5
• Add another slope to the chart
– Redraw residuals
– Recalculate SSE
– Move the line around to find slope which
minimises SSE
• Find the slope
55
• First attempt:
56
• Any straight line can be defined with
two parameters
– The location (height) of the slope
• b0
– Sometimes called a
– The gradient of the slope
• b1
57
• Gradient
b1 units
1 unit
58
• Height
b0 units
59
• Height
• If we fix slope to zero
– Height becomes mean
– Hence mean is b0
• Height is defined as the point that the
slope hits the y-axis
– The constant
– The y-intercept
60
• Why the constant?
•
beds (x1) x0
£ (000s)
– b0x0
1
1
77
– Where x0 is 1.00 for every case
2
1
74
• i.e. x0 is constant
1
1
88
3
1
62
Implicit in Stata
5
1
90
– (And SPSS, SAS, R)
5
1
136
– Some packages force you to2 make1it
35
explicit
5
1
134
– (Later on we’ll need to make
4 it explicit)
1
138
1
1
55
61
• Why the intercept?
– Where the regression line intercepts the yaxis
– Sometimes called y-intercept
62
Finding the Slope
• How do we find the values of b0 and b1?
– Start with we jiggle the values, to find the
best estimates which minimise SSE
– Iterative approach
• Computer intensive – used to matter, doesn’t
really any more
• (With fast computers and sensible search
algorithms – more on that later)
63
• Start with
– b0=416 (mean)
– b1=0.5 (nice round number)
• SSE = 365,774
– b0=300, b1=0.5, SSE=341,683
– b0=300, b1=0.6, SSE=310,240
– b0=300, b1=0.8, SSE=264,573
– b0=300, b1=1, SSE=301, 797
– b0=250, b1=1, SSE=255,366
– …..
64
• Quite a long time later
– b0 = 216.357
– b1 = 1.084
– SSE = 145,636.78
• Gives the position of the
– Regression line (or)
– Line of best fit
• Better than guessing
• Not necessarily the only method
– But it is OLS, so it is the best (it is BLUE)
65
700
160
600
140
100
Price
400
500
120
80
300
60
Actual Price
Predicted Price
40
200
20
0
1000
0.5
1
1.5
2000
2
2.5
3
SQFT
3.5 30004
4.5
5
4000
5.5
Number of Bedrooms
Fitted values
LAST SALE PRICE
66
• We now know
– A zero square metre house is worth 
$216,000
– Adding a square meter adds $1,080
• Told us two things
– Don’t extrapolate to meaningless values of
x-axis
– Constant is not necessarily useful
• It is necessary to estimate the equation
67
Exercise 2a, 2b
68
Standardised Regression Line
• One big but:
– Scale dependent
• Values change
– £ to €, inflation
• Scales change
– £, £000, £00?
• Need to deal with this
69
• Don’t express in ‘raw’ units
– Express in SD units
– x1=183.82
– y=114.637
• b1 = 1.103
• We increase x1 by 1, and Ŷ increases by
1.084
1.084  (1.084 / 114.637) SDs  0.00945SDs
• So we increase x1 by 1 and Ŷ increases
by 0.0094 SDs
70
• Similarly, 1 unit of x1 = 1/69.017 SDs
– Increase x1 by 1 SD
– Ŷ increases by 1.103  (69.017/1) =
76.126
• Put them both together
b1   x1
y
71
1.080  69.071
 0.653
114.637
• The standardised regression line
– Change (in SDs) in Ŷ associated with a
change of 1 SD in x1
• A different route to the same answer
– Standardise both variables (divide by SD)
– Find line of best fit
72
• The standardised regression line has a
special name
The Correlation Coefficient
(r)
(r stands for ‘regression’, but more on that
later)
• Correlation coefficient is a standardised
regression slope
– Relative change, in terms of SDs
73
Exercise 2c
74
Proportional Reduction in
Error
75
Proportional Reduction in Error
• We might be interested in the level of
improvement of the model
– How much less error (as proportion) do we
have
– Proportional Reduction in Error (PRE)
• Mean only
– Error(model 0) = 341,683
• Mean + slope
– Error(model 1) = 196,046
76
ERROR(0)  ERROR(1)
PRE 
ERROR(0)
ERROR(1)
PRE  1 
ERROR(0)
196046
PRE  1 
341683
PRE  0.426
77
• But we squared all the errors in the first
place
– So we could take the square root
0.426  0.653
• This is the correlation coefficient
• Correlation coefficient is the square root
of the proportion of variance explained
78
Standardised Covariance
79
Standardised Covariance
• We are still iterating
– Need a ‘closed-form’ equation
– Equation to solve to get the parameter
estimates
• Answer is a standardised covariance
– A variable has variance
– Amount of ‘differentness’
• We have used SSE so far
80
• SSE varies with N
– Higher N, higher SSE
• Divide by N
– Gives SSE per person (or house)
– (Actually N – 1, we have lost a df to the
mean)
• Gives us the variance
• Same as SD2
– We thought of SSE as a scattergram
• Y plotted against X
– (repeated image follows)
81
160
140
120
100
80
1.5
2
2.5
3
3.5
4
4.5
5
5.5
60
40
20
0
82
• Or we could plot Y against Y
– Axes meet at the mean (415)
– Draw a square for each point
– Calculate an area for each square
– Sum the areas
• Sum of areas
– SSE
• Sum of areas divided by N
– Variance
83
Plot of Y against Y
180
160
140
120
100
0
20
40
60
80
80
100
120
140
160
180
60
40
20
0
84
Draw Squares
180
Area =
40.1 x 40.1
= 1608.1
160
138 – 88.9
= 40.1
140
138 – 88.9
= 40.1
120
100
0
20
35 – 88.9
= -53.9
40
60
80
80
100
120
140
160
180
60
40
35 – 88.9
= -53.9
20
Area =
-53.9 x -53.9
= 2905.21
0
85
• What if we do the same procedure
– Instead of Y against Y
– Y against X
•
•
•
•
Draw rectangles (not squares)
Sum the area
Divide by N - 1
This gives us the variance of x with y
– The Covariance
– Shortened to Cov(x, y)
86
87
Area
= (-33.9) x (-2)
= 67.8
55 – 88.9
= -33.9
138-88.9
= 49.1
4-3=1
1 - 3 = -2
Area =
49.1 x 1
= 49.1
88
• More formally (and easily)
• We can state what we are doing as an
equation
– Where Cov(x, y) is the covariance
( x  x )( y  y )
Cov( x , y ) 
N 1
• Cov(x,y)=5165
• What do points in different sectors do
to the covariance?
89
• Problem with the covariance
– Tells us about two things
– The variance of X and Y
– The covariance
• Need to standardise it
– Like the slope
• Two ways to standardise the covariance
– Standardise the variables first
• Subtract from mean and divide by SD
– Standardise the covariance afterwards
90
• First approach
– Much more computationally expensive
• Too much like hard work to do by hand
– Need to standardise every value
• Second approach
– Much easier
– Standardise the final value only
• Need the combined variance
– Multiply two variances
– Find square root (were multiplied in first
place)
91
• Standardised covariance

Cov( x, y )
Var ( x )  Var ( y )
5165

69.02  114.64
 0.653
92
• The correlation coefficient
– A standardised covariance is a correlation
coefficient
r
Covariance
 variance  variance 
93
• Expanded …
r
 ( x  x )( y  y ) 


N 1


2
2
 ( x  x )
( y  y ) 



N 1 
 N 1
94
• This means …
– We now have a closed form equation to
calculate the correlation
– Which is the standardised slope
– Which we can use to calculate the
unstandardised slope
95
We know that:
r
b1   x1
y
We know that:
b1 
r  y
x
1
96
b1 
r  y
x
1
0.659  114.64
b1 
69.017
b1  1.080
• So value of b1 is the same as the iterative
approach
97
• The intercept
– Just while we are at it
• The variables are centred at zero
– We subtracted the mean from both
variables
– Intercept is zero, because the axes cross at
the mean
98
• Add mean of y to the constant
– Adjusts for centring y
• Subtract mean of x
– But not the whole mean of x
– Need to correct it for the slope
c  y  b1 x1
c  415.7  1.08  183.81
c  216.35
99
Accuracy of Prediction
100
One More (Last One)
• We have one more way to calculate the
correlation
– Looking at the accuracy of the prediction
• Use the parameters
– b0 and b1
– To calculate a predicted value for each
case
101
Beds
239.2
177.4
265.3
153.4
136.2
187.5
280.5
203.3
141.1
130.3
Actual Predicted
Price
Price
605.0
475.8
400.0
408.8
529.5
504.1
315.0
382.7
341.0
364.0
525.0
419.7
585.0
520.5
430.0
436.8
436.0
369.4
390.0
357.7
• Plot actual price
against
predicted price
– From the model
102
600
140
500
Predicted Value
120
100
80
400
60
40
300
20
20
40
200
60
300
80
100
400Actual Value
500
LAST
SALE PRICE
120
600
140
160
700
103
• r = 0.653
• The correlation between actual and predicted
value
• Seems a futile thing to do
– And at this stage, it is
– But later on, we will see why
104
Some More Formulae
• For hand calculation
r
xy
x 2y 2
• Point biserial

M
r
y1
 M y 0  PQ
sd y
105
• Phi (f)
– Used for 2 dichotomous variables
Vote P
Vote Q
Homeowner
A: 19
B: 54
Not homeowner
C: 60
D:53
BC  AD
r
( A  B)(C  D)( A  C )(B  D)
106
• Problem with the phi correlation
– Unless Px= Py (or Px = 1 – Py)
• Maximum (absolute) value is < 1.00
• Tetrachoric correlation can be used to correct
this
• Rank (Spearman) correlation
– Used where data are ranked
6d
r
2
n(n  1)
2
107
Summary
• Mean is an OLS estimate
– OLS estimates are BLUE
• Regression line
– Best prediction of outcome from predictor
– OLS estimate (like mean)
• Standardised regression line
– A correlation
108
• Four ways to think about a correlation
– 1.
– 2.
– 3.
– 4.
Standardised regression line
Proportional Reduction in Error (PRE)
Standardised covariance
Accuracy of prediction
109
Regression and Correlation in
Stata
• Correlation:
• correlate x y
• correlate x y , cov
• regress y x
• Or
• regress price sqm
110
Post-Estimation
• Stata commands ‘leave behind’
something
• You can run post-estimation commands
– They mean ‘from the last regression’
• Get predicted values:
– predict my_preds
This comes after
the comma, so it’s
an option
• Get residuals:
– predict my_res, residuals
111
Graphs
• Scatterplot
• scatter price beds
• Regression line
– lfit price beds
• Both graphs
– twoway (scatter price beds)
(lfit price beds)
112
• What happens if you run reg without a
predictor?
– regress price
113
Exercises
114
Lesson 3: Samples to
Populations – Standard Errors
and Statistical Significance
115
The Problem
• In Social Sciences
– We investigate samples
• Theoretically
– Randomly taken from a specified
population
– Every member has an equal chance of
being sampled
– Sampling one member does not alter the
chances of sampling another
• Not the case in (say) physics, biology,
etc.
116
Population
• But it’s the population that we are
interested in
– Not the sample
– Population statistic represented with Greek
letter
– Hat means ‘estimate’
ˆ
b
x 
ˆx
117
• Sample statistics (e.g. mean) estimate
population parameters
• Want to know
– Likely size of the parameter
– If it is > 0
118
Sampling Distribution
• We need to know the sampling
distribution of a parameter estimate
– How much does it vary from sample to
sample
• If we make some assumptions
– We can know the sampling distribution of
many statistics
– Start with the mean
119
Sampling Distribution of the
Mean
• Given
– Normal distribution
– Random sample
– Continuous data
• Mean has a known sampling distribution
– Repeatedly sampling will give a known
distribution of means
– Centred around the true (population) mean
()
120
Analysis Example: Memory
• Difference in memory for different
words
– 10 participants given a list of 30 words to
learn, and then tested
– Two types of word
• Abstract: e.g. love, justice
• Concrete: e.g. carrot, table
121
Concrete Abstract
12
4
11
7
4
6
9
12
8
6
12
10
9
8
8
5
12
10
8
4
Diff (x)
8
4
-2
-3
2
2
1
3
2
4
x  2.1
 x  3.11
N  10
122
Confidence Intervals
• This means
– If we know the mean in our sample
– We can estimate where the mean in the
population () is likely to be
• Using
– The standard error (se) of the mean
– Represents the standard deviation of the
sampling distribution of the mean
123
1 SD contains
68%
Almost 2 SDs
contain 95%
124
• We know the sampling distribution of
the mean
– t distributed if N < 30
– Normal with large N (>30)
• Asymptotically normal
• Know the range within means from
other samples will fall
– Therefore the likely range of 
x
se( x ) 
n
125
• Two implications of equation
– Increasing N decreases SE
• But only a bit (SE halfs if N is 400 times bigger)
– Decreasing SD decreases SE
• Calculate Confidence Intervals
– From standard errors
• 95% is a standard level of CI
– 95% of samples the true mean will lie within
the 95% CIs
– In large samples: 95% CI = 1.96  SE
– In smaller samples: depends on t
distribution (df=N-1=9)
126
x  2.1,
 x  3.11,
N  10
x
3.11
se( x ) 

 0.98
n
10
127
95% CI  2.26  0.98  2.22
x  CI    x  CI
-0.12    4.32
128
What is a CI?
• (For 95% CI):
• 95% chance that the true (population)
value lies within the confidence
interval? No;
• 95% of samples, true mean will land
within the confidence interval?
129
Significance Test
• Probability that  is a certain value
– Almost always 0
• Doesn’t have to be though
• We want to test the hypothesis that the
difference is equal to 0
– i.e. find the probability of this difference
occurring in our sample IF =0
– (Not the same as the probability that =0)
130
• Calculate SE, and then t
– t has a known sampling distribution
– Can test probability that a certain value is
included
x
t
se(x )
2.1
t
 2.14
0.98
p  0.061
131
Other Parameter Estimates
• Same approach
– Prediction, slope, intercept, predicted
values
– At this point, prediction and slope are the
same
• Won’t be later on
• One predictor only
– More complicated with > 1
132
Testing the Degree of
Prediction
• Prediction is correlation of Y with Ŷ
– The correlation – when we have one IV
• Use F, rather than t
• Started with SSE for the mean only
– This is SStotal
– Divide this into SSresidual
– SSregression
• SStot = SSreg + SSres
133
F
SSreg df1
SS res df 2
df1  k
df 2  N  k  1,
134
• Back to the house prices
– Original SSE (SStotal) = 341683
– SSresidual = 196046
• What is left after our model
– SSregression = 341683– 196046= 145636
• What our model explains
135
SSreg df1
F
SS res df 2
145636 1
F
 18.57
196046 ( 25  1  1)
df1  k  1
df 2  N  k  1  8
136
• F = 18.6, df = 1, 25, p = 0.0002
– Can reject H0
• H0: Prediction is not better than chance
– A significant effect
137
Statistical Significance:
What does a p-value (really)
mean?
138
A Quiz
• Six questions, each true or false
• Write down your answers (if you like)
• An experiment has been done. Carried out
perfectly. All assumptions perfectly satisfied.
Absolutely no problems.
• P = 0.01
– Which of the following can we say?
139
1. You have absolutely disproved the null
hypothesis (that is, there is no
difference between the population
means).
140
2. You have found the probability of the
null hypothesis being true.
141
3. You have absolutely proved your
experimental hypothesis (that there is
a difference between the population
means).
142
4. You can deduce the probability of the
experimental hypothesis being true.
143
5. You know, if you decide to reject the
null hypothesis, the probability that
you are making the wrong decision.
144
6. You have a reliable experimental
finding in the sense that if,
hypothetically, the experiment were
repeated a great number of times, you
would obtain a significant result on
99% of occasions.
145
OK, What is a p-value
• Cohen (1994)
“[a p-value] does not tell us what we
want to know, and we so much want to
know what we want to know that, out
of desperation, we nevertheless believe
it does” (p 997).
146
OK, What is a p-value
• Sorry, didn’t answer the question
• It’s “The probability of obtaining a
result as or more extreme than the
result we have in the study, given that
the null hypothesis is true”
• Not probability the null hypothesis is
true
147
A Bit of Notation
• Not because we like notation
– But we have to say a lot less
•
•
•
•
Probability – P
Null hypothesis is true – H
Result (data) – D
Given - |
148
What’s a P Value
• P(D|H)
– Probability of the data occurring if the null
hypothesis is true
• Not
• P(H|D) (what we want to know)
– Probability that the null hypothesis is true,
given that we have the data = p(H)
• P(H|D) ≠ P(D|H)
149
• What is probability you are prime minister
– Given that you are British
– P(M|B)
– Very low
• What is probability you are British
– Given you are prime minister
– P(B|M)
– Very high
• P(M|B) ≠ P(B|M)
150
• There’s been a murder
– Someone murdered an instructor (perhaps
they talked too much)
• The police have DNA
• The police have your DNA
– They match(!)
• DNA matches 1 in 1,000,000 people
• What’s the probability you didn’t do the
murder, given the DNA match (H|D)
151
• Police say:
– P(D|H) = 1/1,000,000
• Luckily, you have Jeremy on your defence
team
• We say:
– P(D|H) ≠ P(H|D)
• Probability that someone matches the
DNA, who didn’t do the murder
– Incredibly high
152
Back to the Questions
• Haller and Kraus (2002)
– Asked those questions of groups in
Germany
– Psychology Students
– Psychology lecturers and professors (who
didn’t teach stats)
– Psychology lecturers and professors (who
did teach stats)
153
1. You have absolutely disproved the null
hypothesis (that is, there is no difference
between the population means).
•
True
•
•
•
•
•
34% of students
15% of professors/lecturers,
10% of professors/lecturers teaching statistics
False
We have found evidence against the null
hypothesis
154
2. You have found the probability of the
null hypothesis being true.
– 32% of students
– 26% of professors/lecturers
– 17% of professors/lecturers teaching
statistics
•
•
False
We don’t know
155
3. You have absolutely proved your
experimental hypothesis (that there is a
difference between the population means).
–
–
–
•
20% of students
13% of professors/lecturers
10% of professors/lecturers teaching statistics
False
156
4. You can deduce the probability of the
experimental hypothesis being true.
– 59% of students
– 33% of professors/lecturers
– 33% of professors/lecturers teaching
statistics
•
False
157
5. You know, if you decide to reject the null
hypothesis, the probability that you are
making the wrong decision.
•
•
•
•
•
68% of students
67% of professors/lecturers
73% of professors professors/lecturers
teaching statistics
False
Can be worked out
– P(replication)
158
6. You have a reliable experimental finding
in the sense that if, hypothetically, the
experiment were repeated a great
number of times, you would obtain a
significant result on 99% of occasions.
– 41% of students
– 49% of professors/lecturers
– 37% of professors professors/lecturers
teaching statistics
•
•
False
Another tricky one
– It can be worked out
159
One Last Quiz
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.05
• You replicate the study exactly
– What is probability you find p < 0.05?
160
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.01
• You replicate the study exactly
– What is probability you find p < 0.05?
161
• Significance testing creates boundaries
and gaps where none exist.
• Significance testing means that we find
it hard to build upon knowledge
– we don’t get an accumulation of
knowledge
162
• Yates (1951)
"the emphasis given to formal tests of significance
... has resulted in ... an undue concentration of
effort by mathematical statisticians on
investigations of tests of significance applicable
to problems which are of little or no practical
importance ... and ... it has caused scientific
research workers to pay undue attention to the
results of the tests of significance ... and too
little to the estimates of the magnitude of the
effects they are investigating
163
Testing the Slope
• Same idea as with the mean
– Estimate 95% CI of slope
– Estimate significance of difference from a
value (usually 0)
• Need to know the SD of the slope
– Similar to SD of the mean
164
s y. x 
2
ˆ
(Y  Y )
N  k 1
s y.x 
SSres
N  k 1
s y.x 
5921
 27.2
8
165
• Similar to equation for SD of mean
• Then we need standard error
- Similar (ish)
• When we have standard error
– Can go on to 95% CI
– Significance of difference
166
se(by. x ) 
s y.x
( x  x )
2
27.2
se(by. x ) 
 5.24
26.9
167
• Confidence Limits
• 95% CI
– t dist with N - k - 1 df is 2.31
– CI = 5.24  2.31 = 12.06
• 95% confidence limits
14.8  12.1    14.8  12.1
2.7    26.9
168
• Significance of difference from zero
– i.e. probability of getting result if =0
• Not probability that  = 0
b
14.7
t

 2.81
se(b)
5.2
df  N  k  1  8
p  0.02
• This probability is (of course) the same
as the value for the prediction
169
Testing the Standardised
Slope (Correlation)
• Correlation is bounded between –1 and +1
– Does not have symmetrical distribution, except
around 0
• Need to transform it
– Fisher z’ transformation – approximately
normal
z  0.5[ln(1  r )  ln(1  r )]
1
SEz 
n3
170
z  0.5[ln(1  0.706)  ln(1  0.706)]
z  0.879
1
1
SEz 

 0.38
n3
10  3
• 95% CIs
– 0.879 – 1.96 * 0.38 = 0.13
– 0.879 + 1.96 * 0.38 = 1.62
171
• Transform back to correlation
e 1
r  2y
e 1
2y
• 95% CIs = 0.13 to 0.92
• Very wide
– Because of small sample size
– Maybe that’s why CIs are not reported?
172
Using Excel
• Functions in excel
– Fisher() – to carry out Fisher
transformation
– Fisherinv() – to transform back to
correlation
173
The Others
• Same ideas for calculation of CIs and
SEs for
– Predicted score
– Gives expected range of values given X
• Same for intercept
– But we have probably had enough
174
One more tricky thing
• (Don’t worry if you don’t understand)
• For means, regression estimates, etc
– Estimate
• 1.0000
– 95% confidence intervals
• 0.0000, 2.0000
– P = 0.05000
• They match
175
• For correlations, odds ratios, etc
– No longer match
• 95% CIs
– 0.0000, 0.50000
• P-value
– 0.052000
• Because of the sampling distribution of the
mean
– Does not depend on the value
• The sampling distribution of a proportion
– Does depend on the value
– More certainty around 0.9 than around 0.00.
176
Lesson 4: Introducing Multiple
Regression
177
Residuals
• We said
Y = b0 + b1x1
• We could have said
Yi = b0 + b1xi1 + ei
• We ignored the i on the Y
• And we ignored the ei
– It’s called error, after all
• But it isn’t just error
– Trying to tell us something
178
What Error Tells Us
• Error tells us that a case has a different
score for Y than we predict
– There is something about that case
• Called the residual
– What is left over, after the model
• Contains information
– Something is making the residual  0
– But what?
179
700
160
600
140
100
Price
400
500
120
80
300
60
Actual Price
Predicted Price
40
200
20
0
1000
0.5
1
1.5
2000
2
2.5
3
SQFT
3.5 30004
Number of Bedrooms
LAST SALE PRICE
4.5
Fitted values
5
4000
5.5
181
• The residual (+ the mean) is the
expected value of Y
If all cases were equal on X
• It is the value of Y, controlling for X
• Other words:
– Holding constant
– Partialling
– Residualising (residualised scores)
– Conditioned on
182
• Sometimes adjustment is enough on its own
– Measure performance against criteria
• Teenage pregnancy rate
– Measure pregnancy and abortion rate in areas
– Control for socio-economic deprivation, religion,
rural/urban and anything else important
– See which areas have lower teenage pregnancy
and abortion rate, given same level of deprivation
• Value added education tables
– Measure school performance
– Control for initial intake
183
Adj Value
(mean +
resid)
Sqm
Price
Predicted
Residual
239.2
605.0
475.77
129.23
544.8
177.4
400.0
408.78
-8.78
406.8
265.3
529.5
504.08
25.42
441.0
153.4
315.0
382.69
-67.69
347.9
136.2
341.0
364.05
-23.05
392.6
187.5
525.0
419.66
105.34
520.9
280.5
585.0
520.51
64.49
480.1
203.3
430.0
436.79
-6.79
408.8
141.1
436.0
369.39
66.61
482.2
130.3
390.0
357.70
32.30
447.9
184
Control?
• In experimental research
– Use experimental control
– e.g. same conditions, materials, time of
day, accurate measures, random
assignment to conditions
• In non-experimental research
– Can’t use experimental control
– Use statistical control instead
185
Analysis of Residuals
• What predicts differences in crime rate
– After controlling for socio-economic
deprivation
– Number of police?
– Crime prevention schemes?
– Rural/Urban proportions?
– Something else
• This is (mostly) what multiple
regression is about
186
• Exam performance
– Consider number of books a student read
(books)
– Number of lectures (max 20) a student
attended (attend)
• Books and attend as IV, grade as
outcome
187
Book s
Attend
0
1
0
2
4
4
1
4
3
0
9
15
10
16
10
20
11
20
15
15
Grade
45
57
45
51
65
88
44
87
89
59
First 10 cases
188
• Use books as IV
– R=0.492, F=12.1, df=1, 28, p=0.001
– b0=52.1, b1=5.7
– (Intercept makes sense)
• Use attend as IV
– R=0.482, F=11.5, df=1, 38, p=0.002
– b0=37.0, b1=1.9
– (Intercept makes less sense)
189
100
90
80
70
Grade (100)
60
50
40
30
-1
0
1
2
3
4
5
Books
190
100
90
80
70
60
Grade
50
40
30
5
7
9
11
13
15
17
19
21
Attend
191
Problem
• Use R2 to give proportion of shared
variance
– Books = 24%
– Attend = 23%
• So we have explained 24% + 23% =
47% of the variance
– NO!!!!!
192
• Look at the correlation matrix
BOOKS
1
ATTEND
0.44
1
GRADE
0.49
0.48
1
BOOKS
ATTEND
GRADE
• Correlation of books and attend is
(unsurprisingly) not zero
– Some of the variance that books shares
with grade, is also shared by attend
193
• I have access to 2 cars
• My wife has access to 2 cars
– We have access to four cars?
– No. We need to know how many of my 2
cars are shared
• Similarly with regression
– But we can do this with the residuals
– Residuals are what is left after (say) books
– See if residual variance is explained by
attend
– Can use this new residual variance to
calculate SSres, SStotal and SSreg
194
• Well. Almost.
– This would give us correct values for SS
– Would not be correct for slopes, etc
• Because assumes that the variables
have a causal priority
– Why should attend have to take what is
left from books?
– Why should books have to take what is left
by attend?
• Use OLS again; take variance they
share
195
• Simultaneously estimate 2 parameters
– b1 and b2
– Y = b0 + b1x1 + b2x2
– x1 and x2 are IVs
• Shared variance
• Not trying to fit a line any more
– Trying to fit a plane
• Can solve iteratively
– Closed form equations better
– But they are unwieldy
196
3D scatterplot
(2points only)
y
x2
x1
197
b2
y
b1
b0
x2
x1
198
199
Increasing Power
• What if the predictors don’t correlate?
• Regression is still good
– It increases the power to detect effects
– (More on power later)
• Less variance left over
• When do we know the two predictors
don’t correlate?
200
(Really) Ridiculous Equations
2

 y  y x1  x1 x2  x2     y  y x2  x2 x1  x1 x2  x2 
b1 
2
2
2
x1  x1  x2  x2   x1  x1 x2  x2 
2

 y  y x2  x2 x1  x1     y  y x1  x1 x2  x2 x1  x1 
b2 
2
2
2
x2  x2  x1  x1   x2  x2 x1  x1 
b0  y  b1 x1  b2 x2
201
• The good news
– There is an easier way
• The bad news
– It involves matrix algebra
• The good news
– We don’t really need to know how to do it
202
• We’re not programming computers
– So we usually don’t care
• Very, very occasionally it helps to know
what the computer is doing
203
Back to the Good News
• We can calculate the standardised
parameters as
B=Rxx-1 x Rxy
• Where
– B is the vector of regression weights
– Rxx-1 is the inverse of the correlation matrix
of the independent (x) variables
– Rxy is the vector of correlations of the
correlations of the x and y variables
204
Exercise 4.2
205
Exercises
• Exercise 4.1
– Grades data in Excel
• Exercise 4.2
– Repeat in Stata
• Exercise 4.3
– Zero correlation
• Exercise 4.4
– Repeat therapy data
• Exercise 4.5
– PTSD in families.
206
Lesson 5: More on Multiple
Regression
207
Contents
• More on parameter estimates
– Standard errors of coefficients
• R, R2, adjusted R2
• Extra bits
– Suppressors
– Decisions about control variables
– Standardized estimates > 1
– Variable entry techniques
208
More on Parameter Estimates
209
Parameter Estimates
• Parameter estimates (b1, b2 … bk) were
standardised
– Because we analysed a correlation matrix
• Represent the correlation of each IV
with the outcome
– When all other IVs are held constant
210
• Can also be unstandardised
• Unstandardised represent the unit (rather
than SD’s) change in the outcome
associated with a 1 unit change in the
IV
– When all the other variables are held
constant
• Parameters have standard errors
associated with them
– As with one IV
– Hence t-test, and associated probability
can be calculated
• Trickier than with one IV
211
Standard Error of Regression
Coefficient
• Standardised is easier
1 R
1
SEi 
2
n  k 1 1  R i
2
Y
– R2i is the value of R2 when all other predictors are
used as predictors of that variable
• Note that if R2i = 0, the equation is the same as for
previous
212
Multiple R
213
Multiple R
• The degree of prediction
– R (or Multiple R)
– No longer equal to b
• R2 Might be equal to the sum of squares
of B
– Only if all x’s are uncorrelated
214
In Terms of Variance
• Can also think of R2 in terms of
variance explained.
– Each IV explains some variance in the
outcome
– The IVs share some of their variance
• Can’t share the same variance twice
215
Variance in Y
accounted for by x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance in Y
accounted for by x2
rx2y2 = 0.36
216
• In this model
– R2 = ryx12 + ryx22
– R2 = 0.36 + 0.36 = 0.72
– R = 0.72 = 0.85
• But
– If x1 and x2 are correlated
– No longer the case
217
Variance in Y
accounted for by x1
rx1y2 = 0.36
The total
variance of Y
=1
Variance shared
between x1 and x2
(not equal to rx1x2)
Variance in Y
accounted for by x2
rx2y2 = 0.36
218
• So
– We can no longer sum the r2
– Need to sum them, and subtract the
shared variance – i.e. the correlation
• But
– It’s not the correlation between them
– It’s the correlation between them as a
proportion of the variance of Y
• Two different ways
219
• Based on estimates
2
R  b1ryx1  b2 ryx2
• If rx1x2 = 0
– rxy = bx1
– Equivalent to ryx12 + ryx22
220
• Based on correlations
2
R 
2
yx1
r
2
yx2
r
 2ryx1 ryx2 rx1 x2
2
x1 x2
1r
• rx1x2 = 0
– Equivalent to ryx12 + ryx22
221
• Can also be calculated using methods
we have seen
– Based on PRE (predicted value)
– Based on correlation with prediction
• Same procedure with >2 IVs
222
Adjusted R2
• R2 is on average an overestimate of
population value of R2
– Any x will not correlate 0 with Y
– Any variation away from 0 increases R
– Variation from 0 more pronounced with
lower N
• Need to correct R2
– Adjusted R2
223
• Calculation of Adj. R2
N 1
Adj. R  1  (1  R )
N  k 1
2
2
• 1 – R2
– Proportion of unexplained variance
– We multiple this by an adjustment
• More variables – greater adjustment
• More people – less adjustment
224
N 1
N  k 1
N  20, k  3
20  1
19

 1.1875
20  3  1 16
N  10, k  8
N  10, k  3
10  1
9
 9
10  8  1 1
10  1
9
  1.5
10  3  1 6
225
Extra Bits
• Some stranger things that can
happen
– Counter-intuitive
226
Suppressor variables
• Can be hard to understand
– Very counter-intuitive
• Definition
– A predictor which increases the size of the
parameters associated with other
predictors above the size of their
correlations
227
• An example (based on Horst, 1941)
– Success of trainee pilots
– Mechanical ability (x1), verbal ability (x2),
success (y)
• Correlation matrix
Mech
Mech
Verb
Success
1
0.5
0.3
Verb
0.5
1
0
Success
0.3
0
1
228
– Mechanical ability correlates 0.3 with
success
– Verbal ability correlates 0.0 with success
– What will the parameter estimates be?
– (Don’t look ahead until you have had a
guess)
229
• Mechanical ability
– b = 0.4
– Larger than r!
• Verbal ability
– b = -0.2
– Smaller than r!!
• So what is happening?
– You need verbal ability to do the mechanical
ability test
– Not actually related to mechanical ability
• Measure of mechanical ability is contaminated by verbal
ability
230
• High mech, low verbal
– High mech
• This is positive (.4)
– Low verbal
• Negative, because we are talking about
standardised scores (-(-.2)  (.2)
• Your mech is really high – you did well on the
mechanical test, without being good at the
words
• High mech, high verbal
– Well, you had a head start on mech,
because of verbal, and need to be brought
down a bit
231
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
232
Another suppressor?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
0.2
y
0.3
0.2
1
b1 =0.26
b2 = -0.06
233
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 =
b2 =
234
And another?
x1
x2
y
x1
1
0.5
0.3
x2
0.5
1
-0.2
y
0.3
-0.2
1
b1 = 0.53
b2 = -0.47
235
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 =
b2 =
236
One more?
x1
x2
y
x1
1
-0.5
0.3
x2
-0.5
1
0.2
y
0.3
0.2
1
b1 = 0.53
b2 = 0.47
237
• Suppression happens when two opposing
forces are happening together
– And have opposite effects
• Don’t throw away your IVs,
– Just because they are uncorrelated with the
outcome
• Be careful in interpretation of regression
estimates
– Really need the correlations too, to interpret what
is going on
– Cannot compare between studies with different
predictors
– Think about what you want to know
• Before throwing variables into the analysis
238
What to Control For?
• What is the added value of a ‘better’
college
– In terms of salary
– More academic people go to ‘better’
colleges
– Control for:
• Ability? Social class? Mother’s education?
Parent’s income? Course? Ethnic group? …
239
• Decisions about control variables
– Guided from theory
• Effect of gender
– Controlling for hair length and skirt
wearing?
240
241
• Do dogs make kids healthier?
– What to control for? Parent’s weight?
• Yes: Obese parents are more likely to have
obese kids, kids who are thinner, relative to the
parents are thinner.
• No: Dog might make parent thinner. By
controlling for parental weight, you’re
controlling for the effect of dog
242
Bad control
vars
Bad control
vars
Kid’s health
Dog
Good
control vars
Parent
Weight
Child
Asthma
Kid’s health
Dog
Rural/Urban?
House/apartment?
Income
Standardised Estimates > 1
• Correlations are bounded
-1.00 ≤ r ≤ +1.00
– We think of standardised regression
estimates as being similarly bounded
• But they are not
– Can go >1.00, <-1.00
– R cannot, because that is a proportion of
variance
246
• Three measures of ability
– Mechanical ability, verbal ability 1, verbal
ability 2
– Score on science exam
Mech
Mech
Verbal1
Verbal2
Scores
1
0.1
0.1
0.6
Verbal1
0.1
1
0.9
0.6
Verbal2
0.1
0.9
1
0.3
Scores
0.6
0.6
0.3
1
–Before reading on, what are the parameter
estimates?
247
Mech
Verbal1
Verbal2
0.56
1.71
-1.29
• Mechanical
– About where we expect
• Verbal 1
– Very high
• Verbal 2
– Very low
248
• What is going on
– It’s a suppressor again
– a predictor which increases the size of the
parameters associated with other
predictors above the size of their
correlations
• Verbal 1 and verbal 2 are correlated so
highly
– They need to cancel each other out
249
Variable Selection
• What are the appropriate predictors to
use in a model?
– Depends what you are trying to do
• Multiple regression has two separate
uses
– Prediction
– Explanation
250
• Prediction
– What will happen in
the future?
– Emphasis on
practical application
– Variables selected
(more) empirically
– Value free
• Explanation
– Why did something
happen?
– Emphasis on
understanding
phenomena
– Variables selected
theoretically
– Not value free
251
• Visiting the doctor
– Precedes suicide attempts
– Predicts suicide
• Does not explain suicide
• More on causality later on …
• Which are appropriate variables
– To collect data on?
– To include in analysis?
– Decision needs to be based on theoretical knowledge
of the behaviour of those variables
– Statistical analysis of those variables (later)
• Unless you didn’t collect the data
– Common sense (not a useful thing to say)
252
Variable Entry Techniques
• Entry-wise
– All variables entered simultaneously
• Hierarchical
– Variables entered in a predetermined order
• Stepwise
– Variables entered according to change in
R2
– Actually a family of techniques
253
• Entrywise regression
– All variables entered simultaneously
– All treated equally
• Hierarchical regression
– Entered in a theoretically determined order
– Change in R2 is assessed, and tested for
significance
– e.g. sex and age
• Should not be treated equally with other variables
• Sex and age MUST be first (unchangeable)
– Confused with hierarchical linear modelling (MLM)
254
R-Squared Change

SSE1  SSE0  /(df1  df0 )
F
SSE0 / df0
• SSE0, df0
• SSE and df for first (smaller) model
• SSE1, df1
• SSE and df for second (larger) model
255
• Stepwise
– Variables entered empirically
– Variable which increases R2 the most goes
first
• Then the next …
– Variables which have no effect can be
removed from the equation
• Example
– House prices – what’s important?
– Size, lot size, list price,
256
• Stepwise Analysis
– Data determines the order
– Model 1: listing price, R2 = 0.87
– Model 2: listing price + lot size, R2 = 0.89
List
Lot size
b
0.81
0.02
p
<0.001
0.02
257
• Hierarchical analysis
– Theory determines the order
– Model 1: Lot size+ House size, R2 = 0.499
– Model 2: + List price, R2 = 0.905
– Change in R2 = 0.41, p < 0.001
2
House size
Lot size
List price
0.18
0.15
0.75
0.20
0.03
<0.001
258
• Which is the best model?
– Entrywise – OK
– Stepwise – excluded age
• Excluded size
– MOST IMPORTANT PREDICTOR
– Hierarchical
• Listing price accounted for additional variance
– Whoever decides the price has information that we
don’t
• Other problems with stepwise
– F and df are wrong (cheats with df)
– Unstable results
• Small changes (sampling variance) – large
differences in models
259
– Uses a lot of paper
– Don’t use a stepwise procedure to pack
your suitcase
260
Is Stepwise Always Evil?
• Yes
• All right, no
• Research goal is entirely predictive
(technological)
– Not explanatory (scientific)
– What happens, not why
• N is large
– 40 people per predictor, Cohen, Cohen, Aiken, West
(2003)
• Cross validation takes place
261
• Alternatives to stepwise regression
– More recently developed
– Used for genetic studies
• 1000s of predictors, one outcome, small
samples
– Least Angle Regression
• LARS (least angle regression)
• Lasso (Least absolute shrinkage and selection
operator)
262
Entry Methods in Stata
• Entrywise
– What regress does
• Hierarchical
– Two ways
– Use hireg
– Add on module
• net search hireg
• Then install
263
Hierarchical Regression
• Use (on one line)
– hireg outcome (block1var1
block1var2) (block2var1
block2var2)
• Hireg reports
– Parameter estimates for the two
regressions
– R2 for each model, change in R2
264
Model R2
1: 0.022
2: 0.513
p
0.136
0.000
F(df)
2.256(1,98)
50.987(2,97)
R2 change
0.490
P value for
the R2
F(df)change
97.497(1,97)
p
0.000
P value for
the change
in R2
265
Hierarchical Regression (Cont…)
• I don’t like hireg, for two reasons
– It’s different to regression
– It only works for OLS regression, not
logistic, multinomial, Poisson, etc
• Alternative 2:
– Use test
– The p-value associated with the change in
R2 for a variable
• Equal to the p-value for that variable.
266
Hierarchical Regression (Cont…)
• Example (using cars)
– Parameters from final model:
– hireg price () (extro)
car
|
extro |
Coef.
.463
Std. Err.
.1296
t
3.57
P>|t|
0.001
[95% Conf. Interval]
.2004
.72626
– R2 change statistics
R2 change
0.128
F(df) change
12.773(1,36)
p
0.001
– (What is relationship between t and F?)
• We know the p-value of the R2 change
– When there is one predictor in the block
– What about when there’s more than one?
267
Hierarchical Regression (Cont)
• test isn’t exactly what we want
– But it is the same as what we want
• Advantage of test
– You can always use it
• (I can always remember how it works)
268
(For SPSS)
• SPSS calls them ‘blocks’
• Enter some variables, click ‘next block’
– Enter more variables
• Click on ‘Statistics’
– Click on R-squared change
269
Stepwise Regression
• Add stepwise: prefix
• With
– Pr() – probability value to be removed from
equation
– Pe() – probability value to be entered into
equation
• stepwise, pe(0.05) pr(0.2):
reg price sqm lotsize
originallis
270
A quick note on R2
R2 is sometimes regarded as the ‘fit’ of a
regression model
– Bad idea
• If good fit is required – maximise R2
– Leads to entering variables which do not
make theoretical sense
271
Propensity Scores
• Another method of controlling for
variables
• Ensure that predictors are uncorrelated
with one predictor
– Don’t need to control for them
272
x’s Uncorrelated?
• Two cases when x’s are uncorrelated
• Experimental design
– Predictors are uncorrelated
– We randomly assigned people to conditions
to ensure that was the case
• Sample weights
– We can deliberately sample
• Ensure that they are uncorrelated
273
• 20
• 20
• 20
• 20
women with college degree
women without college degree
men with college degree
men without college degree
– Or use post hoc sample weights
• Propensity weighting
– Weight to ensure that variables are uncorrelated
– Usually done to avoid having to control
– E.g. ethnic differences in PTSD symptoms
– Can incorporate many more control
variables
• 100+
274
Propensity Scores
• Race profiling of police stops
– Same time, place, area, etc
– www.youtube.com/watch?v=Oot0BOaQTZI
275
Critique of Multiple Regression
• Goertzel (2002)
– “Myths of murder and multiple regression”
– Skeptical Inquirer (Paper B1)
• Econometrics and regression are ‘junk
science’
– Multiple regression models (in US)
– Used to guide social policy
276
More Guns, Less Crime
– (controlling for other factors)
• Lott and Mustard: A 1% increase in gun
ownership
– 3.3% decrease in murder rates
• But:
– More guns in rural Southern US
– More crime in urban North (crack cocaine
epidemic at time of data)
277
Executions Cut Crime
• No difference between crimes in states
in US with or without death penalty
• Ehrlich (1975) controlled all variables
that affect crime rates
– Death penalty had effect in reducing crime
rate
• No statistical way to decide who’s right
278
Legalised Abortion
• Donohue and Levitt (1999)
– Legalised abortion in 1970’s cut crime in 1990’s
• Lott and Whitley (2001)
– “Legalising abortion decreased murder rates by …
0.5 to 7 per cent.”
• It’s impossible to model these data
– Controlling for other historical events
– Crack cocaine (again)
279
• Crime is still dropping in the US
– Despite the recession
• Levitt says it’s mysterious, because the
abortion effect should be over
• Some suggest Xboxes, Playstations, etc
• Netflix, DVRs
– (Violent movies reduce crime).
280
Another Critique
• Berk (2003)
– Regression analysis: a constructive critique (Sage)
• Three cheers for regression
– As a descriptive technique
• Two cheers for regression
– As an inferential technique
• One cheer for regression
– As a causal analysis
281
Is Regression Useless?
• Do regression carefully
– Don’t go beyond data which you have a
strong theoretical understanding of
• Validate models
– Where possible, validate predictive power
of models in other areas, times, groups
• Particularly important with stepwise
282
Lesson 6: Categorical
Predictors
283
Introduction
284
Introduction
• So far, just looked at continuous
predictors
• Also possible to use categorical
(nominal, qualitative) predictors
– e.g. Sex; Job; Religion; Region; Type (of
anything)
• Usually analysed with t-test/ANOVA
285
Historical Note
• But these (t-test/ANOVA) are special
cases of regression analysis
– Aspects of General Linear Models (GLMs)
• So why treat them differently?
– Fisher’s fault
– Computers’ fault
• Regression, as we have seen, is
computationally difficult
– Matrix inversion and multiplication
– Can’t do it, without a computer
286
• In the special cases where:
• You have one categorical predictor
• Your IVs are uncorrelated
– It is much easier to do it by partitioning of
sums of squares
• These cases
– Very rare in ‘applied’ research
– Very common in ‘experimental’ research
• Fisher worked at Rothamsted agricultural
research station
• Never have problems manipulating wheat, pigs,
cabbages, etc
287
• In psychology
– Led to a split between ‘experimental’
psychologists and ‘correlational’
psychologists
– Experimental psychologists (until recently)
would not think in terms of continuous
variables
• Still (too) common to dichotomise a
variable
– Too difficult to analyse it properly
– Equivalent to discarding 1/3 of your data
288
The Approach
289
The Approach
• Recode the nominal variable
– Into one, or more, variables to represent that
variable
• Names are slightly confusing
– Some texts talk of ‘dummy coding’ to refer to all
of these techniques
– Some (most) refer to ‘dummy coding’ to refer to
one of them
– Most have more than one name
290
• If a variable has g possible categories it
is represented by g-1 variables
• Simplest case:
– Smokes: Yes or No
– Variable 1 represents ‘Yes’
– Variable 2 is redundant
• If it isn’t yes, it’s no
291
The Techniques
292
• We will examine two coding schemes
– Dummy coding
• For two groups
• For >2 groups
– Effect coding
• For >2 groups
• Look at analysis of change
– Equivalent to ANCOVA
– Pretest-posttest designs
293
Dummy Coding – 2 Groups
• Sometimes called ‘simple coding’
• A categorical variable with two groups
• One group chosen as a reference group
– The other group is represented in a variable
• e.g. 2 groups: Experimental (Group 1) and
Control (Group 0)
– Control is the reference group
– Dummy variable represents experimental group
• Call this variable ‘group1’
294
• For variable ‘group1’
– 1 = ‘Yes’, 0=‘No’
Original
Category
Exp
Con
New
Variable
1
0
295
• Some data
• Group is x, score is y
Control
Group
Experiment 1
Experiment 2
Experiment 3
Experimental
Group
10
10
10
20
10
30
296
• Control Group = 0
– Intercept = Score on Y when x = 0
– Intercept = mean of control group
• Experimental Group = 1
– b = change in Y when x increases 1 unit
– b = difference between experimental
group and control group
297
35
30
Gradient of slope
25
represents
difference
between means
20
15
10
5
0
Control Group
Experiment 1
Experimental Group
Experiment 2
Experiment 3
298
Dummy Coding – 3+ Groups
• With three groups the approach is the
similar
• g = 3, therefore g-1 = 2 variables
needed
• 3 Groups
– Control
– Experimental Group 1
– Experimental Group 2
299
Original
Category
Con
Gp1
Gp2
Gp1
Gp2
0
1
0
0
0
1
• Recoded into two variables
– Note – do not need a 3rd variable
• If we are not in group 1 or group 2 MUST be in
control group
• 3rd variable would add no information
• (What would happen to determinant?)
300
• F and associated p
– Tests H0 that
g1  g 2  g3
• b1 and b2 and associated p-values
– Test difference between each experimental group
and the reference group
• To test difference between experimental
groups
– Need to rerun analysis (or just do ANOVA with
post-hoc tests)
301
• One more complication
– Have now run multiple comparisons
– Increases a – i.e. probability of type I error
• Need to correct for this
– Bonferroni correction
– Multiply given p-values by two/three
(depending how many comparisons were
made)
302
Effect Coding
• Usually used for 3+ groups
• Compares each group (except the reference
group) to the mean of all groups
– Dummy coding compares each group to the
reference group.
• Example with 5 groups
– 1 group selected as reference group
• Group 5
303
• Each group (except reference) has a
variable
– 1 if the individual is in that group
– 0 if not
– -1 if in reference group
group
1
2
3
4
5
group_1 group_2 group_3 group_4
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
-1
-1
-1
-1
304
Examples
• Dummy coding and Effect Coding
• Group 1 chosen as reference group
each time
• Data
Group
Mean
SD
1
52.40
4.60
2
56.30
5.70
3
60.10
5.00
Total
56.27
5.88
305
• Dummy
Group
dummy2
dummy3
1
2
3
0
1
0
0
0
1
Group
Effect2
effect3
1
2
3
-1
1
0
-1
0
1
• Effect
306
Dummy
R=0.543, F=5.7,
df=2, 27, p=0.009
b0 = 52.4,
b1 = 3.9, p=0.100
b2 = 7.7, p=0.002
Effect
R=0.543, F=5.7, df=2,
27, p=0.009
b0 = 56.27,
b1 = 0.03, p=0.980
b2 = 3.8, p=0.007
b0  g1
b0  G
b1  g2  g1
b1  g2  G
b2  g3  g1
b2  g3  G
307
In Stata
• Use xi: prefix for dummy coding
• Use xi3: module for more codings
• But
– I don’t like it, I do it by hand
– I don’t understand what it’s doing
– It makes very long variables
• And then I can’t use test
– BUT: If doing stepwise, you need to keep the variables
together
• Example:
xi: reg outcome contpred i.catpred
This has changed
in Stata 11. xi: no
longer needed
Put i. in front of
categorical
308
predictors
xi: reg salary i.job_description
-----------------------------------------------------salary |
Coef.
Std. Err.
t
P>|t|
-------------+---------------------------------------_Ijob_desc~2 |
3100.34
2023.76
1.53
0.126
_Ijob_desc~3 |
36139.2
1228.352
29.42
0.000
_cons
|
27838.5
532.4865
52.28
0.000
------------------------------------------------------
309
Exercise 6.1
• 5 golf balls
– Which is best?
310
In SPSS
• SPSS provides two equivalent procedures for
regression
– Regression
– GLM
– GLM will:
– Automatically code categorical variables
– Automatically calculate interaction terms
– Allow you to not understand
• GLM won’t:
– Give standardised effects
– Give hierarchical R2 p-values
311
ANCOVA and Regression
312
• Test
– (Which is a trick; but it’s designed to make
you think about it)
• Use bank data (Ex 5.3)
– Compare the pay rise (difference between
salbegin and salary)
– For ethnic minority and non-minority staff
• What do you find?
313
ANCOVA and Regression
• Dummy coding approach has one special use
– In ANCOVA, for the analysis of change
• Pre-test post-test experimental design
– Control group and (one or more) experimental
groups
– Tempting to use difference score + t-test / mixed
design ANOVA
– Inappropriate
314
• Salivary cortisol levels
– Used as a measure of stress
– Not absolute level, but change in level over
day may be interesting
• Test at: 9.00am, 9.00pm
• Two groups
– High stress group (cancer biopsy)
• Group 1
– Low stress group (no biopsy)
• Group 0
315
High Stress
Low Stress
AM
20.1
22.3
PM
6.8
11.8
Diff
13.3
10.5
• Correlation of AM and PM = 0.493
(p=0.008)
• Has there been a significant difference
in the rate of change of salivary
cortisol?
– 3 different approaches
316
• Approach 1 – find the differences, do a
t-test
– t = 1.31, df=26, p=0.203
• Approach 2 – mixed ANOVA, look for
interaction effect
– F = 1.71, df = 1, 26, p = 0.203
– F = t2
• Approach 3 – regression (ANCOVA)
based approach
317
– IVs: AM and group
– outcome: PM
– b1 (group) = 3.59, standardised b1=0.432,
p = 0.01
• Why is the regression approach better?
– The other two approaches took the
difference
– Assumes that r = 1.00
– Any difference from r = 1.00 and you add
error variance
• Subtracting error is the same as adding error
318
• Using regression
– Ensures that all the variance that is
subtracted is true
– Reduces the error variance
• Two effects
– Adjusts the means
• Compensates for differences between groups
– Removes error variance
• Data is am-pm cortisol
319
More on Change
• If difference score is correlated with
either pre-test or post-test
– Subtraction fails to remove the difference
between the scores
– If two scores are uncorrelated
• Difference will be correlated with both
• Failure to control
– Equal SDs, r = 0
• Correlation of change and pre-score =0.707
320
Even More on Change
• A topic of surprising complexity
– What I said about difference scores isn’t
always true
• Lord’s paradox – it depends on the precise
question you want to answer
– Collins and Horn (1993). Best methods for
the analysis of change
– Collins and Sayer (2001). New methods for
the analysis of change
– More later
321
Lesson 7: Assumptions in
Regression Analysis
322
The Assumptions
1. The distribution of residuals is normal (at
each value of the outcome).
2. The variance of the residuals for every set
of values for the predictor is equal.
• violation is called heteroscedasticity.
3. The error term is additive
•
no interactions.
4. At every value of the outcome the expected
(mean) value of the residuals is zero
•
No non-linear relationships
323
5. The expected correlation between residuals,
for any two cases, is 0.
•
The independence assumption (lack of
autocorrelation)
6. All predictors are uncorrelated with the
error term.
7. No predictors are a perfect linear function
of other predictors (no perfect
multicollinearity)
8. The mean of the error term is zero.
324
What are we going to do …
• Deal with some of these assumptions in
some detail
• Deal with others in passing only
– look at them again later on
325
Assumption 1: The
Distribution of Residuals is
Normal at Every Value of the
outcome
326
Look at Normal Distributions
• A normal distribution
– symmetrical, bell-shaped (so they say)
327
What can go wrong?
• Skew
– non-symmetricality
– one tail longer than the other
• Kurtosis
– too flat or too peaked
– kurtosed
• Outliers
– Individual cases which are far from the
distribution
328
Effects on the Mean
• Skew
– biases the mean, in direction of skew
• Kurtosis
– mean not biased
– standard deviation is
– and hence standard errors, and
significance tests
329
Examining Univariate
Distributions
• Graphs
– Histograms
– Boxplots
– P-P plots
• Calculation based methods
330
Histograms
30
A and B
30
20
20
10
10
0
0
331
• C and D
40
14
12
30
10
8
20
6
4
10
2
0
0
332
•E&F
20
10
0
333
Histograms can be tricky ….
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
7
7
6
6
6
5
5
6
5
4
3
2
1
5
4
4
4
3
3
2
2
1
1
0
0
3
2
1
0
334
Boxplots
335
P-P Plots
•A&B
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
336
•C&D
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
337
•E&F
1.00
1.00
.75
.75
.50
.50
.25
.25
0.00
0.00
.25
.50
.75
1.00
0.00
0.00
.25
.50
.75
1.00
338
Calculation Based
• Skew and Kurtosis statistics
• Outlier detection statistics
339
Skew and Kurtosis Statistics
• Normal distribution
– skew = 0
– kurtosis = 0
• Two methods for calculation
– Fisher’s and Pearson’s
– Very similar answers
• Associated standard error
– can be used for significance (t-test) of departure
from normality
– not actually very useful
• Never normal above N = 400
340
Skewness Kurtosis
A
B
C
D
E
F
-0.12
0.271
0.454
0.117
2.106
0.171
-0.084
0.265
1.885
-1.081
5.75
-0.21
341
Outlier Detection
• Calculate distance from mean
– z-score (number of standard deviations)
– deleted z-score
• that case biased the mean, so remove it
– Look up expected distance from mean
• 1% 3+ SDs
342
Non-Normality in Regression
343
Effects on OLS Estimates
• The mean is an OLS estimate
• The regression line is an OLS estimate
• Lack of normality
– biases the position of the regression slope
– makes the standard errors wrong
• probability values attached to statistical
significance wrong
344
Checks on Normality
• Check residuals are normally distributed
– Draw histogram residuals
• Use regression diagnostics
– Lots of them
– Most aren’t very interesting
345
Regression Diagnostics
• Residuals
– Standardised, studentised-deleted
– look for cases > |3| (?)
• Influence statistics
– Look for the effect a case has
– If we remove that case, do we get a different
answer?
– DFBeta, Standardised DFBeta
• changes in b
346
– DfFit, Standardised DfFit
• change in predicted value
• Distances
– measures of ‘distance’ from the centroid
– some include IV, some don’t
347
More on Residuals
• Residuals are trickier than you might
have imagined
• Raw residuals
– OK
• Standardised residuals
– Residuals divided by SD
e 2
se 
n  k 1
348
Standardised / Studentised
• Now we can calculate the standardised
residuals
– SPSS calls them studentised residuals
– Also called internally studentised residuals
ei
ei 
se 1  hi
349
Deleted Studentised Residuals
• Studentised residuals do not have a
known distribution
– Cannot use them for inference
• Deleted studentised residuals
– Externally studentised residuals
– Studentized (jackknifed) residuals
• Distributed as t
• With df = N – k – 1
350
Testing Significance
• We can calculate the probability of a
residual
– Is it sampled from the same population
• BUT
– Massive type I error rate
– Bonferroni correct it
• Multiply p value by N
351
Bivariate Normality
• We didn’t just say “residuals normally
distributed”
• We said “at every value of the
outcomes”
• Two variables can be normally
distributed – univariate,
– but not bivariate
352
• Couple’s IQs
– male and female
FEMALE
MALE
8
6
5
6
4
4
3
2
Frequency
2
0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
130.0
1
0
140.0
60.0
70.0
80.0
90.0
100.0
110.0
120.0
130.0
140.0
–Seem reasonably normal
353
• But wait!!
160
140
120
100
80
MALE
60
40
40
60
80
100
120
140
160
FEMALE
354
• When we look at bivariate normality
– not normal – there is an outlier
• So plot X against Y
• OK for bivariate
– but – may be a multivariate outlier
– Need to draw graph in 3+ dimensions
– can’t draw a graph in 3 dimensions
• But we can look at the residuals instead
…
355
• IQ histogram of residuals
12
10
8
6
4
2
0
356
Multivariate Outliers …
• Will be explored later in the exercises
• So we move on …
357
What to do about NonNormality
• Skew and Kurtosis
– Skew – much easier to deal with
– Kurtosis – less serious anyway
• Transform data
– removes skew
– positive skew – log transform
– negative skew - square
358
Transformation
• May need to transform IV and/or outcome
– More often outcome
• time, income, symptoms (e.g. depression) all positively
skewed
– can cause non-linear effects (more later) if only
one is transformed
– alters interpretation of unstandardised parameter
– May alter meaning of variable
• Some people say that this is such a big problem
– Never transform
– May add / remove non-linear and moderator
effects
359
• Change measures
– increase sensitivity at ranges
• avoiding floor and ceiling effects
• Outliers
– Can be tricky
– Why did the outlier occur?
• Error? Delete them.
• Weird person? Probably delete them
• Normal person? Tricky.
360
– You are trying to model a process
• is the data point ‘outside’ the process
• e.g. lottery winners, when looking at salary
• yawn, when looking at reaction time
– Which is better?
• A good model, which explains 99% of your
data? (because we threw outliers out)
• A poor model, which explains all of it (because
we keep outliers in)
• I prefer a good model
361
More on House Prices
• Zillow.com tracks and predicts house
prices
– In the USA
• Sometimes detects outliers
– We don’t trust this selling price
– We haven’t used it
362
Example in Stata
• reg salary educ
• predict res, res
• hist res
• gen logsalary= log(salary)
• reg logsalary educ
• predict logres, res
• hist logres
363
4.0e-05
3.0e-05
0
Density
2.0e-05
1.0e-05
-20000
0
20000
40000
Residuals
60000
80000
2
1
0
Density
1.5
.5
-1
-.5
0
Residuals
.5
1
But …
• Parameter estimates change
• Interpretation of parameter estimate is
different
• Exercise 7.0, 7.1
366
Bootstrapping
• Bootstrapping is very, very cool
• And very, very clever
• But very, very simple
367
Bootstrapping
• When we estimate a test statistic (F or r
or t or c2)
• We rely on knowing the sampling
distribution
• Which we know
– If the distributional assumptions are
satisfied
368
Estimate the Distribution
• Bootstrapping lets you:
– Skip the bit about distribution
– Estimate the sampling distribution from the
data
• This shouldn’t be allowed
– Hence bootstrapping
– But it is
369
How to Bootstrap
• We resample, with replacement
• Take our sample
– Sample 1 individual
• Put that individual back, so that they can be
sampled again
– Sample another individual
• Keep going until we’ve sampled as many
people as were in the sample
• Analyze the data
• Repeat the process B times
– Where B is a big number
370
Example
Original
1
2
3
4
5
6
7
8
9
10
B1
1
1
3
3
3
3
7
7
9
9
B2
1
2
3
4
4
4
8
8
9
10
B3
2
2
3
2
4
4
6
7
9
9
371
• Analyze each dataset
– Sampling distribution of statistic
• Gives sampling distribution
• 2 approaches to CI or P
• Semi-parametric
– Calculate standard error of statistic
– Call that the standard deviation
– Does not make assumption about
distribution of data
• Makes assumption about sampling distribution
372
• Non-parametric
– Stata calls this percentile
• Count.
– If you have 1000 samples
– 25th is lower CI
– 975th is upper CI
– P-value is proportion that cross zero
• Non-parametric needs more samples
373
Bootstrapping in Stata
• Very easy:
– Use bootstrap: (or bs: or bstrap: ) prefix or
– (Better) use vce(bootstrap) option
• By default does 50 samples
– Not enough
– Use reps()
– At least 1000
374
Example
reg salary salbegin educ, vce(bootstrap,
reps(50))
|
Observed
Bootstrap
|
Coef.
Std. Err.
z
-----------+--------------------------------salbegin |
1.672631
.0863302
19.37
• Again
salbegin |
1.672631
.0737315
22.69
375
More Reps
• 1,000 reps
– Z = 17.31
• Again
– Z = 17.59
• 10,000 reps
– 17.23
– 17.02
376
• Exercise 7.2, 7.3
377
Assumption 2: The variance of
the residuals for every set of
values for the predictor is
equal.
378
Heteroscedasticity
• This assumption is a about
heteroscedasticity of the residuals
– Hetero=different
– Scedastic = scattered
• We don’t want heteroscedasticity
– we want our data to be homoscedastic
• Draw a scatterplot to investigate
379
160
140
120
100
80
MALE
60
40
40
60
80
100
120
140
160
380
FEMALE
• Only works with one IV
– need every combination of IVs
• Easy to get – use predicted values
– use residuals there
• Plot predicted values against residuals
• A bit like turning the scatterplot to
make the line of best fit flat
381
Good – no heteroscedasticity
Predicted Value
382
Bad – heteroscedasticity
Predicted Value
383
Testing Heteroscedasticity
•
White’s test
1.
2.
3.
4.
Do regression, save residuals.
Square residuals
Square IVs
Calculate interactions of IVs
– e.g. x1•x2, x1•x3, x2 • x3
384
5. Run regression using
– squared residuals as outcome
– IVs, squared IVs, and interactions as IVs
6. Test statistic = N x R2
– Distributed as c2
– Df = k (for second regression)
•
Use education and salbegin to predict
salary (employee data.sav)
–
R2 = 0.113, N=474, c2 = 53.5, df=5, p <
0.0001
• Automatic in Stata
– estat imtest, white
385
60000
40000
0
20000
-20000
-40000
Residuals
Plot of Predicted and Residual
0
50000
100000
150000
Linear prediction
386
White’s Test as Test of
Interest
• Possible to have a theory that predicts
heteroscedasticity
• Lupien, et al, 2006
– Heteroscedasticity in relationship of
hippocampal volume and age
387
Magnitude of
Heteroscedasticity
• Chop data into 5 “slices”
– Calculate variance of each slice
– Check ratio of smallest to largest
– Less than 5
• OK
388
gen slice = 1
replace slice
replace slice
replace slice
replace slice
=
=
=
=
2
3
4
5
if
if
if
if
pred
pred
pred
pred
>
>
>
>
30000
60000
90000
120000
bysort slice: su pred
1: 3954
5: 17116
(Doesn’t look too bad, thanks to skew in
predictors)
389
Dealing with
Heteroscedasticity
•
Use Huber-White (robust) estimates
–
–
•
Also called sandwich estimates
Also called empirical estimates
Use survey techniques
–
–
Relatively straightforward in SAS and
Stata, fiddly in SPSS
Google: SPSS Huber-White
390
Why’s it a Sandwich?
• SE can be calculated with:
1
n (X ' X )
1
• Sandwich estimator:
1
1
1
1
( n X ' X ) ( n X ' X )(n X ' X )
1
391
Example
• reg salary educ
– Standard errors:
– 204, 2821
• reg salary educ , robust
– Standard errors:
– 267
– 3347
• SEs usually go up, can go down
392
Heteroscedasticity –
Implications and Meanings
Implications
• What happens as a result of
heteroscedasticity?
– Parameter estimates are correct
• not biased
– Standard errors (hence p-values) are
incorrect
393
However …
• If there is no skew in predicted scores
– P-values a tiny bit wrong
• If skewed,
– P-values can be very wrong
• Exercise 7.4
394
Robust SE Haiku
T-stat looks too good.
Use robust standard errors
significance gone
395
Meaning
• What is heteroscedasticity trying to tell
us?
– Our model is wrong – it is misspecified
– Something important is happening that we
have not accounted for
• e.g. amount of money given to charity
(given)
– depends on:
• earnings
• degree of importance person assigns to the
charity (import)
396
• Do the regression analysis
– R2 = 0.60,, p < 0.001
• seems quite good
– b0 = 0.24, p=0.97
– b1 = 0.71, p < 0.001
– b2 = 0.23, p = 0.031
• White’s test
– c2 = 18.6, df=5, p=0.002
• The plot of predicted values against
residuals …
397
20
10
0
-10
-20
30
40
50
Linear prediction
60
70
• Plot shows heteroscedastic relationship
398
• Which means …
– the effects of the variables are not additive
– If you think that what a charity does is
important
• you might give more money
• how much more depends on how much money
you have
399
70
60
50
40
30
20
5
10
import
given
Fitted values
15
given
Fitted values
400
• One more thing about
heteroscedasticity
– it is the equivalent of homogeneity of
variance in ANOVA/t-tests
401
• Exercise 7.4, 7.5, 7.6
402
Assumption 3: The Error Term
is Additive
403
Additivity
• What heteroscedasticity shows you
– effects of variables need to be additive (assume
no interaction between the variables)
• Heteroscedasticity doesn’t always show it to
you
– can test for it, but hard work
– (same as homogeneity of covariance assumption
in ANCOVA)
• Have to know it from your theory
• A specification error
404
Additivity and Theory
• Two IVs
– Alcohol has sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– Some painkillers have sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– A bit of alcohol and a bit of painkiller
doesn’t make you very tired
– Effects multiply together, don’t add
together
405
• If you don’t test for it
– It’s very hard to know that it will happen
• So many possible non-additive effects
– Cannot test for all of them
– Can test for obvious
• In medicine
– Choose to test for salient non-additive
effects
– e.g. sex, race
• More on this, when we look at
moderators
406
• Exercise 7.6
• Exercise 7.7
407
Assumption 4: At every value of
the outcome the expected
(mean) value of the residuals
is zero
408
Linearity
• Relationships between variables should be
linear
– best represented by a straight line
• Not a very common problem in social
sciences
– measures are not sufficiently accurate (much measurement
error) to make a difference
• R2 too low
• unlike, say, physics
409
Fuel
• Relationship between speed of travel
and fuel used
Speed
410
• R2 = 0.938
– looks pretty good
– know speed, make a good prediction of
fuel
• BUT
– look at the chart
– if we know speed we can make a perfect
prediction of fuel used
– R2 should be 1.00
411
Detecting Non-Linearity
• Residual plot
– just like heteroscedasticity
• Using this example
– very, very obvious
– usually pretty obvious
412
Residual plot
413
Linearity: A Case of Additivity
• Linearity = additivity along the range of the
IV
• Jeremy rides his bicycle harder
– Increase in speed depends on current speed
– Not additive, multiplicative
– MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in
multiple regression. Psychological Bulletin.
414
Assumption 5: The expected
correlation between residuals, for
any two cases, is 0.
The independence assumption (lack of
autocorrelation)
415
Independence Assumption
• Also: lack of autocorrelation
• Tricky one
– often ignored
– exists for almost all tests
• All cases should be independent of one
another
– knowing the value of one case should not tell you
anything about the value of other cases
416
How is it Detected?
• Can be difficult
– need some clever statistics (multilevel
models)
• Better off avoiding situations where it
arises
– Or handling it when it does arise
• Residual Plots
417
Residual Plots
• Were data collected in time order?
– If so plot ID number against the residuals
– Look for any pattern
• Test for linear relationship
• Non-linear relationship
• Heteroscedasticity
418
2
Residual
1
0
-1
-2
0
10
20
30
40
Participant Number
419
How does it arise?
Two main ways
• time-series analyses
– When cases are time periods
• weather on Tuesday and weather on Wednesday
correlated
• inflation 1972, inflation 1973 are correlated
• clusters of cases
– patients treated by three doctors
– children from different classes
– people assessed in groups
420
Why does it matter?
• Standard errors can be wrong
– therefore significance tests can be wrong
• Parameter estimates can be wrong
– really, really wrong
– from positive to negative
• An example
– students do an exam (on statistics)
– choose one of three questions
• IV: time
• outcome: grade
421
•Result, with line of best fit
90
80
70
60
50
40
Grade
30
20
10
10
Time
20
30
40
50
60
70
422
• Result shows that
– people who spent longer in the exam,
achieve better grades
• BUT …
– we haven’t considered which question
people answered
– we might have violated the independence
assumption
• outcome will be autocorrelated
• Look again
– with questions marked
423
• Now somewhat different
90
80
70
60
50
40
Question
30
Grade
3
20
2
10
10
1
20
30
40
50
60
70
Time
424
• Now, people that spent longer got
lower grades
– questions differed in difficulty
– do a hard one, get better grade
– if you can do it, you can do it quickly
425
Dealing with NonIndependence
• For time series data
– Time series analysis (another course)
– Multilevel models (hard, some another
course)
• For clustered data
– Robust standard errors
– Generalized estimating equations
– Multilevel models
426
Cluster Robust Standard
Errors
• Predictor: School size
• Outcome: Grades
• Sample:
– 20 schools
– 20 children per school
• What is the N?
427
Robust Standard Errors
• Sample is:
– 400 children – is it 400?
– Not really
• Each child adds information
• First child in a school adds lots of information
about that school
– 100th child in a school adds less information’
– How much less depends on how similar the children
in the school are
– 20 schools
• It’s more than 20
428
Robust SE in Stata
• Very easy
• reg predictor outcome ,
robust cluster(clusterid)
• BUT
– Only to be used where clustering is a
nuisance only
• Only adjusts standard errors, not parameter
estimates
• Only to be used where parameter estimates
shouldn’t be affected by clustering
429
Example of Robust SE
• Effects of incentives for attendance at
adult literacy class
– Some students rewarded for attendance
– Others not rewarded
• 152 classes randomly assigned to each
condition
– Scores measured at mid term and final
430
Example of Robust SE
• Naïve
– reg postscore tx midscore
– Est: -.6798066 SE: .7218797
• Clustered
– reg postscore tx midscore,
robust cluster(classid)
– Est: -.6798066 SE .9329929
431
Problem with Robust
Estimates
• Only corrects standard error
– Does not correct estimate
• Other predictors must be uncorrelated
with predictors of group membership
– Or estimates wrong
• Two alternatives:
– Generalized estimating equations (gee)
– Multilevel models
432
Independence +
Heteroscedasticity
• Assumption is that residuals are:
– Independently and identically distributed
• i.i.d.
• Same procedure used for both problems
– Really, same problem
433
• Exercise 7.9, exercise 7.10
434
Assumption 6: All predictor
variables are uncorrelated
with the error term.
435
Uncorrelated with the Error
Term
• A curious assumption
– by definition, the residuals are uncorrelated
with the predictors (try it and see, if you
like)
• There are no other predictors that are
important
– That correlate with the error
– i.e. Have an effect
436
• Problem in economics
– Demand increases supply
– Supply increases wages
– Higher wages increase demand
• OLS estimates will be (badly) biased in
this case
– need a different estimation procedure
– two-stage least squares
• simultaneous equation modelling
– Instrumental variables
437
Another Haiku
Supply and demand:
without a good instrument,
not identified.
438
Assumption 7: No predictors are
a perfect linear function of
other predictors
no perfect multicollinearity
439
No Perfect Multicollinearity
• IVs must not be linear functions of one
another
– matrix of correlations of IVs is not positive definite
– cannot be inverted
– analysis cannot proceed
• Have seen this with
– age, age start, time working (can’t have all three
in the model)
– also occurs with subscale and total in model at the
same time
440
• Large amounts of collinearity
– a problem (as we shall see) sometimes
– not an assumption
• Exercise 7.11
441
Assumption 8: The mean of the
error term is zero.
You will like this one.
442
Mean of the Error Term = 0
• Mean of the residuals = 0
• That is what the constant is for
– if the mean of the error term deviates from
zero, the constant soaks it up
Y   0  1 x1  
Y  (  0  3)  1 x1  (  3)
- note, Greek letters because we are
talking about population values
443
• Can do regression without the constant
– Usually a bad idea
– E.g R2 = 0.995, p < 0.001
• Looks good
444
13
12
y
11
10
9
8
7
6
7
8
9
10
11
12
13
x1
445
Lesson 8: Issues in
Regression Analysis
Things that alter the
interpretation of the regression
equation
446
The Four Issues
•
•
•
•
Causality
Sample sizes
Collinearity
Measurement error
447
Causality
448
What is a Cause?
• Debate about definition of cause
– some statistics (and philosophy) books try
to avoid it completely
– We are not going into depth
• just going to show why it is hard
• Two dimensions of cause
– Ultimate versus proximal cause
– Determinate versus probabilistic
449
Proximal versus Ultimate
• Why am I here?
– I walked here because
– This is the location of the class because
– Eric Tanenbaum asked me because
– (I don’t know)
– because I was in my office when he rang
because
– I was a lecturer at Derby University
because
– I saw an advert in the paper because
450
– I exist because
– My parents met because
– My father had a job …
• Proximal cause
– the direct and immediate cause of
something
• Ultimate cause
– the thing that started the process off
– I fell off my bicycle because of the bump
– I fell off because I was going too fast
451
Determinate versus Probabilistic
Cause
• Why did I fall off my bicycle?
– I was going too fast
– But every time I ride too fast, I don’t fall
off
– Probabilistic cause
• Why did my tyre go flat?
– A nail was stuck in my tyre
– Every time a nail sticks in my tyre, the tyre
goes flat
– Deterministic cause
452
• Can get into trouble by mixing them
together
– Eating deep fried Mars Bars and doing no
exercise are causes of heart disease
– “My Grandad ate three deep fried Mars
Bars every day, and the most exercise he
ever got was when he walked to the shop
next door to buy one”
– (Deliberately?) confusing deterministic and
probabilistic causes
453
Criteria for Causation
• Association (correlation)
• Direction of Influence (a  b)
• Isolation (not c  a and c  b)
454
Association
• Correlation does not mean causation
– we all know
• But
– Causation does mean correlation
• Need to show that two things are related
– may be correlation
– may be regression when controlling for third (or
more) factor
455
• Relationship between price and sales
– suppliers may be cunning
– when people want it more
• stick the price up
Price
Price
Demand
Sales
1
0.6
0
Demand
0.6
1
0.6
Sales
0
0.6
1
– So – no relationship between price
and sales
456
– Until (or course) we control for demand
– b1 (Price) = -0.56
– b2 (Demand) = 0.94
• But which variables do we enter?
457
Direction of Influence
• Relationship between A and B
– three possible processes
A
B
A
B
B causes A
A
B
C causes A & B
C
A causes B
458
• How do we establish the direction of
influence?
– Longitudinally?
Barometer
Drops
Storm
– Now if we could just get that barometer
needle to stay where it is …
• Where the role of theory comes in
(more on this later)
459
Isolation
• Isolate the outcome from all other
influences
– as experimenters try to do
• Cannot do this
– can statistically isolate the effect
– using multiple regression
460
Role of Theory
• Strong theory is crucial to making
causal statements
• Fisher said: to make causal statements
“make your theories elaborate.”
– don’t rely purely on statistical analysis
• Need strong theory to guide analyses
– what critics of non-experimental research
don’t understand
461
• S.J. Gould – a critic
– says correlate price of petrol and his age,
for the last 10 years
– find a correlation
– Ha! (He says) that doesn’t mean there is a
causal link
– Of course not! (We say).
• No social scientist would do that analysis
without first thinking (very hard) about the
possible causal relations between the variables
of interest
• Would control for time, prices, etc …
462
• Atkinson, et al. (1996)
– relationship between college grades and
number of hours worked
– negative correlation
– Need to control for other variables –
ability, intelligence
• Gould says “Most correlations are noncausal” (1982, p243)
– Of course!!!!
463
I drink a lot of
beer
16 causal
relations
120 non-causal
correlations
laugh
bathroom
jokes (about statistics)
children wake early
karaoke
curtains closed
sleeping
headache
equations (beermat)
thirsty
fried breakfast
no beer
curry
chips
falling over
lose keys
464
• Abelson (1995) elaborates on this
– ‘method of signatures’
• A collection of correlations relating to
the process
– the ‘signature’ of the process
• e.g. tobacco smoking and lung cancer
– can we account for all of these findings
with any other theory?
465
1.
2.
3.
4.
5.
6.
7.
8.
The longer a person has smoked cigarettes, the
greater the risk of cancer.
The more cigarettes a person smokes over a given
time period, the greater the risk of cancer.
People who stop smoking have lower cancer rates
than do those who keep smoking.
Smoker’s cancers tend to occur in the lungs, and be of
a particular type.
Smokers have elevated rates of other diseases.
People who smoke cigars or pipes, and do not usually
inhale, have abnormally high rates of lip cancer.
Smokers of filter-tipped cigarettes have lower cancer
rates than other cigarette smokers.
Non-smokers who live with smokers have elevated
cancer rates.
(Abelson, 1995: 183-184)
466
– In addition, should be no anomalous
correlations
• If smokers had more fallen arches than nonsmokers, not consistent with theory
• Failure to use theory to select
appropriate variables
– specification error
– e.g. in previous example
– Predict wealth from price and sales
• increase price, price increases
• Increase sales, price increases
467
• Sometimes these are indicators of the
process, not the process itself
– e.g. barometer – stopping the needle won’t
help
– e.g. inflation? Indicator or cause of
economic health?
468
No Causation without
Experimentation
• Blatantly untrue
– I don’t doubt that the sun shining makes
us warm
• Why the aversion?
– Pearl (2000) says problem is that there is
no mathematical operator (e.g. “=“)
– No one realised that you needed one
– Until you build a robot
469
AI and Causality
• A robot needs to make judgements
about causality
• Needs to have a mathematical
representation of causality
– Suddenly, a problem!
– Doesn’t exist
• Most operators are non-directional
• Causality is directional
470
Sample Sizes
“How many subjects does it take
to run a regression analysis?”
471
Introduction
• Social scientists don’t worry enough about the
sample size required
– “Why didn’t you get a significant result?”
– “I didn’t have a large enough sample”
• Not a common answer, but very common reason
• More recently awareness of sample size is
increasing
– use too few – no point doing the research
– use too many – waste their time
472
• Research funding bodies
• Ethical review panels
– both become more interested in sample
size calculations
• We will look at two approaches
– Rules of thumb (quite quickly)
– Power Analysis (more slowly)
473
Rules of Thumb
• Lots of simple rules of thumb exist
– 10 cases per IV
– and at least 100 cases
– Green (1991) more sophisticated
• To test significance of R2 – N = 50 + 8k
• To test significance of slopes, N = 104 + k
• Rules of thumb don’t take into account
all the information that we have
– Power analysis does
474
Power Analysis
Introducing Power Analysis
• Hypothesis test
– tells us the probability of a result of that
magnitude occurring, if the null hypothesis is
correct (i.e. there is no effect in the population)
• Doesn’t tell us
– the probability of that result, if the null hypothesis
is false (i.e., there actually is an effect in the
population)
475
• According to Cohen (1982) all null
hypotheses are false
– everything that might have an effect, does
have an effect
• it is just that the effect is often very tiny
476
Type I Errors
• Type I error is false rejection of H0
• Probability of making a type I error
– a – the significance value cut-off
• usually 0.05 (by convention)
• Always this value
• Not affected by
– sample size
– type of test
477
Type II errors
• Type II error is false acceptance of the
null hypothesis
– Much, much trickier
• We think we have some idea
– we almost certainly don’t
• Example
– I do an experiment (random sampling, all
assumptions perfectly satisfied)
– I find p = 0.05
478
– You repeat the experiment exactly
• different random sample from same population
– What is probability you will find p < 0.05?
– Answer: 0.5
– Another experiment, I find p = 0.01
– Probability you find p < 0.05?
– Answer: 0.79
• Very hard to work out
– not intuitive
– need to understand non-central sampling
distributions (more in a minute)
479
• Probability of type II error = beta ()
– same as population regression parameter
(to be confusing)
• Power = 1 – Beta
– Probability of getting a significant result
(given that there is a significant result to
be found)
480
State of the World
Research
Findings
H0 True
(no effect to
be found)
H0 false
(effect to be
found)
H0 true (we find
no effect – p >
0.05)

Type II error
p=
power = 1 - 
H0 false (we find
an effect – p <
0.05)
Type I error
p=a

481
• Four parameters in power analysis
– a – prob. of Type I error
–  – prob. of Type II error (power = 1 – )
– Effect size – size of effect in population
–N
• Know any three, can calculate the
fourth
– Look at them one at a time
482
•
a Probability of Type I error
– Usually set to 0.05
– Somewhat arbitrary
• sometimes adjusted because of circumstances
– rarely because of power analysis
– May want to adjust it, based on power
analysis
483
•  – Probability of type II error
– Power (probability of finding a result)
=1–
– Standard is 80%
• Some argue for 90%
– Implication that Type I error is 4 times
more serious than type II error
• adjust ratio with compromise power analysis
484
•
Effect size in the population
– Most problematic to determine
– Three ways
1. What effect size would be useful to find?
•
R2 = 0.01 - no use (probably)
2. Base it on previous research
– what have other people found?
3. Use Cohen’s conventions
– small R2 = 0.02
– medium R2 = 0.13
– large R2 = 0.26
485
– Effect size usually measured as f2
– For R2
2
R
f 
2
1 R
2
486
– For (standardised) slopes
2
sri
f 
2
1 R
2
– Where sr2 is the contribution to the
variance accounted for by the variable of
interest
– i.e. sr2 = R2 (with variable) – R2 (without)
• change in R2 in hierarchical regression
487
• N – the sample size
– usually use other three parameters to
determine this
– sometimes adjust other parameters (a)
based on this
– e.g. You can have 50 participants. No
more.
488
Doing power analysis
• With power analysis program
– SamplePower, Gpower (free), Nquery
– With Stata command sampsi
• Which I find very confusing
• But we’ll use it anyway
489
sampsi
• Limited in usefulness
– A categorical, two group predictor
• sampsi 0 0.5, pre(1) r01(0.5)
n1(50) sd(1)
– Find power for detecting an effect of 0.5
• When there’s one other variable at baseline
• Which correlates 0.5
• 50 people in each group
• When sd is 1.0
490
sampsi …
Method: ANCOVA
relative efficiency =
adjustment to sd =
adjusted sd1 =
1.143
0.935
0.935
Estimated power:
power =
0.762
491
GPower
• Better for regression designs
492
Underpowered Studies
• Research in the social sciences is often
underpowered
– Why?
– See Paper B11 – “the persistence of
underpowered studies”
495
Extra Reading
• Power traditionally focuses on p values
– What about CIs?
– Paper B8 – “Obtaining regression
coefficients that are accurate, not simply
significant”
496
• Exercise 8.1
497
Collinearity
498
Collinearity as Issue and
Assumption
• Collinearity (multicollinearity)
– the extent to which the predictors are
(multiply) correlated
• If R2 for any IV, using other IVs = 1.00
– perfect collinearity
– variable is linear sum of other variables
– regression will not proceed
– (SPSS will arbitrarily throw out a variable)
499
• R2 < 1.00, but high
– other problems may arise
• Four things to look at in collinearity
– meaning
– implications
– detection
– actions
500
Meaning of Collinearity
• Literally ‘co-linearity’
– lying along the same line
• Perfect collinearity
– when some IVs predict another
– Total = S1 + S2 + S3 + S4
– S1 = Total – (S2 + S3 + S4)
– rare
501
• Less than perfect
– when some IVs are close to predicting
other IVs
– correlations between IVs are high (usually,
but not always)  high multiple
correlations
502
Implications
• Effects the stability of the parameter
estimates
– and so the standard errors of the
parameter estimates
– and so the significance and CIs
• Because
– shared variance, which the regression
procedure doesn’t know where to put
503
• Sex differences
– due to genetics?
– due to upbringing?
– (almost) perfect collinearity
• statistically impossible to tell
504
• When collinearity is less than perfect
– increases variability of estimates between
samples
– estimates are unstable
– reflected in the variances, and hence
standard errors
505
Detecting Collinearity
• Look at the parameter estimates
– large standardised parameter estimates
(>0.3?), which are not significant
• be suspicious
• Run a series of regressions
– each IV as outcome
– all other IVs as IVs
• for each IV
506
• Sounds like hard work?
– SPSS does it for us!
• Ask for collinearity diagnostics
– Tolerance – calculated for every IV
Tolerance  1-R
2
– Variance Inflation Factor
• sq. root of amount s.e. has been increased
1
VIF 
Tolerance
507
Actions
What you can do about collinearity
“no quick fix” (Fox, 1991)
1. Get new data
•
•
•
avoids the problem
address the question in a different way
e.g. find people who have been raised as
the ‘wrong’ gender
•
•
exist, but rare
Not a very useful suggestion
508
2. Collect more data
•
•
•
not different data, more data
collinearity increases standard error (se)
se decreases as N increases
•
get a bigger N
3. Remove / Combine variables
•
•
•
If an IV correlates highly with other IVs
Not telling us much new
If you have two (or more) IVs which are
very similar
•
e.g. 2 measures of depression, socioeconomic status, achievement, etc
509
•
•
sum them, average them, remove one
Many measures
•
use principal components analysis to reduce
them
3. Use stepwise regression (or some
flavour of)
•
•
See previous comments
Can be useful in theoretical vacuum
4. Ridge regression
•
•
not very useful
behaves weirdly
510
• Exercise 8.2, 8.3, 8.4
511
Measurement Error
512
What is Measurement Error
• In social science, it is unlikely that we
measure any variable perfectly
– measurement error represents this
imperfection
• We assume that we have a true score
– T
• A measure of that score
–x
513
xT e
• just like a regression equation
– standardise the parameters
– T is the reliability
• the amount of variance in x which comes from T
• but, like a regression equation
– assume that e is random and has mean of zero
– more on that later
514
Simple Effects of
Measurement Error
• Lowers the measured correlation
– between two variables
• Real correlation
– true scores (x* and y*)
• Measured correlation
– measured scores (x and y)
515
True correlation
of x and y
rx*y*
x*
e
y*
Reliability of x
rxx
Reliability of y
ryy
x
y
Measured
correlation of x and y
rxy
e
516
• Attenuation of correlation
rxy  rx * y *  rxx ryy
• Attenuation corrected correlation
rx * y * 
rxy
rxx ryy
517
• Example
rxx  0.7
ryy  0.8
rxy  0.3
rx* y* 
rx* y*
rxy
rxx ryy
0.3

 0.40
0.7  0.8
518
Complex Effects of
Measurement Error
• Really horribly complex
• Measurement error reduces correlations
– reduces estimate of 
– reducing one estimate
• increases others
– because of effects of control
– combined with effects of suppressor
variables
– exercise to examine this
519
Dealing with Measurement
Error
• Attenuation correction
– very dangerous
– not recommended
• Avoid in the first place
– use reliable measures
– don’t discard information
• don’t categorise
• Age: 10-20, 21-30, 31-40 …
520
Complications
• Assume measurement error is
– additive
– linear
• Additive
– e.g. weight – people may under-report / overreport at the extremes
• Linear
– particularly the case when using proxy variables
521
• e.g. proxy measures
– Want to know effort on childcare, count
number of children
• 1st child is more effort than 19th child
– Want to know financial status, count
income
• 1st £1 much greater effect on financial status
than the 1,000,000th.
522
• Exercise 8.5
523
Lesson 9: Non-Linear Analysis
in Regression
524
Introduction
• Non-linear effect occurs
– when the effect of one predictor
– is not consistent across the range of the IV
• Assumption is violated
– expected value of residuals = 0
– no longer the case
525
Some Examples
526
Skill
A Learning Curve
Experience
527
Performance
Yerkes-Dodson Law of Arousal
Arousal
528
Suicidal
Enthusiastic
Enthusiasm Levels over a
Lesson on Regression
0
Time
3.5
529
• Learning
– line changed direction once
• Yerkes-Dodson
– line changed direction once
• Enthusiasm
– line changed direction twice
530
Everything is Non-Linear
• Every relationship we look at is nonlinear, for two reasons
– Exam results cannot keep increasing with
reading more books
• Linear in the range we examine
– For small departures from linearity
• Cannot detect the difference
• Non-parsimonious solution
531
Non-Linear Transformations
532
Bending the Line
• Non-linear regression is hard
– We cheat, and linearise the data
• Do linear regression
Transformations
• We need to transform the data
– rather than estimating a curved line
• which would be very difficult
• may not work with OLS
– we can take a straight line, and bend it
– or take a curved line, and straighten it
• back to linear (OLS) regression
533
• We still do linear regression
– Linear in the parameters
– Y = b1x + b2x2 + …
• Can do non-linear regression
– Non-linear in the parameters
– Y = b1x + b2x2 + …
• Much trickier
– Statistical theory either breaks down OR
becomes harder
534
• Linear transformations
– multiply by a constant
– add a constant
– change the slope and the intercept
535
y=2x
y
y=x + 3
y=x
x
536
• Linear transformations are no use
– alter the slope and intercept
– don’t alter the standardised parameter
estimate
• Non-linear transformation
– will bend the slope
– quadratic transformation
y = x2
– one change of direction
537
– Cubic transformation
y = x2 + x3
– two changes of direction
538
• To estimate a non-linear regression
– we don’t actually estimate anything nonlinear
– we transform the x-variable to a non-linear
version
– can estimate that straight line
– represents the curve
– we don’t bend the line, we stretch the
space around the line, and make it flat
539
Detecting Non-linearity
540
Draw a Scatterplot
• Draw a scatterplot of y plotted against x
– see if it looks a bit non-linear
– e.g. Education and beginning salary
• from bank data
• with line of best fit
541
A Real Example
• Starting salary and years of education
– From employee data.sav
542
80000
0
20000
40000
60000
Expected value
of error
(residual) is > 0
5
10
Expected 20
value
of error
beginning salary
(residual) is < 0
15
educational level (years)
Fitted values
543
Use Residual Plot
• Scatterplot is only good for one variable
– use the residual plot (that we used for
heteroscedasticity)
• Good for many variables
544
• We want
– points to lie in a nice straight sausage
545
• We don’t want
– a nasty bent sausage
546
-20000
0
20000
40000
60000
• Educational level and starting salary
5000
10000
15000
20000
Linear prediction
Fitted values
25000
30000
Residuals
547
Carrying Out Non-Linear
Regression
548
Linear Transformation
• Linear transformation doesn’t change
– interpretation of slope
– standardised slope
– se, t, or p of slope
– R2
• Can change
– effect of a transformation
549
• Actually more complex
– with some transformations can add a
constant with no effect (e.g. quadratic)
• With others does have an effect
– inverse, log
• Sometimes it is necessary to add a
constant
– negative numbers have no square root
– 0 has no log
550
Education and Salary
Linear Regression
• Saw previously that the assumption of
expected errors = 0 was violated
• Anyway …
– R2 = 0.401, p < 0.001
– salbegin = -6290 + 1727  educ
– Standardised
• b1 (educ) = 0.633
– Both parameters make sense
551
Non-linear Effect
• Compute new variable
– quadratic
– educ2 = educ2
• Add this variable to the equation
– R2 = 0.585, p < 0.001
– salbegin = 46263 + -6542  educ + 310  educ2
• slightly curious
– Standardised
• b1 (educ) = -2.4
• b2 (educ2) = 3.1
– What is going on?
552
• Collinearity
– is what is going on
– Correlation of educ and educ2
• r = 0.990
– Regression equation becomes difficult
(impossible?) to interpret
• Need hierarchical regression
– what is the change in R2
– is that change significant?
– R2 (change) = 0.184, p < 0.001
553
Cubic Effect
• While we are at it, let’s look at the cubic
effect
– R2 (change) = 0.004, p = 0.045
– 19138 + 103  e + -206  e2 + 12  e3
– Standardised:
b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71
554
Fourth Power
• Keep going while we are ahead?
– When do we stop?
555
Interpretation
• Tricky, given that parameter estimates
are a bit nonsensical
• Two methods
• 1: Use R2 change
– Save predicted values
• or calculate predicted values to plot line of best
fit
– Save them from equation
– Plot against IV
556
80000
60000
40000
0
20000
5
10
15
educational level (years)
Linear prediction
Linear prediction
Linear prediction
20
Linear prediction
Linear prediction
beginning salary
557
• Differentiate with respect to e
• We said:
s = 19138 + 103  e + -206  e2 + 12  e3
– but first we will simplify it to quadratic
s = 46263 + -6542  e + 310  e2
• dy/dx = -6542 + 310 x 2 x e
558
Education Slope
9
-962
10
-342
11
278
12
898
13
1518
14
2138
15
2758
16
3378
17
3998
18
4618
19
5238
20
5858
1 year of education
at the higher end of
the scale, better than
1 year at the lower
end of the scale.
MBA versus GCSE
559
• Differentiate Cubic
19138 + 103  e + -206  e2 + 12  e3
dy/dx = 103 – 206  2  e + 12  3  e2
• Can calculate slopes for quadratic and
cubic at different values
560
Education Slope (Quad) Slope (Cub)
9
-962
-689
10
-342
-417
11
278
-73
12
898
343
13
1518
831
14
2138
1391
15
2758
2023
16
3378
2727
17
3998
3503
18
4618
4351
19
5238
5271
20
5858
6263
561
A Quick Note on
Differentiation
• For y = xp
– dx/dy = pxp-1
• For equations such as
y =b1x + b2xP
dy/dx = b1 + b2pxp-1
• y = 3x + 4x2
– dy/dx = 3 + 4 • 2x
562
• y = b1x + b2x2 + b3x3
– dy/dx = b1 + b2 • 2x + b3 • 3 • x2
• y = 4x + 5x2 + 6x3
• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2
• Many functions are simple to
differentiate
– Not all though
563
Splines and Knots
• Estimate a different slope following an
event
– Lines are splines
– Events are knots
• Event might be known
– Marriage
• Might be unknown
– How many years after brain injury does
recovery start
564
Lesson 10: Regression for
Counts and Categories
Dichotomous/Nominal outcomes
565
Contents
• General and Generalized Linear Models
• Dichotomous – logistic / probit
• Counts – Poisson and negative binomial
566
GLMs and GLMs
• General linear models
– Ordinary least squares regression based
models
– Identity link function
– Regression, ANOVA, correlation, etc
• Generalized linear models
– More links
– More error structures
– General linear models are a subset of
generalized linear models
567
Dichotomous
• Often in social sciences, we have a
dichotomous/nominal outcome
– we will look at dichotomous first, then a quick look
at multinomial
• Dichotomous outcome
• e.g.
–
–
–
–
guilty/not guilty
pass/fail
won/lost
Alive/dead (used in medicine)
568
Why Won’t OLS Do?
569
Example: PTSD in Veterans
• How does length of deployment affect
probability of PTSD?
– Have PTSD, or don’t.
– We might be interested in severity
• Army are not
• If you have PTSD, you need help
– Not going back
• Develop a selection procedure
– Two predictor variables
– Rank – 1 =Staff Sgt, 5 = Private,
– Deployment length (months)
570
• 1st ten cases
Rank
5
1
1
4
1
1
4
1
3
4
Months
6
15
12
6
15
6
16
10
12
26
PTSD
0
0
0
0
1
0
1
1
0
1
571
• outcome
– PTSD (1 = Yes, 0 = No)
• Just consider score first
– Carry out regression
– Rank as predictor, PTSD as outcome
– R2 = 0.097, F = 4.1, df = 1, 48, p = 0.028.
– b0 = 0.190
– b1 = 0.110, p=0.028
• Seems OK
572
• Residual plot
573
• Problems 1 and 2
– strange distributions of residuals
– parameter estimates may be wrong
– standard errors will certainly be wrong
574
• 2nd problem – interpretation
– I have rank 2
– Pass = 0.190 + 0.110  2 = 0.41
– I have rank 8
– Pass = 0.190 + 0.110  8 = 1.07
• Seems OK, but
– What does it mean?
– Cannot score 0.41 or 1.07
• can only score 0 or 1
• Cannot be interpreted
– need a different approach
575
A Different Approach
Logistic Regression
576
Logit Transformation
• In lesson 9, transformed IVs
– now transform the outcome
• Need a transformation which gives us
– graduated scores (between 0 and 1)
– No upper limit
• we can’t predict someone will pass twice
– No lower limit
• you can’t do worse than fail
577
Step 1: Convert to Probability
• First, stop talking about values
– talk about probability
– for each value of score, calculate
probability of pass
• Solves the problem of graduated scales
578
probability of
PTSD given a rank
of 1 is 0.7
Score 1 2 3 4 5
No
N
7 5 6 4 2
PTSD P
0.7 0.5 0.6 0.4 0.2
N
3 5 4 6 8
PTSD
P
0.3 0.5 0.4 0.6 0.8
probability of
PTSD given a rank
of 5 is 0.2
579
This is better
• Now a score of 0.41 has a meaning
– a 0.41 probability of pass
• But a score of 1.07 has no meaning
– cannot have a probability > 1 (or < 0)
– Need another transformation
580
Step 2: Convert to Odds-Ratio
Need to remove upper limit
• Convert to odds
• Odds, as used by betting shops
– 5:1, 1:2
• Slightly different from odds in speech
– a 1 in 2 chance
– odds are 1:1 (evens)
– 50%
581
• Odds ratio = (number of times it
happened) / (number of times it didn’t
happen)
p(event)
p(event )
odds ratio 

p(not event ) 1  p(event )
582
• 0.8 = 0.8/0.2 = 4
– equivalent to 4:1 (odds on)
– 4 times out of five
• 0.2 = 0.2/0.8 = 0.25
– equivalent to 1:4 (4:1 against)
– 1 time out of five
583
• Now we have solved the upper bound
problem
– we can interpret 1.07, 2.07, 1000000.07
• But we still have the zero problem
– we cannot interpret predicted scores less
than zero
584
Step 3: The Log
• Log10 of a number(x)
log( x )
10
x
• log(10) = 1
• log(100) = 2
• log(1000) = 3
585
• log(1) = 0
• log(0.1) = -1
• log(0.00001) = -5
586
Natural Logs and e
• Don’t use log10
– Use loge
• Natural log, ln
• Has some desirable properties, that log10
doesn’t
–
–
–
–
For us
If y = ln(x) + c
dy/dx = 1/x
Not true for any other logarithm
587
• Be careful – calculators and stats
packages are not consistent when they
use log
– Sometimes log10, sometimes loge
588
Take the natural log of the odds ratio
• Goes from -  +
– can interpret any predicted value
589
Putting them all together
• Logit transformation
– log-odds ratio
– not bounded at zero or one
590
Score 1
No PTSD
PTSD
N
P
N
P
Odds (No PTSD)
log(odds)No PTSD
2
3
4
7
5
6
4
0.7 0.5 0.6 0.4
3
5
4
6
0.3 0.5 0.4 0.6
5
2
0.2
8
0.8
2.33 1.00 1.50 0.67 0.25
0.85 0.00 0.41 -0.41 -1.39
591
probability
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Probability gets closer
to zero, but never
reaches it as logit
goes down.
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Logit
592
3.5
• Hooray! Problem solved, lesson over
– errrmmm… almost
• Because we are now using log-odds
ratio, we can’t use OLS
– we need a new technique, called Maximum
Likelihood (ML) to estimate the parameters
593
Parameter Estimation using
ML
ML tries to find estimates of model
parameters that are most likely to give
rise to the pattern of observations in
the sample data
• All gets a bit complicated
– OLS is a special case of ML
– the mean is an ML estimator
594
• Don’t have closed form equations
– must be solved iteratively
– estimates parameters that are most likely
to give rise to the patterns observed in the
data
– by maximising the likelihood function (LF)
• We aren’t going to worry about this
– except to note that sometimes, the
estimates do not converge
• ML cannot find a solution
595
R2 in Logistic Regression
• A dichotomous variable doesn’t have
variance
– If you know the mean (proportion) you
know the variance
– You can’t have R2.
• There are several pseudo-R2
• None are perfect
– There’s something better
596
Logistic Regression in Stata
• Exercise 10.1
• Two (almost) equivalent commands
– logistic ptsd rank deployment
– logit ptsd rank deployment
597
Logit
• Gives output in log-odds
• logit ptsd rank deployment
-----------------------------------------------------------------------------pass | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------deployment |
1.158213
.0841987
2.02
0.043
1.004404
1.335575
rank |
1.333192
.4011279
0.96
0.339
.7392395
2.404365
------------------------------------------------------------------------------
598
Logistic
• Gives output in odds ratios
•
– No intercept
logit ptsd rank deployment
-----------------------------------------------------------------------------pass | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------deployment |
1.158213
.0841987
2.02
0.043
1.004404
1.335575
rank |
1.333192
.4011279
0.96
0.339
.7392395
2.404365
------------------------------------------------------------------------------
599
• SPSS produces a classification table
– And Stata produces it if you ask
– predictions of model
– based on cut-off of 0.5 (by default)
– predicted values x actual values
• DO NOT USE IT!
• Will this person go to prison?
– No.
– You will be right 99.9% of the time
– Doesn’t mean you have a good model
– (Gottman and Murray – Blink)
600
Classification Tablea
Predicted
PASS
Observed
Step 1
PASS
0
Percentage
Correct
1
0
18
8
69.2
1
12
12
50.0
Overall Percentage
60.0
a. The cut value is .500
601
Model parameters
•B
– Change in the logged odds associated with
a change of 1 unit in IV
– just like OLS regression
– difficult to interpret
• SE (B)
– Standard error
– Multiply by 1.96 to get 95% CIs
602
• Constant
– i.e. score = 0
– B = 1.314
– Exp(B) = eB = e1.314 = 3.720
– OR = 3.720, p = 1 – (1 / (OR + 1))
= 1 – (1 / (3.720 + 1))
– p = 0.788
603
• Score 1
– Constant b = 1.314
– Score B = -0.467
– Exp(1.314 – 0.467) = Exp(0.847)
= 2.332
– OR = 2.332
– p = 1 – (1 / (2.332 + 1))
= 0.699
604
Standard Errors and CIs
• Symmetrical in B
– Non-symmetrical (sometimes very) in
exp(B)
605
• The odds of failing the test are
multiplied by 0.63 (CIs = 0.408, 0.962
p = 0.033), for every additional point
on the aptitude test.
606
Hierarchical Logistic
Regression
• In OLS regression
– Use R2 change
• In logistic regression
– Use chi-square change
• Difference in chi-square = chi-square
• Difference in df = df
607
Hierarchical Logistic
Regression
• Model 1: Experience
• Model 2: Experience + Score
• Model 1:
– Chi-square =4.83, df = 1
• Model 2:
– Chi-square =5.77, df = 2
608
• Difference:
– Chi-square = 5.77-4.83= 1.94,
– Df = 2 – 1 = 1
• gen p = 1 - chi2(1, 1.94)
• tab p
• p = 0.332
• P-value from SE = 0.339
• Why?
609
More on Standard Errors
• Because of Wald standard errors
– Wald SEs are overestimated
– Make p-value in estimates is wrong – too high
– (CIs still correct)
610
• Two estimates use slightly different
information
– P-value says “what if no effect”
– CI says “what if there is this effect”
• Variance depends on the hypothesised ratio of the
number of people in the two groups
• Can calculate likelihood ratio based pvalues
– If you can be bothered
– Some packages provide them automatically
611
Probit Regression
• Very similar to logistic
– much more complex initial transformation
(to normal distribution)
– Very similar results to logistic (multiplied by
1.7)
• Swap logistic for probit in Stata
command
– Harder to interpret
• Parameter doesn’t mean something – like log
odds
612
Differentiating Between Probit
and Logistic
• Depends on shape of the error term
– Normal or logistic
– Graphs are very similar to each other
• Could distinguish quality of fit
– Given enormous sample size
• Logistic = probit x 1.7
– Actually 1.6998
• Probit advantage
– Understand the distribution
• Logistic advantage
– Much simpler to get back to the probability
613
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
1
-1
-1.2
-1.4
-1.6
-1.8
-2
-2.2
-2.4
-2.6
-2.8
-3
1.2
Normal (Probit)
Logistic
0.8
0.6
0.4
0.2
0
614
Infinite Parameters
• Non-convergence can happen because
of infinite parameters
– Insoluble model
• Three kinds:
• Complete separation
– The groups are completely distinct
• Pass group all score more than 10
• Fail group all score less than 10
615
• Quasi-complete separation
– Separation with some overlap
• Pass group all score 10 or more
• Fail group all score 10 or less
• Both cases:
– No convergence
• Close to this
– Curious estimates
– Curious standard errors
616
• Categorical Predictors
– Can cause separation
– Especially if correlated
• Need people in every cell
Male
White
Non-White
Female
White
Non-White
Below
Poverty
Line
Above
Poverty
Line
617
Logistic Regression and
Diagnosis
• Logistic regression can be used for diagnostic
tests
– For every score
• Calculate probability that result is positive
• Calculate proportion of people with that score (or lower)
who have a positive result
• Calculate c statistic
– Measure of discriminative power
– % of all possible cases, where the model gives a
higher probability to a correct case than to an
incorrect case
618
– Perfect c-statistic = 1.0
– Random c-statistic = 0.5
619
Sensitivity and Specificity
• Sensitivity:
– Probability of saying someone has a
positive result –
• If they do: p(pos)|pos
• Specificity
– Probability of saying someone has a
negative result
• If they do: p(neg)|neg
620
C-Statistic, Sensitivity and
Specificity
• After logistic
– lroc
• Gives c-statistic
– Better than R-squared
621
1.00
0.75
0.50
0.25
0.00
0.00
0.25
Area under ROC curve = 0.7469
0.50
1 - Specificity
0.75
1.00
More Advanced Techniques
• Multinomial Logistic Regression more
than two categories in outcome
– same procedure
– one category chosen as reference group
• odds of being in category other than reference
• Ordinal multinomial logistic regression
– For ordinal outcome variables
623
More on Odds Ratios
• Odds ratios are horrid
• We use them because they have nice
distributional properties
• Example:
– 40% in group 1 get PTSD
– 60% in group 2 get PTSD
– What’s the odds ratio?
– How is this confusing?
624
Alternatives to Odds Ratios
• Risk difference
– 20 percentage points higher
• Relative risk
– Probability is 1.5 times higher
– This is what you would think an odds ratio
meant
• Can we use these in regression?
– RD – maybe. Sometimes.
– RR – yes. But we need to do something
else first
625
Final Thoughts
• Logistic Regression can be extended
– dummy variables
– non-linear effects
– interactions
• Same issues as OLS
– collinearity
– outliers
626
• Same additional options as regress
– xi:
– cluster
– robust
627
Poisson Regression
628
Counts and the Poisson
Distribution
• Von Bortkiewicz
(1898)
– Numbers of Prussian
soldiers kicked to
death by horses
120
100
80
60
0
1
2
3
4
5
109
65
22
3
1
0
40
20
0
0
1
2
3
4
5
629
• The data fitted a Poisson probability distribution
– When counts of events occur, poisson distribution is
common
– E.g. papers published by researchers, police arrests,
number of murders, ship accidents
• Common approach
– Log transform and treat as normal
• Problems
– Censored at 0
– Integers only allowed
– Heteroscedasticity
630
The Poisson Distribution
0.7
0.6
Probability
0.5
0.5
1
4
8
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
Count
10
11
12
13
14
15
16
17
631
exp( )
p ( y | x) 
y!
y
632
exp( )
p ( y | x) 
y!
y
Excel has a Poisson
function you can
use.
• Where:
– y is the count
–  is the mean of the Poisson distribution
• In a Poisson distribution
– The mean = the variance (hence
heteroscedasticity issue))
–   2
633
Poisson Probabilities
Mean
Score
0
1
2
3
4
5
6
7
8
9
10
1
0.37
0.37
0.18
0.06
0.02
0.00
0.00
0.00
0.00
0.00
0.00
2
0.14
0.27
0.27
0.18
0.09
0.04
0.01
0.00
0.00
0.00
0.00
3
0.05
0.15
0.22
0.22
0.17
0.10
0.05
0.02
0.01
0.00
0.00
10
0.00
0.00
0.00
0.01
0.02
0.04
0.06
0.09
0.11
0.13
0.13
634
Issues with Estimation
• Just as with logistic
– We can’t predict a mean below zero
• Don’t predict the mean
– Predict the log of the mean
635
Poisson Regression in Stata
• Adult literacy study
– Number of sessions attended
– Count variable
• Poisson regression
636
poisson sessions tx
---------------------------------------------------------------------------sessions |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-----------+---------------------------------------------------------------tx | -.2359546
.06668
-3.54
0.000
-.366645
-.1052642
_cons |
1.899973
.046225
41.10
0.000
1.809374
1.990572
----------------------------------------------------------------------------
poisson sessions tx, irr
-----------------------------------------------------------------------------sessions |
IRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------tx |
.7898166
.052665
-3.54
0.000
.6930557
.9000867
------------------------------------------------------------------------------
637
But was it Poisson?
• Look at predicted probabilities
– Compare with actual probabilities
• Predicted means
– Control: exp(1.899) = 6.86
– Intervention: exp(1.899-0.236) = 5.28
• Get means and SDs
638
bysort tx: sum sessions
-> tx = 0
VObs
Mean
Std. Dev.
-------------+-------------------------------------sessions |
70
6.685714
3.495516
----------------------------------------------------> tx = 1ariable |
Variable |
Obs
Mean
Std. Dev.
-------------+-------------------------------------sessions |
82
5.280488
2.709263
639
• Do OK on the means
– Don’t do OK on the variances
– Variances are too high
• Compare predicted probabilities with
actual probabilities
• tab session tx,col nofreq
• Draw graphs
– Not horrible
– Except the zeroes
640
0.25
0.20
0.15
Predicted - control
Predicted - intervention
Actual - control
Actual - intervention
0.10
0.05
0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Test for Goodness of Fit to
Poisson Distribution
• After running Poisson
– estat gof
Goodness-of-fit chi2
Prob > chi2(150)
=
=
314.139
0.0000
• Highly significant
– Poisson distribution doesn’t fit
642
Overdispersion
• Problem in Poisson regression
– Too many zeroes
• Causes
– c2 inflation
– Standard error deflation
• Hence p-values too low
– Higher type I error rate
• Two solutions
– Negative binomial regression
– Robust standard errors
643
Robust Standard Errors
poisson
sessions tx, robust
--------------------------------------------------------|
Robust
sessions |
Coef.
Std. Err.
z
P>|z|
-------------+------------------------------------------tx | -.2359546
.0840648
-2.81
0.005
_cons |
1.899973
.0622477
30.52
0.000
---------------------------------------------------------
• Robust SEs are larger
644
Negative Binomial Regression
• Adds a ‘hurdle’ to account for the
zeroes
– Called alpha
• nbreg sessions tx
• OR
• nbreg sessions tx, robust
645
Back to Categorical Outcomes
• We said:
– Odds ratios are not good
• We like relative risk instead
– What is the ratio of the risks?
• What analysis technique do we know
that gives ratios of means
646
• Poisson regression!
• Wait. It won’t work. The distribution is
wrong.
• Robust estimates!
647
Poisson Regression in SPSS
• SPSS 15 (and above), has added it
• Under generalized linear models
648
Lesson 11: Mediation and Path
Analysis
649
Introduction
• Moderator
– Level of one variable influences effect of another
variable
• Mediator
– One variable influences another via a third
variable
• All relationships are really mediated
– are we interested in the mediators?
– can we make the process more explicit
650
• In examples with bank
education
beginning
salary
• Why?
– What is the process?
– Are we making assumptions about the
process?
– Should we test those assumptions?
651
job skills
expectations
beginning
salary
education
negotiating
skills
kudos
for bank
652
Direct and Indirect Influences
X may affect Y in two ways
• Directly – X has a direct (causal)
influence on Y
– (or maybe mediated by other variables)
• Indirectly – X affects Y via a mediating
variable - M
653
• e.g. how does going to the pub effect
comprehension on a Summer school
course
– on, say, regression
not reading
books on
regression
Having fun
in pub in
evening
less
knowledge
Anything
here?
654
not reading
books on
regression
Having fun
in pub in
evening
less
knowledge
fatigue
Still
needed?
655
• Mediators needed
– to cope with more sophisticated theory in
social sciences
– make explicit assumptions made about
processes
– examine direct and indirect influences
656
Detecting Mediation
657
“Classic Approach” 4 Steps
From Baron and Kenny (1986)
• To establish that the effect of X on Y is
mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling for X
4. If effect of X controlling for M is zero, M
is complete mediator of the relationship
•
(3 and 4 in same analysis)
658
Example: Book habits
Enjoy Books

Buy books

Read Books
659
Three Variables
• Enjoy
– How much an individual enjoys books
• Buy
– How many books an individual buys (in a
year)
• Read
– How many books an individual reads (in a
year)
660
ENJOY
BUY
READ
ENJOY BUY
READ
1.00
0.64
0.73
0.64
1.00
0.75
0.73
0.75
1.00
661
• The Theory
enjoy
buy
read
662
• Step 1
1. Show that X (enjoy) predicts Y (read)
– b1 = 0.487, p < 0.001
– standardised b1 = 0.732
– OK
663
2. Show that X (enjoy) predicts M (buy)
– b1 = 0.974, p < 0.001
– standardised b1 = 0.643
– OK
664
3. Show that M (buy) predicts Y (read),
controlling for X (enjoy)
– b1 = 0.469, p < 0.001
– standardised b1 = 0.206
– OK
665
4. If effect of X controlling for M is zero,
M is complete mediator of the
relationship
– (Same as analysis for step 3.)
– b2 = 0.287, p = 0.001
– standardised b2 = 0.431
– Hmmmm…
•
Significant, therefore not a complete mediator
666
0.287
(step 4)
enjoy
read
buy
0.974
(from step 2)
0.206
(from step 3)
667
The Mediation Coefficient
• Amount of mediation =
Step 1 – Step 4
=0.487 – 0.287
= 0.200
• OR
Step 2 x Step 3
=0.974 x 0.206
= 0.200
668
SE of Mediator
enjoy
buy
a
(from step 2)
read
b
(from step 2)
• sa = se(a)
• sb = se(b)
669
• Sobel test
– Standard error of mediation coefficient can
be calculated
se  b s + a s - s s
2 2
a
a = 0.974
sa = 0.189
2 2
b
2 2
a b
b = 0.206
sb = 0.054
670
• Indirect effect = 0.200
– se = 0.056
– t =3.52, p = 0.001
• Online Sobel test:
http://quantpsy.org
671
Problems with the Sobel test
• Recently
– Move in methodological literature away from this
conventional approach
• Problems of power:
– Several tests, all of which must be significant
• Type I error rate = 0.05 * 0.05 = 0.0025
• Must affect power
672
• Distributional Assumption
– We assume that the sampling distribution
of the coefficient is normally distributed
• Standard error is standard deviation
• If:
– a (x  m) is normal and not zero
– b (m  y) is normal and not zero
• Then:
– a×b
– Is not normally distributed
• Assumption is violated
– Test is incorrect
673
• Solution:
– Bootstrap
• Computer intensive semi-parametric procedure
• Removes distributional assumption
– Bootstrapping suggested as alternative
• For Stata:
• www.ats.ucla.edu/stat/stata/faq/mediat
ion_cativ.htm
• For SAS, SPSS:
– www.quantpsy.org
674
Cross Sectional Bias
• If everything is measured at one time
– Likely to be bias
• Ideally:
– Three variables, measured on three
occasions
675
x
x
x
m
m
m
y
y
y
676
• Kind of hard work
– Collecting data on three occasions
• BUT: Stationarity assumption can save
us
x
x
x
m
m
m
y
y
y
677
• We assume that the effect from M to Y
is stable over time
– Only need two time points
• Cole and Maxwell (2003)
678
Power in Mediation
• Really hard to work out
• Need to run simulations
• Power depends on
– Size of a
– Size of b
• Fritz and Mackinnon (2007)
– Table of power for different effects
679
More Information on
Mediation
• Mackinnon, Fritz and Fairchild
– Annual Review of Psychology
• Mackinnon
– Introduction to statistical mediation
• Iacobucci
– Mediation analysis (little green book)
• Mackinnon’s website (Google: mackinnon mediation)
• Facebook group
– (No, really)
680
Lesson 12: Moderators in
Regression
“different slopes for different
folks”
681
Introduction
• Moderator relationships have many
different names
– interactions (from ANOVA)
– multiplicative
– non-linear (just confusing)
– non-additive
• All talking about the same thing
682
A moderated relationship occurs
• when the effect of one variable
depends upon the level of another
variable
683
• Hang on …
– That seems very like a nonlinear relationship
– Moderator
• Effect of one variable depends on level of another
– Non-linear
• Effect of one variable depends on level of itself
• Where there is collinearity
– Can be hard to distinguish between them
– Paper B5
– Should (usually) compare effect sizes
684
• e.g. How much it hurts when I drop a
computer on my foot depends on
– x1: how much alcohol I have drunk
– x2: how high the computer was dropped
from
– but if x1 is high enough
– x2 will have no effect
685
• e.g. Likelihood of injury in a car
accident
– depends on
– x1: speed of car
– x2: if I was wearing a seatbelt
– but if x1 is low enough
– x2 will have no effect
686
30
25
Injury
20
15
10
5
0
5
15
25
35
45
Speed (mph)
Seatbelt
No Seatbelt
687
• e.g. number of words (from a list) I can
remember
– depends on
– x1: type of words (abstract, e.g. ‘justice’, or
concrete, e.g. ‘carrot’)
– x2: Method of testing (recognition – i.e.
multiple choice, or free recall)
– but if using recognition
– x1: will not make a difference
688
• We looked at three kinds of moderator
• alcohol x height = pain
– continuous x continuous
• speed x seatbelt = injury
– continuous x categorical
• word type x test type
– categorical x categorical
• We will look at them in reverse order
689
How do we know to look for
moderators?
Theoretical rationale
• Often the most powerful
• Many theories predict additive/linear
effects
– Fewer predict moderator effects
Presence of heteroscedasticity
• Clue there may be a moderated
relationship missing
690
Two Categorical Predictors
691
• 2 IVs
Data
– word type (concrete [e.g. Carrot, table], abstract
[e.g. Love, justice)
– test method (multiple choice, recall )
• 20 Participants in one of four groups
–
–
–
–
Concrete, MC
Concrete, recall
Abstract, MC
Abstract, recall
• 5 per group
• lesson12.1-words.dta
692
MC
Mean
Concrete
SD
Mean
Abstract
SD
Mean
Total
SD
15.4
2.5
15.6
1.5
15.5
2.0
Recall
Total
15.2
15.3
3.2
2.7
7.0
11.3
1.6
4.8
11.1
13.3
4.9
4.3
693
• Graph of means
20
15
10
Concrete
Abstract
5
0
MC
Recall
694
Procedure for Testing
1: Convert to dummy coding
– Already done
2: Calculate interaction term
– Multiply dummy codes together
– (Can also use xi: for this)
– Call interaction mxc
695
• Interaction term (wxt)
– multiply effect coded variables together
Concrete
0
0
1
1
MC
0
1
0
1
mxc
0
0
0
1
696
3: Carry out regression
– Hierarchical
– linear effects first
– interaction effect in next block
697
• b0(intercept)= 7.0
– Mean score when MC = 0 and concrete = 0
• b1 (mc) = 8.6
– When concrete is zero, effect of MC
• b2 (concrete) = 8.2
– When mc is zero, effect of concrete
MC
Recall
Total
Concrete Abstract Total
15.4
15.2
15.30
15.6
7
11.30
15.50
11.10
13.30
698
• b3 (mc x con) = -8.4
– grand mean
• Given other estimates, what’s the
predicted mean of concrete, MC?
– 7.0 + 8.6 + 8.2 = 23.8
• What is it?
– 15.4
Recog
Recall
Total
Concrete Abstract Total
15.4
15.2
15.3
15.6
7.0
11.3
15.5
11.1
13.3
699
• Have:
• Expect:
• Difference:
15.4
23.8
-8.4
700
Back to the Graph
Slope for
concrete words
15.2-15.4=-0.2
20
15
10
Concrete
Abstract
5 Difference in
slopes
0 -8.6-(-0.2) = -8.4
MC
Recall
Slope for
abstract words
7.0-15.6=-8.6
701
b associated with interaction
• The difference in the slopes
OR
• The change in slope, away from the
average, associated with a 1 unit
change in the moderating variable
702
• Another way to look at it
Y = 7 + 8.6  m + 8.2  c + -8.4  m  c
• Examine concrete words group (c = 1)
– substitute values into the equation
Y(conc) = Y = 7 + 8.6  m + 8.2  1 + -8.4  m  1
Y(conc) = Y = 7 + 8.6  m + 8.2 + -8.4  m
Y(conc) = Y = 7 + 8.2 + 8.6  m -8.4  m
Y(conc) = Y = 15.2 + 0.2  m
703
Categorical x Continuous
704
Note on Dichotomisation
• Very common to see people dichotomise
a variable
– Makes the analysis easier
– Very bad idea
• Paper B6
705
Data
A chain of 60 supermarkets
• examining the relationship between
profitability, shop size, and local
competition
• 2 IVs
– shop size
– comp (local competition, 0=no, 1=yes)
• outcome
– profit
706
• Data, ‘lesson 12.2.dta’
Shopsize
4
10
7
10
10
29
12
6
14
62
Comp
1
1
0
0
1
1
0
1
0
0
Profit
23
25
19
9
18
33
17
20
21
8
707
1st Analysis
Two IVs
• R2=0.367, df=2, 57, p < 0.001
• Unstandardised estimates
– b1 (shopsize) = 0.083 (p=0.001)
– b2 (comp) =- 5.883 (p<0.001)
• Standardised estimates
– b1 (shopsize) = 0.356
– b2 (comp) = 0.448
708
• Suspicions
– Presence of competition is likely to have an
effect
– Residual plot shows a little
heteroscedasticity
709
10
5
-5
0
Residuals
-10
15
20
25
Linear prediction
30
Procedure for Testing
• Very similar to last time
– convert ‘comp’ to dummy coding
• (if it’s not already)
– Compute interaction term
• comp (effect coded) x size
– Hierarchical regression
711
Result
• Estimates
– b1 (shopsize) = 0.12, SE = 0.03
– b2 (comp) = -1.67, SE 2.50
– b3 (sxc) = -0.10, SE 0.05
712
• comp now non-significant
– shows importance of hierarchical
– it obviously is important
713
Interpretation
• Draw graph with lines of best fit
– graph twoway (scatter profit
shopsize if comp==1) (lfit
profit shopsize if comp==1)
(scatter profit shopsize if
comp==0) (lfit profit shopsize
if comp==0), legend(off)
714
40
30
20
10
0
0
20
40
60
80
100
shopsize
715
• Substitute into equation
• Effects of size
– (can ignore the constant)
• Y=size0.12 + comp(-1.67) + sizecomp(-0.09)
– Competition present (comp = 1)
• Y=size0.12 + 1(-1.67) + size1(-0.09)
• Y=size0.12
+ size (-0.09)
• Y=size0.03
716
• Y=size0.12 + comp(-1.67) + sizecomp(-0.09)
– Competition present (x2 = 0)
• Y=size0.12 + 0
• Y=size0.12
(-1.67) + size 0 (-0.09)
717
Two Continuous Variables
718
Data
• Bank Employees
– only using clerical staff
– 363 cases
– predicting starting salary
– previous experience
– age
– age x experience
– (exercise 6.3)
719
• Correlation matrix
– only one significant
LOGSB AGESTARTPREVEXP
LOGSB
1.00
-0.09
0.08
AGESTART
-0.09
1.00
0.77
PREVEXP
0.08
0.77
1.00
720
Initial Estimates (no moderator)
• (standardised)
– R2 = 0.063, p<0.001
– Age at start = -0.37, p<0.001
– Previous experience = 0.36, p<0.001
• Suppressing each other
– Age and experience compensate for one
another
– Older, with no experience, bad
– Younger, with experience, good
721
The Procedure
• Very similar to previous
– create multiplicative interaction term
– BUT
• Center variables (subtract mean)
– Not always necessary
– Can make life easier
722
• Hierarchical regression
– two linear effects first
– moderator effect in second
723
• Change in R2
– 0.085, p<0.001
• Estimates (standardised)
– b1 (agestart) = -0.52
– b2 (prevexp) = 0.93
– b3 (age x exp) = -0.56
724
Interpretation 1: Pick-a-Point
• Graph is tricky
– can’t have two continuous variables
– Choose specific points (pick-a-point)
• Graph the line of best fit of one variable at
others
– Two ways to pick a point
• 1: Choose high (z = +1), medium (z = 0) and
low (z = -1)
• Choose ‘sensible’ values – age 20, 50, 80?
725
• We know:
– Y = e  0.94 + a  -0.53 + a  e  -0.58
– Where a = agestart, and e = experience
• We can rewrite this as:
– Y = (e  0.94) + (a  -0.53) + (a  e  -0.58)
– Take a out of the brackets
– Y = (e  0.94) + (-0.53 + e  -0.58)a
• Bracketed terms are simple intercept and simple
slope – intercept and slope for agestart
– 0= (e  0.10)
– 1= (-0.53 + e  -0.58)a
– Y = 0 + 1a
726
• Pick any value of e, and we know the slope
for a
– Standardised, so it’s easy
• e = -1
– 0= (-1  0.94) = -0.94
– 1= (-0.53 + -1  -0.58)a = -0.05a
• e=0
– 0= (0  0.10) = 0
– 1= (-0.53+ 0  -0.58)a = -0.53a
• e=1
– 0= (1  0.10) = 0.10
– 1= (-0.53 + 1  -0.58)a = -1.11a
727
Graph the Three Lines
1.5
1
e = -1
e=0
e=1
Log(salary)
0.5
0
-0.5
-1
-1.5
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
Age
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
728
Do This in Stata
• The easy way
• Create some pseudo cases
– Some fake people
– With sensible scores for the variables
– Regression equation ‘stays behind’
• Calculate predicted scores with
predict
– (Can be in a new dataset)
729
• Then draw graph
• drop if _n < 364
• graph twoway (lfit pred
agestart if prevexp==0,
lcolor(red)) (lfit pred
agestart if prevexp==85,
lcolor(black)) (lfit pred
agestart if prevexp==170,
lcolor(green)) , legend(off)
731
10
9.8
9.6
9.4
20
30
40
agestart
50
• (Also works in SPSS; in SAS, use Proc
Score)
733
Interpretation 2: P-Values and CIs
• Second way
– Newer, rarely done
• Calculate CIs of the slope
– At any point
• Calculate p-value
– At any point
• Give ranges of significance
734
What do you need?
• The variance and covariance of the
estimates
– SPSS doesn’t provide estimates for
intercept
– Need to do it manually
• In options, exclude intercept
– Create intercept – c = 1
– Use it in the regression
735
• Enter information into web page:
• www.people.ku.edu/~preacher/interact
/mlr2.htm
• Get results
• Calculations in Bauer and Curran (in
press: Multivariate Behavioral Research)
– Paper B13
736
4.1
4.2
Y
4.3
4.4
4.5
MLR 2-Way Interaction Plot
4.0
CVz1(1)
CVz1(2)
CVz1(3)
-1.0
-0.5
0.0
X
0.5
1.0
737
Areas of Significance
0.0
-0.2
-0.4
-0.6
Simple Slope
0.2
0.4
Confidence Bands
-4
-2
0
2
4
Experience
738
• 2 complications
– 1: Constant differed
– 2: outcome was logged, hence non-linear
• effect of 1 unit depends on where the unit is
– See paper A2
739
Finally …
740
Unlimited Moderators
• Moderator effects are not limited to
– 2 variables
– linear effects
741
Three Interacting Variables
• Age, Sex, Exp
• Block 1
– Age, Sex, Exp
• Block 2
– Age x Sex, Age x Exp, Sex x Exp
• Block 3
– Age x Sex x Exp
742
• Results
– All two way interactions significant
– Three way not significant
– Effect of Age depends on sex
– Effect of experience depends on sex
– Size of the age x experience interaction
does not depend on sex (phew!)
743
Moderated Non-Linear
Relationships
• Enter non-linear effect
• Enter non-linear effect x moderator
– if significant indicates degree of nonlinearity differs by moderator
744
745
Lesson 13: Longitudinal Models
746
Advantages of Longitudinal
Data
• You get more data from the same
number of people
• You can test causal relationships
– Although you can’t rule them out
• You can examine change
• You can control for individual
differences
747
Disadvantage of Longitudinal
Data
• It’s much harder to analyze
748
Longitudinal Research
• For comparing
repeated measures
– Clusters are people
• Data are usually
short and fat
ID
V1
V2
V3
V4
1
2
3
4
7
2
3
6
8
4
3
2
5
7
5
749
Converting Data
• Change data to tall
and thin
• Use reshape in
stata
• Use Data,
Restructure in
SPSS
• Clusters are ID
ID
V
X
1
1
2
1
2
3
1
3
4
1
4
7
2
1
3
2
2
6
2
3
8
2
4
4
3
1
2
3
2
5
3
3
7
3
4
5
750
Predict Salary Change
• Use exercise5.3-bank salary.dta
– Compare beginning salary and salary
– Would normally use paired samples t-test
• Difference = $17,403, 95% CIs
$16,427.407, $18,379.555
751
Predict Salary Change
• Don’t take the difference in salary and
salbegin
– Why not?
• reg salary agestart salbegin
• Est: -207.8, 95% CIs -267.4, -148.2
752
Restructure the Data
• gen id = _n
• rename salbegin sal1
• rename salary sal2
• reshape long sal, i(id) j(t)
• replace t = t-1
753
Restructure the Data
• Do it again
– With data tall and thin
• Do a regression
– What do we find?
ID
Time
Cash
1
0
$18,750
1
1
$21,450
2
0
$12,000
2
1
$21,900
3
0
$13,200
3
1
$45,000
754
Results
• We have violated the independence
assumption
• We have the wrong answer
• Simplest way to solve it:
• regress sal t, cluster(id)
• Assumes that ID is just an irritant
– Rather inflexible
755
However …
• That has one advantage
– Missing data doesn’t mean that we exclude
the case
– If data are missing at random (or missing
completely at random) estimates will be
unbiased
756
• If everyone has
– Score at time 1
– Score at time 2
• Analysis is easy
• If half the people have
– Score at time 1
– Score at time 2
• Analysis is easy
• But what if some have
– Time 1
– Time 2
– Time 1 and 2
757
• If
– Missing at random
T1
• MAR
T2
– Missing completely at
random
10
15
7
6
8
2
5
8
• MCAR
– (Crappy names)
• No problem
Interesting …
• That wasn’t very interesting
– What is more interesting is when:
• We have missing data
– Which we won’t talk about more (much)
• We have multiple measurements of the same
people
– Which we will talk about
759
Modelling Change
• Can plot and assess trajectories over
time
• How do people change?
• What predicts the rate of change?
760
Plotting Individuals
Salary
Person 1
T1
T2
761
Plotting Individuals
Person 3
Salary
Person 1
Person 2
T1
T2
762
0
0
100000
150000
50000
0
100000
150000
50000
0
100000
150000
50000
0
100000
150000
50000
0
100000
150000
50000
0
100000
150000
50000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
.5
Graphs by id
1
0
.5
1
0
.5
1
t
0
.5
1
0
.5
1
0
.5
1
0
50000
0
.2
.4
.6
t
.8
1
Estimation
• Each individual has an intercept
– Sampled from the population of intercepts
• Each individual has a slope
– Sampled from the population of slopes
• Can we estimate the average of each
– And a measure of their variance?
• Yes! With multilevel models
765
Multilevel Models
• Can do all kinds of clever things
– We won’t worry about most of them
• Used when
– Level 1 units (measures)
– Are nested within
– Level 2 units (people)
• Same person measured twice
– Violated indepence
766
Levels
• In regression
– Everything is at one level
• In multilevel models
– We have multiple levels
• Hierarchical levels (hence hierarchical linear
models)
• Random effects (random effects models)
• Mixed effects (mixed models)
767
Levels
• Level 1 units
– First level of measurement
– Are clustered within
• Level 2 units
– Second level of measurement
768
Some Equations
• (This is very hard. It’s not important).
• In regression
yi  b0  b1 xi1  ei
• If x is time
– And we have one person
– Reference time with I
– And call it T
769
• Single person equation
yi  b0  b1Ti  ei
• But what if we have lots of people?
– We’ve used i for time
– We’ll use j for people
yij  b0  b1Tij  eij
• But everything is fixed
– We want to have some random effects
770
• Let’s make intercepts random
– Everyone has their own intercept
yij  b0 j  b1Tij  eij
• Look! We added a little j
– Now it’s a multilevel model
• And we need an equation for the
intercept
771
• Equation for each person’s intercept
b0 j  g 00  0 j
• Your intercept (b0j) is equal to:
– Mean intercept
• g00
(Gamma)
– Plus residual (for that individual)
• 0j
(mu)
– This is level 2 model
• Level 2 residuals
• i.i.d, etc
772
• So now we have
yij  b0 j b1 jTij  eij
b0 j  g 00  0 j
• Or
yij  (g 00  0 j )  b1Tij  eij
773
Make Time Random
• Value of the time parameter can vary
– Amongst people
– Everyone can have a different effect of
time
yij  b0 j  b1 jTij  eij
774
• Time is random
– Everyone has a slope parameter
btj  g t 0  tj
• So:
yij  b0 j  b1Tij  eij
b0 j  g 00   0 j
b1 j  g 10  1 j
775
• Or
yij  b0 j  b1 jTij  eij
b0 j  g 00  0 j
btj  g t 0  tj
yij  g t 0  1 j   g 10  1 j   eij
776
Time Invariant Covariates
• Can be added at level 2
– We can predict person’s intercept
• Starting point
– And rate of change
777
Employee Data
• Level 1:
– Pay measures
• Two of them
– Clustered within
• Level 2:
– People
– Level 2 measures: age, sex, job, etc
778
Regression with Time
• Do a regression analysis
– On one person
• Time is the predictor
• Get a regression line for that person
779
Fixed vs Random Effects
• Fixed effects
– Effect is the same across all clusters
(people)
– Variation is only measurement error
• Random effects
– Effect varies across people
– Additional parameter in the model
• Less parsimonious
780
Fixed vs Random Effects
• If an effect has variance
– It might have covariance
– With any other effects which also have
variance
• More parameters
– Less parsimony
781
Covariates
• Two kinds of covariates
– Time invariant
• Fixed for a person
– Age when study started
– Sex
– Time variant
• Can change over time
– Time
– Marital Status
782
Time Invariant
• Look at effect of age
– Add age to the fixed effects
– Is that significant?
– Are random effects (still?) significant
783
Multilevel Models in Stata
• Use xtmixed
– Stata 10 added xtmelogit, xtmepoisson
• Continuous variables are hard enough
• In SPSS, continuous variables only
784
Multilevel Models in Stata
• xtmixed sal t
• Does regression
• Need to tell it about the people:
• xtmixed sal t ||id:
785
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------t |
17403.48
496.7321
35.04
0.000
16429.9
18377.06
_cons |
17016.09
610.6691
27.86
0.000
15819.2
18212.98
----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters |
Estimate
Std. Err.
[95% Conf. Interval]
-----------------------------+-----------------------------------------------id: Identity
|
sd(_cons) |
10875.87
449.592
10029.44
11793.73
-----------------------------+-----------------------------------------------sd(Residual) |
7647.093
248.6285
7174.992
8150.258
-----------------------------------------------------------------------------LR test vs. linear regression: chibar2(01) =
280.88 Prob >= chibar2 = 0.0000
786
Average
-----------------------------------------------------------------------------slope
sal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------t |
17403.48
496.7321
35.04
0.000
16429.9
18377.06
_cons |
17016.09
610.6691
27.86
0.000
15819.2
18212.98
------------------------------------------------------------------------------
Average
-----------------------------------------------------------------------------intercept
Random-effects Parameters |
Estimate
Std. Err.
[95% Conf. Interval]
SD of the
-----------------------------+-----------------------------------------------id: Identity
|
slopes
sd(_cons) |
10875.87
449.592
10029.44
11793.73
-----------------------------+-----------------------------------------------sd(Residual) |
7647.093
248.6285
7174.992
8150.258
-----------------------------------------------------------------------------LR test vs. linear regression: chibar2(01) =
280.88 Prob >= chibar2 = 0.0000
SD of the
slopes
787
• Gives random intercepts only
– Let’s look at them
• predict rand_int
• xtline rand_int , overlay
t(t) i(id) legend(off)
788
0
.2
.4
.6
t
.8
1
0
20000
40000
60000
80000
Random Slopes
• Everyone has the same slope
– Maybe that’s not true
– Make slopes random
• xtmixed sal t ||id: t
790
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------t |
17403.48
496.7273
35.04
0.000
16429.91
18377.05
_cons |
17016.09
361.5138
47.07
0.000
16307.53
17724.64
----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters |
Estimate
Std. Err.
[95% Conf. Interval]
-----------------------------+-----------------------------------------------id: Independent
|
sd(t) |
10814.52
351.6071
10146.88
11526.09
sd(_cons) |
7870.712
255.9014
7384.801
8388.595
-----------------------------+-----------------------------------------------sd(Residual) |
.709687
.3436148
.2747476
1.833158
------------------------------------------------------------------------------
• predict rand_int_slope, fitte
• xtline rand_int_slope ,
overlay t(t) i(id)
legend(off)
792
0
50000
0
.2
.4
.6
t
.8
1
Structure of the Covariances
• We have been forcing slope and
intercepts to be uncorrelated
• Let’s correlate them
• xtmixed sal t ||id: t ,
cov(un)
794
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------t |
17403.48
496.7316
35.04
0.000
16429.91
18377.06
_cons |
17016.09
361.5085
47.07
0.000
16307.54
17724.63
----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters |
Estimate
Std. Err.
[95% Conf. Interval]
-----------------------------+-----------------------------------------------id: Unstructured
|
sd(t) |
10103.18
11670.32
1050.081
97205.97
sd(_cons) |
7382.783
7985.867
886.1058
61511.25
corr(t,_cons) |
.8550486
3.491476
-1
1
-----------------------------+-----------------------------------------------sd(Residual) |
2727.788
21601.34
.0004956
1.50e+10
------------------------------------------------------------------------------
Predicting Change
• Does another variable moderate the
effect of time
– This means that the effect of time varies
– As a function of the slope
• xi: xtmixed sal i.t*agestart
||id: t
796
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_It_1 |
25483.16
1640.544
15.53
0.000
22267.75
28698.57
agestart | -5.157022
30.85168
-0.17
0.867
-65.6252
55.31115
_ItXagest_1 | -212.5138
41.25203
-5.15
0.000
-293.3663
-131.6613
_cons |
17205.18
1226.935
14.02
0.000
14800.43
19609.93
----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters |
Estimate
Std. Err.
[95% Conf. Interval]
-----------------------------+-----------------------------------------------id: Unstructured
|
sd(t) |
9871.434
9785.792
1414.372
68896.47
sd(_cons) |
7437.632
6495.359
1342.987
41190.55
corr(t,_cons) |
.8622535
2.921366
-1
1
-----------------------------+-----------------------------------------------sd(Residual) |
2619.971
18422.9
.0027095
2.53e+09
------------------------------------------------------------------------------
Exercises
• 13.1, 13.2
798
Fixed Effects Models
• A second way of looking at longitudinal
data
• Multilevel (mixed) models
– Assume that intercepts are random
• Fixed effects models
– Assume they are fixed
– If they are fixed they can correlate
• With all other predictors
799
Fixed Effects Models
• Allowing intercepts to correlate
– Has the effect of controlling for ALL time
invariant predictors
– Even those you didn’t measure
– Each person is their own control
800
Fixed Effects Models
• Regression asks:
– Are people who are higher on x also higher
on y?
• Fixed effects asks:
– When a person is higher on x are they also
higher on y
– Effects are within people, not between
people
801
Fixed Effects in Stata
• Make data long, then
– xtreg sal t agestart, i(id)
802
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------t |
17403.48
496.7319
35.04
0.000
16427.41
18379.56
_cons |
17016.09
351.2425
48.45
0.000
16325.9
17706.28
-------------+---------------------------------------------------------------sigma_u | 12145.928
sigma_e | 7647.0911
rho | .71612838
(fraction of variance due to u_i)
-----------------------------------------------------------------------------F test that all u_i=0:
F(473, 473) =
5.05
Prob > F = 0.0000
Fixed Effects Regression
• Can we look at the effect of time
invariant predictors?
• xtreg sal t agestart, i(id) fe
• Why not?
804
Interactions
• But we can look at interactions of time
invariant predictors
• xi: xtreg sal i.t*agestart,
i(id) fe
805
-----------------------------------------------------------------------------sal |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_It_1 |
25483.16
1640.538
15.53
0.000
22259.48
28706.84
agestart | (dropped)
_ItXagest_1 | -212.5138
41.25188
-5.15
0.000
-293.5743
-131.4533
_cons |
17009.25
342.8103
49.62
0.000
16335.62
17682.88
-------------+---------------------------------------------------------------sigma_u |
12087.77
sigma_e | 7455.6318
rho | .72441115
(fraction of variance due to u_i)
------------------------------------------------------------------------------
Exercises
• 13.3, 13.4
807
Bonus Lesson 1: Why
Regression?
A little aside, where we look at
why regression has such a curious
name.
808
Regression
The or an act of regression; reversion;
return towards the mean; return to an
earlier stage of development, as in an
adult’s or an adolescent’s behaving like
a child
(From Latin gradi, to go)
• So why name a statistical technique
which is about prediction and
explanation?
809
• Francis Galton
– Charles Darwin’s cousin
– Studying heritability
• Tall fathers have shorter sons
• Short fathers have taller sons
– ‘Filial regression toward mediocrity’
– Regression to the mean
810
• Galton thought this was biological fact
– Evolutionary basis?
• Then did the analysis backward
– Tall sons have shorter fathers
– Short sons have taller fathers
• Regression to the mean
– Not biological fact, statistical artefact
811
Other Examples
• Secrist (1933): The Triumph of Mediocrity in
Business
• Second albums often tend to not be as good
as first
• Sequel to a film is not as good as the first
one
• Sports Illustrated Cover Jinx
• Parents think that punishing bad behaviour
works, but rewarding good behaviour doesn’t
812
• Accident reduction schemes
– Always reduce accidents
• Poor radiologists improve after training
• Any treatment for a cold will work
– Or for most illnesses
• Deaths due to methadone in Utah
– High last year
– Must take action!
813
Pair Link Diagram
• An alternative to a scatterplot
x
y
814
r=1.00
x
x
x
x
x
x
x
815
r=0.00
x
x
x
x
x
816
From Regression to
Correlation
• Where do we predict an individual’s
score on y will be, based on their score
on x?
– Depends on the correlation
• r = 1.00 – we know exactly where they
will be
• r = 0.00 – we have no idea
• r = 0.50 – we have some idea
817
r=1.00
Starts here
Will end up
here
x
y
818
r=0.00
Starts here
Could end
anywhere here
x
y
819
r=0.50
Probably
end
somewhere
here
Starts here
x
y
820
Galton Squeeze Diagram
• Don’t show individuals
– Show groups of individuals, from the same
(or similar) starting point
– Shows regression to the mean
821
r=0.00
Ends here
Group starts
here
x
Group starts
here
y
822
r=0.50
x
y
823
r=1.00
x
y
824
1 unit
r units
x
y
• Correlation is amount of regression that
doesn’t occur
825
• No regression
• r=1.00
x
y
826
• Some
regression
• r=0.50
x
y
827
r=0.00
• Lots
(maximum)
regression
• r=0.00
x
y
828
Formula
zˆ y  rxy zx
829
Conclusion
• Regression towards mean is statistical necessity
regression = perfection – correlation
• Very non-intuitive
• Interest in regression and correlation
– From examining the extent of regression towards
mean
– By Pearson – worked with Galton
– Stuck with curious name
• See also Paper B3
830
• Correcting for regression to the mean
– Possible
– Makes lots of tricky assumptions
• To appear to do well in your job / life
– Do something after someone has failed
– You probably can’t do worse
– Hospital / school / department / class /
study / experiment
• If it fails, volunteer to do it
831
Bonus Lesson 2: Other Kinds
of Regression
832
Introduction
• We’ve covered a few kinds of
regression
– There are many more, for specific types of
outcomes
833
Beta Regression
• Used when the outcome variable is beta
distributed
• Rates and proportions
– Bounded by zero and 1
– Uniform, or strange shaped distributions
834
Cox Proportional Hazards
Regression
• Type of survival model
• Used for time to an event
– When the event might not occur
• Developed for medical research
835
Cox Proportional Hazards
Regression
• E.g. How long does it take for a car to
break down
– The car crashes, and is scrapped
– We’ll never know – but we want to know
the information
– Discarding the data point would lead to
bias
836
Competing Risks Survival
• Time to multiple events
– Several things are trying to kill you
– Which one succeeds
837
Data Mining Techniques
• Avoid problems with stepwise
regression
– Used as alternatives to logistic
• Boosted regression (Stata command:
boost)
– Semi-parametric alternative to logistic
regression
• Least Angle Regression (LARS)
838
• Classification trees
839
Seemingly Unrelated
Regression
• Used for multiple outcomes
– With correlated error terms
• Some say it should be seemingly related
• Stata command: sureg
840
Instrumental Variables
Regression
• Used for mediator models
– Although economists don’t call them that.
• Use ivregress in Stata
841
Quantile Regression
• Why do we always try to predict the
mean?
• What about predicting the median? The
25th percentile?
• That’s quantile regression
842
Non-Parametric Regression
• (Might also include LARS/Lasso/Boost)
• Don’t force any functional form on the
relationship
• LO(W)ESS – locally weighted scatterplot
smoothing
– Will find any relationship
843
Robust Regression
• (Careful, not sandwich estimators)
• Trimmed of outliers, estimated with
bootstrap
• Lots of publications by Wilcox
844
Censored Regression
• Tobit regression
– When a measure is censored
– E.g. unemployed people work 0 hours.
845