Advanced Methods and Models in Behavioral

advertisement
Advanced Models and Methods
in Behavioral Research
• Chris Snijders
• c.c.p.snijders@gmail.com
ToDo
(if not done yet):
Enroll in 0a611
• 3 ects
• http://www.chrissnijders.com/ammbr
(=studyguide)
• literature: Field book + separate course material
• laptop exam (+ assignments)
Advanced Methods and Models in Behavioral Research –
The methods package
• MMBR (6 ects)
– Blumberg: questions, reliability, validity, research design
– Field: SPSS: factor analysis, multiple regression, ANcOVA,
sample size etc
• AMMBR (3 ects)
- Field (1 chapter): logististic regression
- literature through website:
conjoint analysis  multi-level regression
Advanced Methods and Models in Behavioral Research –
Models and methods: topics
• t-test, Cronbach's alpha, etc
• multiple regression, analysis of (co)variance and
factor analysis
• logistic regression
• conjoint analysis / repeated measures
– Stata next to SPSS
– “Finding new questions”
– Some data collection
In the background:
“now you should be able to deal with data on your own”
Advanced Methods and Models in Behavioral Research –
Methods in brief (1)
• Logistic regression: target Y, predictors Xi.
Y is a binary variable (0/1).
-
Why not just multiple regression?
Interpretation is more difficult
goodness of fit is non-standard
...
(and it is a chapter in Field)
Advanced Methods and Models in Behavioral Research –
Methods in brief (2)
• Conjoint analysis
Underlying assumption: for
each user, the "utility" of an
offer can be written as
-10 Euro p/m
- 2 years fixed
- free phone
- ...
How attractive is this
offer to you?
U(x1,x2, ... , xn) = c0 + c1 x1 + ... + cn xn
Advanced Methods and Models in Behavioral Research –
Conjoint analysis as an “in between method”
Between
Which phone do you like and why?
What would your favorite phone be?
And:
Let’s keep track of what people buy.
We have:
Advanced Methods and Models in Behavioral Research –
Local Master Thesis example:
Fiber to the home
Speed:
Price:
Installation:
Your neighbors:
really fast
sort of high
free!
are in!
(Roel Schuring)
How attractive is this to you?
Advanced Methods and Models in Behavioral Research –
Coming up with new ideas (3)
“More research is necessary”
But on what?
YOU: come up with sensible new
ideas, given previous research
Advanced Methods and Models in Behavioral Research –
Stata next to SPSS
•
It’s just better
•
Multi-level regression
is much easier than in
SPSS
•
It’s good to be
exposed to more than
just a single statistics
package (your knowledge
(faster,
better written, more
possibilities, better
programmable …)
should not be based on
“where to click” arguments)
•
More stable
•
BTW Supports OSX as
well… (anybody?)
Advanced Methods and Models in Behavioral Research –
Every advantage has a disadvantage
• Output less “polished”
• It takes some extra work
to get you started
• The Logistic Regression
chapter in the Field book
uses SPSS (but still readable
for the larger part)
• (and it’s not campus
software, but subfaculty
software)
• Installation …
Advanced Methods and Models in Behavioral Research –
If on Windows, try downloading
• www.chrissnijders.com/ammbr/TUeStata12-zip.exe
Advanced Methods and Models in Behavioral Research –
Logistic Regression Analysis
That is: your Y variable is 0/1:
Now what?
The main points
1.
Why do we have to know and sometimes use logistic
regression?
2.
What is the underlying model? What is maximum
likelihood estimation?
3.
Logistics of logistic regression analysis
1.
2.
3.
4.
4.
Estimate coefficients
Assess model fit
Interpret coefficients
Check residuals
An SPSS example
Advanced Methods and Models in Behavioral Research
Suppose we have 100 observations with information
about an individuals age and wether or not this indivual
had some kind of a heart disease (CHD)
ID
age
CHD
1
2
3
4
…
98
99
100
20
23
24
25
0
0
0
1
64
65
69
0
1
1
A graphic representation of the data
CHD
Age
Let’s just try regression analysis
pr(CHD|age) = -.54 +.022*Age
... linear regression is not a suitable model for probabilities
pr(CHD|age) = -.54 +.0218107*Age
In this graph for 8 age groups, I plotted the probability of
having a heart disease (proportion)
A nonlinear model is probably better here
Something like this
This is the logistic regression model
Pr( Y | X ) 
1
1 e
 ( b 0  b1 X 1   1 )
Predicted probabilities are always between 0 and 1
Pr( Y | X ) 
1
1 e
 ( b 0  b1 X 1   1 )
similar to classic regression
analysis
Side note: this is similar to MMBR …
Suppose Y is a percentage (so between 0 and 1).
Then consider
…which will ensure that the estimated Y will vary between 0 and 1
and after some rearranging this is the same as
Advanced Methods and Models in Behavioral Research –
… (continued)
And one “solution” might be:
- Change all Y values that are 0 to 0.001
- Change all Y values that are 1 to 0.999
Now run regression on log(Y/(1-Y)) …
… but that really is sort of higgledy-piggledy …
Advanced Methods and Models in Behavioral Research –
Logistics of logistic regression
1.
2.
3.
4.
How do we estimate the coefficients?
How do we assess model fit?
How do we interpret coefficients?
How do we check regression assumptions?
Kinds of estimation in regression
• Ordinary Least Squares (we fit a line through a cloud
of dots)
• Maximum likelihood (we find the parameters that are
the most likely, given our data)
We never bothered to consider maximum likelihood in standard
multiple regression, because you can show that they lead to
exactly the same estimator (in MR, that is, normally they
differ).
Actually, maximum likelihood has superior statistical
properties (efficiency, consistency, invariance, …)
Advanced Methods and Models in Behavioral Research –
Maximum likelihood estimation
• Method of maximum likelihood yields values
for the unknown parameters that maximize
the probability of obtaining the observed set
of data
Pr( Y | X ) 
1
1 e
 ( b 0  b1 X 1   1 )
Unknown parameters
Maximum likelihood estimation
• First we have to construct the “likelihood
function” (probability of obtaining the
observed set of data).
Likelihood = pr(obs1)*pr(obs2)*pr(obs3)…*pr(obsn)
Assuming that observations are independent
Log-likelihood
• For technical reasons the likelihood is
transformed in the log-likelihood (then you
just maximize the sum of the logged
probabilities)
LL= ln[pr(obs1)]+ln[pr(obs2)]+ln[pr(obs3)]…+ln[pr(obsn)]
Some subtleties
• In OLS, we did not need stochastic assumptions to
be able to calculate a best-fitting line (only for the
estimates of the confidence intervals we need that).
With maximum likelihood estimation we need this
from the start
(and let us not be bothered at this point by how
the confidence intervals are calculated in
maximum likelihood)
Advanced Methods and Models in Behavioral Research –
Note: optimizing log-likelihoods is difficult
• It’s iterative (“searching the landscape”)
 it might not converge
 it might converge to the wrong answer
Advanced Methods and Models in Behavioral Research –
Nasty implication:
extreme cases should be left out
(some handwaving here)
Advanced Methods and Models in Behavioral Research –
SPSS output
Advanced Methods and Models in Behavioral Research –
Estimation of coefficients: SPSS Results
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 11 X 1 )
Variables in the Equation
B
Step 1a
age
Constant
S.E.
Wald
df
Sig.
Exp(B)
,111
,024
21,254
1
,000
1,117
-5,309
1,134
21,935
1
,000
,005
a. Variable(s) entered on step 1: age.
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 11 X 1 )
This function fits best: other values of b0 and b1 give worse results
(that is, other values have a smaller likelihood value)
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 11 X 1 )
Illustration 1: suppose we chose .05X instead of .11X
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 05 X 1 )
Illustration 2: suppose we chose .40X instead of .11X
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 40 X 1 )
Logistics of logistic regression
• Estimate the coefficients (and their conf.int.)
• Assess model fit
– Between model comparisons
– Pseudo R2 (similar to multiple regression)
– Predictive accuracy
• Interpret coefficients
• Check regression assumptions
Model fit:
comparisons between models
The log-likelihood ratio test statistic can
be used to test the fit of a model
  2[ LL ( New )  LL ( baseline )]
2
The test statistic has a
chi-square distribution
full model
reduced model
NOTE This is sort of similar to the variance decomposition
tables you see in MR!
41
Advanced Methods and Models in Behavioral Research
Between model comparisons:
the likelihood ratio test
  2[ LL ( New )  LL ( baseline )]
2
full model
P (Y ) 
1
1 e
 ( b 0  b1 X 1 )
reduced model
P (Y ) 
1
1 e
 ( b0 )
The model including only an intercept
Is often called the empty model. SPSS uses this
model as a default.
Between model comparison: SPSS output
  2 LL ( New )  2 LL ( baseline )]
2
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
29,310
1
,000
Block
29,310
1
,000
Model
29,310
1
,000
Model Summary
Step
1
-2 Log likelihood
107,353a
Cox & Snell R
Nagelkerke R
Square
Square
,254
,341
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than ,001.

This is the test statistic,
and it’s associated
significance
Overall model fit
pseudo R2
log-likelihood of the model
that you want to test
R
2
LOGIT

 2 LL ( Model )
 2 LL ( Empty )
Just like in multiple
regression, pseudo
R2 ranges 0.0 to 1.0
– Cox and Snell
• cannot theoretically
reach 1
– Nagelkerke
log-likelihood of model
before any predictors were
entered
• adjusted so that it
can reach 1
NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression
45
Overall model fit: Classification table
Classification Table
a
Predicted
chd
Percentage
Observed
Step 1
chd
0
1
Correct
0
45
12
78,9
1
14
29
67,4
Overall Percentage
74,0
a. The cut value is ,500
We predict 74% correctly
46
Overall model fit: Classification table
Classification Table
a
Predicted
chd
Percentage
Observed
Step 1
chd
0
1
Correct
0
45
12
78,9
1
14
29
67,4
Overall Percentage
74,0
a. The cut value is ,500
14 cases had a CHD while according to our model
this shouldnt have happened
47
Overall model fit: Classification table
Classification Table
a
Predicted
chd
Percentage
Observed
Step 1
chd
0
1
Correct
0
45
12
78,9
1
14
29
67,4
Overall Percentage
74,0
a. The cut value is ,500
12 cases didn’t have a CHD while according to our model
this should have happened
48
Logistics of logistic regression
• Estimate the coefficients
• Assess model fit
• Interpret coefficients
– Direction
– Significance
– Magnitude
• Check regression assumptions
The Odds Ratio
We had:
p (Y ) 
1
1 e
 ( b 0  b1 X 11  ...  b n X n )

e
( b 0  b1 X 11  ...  b n X n )
1 e
( b 0  b1 X 11  ...  b n X n )
And after some rearranging we can get
50
Magnitude of association: Percentage change in odds
Odds
i
 prob event
 
 1  prob event




Probability
Odds
25%
0.33
50%
1
75%
3
Interpreting coefficients: direction
• original b reflects changes in logit: b>0 implies positive relationship
logit  ln
p( y)
1  p( y)
 b0  b1 x1  b 2 x 2  ...  b n x n
• exponentiated b reflects the “changes in odds”: exp(b) > 1 implies a
positive relationship
52
3. Interpreting coefficients: magnitude
• The slope coefficient (b) is interpreted as the rate of change in
the "log odds" as X changes … not very useful.
logit  ln
p( y)
1  p( y)
 b0  b1 x1  b 2 x 2  ...  b n x n
• exp(b) is the effect of the independent variable on the odds,
more useful for calculating the size of an effect
Odds 
53
p( y)
1  p( y)
e
b0
e
b1 x1
e
b2 x 2
 ...  e
bn x n
Magnitude of association
Ref=0
Ref=1
Variables in the Equation
B
Step 1a
age
Constant
S.E.
Wald
df
Sig.
Exp(B)
,111
,024
21,254
1
,000
1,117
-5,309
1,134
21,935
1
,000
,005
a. Variable(s) entered on step 1: age.
• For the age variable:
– Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12%, or “the
odds times 1,117”
– A one unit increase in age will result in 12% increase in the odds that the person will
have a CHD
– So if a soccer player is one year older, the odds that (s)he will have CHD is 12%
higher
Another way to get an idea of the size of effects:
Calculating predicted probabilities
Pr( Y | X ) 
1
1 e
 (  5 . 3  . 11 X 1 )
For somebody of 20 years old, the predicted probability is .04
For somebody of 70 years old, the predicted probability is .91
But this gets more complicated
when you have more than a single X-variable
Pr(Y | X) =
1
1+ e
-(-5.3+.11X1+1*X2 )
(see blackboard)
Conclusion: if you consider the effect of a variable on
the predicted probability, the size of the effect of X1
depends on the value of X2! (yuck!)
Advanced Methods and Models in Behavioral Research –
Testing significance of coefficients
•
In linear regression
analysis this statistic is
used to test
significance
b
•
In logistic regression
something similar
exists
SE b
•
however, when b is
large, standard error
tends to become
inflated, hence
underestimation (Type
II errors are more
likely)
estimate
Wald 
t-distribution
standard error of estimate
Note: This is not the Wald Statistic SPSS presents!!!
Interpreting coefficients: significance
• SPSS presents
Wald 
b
2
SE
2
b
• While Andy Field thinks SPSS presents this (at least in the 2nd
version of the book):
Wald 
b
SE b
Advanced Methods and Models in Behavioral Research –
Logistics of logistic regression
•
•
•
•
Estimate the coefficients
Assess model fit
Interpret coefficients
Check regression assumptions
Checking assumptions
• Influential data points & Residuals
– Follow Samanthas tips
• Hosmer & Lemeshow
– Divides sample in subgroups
– Checks whether there are differences between observed and
predicted between subgroups
– Test should not be significant, if so: indication of lack of fit
Hosmer & Lemeshow
Test divides sample in subgroups, checks whether
difference between observed and predicted is about
equal in these groups
Test should not be significant (indicating no difference)
Examining residuals in logistic regression
1. Isolate points for which the model fits poorly
2. Isolate influential data points
Residual statistics: Field’s rules of thumb
Advanced Methods and Models in Behavioral Research –
Logistic regression
• Y = 0/1
• Multiple regression (or ANcOVA) is not right
• You consider either the odds or the log(odds)
• It is estimated through “maximum likelihood”
• Interpretation is a bit more complicated than normal
• Assumption testing is a bit more concrete than in
multiple regression
Advanced Methods and Models in Behavioral Research –
Advanced Methods and Models
in Behavioral Research
Make sure to
• enroll in studyweb (0a611)
• Read the Field chapter on logistic
regression
• Go through the slides as well
• Bring your laptop next time: we’ll go
through a logistic regression in Stata
Advanced Methods
Advanced
andMethods
Models in
and
Behavioral
Models inResearch
Behavioral
– 2008/2009
Research –
68
Illustration with SPSS
(without the outlier part)
• Penalty kicks data, variables:
– Scored: outcome variable,
• 0 = penalty missed, and 1 = penalty scored
– Pswq: degree to which a player worries
– Previous: percentage of penalties scored by a particular
player in their career
69
SPSS OUTPUT Logistic Regression
Case Processing Summary
Unweighted Cas es
Selected C ases
a
N
Included in Analysis
Miss ing Cas es
Total
Unselected Cases
Total
Percent
75
100,0
0
,0
75
100,0
0
,0
75
100,0
a. If weight is in effect, s ee classification table for the total
number of cases .
Dependent Variable Encoding
Original Value
Miss ed Penalty
Scored Penalty
Internal Value
0
1
Tells you something
about the number of
observations and
missings
70
this table is based on
the empty model, i.e. only
the constant in the model
Block 0: Beginning Block
Classification Table
a,b
Predicted
Result of Penalty Kick
Step 0
Observed
Result of Penalty
Kick
Miss ed
Penalty
Scored
Penalty
Percentage
Correct
Miss ed Penalty
0
35
,0
Scored Penalty
0
40
100,0
Overall Percentage
P (Y ) 
1
1 e
 ( b0 )
53,3
a. Constant is included in the model.
b. The cut value is ,500
Variables in the Equation
B
Step 0
Constant
S.E.
,134
,231
Wald
df
Sig.
,333
1
,564
Variables not in the Equation
Score
Step
0
Variables
Overall Statis tics
df
Sig.
previous
34,109
1
,000
ps wq
34,193
1
,000
41,558
2
,000
71
Exp(B)
1,143
these variables
will be entered
in the model
later on
Block is useful to check significance of
individual coefficients, see Field
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
54,977
2
,000
Block
54,977
2
,000
Model
54,977
2
,000
this is the test statistic
  2[ LL ( New )  LL ( baseline )]
2
Note: Nagelkerke
is larger than Cox
after dividing by -2
Model Summary
New
model
Step
1
-2 Log
likelihood
48,662 a
Cox & Snell
R Square
,520
Nagelkerke
R Square
,694
a. Es timation terminated at iteration number 6 becaus e
parameter estimates changed by less than ,001.
72
Block 1: Method = Enter (Continued)
Classification Table
a
Predicted
Result of Penalty Kick
Step 1
Miss ed
Penalty
Observed
Result of Penalty
Kick
Scored
Penalty
Percentage
Correct
Miss ed Penalty
30
5
85,7
Scored Penalty
7
33
82,5
Overall Percentage
84,0
a. The cut value is ,500
Predictive accuracy has
improved (was 53%)
Variables in the Equation
B
Step
a
1
S.E.
Wald
df
Sig.
Exp(B)
previous
,065
,022
8,609
1
,003
1,067
ps wq
-,230
,080
8,309
1
,004
,794
Constant
1,280
1,670
,588
1
,443
3,598
a. Variable(s) entered on step 1: previous, pswq.
estimates
standard error
estimates
significance
based on
Wald statistic
change in odds
73
How is the classification table constructed?
# cases not predicted
corrrectly
a
Classification Table
Predicted
Result of Penalty Kick
Step 1
Miss ed
Penalty
Observed
Result of Penalty
Kick
Scored
Penalty
Percentage
Correct
Miss ed Penalty
30
5
85,7
Scored Penalty
7
33
82,5
Overall Percentage
84,0
a. The cut value is ,500
# cases not predicted
corrrectly
Variables in the Equation
B
Step
a
1
S.E.
Wald
df
Sig.
Exp(B)
previous
,065
,022
8,609
1
,003
1,067
ps wq
-,230
,080
8,309
1
,004
,794
Constant
1,280
1,670
,588
1
,443
3,598
a. Variable(s) entered on step 1: previous, pswq.
Pred. P (Y ) 
1
1 e
 (1 , 28  0 , 065 * previous  0 , 230 * pswq )
74
How is the classification table constructed?
Pred. P (Y ) 
1
1 e
 (1 , 28  0 , 065 * previous  0 , 230 * pswq )
pswq
previous
scored
18
56
1
Predict.
prob.
.68
17
35
1
.41
20
45
0
.40
10
42
0
.85
75
How is the classification table constructed?
pswq
previo
us
scored
18
17
20
10
56
35
45
42
1
1
0
0
Classification Table
Predict predict
. prob.
ed
.68
.41
.40
.85
1
0
0
1
a
Predicted
Result of Penalty Kick
Step 1
Observed
Result of Penalty
Kick
Miss ed
Penalty
Scored
Penalty
Percentage
Correct
Miss ed Penalty
30
5
85,7
Scored Penalty
7
33
82,5
Overall Percentage
84,0
a. The cut value is ,500
76
Download