lecture10

advertisement
Introduction to Data
Analysis
Introduction to Logistic
Regression
This week’s lecture

Categorical dependent variables in more complicated
models. Logistic regression (for binary categorical
dependent variables).




Why can’t we just use OLS?
How does logistic regression work?
How do we compare logistic models?
Reading: A & F chapter 15.
2
But, first an experiment

I’m going to show you a short video of some students
playing basketball.


I’d like you to count the number of times that the
white shirted students pass the ball to each other in
two different ways.



There are 6 people; 3 dressed in black shirts and 3 in white shirts.
An ‘aerial’ pass (without touching the ground on the way).
A ‘bounce’ pass (touching the ground on the way).
Thus after the video has ended you should have two
totals, one for aerial passes by white shirts and one
for bounce passes by white shirts.
3
http://viscog.beckman.uiuc.edu/grafs/demos/1
5.html
4
“Gorillas in our midst” (1)
http://viscog.beckman.uiuc.edu/grafs/demos/15.html
5
“Gorillas in our midst” (2)

This is a real bit of psychology research by
Simons and Chabris (1999) at Harvard.
They find that the harder the task, the more likely it is
that people don’t spot the gorilla.
 Only 50% of his subjects spotted the gorilla…


How is this relevant to us?
Imagine we wanted to predict whether someone saw the
gorilla or not, this is a binary dependent variable.
 We might have independent variables like concentration
span, difficulty of the task, time of day and so on.

6
Predicting gorilla sightings (1)

Our dependent variable is just like the variables we
were using earlier.


But let’s say with this example we want to predict
whether the gorilla will be spotted by a person with a
particular set of characteristics.


e.g. vote choice in 1950s Britain was Labour or Conservative.
In this case, let’s say with a particular concentration span
(measured on a 1-100 scale).
Since our independent variable is interval level data
we can’t use cross-tabs.
7
Predicting gorilla sightings (2)

So, what we want to know is the probability that any
person will be a gorilla spotter or not for any value of
concentration span.



Remember if we know this, we will know the proportion of people
that will spot the gorilla at each level of concentration of span on
average.
We could use simple linear regression (SLR) here,
with the dependent variable coded as 0 (no gorilla
spot) or 1 (gorilla spotted).
Well, why can’t we…?
8
What’s wrong with SLR?
We want to predict a probability, this can only
vary between zero and 1.
 But our SLR may predict values that are below
zero or above 1…
 Let’s quickly fit a SLR to our example.

Our sample here is the 108 subjects that Simons and
Chabris used. I’ve added some extra data on their
concentration spans.
 A scatter-plot isn’t all that much use here.

9
Scatter-plot (1)
More low concentration
people spot the gorilla.
Gorilla spotter
2
1
0
More high concentration
people DON’T spot the gorilla.
-1
0
10
20
30
40
50
60
70
80
90
100
Concentration span
10
People with CS below
21 have > 1 probability
of being a spotter…
Scatter-plot (2)
2
Gorilla spotter
People with CS above
92 have < 0 probability
of being a spotter…
1
0
Could add a linear
regression line
-1
0
10
20
30
40
50
60
70
80
90
100
Concentration span
11
Other problems

If you think about it, that’s just one problem.
For linear regression we assumed that the population
distribution was normally distributed around the mean,
for each value of the X variable.
 That’s not going to be the case if we’ve got a binary
response. The distribution around the mean is going to
be quite different.
 Looking at our data, when CS=50 we’ll have about 60%
of cases scoring 1 (being spotters) and 40% of cases
scoring 0 (not being spotters). That doesn’t sound much
like a normal distribution…

12
What to do (1)
Instead of linear OLS regression we use
something called logistic regression.
 This is a very widely used method, and it’s
important to understand how it works.

Probably more widely used (especially if include
variants) than linear OLS, as interesting dependent
variables are often categorical.
 A randomly selected academic (by the name of Tilley)
has used logistic regression in 55.5% of all his
sociology and politics articles.

13
What to do? (2)
Somehow we need to dump the linear OLS bit
of our model for this binary categorical
variable.
 So what we want to do is assume a different
kind of relationship between the probability of
seeing gorillas (or whatever) and concentration
span.
 Maybe something like this…

14
What to do? (3)
Here’s a more realistic
representation of the
relationship between the
probability of gorilla
spotting and CS
1.5
Gorilla spotter
1.0
0.5
0.0
-0.5
0
10
20
30
40
50
60
70
80
90
100
Concentration span
15
The logistic transformation (1)

This type of relationship is described by a special
formula.

Remember, if the relationship was linear then the equation is just:
    X

But the relationship on the graph is actually described by:
  
log 
    X
1  
16
The logistic transformation (2)
  
log 
    X
1  
This is just the odds.
As the probability increases (from zero
to 1), the odds increase from 0 to
infinity.
So if β is ‘large’ then as X increases
the log of the odds will increase
steeply.
The log of the odds then increases from
–infinity to +infinity.
The steepness of the curve will
therefore increase as β gets bigger.
17
Fitting this model (1)

So that’s what we want to do, but how do we
do it?
With SLR we tried to minimize the squares of the
residuals, to get the best fitting line.
 This doesn’t really make sense here (remember the
errors won’t be normally distributed as there’s only two
values).


We use something called maximum likelihood
to estimate what the β and α are.
18
Fitting this model (2)

Maximum likelihood is an iterative process
that estimates the best fitted equation.
The iterative bit just means that we try lots of models
until we get to a situation where tweaking the equation
any further doesn’t improve the fit.
 The maximum likelihood bit is kind of complicated,
although the underlying assumptions are simple to
understand, and very intuitive. The basic idea is that we
find the coefficient value that makes the observed data
most likely.

19
Back to the gorillas
So pressing the appropriate buttons in STATA
or SPSS, allows us to fit a logistic regression to
our gorilla spotting data.
 The numbers that we get out are not
immediately interpretable however.

Remember for OLS linear regression, a change of one
unit on the X variable meant that the Y variable would
increase by the coefficient for X.
 That’s not what the coefficient associated with X in our
logistic regression means.

20
Gorilla results
Variable
Coefficient
value
Standard
error
p-value
Concentration
-0.07
0.01
0.00
Intercept
3.69
0.72
0.00

This is how logistic regression results are often
reported in articles.


It’s clear that concentration span has a negative (and statistically
significant) effect on gorilla sightings.
But what does the -0.07 actually mean?
21
Interpreting the coefficients (1)



What we need to do is
think about the equation
again, and what an
increase in X means.
So an increase in X of 1
unit will decrease our
log (odds) by 0.07.
If we antilog both sides
then we could see how
the odds change…
 ˆ 
log 
  a  bX
 1  ˆ 
 ˆ 
log 
  3.69  0.07 X
 1  ˆ 
Remember the ‘hat’
sign means the
predicted value.
22
Interpreting the coefficients (2)



Antilog both sides and we
get the odds on the LH side.
If we enter a value of X we
can work out what the
predicted odds will be.
Thus the odds of spotting
the gorilla (as opposed to
not spotting the gorilla) are
nearly 5. For every 5
spotters there should be one
non-spotter.
ˆ
 e a bX
1  ˆ
ˆ
3.69 0.07 X
e
1  ˆ
ˆ
 e 3.690.07 X
1  ˆ
ˆ
 e 3.690.0730  4.90
1  ˆ
23
Interpreting the coefficients (3)



We can also think about what happens to the odds
when we increase X by a certain amount.
Another way of writing ea+bX is ea(eb)X. That means
that a one unit increase in X multiples the odds by eb
(as it’s to the power of 1).
In our case therefore a one unit increase in X
multiplies the odds by e-0.07, or 0.93.


When X increases from 30 to 31, the odds are 4.90*0.93, or 4.56.
When X increases from 30 to 40, the odds are 4.90*(0.93)10, or
2.37.
24
Yet more coefficient interpretation (1)



The other way of thinking
about things is in terms of
probabilities.
If we rearrange the
‘antilogged’ equation then
we work out what the
probability (for a particular
value of X) would be.
The probability of a person
with CS=30 of gorilla
spotting is thus 83%.
ˆ
 e3.690.07 X
1  ˆ
e3.690.07 X
ˆ 
1  e3.690.07 X
e 3.690.07 X
ˆ 
1  e 3.690.07 X
e 3.690.0730
ˆ 
 0.83
3.69 0.0730
1 e
25
Yet more coefficient interpretation (2)
100
When CS=30, probability of
spotting the gorilla is 83%
Gorilla spotter (% chance)
90
80
70
Perhaps the most useful
thing to do is to plot the
predicted probabilities (it is
easiest to do this in STATA).
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Concentration span
26
Adding extra variables (1)

Including other interval level independent
variables and categorical independent variables
is as easy as in multiple linear regression.
The logic is the same as before, we are examining the
effects of one independent variable when the other is
held constant.
 The important bit is to understand what the coefficients
from extra independent variables actually mean.
 Since this is less clear cut than in multiple linear
regression we need to be careful in interpretation.

27
Adding extra variables (2)

Let’s say we think that people that own monkeys are
more adept at spotting the gorilla.

We could include a dummy variable for monkey owner (1 if you
are a monkey owner, and 0 if not).
Variable
Coefficient
value
Standard
error
p-value
Concentration
-0.09
0.02
0.000
Monkey owner
3.15
0.96
0.001
Intercept
4.01
0.83
0.000
28
Interpreting extra variables (1)

So owning a monkey (holding concentration span
constant), multiplies the odds by e3.15, or 23.3 times.


The odds of monkey owners spotting the gorilla are 23 times the
odds of non-monkey owners spotting the gorilla.
The probability of a person with a CS of 50 that owns
a monkey being a gorilla spotter is 93%, and the
probability of a person with a CS of 50 that does not
own a monkey being a gorilla spotter is only 40%.


With such a simple model we can still display it graphically.
A linear model would have two parallel lines for each type of
person (monkey or none) by CS. Our lines are NOT parallel.
29
Monkeys and no monkeys
100
Monkey owners
Gorilla spotter (% chance)
90
80
70
60
50
40
30
Non-monkey owners
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Concentration span
30
Interpreting extra variables (2)

Generally, we want to present information
from a logistic regression in the form of
probabilities as these are easiest to understand.
If we have lots of variables, then we normally set them
to a particular value and then examine how the
predicted probability of the dependent outcome varies.
 e.g. if I had more independent variables (age, sex,
eyesight), I would produce the first graph before for
men of average age with average eyesight not owning a
monkey. Then I could see how concentration alone
affected the predicted probability of a gorilla sighting.

31
Interactive monkeys (1)

We can also include interaction effects. Again though
we need to be careful interpreting these.
Variable
Coefficient Standard
value
error
p-value
Concentration
-0.12
0.02
0.00
Monkey owner
-1.92
2.00
0.34
Monkey*concentration
0.08
0.04
0.02
Intercept
5.07
1.14
0.00
32
Interactive monkeys (2)
100
Monkey owners
Gorilla spotter (% chance)
90
80
70
60
50
40
30
Non-monkey owners
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Concentration span
33
Comparing models (1)

One of the most important differences between
logistic regression and linear regression is in
how we compare models.


Remember for linear regression we looked at how the
adjusted R2 changed. If there was a significant increase
when we added another variable (or interaction) then we
thought the model had improved.
For logistic regression there are a variety of
ways of looking model improvement.
34
Comparing models (2)

The best way of comparing models is to use
something called the likelihood-ratio test.
When we were using OLS regression, we were trying to
minimize the sum of squares, for logistic regression we
are trying to maximize something called the likelihood
function (normally called L).
 To see whether our model has improved by adding a
variable (or interaction, or squared term), we can
compare the maximum of the likelihood function for
each model (just like we compared the R2 before for
OLS regressions).

35
Comparing models (3)

In fact, just to complicate matters we actually compare the
maximised values of -2*log L.
LR  (2  log Lo )  (2  log L1 )
First model’s maximised value

Second model’s maximised value
By logging the Ls and multiplying them by -2, this statistic
conveniently ends up with a chi-square distribution. This
means we test whether there is a statistically significant
improvement with reference to the χ2 distribution.
36
Comparing models (4)

Each addition to the model here improves the model fit. We
can test each improvement with a χ2 test using the appropriate
DF. Each of these is statistically significant at the 0.05 level.
Model
-2logL
Concentration
85.5
Concentration + monkey
65.2
Concentration + monkey + concent*monkey 60.0
Each model uses an extra degree of freedom, as we’re adding an extra
parameter. From earlier weeks, we know that a χ2 test of 20 with 1 df is highly
statistically significant, so the model clearly fits better with the extra terms.
37
Non-binary variables?

A lot of categorical variables are not binary
though, what can we do with these?
Often we can recode them to a binary response. You
often see vote choice in Britain coded to Conservative
or not (with the not category including Labour, the
Liberal Democrats and everyone else).
 We could use something called multinomial logistic
regression. This allows the dependent categorical
variable to have more than two categories. More on this
next in POLS 7050.

38
Some warnings

This course is only an introduction (and a very brief
one at that) to statistical methods.



Hopefully you can now pick up a journal and understand the results
of a linear regression or logistic regression.
Hopefully you can run models yourself and interpret the results.
But, be careful on both counts.


I really haven’t covered very much of the underlying math to the
concepts I’ve talked about.
Plus, there are specific things that are worth looking out for in your
own and other people’s analysis. The three problems I’ve picked
out are all things that will crop up next term in Intermediate Stats.
39
Warning (1)

Are you planning on using time as an independent
variable with aggregate data?


e.g. predicting presidential approval for every month between 1970
and 2000 in the US (dependent variable), using economic growth
(independent variable).
STOP. You need to use time-series analysis.


When we measure things over time, need to take into account
autocorrelation.
e.g. the errors we make in predicting presidential approval ratings
in May are going to be highly correlated with the errors we make in
predicting approval ratings in June.
40
Time-series analysis

Time-series models for this kind of data normally
have a lagged dependent variable as an independent
predictor.

We include Yt-1 as a predictor of Yt.
Yt   0  1 yt 1   0 X t   t

The assumption is normally that the effect of Xt persists over time,
the coefficient β0 is just the immediate impact. Broadly speaking
the α1 coefficient tells us how long the effect persists for.
41
Warning (2)

Are you planning on using items that are highly
correlated, or not well measured, in a regression?


e.g. predicting whether women work full-time or part-time
(dependent variable) using ten different attitudes to feminism
(independent variables).
STOP. You need to use factor analysis and create a
scale, or use structural equation modelling.


Factor analysis tells you how a collection of characteristics are
linked together, and whether one can create a scale from what
appear to be similar items.
SEM is similar to factor analysis and allows one to create latent
variables.
42
Factor analysis


This is really useful for attitudinal variables (though
can be used in numerous other contexts as well).
Imagine that I have questions about what people think
is important about being British


e.g. speaking English, having British ancestry, having British
citizenship, feeling British, being a Christian, being born in Britain,
living one’s whole life in Britain.
We can use factor analysis to tease out whether there
are groups of questions that ‘fit together well’.

In this case we might expect to find two factors, one representing
‘civic’ items and the other representing ‘ethnic’ items. People that
answer positively to one ‘civic’ question also answer positively to
other ‘civic’ questions.
43
Warning (3)

Do you have count data (with ‘smallish’ counts)?



The dependent variable is thus measured as the number of
occurrences of a certain event in a given period of time.
e.g. number of presidential vetoes in any year, number of strikes in
any month, etc.
STOP. You need to use a different kind of model that
uses the Poisson distribution.


There is no upper limit to this kind of data (or a very high limit),
and we can’t measure it as a proportion.
There is a special distribution (and therefore special kind of model)
that describes this data.
44
Some parting words of “wisdom”
Quantitative data is easily found and can add a
lot to your thesis.
 You don’t have to use fancy statistical
methods to find interesting things.
 Take as many quantitative classes as you can!
 Quant work is certainly not the only way to
analyze data, but a strong background makes
you more marketable and well-rounded.

45
Download