please click, pptx - Department of Statistics | Rajshahi University

advertisement
Confounders and Interactions:
An Introduction
1
An Example
• Data were collected from some students in department of an
university on the following variables:
– No. of times visited theatre per month (z)
– Scores in the final examination (y)
• The simple correlation coefficient (ryz) between y and z was
calculated to be 0.20 which was significant because the
sample size was moderately large.
• The same experiment was repeated for other departments in
the university. Every time it was positive and significant.
• Interpretation: As you visit theatre more and more, your
result will improve. An interpretation which was hard to
believe.
2
An Example
(Continued)
• Statisticians were puzzled. After a long investigations
they found that who visited theatre more are more
intelligent students. So they need less time to study and
thus spend more time on other things.
• From the same set of students in the department
experiments were carried out to find the IQ of the
students (x). The results of the computation were as
follows:
rxy = 0.8, ryz = 0.2 and rxz = 0.6.
• Still the paradox was not solved.
3
Solution - 1
• One statistician suggested the following:
– Let us fix IQ and take correlation coefficient
between x and z for each IQ.
• It was not practicable as such. Sample size was
too less for such experiment.
• Sample size was increased and the correlation
coefficient between x and z was found for each
IQ.
• Each time the value was negative, but different.
4
Solution - 2
• The effect of x from both y and z was eliminated
and the correlation coefficient between y and z
was found. It was negative.
• How do we eliminate the effect of x?
– We assume that linear relations exists between these
variables, i.e., y = a + b x and z = c + d x (apart from
the errors in the equations). The regressions were fitted
and the residuals of y and z were found and then the
correlations were found between the residuals. This is
the correlation coefficient between y and z after
eliminating the effect of x and this was negative.
– This is known as the partial correlation coefficient.
5
Discussions
• Fortunately, it is not necessary to do all these steps to find
out the partial correlation coefficient. We can use the
following formula:
• The result is ryz.x ≈ – 0.58. It is clearly a negative value.
• Solution 1 gives different values of the estimates of the
correlation coefficients.
• If we assume that the correlation coefficient is same for
each stratum (i.e., fixed value of x) then the estimates will
be more or less close and close to – 0.58 for this example.
• If x, y and z is a trivariate normal distribution then
theoretically the value of the correlation coefficient will be
same for each x.
• Thus Solution 1 does not need any distributional
assumptions but gives multiple answers whereas solution 2
is unique but valid under restrictive assumptions.
6
Partial Correlation to Regression
• Correlations and regression coefficients are related. In the equation
y = a + b x, b is positive if and only if rxy is positive. Testing for
significance of b is same as testing for significance of rxy.
• In the equation y = a + b x + c z, c is positive if and only if ryz.x is
positive. Testing for significance of c is same as testing for
significance of ryz.x.
• If we want to find the relation between y and z; and the variable x
has effect on both then we should take both the variables as
regressors and proceed.
• This is why the regression coefficients in a multiple linear
regression are known as partial regression coefficients.
• x is called the confounding variable. Not all such variables are
confounding variables. The confounding variable should be the true
cause of variation of the explained variable.
7
Another Illustration of Confounding
• Diabetes is associated with hypertension.
• Does diabetes cause hypertension?
• Does hypertension causes diabetes?
• Another way in which diabetes and hypertension may be related is when
both variables are caused by FACTOR X. For hypertension and diabetes,
Factor X might be obesity.
• We should not conclude that diabetes causes hypertension. In fact, they
had no true causal relationship. We should rather say that:
• The relationship between hypertension and diabetes is confounded by
obesity. Obesity would be termed as a confounding variable in this
relationship.
8
Confounders are true causes of disease.
9
Definition of Confounding
• A confounder:
– 1) Is associated with exposure
– 2) Is associated with disease
– 3) Is NOT a consequence of exposure (i.e. not
occurring between exposure and disease)
10
MEDIATING VARIABLE
(SYNONYM: INTERVENING
VARIABLE)
EXPOSURE
MEDIATOR
DISEASE
AN EXPOSURE THAT PRECEDES A MEDIATOR IN
A CAUSAL CHAIN IS CALLED AN ANTECEDENT
VARIABLE.
11
Mediation
• A mediation effect occurs when the third variable (mediator,
M) carries the influence of a given independent variable (X)
to a given dependent variable (Y).
• Mediation models explain how an effect occurred by
hypothesizing a causal sequence.
• .
12
Confounding Vs. Mediation
• Exposure occurs first and then Mediator and
outcome, and conceptually follows an
experimental design).
• Confounders are often demographic variables
that typically cannot be changed in an
experimental design. Mediators are by
definition capable of being changed and are
often selected based on flexibility.
13
Another Example: No Confounding
14
A Different Example
• A group of scientists wanted to find the effect of IQ and the time
spent on studying for examination on the result of examination. The
linear model taken by them was
yt = α + xt+ zt + et .
• They fitted the data and the fitting was good. However, one of the
scientists noticed that the residuals did not show random pattern
when the data were arranged in increasing order of values of IQ.
Then they started investigating the behaviour of the data more
closely. They could do so because the sample size was large.
• They fixed the value of IQ at different points and plotted the scatter
diagram of result against study hours. Every time the scatter
diagram showed linear relation, but the slope changed every time
the value of IQ was changed. And surprisingly, it had a systematic
increasing pattern as the value of IQ increased.
15
The Revised Model
• Now look at the model again
yt = α + xt+ zt + et .
• We interpret  as the change in the value of y on the average as the value of
x is increased by one unit keeping the value of z fixed. But why should the
value of  change as the value of z is increased to some other fixed value.
Ideally the intercept parameter, α, should absorb zt and thus the intercept
term should change and not the slope parameter.
• It means that the selection of model was wrong. If  changes/increases as z
increases then  is not a constant. We may take  to ( + zt) and get
yt = α + ( + zt)xt+ zt + et ,
and get
yt = α + xt+ zt + xtzt + et .
• This phenomenon is known as the interaction effect between x and z. It is
symmetric. One may arrive at the same by varying coefficient of zt
appropriately.
16
No interaction Vs. Interaction
• No Interaction: Disease increases with age and this
association is the same for both, male and female.
• Interaction: gender interacts with age if the effect of age
on disease is not the same in each gender.
• .
17
Examples
• Aspirin protects against heart attacks, but only in men
and not in women. We say then that gender moderates
the relationship between aspirin and heart attacks,
because the effect is different in the different sexes. We
can also say that there is an interaction between sex
and aspirin in the effect of aspirin on heart disease.
• In individuals with high cholesterol levels, smoking
produces a higher relative risk of heart disease than it
does in individuals with low cholesterol levels.
Smoking interacts with cholesterol in its effects on
heart disease.
18
The Implications
• The implication is that, when x or z is increased there is an
additional change in the expected value of y apart from the linear
effect.
• If x is increased by one unit for fixed z then the change in y is +zt
instead of  only, and if z is increased by one unit for fixed x then
the change in y is +xt . If both x and z are increased by one unit
then the change in y is ++ xt+zt+.
• For binary variables taking only 0 and 1 values the corresponding
changes in y are ,  and ++  respectively assuming that x and z
both were in position 0. This is clear from the following table:
Expected values of y at different values of x and z
Z
X
0
1
0
1
α.
α+
α + + +
α+
19
The Implications
• Since y measures the effect i.e., disease, say, of exposures x and/or
z, the number of cases of y in each stage will reflect the same. The
odds ratios will be different.
• Interaction between two variables (with respect to a response
variable) is said to exist when the association between one of these
variables (may be called the exposure variable) and the response
variable (generally measured by the odds ratio or relative risk) is
different at different levels of the other exposure variable.
• For example, the odds ratio that measures the association between
cigarette smoking and lung cancer may be smaller among
individuals who consume large quantities of beta carotene in their
food when compared to the analogous odds ratio among persons
who consume little or no beta carotene in their food.
20
THE INTERACTING OR EFFECT-MODIFYING
VARIABLE IS ALSO KNOWN AS A
MODERATOR VARIABLE
MODERATOR
EXPOSURE
DISEASE
A moderator variable is one that moderates or modifies
the way in which the exposure and the disease are
related. When an exposure has different effects on
disease at different values of a variable, that variable is
called a modifier.
21
Methods to reduce confounding
– during study design:
• Randomization
• Restriction
• Matching
– during study analysis:
• Stratified analysis
• Mathematical regression
22
Randomized controlled trial
• Randomized controlled trial: A method where the study population is
divided randomly in order to mitigate the chances of self-selection by
participants or bias by the study designers. Before the experiment begins,
the testers will assign the members of the participant pool to their groups,
using a randomization process such as the use of a random number
generator.
• For example, in a study on the effects of exercise, the conclusions would be
less valid if participants were given a choice if they wanted to belong to the
control group which would not exercise or the intervention group which
would be willing to take part in an exercise program. The study would then
capture other variables besides exercise, such as pre-experiment health
levels and motivation to adopt healthy activities. From the observer’s side,
the experimenter may choose candidates who are more likely to show the
results the study wants to see or may interpret subjective results (more
energetic, positive attitude) in a way favorable to their desires.
23
Case-Control Studies
• In a case-control study the researcher retrospectively determines
which individuals were exposed to the agent or treatment or the
prevalence of a variable in each of the study groups. The researcher
assigns confounders to both groups, cases and controls, equally. For
example if somebody wanted to study the cause of myocardial
infarct and thinks that the age is a probable confounding variable,
each 67 years old infarct patient will be matched with a healthy 67
year old "control" person. In case-control studies, matched
variables most often are the age and sex.
• Drawback: Case-control studies are feasible only when it is easy to
find controls, i.e., persons whose status vis-à-vis all known potential
confounding factors is the same as that of the case's patient: Suppose
a case-control study attempts to find the cause of a given disease in a
person who is 1) 45 years old, 2) African-American, 3) from Alaska,
4) an avid football player, 5) vegetarian, and 6) working in
education. A theoretically perfect control would be a person who, in
addition to not having the disease being investigated, matches all
these characteristics and has no diseases that the patient does not
also have — but finding such a control would be an enormous task. 24
An Hypothetical Example
25
Cohort studies
• Cohort studies: A group of people is chosen who do not have the outcome of
interest (for example, myocardial infarction). The investigator then measures a
variety of variables that might be relevant to the development of the condition.
Over a period of time the people in the sample are observed to see whether they
develop the outcome of interest (that is, myocardial infarction).
– Internal Controls: In single cohort studies those people who do not develop the
outcome of interest are used as internal controls.
– External Controls: Where two cohorts are used, one group has been exposed to or
treated with the agent of interest and the other has not, thereby acting as an external
control.
• A degree of matching is also possible in cohort studies, creating a cohort of people
who share similar characteristics and thus all cohorts are comparable in regard to
the possible confounding variable. For example, if age and sex are thought to be
confounders, only 40 to 50 years old males would be involved in a cohort study
that would assess the myocardial infarct risk in cohorts that either are physically
active or inactive.
• Drawback: In cohort studies, the over-exclusion of input data may lead researchers
to define too narrowly the set of similarly situated persons for whom they claim
the study to be useful. Similarly, "over-stratification" of input data within a study
may reduce the sample size in a given stratum to the point.
26
Double blinding
• Double blinding conceals from the trial
population and the observes the experiment group
membership of the participants. By preventing the
participants from knowing if they are receiving
treatment or not, the placebo effect should be the
same for the control and treatment groups. By
preventing the observers from knowing of their
membership, there should be no bias from
researchers treating the groups differently or from
interpreting the outcomes differently.
27
Stratification
• Stratification: As in the example above, physical
activity is thought to be a behaviour that protects
from myocardial infarct; and age is assumed to be a
possible confounder. The data sampled is then
stratified by age group – this means, the association
between activity and infarct would be analyzed per
each age group. If the different age groups (or age
strata) yield much different risk ratios, age must be
viewed as a confounding variable. There exist
statistical tools, among them Mantel–Haenszel
methods, that account for stratification of data sets.
28
Stratification of Confounding Variable
• While ascertaining association between 2 factors, we have Exposure
and disease
– Both Discrete: 2 levels of exposure/disease: 2x2 table
– Both Discrete: More levels of exposure/disease: r x c
– Level of disease continuous and exposure discrete or continuous: Usual
regression
– Level of disease discrete and exposure discrete or continuous:
Regression, but needs special attention
• A 3rd variable is considered: May be considered as an additional
regressor variable or one may use stratification
– Repeat analysis within every level of that variable
– E.g. gender, age, breed, farm etc.
• Stratification solves the problem of confounding as well as
interaction
29
The Problem with Stratification as a
Solution to Confounding
• Stratification sometimes may cause bias. Consider the situation of a
pair of dice, die A and die B. Of course, you know that they must be
independent. In other words, if you roll one, it tells you nothing
about the roll of the other. What if we stratify upon the sum of the
dice?
• What happens if we stratify? Let’s look in the stratum where the
sum is, for example, 7. In this stratum, if we know A (say, 1) then
we know B. If A is 3, B must be 4.
• Earlier, we said that A and B were independent. Now, however,
once we stratify upon the sum, if we know A, we know B. We have
induced a relationship between A and B that otherwise did not exist.
30
Holding the Extraneous Variable Constant
• For example, if you want to control for gender using
this strategy, you would only include females in your
research study (or you would only include males in
your study). If there is still a relationship between the
variables say motivation and test grades, you will be
able to tell that the relationship is not due to gender
because you have made it a constant (by only
including one gender in your study).
31
Statistical Control
• Statistical Control: It's based on the following logic:
examine the relationship between the variables at each level
of the control/extraneous variable; actually, the computer
will do it for you, but that’s what it does.
• One type of statistical control is called partial correlation.
This technique shows the correlation between two
quantitative variables after statistically controlling for one
or more quantitative control/extraneous variables.
• A second type of statistical control is called ANCOVA (or
analysis of covariance). This technique shows the
relationship between the variables after statistically
controlling for one or more quantitative control/extraneous
variables.
32
LOGISTIC REGRESSION
A Note Compiled
by
MANORANJAN PAL
ECONOMIC RESEARCH UNIT
INDIAN STATISTISTICAL INSTITUTE
203 BARRACKPUR TRUNK ROAD
KOLKATA – 700 108
33
Characteristics
Qualitative
Quantitative
(Attribute)
(Variable)
Dichotomous
Polychotomous
Binary Variables
Discrete
Continuous
Set of Binary
Variables
(0 or 1)
(Dummy Variables)
34
Binary Dependent Variable
• In this case the dependent variable takes only one of two values for
each unit/individual.
• Often individual economic agent must choose one out of two
alternatives as follows:
–
–
–
–
A household must decide whether to buy or rent a suitable dwelling;
A consumer must choose which of two types of shopping areas to visit.
A person must choose one of two modes of transportation available;
A person must decide whether or not to attend college.
35
The Linear Probability Model (LPM)
yi
= 1 if an event A occurs
= 0 if the event does not occur
Suppose the probability that it occurs is Pi. Then
= 1× Pi + 0×(1 – Pi)
= Pi.
We assume that Pi depends on the explanatory variable xi, which is a vector. Thus
E(yi)
yi = Pi + ei = xi' + ei,
i = 1, 2, …, T.
Where T is the size of the sample. For a given xi, we now have,
--------------------------------------yi
ei
Pr(ei)
--------------------------------------1
1 - xi'
xi'
0
- xi'
1 - xi'
---------------------------------------
…(01)
…(02)
36
Problems with LPMs
• E(yi)= Pi = xi'
may not be within the unit interval
• Var(ei)
= (-xi')2 (1- xi') + (1- xi')2 (xi')
= (xi') (1- xi')
= (Eyi) (1-Eyi)
 Introduces heteroscedasticity
• ei takes only two values (-xi') and (1- xi')
 Normality assumption is violated
However,
• E(ei) = (1 - xi') (xi') + (- xi') (1 - xi') = 0
 The only solace
37
GLS Estimation of LPM
Thus all T observations are written as
y = X + e.
It follows that the covariance matrix of e is
Cov(e) = E(ee') = ,
where  is a diagonal matrix with ith diagonal element Eyi(1-Eyi).
If the number of choice outcomes yi observed for each xi', say ni, is just one.
That is ni = 1. In that case, feasible GLS can be carried out by estimating it by
OLS which, though inefficient, is consistent and constructing
to be a
diagonal matrix with element
Since
is diagonal, feasible GLS is easily applied using WLS (Weighted
Least Squares). That is, multiplication of each observation on the dependent
and independent variables by the square root of the reciprocal of the variance
of the error yields a transformed model, OLS estimation of which produces
feasible GLS estimates.
Caution: Weighted GLS estimation in this case does not have an intercept
term.
38
The Problem with GLS Estimation of LPM
While this estimation procedure is consistent, an obvious difficulty
exists. If xi'
falls outside the (0,1) interval, the matrix has negative
or undefined elements on its diagonal. If this occurs one must
modify
either by deleting the observations for which the problem
occurs or setting the value of xi' ‘
to 0.01 or 0.99, say, and
proceeding accordingly. While this does not affect the asymptotic
properties of the feasible GLS procedure, it is clearly an awkward
position to be in, especially since predictions based on the feasible
GLS estimates,
= x i'
,
may also fall outside the (0,1) interval.
39
The Case of Repeated Observations
Let ni  1. The sample proportion of the number of occurrences of
the event is pi = yi/ni, where yi is the number of successes out of ni.
Since E(pi) = Pi = x'i, the model can be rewritten as
pi = Pi + ei = xi' + ei, i = 1,2, …, T,
where ei is now the difference between pi and its expectation Pi. The
full set of T observations is then written as
p = X + e.
Since the sample proportions pi are related to the true proportions Pi
by
pi = Pi + ei,
i = 1,2, …, T,
the error term ei has zero mean and variance Pi(1-Pi)/ni, the same as
the sample proportion based on ni Bernoulli trials.
40
Estimation Under the Case of Repeated
Observations
The covariance matrix of e is
and the appropriate estimator for  is
the GLS estimator. If the true proportions Pi are not
known the a feasible GLS estimator is
41
Some Alternative Estimations
42
Questionable Value of R2 as a Measure of
Goodness of Fit
• The conventionally computed R2 is of limited value in the
dichotomous response models. To see why, consider the following
figure. Corresponding to a given X, Y is either 0 or 1. Therefore, all
the Y values will either lie along the X axis or along the line
corresponding to 1. Therefore, generally no LPM is expected to fit
such a scatter well. As a result, the conventionally computed R2 is
likely to be much lower than 1 for such models. In most practical
applications the R2 ranges between 0.2 to 0.6. R2 in such models will
be high, say, in excess of 0.8 only when the actual scatter is very
closely clustered around points A and B (say), for in that case it is
easy to fix the straight line by joining the two points A and B. In this
case the predicted yi will be very close to either 0 or 1.
• Thus, use of the coefficient of determination as a summary statistic
should be avoided in models with qualitative dependent variable.
43
LPM: The case of High R2
44
The difficulty with the linear probability model
Unfortunately, the predictor obtained from feasible GLS estimation can fall
outside the zero-one interval.
To ensure that the predicted proportion of successes will fall within the unit
interval, at least over a range of xi of interest, one may employ inequality
restrictions of the form 0  xi'   or the number of repetitions ni must be large
enough so that the sample proportion pi is a reliable estimate of the probability Pi.
The situation is illustrated in the following figure for the case when xi' = 1 +
2xi2.
0
Figure 1 : Linear and non-linear probability models.
45
The difficulty with the linear probability model
• As we have seen, the LPM is plagued by several problems, such as
(1) nonnormality of ui, (2) heteroscedasticity of ui, (3) possibility of
values lying outside the 0–1 range, and (4) the generally lower R2
values. Some of these problems are surmountable. For example, we
can use WLS to resolve the heteroscedasticity problem or increase
the sample size to minimize the non-normality problem. By
resorting to restricted least-squares or mathematical programming
techniques we can even make the estimated probabilities lie in the
0–1 interval.
• But even then the fundamental problem with the LPM is that it is
not logically a very attractive model because it assumes that Pi =
P(y=1|x) increases linearly with x, that is, the marginal or
incremental effect of x remains constant throughout. This seems
patently unrealistic. In reality one would expect that Pi is nonlinearly
related to xi.
46
Alternatives to LPM
As an alternative to the linear probability model, the
probabilities Pi must assume a nonlinear function of
these explanatory variables.
In the next sections two particular nonlinear
probability models are discussed – the cumulative
density functions of normal and logistic random
variables
Two kinds of estimation procedures are applied –
feasible GLS when repeated observations are
available and ML when ni = 1, or is small.
47
Probit and Logit Models
Two choices of the nonlinear function Pi =
g(xi) are the cumulative density functions of
normal and logistic random variables. The
former gives rise to the probit model and the
latter to the logit model.
The logit model is based on the logistic
cumulative distribution (CDF) functions.
48
The Logit Model
49
The Logit Model
50
The Logit Model
51
An Interpretive Note
Finally, we note the interpretation of the estimated coefficients in logit
model. Estimated coefficients do not indicate the increase in the probability
of the event occurring given a one unit increase in the corresponsing
independent variable. Rather, the coefficients reflect the effect of a change
in an independent variable upon 1n(pi/(1 - pi)) for the logit model. The
amount of the increase in the probability depends upon the original
probability and thus upon the initial values of all the independent variables
and their coefficients. This is true since pi = F(x'i) and pi/xij=f(x'i). j'
where f(.) is the pdf associated with F(.).
For the logit model
52
ML Estimator of Logit Model
When the number of repeated observations on the
choice experiment ni is small and pi can not be
reliably estimated using the sample proportion, then
ML estimation of the logit model can be carried out.
If pi is the probability that the event A occurs on the
ith trial of the experiment then the random variable yi'
which is one if the event occurs but zero otherwise,
has the probability function
Consequently, if T observations are available then the
likelihood function is
.
53
ML Estimator of Logit Models
The logit model arises when pi is specified to be given by the logistic
CDF evaluated at x'i. If F(x'i) denotes the CDFs evaluated at x'i,
then the likelihood function (L) for the model is
and the log L is
The first order conditions of the maximum will be non-linear, so ML
estimates must be obtained numerically.
54
ML Estimator of Logit Model
55
ML Estimator of Logit Model
56
ML Estimator of Logit Model
57
ML Estimator of Logit Models
Using these derivatives and the recursive relation, ML
estimates can be obtained given some initial estimates.
The choice of the initial estimates do not matter since it can
be shown (Dhrymes, P.J.(1978), Introductory Econometrics,
NY, Springer-Verlag, pp. 344-347) that the matrix of second
partials 21nL/′ is negative definite for all values of .
Consequently, the NR procedure will converge, ultimately,
to the unique ML estimates regardless of the initial
estimates. Computationally, of course, the choice does
matter since the better the initial estimates the fewer
iterations required to attain the maximum of the IF. While
several alternatives for initial estimates exist, one can
simply use the OLS estimates of  obtained by regressing yi
on the explanatory variables.
58
Tests of Hypothesis
Usual tests about individual coefficients and confidence
intervals can be constructed from the estimate of the
asymptotic covariance matrix, the negative of the inverse of
the matrix of second partials evaluated at the ML estimates,
and relying on the asymptotic normality of the ML
estimator.
The Hypothesis
HO : 2 = 3 = … = k = 0
.
can be easily carried out using the likelihood ratio (LR)
procedure since the value of the log F under the hypothesis
is easily attained analytically. If n is the number of
successes (yi = 1) observed in the T observations, then the
maximum value of the log LF under the null hypothesis H0
is

n
T n
ln L( w)  n ln( )  (T  n) ln(
)
T
T
59
Tests of Hypothesis
If the hypothesis is true, then asymptotically


 2 ln l  2[ln L( w)  1nL()]
~
has a χ2k-1 distribution, where 1nl is the value of log LF evaluated at  .
Acceptance of this hypothesis would, of course, imply that the explanatory
variables have no effect on the probability of A occurring. In this case the
probability that yi = 1 is estimated by
= n/T,
which is simply the sample proportion.
.
60
Measuring Goodness of Fit
There is a problem with the use of conventional R2–type measures when
the explained variable y takes only two values. The predicted values are
probabilities and the actual values y are either 0 or 1. For the linear
probability model and the logit model we have Σy = Σ , as with the linear
regression model, if a constant term is also estimated. For the probit model
there is no such exact relationship.
.
61
Measuring Goodness of Fit
62
SUMMARY AND CONCLUSIONS
The purpose of this presentation is to show how qualitative, or dummy,
variables using values of 1 and 0 can be introduced into regression
models alongside quantitative variables. The dummy variables are
essentially a data classifying device in that they divide a sample into
various subgroups based on qualities, or attributes (sex, marital status,
race, religion, etc.).
We have considered a model for situations in which the outcomes of an
experiment, the dependent variable, takes only two values.
For the binary choice model the appropriate estimation technique
depends upon the nature of the sample data that are available. If
repeated observations exist on individual decision makers, a feasible
GLS estimation procedure can be used. If only one or a few
observations exist for each decision maker, ML estimation is possible,
that relate the choice probabilities to the unknown parameters in a
nonlinear way.
63
Logistic and Poisson Regression Models
Manoranjan Pal
Indian Statistical Institute
64
An Example
Deaths
Person-years
Exposed Non-exposed
18,000
9,500
900,000
950,000
The Incidence Rates are:
I1 = 18,000/900,000 = 0.02 deaths per person-year.
I0 = 9,500/950,000 = 0.01 deaths per person-year.
RR = I1/I0 = 2.00.
The incidence rate is double in the exposed case to
that of the non-exposed case.
65
The Regression Model
• We can achieve the same result by using a regression model. We
define a dichotomous exposure variable (X1) as:
X1 = 0 if non-exposed
X1 = 1 if exposed
Rate (I)
0.01
0.02
Exposure (X1)
0
1
We want to model the rate (I) as a function of exposure (X1).
One possibility is:
I = b0 +b1X1 (+ e).
but this is less convenient statistically. Because the predicted value
of I may be outside the range of [0,1] and so on.
66
An Alternative Regression Model
It is more convenient to fit the model:
ln(I) = b0 +b1X1 (+ e).
We could fit the model using simple linear
regression (least squares).
However, the least-squares approach does not
handle Poisson or dichotomous outcome variables
well, as they are not normally distributed. Instead,
the model parameters are estimated by the method
of maximum likelihood.
67
Estimation of RR from the Model
The Equation: ln(I) = b0 +b1X1 (+ e).
Exposed: E(ln(I| X1=1) = ln(I1) = b0 + b1.
Non-exposed: E(ln(I| X1=0) = ln(I0) = b0.
ln(I1) – ln(I0) = ln(I1/I0) = (b0+b1) – (b0) = b1.
RR = I1/I0 =
.
b1 = ln(RR): The regression coefficient gives log of
RR value
68
Estimation of Confidence Interval
The 95% CI for ln(RR) is:
Ln(RR) ± 1.96[SE(ln(RR)] = b1+1.96 SE(b1).
If b1 = 0.693 and SE(b1) = 0.124 then
RR = = 2.00.
95 % lower confidence limit = e0.693-1.96×0.124 = 1.63
and
95 % upper confidence limit = e0.693+1.96×0.124 = 2.45.
69
Discussions
• This general approach can be used in a variety
of situations.
• For cohort studies, we fit the Poisson model
ln(I) = b0 +b1X.
This is Poisson data, and we use Poisson
regression to estimate the rate ratio.
• For case-control studies we fit the model
70
Confounding
• We can use the same approach to control for potential
confounding variables:
ln(I) = b0 + b1X1 + b2X2.
where,
X1 = 0 if non-exposed
= 1 if exposed
and
X2 = 0 if Age < 50
= 1 if Age ≥ 50.
71
Confounding
• Then in the exposed group
E(ln(I| X1=1) = ln(I1) = b0 + b1 + b2X2,
• and in the non- exposed group
E(ln(I| X1=0) = ln(I0) = b0 + b2X2.
• Thus, ln(I1/I0) = (b0+b1 + b2X2) – (b0 + b2X2) = b1.
RR = I1/I0 =
.
• and we proceed as before.
72
Multiple Levels
• We can also represent multiple categories of
exposure (or a confounder): Suppose we have
four levels of exposure: none, low, medium and
high.
• We need three variables to represent four levels of
exposure:
ln(I) = b0 + b1X1 + b2X2 + b3X3.
where,
X1 = 1 if low exposure,
= 0 otherwise;
X2 = 0 if medium exposure,
= 0 otherwise
73
Interaction (Joint Effects)
• Suppose that we wish to derive the effect of Smoking
and use of Asbestos on the incidences of Cancer.
• The usual model (without an interaction term) is:
• ln(I) = b0 + b1X1 + b2X2
where X1 and X2 stands for asbestos and smoking
respectively. However, to get the above table, we need
to fit the following model:
• ln(I) = b0 + b1X1 + b2X2 + b3X1X2.
74
The Joint Effect
• This can be used to derive the following:
Group
Χ1
Χ2
Model
Asbestos only
1
0
b0+b1
Smoking only
0
1
b0+b2
Both
1
1
b0+b1+b2+b3
RR
• Thus, the joint effect is obtained by
75
Testing the Joint Effect
• Note that if b3= 0 then the joint effect is just
𝑒 𝑏1+𝑏2 . Thus, b3provides a test for interaction.
However, it is important to emphasize that
b3only provides a test for a departure from the
The
confidence interval
for the joint effect
can be
calculated
mulitplicative
assumptions
of the
model.
It
using the following:
does not test for a departure from additivity.
76
An Alternative Model
• There is a much easier way to get the same results. Just define three
new variables as follows:
X1 = 1 if asbestos but not smoking
= 0 otherwise
X2 = 1 if smoking but not asbestos
= 0 otherwise
X3 = 1 if both asbestos and smoking
= 0 otherwise
• Then fit
ln(I) = b0 + b1X1 + b2X2 + b3X3.
• This will give us the separate and joint effects directly without any
need to consider Variance covariance matrix.
77
Cohort Study Vs. Case Control Study
Cohort Study
Case Control Study
Numerator
Cases
Cases
Denominator
Person-Years
Controls
Effect Estimate
Rate Ratio
Odds Ratio
Modeling
Poisson Regression
Logistic Regression
Model
ln(I) = b0 + b1X1 + b2X2 + …
78
POISSON REGRESSION WITH MULTIPLE
EXPLANATORY VARIABLES
Manoranjan Pal
Indian Statistical Institute
79
Poisson Regression Model
• The Poisson regression model is a technique used to
describe count data as a function of a set of
predictor variables. In the last two decades it has
been extensively used both in human and in
veterinary Epidemiology to investigate the
incidence and mortality of chronic diseases. Among
its numerous applications, Poisson regression has
been mainly applied to compare exposed and
unexposed cohorts and to evaluate the clinical
course of ill subjects.
80
Introduction
• Poisson regression analysis is a technique which
allows to model dependent variables that describe
count data. It is often applied to study the
occurrence of small number of counts or events as
a function of a set of predictor variables, in
experimental and observational study in many
disciplines, including Economy, Demography,
Psychology, Biology and Medicine.
81
Applications
• The Poisson regression model may be used as an alternative
to the Cox model for survival analysis, when hazard rates
are approximately constant during the observation period
and the risk of the event under study is small (e.g.,
incidence of rare diseases). For example, in ecological
investigations, where data are available only in an
aggregated form (typically as a count), Poisson regression
model usually replaces Cox model, which cannot be easily
applied to aggregated data.
• Finally, some variants of the Poisson regression model have
been proposed to take into account the extra-variability
(overdispersion) observed in actual data, mainly due to the
presence of spatial clusters or other sources of
autocorrelation.
82
Measures of Occurrence in Cohort
Studies: Risk and Rate
• The definition of rate may be derived from the general relationship linking
the risk to the follow up time:
… (1).
• Variable λ represents the rate of the outcome onset in the cohort and it
may be considered as a measure of the “speed” of their occurrence. In
many instances, especially for rare diseases in observational cohorts, λ may
be considered approximately as a constant. Moreover, when the rate is
small, the following useful approximation may be applied:
• .
83
Risk and Rate
•
•
It may be noted that for low values of λt, λ represents a mean rate, while λ(t)
represents an instant rate, often called hazard rate.
λ may be estimated by the ratio between the observed events O and the
corresponding sum of follow up times m, named “person-time at risk”.
•
An RR estimate may be obtained by the corresponding rate ratio as follows:
•
where λ1 and λ2 represent the rates estimated in the exposed and unexposed subcohorts, respectively.
84
Poisson Distribution
• The variability of a rate estimate and the comparison between rates need
some assumptions about the probability distribution, which is assumed to
generate the observed rates. When rare events are considered, a Poisson
distribution may be assumed:
• where μ is an unknown parameter, that may be estimated by the observed
events O. In the Poisson distribution function, parameter μ represents both
the expected number of events and the variance of their estimate.
Accordingly, the variance of an estimate of a rate may be obtained as
follows:
• .
85
Variance of Rate Ratio
• Under the null hypothesis of no association between the outcome
(events) and the factor under study (exposure, medications, etc.), an
RR estimate may be assumed to follow approximately a log-normal
distribution with expected value of 1. Accordingly, statistical
inference about a rate ratio may be performed by the estimate of the
variance of its logarithm, which needs the separate estimate of the
variance of the two rates:
• Applying the Delta method, such estimate may be obtained by the
following equation:
• .
86
Confidence Interval
Confidence intervals of an RR estimate, obtained via a rate ratio, may be
obtained by the following equation:
where O1 and O2 are the observed events in the two sub-cohorts and Zα/2 = 1.96
for α=0.05 (useful to obtain 95% confidence intervals).
87
Table 1. Results of a Hypothetical
Observational Cohort Study
Exposure
Exposed
Unexposed
Number of Cases Person - years
108
44870
51
21063
• In the exposed sub-cohort the estimated rate is:
• while the corresponding estimate for the exposed is:
88
Results of a Hypothetical Observational
Cohort Study
Finally, the estimate of RR is:
The 95% confidence interval of the estimated RR will be:
The confidence interval includes the expected value under the null hypothesis
of no effect of the association (i.e., RR=1), then in the cohort under study no
evidence emerges of an association between the exposure and the risk of the
disease onset (p > 0.05). A similar result may be obtained by the Poisson
89
regression model.
Rate Ratio Estimate via Poisson Regression
Model
• As above briefly illustrated, the numerator of a rate for a rare
disease may be considered as a realization of a Poisson
variable with an unknown parameter μ. As a consequence, the
relation between the rate and the variable under study (e.g.,
exposures or treatments) may be investigated by a Poisson
model, which is a regression model belonging to the GLM
class (Generalized Linear Models).
where:
and g is called “the link function”.
90
Table 2: An Example of Confounding in an
Observational Cohort Study
A simple example of confounding by a dichotomous variable (gender) is
illustrated in Table 2, using the same data reported in aggregated form in Table 1.
All individuals
(pooled cohort)
Stratum 1 - Males
No. of Person
cases years
Exposed
Unexposed
Stratum 2 - Females
No. of Person
cases years
No. of Person
cases years
108
44870 Exposed
30
3218 Exposed
78
41652
51
21063 Unexposed
44
11699 Unexposed
7
9364
̂RRT = 0.99 (0.71;1.4)
̂RR1 = 2.5 (1.6;3.9)
̂RR2 = 2.5 (1.2;5.4)
91
Table 3: Example of Effect Modifying or
Interaction in an Observational Cohort Study
A simple example of interaction between a variable of exposure and an effect
modifier, both expressed on a dichotomous scale, is provided in Table 3.
All individuals
(pooled cohort)
Stratum 1 - Males
No. of Person
cases years
No. of Person
cases years
Expos
ed
391 769309 Expos
ed
Unexpo
sed
119 358341 Unexpo
̂RRT = 1.5 (1.2;1.9)
Stratum 2 - Females
No. of Person
cases years
189 478383 Expos
ed
78 242043 Unexpo
sed
̂RR1 = 1.2 (0.94;1.6)
sed
202
29092
6
41
11629
8
̂RR2 = 2.0 (1.4;2.8)
92
Discussions
• In the pooled cohort (Table 3), an association between
the exposure and the risk of the disease onset seems to
emerge, the corresponding RR being statistically
significantly higher than 1, as it is evident from the
corresponding 95% confidence interval which does not
include such a value. However, after stratifying by
gender, different RR emerge comparing males and
females (RR=1.2 and RR=2.0, respectively). In
conclusion, data in Table 3 suggest an interaction
between sex and exposure, indicating that females are
probably more susceptible than males to the exposure
effect.
93
Interaction in Poisson Regression Model
• In the presence of interaction, separated estimate of RR by
each group (stratum) of the effect modifier should be
produced. However, different RR may be observed,
especially in small cohorts, simply due to the sample
variability. To check for the presence of interaction, some
formal statistical tests have been developed, including the
use of Poisson regression models with (at least) one
interaction variable among the predictors.
• where M is the effect modifier and E is the exposure, both
considered as binary variables for didactic purposes.
94
Estimation of RR in Poisson Model with
Interaction
The two RR estimates in each M stratum may be obtained by the above
equation, in fact, when M=0:
and when M =1:
It may be noted that when β3 equals 0, the two RR estimates by M stratum
are equals, then M cannot be considered as an effect modifier. As a
consequence, interaction may be checked testing the statistical significance
of the β3 coefficient by some test commonly employed in GLM (Likelihood
ratio, Wald or Score test).
95
Negative Binomial
Manoranjan Pal
Indian Statistical Institute
96
• This part of the presentation has been taken from:
“Poisson-Based Regression Analysis of Aggregate
Crime Rates”, by D. Wayne Osgood, Journal of
Quantitative Criminology, Vol. 16, No. 1, pp. 21 – 43,
2000.
97
Poisson
• The Poisson distribution characterizes the probability of
observing any discrete number of events (i.e., 0, 1, 2, . . .),
given an underlying mean count or rate of events,
assuming that the timing of the events is random and
independent. For instance, the Poisson distribution for a
mean count of robberies 4.5 would describe the
proportion of times that we should expect to observe any
specific count of robberies (0, 1, 2, . . .) in a
neighbourhood, if the ‘‘true’’ (and unchanging) annual
rate for neighbourhood were 4.5, if the occurrence of one
robbery had no impact on the likelihood of the next, and
if we had an unlimited number of years to observe.
98
Limiting Cases of Poisson Distribution
• When the mean arrest count is low, as is likely for a small
population, the Poisson distribution is skewed, with only
a small range of counts having a meaningful probability
of occurrence.
• As the mean count grows, the Poisson distribution
increasingly approximates the normal. The Poisson
distribution has a variance equal to the mean count.
Therefore, as the mean count increases, the probability of
observing any specific number of events declines and a
broader range of values have meaningful probabilities of
being observed.
99
An Example
• If our interest is in per capita crime rates, say, rather than in
counts of offenses, then we have to translate the Poisson
distribution of crime counts into distributions of crime rates.
Given a constant underlying mean rate of 500 crimes per
100,000 population, population sizes of 200, 600, 2000, and
10,000 would produce the mean crime counts of 1, 3, 10,
and 50. For the population of 200, only a very limited
number of crime rates are probable (i.e., increments of 500
per 100,000), but those probable rates comprise an
enormous range. As the population base increases, the range
of likely crime rates decreases, even though the range of
likely crime counts increases. The standard deviation around
the mean rate shrinks as the population size increases.
100
The Basic Poisson Regression Model
• The basic Poisson regression model is:
• Equation (1) is a regression equation relating the natural logarithm
of the mean or expected number of events for case i, to the linear
function of explanatory variables Equation (2) indicates that the
probability of the observed outcome for this case, follows the
Poisson distribution (the right-hand side of the equation) for the
mean count from Eq. (1). Thus, the expected distribution of crime
counts, and corresponding distribution of regression residuals,
depends on the fitted mean count. The role of the natural logarithm
in Eq. (1) is comparable to the logarithmic transformation of the
dependent variable that is common in analysis of aggregate crime
rates. In both cases, the regression coefficients reflect proportional
differences in rates.
101
Altering the Basic Poisson Regression Model
• Next we must alter the basic Poisson regression model so that it provides
an analysis of per capita crime rates rather than counts of crimes. If λi is the
expected number of crimes in a given aggregate unit, then λi/ni would be
the corresponding per capita crime rate, where ni is the population size for
that unit. With a bit of algebra, we can derive a variation of Eq. (1) that is a
model of per capita crime rates:
• Thus, by adding the natural logarithm of the size of the population at risk to
the regression model of Eq. (1), and by giving that variable a fixed
coefficient of one, Poisson regression becomes an analysis of rates of
events per capita, rather than an analysis of counts of events. .
102
Poisson Regression Vs. OLS Regression
• In the expected distribution of observed crime
rates around the fitted mean crime rates
produced by Eq. (3), the standard deviation is
inversely proportional to the square root of the
population size. Thus, Poisson regression
analysis explicitly addresses the heterogeneous
residual variance that presented a problem for
OLS regression analysis of crime rates.
103
Overdispersion and Variations on the Basic
Poisson Regression Model
• Reason 1: The basic Poisson regression model is appropriate only if
the probability model of Eq. (2) matches the data. Equation (2)
requires that the residual variance be equal to the fitted values, λi,
which is plausible only if the assumptions underlying the Poisson
distribution are fully met by the data. One assumption is that λi is the
true rate for each case, which implies that the explanatory
variables account for all of the meaningful variation among the
aggregate units. If not, the differences between the fitted and true
rates will inflate the variance of the residuals. It is very unlikely that
this assumption will be valid, for there is no more reason to expect
that a Poisson regression will explain all of the variation in the true
crime rates than to expect that an OLS regression would explain all
variance other than error of measurement.
104
Overdispersion and Variations on the
Basic Poisson Regression Model
• Reason 2: Residual variance will also be greater than λi if the
assumption of independence among individual crime events is
inaccurate. Dependence will arise if the occurrence of one
offense generates a short-term increase in the probability of
another occurring. For aggregate crime data, there are many
potential sources of dependence, such as an individual
offending at a high rate over a brief period until being
incarcerated, multiple offenders being arrested for the same
incident, and offenders being influenced by one another’s
behavior. These types of dependence would increase the yearto-year variability in crime rates for a community beyond λi ,
even if the underlying crime rate were constant.
105
A Way Out
• For these two reasons, ‘‘overdispersion’’ in which residual variance
exceeds λi is ubiquitous in analyses of crime data. Applying the
basic Poisson regression model to such data can produce a
substantial underestimation of standard errors of the b’s, which in
turn leads to highly misleading significance tests.
• We use the negative binomial regression model, which is the best
known and most widely available Poisson-based regression model
that allows for overdispersion. Negative binomial regression
combines the Poisson distribution of event counts with a gamma
distribution of the unexplained variation in the underlying or true
mean event counts, λi. This combination produces the negative
binomial distribution, which replaces the Poisson distribution of Eq.
(2).
106
The Negative Binomial Distribution
• The formula for the negative binomial is
• where Γ is the gamma function (a continuous version of the
factorial function), and φ is the reciprocal of the residual
variance of underlying mean counts, α.
• With α equal to zero, we have the original Poisson
distribution. As α increases, the distribution becomes more
decidedly skewed as well as more broadly dispersed. Even
for a moderate α of 0.75, the change from the Poisson is
dramatic: From 5.0% of cases having zero crimes and 1.2%
having eight or more crimes when α = 0, it would increase
to 20.8% and 8.8% of cases respectively when α = 0.75. 107
Poisson Vs. Negative Binomial Regression
• In negative binomial regression (as in almost
all Poisson-based regression models), the
substantive portion of the regression model
remains Eq. (1) for crime counts or Eq. (3) for
per capita crime rates. Thus, though the
response probabilities associated with the
fitted values differ from the basic Poisson
regression model, the interpretation of the
regression coefficients does not.
108
An Example
• Table I presents descriptive statistics for all measures. During this 5-year
period, there were 1212 arrests of juveniles for robbery in this sample of
counties. The distribution of arrest rates is highly skewed, with zero
robbery arrests of juveniles recorded in 52% of the counties, while the
highest annual arrest rates were slightly less than 400 per 100,000.
109
Example (Contd.)
• Poisson-based models do not assume homogeneity of
variance. Instead, residual variance is expected to be a
function of the predicted number of offenses, which is
in turn a function of population size. Furthermore, even
though a logarithmic transformation is inherent in
Poisson-based regression, observed crime rates of zero
present no problem. Unlike the preceding OLS analyses
of log crime rates, Poisson-based regression analyses
do not require taking the logarithm of the dependent
variable. Instead, estimation for these models involves
computing the probability of the observed count of
offenses, based on the fitted value for the mean count.
110
Conclusions
• Using Poisson-based regression models of offense counts to analyze
per capita offense rates is an important advance for research on
aggregate crime data. Standard analytical approaches require that
data be highly aggregated across either offense types or population
units. Otherwise offense counts are too small to generate per capita
rates that have appropriate distributions and sufficient accuracy to
justify least-squares analysis.
• Poisson-based regression models give researchers an appropriate
means for more finegrained analysis. Poisson-based models are built
on the assumption that the underlying data take the form of
nonnegative integer counts of events. This is the case for crime
rates, which are computed as offense counts divided by population
size. In our example analysis of juvenile arrest rates for robbery, the
Poisson-based negative binomial model provides a very good fit to
the data, while OLS analyses produce outliers and require arbitrary
choices that have a striking impact on results.
111
Conclusions
• Poisson-based regression models enable
researchers to investigate a much broader
range of aggregate data.
• The reason they are appropriate is that they
recognize the limited amount of information in
small offense counts. The price one must pay
in this trade off is that the smaller the offense
counts, the larger the sample of aggregate units
needed to achieve adequate statistical power.
112
Thank You
113
Download