regression? - The Economics Network

advertisement
OLS & Logistic Regression
Analysis – A Recap
Cristina Penaloza & Eoin Maloney
Health Economics Unit
1
Outline
• What is regression analysis?
• Relevance of regression analysis
• Regression modelling process
– OLS regression
– Logistic regression
• Exercise
2
What is Regression Analysis?
“Regression analysis is concerned with the study
of the dependence of one variable, the dependent
variable, on one or more other variables, the
explanatory variables, …
with a view to estimating and/or predicting the
(population) mean or average value of the
dependent variable in terms of known or fixed (in
repeated sampling) values of the explanatory
variables.”
Gujarati (1995: 16)
3
Terminology
Dependent variable, explained variable, outcome
variable, outcome, response variable, regressand,
output variable, predicted value, predictand,
endogenous
Explanatory variable, Independent variable,
predictor variable, predictor, regressor,
stimulus/control variable, exogenous
Disturbance (random error) term, residual,
residual error
4
Causation / correlation
• Regression vs causation
– “A statistical relationship, however strong and however
suggestive, can never establish causal connection: our
ideas of causation must come from outside statistics”
Gujarati (1995: 20)
• Regression vs correlation
– Correlation analysis: seeks to measure the strength of
linear association between two variables
– Regression analysis: seeks to estimate or predict the
average value of one variable on the basis of fixed
values of other variables
5
Why study regression?
• Adjusting for baseline characteristics in Economic Evaluation
(Nathwani et al. 2004; Manca et al. 2005; Hoch et al 2002)
• Predicting/mapping utility-based outcome measures for use in
Economic Evaluation (Gray et al. 2006; Kaambwa et al.2011;
Sengupta et al 2004)
• Predicting costs for use in Economic Evaluation (Smith et al. 2007;
Bonizzato et al. 2000; Baumeister et al. 2009)
• Constructing CEACs (Hoch et al. 2006)
• Regression imputation for missing data (Billingham et al. 2002;
Engels & Diehr, 2003; Blazer et al. 1995)
• Explaining factors which cause variation in outcome and cost data
(Barber &Thomspon, 2004; Kaambwa et al. 2008; Raine et al, 2010)
6
The regression modelling process
1.
2.
3.
4.
5.
6.
7.
Statement of hypothesis (theory)
Specification of the model
Obtaining the data
Estimation of the regression model
Diagnostic analysis
Hypothesis testing
Prediction/forecasting
7
1. Statement of hypothesis
Example: High Blood Pressure and older
people
“Amongst those over the age of 65, the incidence
of high diastolic blood pressure (dipb) increases
with age. Therefore, dipb is, in part, explained by
age.”
8
2. Specification of the model
In Functional form:
Mean Diastolic High Blood Pressure, DIBP, is
some function of age, A:
DIBP = f (A)
(1)
9
2. Specification of the model (cntd)
In Mathematical (linear) form:
Y = 1 + 2X
where
(2)
Y = Mean DIBP and X = age
1 & 2 = parameters
10
Linear relationship
E(Y|X)
..
..
x1
..
..
..
.
..
.
x3
.
.
.
.
x4
.
. .
. .
x6
X
11
2. Specification of the model (cntd)
Econometric (Regression) model
Y = 1 + 2X + u
Where
(3)
Y = Mean DIBP - the dependent variable
X = Age - explanatory variable
u = Disturbance (random error) term
1 & 2 = parameters
12
The error term (u)
• Omitted explanatory variables
• Measurement error
• Wrong functional form
• Unavailability of data
• Inherent randomness
etc….
13
3 & 4. Data / estimation of parameters
• Obtaining the data
– observed values of Y and X
• Estimation of the parameters
– Y and X are the variables (“known”)
– 1, 2 and u are the parameters (“unknown”)
14
5. Diagnostic analysis
• Is the model correctly specified?
• Have all assumptions been met?
• Are there any unusual observations or
outliers that may unduly influence results?
More of this later this morning…
15
6. Hypothesis Testing
• Is estimate statistically close to a postulated
value? Or are estimates in accord with
expectations from theory?
• Only after model has been shown to be
adequate
16
7. Forecasting or Prediction
• If hypothesis or theory being tested is
confirmed, then future values of the
dependent variable can be predicted or
forecast
• Policy recommendations
17
The practice of regression modelling
Hypothesis / theory
Model specification
Data
Estimation
Specification testing and diagnostic testing
Yes
Is the model adequate?
No
Hypothesis testing
Policy: prediction and forecasting
18
Sample regression
• In practice we will never observe the population
regression line.
• Instead we take a random sample of observations
in order to estimate the s.
• We distinguish the sample regression from the
population regression as follows:
19
Sample regression
Mathematical Model
Econometric Model
Yˆi  ˆ1  ˆ2 X i
Yi  ˆ1  ˆ2 X i  uˆ i
where
Yˆ = estimator of E(Y/Xi)
ˆ1 = estimator of 1
ˆ2 = estimator of 2
uˆ i = estimate of ui
20
Population regression
Mathematical Model
Yi  1  2 X i  ui
Yi  1   2 X i
where
Y
Econometric Model
= E(Y/Xi)
1 = constant/Y intercept
 2 = coefficient for Xi
ui
= error term
21
.Y
Y
4
.
Yˆ3 .
.Y
.
2
Yˆ1
.
Y.
Yˆi  ˆ1  ˆ2 X i
Yˆ4
.
Y
3
Yˆ2
1
X1
X2
X3
X4
X
22
.Y
Y
uˆ4
uˆ2
Yˆ1
.
uˆ
Y.
2
.
Yˆi  ˆ1  ˆ2 X i  ˆi
Yˆ4
Yˆ3 .
.Y
.
4
uˆ
.
Y
3
3
Yˆ2
1
1
X1
X2
X3
X4
X
23
:
The Ordinary Least
Squares (OLS) Model
Dependent variable is modelled as a linear function of
predictor or independent variables. The dependent variable
is continuous e.g. Blood pressure, Cholesterol level or
Weight
.
24
OLS
•What factors cause variation in an individual’s
Diastolic blood pressure?
•What variables explain movement in Men’s
cholesterol level?
•What variables are predictive of high birth weight in a
population of mothers from Birmingham?
Dependent variable can take on any numerical value
within the limits of the range of that variable.
25
OLS
The OLS method seeks to minimise the residual
sum of squares:
 uˆ  (Y i  ˆ1 ˆ 2 X i)
n
i 1
n
2
i
2
n
i 1
n
 uˆ  (Y i Yˆ i)
i 1
2
i
2
i 1
26
Minimising the residual…
Y
uˆ 4
uˆ 2
.
{.
uˆ 3
.
uˆ1
}
.
X1
X2
X3
X4
X
27
Describing the overall fit of the
estimated model
Coefficient of determination, or R2, is a
measure of the ‘goodness of fit’ of a regression
i.e. the proportion of the variation
in Yi which is explained by the regression
2
0< R <1
But focusing solely on maximising R2 is not a good
idea! (other measures will be consider this afternoon…)
28
Models for Categorical
Dependent Variables
For use on dependent variables that are either
dichotomous (individual has CVD or not), or
polytomous (Low, Medium or High cholesterol level)
which are quite common in Health-related datasets
29
Models for Categorical Dependent
Variables
Focus
Binary response variable – independent variables are
used to predict whether or not some event will occur:
Based on certain described characteristics:
Will an individual get cancer or not?
Will a patient survive or die?
will an individual develop CVD or not?
30
Coding of outcomes:
Usually coded 1 if the attribute of interest is
present and 0 otherwise.
Approach to be used:
Logistic regression - best for dichotomous
dependent variable, and continuous and
categorical independent variables.
Other commonly used approaches:
Probit & Nested Logit
31
Major difference from Ordinary Linear
Regression
• Uses link for relationship between dependent
and independent variable
• Substitute maximum likelihood estimation
(MLE) of a link function of the dependent
variable for regression's use of least squares
estimation of the dependent variable itself.
MLE - Method of estimating unknown parameters in such
a way that the probability of observing a given
dependent variable is as high (or maximum) as
possible
32
Issues to consider…
• Why are OLS models not suitable for dichotomous
data?
• Logit transformation – Link Function
• Marginal & Conditional Odds and Probability
33
Suppose we want to model Yi = β0 + β1X1+ ε but
1 if the i-th individual has the attribute of interest – e.g. CVD
yi =
0, otherwise
and
• β0 is the coefficient on the constant term,
• β1 is the coefficient on the independent variable,
• X1 is the independent variable – e.g. Age, and
• ε is the error term.
34
Let Yi = 1 if the ith individual has CVD, and 0
otherwise.
Let also Yi take the values 1 and 0 with probabilities
pi and 1-pi, respectively.
i.e.
P(Y1=1) = P(CVD =1) = p1
P(Y1=0) = P(CVD =0) = 1- p1
35
Why not just use Simple Linear (OLS) regression?
Consider a simple OLS regression model
CVD = β0 + β1Age+ ε ,
Assumptions
a) ε ~N(0, δ2)
b) var (ε) is constant i.e. Homoscedasticity
Binary outcome variables violate these assumptions…
36
Why not just use Simple Linear (OLS) regression?
• CVD is binary as P takes on only two values.
Consequently, ‘ε’ is also binary and therefore
‘normality of residuals’ assumption is violated.
• The error terms are heteroscedastic, so regression
assumption that the variance of the error term is
constant is violated.
• The predicted probabilities can be greater than 1 or
less than 0 which can be a problem if the predicted
values are used in a subsequent analysis!
37
Logit transformation
1. Move from probabilities to Odds
Pi
CVD exists
Odds 

1  Pi CVD doesn't exist
2. Take logs of both sides, to get log-odds or Logit
 Pi 
log (odds)  logit ( Pi )  log
  βi Agei
1  Pi 
or equivalently,
exp( β Age )
i
i
Pi (CVD exists) 
1  exp( βi Agei )
38
The Logit transformation removes the floor restriction
39
Logistic Regression Output
Part of this output is in form of Odds, Odds ratios and
probability.
An understanding of these concepts (both marginal
and conditional) is therefore cardinal to interpreting
Logistic Regression output
Key Question to be explored:
What factors determine the probability that an
individual will or will not develop CVD?
40
Marginal & Conditional odds.
CVD
No CVD
Column Total
Smokers
75
25
100
Non-Smokers
40
60
100
Row Total
115
85
200
• The odds of having CVD are 115/85 = 1.353. This is the
marginal or unconditional odds of having CVD.
 The conditional odds of having CVD, given “smokers”
is 75:25, or 3. A smoker is 3.0 times as likely to have
CVD than he is not to have it
 The conditional odds of having CVD, given the
category “Non-smokers" is 40:60, or 0.67. A non-smoker
is 0.67 times as likely to have CVD than he is not to have
it
41
Probability
The probability of having CVD is 115/200 = 0.575
The probability of having CVD given that one is a
smoker is 75/100 = 0.75
The probability of having CVD given that one is a
non-smoker is 40/100 = 0.40
42
Odds Ratio
 The odds ratio of smokers (numerator) to non-smokers
(denominator) for CVD, is 3/0.67= 4.478
(This means that the odds of smokers having CVD are 4.478
times as high as those of non-smokers having CVD)
 Odds ratio is cross-product ratio i.e. (60* 75)  4.478
(40* 25)
 When one moves from being a non-smoker to a smoker, the
odds of having CVD increase by 347.8% (i.e. from 0.67
odds for non-smokers to 3 for smokers)
43
Alternative interpretation of Odds Ratio
• Smokers are 4.478 times more likely to have CVD as nonsmokers
• The risk of having CVD is 4.478 times greater for smokers
than non-smokers
• The odds of CVD for smokers are 347.8% higher than the
odds of CVD for non-smokers (4.478 - 1.00)
• The predicted odds for smokers are 4.478 times the odds
for non-smokers.
• A one unit change in the independent variable Smokers
(smokers to non-smokers) increases the odds of having
CVD by a factor of 4.478.
44
References
•
Altman D.G. 1991. Practical Statistics For Medical Research
(London: Chapman & Hall/CRC)
•
Gujarati D.N. 1995. Basic Econometrics (New York: McGrawHill, Inc)
•
Johnston J. and J. DiNardo. 1997. Econometric Methods
(London: The McGraw-Hill Companies, Inc)
•
Long J.S. 1997. Regression Models for Categorical and
Limited Dependent. A Volume in the Sage Series for
Advanced Quantitative Techniques (Thousand Oaks, CA: Sage
Publications
•
Want, MinQi, James M. Eddy, Eugene C. Fitzhugh. 1995.
"Application of Odds Ratio and Logistic Models in Epidemiology
and Health Research," Health Values 19 : 59-62.
45
References
•
Nathwani et al. 2004. “An economic evaluation of a European cohort
from a multinational trial of linezolid versus teicoplanin in serious
Gram-positive bacterial infections: the importance of treatment setting
in evaluating treatment effects” International Journal of
Antimicrobial Agents 23: 315–324
•
Manca A, Hawkins N, Sculpher M. 2005. “Estimating mean QALYs in
trial-based cost-effectiveness analysis: the importance of controlling
for baseline utility” Health Economics 14:487-496
•
Hoch et al. 2002 “Something old, something new, something blue: a
framework for the marriage of health econometrics and costeffectiveness analysis” Health Econ 11:415–430.
•
Gray et al. 2006, "Estimating the association between SF-12 responses
and EQ-5D utility values by response mapping", Med Decis Making.,
vol. 26, no. 1, pp. 18-29.
46
References
•
Kaambwa et al. 2011, “Mapping utility scores from the Barthel
index", Eur. Journal of Health Economics, DOI: 10.1007/s10198011-0364-5
•
Sengupta et al. 2004, "Mapping the SF-12 to the HUI3 and VAS
in a managed care population", Med Care.,42,9: 927-937.
•
Smith et al. 2007. Predicting Costs Of Care In Chronic Kidney
Disease: The Role Of Comorbid Conditions. The Internet
Journal of Nephrology 4, 1
•
Bonizzato et al. 2000, “Community-based mental health care: to
what extent are service costs associated with clinical, social and
service history variables? Psychological Medicine, 30: 12051215.
•
Baumeister et al. 2009, “Predictive modeling of health care costs:
do cardiovascular risk markers improve prediction? European
47
Journal of Cardiovascular Prevention & Rehabilitation
References
•
Hoch et al. 2006, “Using the net benefit regression framework to
construct cost-effectiveness acceptability curves: an example
using data from a trial of external loop recorders versus Holter
monitoring for ambulatory monitoring of "community acquired"
syncope”, BMC Health Services Research, 6:68
•
Billingham LJ et al. 2002. “Patterns, costs and cost-effectiveness
of care in a trial of chemotherapy for advanced non-small cell
lung cancer: evidence from a randomised trial” Lung Cancer
37:219-225
•
Engels, J.M. & Diehr, P. 2003, “Imputation of missing
longitudinal data: a comparison of methods”, Journal of Clinical
Epidemiology 56: 968–976
•
Blazer et al. 1995. “Health Services Access and Use among Older
Adults in North Carolina:Urban vs Rural Residents” American
Journal of Public Health, 85, 10:1384-1390
48
References
•
Barber, J. & Thomspon, S. 2004, “Multiple regression of cost
data: use of generalised linear models”, J Health Serv Res
Policy 9:197-204
•
Kaambwa, B., Bryan, S., Barton, P., Parker, H., Martin, G.,
Hewitt, G., Parker, S., & Wilson, A. 2008, "Costs and health
outcomes of intermediate care: results from five UK case study
sites", Health Soc. Care Community 16: 573 - 581
•
Raine et al. 2010, “Social variations in access to hospital care for
patients with colorectal, breast, and lung cancer between 1999
and 2006: retrospective analysis of hospital episode statistics”,
BMJ 340:b5479
49
Exercises
• OLS regression
• Logistic Regression
50
Download