Chapter 12 Slides (PPT)

advertisement
Chapter 12
Multiple Regression
Multiple Regression
• Numeric Response variable (y)
• k Numeric predictor variables (k < n)
• Model:
Y = b0 + b1x1 +  + bkxk + e
• Partial Regression Coefficients: bi  effect (on the mean
response) of increasing the ith predictor variable by 1
unit, holding all other predictors constant
• Model Assumptions (Involving Error terms e )
– Normally distributed with mean 0
– Constant Variance s2
– Independent (Problematic when data are series in time/space)
Example - Effect of Birth weight on
Body Size in Early Adolescence
• Response: Height at Early adolescence (n =250 cases)
• Predictors (k=6 explanatory variables)
• Adolescent Age (x1, in years -- 11-14)
• Tanner stage (x2, units not given)
• Gender (x3=1 if male, 0 if female)
• Gestational age (x4, in weeks at birth)
• Birth length (x5, units not given)
• Birthweight Group (x6=1,...,6 <1500g (1), 15001999g(2), 2000-2499g(3), 2500-2999g(4), 30003499g(5), >3500g(6))
Source: Falkner, et al (2004)
Least Squares Estimation
• Population Model for mean response:
E (Y )  b 0  b1 x1    b k xk
• Least Squares Fitted (predicted) equation, minimizing SSE:
^
^
^
^
Y  b 0  b 1 x1    b k xk


SSE    Y  Y 


• All statistical software packages/spreadsheets can
compute least squares estimates and their standard errors
^
2
Analysis of Variance
• Direct extension to ANOVA based on simple
linear regression
• Only adjustments are to degrees of freedom:
– DFR = k
Source of
Variation
Model
Error
Total
DFE = n-(k+1)
Sum of
Squares
SSR
SSE
TSS
Degrees of
Freedom
k
n-(k+1)
n-1
Mean
Square
MSR = SSR/k
MSE = SSE/(n-(k+1))
TSS  SSE SSR
R 

TSS
TSS
2
F
F = MSR/MSE
Testing for the Overall Model - F-test
• Tests whether any of the explanatory variables are
associated with the response
• H0: b1==bk=0 (None of the xs associated with y)
• HA: Not all bi = 0
2
MSR
R /k
T .S . : Fobs 

2
MSE (1  R ) /( n  (k  1))
R.R. : Fobs  F ,k ,n ( k 1)
P  val : P ( F  Fobs )
Example - Effect of Birth weight on
Body Size in Early Adolescence
• Authors did not print ANOVA, but did provide following:
• n=250
k=6
R2=0.26
• H0: b1==b6=0
HA: Not all bi = 0
MSR
R2 / k
T .S . : Fobs 


2
MSE
(1  R ) /( n  ( k  1))
0.26 / 6
.0433


 14.2
(1  0.26) /( 250  (6  1)) .0030
R.R. : Fobs  F , 6 , 243  2.13
P  val : P ( F  14.2)
Testing Individual Partial Coefficients - t-tests
• Wish to determine whether the response is
associated with a single explanatory variable, after
controlling for the others
• H0: bi = 0
HA: bi  0 (2-sided alternative)
^
T .S . : t obs 
bi
s^
bi
R.R. : | t obs |  t / 2 , n ( k 1)
P  val : 2 P (t | tobs |)
Example - Effect of Birth weight on
Body Size in Early Adolescence
Variable
b
SEb
t=b/SEb P-val (z)
Adolescent Age
2.86
0.99
2.89
.0038
Tanner Stage
3.41
0.89
3.83
<.001
Male
0.08
1.26
0.06
.9522
Gestational Age
-0.11 0.21
-0.52
.6030
Birth Length
0.44
0.19
2.32
.0204
Birth Wt Grp
-0.78 0.64
-1.22
.2224
Controlling for all other predictors, adolescent age, Tanner stage, and
Birth length are associated with adolescent height measurement
Comparing Regression Models
• Conflicting Goals: Explaining variation in Y while
keeping model as simple as possible (parsimony)
• We can test whether a subset of k-g predictors (including
possibly cross-product terms) can be dropped from a
model that contains the remaining g predictors. H0:
bg+1=…=bk =0
– Complete Model: Contains all k predictors
– Reduced Model: Eliminates the predictors from H0
– Fit both models, obtaining sums of squares for each
(or R2 from each):
• Complete: SSRc , SSEc (Rc2)
• Reduced: SSRr , SSEr (Rr2)
Comparing Regression Models
• H0: bg+1=…=bk = 0 (After removing the
effects of X1,…,Xg, none of other predictors
are associated with Y)
• Ha: H0 is false
TS : Fobs


( SSRc  SSRr ) /( k  g )
R  R (k  g )


2
SSEc /[ n  (k  1)]
1  Rc [n  (k  1)]

2
c

2
r
RR : Fobs  F ,k  g ,( n ( k 1))
P  P ( F  Fobs )
P-value based on F-distribution with k-g and n-(k+1) d.f.
Models with Dummy Variables
• Some models have both numeric and categorical
explanatory variables (Recall gender in example)
• If a categorical variable has m levels, need to create m-1
dummy variables that take on the values 1 if the level of
interest is present, 0 otherwise.
• The baseline level of the categorical variable is the one
for which all m-1 dummy variables are set to 0
• The regression coefficient corresponding to a dummy
variable is the difference between the mean for that
level and the mean for baseline group, controlling for
all numeric predictors
Example - Deep Cervical Infections
• Subjects - Patients with deep neck infections
• Response (Y) - Length of Stay in hospital
• Predictors: (One numeric, 11 Dichotomous)
–
–
–
–
–
–
–
–
–
–
–
Age (x1)
Gender (x2=1 if female, 0 if male)
Fever (x3=1 if Body Temp > 38C, 0 if not)
Neck swelling (x4=1 if Present, 0 if absent)
Neck Pain (x5=1 if Present, 0 if absent)
Trismus (x6=1 if Present, 0 if absent)
Underlying Disease (x7=1 if Present, 0 if absent)
Respiration Difficulty (x8=1 if Present, 0 if absent)
Complication (x9=1 if Present, 0 if absent)
WBC > 15000/mm3 (x10=1 if Present, 0 if absent)
CRP > 100mg/ml (x11=1 if Present, 0 if absent)
Source: Wang, et al (2003)
Example - Weather and Spinal Patients
• Subjects - Visitors to National Spinal Network in 23 cities
Completing SF-36 Form
• Response - Physical Function subscale (1 of 10 reported)
• Predictors:
–
–
–
–
–
–
–
–
–
–
Source: Glaser, et al (2004)
Patient’s age (x1)
Gender (x2=1 if female, 0 if male)
High temperature on day of visit (x3)
Low temperature on day of visit (x4)
Dew point (x5)
Wet bulb (x6)
Total precipitation (x7)
Barometric Pressure (x7)
Length of sunlight (x8)
Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon,
wan gibbous, last Qtr, wan crescent, presumably had 8-1=7
dummy variables)
Modeling Interactions
• Statistical Interaction: When the effect of one
predictor (on the response) depends on the level
of other predictors.
• Can be modeled (and thus tested) with crossproduct terms (case of 2 predictors):
– E(Y) =   b1X1 + b2X2 + b3X1X2
– X2=0  E(Y) =   b1X1
– X2=10  E(Y) =   b1X1 + 10b2 + 10b3X1
= ( + 10b2) + (b1 + 10b3)X1
• The effect of increasing X1 by 1 on E(Y) depends
on level of X2, unless b3=0 (t-test)
Logistic Regression
• Logistic Regression - Binary Response variable
and numeric and/or categorical explanatory
variable(s)
– Goal: Model the probability of a particular
outcome as a function of the predictor
variable(s)
– Problem: Probabilities are bounded between 0
and 1
Logistic Regression with 1 Predictor
• Response - Presence/Absence of characteristic
• Predictor - Numeric variable observed for each case
• Model - p (x)  Probability of presence at predictor level x
b 0  b1 x
e
p ( x) 
1  e b 0  b1 x
• b = 0  P(Presence) is the same at each level of x
• b > 0  P(Presence) increases as x increases
• b < 0  P(Presence) decreases as x increases
Logistic Regression with 1 Predictor
 b0, b1 are unknown parameters and must be estimated
using statistical software such as SPSS, SAS, or STATA
· Primary interest in estimating and testing hypotheses
regarding b1
· Large-Sample test (Wald Test): (Some software runs z-test)
· H0: b1 = 0
HA: b1  0

 b1

 s b^
1

  2 ,1
^
2
T .S . : X obs
2
R.R. : X obs





2
2
P  value : P (  2  X obs
)
Example - Rizatriptan for Migraine
• Response - Complete Pain Relief at 2 hours (Yes/No)
• Predictor - Dose (mg): Placebo (0),2.5,5,10
Dose
0
2.5
5
10
Source: Gijsmant, et al (1997)
# Patients
67
75
130
145
# Relieved
2
7
29
40
% Relieved
3.0
9.3
22.3
27.6
Example - Rizatriptan for Migraine (SPSS)
t
d
B
p
.
a
ig
E
S
D
5
7
9
1
0
0
a
1
C
0
5
6
1
0
3
a
V
2.490 0.165x
e
p ( x) 
1  e  2.490 0.165x
^
H 0 : b1  0 H A : b1  0
2
2
T .S . : X obs
 0.165 

  19.819
 0.037 
2
RR : X obs
  .205,1  3.84
P  val : .000
Odds Ratio
• Interpretation of Regression Coefficient (b1):
– In linear regression, the slope coefficient is the change
in the mean response as x increases by 1 unit
– In logistic regression, we can show that:
odds ( x  1)
 e b1
odds ( x)

p ( x) 
 odds ( x) 

1  p ( x) 

• Thus eb1 represents the change in the odds of the outcome
(multiplicatively) by increasing x by 1 unit
• If b1=0, the odds (and probability) are equal at all x levels (eb1=1)
• If b1>0 , the odds (and probability) increase as x increases (eb1>1)
• If b1< 0 , the odds (and probability) decrease as x increases (eb1<1)
95% Confidence Interval for Odds Ratio
• Step 1: Construct a 95% CI for b :
^
^

  b 1  1.96 s ^ , b 1  1.96 s ^ 
b1
b1 

^
b 1  1.96 s
^
b1
• Step 2: Raise e = 2.718 to the lower and upper bounds of the CI:
 b 1 1.96 s ^
b 1 1.96 s ^ 
b1
b1 
e
,e




^
^
• If entire interval is above 1, conclude positive association
• If entire interval is below 1, conclude negative association
• If interval contains 1, cannot conclude there is an association
Example - Rizatriptan for Migraine
• 95% CI for b1:
^
b 1  0.165
s ^  0.037
b1
95% CI : 0.165  1.96(0.037)  (0.0925 , 0.2375)
• 95% CI for population odds ratio:
e
0.0925
,e
0.2375
  (1.10 , 1.27)
• Conclude positive association between dose and
probability of complete relief
Multiple Logistic Regression
• Extension to more than one predictor variable (either
numeric or dummy variables).
• With p predictors, the model is written:
e b 0  b1x1  b k xk
p
1  e b 0  b1x1  b k xk
• Adjusted Odds ratio for raising xi by 1 unit, holding
all other predictors constant:
ORi  e b i
• Inferences on bi and ORi are conducted as was described
above for the case with a single predictor
Example - ED in Older Dutch Men
• Response: Presence/Absence of ED (n=1688)
• Predictors: (k=12)
–
–
–
–
Age stratum (50-54*, 55-59, 60-64, 65-69, 70-78)
Smoking status (Nonsmoker*, Smoker)
BMI stratum (<25*, 25-30, >30)
Lower urinary tract symptoms (None*, Mild,
Moderate, Severe)
– Under treatment for cardiac symptoms (No*, Yes)
– Under treatment for COPD (No*, Yes)
*
Baseline group for dummy variables
Source: Blanker, et al (2001)
Example - ED in Older Dutch Men
Predictor
Age 55-59 (vs 50-54)
Age 60-64 (vs 50-54)
Age 65-69 (vs 50-54)
Age 70-78 (vs 50-54)
Smoker (vs nonsmoker)
BMI 25-30 (vs <25)
BMI >30 (vs <25)
LUTS Mild (vs None)
LUTS Moderate (vs None)
LUTS Severe (vs None)
Cardiac symptoms (Yes vs No)
COPD (Yes vs No)
b
0.83
1.53
2.19
2.66
0.47
0.41
1.10
0.59
1.22
2.01
0.92
0.64
SEb
0.42
0.40
0.40
0.41
0.19
0.21
0.29
0.41
0.45
0.56
0.26
0.28
Adjusted OR (95% CI)
2.3 (1.0 – 5.2)
4.6 (2.1 – 10.1)
8.9 (4.1 – 19.5)
14.3 (6.4 – 32.1)
1.6 (1.1 – 2.3)
1.5 (1.0 – 2.3)
3.0 (1.7 – 5.4)
1.8 (0.8 – 4.3)
3.4 (1.4 – 8.4)
7.5 (2.5 – 22.5)
2.5 (1.5 – 4.3)
1.9 (1.1 – 3.6)
•Interpretations: Risk of ED appears to be:
• Increasing with age, BMI, and LUTS strata
• Higher among smokers
• Higher among men being treated for cardiac or COPD
Testing Regression Coefficients
• Testing the overall model:
H 0 : b1    b k  0
H A : Not all b i  0
T .S . X
2
obs
 (2 log( L0 ))  (2 log( L1 ))
2
R.R. X obs
 2 ,k
P  P(   X
2
2
obs
)
• L0, L1 are values of the maximized likelihood function, computed by
statistical software packages. This logic can also be used to compare
full and reduced models based on subsets of predictors. Testing for
individual terms is done as in model with a single predictor.
Download