multiple01

advertisement
Overview of our study of the
multiple linear regression model
Regression models with
more than one slope parameter
Example 1
Is brain and body size
predictive of intelligence?
• Sample of n = 38 college students
• Response (y): intelligence based on PIQ
(performance) scores from the (revised)
Wechsler Adult Intelligence Scale.
• Potential predictor (x1): Brain size based on
MRI scans (given as count/10,000).
• Potential predictor (x2): Height in inches.
• Potential predictor (x3): Weight in pounds.
Example 1
Scatter matrix plot
3 .728
8
2
. 00
.75 3. 25 27. 5 70.5
6
5
8
1
6
7
1
1
130.5
PIQ
91.5
100.728
Brain
86.283
73.25
Height
65.75
Weight
Example 1
Scatter matrix plot
91.5
100.728
86.283
73.25
65.75
Brain
130.5
PIQ
Weight
7.5 70.5
2
1
1
Height
Brain
Height
8
83 0. 72
2
.
. 75 3. 25
6
0
5
8
1
6
7
Scatter matrix plot
• Illustrates the marginal relationships
between each pair of variables without
regard to the other variables.
• The challenge is how the response y relates
to all three predictors simultaneously.
Example 1
A multiple linear regression model
with three quantitative predictors
yi  0  1 xi1  2 xi 2  3 xi 3    i
where …
• yi is intelligence (PIQ) of student i
• xi1 is brain size (MRI) of student i
• xi2 is height (Height) of student i
• xi3 is weight (Weight) of student i
and … the independent error terms i follow a normal
distribution with mean 0 and equal variance 2.
Example 1
Some research questions
• Which predictors – brain size, height, or
weight – explain some variation in PIQ?
• What is the effect of brain size on PIQ, after
taking into account height and weight?
• What is the PIQ of an individual with a
given brain size, height, and weight?
Example 1
The regression equation is
PIQ = 111 + 2.06 Brain - 2.73 Height + 0.001 Weight
Predictor
Constant
Brain
Height
Weight
Coef
111.35
2.0604
-2.732
0.0006
S = 19.79
R-Sq = 29.5%
Analysis of Variance
Source
DF
Regression
3
Residual Error 34
Total
37
Source
Brain
Height
Weight
SE Coef
62.97
0.5634
1.229
0.1971
DF
1
1
1
SS
5572.7
13321.8
18894.6
Seq SS
2697.1
2875.6
0.0
T
1.77
3.66
-2.22
0.00
P
0.086
0.001
0.033
0.998
R-Sq(adj) = 23.3%
MS
1857.6
391.8
F
4.74
P
0.007
Example 2
Baby bird breathing habits
in burrows?
• Experiment with n = 120 nestling bank swallows
• Response (y): % increase in “minute ventilation”,
Vent, i.e., total volume of air breathed per minute
• Potential predictor (x1): percentage of oxygen, O2,
in the air the baby birds breathe
• Potential predictor (x2): percentage of carbon
dioxide, CO2, in the air the baby birds breathe
Example 2
Scatter matrix plot
.5
14
.5
17
5
2.2
5
6.7
484.75
Vent
52.25
17.5
O2
14.5
CO2
Example 2
Three-dimensional scatter plot
600
400
Vent
200
0
-200
13
14
15
O2
16
17
18
0
19
2
4
6
8
CO2
Example 2
A first order model
with two quantitative predictors
yi  0  1 xi1  2 xi 2    i
where …
• yi is percentage of minute ventilation
• xi1 is percentage of oxygen
• xi2 is percentage of carbon dioxide
and … the independent error terms i follow a normal
distribution with mean 0 and equal variance 2.
Example 2
Some research questions
• Is oxygen related to minute ventilation, after
taking into account carbon dioxide?
• Is carbon dioxide related to minute ventilation,
after taking into account oxygen?
• What is the mean minute ventilation of all nestling
bank swallows whose breathing air is comprised
of 15% oxygen and 5% carbon dioxide?
Example 2
The regression equation is
Vent = 86 - 5.33 O2 + 31.1 CO2
Predictor
Constant
O2
CO2
Coef
85.9
-5.330
31.103
S = 157.4
R-Sq = 26.8%
Analysis of Variance
Source
DF
Regression
2
Residual Error 117
Total
119
Source
O2
CO2
SE Coef
106.0
6.425
4.789
DF
1
1
SS
1061819
2897566
3959385
Seq SS
17045
1044773
T
0.81
-0.83
6.50
P
0.419
0.408
0.000
R-Sq(adj) = 25.6%
MS
530909
24766
F
21.44
P
0.000
Example 3
Is baby’s birth weight related to
smoking during pregnancy?
• Sample of n = 32 births
• Response (y): birth weight in grams of baby
• Potential predictor (x1): smoking status of
mother (yes or no)
• Potential predictor (x2): length of gestation
in weeks
Example 3
Scatter matrix plot
36
40
5
0.2
5
0. 7
3252.5
Weight
2697.5
40
Gest
36
Smoking
Example 3
A first order model
with one binary predictor
yi  0  1 xi1  2 xi 2    i
where …
• yi is birth weight of baby i
• xi1 is length of gestation of baby i
• xi2 = 1, if mother smokes and xi2 = 0, if not
and … the independent error terms i follow a normal
distribution with mean 0 and equal variance 2.
Example 3
Estimated first order model
with one binary predictor
The regression equation is
Weight = - 2390 + 143 Gest - 245 Smoking
Weight (grams)
3700
0
1
3200
2700
2200
34
35
36
37
38
39
Gestation (weeks)
40
41
42
Example 3
Some research questions
• Is baby’s birth weight related to smoking
during pregnancy?
• How is birth weight related to gestation,
after taking into account smoking status?
Example 3
The regression equation is
Weight = - 2390 + 143 Gest - 245 Smoking
Predictor
Constant
Gest
Smoking
Coef
-2389.6
143.100
-244.54
S = 115.5
SE Coef
349.2
9.128
41.98
R-Sq = 89.6%
T
-6.84
15.68
-5.83
P
0.000
0.000
0.000
R-Sq(adj) = 88.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
Gest
Smoking
DF
1
1
DF
2
29
31
SS
3348720
387070
3735789
Seq SS
2895838
452881
MS
1674360
13347
F
125.45
P
0.000
Example 4
Compare three treatments (A, B, C)
for severe depression
• Random sample of n = 36 severely
depressed individuals.
• y = measure of treatment effectiveness
• x1 = age (in years)
• x2 = 1 if patient received A and 0, if not
• x3 = 1 if patient received B and 0, if not
Example 4
Compare three treatments (A, B, C)
for severe depression
75
A
B
C
65
y
55
45
35
25
20
30
40
50
age
60
70
Example 4
A second order model with one
quantitative predictor, a three-group
qualitative variable, and interactions
yi   0  1 xi1   2 xi 2  3 xi 3
 12 xi1 xi 2  13 xi1 xi 3   i
where …
• yi is treatment effectiveness for patient i
• xi1 is age of patient i
• xi2 = 1, if treatment A and xi2 = 0, if not
• xi3 = 1, if treatment B and xi3 = 0, if not
Example 4
The estimated regression function
Regression equation is
y = 6.21 + 1.03 age + 41.3 x2 + 22.7 x3
- 0.703 agex2 - 0.510 agex3
80
A
B
C
y = 47.5 + 0.33x
70
y
60
50
y = 28.9 + 0.52x
40
30
y = 6.21 + 1.03x
20
20
30
40
50
age
60
70
Example 4
Potential research questions
• Does the effectiveness of the treatment
depend on age?
• Is one treatment superior to the other
treatment for all ages?
• What is the effect of age on the effectiveness
of the treatment?
Regression equation is y = 6.21 + 1.03 age + 41.3 x2
+ 22.7 x3 - 0.703 agex2 - 0.510 agex3
Example 4
Predictor
Constant
age
x2
x3
agex2
agex3
Coef
6.211
1.03339
41.304
22.707
-0.7029
-0.5097
SE Coef
3.350
0.07233
5.085
5.091
0.1090
0.1104
S = 3.925
R-Sq = 91.4%
Analysis of Variance
Source
DF
SS
Regression
5
4932.85
Residual Error
30
462.15
Total
35
5395.00
Source
age
x2
x3
agex2
agex3
DF
1
1
1
1
1
Seq SS
3424.43
803.80
1.19
375.00
328.42
T
1.85
14.29
8.12
4.46
-6.45
-4.62
P
0.074
0.000
0.000
0.000
0.000
0.000
R-Sq(adj) = 90.0%
MS
986.57
15.40
F
64.04
P
0.000
Example 5
How is the length of a bluegill fish
related to its age?
• In 1981, n = 78 bluegills randomly sampled
from Lake Mary in Minnesota.
• y = length (in mm)
• x1 = age (in years)
Example 5
Scatter plot
200
length
150
100
1
2
3
4
age
5
6
Example 5
A second order polynomial model
with one quantitative predictor


yi  0  1 xi  11 x   i
2
i
where …
• yi is length of bluegill (fish) i (in mm)
• xi is age of bluegill (fish) i (in years)
and … the independent error terms i follow a normal
distribution with mean 0 and equal variance 2.
Example 5
Estimated regression function
Regression Plot
length = 13.6224 + 54.0493 age - 4.71866 age**2
S = 10.9061
R-Sq = 80.1 %
R-Sq(adj) = 79.6 %
200
length
150
100
1
2
3
4
age
5
6
Example 5
Potential research questions
• How is the length of a bluegill fish related to
its age?
• What is the length of a randomly selected
five-year-old bluegill fish?
Example 5
The regression equation is
length = 148 + 19.8 c_age - 4.72 c_agesq
Predictor
Constant
c_age
c_agesq
S = 10.91
Coef
147.604
19.811
-4.7187
SE Coef
1.472
1.431
0.9440
R-Sq = 80.1%
T
100.26
13.85
-5.00
P
0.000
0.000
0.000
R-Sq(adj) = 79.6%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
2
35938
17969
151.07
0.000
Residual Error 75
8921
119
Total
77
44859
...
Predicted Values for New Observations
New
Fit
SE Fit
95.0% CI
95.0% PI
1
165.90
2.77 (160.39, 171.42) (143.49, 188.32)
Values of Predictors for New Observations
New
c_age
c_agesq
1
1.37
1.88
The good news!
• Everything you learned about the simple
linear regression model extends, with at
most minor modification, to the multiple
linear regression model:
–
–
–
–
same assumptions, same model checking
(adjusted) R2
t-tests and t-intervals for one slope
prediction (confidence) intervals for (mean)
response
New things we need to learn!
• The above research scenarios (models) and
a few more
• The “general linear test” which helps to
answer many research questions
• F-tests for more than one slope
• Interactions between two or more predictor
variables
• Identifying influential data points
New things we need to learn!
• Detection of (“variance inflation factors”)
correlated predictors (“multicollinearity”)
and the limitations they cause
• Selection of variables from a large set of
variables for inclusion in a model (“stepwise
regression and “best subsets regression”)
Download