This work is licensed under a . Your use of this

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this
material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of these materials
permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or
warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently
review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for
obtaining permissions for use from third parties as needed.
Section C
Variability in MLR: Assessing Uncertainty and
Goodness of Fit
MLR
 
The algorithm to estimate the equation of the MLR line is called the
“least squares” estimation
 
The idea is to find the line (actually multi-dimensional object, like a
plane or beyond) that gets “closest” to all of the points in the
sample
 
How to define closeness to multiple points?
 
In regression, closeness is defined as the cumulative squared
distance between each point’s y-value and the corresponding value
of
for that point’s set of x values: in other words the squared
distance between an observed y-value and the estimated mean yvalue for all points with the same values of each x
3
MLR
 
Each distance is computed for each data point in the sample
 
The algorithm to estimate the equation of the line is called the
“least squares” estimation
 
The values chosen for
are the values that minimize
the cumulative distances squared: i.e.,
4
Example: Arm Circumference and Height
 
The values chosen for
single sample
are just estimates based on a
 
If you were to have a different random sample the resulting
estimates would likely be different: i.e., the values that minimized
the cumulative squared distance from this second sample of points
would likely be different
 
As such, all regression coefficients have an associated standard
error that can be used to make statements about the true
relationship between the mean of y and x1, x2,..xp (the true slopes)
based on a single sample
5
Example: Arm Circumference and Height
 
Random sampling behavior of estimated regression coefficients is
normal for large samples (n > 60), and centered at true values
 
As such, we can use the same ideas to create 95% CIs and get
p-values
6
Arm Circumference MLR
 
How were the 95% CIs for the slopes of height and weight estimated?
7
Arm Circumference MLR
 
Notice each slope has an estimated standard error
8
Arm Circumference MLR
 
Just like in SLR, sampling distributions of estimated MLR slopes is
normal, and centered at true population values (when n is large,
i.e., n > 60)
 
So the approach to constructing a 95% CI for
same old approach:
 
So for example, slope of height from previous regression results:
- 
is given by the
95% CI for :
≈ (-0.21, -0.11)
9
Arm Circumference MLR
 
 
How to get a p-value?
- 
Ho:
- 
Ha:
(no relationship between y and xp after accounting
for other xs)
(xp is associated with y after accounting for other
xs)
Same “recipe” as before
10
Arm Circumference MLR
 
 
How to get a p-value for slope of height in MLR?
- 
Ho:
- 
Ha:
(no relationship between arm circumference and
height after accounting for weight)
Same “recipe” as before
-  Assume Ho true
-  Compute “distance” of sample result from 0 in unit of standard
error
- 
Compare distance to sampling distribution to get a p-value
11
Arm Circumference MLR
 
To get p-value
-  We have a result that is 6.6 standard errors below 0; the
sampling distribution is normal and centered at the assumed
null truth of 0
-  The resulting probability of getting a sample estimate 6.6 or
greater standard errors away from 0 is through Ho being true
and also really small, p < .0001
12
Hemoglobin MLR
 
How were the 95% CIs for the slopes of PCV and age estimated?
13
Hemoglobin MLR
 
When n is small, i.e., n < 60, just like in SLR, sampling distributions
of MLR slopes is a t-distribution, but with n-(1+ “# of xs”) degrees of
freedom
 
So the approach to constructing a 95% CI for
same old approach:
 
So for example, slope of PCV from previous regression results:
- 
95% CI for
is given by the
:
≈ (0.037, 0.163)
14
Hemoglobin MLR
 
 
How to get a p-value?
- 
Ho:
- 
Ha:
(no relationship between y and xp after accounting
for other xs)
(xp is associated with y after accounting for other xs)
Same “recipe” as before
15
Hemoglobin MLR
 
 
How to get a p-value for slope of PCV in MLR of hemoglobin on PCV
and age?
- 
Ho:
- 
Ha:
(no relationship between hemoglobin and PCV after
accounting for age)
Same “recipe” as before
-  Assume Ho true
-  Compute “distance” of sample result from 0 in units of
standard error
- 
Compare distance to sampling distribution to get a p-value
16
Arm Circumference MLR
 
To get a p-value
-  We have a result that is 3.3 standard errors below 0; the
sampling distribution in a t-distribution with 18 degrees of
freedom and centered at the assumed null truth of 0
-  The resulting probability of getting a sample estimate 3.3 or
greater standard errors away from 0 if Ho true is the p-value:
p = 0.004
17
The Overall F-Test
 
In both small and large samples, the p-values for each slope in a
MLR are on testing for a relationship between y and a specific x, in a
model that includes multiple xs
 
In some sense, it may be nice to know whether any xs are
associated, before assessing which xs are by looking at the
inferences (CI, p-value) on individual slopes
 
The overall F-test provides an answer to the prior query
18
The Overall F-Test
 
Generic formulation: null and alternative
-  Ho:
-  Ha: at least one slope ( ) not equal to 0
 
The test gives only a p-value (no 95% CI) for choosing between the
null and alternative hypotheses
-  If null is rejected, individual CIs/p-values for each can be
used to find out which are statistically significant
19
The Overall F-Test
 
Null and alternative
-  Ho:
-  Ha: at least one slope (
) not equal to 0
p-value
20
Measuring Variability Explained by MLR
 
(SR1 flashback) the sample standard deviation of the y-values
ignoring the corresponding potential information in the xs is
- 
- 
This measures how far on average each of the sample y-values
falls from the overall mean all y-values
In the arm circumference examples, s = 1.48 cm
21
Measuring Variability Explained by MLR
 
“Visualization” on the scatterplot
22
Measuring Variability Explained by MLR
 
Standard deviation of regression, referred to as root mean square
error is “average” distance of points from the line: how far on
average each y falls from its mean predicted by its corresponding
x-values
23
Measuring Variability Explained by MLR
 
regress command in Stata gives sy|x (named root MSE on the output)
24
Measuring Variability Explained by MLR
 
If s = sy|x, then knowing x does not yield a better guess for the mean
of y than using the overall mean
(flat regression line)
 
The smaller sy|x is relative to s, the closer the points are to the
regression line
 
R2 functionally measures how much smaller sy|x is than s: as such it
is an estimate of the amount of variability in y explained by taking
all the xs into account
25
Measuring Variability Explained by MLR
 
regress command in Stata gives R2: child height, sex, and weight
together explain (an estimated) 78% of the variation in arm
circumferences
26
Example: Arm Circumference and Height
 
One mathematical quirk about R2 in MLR is that adding more xs will
always increase R2, even if an x is not informative about y
 
There is a quantity called “Adjusted R2” that penalizes the original
R2 for this property
27
Measuring Variability Explained by MLR
 
Regress command in Stata gives R2: child height, sex, and weight
together explain (an estimated) 78% of the variation in arm
circumferences
28