Conditioning on Continuous Variables

advertisement
STAT 360: Regression Analysis
Handout #3: Conditioning on a Continuous Variable
Example 3.1: Consider the Impact Crater dataset on our course website.
In the previous handout, we considered the following conditional distributions.
ο‚·
Diameter | SandType
ο‚·
Diameter | ProjectileType
In this handout, we will consider the conditional distribution of Diameter | Height. Recall, the height is
the distance from which the projectile was dropped. It is expected that as the height of the fall
increases, the diameter increases. A better understanding of this relationship is the goal of the
following investigation.
1
If the conditioning variable in numerical, the conditional distribution of the response given this variable
is best displayed via a scatterplot.
Questions
1. What is the relationship between the height in which the projectile fell and the diameter of the
crater it created? Discuss.
2. Does the mean diameter appear to change as a function of height? Explain.
3. Does the variance in the diameter appear to change as a function of height? Discuss.
2
Descriptive summaries for each height can easily be obtained in JMP by selecting Tables > Summary.
The specification of which descriptive summaries to report is done in the Statistics box. The
conditioning variable(s) are specified in the Group box.
The summaries computed by JMP are provided here.
Questions
4. Using the numerical summaries for the mean, discuss the effect of height on diameter of crater.
5. Does the effect of height appear to be constant? Is the degree in which the mean increases the
same for all heights? Discuss.
6. Using the numerical summaries for the mean, discuss the effect of height on diameter of crater.
3
The following scatterplot displays the mean for each height.
Conditioning should always be done in light of what makes sense for the data; however, there are no
rules or guidelines to follow. For example, in the following graphic, conditioning was done by dividing
the heights into two subgroups, heights <=100 and heights > 100.
4
Estimating the Mean Function
Next, we will consider how best to obtain an estimate of the mean function for the conditional
distribution of Diameter | Height. As we have already seen, one very simple approach is to obtain the
mean for the various values of height.
The approach of computing the mean for each height value is simple, but has some shortcoming. For
example, when replicates do not exist for some or all the heights, the estimation of the mean function is
confounded with the variance function. That is, computing the mean for each height in the following
plot is messed up by the natural variation in Diameter.
5
A second concern in over simplifying the estimate of the mean function for the conditional distribution
is related to the prediction. This particular experiment choose to use Height values of 25, 50, 75, 100,
125, 150, 175, and 200. Certainly, we cannot expect an experiment to include all the possible height
values of interest. For example, in many situation, using the information from Height = 50 and Height =
75 is sufficient in estimating a quantity such as 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘ = 60).
Using the conditional distribution of Diameter | Height = 50 and Diameter | Height = 75 to understand
the conditional distribution of Diameter | Height = 60.
When several replicates exist for each height, it may be practical to connect the mean for each height to
obtain the an estimate of 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘).
Question
7. If the mean function is obtained by simply connect the means, is the result a continuous
function?
6
As is typical with many things, the “keep-it-simple” approach may not be the best approach. Locally
weighted scatterplot smoothing (LOESS or LOWESS) is an approach that sits on the other end of the
“keep-it-simple” spectrum.
Wiki Entry for Lowess Smoothing
http://en.wikipedia.org/wiki/Local_regression
7
Step #1: The first step in doing locally weighted scatterplot smoothing (i.e. LOESS or LOWESS) is to
obtain a subset of the data. Typically, the neighborhood of points is used in obtaining this subset.
Step #2: The second step in doing LOWESS is to obtain a mean function using only the data within the
window which includes the subset of the data obtained from Step #1. There are a wide variety of
different mean functions that can be used. As Wiki states, a low order polynomial is often used to allow
for some flexibility in fitting the data.
A linear mean function
A quadratic mean function
8
Step #3: When fitting the mean function on the data within this window, you may want to allow some
neighbors to have more weight than other neighbors. A data point with more weight will have more
impact on the estimation of the mean function. As with kernel density estimation , a normal weighting
scheme is most common.
Using normal weights
Using uniform weights
Step #4: Repeat this process for all height values. In other words, slide your window across all values of
x. This is shown visually here.
---> Sliding the window across the values of height ----->
9
The following is a LOWESS estimate of 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘).
A LOWESS estimate using a wider window (i.e. a larger subset of the data) will result in more smoothing.
Akin to kernel density estimation, it is necessary to find the approach balance between too much
smoothing and too little smoothing.
---> Sliding the window across the values of height ----->
10
The following graph compares a LOWESS smoother obtained from two different window sizes. As, we
can see, increasing the window size has little effect on estimate of 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘).
Comment: A traditional modeling approach assumes the “subset” window spans the all values of x and
that all points are equally weighted.
Traditional Approach to Modeling
“Subset” is all the data
All points are equally weighted
11
The following graph displays the best estimate of 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘) using a traditional approach
with a linear and quadratic form for the mean function.
Questions:
8. As mentioned above in the estimation of the mean function using the LOWESS method, a subset
that includes less data points results in a LOWESS smoother that does less smoothing. Using
non-technical language, explain why a smaller subset of data would result in an estimated mean
function that has more bumps.
9. Consider the graph above in which a traditional linear and quadratic mean function was used to
estimate 𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘). The correct answer to Question 3 on page 5 supports the
notation that the quadratic mean function is more appropriate than the linear mean function.
Explain why this is the case.
12
10. Consider the following snip-it from Excel’s Help menu for adding linear trendlines to a
scatterplot.
a. As stated above, fitting a linear mean function implies a constant rate of change for all x
value. Does Excel communicate this notion to the reader? Explain.
b. Does Excel’s communicate that all points are equally weighted in the fitting of this mean
function?
c. We have not yet discussed the formulas for obtaining the trend line, but consider the
fact that the y-intercept of the trendline can be computed as follows.
𝑦 − π‘–π‘›π‘‘π‘’π‘Ÿπ‘π‘’π‘π‘‘ = π‘€π‘’π‘Žπ‘› π‘œπ‘“ π‘Œ − π‘†π‘™π‘œπ‘π‘’ ∗ π‘€π‘’π‘Žπ‘› π‘œπ‘“ 𝑋
Next, consider the fact that Excel is using the date as the x-variable on the above
graphic. How do you think Excel deals with a formula such as “Mean of X” when X is a
date variable? [By the way, I don’t know the answer to this question.]
13
Understanding the Unexplained Variation
Consider once again the formula for the total unexplained variation in the marginal distribution of
Diameter. In words, the total unexplained variation is the sum of all the squared distances from each
data point to the mean.
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘‘π‘–π‘œπ‘›
= ∑ π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™π‘  2
= ∑(π·π‘Žπ‘‘π‘Ž π‘ƒπ‘œπ‘–π‘›π‘‘ − π‘€π‘’π‘Žπ‘›)2
A visual depiction of the total unexplained variation in the marginal distribution is shown here.
For simplicity, consider the total unexplained variation when the estimated mean functions are
computed using simple averages for each height.
14
The total unexplained variation for each height is given in the following table.
Marginal
Distribution
Conditional Distributions of Diameter | Height
Height
= 25
Height
= 50
Height
= 50
Height
= 100
Height
= 125
Height
= 150
Height
= 175
Height
= 200
22.46
24.58
28.72
41.44
40.05
45.84
48.52
53.47
425.35
Total = 305.08
Questions
11. Did the conditional distribution take away a sufficient amount of the unexplained variation in
Diameter? Discuss.
12. What is the R2 value for this analysis?
13. A small R2 value supports the notation that not much of the total unexplained variation is
accounted for in the conditional distribution of Diameter | Height. Give a rough sketch of the
pattern that must be present in the conditional distribution for each of the following situations.
Situation #1: Small R2
Situation #2: Very large R2
15
Estimating the Variance Function
The estimation of the variance function is often considered to be of less importance than the estimation
of the mean function; however, this is not the case for all analyses. Consider the following example in
which an understanding of the variance function has implications about the possible diversity of bird
species in northern Montana.
Example 3.1 Consider an article by Hutto, et al (1995) titled “A Comparison of Bird Detection Rates
Derived from On-Road versus Off-Road Counts in Northern Montana. This article contained data of the
off-road and on-rounds counts for a variety of bird species. In this work, the authors were interested in
comparing the bird detection rates (i.e. counts) for on-road and off-road habits. Bird detection rates
were obtained for 0-50 meters and for 50-100 meters from the observer.
Source: United Stated Department of Agriculture, General Technical Report PSW-GTR-149, 1995.
On-Road Bird Counting
Off-Road Bird Counting
A plot comparing the Off-Road bird detection rate (i.e. count) to the On-Road counts for 0-50 meters.
16
Answer the following questions using the plot below.
Questions
14. What is the purpose of adding the y=x line to this plot? What would it mean if the data points
fall close to the y=x line.
15. Why would we expect the variance function to be larger for higher counts? Discuss.
16. How might an investigation of the variance function be used to understand the diversity level of
these bird species between the on-road and off-road habits? Explain.
17
Let’s return to the Impact Crater example in which we were investigating the relationship between
diameter of the crater and the height in which the projectile fell.
Conditional Distribution of
Diameter | Height
Plot of Residuals for each Height
Consider the following table which contains the total amount of unexplained variation for each height
value.
Comparing the amount of Total Unexplained Variation across the heights
Height
25
50
75
100
125
150
175
200
Total
Unexplained
Variation
22.46
24.58
28.72
41.44
40.05
45.84
48.52
53.47
Question
17. Does the variation in diameter change as a function of height? Explain.
18
A scatterplot of the residuals vs. height will allow us to investigate possible trends in the variance
function.
As we have discussed previously, the average residual is zero whenever the commonly used L2 loss
function is used in the fitting of the mean function because the negative residuals simply cancel out the
positives residuals. There are two options to overcome this problem – square the residuals or use the
absolute residuals. In the following graph, the squared residual, i.e. Residual^2, values are displayed as
a function of height here.
19
Once again, a very simple approach to understanding the variance function, is to obtain an average
Residual^2 value for each height. These values are shown in the table below and shown here graphically
as well.
Comparing Residual^2 across the heights
Height
Total
Residual^2
Average
Residual^2
SQRT(Average
Residual^2)
25
50
75
100
125
150
175
200
22.46
24.58
28.72
41.44
40.05
45.84
48.52
53.47
1.32
1.45
1.69
2.44
2.36
2.7
2.85
3.15
1.15
1.20
1.30
1.56
1.53
1.64
1.69
1.77
Question
1. Is there any pattern in the average Residual^2 values as height increases?
2. Is the increase in the average Residual^2 fairly consistent? That is, would a linear function
appear to model the variance adequately for this data? Discuss.
20
Similar to our approach to modeling the mean function, a LOWESS may be a better way to estimate the
variance function. This will provide a smooth estimate of the variance function which will allow us to
understand how |Residual| changes as a function of height.
The same plot as above with the data points being plotted.
21
Putting the pieces together -- Mean & Variance Functions
Some of the more sophisticated modeling software packages (e.g. Arc or R) allow you to display the
mean and variance functions on the same scatterplot. In this plot below, we can see the curvature in
the mean function and the slight increase in the variances as height increase.
LOWESS Estimate of
𝐸̂ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘)
LOWESS Estimate of
Μ‚ (π·π‘–π‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ |π»π‘’π‘–π‘”β„Žπ‘‘)
π‘‰π‘Žπ‘Ÿ
The mean and SQRT(variance) function on a single plot. The square root of the variance function is
necessary to match the scale, i.e. mm not mm2, when making this plot.
22
Download