STAT 360: Regression Analysis Handout #3: Conditioning on a Continuous Variable Example 3.1: Consider the Impact Crater dataset on our course website. In the previous handout, we considered the following conditional distributions. ο· Diameter | SandType ο· Diameter | ProjectileType In this handout, we will consider the conditional distribution of Diameter | Height. Recall, the height is the distance from which the projectile was dropped. It is expected that as the height of the fall increases, the diameter increases. A better understanding of this relationship is the goal of the following investigation. 1 If the conditioning variable in numerical, the conditional distribution of the response given this variable is best displayed via a scatterplot. Questions 1. What is the relationship between the height in which the projectile fell and the diameter of the crater it created? Discuss. 2. Does the mean diameter appear to change as a function of height? Explain. 3. Does the variance in the diameter appear to change as a function of height? Discuss. 2 Descriptive summaries for each height can easily be obtained in JMP by selecting Tables > Summary. The specification of which descriptive summaries to report is done in the Statistics box. The conditioning variable(s) are specified in the Group box. The summaries computed by JMP are provided here. Questions 4. Using the numerical summaries for the mean, discuss the effect of height on diameter of crater. 5. Does the effect of height appear to be constant? Is the degree in which the mean increases the same for all heights? Discuss. 6. Using the numerical summaries for the mean, discuss the effect of height on diameter of crater. 3 The following scatterplot displays the mean for each height. Conditioning should always be done in light of what makes sense for the data; however, there are no rules or guidelines to follow. For example, in the following graphic, conditioning was done by dividing the heights into two subgroups, heights <=100 and heights > 100. 4 Estimating the Mean Function Next, we will consider how best to obtain an estimate of the mean function for the conditional distribution of Diameter | Height. As we have already seen, one very simple approach is to obtain the mean for the various values of height. The approach of computing the mean for each height value is simple, but has some shortcoming. For example, when replicates do not exist for some or all the heights, the estimation of the mean function is confounded with the variance function. That is, computing the mean for each height in the following plot is messed up by the natural variation in Diameter. 5 A second concern in over simplifying the estimate of the mean function for the conditional distribution is related to the prediction. This particular experiment choose to use Height values of 25, 50, 75, 100, 125, 150, 175, and 200. Certainly, we cannot expect an experiment to include all the possible height values of interest. For example, in many situation, using the information from Height = 50 and Height = 75 is sufficient in estimating a quantity such as πΈΜ (π·πππππ‘ππ |π»πππβπ‘ = 60). Using the conditional distribution of Diameter | Height = 50 and Diameter | Height = 75 to understand the conditional distribution of Diameter | Height = 60. When several replicates exist for each height, it may be practical to connect the mean for each height to obtain the an estimate of πΈΜ (π·πππππ‘ππ |π»πππβπ‘). Question 7. If the mean function is obtained by simply connect the means, is the result a continuous function? 6 As is typical with many things, the “keep-it-simple” approach may not be the best approach. Locally weighted scatterplot smoothing (LOESS or LOWESS) is an approach that sits on the other end of the “keep-it-simple” spectrum. Wiki Entry for Lowess Smoothing http://en.wikipedia.org/wiki/Local_regression 7 Step #1: The first step in doing locally weighted scatterplot smoothing (i.e. LOESS or LOWESS) is to obtain a subset of the data. Typically, the neighborhood of points is used in obtaining this subset. Step #2: The second step in doing LOWESS is to obtain a mean function using only the data within the window which includes the subset of the data obtained from Step #1. There are a wide variety of different mean functions that can be used. As Wiki states, a low order polynomial is often used to allow for some flexibility in fitting the data. A linear mean function A quadratic mean function 8 Step #3: When fitting the mean function on the data within this window, you may want to allow some neighbors to have more weight than other neighbors. A data point with more weight will have more impact on the estimation of the mean function. As with kernel density estimation , a normal weighting scheme is most common. Using normal weights Using uniform weights Step #4: Repeat this process for all height values. In other words, slide your window across all values of x. This is shown visually here. ---> Sliding the window across the values of height -----> 9 The following is a LOWESS estimate of πΈΜ (π·πππππ‘ππ |π»πππβπ‘). A LOWESS estimate using a wider window (i.e. a larger subset of the data) will result in more smoothing. Akin to kernel density estimation, it is necessary to find the approach balance between too much smoothing and too little smoothing. ---> Sliding the window across the values of height -----> 10 The following graph compares a LOWESS smoother obtained from two different window sizes. As, we can see, increasing the window size has little effect on estimate of πΈΜ (π·πππππ‘ππ |π»πππβπ‘). Comment: A traditional modeling approach assumes the “subset” window spans the all values of x and that all points are equally weighted. Traditional Approach to Modeling “Subset” is all the data All points are equally weighted 11 The following graph displays the best estimate of πΈΜ (π·πππππ‘ππ |π»πππβπ‘) using a traditional approach with a linear and quadratic form for the mean function. Questions: 8. As mentioned above in the estimation of the mean function using the LOWESS method, a subset that includes less data points results in a LOWESS smoother that does less smoothing. Using non-technical language, explain why a smaller subset of data would result in an estimated mean function that has more bumps. 9. Consider the graph above in which a traditional linear and quadratic mean function was used to estimate πΈΜ (π·πππππ‘ππ |π»πππβπ‘). The correct answer to Question 3 on page 5 supports the notation that the quadratic mean function is more appropriate than the linear mean function. Explain why this is the case. 12 10. Consider the following snip-it from Excel’s Help menu for adding linear trendlines to a scatterplot. a. As stated above, fitting a linear mean function implies a constant rate of change for all x value. Does Excel communicate this notion to the reader? Explain. b. Does Excel’s communicate that all points are equally weighted in the fitting of this mean function? c. We have not yet discussed the formulas for obtaining the trend line, but consider the fact that the y-intercept of the trendline can be computed as follows. π¦ − πππ‘ππππππ‘ = ππππ ππ π − πππππ ∗ ππππ ππ π Next, consider the fact that Excel is using the date as the x-variable on the above graphic. How do you think Excel deals with a formula such as “Mean of X” when X is a date variable? [By the way, I don’t know the answer to this question.] 13 Understanding the Unexplained Variation Consider once again the formula for the total unexplained variation in the marginal distribution of Diameter. In words, the total unexplained variation is the sum of all the squared distances from each data point to the mean. πππ‘ππ ππππ₯πππππππ ππππππ‘πππ = ∑ π ππ πππ’πππ 2 = ∑(π·ππ‘π πππππ‘ − ππππ)2 A visual depiction of the total unexplained variation in the marginal distribution is shown here. For simplicity, consider the total unexplained variation when the estimated mean functions are computed using simple averages for each height. 14 The total unexplained variation for each height is given in the following table. Marginal Distribution Conditional Distributions of Diameter | Height Height = 25 Height = 50 Height = 50 Height = 100 Height = 125 Height = 150 Height = 175 Height = 200 22.46 24.58 28.72 41.44 40.05 45.84 48.52 53.47 425.35 Total = 305.08 Questions 11. Did the conditional distribution take away a sufficient amount of the unexplained variation in Diameter? Discuss. 12. What is the R2 value for this analysis? 13. A small R2 value supports the notation that not much of the total unexplained variation is accounted for in the conditional distribution of Diameter | Height. Give a rough sketch of the pattern that must be present in the conditional distribution for each of the following situations. Situation #1: Small R2 Situation #2: Very large R2 15 Estimating the Variance Function The estimation of the variance function is often considered to be of less importance than the estimation of the mean function; however, this is not the case for all analyses. Consider the following example in which an understanding of the variance function has implications about the possible diversity of bird species in northern Montana. Example 3.1 Consider an article by Hutto, et al (1995) titled “A Comparison of Bird Detection Rates Derived from On-Road versus Off-Road Counts in Northern Montana. This article contained data of the off-road and on-rounds counts for a variety of bird species. In this work, the authors were interested in comparing the bird detection rates (i.e. counts) for on-road and off-road habits. Bird detection rates were obtained for 0-50 meters and for 50-100 meters from the observer. Source: United Stated Department of Agriculture, General Technical Report PSW-GTR-149, 1995. On-Road Bird Counting Off-Road Bird Counting A plot comparing the Off-Road bird detection rate (i.e. count) to the On-Road counts for 0-50 meters. 16 Answer the following questions using the plot below. Questions 14. What is the purpose of adding the y=x line to this plot? What would it mean if the data points fall close to the y=x line. 15. Why would we expect the variance function to be larger for higher counts? Discuss. 16. How might an investigation of the variance function be used to understand the diversity level of these bird species between the on-road and off-road habits? Explain. 17 Let’s return to the Impact Crater example in which we were investigating the relationship between diameter of the crater and the height in which the projectile fell. Conditional Distribution of Diameter | Height Plot of Residuals for each Height Consider the following table which contains the total amount of unexplained variation for each height value. Comparing the amount of Total Unexplained Variation across the heights Height 25 50 75 100 125 150 175 200 Total Unexplained Variation 22.46 24.58 28.72 41.44 40.05 45.84 48.52 53.47 Question 17. Does the variation in diameter change as a function of height? Explain. 18 A scatterplot of the residuals vs. height will allow us to investigate possible trends in the variance function. As we have discussed previously, the average residual is zero whenever the commonly used L2 loss function is used in the fitting of the mean function because the negative residuals simply cancel out the positives residuals. There are two options to overcome this problem – square the residuals or use the absolute residuals. In the following graph, the squared residual, i.e. Residual^2, values are displayed as a function of height here. 19 Once again, a very simple approach to understanding the variance function, is to obtain an average Residual^2 value for each height. These values are shown in the table below and shown here graphically as well. Comparing Residual^2 across the heights Height Total Residual^2 Average Residual^2 SQRT(Average Residual^2) 25 50 75 100 125 150 175 200 22.46 24.58 28.72 41.44 40.05 45.84 48.52 53.47 1.32 1.45 1.69 2.44 2.36 2.7 2.85 3.15 1.15 1.20 1.30 1.56 1.53 1.64 1.69 1.77 Question 1. Is there any pattern in the average Residual^2 values as height increases? 2. Is the increase in the average Residual^2 fairly consistent? That is, would a linear function appear to model the variance adequately for this data? Discuss. 20 Similar to our approach to modeling the mean function, a LOWESS may be a better way to estimate the variance function. This will provide a smooth estimate of the variance function which will allow us to understand how |Residual| changes as a function of height. The same plot as above with the data points being plotted. 21 Putting the pieces together -- Mean & Variance Functions Some of the more sophisticated modeling software packages (e.g. Arc or R) allow you to display the mean and variance functions on the same scatterplot. In this plot below, we can see the curvature in the mean function and the slight increase in the variances as height increase. LOWESS Estimate of πΈΜ (π·πππππ‘ππ |π»πππβπ‘) LOWESS Estimate of Μ (π·πππππ‘ππ |π»πππβπ‘) πππ The mean and SQRT(variance) function on a single plot. The square root of the variance function is necessary to match the scale, i.e. mm not mm2, when making this plot. 22