STAT 360: Regression Analysis Handout #2: Reducing the Unexplained Variation through Conditioning Example 2.1: Consider the following data that has been collected from my STAT 110 students over the past semesters. Response Variable: Hair Length (mm) Variables under investigation (i.e. independent variables): Gender Height (inches) A snip-it of the data is provided here. A univarate analysis of the response ignores or marginalizes the effect of all other variables that may be under study. In this case, it is said that the marginal distribution of the response is being summarized. A summary of marginal distribution ignores all variables. 1 Marginal Distribution of Hair Length (mm) Historically, the mean and variance have been considered to be sufficient in describing a distribution. [In fact, the mean and variance are said to be sufficient statistics for a wide class of distributions.] Certainly, information is lost when all data values are reduced to a single quantity that is supposedly describing a typical value (i.e. mean or average) and a single quantity being used to quantify the variation (i.e. standard deviation or variance). In this class, we will use the E() or expectation notation to denote the mean and Var() to denote the variance. The following notion will be used to describe the mean and variance of the response Hair Length. Mean = E(Hair Length) Variance = Var(Hair Length) When these quantities are estimated from the data, the commonly used “hat” notation will be used. Estimate of mean from data 𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) = 237.3 𝑚𝑚 Estimate of variance from data ̂ ̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) = 𝑉𝑎𝑟 𝑆𝑡𝑑 𝐷𝑒𝑣 2 = (171.8 𝑚𝑚)2 = 29,519 𝑚𝑚2 2 Explained vs. Unexplained Variation Consider the following Wiki entry for “Unexplained” Variation. http://en.wikipedia.org/wiki/Explained_variation Consider next these concepts in the context of our previous example. There are several reasons why people have different hair lengths (e.g. specific hair styles, days since last haircut, gender, etc.). If such variables are known to have an effect on hair length, then some of the inherent variation in hair length can be explained. However, if the aforementioned variables have no effect on hair length or if the aforementioned variables are ignored all together, then all the variation in hair length is said to be unexplained. Explained Variation: Variation in a response that can be attributed to one or more other variables Unexplained Variation: Variation in a response that remains after considering other variables 3 Comment: When the marginal distribution of a response is being considered, then all the variation in the response is said to be unexplained. Example 2.2 Consider the following data regarding the top baseball players over a particular season. This data has been sorted by Batting Average, which is one possible measure of a baseball player’s performance or value to a team. Explained Variation in Context In baseball, the batting average is computed as follows 𝐵𝑎𝑡𝑡𝑖𝑛𝑔 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = 𝐻𝑖𝑡𝑠 𝐴𝑡 𝐵𝑎𝑡𝑠 For example, for the first observation the batting average is computed as 159/459 = 0.346 or for Mauer, the batting average is computed as 155/486 = 0.319. The differences in the batting average values can be explained (completely explained) by players having a different number of hits and innings played. Unexplained Variation in Context Next, consider the variation in Salary. There appear to be factors other than those given here that have an impact on salary. For example, Mauer has the top salary of the players listed here, but his performance, as measured by batting average, is not the highest. The differences in salary cannot be completely explained by At Bats, Hits, and Batting Average. Thus, much of the variation in salary remains unexplained. 4 Measuring the Unexplained Variation Consider again the Hair Length data presented in Example 2.1. The unexplained variation is the inherent variation that exists in hair lengths. Up to this point, we have only considered the marginal distribution of hair length (i.e. we have ignored other variables such as gender, hair style, etc.), thus all the variation in hair length is said to be unexplained. Dotplot of Hair Length Adding the average to the display Unexplained variation The variance of the response is used to quantify the unexplained variation. 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ∑(𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙)2 𝑛−1 where 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡 − 𝑀𝑒𝑎𝑛 Questions: 1. What does it mean if the the residual value is positive? Negative? 2. What does it mean if the residual is small (i.e. close to zero)? 5 3. Suppose the residual value for the particular data point is positive. Is the mean an overestimate or under-estimate for this particular data value? Explain. The residual value for the 1st observation in the Hair Length dataset. On the graphic below, idnetify the residuals for the 1st three obsevations. 6 All residual values are shown on the graphic below. Excel can be used to easily obtain the residual value for all observations in your data. Residuals shown graphically Obtaining residual values in a simple spreadsheet Notice that when all the residual values are added up the total is 0. This happens whenever the average is used to estimate the mean funciton (i.e. the E(Hair Length) quantity). Visually, we can see this happens because all the positiive residuals cancel out all the negative residuals. Task: Prove ∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 0 when the the sample mean is used as an estimate of the mean funciton, i.e. 𝑚𝑒𝑎𝑛 = ∑ 𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡𝑠 𝑛 7 There are two mathematical options to avoid the cancelling out effect when summing the residuals. Total Variation = ∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 2 or Total Variation = ∑|𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙| The first method -- squaring the residuals -- is more generally known as squared error loss or L2 loss. The absolute or L1 loss function is the name given when an absolute value is used to get rid of the negatives. Wiki entry for Loss Function http://en.wikipedia.org/wiki/Loss_function Comments: Whenever an L2 loss function is used, the mean is the qauntity that will minimize the residuals across all observations. If an L1 loss function is being used, then the median will minimize the residuals across all observations. 8 In the following, Excel was used to obtain the total variation using the L1 and L2 loss functions. Total Variation = ∑|𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙| = 20106 Total Variation = ∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 2 = 3836975 Questions: 1. Explain how the total variation for squared error loss can be computed from the variance. Realize, JMP can return the numerator of the variance and is identified by the Corrected SS value. 2. Verify that when the median is used instead of the mean in the absolute loss, the total variation is reduced. 9 Concept of Conditioning As stated previously, there are several variables that may influence a person’s hair length. For example, gender is likely to help explain some of the variation in hair lengths – i.e. women tend to have longer hair then men. If you take the gender into consideration when analyzing hair length, then it is said that the conditional distribution of hair length given gender is being considered. Conditional Distribution – the distribution of the response variable conditioning on (i.e. taking into consideration) one or more other variables Visually, the conditional distribution simply means that the distribution of the response will be considered for each gender, separately. Marginal Distribution Conditional Distributions: dividing the response into subgroups The following graph communicates the difference between the marginal distribution and the two conditional distributions. 10 Akin to our investigation of the marginal distribution, the mean and variance are sufficient quantities for the conditional distributions. Identify each of these quantities for the marginal and conditional distributions in the table below. Identify the conditional means and variances in the following table. Distribution of Hair Length Quantity Mean Variance n Hair Length | Female Hair Length | Male 𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝐹𝑒𝑚𝑎𝑙𝑒) = 𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝑀𝑎𝑙𝑒) = ̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) 𝑉𝑎𝑟 = ̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝐹𝑒𝑚𝑎𝑙𝑒) 𝑉𝑎𝑟 = ̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝑀𝑎𝑙𝑒) 𝑉𝑎𝑟 = n= n= n= 𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) = 11 The normal kernel density estimates for the marginal and conditional distributions are shown here. From a modeling perspective, a significant advantage of considering the conditional distribution is possible reduction in the unexplained variation. Consider the substantial reduction in these quantities for the hair length data. 12 Questions: 1. Given that a squared error loss function was used, give the formula for how the total unexplained variation is computed for Females. How is it computed for Males? 2. What is the difference between the total unexplained variation in the marginal distribution and the total unexplained variation in the two conditional distributions? 3. Explain in context, what would it mean if this difference were zero? 4. If the difference between the unexplained variation in the marginal distribution and conditional distributions is close to zero, then which of the following is true? The variable being conditioned on (i.e. Gender) is useful in understanding the response variable. The variable being conditioned on (i.e. Gender) is *not* useful in understanding the response variable. Explain. 5. On the graph below, sketch a possible scenario in which the difference between the unexplained variation in the marginal and conditional distributions is small. 13 6. Next, sketch a scenario in which the difference between the unexplained variation in the marginal and conditional distributions is very large (i.e. the total variation in the conditional distributions is very small, say 0). Proportion of Variation Being Explained The potential amount of unexplained variation that can be explained by conditioning certainly depends on the total amount in the marginal distribution. For example, in the above example the reduction was 3836975 – 1209170 = 2627805 which is a substantial amount considering that the total unexplained variation in the marginal distribution was 3836975. As a result, the proportion of unexplained variance taken away by considering the conditional distributions is typically used as a measure of overall usefulness of the conditioning variable(s). This proportion is commonly referred to as the coefficient of determination or R2. 𝑅2 = 𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑀𝑎𝑟𝑔𝑖𝑛𝑎𝑙 − 𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑀𝑎𝑟𝑔𝑖𝑛𝑎𝑙 Comment: R2 is often misinterpreted in practice. The correct interpretation, albeit not necessarily eloquent, is the proportion of unexplained variation that is explained by the independent variables being considered by the model. 14 Notation: Sum of Squares Total (i.e. SSTotal) is commonly used to identify the total unexplained variation in the marginal distribution of the response Sum of Squares Error (i.e. SSError) is commonly used to identify the total amount of unexplained variation in the conditional distributions The coefficient of determination using this common notation is expressed as 𝑅2 = (𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 ) 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 or more simply 𝑅2 = 1 − 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 The coefficient of determination or R2 value for our situation would then be computed as 𝑅2 = = = (𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 ) 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 3836975 – 1209170 3836975 2627805 3836975 = 0.6849 = 68.5% Question: What is the correct Interpretation of this quantity in context? 15 Wiki entry for Coefficient of Determination http://en.wikipedia.org/wiki/Coefficient_of_determination Example 2.3 Consider once again the Impact Crater dataset. Goal: Determine which of the following conditional distributions is more advantageous to consider if the goal is to reduce the total unexplained variation in Diameter, the response variable of interest here. Diameter | SandType Diameter | ProjectileType 16 Compute the following quantities for the various distributions under investigation here. Marginal Distribution of Diameter Marginal Distribution Diameter Quantity Mean Variance n 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟) = ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟) 𝑉𝑎𝑟 = n= Total Unexplained Variation Conditional Distribution of Diameter | SandType Conditional Distributions for Sand type Diameter | Course Quantity Mean Variance n 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐶𝑜𝑢𝑟𝑠𝑒) = Diameter | Fine 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐹𝑖𝑛𝑒) = ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐶𝑜𝑢𝑟𝑠𝑒) 𝑉𝑎𝑟 ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐹𝑖𝑛𝑒) 𝑉𝑎𝑟 = = n= n= Unexplained Variation Total Unexplained Variation =_____________ 17 Conditional Distribution of Diameter | ProjectileType Conditional Distributions for Projectile type Diameter | Glass Quantity Mean Variance n Diameter | Wood Diameter | Steel 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐺𝑙𝑎𝑠𝑠) = 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑆𝑡𝑒𝑒𝑙) = 𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑊𝑜𝑜𝑑) = ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐺𝑙𝑎𝑠𝑠) 𝑉𝑎𝑟 = ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑆𝑡𝑒𝑒𝑙) 𝑉𝑎𝑟 = ̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑊𝑜𝑜𝑑) 𝑉𝑎𝑟 = n= n= n= Unexplained Variation Total Unexplained Variation =_____________ Questions: 1. What proportion of the total unexplained variation is accounted for by considering the Sand type? 2. What proportion of the total unexplained variation is accounted for by considering the Projectile type? 3. Which conditional distribution accounts for more of the unexplained variation in Diameter? Discuss. 18 4. Consider the following visual depictions of the conditional distributions. From these visual displays, explain why we’d expect the conditional distribution of Diameter | ProjectileType to account for more of the unexplained variation in Diameter. Conditional Distribution of Diameter | SandType Conditional Distribution of Diameter | ProjectileType 19