Understanding Variation

advertisement
STAT 360: Regression Analysis
Handout #2: Reducing the Unexplained Variation through Conditioning
Example 2.1: Consider the following data that has been collected from my STAT 110 students over the
past semesters.
Response Variable: Hair Length (mm)
Variables under investigation (i.e. independent variables):
 Gender
 Height (inches)
A snip-it of the data is provided here.
A univarate analysis of the response ignores or marginalizes the effect of all other variables that may be
under study. In this case, it is said that the marginal distribution of the response is being summarized.
A summary of marginal distribution ignores all variables.
1
Marginal Distribution of Hair Length (mm)
Historically, the mean and variance have been considered to be sufficient in describing a distribution.
[In fact, the mean and variance are said to be sufficient statistics for a wide class of distributions.]
Certainly, information is lost when all data values are reduced to a single quantity that is supposedly
describing a typical value (i.e. mean or average) and a single quantity being used to quantify the
variation (i.e. standard deviation or variance).
In this class, we will use the E() or expectation notation to denote the mean and Var() to denote the
variance. The following notion will be used to describe the mean and variance of the response Hair
Length.

Mean = E(Hair Length)

Variance = Var(Hair Length)
When these quantities are estimated from the data, the commonly used “hat” notation will be used.

Estimate of mean from data
𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) = 237.3 𝑚𝑚

Estimate of variance from data
̂
̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ) =
𝑉𝑎𝑟
𝑆𝑡𝑑
𝐷𝑒𝑣 2
= (171.8 𝑚𝑚)2
= 29,519 𝑚𝑚2
2
Explained vs. Unexplained Variation
Consider the following Wiki entry for “Unexplained” Variation.
http://en.wikipedia.org/wiki/Explained_variation
Consider next these concepts in the context of our previous example. There are several reasons why
people have different hair lengths (e.g. specific hair styles, days since last haircut, gender, etc.). If such
variables are known to have an effect on hair length, then some of the inherent variation in hair length
can be explained. However, if the aforementioned variables have no effect on hair length or if the
aforementioned variables are ignored all together, then all the variation in hair length is said to be
unexplained.

Explained Variation: Variation in a response that can be attributed to one or more other
variables

Unexplained Variation: Variation in a response that remains after considering other variables
3
Comment: When the marginal distribution of a response is being considered, then all the variation in
the response is said to be unexplained.
Example 2.2 Consider the following data regarding the top baseball players over a particular season.
This data has been sorted by Batting Average, which is one possible measure of a baseball player’s
performance or value to a team.

Explained Variation in Context
In baseball, the batting average is computed as follows
𝐵𝑎𝑡𝑡𝑖𝑛𝑔 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 =
𝐻𝑖𝑡𝑠
𝐴𝑡 𝐵𝑎𝑡𝑠
For example, for the first observation the batting average is computed as 159/459 = 0.346 or for
Mauer, the batting average is computed as 155/486 = 0.319. The differences in the batting
average values can be explained (completely explained) by players having a different number of
hits and innings played.

Unexplained Variation in Context
Next, consider the variation in Salary. There appear to be factors other than those given here
that have an impact on salary. For example, Mauer has the top salary of the players listed here,
but his performance, as measured by batting average, is not the highest. The differences in
salary cannot be completely explained by At Bats, Hits, and Batting Average. Thus, much of the
variation in salary remains unexplained.
4
Measuring the Unexplained Variation
Consider again the Hair Length data presented in Example 2.1. The unexplained variation is the inherent
variation that exists in hair lengths. Up to this point, we have only considered the marginal distribution
of hair length (i.e. we have ignored other variables such as gender, hair style, etc.), thus all the variation
in hair length is said to be unexplained.
Dotplot
of Hair Length
Adding the average
to the display
Unexplained variation
The variance of the response is used to quantify the unexplained variation.
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 =
∑(𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙)2
𝑛−1
where
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡 − 𝑀𝑒𝑎𝑛
Questions:
1. What does it mean if the the residual value is positive? Negative?
2. What does it mean if the residual is small (i.e. close to zero)?
5
3. Suppose the residual value for the particular data point is positive. Is the mean an overestimate or under-estimate for this particular data value? Explain.
The residual value for the 1st observation in the Hair Length dataset.
On the graphic below, idnetify the residuals for the 1st three obsevations.
6
All residual values are shown on the graphic below. Excel can be used to easily obtain the residual value
for all observations in your data.
Residuals shown graphically
Obtaining residual values in a simple spreadsheet
Notice that when all the residual values are added up the total is 0. This happens whenever the average
is used to estimate the mean funciton (i.e. the E(Hair Length) quantity). Visually, we can see this
happens because all the positiive residuals cancel out all the negative residuals.
Task: Prove
∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 0
when the the sample mean is used as an estimate of the mean funciton, i.e. 𝑚𝑒𝑎𝑛 =
∑ 𝐷𝑎𝑡𝑎 𝑃𝑜𝑖𝑛𝑡𝑠
𝑛
7
There are two mathematical options to avoid the cancelling out effect when summing the residuals.
Total Variation = ∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 2
or
Total Variation = ∑|𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙|
The first method -- squaring the residuals -- is more generally known as squared error loss or L2 loss. The
absolute or L1 loss function is the name given when an absolute value is used to get rid of the negatives.
Wiki entry for Loss Function
http://en.wikipedia.org/wiki/Loss_function
Comments:

Whenever an L2 loss function is used, the mean is the qauntity that will minimize the
residuals across all observations.

If an L1 loss function is being used, then the median will minimize the residuals across all
observations.
8
In the following, Excel was used to obtain the total variation using the L1 and L2 loss functions.
Total Variation = ∑|𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙| = 20106
Total Variation = ∑ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 2 = 3836975
Questions:
1. Explain how the total variation for squared error loss can be computed from the variance.
Realize, JMP can return the numerator of the variance and is identified by the Corrected SS
value.
2. Verify that when the median is used instead of the mean in the absolute loss, the total variation
is reduced.
9
Concept of Conditioning
As stated previously, there are several variables that may influence a person’s hair length. For example,
gender is likely to help explain some of the variation in hair lengths – i.e. women tend to have longer
hair then men. If you take the gender into consideration when analyzing hair length, then it is said that
the conditional distribution of hair length given gender is being considered.
Conditional Distribution – the distribution of the response variable conditioning on (i.e. taking
into consideration) one or more other variables
Visually, the conditional distribution simply means that the distribution of the response will be
considered for each gender, separately.
Marginal Distribution
Conditional Distributions:
dividing the response into subgroups
The following graph communicates the difference between the marginal distribution and the two
conditional distributions.
10
Akin to our investigation of the marginal distribution, the mean and variance are sufficient quantities for
the conditional distributions. Identify each of these quantities for the marginal and conditional
distributions in the table below.
Identify the conditional means and variances in the following table.
Distribution of
Hair Length
Quantity
Mean
Variance
n
Hair Length | Female
Hair Length | Male
𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝐹𝑒𝑚𝑎𝑙𝑒)
=
𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝑀𝑎𝑙𝑒)
=
̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ)
𝑉𝑎𝑟
=
̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝐹𝑒𝑚𝑎𝑙𝑒)
𝑉𝑎𝑟
=
̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ | 𝑀𝑎𝑙𝑒)
𝑉𝑎𝑟
=
n=
n=
n=
𝐸̂ (𝐻𝑎𝑖𝑟 𝐿𝑒𝑛𝑔𝑡ℎ)
=
11
The normal kernel density estimates for the marginal and conditional distributions are shown here.
From a modeling perspective, a significant advantage of considering the conditional distribution is
possible reduction in the unexplained variation. Consider the substantial reduction in these quantities
for the hair length data.
12
Questions:
1. Given that a squared error loss function was used, give the formula for how the total
unexplained variation is computed for Females. How is it computed for Males?
2. What is the difference between the total unexplained variation in the marginal distribution and
the total unexplained variation in the two conditional distributions?
3. Explain in context, what would it mean if this difference were zero?
4. If the difference between the unexplained variation in the marginal distribution and conditional
distributions is close to zero, then which of the following is true?


The variable being conditioned on (i.e. Gender) is useful in understanding the response
variable.
The variable being conditioned on (i.e. Gender) is *not* useful in understanding the
response variable.
Explain.
5. On the graph below, sketch a possible scenario in which the difference between the
unexplained variation in the marginal and conditional distributions is small.
13
6. Next, sketch a scenario in which the difference between the unexplained variation in the
marginal and conditional distributions is very large (i.e. the total variation in the conditional
distributions is very small, say 0).
Proportion of Variation Being Explained
The potential amount of unexplained variation that can be explained by conditioning certainly
depends on the total amount in the marginal distribution. For example, in the above example
the reduction was
3836975 – 1209170 = 2627805
which is a substantial amount considering that the total unexplained variation in the marginal
distribution was 3836975. As a result, the proportion of unexplained variance taken away by
considering the conditional distributions is typically used as a measure of overall usefulness of
the conditioning variable(s). This proportion is commonly referred to as the coefficient of
determination or R2.
𝑅2 =
𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑀𝑎𝑟𝑔𝑖𝑛𝑎𝑙 − 𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙
𝑇𝑜𝑡𝑎𝑙 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑀𝑎𝑟𝑔𝑖𝑛𝑎𝑙
Comment: R2 is often misinterpreted in practice. The correct interpretation, albeit not
necessarily eloquent, is the proportion of unexplained variation that is explained by the
independent variables being considered by the model.
14
Notation:

Sum of Squares Total (i.e. SSTotal) is commonly used to identify the total unexplained
variation in the marginal distribution of the response

Sum of Squares Error (i.e. SSError) is commonly used to identify the total amount of
unexplained variation in the conditional distributions
The coefficient of determination using this common notation is expressed as
𝑅2 =
(𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 )
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
or more simply
𝑅2 = 1 −
𝑆𝑆𝐸𝑟𝑟𝑜𝑟
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
The coefficient of determination or R2 value for our situation would then be computed as
𝑅2
=
=
=
(𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 )
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
3836975 – 1209170
3836975
2627805
3836975
=
0.6849
=
68.5%
Question: What is the correct Interpretation of this quantity in context?
15
Wiki entry for Coefficient of Determination
http://en.wikipedia.org/wiki/Coefficient_of_determination
Example 2.3 Consider once again the Impact Crater dataset.
Goal: Determine which of the following conditional distributions is more advantageous to consider if the
goal is to reduce the total unexplained variation in Diameter, the response variable of interest here.

Diameter | SandType

Diameter | ProjectileType
16
Compute the following quantities for the various distributions under investigation here.

Marginal Distribution of Diameter
Marginal Distribution
Diameter
Quantity
Mean
Variance
n
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟)
=
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟)
𝑉𝑎𝑟
=
n=
Total
Unexplained
Variation

Conditional Distribution of Diameter | SandType
Conditional Distributions for Sand type
Diameter | Course
Quantity
Mean
Variance
n
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐶𝑜𝑢𝑟𝑠𝑒)
=
Diameter | Fine
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐹𝑖𝑛𝑒)
=
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐶𝑜𝑢𝑟𝑠𝑒) 𝑉𝑎𝑟
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐹𝑖𝑛𝑒)
𝑉𝑎𝑟
=
=
n=
n=
Unexplained
Variation
Total Unexplained Variation =_____________
17

Conditional Distribution of Diameter | ProjectileType
Conditional Distributions for Projectile type
Diameter | Glass
Quantity
Mean
Variance
n
Diameter | Wood
Diameter | Steel
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐺𝑙𝑎𝑠𝑠)
=
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑆𝑡𝑒𝑒𝑙)
=
𝐸̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑊𝑜𝑜𝑑)
=
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝐺𝑙𝑎𝑠𝑠)
𝑉𝑎𝑟
=
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑆𝑡𝑒𝑒𝑙)
𝑉𝑎𝑟
=
̂ (𝐷𝑖𝑎𝑚𝑒𝑡𝑒𝑟 | 𝑊𝑜𝑜𝑑)
𝑉𝑎𝑟
=
n=
n=
n=
Unexplained
Variation
Total Unexplained Variation =_____________
Questions:
1. What proportion of the total unexplained variation is accounted for by considering the Sand
type?
2. What proportion of the total unexplained variation is accounted for by considering the Projectile
type?
3. Which conditional distribution accounts for more of the unexplained variation in Diameter?
Discuss.
18
4.
Consider the following visual depictions of the conditional distributions. From these visual
displays, explain why we’d expect the conditional distribution of Diameter | ProjectileType to
account for more of the unexplained variation in Diameter.
Conditional Distribution of
Diameter | SandType
Conditional Distribution of
Diameter | ProjectileType
19
Download