Leverage

advertisement
Handout #9: Understanding the Influence of Individual Observations
Example 9.1: In this example, we will again consider a subset of a more substantial dataset so that we
can more easily understand the ideas presented. For this example, we will investigation the Price of a
used Chevrolet from our CarPrices dataset.
Linear Regression Setup
ο‚·
ο‚·
ο‚·
ο‚·
Model to be fit using only used Chevrolet vehicle only, i.e., Make = Chevrolet, New=No.
Response Variable: Price
Predictor Variable: Miles
Assume the following structure for mean and variance functions
o
o
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, π‘€π‘Žπ‘˜π‘’ = πΆβ„Žπ‘’π‘£π‘Ÿπ‘œπ‘™π‘’π‘‘, 𝑁𝑒𝑀 = π‘π‘œ) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑠
π‘‰π‘Žπ‘Ÿ(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, π‘€π‘Žπ‘˜π‘’ = πΆβ„Žπ‘’π‘£π‘Ÿπ‘œπ‘™π‘’π‘‘, 𝑁𝑒𝑀 = π‘π‘œ) = 𝜎 2
1
An initial plot using the graph builder functionality in JMP. As expected, as Miles increase, Price
decreases, and as Year increases, so does Price.
Regression Output for 𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑠
Scatterplot with simple linear regression line.
Standard Regression Output
2
Next, consider a plot of the residuals.
Discussion: The functional form of the mean function does not appear to be correct. Discuss.
Something Extra - A Significance Test for Curvature
Cook and Weisberg (1999) discuss a simple statistical test to determine whether or not curvature
(statistically) exists in a residual plot. The following outlines this procedure.
ο‚·
Step #1: Save the predicated values and residuals into the dataset.
ο‚·
Step #2: Plot the residuals against the predicted values.
3
ο‚·
Step #3: Fit a quadratic mean function to investigate possible curvature.
Note: Even with the extreme observation on the left removed, there possible issues with
curvature may still exist. This is shown in the following plot.
4
Concepts of Leverage, Outliers, and Influence
In this section, we will consider the potential for an observation to impact our estimated mean function.
Continuing with our investigation above, we will consider the effect of the observation are the far right
of the graph presented above. This observation had a high number of miles relative to other used cars
in our dataset.
A very simple (and maybe somewhat naïve) approach would be to simply compare the estimated mean
function when this observation is included in the analysis to the estimated mean function when this
observation is not included in the analysis. This can easily be done in JMP.
To exclude an observation in JMP.
From the drop-down menu, select Script > Redo
Analysis
5
Including Observation #14
Not including Observation #14
Questions
1. What was the impact of removing Observation #14 on the estimated slope of the regression
line? What about the impact on the y-intercept?
2. What was the impact of removing Observation #14 on the estimated variance or standard
deviation?
3. Do you think removing this observation has significantly impacted our model?
6
Concept of Leverage
Observation #14 is an outlier because of it’s miles, not because of the price. This notation is captured by
a concept called leverage. The current Wiki entry for leverage is presented here.
Wiki entry for Leverage
http://en.wikipedia.org/wiki/Leverage_(statistics)
I would guess that the term leverage as commonly used in modeling was borrowed from the concept of
a lever in physics.
7
A visual depiction of leverage within the context of regression is shown next.
Concept of leverage with one predictor
Concept of leverage with 2 predictor variables
The formula for leverage when a single predictor is involved can be written out using summation
notation. Matrix representation is usually used when one or more predictors are involved in the mean
function.
Formula for leverage for a single predictor
Formula for leverage in matrix notation
(π‘₯𝑖 − π‘₯Μ… )2
1
β„Žπ‘– = ( +
)
𝑛 ∑(π‘₯𝑖 − π‘₯Μ… )2
𝑯 = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′
Comments:
ο‚· The matrix H is commonly referred
to as the hat matrix
ο‚· The β„Žπ‘– values presented to the left are
the diagonal elements of H
The leverage values, i.e. diagonal elements of H, can be obtained in JMP by selecting Hats from the Save
Columns menu in JMP.
8
The leverage values for all observations are added as a column in your dataset.
Plotting the diagonal elements of the hat matrix against Miles clearly shows which observations have
high leverage and which do not.
In a model with a single predictor, leverage
increases as the distance from the center increases
In a model with multiple predictors, leverage
increases as the distance from the centroid increases
9
Getting the leverage values in R can easily be done using the matrix notation.
>
>
>
>
>
x=cbind(rep(1,38),Miles)
xprimex = xprime %*% x
xprimex.inv = solve(xprimex,diag(2))
hat.matrix = x %*% xprimex.inv %*% xprime
diag(hat.matrix)
Belsley (1980) suggests that observations with
β„Žπ‘– > 2 ∗
# π‘œπ‘“ π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿπ‘  𝑖𝑛 π‘šπ‘œπ‘‘π‘’π‘™
𝑛
be considered as high leverage observations. Such observations may have an adverse effect on your
estimated mean function and one should proceed cautiously in the case.
10
Concept of Outlier
We have already considered a crude approach to identifying outliers in a regression situation.
Crude Outlier Rule
Graphically, more than 2 standard deviations away from 0
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™ > 2 ∗ πœŽΜ‚
or
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™ < −2 ∗ πœŽΜ‚
A more through consideration of what constitutes an outlier is presented here. First, consider the
following facts for the estimated residual vector.
𝐸(πœΊΜ‚) = 𝟎
and
π‘‰π‘Žπ‘Ÿ(πœΊΜ‚) =
=
=
=
=
Μ‚)
π‘‰π‘Žπ‘Ÿ(𝒀 − π‘Ώπœ·
π‘‰π‘Žπ‘Ÿ(𝒀 − 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝒀)
π‘‰π‘Žπ‘Ÿ( (𝑰 − 𝑯) 𝒀 )
(𝑰 − 𝑯) ∗ 𝑽𝒂𝒓(𝒀) ∗ (𝑰 − 𝑯)′
(𝑰 − 𝑯) ∗ 𝜎 2
Comments:
ο‚·
ο‚·
ο‚·
The above expectation holds assuming the model includes an intercept term.
The last equality is true because (𝑰 − 𝑯) is a symmetric idempotent matrix. Idempotent implies
that when it is multiplied by itself, you retain the same matrix.
It makes sense that the variability in the estimated residuals should be a function of leverage
because as the distance in the x-direction(s) increases, there is more variability in the mean
function and thus is reflected in the variability of the estimated residuals.
Task: Verify that (𝑰 − 𝑯) is indeed a symmetric and idempotent matrix for the model we have been
working with.
11
Studentized Residuals
Certainly, the determination of whether or not an observation would be considered an outlier depends
on the scale in the response, i.e depends on πœŽΜ‚. The crude approach above simply multiplied this
quantity by 2 to identify an outlier. A more traditional approach is to the identification of outlier is to
standardize the residuals.
Concept of Standardizing a Measurement
To standardize a measurement implies the following transformation
π‘€π‘’π‘Žπ‘ π‘’π‘Ÿπ‘’π‘šπ‘’π‘›π‘‘ − π‘€π‘’π‘Žπ‘› π‘œπ‘“ π‘€π‘’π‘Žπ‘ π‘’π‘Ÿπ‘’π‘šπ‘’π‘›π‘‘
√π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘œπ‘“ π‘€π‘’π‘Žπ‘ π‘’π‘Ÿπ‘’π‘šπ‘’π‘›π‘‘
Standardized Measurements have the following properties
ο‚· Expectation = 0
ο‚· Variance = 1
ο‚· Outlier Rules
o If normal theory holds, |π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘π‘–π‘§π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’π‘ | > 2, or
o Use the more conservative |π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘π‘–π‘§π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’π‘ | > 3, for nonnormal situations (via Chebyshev’s Inequality)
A studentized residual is computed as follows.
𝑆𝑑𝑒𝑑𝑒𝑛𝑑𝑖𝑧𝑒𝑑 π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™π‘– = 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿπ‘– =
𝑒̂𝑖 − 0
√πœŽΜ‚ 2
∗ (1 − β„Žπ‘–π‘– )
Comments:
ο‚· Any observations with an absolute studentized residual larger than 2 would be considered an
outlier.
ο‚· This type of residual is sometimes referred to as an internally studentized residual. This is in
contrast to an externally studentized residual (or deleted studentized residual) which is
discussed below.
12
Getting Studentized residuals in JMP
The output from JMP is placed into a new column.
Questions:
1. Show the calculations for at least one of the studentized residuals in the above dataset. Would
this observation be considered at outlier?
2. Which observations would be considered statistical outliers?
3. Do the outliers identified here agree with the outliers identified by the crude approach of 2 ∗
πœŽΜ‚?
13
Deleted Studentized Residuals
Deleted studentized residuals give a more holistic perspective of error. In particular, a deleted
studentized residual for a particular observation is computed from a model that is estimated with this
observation deleted from the dataset. If an observation is withheld in the fitting process, then the
residual may reflect a more pure notation of error in the prediction.
A deleted studentized residual is computed as follows.
𝐷𝑒𝑙𝑒𝑑𝑒𝑑 𝑠𝑑𝑒𝑑𝑒𝑛𝑑𝑖𝑧𝑒𝑑 π‘Ÿπ‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™(−𝑖) = 𝑑𝑒𝑙𝑒𝑑𝑒𝑑 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿ(−𝑖) =
𝑒̂𝑖 − 0
2
∗ (1 − β„Žπ‘–π‘– )
√πœŽΜ‚(−𝑖)
Comments:
ο‚· Any observations with an absolute deleted studentized residual larger than 2 would be
considered an outlier.
ο‚· The deleted studentized residuals is sometimes called the externally studentized residual as the
estimate of error is computed externally, i.e. without the observation being investigated.
ο‚· The deleted studentized residuals cannot be easily obtained in JMP; however, Beckman and
Trussell (1974) provide a way to get from internally to externally studentized residuals via the
following relationship. This relationship suggests that if n is substantially large compared to the
number of predictors that the externally and internally studentized residuals are similar. In the
following n = number of observations and k = number of predictors in the model.
(𝑛 − 1) − (π‘˜ + 1)
𝑑𝑒𝑙𝑒𝑑𝑒𝑑 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿ(−𝑖) = 𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿπ‘– ∗ √
𝑛 − (π‘˜ + 1) − (𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿπ‘– )2
14
Cooks Distance
Cook (1977) developed a separate measure of the effect of an individual observation by combining the
magnitude of the internally studentized residual with the magnitude of leverage for this observations.
This statistic is simply referred to as Cook’s Distance or Cook’s D.
πΆπ‘œπ‘œπ‘˜ ′ 𝑠 𝐷𝑖 =
(𝑠𝑑𝑒𝑑𝑒𝑛𝑑 π‘Ÿπ‘– )2
β„Žπ‘–
∗
(1 − β„Žπ‘– )
⏟ (π‘˜ + 1)
⏟
π‘Ÿπ‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™
π‘™π‘’π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’
where
ο‚·
ο‚·
ο‚·
π‘Ÿπ‘– = π‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘Žπ‘™π‘™π‘¦ 𝑠𝑑𝑒𝑑𝑒𝑛𝑑𝑖𝑧𝑒𝑑 π‘Ÿπ‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™
β„Žπ‘– = π‘™π‘’π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’
π‘˜ = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘œπ‘Ÿπ‘  𝑖𝑛 π‘šπ‘œπ‘‘π‘’π‘™
Suggested Rules for Cook’s Distance
ο‚·
ο‚·
An observation whose Cook’s Distance is substantially larger than others should be investigated
Cook suggests it is always important to investigate observations whose Cook’s D > 1
ο‚·
Others have suggested observation with Cook’s D > should be investigated further
4
𝑛
To obtain Cook’s Distance values in JMP, select Save Columns > Cook’s D Influence from the red-drop
down menu in the Fit Model output window.
15
Download