Centering and Standardizing Data and the Change of Coordinates

advertisement
Centering, Scaling, and Standardizing Data
and the Change of Coordinates for Model
Parameters
Centering and standardizing are common data transformations. Changes to the coordinate system of
the data space induce a change in the coordinate system of the parameter space that is fairly simple
where the parameter space is a tensor product of the components of the data space. New parameters
can be calculated as a linear transformation of the original parameters.
Recentering a variable
When we recenter data, we move the zero coordinate
π‘₯𝑐 = π‘₯ − π‘Ž
Where 𝑐 is a superscript indicating π‘₯ has been recentered, and π‘Ž is some arbitrary constant. Very often
we will simply center π‘₯ at its mean, πœ‡π‘₯
Δπ‘₯ = π‘₯ − πœ‡π‘₯
Rescaling a variable
Similarly, when we rescale data, we change the quantity considered a unit of measure
π‘₯ 𝑠 = π‘₯/π‘Ž
Where again π‘Ž is an arbitrary constant. Very often we rescale in units of standard deviation, 𝜎π‘₯ .
Standardizing a variable
Because the concept of standard deviation is tied to the mean, we will very often combine recentering
and rescaling as standardizing a variable
π‘₯𝑧 =
Δπ‘₯ π‘₯ − πœ‡π‘₯
=
𝜎π‘₯
𝜎π‘₯
Centering and Standardizing Additive Models
Suppose we are modeling a regression outcome, 𝑦, as a function of a variable π‘₯1 , so
𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝑒
If we center the data in π‘₯1 , so that Δπ‘₯1 = π‘₯1 − πœ‡1 then we can model
𝑦 = 𝛽0𝑐 + 𝛽1𝑐 Δπ‘₯1 + 𝑒 𝑐
= (𝛽0𝑐 − 𝛽1𝑐 πœ‡1 ) + 𝛽1𝑐 π‘₯1 + 𝑒 𝑐
With the second model expressed in terms of π‘₯1 , we see that
𝑒 = 𝑒𝑐
𝛽1 = 𝛽1𝑐
𝛽0 + 𝛽1 πœ‡1 = 𝛽0𝑐
The change in parameters from the first model to the second is a linear transformation, and can be
expressed in matrix form as
[
𝛽0𝑐
1
]=[
𝛽1𝑐
0
πœ‡1 𝛽0
][ ]
1 𝛽1
1 πœ‡1
]. It is worth noting that we could “center” our data at
0 1
any arbitrary point on the π‘₯1 scale by substituting any constant for πœ‡1 .
And let’s call the transition matrix π‘ͺπ’™πŸ = [
We can easily extend this to models with more independent variables.
𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + 𝑒
Can be centered as
𝑦 = (𝛽0𝑐 − 𝛽1𝑐 πœ‡1 − 𝛽2𝑐 πœ‡2 ) + 𝛽1𝑐 π‘₯1 + 𝛽2𝑐 π‘₯2 + 𝑒 𝑐
with
𝛽0𝑐
1
[𝛽1𝑐 ] = [0
0
𝛽2𝑐
πœ‡1
1
0
πœ‡2 𝛽0
0 ] [𝛽1 ]
1 𝛽2
This form for the centering transition matrix is characteristic for additive models: the intercept is
shifted, and all other coefficients remain unchanged.
Returning to our simple model with one independent variable, we can also standardize π‘₯1 , so that π‘₯ 𝑧 =
Δπ‘₯1 /𝜎π‘₯1
𝑦 = 𝛽0𝑧 + 𝛽1𝑧 π‘₯1𝑧 + 𝑒 𝑧
= 𝛽0𝑧 +
𝛽1𝑧
Δπ‘₯ + 𝑒 𝑧
𝜎1 1
Which eventually gives us
𝛽𝑧
1 0 𝛽0𝑐
[ 0𝑧 ] = [
][ ]
0 𝜎1 𝛽1𝑐
𝛽1
Calling this new transition matrix π‘Ίπ’™πŸ we see that
𝑩𝒛 = π‘Ίπ’™πŸ π‘ͺπ’™πŸ 𝑩
Centering and Standardizing Models with Interactions
Now let’s consider a slightly more complicated model where we add a second variable π‘₯2 and consider
also the interaction effect of π‘₯1 and π‘₯2 , formed in the usual way as π‘₯1 π‘₯2, that is
𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯1 π‘₯2 + 𝑒
Here, the parameter space for our expanded model is formed as a tensor product of the parameter
spaces for π‘₯1 and π‘₯2 . If we wish to center both π‘₯1 and π‘₯2 we then have two transition matrices, π‘ͺπ’™πŸ and
π‘ͺπ’™πŸ , and the transition matrix for the parameter space is now the direct product of the two [citation]
π‘ͺπ’™πŸ ⊗ π‘ͺπ’™πŸ
1
=[
0
πœ‡2
1
]⊗[
1
0
1 πœ‡1
πœ‡1
0 1
]=[
1
0 0
0 0
πœ‡2
0
1
0
πœ‡1 πœ‡2
πœ‡2
]
πœ‡1
1
In other words (and expressed in matrix form)
𝑩𝒄 = (π‘ͺπ’™πŸ ⊗ π‘ͺπ’™πŸ )𝑩
And this may be extended to include more variables in the natural way.
Standardizing variables works in the same way. We start with π‘₯1𝑧 =
(π‘₯1 −πœ‡1 )
𝜎1
π‘₯𝑐
= 𝜎1 If we take the
1
transition matrix
𝑆π‘₯1 = [
1 0
]
0 𝜎1
(again, note that we could scale our parameter by any arbitrary constant), then
𝑩𝒛 = π‘Ίπ’™πŸ π‘ͺπ’™πŸ 𝑩
Or more fully
[
𝛽0𝑧
1
𝑧 ] = [0
𝛽1
0 1 πœ‡1 𝛽0
][
][ ]
𝜎1 0 1 𝛽1
For the two variable model we can form the full transition matrix as either of two forms
(π‘Ίπ’™πŸ ⊗ π‘Ίπ’™πŸ )(π‘ͺπ’™πŸ ⊗ π‘ͺπ’™πŸ ) = (π‘Ίπ’™πŸ π‘ͺπ’™πŸ ) ⊗ (π‘Ίπ’™πŸ π‘ͺπ’™πŸ )
Where
1
(π‘Ίπ’™πŸ ⊗ π‘Ίπ’™πŸ ) = [
0
1 0
0 𝜎1
0
1 0
]⊗[
]=[
𝜎2
0 𝜎1
0 0
0 0
Again, this generalizes to more variables.
Variable-standardization or term-standardization
Standardizing each variable:
0
0
𝜎2
0
0
0
]
0
𝜎1 𝜎2
𝑦 = 𝛽0𝑧 + 𝛽1𝑧 π‘₯1𝑧 + 𝛽2𝑧 π‘₯2𝑧 + 𝛽3𝑧 π‘₯1𝑧 π‘₯2𝑧 + 𝑒 𝑧
Or standardizing each term
𝑦 = 𝛽0𝑧 + 𝛽1𝑧 π‘₯1𝑧 + 𝛽2𝑧 π‘₯2𝑧 + 𝛽3𝑧 (π‘₯1 π‘₯2 ) 𝑧 + 𝑒 𝑧
In the first case we have
π‘₯1𝑧 π‘₯2𝑧 =
(π‘₯1 − πœ‡π‘₯1 )(π‘₯2 − πœ‡π‘₯2 )
𝜎π‘₯1 𝜎π‘₯2
In the latter case we have
(π‘₯1 π‘₯2 ) 𝑧 =
π‘₯1 π‘₯2 − πœ‡π‘₯1π‘₯2
𝜎π‘₯1π‘₯2
Consider five models, which differ in how the independent variables are centered and scaled.
(1) Original scales
𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯1 π‘₯2 + 𝑒
(2) Centered, variable-wise
𝑦 = 𝛽0𝑐𝑣 + 𝛽1𝑐𝑣 π‘₯1𝑐 + 𝛽2𝑐𝑣 π‘₯2𝑐 + 𝛽3𝑐𝑣 π‘₯1𝑐 π‘₯2𝑐 + 𝑒 𝑐
(3) Centered, term-wise
𝑦 = 𝛽0𝑐𝑑 + 𝛽1𝑐𝑑 π‘₯1𝑐 + 𝛽2𝑐𝑑 π‘₯2𝑐 + 𝛽3𝑐𝑑 (π‘₯1 π‘₯2 )𝑐 + 𝑒 𝑐
(4) Standardized, variable-wise
𝑦 = 𝛽0𝑧𝑣 + 𝛽1𝑧𝑣 π‘₯1𝑧 + 𝛽2𝑧𝑣 π‘₯2𝑧 + 𝛽3𝑧𝑣 π‘₯1𝑧 π‘₯2𝑧 + 𝑒 𝑧
(5) Standardized, term-wise
𝑦 = 𝛽0𝑧𝑑 + 𝛽1𝑧𝑑 π‘₯1𝑧 + 𝛽2𝑧𝑑 π‘₯2𝑧 + 𝛽3𝑧𝑑 (π‘₯1 π‘₯2 )𝑧 + 𝑒 𝑧
As it turns out, these models are all equivalent. Although they have different values of 𝛽, they produce
the same predicted values for 𝑦, and the residuals are the same, that is 𝑒 = 𝑒 𝑐 = 𝑒 𝑧 . This means that
the overall fit, 𝑅 2, of the models is the same. Additionally, all of the t-statistics and p-values for 𝛽s
representing the same effect are equal for the highest order effects, i.e. centering and scaling do not
change the statistical significance of highest order effects (but will in general change t-values and pvalues for lower order, included effects, and intercepts). Given the correct centering and scaling
constants, each set of 𝛽s can be calculated as a linear transformation of any of the other sets (without
any need to transform the data).
The real differences, then, are in ease of calculation, use, and interpretation.
For ease of calculation it is hard to beat term-wise centering. Except for the intercept, each 𝛽𝑖𝑐𝑑 = 𝛽𝑖 ,
and the intercept is 𝛽0𝑐𝑑 = πœ‡π‘¦ . More explicitly
𝛽0𝑐𝑑
1 πœ‡π‘₯1
𝛽1𝑐𝑑
0 1
𝑐𝑑 = [0
0
𝛽2
𝑐𝑑
0
0
[𝛽3 ]
πœ‡π‘₯2
0
1
0
πœ‡π‘₯1π‘₯2 𝛽0
𝛽
0
] [ 1]
𝛽
0
2
𝛽3
1
This is transformation is like the transformation in an additive model.
The one wrinkle here is to note that πœ‡π‘₯1π‘₯2 = πœ‡π‘₯1 πœ‡π‘₯2 + π‘π‘œπ‘£(π‘₯1 , π‘₯2 ). Because of this, unless
π‘π‘œπ‘£(π‘₯1 , π‘₯2 ) = 0 there is a constant effect hidden in the interaction term, and there is no simple
interpretation of the first-order effects. Another way to say this is when π‘π‘œπ‘£(π‘₯1 , π‘₯2 ) ≠ 0 there is no
easily interpreted point in the data space where both 𝛽2𝑐𝑑 and 𝛽3𝑐𝑑 cancel out leaving only the effect of
𝛽1𝑐𝑑 .
Suppose we seek the point at which both 𝛽2𝑐𝑑 and 𝛽3𝑐𝑑 cancel out. For 𝛽2𝑐𝑑 to cancel this implies π‘₯2 =
πœ‡π‘₯2 . Then the product term is
(π‘₯1 πœ‡π‘₯2 )𝑐 = π‘₯1 πœ‡π‘₯2 − (πœ‡π‘₯1 πœ‡π‘₯2 + π‘π‘œπ‘£(π‘₯1 , π‘₯2 ))
And for this to be zero
π‘₯1 =
πœ‡π‘₯1 πœ‡π‘₯2 + π‘π‘œπ‘£(π‘₯1 , π‘₯2 )
πœ‡π‘₯2
πœ‡π‘₯1 πœ‡π‘₯2 +π‘π‘œπ‘£(π‘₯1 ,π‘₯2 )
, πœ‡π‘₯2 ).
πœ‡π‘₯2
π‘π‘œπ‘£(π‘₯1 ,π‘₯2 )
In other words, the parameter 𝛽1𝑐𝑑 is the effect of π‘₯1𝑐 at the point (π‘₯1 , π‘₯2 ) = (
Rewritten in terms of our centered variables, this is the point (π‘₯1𝑐 , π‘₯2𝑐 ) = (
πœ‡π‘₯2
, 0). Because this is a
specific point, we can further calculate the predicted value of 𝑦
𝑦 = πœ‡π‘¦ + 𝛽1𝑐𝑑 (
π‘π‘œπ‘£(π‘₯1 , π‘₯2 )
)
πœ‡π‘₯2
A parallel derivation isolates the parameter 𝛽2𝑐𝑑 .
Consider the predicted value of 𝑦 where π‘₯1𝑐 = π‘₯2𝑐 = 0, i.e. at π‘₯1 = πœ‡π‘₯1 and π‘₯2 = πœ‡π‘₯2 .
𝑦 = 𝛽0𝑐𝑑 + 𝛽1𝑐𝑑 βˆ™ 0 + 𝛽2𝑐𝑑 βˆ™ 0 + 𝛽3𝑐𝑑 (π‘₯1 π‘₯2 )𝑐
= πœ‡π‘¦ + 𝛽3𝑐𝑑 (π‘₯1 π‘₯2 )𝑐
But
(π‘₯1 π‘₯2 )𝑐 = (πœ‡π‘₯1 πœ‡π‘₯2 )𝑐
= πœ‡π‘₯1 πœ‡π‘₯2 − πœ‡π‘₯1π‘₯2
= πœ‡π‘₯1 πœ‡π‘₯2 − (πœ‡π‘₯1 πœ‡π‘₯2 + π‘π‘œπ‘£(π‘₯1 , π‘₯2 ))
= −π‘π‘œπ‘£(π‘₯1 , π‘₯2 )
Returning to the predicted value of 𝑦, we have
𝑦 = πœ‡π‘¦ − 𝛽3𝑐𝑑 π‘π‘œπ‘£(π‘₯1 , π‘₯2 )
And our hidden constant makes its appearance.
Interestingly, we see that all three parameters take on their individual meaning at specific points, and
not as general “effects” across sets of points. We also note that 𝛽0𝑐𝑑 never appears in isolation if
π‘π‘œπ‘£(π‘₯1 , π‘₯2 ) ≠ 0.
But if these are not effects, what would be the effect of π‘₯1𝑐 or π‘₯2𝑐 ? Because there is an interaction term,
both effects are conditional. Suppose we let π‘₯2𝑐 = π‘Ž and pursue the effect of π‘₯1𝑐 .
𝑐
𝑦 = πœ‡π‘¦ + 𝛽1 π‘₯1𝑐 + 𝛽2 π‘Ž + 𝛽3 ((π‘₯1𝑐 + πœ‡π‘₯1 )(π‘Ž + πœ‡π‘₯2 ))
= πœ‡π‘¦ + 𝛽1 π‘₯1𝑐 + 𝛽2 π‘Ž + 𝛽3 (π‘₯1𝑐 (π‘Ž + πœ‡π‘₯2 ) + πœ‡π‘₯1 (π‘Ž + πœ‡π‘₯2 ) − πœ‡π‘₯1π‘₯2 )
= 𝛽0 + 𝛽1 πœ‡π‘₯1 + 𝛽2 (π‘Ž + πœ‡2 ) + 𝛽3 (πœ‡π‘₯1 (π‘Ž + πœ‡2 )) + (𝛽1 + 𝛽3 (π‘Ž + πœ‡2 ))π‘₯1𝑐
Then the effect of π‘₯1𝑐 at π‘₯2𝑐 = π‘Ž will be 𝛽1 + 𝛽3 (π‘Ž + πœ‡π‘₯2 )
Download