Centering, Scaling, and Standardizing Data and the Change of Coordinates for Model Parameters Centering and standardizing are common data transformations. Changes to the coordinate system of the data space induce a change in the coordinate system of the parameter space that is fairly simple where the parameter space is a tensor product of the components of the data space. New parameters can be calculated as a linear transformation of the original parameters. Recentering a variable When we recenter data, we move the zero coordinate π₯π = π₯ − π Where π is a superscript indicating π₯ has been recentered, and π is some arbitrary constant. Very often we will simply center π₯ at its mean, ππ₯ Δπ₯ = π₯ − ππ₯ Rescaling a variable Similarly, when we rescale data, we change the quantity considered a unit of measure π₯ π = π₯/π Where again π is an arbitrary constant. Very often we rescale in units of standard deviation, ππ₯ . Standardizing a variable Because the concept of standard deviation is tied to the mean, we will very often combine recentering and rescaling as standardizing a variable π₯π§ = Δπ₯ π₯ − ππ₯ = ππ₯ ππ₯ Centering and Standardizing Additive Models Suppose we are modeling a regression outcome, π¦, as a function of a variable π₯1 , so π¦ = π½0 + π½1 π₯1 + π If we center the data in π₯1 , so that Δπ₯1 = π₯1 − π1 then we can model π¦ = π½0π + π½1π Δπ₯1 + π π = (π½0π − π½1π π1 ) + π½1π π₯1 + π π With the second model expressed in terms of π₯1 , we see that π = ππ π½1 = π½1π π½0 + π½1 π1 = π½0π The change in parameters from the first model to the second is a linear transformation, and can be expressed in matrix form as [ π½0π 1 ]=[ π½1π 0 π1 π½0 ][ ] 1 π½1 1 π1 ]. It is worth noting that we could “center” our data at 0 1 any arbitrary point on the π₯1 scale by substituting any constant for π1 . And let’s call the transition matrix πͺππ = [ We can easily extend this to models with more independent variables. π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + π Can be centered as π¦ = (π½0π − π½1π π1 − π½2π π2 ) + π½1π π₯1 + π½2π π₯2 + π π with π½0π 1 [π½1π ] = [0 0 π½2π π1 1 0 π2 π½0 0 ] [π½1 ] 1 π½2 This form for the centering transition matrix is characteristic for additive models: the intercept is shifted, and all other coefficients remain unchanged. Returning to our simple model with one independent variable, we can also standardize π₯1 , so that π₯ π§ = Δπ₯1 /ππ₯1 π¦ = π½0π§ + π½1π§ π₯1π§ + π π§ = π½0π§ + π½1π§ Δπ₯ + π π§ π1 1 Which eventually gives us π½π§ 1 0 π½0π [ 0π§ ] = [ ][ ] 0 π1 π½1π π½1 Calling this new transition matrix πΊππ we see that π©π = πΊππ πͺππ π© Centering and Standardizing Models with Interactions Now let’s consider a slightly more complicated model where we add a second variable π₯2 and consider also the interaction effect of π₯1 and π₯2 , formed in the usual way as π₯1 π₯2, that is π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + π½3 π₯1 π₯2 + π Here, the parameter space for our expanded model is formed as a tensor product of the parameter spaces for π₯1 and π₯2 . If we wish to center both π₯1 and π₯2 we then have two transition matrices, πͺππ and πͺππ , and the transition matrix for the parameter space is now the direct product of the two [citation] πͺππ ⊗ πͺππ 1 =[ 0 π2 1 ]⊗[ 1 0 1 π1 π1 0 1 ]=[ 1 0 0 0 0 π2 0 1 0 π1 π2 π2 ] π1 1 In other words (and expressed in matrix form) π©π = (πͺππ ⊗ πͺππ )π© And this may be extended to include more variables in the natural way. Standardizing variables works in the same way. We start with π₯1π§ = (π₯1 −π1 ) π1 π₯π = π1 If we take the 1 transition matrix ππ₯1 = [ 1 0 ] 0 π1 (again, note that we could scale our parameter by any arbitrary constant), then π©π = πΊππ πͺππ π© Or more fully [ π½0π§ 1 π§ ] = [0 π½1 0 1 π1 π½0 ][ ][ ] π1 0 1 π½1 For the two variable model we can form the full transition matrix as either of two forms (πΊππ ⊗ πΊππ )(πͺππ ⊗ πͺππ ) = (πΊππ πͺππ ) ⊗ (πΊππ πͺππ ) Where 1 (πΊππ ⊗ πΊππ ) = [ 0 1 0 0 π1 0 1 0 ]⊗[ ]=[ π2 0 π1 0 0 0 0 Again, this generalizes to more variables. Variable-standardization or term-standardization Standardizing each variable: 0 0 π2 0 0 0 ] 0 π1 π2 π¦ = π½0π§ + π½1π§ π₯1π§ + π½2π§ π₯2π§ + π½3π§ π₯1π§ π₯2π§ + π π§ Or standardizing each term π¦ = π½0π§ + π½1π§ π₯1π§ + π½2π§ π₯2π§ + π½3π§ (π₯1 π₯2 ) π§ + π π§ In the first case we have π₯1π§ π₯2π§ = (π₯1 − ππ₯1 )(π₯2 − ππ₯2 ) ππ₯1 ππ₯2 In the latter case we have (π₯1 π₯2 ) π§ = π₯1 π₯2 − ππ₯1π₯2 ππ₯1π₯2 Consider five models, which differ in how the independent variables are centered and scaled. (1) Original scales π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + π½3 π₯1 π₯2 + π (2) Centered, variable-wise π¦ = π½0ππ£ + π½1ππ£ π₯1π + π½2ππ£ π₯2π + π½3ππ£ π₯1π π₯2π + π π (3) Centered, term-wise π¦ = π½0ππ‘ + π½1ππ‘ π₯1π + π½2ππ‘ π₯2π + π½3ππ‘ (π₯1 π₯2 )π + π π (4) Standardized, variable-wise π¦ = π½0π§π£ + π½1π§π£ π₯1π§ + π½2π§π£ π₯2π§ + π½3π§π£ π₯1π§ π₯2π§ + π π§ (5) Standardized, term-wise π¦ = π½0π§π‘ + π½1π§π‘ π₯1π§ + π½2π§π‘ π₯2π§ + π½3π§π‘ (π₯1 π₯2 )π§ + π π§ As it turns out, these models are all equivalent. Although they have different values of π½, they produce the same predicted values for π¦, and the residuals are the same, that is π = π π = π π§ . This means that the overall fit, π 2, of the models is the same. Additionally, all of the t-statistics and p-values for π½s representing the same effect are equal for the highest order effects, i.e. centering and scaling do not change the statistical significance of highest order effects (but will in general change t-values and pvalues for lower order, included effects, and intercepts). Given the correct centering and scaling constants, each set of π½s can be calculated as a linear transformation of any of the other sets (without any need to transform the data). The real differences, then, are in ease of calculation, use, and interpretation. For ease of calculation it is hard to beat term-wise centering. Except for the intercept, each π½πππ‘ = π½π , and the intercept is π½0ππ‘ = ππ¦ . More explicitly π½0ππ‘ 1 ππ₯1 π½1ππ‘ 0 1 ππ‘ = [0 0 π½2 ππ‘ 0 0 [π½3 ] ππ₯2 0 1 0 ππ₯1π₯2 π½0 π½ 0 ] [ 1] π½ 0 2 π½3 1 This is transformation is like the transformation in an additive model. The one wrinkle here is to note that ππ₯1π₯2 = ππ₯1 ππ₯2 + πππ£(π₯1 , π₯2 ). Because of this, unless πππ£(π₯1 , π₯2 ) = 0 there is a constant effect hidden in the interaction term, and there is no simple interpretation of the first-order effects. Another way to say this is when πππ£(π₯1 , π₯2 ) ≠ 0 there is no easily interpreted point in the data space where both π½2ππ‘ and π½3ππ‘ cancel out leaving only the effect of π½1ππ‘ . Suppose we seek the point at which both π½2ππ‘ and π½3ππ‘ cancel out. For π½2ππ‘ to cancel this implies π₯2 = ππ₯2 . Then the product term is (π₯1 ππ₯2 )π = π₯1 ππ₯2 − (ππ₯1 ππ₯2 + πππ£(π₯1 , π₯2 )) And for this to be zero π₯1 = ππ₯1 ππ₯2 + πππ£(π₯1 , π₯2 ) ππ₯2 ππ₯1 ππ₯2 +πππ£(π₯1 ,π₯2 ) , ππ₯2 ). ππ₯2 πππ£(π₯1 ,π₯2 ) In other words, the parameter π½1ππ‘ is the effect of π₯1π at the point (π₯1 , π₯2 ) = ( Rewritten in terms of our centered variables, this is the point (π₯1π , π₯2π ) = ( ππ₯2 , 0). Because this is a specific point, we can further calculate the predicted value of π¦ π¦ = ππ¦ + π½1ππ‘ ( πππ£(π₯1 , π₯2 ) ) ππ₯2 A parallel derivation isolates the parameter π½2ππ‘ . Consider the predicted value of π¦ where π₯1π = π₯2π = 0, i.e. at π₯1 = ππ₯1 and π₯2 = ππ₯2 . π¦ = π½0ππ‘ + π½1ππ‘ β 0 + π½2ππ‘ β 0 + π½3ππ‘ (π₯1 π₯2 )π = ππ¦ + π½3ππ‘ (π₯1 π₯2 )π But (π₯1 π₯2 )π = (ππ₯1 ππ₯2 )π = ππ₯1 ππ₯2 − ππ₯1π₯2 = ππ₯1 ππ₯2 − (ππ₯1 ππ₯2 + πππ£(π₯1 , π₯2 )) = −πππ£(π₯1 , π₯2 ) Returning to the predicted value of π¦, we have π¦ = ππ¦ − π½3ππ‘ πππ£(π₯1 , π₯2 ) And our hidden constant makes its appearance. Interestingly, we see that all three parameters take on their individual meaning at specific points, and not as general “effects” across sets of points. We also note that π½0ππ‘ never appears in isolation if πππ£(π₯1 , π₯2 ) ≠ 0. But if these are not effects, what would be the effect of π₯1π or π₯2π ? Because there is an interaction term, both effects are conditional. Suppose we let π₯2π = π and pursue the effect of π₯1π . π π¦ = ππ¦ + π½1 π₯1π + π½2 π + π½3 ((π₯1π + ππ₯1 )(π + ππ₯2 )) = ππ¦ + π½1 π₯1π + π½2 π + π½3 (π₯1π (π + ππ₯2 ) + ππ₯1 (π + ππ₯2 ) − ππ₯1π₯2 ) = π½0 + π½1 ππ₯1 + π½2 (π + π2 ) + π½3 (ππ₯1 (π + π2 )) + (π½1 + π½3 (π + π2 ))π₯1π Then the effect of π₯1π at π₯2π = π will be π½1 + π½3 (π + ππ₯2 )