TESTS FOR TRANSFORMATIONS AND ROBUST REGRESSION Anthony Atkinson, 25th March 2014 Joint work with Marco Riani, Parma Department of Statistics London School of Economics London WC2A 2AE, UK a.c.atkinson@lse.ac.uk Transformation of Data I May need to transform y. For regression hope for simpler model, homoscedasticity and approximate normality Examples: survival time, volume, viscosity √ I Power transformations are often used: y, y1/3 , 1/y I I Box and Cox (1964) analyse the normalized power transformation λ (y − 1)/(λẏλ−1 ) λ 6= 0 z(λ) = ẏ log y λ = 0, where the geometric mean of the observations is written as ẏ = exp(Σ log yi /n) I Includes log transformation. Untransformed data λ = 1, reciprocal, λ = −1 I They use a likelihood ratio test for λ = λ0 Tests for Data Transformation I The residual sum of squares of the z(λ) is R(λ) = z(λ)T (I − H)z(λ) = z(λ)T Az(λ) and H = X(X T X)−1 X, the ‘hat’ matrix I The Box-Cox LR test for λ = λ0 TLR = n log{R(λ0 )/R(λ̂)}, requiring λ̂, which has to be found numerically I The value of λ̂ is usually only an indication of the transformation; y0.1379... is unlikely I Several score tests have been suggested, requiring quantities calculated only at λ0 Score Tests for Data Transformation I A computationally simple alternative to the LR test is an approximate score statistic derived from a Taylor series I Expansion yields . z(λ) = z(λ0 ) + (λ − λ0 )w(λ0 ), where I ∂z(λ) w(λ0 ) = ∂λ λ=λ0 Together with the regression model z(λ) = xT β − (λ − λ0 )w(λ0 ) + = xT β + γ w(λ0 ) + I Another regression model with an extra variable w(λ0 ), the constructed variable for the transformation I Testing γ = 0 tests λ = λ0 Score Tests for Data Transformation 2 I The numerator of the approximate score statistic Tp is z(λ0 )T Aw(λ0 ). The variance comes from regression on X and w(λ0 ). I Lawrance (1987) uses the same numerator but with an approximation to the information to standardise the statistic - this requires second derivatives of z(λ). The variance can be seen as an improvement of that in Tp . I Atkinson and Lawrance (1989) compare six statistics, additionally including the signed square root of the LR test and two Wald tests I The Wald tests have distributions far from N (0, 1). Otherwise the behaviour depends on the example. I Tests have very similar power when adjusted for size (perhaps not much practical use) I However, simulations in Lawrance (1987) in the absence of regression show that Tp does not have an asymptotic N (0, 1) distribution. (why?) Aggregate Statistics and Departures from Assumptions I These are all aggregate statistics I The effect of a single outlier can be found using deletion statistics (computationally cheap) I But there may be many aberrant observations: dispersed outliers, a second population, systematic departures from distributional assumptions, ... I For a general test of departures, divide the data into two groups, m believed to be uncontaminated I Estimate the parameters from these m observations I Calculate test statistics using these estimates: residuals, tests of transformation, ... I How to divide the data into these two groups? I We’ll use the Forward Search The Forward Search I The Forward Search (FS) fits the model to increasing numbers of observations. Starting from an outlier free subset, outliers enter at the end of the search I For regression, the subset of observations S∗ of size m yields least squares estimates β̂(m) and s2 (m), giving n least squares residuals ei (m) I The search moves forward with S∗ containing the observations with the m + 1 smallest absolute values of ei (m), I The search starts from a subset of p observations chosen by LMS I We may need simulation to interpret the results (m) (m+1) Poison Data I Example: Box and Cox poison data - 48 observations with response survival time I Non-negative y values with range 0.18 to 1.24 I Normal errors of constant variance are unlikely to hold I B & C suggest reciprocal transformation - λ = −1 0 -4 -2 Score test statistic 2 4 Poison Data 2 10 20 30 40 Subset size m Poison data: forward plot of Tp (−1) with 90%, 95% and 99% simulation envelopes using parameter estimates β̂(n) Distribution of Statistic I What is the null distribution of Tp (λ0 )? I The ordering is of residual values of y without regression on w(λ0 ) I If w were a new regressor, instead of w(λ0 ), the distribution would be t I Normal approximation given on the plot I But w(λ0 ) is a function of y 0 -4 -2 Score test statistic 2 4 Poison Data 3 10 20 30 40 Subset size m I Inverse transformation supported by all the data; death rate has a simple structure I Normal approximation mostly excellent I Slight asymmetry and “trumpet” effect in envelopes as m → n Multiply Modified Poison Data I The preceding just considers the null distribution of the statistic I To demonstrate the use of the FS, four observations were changed. As a result of these fabricated outliers, a value of λ = 13 was indicated as the transforation for the data I For the reciprocal transformation, Tp (−1) = 22.08 I The effect can be seen by looking at forward plots of Tp (λ) for a set of values of λ0 , called a “fan plot” Multiply Modified Poison Data 2 10 -0.5 0 0 Score test statistic 20 -1 0.5 -10 1 10 20 30 40 50 Subset size m Fan plot for λ0 = −1, −0.5, 0, 0.5 and 1. The effect of the four outliers on the otherwise correct transformation, λ = −1, is clearly revealed Null Distribution I The simulation results of Lawrance (1987) for a simple sample, that is without regression, showed that Tp does not have an asymptotic N (0, 1) distribution. I In Atkinson and Lawrance (1989) the numbers of observations were around 40 I Forward plots of more examples suggested that the higher the value of R2 , the smaller the departure from normality. I Look at simulation envelopes for small effects I Recall that, for outlier detection, distribution of residuals does not depend on β, just on H 0 Score test statistic -2 0 -4 -4 -2 Score test statistic 2 2 4 4 Null Distribution 2 10 20 30 40 10 20 30 40 Subset size m Subset size m I Poison data. Left-hand panel: envelopes from β̂; R2 = 0.85 I Right-hand panel: envelopes from β̂/10. Average R2 from simulations 0.15 I There is a marked decrease in normality towards the end of the search for weak regression Fitting a Constant to the Poison Data 20 30 40 0.2 -0.6 -0.3 10 -0.2 0.1 Residual response 0.3 0.4 m=48 -0.1 Residual response 2 0 -2 -4 Score test statistic 4 m=39 -0.04 Subset size m 0.0 0.04 0.08 Residual constructed variable -0.05 0.05 0.15 0.25 Residual constructed variable I Poison data with λ = −1 when only a constant is fitted I (a) forward plot of Tp (−1) I (b) added variable plot for m = n − 9 I (c) added variable plot for all the data: the filled symbols are the last observations to enter (some symbols overlap) Simulated data with no structure 20 30 Subset size m 40 50 20 10 0 -30 -20 10 -10 Residual response 10 0 -10 Residual response 4 2 0 -2 -4 Score test statistic m=50 20 m=47 -0.4 0.0 0.4 0.8 Residual constructed variable 0 1 2 3 Residual constructed variable I Simple random sample, fitted model with three random regressors. I (a) forward plot of Tp (1) increasing at the end - but data do not need transformation I (b) symmetrical added variable plot for m = n − 3 showing jittered elliptical structure I (c) added variable plot for all the data. The filled symbols are the last observations to enter. Uses of the Forward Search I The purpose of the FS is to provide robust methods of data analysis, both in the presence of outliers and with data coming from more than one model. I Imports into the EU of a wide variety of goods are liable to tax fraud and money laundering due to misreporting of transaction values and quantities I The EU Joint Research Centre has a programme monitoring such data I There are huge quantities of data, even if each commodity should have a simple structure I There may be also be non-fraudulent misrecording or incorrect classifications. International trade data example n 677 monthly aggregates of EU import flows of a fishery product (y=Values, X= quantity) Fan plot Dynamic visualization Dynamic link from the fan plot to the yXplot Clustering with the Forward Search I The main tool in clustering multivariate data is the detection of outliers I For a single population start from a robustly chosen subset of m0 observations. The subset is increased from size m to m + 1 by forming the new subset from the observations with the m + 1 smallest squared Mahalanobis distances I For each m test for the presence of outliers I With data coming from two or more populations, starting from a subset of observations in one of the clusters, some observations from other clusters are identified as outliers. I Now no simple robust path to an initial subset. Instead use randomly selected initial subsets. The resulting searches indicate the number and membership of clusters I Then refine cluster membership Outlier Detection I The subset of m observations yields estimates µ̂(m) and Σ̂(m) I Hence n squared Mahalanobis distances di2 (m) = {yi − µ̂(m)}0 Σ̂−1 (m){yi − µ̂(m)}, i = 1, . . . , n. I Outliers detected by the minimum Mahalanobis distance amongst observations not in the subset dmin (m) = min di (m) i ∈ / S∗ (m) I Need reference distribution for di2 (m) and hence for d min (m) I If Σ estimated from all n observations, the statistic has an F distribution. FS selects the central m out of n observations for Σ̂(m). Use a consistency factor to make the estimate approximately unbiased. I Use an order-statistic argument for the distribution of d min (m), so simulation not needed for the reference distribution. Outlier Detection 2 I As the search progresses, we perform a series of outlier tests, one for each m ≥ m0 I To allow for multiple testing, we use an outlier detection rule depending on the sample size and on the calculated envelopes for the distribution of the test statistic Random Start Forward Search I We run many forward searches from randomly selected starting points, monitoring the evolution of the values of dmin (m) for each search I Because the search can drop units from the subset as well as adding them, some searches are attracted to cluster centres I The random start trajectories converge, with subsets containing the same units. Once trajectories have converged, they cannot diverge again I The search is rapidly reduced to only a few trajectories, which provide information on the number and membership of the clusters An Example of Garcı́a-Escudero et al. 15 10 5 0 −5 −10 −15 −20 −20 −15 −10 −5 0 5 10 15 20 25 I A simulated data set of 1,800 bivariate normal observations plus 200 outliers I The plot appears to show one clear tight cluster, one moderately clearly defined cluster, a third more dispersed cluster and a background scatter. I Unlike some traditional methods, we do not have to cluster all observations Preliminary Cluster Identification 5 15 dmin(m) 4.5 10 4 5 3.5 0 3 −5 2.5 −10 −15 2 1.5 0 500 1000 1500 Subset size m 2000 −20 −20 15 15 10 10 5 5 0 0 10 20 −10 0 10 20 0 −5 −5 −10 −10 −15 −20 −20 −10 −15 −10 0 10 20 −20 −20 I The trajectories of minimum Mahalanobis distances d min (m) from 500 random start forward searches I Top right-hand panel, scatterplot, with preliminary Group 1 highlighted; the first 420 units from the sharpest peak I Bottom left-hand panel, preliminary Group 2; the first 490 units from the lowest trajectory I Bottom right-hand panel, preliminary Group 3, the first 780 units from the central peak Cluster Confirmation I Start within each tentative cluster j. Run the FS on all data, monitoring the bounds for all n observations until we obtain a “signal” indicating that observation m†j , and therefore succeeding observations, may be outliers I Cluster j contains an unknown number nj of observations. I To judge the values of the statistics against envelopes from nj we superimpose envelopes for values of n from m†j − 1 onwards, until the first outlier is found, so establishing the cluster size nj Confirming Group 1 4 4 dmin (m, 243) 3.5 3.5 3 3 2.5 2.5 2 2 1.5 500 1000 1500 4 2 dmin (m, 390) dmin (m, 397) 3.5 5 dmin (375, 397) > 99.9% envelope 3 4 2.5 3 2 2 50 100 150 200 250 Subset size m 300 350 400 50 100 150 200 250 Subset size m 300 350 400 I Top left-hand panel, forward plot of minimum Mahalanobis distances dmin (m) starting with units believed to be in Group 1; signal at m† = 244 I Succeeding panels, distances for n = 243, 390 and 397. 396 units are assigned to the group Confirming Group 2 3.5 3.5 d min (m, 661) 3 3 d min (597, 661) > 99.9% envelope 2.5 2 2.5 100 200 300 400 500 600 Subset size m 700 800 100 200 300 400 500 Subset size m 600 700 2 I Because this group is relatively dispersed compared to Group 1, a FS starting from Group 2 will absorb many units from the compact group I Analyse 836 unassigned units I There is a signal at m = 543 I Superimpose from n = 542. At n = 661 an outlier is indicated at m = 597. I Units (from a different group?) make the last part of the data trace flat I Removal of these units leads to 656 units in the group. Final FS Clustering 15 10 Y2 5 0 −5 −10 −15 −20 −20 −10 0 10 20 Y1 30 −20 −10 0 10 Y2 I Left-hand panel, scatterplot of the three groups and background contamination. I Right-hand panel, histogram of classification of y2 values: the contamination is shown in blue I For central values of y2 , virtually all these observations have been included in one of the groups References Atkinson, A. C. and Lawrance, A. J. (1989). A comparison of asymptotically equivalent tests of regression transformation. Biometrika, 76, 223–229. Lawrance, A. J. (1987). The score statistic for regression transformation. Biometrika, 74, 275–289.