TESTS FOR TRANSFORMATIONS AND ROBUST REGRESSION Anthony Atkinson, 25th March 2014

advertisement
TESTS FOR TRANSFORMATIONS
AND
ROBUST REGRESSION
Anthony Atkinson, 25th March 2014
Joint work with Marco Riani, Parma
Department of Statistics
London School of Economics
London WC2A 2AE, UK
a.c.atkinson@lse.ac.uk
Transformation of Data
I
May need to transform y. For regression hope for simpler model,
homoscedasticity and approximate normality
Examples: survival time, volume, viscosity
√
I Power transformations are often used: y, y1/3 , 1/y
I
I
Box and Cox (1964) analyse the normalized power
transformation
λ
(y − 1)/(λẏλ−1 ) λ 6= 0
z(λ) =
ẏ log y
λ = 0,
where the geometric mean of the observations is written as
ẏ = exp(Σ log yi /n)
I
Includes log transformation. Untransformed data λ = 1,
reciprocal, λ = −1
I
They use a likelihood ratio test for λ = λ0
Tests for Data Transformation
I
The residual sum of squares of the z(λ) is
R(λ) = z(λ)T (I − H)z(λ) = z(λ)T Az(λ)
and H = X(X T X)−1 X, the ‘hat’ matrix
I
The Box-Cox LR test for λ = λ0
TLR = n log{R(λ0 )/R(λ̂)},
requiring λ̂, which has to be found numerically
I
The value of λ̂ is usually only an indication of the transformation;
y0.1379... is unlikely
I
Several score tests have been suggested, requiring quantities
calculated only at λ0
Score Tests for Data Transformation
I
A computationally simple alternative to the LR test is an
approximate score statistic derived from a Taylor series
I
Expansion yields
.
z(λ) = z(λ0 ) + (λ − λ0 )w(λ0 ),
where
I
∂z(λ) w(λ0 ) =
∂λ λ=λ0
Together with the regression model
z(λ) = xT β − (λ − λ0 )w(λ0 ) + = xT β + γ w(λ0 ) + I
Another regression model with an extra variable w(λ0 ), the
constructed variable for the transformation
I
Testing γ = 0 tests λ = λ0
Score Tests for Data Transformation 2
I The numerator of the approximate score statistic Tp is z(λ0 )T Aw(λ0 ).
The variance comes from regression on X and w(λ0 ).
I Lawrance (1987) uses the same numerator but with an approximation to
the information to standardise the statistic - this requires second
derivatives of z(λ). The variance can be seen as an improvement of
that in Tp .
I Atkinson and Lawrance (1989) compare six statistics, additionally
including the signed square root of the LR test and two Wald tests
I The Wald tests have distributions far from N (0, 1). Otherwise the
behaviour depends on the example.
I Tests have very similar power when adjusted for size (perhaps not much
practical use)
I However, simulations in Lawrance (1987) in the absence of regression
show that Tp does not have an asymptotic N (0, 1) distribution. (why?)
Aggregate Statistics and Departures from
Assumptions
I
These are all aggregate statistics
I
The effect of a single outlier can be found using deletion
statistics (computationally cheap)
I
But there may be many aberrant observations: dispersed
outliers, a second population, systematic departures from
distributional assumptions, ...
I
For a general test of departures, divide the data into two groups,
m believed to be uncontaminated
I
Estimate the parameters from these m observations
I
Calculate test statistics using these estimates: residuals, tests of
transformation, ...
I
How to divide the data into these two groups?
I
We’ll use the Forward Search
The Forward Search
I
The Forward Search (FS) fits the model to increasing numbers
of observations. Starting from an outlier free subset, outliers
enter at the end of the search
I
For regression, the subset of observations S∗ of size m yields
least squares estimates β̂(m) and s2 (m), giving n least squares
residuals ei (m)
I
The search moves forward with S∗
containing the
observations with the m + 1 smallest absolute values of ei (m),
I
The search starts from a subset of p observations chosen by
LMS
I
We may need simulation to interpret the results
(m)
(m+1)
Poison Data
I
Example: Box and Cox poison data - 48 observations with
response survival time
I
Non-negative y values with range 0.18 to 1.24
I
Normal errors of constant variance are unlikely to hold
I
B & C suggest reciprocal transformation - λ = −1
0
-4
-2
Score test statistic
2
4
Poison Data 2
10
20
30
40
Subset size m
Poison data: forward plot of Tp (−1) with 90%, 95% and 99% simulation
envelopes using parameter estimates β̂(n)
Distribution of Statistic
I
What is the null distribution of Tp (λ0 )?
I
The ordering is of residual values of y without regression on
w(λ0 )
I
If w were a new regressor, instead of w(λ0 ), the distribution
would be t
I
Normal approximation given on the plot
I
But w(λ0 ) is a function of y
0
-4
-2
Score test statistic
2
4
Poison Data 3
10
20
30
40
Subset size m
I
Inverse transformation supported by all the data; death rate has
a simple structure
I
Normal approximation mostly excellent
I
Slight asymmetry and “trumpet” effect in envelopes as m → n
Multiply Modified Poison Data
I
The preceding just considers the null distribution of the statistic
I
To demonstrate the use of the FS, four observations were
changed. As a result of these fabricated outliers, a value of
λ = 13 was indicated as the transforation for the data
I
For the reciprocal transformation, Tp (−1) = 22.08
I
The effect can be seen by looking at forward plots of Tp (λ) for a
set of values of λ0 , called a “fan plot”
Multiply Modified Poison Data 2
10
-0.5
0
0
Score test statistic
20
-1
0.5
-10
1
10
20
30
40
50
Subset size m
Fan plot for λ0 = −1, −0.5, 0, 0.5 and 1. The effect of the four outliers
on the otherwise correct transformation, λ = −1, is clearly revealed
Null Distribution
I
The simulation results of Lawrance (1987) for a simple sample,
that is without regression, showed that Tp does not have an
asymptotic N (0, 1) distribution.
I
In Atkinson and Lawrance (1989) the numbers of observations
were around 40
I
Forward plots of more examples suggested that the higher the
value of R2 , the smaller the departure from normality.
I
Look at simulation envelopes for small effects
I
Recall that, for outlier detection, distribution of residuals does
not depend on β, just on H
0
Score test statistic
-2
0
-4
-4
-2
Score test statistic
2
2
4
4
Null Distribution 2
10
20
30
40
10
20
30
40
Subset size m
Subset size m
I
Poison data. Left-hand panel: envelopes from β̂; R2 = 0.85
I
Right-hand panel: envelopes from β̂/10. Average R2 from
simulations 0.15
I
There is a marked decrease in normality towards the end of the
search for weak regression
Fitting a Constant to the Poison Data
20
30
40
0.2
-0.6
-0.3
10
-0.2
0.1
Residual response
0.3
0.4
m=48
-0.1
Residual response
2
0
-2
-4
Score test statistic
4
m=39
-0.04
Subset size m
0.0
0.04
0.08
Residual constructed variable
-0.05
0.05
0.15
0.25
Residual constructed variable
I
Poison data with λ = −1 when only a constant is fitted
I
(a) forward plot of Tp (−1)
I
(b) added variable plot for m = n − 9
I
(c) added variable plot for all the data: the filled symbols are the
last observations to enter (some symbols overlap)
Simulated data with no structure
20
30
Subset size m
40
50
20
10
0
-30
-20
10
-10
Residual response
10
0
-10
Residual response
4
2
0
-2
-4
Score test statistic
m=50
20
m=47
-0.4
0.0
0.4
0.8
Residual constructed variable
0
1
2
3
Residual constructed variable
I
Simple random sample, fitted model with three random
regressors.
I
(a) forward plot of Tp (1) increasing at the end - but data do not
need transformation
I
(b) symmetrical added variable plot for m = n − 3 showing
jittered elliptical structure
I
(c) added variable plot for all the data. The filled symbols are the
last observations to enter.
Uses of the Forward Search
I
The purpose of the FS is to provide robust methods of data
analysis, both in the presence of outliers and with data coming
from more than one model.
I
Imports into the EU of a wide variety of goods are liable to tax
fraud and money laundering due to misreporting of transaction
values and quantities
I
The EU Joint Research Centre has a programme monitoring
such data
I
There are huge quantities of data, even if each commodity
should have a simple structure
I
There may be also be non-fraudulent misrecording or incorrect
classifications.
International trade data example
n
677 monthly aggregates of EU import flows of a fishery product (y=Values, X= quantity)
Fan plot
Dynamic visualization
Dynamic link from the fan plot to the yXplot
Clustering with the Forward Search
I The main tool in clustering multivariate data is the detection of outliers
I For a single population start from a robustly chosen subset of m0
observations. The subset is increased from size m to m + 1 by forming
the new subset from the observations with the m + 1 smallest squared
Mahalanobis distances
I For each m test for the presence of outliers
I With data coming from two or more populations, starting from a subset
of observations in one of the clusters, some observations from other
clusters are identified as outliers.
I Now no simple robust path to an initial subset. Instead use randomly
selected initial subsets. The resulting searches indicate the number and
membership of clusters
I Then refine cluster membership
Outlier Detection
I The subset of m observations yields estimates µ̂(m) and Σ̂(m)
I Hence n squared Mahalanobis distances
di2 (m) = {yi − µ̂(m)}0 Σ̂−1 (m){yi − µ̂(m)},
i = 1, . . . , n.
I Outliers detected by the minimum Mahalanobis distance amongst
observations not in the subset
dmin (m) = min di (m) i ∈
/ S∗ (m)
I Need reference distribution for di2 (m) and hence for d
min (m)
I If Σ estimated from all n observations, the statistic has an F distribution.
FS selects the central m out of n observations for Σ̂(m). Use a
consistency factor to make the estimate approximately unbiased.
I Use an order-statistic argument for the distribution of d
min (m), so
simulation not needed for the reference distribution.
Outlier Detection 2
I As the search progresses, we perform a series of outlier tests, one for
each m ≥ m0
I To allow for multiple testing, we use an outlier detection rule depending
on the sample size and on the calculated envelopes for the distribution
of the test statistic
Random Start Forward Search
I We run many forward searches from randomly selected starting points,
monitoring the evolution of the values of dmin (m) for each search
I Because the search can drop units from the subset as well as adding
them, some searches are attracted to cluster centres
I The random start trajectories converge, with subsets containing the
same units. Once trajectories have converged, they cannot diverge
again
I The search is rapidly reduced to only a few trajectories, which provide
information on the number and membership of the clusters
An Example of Garcı́a-Escudero et al.
15
10
5
0
−5
−10
−15
−20
−20
−15
−10
−5
0
5
10
15
20
25
I A simulated data set of 1,800 bivariate normal observations plus 200
outliers
I The plot appears to show one clear tight cluster, one moderately clearly
defined cluster, a third more dispersed cluster and a background scatter.
I Unlike some traditional methods, we do not have to cluster all
observations
Preliminary Cluster Identification
5
15
dmin(m)
4.5
10
4
5
3.5
0
3
−5
2.5
−10
−15
2
1.5
0
500
1000
1500
Subset size m
2000
−20
−20
15
15
10
10
5
5
0
0
10
20
−10
0
10
20
0
−5
−5
−10
−10
−15
−20
−20
−10
−15
−10
0
10
20
−20
−20
I The trajectories of minimum Mahalanobis distances d
min (m) from 500
random start forward searches
I Top right-hand panel, scatterplot, with preliminary Group 1 highlighted;
the first 420 units from the sharpest peak
I Bottom left-hand panel, preliminary Group 2; the first 490 units from the
lowest trajectory
I Bottom right-hand panel, preliminary Group 3, the first 780 units from
the central peak
Cluster Confirmation
I Start within each tentative cluster j. Run the FS on all data, monitoring
the bounds for all n observations until we obtain a “signal” indicating that
observation m†j , and therefore succeeding observations, may be outliers
I Cluster j contains an unknown number nj of observations.
I To judge the values of the statistics against envelopes from nj we
superimpose envelopes for values of n from m†j − 1 onwards, until the
first outlier is found, so establishing the cluster size nj
Confirming Group 1
4
4
dmin (m, 243)
3.5
3.5
3
3
2.5
2.5
2
2
1.5
500
1000
1500
4
2
dmin (m, 390)
dmin (m, 397)
3.5
5
dmin (375, 397) > 99.9% envelope
3
4
2.5
3
2
2
50
100
150
200
250
Subset size m
300
350
400 50
100
150
200
250
Subset size m
300
350
400
I Top left-hand panel, forward plot of minimum Mahalanobis distances
dmin (m) starting with units believed to be in Group 1; signal at m† = 244
I Succeeding panels, distances for n = 243, 390 and 397. 396 units are
assigned to the group
Confirming Group 2
3.5
3.5
d min (m, 661)
3
3
d min (597, 661) > 99.9% envelope
2.5
2
2.5
100
200
300
400 500 600
Subset size m
700
800
100
200
300
400
500
Subset size m
600
700
2
I Because this group is relatively dispersed compared to Group 1, a FS
starting from Group 2 will absorb many units from the compact group
I Analyse 836 unassigned units
I There is a signal at m = 543
I Superimpose from n = 542. At n = 661 an outlier is indicated at
m = 597.
I Units (from a different group?) make the last part of the data trace flat
I Removal of these units leads to 656 units in the group.
Final FS Clustering
15
10
Y2
5
0
−5
−10
−15
−20
−20
−10
0
10
20
Y1
30 −20
−10
0
10
Y2
I Left-hand panel, scatterplot of the three groups and background
contamination.
I Right-hand panel, histogram of classification of y2 values: the
contamination is shown in blue
I For central values of y2 , virtually all these observations have been
included in one of the groups
References
Atkinson, A. C. and Lawrance, A. J. (1989). A comparison of
asymptotically equivalent tests of regression transformation.
Biometrika, 76, 223–229.
Lawrance, A. J. (1987). The score statistic for regression
transformation. Biometrika, 74, 275–289.
Download