Pipeline Givens sequences for computing the QR decomposition on

Parallel Computing 32 (2006) 222–230
www.elsevier.com/locate/parco
Pipeline Givens sequences for computing the QR
decomposition on a EREW PRAM q
Marc Hofmann
a
a,*
, Erricos John Kontoghiorghes
b,c
Institut d’informatique, Université de Neuchâtel, Emile-Argand 11, Case Postale 2, CH-2007 Neuchâtel, Switzerland
b
Department of Public and Business Administration, University of Cyprus, Cyprus
c
School of Computer Science and Information Systems, Birkbeck College, University of London, UK
Received 4 March 2005; received in revised form 25 September 2005; accepted 19 November 2005
Available online 18 January 2006
Abstract
Parallel Givens sequences for computing the QR decomposition of an m · n (m > n) matrix are considered. The Givens
rotations operate on adjacent planes. A pipeline strategy for updating the pair of elements in the affected rows of the
matrix is employed. This allows a Givens rotation to use rows that have been partially updated by previous rotations.
Two new Givens schemes, based on this pipeline approach, and requiring respectively n2/2 and n processors, are developed.
Within this context a performance analysis on an exclusive-read, exclusive-write (EREW) parallel random access machine
(PRAM) computational model establishes that the proposed schemes are twice as efficient as existing Givens sequences.
2005 Elsevier B.V. All rights reserved.
Keywords: Givens rotation; QR decomposition; Parallel algorithms; PRAM
1. Introduction
Consider the QR decomposition of the full column rank matrix A 2 Rmn :
R n
T
;
Q A¼
0 mn
ð1Þ
where Q 2 Rmm and R is upper triangular of order n. The triangular matrix R in (1) is derived iteratively from
e iþ1 , where A
e 0 ¼ A and Qi is orthogonal: the triangularization process terminates when A
e v (v > 0) is
ei ¼ A
QTi A
upper triangular. Q = Q0 Q1 Qv is not computed explicitly. R can be derived by employing a sequence of
Givens rotations. The Givens rotation (GR) that annihilates the element Ai,j when applied from the left of
A has the form
q
*
This work is in part supported by the Swiss National Foundation Grants 101412-105978.
Corresponding author. Fax: +41 32 718 27 01.
E-mail addresses: marc.hofmann@unine.ch (M. Hofmann), erricos@ucy.ac.cy (E.J. Kontoghiorghes).
0167-8191/$ - see front matter 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.parco.2005.11.001
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
e i;j ; I mi Þ
Gi;j ¼ diag ðI i2 ; G
e i;j ¼
with G
c
s
s
c
;
where c = Ai1,j/t, s = Ai,j/t and t2 ¼ A2i1;j þ A2i;j ð6¼ 0Þ.
(a) SK
(b) mSK
Fig. 1. The SK and modified SK Givens sequences to compute the QR decomposition, where m = 16 and n = 6.
Fig. 2. Partial annihilation of a matrix by the PipSK scheme when n = 6.
223
224
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
A GR affects only the ith and (i 1)th rows of A. Thus, bm/2c rotations can be applied simultaneously. A
compound disjoint Givens rotation (CDGR) comprises rotations that can be applied simultaneously. Parallel
algorithms for computing the QR decomposition based on CDGRs have been developed [2,3,8–10,12–14]. The
Greedy sequence in [4,11] was found to be optimal; that is, it requires less CDGRs than any other Givens
strategy. However, the computation of the indices of the rotations which do not involve adjacent planes is
non-trivial. The employment of rotations between adjacent planes facilitates the development of efficient factorization strategies for structured matrices [6].
An EREW (exclusive-read, exclusive-write) PRAM (parallel random access machine) computational model
is considered [6]. It is assumed that there are p processors which can perform simultaneously p GRs. A single
time unit is defined to be the time required to execute the operation of applying a Givens rotation to two oneelement vectors. Thus, the elapsed time necessary to perform a rotation depends on the length of the vectors
involved. Computing c and s requires 6 flops. Rotating two elements requires another six flops. Hence, annihilation of an element and performing the necessary updating of an m · n matrix requires 6n flops. Notice that
the GR is not applied to the first pair of elements the components of which are set to t and zero, respectively.
To simplify the complexity analysis of the proposed algorithms it is assumed that m and n are even. Complexities are given in time units.
(a) PGS-1
(b) PGS-2
(c) PGS-2 cycles
Fig. 3. Parallel Givens sequences for computing the QR decomposition, where m = 16 and n = 6.
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
225
New parallel Givens sequences to compute the QR decomposition are proposed. Throughout the execution
of a Givens sequence the annihilated elements are preserved. In the next section a pipelined version of the parallel Sameh and Kuck scheme is presented. Two new pipeline parallel strategies are discussed in Section 3. A
theoretical analysis of complexities of the various schemes is presented. Section 4 summarizes what has been
achieved.
2. Pipeline parallel SK sequence
The parallel Sameh and Kuck (SK) scheme in [14] computes (1) by applying up to n GRs simultaneously.
Each GR is performed by one processor. Specifically, the elements of the ith column of A begin to be annihilated by the (2i 1)th CDGR as illustrated in Fig. 1(a). The numeral i and the symbol • denote an element
annihilated by the ith CDGR and a non-zero element, respectively. The number of CDGRs and time units
required by the SK scheme are given, respectively, by
C SK ðm; nÞ ¼ m þ n 2
Fig. 4. Triangularization of a 16 · 6 matrix by the PipPGS-1 algorithm.
226
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
and
T SK ðm; nÞ ¼ ðm 1Þn þ
n
X
ðn j þ 1Þ ¼ nð2m þ n 3Þ=2.
j¼2
Here it is assumed that p = n.
ðkÞ
Let gi;j (k = 1, . . . , n j + 1) denote the application of Gi,j to the kth pair of elements in positions
(i, j + k 1) and (i 1, j + k 1). That is, the rotation Gi,j operating on the (n j + 1)—element subvectors
ð1Þ
ðnjþ1Þ
of rows i and i 1 is now expressed as the application of a sequence of elementary rotations gi;j ; . . . ; gi;j
to
the pairs of elements {(i, j), (i 1, j)}, . . . , {(i, n), (i 1, n)}, respectively. The rotation Gi+1,j+1 can begin before
ð1Þ
ð2Þ
the application of Gi,j has been completed. Specifically, giþ1;jþ1 can be applied once gi;j has been executed.
Thus, several GRs can operate concurrently on different elements of a row. The Pipelined SK (PipSK) scheme
employs this strategy. The first steps of the PipSK scheme are illustrated in Fig. 2. As in the case of the SK
scheme, PipSK initiates a CDGR every second time unit and its overall execution time is given by
T PipSK ðm; nÞ ¼ 2C SK ðm; nÞ 1 ¼ 2m þ 2n 5.
The number of CDGRs being performed
is n/2 requiring 2i processors each (i = 1, . . . , n/2). The
Pn=2 in parallel
PipSK scheme thus requires p ¼ i¼1
2i n2 =4 processors. Its annihilation pattern—a modified SK
scheme—is shown in Fig. 1(b).
Fig. 5. Annihilation cycle of the PipPGS-1 algorithm when n = 6.
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
Fig. 6. Annihilation cycle of the PGS-2 scheme when n = 6.
Fig. 7. Annihilation cycle of the PipPGS-2 algorithm when n = 6.
227
228
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
3. New pipeline parallel Givens sequences
Alternative parallel Givens (PG) sequences to compute the QR decomposition and which are more suitable
for pipelining are shown in Fig. 3. The number of CDGRs applied by the first PG sequence (PGS-1 [1]) is
given by
C PGS-1 ðm; nÞ ¼ m þ 2n 3.
The Pipelined PGS-1 (PipPGS-1) is illustrated in Fig. 4. It requires p n2/2 processors. It initiates a CDGR
every time unit and its time complexity is given by the total number of CDGRs applied. That is,
T PipPGS-1 ðm; nÞ ¼ C PGS1 ðm; nÞ ¼ m þ 2n 3.
The PipPGS-1 operates in cycles of n + 1 time units. This is shown in Fig. 5. In every time unit up to n
processors initiate a GR, while the other processors are updating previously initiated rotations. Each processor executes two GRs in one cycle with complexities T1 and T2 respectively, such that T1 + T2 = n + 1 time
units. The PipPGS-1 performs better than the SK scheme utilizing approximately n/2 times the number of processors. That is, TSK(m, n)/TPipPGS-1(m, n) n. Hence, the efficiency of the PipPGS-1 is twice that of the SK
scheme.
The PGS-2 illustrated in Fig. 3(b) employs p = n processors. A cycle involves one CDGR and n consecutive
GRs and annihilates up to 2n elements in n + 1 steps. This is illustrated in Fig. 6. The PGS-2 scheme requires
more steps than the SK scheme. When m and n are even, the sequence consists of (m + n 2)/2 cycles which
Fig. 8. Initial annihilation steps of a 16 · 6 matrix by the PipPGS-2 algorithm.
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
229
Fig. 9. Final annihilation steps of a 16 · 6 matrix by the PipPGS-2 algorithm.
can be partitioned in three sets comprising cycles {1, . . . , n 1}, {n, . . . , (m 2)/2} and {m/2, . . . , (m + n 2)/
2}. The sets, cycles and constituent steps are detailed in Fig. 3. The ith cycle in the first, second and third set
applies i + 1, n + 1 and 2(n/2 i + 1) steps, respectively. The total number of steps applied by the PGS-2 is
given by
C PGS-2 ðm; nÞ ¼
n1
X
ði þ 1Þ þ ðm 2nÞðn þ 1Þ=2 þ
n=2
X
2i ð2mn þ 2m n2 Þ=4.
i¼1
i¼1
The Pipelined PGS-2 (PipPGS-2) applies 2n GRs in a pipeline manner. Fig. 7 illustrates the annihilation
cycle of the PipPGS-2 which annihilates 2n elements in n + 1 time units. This is equivalent to two CDGRs
of the SK scheme for which 2n time units are required. Figs. 8 and 9 illustrate, respectively, the initial and
the final phase of the annihilation process of a 16 · 6 matrix by the PipPGS-2. An asterisk denotes the start
of a new cycle.
In every time unit a processor initiates a GR. When m and n are even, the first (m 2)/2 cycles are executed
in n + 1 time units each. The ith cycle in the third set requires 2(n/2 i + 1) + 1 time units (i = 1, . . . , n/2). The
overall execution time of the PipPGS-2 algorithm is given by
T PipPGS-2 ðm; nÞ ¼ ðm 2Þðn þ 1Þ=2 þ
n=2
X
ð2i þ 1Þ ð2mn þ 2m þ n2 Þ=4.
i¼1
Furthermore, if m n and for the same number of processors,
T SK ðm; nÞ
2.
T PipPGS-2 ðm; nÞ
That is, the proposed scheme is twice as fast as the SK scheme.
230
M. Hofmann, E.J. Kontoghiorghes / Parallel Computing 32 (2006) 222–230
Table 1
Summary of the complexities of the SK and pipelined schemes
Scheme
Processors
Complexity
SK
n
n(2m + n 3)/2
PipSK
PipPGS-1
PipPGS-2
2
n /4
n2/2
n
2m + 2n 5
m + 2n 3
(2mn + 2m + n2)/4
4. Conclusions
A new pipeline parallel strategy has been proposed for computing the QR decomposition. Its computational complexity is compared to the SK scheme of Sameh and Kuck in [14]. The complexity analysis is
not based on the unrealistic assumption that all CDGRs, or cycles, have the same execution time. Instead,
the number of operations performed by a single Givens rotation is given by the size of the pair of vectors used
in the rotation. It was found that for an equal number of processors the pipelined scheme solves the m · n QR
decomposition twice as fast as the SK scheme when m n. The complexities are summarized in Table 1.
Block versions of the SK scheme have previously been designed to compute the orthogonal factorizations of
structured matrices which arise in econometric estimation problems [5,15]. Within this context the Givens rotations are replaced by orthogonal factorizations which employ Householder reflections. Thus, it might be fruitful to investigate the effectiveness of incorporating the pipeline strategy in the design of block algorithms [7,16].
Acknowledgements
The authors are grateful to Maurice Clint, Ahmed Sameh and the anonymous referee for their valuable
comments and suggestions.
References
[1] A. Bojanczyk, R. Brent, H. Kung, Numerically stable solution of dense systems of linear equations using mesh-connected processors,
SIAM J. Sci. Statist. Comput. 5 (1984) 95–104.
[2] M. Cosnard, M. Daoudi, Optimal algorithms for parallel Givens factorization on a coarse-grained PRAM, J. ACM 41 (2) (1994) 399–
421.
[3] M. Cosnard, J.-M. Muller, Y. Robert, Parallel QR decomposition of a rectangular matrix, Numerische Mathematik 48 (1986) 239–
249.
[4] M. Cosnard, Y. Robert, Complexité de la factorisation QR en parallèle, C. R. Acad. Sci. Paris, ser. I 297 (1983) 137–139.
[5] E.J. Kontoghiorghes, Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems, Advances in
Computational Economics, volume 15, Kluwer Academic Publishers, Boston, MA, 2000.
[6] E.J. Kontoghiorghes, Parallel Givens sequences for solving the general linear model on a EREW PRAM, Parallel Algorithms Appl.
15 (1–2) (2000) 57–75.
[7] E.J. Kontoghiorghes, Parallel strategies for rank-k updating of the QR decomposition, SIAM J. Matrix Anal. Appl. 22 (3) (2000)
714–725.
[8] E.J. Kontoghiorghes, Greedy Givens algorithms for computing the rank-k updating of the QR decomposition, Parallel Comput. 28
(9) (2002) 1257–1273.
[9] F.T. Luk, A rotation method for computing the QR decomposition, SIAM J. Sci. Statist. Comput. 7 (2) (1986) 452–459.
[10] J.J. Modi, Parallel Algorithms and Matrix Computation, Oxford Applied Mathematics and Computing Science series, Oxford
University Press, 1988.
[11] J.J. Modi, M.R.B. Clarke, An alternative Givens ordering, Numerische Mathematik 43 (1984) 83–90.
[12] A.H. Sameh, Solving the linear least squares problem on a linear array of processors, in: Algorithmically Specialized Parallel
Computers, Academic Press, Inc., 1985, pp. 191–200.
[13] A.H. Sameh, R.P. Brent, On Jacobi and Jacobi like algorithms for a parallel computer, Math. Comput. 25 (115) (1971) 579–590.
[14] A.H. Sameh, D.J. Kuck, On stable parallel linear system solvers, J. ACM 25 (1) (1978) 81–91.
[15] P. Yanev, P. Foschi, E.J. Kontoghiorghes, Algorithms for computing the QR decomposition of a set of matrices with common
columns, Algorithmica 39 (2004) 83–93.
[16] P. Yanev, E.J. Kontoghiorghes, Efficient algorithms for block downdating of least squares solutions, Appl. Numer. Math. 49 (2004)
3–15.
Computational Statistics & Data Analysis 52 (2007) 16 – 29
www.elsevier.com/locate/csda
Efficient algorithms for computing the best subset regression
models for large-scale problems夡
Marc Hofmanna,∗ , Cristian Gatua, d , Erricos John Kontoghiorghesb, c
a Institut d’Informatique, Université de Neuchâtel, Switzerland
b Department of Public and Business Administration, University of Cyprus, Cyprus
c School of Computer Science and Information Systems, Birkbeck College, University of London, UK
d Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iasi, Romania
Available online 24 March 2007
Abstract
Several strategies for computing the best subset regression models are proposed. Some of the algorithms are modified versions of
existing regression-tree methods, while others are new. The first algorithm selects the best subset models within a given size range.
It uses a reduced search space and is found to outperform computationally the existing branch-and-bound algorithm. The properties
and computational aspects of the proposed algorithm are discussed in detail. The second new algorithm preorders the variables
inside the regression tree. A radius is defined in order to measure the distance of a node from the root of the tree. The algorithm
applies the preordering to all nodes which have a smaller distance than a certain radius that is given a priori. An efficient method
of preordering the variables is employed. The experimental results indicate that the algorithm performs best when preordering
is employed on a radius of between one quarter and one third of the number of variables. The algorithm has been applied with
such a radius to tackle large-scale subset-selection problems that are considered to be computationally infeasible by conventional
exhaustive-selection methods. A class of new heuristic strategies is also proposed. The most important of these is one that assigns
a different tolerance value to each subset model size. This strategy with different kind of tolerances is equivalent to all exhaustive
and heuristic subset-selection strategies. In addition the strategy can be used to investigate submodels having noncontiguous size
ranges. Its implementation provides a flexible tool for tackling large scale models.
© 2007 Elsevier B.V. All rights reserved.
Keywords: Best-subset regression; Regression tree; Branch-and-bound algorithm
1. Introduction
The problem of computing the best-subset regression models arises in statistical model selection. Most of the criteria
used to evaluate the subset models rely upon the residual sum of squares (RSS) (Searle, 1971; Sen and Srivastava,
1990). Consider the standard regression model
y = A + ε,
夡 The
∼ (0, 2 Im ),
R routines can be found at URL: http://iiun.unine.ch/matrix/software.
∗ Corresponding author. Tel.: +41 32 7182708; fax: +41 32 7182701.
E-mail addresses: marc.hofmann@unine.ch (M. Hofmann), cristian.gatu@unine.ch (C. Gatu), erricos@ucy.ac.cy (E.J. Kontoghiorghes).
0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.csda.2007.03.017
(1)
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
17
Table 1
Leaps and BBA: execution times in seconds for data sets of different sizes, without and with variable preordering
# Variables
36
37
38
39
40
41
42
43
44
45
46
47
48
Leaps
BBA
8
2
29
5
44
12
30
8
203
35
57
14
108
9
319
55
135
27
316
37
685
97
2697
380
6023
1722
Leaps-1
BBA-1
3
1
16
4
28
13
9
2
82
20
33
11
22
4
203
47
79
18
86
15
306
51
1326
216
1910
529
where y ∈ Rm , A ∈ Rm×n is the exogenous data matrix of full column rank, ∈ Rn is the coefficient vector and ε ∈ Rn
is the noise vector. The columns of A correspond to the exogenous variables V = [v1 , . . . , vn ]. A submodel S of (1)
comprises some of the variables in V. There are 2n − 1 possible subset models, and their computation is only feasible
for small values of n. The dropping column algorithm (DCA) derives all submodels by generating a regression tree
(Clarke, 1981; Gatu and Kontoghiorghes, 2003; Smith and Bremner, 1989). The parallelization of the DCA moderately
improves its practical value (Gatu and Kontoghiorghes, 2003). Various procedures such as the forward, backward and
stepwise selection try to identify a subset by inspecting very few combinations of variables. However, these methods
rarely succeed in finding the best submodel (Hocking, 1976; Seber, 1977). Other approaches for subset-selection include
ridge regression, the nonnegative garrote and the lasso (Breiman, 1995; Fan and Li, 2001; Tibshirani, 1996). Sequential
replacement algorithms are fairly fast and can be used to give some indication of the maximum size of the subsets that
are likely to be of interest (Hastie et al., 2001). The branch-and-bound algorithms for choosing a subset of k features
from a given larger set of size n have also been investigated within the context of feature selection problems (Narendra
and Fukunaga, 1997; Roberts, 1984; Somol et al., 2004). These strategies are used when the size k of the subset to be
selected is known. Thus, they search over n!/(k!(n − k)!) subsets.
A computationally efficient branch-and-bound algorithm (BBA) has been devised (Gatu and Kontoghiorghes, 2006;
Gatu et al., 2007). The BBA avoids the computation of the whole regression tree and it derives the best subset model
for each number of variables. That is, it computes
argmin RSS(S) subject to |S| = k for k = 1, . . . , n.
(2)
S
The BBA was built around the fundamental property
RSS(S1 ) RSS(S2 )
if S1 ⊆ S2 ,
(3)
where S1 and S2 are two variable subsets of V (Gatu and Kontoghiorghes, 2006). The BBA-1, which is an extension of
the BBA, preorders the n variables according to their strength in the root node. The variables i and j are arranged such that
RSS(V−i ) RSS(V−j ) for each i j , where V−i is the set V from which the ith variable has been deleted. The BBA-1
has been shown to outperform the previously introduced leaps-and-bounds algorithm (Furnival and Wilson, 1974).
Table 1 shows the execution times of the BBA and leaps-and-bounds algorithm for data sets with 36–48 variables.
Note that the BBA outperforms the leaps-and-bounds with preordering in the root node (Leaps-1). A heuristic version
of the BBA (HBBA) that uses a tolerance parameter to relax the BBA pruning test has been discussed. The HBBA
might not provide the optimal solution, but the relative residual error (RRE) of the computed solution is smaller than
the tolerance employed.
Often models within a given size range must be investigated. These models, hereafter called subrange subset models,
do not require the generation of the whole tree. Thus, the adaptation of the BBA for deriving the subrange subset models
is expected to have a lower computational cost, and thus, it can be feasible to tackle larger scale models. The structural
properties of a regression tree strategy which generates the subrange subset models is investigated and its theoretical
complexity derived. A new nontrivial preordering strategy that outperforms the BBA-1 is designed and analyzed. The
new strategy, which can be found to be significantly faster than existing ones, can derive the best subset models from
a larger pool of variables. In addition, some new heuristic strategies based on the HBBA are developed. The tolerance
parameter is either a function of the level in the regression tree, or of the size of the subset model. The novel strategies
decrease execution time while selecting models of similar, or of even better, quality.
18
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
The proposed strategies, which outperform the existing subset selection BBA-1 and its heuristic version, are aimed
at tackling large-scale models. The next section briefly discusses the DCA, and it introduces the all-subset-models
regression tree. It generalizes the DCA so as to select only the submodels within a given size range. Section 3 discusses
a novel strategy that preorders the variables of the nodes in various levels of the tree. The significant improvement in
the computational efficiency when compared to the BBA-1 is illustrated. Section 4 presents and compares various new
heuristic strategies. Theoretical and experimental results are presented. Conclusions and proposals for future work are
discussed in Section 5.
The algorithms were implemented in C++ and are available in a package for the R statistical software environment
(R Development Core Team, 2005). The GNU compiler collection was used to generate the shared libraries. The tests
were run on a Pentium-class machine with 512 Mb of RAM in a Linux environment. Real and artificial data have been
used in the experiments. A set of artificial variables has been randomly generated. The response variable of the true
model is based on a linear combination of a subset of these artificial variables with the addition of some noise. An
intercept term is included in the true model.
2. Subrange model selection
The DCA employs a straightforward approach to solve the best-subset problem (2). It enumerates and evaluates all
possible 2n − 1 subsets of V. It generates a regression tree consisting of 2n−1 nodes (Gatu and Kontoghiorghes, 2003;
Smith and Bremner, 1989). Each node in the tree corresponds to a subset S = [s1 , . . . , sns ] of ns variables and to an
index k (k = 0, . . . , ns − 1). The ns − k subleading models [s1 , . . . , sk+1 ], . . . , [s1 , . . . , sns ] are evaluated. A new node
is generated by deleting a variable. The descending nodes are given by
(drop(S, k + 1), k), (drop(S, k + 2), k + 1), . . . , (drop(S, ns − 1), ns − 2).
Here, the operation drop(S, i) computes a new subset which corresponds to the subset S from which the ith variable
has been deleted. This is equivalent to downdating the QR decomposition after the ith column has been deleted (Golub
and Van Loan, 1996; Kontoghiorghes, 2000; Smith and Bremner, 1989). The DCA employs Givens rotations to move
efficiently from one node to another.
The search space of all possible variable subset models can be reduced by imposing bounds on the size of the subset
models. The subrange model selection problem is to derive:
Sj∗ = argmin RSS(S)
S
subject to |S| = j for j = na , . . . , nb ,
(4)
where na and nb are the subrange bounds (1 na nb n). The DCA and Subrange DCA (RangeDCA) are equivalent
when na = 1 and nb = n. The RangeDCA generates a subtree of the original regression tree. The nodes (S, k) are
not computed when ns < na or k nb . This is illustrated in Fig. 1. The DCA regression tree with n = 5 variables is
shown. The blank nodes represent the RangeDCA subtree for na = nb = 3. Portions of the tree that are not computed
by the RangeDCA are shaded. The nodes in the last two levels of the tree evaluate subset models of sizes 1 and 2 , i.e.
Level
L0
12345
•2345
L1
•345
L2
L3
•45
L4
•5
2•45
3•5
2•5
1•345
23•5
1•45
13•5
12•45
12•5
1•5
Fig. 1. The RangeDCA subtree, where n = 5 and na = nb = 3.
123•5
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
19
the subsets [4], [5], [4, 5], [3, 5], [2, 5] and [1, 5]. These nodes are discarded by the RangeDCA (case ns < na ). The
leftmost node in the tree evaluates the subset model [1, 2, 3, 5] of size 4 . The RangeDCA discards this node (case
k nb ). The Appendix provides a detailed and formal analysis of the RangeDCA.
The branch-and-bound strategy can be applied to the subtree generated by the RangeDCA. This strategy is called
RangeBBA and is summarized in Algorithm 1. The RangeBBA stores the generated nodes of the regression subtree in
a list. The list is managed according to a last-in, first-out (LIFO) policy. The RSSs of the best subset models are recorded
in a table r. The entry ri holds the RSS of the best current submodel of size i. The initial residuals table may be given a
priori based on some earlier results, otherwise the initial residuals are set to positive infinity. The entries are sorted in
decreasing order. Each iteration removes a node (S, k) from the list. The subleading model [s1 , . . . , si ] is evaluated and
compared to the entry ri in the residuals table for i = k + 1, . . . , ns . The entry ri is updated when RSS([s1 , . . . , si ]) < ri .
If ns na , then no child nodes are computed and the iteration terminates; otherwise, the cutting test bS > ri is computed
for i = k + 1, . . . , min(nb , ns − 1) and bS = RSS(S). If the test fails the child node (drop(S, i), i − 1) is generated and
inserted into the node list. Note that, if i < na , then the value rna is used in the cutting test. This is illustrated on Line 11
of the algorithm. The modified cutting test is more efficient than that of the BBA, since rna ri for i = 1, . . . , na − 1.
The algorithm terminates when the node list is empty. Notice that the RangeBBA with preordering (RangeBBA-1) is
obtained by sorting the variables in the initial set V. The RangeBBA outperforms the standard BBA, since it uses a
reduced search space and a more efficient cutting test.
Algorithm 1. The subrange BBA (RangeBBA)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
procedure RangeBBA(V , na , nb , r)
insert (V , 0) into the node list
while the node list is not empty do
Extract (S, k) from the node list
ns ← |S|
Update the residuals rk+1 , . . . , rns
if ns > na then
bS ← RSS(S)
for i = k + 1, . . . , min(nb , ns − 1) do
j ← max(i, na )
if bS > rj go to line 3
S ← drop(S, i)
Insert (S , i − 1) into the node list
end for
end if
end while
end procedure
The effects of the subrange bounds na and nb on the computational performance of the RangeDCA and RangeBBA-1
have been investigated. The Fig. 2(a) and (b) show the execution times of the RangeDCA for n = 20 variables and
RangeBBA-1 for 36 variables, respectively. It can be observed that the RangeDCA is computationally effective in
two cases: for narrow size ranges (i.e. nb − na < 2) or extreme ranges (i.e. na = 1 and nb < n/4 or na > 3n/4
and nb = n). The RangeBBA-1, on the other hand, is effective for all ranges such that na > n/2. This is further
confirmed by the results in Table 2. The number of nodes generated by the RangeBBA-1 for the 15 variable POLLUTE data set (Miller, 2002) is shown. All possible subranges 1na nb 15 are considered. For the case na = 1
and nb = 15 the RangeBBA-1 generates 381 nodes and is equivalent to the BBA-1 (Gatu and Kontoghiorghes,
2006).
3. Radius preordering
The BBA with an initial preordering of the variables in the root node (BBA-1) significantly increases the computational speed. The cost of preordering the variables once is negligible. The aim is to consider a strategy that applies
the preordering of variable subsets inside the regression tree and that yields a better computational performance than
20
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
0.8
1.0
0.6
time
time
0.4
0.5
0.2
0.0
5
n_a
0.0
20
n_b
15
10
15
(a)
n_a
10
10
5
20
10
30
20
RangeDCA (n = 20)
30
n_b
20
(b) RangeBBA-1 (n = 36)
Fig. 2. Subrange model selection: execution times in seconds for varying na and nb .
Table 2
Number of nodes generated by the RangeBBA-1 to compute the best subset models of the POLLUTE data set for different size ranges
na
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
nb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
12
37
31
96
90
75
178
172
157
123
276
270
255
221
173
332
326
311
277
229
103
356
350
335
301
253
127
52
373
367
352
318
270
144
69
38
375
369
354
320
272
146
71
40
11
376
370
355
321
273
147
72
41
12
11
377
371
356
322
274
148
73
42
13
12
12
378
372
357
323
275
149
74
43
14
13
13
13
380
374
359
325
277
151
76
45
16
15
15
15
15
381
375
360
326
278
152
77
46
17
16
16
16
16
15
381
375
360
326
278
152
77
46
17
16
16
16
16
15
1
the BBA-1. The new strategy is hereafter called radius preordering BBA (RadiusBBA). The RadiusBBA sorts the
variables according to their strength. The strength of the ith variable is given by its bound RSS(S−i ) = RSS(drop(S, i)).
The main tool for deriving the bound is the downdating of the QR decomposition after the corresponding column of the data matrix has been deleted. This has a cost, and therefore care must be taken to apply a preordering to the nodes whenever the expected gain outweighs the total cost inherent to the preordering
process.
The RadiusBBA preorders the variables at the roots of large subtrees. The size of the subtree with root (S, k) is
given by 2d−1 , where ns is the number of variables in S and d = ns − k is the number of active variables. The
RadiusBBA defines the node radius = n − d, where n is the initial number of variables in the root node (V , 0). The
radius of a node is a measure of its distance from the root node. Notice that the root nodes of larger subtrees have
a smaller radius, while the roots of subtrees with the same number of nodes have the same radius. Given a radius P
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
21
12345
ρ=0
2345
ρ=1
•345
ρ=2
•45
ρ=3
2•45
ρ=3
3•5
ρ=4
1•345
ρ=2
23•5
ρ=4
2•5
ρ=4
1•45
ρ=3
12•45
ρ=3
13•5
ρ=4
123•5
ρ=4
12•5
ρ=4
1•5
ρ=4
•5
ρ=4
Fig. 3. Nodes that apply preordering for radius P = 3.
the preordering of variables is applied only to nodes of radius < P , where 0 P n. If P = 0 and P = 1, then the
RadiusBBA is equivalent to the BBA and BBA-1, respectively. If P = n, then the active variables are preordered in all
nodes. Fig. 3 illustrates the radius of every node in a regression tree for n = 5 variables. Shaded nodes are preordered
by the RadiusBBA, where P = 3.
The RadiusBBA is illustrated in Algorithm 2. A node is extracted from the node list at each iteration. If the node radius
is less than the given preordering radius P, then the active variables are preordered before updating the residuals table.
The nodes which cannot improve the current best solutions are not generated. The cutting test (see line 11) compares
the bound of the current node to the corresponding entry in the residuals table in order to decide whether or not to
generate a child node.
Algorithm 2. The radius brnach-and-bound algorithm (RadiusBBA)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
procedure RadiusBBA(V , P , r)
n ← |V |
Insert (V , 0) into node list
while node list is not empty do
Extract (S, k) from nodes list
ns ← |S|; ← n − k
if < P then preorder [sk+1 , . . . , sns ]
Update residuals rk+1 , . . . , rns
bS ← RSS(S)
for i = k + 1, . . . , ns − 1 do
if bS > ri goto line 3
S ← drop(S, i)
Insert (S , i − 1) into node list
end for
end while
end procedure
The preordering process sorts the variables in order of decreasing bounds. Given a node (S, k), the bound of the
ith active variable (i = 1, . . . , d) is RSS(drop(S, k + i)), i.e. the RSS of the model from which the (k + i)th variable
has been removed. The d active variables of the node (S, k) are represented by the leading d × d submatrix of the
upper triangular R ∈ R(d+1)×(d+1) factor of the QR decomposition. The last column of R corresponds to the response
22
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
Fig. 4. Exploiting the QR decomposition to compute the bound of the ith active variable, when i = 2.
variable y in (1). Let R̃ denote R without its ith column. The drop operation applies d − i + 1 biplanar Givens rotations
to retriangularize R̃. That is, it computes
R̂ d
Gd · · · Gi R̃ =
,
0 1
(5)
where R̂ ∈ Rd×d is upper triangular. The bound of the ith active variable, that is, the RSS of the model after deleting
2 —the square of the diagonal element of R̂ in position (d, d).
the ith variable, is given by R̂d,d
Note that the rotation Gj R̃ (j = i, . . . , d) affects only the last (d − j + 1) elements of rows j and j + 1 of
R̃, and reduces to zero R̃j +1,j . The application of a rotation to two biplanar elements x and y can be written
as
x̃
c
=
ỹ
−s
s
c
x
.
y
If c and s are chosen such that c = x/t and s = y/t, then x̃ = t and ỹ = 0, where t 2 = x 2 + y 2 and t = 0.
The number of nodes in which the variables are preordered increases exponentially with the preordering radius
P. This computational overhead will have a significant impact on the overall performance of the RadiusBBA.
Fig. 4(a) shows the retriangularization of a 6 × 6 triangular matrix after deleting the second column using Givens
rotations.
The complete and explicit triangularization of R̃ should be avoided and only the bound of the deleted ith variable,
2 , should be computed. This can be achieved by computing only the elements of R̃ which are needed in deriving
i.e. R̂d,d
R̂d,d . Thus, the Givens rotation Gj R̃ (j = i, . . . , d + 1) explicitly updates only the last (d − j + 1) elements of the
(j + 1)th row of R̃ which are required by the subsequent rotation. The jth row of R̃ is not updated and neither is the
subdiagonal element R̃j +1,j annihilated. This strategy has approximately half the cost for deriving the bound of the
ith variable. In order to optimize the implementation of this procedure, the original triangular R is not modified and
the bounds are computed without copying the matrix to temporary store. Fig. 4(b) illustrates the steps of this strategy,
while Algorithm 3 shows the exact computational steps.
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
30
400
25
execution time (s)
350
generated nodes
23
300
250
200
20
15
10
5
150
0
2
4
6
8
10
preordering radius
12
14
0
(a) POLLUTE dataset (number of nodes)
10
20
30
preordering radius
40
50
(b) 13 variable true model (execution time)
1200
100
execution time (s)
execution time (s)
1000
80
60
40
800
600
400
20
200
0
10
20
30
preordering radius
40
50
(c) 26 variable true model (execution time)
0
10
20
30
preordering radius
40
50
(d) 39 variable true model (execution time)
Fig. 5. Computational cost of RadiusBBA for the POLLUTE data set and artificial data with 52 variables.
Algorithm 3. Computing the bound of the ith active variable
1
2
3
4
5
6
7
8
9
procedure bound(R, i, b)
xj ← Ri+1,j , for j = i + 1, . . . , d + 1
for j = i + 1, . . . , d + 1 do
yk ←Rj +1,k , k = j, . . . , d + 1
t ← xj2 + yj2 ; c ← xj /t; s ← yj /t
xk ← −s · xk + c · yk , k = j + 1, . . . , d + 1
end for
b ← t2
end procedure
Fig. 5 illustrates the effect of the preordering radius P on the RadiusBBA. Fig. 5(a) illustrates the number of nodes
generated by the RadiusBBA on the POLLUTE data set (15 variables) for every preordering radius (1 P 15). The
BBA-1 generates 318 nodes. The number of nodes generated by the RadiusBBA decreases steadily up to P = 8 where a
minimum of 146 nodes is reached. The other three figures illustrate the execution times of the RadiusBBA on artificial
data sets with 52 variables. The true model comprises 13, 26 and 39 variables, respectively. In all three cases the
RadiusBBA represents a significant improvement over the BBA-1. In the case of the small true model (13 variables),
24
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
Table 3
Execution times in seconds of the subrange RadiusBBA with radius 21 and 26 on data sets comprising 64 and 80 variables, respectively
ntrue
n = 64
16
32
48
n = 80
20
40
60
na
nb
Time
1
8
1
24
1
36
64
24
64
40
64
60
119
50
3415
4
3531
1
1
10
1
20
1
40
80
30
80
60
80
80
4205 (70 min)
2309 (38 min)
177383 (2 days)
25732 (8 h)
1293648 (15 days)
178 (3 min)
The true model comprises ntrue variables.
the BBA-1 and the RadiusBBA with P = 9 require 30 and 1 s, respectively. In the case of a medium size true model
(26 variables), the time is reduced from 112 to 6 s, i.e. the RadiusBBA with radius 13 is almost 20 times faster. In
the third case (big true model with 39 variables) the RadiusBBA with radius 18 is over 2 times faster than the BBA-1
which requires 500 s. These tests show empirically that values of P lying between n/4 and n/3 are a good choice for
the RadiusBBA.
Table 3 shows the execution times of the RadiusBBA on two data sets with n = 64 and 80 variables, respectively.
The preordering radius used is P = n/3. The number of variables in the true model is given by ntrue . Different ranges
na and nb have been used. For the full range, i.e. na = 1 and nb = n, the RadiusBBA computes the best subset models
for model sizes which are computationally infeasible for the BBA-1. It can be observed that the use of smaller ranges
significantly reduces the time required by the RadiusBBA for deriving the best subset models.
4. Heuristic strategies
The Heuristic BBA (HBBA) relaxes the objective of finding an optimal solution in order to gain in computational efficiency. That is, the HBBA is able to tackle large-scale models when the exhaustive BBA is found to be computationally
infeasible. The heuristic algorithm ensures that
RRE(S̃i ) < for i = 1, . . . , n,
(6)
where Si is the (heuristic) solution subset model of size i and is a tolerance parameter ( > 0). Generally, the RRE of
a subset Si is given by
RRE(Si ) =
|RSS(Si ) − RSS(S∗i )|
,
RSS(S∗i )
where Si∗ is the optimal subset of size i reported by the BBA. The space of all possible submodels is not searched
exhaustively. The HBBA aims to find an acceptable compromise between the brevity of the search ( → ∞) and the
quality of the solutions computed ( → 0). The modified cutting test in Gatu and Kontoghiorghes (2006) is given by
(1 + ) · RSS(S) > rj +1 .
(7)
Note that the HBBA is equivalent to the BBA for = 0. Furthermore, the HBBA reduces to the DCA if = −1.
Notice that in (7) rj +1 < 0 for = −1, which implies that the cutting test never holds and all the nodes of the tree are
generated.
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
25
λ(i)
2τ
τ
0
0
n−1
1 2
level i
Fig. 6. The (i) level tolerance function.
Table 4
Mean number of nodes and RREs generated by the HBBA and LevelHBBA
ntrue
9
18
27
Algorithm
Nodes
RRE
Nodes
RRE
Nodes
RRE
HBBA
LevelHBBA
14 278
13 129
6e − 4
8e − 4
47 688
34 427
3e − 4
5e − 4
35 062
21 455
9e − 4
3e − 3
In order to increase the capability of the heuristic strategy to tackle larger subset-selection problems, a new heuristic
algorithm is proposed. The Level HBBA (LevelHBBA) employs different values of the tolerance parameter in different
levels of the regression tree. It uses higher values in the levels close to the root node to encourage the cutting of large
subtrees. Lower tolerance values are employed in lower levels of the tree in order to select good quality subset models.
The indices of the tree levels are shown in Fig. 1. The tolerance function employed by the LevelHBBA is defined
formally as
(i) = 2(n − i − 1)/(n − 1)
for i = 0, . . . , n − 1,
where i denotes the level of the regression tree and the average tolerance. The graph of the function (i) is shown in
Fig. 6.
The HBBA and LevelHBBA were executed on data sets with 36 variables. Three types of data sets were employed,
with a small, a medium and a big true model comprising 9, 18 and 27 variables, respectively. The tolerance parameter has been set to 0.2. The results are summarized in Table 4. The table shows the number of nodes and the mean RRE. Each
experiment has been repeated 32 times. The values shown in the table are overall means. The LevelHBBA generates
slightly fewer nodes, but it produces results that are of lesser quality than those computed by the HBBA. Notice that
the average RRE is significantly lower than the tolerance employed.
The Size HBBA (SizeHBBA) assigns a different tolerance value to each subset model size. It can be seen as a
generalization of the HBBA and the RangeBBA. The degree of importance of each subset size can be expressed.
Lower tolerance values are attributed to subset sizes of greater importance. Less relevant subset sizes are given higher
tolerance values. Subset model sizes can be effectively excluded from the search by setting a very high tolerance
value. Thus, unlike the RangeBBA, the SizeHBBA can be employed to investigate non-contiguous size ranges. The
SizeHBBA satisfies
RRE(S̃i ) i
for i = 1, . . . n,
where i denotes the size of the subset model and i the corresponding tolerance value. Given a node (S, k), the child
node (drop(S, j ), j − 1) is cut if:
(1 + i ) · RSS(S) > ri
for i = j, . . . , ns − 1.
26
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
Table 5
Mean number of nodes and RREs generated by the HBBA and SizeHBBA
ntrue
9
18
27
Algorithm
Nodes
RRE
Nodes
RRE
Nodes
RRE
HBBA
SizeHBBA
12 781
15 079
8e − 4
2e − 4
38 716
39 457
4e − 4
2e − 4
39 907
40 250
1e − 3
3e − 4
The SizeHBBA generalizes the previous algorithms, i.e.,
⎧
DCA
if i = −1,
⎪
⎪
⎪ RangeDCA if i = −1 for na i nb and i ?0 otherwise,
⎨
SizeHBBA ≡ BBA
if i = 0,
⎪
⎪
⎪
⎩ RangeBBA if i = 0 for na i nb and i ?0 otherwise,
HBBA
if i = .
The SizeHBBA is equivalent to all previously proposed algorithms with the exception of the LevelHBBA. Thus, it can
be seen as more than a mere heuristic algorithm and allows a very flexible investigation of all subset models.
It has been observed experimentally that the SizeHBBA is efficient compared to the HBBA when a tolerance value
is used for the first half of model sizes and zero-tolerance for the remaining sizes. That is, when the optimal solution
is guaranteed to be found for submodel sizes between n/2 and n. Table 5 shows the computational performance of the
HBBA and SizeHBBA on data sets with 36 variables. The HBBA is executed with = 0.2, while the SizeHBBA is
executed with i = for i 18 and i =0 otherwise. The results show that, without a significant increase in computational
cost, there is a gain in solution quality, i.e. the optimal subset models with 18 or more variables are found. Furthermore,
the results are consistent with the observed behavior of the RangeBBA. For bigger models, larger subranges can be
chosen at a reasonable computational cost. In case of the SizeHBBA, constraints on larger submodels can be stricter
(i.e. lower tolerance) without additional computational cost. This may be due to the asymmetric structure of the tree,
i.e. subtrees are smaller on the right hand side.
5. Conclusions
Various algorithms for computing the best subset regression models have been developed. They improve and extend
previously introduced exhaustive and heuristic strategies which were aimed at solving large-scale model-selection
problems. The proposed algorithms are based on a dropping column algorithm (DCA) which derives all possible subset
models by generating a regression tree (Gatu and Kontoghiorghes, 2003, 2006; Smith and Bremner, 1989).
An algorithm (RangeDCA) that computes the all-subsets models within a given range of model sizes has been
proposed. The RangeDCA is a generalization of the DCA and it generates only a subtree of the all-subsets tree derived
by the DCA. Theoretical measures of complexity of the RangeDCA have been derived and analyzed (see Appendix).
The theoretical complexities have been confirmed through experiments. The branch-and-bound strategy in Gatu and
Kontoghiorghes (2006) has been applied in the tree that is generated by the RangeDCA.
The preordering of the initial variable set (BBA-1) significantly improves the computational performance of the
BBA. However, the BBA-1 might fail to detect significant combinations of variables in the root node. Hence, a more
robust preordering strategy is designed. Subsets of variables are sorted inside the regression tree after some variables
have been deleted. Thus, important combinations of variables are more likely to be identified and exploited by the
algorithm.
A preordering BBA (RadiusBBA) which generalizes the BBA-1 has been designed. The RadiusBBA applies variable preordering to nodes of arbitrary radius in the regression tree rather than to the root only. The radius provides a
measure of the distance between a node and the root. Experiments have shown that the number of nodes computed
by the RadiusBBA decreases as the preordering radius increases. However, the preordering requires the retriangularization of an upper triangular matrix after deleting a column and it incurs a considerable computational overhead. A
computationally efficient strategy has been designed, which avoids the explicit retriangularization used to compute the
strength of a variable. This reduces the total overhead of the RadiusBBA. In various experiments, it has been observed
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
27
that the best performance is achieved when preordering is employed with a radius of between one quarter and one third
of the number of variables. The RadiusBBA significantly reduces the computational time required to derive the best
submodels when compared to the existing BBA-1. This allows the RadiusBBA to tackle subset-selection problems that
have previously been considered as computationally infeasible.
A second class of algorithms has been designed, which improve the heuristic version of the BBA (HBBA) (Gatu
and Kontoghiorghes, 2006). The Level HBBA (LevelHBBA) applies different tolerances on different levels of the
regression tree. The LevelHBBA generates fewer nodes than the HBBA when both algorithms are applied with the
same mean tolerance. Although the subset models computed by the LevelHBBA are of lesser quality than those
computed by the HBBA, the relative residual errors remain far below the mean tolerance. The size-heuristic BBA
(SizeHBBA) assigns different tolerance values to subset models of different sizes. The subset models computed by the
SizeHBBA improve the quality of the models derived by the HBBA. Thus, for approximately the same computational
effort, the SizeHBBA produces submodels closer to the optimal ones than does the HBBA. The SizeHBBA for different
kind of tolerances is equivalent to the DCA, RangeDCA, BBA, RangeBBA and HBBA. This makes the SizeHBBA a
powerful and flexible tool for computing subset models. Within this context, it extends the RangeBBA by allowing the
investigation of submodels of non-contiguous size ranges.
The employment by the RadiusBBA of computationally less expensive criteria in preordering the variables should be
investigated. This should include the use of parallel strategies to compute the bound of the model after deleting a variable
(Hofmann and Kontoghiorghes, 2006). It might be fruitful to explore the possibility of designing a dynamic heuristic
BBA which automatically determines the tolerance value in a given node based on a learning strategy. A parallelization
of the BBA, employing a task-farming strategy on heterogeneous parallel systems, could be considered. The adaptation
of the strategies to the vector autoregressive model is currently under investigation (Gatu and Kontoghiorghes, 2005,
2006).
Acknowledgments
The authors are grateful to the guest-editor Manfred Gilli and the two anonymous referees for their valuable comments
and suggestions. This work is in part supported by the Swiss National Science Foundation Grants 101412-105978,
200020-100116/1, PIOI1-110144 and PIOI1-115431/1, and the Cyprus Research Promotion Foundation Grant KYIT/0906/09.
Appendix A. Subrange model selection: complexity analysis
Let the pair (S, k) denote a node of the regression tree, where S is a set of n variables and k the number of passive
variables (0 k < n). A formal representation of the DCA regression tree is given by
(S, k)
if k = n − 1,
(S, k) =
((S, k), (drop(S, k + 1), k), . . . , (drop(S, n − 1), n − 2)) if k < n − 1.
The operation drop(S, i) deletes the ith variable in S = [s1 , . . . , sn ]. The QR decomposition is downdated after the
corresponding column of the data matrix has been deleted. Orthogonal Givens rotations are employed in reconstructing
the upper-triangular factor. An elementary operation is defined as the rotation of two vector elements. The cost of
one elementary operation is approximately six flops. The number of elementary operations required by the drop
operation is
Tdrop (S, i) = (n − i + 1)(n − i + 2)/2.
The passive variables s1 , . . . , sk are not dropped, i.e. they are inherited by all child nodes. All active variables
sk+1 , . . . , sn , except the last one, are dropped in turn to generate new nodes. The structure of the tree can be expressed in terms of the number of active variables d = n − k. This simplified representation (d) of the regression tree
(S, k) is given by
(d)
if d = 1,
(d) =
((d), (d − 1), . . . , (1)) if d > 1,
28
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
where (d) is a node with d active variables. The number of nodes and elementary operations are calculated,
respectively, by
N(d) = 1 +
d−1
N(d − i) = 2d−1
i=1
and
T (d) =
d−1
(Tdrop (d, i) + T (d − i)) = 7 · 2d−1 − (d 2 + 5d + 8)/2.
i=1
Here, Tdrop (d, i) is the complexity of dropping the ith of d active variables (i = 1, . . . , d).
Let na designate a model size (1na n). Then, na (S, k) denotes the subtree of (S, k) which consists of all nodes
which evaluate exactly one model of size na (0 k < na ). It is equivalent to:
(d)
if d = a,
a (d) =
((d), a (d − 1), . . . , 1 (d − a)) if d > a,
where a = na − k. The number of nodes is calculated by
1 if d = a,
Na (d) =
1 + ai=1 Na−i+1 (d − i) if d > a
d!
= Cad =
.
a!(d − a)!
Similarly, the number of elementary operations required to construct a (d) is calculated by
0
if d = a,
Ta (d) = a
i=1 (Tdrop (d, i) + Ta−i+1 (d − i)) if d > a
= Tdrop (d, 1) + Ta−1 (d − 1) + Ta (d − 1).
The closed form
Ta (d) =
a−1
d−a+i−1
i=0
j =i
j
Ci · Tdrop (d − j, 1)
is obtained through the generating function
Tdrop (j, i)x i y j
G(x, y) = (1 − y(1 + x))−1
0<i<j
of Ta (d), where k = 0, a = na and d = n. That is, this is the number of elementary operations necessary to compute
all subset models comprising na out of n variables.
Now, let na ,nb (S, k) denote the tree which evaluates all subset models with more than na and less than nb variables,
inclusively (1 na nb n and 0 k < na ). It is equivalent to
(d)
if d = a,
a,b (d) = a,b−1 (d)
if d = b,
((d), a,b (d − 1), . . . , 1,b−a+1 (d − a), . . . , 1,1 (d − b)) if d > b,
where a = na − k and b = nb − k. This tree can be seen as the union of all trees c (d), for c = a, . . . , b. Hence, the
number of nodes and operations can be calculated, respectively, by
Na,b (d) =
b
c=a
Nc (d) −
b−1
c=a
Nc (d)
M. Hofmann et al. / Computational Statistics & Data Analysis 52 (2007) 16 – 29
29
and
Ta,b (d) =
b
Tc (d) −
b−1
Tc (d).
c=a
c=a
Now,
Nc (d) = Ccd−1
and
Tc (d) =
c−1
d−c+i−2
i=0
j =i
j
Ci · Tdrop (d − j, 1)
are the nodes and operations which have been counted twice. Specifically, these are given by the subtrees nc (S, k)
which represent the intersection of the two trees nc (S, k) and nc +1 (S, k) for 1 nc < n. Their structure is given by
(d)
if d = c + 1,
c (d) =
((d), c (d − 1), . . . , 1 (d − a)) if d < c + 1,
where c = nc − k.
References
Breiman, L., 1995. Better subset regression using the nonnegative garrote. Technometrics 37, 373–384.
Clarke, M.R.B., 1981. Statistical algorithms: algorithm AS 163: a Givens algorithm for moving from one linear model to another without going back
to the data. J. Roy. Statist. Soc. Ser. C Appl. Statist. 30, 198–203.
Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.
Furnival, G., Wilson, R., 1974. Regression by leaps and bounds. Technometrics 16, 499–511.
Gatu, C., Kontoghiorghes, E.J., 2003. Parallel algorithms for computing all possible subset regression models using the QR decomposition. Parallel
Comput. 29, 505–521.
Gatu, C., Kontoghiorghes, E.J., 2005. Efficient strategies for deriving the subset VAR models. Comput. Manage. Sci. 2, 253–278.
Gatu, C., Kontoghiorghes, E.J., 2006. Branch-and-bound algorithms for computing the best subset regression models. J. Comput. Graph. Statist. 15,
139–156.
Gatu, C., Yanev, P., Kontoghiorghes, E.J., 2007. A graph approach to generate all possible regression submodels. Comput. Statist. Data Anal. in
press, doi: 10.1016/j.csda.2007.02.018.
Golub, G.H., Van Loan, C.F., 1996. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. third ed. Johns Hopkins University
Press, Baltimore, MA.
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York.
Hocking, R.R., 1976. The analysis and selection of variables in linear regression. Biometrics 32, 1–49.
Hofmann, M., Kontoghiorghes, E.J., 2006. Pipeline givens sequences for computing the QR decomposition on a EREW PRAM. Parallel
Comput. 32, 222–230.
Kontoghiorghes, E.J., 2000. Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems, Advances in Computational
Economics, vol. 15. Kluwer Academic Publishers, Boston.
Miller, A.J., 2002. Subset Selection in Regression Monographs on Statistics and Applied Probability, vol. 95, second ed. Chapman & Hall, London
(Related software can be found at URL: http://users.bigpond.net.au/amiller/).
Narendra, P.M., Fukunaga, K., 1997. A branch and bound algorithm for feature subset selection. IEEE Trans. Comput. 26, 917–922.
R Development Core Team, 2005. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna,
Austria.
Roberts, S.J., 1984. Statistical algorithms: algorithm AS 199: a branch and bound algorithm for determining the optimal feature subset of given size.
Appl. Statist. 33, 236–241.
Searle, S.R., 1971. Linear Models. Wiley, New York.
Seber, G.A.F., 1977. Linear Regression Analysis. Wiley, New York.
Sen, A., Srivastava, M., 1990. Regression Analysis. Theory, Methods and Applications. Springer, Berlin.
Smith, D.M., Bremner, J.M., 1989. All possible subset regressions using the QR decomposition. Comput. Statist. Data Anal. 7, 217–235.
Somol, P., Pudil, P., Kittler, J., 2004. Fast branch & bound algorithms for optimal feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26,
900–912.
Tibshirani, R.J., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B (Statist. Methodol.) 58, 267–288.
An exact least-trimmed-squares algorithm for a range of
coverage values
Marc Hofmann∗,
Cristian Gatu†,
Erricos John Kontoghiorghes‡
Abstract
A new adding row algorithm (ARA) extending existing methods to compute exact least trimmed squares
(LTS) regression is presented. The ARA employs a tree-based strategy to compute LTS regressors for a range
of coverage values. Thus, a priori knowledge of the optimal coverage parameter is not required. New nodes
in the regression tree are generated by updating the QR decomposition after adding one observation to the
regression model. The ARA is enhanced employing a branch-and-bound strategy. The branch-and-bound
algorithm is an exhaustive algorithm that uses a cutting test to prune non-optimal subtrees. It significantly
improves over the ARA in computational performance. Observation preordering throughout the traversal
of the regression tree is investigated. A computationally efficient and numerically stable calculation of the
bounds using Givens rotations is designed around the QR decomposition, avoiding the need to explicitely
update the triangular factor when an observation is added. This reduces the overall computational load
of the preordering device by approximately half. A solution is proposed to allow preordering during the
execution of the algorithm when the model is underdetermined. It employs pseudo-orthogonal rotations to
downdate the QR decomposition. The strategies are illustrated by example. Experimental results confirm
the computational efficiency of the proposed algorithms.
Keywords: Least trimmed squares, outliers, regression tree algorithms, QR factorization.
1
Introduction
Least-squares regression is sensitive to outliers. This has prompted the search for regression estimators which
are resistant to data points that deviate from the usual assumptions. The goal of positive-breakdown estima∗
†
Institut d’informatique,Université de Neuchâtel, Switzerland. E-mail: marc.hofmann@unine.ch
VTT Technical Research Centre of Finland, Espoo, Finland; and Faculty of Computer Science, “Alexandru Ioan Cuza” Univer-
sity of Iaşi, Romania. E-mail: cristian.gatu@unine.ch
‡
Department of Public and Business Administration, University of Cyprus, Cyprus; and School of Computer Science and Information Systems, Birkbeck College, University of London, UK. E-mail: erricos@ucy.ac.cy
1
tors is to be robust against the possibility of unannounced outliers (Rousseeuw 1997). The breakdown point
provides a crude quantification of the robustness properties of an estimator. Briefly, the breakdown point is
the smallest amount of contamination that may cause an estimator to take on arbitrarily large aberrant values
(Donoho & Huber 1983).
Consider the standard regression model
y = Xβ + ε,
(1)
where y ∈ Rn is the dependent-variable vector, X ∈ Rn×p is the exogenous data-matrix of full column-rank,
β ∈ Rp is the coefficient vector and ε ∈ Rn is the noise vector. It is usually assumed that ε is normally
distributed with zero mean and variance-covariance matrix σ 2 In . Least squares (LS) regression consists of
minimizing the residual sum of squares (RSS). One outlier may be sufficient to compromise the LS estimator.
In other words, the finite-sample breakdown point of the LS estimator is 1/n and therefore tends to 0 when n is
large (Rousseeuw 1997). Several positive-breakdown methods for robust regression have been proposed, such
as the least median of squares (LMS) (Rousseeuw 1984). The LMS is defined by minimizing medi ε̂2i . The
LMS attains the highest-possible breakdown-value, namely (⌊(n − p)/2⌋ + 1)/n. This means that the LMS
fit stays in a bounded region whenever ⌊(n − p)/2⌋, or fewer, observations are replaced by arbitrary points
(Rousseeuw & Van Driessen 2006).
The least trimmed squares (LTS) estimator possesses better theoretical properties than the LMS (Rousseeuw
& Van Driessen 2006, Hössjer 1994). The objective of the LTS estimator is to minimize
h
X
hε̂2 ii ,
i=1
where h = 1, . . . n and hε̂2 i denotes the vector of squared residuals sorted in increasing order. This is equivalent
to finding the h-subset of observations with the smallest LS objective function. The LTS regression estimate
is then the LS fit to these h points. The breakdown value of LTS with h = ⌊(n + p + 1)/2⌋ is equivalent to
that of the LMS. In spite of its advantages over the LMS estimator, the LTS estimator has been applied less
often because it is computationally demanding. For the multiple linear regression model, the trivial algorithm
that explicitly enumerates and computes the RSS for all h-subsets works if the number of observations is
relatively small, i.e. less than 30. Otherwise, the computational load is prohibitive. To overcome this drawback,
several approximate algorithms have been proposed. These include the PROGRESS algorithm (Rousseeuw &
Leroy 1987), the feasible-solution-algorithm (FSA) (Hawkins 1994, Hawkins & Olive 1999) and the FASTLTS algorithm (Rousseeuw & Van Driessen 2006). However, these algorithms do not inspect all combinations
of observations and are not guaranteed to find the optimal solution. An exact algorithm to calculate the LTS
estimator has been proposed (Agulló 2001). It is based on a branch-and-bound procedure that does not require
the explicit enumeration of all h-subsets. It therefore reduces the computational load of the trivial algorithm
2
significantly. A tree-based algorithm to enumerate subsets of observations in the context of outlier detection
has been suggested (Belsley, Kuh & Welsch 1980).
High-breakdown estimators may pick up local linear trends with different slopes than the global linear trend.
Thus, high-breakdown estimators may have arbitrarily low efficiency (Stefanski 1991, Morgenthaler 1991).
For higher values of the breakdown point the probability for this to happen is bigger. Thus, for small n/p it
is preferable to use a method with lower breakdown value such as the LTS with larger h (Rousseeuw 1997).
Here, an adding row algorithm (ARA) is proposed which computes the exact LTS estimates for a range hmin :
hmax = {h ∈ N|h ≥ hmin , h ≤ hmax }. It renders possible the efficient computation and investigation of a set of
exact LTS estimators for distinct breakdown values h = hmin , . . . , hmax . A branch-and-bound algorithm (BBA)
improves upon the ARA. Its computational efficiency is further improved by preordering the observations.
The ARA and its application to LTS regression is discussed in Section 2. The BBA is introduced in Section 3, observation preordering is investigated and numerical results are presented. Conclusions and notions of
future work are discussed in Section 4 .
2
Adding row algorithm
As noted by Belsley et al. (1980), there is a strong correspondence between row-selection techniques and
procedures for computing the all-possible-column-subsets regression. Within the context of variable-subset
selection, a Dropping column algorithm (DCA) has been discussed (Gatu & Kontoghiorghes 2003, Gatu &
Kontoghiorghes 2006, Smith & Bremner 1989). The new Adding row algorithm (ARA) computes the allobservation-subsets regression. The organization of the algorithm is similar to that of the DCA and is determined by the all-subsets tree illustrated in Figure 1 (Furnival & Wilson 1974, Gatu & Kontoghiorghes 2003,
Smith & Bremner 1989), where the number of observations in the model is n = 4.
[], [1234]
level 0
[1], [234]
1
2
[12], [34]
3
[123], [4]
4
[1234], []
[124], []
[13], [4]
[2], [34]
[14], []
[134], []
[23], [4]
[24], []
[234], []
Figure 1: The ARA regression tree where I = [1234].
3
[3], [4]
[34], []
[4], []
The observations or points in model (1) are designated by their indices I = [1, . . . , n]. A node (S, A) in the
regression tree carries an observation-subset model of nS selected observations S ⊆ I. The set A represents
the nA observations which are available for selection in nodes that are children of node (S, A). The RSS of the
LS estimator that corresponds to the subset model S is computed in each node. The subset model is denoted by
(XS yS ) and is assumed to be of full rank. The regression model is represented by means of the numerically
stable QR decomposition (QRD)
QTS
p
1
XS yS

= 
p
1
RS zS
0
w


p
,
(2)
nS −p
where QS is orthogonal and RS is square upper-triangular and non-singular. The RSS of the model is given by
ρS = wT w. Note that for underdetermined models (nS < p), the QR factorization is not computed and the RSS
is 0. The orthogonal factor QTS is typically a product of Givens rotations, or Householder reflectors (Golub &
Van Loan 1996).
Given any node (S, A = [a1 , . . . , anA ]), its nA child nodes are given by
[add(S, a1 ), A2: ),
(add(S, a2 ), A3: ),
...,
(add(S, anA ), ∅)] ,
where Ai: denotes the subset of A containing all but the i − 1 first observations. The operation add(S, ai ),
i = 1, . . . , nA , constructs the new linear model S ∪ {ai } by updating the linear model S with observation
ai . Effective algorithms to update the quantities RS , zS and ρS after adding an observation exist (Gill, Golub,
Murray & Saunders 1974). A Cholesky updating routine based on orthogonal Givens rotations can be found
in the LINPACK numerical library (Dongarra, Bunch, Moler & Stewart 1979). It requires approximately 3p2
flops (Björck, Park & Eldén 1994).
A straight forward “brute-force” (BF) method to compute the exact LTS estimator β̂h for a given coverage
h (h = 1, . . . , n) consists in enumerating all possible h-subsets, solving the LS problem for each subset. This
implies computing nh = n!/(h!(n − h)!) QRDs. Thus, the computational cost amounts to approximately
TBF = nh · 3hp2 flops, where 3hp2 is the approximate number of flops necessary to compute the QRD. On the
other hand, the specialized algorithm ARAh to compute the LTS regressor β̂h for a given coverage h lists the
observation subsets in an order predetermined by the all-subsets tree. Starting at the root node, it traverses all
nodes that lead to a node on level h. The nodes on level h are included in the traversal. Although it enumerates
more subsets than the BF algorithm, the computational load is less. By exploiting the information gathered in
intermediate nodes, the QRDs are obtained cheaply. That is, the QRD in a node on level ℓ is partially available
as the QRD in the parent node (level ℓ − 1). The new QRD is not derived from scratch. Instead, it is obtained
by adding an observation to the selected set S. Numerically, this implies updating the parent QRD by one row.
4
Let ∆(S, A) denote the all-observation-subsets tree with root node (S, A) and let ∆h (S, A) denote the tree
generated by the ARAh . The tree ∆h (S, A) is the subtree of ∆(S, A) that contains the nodes that lie on a path
between the root node and a node on level h. It is equivalent to the tree employed in feature subset regression
by Narendra & Fukunaga (1977) and by Agulló’s (2001) exact LTS algorithm (hereafter denoted by AGLA).
Formally,
∆h (S, A) =


∅
if
nS + nA < h or
nS > h,

[(S, A), ∆h (add(S, a1 ), A2: ), . . . , ∆h (add(S, an ), ∅)]
A
otherwise,
nodes, where n = |I|. Thus, the
where nS = |S| and nA = |A|. The tree ∆h (∅, I) consists of n+1
h
n+1
computational cost of the ARA is approximately TARAh = h · 3p2 flops. In other words,
TBF /TARAh ≈ α(1 − α)n + α = O(n),
where n ≫ p and h = αn (α < 1). The ARAh is O(n) times faster than the brute-force method.
The optimal value of h will both resist outliers in the data and give the highest efficiency, i.e. will accurately reveal the global linear trend. However, in practice, this value is never known before using LTS regression
(Atkinson & Cheng 1999). The ARAh visits all the nodes in ∆h to determine β̂h , the LTS estimator with coverage h. Several passes are required to find all the LTS estimators β̂h , h = hmin , . . . , hmax . Each pass executes
P max
n+1
nodes. Notice that nodes will be computed several
the ARAh for one h. This implies computing hh=h
h
min
times over. It is inefficient to analyse the same data several times for different vaues of h.
Now, let ∆hmin :hmax (S, A) denote the subtree of ∆(S, A) that contains all nodes that lie on a path between
the root node and a node on level ℓ (ℓ = hmin , . . . , hmax ). Formally,


∅ if nS + nA < hmin or nS > hmax ,
∆hmin :hmax (S, A) =

[(S, A), ∆h :h (add(S, a1 ), A2: ), . . . , ∆h :h (add(S, an ), ∅)]
min max
min max
A
otherwise.
The ARAhmin :hmax generates the tree ∆hmin :hmax (∅, I) and returns the set of LTS estimators {β̂h |h = hmin , . . . , hmax }.
It is optimal in the sense that it does not generate any unnecessary nodes and that it does not generate any node
multiple times. The number of nodes that it computes is given by
NARA =
hX
hmax max −1 X
n
n+1
,
−
h
h
h=h
h=h
min
min
thus improving over the previous approach. The complete procedure for generating ∆hmin :hmax (∅, I) is given by
Algorithm 1. Nodes which await processing are held in a node list. The list is managed according to a LIFO
(“last in, first out”) strategy (Burks, Warren & Wright 1954, Newell & Shaw 1957). The output of the algorithm
is ρ̂h and Ŝh , respectively the RSS of the LTS estimate β̂h and the corresponding h-subset (h = hmin , . . . , hmax ).
5
Algorithm 1: Adding row algorithm (ARA)
1
procedure ARA(I, hmin , hmax , ρ̂, Ŝ)
2
n ← |I|
3
ρ̂h ← +∞, where h = hmin , . . . , hmax
4
Insert node (∅, I) into node list
5
while list contains nodes do
6
Remove node (S, A) from node list
7
nS ← |S|; nA ← |A|; ρS ← RSS(S)
8
if nS ≥ hmin and nS ≤ hmax then
9
if ρS < ρ̂nS then ρ̂nS = ρS , ŜnS ← S
10
end if
11
if nS + nA ≥ hmin and nS < hmax then
12
for i = 1, . . . , nA do
13
Compute (S ′ , A′ ) = (add(S, ai ), Ai+1: )
14
Insert (S ′ , A′ ) into node list
15
end for
16
end if
17
end while
18
end procedure ARA
Hofmann, Gatu & Kontoghiorghes (2007) give a detailed analysis of the tree structures and the associated
complexities in the context of variable-subset selection.
The advantage of the ARA over the previously introduced AGLA is that the ARA simulataneously computes the LTS estimators for a range of coverage values 1 ≤ hmin , . . . , hmax ≤ n. It traverses the tree ∆hmin :hmax
only once and avoids redundant computations. In contrast, the AGLA can determine only one LTS regressor at
a time. Thus, in order to obtain the same list of LTS estimators, the AGLA is executed hmax − hmin + 1 times.
This implies several independent traversals of the subsets-tree and the computation of redundant nodes.
2.1 Experimental results
To see the effect of the coverage parameter h on the LTS estimator, the ratio suggested by Atkinson & Cheng
(1999) is considered:
Rh =
6
σ̂h2
,
2
σ̂LS
where σ̂h2 = ρ̂h /(h − p) and ρ̂h = RSS(β̂h ). The variance estimate associated with the LS estimator is given
2
by σ̂LS
= σ̂n2 . The ratio Rh is examined as different numbers of data are fitted in the model. Two data models
are used in the experiment. Atkinson & Cheng (1999) used the model
yi = β0 +
p−1
X
βj xi,j + εi ,
i = 1, . . . , n,
j=1
where εi ∼ N (0, 1) for good data, whereas bad data are simulated from N (12, 1). Rousseeuw & Van Driessen
(2006) used the model
yi = xi,1 + xi,2 + . . . + xi,p−1 + 1 + εi ,
i = 1, . . . , n,
where εi ∼ N (0, 1) and xi,j ∼ N (0, 100). Outliers are introduced by replacing some of the xi,1 by values that
are normally distributed with mean 100 and variance 100.
Two classes of datasets are generated, one for each data model. Each class contains 100 datasets with
n = 32 and p = 5. The contaminated datasets are obtained from a clean dataset by injecting q = 8 outliers.
The ARA is then executed on all 100 datasets with hmin = 16 and hmax = 32, and the results reported for
both clases of data. For the first class of data (Atkinson & Cheng 1999), the ARA correctly discriminated the
contaminated observations in 92 of 100 cases. This means that for 92 out of 100 contaminated datasets, the
estimator β̂n−q=24 did not include any of the contaminated data points. For the second class of data (Rousseeuw
& Van Driessen 2006), the ARA correctly discriminated the outliers in all 100 cases. Figure 2 illustrates the
ratio Rh for the two classes of data. Each plot shows the ratio Rh for one case, i.e. a dataset with and without
1.4
1.0
0.8
R(h)
0.0
0.2
0.4
0.6
1.0
0.8
0.6
0.4
0.2
0.0
R(h)
clean
contaminated
1.2
clean
contaminated
1.2
1.4
contamination.
20
25
30
20
coverage (h)
25
30
coverage (h)
(a) Data: Atkinson & Cheng (1999).
(b) Data: Rousseeuw & Van Driessen (2006).
Figure 2: The ratio Rh for two types of data.
7
3
Branch and bound algorithm
The ARA is computationally prohibitive for even a moderate number of observations to investigate. The ARA
can be optimized to avoid the explicit enumeration of all observation-subsets. Given two sets of observations
S1 and S2 ,
if
S1 ⊂ S2 , then
ρS 1 ≤ ρS 2 ,
where ρS denotes the RSS of the LS estimator of the model consisting of the observations in S. That is, adding
observations cannot cause the RSS of the model to decrease. This property can be used to restrict the number
of evaluated subsets while searching for the best observation-subset models.
(g)
Let ρ̂j denote the minimal RSS for models selecting j observations. Furthermore, let ρ̂j denote its value
after g nodes of the regression tree have been generated (g = 0, . . . , 2n ). For any g, the RSS satisfy the
following relationship:
(g)
(g)
ρ̂1 ≤ ρ̂2 ≤ . . . ≤ ρ̂(g)
n .
See Lemma 1 in Appendix A for a formal proof. After the whole regression tree ∆(∅, I) has been generated, the minimal RSS corresponding to the best regression models for each number of observations are given
by ρ̂1 , . . . , ρ̂n (g = 2n ). Consider the gth node (S, A) and let its bound be ρS . A cutting test is devised.
Specifically,
if
(g−1)
ρS ≥ ρ̂nS +nA −i+1 , then
(g)
ρW ≥ ρ̂|W | ,
(3)
where W is any model obtained from ∆(S, Ai: ), i = 1, . . . , nA (see Lemma 2 for a formal proof). This implies
that the subtrees ∆(add(S, ai ), Ai+1: ) cannot improve ρ̂. A procedure to compute the regression tree follows.
The child nodes (add(S, ai ), Ai+1: ) of node (S, A) are computed from left to right, i.e. for increasing i. If the
bound ρS is greater than ρ̂nS +nA −i+1 , then the corresponding child and its younger siblings (i.e. child nodes to
the right) are discarded. Otherwise, the child node is computed and this procedure repeated for the next child.
This is illustrated in Algorithm 2. The break statement on line 11 terminates and exits the inner-most loop of
the algorithm. Note that the cutting test is not effective in the first p levels of the tree, where the RSS of the
submodels is 0.
The computational efficiency of the BBA improves when more nodes are cut. That is, if bigger subtrees are
bounded with bigger values. This can be achieved by preordering the observations in each node (S, A). The
BBA with preordering (PBBA) constructs nodes using “stronger” observations first. The observations in A are
sorted according to their strength (Agulló 2001). The exact bound RSS(S+ai ) can be computed to determine
the strength of observation ai ∈ A. Here, S+ai denotes the set S to which the observation ai has been added.
This approach involves nA rank-1 Cholesky updates (Gill et al. 1974, Golub & Van Loan 1996). Note that the
e denote the right-hand side of (2) and x
observations in A are sorted in decreasing order of the RSS. Let R
eT
the row vector corresponding to the added observation. The updating process is illustrated in Figure 3(a). It
8
Algorithm 2: Branch-and-bound algorithm (BBA)
1
procedure BBA(I, ρ̂, Ŝ)
2
n ← |I|
3
ρ̂i ← +∞, where i = 1, . . . , n
4
Insert node (∅, I) into node list
5
while list contains nodes do
6
Remove node (S, A) from node list
7
nS ← |S|; nA ← |A|; ρS ← RSS(S)
8
if ρS < ρ̂nS then ρ̂nS ← ρS , ŜnS ← S
9
for i = 1, . . . , nA do
10
if ρS ≥ ρ̂nS +nA −i+1 then break
11
Compute (S ′ , A′ ) = (add(S, ai ), Ai+1: )
12
Insert (S ′ , A′ ) into node list
13
end for
14
end while
15
end procedure BBA
involves the computation and the application of p + 1 Givens rotations, where p is the number of independent
variables. The ith Givens rotation Gi can be written as


ci si
,
Gi = 
−si ci
2
ei,i /t, si = x
ei,i
where ci = R
ei /t and t2 = R
+x
e2i . The application of Gi is given by

 

(i)
(i−1)
e
e
Ri,i:p+1
Ri,i:p+1
Gi  (i−1),T  =  (i),T  ,
x
ei,i:p+1
x
ei,i:p+1
e(i) and x
e and x
e(0) ≡ R,
e x
where R
e(i),T respectively denote R
eT modified by the first i Givens rotations, R
e(0) ≡ x
e
and x
e(p+1) ≡ 0. The standard colon notation is used in order to denote submatrices and subvectors (Golub &
Van Loan 1996). The bound, i.e. the RSS, is given by the square of the element (p + 1, p + 1) of the matrix
e(p+1) = Gp+1 · · · G1 R.
e The procedure is computationally expensive. Note that the construction of a Givens
R
e Thus, the application of the sequence
rotation does not involve previously modified elements of the matrix R.
of Givens rotations can be carried out without explicitly modifying the upper triangular factor as illustrated in
(p)
2
ep+1,p+1
Figure 3(b). The bound is given by R
+ (e
xp+1 )2 . Thus, the computational complexity is roughly divided
by two. The new procedure to derive the bound is illustrated by Algorithm 3.
9
“
e (0)
R
x
e(0),T
”
G1
“
e (0)
R
x
e(0),T
”
G2
“
e (1)
R
x
e(1),T
”
G3
“
e (2)
R
x
e(2),T
”
G4
“
e (3)
R
x
e(3),T
”
G5
“
e (4)
R
x
e(4),T
”
“
e (5)
R
x
e(5),T
”
u u u u u
u u u u
u u u
u u
◦ u u u u
added
◦ u u u
zero
non-zero
◦ u u
u updated
u
◦
◦ u
◦ annihilated
⋆
⋆ bound
(a) Updating the QR decomposition after adding a row.
⋆
◦ u u u u
added
◦ u u u
zero
non-zero
◦ u u
◦ u
u updated
◦ annihilated
⋆
⋆ bound
(b) Efficiently computing the bound of the observation.
Figure 3: Exploiting the QR decomposition to compute the bound of an observation.
Algorithm 3: Computing the exact bound of an observation x
3
procedure bound(R, x, b)
for j = 1, . . . , p + 1 do
q
2
t ← Rj,j
+ x2j ; c ← Rj,j /t; s ← xj /t
4
5
6
7
xj ← −s · xj + c · Ri,j , where j = i + 1, . . . , p + 1
end for
b ← t2
end procedure
1
2
The absolute residuals (Resid) represent an alternative criterion that can be used in preordering the observations. In a node (S, A), the residual vector ε̂S is computed with respect to β̂S , the LS estimator of the
subset S of observations. This merely involves solving an upper-triangular system and is computationally non
expensive.
The traversal by the PBBA of the subsets-tree is divided into two stages. In stage I, the algorithm visits
the nodes that are on the first p levels (i.e. levels 0 to p − 1) of the regression tree. The remaining nodes
are visited in stage II. In stage I, the subset-models S are underdetermined, that is nS < p. The estimator β̂S
is not available, and neither the RSSs nor the residual vector can be computed. To circumvent this problem,
one can turn to another estimator, namely β̂S∪A . The initial linear model I is solved in the root node. Then,
in stage I of the algorithm, each new child node can be obtained by downdating the estimator of the parent
node by one observation. The downdating operation can be implemented, for example, by means of pseudoorthogonal rotations (Golub & Van Loan 1996, Alexander, Pan & Plemmons 1988). The residual vector ε̂S∪A
is then available and can be used to sort the available observations. Another indicator of the strength of an
observation is RSS(S ∪ A−ai ), the RSS of the estimator β̂S∪A downdated by the observation ai . This quantity
10
can be computed for every observation ai in A. The same scheme illustrated in Figure 3(b) can be employed,
substituting hyperbolic plane rotations for Givens rotations. The observations are then sorted in increasing
order of the criterion. To distinguish the two cases, RSS+ and Resid+ will denote the criteria used to preorder
the observations obtained by updating the estimator β̂S (stage II); RSS− and Resid− will denote the criteria
obtained by downdating β̂S∪A (stage I).
Preordering is an expensive procedure. The algorithm hence preorders the available observations in nodes
which are the root of big subtrees, i.e. where a potentially large number of nodes will be cut. A preordering
radius can be used to restrict the use of preordering (Hofmann et al. 2007). Specifically, observations are
preordered in nodes whose distance from the root node is smaller than the preordering radius. In the present
context this is equivalent to preordering the observations in all nodes such that nA > n − π, where π denotes
the preordering radius. Notice that if π = 0, then no preordering occurs. If π = 1, then the observations are
preordered in the root node; if π = n, then the observations are preordered in all nodes of the tree.
Figure 4 and 5 illustrate the computational cost of the PBBA for various preordering radii. The employed
dataset contains n = 40 observations (8 of which are outliers), p = 4 independent variables and is generated
according to Rousseeuw & Van Driessen (2006). The LTS estimates for coverage values hmin = 20 to hmax = 40
are computed. Four different preordering strategies are illustrated, depending on which criterion (RSS or Resid)
is used in each of the two stages (I/II). It can be seen from Figure 4 that all four preordering strategies generate
about the same number of nodes, and that the number of nodes does not continue to decrease for preordering
radii π > n/2. Thus, Resid should be the preordering criterion of choice as it is cheaper to compute than RSS,
and the employed preordering radius should be approximately n/2. These findings are confirmed by Figure 5
which shows that a preordering radius beyond n/2 does not significantly improve the execution time of the
algorithm. Moreover, employing the Resid criterion in stage II of the algorithm leads to execution times that
are up to 3 times lower for π ≥ n/2.
The PBBA is equivalent to Agulló’s (2001) algorithm (AGLA) under the conditions that follow. The AGLA
computes the LTS estimate for one coverage value only, i.e. h = hmin = hmax . Furthermore, preordering
of observations takes place in the root node with respect to the absolute residuals; in the second stage the
observations are sorted according to their RSS. In brief, the AGLA is equivalent to the PBBA with Resid−
(π1 = 1) / RSS+ (π2 = n), where π1 and π2 designate the preordering radii in stages I and II, respectively. For
a given range hmin : hmax the PBBA performs twice faster than the AGLA if it employs the same preordering
strategy. Contrary to the AGLA, the PBBA does not need to step through the regression tree more than once
and hence does not generate any redundant nodes.
Table 1 illustrates the results of an experimental comparison of the PBBA with the AGLA. The algorithms
are executed on datasets with n = 32, 36, 40, 44, 48 observations. The number of outliers is n/4. The PBBA
is executed with a preordering radius π = n/2. The LTS estimates are computed for three different ranges
11
hmin : hmax , namely n/2 : n, n/2 : 3n/4 and 3n/4 : n. The execution times and number of nodes indicated are
mean values taken over 100 different datasets generated for every data size. The experiment reveals that the
PBBA is 6 to 10 times faster than the AGLA. In certain cases, the AGLA generates less nodes then the PBBA.
It employs the RSS as a preordering criterion and preorders the observations in all nodes of stage II (i.e. down
to the leaf nodes). However, the RSS criterion is expensive to compute compared to Resid and a preordering
2.0e+07
1.5e+07
nodes
5.0e+06
1.0e+07
1.5e+07
1.0e+07
RSS− / RSS+
RSS− / Resid+
0
10
20
30
0.0e+00
0.0e+00
5.0e+06
nodes
2.0e+07
radius which is too large adds to the computational load. For these reasons the PBBA outperforms the AGLA.
40
Resid− / RSS+
Resid− / Resid+
0
10
preordering radius
20
30
40
preordering radius
(a) RSS− /Resid+ and RSS− /RSS+
(b) Resid− /Resid+ and Resid− /RSS+
Figure 4: Number of nodes visited by the PBBA with different preordering criteria in stages I/II of the algo-
40
30
10
20
execution time (sec)
30
20
10
execution time (sec)
40
rithm, as a function of the preordering radius (data: n = 40, p = 4).
0
Resid− / RSS+
Resid− / Resid+
0
0
RSS− / RSS+
RSS− / Resid+
10
20
30
40
0
10
preordering radius
20
30
40
preordering radius
(a) RSS− /Resid+ and RSS− /RSS+
(b) Resid− /Resid+ and Resid− /RSS+
Figure 5: Execution time (in seconds) of the PBBA with different preordering criteria in stages I/II of the
algorithm, as a function of the preordering radius (data: n = 40, p = 4).
12
Table 1: Time (in seconds) needed by the PBBA and AGLA to compute a range of LTS regressors, for three
different ranges hmin : hmax . The number of nodes visited by each algorithms is given in parentheses.
(a) hmin = n/2, hmax = n
n
p
q
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
AGLA
mean
min
Time in seconds
1.16
0.60
3.51
1.30
17.51
8.17
49.68
19.92
270.73
132.55
Number of nodes
99’434
50’322
268’115
91’892
974’200
462’648
2’506’074
987’432
10’477’284 5’146’928
max
mean
PBBA
min
max
2.71
11.08
52.44
123.88
1’164.31
0.18
0.54
2.34
6.75
31.58
0.092
0.16
0.91
2.56
12.16
0.40
1.52
6.74
16.36
122.47
6.44
6.50
7.48
7.36
8.57
238’444
881’282
2’964’662
6’292’358
44’844’933
97’138
275’415
1’061’924
2’966’393
12’276’869
44’742
72’665
424’824
1’021’971
4’724’249
215’763
821’995
3’168’084
7’517’632
46’391’496
1.02
0.97
0.92
0.84
0.85
AGLA / PBBA
(b) hmin = n/2, hmax = 3n/4
n
p
q
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
AGLA
mean
min
Time in seconds
0.87
0.47
2.47
0.92
13.43
7.52
36.03
17.16
205.12
114.84
Number of nodes
77’835
40’851
198’201
72’552
776’800
411’723
1’906’482
878’052
8’231’868 4’612’913
max
mean
PBBA
min
max
AGLA / PBBA
1.88
6.99
37.05
87.42
752.72
0.15
0.42
1.95
5.47
26.28
0.068
0.12
0.84
2.08
10.59
0.30
1.12
5.42
13.85
80.96
5.80
5.88
6.88
6.59
7.81
169’718
579’627
2’222’448
4’907’138
30’672’512
84’428
233’379
946’947
2’613’071
10’971’795
37’008
63’821
393’425
826’507
3’868’554
180’373
641’923
2’670’110
7’087’033
32’010’797
0.92
0.85
0.82
0.73
0.75
(c) hmin = 3n/4, hmax = n
n
p
q
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
32
36
40
44
48
3
3
4
4
5
8
9
10
11
12
AGLA
mean
min
Time in seconds
0.40
0.10
1.42
0.16
5.51
1.00
18.18
2.19
87.88
11.99
Number of nodes
29’901
6’560
96’467
9’662
271’261
47’830
811’691
88’112
3’044’579 362’517
max
mean
PBBA
min
max
AGLA / PBBA
1.44
5.83
25.71
60.85
753.21
0.061
0.21
0.67
2.13
8.79
0.012
0.02
0.11
0.24
1.18
0.22
0.76
2.99
6.88
64.96
6.67
6.76
8.22
8.54
10.00
114’792
432’003
1’353’596
2’857’928
27’933’149
21’804
72’002
194’495
580’998
2’139’972
3’964
5’609
28’420
53’587
209’355
100’873
336’567
1’093’940
2’248’152
20’633’518
1.37
1.34
1.40
1.40
1.42
13
4
Conclusions
Various strategies to compute the exact least-trimmed-squares (LTS) regression are proposed. The adding
row algorithm (ARA) is based on a regression tree. The computational tool employed is a rank-1 Cholesky
updating procedure. The exploitation of the tree’s structure allows to determine LTS estimates for a range
of coverages. Thus, the coverage parameter h does not need to be known in advance and the algorithm can
be used to examine the degree of contamination of the data. The branch-and-bound algorithm (BBA) avoids
the explicit enumeration of all observation-subset models. It considerably reduces the execution time. The
BBA with observation preordering (PBBA) sorts the observations in each node to increase the computational
performance of the BBA. Experiments show a significant improvement of the PBBA over the BBA. In the ARA
context a fast computation of the bounds has been devised.
A heuristic strategy which provides solutions reasonably close to the best solution can lead to smaller execution times. The heuristic BBA (HBBA) uses a tolerance parameter τ to cut subtrees (Gatu & Kontoghiorghes
2006). Although very efficient for variable-subset regression, this approach does not show interesting results
when combined with the PBBA. The increase in computational efficiency with respect to the loss in quality of
the solution computed by the HBBA, is not significant. The PBBA traverses the regression tree starting from
the empty model, updating submodels as it procedes downward. Thus, τ has little effect on the execution time
becuase the residual sum of squares (RSSs) are small in early stages of the algorithm.
The PBBA is compared to the exact algorithm (AGLA) presented by Agulló (2001). The AGLA computes
the LTS regressor for a given coverage parameter h, usually n/2. Thus, it will discard relevant data as the
degree of contamination of the data is typically less than 50% and might fail to reveal the global linear trend
of the data. The PBBA can be seen as a generalization of the AGLA. It uses a more general notion of the
subsets-tree to compute a set of LTS regressors for a range of coverage values hmin : hmax simultaneously.
Furthermore, the PBBA employs a more efficient preordering strategy resulting in a smaller computational
load. Experiments show that the PBBA is 6 to 10 times faster than the AGLA to compute a range of LTS
regressors. It is an efficient tool to examine the degree of contamination of the data, hence revealing the exact
LTS estimator which is both robust and accurate.
Acknowldegements
The authors are grateful to the two anonymous referees for their valuable comments and suggestions.
14
A
Formal proofs
The proofs of the Lemmas 1–3 are given. These follow closely the proofs given in (Gatu & Kontoghiorghes
2006).
(g)
(g)
Lemma 1 rj ≤ rj+1 (j = 1, . . . , n − 1).
(0)
Proof. 1 The proof is by induction on the number of generated nodes. Initially, rj = +∞ and the proposition
holds. By inductive hypothesis, the proposition holds if g nodes have been generated. It must be shown that the
proposition holds after the (g + 1)th node has been computed.
Consider the (g + 1)th node (S, A), with nS = j. It selects the observations in S and affects rj which
(g+1)
when modified becomes rj
(g+1)
(g)
= RSS(S) ≤ rj . Thus, by inductive hypothesis, rj
(g)
(g+1)
≤ rj+1 = rj+1 . The
model S was derived from its parent model Spar by adding an observation. Hence, RSS(S) ≥ RSS(Spar ) and
(g+1)
(g+1)
rj−1 ≤ RSS(Spar ) ≤ rj
. This completes the proof.
Lemma 2 Given the node (S, A) and a constant α > 0,
if
(g)
rnS +nA ≤ α · RSS(S) then
(g)
r|W | ≤ α · RSS(W ),
where W is any observation-subset obtained from ∆(S, k).
Proof. 2 Any observation-subset W of size j (j = nS + 1, . . . , nS + nA ) was obtained by adding one or
(g)
more observations to S. Thus, RSS(W ) ≥ RSS(S). From Lemma 1 follows that rj
(g)
(g)
≤ rnS +nA . Hence, if
(g)
rnS +nA ≤ α · RSS(S) then rj ≤ α · RSS(W ). This completes the proof.
Lemma 3 Let Se be an observation-subset model selected by the HBBAτ . Then,
e ≤ τ.
RRE(S)
e =
Proof. 3 Two possibilities arise. Either the HBBAτ found the optimal model, i.e. Se = S ∗ , and RRE(S)
e = |S ∗ | = nS . According to
0 ≤ τ . Else, the model S ∗ was contained in a subtree that was cut. Note that |S|
(g)
e ≤ (1+τ )·RSS(S ∗ )
e ≤ r(g) . Thus, RSS(S)
Lemma 2 with α = 1+τ , rnS ≤ (1+τ )·RSS(S ∗ ). Further, RSS(S)
j
which completes the proof.
References
Agulló, J. (2001), ‘New algorithms for computing the least trimmed squares regression estimator’, Computational Statistics and Data Analysis 36, 425–439.
15
Alexander, S. T., Pan, C.-T. & Plemmons, R. J. (1988), ‘Analysis of a recursive least squares hyperbolic rotation
algorithm for signal processing’, Linear Algebra and its Applications 98, 3–40.
Atkinson, A. C. & Cheng, T.-C. (1999), ‘Computing least trimmed squares regression with the forward search’,
Statistics and Computing 9, 251–263.
Belsley, D. A., Kuh, A. E. & Welsch, R. E. (1980), Regression Diagnostics: Identifying Influential Data and
Sources of Collinearity, John Wiley and Sons, New York.
Björck, Å., Park, H. & Eldén, L. (1994), ‘Accurate downdating of least squares solutions’, SIAM Journal on
Matrix Analysis and Applications 15(2), 549–568.
Burks, A. W., Warren, D. W. & Wright, J. B. (1954), ‘An analysis of a logical machine using parenthesis-free
notation’, Mathematical Tables and Other Aids to Computation 8(46), 53–57.
Dongarra, J. J., Bunch, J. R., Moler, C. B. & Stewart, G. W. (1979), LINPACK Users’ Guide, SIAM, Philadelphia.
Donoho, D. L. & Huber, P. J. (1983), The notion of breakdown point, in P. J. Bickel, K. A. Doksum, E. L.
Lehman & J. L. Hodges, eds, ‘A festschrift for Erich L. Lehmann in honor of his sixty-fifth birthday’,
CRC Press.
Furnival, G. & Wilson, R. (1974), ‘Regression by leaps and bounds’, Technometrics 16, 499–511.
Gatu, C. & Kontoghiorghes, E. J. (2003), ‘Parallel algorithms for computing all possible subset regression
models using the QR decomposition’, Parallel Computing 29(4), 505–521.
Gatu, C. & Kontoghiorghes, E. J. (2006), ‘Branch-and-bound algorithms for computing the best subset regression models’, Journal of Computational and Graphical Statistics 15, 139–156.
Gill, P. E., Golub, G. H., Murray, W. & Saunders, M. A. (1974), ‘Methods for modifying matrix factorizations’,
Mathematics of Computations 28, 505–535.
Golub, G. H. & Van Loan, C. F. (1996), Matrix computations, Johns Hopkins Studies in the Mathematical
Sciences, 3rd edn, Johns Hopkins University Press, Baltimore, Maryland.
Hawkins, D. M. (1994), ‘The feasible solution algorithm for least trimmed squares regression’, Computational
Statistics and Data Analysis 17, 185–196.
Hawkins, D. M. & Olive, D. J. (1999), ‘Improved feasible solution algorithms for high breakdown estimation’,
Computational Statistics and Data Analysis 30, 1–11.
16
Hofmann, M., Gatu, C. & Kontoghiorghes, E. J. (2007), ‘Efficient algorithms for computing the best-subset
regression models for large-scale problems’, Computational Statistics and Data Analysis 52, 16–29.
Hössjer, O. (1994), ‘Rank-based estimates in the linear model with high breakdown point’, Journal of the
American Statistical Association 89, 149–158.
Morgenthaler, S. (1991), ‘A note on efficient regression estimators with positive breakdown point’, Statistics
and Probability Letters 11, 469–472.
Narendra, P. M. & Fukunaga, K. (1977), ‘A branch and bound algorithm for feature subset selection’, IEEE
Transactions on Computers 26(9), 917–922.
Newell, A. & Shaw, J. C. (1957), Programming the logic theory machine, in ‘Proceedings of the Western Joint
Computer Conference’, Institute of Radio Engineers, pp. 230–240.
Rousseeuw, P. J. (1984), ‘Least median of squares regression’, Journal of the American Statistical Association
79, 871–880.
Rousseeuw, P. J. (1997), Introduction to positive-breakdown methods, in G. S. Maddala & C. R. Rao, eds,
‘Handbook of statistics’, Vol. 15: Robust inference, Elsevier, pp. 101–121.
Rousseeuw, P. J. & Leroy, A. M. (1987), Robust regression and outlier detection, John Wiley & Sons.
Rousseeuw, P. J. & Van Driessen, K. (2006), ‘Computing LTS regression for large data sets’, Data Mining and
Knowledge Discovery 12, 29–45.
Smith, D. M. & Bremner, J. M. (1989), ‘All possible subset regressions using the QR decomposition’, Computational Statistics and Data Analysis 7(3), 217–235.
Stefanski, L. A. (1991), ‘A note on high-breakdown estimators’, Statistics and Probability Letters 11, 353–358.
17
Matrix strategies for computing the least trimmed squares
estimation of the general linear and SUR models
Marc Hofmann∗,
Erricos John Kontoghiorghes†
Abstract
An algorithm has been recently proposed for computing the exact least trimmed squares (LTS)
estimator of the standard regression model. This combinatorial algorithm is adapted to the case
of the general linear and seemingly unrelated regression models with possible singular dispersion
matrices. It searches through a regression tree to find optimal estimates by employing efficient
matrix techniques to update the generalized linear least squares problem in each tree node. The
new formulation of the problem allows one to update the residual sum of squares of a subset model
efficiently. Specifically, the new algorithms utilize previous computations in order to update the
generalized QR decomposition by a single observation. The sparse structure of the models are
exploited. Theoretical measures of computational complexity are provided. Experimental results
confirm the ability of the algorithms to identify the outlying observations, and at the same time,
they illustrate the computational intensity of deriving the LTS estimators.
Keywords: Least trimmed squards, general linear model, seemingly unrelated regressions, generalized
linear least squares.
1 Introduction
Algorithms for least trimmed squares (LTS) regression of the ordinary linear model have been proposed
(Agulló 2001, Rousseeuw & Leroy 1987, Rousseeuw & Van Driessen 2006). Robust multivariate
methods, and multivariate LTS in particular, have also been investigated (Agulló, Croux & Aelst 2008,
∗
†
Institut d’informatique, Université de Neuchâtel, Switzerland. E-mail: marc.hofmann@unine.ch
Department of Public and Business Administration, University of Cyprus, Cyprus; and School of Computer Science
and Information Systems, Birkbeck College, University of London, UK.
1
E-mail: erricos@ucy.ac.cy
Hubert, Rousseeuw & Aelst 2008, Rousseeuw, Aelst, Driessen & Agulló 2004). Here, new numerical
strategies to solve the LTS regression of the general linear model (GLM) and of the seemingly unrelated
regressions (SUR) model are designed. These strategies exploit the matrix properties of the linear
models. Recently, a fast branch and bound strategy for computing the LTS estimator has been designed
(Hofmann, Gatu & Kontoghiorghes 2008). This algorithm is extended to compute the LTS estimator of
the GLM and SUR model (Srivastava & Giles 1987, Srivastava & Dwivedi 1979, Zellner 1962). The
GLM is given by:
y = Xβ + ε,
(1)
where y ∈ Rm , X ∈ Rm×n , β ∈ Rn and ε ∼ (0, σ 2 Ω), Ω ∈ Rm×m (m ≥ n). The objective of the LTS
P
estimator is to minimize hi=1 e2(i) , where e2(1) , . . ., e2(m) are the squared residuals sorted in increasing
order and h is the coverage parameter (h ≥ ⌊m + n + 1⌋/2). This is equivalent to finding the h-subset
with the smallest least-squares (LS) objective function.
The SUR model is a special case of the GLM and is written as:
y (i) = X (i) β (i) + ε(i) ,
i = 1, . . . , G,
(i)
where y (i) ∈ Rm , X (i) ∈ Rm×ni (m ≥ ni ), β (i) ∈ Rni and ε(i) ∈ Rm . Furthermore, E(εk ) = 0, and
(i)
(j)
(i)
(j)
contemporaneous disturbances are correlated, i.e. Var(εt , εt ) = σij and Var(εs , εt ) = 0 if s 6= t.
In compact form, the model may be written:
G
vec(Y ) = ⊕ X (i) · vec({β (i) }G ) + vec(E),
i=1
where Y = y (1) · · · y (G)
∈ Rm×G , E = ε(1) · · · ε(G)
(2)
∈ Rm×G , vec(E) ∼ (0, Σ ⊗ Im )
(i)
and Σ = (σij ) ∈ RG×G . The set of vectors β (1) , . . . , β (G) is denoted by {β (i) }G , ⊕G
=
i=1 X
diag(X (1) , . . . , X (G) ) denotes the direct sum of the matrices X (1) , . . .,X (G) and vec(•) is the vector
operator which stacks a set of column vectors.
Here the GLM and SUR model are reformulated as a generalized linear least squares problem
(GLLSP) (Foschi & Kontoghiorghes 2002, Kontoghiorghes & Clarke 1995, Paige 1978, Paige 1979b,
Paige 1979a). The solution of the GLLSP using orthogonal factorization methods is numerically stable.
Furthermore, the GLLSP can be updated efficiently after one observation has been added to the regression model (Kontoghiorghes 2004, Yanev & Kontoghiorghes 2007). The best linear unbiased estimator
(BLUE) of β in (1) is the solution of the GLLSP
βb = argmin kuk2
β
subject to
2
y = Xβ + Bu,
(3)
where B ∈ Rm×p such that Ω = BB T , u ∼ (0, σ 2 Im ) and p ≥ m − n. The residual sum of squares of
b = kuk2 . Notice that B is not necessarily of full rank.
βb is RSS(β)
Similarly, the BLUE of β in the SUR model (2) is obtained by solving the GLLSP (hereafter SURGLLSP):
{βb(i) }G = argmin kY kF
subject to
{β (i) }G
G
vec(Y ) = ⊕ X (i) · vec({β (i) }G ) + K · vec(U ),
i=1
(4)
where k • kF denotes the Frobenius norm, E = U C T , C ∈ RG×G is upper triangular such that CC T =
Σ, K = C ⊗ Im , vec(U ) ∼ (0, σ 2 IM ) and M = Gm. The residual sum of squares is given by
RSS({βb(i) }G ) = kU kF .
Section 2 provides a brief description of the row-adding algorithm which computes the LTS estimators for a range of coverage values. Sections 3 and 4 present the adaptation of the LTS algorithm
for the GLM and SUR model, respectively. Within this context, emphasis is given to the development
of efficient strategies for updating the linear models. The sparse matrix structures are exploited for
minimizing the cost of expensive matrix operations. Experimental results are also presented. Section 5
concludes.
2 The row-adding algorithm
The row-adding algorithm (ARA) generates all possible observation-subset models to compute the
LTS estimates for a range hmin , . . . , hmax of coverage values (Hofmann et al. 2008). The observation
subsets are organized in a regression tree illustrated in Figure 1 (Belsley, Kuh & Welsch 1980, Gatu
& Kontoghiorghes 2003). A node (S,A) corresponds to a set of selected observations S and a set of
available observations A. The observations in A are used to compute new nodes. For example, in
the node ([12],[34]) on level 2, the observations 1 and 2 are selected and new nodes can be computed
by adding observations 3 or 4. If observation 4 is selected, then 3 is skipped so as to avoid generating
duplicate subsets. Thus, it is ensured that all submodels are generated and that no submodel is computed
more than once. A branch and bound strategy is employed to reduce computational time (Furnival &
Wilson 1974, Gatu & Kontoghiorghes 2006, Hofmann et al. 2008). The RSS is computed in each node
by updating the numerical quantities computed in the parent node. Effectively, this corresponds to
updating a generalized QR decomposition after a row has been added. The best RSS for every subset
size is stored in a list, which is updated each time a better solution for one of the subset sizes is found.
This is illustrated in Algorithm 1. The input argument I is the list of observation indices contained in
3
the data. The output arguments are the RSS and the indices of the selected observations (SEL) of the
computed LTS estimators.
[], [1234]
level 0
[1], [234]
1
2
[12], [34]
3
[123], [4]
4
[1234], []
[13], [4]
[124], []
[2], [34]
[14], []
[134], []
[23], [4]
[24], []
[234], []
Figure 1: The ARA regression tree for m = 4 observations.
Algorithm 1: Adding row algorithm (ARA)
1
procedure ARA(I, OPT, SEL)
2
m ← |I|
3
OPT(i) ← +∞, where i = 1, . . . , m
4
Insert node (∅, I) into node list
5
while list contains nodes do
6
Remove node (S, A) from node list
7
if RSS(S) <OPT(|S|) then OPT(|S|) ←RSS(S), SEL(|S|) ← S
8
for i = 1, . . . , |A| do
9
if RSS(S) ≥OPT(|S| + |A| − i + 1) then break
10
Compute (S ′ , A′ ) = add(S, A, i)
11
Insert (S ′ , A′ ) into node list
12
end for
13
end while
14
end procedure ARA
4
[3], [4]
[34], []
[4], []
3 The LTS estimation of the GLM
For the solution of the GLLSP (3), the generalized QR decomposition (GQRD) of X and B is required:
!
!
R
T
T
11
11
12
QT X =
and QT BP T =
.
(5)
0
0 T22
Here Q ∈ Rm×m and P ∈ Rp×p are orthogonal, T11 ∈ Rn×(p−m+n) is upper trapezoidal, T12 ∈
Rn×(m−n) and T22 ∈ R(m−n)×(m−n) is upper triangular. The constraints y = Xβ +Bu of the GLLSP (3)
are equivalent to QT y = QT Xβ + QT BP T P u. Thus, the transformed GLLSP can be written as:
!
!
!
!
y
R
T
T
u
1
11
11
12
1
βb = argmin ku1 k2 + ku2 k2 subject to
=
β+
,
β
y2
0
0 T22
u2
y1
!
u1
(6)
!
and P u =
. Thus, y2 = T22 u2 and the solution z = R11 βb follows from
y2
u2
u1 = 0, where z = y1 − T12 u2 . The RSS is ku2 k2 . Notice that (6) is equivalent to
where Q y =
T
βb = argmin ku1 k2
β
subject to z = R11 β + T11 u1 .
(7)
b
This is the reduced-size GLLSP of (1) and has the same solution β.
It is not necessary to compute the new solution from scratch after an observation has been added to
the GLM. Let the updated GLM (1) be given by:
e + εe,
ye = Xβ
where
ye =
υ
!
y
e=
X
,
xT
(8)
!
e B ),
and εe ∼ (0, σ B
2
X
eT
with
bT
e=
B
B
!
.
The GLLSP corresponding to (8) is:
βe = argmin ku1 k2
β
ζ
subject to
z
!
=
xT
R11
!
β+
!
b1 T
T11
u1 ,
(9)
where bT P T = b1 T b2 T , ζ = υ − b2 T u2 , u2 is given by (6) and z, R11 , T11 by (7). Now consider the
updated GQRD
eT
Q
xT
R11
!
=
e11
R
0
!
eT
Q
and
5
b1 T
T11
!
PeT =
Te11 t
0
τ
!
.
(10)
After applying the orthogonal transformations to (9), the GLLSP becomes:
!
!
!
!
e
e
y
e
R
T
t
u
e
1
11
11
1
βe = argmin ke
u1 k2 + η 2 subject to
=
β+
,
β
0 τ
0
η
ξ
where
eT
Q
ζ
!
z
=
ye1
ξ
!
and
Peu1 =
u
e1
η
(11)
!
.
e where ze = ye1 − tη. The new RSS is
e11 β,
Hence, η = ξ/τ and setting u
e1 = 0 leads to ze = R
e = RSS(β)
b + η 2 . As expected, the RSS is non-decreasing an observation is added to the GLM.
RSS(β)
The reduced GLLSP of (8) is given by:
βe = argmin ke
u1 k2
β
e11 β + Te11 u
subject to ze = R
e1 .
(12)
It follows that if the row υ xT bT has been properly transformed — or reduced — to ζ xT bT1
in (9), then the solution of the updated GLM (8) can be obtained from the reduced GLLSP (7). This
is exploited by the ARA to solve all subset models. The ARA solves a new subset model S ′ in each
node (S ′ , A′ ) by updating the model S of the parent node (S, A) with one of the observations in A.
Specifically, given a GLLSP (7) it computes the updated GQRD (10) in order to obtain the reduced
GLLSP (12) of the new subset model. All observations in A′ are reduced so that they can be used later
in updating the new, reduced GLLSP.
The GLM is underdetermined on the first n levels of the tree (i.e. m < n). On level n, the ARA
computes the GQRD (5) and it transforms the GLLSP according to (6) with m = n. The matrix
structure of the GLLSP is illustrated in Figure 2(a). Both sets of observations are shown: the set S of n
selected points (y, X, B) and the set A of k available points (yA , XA , BA ). As illustrated in Figure 2(b),
the orthogonal transformation P T must be applied to BA .
Following the order imposed by the all subsets tree, each of the k observations in A are selected in
turn to compute a new subset model. As illustrated in Figure 3(a), the ith available observation is added
to the GLLSP, while the first i − 1 are discarded. In Figure 3(b), the GLLSP is transformed according
to (11). Finally, the reduced GLLSP (12) is formed in Figure 3(c).
The GLLSP (11) is computed by updating the GQRD using Givens rotations (Golub & Van Loan
1996, Paige 1978) as illustrated in Figure 4. In Figure 4(a), a row is added to (7) to form (9). In
Figure 4(b), (p − n − 1) Givens rotations are applied to the right of T11 and BA . Next, four pairs of
ei from the left annihilates one subdiagonal
Givens rotations are applied in Figure 4(c)-4(f). A rotation Q
element in R11 , causing the fillup of a subdiagonal element in T11 . The fillup is annihilated by a Givens
rotation Pej applied from the right. Finally, the GLLSP is reduced in Figure 4(g).
6
k yA
XA
BA
k
n y
X
B
n
n
p
n
(a) Structure of the
(b)
GLLSP.
GLLSP.
zero
non-zero
The
p
transformed
modified element
Figure 2: Computing the GQRD and transforming the GLLSP (case p > n).
k−i
k−i
k−i
1
n
n+1
n
n
p
n
(a) Add row.
p
1
(b) Update GQRD.
zero
non-zero
added
n
p−1
1
(c) Reduce GQRD.
modified element
Figure 3: Updating the GLLSP with the i-th observation (case p > n).
Consider the GLLSP (11) with dimensions n and p. Let T (n, k, p) denote the cost of computing all
possible subset models, where k is the number of available observations (n ≥ 1, p ≥ k ≥ 0). It can be
expressed as:
T (n, k, p) =
k
X
[Tu (n, k − i, p) + T (n, k − i, p − 1)]
i=1
= Tu (n, k, p) + T (n, k − 1, p − 1) + T (n, k − 1, p),
where Tu is the cost of the updating operation. The closed form for T is given by:
p k X
X
k−i
T (n, k, p) =
Tu (n, i, j).
p−j
i=0 j=i
The complexity Tu (n, k, p) expressed in terms of elementary rotations is:

3n2 /2 + (k + 1)p + 4n − (k + 1)/2
Tu (n, k, p) =
n2 /2 + 2np − p2 + n + (k + 4)p − (k + 5)/2
7
if
p > n,
if
p ≤ n.
•
•
k
•
•
1×
•
•
n
•
•
•• ••
•• ••
•• ••
•• ••
××××
•• ••
• ••
••
•
n
•• •• •• ••
•• •• •• ••
•• •• •• ••
•• •• •• ••
××××××××
•• ••
• ••
••
•
•
•
•
•
•
•
•
•
•
p
•
•
•
•
•
•
•
•
•
•
•
◦
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
• •
• •
• •
• •
◦ •
×
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
◦
•
•
•
•
•
•
•
•
e T , PeT .
(c) Q
1
4
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
◦
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
◦
•
•
•
•
◦
•
•
•
•
◦
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
• •
• •
• •
• •
• •
◦ •
×
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
e T , PeT .
(d) Q
2
5
•
•
•
•
•
•
•
•
•
•
•
• •
• •
• •
• •
• •
• •
• •
◦ •
×
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
e T , PeT .
(e) Q
6
3
•
•
k
•
•
•
•
n
•
•
•
•
•
•
•
•
•
•
•
•
(b) Pe1T , Pe2T , Pe3T .
(a) Add row.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
◦
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
••
••
••
••
••
••
◦•
×
e T , PeT .
(f) Q
7
4
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
n
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
p−1
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(g) Reduce GQRD.
zero
• non-zero
◦ zeroed
× fillup
modified element
8
Figure 4: Givens sequence for updating the GLM (case p > n).
In computing the complexity Tu , it is assumed that the cost of adding a multiple of a vector to another
vector of the same length (Figure 3(c)) is equal to half the cost of rotating the same vectors. Furthermore, assuming that the cost of solving the GQRD (Figure 2) is equal to the cost of n update operations,
it follows that the overall cost of the ARA is T (n, m, m) ∈ O (2m (m2 + n2 )) when p = m.
3.1 Experimental results
The GLM is simulated with m observations and n variables. The chosen variance structure is given by:
Ωi,j =
σ2
ϕ|i−j| .
1 − ϕ2
(13)
The values of σ = 0.25 and ϕ = 0.9 were chosen for the simulation. Specifically,
y = Xβ + Bu,
where β, u ∼ (0, 1), X ∼ (0, 10) and B is the upper triangular Cholesky factor of the variancecovariance matrix Ω. Outliers are introduced by replacing some values of y by values that are normally
distributed with mean 100 and standard deviation 10. Five instances are simulated for each of the
different dimensions of the problem. The mean execution times are shown in Table 1. The experiments show that the computational cost remains high and the algorithm cannot be used for large-scale
problems. The algorithm correctly identified the outlying observations in all simulations. Figure 5
illustrates the RSSs computed by the algorithm for the simulated GLM with m = 24 observations. The
steep increase in RSS for subset sizes greater than 18 indicates that the computed LTS estimators for a
coverage h > 18 are contaminated by outliers.
Table 1: Mean execution times in seconds of the LTS algorithm for simulated GLM models.
m
n
outliers
16
4
4
4
20
4
5
14
24
4
6
62
28
8
7
4’404
32
8
8
18’038
36
8
9
62’551
9
time (s)
8e+05
RSS
4e+05
0e+00
5
10
15
20
size
Figure 5: The LTS-RSSs computed for the simulated GLM with m = 24 observations.
4 The LTS estimation of the SUR model
For the solution of the SUR-GLLSP (4), which provides the estimator of the SUR model (2), consider
the GQRD of ⊕ X (i) and K = C ⊗ Im :
QT · ⊕ X (i) =
⊕ R(i)
!
and
0
T11 T12
QT KP T =
0
T22
!
,
where R(i) ∈ Rni ×ni , T11 ∈ RN ×N , T22 ∈ R(M −N )×(M −N ) are upper-triangular and Q ∈ RM ×M is
(i)
orthogonal. Here, Q = ⊕ Q(i)
⊕ Q2 , where the QRD of X (i) is given by
1
(i) T
Q
X (i) =
R(i)
!
0
, with
(i)
Q(i) = Q(i)
Q
1
2
such that
(i) T
Q1 X (i) = R(i)
The SUR-GLLSP becomes
b(i)
2
2
{β } = argmin(ku1 k + ku2 k ) subject to
{β (i) }
where
y1
y2
!
y1
y2
(i) T
and Q2 X (i) = 0.
!
=
⊕ R(i)


(i) T
⊕ Q1
 · vec(Y ) and
=
(i) T
⊕ Q2
!
0
u1
u2
(i)
· vec({β }) +
0
T22
!
u1
u2
!
,
!
= P · vec(U ).
Thus, u1 = 0 and y2 = T22 · u2 . The estimator of the SUR model is given by:
⊕ R(i) · vec({βb(i) }) = vec({z (i) }),
10
T11 T12
(14)
where vec({z (i) }) = y1 − T12 · u2 , z (i) ∈ Rni and RSS({βb(i) }) = ku2 k2 . This is equivalent to
solving the triangular systems R(i) βb(i) = z (i) (i = 1, . . . , G). The procedure is illustrated in Figure 6.
An orthogonal transformation from the left is applied to triangularize the X (i) s. Next, the rows are
permuted such as to obtain ⊕ R(i) in the upper part of the data matrix. The same permutation is applied
to the rows and columns of the covariance structure K. An orthogonal transformation is applied to the
right of K to form a block upper-triangular structure. Finally, the GLLSP is reduced by a block column
to:
{βb(i) } = argmin ku1 k2
subject to
{β (i) }
vec({z (i) }) = ⊕ R(i) · vec({β (i) }) + T11 · u1 .
An observation is added to each of the G equations of the SUR model (2). The reduced SURGLLSP is now written as:
b(i)
2
2
{β } = argmin(ku1 k +ku2 k ) subject to
{β (i) }
vec({z (i) })
!
=
vec({ζ (i) })
⊕ R(i)
⊕ a(i)
!
(i)
·vec({β })+
T
T11
0
0
C
!
u1
!
u2
.
The GQRD of ⊕ R(i) and T11 is updated by a sequence of Givens rotations to give:
!
!
!
!
(i)
e11 Te12
e(i)
⊕
R
T
⊕
R
T
0
11
eT
eT
Q
=
PeT =
and Q
.
T
⊕ a(i)
0
0 Te22
0 C
Thus, the SUR-GLLSP is reduced to:
{βe(i) } = argmin(kv1 k2 +kv2 k2 ) subject to
{β (i) }
where
ye1
!
vec({ξ (i) })
eT
=Q
!
ye1
vec({ξ (i) })
vec({z (i) })
=
!
vec({ζ (i) })
and
e(i)
⊕R
!
0
v1
v2
!
Te11 Te12
·vec({β (i) })+
0 Te22
u1
= Pe
u2
!
v1
v2
!
.
e(i) · vec({βe(i) }), where vec({e
The solution is given by vec({e
z (i) }) = ⊕ R
z (i) }) = ye1 − Te12 · v2 . The
RSS of the new model is RSS({βe(i) }) = RSS({βb(i) }) + kv2 k2 . The various steps of the procedure are
illustrated schematically in Figure 7. The new observations are appended to the data matrix ⊕ R(i) and
the block C to the covariance structure T11 . The G equations are updated by applying the orthogonal
eT from the left. Next, the block upper-triangular structure of T11 is restored by an ortransformation Q
thogonal transformation from the right. Finally, the model is reduced. Givens sequences are employed
to update the matrices (Foschi, Belsley & Kontoghiorghes 2003, Kontoghiorghes 2004).
The ARA can be modified to compute the all-observation-subsets regression for the SUR model. A
remarkable difference with the GLM is that observations that are available for selection are not affected
11
!
,
vec(Y )
⊕ X (i)
K
(a) The SUR-GLLSP.
(b) Apply QT from the left.
(c) Permute rows and columns.
(d) Apply P T from the right.
vec({z (i) })
⊕ R(i)
T11
(e) Reduce SUR-GLLSP.
zero
non-zero
modified
element
Figure 6: Solving the SUR-GLLSP (G = 3).
by the orthogonal transformation P T . In constructing the model, the G regressions are sorted in order
of the increasing numbers of variables. Hence, nG ≥ ni , i = 1, . . . , G. A subset model is solved in
each node of level nG as illustrated in Figure 6. The algorithm proceeds by updating the GQRD with
one observation at a time (Figure 7).
12
vec({z (i) })
⊕ R(i)
T11
C
(b) Apply QT from the left.
(a) Add rows.
vec({e
z (i) })
(c) Apply P T from the right.
zero
non-zero
e(i)
⊕R
Te11
(d) Reduce SUR-GQRD.
added
modified
element
Figure 7: Updating the SUR-GLLSP (G = 3).
The cost of computing the subtree of a node residing in a level ℓ > nG is:
T (G, {ni }, k) =
k
X
[Tu (G, {ni }) + T (G, {ni }, k − i)]
i=1
= 2k+1 − 1 · Tu (G, {ni }),
where Tu is the cost of updating the SUR-GQRD, Tu (G, {ni }) ∈ O(GN 2 ) and N =
P
ni . Assuming
that solving the GQRD in level nG has the same cost as performing nG updates, the complexity of the
algorithm is T (G, {ni }, m) ∈ O(2m GN 2 ).
4.1 Experimental results
The SUR model is simulated as follows:
vec({y (i) }) = ⊕ X (i) · β + (C ⊗ Im ) · u,
13
where X (i) ∼ (0, 10) and β, u ∼ (0, 1), C is the upper-triangular Cholesky factor of Σ, which is
generated according to the same scheme as Ω in (13). Outlying observations are simulated by replacing
some values at identical indices in each of the vectors y (i) by values that are normally distributed with
mean 100 and standard deviation 10. Five instances of the problem are simulated for each of various
dimensions. The mean execution times are illustrated in Table 2. The algorithm is computationally
infeasible for large-scale SUR models. Figure 8 reveals, as was the case for the GLM, the degree of
contamination of the simulated data.
Table 2: Mean execution times in seconds for the simulated SUR model.
m
ni (G = 3)
outliers
16
4,6,8
4
12
20
4,6,8
5
96
24
4,6,8
6
537
28
6,8,10
7
10561
32
6,8,10
8
46120
0
RSS
5000 10000
20000
time (s)
5
10
15
20
size
Figure 8: The LTS-RSSs computed for the simulated SUR with m = 24 observations.
5 Conclusions
New algorithms have been designed to compute the LTS estimators of the GLM and SUR model. These
algorithms are based on LTS strategies for the standard regression model. The singularity problem of
14
the dispersion matrix is avoided by reformulating the estimation problem as a GLLSP. The main computational tool of the algorithm is the GQRD. Efficient strategies to update the GQRD are employed.
The sparse matrix structure of the linear models are exploited to apply orthogonal transformations efficiently. Thus, solutions to the GLLSP are obtained efficiently and without having to solve the GLLSP
from scratch.
Although the algorithm is infeasible for larger problem sizes, simulations that it succeeds in detecting outlying observations. Future work can address the LTS problem of the SUR model where the
regressions are modified independently. That is, observations are deleted from one regression at a time.
This will yield an unbalance SUR model in each node of the ARA tree (Foschi & Kontoghiorghes
2002). The feasability of adapting the Fast LTS algorithm (Rousseeuw & Van Driessen 2006) to the
GLM and the use of parallel computing in order to solve larger scale problems merits investigation.
References
Agulló, J. (2001), ‘New algorithms for computing the least trimmed squares regression estimator’,
Computational Statistics and Data Analysis 36, 425–439.
Agulló, J., Croux, C. & Aelst, S. V. (2008), ‘The multivariate least-trimmed squares estimator’, Journal
of multivariate analysis 99(3), 311–338.
Belsley, D. A., Kuh, A. E. & Welsch, R. E. (1980), Regression Diagnostics: Identifying Influential Data
and Sources of Collinearity, John Wiley and Sons, New York.
Foschi, P., Belsley, D. A. & Kontoghiorghes, E. J. (2003), ‘A comparative study of algorithms for
solving seemingly unrelated regressions models’, Computational Statistics and Data Analysis
44(1-2), 3–35.
Foschi, P. & Kontoghiorghes, E. J. (2002), ‘Seemingly unrelated regression model with unequal size
observations: computational aspects’, Computational Statistics and Data Analysis 41(1), 211–
229.
Furnival, G. & Wilson, R. (1974), ‘Regression by leaps and bounds’, Technometrics 16, 499–511.
Gatu, C. & Kontoghiorghes, E. J. (2003), ‘Parallel algorithms for computing all possible subset regression models using the QR decomposition’, Parallel Computing 29(4), 505–521.
15
Gatu, C. & Kontoghiorghes, E. J. (2006), ‘Branch-and-bound algorithms for computing the best subset
regression models’, Journal of Computational and Graphical Statistics 15, 139–156.
Golub, G. H. & Van Loan, C. F. (1996), Matrix computations, Johns Hopkins Studies in the Mathematical Sciences, 3rd edn, Johns Hopkins University Press, Baltimore, Maryland.
Hofmann, M., Gatu, C. & Kontoghiorghes, E. J. (2008), ‘An exact least-trimmed-squares algorithm for
a range of coverage values’, Journal of Computational and Graphical Statistics . Submitted.
Hubert, M., Rousseeuw, P. J. & Aelst, S. V. (2008), ‘High-breakdown robust multivariate methods’,
Statistical science 23(1), 92–119.
Kontoghiorghes, E. J. (2004), ‘Computational methods for modifying seemingly unrelated regressions
models’, Journal of Computational and Applied Mathematics 162(1), 247–261.
Kontoghiorghes, E. J. & Clarke, M. R. B. (1995), ‘An alternative approach for the numerical solution
of seemingly unrelated regression equations models’, Computational Statistics & Data Analysis
19(4), 369–377.
Paige, C. C. (1978), ‘Numerically stable computations for general univariate linear models’, Communications in Statistics Part B — Simulation and Computation 7(5), 437–453.
Paige, C. C. (1979a), ‘Computer solution and perturbation analysis of generalized linear least squares
problems’, Mathematics of Computation 33(145), 171–183.
Paige, C. C. (1979b), ‘Fast numerically stable computations for generalized linear least squares problems’, SIAM Journal on Numerical Analysis 16(1), 165–171.
Rousseeuw, P. J., Aelst, S. V., Driessen, K. V. & Agulló, J. (2004), ‘Robust multivariate regression’,
Technometrics 46, 293–305.
Rousseeuw, P. J. & Leroy, A. M. (1987), Robust regression and outlier detection, John Wiley & Sons.
Rousseeuw, P. J. & Van Driessen, K. (2006), ‘Computing LTS regression for large data sets’, Data
Mining and Knowledge Discovery 12, 29–45.
Srivastava, V. K. & Dwivedi, T. D. (1979), ‘Estimation of seemingly unrelated regression equations
models: a brief survey’, Journal of Econometrics 10, 15–32.
16
Srivastava, V. K. & Giles, D. E. A. (1987), Seemingly unrelated regression equations models: estimation and inference ()p, Vol. 80 of Statistics: textbooks and monographs, Marcel Dekker, Inc.
Yanev, P. & Kontoghiorghes, E. J. (2007), ‘Computationally efficient methods for estimating the
updated-observations SUR models’, Applied Numerical Mathematics 57(11-12), 1245–1258.
Zellner, A. (1962), ‘An efficient method of estimating seemingly unrelated regression equations and
tests for aggregation bias’, Journal of the American Statistical Association 57, 348–368.
17