Local Convergence of Tri-Level Alternating Optimization

advertisement
Local Convergence of Tri-Level Alternating Optimization
Richard J. Hathaway1, Yingkang Hu1, and James C. Bezdek2
and Computer Science Department, Georgia Southern University
Statesboro, GA 30460
2Computer Science Department, University of West Florida
Pensacola, FL 32514
1Mathematics
Abstract
Tri-level alternating optimization (TLAO) of a real valued function f(w) consists of
partitioning the vector variable w into three parts, say w = (x,y,z), and alternating
optimizations over each of the three parts while holding the other two at their newest
values. Alternating optimization is not usually the best approach to optimizing a
function. However, in cases when f(x,y,z) has special structure where each of the partial
optimizations can be performed very easily, then the method can be simple to implement
and computationally competitive with other popular approaches such as quasi-Newton or
conjugate gradient methods (Hu and Hathaway, 1999). A local convergence analysis of
TLAO is given which shows that the method is locally, q-linearly convergent to
minimizers for which the second derivative of the objective function is positive definite.
A useful recent application of TLAO in the area of pattern recognition is described.
Keywords – alternating optimization, local convergence, pattern recognition
1. INTRODUCTION
In this paper we consider the convergence analysis of a technique for computing local
solutions to the problem:
min
w R s
f (w ) ,
(1)
where f: Rs  R is twice differentiable. The technique is called tri-level alternating
optimization (TLAO). Application of this technique requires a partitioning of the
variable w  Rs as w = (x,y,z), with x  Rp, y  Rq, and w  Rr . TLAO attempts to
minimize f using an iteration that sequentially minimizes f over each of the x, y, and z
variables. The TLAO procedure is stated next, and the notation: “arg min” is used to
denote the argument that minimizes; i.e., the minimizer.
Tri-Level Alternating Optimization (TLAO)
TLAO-1
TLAO-2
Partition w  Rs as w = (x,y,z), with x  Rp, y  Rq, and w  Rr .
Pick the initial iterate w(0) = (x(0),y(0),z(0)) and stopping criterion.
Set k = 0.
Compute x (k 1) 
arg min f ( x, y (k ) , z (k ) )
x R
TLAO-3
Compute y (k 1) 
(2)
p
arg min f ( x (k 1) , y, z (k ) )
(3)
x R q
TLAO-4
Compute z (k 1) 
arg min f ( x (k 1) , y (k 1) , z)
x R
TLAO-5
(4)
r
If w(k+1) = (x(k+1) ,y(k+1) , z(k+1)) and w(k) = (x(k) ,y(k) , z(k)) satisfy the
stopping criterion, then quit; otherwise, set k = k+1 and go to TLAO-2.
A bi-level version of this approach is analyzed in Bezdek et al. (1987). The bi-level
version has been widely used to optimize numerous fuzzy clustering criteria. Our interest
in the tri-level version is motivated in part by the need to validate the optimization
procedure employed in the recently devised pattern recognition tool in Hathaway and
Bezdek (1999) that is briefly described in Section 3. This technique is a modification of
the popular fuzzy c-means algorithm (Bezdek, 1981) and is capable of effectively
clustering incomplete data. Incomplete data vectors are data vectors missing values for
some (but not all) components. There are other statistical and fuzzy methods for pattern
recognition that alternate optimizations over three sets of variables and this note will help
supply the underlying local convergence theory in all those cases.
We mention that the global convergence properties of TLAO follow easily from the
general convergence theory of Zangwill (1969), and are based on the monotonic decrease
in the objective function values as the iteration is being done. In short, the theory can be
used to show that under mild assumptions any limit point of a TLAO sequence is a point
(x*,y*,z*) satisfying (2-4) with x(k+1) = x*, y(k) = y(k+1) = y*, and z(k) = z(k+1) = z*.
This type of point could either be a minimizer or a saddle point, but in practice, computed
(x*,y*,z*) values are almost never saddle points.
The next section gives the local analysis of TLAO. Section 3 briefly describes the
new clustering algorithm that uses TLAO to optimize a particular clustering criterion.
The final section contains concluding remarks and some ideas regarding worthwhile
future work.
2
2. LOCAL CONVERGENCE ANALYSIS OF TLAO
Let f: Rp  Rq  Rr  R and partition w = (x,y,z)  Rp  Rq  Rr. We show in this
section that TLAO is locally, q-linearly convergent to any local minimizer of f for which
the Hessian of f is positive definite. Corresponding to (2-4) we define X: Rq  Rr  Rp,
Y: Rp  Rr  Rq, and Z: Rp  Rq  Rr, as :
X(y,z)
=
arg min f ( x , y, z)
x R
Y(x,z)
=
(5)
p
arg min f ( x, y, z)
(6)
y R q
Z(x,y)
=
arg min f ( x, y, z)
z R
(7)
r
The arguments used in this section are translation invariant and we therefore simplify
notation by assuming the local minimizer of interest is (0,0,0)  RpRqRr. We first
show that under reasonable assumptions, X, Y, and Z are continuously differentiable at
(0,0)  RpRr, RrRq, RpRq, respectively. (In the future, we will sometimes leave it to
the reader to infer all applicable dimensions of points such as (0,0) rather than explicitly
mention them.)
Lemma 2.1
Let f: Rp  Rq  Rr  R satisfy the conditions:
(i)
f is C2 in a neighborhood of (0,0,0);
f (0,0,0) is positive definite; and
(ii)
(iii) (0,0,0) is a local minimizer of f.
Then in some neighborhood of (0,0)  RqRr, the function X(y,z) in (5) is continuously
differentiable. Similar results hold for Y(x,z) and Z(x,y).
f xx ( x, y, z) f xy ( x, y, z) f xz ( x, y, z)


Proof. Partition f ( x , y, z) as f yx ( x, y, z) f yy ( x, y, z) f yz ( x, y, z)  . By (i) and (ii),
 f zx ( x, y, z) f zy ( x, y, z) f zz ( x, y, z) 


f xx (x, y, z) is positive definite and nonsingular in a neighborhood of (0,0,0). The
implicit function theorem guarantees a continuously differentiable function X: Rq  Rr 
Rp, defined in a neighborhood of (0,0)  RqRr, satisfying f x (X( y, z), y, z) = 0. This
implies x = X(y,z) is a critical point of f(  ,y,z), and this together with (iii) gives us that
X(0,0) = 0. Since (X(y,z),y,z) is near (0,0,0) for (y,z,) near (0,0), it follows using (i) and
(ii) that f xx (X( y, z), y, z) is positive definite for (y,z,) near (0,0), and this implies that
X(y,z) is a minimizer of f(  ,y,z). Similar arguments give the results for Y(x,z) and
Z(x,y).
3
For notational convenience in the following, we define A = f (0,0,0) and
A xx

 A yx
 A zx

A xy
A yy
A zy
f xx (0,0,0) f xy (0,0,0) f xz (0,0,0)
A xz 



A yz  = f yx (0,0,0) f yy (0,0,0) f yz (0,0,0)  .
 f zx (0,0,0) f zy (0,0,0) f zz (0,0,0) 
A zz 


(8)
Define the mapping S: Rp  Rq  Rr  Rp  Rq  Rr corresponding to one iteration
through steps TLAO-2, 3 and 4 as:
S(x,y,z) = ( S1 (x, y, z) , S2 (x, y, z) , S3 ( x, y, z) )
(9a)
= ( X(y,z), Y(X(y,z,),z), Z(X(y,z),Y(X(y,z),z)) )
(9b)
The results of Lemma 2.1 imply that S is continuously differentiable in a neighborhood
of (0,0,0) with S(0,0,0) = (0,0,0). As will be seen later in the proof of Theorem 2.1, the
fundamental property needed to establish convergence of a TLAO sequence is
S' (0,0,0) < 1, which is proved in the following lemma.
Lemma 2.2
Let f: Rp  Rq  Rr  R satisfy the conditions of Lemma 2.1 and let S:
Rp  Rq  Rr  Rp  Rq  Rr be defined by (9). Then S' (0,0,0) < 1.
Proof.
 S1x (0,0,0) S1y (0,0,0) S1z (0,0,0) 


Partition S' (0,0,0) as S 2 x (0,0,0) S 2 y (0,0,0) S 2z (0,0,0) .
S3x (0,0,0) S3y (0,0,0) S3z (0,0,0) 


In calculating
S' (0,0,0) , we will need the various partials Xy(0,0), Xz(0,0), Yx(0,0), Yz(0,0), Zx(0,0),
and Zy(0,0), which are obtained first. We will suppress the argument (0,0) in the
following. To obtain Xy, differentiate fx(X(y,z),y,z) = 0 with respect to y and evaluate at
(0,0) to get:
fxx(0,0,0) Xy + fxy(0,0,0) = 0,
which yields:
Xy = - A xx1 A xy
(10a)
Differentiating fx(X(y,z),y,z) = 0 with respect to z and evaluating at (0,0) gives
Xz = - A xx1 A xz
(10b)
The remaining partials are calculated by differentiating fy(x,Y(x,z),z) = 0 with respect to
x and z, and fz(x,y,Z(x,y)) = 0 with respect to x and y. They are:
4
Yx = - A yy1 A yx
(10c)
Yz = - A yy1 A yz
(10d)
Zx = - A zz1 A zx
(10e)
Zy = - A zz1 A zy
(10f)
Now the parts of S' (0,0,0) are calculated using (10) as:
S1x(0,0,0) = 0  Rpp ;
(11a)
S2x(0,0,0) = 0  Rqp ;
(11b)
S3x(0,0,0) = 0  R
rp
;
(11c)
S1y(0,0,0) = Xy = - A xx1 A xy ;
(11d)
S2y(0,0,0) = YxXy = A yy1 A yx A xx1 A xy ;
(11e)
S3y(0,0,0) = ZxXy + ZyYxXy = A zz1 A zx A xx1 A xy -
A zz1 A zy A yy1 A yx A xx1 A xy ;
(11f)
S1z(0,0,0) = Xz = - A xx1 A xz ;
(11g)
S2z(0,0,0) = YxXz + Yz = A yy1 A yx A xx1 A xz - A yy1 A yz ;
(11h)
S3z(0,0,0) = ZxXz + ZyYxXz + ZyYz = A zz1 A zx A xx1 A xz -
A zz1 A zy A yy1 A yx A xx1 A xz + A zz1 A zy A yy1 A yz
(11i)
We can now establish that S' (0,0,0) < 1 by recognizing an important relationship
between S' (0,0,0) and A. Define the matrices B, C, and D as
A xx

B = A yx
 A zx

0
A yy
A zy
0 

0 , C =
A zz 
0  A xy
0
0

0
0
 A xz 
 A yz  , and D =
0 
A xx
 0

 0
0
A yy
0
0 
0  . (12)
A zz 
Note that A = B – C. Since A is positive definite, it follows that D is positive definite
and B is nonsingular. A straightforward but tedious calculation shows that
S' (0,0,0) = B 1 C
.
(13)
By Theorem 7.1.9 in Ortega (1971) and the assumption that A is symmetric and positive


definite, we have that S' (0,0,0) =  B 1A < 1 if A = B – C is a P-regular splitting.
By definition, B – C is a P-regular splitting if B is nonsingular and B + C is positive
5
definite. By earlier comments, it only remains to show that B + C is positive definite.
The symmetric part of B + C is


1 B  C   B T  C T
2
 = 12 B  CT  BT  C = 12 D D T  = D,
(14)
which is positive definite.
We now give the main result for local convergence, which is essentially an adaptation of
Ostrowski’s theorem (Theorem 8.1.7 in Ortega, 1972).
Theorem 2.1
Let w* be a local minimizer of f: Rs  R for which f ( w*) is positive
definite, and let f be C2 on a neighborhood of w*. Then there is a neighborhood U of w*
such that for any w(0) U, the corresponding TLAO iteration sequence {w(k)} defined
(k+1)
using S in (9) as w
(k)
= S(w ) converges q-linearly to w*.
Proof. As discussed earlier, we can assume that w* = 0. It is necessary to show
lim w(k+1) =
k 
lim S(k+1)(w(0)) = 0
k 
(15)
for all choices of w(0) close enough to w*. Apply Lemma 2.2 to obtain
 = S' (0,0,0) < 1.
(16)
Pick  > 0 such that  + 2  < 1. By Theorem 3.8 in Stewart (1973), there exists a norm
s
s
  on R such that for all w  R ,
S(0,0,0) w   (  + ) w 
(17)
Since S is continuous near w* = (x*,y*,z*) = (0,0,0), there is a number r > 0, such that
S( w1 ) w 2   (  + 2) w 2 
(18)
for all w1 and w2  Br = {w  Rs | w   r}. From (18) and the fact that S(0) = 0, we
have:
1
S( w ) 
 S(tw ) w dt
=

0
1


S( tw ) w  dt
0
6
 (  + 2) w 
(19)
The result of (19) establishes that for initialization of TLAO near w*, the error is reduced
by (  + 2) < 1 at each iteration, which gives the local q-linear convergence of {w(r)} to
w*.
3. AN EXAMPLE OF TLAO
An important problem from the area of pattern recognition is that of partitioning a set of
data X = {x1,…,xn} into natural data groupings (or clusters). A popular and effective
method for partitioning data into c fuzzy clusters {C1,…,Cc}is based on minimization of
the fuzzy c-means functional (Bezdek, 1981):
c
Jm(U,V) =
where:
n
  U ikm x k  Vi
2
,
(20)
i 1 k 1
m > 1 is the fuzzification parameter;
U = [Uik], where Uik = degree to which xk  Ci;
V = [V1,…,Vc], where Vi is the center of Ci; and
 is an inner product norm on R
The optimization of (20) is attempted over all V  R
tc
and U  Mfcn, where
c
n




Mfcn =  U R c n | U ik [0,1],  U ik 1,  U ik  1;  i, k 


i 1
k 1


(21)
The most popular method for optimizing (20) is a bi-level alternating optimization over
the U and V variables known as the fuzzy c-means algorithm (Bezdek, 1981). It
calculates a new V from the most recent U via
 n m

x jk 
Vji =   U ik


 k 1

 n m
 U
ik

 k 1

 ,  j,i,


(22)
and a new U from the most recent V by
 2 /(m 1) 
Uik =  x k  Vi



c
  x k  Vj
j 1
 2 /(m 1) 
 ,  i,k,

where Vji and xjk are the jth components of Vi and xk, respectively.
7
(23)
Real-world data sets sometimes contain partially missing data (Jain and Dubes,
1988; Dixon, 1979), which can arise from human or sensor error, subsequent data
corruption, etc. For example, datum component x3k may be missing so that xk has the
form (x1k,x2k,?,x4k,…,xtk)T. Unfortunately, the iteration described by (22) and (23)
requires access to complete data. If the number of data with missing components is
small, then one option is to delete that portion of the data set and base the clustering
entirely on those data containing no missing components. This approach is problematic if
the proportion of data with missing components is significant and in all cases fails to
provide a classification of those data deleted from the cluster analysis. Recently a
missing data version of the fuzzy c-means algorithm (Hathaway and Bezdek, 1999) has
been devised for the incomplete data case and we give the algorithm here as a useful
example of TLAO.
In the following we will use x̂ to represent missing data components; e.g., xk =
(x1k,x2k, x̂ 3k,x4k,…,xtk)T. The collection of all missing data components will be denoted
by X̂  X. We adapt fuzzy c-means to missing data using a principal of optimality
similar to that used in maximum likelihood methods from statistics. In this case, we
assume that the missing data is highly consistent with the data set having strong cluster
structure. We implement this optimistic completion principle by dynamically estimating
the missing data components to be those numbers that maximize the cluster structure, as
measured by minimum values of the criterion in (20).
The incomplete data version of fuzzy c-means from Hathaway and Bezdek (1999)
minimizes (20) over U, V, and X̂ . The minimization is done using TLAO based on the
following updating. The current values of V and U are used to estimate the missing data
components X̂ by:
 c m
  c m 

x̂ jk =  U ik V ji    U ik  ,  x̂ jk  X̂ .
(24)

 

 i 1
  i 1

The current missing data values X̂ are then used to complete X so that (22) can be used to
calculate the new V. The third inner step of one complete TLAO iteration is then done
using the completed X and the new V in (23) to calculate the new U. Preliminary testing
of this missing data approach for clustering has demonstrated it to be highly effective.
The convergence theory of the last section guarantees the procedure to be locally
convergent, at a q-linear rate.
4. DISCUSSION
Tri-level alternating optimization attempts to optimize a function f(x,y,z) using an
iteration that alternates optimizations over each of the three (vector) variables while
holding the other two fixed. The method is locally, q-linearly convergent to any
minimizer at which the second derivative is positive definite. The global analysis of this
method fits into the general convergence theory of Zangwill (1969), which guarantees
that any limit point of an iteration sequence must satisfy the first order necessary
8
conditions for minimizing f. The authors plan to unify all convergence theory and extend
to the case of m-level alternating optimization, and to survey many important instances of
alternating optimization type schemes in pattern recognition and statistics.
A recent example of TLAO for an important problem in pattern recognition was
given. The TLAO scheme applied to the fuzzy c-means function allows clustering of
data sets where some of the data vectors have missing components. Preliminary
numerical tests of the approach have shown it to produce good results even when there is
a substantial proportion of incomplete data.
Alternating optimization is often not the best approach to use to do optimization, but
it is certainly worth consideration if there is a natural partitioning of the variables so that
each of the partial minimizations is simple to do. While the convergence rate of the
approach is only q-linear, if the partial minimizations are simple, the method can still be
competitive or superior to optimization approaches with faster rates (e.g., q-superlinear)
of convergence (Hu and Hathaway, 1999).
One of the most interesting mathematical questions concerning this approach is how
to systematically and efficiently determine the “best” partitioning of the variables so that
the value of  in (16) is as small as possible. We expect the value of  to be small when
we have partitioned into variables that are largely independent. For example,
minimization of f(x,y,z) = x2 + y2 + z2 can be done in one iteration (  = 0) because there
is complete independence among the three variables. Other computationally oriented
questions concern how to best formulate a relaxation scheme and how an alternating
optimization approach can best be hybridized with a q-superlinearly (or faster)
convergent local method.
1
2
3
4
5
6
7
8
9
REFERENCES
Bezdek, J.C. (1981). Pattern Recognition with Fuzzy Objective Functions. New
York: Plenum Press.
Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., & Windham, M.P.
(1987). Local convergence analysis of a grouped variable version of coordinate
descent, Journal of Optimization Theory and Applications, v. 54, 471-477.
Dixon, J.K. (1979). Pattern recognition with partly missing data, IEEE Transactions
on Systems, Man and Cybernetics, v. 9, 617-621.
Hu, Y., & Hathaway, R.J. (1999). On efficiency of optimization in fuzzy c-means,
preprint.
Hathaway, R.J., & Bezdek, J.C. (1999). Pattern recognition with incomplete data
using the optimistic completion principal, preprint.
Jain, A.K., & Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood
Cliffs, NJ: Prentice-Hall.
Ortega, J.M. (1972). Numerical Analysis: A Second Course. New York: Academic
Press.
Stewart, G.W. (1973). Introduction to Matrix Computations. New York: Academic
Press.
Zangwill, W. (1969). Nonlinear Programming: A Unified Approach. Englewood
Cliffs, NJ: Prentice-Hall.
9
Download