Local Convergence of Tri-Level Alternating Optimization Richard J. Hathaway1, Yingkang Hu1, and James C. Bezdek2 and Computer Science Department, Georgia Southern University Statesboro, GA 30460 2Computer Science Department, University of West Florida Pensacola, FL 32514 1Mathematics Abstract Tri-level alternating optimization (TLAO) of a real valued function f(w) consists of partitioning the vector variable w into three parts, say w = (x,y,z), and alternating optimizations over each of the three parts while holding the other two at their newest values. Alternating optimization is not usually the best approach to optimizing a function. However, in cases when f(x,y,z) has special structure where each of the partial optimizations can be performed very easily, then the method can be simple to implement and computationally competitive with other popular approaches such as quasi-Newton or conjugate gradient methods (Hu and Hathaway, 1999). A local convergence analysis of TLAO is given which shows that the method is locally, q-linearly convergent to minimizers for which the second derivative of the objective function is positive definite. A useful recent application of TLAO in the area of pattern recognition is described. Keywords – alternating optimization, local convergence, pattern recognition 1. INTRODUCTION In this paper we consider the convergence analysis of a technique for computing local solutions to the problem: min w R s f (w ) , (1) where f: Rs R is twice differentiable. The technique is called tri-level alternating optimization (TLAO). Application of this technique requires a partitioning of the variable w Rs as w = (x,y,z), with x Rp, y Rq, and w Rr . TLAO attempts to minimize f using an iteration that sequentially minimizes f over each of the x, y, and z variables. The TLAO procedure is stated next, and the notation: “arg min” is used to denote the argument that minimizes; i.e., the minimizer. Tri-Level Alternating Optimization (TLAO) TLAO-1 TLAO-2 Partition w Rs as w = (x,y,z), with x Rp, y Rq, and w Rr . Pick the initial iterate w(0) = (x(0),y(0),z(0)) and stopping criterion. Set k = 0. Compute x (k 1) arg min f ( x, y (k ) , z (k ) ) x R TLAO-3 Compute y (k 1) (2) p arg min f ( x (k 1) , y, z (k ) ) (3) x R q TLAO-4 Compute z (k 1) arg min f ( x (k 1) , y (k 1) , z) x R TLAO-5 (4) r If w(k+1) = (x(k+1) ,y(k+1) , z(k+1)) and w(k) = (x(k) ,y(k) , z(k)) satisfy the stopping criterion, then quit; otherwise, set k = k+1 and go to TLAO-2. A bi-level version of this approach is analyzed in Bezdek et al. (1987). The bi-level version has been widely used to optimize numerous fuzzy clustering criteria. Our interest in the tri-level version is motivated in part by the need to validate the optimization procedure employed in the recently devised pattern recognition tool in Hathaway and Bezdek (1999) that is briefly described in Section 3. This technique is a modification of the popular fuzzy c-means algorithm (Bezdek, 1981) and is capable of effectively clustering incomplete data. Incomplete data vectors are data vectors missing values for some (but not all) components. There are other statistical and fuzzy methods for pattern recognition that alternate optimizations over three sets of variables and this note will help supply the underlying local convergence theory in all those cases. We mention that the global convergence properties of TLAO follow easily from the general convergence theory of Zangwill (1969), and are based on the monotonic decrease in the objective function values as the iteration is being done. In short, the theory can be used to show that under mild assumptions any limit point of a TLAO sequence is a point (x*,y*,z*) satisfying (2-4) with x(k+1) = x*, y(k) = y(k+1) = y*, and z(k) = z(k+1) = z*. This type of point could either be a minimizer or a saddle point, but in practice, computed (x*,y*,z*) values are almost never saddle points. The next section gives the local analysis of TLAO. Section 3 briefly describes the new clustering algorithm that uses TLAO to optimize a particular clustering criterion. The final section contains concluding remarks and some ideas regarding worthwhile future work. 2 2. LOCAL CONVERGENCE ANALYSIS OF TLAO Let f: Rp Rq Rr R and partition w = (x,y,z) Rp Rq Rr. We show in this section that TLAO is locally, q-linearly convergent to any local minimizer of f for which the Hessian of f is positive definite. Corresponding to (2-4) we define X: Rq Rr Rp, Y: Rp Rr Rq, and Z: Rp Rq Rr, as : X(y,z) = arg min f ( x , y, z) x R Y(x,z) = (5) p arg min f ( x, y, z) (6) y R q Z(x,y) = arg min f ( x, y, z) z R (7) r The arguments used in this section are translation invariant and we therefore simplify notation by assuming the local minimizer of interest is (0,0,0) RpRqRr. We first show that under reasonable assumptions, X, Y, and Z are continuously differentiable at (0,0) RpRr, RrRq, RpRq, respectively. (In the future, we will sometimes leave it to the reader to infer all applicable dimensions of points such as (0,0) rather than explicitly mention them.) Lemma 2.1 Let f: Rp Rq Rr R satisfy the conditions: (i) f is C2 in a neighborhood of (0,0,0); f (0,0,0) is positive definite; and (ii) (iii) (0,0,0) is a local minimizer of f. Then in some neighborhood of (0,0) RqRr, the function X(y,z) in (5) is continuously differentiable. Similar results hold for Y(x,z) and Z(x,y). f xx ( x, y, z) f xy ( x, y, z) f xz ( x, y, z) Proof. Partition f ( x , y, z) as f yx ( x, y, z) f yy ( x, y, z) f yz ( x, y, z) . By (i) and (ii), f zx ( x, y, z) f zy ( x, y, z) f zz ( x, y, z) f xx (x, y, z) is positive definite and nonsingular in a neighborhood of (0,0,0). The implicit function theorem guarantees a continuously differentiable function X: Rq Rr Rp, defined in a neighborhood of (0,0) RqRr, satisfying f x (X( y, z), y, z) = 0. This implies x = X(y,z) is a critical point of f( ,y,z), and this together with (iii) gives us that X(0,0) = 0. Since (X(y,z),y,z) is near (0,0,0) for (y,z,) near (0,0), it follows using (i) and (ii) that f xx (X( y, z), y, z) is positive definite for (y,z,) near (0,0), and this implies that X(y,z) is a minimizer of f( ,y,z). Similar arguments give the results for Y(x,z) and Z(x,y). 3 For notational convenience in the following, we define A = f (0,0,0) and A xx A yx A zx A xy A yy A zy f xx (0,0,0) f xy (0,0,0) f xz (0,0,0) A xz A yz = f yx (0,0,0) f yy (0,0,0) f yz (0,0,0) . f zx (0,0,0) f zy (0,0,0) f zz (0,0,0) A zz (8) Define the mapping S: Rp Rq Rr Rp Rq Rr corresponding to one iteration through steps TLAO-2, 3 and 4 as: S(x,y,z) = ( S1 (x, y, z) , S2 (x, y, z) , S3 ( x, y, z) ) (9a) = ( X(y,z), Y(X(y,z,),z), Z(X(y,z),Y(X(y,z),z)) ) (9b) The results of Lemma 2.1 imply that S is continuously differentiable in a neighborhood of (0,0,0) with S(0,0,0) = (0,0,0). As will be seen later in the proof of Theorem 2.1, the fundamental property needed to establish convergence of a TLAO sequence is S' (0,0,0) < 1, which is proved in the following lemma. Lemma 2.2 Let f: Rp Rq Rr R satisfy the conditions of Lemma 2.1 and let S: Rp Rq Rr Rp Rq Rr be defined by (9). Then S' (0,0,0) < 1. Proof. S1x (0,0,0) S1y (0,0,0) S1z (0,0,0) Partition S' (0,0,0) as S 2 x (0,0,0) S 2 y (0,0,0) S 2z (0,0,0) . S3x (0,0,0) S3y (0,0,0) S3z (0,0,0) In calculating S' (0,0,0) , we will need the various partials Xy(0,0), Xz(0,0), Yx(0,0), Yz(0,0), Zx(0,0), and Zy(0,0), which are obtained first. We will suppress the argument (0,0) in the following. To obtain Xy, differentiate fx(X(y,z),y,z) = 0 with respect to y and evaluate at (0,0) to get: fxx(0,0,0) Xy + fxy(0,0,0) = 0, which yields: Xy = - A xx1 A xy (10a) Differentiating fx(X(y,z),y,z) = 0 with respect to z and evaluating at (0,0) gives Xz = - A xx1 A xz (10b) The remaining partials are calculated by differentiating fy(x,Y(x,z),z) = 0 with respect to x and z, and fz(x,y,Z(x,y)) = 0 with respect to x and y. They are: 4 Yx = - A yy1 A yx (10c) Yz = - A yy1 A yz (10d) Zx = - A zz1 A zx (10e) Zy = - A zz1 A zy (10f) Now the parts of S' (0,0,0) are calculated using (10) as: S1x(0,0,0) = 0 Rpp ; (11a) S2x(0,0,0) = 0 Rqp ; (11b) S3x(0,0,0) = 0 R rp ; (11c) S1y(0,0,0) = Xy = - A xx1 A xy ; (11d) S2y(0,0,0) = YxXy = A yy1 A yx A xx1 A xy ; (11e) S3y(0,0,0) = ZxXy + ZyYxXy = A zz1 A zx A xx1 A xy - A zz1 A zy A yy1 A yx A xx1 A xy ; (11f) S1z(0,0,0) = Xz = - A xx1 A xz ; (11g) S2z(0,0,0) = YxXz + Yz = A yy1 A yx A xx1 A xz - A yy1 A yz ; (11h) S3z(0,0,0) = ZxXz + ZyYxXz + ZyYz = A zz1 A zx A xx1 A xz - A zz1 A zy A yy1 A yx A xx1 A xz + A zz1 A zy A yy1 A yz (11i) We can now establish that S' (0,0,0) < 1 by recognizing an important relationship between S' (0,0,0) and A. Define the matrices B, C, and D as A xx B = A yx A zx 0 A yy A zy 0 0 , C = A zz 0 A xy 0 0 0 0 A xz A yz , and D = 0 A xx 0 0 0 A yy 0 0 0 . (12) A zz Note that A = B – C. Since A is positive definite, it follows that D is positive definite and B is nonsingular. A straightforward but tedious calculation shows that S' (0,0,0) = B 1 C . (13) By Theorem 7.1.9 in Ortega (1971) and the assumption that A is symmetric and positive definite, we have that S' (0,0,0) = B 1A < 1 if A = B – C is a P-regular splitting. By definition, B – C is a P-regular splitting if B is nonsingular and B + C is positive 5 definite. By earlier comments, it only remains to show that B + C is positive definite. The symmetric part of B + C is 1 B C B T C T 2 = 12 B CT BT C = 12 D D T = D, (14) which is positive definite. We now give the main result for local convergence, which is essentially an adaptation of Ostrowski’s theorem (Theorem 8.1.7 in Ortega, 1972). Theorem 2.1 Let w* be a local minimizer of f: Rs R for which f ( w*) is positive definite, and let f be C2 on a neighborhood of w*. Then there is a neighborhood U of w* such that for any w(0) U, the corresponding TLAO iteration sequence {w(k)} defined (k+1) using S in (9) as w (k) = S(w ) converges q-linearly to w*. Proof. As discussed earlier, we can assume that w* = 0. It is necessary to show lim w(k+1) = k lim S(k+1)(w(0)) = 0 k (15) for all choices of w(0) close enough to w*. Apply Lemma 2.2 to obtain = S' (0,0,0) < 1. (16) Pick > 0 such that + 2 < 1. By Theorem 3.8 in Stewart (1973), there exists a norm s s on R such that for all w R , S(0,0,0) w ( + ) w (17) Since S is continuous near w* = (x*,y*,z*) = (0,0,0), there is a number r > 0, such that S( w1 ) w 2 ( + 2) w 2 (18) for all w1 and w2 Br = {w Rs | w r}. From (18) and the fact that S(0) = 0, we have: 1 S( w ) S(tw ) w dt = 0 1 S( tw ) w dt 0 6 ( + 2) w (19) The result of (19) establishes that for initialization of TLAO near w*, the error is reduced by ( + 2) < 1 at each iteration, which gives the local q-linear convergence of {w(r)} to w*. 3. AN EXAMPLE OF TLAO An important problem from the area of pattern recognition is that of partitioning a set of data X = {x1,…,xn} into natural data groupings (or clusters). A popular and effective method for partitioning data into c fuzzy clusters {C1,…,Cc}is based on minimization of the fuzzy c-means functional (Bezdek, 1981): c Jm(U,V) = where: n U ikm x k Vi 2 , (20) i 1 k 1 m > 1 is the fuzzification parameter; U = [Uik], where Uik = degree to which xk Ci; V = [V1,…,Vc], where Vi is the center of Ci; and is an inner product norm on R The optimization of (20) is attempted over all V R tc and U Mfcn, where c n Mfcn = U R c n | U ik [0,1], U ik 1, U ik 1; i, k i 1 k 1 (21) The most popular method for optimizing (20) is a bi-level alternating optimization over the U and V variables known as the fuzzy c-means algorithm (Bezdek, 1981). It calculates a new V from the most recent U via n m x jk Vji = U ik k 1 n m U ik k 1 , j,i, (22) and a new U from the most recent V by 2 /(m 1) Uik = x k Vi c x k Vj j 1 2 /(m 1) , i,k, where Vji and xjk are the jth components of Vi and xk, respectively. 7 (23) Real-world data sets sometimes contain partially missing data (Jain and Dubes, 1988; Dixon, 1979), which can arise from human or sensor error, subsequent data corruption, etc. For example, datum component x3k may be missing so that xk has the form (x1k,x2k,?,x4k,…,xtk)T. Unfortunately, the iteration described by (22) and (23) requires access to complete data. If the number of data with missing components is small, then one option is to delete that portion of the data set and base the clustering entirely on those data containing no missing components. This approach is problematic if the proportion of data with missing components is significant and in all cases fails to provide a classification of those data deleted from the cluster analysis. Recently a missing data version of the fuzzy c-means algorithm (Hathaway and Bezdek, 1999) has been devised for the incomplete data case and we give the algorithm here as a useful example of TLAO. In the following we will use x̂ to represent missing data components; e.g., xk = (x1k,x2k, x̂ 3k,x4k,…,xtk)T. The collection of all missing data components will be denoted by X̂ X. We adapt fuzzy c-means to missing data using a principal of optimality similar to that used in maximum likelihood methods from statistics. In this case, we assume that the missing data is highly consistent with the data set having strong cluster structure. We implement this optimistic completion principle by dynamically estimating the missing data components to be those numbers that maximize the cluster structure, as measured by minimum values of the criterion in (20). The incomplete data version of fuzzy c-means from Hathaway and Bezdek (1999) minimizes (20) over U, V, and X̂ . The minimization is done using TLAO based on the following updating. The current values of V and U are used to estimate the missing data components X̂ by: c m c m x̂ jk = U ik V ji U ik , x̂ jk X̂ . (24) i 1 i 1 The current missing data values X̂ are then used to complete X so that (22) can be used to calculate the new V. The third inner step of one complete TLAO iteration is then done using the completed X and the new V in (23) to calculate the new U. Preliminary testing of this missing data approach for clustering has demonstrated it to be highly effective. The convergence theory of the last section guarantees the procedure to be locally convergent, at a q-linear rate. 4. DISCUSSION Tri-level alternating optimization attempts to optimize a function f(x,y,z) using an iteration that alternates optimizations over each of the three (vector) variables while holding the other two fixed. The method is locally, q-linearly convergent to any minimizer at which the second derivative is positive definite. The global analysis of this method fits into the general convergence theory of Zangwill (1969), which guarantees that any limit point of an iteration sequence must satisfy the first order necessary 8 conditions for minimizing f. The authors plan to unify all convergence theory and extend to the case of m-level alternating optimization, and to survey many important instances of alternating optimization type schemes in pattern recognition and statistics. A recent example of TLAO for an important problem in pattern recognition was given. The TLAO scheme applied to the fuzzy c-means function allows clustering of data sets where some of the data vectors have missing components. Preliminary numerical tests of the approach have shown it to produce good results even when there is a substantial proportion of incomplete data. Alternating optimization is often not the best approach to use to do optimization, but it is certainly worth consideration if there is a natural partitioning of the variables so that each of the partial minimizations is simple to do. While the convergence rate of the approach is only q-linear, if the partial minimizations are simple, the method can still be competitive or superior to optimization approaches with faster rates (e.g., q-superlinear) of convergence (Hu and Hathaway, 1999). One of the most interesting mathematical questions concerning this approach is how to systematically and efficiently determine the “best” partitioning of the variables so that the value of in (16) is as small as possible. We expect the value of to be small when we have partitioned into variables that are largely independent. For example, minimization of f(x,y,z) = x2 + y2 + z2 can be done in one iteration ( = 0) because there is complete independence among the three variables. Other computationally oriented questions concern how to best formulate a relaxation scheme and how an alternating optimization approach can best be hybridized with a q-superlinearly (or faster) convergent local method. 1 2 3 4 5 6 7 8 9 REFERENCES Bezdek, J.C. (1981). Pattern Recognition with Fuzzy Objective Functions. New York: Plenum Press. Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., & Windham, M.P. (1987). Local convergence analysis of a grouped variable version of coordinate descent, Journal of Optimization Theory and Applications, v. 54, 471-477. Dixon, J.K. (1979). Pattern recognition with partly missing data, IEEE Transactions on Systems, Man and Cybernetics, v. 9, 617-621. Hu, Y., & Hathaway, R.J. (1999). On efficiency of optimization in fuzzy c-means, preprint. Hathaway, R.J., & Bezdek, J.C. (1999). Pattern recognition with incomplete data using the optimistic completion principal, preprint. Jain, A.K., & Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall. Ortega, J.M. (1972). Numerical Analysis: A Second Course. New York: Academic Press. Stewart, G.W. (1973). Introduction to Matrix Computations. New York: Academic Press. Zangwill, W. (1969). Nonlinear Programming: A Unified Approach. Englewood Cliffs, NJ: Prentice-Hall. 9