Duality for Entropy Optimization and Its Application

advertisement
Duality for Entropy Optimization and Its Applications
Xingsi Li
Shaohua Pan
Department of Engineering Mechanics, Dalian University of Technology
Dalian 116024, P.R.China
Abstract
In this paper we present the dual formulations of two entropy optimization principles:
Jaynes’ maximum entropy and Kullback-Leibler’s minimum cross-entropy principles,
together with some applications in developing efficient algorithms for various optimization
problems, including minimax, complementarity and nonlinear programming. Our presentation
consists of three parts: dual formulations of entropy optimization, smoothing technique for
min-max problem with applications to optimization problems and Lagrangian perturbations.
1. Dual Formulations of Entropy Optimization
Entropy optimization principles are developed to establish some inference criteria for
predicting probabilities based on incomplete information. The maximum entropy principle
claims: "in making inference on the basis of partial information we must use that probability
distribution which has maximum entropy subject to whatever is known. This is the only
unbiased assignment we can make." Mathematically, it is stated as the following optimization
problem (E1):
n
max S ( p ) :  p i ln pi
i 1
n
s.t.
p
i 1
i
f ji ( x)  E[ f j ], j  1, 2, . . . , m
i
1
n
p
i 1
(1)
p i  0, i  1, 2, . . . , n
where the vector p stands for the probability to be assigned, E  f j  denotes the j th
moment known from some probabilistic experiments and S ( p ) is the Shannon entropy
measure. It can be easily verified that the problem (E1) is a convex programming and has an
unconstrained dual program in the form (DE1):

This work is supported by Special Fund for Basic Research (G1999032805)
1
 m
  m
 n
min D(  ) : ln  exp    j f ji      j E[ f j ]

 i 1
 j 1
  j 1
(2)
where  is a vector of Lagrange multipliers.
If one has a prior probability q  (q1 , . . . , qn ) , in addition to the moment constraints
in (E1), the probability p should be assigned based on the minimum cross-entropy principle.
Mathematically, it leads to the following entropy optimization problem (E2):
n
min D ( p, q ) :  p i ln( pi qi )
i 1
n
s.t.
p
i 1
i
f ji ( x)  E[ f j ], j  1, 2, . . . , m
i
1
n
p
i 1
(3)
p i  0, i  1, 2, . . . , n
where D( p, q) stands for the Kullback-Leibler’s cross-entropy or relative entropy. The
problem (E2) is also convex in p and has an unconstrained dual program as (DE2):
m a x Dq ( ) :

 n

n qi
l
 i 1

m

 m

e
x
p

f


  j j i   Ej f j[
j 1
 j 1


]
(4)
where the priori probability q is considered as a parameter vector only.
Suppose that there is no any information (moment constraints), the problem (E1) will
produces p  1 n and (E2) gives p  q . This means that the maximum entropy principle is
to choose the probability p as close as possible to a uniform distribution while the
minimum cross-entropy principle will choose the probability p as close as possible to the
priori probability q , subjected to given information.
The unconstrained nature of dual programs does not only make it possible to solve the
entropy optimization problems by unconstrained optimization algorithms, but also lends
themselves to various applications. In developing our optimization algorithms, we utilize this
feature and artificially construct some entropy optimization problems.
2. Smoothing Technique for Min-Max Problem
The finite min-max problem is usually expressed as (MMP):
min  ( x) : max g1 ( x), g 2 ( x), . . . , g m ( x)
x
1i  m
2
(5)
This is a typical non-smooth optimization problem due to the non-differentiability of the
objective (max) function   x  . Many algorithms have been devised to solve this problem
due to its special role played in various numerical analysis and optimization problems. They
either transform the original problem (MMP) into equivalent nonlinear program or seek to
find a smooth approximation to the non-differentiable   x  . Our methodology belongs to
the latter and smooth functions are derived based on a continuous estimation of Lagrange
multipliers. For problem (MMP), the Lagrangian function has the following form:
L( x,  ) : i 1 i g i ( x)
m

where    :   R m |

m

i 1
i
(6)

 1, i  0, i  1, 2, . . . , m . Based on our interpretation

that each Lagrange multiplier represents the probability of corresponding component function
attaining at   x  , we introduce the Shannon’s entropy and Kullback-Leibler’s cross-entropy
into the Lagrangian function, respectively, into L  x,   and construct the following entropy
optimization problems (PE1) and (PE2):
m
m
i 1
i 1
max Lp ( x,  ) :  i gi  x   p 1  i ln i

(7)
and
m
m
i 1
i 1
max Lp ( x,  ,  ) :  i gi  x   p 1  i ln  i i 

(8)
where   int  denotes the Lagrange multiplier vector obtained from the last iteration. It is
easily shown that the above entropy optimization problems can be analytically solved and the
original problem (MMP) is transformed into the following smooth unconstrained optimization
problem:
m

min  p ( x) : p 1 ln  exp  pg i ( x)
x
 i 1

m i n  p x( , ) :p1
x
m
l n i

 i 1
3
(9)

 epxgpi x

( )
(10)
It can be proven that  p (x ) and  p ( x,  ) uniformly approximate the maximum function
 ( x ) from above and below, respectively; that is,  p ( x,  )    x    p  x  . Furthermore,
for the smooth function  p (x ) , there is an error bound: 0   p  x     x   ln  m p .
2-1. Nonlinear Programming (NLP):
min f ( x)
(11)
s.t. g i ( x)  0, i  1, 2, . . . , m
The inequality constraints present main difficulty in the solution of (NLP). However, the
original problem is equivalent to the following singly-constrained one:
min f ( x)
s.t.   x   max  gi ( x)  0
(12)
1i  m
The non-smooth constraint could be replaced by the smooth function  p (x ) and an
  optimal solution of (NLP) problem can be found by solving the following problem:
min f ( x)
(13)
s.t.  p  x   0
Similarly, the smooth function  p (x ) can be applied to the non-smooth L1 and L exact
penalty functions:
m
1  x   f  x     max 0, gi  x 
i 1
   x   f  x    max 0, g1  x  ,..., g m  x 
to smoothen the max-type functions.
2-2. Complementarity Problem:
Consider the following vertical complementarity problem (VNCP):
m
x  0 , F1 ( x ) 0 , . . m.F , x ( ) i
x0 ,
j 1
i
j
F
x ( )
0
(14)
where Fj ( x) : Rn  Rn , 1  j  m are vector-valued functions and F ji (x) denotes the
i th component of F j (x) . The problem VNCP is equivalent to the following non-smooth
equations:
4
min  xi , F1i ( x), ... , Fmi ( x)   max  xi ,  F1i ( x), ... ,  Fmi ( x)  0, i  1, ... , n (15)
Still, one can replace the above maximum operations by smoothing approximation  p (x ) . In
the special case of m  1 , the problem (VNCP) reduces to a nonlinear complementarity
problem (NCP)
x  0, F1 ( x)  0, x  F1 ( x)  0
Eq. (15) is then reduces to
min  xi , F1i ( x)   max  xi ,  F1i ( x)  0, i  1, ... , n
2-3. Box Constrained Variational Inequality Problem (BVIP):
This problem is to find an x  [l , u ] such that
( y  x)  F ( x)  0, y  [l , u]
(16)
where [l , u ] is a box constraint in R n with l  u . It is easy to see that the problem (BVIP)
is equivalent to the system of equations:
x  m i dl, u, x  F ( x)  m i dx  l, x  u, F ( x)  0
(17)
where the mid operator mid a, b, c can be represented by
mid a, b, c  a  b  c  min a, b, c  max a, b, c
(18)
Once again, the max and min operators could be replaced by smoothing approximation
 p (x ) in proper forms.
2-4. Global Optimization
The smooth approximations  p (x ) can be generalized for infinite case, i.e.,
sup f ( x)  p 1 
xX
xX
exp[ pf ( x)]dx
(19)
which provides a framework for devising global optimization algorithms. In particular, we
could apply (19) to the above variational inequality problem (BVIP) and obtain a regularized
gap function as follows. For (BVIP), Auslender defined a gap function as
g  x   sup F T  x  x  y 
(20)
yX
Due to the non-smoothness of g  x  , Fukushima defined a regularized gap function in the
5
form:
1


g  x   sup  F T  x  x  y   x  y 

yX 

(21)
By applying Eq.(19) directly to (20), we obtain a new regularized gap function:
g p  x   p 1 ln

yX
exp  pF T  x  x  y  
(22)
For y  [l , u ] , the above integration can be easily calculated.
3. Lagrangian Perturbations
The Lagrangian function has played important role both in theoretical and algorithmic
developments of optimization. For NLP problem (11), the Lagrangian function takes the form:
L( x,  ) : f ( x)  i 1 i g i ( x)
m
(23)
The weak duality theorem can be stated as
min max L  x,    max min L  x,  
x
 0
 0
(24)
x
which gives two possibilities for solving the original problem (11). Usually, one starts from
the right-hand side of above inequality; that is, the minimization of L( x,  ) in x  space is
performed for given  and repeated for updated  until convergence. This kind of
so-called dual algorithms is effective only for some structured problems. We make our
contributions from the left-hand side of (24); that is,
min max L( x,  ) : f ( x)  i 1 i g i ( x)
m
x
(25)
 0
It is well known that the maximization of L( x,  ) in   space for given x is difficult
due to the linear property of L( x,  ) in  . The Lagrangian perturbation is a special
regularization technique, through which Lagrange multipliers can be estimated in terms of
primal variables. In this paper we employ the Shannon’s entropy and Kullback-Leibler’s
cross-entropy as our perturbing functions, respectively; that is we solve
m
m
i 1
i 1
max Lp ( x,  ) : f ( x)   i gi ( x)  p 1  i ln i
 0
and
6
(26)
m
m
i 1
i 1
max Lp ( x,  ,  ) : f ( x)   i gi ( x)  p 1  i ln(i i )
 0
(27)
where p  0 is a controlling parameter and   0 denotes the last estimation of  . The
choice of entropy functions is because they are convex and bounded for   0 and at the
same time the regularized maximization problems (26) and (27) can be analytically solved.
On substituting the solutions of two problems to eliminate  from perturbed Lagrangians,
we obtain
m
Lp  x,   x    f (x ) p 1  e x pg
p [i x ( )
1]
(28)
i 1
Lp  x,   x ,  
f( x)
m
p i e x p [pi g (x )
1
1]
(29)
i 1
It should be recognized that (28) and (29) are exponential penalty functions with and without
Lagrange multipliers. By using entropy perturbations, we reveal a link between traditional
optimization methods and entropy regularization techniques.
As a matter of fact, we can replace the entropy functions with general convex function
   or  ,   as perturbing functions to derive other penalty functions. From the above
derivations, one should note that the estimation of Lagrange multipliers has been embedded
into the derived penalty functions.
All of these discussions reflect the important role of duality of entropy optimization in the
field of mathematical programming. Of course, since entropy optimization problem itself
originates from many different fields, the potential of the duality should not be limited in this
presentation.
References
1. E.T.Jaynes (1957): “Information Theory and Statistical Mechanics”, Physics Review, 106,
620-630.
2. S.Kullback and R.A.Leibler (1951): “Information and Sufficiency”, Annals of
Mathematical Statistics, 22, 79-86.
3. A.B.Templeman and Li Xingsi (1985): "Entropy Duals", J. Engineering Optimization, 9,
107-119.
4. Li Xingsi (1991): "An Aggregate Function Method for Non-linear Programming",
Science in China (Series A), 34, 1467-1473.
7
5.
6.
7.
Li Xingsi (1992): "An Entropy-based Aggregate Method for Minimax Optimization", J.
Engineering Optimization, 18, 277-285.
Li Xingsi (1994): "An Efficient Approach to A Class of Non-smooth Optimization
Problems", Science in China (Series A), 37, 323-330.
Li Xingsi and Fang Shu-Cherng (1997): “On the Entropic Regularization Method for
solving Min-Max Problems with Applications”, Mathematical Methods of Operations
Research,46, 119-130。
8
Download