Automation and Remote Control, Vol. 63, No. 2, 2002, pp. 209–219. Translated from Avtomatika i Telemekhanika, No. 2, 2002, pp. 44–55. c 2002 by Granichin. Original Russian Text Copyright STOCHASTIC SYSTEMS Randomized Algorithms for Stochastic Approximation under Arbitrary Disturbances O. N. Granichin St. Petersburg State University, St. Petersburg, Russia Received March 20, 2001 Abstract—New algorithms for stochastic approximation under input disturbance are designed. For the multidimensional case, they are simple in form, generate consistent estimates for unknown parameters under “almost arbitrary” disturbances, and are easily “incorporated” in the design of quantum devices for estimating the gradient vector of a function of several variables. 1. INTRODUCTION In the design of optimization and estimation algorithms, noises and errors in measurements and properties of models are usually attributed with some useful statistical characteristics, which are used in demonstrating the validity of the algorithm. For example, noise is often assumed to be centered. Algorithms based on the ordinary least-squares method are used in engineering practice for simple averaging of observation data. If noise is assumed to centered without any valid justification, then such algorithms are unsatisfactory in practice and may even be harmful. Such is the state of affairs under “opponent’s” counteraction. In particular, if noise is defined by a deterministic unknown function (opponent suppresses signals) or measurement noise is a dependent sequence, then averaging of observations does not yield any useful result. Analysis of typical results in such situations shows that the estimates generated by ordinary algorithms are unsatisfactory, observation sequences are believed to be degenerated and such problems are usually disregarded. Another close problem in applying many recurrent estimation algorithms is that observation sequences do not have adequate variability. For example, the main aim in designing adaptive controls is to minimize the deviation of the state vector of a system from a given trajectory and this often results in a degenerate observation sequence. Consequently, identification is complicated and successful identification requires “diverse” observations. Our new approach to estimation and optimization under poor conditions (for example, degenerate observations) is based on the use of test disturbances. The information in the observation channel can be “enriched” for solving many problems by introducing a test disturbance with known statistical properties into the input channels or algorithm. Sometimes the measured random process can be used as the test disturbance. In control systems, test actions can be introduced via the control channel, whereas in other cases, a randomized observation plan (experiment) can be used as the test action. In studying a renewed system with test disturbance, which is sometimes simply the old system in a different form, results on convergence and applicability fields of new algorithms can be obtained through traditional estimation methods. One remarkable characteristic of such algorithms is their convergence under “almost arbitrary” disturbances. An important constraint on their applicability is that the test disturbance and inherent noise are assumed to be independent. This constraint in many problems is satisfied. Such is the case if noise is taken to be an unknown bounded deterministic function or some external random perturbation generated by some statistical properties of the test disturbance unknown to the opponent counteracting our investigation. c 2002 MAIK “Nauka/Interperiodica” 0005-1179/02/6302-0209$27.00 210 GRANICHIN Surprisingly, researchers did not notice for long that in case of noisy observations, search algob rithms with sequential (n = 1, 2, . . . ) changes in the estimate θ n−1 along some random centered vector ∆n b =θ b θ n n−1 − ∆n y n , might converge to the true vector of controlled parameters θ ∗ under not only “good,” but also “almost arbitrary” disturbances. This happens if observations y n are taken at some point defined b n−1 and vector ∆n , called the simultaneous test disturbance. Such an by the previous estimate θ algorithm is called the randomized estimation algorithm, because its convergence under “almost arbitrary” noises is demonstrated through the stochastic (probabilistic) properties of the test disturbance. In the near future, experimenters will radically change their present cautious attitude to stochastic algorithms and their results. Modern computing devices will be supplanted by quantum computers, which operator as stochastic systems due to the Heisenberg principle of uncertainty. By virtue of the possibility of quantum parallelism, randomized-type estimation algorithms will, most probably, form the underlying principle of future quantum computing devices. A main feature of the stochastic approximation algorithms designed in paper is that the unknown maximized function is measured in each step not at the point defined by the previous estimate, but at its slightly perturbed position. Precisely this “perturbation” plays the part of simultaneous test disturbance, “enriching” the information arriving at the observation channel. The convergence of randomized stochastic approximation algorithms under “almost arbitrary” disturbances was first investigated in [1–4]. In [5], such algorithms are studied, but under “good” disturbances in observations, and the asymptotic optimality of recurrent algorithms in minimax sense is demonstrated. A similar algorithm is studied in [6, 7], in which it is shown that the number of iterations required for attaining the necessary estimation accuracy is not attained, though the number of measurements in the multidimensional case at each step is considerably less compared to the classical Kiefer–Wolfowitz procedure. In foreign literature, the new algorithm is called the simultaneous perturbation stochastic approximation. Similar algorithms are investigated in designing adjustment algorithms for neural networks [8, 9]. The consistency of recurrent algorithms under “almost arbitrary” perturbations are also investigated in [10, 11]. In this paper, we study a more general formulation of the problem of minimization of mean-risk functional with measurements of the cost function under “almost arbitrary” perturbations. We shall demonstrate that sequences of estimates generated by different algorithms converge to the true value of unknown parameters with probability 1 in the mean-square sense. 2. FORMULATION OF THE PROBLEM AND MAIN ASSUMPTIONS Let F (w, θ) : Rp × Rr → R1 be a θ-differentiable function and let x1 , x2 . . . , be a sequence of measurement points chosen by the experimenter (observation plan), at which the value yn = F (wn , xn ) + vn of the function F (wn , ·) is accessible to observation at every instant n = 1, 2, . . . , with additive disturbances vn , where {wn } is a noncontrollable sequence of random variables (wn ∈ Rp ) having, in general, identical unknown distribution Pw (·) with a finite carrier. Formulation of the problem. Using the observations y1 , y2 . . . , construct a sequence of estimates b {θ n } for the unknown vector θ ∗ minimizing the function Z f (θ) = F (w, θ)Pw (dw) Rp of the type of mean-risk functional. AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 RANDOMIZED ALGORITHMS FOR STOCHASTIC APPROXIMATION 211 Usually, minimization of the function f (·) is studied with a simple observation model yn = f (xn ) + vn , which easily matches within the general scheme. The generalization in the formulation is stipulated by the need to take account of multiplicative perturbations in observations: yn = wn f (xn ) + vn , which is contained in the general scheme with the function F (w, x) = wf (x), and the need to generalize the observation model [12], where 1 F (w, x) = (x − θ ∗ − w)T (x − θ ∗ − w). 2 Let us formulate the main assumptions, denoting the Euclidean norm and scalar multiplication in Rr by k · k and h·, ·i, respectively. SA.1 The function f (·) is strongly convex, i.e., has a unique minimum in Rr at some point ∗ θ = θ ∗ (f (·)) and (x − θ ∗ , ∇f (x)) ≥ µkx − θ ∗ k2 ∀x ∈ Rr with some constant µ > 0. SA.2 The gradient of the function f (·) satisfies the Lipschitz condition k∇f (x) − ∇f (θ)k ≤ Akx − θk ∀x, θ ∈ Rr with some constant A > µ. 3. TEST PERTURBATION AND MAIN ALGORITHMS Let ∆n , n = 1, 2, . . . , be an observed sequence of independent random variables in Rr , called the simultaneous test perturbation, with distribution function Pn (·). Let us take a fixed initial vector θb0 ∈ Rr and choose two sequences {αn } and {βn } of positive numbers tending to zero. We design three algorithms for constructing sequences of measurement points {xn } and estimates {θn }. The first algorithm uses at every step (iteration) one observation b xn = θ n−1 + βn ∆n , yn = F (wn , xn ) + vn bn = θ b n−1 − αn Kn (∆n )yn , θ (1) βn whereas the second and third algorithms, which are randomized variants of the Kiefer–Wolfowitz procedure, use two observations b b x2n = θ n−1 + βn ∆n , x2n−1 = θ n−1 − βn ∆n αn b =θ b θ Kn (∆n )(y2n − y2n−1 ), n n−1 − (2) 2βn b n−1 + βn ∆n , x2n = θ b n−1 x2n−1 = θ αn b =θ b θ Kn (∆n )(y2n − y2n−1 ). n n−1 − βn AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 (3) 212 GRANICHIN All these three algorithms use some vector functions (kernels) Kn (·) : Rr → Rr , n = 1, 2, . . . , with a compact carrier, which, along with distribution functions of the test perturbation Pn (·), satisfy the conditions Z Z Kn (x)Pn (dx) = 0, Z sup n Kn (x)xT Pn (dx) = I, kKn (x)k2 Pn (dx) < ∞, n = 1, 2, . . . , (4) where I is an r-dimensional unit matrix. Algorithm (2) was first developed by Kushner and Clark [13] for a uniform distribution and function Kn (∆n ) = ∆n . Algorithm (1) was first designed in [1] for constructing a sequence of consistent estimates under “almost arbitrary” perturbations in observations, using the same simple kernel function, but for a more general class of test perturbations. Polyak and Tsybakov [5] investigated algorithms (1) and (2) with a vector function Kn (·) of sufficiently general type for uniformly distributed test perturbation under the assumption that the observation perturbations are independent and centered. Spall [6, 7] studied algorithm (2) for a test perturbation distribution with finite inverse moments and vector function Kn (·) defined by the rule Kn x (1) (2) ,x (r) T ,...,x = 1 1 1 , (2) , . . . , (r) (1) x x x T . Using this vector Kn (·) and constraints on the distribution of the simultaneous test perturbation, Chen et al. [10] studied algorithm (3). 4. CONVERGENCE WITH PROBABILITY 1 AND IN THE MEAN-SQUARE SENSE Instead of algorithm (1), let us study a close algorithm with projection b n−1 + βn ∆n , yn = F (wn , xn ) + vn xn = θ b =P θ n Θn b θ n−1 − αn Kn (∆n )yn . βn (5) For this algorithm, it is more convenient to prove the consistency of the estimate sequence. In this algorithm, PΘn , n = 1, 2, . . . , are operators of projection onto some convex closed bounded subsets Θn ⊂ Rr containing the point θ ∗ , beginning from some n ≥ 1. If the set Θ containing the point θ ∗ is known in advance, then we can take Θn = Θ. Otherwise, the sets {Θn } may extend to infinity. Let W = supp(Pw (·)) ⊂ Rp be the carrier of the distribution Pw (·) and let Fn−1 be the σ-algebra b0, θ b1, . . . , θ b n−1 formed by algorithm (5) of probabilistic events generated by the random variables θ (or (2), or (3)). In applying algorithm (2) or (3), we have v n = v2n − v2n−1 , wn = w2n w2n−1 ! , dn = 1, and in constructing estimates by algorithm (5), we have v n = vn , w n = wn , dn = diam(Θn ), where diam(·) is the Euclidean diameter of the set. Theorem 1. Let the following conditions hold: (SA.1) for the function f (θ) = E{F (w, θ)}, AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 RANDOMIZED ALGORITHMS FOR STOCHASTIC APPROXIMATION 213 (SA.2) for the function F (w, ·) ∀w ∈ W, (4) for the functions Kn (·) and Pn (·), n = 1, 2, . . . , ∀θ ∈ Rr the functions F (·, θ) and ∇θ F (·, θ) are uniformly bounded on W, ∀n ≥ 1 the random variables v 1 , . . . , v n and vectors w1 , . . . , wn−1 do not depend on wn and ∆n , and the random vector w n does not depend on ∆n , and E{v n2 } ≤ σn2 , n = 1, 2, . . . . P If αn = ∞, αn → 0, βn → 0, and α2n βn−2 (1 + d2n + σn2 ) → 0 as n → ∞, then the sequence n b n } generated by algorithm (5) (or (2), or (3)) converges to the point θ ∗ in the of estimates {θ b n − θ ∗ k2 } → 0 as n → ∞. mean-square sense: E{kθ P Moreover, if αn βn2 < ∞ and n X α2n βn−2 1 + E{v n2 | Fn−1 } < ∞ n b → θ ∗ as n → ∞ with probability 1. with probability 1, then θ n The proof of Theorem 1 is given the Appendix. The conditions of Theorem 1 hold for the function F (w, x) = wf (x) if the function f (x) satisfies conditions (SA.1) and (SA.2). The observation perturbations vn in Theorem 1 can be said to be “almost arbitrary” since they may also be nonrandom, but independent and bounded or the realization of some stochastic process with arbitrary dependencies. In particular, to prove the assertions of Theorem 1, there is no need to assume that v n and Fn−1 are dependent. The assertions of Theorem 1 do not hold if the perturbations v n depend on the observation point v n = v n (xn ) in a specific manner. If such a dependence holds, for example, due to rounding errors, then the observation perturbation must be subdivided into two parts v n (xn ) = ven + ξ(xn ) in solving such a problem. The first part may satisfy the conditions of Theorem 1. If the second part results from rounding errors, then it, as a rule, has good statistical properties and may not prevent the convergence of algorithms. The condition that the observation be independent of the test perturbation can be slackened. It suffices to assume that the conditional mutual correlation between v n and Kn (∆n ) tend to zero with probability 1 at a rate not less than αn βn−1 as n → ∞. Though algorithms (2) and (3) may look alike, algorithm (3) is more suitable for use in realtime systems if observations contain arbitrary perturbations. For algorithm (2), the condition that observation perturbation v2n and the test perturbation ∆n be independent is rather restrictive, because the vector ∆n is used at the previous instant (2n − 1) in the system. In the operation of algorithm (3), perturbations v2n and the test perturbation vector ∆n simultaneously appear in the system. Hence they can be regarded as independent. For `-times continuously differentiable functions F (w, ·), the vector function Kn (·) is constructed in [5] with the help of Lagrange orthogonal polynomials pm (·), m = 0, 1, . . . , `, using a uniform test perturbation on an r-dimensional cube [−1/2, 1/2]r as the probabilistic distribution. Moreover, it ` is shown that the mean-square convergence rate asymptotically behaves as O n `+1 . A similar result can also be obtained for the general test perturbation distribution studied in this paper. 5. QUANTUM COMPUTER AND GRADIENT VECTOR ESTIMATION Owing to the simplicity of representation, randomized stochastic approximation algorithms can be used not only in programmable computing devices, but also permit the incorporation of the AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 214 GRANICHIN simultaneous perturbation principle into the design of classical-type electronic devices. A fundamental point in demonstrating the convergence of any method in practical application is that the observation perturbation must be independent of the generated test perturbation, and the components of the test perturbation vector must be mutually independent. In realizing the algorithm on a conventional computer implementing elementary operations in succession, the designed algorithms are not as effective as theoretical predictions. The very name “simultaneously perturbed” implies the imperative requirement for practical realization, viz., capable of being implemented in parallel. Nevertheless, if the vector arguments of a function are large dimensional (hundreds or thousands), it is not a simple matter to organize parallel computations on conventional computers. We consider a model of a “hypothetical” quantum computer, which obviously generates not only a simultaneous test perturbation, but also realizes a fundamental nontrivial concept—measurement of quantum computation result—since the concept itself underlies the design of the randomized algorithm in the form of product of the computed value of a function and test perturbation. In other words, we give an example of the design of a quantum device for estimating the gradient vector of an unknown function of several variables in one cycle. The epithet “hypothetical” is purposefully put within quotes, because quantum computer is generally believed to be a machine of the near future after Shor has read his report at the Berlin Mathematical Congress in 1998 [14]. As in [14], let us describe the mathematical model of a quantum computer that computes by a specific circuit. We use, whenever possible, the generalizations of concepts describing the model of a conventional computer. A conventional computer handles bits, which take value from the set {0, 1}. It is based on a finite ensemble of circuits that can be applied to a set of bits. A quantum computer handles q-bits (quantum bits)—a quantum system of two states (a microscopic system describing, for example, an excited ion or polarized photon, or the spin of atomic kernel). The behavioral characteristics of this quantum system, such as interference, superposition, stochasticity, etc., can be described exactly only through the rules of quantum mechanics [15]. Mathematically, a q-bit takes its values from a complex projective (Hilbert) space C2 . Quantum states are invariant to multiplication by a scalar. Let |0i, |1i be the base of this space. We assume that a quantum computer, like a conventional computer, is equipped with a discrete set of basic components, called the quantum circuits. Every quantum circuit is essentially a unitary transformation, which acts on a fixed number of q-bits. One of the fundamental principles of quantum mechanics asserts that the joint quantum state space of a system consisting of ` two-state systems is the tensor product of their component state spaces. Thus, the quantum state space of a system of ` q-bits is the l projective Hilbert space C2 . The set of base vectors of this state space can be parametrized by bit rows of length `: |b1 b2 . . . bl i. Let us assume that classical data, a row i of length k, k ≤ `, is fed to the input of a quantum computer. In quantum computation, ` q-bits initially exist in the state |i00 . . . 0i. An executable circuit is constructed from a finite number of quantum circuits acting on these q-bits. At the end of computation, the quantum computer passes to some state—a unit l vector in the space C2 . This state can be represented as W = X ψ (s) | si, s where summation with respect to s is taken over all binary rows of length `, ψ (s) ∈ C, P s |ψ (s) |2 = 1. Here ψ (s) is called the probabilistic amplitude and W , the superposition of the base vectors |si. The Heisenberg uncertainty principle asserts that the state of a quantum system cannot be predicted exactly. Nevertheless, there are several possibilities of measuring all q-bits (or a subset of q-bits). The state space of our quantum system is Hilbertian and the concept of state measurement is equivalent to the scalar product in this Hilbert space with some given vector V: hV, Wi. AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 RANDOMIZED ALGORITHMS FOR STOCHASTIC APPROXIMATION 215 The projection of each of q-bits on the base {|0i, |1i} is usually used in measurement. The result of this measurement is the computation result. Let us examine the segment of the randomized stochastic approximation algorithm (3) (or (1), b (x) function or (2)), that is used in computing the approximate value of the gradient vector g f (·) : Rr → R at some point x ∈ Rr . Let the numbers in our quantum computer be expressed through p binary digits and ` = p × r (in modern computers, p = 16, 32, 64). In the algorithm, let Kn (x) = x, β = 2−j , 0 ≤ j < p, and let the simultaneous test perturbation fed to the input of the algorithm be defined by a Bernoulli distribution. For any number u ∈ R, let su denote (1) (p) its binary representation in the form of a bit array su = bu . . . bu . Assume that there exists a quantum circuit that computes the value of the function f (x). More exactly, there exists a unitary l l transformation Uf : C2 → C2 that for any x = x(1) , . . . , x(r) (1) (p) T ∈ Rr , associates the base element (1) (p) |sx(1) . . . sx(r) i = bx(1) . . . bx(1) . . . bx(r) . . . bx(r) E with another base element E E (1) (p) sf (x) 00 . . . 0 = bf (x) . . . bf (x) 00 . . . 0i = Uf |sx(1) . . . sx(r) . Let the hypothetical quantum computer contain, at least, four ` q-bit registers: an input register I for transforming conventional data into quantum data, two working registers W1 and W2 for manipulating quantum data, and an output register ∆ for storing the quantum value, projection onto which is the computation result in conventional form (measurement). The design of a quantum circuit for estimating the gradient vector of the function f (·) requires a few standard quantum (unitary) transformations that are applicable to register data. We assume that the transformation result is stored in the same register to which the transformation is applied. Addition/Subtraction U±X of a vector with (from) another vector stored in the register X. Rotation of the first q-bits UR1,p+1,...,(r−1)p+1 converts the state of q-bits 1, p + 1, . . . , (r − 1)p + 1 1 1 into √ |0i + √ |1i. 2 2 Shift by j q-bits USj displaces the state vector by j q-bits, adding new bits |0i. The gradient vector of a function f (·) at a point x can be estimated by the following algorithm. 1. Zero the states of all registers I, W1 , W2 , ∆. 2. Feed a bit array sx to the input I and transform the register ∆: ∆ := UR1,p+1,...,(r−1)p+1 ∆. 3. Compute the value of the functions f (x) and f (x + 2−j ∆) W1 := Uf U+I W1 , W2 := Uf U+I USj U+∆ W2 . 4. Compute the difference W2 := U−W1 W2 . 5. Sequentially (i = 1, 2, . . . , r) measure the components of result gb (i) (x) = h∆, W2 i, W2 := USp W2 . Computation result is determined by the formula for estimating the gradient vector of f (·): ∇f (x) ≈ ∆ AUTOMATION AND REMOTE CONTROL f (x + β∆) − f (x) . β Vol. 63 No. 2 2002 216 GRANICHIN 6. CONCLUSIONS Unlike deterministic methods, stochastic optimization considerably widens the range of practical problems for which an exact optimal solution can be found. Stochastic optimization algorithms are effective for information network analysis, optimization via modeling, image processing, pattern recognition, neural network learning, and adaptive control. The role of stochastic optimization is expected to grow with the complication of modern systems in the same way as population growth and depletion of natural resources stimulate the use of more intensive technologies in those areas where they were not required earlier. The trend of modern development of computation techniques heralds that traditional deterministic algorithms will be supplanted by stochastic algorithms, because there already exist pilot quantum computers based on stochastic principles. APPENDIX Proof of Theorem 1. Let us consider algorithm (5). Theorem 1 for algorithms (2) and (3) are proved almost along similar lines and can be demonstrated by analogy. For sufficiently large n under which θ ∗ = θ ∗ (f (·)) ∈ Θn , using projection properties we easily obtain 2 αn b ∗ 2 ∗ b Kn (∆n )yn θ n − θ ≤ θ n−1 − θ − . β n Now applying the operation of conditional expectation for the σ-algebra Fn−1 , we obtain 2 2 E αn D b b b ∗ ∗ E θ n − θ |Fn−1 ≤ θ θ n−1 − θ ∗ , E{Kn (∆n )yn | Fn−1 } n−1 − θ − 2 βn 2 n o α + n2 E yn2 kKn (∆n )k2 |Fn−1 . βn (6) By virtue of the mean value theorem, condition (SA.2) for the function F (·, ·) yields 1 1 kx − θ ∗ k2 , ∇θ F (w, θ ∗ )2 + A + 2 2 |F (w, x) − F (w, θ ∗ )| ≤ x ∈ Rr . Since the function F (·, θ) is uniformly bounded, using the notation ν1 = sup w∈W 1 |F (w, θ )| + ∇θ F (w, θ ∗ )2 , 2 ∗ we find that 2 2 ∗ 2 2 b b F (w, θ n−1 + βn x) ≤ (ν1 + (2A + 1) kθ n−1 − θ k + kβn xk ) uniformly for w ∈ W. By condition (4), we have n o E vn2 kKn (∆n )k2 |Fn−1 ≤ sup Kn (x)2 ξn2 . x Since the vector functions Kn (·) are bounded and their carriers are compact, the last two inequalities for the last term in the right side of (6) yield n o n E kyn k2 kKn (∆n )k2 |Fn−1 ≤ 2E vn2 kKn (∆n )k2 |Fn−1 ZZ +2 2 b F w, θ n−1 + βn x ≤ C1 + C2 (d2n o kKn (x)k2 Pn (dx)Pw (dw) b ∗ 2 2 + 1) θ n−1 − θ + βn + C3 ξn2 . AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 RANDOMIZED ALGORITHMS FOR STOCHASTIC APPROXIMATION 217 Here and in what follows, Ci , i = 1, 2, . . . , are positive constants. Now let us consider βn−1 E {yn Kn (∆n )|Fn−1 } = βn−1 ZZ b n−1 + βn x Kn (x)Pn (dx)Pw (dw) F w, θ +βn−1 E{vn Kn (∆n )|Fn−1 }. (7) For the second term in the last relation, since vn and ∆n are independent, from (4) we obtain E{vn Kn (∆n ) | Fn−1 } = E{vn | Fn−1 } Z Kn (x)Pn (dx) = 0. Since the function ∇θ F (·, θ) is uniformly bounded, we have Z ∇θ F (w, x)Pw (dw) = ∇f (x). Rp Using the last relation and condition (4), let us express the first term in (7) as βn−1 Z + b n−1 ) + = ∇f (θ ZZ βn−1 ZZ b n−1 + βn x)Kn (x)Pn (dx)Pw (dw) = ∇f (θ b n−1 ) F (w, θ b b F (w, θ n−1 + βn x)Kn (x)Pn (dx) − ∇θ F (w, θ n−1 ) Pw (dw) Z Kn (x)x Z1 T b n−1 + tβn x) − ∇ F (w, θ b n−1 ))dtPn (dx)Pw (dw). (∇θ F (w, θ θ 0 Since conditions (SA.2) hold for any function F (w, ·) and condition (4) holds for Kn (·), for the absolute value of the second term in the last equality we have ZZ Z1 T b b Kn (x)x ∇θ F (w, θ n−1 + tβn x) − ∇θ F (w, θ n−1 ) dtPn (dx)Pw (dw) 0 ≤ ZZ kKn (x)k kxkAkβn xkPn (dx)Pw (dw) ≤ C4 βn . Consequently, for the second term in the right side of inequality (6) we obtain −2 D E αn D b θ n−1 − θ ∗ , E{Kn (∆n )yn | Fn−1 } βn E b n−1 ) + 2C4 αn βn θ b n−1 − θ ∗ , ∇f (θ b n−1 − θ ∗ . ≤ −2αn θ Substituting the expressions for the second and third terms in the right side into (6), we find that 2 2 b b ∗ ∗ E θn − θ | Fn−1 ≤ θ n−1 − θ D E b ∗ ∗ b b −2αn θ n−1 − θ , ∇f (θ n−1 ) + 2C4 αn βn θ n−1 − θ 2 α2 b ∗ 2 2 + n2 C1 + C2 (d2n + 1) θ − θ + β + C ξ n−1 3 n . n βn Since condition (SA.1) holds for the function f (·) and the inequality ∗ 2 b ε−1 βn + εβn−1 kθ b n−1 − θ k θ n−1 − θ ∗ ≤ 2 AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 218 GRANICHIN holds for any ε > 0, we obtain 2 b ∗ E θ n − θ Fn−1 2 b n−1 − θ ∗ ≤ θ +ε−1 C4 αn βn2 + 1 − (2µ − εC4 )αn + C2 α2n βn−2 (d2n + 1) α2n 2 2 C + C β + C ξ 1 2 3 n n . βn2 Choosing a small ε such that εC4 < µ, a sufficiently large n, and applying the conditions of Theorem 1 for number sequences, let us transform the last inequality to the form 2 b ∗ E θ n − θ Fn−1 ≤ 2 b θ n−1 − θ ∗ (1 − C5 αn ) + C6 αn βn2 + α2n βn−2 (1 + ξn2 ) . Hence, by the conditions of Theorem 1, all conditions of the Robbins–Siegmund Lemma [16] that b → θ ∗ as n → ∞ with probability 1 are satisfied. To are necessary for the convergence of θ n prove the conditions of Theorem 1 for mean-square convergence, let us examine the unconditional mathematical expectation of both sides of the last inequality 2 b − θ∗ E θ n 2 ∗ b ≤ E θ n−1 − θ (1 − C5 αn ) + C6 αn βn2 + α2n βn−2 (1 + σn2 ) . b n } to the point θ ∗ is implied by Lemma 5 of [17]. The mean-square convergence of the sequence {θ This completes the proof of Theorem 1. REFERENCES 1. Granichin, O.N., Stochastic Approximation with Input Perturbation for Dependent Observation Disturbances, Vestn. Leningrad. Gos Univ., 1989, Ser. 1, no. 4, pp. 27–31. 2. Granichin, O.N., A Procedure of Stochastic Approximation with Input Perturbation, Avtom. Telemekh., 1992, no. 2, pp. 97–104. 3. Granichin, O.N., Estimation of the Minimum Point for an Unknown Function with Dependent Background Noise, Probl. Peredachi Inf., 1992, no. 2, pp. 16–20. 4. Polyak, B.T. and Tsybakov, A.B., On Stochastic Approximation with Arbitrary Noise (The KW Case), in Topics in Nonparametric Estimation. Adv. in Soviet Mathematics. Am. Math. Soc., Khasminskii, R.Z., Ed., 1992, no. 12, pp. 107–113. 5. Polyak, B.T. and Tsybakov, A.B., Optimal Accuracy Orders of Stochastic Approximation Search Algorithms, Probl. Peredachi Inf., 1990, no. 2, pp. 45–53. 6. Spall, J.C., A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates, Am. Control Conf., 1987, pp. 1161–1167. 7. Spall, J.C., Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation, IEEE Trans. Autom. Control, 1992, vol. 37, pp. 332–341. 8. Alspector, J., Meir, R., Jayakumar, A., et al., A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks, in Advances in Neural Information Processing Systems, Hanson, S.J., Ed., San Mateo: Morgan Kaufmann, 1993, pp. 834–844. 9. Maeda, Y. and Kanata, Y., Learning Rules for Recurrent Neural Networks Using Perturbation and Their Application to Neuro-control, Trans. IEE Japan, 1993, vol. 113-C, pp. 402–408. 10. Chen, H.F., Duncan, T.E., and Pasik-Duncan, B., A Kiefer–Wolfowitz Algorithm with Randomized Differences, IEEE Trans. Autom. Control, 1999, vol. 44, no. 3, pp. 442–453. 11. Ljung, L. and Guo, L., The Role of Model Validation for Assessing the Size of the Unmodeled Dynamics, IEEE Trans. Autom. Control, 1997, vol. 42, no. 9, pp. 1230–1239. AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002 RANDOMIZED ALGORITHMS FOR STOCHASTIC APPROXIMATION 219 12. Granichin, O.N., Estimation of Linear Regression Parameters under Arbitrary Disturbances, Avtom. Telemekh., 2002, no. 1, pp. 30–41. 13. Kushner, H.J. and Clark, D.S., Stochastic Approximation Methods for Constrained and Unconstrained Systems, Berlin: Springer-Verlag, 1978. 14. Shor, P.W., Quantum Computing, 9 Int. Math. Congress, Berlin, 1998. 15. Faddeev, L.D. and Yakubovskii, O.A., Lektsii po kvantovoi mekhanike dlya studentov matematikov (Lectures on Quantum Mechanics for Students of Mathematics), St. Petersburg: RKhD, 2001. 16. Robbins, H. and Siegmund, D., A Convergence Theorem for Nonnegative Almost Super-Martingales and Some Applications, in Optimizing Methods in Statistics, Rustagi, J.S., Ed., New York: Academic, 1971, pp. 233–257. 17. Polyak, B.T., Vvedenie v optimizatsiyu (Introduction to Optimization), Moscow: Nauka, 1983. This paper was recommended for publication by B.T. Polyak, a member of the Editorial Board AUTOMATION AND REMOTE CONTROL Vol. 63 No. 2 2002