Proceedings of the 7th Annual ISC Graduate Research Symposium ISC-GRS 2013 April 24, 2013, Rolla, Missouri Qiming Zhao Department of Electrical and Computer Engineering Missouri University of Science and Technology, Rolla, MO 65409 OPTIMAL ADAPTIVE CONTROLLER DESIGN FOR UNKNOWN LINEAR SYSTEMS ABSTRACT In this work, the optimal adaptive control design with finitehorizon is presented for discrete-time linear systems with unknown system dynamics. Q-learning scheme is utilized while an adaptive estimator is proposed to learn the Q-function such that the system dynamics are not needed. The time-varying nature of the solution to the Bellman equation is handled by utilizing a time-dependent basis function and the terminal constraint is incorporated in the novel update law for solving the optimal feedback control. The proposed optimal regulation scheme of the uncertain linear system yields a forward-in-time and online solution without using policy and/or value iterations. For the time invariant linear discrete-time systems, the closedloop dynamics of the finite-horizon regulation problem becomes essentially non-autonomous and involved, but verified by using standard Lyapunov stability theory. Simulation results are shown to verify the effectiveness of the proposed method. 1. INTRODUCTION Optimal regulation of linear systems with quadratic performance index (PI), i.e. LQR problem has been one of the key focuses in control theory for several decades. For a linear system, in the case of infinite-horizon, the algebraic Riccati equation (ARE) is considered and its solution converges and becomes time invariant. However, in the finite-horizon scenario, the solution of the Riccati equation (RE), which is essentially time-varying [9], can only be obtained by solving the RE in a backward-in-time manner when the system matrices are considered to be known. More recently, the authors in [2] proposed a fixed-final time optimal control design using a neural network (NN) to solve the time-varying Hamilton-Jacobi-Bellman (HJB) equation for general affine nonlinear continuous-time systems. The time-varying NN weights and a state-dependent activation function are used to obtain the optimal control in a backwardin-time manner. In [4], the input-constrained finite-horizon optimal control problem is considered by using off-line NN training scheme. The time-varying nature of the finite-horizon is handled by utilizing constant NN weights and time-varying activation functions. The effort [2] and [4] provided some insights into solving the finite-horizon problem, but the solutions are obtained either backward-in-time or offline. To relax the requirement of system dynamics and achieve optimality, adaptive dynamic programming (ADP) techniques [3] is normally used to solve the optimal control problem in a forward-in-time fashion by using value and/or policy iterations. However, iteration-based schemes require significantly large number of iterations within each time step to guarantee the stability of the system, and thus are not suitable for practical situations. Motivated by the aforementioned deficiencies, in this work, the ADP technique via reinforcement learning (RL) is utilized to solve the finite-horizon optimal regulation problem of a linear discrete-time system with unknown dynamics in an online and forward-in-time manner. Policy and/or value iterations are not used. The Bellman equation is utilized with estimated Q-function such that the system dynamics are not needed. To properly satisfy the terminal constraint, an additional error term corresponding to the terminal constraint is defined and minimized at each time step in order to solve the optimal control problem within a finite time period. In addition, the controller functions in a forward-in-time fashion with no offline training phase. Due to the time-varying nature of the finite-horizon, the closed-loop system becomes essentially nonautonomous and Lyapunov stability theory is utilized to show the stability of our proposed design scheme. 2. PROBLEM FORMULATION Consider the following time-invariant linear discrete-time system described as (1) xk 1 Axk Buk where x k n , u k m are the system states vector and control inputs vector, respectively. The system matrices A and B are assumed to be unknown with appropriate dimension. In this paper, it is also assumed that the system states are available for measurement. The objective of the control design is to determine a state feedback control policy which minimizes the following cost function N 1 J 0 xTN S N x N xTk Qxk uTk Ruk k 0 (2) where Q and R are weighting matrices for the system states and control inputs, and assumed to be symmetric positive semidefinite and positive definite, respectively. S N is a symmetric positive semi-definite matrix that penalties the system states at terminal stage N . 1 It is well-known that from conventional optimal control theory [9], the finite-horizon optimal regulation problem can be addressed by solving the following Riccati equation: S k A T [S k 1 S k 1B(B T S k 1B R ) 1 B T S k 1 ]A Q (3) in backward-in-time manner while the time-varying Kalman gain is given as K k (BT S k 1B R ) 1 B T S k 1A (4) However, it can be seen clearly from (3) and (4), the traditional design of the optimal controller is essentially an offline scheme. Due to the backward-in-time feature, such design is not suitable for real-time implementation. Moreover, when the system dynamics are not known a priori, the backward-in-time solution is even not possible. It will be shown that in the next section, the finite-horizon optimal regulation problem for an uncertain linear discrete-time system can be tackled in an online and forward-in-time manner. In addition, policy and/or value iterations are not needed and the requirement of the system dynamics is relaxed for the optimal controller design since a Q-learning scheme is undertaken as defined in next section. system dynamics in a forward-in-time fashion. However, the available ADP iteration-based schemes will be difficult for practical implementation. The insufficient number of iterations will cause instability of the system [7]. Next, by reinforcement learning, we will show that the requirements for the system dynamics are not needed by estimating the time-dependent value function V (xk ,N k ) . Define the time-varying Q-function Q(xk , uk , N k ) as T x x Q(x k , u k , N k ) r (x k , u k ) J k 1 k G k k u k u k (7) where r (x k , u k ) xTk Qxk uTk Ruk is the utility. The Bellman equation can be written as T x k x k u k G k u k T x Q 0 x k k ( Ax k Bu k ) T S k 1 ( Ax k Bu k ) (8) u u k 0 R k T T A T S k 1B x k x Q A S k 1 A k R B T S k 1B u k u k B T S k 1 A Therefore, define a new time-varying matrix G k as 3. FINITE-HORIZON OPTIMAL CONTROL DESIGN UNDER Q-LEARNING SCHEME In this section, the finite-horizon optimal controller design for linear systems with uncertain system dynamics is addressed. A Q-function [3] is first defined and adaptively estimated by using reinforcement learning, which is in turn utilized to design the controller such that the system dynamics are not required. Next, an additional error term corresponding to the terminal constraint is defined and minimized at each time step so that the terminal constraint can be satisfied properly. Finally, the stability of closed-loop system is analyzed under nonautonomous schemed and verified based on standard Lyapunov stability and geometric control theory. Q A T S k 1 A A T S k 1B G kxx G kxu (9) Gk T R BT S k 1B G ux G uu k k B S k 1 A Comparing with (4), the Kalman gain can be represented in terms of G k as 3.1 Q-function Setup 3.2 Model-free Estimator Before proceeding, it is important to note that in the case of finite-horizon, the value function becomes time-dependent [9] and is denoted as V (xk ,N k ) , which is a function of both system states and time-to-go. For the linear system (1), by optimal control theory [9], the value function V (xk ,N k ) can be expressed in the quadratic form as V (x k ,N k ) xTk S k x k (5) where S k is the solution sequence to the time-varying Riccati equation. According to [9], the optimal control input is obtained as u k K k x k (BT S k 1B R )1 B T S k 1Ax k (6) Remark 1: (6) clearly shows that the conventional optimal control approach requires both the system dynamics A and B , and the solution of the RE is obtained backward-in-time from the terminal value S N . Instead, with ADP technique, the value function can be estimated and in turn used to derive the optimal control policy by using policy and/or value iterations without 1 ux K k (G uu (10) k ) Gk Therefore, by using adaptive control schemes, the timevarying Q-function Q(xk , uk ,N k ) includes the information of G k , which can be solved in an online manner. Subsequently, the control input can be obtained by using (10). Online Tuning with Q-function In this subsection, to overcome the drawback of iterationbased schemes mentioned before, the finite-horizon optimal regulation scheme is proposed by incorporating the history information of both system states and the utilities. To properly satisfy the terminal constraint, an error term for the terminal constraint is defined and minimized along the system evolution. Before proceeding, the following assumption is introduced. Assumption 1 (Linear-in-the-unknown-parameters): The Qfunction Q(x k , u k , N k ) can be expressed as the linear in the unknown parameters (LIP). By adaptive control theory and Q-function definition, Q(xk , uk ,N k ) can be written in the vector form as Q(x k , u k , N k ) z Tk G k z k g Tk zk T (11) z k xTk uTk n m l , z k ( z k21 ,, z k1 z kl , z k22 , , zkl 1 zkl , zkl2 ) is the Kronecker product quadratic polynomial where 2 basis vector, g k vec(G k ) where vec() is a vector function that acts on l l matrices and returns a l (l 1) 2 column vector. The output of vec() is constructed by stacking the columns of the squared matrix into a one-column vector with the off-diagonal elements summed as G mn G nm . Based on Assumption 1, define g k as (12) g k T ( N k ) where is the target parameter vector of the time-invariant part of g k and (N k ) is the time-varying basis function matrix. By [9], the standard Bellman equation can be represented in terms of Q-function as Q(xk 1 , uk 1 , N (k 1)) Q(xk , uk , N k ) r (xk , uk ) 0 (13) However, (13) no longer holds when the estimated value gˆ k is used. To estimate the time-varying matrix G k , define gˆ ˆT (N k ) (14) k k where ˆk is the estimated value of . Therefore, the Q-function estimation can be written as Qˆ (xk , uk , N k ) gˆ Tk zk ˆkT (N k ) zk ˆkT Xk where Χk 1 Χk Χk 1 ( N k )z k ( N k 1)z k 1 and is bounded by X min X k 1 X max , with X min min z k maxz k 1 and X max maxz k min z k . The dynamics of the Bellman estimation error can be thus rewritten as ek 1 r (xk , uk ) ˆkT1Χk Next, introduce an auxiliary error incorporates the history of past utilities as Ξ Γ ˆT Ω vector (18) which (19) Γ k 1 [r (x k 1 , u k 1 ), r (x k 2 , u k 2 ), , r (x k 1 j , u k 1 j )] and k 1 k with k 1 k for 0 j k 1 . Again, Ω k 1 [Χ k 1 , Χ k 2 ,, Χ k 1 j ] X k since is Ωk bounded, is bounded as Ωmin ( z) Ωk Ωmax ( z) . It is clear that (19) includes previous j 1 Bellman estimation errors which are recalculated by using the most recent ˆ . k (15) where Χ k (N k ) zk is a time-dependent regression function incorporating the terminal time N and satisfying ek r (xk 1 , uk 1 ) ˆkT Χk ˆkT Χk 1 r (xk 1 , uk 1 ) ˆkT Χk 1 (17) Χk 0 when zk 0 . Note that since the time of interest is considered to be finite, the time dependent function (k ) is bounded as min ( N k ) max , k 0,1,...,N , where min and max are positive constants. Moreover, Q-function is also bounded Qˆ min (x k , u k ) Qˆ (x k , u k , N k ) Qˆ max (x k , u k ) , where Qˆ min (x k , u k ) ˆkT min z k and Qˆ max (x k , u k ) ˆkT max z k . It should be noted that Qmin (x k , u k ) and Qmax (x k , u k ) are time independent functions and are used for non-autonomous analysis. Remark 2: In the case of infinite-horizon, since the desired value of g becomes time-invariant [5], and hence the timevarying term (N k ) vanishes in (14). By contrast, in the case of finite-horizon, the desired value of g k becomes timevarying. Therefore, the basis function can be taken as the product of system states and time-dependent basis function. by With the estimated value of the Q-function, the Bellman equation can be expressed as Qˆ (xk 1 , uk 1 , N (k 1)) Qˆ (xk , uk , N k ) r (xk , uk ) ek 1 (16) where ek 1 is the Bellman estimation error along the system trajectory. Using one-time step delayed value for convenience, the Bellman estimation error can be written as Similar to (19), the dynamics of the auxiliary error vector are generated as Ξk 1 Γk ˆkT1Ωk (20) In finite-horizon optimal regulation problem, terminal constraint of the value function should also be considered. The estimated value function at terminal stage is defined as Qˆ (x ) ˆT (0) z (21) k N k N In (21), it should be noted that the time-varying basis function (N k ) at the terminal stage are defined as (0) since the time index, by definition of (N k ) , is taken as the time-to-go and hence in the reverse order. Next, the terminal constraint error vector is defined as Ξ gˆ g ˆT (0) g (22) k ,N k ,N N k N where g N is bounded by g N gM . Remark 3: In the case of finite-horizon, the terminal error Ξk ,N , which is the difference between the estimated and true value of the terminal constraint (in our case, g N ), is critical for the optimal control design. By minimizing Ξk ,N along the system trajectory, the terminal constraint can be satisfied. Another Bellman estimation error term Ξk will always exist for both finite-horizon and infinite-horizon problem as long as the estimated value function is used. See [5] for infinite-horizon case. 3 Now, the total error vector is defined as Ξk ,total Ξk Ξk ,N (23) Next, consider the terminal constraint effect, define Π k Ωk (0) (24) It should be noted that Π k is bounded by Π min (x) Π k Π max (x) , where Π min (x) Ω min (x) (0) and system states must be persistently exiting long enough for the estimator to properly learn the Q-function. PE condition is a standard assumption in adaptive control and can be satisfied by adding exploration noise. In this work, the same approach is taken to satisfy the PE condition. (25) 3.3 Estimation of the Optimal Control and Algorithm Π max (x) Ω max (x) (0) . The update law for tuning ˆk is defined as ˆk 1 Πk ΠTk Πk 1 Ξ T k ,total ΓTk where 0 1 is a design parameter. Also note that the update law defined in (25) is essentially in least-squares sense. Expanding (25) by using (23), we have 1 ˆk 1 Π k Π Tk Π k ΞTk ΞTk ,N Γ Tk (26) 1 1 Π k Π Tk Π k ΞTk Γ Tk Π k Π Tk Π k ΞTk ,N By [9], the optimal control can be obtained by minimizing the value function. Recall from (10), the estimated control input is given by Note that from (24), we have Ω k Π k (0) . Then (20) becomes Ξk 1 Γ k ˆkT1Ωk Γ k ˆkT1 Π k (0) (27) Γ k ˆkT1Π k ˆkT1 (0) Γ k ˆkT1Π k gˆ k 1,N To find the error dynamics for ˆk , substituting (26) into (27) renders Ξk 1 Γ k ˆkT1Π k ˆkT1 (0) Ξk Ξk ,N ˆkT1 (0) (28) ˆ uu )1 G ˆ ux x (36) uˆ k (R BT Sˆ k 1B)1 BT Sˆ k 1A xk (G k k k ˆ From (36), the Kalman gain can be computed based on G k matrix which is obtained from the estimated Q-function. This avoids the need of the system matrices A and B , while the update law (25) relaxes the policy and/or value iterations. Note that the Q-function (11) and control input (36) are updated at each time step. Before proceeding, the flowchart of our proposed scheme is shown as in Fig. 1. Start Proposed Algorithm (28) clearly shows that the Bellman estimation error is coupled with the terminal constraint estimation error. Therefore, the dynamics for the total error Ξtotal is given by Ξk 1,total Ξk 1 Ξk 1,N Ξk Ξk ,N ˆkT1 (0) Ξk 1,N (29) Ξk Ξk ,N g N Initialization Vˆ0 ( x ) 0, u u0 Update the finite horizon Bellman Equation and terminal constraint error Ξk Γk 1 ˆkT Ωk 1 Ξ k ,N ˆkT (0) gN Define the estimation error for ˆk as ~ k ˆk (30) Recall from (19), the utility vector can be written as Update the adaptive estimator parameters with auxiliary error vectors Γ k T Ωk . Then we have with ~ Ξk 1 Γ k ˆkT1Ω k T Ω k ˆkT1Ω k kT1Ω k (31) From (23), we further have ~ kT1Ω k Ξk Ξk ,N g N Ξk 1,N (32) ~ T T T Note that Ξ k , N gˆ k , N g N ˆk (0) (0) k (0) and ~ similarly Ξk 1, N kT1 (0) , then (32) becomes ~ ~ ~ ~ kT1Ωk kT Ωk 1 kT (0) g N kT1 (0) ˆk1 Πk (ΠkTΠk )1(ΞkT,totoal ΓkT ) Π k Ωk (0) Update finite horizon control policy Gˆ k vec1(gˆ k ) vec1(ˆkT(N k)) uˆ k (Gˆ kuu ) 1 Gˆ kux x k k k 1, k 1,2,...,N 1 k=N? (33) Therefore, we have ~ ~ (34) kT1 (Ωk (0)) kT (Ωk 1 (0)) g N From (24), (34) can be finally written as ~ ~ (35) kT1Πk kT Πk 1 g N Remark 4: It can be seen that from the definition of Q-function (11), the Q-function estimation will no longer update when the system states converge to zero. This can be viewed as persistency of excitation (PE) requirement. Therefore, the Update the time interval Yes End Fig. 1. Finite horizon optimal design 3.4 Stability Analysis ~ In this subsection, both the estimation error k and the closed-loop system are shown to be uniformly ultimately 4 bounded (UUB). The closed-loop system becomes essentially non-autonomous, in contrast with [4], because of the timedependent nature of finite-horizon. Due to the page limit, only the mathematical claims are presented while the proofs are omitted. Theorem 1: Let the initial value for ĝ 0 be bounded. Let u 0 (k ) be an initial admissible control for the linear system (1). Let the update law for ˆk be given in (25). Then, there exists a positive constant satisfying 0 ~ 1 such that the parameter 2 estimation error k is UUB. Proof: Omitted due to the page limit. Before showing the closed-loop stability, the following lemma is also needed. Lemma 1: (Bounds on closed-loop dynamics with admissible control) Consider the linear discrete-time system (1), then there exists an admissible control uk such that the closed-loop system dynamics Axk Buk can be written as Ax k Bu k 2 xk 2 xk 1 The cost function is taken as N 1 J 0 xNT SN xN xkT Qxk ukT Ruk The weighting matrices Q , R and the terminal penalty matrix S N are selected to be identity matrices with appropriate dimensions. The initial system states and initial admissible control gain are selected to be x0 [1, 1, 0.5]T and K 0 [0.3, 0.3, 1.2] , respectively. The design parameter is selected to be 0.001 . The time-dependent basis function (N k ) is chosen as a polynomial of time-to-go with saturation. Saturation for (N k ) is to ensure that the magnitude of (N k ) is within a reasonable range such that ˆ is computable. The initial k (37) 1 is a constant. 2 Theorem 2 (Boundedness of the Closed-loop System): Let u 0 (k ) be an initial admissible control policy. Let the parameter vector of the Q-function estimator be tuned and estimation control policy be provided by (25) and (36), respectively. Then, values for ˆk are set to zeros. The simulation results are given as below. 1 System Response In this section, a practical example is used to evaluate feasibility of the proposed finite-horizon optimal regulation design. Consider the continuous-time F-16 aircraft model given as: 2 0 -0.5 -1 0 4. SIMULATION RESULTS 1 x3 0.5 1 such 2 that the closed-loop system is UUB. Furthermore, the bounds for both the states and estimation error will decrease as N increases. Proof: Omitted due to the page limit. 10 20 30 Time Index k 40 50 40 50 Fig. 2. System response 0.2 0.15 (38) Control Inputs 0.00215 0 0.17555 x 0 u 1 1 x x there exist positive constants satisfying 0 0.90506 1.07741 0 (40) k 0 where 0 1.01887 x 0.82225 0 0.9065 0.0816 0.0009 0 0.0741 0.9012 0.0159 xk 0.0008 uk (39) 0 0.0952 0 0.9048 where x q e , with the angle of attack, q the pitch rate and e the elevator deflection angle. The control input is the taken as the elevator actuator voltage. Discretizing the system with a sampling time Ts 0.1sec , we have the discrete-time version of the system as 0.1 0.05 0 -0.05 0 10 20 30 Time Index k Fig. 3. Control input 5 First, the response of the system with our proposed control design is examined. The augmented states are generated as T zk xkT ukT 4 and hence zk 10 . From Figs. 2 and 3, it can be seen that both the system states and the control input finally converge close to zero, which first verifies the stability of our proposed design scheme. 1.4 e 1.2 e N Bellman Error Histories 1 0.8 0.6 0.4 0.2 0 0 10 20 Time Index k 30 40 Fig. 4. Convergence of error terms Next, to verify the optimality and satisfying the terminal constraint, the error histories are plotted in Fig. 4. It clearly shows that the Bellman error eventually converges close to zero, which ensures the optimality of the system. It is more important to note from Fig. 4 that the history of terminal constraint error eN also converges close to zero, which illustrates the fact that the terminal constraint is also satisfied by our proposed controller design. Finally, for comparison purpose, the error of cost function between traditional backward RE-based method and our proposed algorithm are shown in Fig. 5. It can be seen from Fig. 5 that the difference between two costs converges close to zero more quickly than system response, which illustrates that the proposed algorithm indeed yields an (near) optimal control policy. 5. CONCLUSIONS In this work, the finite-horizon optimal regulation problem for the linear discrete-time systems with unknown system dynamics is proposed by using the ADP technique. An error term corresponding to the terminal constraint together with the approximation error is defined and minimized along the system trajectory. A time-dependent Q-function Q(xk , uk , N k ) is estimated so that system dynamics are not needed. The Q- ˆ are learned by the function Q(xk , uk , N k ) and the matrix G k Q-function estimator. The optimal regulation problem is subsequently solved by using the information of estimated Qfunction Qˆ (xk , uk , N k ) . Under the non-autonomous scheme, the boundedness of closed-loop system is demonstrated by using Lyapunov stability theory. Policy and/or value iterations are not needed. Therefore, the proposed algorithm yields an online and forward-in-time control design scheme which offers many practical benefits. 6. ACKNOWLEDGMENTS The authors acknowledge the financial support for this research by Intelligent Systems Center and ECCS-1128281. This paper is generated from [8] which have been recently accepted for publication. 7. REFERENCES [1] [2] [3] [4] [5] 4 3.5 [6] Cost Function Error 3 2.5 2 [7] 1.5 1 [8] 0.5 0 -0.5 0 10 20 30 40 Time Index k Fig. 5. Cost function error between traditional and proposed method [9] 6 Beard, R., 1995, “Improving the closed-loop performance of nonlinear systems,” Ph.D. dissertation, Rensselaer Polytechnic Institute, USA. Cheng, T., Lewis, F.L., and Abu-Khalaf, M., 2007, “A neural network solution for fixed-final-time optimal control of nonlinear systems,” Automatica, vol. 43, pp. 482–490. Watkins, C., 1989, “Learning from delayed rewards,” Ph.D. dissertation, Cambridge University, England. A. Heydari and S. N. Balakrishan, “Finite-horizon inputconstrained nonlinear optimal control using single network adaptive critics,” in Proc. American Control Conf, San Francisco, USA, pp. 3047–3052, 2011. Xu, H., Jagannathan, S., and Lewis, F.L., 2012, “Stochastic optimal control of unknown networked control systems in the presence of random delays and packet losses,” Automatica, vol. 48, pp. 1017–1030. Dierks, T., and Jagannathan, S., 2009, “Optimal control of affine nonlinear discrete-time systems with unknown internal dynamics,” in Proc. Conf. on Decision and Control, Shanghai, pp. 6750–6755. Dierks, T., and Jagannathan, S., 2012, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update,” IEEE Trans. Neural Networks and Learning Systems, vol. 23, pp. 1118–1129. Zhao, Q., Xu, H., and Jagannathan, S., 2013, “FiniteHorizon Optimal Control Design for Uncertain Linear Discrete-time Systems,” to appear in the Proc. Of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Singapore. Lewis, F.L., and Syrmos, 1995, V.L., Optimal Control, 2nd edition. New York: Wiley.