Optimal Adaptive Controller Design for

advertisement
Proceedings of the 7th Annual ISC Graduate Research Symposium
ISC-GRS 2013
April 24, 2013, Rolla, Missouri
Qiming Zhao
Department of Electrical and Computer Engineering
Missouri University of Science and Technology, Rolla, MO 65409
OPTIMAL ADAPTIVE CONTROLLER DESIGN FOR UNKNOWN LINEAR SYSTEMS
ABSTRACT
In this work, the optimal adaptive control design with finitehorizon is presented for discrete-time linear systems with
unknown system dynamics. Q-learning scheme is utilized while
an adaptive estimator is proposed to learn the Q-function such
that the system dynamics are not needed. The time-varying
nature of the solution to the Bellman equation is handled by
utilizing a time-dependent basis function and the terminal
constraint is incorporated in the novel update law for solving
the optimal feedback control. The proposed optimal regulation
scheme of the uncertain linear system yields a forward-in-time
and online solution without using policy and/or value iterations.
For the time invariant linear discrete-time systems, the closedloop dynamics of the finite-horizon regulation problem
becomes essentially non-autonomous and involved, but verified
by using standard Lyapunov stability theory. Simulation results
are shown to verify the effectiveness of the proposed method.
1. INTRODUCTION
Optimal regulation of linear systems with quadratic
performance index (PI), i.e. LQR problem has been one of the
key focuses in control theory for several decades. For a linear
system, in the case of infinite-horizon, the algebraic Riccati
equation (ARE) is considered and its solution converges and
becomes time invariant. However, in the finite-horizon
scenario, the solution of the Riccati equation (RE), which is
essentially time-varying [9], can only be obtained by solving
the RE in a backward-in-time manner when the system matrices
are considered to be known.
More recently, the authors in [2] proposed a fixed-final
time optimal control design using a neural network (NN) to
solve the time-varying Hamilton-Jacobi-Bellman (HJB)
equation for general affine nonlinear continuous-time systems.
The time-varying NN weights and a state-dependent activation
function are used to obtain the optimal control in a backwardin-time manner. In [4], the input-constrained finite-horizon
optimal control problem is considered by using off-line NN
training scheme. The time-varying nature of the finite-horizon
is handled by utilizing constant NN weights and time-varying
activation functions. The effort [2] and [4] provided some
insights into solving the finite-horizon problem, but the
solutions are obtained either backward-in-time or offline.
To relax the requirement of system dynamics and achieve
optimality, adaptive dynamic programming (ADP) techniques
[3] is normally used to solve the optimal control problem in a
forward-in-time fashion by using value and/or policy iterations.
However, iteration-based schemes require significantly large
number of iterations within each time step to guarantee the
stability of the system, and thus are not suitable for practical
situations.
Motivated by the aforementioned deficiencies, in this
work, the ADP technique via reinforcement learning (RL) is
utilized to solve the finite-horizon optimal regulation problem
of a linear discrete-time system with unknown dynamics in an
online and forward-in-time manner. Policy and/or value
iterations are not used. The Bellman equation is utilized with
estimated Q-function such that the system dynamics are not
needed. To properly satisfy the terminal constraint, an
additional error term corresponding to the terminal constraint is
defined and minimized at each time step in order to solve the
optimal control problem within a finite time period. In addition,
the controller functions in a forward-in-time fashion with no
offline training phase. Due to the time-varying nature of the
finite-horizon, the closed-loop system becomes essentially nonautonomous and Lyapunov stability theory is utilized to show
the stability of our proposed design scheme.
2. PROBLEM FORMULATION
Consider the following time-invariant linear discrete-time
system described as
(1)
xk 1  Axk  Buk
where x k   n , u k   m are the system states vector and
control inputs vector, respectively. The system matrices A and
B are assumed to be unknown with appropriate dimension. In
this paper, it is also assumed that the system states are available
for measurement.
The objective of the control design is to determine a state
feedback control policy which minimizes the following cost
function
N 1

J 0  xTN S N x N   xTk Qxk  uTk Ruk
k 0

(2)
where Q and R are weighting matrices for the system states
and control inputs, and assumed to be symmetric positive semidefinite and positive definite, respectively. S N is a symmetric
positive semi-definite matrix that penalties the system states at
terminal stage N .
1
It is well-known that from conventional optimal control
theory [9], the finite-horizon optimal regulation problem can be
addressed by solving the following Riccati equation:
S k  A T [S k 1  S k 1B(B T S k 1B  R ) 1 B T S k 1 ]A  Q
(3)
in backward-in-time manner while the time-varying Kalman
gain is given as
K k  (BT S k 1B  R ) 1 B T S k 1A
(4)
However, it can be seen clearly from (3) and (4), the
traditional design of the optimal controller is essentially an
offline scheme. Due to the backward-in-time feature, such
design is not suitable for real-time implementation. Moreover,
when the system dynamics are not known a priori, the
backward-in-time solution is even not possible.
It will be shown that in the next section, the finite-horizon
optimal regulation problem for an uncertain linear discrete-time
system can be tackled in an online and forward-in-time manner.
In addition, policy and/or value iterations are not needed and
the requirement of the system dynamics is relaxed for the
optimal controller design since a Q-learning scheme is
undertaken as defined in next section.
system dynamics in a forward-in-time fashion. However, the
available ADP iteration-based schemes will be difficult for
practical implementation. The insufficient number of iterations
will cause instability of the system [7].
Next, by reinforcement learning, we will show that the
requirements for the system dynamics are not needed by
estimating the time-dependent value function V (xk ,N  k ) .
Define the time-varying Q-function Q(xk , uk , N  k ) as
T
x 
x 
Q(x k , u k , N  k )  r (x k , u k )  J k 1   k  G k  k 
u k 
u k 
(7)
where r (x k , u k )  xTk Qxk  uTk Ruk is the utility.
The Bellman equation can be written as
T
x k 
x k 
u k  G k u k 
T
x  Q 0  x k 
 k 
 ( Ax k  Bu k ) T S k 1 ( Ax k  Bu k ) (8)



u
u
 k   0 R k 
T
T
A T S k 1B  x k 
x  Q  A S k 1 A
 k 

R  B T S k 1B  u k 
u k   B T S k 1 A
Therefore, define a new time-varying matrix G k as
3. FINITE-HORIZON OPTIMAL CONTROL DESIGN
UNDER Q-LEARNING SCHEME
In this section, the finite-horizon optimal controller design for
linear systems with uncertain system dynamics is addressed. A
Q-function [3] is first defined and adaptively estimated by
using reinforcement learning, which is in turn utilized to design
the controller such that the system dynamics are not required.
Next, an additional error term corresponding to the terminal
constraint is defined and minimized at each time step so that the
terminal constraint can be satisfied properly. Finally, the
stability of closed-loop system is analyzed under nonautonomous schemed and verified based on standard Lyapunov
stability and geometric control theory.
Q  A T S k 1 A
A T S k 1B  G kxx G kxu 
(9)
Gk  


T
R  BT S k 1B  G ux
G uu
k
k 
 B S k 1 A
Comparing with (4), the Kalman gain can be represented in
terms of G k as
3.1 Q-function Setup
3.2 Model-free
Estimator
Before proceeding, it is important to note that in the case of
finite-horizon, the value function becomes time-dependent [9]
and is denoted as V (xk ,N  k ) , which is a function of both
system states and time-to-go. For the linear system (1), by
optimal control theory [9], the value function V (xk ,N  k ) can
be expressed in the quadratic form as
V (x k ,N  k )  xTk S k x k
(5)
where S k is the solution sequence to the time-varying Riccati
equation.
According to [9], the optimal control input is obtained as
u k  K k x k  (BT S k 1B  R )1 B T S k 1Ax k
(6)
Remark 1: (6) clearly shows that the conventional optimal
control approach requires both the system dynamics A and B ,
and the solution of the RE is obtained backward-in-time from
the terminal value S N . Instead, with ADP technique, the value
function can be estimated and in turn used to derive the optimal
control policy by using policy and/or value iterations without
1
ux
K k  (G uu
(10)
k ) Gk
Therefore, by using adaptive control schemes, the timevarying Q-function Q(xk , uk ,N  k ) includes the information of
G k , which can be solved in an online manner. Subsequently,
the control input can be obtained by using (10).
Online
Tuning
with
Q-function
In this subsection, to overcome the drawback of iterationbased schemes mentioned before, the finite-horizon optimal
regulation scheme is proposed by incorporating the history
information of both system states and the utilities. To properly
satisfy the terminal constraint, an error term for the terminal
constraint is defined and minimized along the system evolution.
Before proceeding, the following assumption is introduced.
Assumption 1 (Linear-in-the-unknown-parameters): The Qfunction Q(x k , u k , N  k ) can be expressed as the linear in the
unknown parameters (LIP).
By adaptive control theory and Q-function definition,
Q(xk , uk ,N  k ) can be written in the vector form as
Q(x k , u k , N  k )  z Tk G k z k  g Tk zk
T
(11)
z k  xTk uTk  n  m l , z k  ( z k21 ,, z k1 z kl , z k22 ,
, zkl 1 zkl , zkl2 ) is the Kronecker product quadratic polynomial
where
2
basis vector, g k  vec(G k ) where vec() is a vector function
that acts on l  l matrices and returns a l  (l  1) 2 column
vector. The output of vec() is constructed by stacking the
columns of the squared matrix into a one-column vector with
the off-diagonal elements summed as G mn  G nm .
Based on Assumption 1, define g k as
(12)
g k   T ( N  k )
where  is the target parameter vector of the time-invariant
part of g k and  (N  k ) is the time-varying basis function
matrix.
By [9], the standard Bellman equation can be represented
in terms of Q-function as
Q(xk 1 , uk 1 , N  (k  1))  Q(xk , uk , N  k )  r (xk , uk )  0 (13)
However, (13) no longer holds when the estimated value
gˆ k is used.
To estimate the time-varying matrix G k , define
gˆ  ˆT (N  k )
(14)
k
k
where ˆk is the estimated value of  .
Therefore, the Q-function estimation can be written as
Qˆ (xk , uk , N  k )  gˆ Tk zk  ˆkT (N  k ) zk  ˆkT Xk
where Χk 1  Χk  Χk 1   ( N  k )z k   ( N  k  1)z k 1 and
is bounded by X min  X k 1  X max , with X min  min z k 
maxz k 1 and X max  maxz k  min z k .
The dynamics of the Bellman estimation error can be thus
rewritten as
ek 1  r (xk , uk )  ˆkT1Χk
Next, introduce an auxiliary error
incorporates the history of past utilities as
Ξ  Γ  ˆT Ω
vector
(18)
which
(19)
Γ k 1  [r (x k 1 , u k 1 ), r (x k  2 , u k  2 ),  , r (x k 1 j , u k 1 j )] and
k 1
k
with
k 1
k
for 0  j  k 1 . Again,
Ω k 1  [Χ k 1 , Χ k 2 ,, Χ k 1 j ]
X k
since
is
Ωk
bounded,
is
bounded
as
Ωmin ( z)  Ωk  Ωmax ( z) .
It is clear that (19) includes previous j  1 Bellman
estimation errors which are recalculated by using the most
recent ˆ .
k
(15)
where Χ k   (N  k ) zk is a time-dependent regression function
incorporating the terminal time N and satisfying
ek  r (xk 1 , uk 1 )  ˆkT Χk  ˆkT Χk 1  r (xk 1 , uk 1 )  ˆkT Χk 1 (17)
Χk  0
when zk  0 . Note that since the time of interest is considered
to be finite, the time dependent function  (k ) is bounded as
min   ( N  k )  max , k  0,1,...,N , where  min and  max
are positive constants. Moreover, Q-function is also bounded
Qˆ min (x k , u k )  Qˆ (x k , u k , N  k )  Qˆ max (x k , u k ) , where
Qˆ min (x k , u k )  ˆkT min z k and Qˆ max (x k , u k )  ˆkT  max z k . It
should be noted that Qmin (x k , u k ) and Qmax (x k , u k ) are time
independent functions and are used for non-autonomous
analysis.
Remark 2: In the case of infinite-horizon, since the desired
value of g becomes time-invariant [5], and hence the timevarying term  (N  k ) vanishes in (14). By contrast, in the case
of finite-horizon, the desired value of g k becomes timevarying. Therefore, the basis function can be taken as the
product of system states and time-dependent basis function.
by
With the estimated value of the Q-function, the Bellman
equation can be expressed as
Qˆ (xk 1 , uk 1 , N  (k  1))  Qˆ (xk , uk , N  k )  r (xk , uk )  ek 1 (16)
where ek 1 is the Bellman estimation error along the system
trajectory.
Using one-time step delayed value for convenience, the
Bellman estimation error can be written as
Similar to (19), the dynamics of the auxiliary error vector
are generated as
Ξk 1  Γk  ˆkT1Ωk
(20)
In finite-horizon optimal regulation problem, terminal
constraint of the value function should also be considered. The
estimated value function at terminal stage is defined as
Qˆ (x )  ˆT (0) z
(21)
k
N
k
N
In (21), it should be noted that the time-varying basis
function  (N  k ) at the terminal stage are defined as  (0)
since the time index, by definition of  (N  k ) , is taken as the
time-to-go and hence in the reverse order.
Next, the terminal constraint error vector is defined as
Ξ  gˆ  g  ˆT (0)  g
(22)
k ,N
k ,N
N
k
N
where g N is bounded by g N  gM .
Remark 3: In the case of finite-horizon, the terminal error
Ξk ,N , which is the difference between the estimated and true
value of the terminal constraint (in our case, g N ), is critical
for the optimal control design. By minimizing Ξk ,N along the
system trajectory, the terminal constraint can be satisfied.
Another Bellman estimation error term Ξk will always exist for
both finite-horizon and infinite-horizon problem as long as the
estimated value function is used. See [5] for infinite-horizon
case.
3
Now, the total error vector is defined as
Ξk ,total  Ξk  Ξk ,N
(23)
Next, consider the terminal constraint effect, define
Π k  Ωk   (0)
(24)
It should be noted that Π k is bounded by Π min (x)
 Π k  Π max (x) , where
Π min (x)  Ω min (x)   (0)
and
system states must be persistently exiting long enough for the
estimator to properly learn the Q-function. PE condition is a
standard assumption in adaptive control and can be satisfied by
adding exploration noise. In this work, the same approach is
taken to satisfy the PE condition.
(25)
3.3 Estimation of the Optimal Control and Algorithm
Π max (x)  Ω max (x)   (0) .
The update law for tuning ˆk is defined as
ˆk 1  Πk  ΠTk Πk 
1
Ξ
T
k ,total
 ΓTk

where 0    1 is a design parameter. Also note that the
update law defined in (25) is essentially in least-squares sense.
Expanding (25) by using (23), we have
1
ˆk 1  Π k Π Tk Π k
 ΞTk   ΞTk ,N  Γ Tk
(26)
1
1
 Π k Π Tk Π k
 ΞTk  Γ Tk  Π k Π Tk Π k ΞTk ,N


 
 
By [9], the optimal control can be obtained by minimizing
the value function. Recall from (10), the estimated control input
is given by




Note that from (24), we have Ω k  Π k   (0) . Then (20)
becomes
Ξk 1  Γ k  ˆkT1Ωk  Γ k  ˆkT1 Π k   (0)
(27)
 Γ k  ˆkT1Π k  ˆkT1 (0)  Γ k  ˆkT1Π k  gˆ k 1,N


To find the error dynamics for ˆk , substituting (26) into
(27) renders
Ξk 1  Γ k  ˆkT1Π k  ˆkT1 (0)   Ξk   Ξk ,N  ˆkT1 (0) (28)
ˆ uu )1 G
ˆ ux  x (36)
uˆ k  (R  BT Sˆ k 1B)1 BT Sˆ k 1A  xk  (G
k
k
k
ˆ
From (36), the Kalman gain can be computed based on G
k
matrix which is obtained from the estimated Q-function. This
avoids the need of the system matrices A and B , while the
update law (25) relaxes the policy and/or value iterations. Note
that the Q-function (11) and control input (36) are updated at
each time step.
Before proceeding, the flowchart of our proposed scheme
is shown as in Fig. 1.
Start Proposed
Algorithm
(28) clearly shows that the Bellman estimation error is
coupled with the terminal constraint estimation error.
Therefore, the dynamics for the total error Ξtotal is given
by
Ξk 1,total  Ξk 1  Ξk 1,N   Ξk   Ξk ,N  ˆkT1 (0)  Ξk 1,N
(29)
  Ξk   Ξk ,N  g N
Initialization
Vˆ0 ( x )  0, u  u0
Update the finite horizon Bellman Equation
and terminal constraint error
Ξk  Γk 1 ˆkT Ωk 1
Ξ k ,N  ˆkT (0)  gN
Define the estimation error for ˆk as
~
 k    ˆk
(30)
Recall from (19), the utility vector can be written as
Update the adaptive estimator parameters with auxiliary
error vectors
Γ k   T Ωk . Then we have
with
~
Ξk 1  Γ k  ˆkT1Ω k   T Ω k  ˆkT1Ω k   kT1Ω k (31)
From (23), we further have
~
 kT1Ω k  Ξk  Ξk ,N  g N  Ξk 1,N
(32)
~
T
T
T
Note that Ξ k , N  gˆ k , N  g N  ˆk  (0)    (0)   k  (0) and
~
similarly Ξk 1, N   kT1 (0) , then (32) becomes
~
~
~
~
 kT1Ωk   kT Ωk 1   kT (0)  g N   kT1 (0)
ˆk1  Πk (ΠkTΠk )1(ΞkT,totoal ΓkT )
Π k  Ωk   (0)
Update finite horizon control policy
Gˆ k  vec1(gˆ k )  vec1(ˆkT(N  k))
uˆ k  (Gˆ kuu ) 1 Gˆ kux x k
k  k  1, k  1,2,...,N  1
k=N?
(33)
Therefore, we have
~
~
(34)
 kT1 (Ωk   (0))   kT (Ωk 1   (0))  g N
From (24), (34) can be finally written as
~
~
(35)
 kT1Πk   kT Πk 1  g N
Remark 4: It can be seen that from the definition of Q-function
(11), the Q-function estimation will no longer update when the
system states converge to zero. This can be viewed as
persistency of excitation (PE) requirement. Therefore, the
Update the time interval
Yes
End
Fig. 1. Finite horizon optimal design
3.4 Stability Analysis
~
In this subsection, both the estimation error  k and the
closed-loop system are shown to be uniformly ultimately
4
bounded (UUB). The closed-loop system becomes essentially
non-autonomous, in contrast with [4], because of the timedependent nature of finite-horizon. Due to the page limit, only
the mathematical claims are presented while the proofs are
omitted.
Theorem 1: Let the initial value for ĝ 0 be bounded. Let u 0 (k )
be an initial admissible control for the linear system (1). Let the
update law for ˆk be given in (25). Then, there exists a positive
constant  satisfying 0   
~
1
such that the parameter
2
estimation error  k is UUB.
Proof: Omitted due to the page limit.
Before showing the closed-loop stability, the following
lemma is also needed.
Lemma 1: (Bounds on closed-loop dynamics with admissible
control) Consider the linear discrete-time system (1), then there
exists an admissible control uk such that the closed-loop
system dynamics Axk  Buk can be written as
Ax k  Bu k
2
  xk
2
xk 1
The cost function is taken as
N 1
J 0  xNT SN xN    xkT Qxk  ukT Ruk 
The weighting matrices Q , R and the terminal penalty
matrix S N are selected to be identity matrices with appropriate
dimensions. The initial system states and initial admissible
control gain are selected to be x0  [1,  1, 0.5]T and
K 0  [0.3, 0.3, 1.2] , respectively.
The design parameter is selected to be   0.001 . The
time-dependent basis function  (N  k ) is chosen as a
polynomial of time-to-go with saturation. Saturation for
 (N  k ) is to ensure that the magnitude of  (N  k ) is within
a reasonable range such that ˆ is computable. The initial
k
(37)
1
is a constant.
2
Theorem 2 (Boundedness of the Closed-loop System): Let
u 0 (k ) be an initial admissible control policy. Let the parameter
vector of the Q-function estimator be tuned and estimation
control policy be provided by (25) and (36), respectively. Then,
values for ˆk are set to zeros. The simulation results are given
as below.
1
System Response
In this section, a practical example is used to evaluate
feasibility of the proposed finite-horizon optimal regulation
design. Consider the continuous-time F-16 aircraft model given
as:
2
0
-0.5
-1
0
4. SIMULATION RESULTS
1
x3
0.5
1
such
2
that the closed-loop system is UUB. Furthermore, the bounds
for both the states and estimation error will decrease as N
increases.
Proof: Omitted due to the page limit.
10
20
30
Time Index k
40
50
40
50
Fig. 2. System response
0.2
0.15
(38)
Control Inputs
 0.00215 
0
 0.17555  x  0 u
1
 1 
x
x
there exist positive constants  satisfying 0   
0.90506
 1.07741
0
(40)
k 0
where 0   
 1.01887
x   0.82225

0
0.9065 0.0816 0.0009
 0 


  0.0741 0.9012 0.0159 xk   0.0008 uk (39)
 0
 0.0952 
0
0.9048 
where x   q  e  , with  the angle of attack, q the pitch
rate and  e the elevator deflection angle. The control input is
the taken as the elevator actuator voltage.
Discretizing the system with a sampling time Ts  0.1sec ,
we have the discrete-time version of the system as
0.1
0.05
0
-0.05
0
10
20
30
Time Index k
Fig. 3. Control input
5
First, the response of the system with our proposed control
design is examined. The augmented states are generated as
T
zk   xkT ukT  4 and hence zk  10 . From Figs. 2 and 3,
it can be seen that both the system states and the control input
finally converge close to zero, which first verifies the stability
of our proposed design scheme.
1.4
e
1.2
e
N
Bellman
Error Histories
1
0.8
0.6
0.4
0.2
0
0
10
20
Time Index k
30
40
Fig. 4. Convergence of error terms
Next, to verify the optimality and satisfying the terminal
constraint, the error histories are plotted in Fig. 4. It clearly
shows that the Bellman error eventually converges close to
zero, which ensures the optimality of the system. It is more
important to note from Fig. 4 that the history of terminal
constraint error eN also converges close to zero, which
illustrates the fact that the terminal constraint is also satisfied
by our proposed controller design.
Finally, for comparison purpose, the error of cost function
between traditional backward RE-based method and our
proposed algorithm are shown in Fig. 5. It can be seen from
Fig. 5 that the difference between two costs converges close to
zero more quickly than system response, which illustrates that
the proposed algorithm indeed yields an (near) optimal control
policy.
5. CONCLUSIONS
In this work, the finite-horizon optimal regulation problem for
the linear discrete-time systems with unknown system
dynamics is proposed by using the ADP technique. An error
term corresponding to the terminal constraint together with the
approximation error is defined and minimized along the system
trajectory. A time-dependent Q-function Q(xk , uk , N  k ) is
estimated so that system dynamics are not needed. The Q-
ˆ are learned by the
function Q(xk , uk , N  k ) and the matrix G
k
Q-function estimator. The optimal regulation problem is
subsequently solved by using the information of estimated Qfunction Qˆ (xk , uk , N  k ) .
Under the non-autonomous scheme, the boundedness of
closed-loop system is demonstrated by using Lyapunov stability
theory. Policy and/or value iterations are not needed. Therefore,
the proposed algorithm yields an online and forward-in-time
control design scheme which offers many practical benefits.
6. ACKNOWLEDGMENTS
The authors acknowledge the financial support for this research
by Intelligent Systems Center and ECCS-1128281. This paper
is generated from [8] which have been recently accepted for
publication.
7. REFERENCES
[1]
[2]
[3]
[4]
[5]
4
3.5
[6]
Cost Function Error
3
2.5
2
[7]
1.5
1
[8]
0.5
0
-0.5
0
10
20
30
40
Time Index k
Fig. 5. Cost function error between traditional and proposed method
[9]
6
Beard, R., 1995, “Improving the closed-loop performance
of nonlinear systems,” Ph.D. dissertation, Rensselaer
Polytechnic Institute, USA.
Cheng, T., Lewis, F.L., and Abu-Khalaf, M., 2007, “A
neural network solution for fixed-final-time optimal
control of nonlinear systems,” Automatica, vol. 43, pp.
482–490.
Watkins, C., 1989, “Learning from delayed rewards,”
Ph.D. dissertation, Cambridge University, England.
A. Heydari and S. N. Balakrishan, “Finite-horizon inputconstrained nonlinear optimal control using single network
adaptive critics,” in Proc. American Control Conf, San
Francisco, USA, pp. 3047–3052, 2011.
Xu, H., Jagannathan, S., and Lewis, F.L., 2012, “Stochastic
optimal control of unknown networked control systems in
the presence of random delays and packet losses,”
Automatica, vol. 48, pp. 1017–1030.
Dierks, T., and Jagannathan, S., 2009, “Optimal control of
affine nonlinear discrete-time systems with unknown
internal dynamics,” in Proc. Conf. on Decision and
Control, Shanghai, pp. 6750–6755.
Dierks, T., and Jagannathan, S., 2012, “Online optimal
control of affine nonlinear discrete-time systems with
unknown internal dynamics by using time-based policy
update,” IEEE Trans. Neural Networks and Learning
Systems, vol. 23, pp. 1118–1129.
Zhao, Q., Xu, H., and Jagannathan, S., 2013, “FiniteHorizon Optimal Control Design for Uncertain Linear
Discrete-time Systems,” to appear in the Proc. Of the IEEE
Symposium on Adaptive Dynamic Programming and
Reinforcement Learning, Singapore.
Lewis, F.L., and Syrmos, 1995, V.L., Optimal Control, 2nd
edition. New York: Wiley.
Download