Neural dynamic programming based online controller with a novel trim approach

advertisement
Neural dynamic programming based online
controller with a novel trim approach
S. Chakraborty and M.G. Simoes
Abstract: Neural dynamic programming (NDP) is a generic online learning control system based
on the principle of reinforcement learning. Such a controller can self tune with a wide change of
operating conditions and parametric variations. Implementation details of a self-tuning NDP based
speed controller of a permanent-magnet DC machine along the online training algorithm are given.
A simple solution is developed for finding the trim control position for the NDP controller NDP
controller that can be extended to other problems. The DC machine is chosen for the
implementation because it can be easily operated in a variety of operating conditions, including
parametric variations, to prove the robustness of the controller and its multiobjective capabilities.
The simulation results of the NDP controller are compared with the results of a conventional PI
controller to access the overall performance.
1
Introduction
Adaptive critic design (ACD) neural network structures
have evolved from a combination of reinforcement learning,
dynamic programming and back-propagation with a long
and solid history of work [1 –11]. Dynamic programming is
a very useful tool in solving nonlinear MIMO control cases,
most of which can be formulated as a cost minimisation or
maximisation problem. Unfortunately the backward
numerical process required for running dynamic programming makes computation and storage very problematic,
especially for high-order nonlinear systems, i.e. the
commonly known ‘curse of dimensionality’ problem
[12 – 14]. Over the years, progress had been made to
overcome this by building a system called critic to
approximate the cost function of the dynamic programming
[6, 9] in the form of ACDs. The basic structures of ACD
proposed in the literature were heuristic dynamic programming (HDP), dual heuristic programming (DHP) and
globalised dual heuristic programming (GDHP), and their
action-dependent (AD) versions, which gives action-dependent heuristic dynamic programming (ADHDP), actiondependent dual heuristic programming (ADDHP) and
action dependent globalised dual heuristic programming
(ADGDHP) [4, 6, 9]. A typical ACD consists of three neural
network modules called action (decision making module),
critic (evaluation module) and model (prediction module).
In the action-dependent versions’ action is directly connected to the critic without using models.
The basic idea in ACD is to adapt the weights of the
critic network to approximate the future reward-to-go
function J(t) such that it satisfies the modified Bellman
equation used in dynamic programming. Instead of finding
q IEE, 2005
IEE Proceedings online no. 20041119
doi: 10.1049/ip-cta:20041119
Paper first received 24th February and in revised form 18th August 2004
The authors are with Colorado School of Mines, Engineering Division,
Golden, Colorado 80401-1887, USA
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
the exact minimum, a neural network is used to get an
approximate solution for the following dynamic programming equation:
J ðXðtÞÞ ¼ minfJ ðXðt þ 1ÞÞ þ gðXðtÞ; Xðt þ 1ÞÞ U0 g
uðtÞ
ð1Þ
where X(t) are the states of the system, gðXðtÞ; Xðt þ 1ÞÞ is
the immediate cost incurred by u(t), the control action at
time t, and U0 is a heuristic term used to balance [15].
After solving (1), the optimised control signal u(t) is
utilised to train the action neural network. The weights of
the action module are so adapted that it gives that desired
control signal using the system states as its input. The
simple block diagram representing (1) is given in Fig. 1 [9].
The training procedure of both the critic and action network
is described in detail in the following sections of the paper.
Now to adapt JðXðtÞÞ in the critic network, the target on the
right-hand side of (1) must be known a priori. It is required
to wait for a time-step until the next input becomes available
in the case of action-dependent ACD. Consequently
JðXðt þ 1ÞÞ can be calculated by using the critic network
at time ðt þ 1Þ: A simple block diagram of action-dependent
ACD [16] is shown in Fig. 2a.
When the problem is of temporal nature, i.e. not
waiting for the subsequent time-steps to infer incremental
costs, a pre-trained model network is to be used to
calculate Xðt þ 1Þ: The block diagram of this ACD [17] is
given in Fig. 2b. The problem in such an approach is to
train the model network that becomes complex for
nonlinear MIMO systems.
The basis of the proposed online training algorithm lies in
the combination of ACD and temporal difference (TD)
reinforcement learning [18, 19]. It has been shown in the
literature that TD-based reinforcement learning can be
combined with ACD to get a new structure called neural
dynamic programming [20].
In the approach proposed the system model network (the
one that can predict the future system states and consequently the cost-to-go for the next time-step) is excluded
and a model for future prediction used. In this model the
95
Fig. 1 Simple block diagram for solving modified Bellman
equation by ACD
Fig. 2
Basic adaptive critic design block diagrams
a Action-dependent ACD
b Simple ACD
previous J values are stored and together with the current
J value, and the temporal difference obtained along which a
reinforcement signal rðtÞ is used to train the critic network.
The signal rðtÞ is a measure of instantaneous primary reward
of the current action, which means that it tries to adapt the
critic in such a way that the actual system states follow the
desired system states for optimised operation.
The adaptation of the action network is conducted by
indirectly back-propagating the error between the critic
network output JðtÞ and a control signal called ultimate
control objective Uc ðtÞ: The ultimate control objective
function Uc ðtÞ is a weighted cost function which can be
defined according to the main control objective for the
controller, giving more flexibility in designing the
controller.
For a closed-loop NDP controller the nominal control
position (sometimes referred to as trim control position
[21, 22]) has to be scheduled as a function of system states
and environmental parameters [21]; a novel way to find the
trim control position is proposed as follows. It is actually an
open-loop solution to the control problem, giving a good
understanding into forming a closed-loop solution. It also
Fig. 3
96
helps to facilitate realistic simulations where one often
wants to start with arbitrary initial conditions. There are
existing methods for implementing the trim network some
of which require knowledge of the governing equations of
the system, not usually a feasible option for complex
systems where the detailed system equations are hard to
obtain. Another previously used technique is the use of the
existing NDP neural network framework to design the trim
[21]. This method is robust but causes extra complexity to
the algorithm along with more convergence time. The trim
controller implemented here is called as the correction
module which corrects the output of the controller. The
correction module actually consists of two simple neural
networks easily trained offline.
The proposed configuration is simpler than the ACD and
yet very useful for complex MIMO systems because the
controller can handle a continuous and large number of
discrete states due to the use of gradient information instead
of a search algorithm for finding the actions. Also, the
design is robust in the sense that it is insensitive to system
parameter variations, initial weights of action and critic
networks, and the control objective.
2
General structure of NDP controller
Figure 3 shows the block diagram of the NDP controller in
the closed loop. A brief description of each block follows.
2.1
Reinforcement signal
The reinforcement signal r(t) is used to train the critic
network. It is a measure of instantaneous primary reward of
the current action; it tries to adapt critic in such a way that
the actual system states follow the ideal system states for
optimised operation. In systems where explicit feedback is
available, the following quadratic reinforcement signal can
be used:
2
n ^
X
ðXi Xi Þ
ð2Þ
rðtÞ ¼ ðXi ÞMax
i¼1
where Xi is the ith state of the state vector X, Xi is the
desired reference state, X^ i is the actual state and ðXi ÞMax is
the nominal maximum state value.
2.2
Generation of Uc ðtÞ
The adaptation of the action network is done by indirectly
back propagating the error between the critic network output
JðtÞ and a control signal called the ultimate control
objective Uc ðtÞ: The ultimate control objective function
Block diagram of proposed NDP structure
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
Uc ðtÞ is a signal which can be defined according to the main
control objective for the controller.
Suppose the system has n state variables denoted by X1 ;
X2 ; X3 ; . . . ; Xn where the multiobjective controller is to be
applied so that the weighted quadratic error between
m number of the actual states and the corresponding desired
reference states will reach a minimum. In such a case Uc ðtÞ
can be defined in the following way:
Uc ðtÞ ¼ m
X
1
ððXi Þerror Þ2
K
i¼1 i
ð3Þ
Fig. 4 Main components of NDP structure
where ðXi Þerror is the error between the actual ith state Xi to
its desired value, and 1=Ki is the weight for the ith state
error. This ultimate control objective function is flexible
because it can be defined depending on the control actions
required for the system. The weighting variable Ki helps to
give different levels of importance for the different control
actions in case of a multiobjective controller.
2.3 Internal model
For both r(t) and Uc ðtÞ the desired reference states Xi are
needed. These desired reference states can be either given as
the external references to the controller (for explicit
multiobjective controller) or can be derived internally and
used for enhancement of the control actions. In the second
case, the internal model works like an observer generating
desired pseudostates from the controller output.
One important characteristic is that when the differential
equations of the system are explicitly known, they can be
used in the internal model. But for a system where the
system equations are unknown or complex, a simple offline
trained neural network model is preferable.
node output, n þ 1 the total number of inputs into the critic
network including action network output u(t) as one of
ð2Þ
them, Wci ðtÞ the weight of connection between the ith
hidden node to the output node of the critic network,
ð1Þ
Wcij ðtÞ the weight of connection between the jth input node
and the ith hidden node, and xj ðtÞ the jth input to the critic
network.
The temporal difference method is utilised in the critic
training. The previous J values are stored and together with
the current J value, the temporal difference is obtained along
which a reinforcement signal r(t) and used to train the critic
network. The prediction error for the critic network can be
defined as
ec ðtÞ ¼ aJðtÞ ½Jðt 1Þ rðtÞ
The output of the critic network JðtÞ approximates the
discounted total reward-to-go given by RðtÞ as follows:
RðtÞ ¼ rðt þ 1Þ þ arðt þ 2Þ þ a2 rðt þ 3Þ þ :
ð4Þ
where R(t) is the future accumulative reward-to-go value at
time t, a is a discount factor for the infinite-horizon problem
ð0 < a < 1Þ; and rðt þ 1Þ is the reinforcement signal value at
time t þ 1:
The critic network can be implemented with a simple
multilayer feedforward neural network. The critic network
has n measured states along with the action network output
u(t) as the inputs, and the output of critic network is J(t)
which is an approximation of cost function R(t). The neural
network can have either linear or with nonlinear sigmoid
functions at the nodes. The three-layer feedforward neural
network with one hidden layer shown in Fig. 4a is
commonly used as critic. The equations for the feedforward
critic network are
Nh
X
ð2Þ
Wci ðtÞpi ðtÞ
ð5Þ
ð8Þ
and the objective function that the critic network minimises is
1
Ec ðtÞ ¼ e2c ðtÞ
2
2.4 Critic network
JðtÞ ¼
a Critic
b Action
ð9Þ
where JðtÞ is the output of the critic network, the J function,
as an approximation of RðtÞ; a the discount factor with range
f0; 1g; and rðtÞ the reinforcement signal at time t.
The weight update rules for the critic network are based
on the gradient descent algorithm as follows:
W c ðt þ 1Þ ¼ W c ðtÞ þ DW c ðtÞ
ð10Þ
@Ec ðtÞ
DW c ðtÞ ¼ lc ðtÞ @W c ðtÞ
ð11Þ
@Ec ðtÞ
@E ðtÞ @JðtÞ
¼ c
@W c ðtÞ
@JðtÞ @W c ðtÞ
ð12Þ
where lc ðtÞ is the positive learning rate of the critic
network at time t, and W c is the weight vector in the critic
network. For a three-layer feedforward neural network as
shown in Fig. 4a, by applying the chain rule to (11) and
(12) the adaptation of critic can be summarised as
follows.
i¼1
2.4.1 Weights of hidden to output layer:
qi ðtÞ
pi ðtÞ ¼
qi ðtÞ ¼
nþ1
X
1e
;
1 þ eqi ðtÞ
ð1Þ
Wcij ðtÞxj ðtÞ;
i ¼ 1; 2; 3; . . . ; NhðcriticÞ
"
ð6Þ
ð2Þ
DWci ðtÞ
i ¼ 1; 2; 3; . . . ; NhðcriticÞ
¼ lc ðtÞ @Ec ðtÞ
#
ð2Þ
ð13Þ
¼ aec ðtÞpi ðtÞ
ð14Þ
@Wci ðtÞ
ð7Þ
j¼1
where Nh is the number of hidden nodes in critic, qi ðtÞ ith
hidden node input of the critic network, pi ðtÞ the ith hidden
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
@Ec ðtÞ
ð2Þ
@Wci ðtÞ
97
2.4.2
Weights of input to hidden layer:
"
ð1Þ
DWcij ðtÞ ¼ lc ðtÞ @Ec ðtÞ
W a ðt þ 1Þ ¼ W a ðtÞ þ DW a ðtÞ
#
ð1Þ
@Wcij ðtÞ
1
ð2Þ
2
ð1
p
ðtÞWc
ðtÞ
ðtÞÞ
xj ðtÞ
¼
ae
c
i
i
ð1Þ
2
@Wc ðtÞ
@Ec ðtÞ
@E ðtÞ
DW a ðtÞ ¼ la ðtÞ ac
@W a ðtÞ
ð15Þ
ij
where lc ðtÞ is the positive learning rate for the critc network.
2.5
Action network
The action network generates the desired plant control
based on the measurements of the plant states and
operates as the actual controller for the system. In a
similar way to the critic network, the action network can
be implemented with a standard multiplayer linear or
nonlinear feedforward neural network. The inputs to the
action are the n measured system states and the output is
the control action uðtÞ: For a multiobjective controller the
control-space dimension defines the number of action
network outputs. The three-layer feedforward neural
network with one hidden layer as shown in Fig. 4b is
commonly used as action. The equations of the feedforward action networks follow:
uðtÞ ¼
vðtÞ ¼
Nh
X
1 evðtÞ
1 þ evðtÞ
ð2Þ
Wai ðtÞgi ðtÞ
i¼1
2.5.1
hi ðtÞ ¼
1e
;
1 þ ehi ðtÞ
n
X
1
2
¼ eac ðtÞ ð1 uðtÞ Þ gi ðtÞ
ð2Þ
2
@Wa ðtÞ
i
Nh X
ð1Þ
1
ð2Þ
2
Wci ðtÞ 1 pi ðtÞ Wci;nþ1 ðtÞ
2
i¼1
ð27Þ
2.5.2
Weights of input to hidden layer:
ð1Þ
Waij ðtÞxj ðtÞ;
ð1Þ
DWaij ðtÞ
ð19Þ
i ¼ 1; 2; 3; . . . ; NhðActionÞ ð20Þ
j¼1
¼ la ðtÞ @Eac ðtÞ
#
ð1Þ
@Waij ðtÞ
ð28Þ
1
1
ð2Þ
2
2
¼ eac ðtÞ ð1 uðtÞ Þ Wai ðtÞ 1 gi ðtÞ xj ðtÞ
ð1Þ
2
2
@Wa ðtÞ
@Eac ðtÞ
ij
where Nh is the number of hidden nodes in the action, vðtÞ
the input to the action node, uðtÞ the output from the action
network, hi ðtÞ the ith hidden node input for the action,
gi ðtÞ the output of the ith hidden node, n the number of
inputs for the action network i.e. the number of states for
ð2Þ
the system, Wai ðtÞ the weights of connection between
the ith hidden node to the output node of the action
ð1Þ
network, Waij ðtÞ the weights of connection between the
jth input node and the ith hidden node, and xj ðtÞ the jth
input to the action network. The adaptation of the action
network is to back-propagate the error between the
desired ultimate objective Uc ðtÞ and the cost function
RðtÞ: If the explicit cost function is available, the actual
cost function RðtÞ is used. When the critic network is used
to approximate RðtÞ; the critic output JðtÞ is used instead
of RðtÞ: In the second case, back-propagation is done
through the critic network. The weights of action network
are updated to minimise the following performance error:
1
Eac ðtÞ ¼ e2ac ðtÞ
2
ð21Þ
eac ðtÞ ¼ JðtÞ Uc ðtÞ
ð22Þ
where
and eac ðtÞ is the prediction error for the action network,
and Eac ðtÞ the objective function that action tries to
minimise. The weight update rules for the action network
are also based on the gradient descent algorithm as
98
Weights of hidden to output layer:
"
#
@Eac ðtÞ
ð2Þ
ð26Þ
DWai ðtÞ ¼ la ðtÞ ð2Þ
@Wai ðtÞ
"
i ¼ 1; 2; 3; . . . ; NhðActionÞ
ð25Þ
@Eac ðtÞ
hi ðtÞ
gi ðtÞ ¼
ð24Þ
where la ðtÞ is the positive learning rate of the action
network at time t, and W a is the weight vector in the
action network.
For a three-layer feedforward action network as shown in
Fig. 4b the chain rule is applied to the (24) and (25) to get
the following equations:
ð17Þ
ð18Þ
@Eac ðtÞ
@Eac ðtÞ @JðtÞ @uðtÞ
¼ @W a ðtÞ
@JðtÞ @uðtÞ @W a ðtÞ
ð16Þ
ð23Þ
Nh X
ð1Þ
1
ð2Þ
Wci ðtÞ 1 p2i ðtÞ Wci;nþ1 ðtÞ
2
i¼1
ð29Þ
where pi ðtÞ is the ith hidden node output from the critic
network, and la ðtÞ the positive learning rate for the action
network. Normalisation is performed in both action and
critic networks to confine the values of weights into some
appropriate range by
Wcðt þ 1Þ ¼
WcðtÞ þ DWcðtÞ
kWcðtÞ þ DWcðtÞk1
ð30Þ
Waðt þ 1Þ ¼
WaðtÞ þ DWaðtÞ
kWaðtÞ þ DWaðtÞk1
ð31Þ
Following this description of all the component blocks for
the NDP controller it is now possible to understand the
interaction between these components in the controller.
In the proposed online control design the controller is
‘naive’ at the beginning as both the action network and critic
network are randomly initialised in their weights. To
facilitate the convergence speed the action network can
also be pretrained with the static data from the system
before starting the controller. Once the system with
controller is running the system states are acquired and
the control signals rðtÞ and Uc ðtÞ are also calculated. The
NDP controller then starts training the critic network while
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
the action networks weights are kept constant. After training
is complete the critic network weights are kept constant at
their final values. Then the controller starts training the
action network. When action training is complete the action
network weights are frozen and the new control signal is
generated using feedforward the action network with those
final action network weights. With this new control signal
the critic network is again trained, followed by action
network training. This critic – action training process continues a few times to ensure proper adaptation of both the
action and critic network, after which the system is run with
the final control signal from the action network and new
states are observed and the whole process is repeated for
those new state values.
The detailed operation of the controller is given in the
Section-4 flowcharts, where an application for the controller
to command the speed of a permanent-magnet DC motor is
presented.
3 System description: Permanent-magnet DC
motor
The NDP controller is designed to drive the speed of a
permanent-magnet DC motor. A simple model for the
DC motor is developed. The input to the system is the
^ ; actual
voltage Va ðtÞ and the outputs are actual speed o
current I^a ; and the signal L^ a ðdIa =dtÞ is also made
available. The DC machine parameters used in simulations are given in Table 1. The DC motor is the chosen
plant because it is a simple nonlinear system (considering
inductance saturation and parameter variation) that will
enable our claims about the nonlinear control capability
of the proposed controller to be verified. Although the
controller is designed for speed control, a simple
modification in the ultimate control objective Uc ðtÞ can
be utilised to get better transient response in armature
current and bounded back-EMF. With this system, it can
be easily verified that the NDP controller works with
various operating conditions, parametric variations and
load disturbances.
4
Implementation of NDP controller
For this particular system the ultimate control objective
Uc ðtÞ is defined as
1
1
Uc ðtÞ ¼ ððoÞer Þ2 ððIa Þer Þ2
2
8
2
1
dI
La a
16
dt er
ð32Þ
where ðoÞer is the difference between actual speed and the
reference speed, ðIa Þer difference between actual and
reference armature current, ðLa dIa=dtÞer the difference
between actual and reference ðLa dIa =dtÞ:
The main objective is to control the speed of the DC motor
such that error in speed is minimised. The minimisation of
Table 1: Parameters of the permanent magnet DC motor
Parameter
Value
Parameter
Value
Ra
1:78 O
Kt 0.935
La
0.03 H
Kv 0.935
Kt is torque constant; Kv is voltage constant; is machine internal flux in
Weber; Ra is motor armature resistance; La is motor armature inductance
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
error in armature current and rate of change of armature
current are also implemented as shown in (32).
Inputs to the critic network are state variables
^ , I^a , L^ a dIa =dt and the output of the action network Va ðtÞ.
o
The output of the critic is the cost function J(t). Inputs to the
action network are the actual state variables. Output of this
network is the control signal Va ðtÞ, i.e. the voltage for the
PWM converter of the DC motor or directly the terminal
voltage of the DC motor.
^ : The
The internal model has two inputs Va ðtÞ and o
output is the ideal current Ia and La ðdIa =dtÞ: The governing
equation of the system assuming constant field excitation
(as a permanent-magnet DC motor) is as follows:
€ þ La ðB þ KÞo
_ þ ðKt fÞðKv fÞo ¼ Kt fðVa Ia Ra Þ
La J o
ð33Þ
where J is the moment of inertia of the motor, B the damping
constant for the system, K the load torque constant,
i.e. TL ¼ Ko; Ra ; La the machine armature resistance and
inductance, Va the machine terminal voltage, here it is equal
to the controller output Va , Ia the armature current, and o
the rotor speed in rad=s.
The DC motor model and all other models used are
implemented in Simulink and the programs are written in
Matlab version 6.1. Initially the self-tuning controller has no
prior knowledge about the plant but only the online
measurements. The controller is actually the feed forward
action network where the weights are updated on the fly.
The variables speed, armature current, rate of change of
armature current are scaled to between 0:9 to 0.9 because
the neural network layer transfer functions are bipolar. The
r(t) and Uc ðtÞ are scaled between 0:9 to 0 because in
accordance with the control formulation presented here they
are always negative. The action network output is to be
scaled back to the actual range. The flowcharts for the
complete procedure and the individual component training
are given in Figs. 5– 7.
The important training parameters are given in Table 2.
The learning rate of both the action and critic network is
increased by a factor of two if corresponding training error
decreases and is decreased by a factor of 0.8 as training error
increases. A variable-step ode solver with maximum stepsize limited to 10 ms is used:
4.1 A Design of correction module
For a closed-loop NDP controller the nominal control position (sometimes referred as trim control position
[21, 22]) has to be scheduled as a function of system states
and environmental parameters [21]. Previous works on NDP
were successful because they were tested on systems (like
the inverted pendulum) having zero trim requirements [20].
Finding a trim control position is actually an open-loop
solution to the control problem, giving a good understanding into forming a closed-loop solution. It also helps to
facilitate realistic simulations where one often wants to start
with an arbitrary initial condition rather than controlling the
DC machine while it is running at a particular state [21].
A simple but effective method for determining the trim
control position for the DC machine speed control is
implemented here and is called the correction module as it
corrects the output of the controller. The correction module
consists of two simple offline trained neural networks. The
first neural network takes the scaled reference speed and
scaled armature current Ia as inputs and generates the
terminal voltage as output, whereas the second one takes the
scaled actual speed and scaled armature current Ia as inputs
99
Fig. 7
Flowchart for action training program
Table 2: Summary of important training parameters
Fig. 5
Fig. 6
Value
Critic: initial learning rate
0.4
Critic: minimum learning rate
0.01
Critic: discount factor
0.95
Critic: number of nodes in hidden layer
10
Critic: maximum training cycles
no. of patterns
Critic: threshold error for training
0.01
Action: initial learning rate
0.3
Action: minimum learning rate
0.01
Action: number of nodes in hidden layer
10
Action: maximum training cycles
no. of patterns
Action: threshold error for training
0.01
Flowchart for complete process
Flowchart for critic training program
and generates the terminal voltage as output. The outputs of
these networks do not need to be optimised so the target for
training can be easily obtained, i.e. the system voltage from
the machine running with a suboptimal PI speed controller.
The training is conducted offline using the ‘nntool’ GUI
available in Matlab 6.1.
100
Parameters
The reason behind taking two neural networks is that with
only the first one it can work fine with most of the operating
conditions except when there is some load torque disturbance. With a load torque increase, the armature current
increases as there is no explicit armature current controller
(although the NDP system can optimise armature current in
a particular trim position), so with the same reference speed
the correction module generates higher trim values than
what is expected. The second neural network takes care of
this variation, because with load changes, when armature
current changes the speed also tries to change. In addition,
the second neural network captures the dynamics to
generate proper trim control position. If only the second
NN was used as correction module the speed would end up
to less than the reference speed. This is because without a
reference speed input the system would not generate proper
trim position corresponding to that reference speed. To
decide which neural network is to be used, some simple
conditional statements are used. The correction module is
portrayed in the block diagram of Fig. 8. The scaling gain
blocks are used to denormalise the outputs of the neural
networks back to their actual range. Then they are low-pass
filtered. The transfer function of this filter is based on the
mechanical pole of the DC machine as
GðsÞ ¼
KJ
s þ KJ
ð34Þ
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
where K is defined as the load torque constant TL ¼ Ko:
The decision block compares the actual speed with the
reference speed to decide which neural network will work.
Figure 9 shows the reference speed profile used from the
training data set.
It is very simple to generate the correction module
without changing the training algorithm of the main ACD
network. Before designing the controller for a specific
application, the correction module has to be trained offline
with the static data from that particular system. Once trained
offline, there is no need to change its weights during the
online training phase.
5
Fig. 8 Block diagram of correction module showing all
components involved in the design
Simulation study with NDP Controller
The NDP controller was tested at different operating
conditions under different parameter variations subjected
to load torque variations. Some of these results are given
along with the PI-controlled DC machine to get a good
comparison.
The permanent-magnet DC motor transfer function is
obtained with the parameters of the motor given in Table 1.
Some additional parameters used are J (moment of inertia of
motor) ¼ 0:0012 kg:m2 ; load torque proportional to o;
i.e. TL ¼ Ko; K (load torque constant) ¼ 0:2 Nm:s: With
these parameters the transfer function for the DC motor is
Go;V ¼
Fig. 9 Reference speed profile from data set used for offline
training of correction module neural networks
0:935
ð3:54 10 Þs þ ð0:0081Þs þ 1:23
5
2
ð35Þ
It is evident from Fig. 10b that without any controller the
system has a almost 55% steady-state error as expected. The
overshoot is also quite high, in the order of 20%:
The steady-state error as well as the overshoot can be
eliminated by implementing a simple PI controller. By using
‘sisotool’ GUI in MATLAB 6.1 an optimal PI controller is
designed to obtain a settling time less than 0.1 s in addition
to zero steady-state error and no overshoot. The response on
the PI-controlled system is given in Fig. 11. The PI
Fig. 10 DC motor without controller
a Root locus
b Step response
Fig. 11 PI controlled DC motor
a Closed-loop poles in root locus
b Step response
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
101
Fig. 12 Step response of PI-controlled DC motor at different
operating point K ¼ 0:02
controller parameters are KP ¼ 0:01; KI ¼ 50: This PI
controller is optimally tuned with machine parameters
given in Table 1. The additional parameters used are
J ¼ 0:0012 kg:m2 and K ¼ 0:2 Nm:s: But when the operating conditions change this simple PI controller is no longer
the optimal one. This is clear from Fig. 12 when the load
torque constant K is changed to 0.02 Nm.s.
All the following simulation studies were done at
operating conditions that differ from the condition for
which the PI controller was optimised, i.e. K ¼ 0:02 Nm:s is
used.
5.1
Speed and load torque change
In this simulation study the speed step-changes from 0 to
100 rad=s and then to 150 rad=s at time 0.5 s along with the
additional step load torque change from 0 to 1 Nm at 1.35 s.
To test robustness the simulation studies were conducted at
Fig. 13 Speed profile of PI and NDP controller with step speed
and load torque changes
Fig. 14 Armature current profile of PI and NDP controller with
step speed change and step load torque change
102
operating conditions different from the condition for which
the PI controller was optimised. From Fig. 13 it can be seen
that the speed response of the NDP controller is much better
than the PI controller. There is no overshoot in speed for a
step change in speed with the NDP controller. Again the
oscillation in speed is less for the NDP controller when there
is a step change in torque.
Figure 14 shows the armature current ripple at standing
and at the step change in speed is much smaller with the
NDP controller than with the PI controller. With the torque
change though, the current ripple with NDP controller is
slightly larger than that with PI controller; the oscillation
dies at a much faster rate for the NDP controller.
5.2
Parameter variation analysis
In this case study a step change in speed was imposed from
0 to 100 rad=s and then to 150 rad=s at time 0.5 s. The
additional load torque step changed from 0 to 1 Nm at 1.5 s
and then came back to 0 at 2.19 s. Also considered here 70%
increase of the armature resistance value from its nominal
value due to heating and change in armature inductance with
current due to saturation.
This case was also done at operating conditions different
from the condition for which the PI controller is optimised.
The parametric variations were incorporated to prove the
robustness of the self-tuning NDP controller. In Fig. 15 the
parameter variations of the DC machine are shown.
Figure 16 shows the load torque disturbance. From
Figs. 17 –20 it can be seen that the speed and current
response with the NDP controller are much better than the
PI controller both at transient and at steady state.
In all the simulations a simple PI controller was used to
compare with the NDP-based adaptive controller. As the DC
Fig. 15 Simulated armature resistance change to 70% of
nominal value with time, and armature inductance change with
armature current
Fig. 16 Load torque changes from 0 to 1 Nm at 1.5 s and then
returns to 0 at 2.19 s
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
Fig. 17 Speed profile of PI and NDP controller with step speed
change, periodic load torque change, and armature resistance and
inductance variation
Fig. 18 Magnified speed profile of PI and NDP from Fig. 17 to
show comparison between controllers when load torque changes
motor model is well-known it is easy to implement an
adaptive PI controller to get better results than a simple PI
controller. But in general most of these adaptive controller
design techniques, such as gain scheduling, require some
form of system identification based on the parameterised
model with a set of design equations relating controller
parameters to plant parameters [23]. With such techniques
there is a necessity to impose a model structure on the
system, which introduces approximation even when best-fit
parameters for such models are available. Furthermore, in
some cases the complexity of the system may make the
modelling infeasible, a typical problem in all model-based
reference adaptive control approaches. On the other hand,
some PI tuning techniques, such as unfalsified control
theory based tuning, can work without plant models.
But they are complex to implement and have the limitation
that the set of unfalsified controllers may shrink to a null set
if there are no PID controllers capable of meeting the
performance specification [24].
In contrast, the NDP controller is robust with parameter
variations and load disturbances. It is simpler to implement
than most of the optimal controllers. It does not need any
plant models and can be designed to have either explicit or
implicit multiobjective control actions which in turn makes
the controller very flexible. The NDP controller can be very
useful for complex MIMO systems because the controller
can handle continuous and large number of discrete states
due to the usage of gradient information instead of search
algorithm for finding the actions.
The closed-loop stability analysis of the NDP controller is an ongoing topic of active research and is
outside the scope for this paper. With detailed analysis it
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
Fig. 19 Armature current profile of PI and NDP controller with
step speed change, periodic load torque change, armature
resistance and inductance variation
Fig. 20 Starting current profile of PI and NDP controller with
step speed change, periodic load torque change, armature
resistance and inductance variation
can be shown that the back-propagation-based weight
adaptation of the critic and action networks ensures that
weight estimation errors are uniformly ultimately
bounded [7]. The normalisation on neural network
weights (30) and (31), the scaling of input variables to
the controller and the use of bipolar-layer transfer
functions (6), (17) and (19) that saturate at 1 help to
keep the controller output within the acceptable range.
Furthermore, the correction module acts as a feedforward
controller to enhance the stability of the closed-loop
system.
6
Conclusions
A step-by-step procedure for designing a NDP controller
has been fully described, analysed and implemented. A DC
machine was used as an example of plant, but the approach
can be extended to other systems. The results from the NDPcontrolled system were compared with a PI-based system,
corroborating the validity of the algorithms used and the
superiority of the controller. This work included a novel
technique for finding the trim control position by correction
module. The results verified that the NDP controller was
able to successfully control a nonlinear system at various
operating conditions, parametric variations and load disturbances. A noticeable feature is the multiobjective
capability of controlling machine speed, with better
response for armature current and bounded back-EMF. It
appears that NDP is a good candidate for controlling
nonlinear MIMO systems. Future work will concentrate
effort on memory buffer size, network size and detailed
stability analysis.
103
7
Acknowledgments
The authors are grateful to the National Science Foundation
(NSF) for their continuing support in this work.
8
References
1 Werbos, P.J.: ‘Advanced forecasting methods for global crisis warning
and models of intelligence’, in ‘General Systems Yearbook’, 1977,
Vol. 22, pp. 25–38
2 Bertsekas, D.P., and Tsitsiklis, J.N.: ‘Neuro-dynamic programming’
(Athena Scientific, Belmont, MA, 1996)
3 Lendaris, G.G., and Paintz, C.: ‘Training strategies for critic and action
neural networks in dual heuristic programming method’. Proc.
1997 IEEE Int. Conf. on Neural Networks, Houston, TX, June 1997,
pp. 712–717
4 Prokhorov, D.V.: ‘Adaptive critic designs and their applications’.
PhD dissertation, Texas Tech University, Lubbock, TX, 1997
5 Prokhorov, D.V., Santiago, R.A., and Wunsch, D.C.: ‘Adaptive critic
designs: A case study for neurocontrol’, Neural Netw., 1995, 8,
pp. 1367–1372
6 Prokhorov, D.V., and Wunsch, D.C.: ‘Adaptive critic designs’, IEEE
Trans. Neural Netw., Sept. 1997, 8, pp. 997–1007
7 Lewis, F.W., Campos, J., and Selmic, R.: ‘On adaptive critic
architectures in feedback control’. Presented at IEEE Conf. on Decision
and Control, Phoenix, AZ, Dec. 1999, pp. 5–10
8 Werbos, P.J.: ‘A menu of designs for reinforcement learning over time’,
in Miller, W.T., Sutton, R.S., and Werbos, P.J. (Eds.): ‘Neural networks
for control ’ (MIT Press, Cambridge, MA, 1990), Chapter 3
9 Werbos, P.J.: ‘Approximate dynamic programming for real-time control
and neural modeling’, in White, D.A., and Sofge, D.A. (Eds.):
‘Handbook of intelligent control: neural, fuzzy, and adaptive
approaches’ (Van Nostrand Reinhold, NewYork, NY, 1992), Chapter 13
10 Ferrari, S., and Stengel, R.F: ‘An adaptive critic global controller’.
Presented at the American Control Conf., May 2002
104
11 Shannon, T.T., and Lendaris, G.G.: ‘Adaptive critic based design of a
fuzzy motor speed controller’. Proc. IEEE Int. Symp. on Intelligent
Control, Mexico City, Mexico, Sept. 2001, pp. 359–363
12 Bryson, A.E.: ‘Dynamic optimization’ (Addison-Wesley-Longman,
Menlo Park, CA, 1999)
13 Bellman, R.E.: ‘Dynamic programming’ (Princeton Univ. Press,
Princeton, NJ, 1957)
14 Dreyfus, S.E., and Law, A.M.: ‘The art and theory of dynamic
programming’ (Academic Press, New York, NY, 1977)
15 Werbos, P.J.: ‘Neurocontrol and supervised learning: An overview and
valuation’, in White, D.A., and Sofge, D.A. (Eds.): ‘Handbook of
intelligent control’ (Van Nostrand Reinhold, NewYork, NY, 1992)
16 Liu, D.: ‘Adaptive critic designs for problems with known analytical
form of cost function’. Proc. INNS-IEEE Int. Joint Conf. on Neural
Networks, Honululu, HI, May 2002, pp. 1808–1813
17 Liu, D., Xiong, X., and Zhang, Y.: ‘Action dependent adaptive critic
designs’. Proc. INNS-IEEE Int. Joint Conf. on Neural Networks,
Washington, DC, July 2001, pp. 990 –995
18 Barto, A.G., Sutton, R.S., and Anderson, C.W.: ‘Neuron like adaptive
elements that can solve difficult learning control problems’, IEEE
Trans. Syst., Man Cybern., 1983, 13, pp. 834 –847
19 Sutton, R.S.: ‘Learning to predict by the methods of temporal
difference’, Mach. Learn., 1988, 3, pp. 9– 44
20 Si, J., and Wang, Y.-T.: ‘Online learning control by association and
reinforcement’, IEEE Trans. Neural Netw., 2001, 12, pp. 264–276
21 Enns, R., and Si, J.: ‘Helicopter flight control design using a learning
control approach’. Proc. 39th IEEE Conf. Decision and Control,
Sydney, Australia, Dec. 2000, pp. 1754–1759
22 Enns, R., and Si, J.: ‘Helicopter tracking control using direct neural
dynamic programming’. Proc. Int. Joint Conf. on Neural Networks,
2001, Vol. 2, pp. 1019–1024
23 Ringwood, J.V., and O’Dwyer, A.: ‘A frequency-domain based selftuning PID controller’. Proc. Asian Control Conf., Tokyo, Japan, July
1994, pp. 331–334
24 Jun, M., and Safonov, M.G.: ‘Automatic PID tuning: an application
of unfalsified control’. Proc. IEEE Symp. on CACSD, Hawaii,
August 1999
IEE Proc.-Control Theory Appl., Vol. 152, No. 1, January 2005
Download