F.L. Lewis Supported by : NSF - PAUL WERBOS Moncrief-O’Donnell Endowed Chair ARO – JIM OVERHOLT Head, Controls & Sensors Group Automation & Robotics Research Institute (ARRI) The University of Texas at Arlington ADP for Feedback Control Talk available online at http://ARRI.uta.edu/acs 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning David Fogel, General Chair Derong Liu, Program Chair Remi Munos, Program Co-Chair Jennie Si, Program Co-Chair Donald C. Wunsch, Program Co-Chair Automation & Robotics Research Institute (ARRI) Relevance- Machine Feedback Control High-Speed Precision Motion Control with unmodeled dynamics, vibration suppression, disturbance rejection, friction compensation, deadzone/backlash control Barrel tip position qr2 qr1 Industrial Machines El turret with backlash and compliant drive train Az compliant coupling qd Military Land Systems moving tank platform terrain and vehicle vibration disturbances d(t) vibratory modes qf(t) Vehicle mass m active damping uc (if used) forward speed y(t) kc vertical motion z(t) cc k Parallel Damper mc c Vehicle Suspension Series Damper + suspension + wheel w(t) surface roughness (t) Single-Wheel/Terrain System with Nonlinearities Aerospace barrel flexible modes qf INTELLIGENT CONTROL TOOLS Fuzzy Associative Memory (FAM) Neural Network (NN) (Includes Adaptive Control) Fuzzy Logic Rule Base NN Input Input Membership Fns. Output Output Membership Fns. Input x Input x NN Output u Output u Both FAM and NN define a function u= f(x) from inputs to outputs FAM and NN can both be used for: 1. Classification and Decision-Making 2. Control NN Includes Adaptive Control (Adaptive control is a 1-layer NN) Neural Network Properties Learning Recall Function approximation Generalization Classification Association Pattern recognition Clustering Robustness to single node failure Repair and reconfiguration Nervous system cell. http://www.sirinet.net/~jgjohnso/index.html Two-layer feedforward static neural network (NN) VT s(.) x1 s(.) 1 y1 s(.) y2 s(.) ym 3 xn inputs s(.) 2 x2 s(.) WT L s(.) outputs hidden layer Summation eqs K n yi s wiks vkj x j vk 0 wi 0 k 1 j 1 Matrix eqs y W T s (V T x) Have the universal approximation property Overcome Barron’s fundamental accuracy limitation of 1-layer NN Neural Network Robot Controller Universal Approximation Property Feedback linearization .. q d Nonlinear Inner Loop Feedforward Loop qd ^f(x) e [I] r Kv q Robot System Robust Control v(t) Term PD Tracking Loop Problem- Nonlinear in the NN weights so that standard proof techniques do not work Easy to implement with a few more lines of code Learning feature allows for on-line updates to NN memory as dynamics change Handles unmodelled dynamics, disturbances, actuator problems such as friction NN universal basis property means no regression matrix is needed Nonlinear controller allows faster & more precise motion Extension of Adaptive Control to nonlinear-in parameters systems No regression matrix needed Theorem 1 (NN Weight Tuning for Stability) Let the desired trajectory qd (t ) and its derivatives be bounded. Let the initial tracking error be within a certain allowable set U . Let Z M be a known upper bound on the Frobenius norm of the unknown ideal weights Z . Take the control input as Wˆ T s (Vˆ T x) K v r v Let weight tuning be provided by Can also use simplified tuning- Hebbian But tracking error is larger with v(t ) K Z ( Z F Z M )r . Forward Prop term? Wˆ Fsˆr T Fsˆ 'Vˆ T xrT F r Wˆ , Vˆ Gx(sˆ 'T Wˆ r )T G r Vˆ T T with any constant matrices F F 0, G G 0 , and scalar tuning parameter 0 . Initialize the weight estimates as Wˆ 0,Vˆ random . Then the filtered tracking error r (t ) and NN weight estimates Wˆ ,Vˆ are uniformly ultimately bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large control gains K v . Extra robustifying terms- Backprop termsWerbos Narendra’s e-mod extended to NLIP systems More complex Systems? Force Control Flexible pointing systems Vehicle active suspension SBIR Contracts Won 1996 SBA Tibbets Award 4 US Patents NSF Tech Transfer to industry Flexible & Vibratory Systems Add an extra feedback loop Two NN needed Use passivity to show stability Backstepping .. qd Nonlinear FB Linearization Loop NN#1 e e = e. ^F (x) 1 [I] qd = r qd . qd Kr Robust Control Term vi(t) 1/KB1 id h ue Kh q qr = . r qr Robot System i ^F (x) 2 NN#2 Tracking Loop Backstepping Loop Neural network backstepping controller for Flexible-Joint robot arm Advantages over traditional Backstepping- no regression functions needed Actuator Nonlinearities - Deadzone, saturation, backlash NN in Feedforward Loop- Deadzone Compensation q d Estimate of Nonlinear Function little critic network NN Deadzone Precompensator f ( x ) qd e - Actor: Critic: [ r Kv I w II u D(u) v T Wˆ i Ts i (U i w)r T Wˆ T s ' (U T u )U T k1T r Wˆ i k 2T r Wˆ i Wˆ i T Wˆ Ss ' (U T u )U T Wˆis i (U i w)r T k1 S r Wˆ Mechanical System q Acts like a 2-layer NN With enhanced backprop tuning ! Needed when all states are not measured NN Observers i.e. Output feedback Recurrent NN Observer zˆ xˆ k ~ x 1 2 D 1 ˆ T s (xˆ , xˆ ) M 1 (x ) (t ) K~ zˆ 2 W x1 o o 1 2 1 x1 q x2 Neural Network Controller ~ x1 Tune NN observer hc ( x1 , x2 ) e e e qd qd qd Kv [ rˆ ( t ) ho ( x1 , x 2 ) (t ) x1 ~ x1 x1 Wˆo k D Fos o ( xˆ ) x~1T o Fo x~1 Wˆo o FoWˆo z2 x2 kp M-1(.) Kv ROBOT vc Neural Network Observer Tune Action NN - Wˆ c Fcs c ( xˆ1 , xˆ 2 )rˆT kD cFc rˆ Wˆ c Also Use CMAC NN, Fuzzy Logic systems Fuzzy Logic System = NN with VECTOR thresholds Separable Gaussian activation functions for RBF NN Tune first layer weights, e.g. Centroids and spreadsActivation fns move around Dynamic Focusing of Awareness Separable triangular activation functions for CMAC NN Elastic Fuzzy Logic- c.f. P. Werbos ( z , a ,b , c ) B ( z , a ,b ) c2 Weights importance of factors in the rules cos ( a( z b )) ( z , a ,b , c ) 2 2 1 a ( z b ) 2 Effect of change of membership function spread "a" c2 Effect of change of membership function elasticities "c" Elastic Fuzzy Logic Control Control Tune Membership Functions u( t ) K v r ĝ( x , x d ) â K a AT Ŵr k a K a â r b̂ K b B T Ŵr k b K b b̂ r Tune Control Rep. Values ĉ K cC TŴr kc K c ĉ r Ŵ KW ( ˆ Aâ Bb̂ Cĉ )r T kW KWŴ r Fuzzy Rule Base Input Membership Functions Output Membership Functions ĝ ( x , x d ) e(t) xd(t) - [ I] r(t) Kv - Controlled Plant x(t) Better Performance Start with 5x5 uniform grid of MFS After tuning- Builds its own basis setDynamic Focusing of Awareness Optimality in Biological Systems Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so. Permeability control of the cell membrane Cellular Metabolism http://www.accessexcellence.org/RC/VL/GG/index.html Optimality in Control Systems Design Rocket Orbit Injection R. Kalman 1960 Dynamics r w v2 F w 2 sin r r m wv F v cos r m m Fm Objectives Get to orbit in minimum time Use minimum fuel http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf 2. Neural Network Solution of Optimal Design Equations Nearly Optimal Control Based on HJ Optimal Design Equations Known system dynamics Preliminary Off-line tuning 1. Neural Networks for Feedback Control Based on FB Control Approach Unknown system dynamics Extended adaptive control On-line tuning to NLIP systems No regression matrix Murad Abu Khalaf System H-Infinity Control Using Neural Networks Performance output z x f ( x) g ( x)u k ( x)d yx z ( x, u ) y Measured output disturbance d u control u l ( y) where z hT h u 2 L2 Gain Problem 2 Find control u(t) so that 2 0 0 2 d (t ) dt T ( h h u )dt 2 z (t ) dt 0 2 2 d (t ) dt 0 Zero-Sum differential Nash game For all L2 disturbances And a prescribed gain 2 Cannot solve HJI !! Successive Solution- Algorithm 1: Let be prescribed and fixed. Murad Abu Khalaf u 0 a stabilizing control with region of asymptotic stability 0 1. Outer loop- update control Initial disturbance d 0 0 2. Inner loop- update disturbance Solve Value Equation Consistency equation For Value Function (V i j ) x T f gu uj kd h h 2 T ( )d 2 (d i ) T d i 0 T j 0 Inner loop update disturbance d i 1 1 T V i j 2 k ( x) 2 x go to 2. Iterate i until convergence to d , V j with RAS j Outer loop update control action u j 1 T V j g ( x) x 1 2 Go to 1. Iterate j until convergence to u , V , with RAS CT Policy Iteration for H-Infinity Control Murad Abu Khalaf Problem- Cannot solve the Value Equation! (V i j ) x T f gu uj j kd h T h 2 T ( )d 2 (d i ) T d i 0 0 Neural Network Approximation for Computational Technique Neural Network to approximate V(i)(x) L V ( x) w(ji )s j ( x) WLT (i )s L ( x), (i ) L (Can use 2-layer NN!) j 1 Value function gradient approximation is VL x (i ) s L ( L) (i ) T (i ) WL s L ( x)WL x T Substitute into Value Equation to get 0 T wij s ( x) x r ( x, u j , d i ) T wij s ( x) f ( x, u j , d ) h h u j i T 2 2 d i 2 Therefore, one may solve for NN weights at iteration (i,j) VFA converts partial differential equation into algebraic equation in terms of NN weights Murad Abu Khalaf Neural Network Optimal Feedback Controller Optimal Solution d 1 T k ( x)s LTWL . 2 u 12 g T ( x)s L WL T A NN feedback controller with nearly optimal weights Finite Horizon Control Cheng Tao Fixed-Final-Time HJB Optimal Control Optimal cost V x, t t * Optimal control * V x , t min L u t x T f x g x u x 1 T V x, t u x R 1 g x 2 x * * This yields the time-varying Hamilton-Jacobi-Bellman (HJB) equation V x, t V x, t 1 V x, t f x Q x t x 4 x * * *T g x R g x 1 T V x, t 0 x * Cheng Tao HJB Solution by NN Value Function Approximation L V L x, t w j t s j x wLT t s L x Time-varying weights j 1 Note that V L x, t σ x w L t σ TL x w L t x x Irwin Sandberg T L where σ L x is the Jacobian σ L x x V L x, t TL t σ L x w t Policy iteration not needed! Approximating V x, t in the HJB equation gives an ODE in the NN weights TL t σ L x w TL t σ L x f x w 1 w TL t σ L x g x R 1 g T x σ TL x w L t 4 Q x e L x Solve by least-squares – simply integrate backwards to find NN weights Control is 1 T u * x R 1 g x s LT wL (t ) 2 ARRI Research Roadmap in Neural Networks 3. Approximate Dynamic Programming – 2006Nearly Optimal Control Based on recursive equation for the optimal value Usually Known system dynamics (except Q learning) Extend adaptive control to The Goal – unknown dynamics yield OPTIMAL controllers. On-line tuning No canonical form needed. Optimal Adaptive Control 2. Neural Network Solution of Optimal Design Equations – 2002-2006 Nearly Optimal Control Based on HJ Optimal Design Equations Known system dynamics Preliminary Off-line tuning Nearly optimal solution of controls design equations. No canonical form needed. 1. Neural Networks for Feedback Control – 1995-2002 Extended adaptive control Based on FB Control Approach Unknown system dynamics to NLIP systems On-line tuning No regression matrix NN- FB lin., sing. pert., backstepping, force control, dynamic inversion, etc. Four ADP Methods proposed by Werbos Critic NN to approximate: Heuristic dynamic programming Value V ( xk ) Dual heuristic programming Gradient V x AD Heuristic dynamic programming (Watkins Q Learning) Q function Q( xk , u k ) AD Dual heuristic programming Gradients Q , x Action NN to approximate the Control Bertsekas- Neurodynamic Programming Barto & Bradtke- Q-learning proof (Imposed a settling time) Q u Discrete-Time Optimal Control cost Vh ( xk ) i k r ( xi , ui ) i k Value function recursion Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) u k h( xk ) = the prescribed control input function Hamiltonian H ( xk , V ( xk ), h) r ( xk , h( xk )) Vh ( xk 1 ) Vh ( xk ) Optimal cost V * ( xk ) min (r ( xk , h( xk )) Vh ( xk 1 )) h Bellman’s Principle V * ( xk ) min (r ( xk , uk ) V * ( xk 1 )) uk Optimal Control h * ( xk ) arg min (r ( xk , uk ) V * ( xk 1 )) uk System dynamics does not appear Solutions by Comp. Intelligence Community Use System Dynamics System xk 1 f ( xk ) g ( xk )uk V ( x0 ) xk Qxk uk Ruk k 0 DT HJB equation V ( xk ) min xkT Qxk ukT Ruk V ( xk 1 ) uk min xkT Qxk ukT Ruk V f ( xk ) g ( xk )uk uk 1 1 T dV ( xk 1 ) u ( xk ) R g ( xk ) 2 dxk 1 Difficult to solve Few practical solutions by Control Systems Community Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP) Paul Werbos Policy Iteration V j 1 ( xk ) r ( xk , h j ( xk )) V j 1 ( xk 1 ) h j 1 ( xk ) arg min (r ( xk , uk ) V j 1 ( xk 1 )) uk Lyapunov eq. ( A BL j )T Pj 1 ( A BL j ) Pj 1 Q LTj RL j For LQR Underlying RE L j ( I B T Pj B) 1 B T Pj A Hewer 1971 Initial stabilizing control is needed ADP Greedy Cost Update V j 1 ( xk ) r ( xk , h j ( xk )) V j ( xk 1 ) h j 1 ( xk ) arg min (r ( xk , uk ) V j 1 ( xk 1 )) Simple recursion uk For LQR Pj 1 ( A BL j )T Pj ( A BL j ) Q LTj RL j Underlying RE L j ( I B T Pj B) 1 B T Pj A Lancaster & Rodman proved convergence Initial stabilizing control is NOT needed DT HDP vs. Receding Horizon Optimal Control Forward-in-time HDP Pi 1 AT Pi A Q AT Pi B( I BT Pi B) 1 BT Pi A P0 0 Backward-in-time optimization – RHC Pk AT Pk 1 A Q AT Pk 1B( I BT Pk 1B) 1 BT Pk 1 A PN Control Lyapunov Function Q Learning - Action Dependent ADP Define Q function Qh ( xk , uk ) r ( xk , uk ) Vh ( xk 1 ) Note uk arbitrary policy h(.) used after time k Qh ( xk , h( xk )) Vh ( xk ) Recursion for Q Qh ( xk , uk ) r ( xk , uk ) Qh ( xk 1 , h( xk 1 )) Simple expression of Bellman’s principle V * ( xk ) min (Q* ( xk , uk )) uk h * ( xk ) arg min (Q* ( xk , uk )) uk Q Learning does not need to know f(xk) or g(xk) For LQR T T V ( x) W ( x) x Px V is quadratic in x Qh ( xk , uk ) r ( xk , uk ) Vh ( xk 1 ) xkT Qxk ukT Ruk ( Axk Buk )T P( Axk Buk ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H T T R B PB u k u k B PA u k u k H ux H xu xk H uu u k Q is quadratic in x and u Control update is found by so 0 Q 2[ BT PAx k ( R BT PB)u k ] 2[ H ux xk H uu u k ] u k 1 uk ( R BT PB) 1 BT PAxk H uu H ux xk L j 1 xk Control found only from Q function A and B not needed Model-free policy iteration Q Policy Iteration Q j 1 ( xk , uk ) r ( xk , uk ) Q j 1 ( xk 1 , L j xk 1 ) Bradtke, Ydstie, Barto W jT1 ( xk , u k ) ( xk 1 , L j xk 1 ) r ( xk , L j xk ) Control policy update Stable initial control needed h j 1 ( xk ) arg min (Q j 1 ( xk , uk )) uk 1 uk H uu H ux xk L j 1 xk Greedy Q Fn. Update - Approximate Dynamic Programming ADP Method 3. Q Learning Action-Dependent Heuristic Dynamic Programming (ADHDP) Greedy Q Update Model-free ADP Paul Werbos Q j 1 ( xk , u k ) r ( xk , u k ) Q j ( xk 1 , h j ( xk 1 )) W jT1 ( xk , u k ) r ( xk , L j xk ) W jT ( xk 1 , L j xk 1 ) target j 1 Update weights by RLS or backprop. Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics Model-free ADP Direct OPTIMAL ADAPTIVE CONTROL Works for Nonlinear Systems Proofs? Robustness? Comparison with adaptive control methods? Asma Al-Tamimi ADP for Discrete-Time H-infinity Control Finding Nash Game Equilbrium HDP DHP AD HDP – Q learning AD DHP Asma Al-Tamimi ADP for DT H∞ Optimal Control Systems Disturbance Penalty output zk Measured yk output x k 1 Ax k Bu k Ewk y k xk , wk uk Control uk=Lxk where zkT zk xkT Qx k ukT uk Find control uk so that T T x Qx u k k k uk k 0 T w k wk i 0 2 for all L2 disturbances and a prescribed gain 2 when the system is at rest, x0=0. Asma Al-Tamimi Two known ways for Discrete-time H-infinity iterative solution Policy iteration for game solution Pi 1 Α T Pi 1Α Q LTi RLi 2 K iT K i Requires stable initial policy Α A EK j BLi Ai A EK j Aj A BLi Li ( I BT Pi B) 1 BT Pi Ai K j 2 ( E T Pi E 2 I ) 1 E T Pi Aj ADP Greedy iteration Pi 1 A Pi A Q [ A Pi B T T 1 I B Pi B B Pi E BT Pi A A Pi E ] T T 2 T E P A E P E I E P A i i i T T T Both require full knowledge of system dynamics Does not require a stable initial policy Asma Al-Tamimi • DT Game Heuristic Dynamic Programming: Forward-in-time Formulation An Approximate Dynamic Programming Scheme (ADP) where one has the following incremental optimization Vi 1 ( xk ) min max xkT Qx k u kT u k 2 wkT wk Vi ( xk 1 ) uk wk which is equivalently written as Vi 1 ( xk ) xkT Qx k u iT ( xk )u i ( xk ) 2 wiT ( xk ) wi ( xk ) Vi ( xk 1 ) Asma Al-Tamimi HDP- Linear System Case Vˆ ( x, pi ) piT x x ( x12 , , x1 x n , x 22 , x 2 x3 , , x n 1 x n , x n2 ) Value function update p x x Qx k ( Li xk ) ( Li xk ) ( K i xk ) ( K i xk ) p xk 1 T i 1 k T k Control update T 2 uˆ ( x, Li ) LTi x T T i Solve by batch LS or RLS wˆ ( x, K i ) K iT x Li ( I B T Pi B B T Pi E ( E T Pi E 2 I ) 1 E T Pi B) 1 ( B T Pi E ( E T Pi E 2 I ) 1 E T Pi A B T Pi A), K i ( E T Pi E 2 I E T Pi B( I B T Pi B) 1 B T Pi E ) 1 ( E T Pi B( I B T Pi B) 1 B T Pi A E T Pi A). Control gain A, B, E needed Disturbance gain Showed that this is equivalent to iteration on the Underlying Game Riccati equation 1 Pi 1 AT Pi A Q [ AT Pi B I B T Pi B B T Pi E B T Pi A T A Pi E ] T E T Pi E 2 I E T Pi A E Pi A Which is known to converge- Stoorvogel, Basar Q-Learning for DT H-infinity Control: Action Dependent Heuristic Dynamic Programming Asma Al-Tamimi • Dynamic Programming: Backward-in-time Q ( xk , uk , wk ) ( xkT Rx k ukT uk 2 wkT wk V ( xk 1 )) (uk , wk ) arg{ min max Q ( xk , uk , wk )} uk • wk Adaptive Dynamic Programming: Forward-in-time Qi 1 ( xk , uk , wk ) xkT Rx k ukT uk 2 wkT wk min max Qi ( xk 1 , uk 1 , wk 1 ) u k 1 wk 1 xkT Rx k ukT uk 2 wkT wk Vi ( xk 1 ) xkT Rx k ukT uk 2 wkT wk Vi ( Axk Bu k Ewk ) ui ( xk ) Li xk , wi ( xk ) Ki xk Linear Quadratic case- V and Q are quadratic Asma Al-Tamimi V ( xk ) xkT Pxk Q learning for H-infinity Control Q ( xk , uk , wk ) r ( xk , uk , wk ) V ( xk 1 ) xkT ukT wkT H xkT ukT wkT T Q function update Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk )) xkT Rx k uˆi ( xk )T uˆi ( xk ) 2 wˆ i ( xk )T wˆ i ( xk ) Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 )) [ xkT ukT wkT ]H i 1[ xkT ukT wkT ]T xkT Rx k ukT uk 2 wkT wk [ xkT1 ukT1 wkT1 ]H i [ xkT1 ukT1 wkT1 ]T Control Action and Disturbance updates ui ( xk ) Li xk , wi ( xk ) Ki xk i i 1 i i i 1 i Li ( H uui H uw H ww H wu )1 ( H uw H ww H wx H uxi ), i i i i i Ki ( H ww H wu H uui 1 H uw )1 ( H wu H uui 1 H uxi H wx ). H xx H ux H wx H xu H uu H wu H xw H uw H ww A, B, E NOT needed Quadratic Basis set is used to allow on-line solution Qˆ ( z , hi ) zT Hi z hiT z where z xT wT uT T z ( z12 , and Asma Al-Tamimi , z1 zq , z22 , z2 z3 , , zq 1 zq , zq2 ) Quadratic Kronecker basis Q function update Qi 1 ( xk , uˆi ( xk ), wˆ i ( xk )) xkT Rx k uˆi ( xk )T uˆi ( xk ) 2 wˆ i ( xk )T wˆ i ( xk ) Qi ( xk 1 , uˆi ( xk 1 ), wˆ i ( xk 1 )) Solve for ‘NN weights’ - the elements of kernel matrix H h z ( xk ) x Rx k uˆi ( xk ) uˆi ( xk ) wˆ i ( xk ) wˆ i ( xk ) h z ( xk 1 ) T i 1 T k T 2 T T i Use batch LS or online RLS Control and Disturbance Updates uˆi ( x) Li x wˆ i ( x ) K i x Probing Noise injected to get Persistence of Excitation uˆei ( xk ) Li xk n1k ˆ ei ( xk ) Ki xk n2 k w Proof- Still converges to exact result Asma Al-Tamimi H-inf Q learning Convergence Proofs • Convergence – H-inf Q learning is equivalent to solving 0 A Q 0 H i 1 0 I 0 Li A xk B xk 1 A uk E wk 2 yk xk , 0 0 I K i A B Li B Ki B T E A Li E H i Li A K i A K i E B Li B Ki B E Li E K i E without knowing the system matrices • The result is a model free Direct Adaptive Controller that converges to an H-infinity optimal controller • No requirement what so ever on the model plant matrices Direct H-infinity Adaptive Control Asma Al-Tamimi Compare to Q function for H2 Optimal Control Case Qh ( xk , uk ) r ( xk , uk ) Vh ( xk 1 ) xkT Qxk ukT Ruk ( Axk Buk )T P( Axk Buk ) x k u k T T T Q AT PA xk xk H xx AT PB xk xk H u u H T T u u R B PB k k B PA k k ux H-infinity Game Q function H xu xk H uu u k Asma Al-Tamimi ADP for Nonlinear Systems: Convergence Proof HDP Asma Al-Tamimi Discrete-time Nonlinear Adaptive Dynamic Programming: System dynamics xk 1 f ( xk ) g ( xk )u( xk ) V ( xk ) i k xiT Qxi uiT Rui Value function recursion V ( xk ) x Qxk u Ruk i k 1 xiT Qxi uiT Rui T k T k xkT Qxk ukT Ruk V ( xk 1 ) HDP ui ( xk ) arg min( xkT Qxk u T Ru Vi ( xk 1 )) u Vi 1 min( xkT Qxk u T Ru Vi ( xk 1 )) u xkT Qxk uiT ( xk ) Rui ( xk ) Vi ( f ( xk ) g ( xk )ui ( xk )) Asma Al-Tamimi Proof of convergence of DT nonlinear HDP Flavor of proofs Standard Neural Network VFA for On-Line Implementation NN for Value - Critic NN for control action Vˆi ( xk ,WVi ) WViT ( xk ) HDP (can use 2-layer NN) uˆi ( xk ,Wui ) W s ( xk ) T ui Vi 1 min( xkT Qxk u T Ru Vi ( xk 1 )) u x Qxk uiT ( xk ) Rui ( xk ) Vi ( f ( xk ) g ( xk )ui ( xk )) T k ui ( xk ) arg min( xkT Qxk u T Ru Vi ( xk 1 )) u d ( ( xk ),WViT ) xkT Qxk uˆiT ( xk ) Ruˆi ( xk ) Vˆi ( xk 1 ) Define target cost function xkT Qxk uˆiT ( xk ) Ruˆi ( xk ) WViT ( xk 1 ) Explicit equation for cost – use LS for Critic NN update WVi 1 arg min{ | W WVi 1 ( xk ) d ( ( xk ),W ) | dxk } T Vi 1 T Vi 2 WVi 1 ( xk ) ( xk )T dx 1 ( x )d k T ( ( xk ),WViT ,WuiT )dx Implicit equation for DT control- use gradient descent for action update x Qxk uˆ ( xk , ) Ruˆ ( xk , ) Wui arg min ˆ ˆ V ( f ( x ) g ( x ) u ( x , )) k k k i T k T Wui ( j 1) Wui ( j ) ( xkT Qxk uˆiT( j ) Ruˆi ( j ) Vˆi ( xk 1 ) Wui ( j ) Wuij 1 Wuij s ( xk )(2 Ruˆi ( j ) g ( xk )T ( xk 1 ) WVi )T xk 1 Backpropagation- P. Werbos Issues with Nonlinear ADP LS solution for Critic NN update WVi 1 ( xk ) ( xk )T dx Selection of NN Training Set 1 T T T ( x ) d ( ( x ), W , W )dx k k Vi ui x2 x2 x1 x1 time time Integral over a region of state-space Approximate using a set of points Take sample points along a single trajectory Batch LS Recursive Least-Squares RLS Set of points over a region vs. points along a trajectory For Linear systems- these are the same Conjecture- For Nonlinear systems They are the same under a persistence of excitation condition - Exploration Interesting Fact for HDP for Nonlinear systems Linear Case h j ( xk ) L j xk ( I B T Pj B) 1 B T Pj Axk must know system A and B matrices NN for control action uˆi ( xk ,Wui ) WuiT s ( xk ) Implicit equation for DT control- use gradient descent for action update x Qxk uˆ ( xk , ) Ruˆ ( xk , ) Wui arg min ˆ ˆ V ( f ( x ) g ( x ) u ( x , )) k k k i T k Wui ( j 1) Wui ( j ) T ( xkT Qxk uˆiT( j ) Ruˆi ( j ) Vˆi ( xk 1 ) Wui ( j ) Wuij 1 Wuij s ( xk )(2 Ruˆi ( j ) g ( xk )T ( xk 1 ) WVi )T xk 1 Note that state internal dynamics f(xk) is NOT needed in nonlinear case since: 1. NN Approximation for action is used 2. xk+1 is measured Draguna Vrabie ADP for Continuous-Time Systems Policy Iteration HDP Continuous-Time Optimal Control x f ( x, u ) System Cost t t V ( x(t )) r ( x, u ) dt (Q( x) u T Ru ) dt c.f. DT value recursion, where f(), g() do not appear Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) Hamiltonian V V V 0 V r ( x, u ) , u) x r ( x, u ) f ( x, u ) r ( x, u ) H ( x, x x x T Optimal cost Bellman Optimal control T T T V V 0 min r ( x, u ) x min r ( x, u ) f ( x, u ) u (t ) x u (t ) x T T * V * V x min r ( x, u ) f ( x, u ) 0 min r ( x, u ) x u (t ) u (t ) x * V h ( x(t )) 1 2 R g ( x) x 1 * T HJB equation V (0) 0 T T * * dV * 1 T dV 1 dV f Q( x) 4 gR g 0 dx dx dx V (0) 0 Linear system, quadratic cost System: x Ax Bu Utility: r ( x, u) xT Qx u T Ru ; R 0, Q 0 The cost is quadratic V ( x(t )) r ( x, u)d xT (t ) Px(t ) t Optimal control (state feed-back): u (t ) R 1BT ( x) Px(t ) Lx(t ) HJB equation is the algebraic Riccati equation (ARE): 0 PA AT P Q PBR 1BT P CT Policy Iteration Utility r ( x, u ) Q( x) u T Ru Cost for any given u(t) T V V 0 , u) f ( x, u ) r ( x, u ) H ( x , x x Lyapunov equation Iterative solution Pick stabilizing initial control Find cost V j 0 x T f ( x, h j ( x)) r ( x, h j ( x)) V j ( 0) 0 Update control h j 1 ( x) 1 2 R 1 g T ( x) • Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly • Beard & Saridis used complicated Galerkin Integrals to solve Lyapunov eq. • Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence V j x Full system dynamics must be known LQR Policy iteration = Kleinman algorithm 1. For a given control policy u Lk x solve for the cost: 0 Ak T Pk Pk Ak C T C Lk T RLk Lyapunov eq. Ak A BLk 2. Improve policy: Lk R 1BT Pk 1 If started with a stabilizing control policy L0 the matrix Pk monotonically converges to the unique positive definite solution of the Riccati equation. Every iteration step will return a stabilizing controller. The system has to be known. Kleinman 1968 Policy Iteration Solution Policy iteration T ( A BBT Pi )T Pi 1 Pi 1 ( A BBT Pi ) PBB Pi Q 0 i This is in fact a Newton’s Method Ric( P) AT P PA Q PBBT P Then, Policy Iteration is Pi 1 Pi RicPi 1 Ric( Pi ), i 0,1, Frechet Derivative Ric 'Pi ( P) ( A BB T Pi )T P P( A BB T Pi ) Synopsis on Policy Iteration and ADP Discrete-time Policy iteration V j 1 ( xk ) r ( xk , h j ( xk )) V j 1 ( xk 1 ) r ( xk , h j ( xk )) V j 1[ f ( xk ) g ( xk )h j ( xk )] h j ( xk ) L j ( I B T Pj B) 1 B T Pj Axk If xk+1 is measured, do not need knowledge of f(x) or g(x) Need to know f(xk) AND g(xk) for control update ADP Greedy cost update V j 1 ( xk ) r ( xk , h j ( xk )) V j ( xk 1 ) Continuous-time Policy iteration V j 0 x T V x r ( x, h j ( x)) j x h j 1 ( x) 1 2R 1 T g ( x) T [ f ( x) g ( x)h j ( x)] r ( x, h j ( x)) V j x What is Greedy ADP for CT Systems ?? Either measure dx/dt or must know f(x), g(x) Need to know ONLY g(x) for control update Draguna Vrabie Policy Iterations without Lyapunov Equations • An alternative to using policy iterations with Lyapunov equations is the following form of policy iterations: V j ( x0 ) [Q( x) W (u j )]dt Measure the cost 0 1 1 dV j u j 1 ( x) 2 R g dx • Note that in this case, to solve for the Lyapunov function, you do not need to know the information about f(x). Murray, Saeks, and Lendaris Methods to obtain the solution Dynamic programming built on Bellman’s optimality principle – alternative form for CT Systems [Lewis & Syrmos 1995] t t V * ( x(t )) min r ( x( ), u ( )) d u ( ) t t t t V ( x(t t )) * r( x( ), u( )) xT ( )Qx ( ) uT ( ) Ru ( ) Draguna Vrabie Solving for the cost – Our approach u Lx For a given control The cost satisfies t T V ( x(t )) T T ( x Qx u Ru )dt V ( x(t T )) t c.f. DT case Vh ( xk ) r ( xk , h( xk )) Vh ( xk 1 ) LQR case t T x(t )T Px(t ) T T T ( x Qx u Ru ) dt x ( t T ) Px(t T ) t Optimal gain is f(x) and g(x) do not appear L R 1BT P Policy Evaluation – Critic update Let K be any state feedback gain for the system (1). One can measure the associated cost over the infinite time horizon V (t , x(t )) t T x( )T (Q K T RK ) x( )d W (t T , x(t T )) t where W (t T , x(t T )) is an initial infinite horizon cost to go. What to do about the tail – issues in Receding Horizon Control Draguna Vrabie Now Greedy ADP can be defined for CT Systems Solving for the cost – Our approach CT ADP Greedy iteration u k (t ) Lk x(t ) Control policy Cost update LQR Vk 1( x(t0 )) x0T Pk 1x0 t0 T t0 T T ( xT Qx u k Ru k )dt Vk ( x(t0 T )) t0 T ( x Qx u kT Ru k )dt x1T Pk x1 t0 A and B do not appear Control gain update Lk 1 R 1BT Pk 1 B needed for control update Implement using quadratic basis set t T pi 1T x (t ) x( ) T (Q Pi BR 1BT Pi ) x( )d pi T x (t T ) t u(t+T) in terms of x(t+T) - OK No initial stabilizing control needed Direct Optimal Adaptive Control for Partially Unknown CT Systems Algorithm Implementation T Measure cost increment by adding V as a state. Then V xT Qx u k Ruk The Critic update xT (t ) Pi1x(t ) t T xT ( )(Q K iT RK i ) x( )d xT (t T ) Pi x (t T ) t can be setup as pi1T x (t ) t T Quadratic basis set x( )T (Q K iT RK i ) x( )d piT x (t T ) d ( x (t ), K i ) t Evaluating d ( x (t ), Ki ) for n(n+1)/2 trajectory points, one can setup a least squares problem to solve pi1 ( XX T )1 XY X [ x1(t ) x 2 (t ) ... x N (t )] Y [d ( x1, Ki ) d ( x 2 , Ki ) ... d ( x N , Ki )]T Or use recursive Least-Squares along the trajectory Draguna Vrabie Analysis of the algorithm u k Lk x For a given control policy x Ax Bu ; with Lk R 1 T B Pk x(0) Ak A BR1BT Pk x e Ak t x(0) t T Greedy update Vi 1 ( x(t )) t t 0 T Pk 1 e AkT t x Qx u T T i (Q Lk RLk )e T Rui d Vi ( x(t T )), V0 0 is equivalent to Ak t dt e AkT (T t 0 ) Pk e Ak (T t0 ) t0 a strange pseudo-discretized RE c.f. DT RE Pk 1 A Pk A Q A Pk B Pk B PK B T T Pk 1 A Pk Ak T k T Q L P B T k k T 1 BT Pk A PK B Lk Draguna Vrabie Analysis of the algorithm This extra term means the initial Control action need not be stabilizing Lemma 2. CT HDP is equivalent to T Pk 1 Pk e Ak t ( Pk A APk Q Lk RLk )e Ak t dt T T Ak A BR1BT Pk 0 When ADP converges, the resulting P satisfies the Continuous-Time ARE !! ADP solves the CT ARE without knowledge of the system dynamics f(x) Solve the Riccati Equation WITHOUT knowing the plant dynamics Model-free ADP Direct OPTIMAL ADAPTIVE CONTROL Works for Nonlinear Systems Proofs? Robustness? Comparison with adaptive control methods? Gain update (Policy) Lk 0 1 2 3 4 5 k Control u k (t ) Lk x(t ) t Sample periods need not be the same Continuous-time control with discrete gain updates Neurobiology Higher Central Control of Afferent Input Descending tracts from the brain influence not only motor neurons but also the gamma-neurons which regulate sensitivity of the muscle spindle. Central control of end-organ sensitivity has been demonstrated. Many brain structures exert control of the first synapse in ascending systems. Role of cerebello rubrospinal tract and Purkinje Cells? T.C. Rugh and H.D. Patton, Physiology and Biophysics, p. 213, 497,Saunders, London, 1966. Small Time-Step Approximate Tuning for Continuous-Time Adaptive Critics D V V V V r ( xt , u t ) V t 1 t t 1 t H ( x, , u ) V ( x ) r ( x, u ) r ( x, u ) x t t t D * r ( x , u ) V ( x ) V ( xt ) * t t t 1 A1 ( xt , u t ) t Baird’s Advantage function Advantage learning is a sort of first-order approximation to our method Results comparing the performances of DT-ADHDP and CT-HDP Submitted to IJCNN’07 Conference Asma Al-Tamimi and Draguna Vrabie System, cost function, optimal solution System – power plant x Ax Bu xR , uR n -0.665 Cost m V * ( x0 ) min ( xT Qx uT Ru )d u (t ) 0 3.663 3.663 0 A 6.86 0 13.736 13.736 6 0 0 0 8 BT 0 0 13.736 0 0 0 0 Q In ; R Im Wang, Y., R. Zhou, C. Wen - 1993 CARE: AT P PA PBR1BT P Q 0 PCARE 0.4750 0.4766 0.0601 0.4751 0.4766 0.0601 0.4751 0.7831 0.1237 0.3829 0.1237 0.0513 0.0298 0.3829 0.0298 2.3370 V ( x0 ) min ( xT QCT x uT RCT u )d CT HDP results * u (t ) 0 P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4) 2.5 2 The state measurements were taken at each 0.1s time period. P(1,1) P(1,3) P(2,4) P(4,4) 1.5 A cost function update was performed at each 1.5s. 1 0.5 0 -0.5 0 10 20 30 40 50 60 For the 60s duration of the simulation a number of 40 iterations (control policy updates) were performed. Time (s) Convergence of the P matrix parameters for CT HDP PCT HDP 0.4753 0.4771 0.0602 0.4770 0.4771 0.0602 0.4770 0.7838 0.1238 0.3852 0.1238 0.0513 0.0302 0.3852 0.0302 2.3462 The discrete version was obtained by discretizing the continuous time model using zero-order hold method with the sample time T=0.01s. V * ( xk ) min t k xtT (QCT T ) xt utT ( RCT T )ut DT ADHDP results ut[ k , ] P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4) 2.5 The state measurements were taken at each 0.01s time period. P(1,1) P(1,3) P(2,4) P(4,4) 2 1.5 A cost function update was performed at each .15s. 1 0.5 0 -0.5 0 10 20 30 40 50 60 For the 60s duration of the simulation a number of 400 iterations (control policy updates) were performed. Time (s) Convergence of the P matrix parameters for DT ADHDP PDT ADHDP Continuous-time used only 40 iterations! 0.4802 0.4768 0.0603 0.4754 0.4768 0.0603 0.4754 0.7887 0.1239 0.3834 0.1239 0.0567 0.0300 0.3843 0.0300 2.3433 Comparison of CT and DT ADP • CT HDP – Partially model free (the system A matrix is not required to be known) • DT ADHDP – Q learning – Completely model free The DT ADHP algorithm is computationally more intensive than the CT HDP since it is using a smaller sampling period 4 US Patents Sponsored by Paul Werbos NSF Call for Papers IEEE Transactions on Systems, Man, & Cybernetics- Part B Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control George Lendaris Derong Liu F.L Lewis Papers due 1 August 2007