JMLR: Workshop and Conference Proceedings 19 (2011) 1{26 24th Annual Conference on Learning Theory Regret Bounds for the Adaptive Control of Linear Quadratic Systems Yasin Abbasi-Yadkori abbasiya@cs.ualberta.ca Csaba Szepesv ari szepesva@cs.ualberta.ca Department of Computing Science, University of Alberta Editor: Sham Kakade, Ulrike von Luxburg Abstract p T). Unlike previous approaches that use a forced-exploration scheme, we construct aWe study the average cost Linear Quadratic (LQ) control problem with unknown model parameters, also known as the adaptive control problem in the control community. We design an algorithm and prove that apart from logarithmic factors its regret up to time T is O( high-probability con dence set around the model parameters and design an algorithm that plays optimistically with respect to this con dence set. The construction of the con dence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm. To the best of our knowledge this is the the rst time that a regret bound is derived for the LQ control problem. 1. Intro duction We study the average cost LQ control problem with unknown model parameters, also known as the adaptive control problem in the control community. The problem is to minimize the average cost of a controller that operates in an environment whose dynamics is linear, while the cost is a quadratic function of the state and the control. The optimal solution is a linear feedback controller which can be computed in a closed form from the matrices underlying the dynamics and the cost. In the learning problem, the topic of this paper, the dynamics of the environment is unknown. This problem is challenging since the control actions inuence both the cost and the rate at which the dynamics is learned, a topic of adaptive control. The objective in this case is to minimize the regret of the controller, i.e. to minimize the di erence between the average cost incurred by the learning controller and that of the optimal controller. In this paper, for the rst time, we show an adaptive controller and prove that, under some assumptions, its expected regret is bounded by~ O(p T). We build on recent works in online linear estimation and adaptive control design, the latter of which we survey next. c 2011 Y. Abbasi-Yadkori & C. Szepesv ari. Abbasi-Yadkori Szepesv ari When the model parameters are known and the state is fully observed, one can use the principles of dynamic programming to obtain the optimal controller. The version of the problem that deals with the unknown model parameters is called the adaptive control problem. The early attempts to solve this problem relied on the certainty equivalence principle (Simon, 1956). The idea was to estimate the unknown parameters from observations and then use the estimated parameters as if they were the true parameters to design a controller. It was soon realized that the certainty equivalence principle does not necessarily provide enough information to reliably estimate the parameters and the estimated parameters can converge to incorrect values with positive probability (Becker et al., 1985). This in turn might lead to suboptimal performance. To avoid non-identi cation problem, methods that actively explore the environment to gather information are developed (Lai and Wei, 1982a, 1987; Chen and Guo, 1987; Chen and Zhang, 1990; Fiechter, 1997; Lai and Ying, 2006; Campi and Kumar, 1998; Bittanti and Campi, 2006). However, only asymptotic results are proven for these methods. One exception is the work of Fiechter (1997) who proposes an algorithm for the \discounted" LQ problem and analyzes its performance in a PAC framework. Most of the aforementioned methods use forced-exploration schemes to provide the suf cient exploratory information. The idea is to take exploratory actions according to a xed and appropriately designed schedule. However, the forced-exploration schemes lack strong worst-case regret bounds, even in the simplest problems (see e.g. Dani and Hayes (2006), section 6). Unlike the preceding methods, Campi and Kumar (1998) proposes an algorithm that uses the Optimism in the Face of Uncertainty (OFU) principle, which goes back to the work of Lai and Robbins (1985), to deal with the exploration/exploitation dilemma. They call this the Bet On the Best (BOB) principle. The idea is to construct high-probability con dence sets around the model parameters, nd the optimal controller for each member of the con dence set, and nally choose the controller whose associated average cost is the smallest. However, Campi and Kumar (1998) only show asymptotic optimality, i.e. the average cost of their algorithm converges to that of the optimal policy in the limit. In this paper, we modify the algorithm and the proof technique of Campi and Kumar (1998) and extend their work to derive a nite time regret bound. Our work also builds upon on the works of Lai and Wei (1982b); Dani et al. (2008); Rusmevichientong and Tsitsiklis (2010) in analyzing the linear estimation with dependent covariates, although we use a more recent, improved con dence bound (see Theorem 1). Note that the OFU principle has been applied very successfully to a number of challenging learning and control situations. Lai and Robbins (1985), who invented the principle, used it to address learning in bandit problems (i.e., when there is no state) and later this work was picked up and modi ed by Auer et al. (2002) to make it work in nonparametric bandits. The OFU principle has also been applied to learning in nite Markov Decision Processes, both in a regret minimization (e.g., Bartlett and Tewari 2009; Auer et al. 2010) and in a PAC-learning setting (e.g., Kearns and Singh 1998; Brafman and Tennenholtz 2002; Kakade 2003; Strehl et al. 2006; Szita and Szepesv ari 2010). In the PAC-MDP framework there has been some work to extend the OFU principle to in nite Markov Decision Problems under various assumptions. For example, Lipschitz assumptions have been used 2 Regret Bounds for the Adaptive Control of Linear Quadratic Systems by Kakade et al. (2003), while Strehl and Littman (2008) explored linear models. However, none of these works consider both continuous state and action spaces. Continuous action spaces in the context of bandits have been explored in a number of works, such as the works of Kleinberg (2005); Auer et al. (2007); Kleinberg et al. (2008) and in a linear setting by Auer (2003); Dani et al. (2008) and Rusmevichientong and Tsitsiklis (2010). 2. N otation and conventions We use k kand k:kFd dto denote the 2-norm and the Frobenius norm, respectively. For a positive semide nite matrix A2R, the weighted 2-norm k:kAis de ned by kxk2 A= x>Ax, where x2Rd. The inner product is denoted by h ; i. We use min(A) and (A) to denote the minimum and maximum eigenvalues of the positive semide nite matrix A, respectively. We use A 0 to denote that Ais positive de nite, while we use A 0 to denote that it is positive semide nite. We use IfAgmaxto denote the indicator function of event A. 3. T he Linear Quadratic Problem We consider the discrete-time in nite-horizon linear quadratic (LQ) control problem: xt+1 ; ct= x> tQxt+ u> tRut;t+1 + w+ B ut = A xt 2Rd is the control at time t, xt 2Rn is the state at time t, ct nn where t= 0;1;:::;ut 2Rn d t nn dd t=0X 0 Let J t 2R is the cost at time t, wt+1is the \noise", A2Rand Bare unknown matrices while Q2Rand R2Rare known (positive de nite) matrices. At time zero, for simplicity, x= 0. The problem is to design a controller based on past observations to minimize the average expected cost 1 T ]: (1) T E [c J(u0;u1;:::) = limsup T!1 be the optimal (lowest) average cost. The regret up to time Tof a controller which incurs a cost of cat time tis de ned by T R(T) = t=0X( J ); ct which is the di erence between the performance of the controller and the performance of the optimal controller that has full information about the system dynamics. Thus the regret can be interpreted as a measure of the cost of not knowing the system dynamics. 3 3. 1. A ss u m pti o ns 1I n thi s se cti o n, w e st at e o A b b a s i Y a d k o r i S z e p e s v a r i ur as su m pti o ns o n th e n oi se a n d th e sy st e m dy n a mi cs . In p ar tic ul ar , w e as su m e th at th e n oi se is su bG a us si a n a n d th e sy st e m is co nt ro lla bl e a n d o bs er va bl e. D e n e t = xtut : = A = B and z Thus, the state transition can be written as x> zt Assumption A1 There exists a ltration (Ft (i) zt;xt are Ft > t+1 w > t+1 t+1 ; t+1 +w : ) such that for the random variables (z0;x), :::, (zt;xt+1), the following hold:1 t t+1jF t + 1 t jFt = In is a member of set Ssuch that S S t + 1 -measurable; (ii)For any t 0, ; i.e., w= xt+1z> t is a martingale di erence sequence (E [w ] = z> t jF E [x ] = 0, t= 0; 1; ::: ); (iii) E wt+1 ; (iv)The random variables ware component-wise sub-Gaussian in the sense that there exists L>0 such that for any 2R , and index j, 2L=2): The assumption E w w> t+1E [exp(wt+1;j)jFt] exp(2 jFt = In makes the analysis more readable. However, we shall show it in Section 4 that it is in fact not necessary. Our next assumption on the system uncertainty states that the unknown parameter is a member of a bounded set and is such that the system is controllable and observable. This assumption will let us derive a closed form expression for the optimal control law. Assumption A2 The unknown parameter where S 0 \ n 2Rn (n+d) 4 = 0 jtrace( > ) S2o ; >M o : n j(A;B) is controllable, (A;M) is observable, where Q= M ( = (A;B) 2R n n + d In what follows, we shall always assume that the above 1. Controllability and observability) are de ned in Appendix B assumptions are valid. Regret Bounds for the Adaptive Control of Linear Quadratic Systems 3.2. Parameter estimation In order to implement the OFU principle, we need high-probability con dence sets for the unknown parameter matrix. The derivation of the con dence set is based on results from Abbasi-Yadkori et al. (2011) that use techniques from self-normalized processes to estimate the least squares estimation error. De ne t1s= trace((xs+1 > zs)(xs+1 e( ) = trace( > ) + > zs )>): 0X ^ t= argmin e( ) = (Z> 2be the ‘-regularized least-squares estimate of >X; (2) 1 Let ^ t with regularization parameter >0: Z+ I) Z where Zand Xare the matrices whose rows are z> 0;:::;z > t1 and x> ;:::;x> t, respectively. 1 Theorem 1 Let (z0;x1);:::;(zt;xt+1), zi V 2R t 2 2Rn+d, xi 2Rn t t satisfy the linear model Assumption A1 with some L>0, (n+d) n, trace( > ) S2and let F= (F) be the associated ltration. Consider the ‘-regularized least-squares parameter estimate ^ with regularization coe cient >0 (cf. (2)). Let = I+ t1i= ziz> 0X i be the regularized design matrix underlying the covariates. De ne t( ) = 0 @ s 2log det(Vt 1)=21det( I)=2 + 1=2 S : (3) nL 1A2 Then, for any 0 < <1, with probability at least 1 , trace(( )>Vt( )) t( ): ^ t ^ t t 2C t ( );t= 1;2;:::) 1 , where In particular, t( ^ ()=n P( t t C 2Rn (n+d) : trace n ( ^ )>V t 3.3. The design of the controller Let (A;B) = 2S0, where S0is de ned in Assumption A2. Then there is a unique solution P( ) in the class of positive semide nite symmetric matrices to the Riccati equation P( )B+ R)1B>P( )A: P( ) = Q+ A>P( )AA>P( )B(B> 5 )o ( Abbasi-Yadkori Szepesv ari Under the same assumptions, the matrix A+ BK( ) is stable, i.e. its norm-2 is less than one, where K( ) = (B> P( )B+ R)1B> t ut = K( )xt; (5) ( t( ) \Ssuch that ) \ S = J ( P( )A is the gain matrix (Bertsekas, 2001). Further, by the boundedness of S, we also obtain theboundedness of P( ) (Anderson and Moore, 1971). The corresponding constant will be denoted by D: D= supfkP( )kj 2Sg: (4) The optimal control law for a LQ system with parameters is i.e., this controller achieves the optimal average cost which satis es J( ) = trace(P( )) (Bertsekas, 2001). In particular, the average cost of control law (5) with = is the optimal average cost J) = trace(P( )). We assume that the bound on the norm of the unknown parameter, S, and the subGaussianity constant, L, are known: Assumption A3 Constants Land Sin Assumptions A1 and A2 are known. The algorithm that we propose implements the OFU principle as follows: At time t, the algorithm chooses a parameter~ tfrom C J( J( ) + 1p t ~ t) inf 2C and then uses the optimal feedback controller (5) underlying the chosen parameter. In order to prevent too frequent changes to the controller (which might harm performance), the algorithm changes controllers only after the current parameter estimates are signi cantly re ned. The details of the algorithm are given in Algorithm 1. 4. A nalysis In this section we give our main result together with its proof. Before stating the main theorem, we make one more assumption in addition to the assumptions we made before. Assumption A4 The set Sis such that := sup(A;B)2SkA+ BK(A;B)k<1. Further, there exists a positive number Csuch that C= sup 2SkK( )k<1. Our main result is the following theorem: 6 Regret Bounds for the Adaptive Control of Linear Quadratic Systems 0t. else ~ = t ~ 1 pt J( ) + t+1 t1 based on the current parameters, u = K( ~ )x t+1 >t t ) into the dataset, where z> t= (x t;x : = V t Inputs: T;S>0; >0;Q;L; >0. Set V0= Iand . Let V= V ^ = 0. (~ A0; ~ B0) = ~ 00= argmin 2C0( )\SJ( ). for t:= 0;1;2;::: do if det(Vt) >2det(V0) then Calculate^ by (2). Find~ ttsuch that J( ~ t) inf 2Ct( )\S t t t > . t;u end if Calculate u + z . Save (zt+1 z> t pTlog(1= ) ; 2 R(T) = ~ O . Execute control, observe new state x ). Vt. end for Table 1: The proposed adaptive algorithm for the LQ problem Theorem 2 For any 0 < <1, for any time T, with probability at least 1 , the regret of Algorithm 1 is bounded as follows: where the constant hidden is a problem dependent constant. w w> = Int = and G t+1jFw> G )G t+1 (~ ;G t ( ;G)2C 7 t where Ct 2. Here, ~ O hides logarithmic factors. Remark 3 The assumption E t+1makes the analysis more readable. Alternatively, we could assume that Ej wThen the optimal average cost becomes J( ) = trace(P( ). The only change in in the computation of~ , which will have the following form: J( ); is now a con dence and G set over t . The rest of the analysis remains identical, provided that an appropriate con dence set is constructed. The least squares estimation error from Theorem 1 scales with the size of the state and action vectors. Thus, in order to prove Theorem 2, we rst prove a high-probability bound on the norm of the state vector. Given the boundedness of the state, we decompose the regret and analyze each term using appropriate concentration inequalities. Abbasi-Yadkori Szepesv ari 4.1. Bounding kxtk We choose an error probability, >0. Given this, we de ne two \good events" in the probability space . In particular, we de ne the event that the con dence sets hold for s= 0;:::;t, t Et = f!2 : 8s t; 2Cs # nlog 4nt(t+ 1) ; kA Ft s " GZn+ d n+ d+1T t( =4) + B K( )k; t = 1 _sup 2S = max tk; T 0tT 0H n+d+1=2 1=2 ! ; ; U ( =4)g; and the event that the state vector stays \small": = f!2 : 8s t; kx k = 11 n+d + 2L r 1 2(n+ d+1) kz 2S(n+ d) 1=(n+d+1) Z G= 2 U= U g where (n+ d)U 3and U = 116n+d2(1 _S2(n+d2)) ; 0 His any number satisfying H> 16 _ 4S2M wher e M= sup Y 0 In what follows, we let E= E nLr (n+ d)log 2 0 1+TY= Y: and F= FT T + ~ ( 8 ; 1=2S >zt is well controlled except . First, we show that E\F holds with high probability and on E\F, the state vector does not explode. t) Lemma 4 P (E\F) 1 =2. The proof is in Appendix D. It rst shows that t t and ZT t. Notice that t t tk. for a small number of occasions. Given this and proper decomposition of the state update equation, we can prove that the state vector xstays smaller than itself depends , which in turn depend on x. Thus, we need one more step to have a bound on kx 3. We use ^ and _ to denote the minimum and the maximum, respectively. R eg ret B ou nd s for th e A da pti ve C on tro l of Li ne ar Q ua dr ati c Sy st e m s Lemma 5 For appropriate problem dependent constants C1 >0;C2 fFtg ma 1kxsk Xt and Ytdef= (e_ (n+ d)(e1) 2 x1 s t _4(C1 t 2 Proof Fix t. On Ft, ^ Xt t. Let Xt := max0 s t x D1p t D1p kxsk 2 t t ; (6) r n+ d n+ d+1 l o g t 2 r Xt t l o g t t 1 s t 1=(n+d+1) t D + D2 1 p t ( ) l o g ( t ) tn+d+1)) log(t= ))): t n+d+1 ; (7) t t t > 0 (w hi ch ar e in d e p e n d e nt of t; ;T ), fo r a ny t 0, it h ol ds th at I, w h er e X = Yn +d +1 t log(1= ) + C log(t= ))log2(4(C . W ith a p log(1= ) + C pr o pr iat e co ns ta nt s, thi s im pli es th at holds for x= ^ X x ( )log(t)x ( )log(t) + D + D !n+d+1 t ; r log ! or be the largest value of x 0 that satis es (6). Thus, Clearly, ^ Xt. Because t ( ) is a function of logdet(V ), (7) has the form of X f(log(X: (8) Let a= X. Then, (8) is equivalent to Xt f((n+ d+ 1)loga): Let c= max(1;maxkask). Assume that t (n+ d). By the Lemma 10, tedious, but elementary calculations, it can then be shown t c Alog2(c) + Bt; (9) where A= G1log(1= ) and Bt= Glog(t= ). From this, further elementary > ) + x> tP( ~ t)xt J( ~ calculations show that the maximum value that ccan take on subject to the constraint (9) is bounded from above by Yt.2 4.2. Regret Decomposition n xQxt> tQx+ ut> t = x> t From the Bellman optimality equations for the LQ problem, we get that (Bertsekas, 1987)[V. 2, p. 228{229] = minu + u Ru+ E h ~x+ E h )~xu t+1 ~xuT t+1P( ~ t Ftio Rut utT t+1P( u)~xtt i ;t +1 9 ~ t F A bb as iY ad ko ri Sz ep es v ari t = ~ Atxtt)xt = x + ~ Btu+ wt+1 + u> tRut and ( ~ At; ~ Bt) t =~ ~ t where ~xu . Hence, J() + x> tP( ~ > tQx+ t E h (= x> tQxt> tt~ Att i t+1 ~ Atx+ t ~ Btut+ wt+1 Rut + u tx + ~ Bu ) P( ~ )( ~ x + ~ Bu A F t+1) t t t t +Eh ()wt+1 ) t t tt + w u+ ~ x)>P( ~ t)( B ~A > t i F F+ Et P( ~ + E h w= x> tQxt > t+1 i h ()xt+1t> Ft+ E h x>P( ~ t >t t + ~ Bu >P( ~ t t x )( F ~ A t t P( ~ )( ~ x + ~ Bu )(A x t A +B u ) P( ~ )(A x +B u ) : (12) Tt=0X t >+t u Rut ~ Ax i )(A + E h x t t t) t t t t t t u > (A=t x> + tQx1t Eh +~ u) B F P( xt ~ t + B ut ) x P ( ~ ti +B + ( ~ xt + ~ Btut) At t) i )( + ~ Btu xt+ B ut)>P( ~ t)(A + B) (A); where in the one before last xt t ~ equality we have used x= A A and the martingale property of the noise. Hence, ~ t) + R T= t=0 n x> tP( where R1 X ~ t)xt t=0 > t t+1 t+1 an t+1 X d > t t t t t t t > R3 = (~ + ~ Atxt Btut) Thus, on E\F, Tt=0X Qx + u> tRu ) = Tt= J( ) + R1 R2 R3 0X ~ (x T t=0XJ( )). Thu s, on E\F, R(T) R1 R2R 3+ 2 p T: (13) I n t h e f o l l o w i n g s u b s e c t i o n s , w e TJ( ) + R1 where the last inequality follows from the choice of ~ t Ftio ; (10) Fti ; (11) t R 2 t R 3 1 = Tt= x> 0X tQxt t and the fact that on E, . ( ~ + B ut 3; + u> E h x> t+1P( ~ ))x ) P( ~ + 2 p T; tt+1 > t+1P( > i 2Ct + R Rut t t+1)xt+1 (P( E h R2 = ~ x T b o u n d R 1 ; R 2 , a n d R 3 10 Regret Bounds for the Adaptive Control of Linear Quadratic Systems 4.3. Bounding IfE\FgR1 We start by showing that with high probability all noise terms are small. Lemma 6 With probability 1 =8, for any k T, kwkk Ln p2nlog(8nT= ). jwk;iProof From sub-Gaussianity Assumption A1, we have that for any index 1 i nand any time k,j L pk2log(8= ): Thus, with probability 1 =8, for any k T, kwk Ln p2nlog(8nT= ). 2nlog(8nT= ) and Lemma 7 Let R1 be as de ned by (10). With probability at least 1 =2, IfE\FgR1 2DW 2 8 B 0; r 2Tlog +nq where W= Ln p B0 = v+ TD2S2X2(1 + C2 Proof Let ft1 = A xt1 ) lo 4nv1=2 v+ TD2S2X2(1 + C2) g and Pt = P( ~ t). Write + B ut1 1=2! R1 = x > P( x> T+1P( ~ T+1)xT+1 0 ~ 0)x0 Tt=1 x> tP( E h x> tP( ~ t)xtjFt1i : + X ~ t)xt Because P is positive semi-de nite and x0= 0, the rst term is bounded by zero. The second term can be decomposed as follows E h x > tPtx jF i = T > t1P w t=1 Tt=1X x> tPtxt t + t=1 jF t1 > tPtwt TX w> tPtw We bound each term separately. Let = f> t1Pt an v> t d 1 = IfE\Fg Tt=1X v> = IfE\Fg Tt=1X ni=1 vk;iwk;i = ni=1 IfE\Fg Tt=1X v w : X X twt t t1 t t k;i X f G 1 i : Ehw 1 k;i : th at h ol ds wi th pr o b a bil ity at le as t 1 = (4 n) , fo r a ny T 0, Abbasi-Yadkori Szepesv ari = vk;iwk;i. By Theorem 16, on some event G ;i P L e t T M t = 1 T ; i 2R2 v+ Tt=1X v2 t;i! M2 T;i log 0 4nv1=2 @ 0 On E\F, kvt Tt= v2 t;i!1=21 = B ;i: A n i=1on \G ;i, that holds w.p. 1 =4. 0 ;i v+ 1X B. Thus, we have G P k DSX p1 + C2 1 De ne X = w Ptw 2E Xt T t=1fE\Fg t 2 P G2 >2DW . De ne G= P > t 8 1 t w Ptw jF~ Xt G2 > t= t 2. By Lemma 14, and its truncated version +P~ t1 ~ =X I X t t fX 8 and ~ P! PT t=1 G 2DW Xt 1 t T 2 t 2DW 2g !: 2 G >2DW 2 2 r 2Tlog max q By Lemma 6 and Azuma’s inequality, each term on the right hand side is bounded by =8. Thus, w.p. 1 =4, 2DW r 2Tlog 8 Summing up the bounds on and gives the result that holds w.p. at least 1 =2, G G2 IfE\Fg R1 2DW 2r 2Tlog 8 + n q B 0: 4.4. Bounding I jR2j 2 We can bound Ijby simply showing that Algorithm 1 rarely changes the policy, and hence most terms in jR fE\Fg (11) are zero. ( )= 1 up to 1time T, then we sho Proof If we have changed the policy Ktimes + + det(VT) n+dK2. On the other hand, we have T C 2 X 2 (n+ d)log2 T t i m e s u p t o t i m e T . T1t=0kztk2 2 T max(VT) + X 12 + TX (1 + C2); Regret Bounds for the Adaptive Control of Linear Quadratic Systems where Cis the bound on the norm of K(:) as de ned in Assumption A4. Thus, it holds that n+d2K 2 T ( + TX (1 + C2 n+d)) 2 1 + TX (1 + C2 : : Solving for K, we get 2 ) K (n+ d)log T Lemma 9 Let R2 be as de ned by Equation (11). Then we have IfE\FgjR22 Tj 2DX(n+ d)log2 1 + TX2 T(1 + CProof On event E\F, we have at most K= (n+ d)log22)= : 1 + TX2 T(1 + C2)= policychanges up to time T. So at most Kterms in the summation (11) are non-zero. Each term in the summation is bounded by 2DX2 T. Thus, IfE\FgjR22 Tj 2DX(n+ d)log2 1 + TX2 T(1 + C2)= : 4.5. Bounding IfE\Fg jR3j The summation PT 3j. So we rst bound this summation, whose analysis requires the following two results. t=0 Lemma 10 The following holds for any t 1: t1k=0X kzkk2 ^1 2log det(Vt) : V1 k det( I) >0 w.p.1 then Further, when the covariates satisfy kztk cm, t 0 with some c (n+ d)log (n+ d) + tc 2 mm (n+ d) : log det(Vt) det( I) The proof of the lemma can be found in Abbasi-Yadkori et al. (2011). supand B2Rm m X>>AX det(A) : X6= kXBXk det(B) 0 Lemma 11 Let A2Rm m be positive semi-de nite matrices such that A B. Then, we have 13 Abbasi-Yadkori Szepesv ari The proof of this lemma is in Appendix C. Lemma 12 On E\F, it holds that Tt=0X ~ t )>zt 2 16 (1 + )X2 T ( =4)log det(VT) : T ( C2 det( I) Proof Consider timestep t. Let st =( ~ t)>zt t =( t > . Let tbe the last timestep when the policy is changed. So s ~ ) z . We have ~ >zt > ^ (Choice of ) p 2 ( =4)kztkApplying the last inequality to V1 t; ( and ~ max(M) trace(M) for M 0)t2, together with (14) gives ksk 8 ( =4)kztk2 V1 t: Now, by Assumption A4 and the fact that 2Swe have that ~ t )X2 T 8 (1 + C2)X2 T T t=0 (kztk2 V ^1) 1=2 p T: ) det( I) ) det( I) t=0Xkstk2 jR3j 8p 2 ( =4 ) I V T kz) ( ^ tkV1 ts det(V t) det(V) kztk V) kz1 ttkV1 t (Cauchy-Schwartz inequality) (Lemma 11) For all 2C, ( ^ )> zt ^ V2 p1=2 V ( ^ 1=2 ( ^ 1=2 t kstk () ) zt + ) : (14) ( kztk2 (1 + C2 : kztk2 V1 t It follows then that T T Now, we are ready to bound R Lemma 13 Let R X 2T 1t : (Lemma 10): T( =4)log det(VT 16 (1 + C2)X . by Equation be as de ned (12). Then we have 3 3 fE\Fg (1 + C )X2 TSD 14 ( =4)log det(VT R e g r e t B o u n d s f o r t h e A d a p t i v e C o n t r o l o f L i n e a r Q u a d r a t i c S y s t e m s Pr o of W e h av e th at I jR3 fE\Fg Tj I IfE\Fg 1=2 t P(~ P( t) ) > t ~ 2! > t t 1=2)~ > 2 t tz 1= 2 > 2 (Tri. ineq.) zt (C.-S. ineq.) ~ t=0 X ~ T +zt 1=2 > t X P(~ t)1=2 > zt 1=2 2 t=0 P( X T IfE\Fg ) ~ z P( ! P( Tt=0~ t)~ 1= 2 I zt ! 1=2 1=2 (Tri. ineq.) t=0 ~ ) P( P((1 X + C2~ )X2 Tt) t 1=2 ( ~ ) zt fE\Fg t 2 > tzt t 1=2> > ~ zt 2!1=2 p T: ((4), L. 12) T fE\Fg(R1 X T 1=2~ t = 0 t) + SD P( 2)= 1=2 B ) 2DX Now we are ready to prove Theorem 2.T ( =4)log det(V) 4.6. Putting Everything Together det( I) 8p 2 2 T (1 + C 2 T Proof [Proof of Theorem 2] By (13) and Lemmas 7, 9, 13 we have that with probability at R least 1 =2, 0 + 82 1=2 + n 2 q D W + 8p (1 + C2 )X2 T S D R3 (n+ d)logr 2Tlog2 1+T ( =4)lo p T: g det(VT) det( I) + 8p (1 + C2 )X2 T T ( =4)log det(VT) det( I) 2+ T B0 + 2DW 22 T(n+ d)logThus, on E\F, with probability at least 1 =2, R(T) 2DXr 2Tlog8 2 1 + TX SD T logdetVT (n+ d)log Further, on E\F, by Lemmas 5 and 10, q n (1 + C2)= 1=2 (n+ d) + T(1 + C2)X 15 p T: 2T (n+ d) + logdet I: Abbasi-Yadkori Szepesv ari Plugging in this gives the nal bound, which, by Lemma 4, holds with probability 1 . References Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesv ari. Online least squares estimation with self-normalized processes: An application to bandit problems. Arxiv preprint http://arxiv.org/abs/1102.2670, 2011. B. D. O. Anderson and J. B. Moore. Linear Optimal Control. Prentice-Hall, 1971. P. Auer. Using con dence bounds for exploitation-exploration trade-o s. Journal of Machine Learning Research, 3:397{422, 2003. ISSN 1533-7928. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235{256, 2002. P. Auer, R. Ortner, and Cs. Szepesv ari. Improved rates for the stochastic continuumarmed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory (COLT-07), pages 454{468, 2007. P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563|1600, 2010. P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Annual Conference on Uncertainty in Arti cial Intelligence, 2009. A. Becker, P. R. Kumar, and C. Z. Wei. Adaptive control with the stochastic approximation algorithm: Geometry and convergence. IEEE Trans. on Automatic Control, 30(4):330{ 338, 1985. D. Bertsekas. Dynamic Programming. Prentice-Hall, 1987. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti c, 2nd edition, 2001. S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the \bet on the best" principle. Communications in Information and Systems, 6(4):299{320, 2006. R. I. Brafman and M. Tennenholtz. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213{231, 2002. M. C. Campi and P. R. Kumar. Adaptive linear quadratic Gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6):1890{1907, 1998. H. Chen and L. Guo. Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost. SIAM Journal on Control and Optimization, 25(4): 845{867, 1987. 16 Regret Bounds for the Adaptive Control of Linear Quadratic Systems H. Chen and J. Zhang. Identi cation and adaptive control for systems with unknown orders, delay, and coe cients. Automatic Control, IEEE Transactions on, 35(8):866 {877, August 1990. V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 937{943, 2006. V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. COLT-2008, pages 355{366, 2008. C. Fiechter. Pac adaptive control of linear systems. In in Proceedings of the 10th Annual Conference on Computational Learning Theory, ACM, pages 72{80. Press, 1997. S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. S. M. Kakade, M. J. Kearns, and J. Langford. Exploration in metric state spaces. In T. Fawcett and N. Mishra, editors, ICML 2003, pages 306{312. AAAI Press, 2003. M. Kearns and S. P. Singh. Near-optimal performance for reinforcement learning in polynomial time. In J. W. Shavlik, editor, ICML 1998, pages 260{268. Morgan Kau mann, 1998. R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697{704, 2005. R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th annual ACM symposium on Theory of computing, pages 681{690, 2008. T. L. Lai and H. Robbins. Asymptotically e cient adaptive allocation rules. Advances in Applied Mathematics, 6:4{22, 1985. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identi cation and control of dynamic systems. The Annals of Statistics, 10(1):pp. 154{166, 1982a. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identi cation and control of dynamic systems. The Annals of Statistics, 10(1):154{166, 1982b. T. L. Lai and C. Z. Wei. Asymptotically e cient self-tuning regulators. SIAM Journal on Control and Optimization, 25:466{481, March 1987. T. L. Lai and Z. Ying. E cient recursive estimation and adaptive control in stochastic regression and armax models. Statistica Sinica, 16:741{772, 2006. P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395{411, 2010. 17 H. A. Sim on. dyn ami c pro gra mmi ng und er unc ertai nty with a qua drati c crite rion A b b a s i Y a d k o r i S z e p e s v a r i func tion. Eco nom etric a, 24( 1):7 4{8 1, 195 6. A. L. Stre hl and M. L. Litt man . Onli ne line ar regr essi on and its appl icati on to mod el-b ase d reinf orce men t lear ning . In NIP S, pag es 141 7{1 424, 200 8. A. L. Stre hl, L. Li, E. Wie wior a, J. Lan gfor d, and M. L. Litt man . PA C mod el-fr ee reinf orce men t lear ning . In ICM L, pag es 881 {88 8, 200 6. I. Szit a and Cs. Sze pes v ari . Mod el-b ase d reinf orce men t lear ning with nea rly tight expl orati on com plex ity bou nds. In ICM L 201 0, pag es 103 1{1 038, 201 0. A p p e n di x A. T o ol s fr o m Pr o b a bi lit y T h e or e m Lemma 14 Let X1;:::;Xt Proof and ~St= Pt s=1XsIfXs ag. Then it holds thatP (St>x) P max1 s tXs a + P ~ St>x x) Pma x Xs a a + + P x Xs P P ~ 1st Stt S P (St ma 1st :s X = Pt s=1 x; max1 s t x St = k 18 : Xs a Theorem 16 (Self-norm a ltration, (m;k 0) be that k;k 1) be a real-v sub-Gaussian with co The ore m 15 (Azu ma’s ineq ualit y) Ass ume that (Xss; s 0) is a sup erm artin gale and jXXs 1j cs alm ost sure ly. The n for all t>0 and all >0, tk=1 kmk1 X P ( j X t X 0 2 P t s = 1 2 c2 s k), ( k R e g r e t B o u n d s f o r t h e A d a p t i v e C o n t r o l o f L i n e a r Q u a d r a t i a n d th e m at rix -v al u e d pr oc es se s V = t c S y s t e m s tk=1 mk1 X m> k1; V t = V+ Vt; t 0; Then for any 0 < <1, with probability 1 , A p p e n di x B. C o nt ro ll a bi lit y a n d O b s er 8t 0; kSt 1t 2 Vk 2R2 log det( Vt 1)=2 1det(V)=2 ! : v a bi lit y De nitio n1 (Ber tsek as (20 01)) A pair (A;B ), whe re Ais an nn matr ix and Bis an nd matr ix, is said to be cont rolla ble if the n nd matr ix has [ B A B : : : A n 1 B ] full rank .A pair (A;C ), whe re Ais an nn matr ix and Cis an dn matr ix, is said to be obs erva ble if the pair (A>; C>) is cont rolla ble. A p p e n di x C. Pr o of of L e m m a 1 1 X2 m>B1=2 B1=2X : 2 2 Proof [Proof of Lemma 11] We consider rst a simple case. Let A= B+ mm>, Bpositive fact de nite. Let X6= 0 be an arbitrary matrix. Using the Cauchy-Schwartz inequality and the that for any matrix M, M>M = kMk2, we get m >B X= 1=2>B1=2BX 1 + + and so X>mm>Thus, X m>B BXk 1 + > A X X => m>1=2B 21=2 B2 m>X (B+ mm2 >= )X 1=2 BX 2 1=2X 2 2 > kX X ) = det(B)det(I+ B We also have that d thus nishing the proof of this case. e 19 t ( A ) = d e t ( B + m m > m>B1=2 1=2m(B1=2 : m)>) = det(B)(1 + kmk2 B1); Abbasi-Yadkori Szepesv ari More generally, if A= B+m1 m> + +mt1m> t1 1 > s1 T h u s , s1 X>> AX BXk = t X>Vt>VX t1 kXXk then de ne Vs ) ::: X>Vt1>VX t2 kXXk ::: t1 2) = B+m1 m> 1 t X>V2>X kXBXk : = det(A)det(B) ; det(B) t1 t2 X>>AX kX + + mmand use kXBy the above argument, since all the terms are positive, we get BXk det(V) det(V) det(V) det(Vdet(V= det(V) det(B) the desired inequality. Finally, by SVD, if C 0, Ccan be written as the sum of at most mrank-one matrices, nishing the proof for the general case. App endix D. Bounding kxtk We show that E\Fholds with high probability. Proof [Proof of Lemma 4] Let Mt= ~ t. On event E, for any t Twe have that s=0 t1 trace M> t Since >0 we get that, I+ zsz> s! Mt! t( =4): X t1s= trace(M> tzsz> sMt) t( =4): 0X trace M> tzsz> sMt! t( =4): t1s=0X max Since (M) trace(M) for M 0, we get that t1s= 0X 2 M> tzs t( =4): Thus, for all t 1, M> tzs t( =4)1=2 T max 0 s t1 (n+ d)U 20 Choose H> 16 _ 4S2M 2 0 ; 1=2( =4): (15) Regret Bounds for the Adaptive Control of Linear Quadratic Systems where U0 = 116n+d2 t (n+d) (1 _S2(n+d2)) ; 1+TY= nLr (n+ d)log + 1=2S ? t (n+d) ? t and t? t? t;B F t;B (n+ d) : (16) 1Y : Fix a real number 0 Y 1, and consider the time horizon T. Let (v;B) and (M;B projections of vector0vand matrix Monto subspace B R, where the projection o is done column-wise. Let B fvgbe the span of Band v. Let Bbe the subspace o to Bsuch that R= B B. De ne a sequence of subspaces Bas follows: Set Bt= ;. T;:::;1, initialize B= Bt+1. Then while (M? t;B) FT+1>(n+ d) , choose a column o such that (v;B? t) F> and update Bt= Bt fvg. After nishing with timestep t, we have (M) (M) be the set of timesteps at which subspace Bt>t2> >tm expands. The cardinality of this set, m, is at most n+d. Denote these timesteps by t. Let i(t) = maxf1 i m: ti tg. Lemma 17 For any vector x2Rn+d i(t)i k (x;Bt)k2 M> tix ; =1 2 U 2(n+d) X where U= U0=H: Proof Let N= fv1;:::;vm gbe the set of vectors that are added to Btduring the expansion timesteps. By construction, Nis a subset of the set of all columns of Mt1;Mt2;:::;Mti(t). Thus, we have that Thus, in order to prove the statement of the lemma, it is enough to show that 2 ; (17) 8x;8j2f1;:::;mg; k=1Xh 2;xi H 16j22(j2)(1 _S2(j2)) k (x;B 4j vk j = span(v1;:::;vj) for any 1 j m. We have B =B Let TT k M= sup k k where B = wk + u , where w 2B k1 i(t)i=1X ti x 2 k M> k1 > x(v1 k 1 1 v > 1 k + + vm v> m)x: j)k t m , u ?B 1 , ku k , and kv 21 ) 1: . We can write vk 2S. The inequality (17) is proven by induction. First, we prove the induction base for j= 1. Without loss of generality, assume that x= Cv. From condition H>16, we get that 16H(1 _S A bb as iY ad ko ri Sz ep es v ari B P x ηL a λЯ v Figure 1: The geometry used in the inductive step. v= vl+1 and B= Bl. Thus, 2 216 1H(1 _S1) : Thus, C2 kv1k 216 1C2 H(1 _S1) ; kv1k w h er e w e h av e us e d th e fa ct th at kv 1k . T h us 4 2 , hv 2;xi 1 whi ch esta blis hes the bas e of indu ctio n. Nex t, we prov e that if the ineq ualit y (17) hold s for j= l, then it also hold s for j= l+1. Figu re 1 cont ains all rele vant qua ntiti es that are use d in the follo win g arg ume nt. Ass ume 4 H 161(1 _S2) k (x;B1)k2 2 ; that the ineq ualit y (17) hold s for j= l. With out loss of gen erali ty, ass ume that xis in Bl+1, and thus kxk = k (x; Bl+1) k. Let PB be the 2-di men sion al sub spa ce that pas ses thro ugh xan d vl+1l +1. The 2-di men sion al sub spa ce Pan d the l-di men sion al sub spa ce Blca n, resp ecti vely , be iden ti ed by l1 and one equ atio ns in B. Bec aus eP is not a sub set of Bl, the inter sect ion of Pan d Blis a line in Bl+1l +1. Let’ s call this line L. The line Lcre ates two halfplan es on P. With out loss of gen erali ty, ass ume that xan d vl+1 are on the sam e halfplan e (noti ce that we can alw ays repl ace xby xin (17) ). Let 0 =2 be the angl e bet wee n vl+1 and L. Let 0 << =2 be the orth ogo nal angl e bet wee n vl+1 and Bl. We kno w that >, ul+1i s orth ogo nal to Bl, and kul+ 1l+1k . Thu s, arcs in( = kvk) . Let 0 be the angl e bet wee n xan d L( < , bec aus e xan d vl+1l are on the sam e halfplan e). The dire ctio n of i s cho sen so that it is con sist ent with the dire ctio n of . Fina lly, let 0 =2 be the orth ogo nal angl e bet wee n xan d B. 22 R e gr et B o u n d s fo r th e A d a pt iv e C o nt ro l of Li n e ar Q u a dr at ic S ys te m s B y th e in d uc tio n as su m pti o n 2;xi = hvl+1 2;xi + lk= 2;xi 1X hvl+1 vk 2;xi (1 _S2(l2) + vk H 1 6l2 kvl+1kkxksin If < =2 + =2 or > =2 + 3 =2, then jhvl+1;xij= jkvl+1kkxkcos\ (vl+1;x)j Thus, hvl+1 2;xi 2 2kxk16 Thus, k (x;Bl 4 max (1 _S2(l2)) k (x;B 16l2 s t;s=2Tt > szs 4H 16l12(l1)(1 _S2(l1)) k (x;Bl+1)k2 ; 2)k 2 4S : 2 16S2; and kxk 24H )k2 16(1 _S2(l1) ; t 1 = 2 2(l2) l l1 2(l1) ) kxk is well controlled except when t2TT which nishes the proof. Next we show thatM ; Lemma 18 We have that for any 0 t T, G= 2 GZU 2S(n+ d)n+ d n+ d+1n+d+1=2 and Zt kxk4 : 2 k (x;Blwhere we use 0 1 and x2Bl+1in the last inequality. If =2 + =2 < < =2 + 3 =2, then < =2 =2. Thus,)k= kxkjcos( )j kxk si kxk n 2 H : 2(l2) 4h l+1k=1Xh ) k (x;Bl)k2 Mwhere = max kzsk: st 23 t( =4) > tzt 1 2(n+ d+1) !1=(n+d+1) . ; Abbasi-Yadkori Szepesv w h i c h n+dProof From Lemma M 17 we get that pU k (zs;Bs)k pi(s) max1 i i(s) 1 i z s ; M> tizs : (18) n+d 1 i i(s) k (zs;Bs)k rn+ d U ? s;B) + (Ms;Bs))>( (zs Now we can write M> szs = ( (M= (M (Msss; B;B ( ;B > s;B ) Mn+d s s ? s;B) ; > (z z + B s s) (z ) (zs;B ) + (Ms s s) )> ? s> ? s (zs s i )) ? s;B) + (zs;B) ;B ? s M> szs (n+ d) Ztt + 2S r n+ d U 1 max n+d max 1 i i(s) M> tizs : max i s=2Tt;s t s t;s=2Tt (n+ d) kzsk+ 2S r Thu s, From 1 i i(s);s=2T > t i m p l i e s t h a t max n+ d 1 U max 1 i i(s) ;we conclude that s<t s M> t . Thus, ari : by ((18) and (16)) (19) (n+ d) Zt t M> szs + 2S r n+ d 1 d U max n+ max 0 s<t s t;s=2T By (15) we get that t Zt max M> szs (n+ d) Zt U 1 n t( =4)1=2: +d st ;s =2 Tt t t( =4) n+d+1=2 1 2(n+ d+1) ! H> 4S M> szs 2 4 2 2S (n+ d) 1=2( =4) !1=(n+d+1) H Now if we choose = 1= 2 n+ d + 2S r U 1= 2 2S t( =4) we get that 1=2Zn+d t(n+ U1=2 d) 1=(n+d+1) max s t;s=2Tt = GZ n+ d n+ d+1 2 0 2M : Finally, we show that this choice of satis es <1. From the chose of H, we have that (n+ d)U : M> tzs : R eg ret B ou nd s for th e A da pti ve C on tro l of Li ne ar Q ua dr ati c Sy st e m s (n + d) U 0 1 2(n+ d+1) 2H 4 <1: Thus, S2 M Thus = 2S t( =4) Zt1=2(n+ ! 1 n+ d+1 <1: , d)U1=2 0H1=2 We can write the state update as xt+1 = txt + rt+1 = ~ AK( t ~ + B+ ~ BtK( A ~ t) t=2Tt) t2TT t+1 = M > + wt+1 t=2TT and rt+1 w tzt t2TT Hence we can write = t1(t2xt2 + rt1) + rt = t1t2xt2 xt= t1xt1+ rt T ; where t+1 = t1t2t3 + rt + t1rt1 = k=1X t xt3 + rt + t1rt1 + t1t2rt2 + t1Ys=k + t1t2rt2 + t1t2 = = t1 :::t(t1)rt(t1) + rt + t1rt1 :::ttxtt s! rk: Fr o m S ec tio n 4, w e h av e th at tT ~ + ~ K( A B ~ maxt T ) So we have that t1Y s=k + B K( ~ t) A ; max t : s k k t t : n+d tk(n+d)+1 1 max krk+1k: n+d 0 k t1 1 Hence, we have that Now, krk+1k kxtk n+d tk=1X tk+1 max krk+1k maxk<t;k= k<t M> kzk 2Tt krk+1k M25, and kr> kzkk+1k= kwk<t + maxkwk+1k+1k:k, otherwise. Hence, k+1kwhen k=2TT + kw Abbasi-Yadkori Szepesv ari jwk;iThe rst term can be bounded by Lemma 18. The second term can be bounded as follows: notice that from the sub-Gaussianity Assumption A1, we have that for any index 1 i n and any time k t, with probability 1 =(t(t+ 1))j L r2log t(t+ 1) : k 1kxtAs a result, with a union bound argument, on some event Hwith P (H) 1 =4, kwk 2Lq nlog4nt(t+1) . Thus, on H\E,1 n+d" GZn+ d n+ d+1T t( =4)1 2(n+ d+1)+ 2L rnlog 4nt(t+ 1) # = t:t By the de nition of F, H\E F\E. Since, by the union bound, P (H\E) 1 =2, P (E\F) 1 =2 also holds, nishing the proof. 26