Regret Bounds for the Adaptive Control of Linear Quadratic Systems

advertisement
JMLR: Workshop and Conference Proceedings 19 (2011) 1{26 24th Annual Conference on Learning Theory
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
Yasin Abbasi-Yadkori abbasiya@cs.ualberta.ca
Csaba Szepesv ari szepesva@cs.ualberta.ca Department of Computing Science, University of
Alberta
Editor: Sham Kakade, Ulrike von Luxburg
Abstract
p T). Unlike previous approaches that use a forced-exploration scheme, we construct aWe
study the average cost Linear Quadratic (LQ) control problem with unknown model
parameters, also known as the adaptive control problem in the control community. We design
an algorithm and prove that apart from logarithmic factors its regret up to time T is O(
high-probability con dence set around the model parameters and design an algorithm that
plays optimistically with respect to this con dence set. The construction of the con dence set is
based on the recent results from online least-squares estimation and leads to improved
worst-case regret bound for the proposed algorithm. To the best of our knowledge this is the
the rst time that a regret bound is derived for the LQ control problem.
1. Intro duction
We study the average cost LQ control problem with unknown model parameters, also known
as the adaptive control problem in the control community. The problem is to minimize the
average cost of a controller that operates in an environment whose dynamics is linear, while
the cost is a quadratic function of the state and the control. The optimal solution is a linear
feedback controller which can be computed in a closed form from the matrices underlying the
dynamics and the cost. In the learning problem, the topic of this paper, the dynamics of the
environment is unknown. This problem is challenging since the control actions inuence both
the cost and the rate at which the dynamics is learned, a topic of adaptive control. The
objective in this case is to minimize the regret of the controller, i.e. to minimize the di erence
between the average cost incurred by the learning controller and that of the optimal controller.
In this paper, for the rst time, we show an adaptive controller and prove that, under some
assumptions, its expected regret is bounded by~ O(p T). We build
on recent works in online linear estimation and adaptive control design, the latter of which we
survey next.
c 2011 Y. Abbasi-Yadkori & C. Szepesv ari.
Abbasi-Yadkori Szepesv
ari
When the model parameters are known and the state is fully observed, one can use the
principles of dynamic programming to obtain the optimal controller. The version of the problem
that deals with the unknown model parameters is called the adaptive control problem. The
early attempts to solve this problem relied on the certainty equivalence principle (Simon,
1956). The idea was to estimate the unknown parameters from observations and then use the
estimated parameters as if they were the true parameters to design a controller. It was soon
realized that the certainty equivalence principle does not necessarily provide enough
information to reliably estimate the parameters and the estimated parameters can converge to
incorrect values with positive probability (Becker et al., 1985). This in turn might lead to
suboptimal performance.
To avoid non-identi cation problem, methods that actively explore the environment to
gather information are developed (Lai and Wei, 1982a, 1987; Chen and Guo, 1987; Chen and
Zhang, 1990; Fiechter, 1997; Lai and Ying, 2006; Campi and Kumar, 1998; Bittanti and
Campi, 2006). However, only asymptotic results are proven for these methods. One exception
is the work of Fiechter (1997) who proposes an algorithm for the \discounted" LQ problem and
analyzes its performance in a PAC framework.
Most of the aforementioned methods use forced-exploration schemes to provide the
suf cient exploratory information. The idea is to take exploratory actions according to a xed
and appropriately designed schedule. However, the forced-exploration schemes lack strong
worst-case regret bounds, even in the simplest problems (see e.g. Dani and Hayes (2006),
section 6). Unlike the preceding methods, Campi and Kumar (1998) proposes an algorithm
that uses the Optimism in the Face of Uncertainty (OFU) principle, which goes back to the
work of Lai and Robbins (1985), to deal with the exploration/exploitation dilemma. They call
this the Bet On the Best (BOB) principle. The idea is to construct high-probability con dence
sets around the model parameters, nd the optimal controller for each member of the
con dence set, and nally choose the controller whose associated average cost is the
smallest. However, Campi and Kumar (1998) only show asymptotic optimality, i.e. the average
cost of their algorithm converges to that of the optimal policy in the limit. In this paper, we
modify the algorithm and the proof technique of Campi and Kumar (1998) and extend their
work to derive a nite time regret bound. Our work also builds upon on the works of Lai and
Wei (1982b); Dani et al. (2008); Rusmevichientong and Tsitsiklis (2010) in analyzing the linear
estimation with dependent covariates, although we use a more recent, improved con dence
bound (see Theorem 1).
Note that the OFU principle has been applied very successfully to a number of challenging
learning and control situations. Lai and Robbins (1985), who invented the principle, used it to
address learning in bandit problems (i.e., when there is no state) and later this work was
picked up and modi ed by Auer et al. (2002) to make it work in nonparametric bandits. The
OFU principle has also been applied to learning in nite Markov Decision Processes, both in a
regret minimization (e.g., Bartlett and Tewari 2009; Auer et al. 2010) and in a PAC-learning
setting (e.g., Kearns and Singh 1998; Brafman and Tennenholtz 2002; Kakade 2003; Strehl et
al. 2006; Szita and Szepesv ari 2010). In the PAC-MDP framework there has been some work
to extend the OFU principle to in nite Markov Decision Problems under various assumptions.
For example, Lipschitz assumptions have been used
2
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
by Kakade et al. (2003), while Strehl and Littman (2008) explored linear models. However,
none of these works consider both continuous state and action spaces. Continuous action
spaces in the context of bandits have been explored in a number of works, such as the works
of Kleinberg (2005); Auer et al. (2007); Kleinberg et al. (2008) and in a linear setting by Auer
(2003); Dani et al. (2008) and Rusmevichientong and Tsitsiklis (2010).
2. N otation and conventions
We use k kand k:kFd dto denote the 2-norm and the Frobenius norm, respectively. For a
positive semide nite matrix A2R, the weighted 2-norm k:kAis de ned by kxk2 A= x>Ax, where
x2Rd. The inner product is denoted by h ; i. We use min(A) and (A) to denote the minimum
and maximum eigenvalues of the positive semide nite matrix A, respectively. We use A 0 to
denote that Ais positive de nite, while we use A 0 to denote that it is positive semide nite. We
use IfAgmaxto denote the indicator function of event A.
3. T he Linear Quadratic Problem
We consider the discrete-time in nite-horizon linear quadratic (LQ) control problem:
xt+1 ; ct= x> tQxt+ u> tRut;t+1 + w+ B ut = A xt
2Rd
is the control at time t, xt 2Rn is the state at time t, ct
nn
where t= 0;1;:::;ut
2Rn d
t
nn
dd
t=0X
0
Let J
t
2R is the cost at time t, wt+1is the \noise", A2Rand Bare unknown matrices while Q2Rand
R2Rare known (positive de nite) matrices. At time zero, for simplicity, x= 0. The problem is
to design a controller based on past observations to minimize the average expected cost
1 T
]: (1)
T E [c
J(u0;u1;:::) = limsup
T!1
be the optimal (lowest) average cost. The regret up to time Tof a controller which
incurs a cost of cat time tis de ned by
T
R(T) =
t=0X(
J );
ct
which is the di erence between the performance of the controller and the performance of the
optimal controller that has full information about the system dynamics. Thus the regret can be
interpreted as a measure of the cost of not knowing the system dynamics.
3
3.
1.
A
ss
u
m
pti
o
ns
1I
n
thi
s
se
cti
o
n,
w
e
st
at
e
o
A
b
b
a
s
i
Y
a
d
k
o
r
i
S
z
e
p
e
s
v
a
r
i
ur
as
su
m
pti
o
ns
o
n
th
e
n
oi
se
a
n
d
th
e
sy
st
e
m
dy
n
a
mi
cs
.
In
p
ar
tic
ul
ar
,
w
e
as
su
m
e
th
at
th
e
n
oi
se
is
su
bG
a
us
si
a
n
a
n
d
th
e
sy
st
e
m
is
co
nt
ro
lla
bl
e
a
n
d
o
bs
er
va
bl
e.
D
e
n
e
t
= xtut :
= A
=
B and z
Thus, the state transition can be written as
x> zt
Assumption A1 There exists a ltration (Ft
(i) zt;xt are Ft
>
t+1
w
>
t+1
t+1 ;
t+1
+w
:
) such that for the random variables (z0;x), :::, (zt;xt+1), the following hold:1
t
t+1jF
t
+
1
t jFt
= In
is a member of set Ssuch that S S
t
+
1
-measurable; (ii)For any t 0, ; i.e., w= xt+1z> t is a martingale di erence sequence (E [w
] = z> t jF E
[x
]
=
0,
t=
0;
1;
:::
);
(iii) E wt+1 ; (iv)The random variables ware component-wise sub-Gaussian in the sense that there
exists L>0 such that for any 2R , and index j,
2L=2):
The assumption E w w> t+1E [exp(wt+1;j)jFt] exp(2
jFt = In
makes the analysis more readable. However, we
shall show it in Section 4 that it is in fact not necessary. Our next assumption on the
system uncertainty states that the unknown parameter is a member of a bounded set and
is such that the system is controllable and observable. This assumption will let us derive a
closed form expression for the optimal control law.
Assumption A2 The unknown parameter
where S
0
\ n 2Rn (n+d)
4
=
0
jtrace( > ) S2o ;
>M o :
n
j(A;B) is controllable, (A;M) is
observable, where Q= M
(
= (A;B) 2R
n
n
+
d
In what follows, we shall always
assume that the above
1. Controllability and observability) are de ned in Appendix B
assumptions are valid.
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
3.2. Parameter estimation
In order to implement the OFU principle, we need high-probability con dence sets for the
unknown parameter matrix. The derivation of the con dence set is based on results from
Abbasi-Yadkori et al. (2011) that use techniques from self-normalized processes to estimate
the least squares estimation error. De ne
t1s= trace((xs+1 > zs)(xs+1
e( ) = trace( > ) +
> zs
)>):
0X
^ t= argmin e( ) = (Z> 2be the ‘-regularized least-squares estimate of
>X;
(2)
1
Let ^ t
with regularization parameter >0:
Z+ I)
Z
where Zand Xare the matrices whose rows are z> 0;:::;z
> t1
and x> ;:::;x> t, respectively.
1
Theorem 1 Let (z0;x1);:::;(zt;xt+1), zi
V 2R
t 2
2Rn+d, xi
2Rn
t
t
satisfy the linear model Assumption A1 with some L>0, (n+d) n, trace( > ) S2and let F= (F)
be the associated ltration. Consider the ‘-regularized least-squares parameter estimate
^ with regularization coe cient >0 (cf. (2)). Let
= I+ t1i= ziz>
0X i
be the regularized design matrix underlying the covariates. De ne
t( ) = 0
@ s 2log det(Vt 1)=21det( I)=2
+ 1=2 S
: (3)
nL
1A2
Then, for any 0 < <1, with probability at least 1 ,
trace((
)>Vt(
)) t( ):
^ t
^ t
t 2C
t ( );t= 1;2;:::) 1 , where
In particular,
t(
^
()=n
P(
t
t
C
2Rn (n+d)
: trace n (
^ )>V
t
3.3. The design of the controller
Let (A;B) = 2S0, where S0is de ned in Assumption A2. Then there is a unique solution P( )
in the class of positive semide nite symmetric matrices to the Riccati equation
P( )B+ R)1B>P( )A:
P( ) = Q+ A>P( )AA>P( )B(B>
5
)o
(
Abbasi-Yadkori Szepesv
ari
Under the same assumptions, the matrix A+ BK( ) is stable, i.e. its norm-2 is less than one,
where
K( ) = (B>
P( )B+ R)1B>
t
ut
= K( )xt; (5)
(
t( ) \Ssuch that
)
\
S
=
J
(
P( )A is the gain matrix (Bertsekas,
2001). Further, by the boundedness of S, we also obtain theboundedness of P( ) (Anderson
and Moore, 1971). The corresponding constant will be denoted by D:
D= supfkP( )kj 2Sg: (4) The optimal control law for a LQ
system with parameters is
i.e., this controller achieves the optimal average cost which satis es J( ) = trace(P( ))
(Bertsekas, 2001). In particular, the average cost of control law (5) with = is the optimal
average cost J) = trace(P( )). We assume that the bound on the norm of the unknown
parameter, S, and the subGaussianity constant, L, are known:
Assumption A3 Constants Land Sin Assumptions A1 and A2 are known.
The algorithm that we propose implements the OFU principle as follows: At time t, the
algorithm chooses a parameter~ tfrom C
J(
J( ) + 1p t
~ t) inf 2C
and then uses the optimal feedback controller (5) underlying the chosen parameter. In order to
prevent too frequent changes to the controller (which might harm performance), the algorithm
changes controllers only after the current parameter estimates are signi cantly re ned. The
details of the algorithm are given in Algorithm 1.
4. A nalysis
In this section we give our main result together with its proof. Before stating the main theorem,
we make one more assumption in addition to the assumptions we made before.
Assumption A4 The set Sis such that := sup(A;B)2SkA+ BK(A;B)k<1. Further, there exists a
positive number Csuch that C= sup 2SkK( )k<1.
Our main result is the following theorem:
6
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
0t.
else
~ =
t ~
1
pt
J( ) +
t+1
t1
based on the current parameters, u = K( ~ )x
t+1
>t
t
) into the dataset, where z> t= (x
t;x
:
=
V
t
Inputs: T;S>0; >0;Q;L; >0. Set V0= Iand . Let V= V
^ = 0. (~ A0; ~ B0) = ~ 00=
argmin 2C0( )\SJ( ). for t:= 0;1;2;::: do
if det(Vt) >2det(V0) then Calculate^ by (2).
Find~ ttsuch that J( ~ t) inf 2Ct( )\S
t
t
t
>
.
t;u
end if
Calculate u
+
z
. Save (zt+1
z> t
pTlog(1= )
;
2
R(T) = ~ O
.
Execute control, observe new state x
).
Vt. end for
Table 1: The proposed adaptive algorithm for the LQ problem
Theorem 2 For any 0 < <1, for any time T, with probability at least 1 , the regret of
Algorithm 1 is bounded as follows:
where the constant hidden is a problem dependent constant.
w w>
= Int = and G
t+1jFw> G )G
t+1
(~
;G
t
( ;G)2C
7
t
where Ct
2. Here, ~ O hides logarithmic factors.
Remark 3 The assumption E
t+1makes the analysis more readable. Alternatively, we could assume that Ej
wThen the optimal average cost becomes J( ) = trace(P( ). The only change in
in the computation of~ , which will have the following form:
J( );
is now a con dence and G
set over
t
. The rest of the analysis remains
identical, provided that an appropriate con dence set is constructed.
The least squares estimation error from Theorem 1 scales with the size of the state
and action vectors. Thus, in order to prove Theorem 2, we rst prove a
high-probability bound on the norm of the state vector. Given the boundedness of the
state, we decompose the regret and analyze each term using appropriate
concentration inequalities.
Abbasi-Yadkori Szepesv
ari
4.1. Bounding kxtk
We choose an error probability, >0. Given this, we de ne two \good events" in the
probability space . In particular, we de ne the event that the con dence sets hold for s=
0;:::;t,
t
Et = f!2 : 8s t;
2Cs
#
nlog 4nt(t+ 1)
;
kA
Ft
s
" GZn+ d n+ d+1T t( =4)
+ B K( )k;
t
= 1 _sup 2S
= max
tk;
T
0tT
0H
n+d+1=2
1=2
!
;
;
U
( =4)g; and the event that the state vector stays \small":
= f!2 : 8s t; kx
k
= 11
n+d
+ 2L r
1 2(n+ d+1)
kz
2S(n+ d)
1=(n+d+1)
Z
G= 2
U= U
g where
(n+ d)U
3and
U
= 116n+d2(1 _S2(n+d2)) ;
0
His any number satisfying
H> 16 _ 4S2M
wher
e
M= sup Y 0
In what follows, we let
E= E
nLr (n+ d)log
2
0
1+TY=
Y:
and F= FT
T
+
~
(
8
;
1=2S
>zt
is well controlled
except
. First, we show that E\F holds with high probability and on E\F, the state vector does not
explode.
t)
Lemma 4 P (E\F) 1 =2.
The proof is in Appendix D. It rst shows that
t
t
and ZT
t. Notice that
t
t
tk.
for a small number of occasions. Given this and proper decomposition of the state update
equation, we can prove that the state vector xstays smaller than itself depends , which in
turn depend on x. Thus, we need one more step to have a bound on kx
3. We use ^ and _ to denote the minimum and the maximum, respectively.
R
eg
ret
B
ou
nd
s
for
th
e
A
da
pti
ve
C
on
tro
l
of
Li
ne
ar
Q
ua
dr
ati
c
Sy
st
e
m
s
Lemma 5 For appropriate problem dependent constants C1 >0;C2
fFtg ma
1kxsk Xt
and Ytdef= (e_ (n+ d)(e1)
2
x1 s
t
_4(C1
t
2
Proof Fix t. On Ft, ^ Xt
t.
Let Xt
:= max0 s t
x D1p t
D1p
kxsk
2
t
t
; (6)
r
n+ d n+
d+1
l
o
g
t
2
r
Xt
t
l
o
g
t
t
1
s
t
1=(n+d+1) t
D
+ D2
1
p
t
(
)
l
o
g
(
t
)
tn+d+1))
log(t= ))):
t
n+d+1
; (7)
t
t
t
>
0
(w
hi
ch
ar
e
in
d
e
p
e
n
d
e
nt
of
t;
;T
),
fo
r
a
ny
t
0,
it
h
ol
ds
th
at
I,
w
h
er
e
X
=
Yn
+d
+1
t
log(1= ) + C log(t= ))log2(4(C
.
W
ith
a
p
log(1= ) + C
pr
o
pr
iat
e
co
ns
ta
nt
s,
thi
s
im
pli
es
th
at
holds for x= ^
X
x
( )log(t)x
( )log(t) + D
+ D !n+d+1
t
;
r log !
or
be the largest value of x 0 that satis es (6). Thus,
Clearly, ^ Xt. Because t ( ) is a function of logdet(V ), (7) has the form of X f(log(X: (8) Let a= X. Then, (8) is equivalent to
Xt
f((n+ d+ 1)loga): Let c= max(1;maxkask). Assume that t (n+ d). By the
Lemma 10, tedious, but elementary calculations, it can then be shown
t c Alog2(c) + Bt; (9) where A= G1log(1= ) and Bt= Glog(t= ). From this, further elementary >
) + x> tP( ~ t)xt
J( ~
calculations show that the maximum value that ccan take on subject to the constraint (9) is
bounded from above by Yt.2
4.2. Regret Decomposition
n xQxt>
tQx+ ut> t
= x> t
From the Bellman optimality equations for the LQ problem, we get that (Bertsekas,
1987)[V. 2, p. 228{229]
= minu
+ u Ru+ E h ~x+ E h )~xu
t+1
~xuT t+1P( ~ t
Ftio
Rut
utT t+1P( u)~xtt
i ;t
+1
9
~ t
F
A
bb
as
iY
ad
ko
ri
Sz
ep
es
v
ari
t
= ~ Atxtt)xt = x + ~ Btu+ wt+1 + u> tRut and ( ~ At; ~ Bt)
t
=~
~ t
where ~xu . Hence, J() + x> tP( ~
> tQx+
t
E h (= x> tQxt> tt~ Att
i
t+1
~ Atx+
t ~ Btut+ wt+1 Rut + u
tx
+ ~ Bu ) P( ~ )( ~ x + ~ Bu
A
F
t+1) t t t
t
+Eh
()wt+1
)
t t
tt
+ w u+ ~ x)>P( ~ t)(
B
~A
>
t
i
F
F+ Et P(
~
+ E h w=
x> tQxt
> t+1
i
h
()xt+1t>
Ft+ E h
x>P( ~ t
>t
t
+ ~ Bu
>P(
~
t
t
x )(
F ~
A
t
t
P( ~ )( ~ x + ~ Bu )(A x
t
A
+B u ) P(
~
)(A x +B u ) :
(12)
Tt=0X
t
>+t
u Rut
~ Ax
i )(A
+
E
h
x
t
t t)
t
t
t t
t t
u
>
(A=t x>
+
tQx1t
Eh
+~ u)
B
F
P(
xt
~
t
+
B ut
)
x
P
(
~
ti
+B
+ ( ~ xt + ~
Btut)
At
t)
i
)( + ~ Btu xt+ B ut)>P( ~ t)(A + B) (A); where in the one before last xt t
~ equality we have used x= A
A
and the martingale property of the noise. Hence,
~ t) + R
T= t=0 n x> tP(
where R1
X ~ t)xt
t=0
>
t
t+1
t+1
an
t+1
X
d
>
t
t t
t t
t
t >
R3 =
(~ + ~
Atxt
Btut)
Thus, on E\F, Tt=0X Qx + u> tRu ) = Tt= J( ) + R1 R2 R3
0X ~
(x
T
t=0XJ(
)).
Thu
s,
on
E\F,
R(T)
R1
R2R
3+ 2
p T:
(13)
I
n
t
h
e
f
o
l
l
o
w
i
n
g
s
u
b
s
e
c
t
i
o
n
s
,
w
e
TJ( ) + R1
where the last inequality follows
from the choice of ~ t
Ftio ; (10)
Fti ; (11)
t
R
2
t
R
3
1
=
Tt=
x>
0X tQxt
t
and the fact that on E,
.
(
~
+ B ut
3;
+ u>
E h x> t+1P( ~
))x ) P( ~
+ 2 p T;
tt+1 > t+1P( >
i
2Ct
+ R Rut
t
t+1)xt+1
(P( E h R2 =
~
x
T
b
o
u
n
d
R
1
;
R
2
,
a
n
d
R
3
10
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
4.3. Bounding IfE\FgR1
We start by showing that with high probability all noise terms are small.
Lemma 6 With probability 1 =8, for any k T, kwkk Ln p2nlog(8nT= ).
jwk;iProof From sub-Gaussianity Assumption A1, we have that for any index 1 i nand
any time k,j L pk2log(8= ): Thus, with probability 1 =8, for any k T, kwk Ln
p2nlog(8nT= ).
2nlog(8nT= ) and
Lemma 7 Let R1 be as de ned by (10). With probability at least 1 =2,
IfE\FgR1 2DW 2
8
B 0;
r 2Tlog
+nq
where W= Ln p B0 = v+ TD2S2X2(1 + C2
Proof Let ft1
= A xt1
) lo 4nv1=2 v+ TD2S2X2(1 + C2)
g
and Pt = P( ~ t).
Write
+ B ut1
1=2!
R1 = x > P(
x> T+1P( ~ T+1)xT+1
0
~ 0)x0
Tt=1
x> tP(
E h x> tP( ~ t)xtjFt1i :
+ X ~ t)xt
Because P is positive semi-de nite and x0= 0, the rst term is bounded by zero. The
second term can be decomposed as follows
E h x > tPtx jF i = T > t1P w
t=1
Tt=1X x>
tPtxt
t
+ t=1
jF t1 >
tPtwt
TX
w>
tPtw
We bound each term separately. Let = f> t1Pt an
v> t
d
1 = IfE\Fg Tt=1X v>
= IfE\Fg Tt=1X ni=1 vk;iwk;i = ni=1 IfE\Fg Tt=1X v w :
X
X
twt
t
t1
t t
k;i
X f
G
1
i :
Ehw
1
k;i
:
th
at
h
ol
ds
wi
th
pr
o
b
a
bil
ity
at
le
as
t
1
=
(4
n)
,
fo
r
a
ny
T
0,
Abbasi-Yadkori Szepesv ari
=
vk;iwk;i. By Theorem 16, on some event G ;i
P
L
e
t
T
M
t
=
1
T
;
i
2R2
v+
Tt=1X
v2 t;i!
M2 T;i
log 0 4nv1=2
@
0
On E\F, kvt
Tt=
v2 t;i!1=21 = B ;i:
A
n i=1on \G ;i, that holds w.p. 1 =4.
0
;i
v+
1X
B. Thus, we have G
P k DSX p1 + C2
1
De ne X
= w Ptw 2E
Xt
T t=1fE\Fg
t
2
P G2
>2DW
. De ne G= P
>
t
8
1
t
w Ptw jF~
Xt
G2
> t=
t
2.
By Lemma 14, and its
truncated version
+P~
t1
~ =X I
X
t
t fX
8
and ~ P! PT
t=1
G
2DW Xt 1 t T
2
t
2DW
2g
!:
2
G >2DW
2
2
r 2Tlog
max
q
By Lemma 6 and Azuma’s inequality, each term on the right hand side is bounded
by =8. Thus, w.p. 1 =4,
2DW
r 2Tlog 8
Summing up the bounds on
and
gives the result that holds w.p. at least 1 =2,
G
G2
IfE\Fg R1 2DW 2r 2Tlog 8 + n q B 0:
4.4. Bounding I
jR2j
2
We can bound Ijby simply showing that Algorithm 1 rarely changes the policy, and hence most terms in jR fE\Fg
(11) are zero.
(
)=
1 up to 1time T, then we sho
Proof If we have changed the policy Ktimes
+
+
det(VT) n+dK2. On the other hand, we have
T
C
2
X
2
(n+ d)log2
T
t
i
m
e
s
u
p
t
o
t
i
m
e
T
.
T1t=0kztk2 2 T
max(VT)
+
X
12
+ TX
(1 + C2);
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
where Cis the bound on the norm of K(:) as de ned in Assumption A4. Thus, it holds that
n+d2K 2 T ( + TX
(1 + C2 n+d))
2 1 + TX (1 + C2
:
: Solving for K, we get
2 )
K (n+ d)log
T
Lemma 9 Let R2 be as de ned by Equation (11). Then we have IfE\FgjR22 Tj 2DX(n+ d)log2 1 + TX2 T(1 +
CProof On event E\F, we have at most K= (n+ d)log22)= : 1 + TX2 T(1 +
C2)= policychanges up to time T. So at most Kterms in the summation (11) are
non-zero. Each term in the summation is bounded by 2DX2 T. Thus, IfE\FgjR22 Tj 2DX(n+
d)log2 1 + TX2 T(1 + C2)= :
4.5. Bounding IfE\Fg
jR3j
The summation PT 3j. So we rst bound this summation, whose analysis requires the following two results.
t=0
Lemma 10 The following holds for any t 1:
t1k=0X kzkk2 ^1
2log det(Vt) :
V1 k
det( I)
>0 w.p.1 then
Further, when the covariates satisfy kztk cm, t 0 with some c (n+
d)log
(n+
d)
+
tc
2
mm
(n+
d)
:
log det(Vt) det( I)
The proof of the lemma can be found in Abbasi-Yadkori et al. (2011).
supand B2Rm m X>>AX det(A) :
X6= kXBXk
det(B)
0
Lemma 11 Let A2Rm m
be positive semi-de nite matrices such that A B.
Then, we have
13
Abbasi-Yadkori Szepesv
ari
The proof of this lemma is in Appendix C.
Lemma 12 On E\F, it holds that
Tt=0X
~ t )>zt 2 16 (1 + )X2 T ( =4)log det(VT) :
T
(
C2
det( I)
Proof Consider timestep t. Let st
=(
~ t)>zt
t =(
t
>
. Let tbe the last timestep when the policy is changed. So s
~ ) z . We have ~
>zt
> ^
(Choice of ) p 2 ( =4)kztkApplying the last inequality to V1 t; ( and ~ max(M) trace(M) for M 0)t2, together
with (14) gives ksk 8 ( =4)kztk2 V1 t:
Now, by Assumption A4 and the fact that
2Swe have that
~ t
)X2 T
8 (1 + C2)X2 T
T
t=0 (kztk2 V ^1)
1=2 p T:
) det( I)
)
det( I)
t=0Xkstk2
jR3j 8p
2
( =4
)
I
V
T
kz) ( ^ tkV1 ts det(V t)
det(V) kztk V) kz1 ttkV1 t
(Cauchy-Schwartz inequality)
(Lemma 11)
For all 2C,
( ^ )> zt
^
V2 p1=2 V (
^ 1=2 ( ^ 1=2
t kstk
()
) zt
+
)
: (14)
(
kztk2
(1 + C2 :
kztk2 V1 t
It follows then that
T
T
Now, we are ready to bound R
Lemma 13 Let R
X
2T
1t
: (Lemma 10):
T( =4)log det(VT
16 (1 + C2)X
. by Equation
be as de ned
(12). Then we have
3
3
fE\Fg
(1 + C )X2 TSD
14
( =4)log det(VT
R
e
g
r
e
t
B
o
u
n
d
s
f
o
r
t
h
e
A
d
a
p
t
i
v
e
C
o
n
t
r
o
l
o
f
L
i
n
e
a
r
Q
u
a
d
r
a
t
i
c
S
y
s
t
e
m
s
Pr
o
of
W
e
h
av
e
th
at
I
jR3 fE\Fg Tj I
IfE\Fg
1=2
t
P(~
P( t) )
>
t
~
2!
>
t
t
1=2)~ > 2
t
tz
1=
2
>
2
(Tri. ineq.)
zt
(C.-S.
ineq.)
~
t=0
X
~
T
+zt
1=2
>
t
X
P(~ t)1=2
> zt
1=2
2
t=0
P(
X
T
IfE\Fg
)
~ z
P( !
P(
Tt=0~ t)~ 1=
2
I
zt
! 1=2
1=2
(Tri. ineq.)
t=0
~ )
P(
P((1
X
+ C2~ )X2 Tt)
t 1=2
(
~
) zt
fE\Fg
t
2
> tzt t
1=2>
>
~
zt
2!1=2
p T: ((4), L. 12)
T
fE\Fg(R1
X
T
1=2~
t
=
0
t)
+
SD
P(
2)=
1=2
B
) 2DX
Now we are ready to prove Theorem 2.T ( =4)log det(V)
4.6. Putting Everything Together
det( I)
8p
2
2
T
(1 + C 2
T
Proof [Proof of Theorem 2] By (13) and Lemmas 7, 9, 13 we have that with probability at R
least 1 =2,
0
+
82
1=2
+
n
2
q
D
W
+
8p
(1 + C2
)X2 T
S D
R3
(n+ d)logr
2Tlog2
1+T
( =4)lo
p T:
g
det(VT)
det( I)
+ 8p (1 + C2 )X2
T
T
( =4)log det(VT)
det( I)
2+
T
B0
+ 2DW 22 T(n+ d)logThus, on E\F, with
probability at least 1 =2, R(T) 2DXr
2Tlog8 2 1 + TX
SD T
logdetVT (n+ d)log
Further, on E\F, by Lemmas 5 and 10,
q
n
(1 + C2)=
1=2
(n+ d) + T(1 + C2)X
15
p T:
2T
(n+ d)
+ logdet I:
Abbasi-Yadkori Szepesv
ari
Plugging in this gives the nal bound, which, by Lemma 4, holds with probability 1 .
References
Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesv ari. Online least squares estimation with
self-normalized processes: An application to bandit problems. Arxiv preprint
http://arxiv.org/abs/1102.2670, 2011.
B. D. O. Anderson and J. B. Moore. Linear Optimal Control. Prentice-Hall, 1971. P. Auer.
Using con dence bounds for exploitation-exploration trade-o s. Journal of Machine Learning Research, 3:397{422, 2003. ISSN 1533-7928. P. Auer, N. Cesa-Bianchi,
and P. Fischer. Finite time analysis of the multiarmed bandit
problem. Machine Learning, 47(2-3):235{256, 2002. P. Auer, R. Ortner, and Cs.
Szepesv ari. Improved rates for the stochastic continuumarmed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory
(COLT-07), pages 454{468, 2007.
P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning.
Journal of Machine Learning Research, 11:1563|1600, 2010.
P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement
learning in weakly communicating MDPs. In Proceedings of the 25th Annual Conference on
Uncertainty in Arti cial Intelligence, 2009.
A. Becker, P. R. Kumar, and C. Z. Wei. Adaptive control with the stochastic approximation
algorithm: Geometry and convergence. IEEE Trans. on Automatic Control, 30(4):330{ 338,
1985.
D. Bertsekas. Dynamic Programming. Prentice-Hall, 1987. D. P. Bertsekas. Dynamic
Programming and Optimal Control. Athena Scienti c, 2nd edition,
2001. S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the \bet
on the best" principle. Communications in Information and Systems, 6(4):299{320, 2006. R.
I. Brafman and M. Tennenholtz. R-MAX - a general polynomial time algorithm for
near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213{231,
2002.
M. C. Campi and P. R. Kumar. Adaptive linear quadratic Gaussian control: the cost-biased
approach revisited. SIAM Journal on Control and Optimization, 36(6):1890{1907, 1998.
H. Chen and L. Guo. Optimal adaptive control and consistent parameter estimates for armax
model with quadratic cost. SIAM Journal on Control and Optimization, 25(4): 845{867, 1987.
16
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
H. Chen and J. Zhang. Identi cation and adaptive control for systems with unknown orders,
delay, and coe cients. Automatic Control, IEEE Transactions on, 35(8):866 {877, August 1990.
V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization
against an adaptive adversary. In 16th Annual ACM-SIAM Symposium on Discrete
Algorithms, pages 937{943, 2006.
V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.
COLT-2008, pages 355{366, 2008.
C. Fiechter. Pac adaptive control of linear systems. In in Proceedings of the 10th Annual
Conference on Computational Learning Theory, ACM, pages 72{80. Press, 1997.
S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby
Computational Neuroscience Unit, University College London, 2003.
S. M. Kakade, M. J. Kearns, and J. Langford. Exploration in metric state spaces. In T. Fawcett
and N. Mishra, editors, ICML 2003, pages 306{312. AAAI Press, 2003.
M. Kearns and S. P. Singh. Near-optimal performance for reinforcement learning in
polynomial time. In J. W. Shavlik, editor, ICML 1998, pages 260{268. Morgan Kau mann,
1998.
R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in
Neural Information Processing Systems, pages 697{704, 2005.
R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings
of the 40th annual ACM symposium on Theory of computing, pages 681{690, 2008.
T. L. Lai and H. Robbins. Asymptotically e cient adaptive allocation rules. Advances in Applied
Mathematics, 6:4{22, 1985.
T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with
applications to identi cation and control of dynamic systems. The Annals of Statistics,
10(1):pp. 154{166, 1982a.
T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with
applications to identi cation and control of dynamic systems. The Annals of Statistics,
10(1):154{166, 1982b.
T. L. Lai and C. Z. Wei. Asymptotically e cient self-tuning regulators. SIAM Journal on Control
and Optimization, 25:466{481, March 1987.
T. L. Lai and Z. Ying. E cient recursive estimation and adaptive control in stochastic
regression and armax models. Statistica Sinica, 16:741{772, 2006.
P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of
Operations Research, 35(2):395{411, 2010.
17
H.
A.
Sim
on.
dyn
ami
c
pro
gra
mmi
ng
und
er
unc
ertai
nty
with
a
qua
drati
c
crite
rion
A
b
b
a
s
i
Y
a
d
k
o
r
i
S
z
e
p
e
s
v
a
r
i
func
tion.
Eco
nom
etric
a,
24(
1):7
4{8
1,
195
6.
A.
L.
Stre
hl
and
M.
L.
Litt
man
.
Onli
ne
line
ar
regr
essi
on
and
its
appl
icati
on
to
mod
el-b
ase
d
reinf
orce
men
t
lear
ning
. In
NIP
S,
pag
es
141
7{1
424,
200
8.
A.
L.
Stre
hl,
L.
Li,
E.
Wie
wior
a, J.
Lan
gfor
d,
and
M.
L.
Litt
man
.
PA
C
mod
el-fr
ee
reinf
orce
men
t
lear
ning
. In
ICM
L,
pag
es
881
{88
8,
200
6.
I.
Szit
a
and
Cs.
Sze
pes
v ari
.
Mod
el-b
ase
d
reinf
orce
men
t
lear
ning
with
nea
rly
tight
expl
orati
on
com
plex
ity
bou
nds.
In
ICM
L
201
0,
pag
es
103
1{1
038,
201
0.
A
p
p
e
n
di
x
A.
T
o
ol
s
fr
o
m
Pr
o
b
a
bi
lit
y
T
h
e
or
e
m
Lemma 14 Let X1;:::;Xt
Proof
and ~St= Pt s=1XsIfXs ag. Then it holds thatP (St>x) P max1 s tXs a + P ~ St>x
x) Pma x
Xs
a a
+ +
P x
Xs
P P ~
1st
Stt
S
P (St
ma
1st
:s
X = Pt s=1
x; max1 s t x
St =
k
18
:
Xs
a
Theorem 16 (Self-norm
a ltration, (m;k 0) be
that k;k 1) be a real-v
sub-Gaussian with co
The
ore
m
15
(Azu
ma’s
ineq
ualit
y)
Ass
ume
that
(Xss;
s 0)
is a
sup
erm
artin
gale
and
jXXs
1j cs
alm
ost
sure
ly.
The
n for
all
t>0
and
all
>0,
tk=1 kmk1
X
P
(
j
X
t
X
0
2
P
t
s
=
1
2
c2 s
k), (
k
R
e
g
r
e
t
B
o
u
n
d
s
f
o
r
t
h
e
A
d
a
p
t
i
v
e
C
o
n
t
r
o
l
o
f
L
i
n
e
a
r
Q
u
a
d
r
a
t
i
a
n
d
th
e
m
at
rix
-v
al
u
e
d
pr
oc
es
se
s
V =
t
c
S
y
s
t
e
m
s
tk=1 mk1
X
m> k1; V
t
= V+ Vt; t 0;
Then for any 0 < <1, with probability 1 ,
A
p
p
e
n
di
x
B.
C
o
nt
ro
ll
a
bi
lit
y
a
n
d
O
b
s
er
8t 0; kSt
1t
2 Vk
2R2 log det( Vt 1)=2 1det(V)=2 ! :
v
a
bi
lit
y
De
nitio
n1
(Ber
tsek
as
(20
01))
A
pair
(A;B
),
whe
re
Ais
an
nn
matr
ix
and
Bis
an
nd
matr
ix, is
said
to
be
cont
rolla
ble
if
the
n nd
matr
ix
has
[
B
A
B
:
:
:
A
n
1
B
]
full
rank
.A
pair
(A;C
),
whe
re
Ais
an
nn
matr
ix
and
Cis
an
dn
matr
ix, is
said
to
be
obs
erva
ble
if
the
pair
(A>;
C>)
is
cont
rolla
ble.
A
p
p
e
n
di
x
C.
Pr
o
of
of
L
e
m
m
a
1
1
X2
m>B1=2
B1=2X :
2
2
Proof [Proof of Lemma 11] We consider rst a simple case. Let A= B+ mm>, Bpositive fact
de nite. Let X6= 0 be an arbitrary matrix. Using the Cauchy-Schwartz inequality and the that
for any matrix M, M>M = kMk2, we get
m >B X= 1=2>B1=2BX 1 + +
and so
X>mm>Thus, X
m>B
BXk 1 +
>
A
X
X =>
m>1=2B
21=2
B2
m>X (B+ mm2 >= )X 1=2
BX 2
1=2X 2
2
>
kX
X
) = det(B)det(I+ B
We also have that
d
thus nishing the proof of this case.
e 19
t
(
A
)
=
d
e
t
(
B
+
m
m
>
m>B1=2
1=2m(B1=2
:
m)>) = det(B)(1 + kmk2 B1);
Abbasi-Yadkori Szepesv
ari
More generally, if A= B+m1
m> + +mt1m> t1
1
> s1
T
h
u
s
,
s1
X>>
AX
BXk =
t
X>Vt>VX t1
kXXk
then de ne
Vs
) ::: X>Vt1>VX t2
kXXk :::
t1
2)
= B+m1 m>
1
t
X>V2>X
kXBXk :
= det(A)det(B) ;
det(B)
t1
t2
X>>AX kX
+
+ mmand use kXBy the above argument, since all the terms are positive, we get
BXk det(V) det(V) det(V) det(Vdet(V= det(V) det(B)
the desired inequality. Finally, by SVD, if C 0, Ccan be written as the sum of at most
mrank-one matrices, nishing the proof for the general case.
App endix D. Bounding
kxtk
We show that E\Fholds with high probability. Proof [Proof of Lemma 4] Let
Mt= ~ t. On event E, for any t Twe have that
s=0 t1 trace M> t
Since >0 we get that,
I+ zsz> s! Mt! t( =4):
X
t1s= trace(M> tzsz> sMt) t( =4):
0X
trace
M> tzsz> sMt! t( =4):
t1s=0X
max
Since
(M) trace(M) for M 0, we get that
t1s=
0X 2
M> tzs
t(
=4):
Thus, for all t 1,
M> tzs
t(
=4)1=2
T
max
0 s t1
(n+ d)U
20
Choose H>
16 _ 4S2M 2
0
;
1=2(
=4): (15)
Regret Bounds for the Adaptive Control of Linear Quadratic Systems
where U0
= 116n+d2
t
(n+d)
(1 _S2(n+d2)) ;
1+TY=
nLr (n+ d)log
+
1=2S
?
t
(n+d)
?
t
and
t?
t?
t;B
F
t;B
(n+ d) : (16)
1Y :
Fix a real number 0 Y 1, and consider the time horizon T. Let (v;B) and (M;B
projections of vector0vand matrix Monto subspace B R, where the projection o
is done column-wise. Let B fvgbe the span of Band v. Let Bbe the subspace o
to Bsuch that R= B B. De ne a sequence of subspaces Bas follows: Set Bt= ;.
T;:::;1, initialize B= Bt+1. Then while (M? t;B) FT+1>(n+ d) , choose a column o
such that (v;B? t) F> and update Bt= Bt fvg. After nishing with timestep t, we
have
(M)
(M) be the set of timesteps at which subspace Bt>t2> >tm
expands. The cardinality of this set, m, is
at most n+d. Denote these timesteps by t. Let i(t) = maxf1 i m: ti tg.
Lemma 17 For any vector x2Rn+d
i(t)i
k (x;Bt)k2
M> tix ;
=1 2
U 2(n+d)
X
where U= U0=H:
Proof Let N= fv1;:::;vm
gbe the set of vectors that are added to Btduring the expansion timesteps. By
construction, Nis a subset of the set of all columns of Mt1;Mt2;:::;Mti(t). Thus, we have that
Thus, in order to prove the statement of the lemma, it is enough to show that
2 ; (17)
8x;8j2f1;:::;mg;
k=1Xh 2;xi H
16j22(j2)(1 _S2(j2)) k (x;B 4j
vk
j = span(v1;:::;vj) for any 1 j m. We have B
=B
Let TT
k
M= sup
k
k
where B
= wk + u , where w 2B
k1
i(t)i=1X
ti x 2
k
M>
k1
>
x(v1
k
1
1
v
>
1
k
+
+ vm v> m)x:
j)k
t
m
, u ?B
1
, ku k , and kv
21 ) 1:
. We can write vk 2S.
The inequality (17) is proven by induction. First, we prove the induction base for j= 1.
Without loss of generality, assume that x= Cv. From condition H>16, we get that 16H(1 _S
A
bb
as
iY
ad
ko
ri
Sz
ep
es
v
ari
B
P
x
ηL a
λЯ v
Figure 1: The geometry used in the inductive step. v= vl+1 and B= Bl.
Thus, 2
216 1H(1 _S1) :
Thus, C2
kv1k 216 1C2 H(1 _S1) ; kv1k
w
h
er
e
w
e
h
av
e
us
e
d
th
e
fa
ct
th
at
kv
1k
.
T
h
us
4
2
,
hv 2;xi
1
whi
ch
esta
blis
hes
the
bas
e of
indu
ctio
n.
Nex
t,
we
prov
e
that
if
the
ineq
ualit
y
(17)
hold
s for
j= l,
then
it
also
hold
s for
j=
l+1.
Figu
re 1
cont
ains
all
rele
vant
qua
ntiti
es
that
are
use
d in
the
follo
win
g
arg
ume
nt.
Ass
ume
4
H 161(1 _S2) k (x;B1)k2
2
;
that
the
ineq
ualit
y
(17)
hold
s for
j= l.
With
out
loss
of
gen
erali
ty,
ass
ume
that
xis
in
Bl+1,
and
thus
kxk
=
k (x;
Bl+1)
k.
Let
PB
be
the
2-di
men
sion
al
sub
spa
ce
that
pas
ses
thro
ugh
xan
d
vl+1l
+1.
The
2-di
men
sion
al
sub
spa
ce
Pan
d
the
l-di
men
sion
al
sub
spa
ce
Blca
n,
resp
ecti
vely
, be
iden
ti ed
by
l1
and
one
equ
atio
ns
in
B.
Bec
aus
eP
is
not
a
sub
set
of
Bl,
the
inter
sect
ion
of
Pan
d
Blis
a
line
in
Bl+1l
+1.
Let’
s
call
this
line
L.
The
line
Lcre
ates
two
halfplan
es
on
P.
With
out
loss
of
gen
erali
ty,
ass
ume
that
xan
d
vl+1
are
on
the
sam
e
halfplan
e
(noti
ce
that
we
can
alw
ays
repl
ace
xby
xin
(17)
).
Let
0
=2
be
the
angl
e
bet
wee
n
vl+1
and
L.
Let
0
<<
=2
be
the
orth
ogo
nal
angl
e
bet
wee
n
vl+1
and
Bl.
We
kno
w
that
>,
ul+1i
s
orth
ogo
nal
to
Bl,
and
kul+
1l+1k
.
Thu
s,
arcs
in( =
kvk)
. Let
0
be
the
angl
e
bet
wee
n
xan
d
L( <
,
bec
aus
e
xan
d
vl+1l
are
on
the
sam
e
halfplan
e).
The
dire
ctio
n
of i
s
cho
sen
so
that
it is
con
sist
ent
with
the
dire
ctio
n
of .
Fina
lly,
let
0
=2
be
the
orth
ogo
nal
angl
e
bet
wee
n
xan
d B.
22
R
e
gr
et
B
o
u
n
d
s
fo
r
th
e
A
d
a
pt
iv
e
C
o
nt
ro
l
of
Li
n
e
ar
Q
u
a
dr
at
ic
S
ys
te
m
s
B
y
th
e
in
d
uc
tio
n
as
su
m
pti
o
n
2;xi
= hvl+1 2;xi +
lk=
2;xi
1X
hvl+1
vk
2;xi
(1 _S2(l2)
+ vk
H
1
6l2
kvl+1kkxksin
If < =2 + =2 or > =2 + 3 =2, then
jhvl+1;xij= jkvl+1kkxkcos\ (vl+1;x)j
Thus, hvl+1
2;xi 2
2kxk16
Thus, k (x;Bl
4
max (1 _S2(l2)) k (x;B
16l2 s t;s=2Tt > szs
4H
16l12(l1)(1 _S2(l1)) k (x;Bl+1)k2 ;
2)k 2
4S :
2
16S2; and
kxk
24H
)k2
16(1 _S2(l1)
;
t
1
=
2
2(l2)
l
l1
2(l1)
) kxk
is well controlled except when t2TT
which nishes the proof.
Next we show thatM
;
Lemma 18 We have that for any 0 t T,
G= 2
GZU 2S(n+ d)n+ d n+ d+1n+d+1=2
and Zt
kxk4 :
2
k (x;Blwhere we use 0 1 and x2Bl+1in the last inequality.
If =2 + =2 < < =2 + 3 =2, then < =2 =2. Thus,)k=
kxkjcos( )j kxk
si
kxk
n 2
H
:
2(l2)
4h
l+1k=1Xh
) k (x;Bl)k2
Mwhere
= max kzsk:
st
23
t(
=4)
> tzt
1 2(n+ d+1)
!1=(n+d+1)
.
;
Abbasi-Yadkori Szepesv
w
h
i
c
h
n+dProof
From Lemma
M 17 we get that pU k (zs;Bs)k pi(s)
max1 i i(s)
1
i
z
s
;
M> tizs
: (18)
n+d 1 i i(s)
k (zs;Bs)k rn+ d U
? s;B) + (Ms;Bs))>( (zs
Now we can write
M> szs =
( (M=
(M
(Msss;
B;B
( ;B > s;B
)
Mn+d
s s
? s;B)
;
> (z
z
+
B s
s)
(z
) (zs;B ) + (Ms
s
s)
)>
? s>
?
s (zs
s
i
))
? s;B)
+ (zs;B)
;B
?
s
M> szs
(n+ d) Ztt + 2S r n+ d U 1
max
n+d
max
1 i i(s)
M> tizs
:
max
i
s=2Tt;s
t
s t;s=2Tt
(n+ d) kzsk+ 2S r
Thu
s,
From 1 i i(s);s=2T
>
t
i
m
p
l
i
e
s
t
h
a
t
max
n+ d 1
U
max
1 i i(s)
;we conclude that s<t
s
M> t
. Thus,
ari
: by ((18) and (16)) (19)
(n+ d) Zt t
M> szs
+ 2S r n+ d 1
d
U
max
n+ max
0 s<t
s t;s=2T
By (15) we get that
t
Zt
max
M> szs
(n+ d) Zt U
1
n t(
=4)1=2:
+d
st
;s
=2
Tt
t
t(
=4)
n+d+1=2
1 2(n+ d+1)
!
H> 4S
M>
szs
2
4
2
2S (n+ d)
1=2(
=4)
!1=(n+d+1)
H
Now if we choose =
1=
2
n+ d + 2S r
U
1=
2
2S t( =4)
we get that
1=2Zn+d t(n+
U1=2
d)
1=(n+d+1)
max
s t;s=2Tt
= GZ
n+ d n+
d+1
2
0
2M
: Finally, we show that this choice of satis es <1. From the chose of H, we have that
(n+ d)U :
M> tzs
:
R
eg
ret
B
ou
nd
s
for
th
e
A
da
pti
ve
C
on
tro
l
of
Li
ne
ar
Q
ua
dr
ati
c
Sy
st
e
m
s
(n
+
d)
U
0
1 2(n+ d+1)
2H
4
<1:
Thus,
S2
M
Thus
=
2S t( =4) Zt1=2(n+
! 1 n+ d+1 <1:
,
d)U1=2 0H1=2
We can write the state update as xt+1 = txt + rt+1
=
~ AK(
t ~ + B+ ~ BtK(
A
~
t) t=2Tt) t2TT
t+1
=
M > + wt+1
t=2TT
and rt+1
w tzt
t2TT
Hence we can write
= t1(t2xt2
+ rt1) + rt
= t1t2xt2
xt= t1xt1+ rt
T
;
where
t+1
= t1t2t3
+ rt + t1rt1
= k=1X
t
xt3 + rt + t1rt1
+ t1t2rt2
+
t1Ys=k
+ t1t2rt2
+ t1t2
= = t1
:::t(t1)rt(t1)
+ rt + t1rt1
:::ttxtt
s! rk:
Fr
o
m
S
ec
tio
n
4,
w
e
h
av
e
th
at
tT
~ + ~ K(
A B ~
maxt T
)
So we have that
t1Y
s=k
+ B K( ~ t)
A
; max
t
:
s
k
k
t
t
:
n+d tk(n+d)+1
1
max krk+1k:
n+d
0 k t1
1
Hence, we have
that
Now, krk+1k
kxtk
n+d tk=1X tk+1
max krk+1k maxk<t;k=
k<t
M> kzk
2Tt
krk+1k
M25, and kr> kzkk+1k= kwk<t +
maxkwk+1k+1k:k, otherwise. Hence,
k+1kwhen
k=2TT + kw
Abbasi-Yadkori Szepesv
ari
jwk;iThe rst term can be bounded by Lemma 18. The second term can be bounded as
follows: notice that from the sub-Gaussianity Assumption A1, we have that for any index 1 i n
and any time k t, with probability 1 =(t(t+ 1))j L r2log t(t+ 1) :
k 1kxtAs a result, with a union bound argument, on some event Hwith P (H) 1 =4, kwk 2Lq
nlog4nt(t+1) . Thus, on H\E,1
n+d" GZn+ d n+ d+1T t( =4)1 2(n+ d+1)+ 2L rnlog 4nt(t+ 1) # = t:t
By the de nition of F, H\E F\E. Since, by the union bound, P (H\E) 1 =2, P (E\F) 1 =2 also
holds, nishing the proof.
26
Download