# solution1

```CS 234 Spring 2017
Assignment 1 Solutions
1
Bellman Operator Properties
(a) For all s:
&quot;
#
(Tπ V )(s) = Eπ R(s, π(s)) + γ
X
0
0
P (s |s, π(s))V (s )
s0 ∈S
#
&quot;
≤ Eπ R(s, π(s)) + γ
X
P (s0 |s, π(s))V 0 (s0 )
s0 ∈S
0
= (Tπ V )(s)
So,
Tπ V ≤ Tπ V 0
(1)
Let π1 and π2 be the greedy policies referring to V and V 0 respectively. So,
Tπ1 V = T V,
T π2 V 0 = T V 0
As a result:
T V = Tπ1 V ≤ Tπ1 V 0
The inequality above results from inequality (1). Further,
Tπ1 V 0 ≤ Tπ2 V 0 = T V 0
Hence,
TV ≤ TV 0
1
(2)
(b) For the Tπ operator,
&quot;
#
(Tπ V + c~1)(s) = Eπ R(s, π(s)) + γ
X
0
0
P (s |s, π(s))(V (s ) + c)
s0 ∈S
&quot;
#
= Eπ R(s, π(s)) + γ
X
P (s0 |s, π(s))V (s0 ) + γc
s0 ∈S
X
P (s0 |s, π(s))
s0 ∈S
&quot;
#
= Eπ R(s, π(s)) + γ
X
P (s0 |s, π(s))V (s0 ) + γc
s0 ∈S
= (Tπ V )(s) + γc
For the T operator, the solution is similar since the constant c does not affect the max
operation.
(c) Let π1 be the greedy policy referring to αV + βV 0 . Hence, for all s:
T (αV + βV 0 )(s) = Tπ1 (αV + βV 0 )(s)
&quot;
= Eπ1 R(s, π(s)) + γ
#
X
0
0
0
P (s |s, π(s))(αV (s ) + βV (s ))
s0 ∈S
&quot;
#
= Eπ1 αR(s, π(s)) + γ
X
P (s0 |s, π(s))(αV (s0 ))
s0 ∈S
&quot;
#
+ Eπ1 βR(s, π(s)) + γ
X
P (s0 |s, π(s))(βV (s0 ))
s0 ∈S
0
= (αTπ1 V )(s) + (βTπ1 V )(s)
≤ (αT V )(s) + (βT V 0 )(s)
(d) The statement is wrong. Consider the counter-example below:
V = ~0
R(s, a) &gt; 0 for all s, a
This will result in: T V &gt; V
2
2
Value Iteration
(a) Yes, it is possible that the policy changes again with further iterations of value iteration.
Consider the following example. There are three states {0, 1, 2}. In each state, the
available actions are indicated by outgoing edges and each is denoted by the next state
the action leads to. The corresponding rewards are indicated on the edges. The state
transition is deterministic, that is, taking an action always leads to the corresponding
state in the next time step. Assume γ = 1. Note ∈ (0, 1) is some small constant.
Over the iterations of value iteration, the value function can be computed as follows:
Iteration
V (0)
V (1) V (2)
0
0
0
0
1
1+
1
∞
2(1 + )
∞
∞
2
∞
∞
∞
3
Over the iterations of value iteration, the current best policy can be computed as
follows:
Iteration π(0)
0
Arb.
0
1
2
0
3
1
3
π(1)
Arb.
2
2
2
π(2)
Arb.
2
2
2
(b) Note the optimal value function V ∗ is such that
(
)
X
V ∗ (s) = max R(s, a) + γ
P (s0 |s, a)V ∗ (s0 ) , ∀s ∈ S .
a
s0 ∈S
The best policy for Ṽ is deterministic and is defined
(
)
X
πṼ (s) = arg max R(s, a) + γ
P (s0 |s, a)Ṽ (s0 ) , ∀s ∈ S ,
a
s0 ∈S
and the corresponding value function is such that
X
VπṼ (s) = R(s, πṼ (s)) + γ
P (s0 |s, πṼ (s))VπṼ (s0 ), ∀s ∈ S .
s0 ∈S
We first find relations on {LṼ (s)}s∈S . For any s ∈ S,
X
P (s0 |s, πṼ (s))VπṼ (s0 )
VπṼ (s) = R(s, πṼ (s)) + γ
s0 ∈S
)
(
= max R(s, a) + γ
a
X
P (s0 |s, a)Ṽ (s0 )
s0 ∈S
+γ
X
P (s0 |s, πṼ (s) ) VπṼ (s0 ) − Ṽ (s0 ) , by definition of πṼ
s0 ∈S
)
(
≥ max R(s, a) + γ
a
X
P (s0 |s, a) (V ∗ (s0 ) − )
s0 ∈S
+γ
X
P (s0 |s, πṼ (s)) VπṼ (s0 ) − (V ∗ (s0 ) + ) , since |V ∗ (s) − Ṽ (s)| ≤ , ∀s
s0 ∈S
(
)
= max R(s, a) + γ
a
X
P (s0 |s, a)V ∗ (s0 )
s0 ∈S
+γ
X
X
P (s0 |s, a) = 1, ∀s, a
P (s0 |s, πṼ (s)) VπṼ (s0 ) − V ∗ (s0 ) − 2γ , since
s0 ∈S
= V ∗ (s) + γ
X
s0 ∈S
P (s0 |s, πṼ (s)) VπṼ (s0 ) − V ∗ (s0 ) − 2γ , by definition of V ∗ .
s0 ∈S
Then, we can rewrite
X
γ
P (s0 |s, πṼ (s)) V ∗ (s0 ) − VπṼ (s0 ) + 2γ ≥ V ∗ (s) − VπṼ (s) .
s0 ∈S
Equivalently, for any s ∈ S,
X
γ
P (s0 |s, πṼ (s))LṼ (s0 ) + 2γ ≥ LṼ (s) .
s0 ∈S
4
(3)
2γ
We now show LṼ (s) ≤ 1−γ
for all s ∈ S. Let M = maxs LṼ (s) and the max be
∗
achieved at state s , i.e., LṼ (s∗ ) = M . From (3),
X
LṼ (s∗ ) ≤ γ
P (s0 |s∗ , πṼ (s∗ ))LṼ (s0 ) + 2γ
s0 ∈S
≤γ
X
P (s0 |s∗ , πṼ (s∗ ))M + 2γ
s0 ∈S
= γM + 2γ .
Then, M ≤ γM + 2γ, or M ≤
2γ
.
1−γ
It follows that LṼ (s) ≤
5
2γ
1−γ
for all s ∈ S.
3
Grid Policies
(a) Let all rewards be −1.
(b)
−4
−3
−2
−1
0
−5
4
3
2
1
4
5
4
3
2
−3
−2
−1
0
1
−4
−3
−2
−1
0
(c) No. Changing γ changes the value function but not the relative order.
(d) The value function changes (all states increase by 15), but the policy does not.
6
4
Frozen Lake MDP
(c) Stochasticity generally increases the number of iterations required to converge. In
the stochastic frozen lake environment, the number of iterations for value iteration
increases. For policy iteration, depending on the implementation method, the number
of iterations could remain unchanged; or policy iteration might not even converge at
all. The stochasticity would also change the optimal policy. In this environment, the
optimal policy of the stochastic frozen lake is different from the one of the deterministic
frozen lake.
5
Frozen Lake Reinforcement Learning
(d) Q-learning is less computationally intensive, since it only needs to update a single Qvalue per update. Model-based learning, on the other hand, must keep track of all the
counts and attempt to estimate the model parameters. However, model-based learning
uses the update data more efficiently, so it may require less experience to learn a good
policy (provided the model space is not prohibitively large).
7
```