Lecture 1 – Lecture 6
Lecture 1: MDPs & Bellman Equations
Dynamics of the Environment (Transition Probability)
Figure 1: Agent-interaction interface
Conditional probability
.
p(s′ , r|s, a) = Pr[St+1 = s′ , Rt+1 = r|St = s, At = a]
State-Transition Probabilities
. X
p(s′ |s, a) = Pr[St+1 = s′ |St = s, At = a] = p(s′ , r|s, a)
r∈R
Expected Rewards for State-Action Pairs
. X X
r(s, a) = E[Rt+1 |St = s, At = a] = r p(s′ , r|s, a)
r∈R s′ ∈S
Expected Rewards for State-Action-NextState Triplets
. X p(s′ , r|s, a)
r(s, a, s′ ) = E[Rt+1 |St = s, At = a, St+1 = s′ ] = r
p(s′ |s, a)
r∈R
Episodic task
∞
. X
Gt = Rt+1 + Rt+2 + Rt+3 + · · · = Rt
k=0
Discounted Return
∞
. X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = γ k Rt+k+1
k=0
Policy Definition
.
π(a|s) = Pr[At = a|St = s]
State-Value Function (vπ )
"∞ #
. X
vπ (s) = Eπ [Gt |St = s] = Eπ γ k Rt+k+1 | St = s , ∀s ∈ S
k=0
Action-Value Function (qπ )
.
qπ (s, a) = Eπ [Gt | St = s, At = a], ∀s ∈ S, ∀a ∈ A(s)
Bellman Equation for State-Value Function (Derivation)
1
, .
vπ (s) = Eπ [Gt | St = s]
= Eπ [Rt+1 + γGt+1 | St = s]
X
= π(a|s)Eπ [Rt+1 + γGt+1 | St = s, At = a]
a∈A(s)
X X
p(s′ , r | s, a) r + γEπ [Gt+1 | St+1 = s′ ]
= π(a|s)
a s′ ,r
X X
= π(a|s) p(s′ , r | s, a)[r + γvπ (s′ )]
a s′ ,r
Bellman Equation for Action-Value Function (Derivation)
.
qπ (s, a) = Eπ [Gt | St = s, At = a]
= Eπ [Rt+1 + γGt+1 | St = s, At = a]
X
p(s′ , r | s, a) r + γEπ [Gt+1 | St+1 = s′ ]
=
s′ ,r
X X
= p(s′ , r | s, a) r + γ π(a′ |s′ )Eπ [Gt+1 | St+1 = s′ , At+1 = a′ ]
s′ ,r a′ ∈A(s′ )
X X
= p(s′ , r | s, a) π(a′ |s′ )[r + γqπ (s′ , a′ )]
s′ ,r a′ ∈A(s′ )
Policy Comparison (Condition for a ”Better” Policy)
π ≥ π̃ ⇐⇒ vπ (s) ≥ vπ̃ (s), ∀s ∈ S
Optimal State-Value Function (v∗ )
.
v∗ (s) = max vπ (s), ∀s ∈ S
π
Optimal Action-Value Function (q∗ )
.
q∗ (s, a) = max qπ (s, a), ∀s ∈ S, ∀a ∈ A(s)
π
Bellman Optimality Equation for State-Value Function (v∗ ) (Derivation)
v∗ (s) = max q∗ (s, a)
a∈A(s)
= max E[Rt+1 + γv∗ (St+1 ) | St = s, At = a]
a
X
= max p(s′ , r | s, a)[r + γv∗ (s′ )]
a
s′ ,r
Bellman Optimality Equation for Action-Value Function (q∗ ) (Derivation)
q∗ (s, a) = E[Rt+1 + γv∗ (St+1 ) | St = s, At = a]
= E[Rt+1 + γ max q∗ (St+1 , a′ ) | St = s, At = a]
a′
X
′ ′ ′
= p(s , r | s, a) r + γ max ′
q∗ (s , a )
a
s′ ,r
2