UNIT – 4: SEMI SUPERVISED LEARNING, REINFORCEMENT LEARNING
MARKOV DECISION PROCESS
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In
MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP. MDP contains a tuple of four elements (S, A, Pa, Ra):
➢ A set of finite States S
➢ A set of finite Actions A
➢ Rewards received after transitioning from state S to state S', due to action a.
➢ Probability Pa.
Markov Property: It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on the current state and
future action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any past
action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess
game, the players only focus on the current state and do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In RL,
we consider only the finite MDP.
Markov Process: Markov Process is a memoryless process with a sequence of random states S1, S2,
......, St that uses the Markov Property. Markov process is also known as Markov chain, which is a
tuple (S, P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.
, BELLMAN EQUATION
The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the year
1953, and hence it is called as a Bellman equation. It is associated with dynamic programming and
used to calculate the values of a decision problem at a certain point by including the values of
previous states.
It is a way of calculating the value functions in dynamic programming or environment that leads to
modern reinforcement learning. The key-elements used in Bellman equations are:
➢ Action performed by the agent is referred to as "a"
➢ State occurred by performing the action is "s."
➢ The reward/feedback obtained for each good and bad action is "R."
➢ A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state’s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward
at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward
at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no
reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no
reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
MARKOV DECISION PROCESS
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In
MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP. MDP contains a tuple of four elements (S, A, Pa, Ra):
➢ A set of finite States S
➢ A set of finite Actions A
➢ Rewards received after transitioning from state S to state S', due to action a.
➢ Probability Pa.
Markov Property: It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on the current state and
future action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any past
action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess
game, the players only focus on the current state and do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In RL,
we consider only the finite MDP.
Markov Process: Markov Process is a memoryless process with a sequence of random states S1, S2,
......, St that uses the Markov Property. Markov process is also known as Markov chain, which is a
tuple (S, P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.
, BELLMAN EQUATION
The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the year
1953, and hence it is called as a Bellman equation. It is associated with dynamic programming and
used to calculate the values of a decision problem at a certain point by including the values of
previous states.
It is a way of calculating the value functions in dynamic programming or environment that leads to
modern reinforcement learning. The key-elements used in Bellman equations are:
➢ Action performed by the agent is referred to as "a"
➢ State occurred by performing the action is "s."
➢ The reward/feedback obtained for each good and bad action is "R."
➢ A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state’s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward
at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward
at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no
reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no
reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66