CS 7643 – Quiz 6 Review | Questions and Answers – spring 2026 | 100% Correct
– GT.
🔵 SECTION 1: Markov Decision Processes
(MDPs)
Q1. Define a Markov Decision Process (MDP).
Answer:
An MDP is a tuple:
(S,A,T,R,γ)(S, A, T, R, \gamma)(S,A,T,R,γ)
Where:
S = set of states
A = set of actions
T(s'|s,a) = transition probability
R(s,a) = reward function
γ ∈ [0,1] = discount factor
Markov property:
P(St+1∣St)=P(St+1∣St,...,S0)P(S_{t+1}|S_t) = P(S_{t+1}|S_t,...,S_0)P(St+1∣St)=P(St+1∣St,...,S0)
Only the present state matters.
Q2. Define Value Function and Q-Function.
Answer:
Value function:
Vπ(s)=E[∑t=0∞γtrt∣s0=s]V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s
\right]Vπ(s)=E[t=0∑∞γtrt∣s0=s]
Q-function:
,Qπ(s,a)=E[∑t=0∞γtrt∣s0=s,a0=a]Q^\pi(s,a) = E\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid
s_0=s, a_0=a \right]Qπ(s,a)=E[t=0∑∞γtrt∣s0=s,a0=a]
Optimal value:
V∗(s)=maxaQ∗(s,a)V^*(s) = \max_a Q^*(s,a)V∗(s)=amaxQ∗(s,a)
Q3. Write the Bellman Optimality Equation.
V∗(s)=maxa[r(s,a)+γ∑s′P(s′∣s,a)V∗(s′)]V^*(s) = \max_a \left[r(s,a) + \gamma \sum_{s'}
P(s'|s,a)V^*(s') \right]V∗(s)=amax[r(s,a)+γs′∑P(s′∣s,a)V∗(s′)]
Breaks value into:
Immediate reward
Discounted future value
🔵 SECTION 2: Dynamic Programming
Q4. Explain Value Iteration.
Answer:
Initialize arbitrary V(s)V(s)V(s)
Iteratively update:
V(s)←maxa[r(s,a)+γ∑s′P(s′∣s,a)V(s′)]V(s) \leftarrow \max_a \left[r(s,a) + \gamma \sum_{s'}
P(s'|s,a)V(s') \right]V(s)←amax[r(s,a)+γs′∑P(s′∣s,a)V(s′)]
Repeat until convergence.
Q5. Explain Policy Iteration.
Answer:
1. Initialize random policy π
, 2. Policy Evaluation:
Vπ(s)V^\pi(s)Vπ(s)
3. Policy Improvement:
π(s)=argmaxaQπ(s,a)\pi(s) = \arg\max_a Q^\pi(s,a)π(s)=argamaxQπ(s,a)
4. Repeat until stable
🔵 SECTION 3: Q-Learning & Deep Q-
Learning
Q6. Write Q-learning update rule.
Q(s,a)←Q(s,a)+α[r+γmaxaQ(s′,a)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma
\max_a Q(s',a) - Q(s,a) \right]Q(s,a)←Q(s,a)+α[r+γamaxQ(s′,a)−Q(s,a)]
Off-policy TD learning.
Q7. Compute Q-update (GT-style numeric problem).
Given:
γ = 0.8
State B, action Up
Enter state C
Reward = 3
max Q(C,a) = 5
Q(B,Up) = 8
Update:
Q(B,Up)=8+(3+0.8∗5−8)Q(B,Up) = 8 + (3 + 0.8*5 - 8)Q(B,Up)=8+(3+0.8∗5−8) =8+(3+4−8)= 8
+ (3 + 4 - 8)=8+(3+4−8) =7= 7=7
MSE Loss:
– GT.
🔵 SECTION 1: Markov Decision Processes
(MDPs)
Q1. Define a Markov Decision Process (MDP).
Answer:
An MDP is a tuple:
(S,A,T,R,γ)(S, A, T, R, \gamma)(S,A,T,R,γ)
Where:
S = set of states
A = set of actions
T(s'|s,a) = transition probability
R(s,a) = reward function
γ ∈ [0,1] = discount factor
Markov property:
P(St+1∣St)=P(St+1∣St,...,S0)P(S_{t+1}|S_t) = P(S_{t+1}|S_t,...,S_0)P(St+1∣St)=P(St+1∣St,...,S0)
Only the present state matters.
Q2. Define Value Function and Q-Function.
Answer:
Value function:
Vπ(s)=E[∑t=0∞γtrt∣s0=s]V^\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0=s
\right]Vπ(s)=E[t=0∑∞γtrt∣s0=s]
Q-function:
,Qπ(s,a)=E[∑t=0∞γtrt∣s0=s,a0=a]Q^\pi(s,a) = E\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid
s_0=s, a_0=a \right]Qπ(s,a)=E[t=0∑∞γtrt∣s0=s,a0=a]
Optimal value:
V∗(s)=maxaQ∗(s,a)V^*(s) = \max_a Q^*(s,a)V∗(s)=amaxQ∗(s,a)
Q3. Write the Bellman Optimality Equation.
V∗(s)=maxa[r(s,a)+γ∑s′P(s′∣s,a)V∗(s′)]V^*(s) = \max_a \left[r(s,a) + \gamma \sum_{s'}
P(s'|s,a)V^*(s') \right]V∗(s)=amax[r(s,a)+γs′∑P(s′∣s,a)V∗(s′)]
Breaks value into:
Immediate reward
Discounted future value
🔵 SECTION 2: Dynamic Programming
Q4. Explain Value Iteration.
Answer:
Initialize arbitrary V(s)V(s)V(s)
Iteratively update:
V(s)←maxa[r(s,a)+γ∑s′P(s′∣s,a)V(s′)]V(s) \leftarrow \max_a \left[r(s,a) + \gamma \sum_{s'}
P(s'|s,a)V(s') \right]V(s)←amax[r(s,a)+γs′∑P(s′∣s,a)V(s′)]
Repeat until convergence.
Q5. Explain Policy Iteration.
Answer:
1. Initialize random policy π
, 2. Policy Evaluation:
Vπ(s)V^\pi(s)Vπ(s)
3. Policy Improvement:
π(s)=argmaxaQπ(s,a)\pi(s) = \arg\max_a Q^\pi(s,a)π(s)=argamaxQπ(s,a)
4. Repeat until stable
🔵 SECTION 3: Q-Learning & Deep Q-
Learning
Q6. Write Q-learning update rule.
Q(s,a)←Q(s,a)+α[r+γmaxaQ(s′,a)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma
\max_a Q(s',a) - Q(s,a) \right]Q(s,a)←Q(s,a)+α[r+γamaxQ(s′,a)−Q(s,a)]
Off-policy TD learning.
Q7. Compute Q-update (GT-style numeric problem).
Given:
γ = 0.8
State B, action Up
Enter state C
Reward = 3
max Q(C,a) = 5
Q(B,Up) = 8
Update:
Q(B,Up)=8+(3+0.8∗5−8)Q(B,Up) = 8 + (3 + 0.8*5 - 8)Q(B,Up)=8+(3+0.8∗5−8) =8+(3+4−8)= 8
+ (3 + 4 - 8)=8+(3+4−8) =7= 7=7
MSE Loss: