CS 7643 Quiz 5 Review – Questions and Answers | 2026 Update | 100% Correct – GT.
🔵 SECTION 1: Deep Reinforcement
Learning
Q1. Define Reinforcement Learning and explain how Deep Reinforcement
Learning extends it.
Answer:
Reinforcement Learning (RL) is a framework where an agent interacts with an environment to
maximize cumulative reward.
At each timestep:
1. Observe state sts_tst
2. Take action ata_tat
3. Receive reward rtr_trt
4. Transition to next state st+1s_{t+1}st+1
Deep Reinforcement Learning (DRL) extends RL by using deep neural networks to
approximate:
Value functions V(s)V(s)V(s)
Q-functions Q(s,a)Q(s,a)Q(s,a)
Policies π(a∣s)\pi(a|s)π(a∣s)
This enables solving high-dimensional problems such as images and continuous control.
Q2. What is the difference between Q-learning and SARSA?
Answer:
Both are temporal-difference (TD) methods.
Q-Learning (Off-Policy)
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \big[r + \gamma
\max_{a'} Q(s',a') - Q(s,a)\big]Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]
, Uses maximum next Q-value
Learns optimal policy
Off-policy
SARSA (On-Policy)
Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \big[r + \gamma Q(s',a') -
Q(s,a)\big]Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Uses actual next chosen action
Learns behavior policy
On-policy
Key difference:
Q-learning uses greedy max; SARSA uses actual action taken.
Q3. Explain step-by-step how Deep Q-Learning (DQN) works.
Answer:
1. Replace Q-table with neural network:
Q(s,a;θ)Q(s,a;\theta)Q(s,a;θ)
2. Use ε-greedy action selection.
3. Store experience tuple:
(s,a,r,s′)(s,a,r,s')(s,a,r,s′)
4. Sample minibatch from replay buffer.
5. Compute target:
y=r+γmaxa′Q(s′,a′;θ−)y = r + \gamma \max_{a'} Q(s',a';\theta^-)y=r+γa′maxQ(s′,a′;θ−)
6. Minimize loss:
L=(y−Q(s,a;θ))2L = (y - Q(s,a;\theta))^2L=(y−Q(s,a;θ))2
7. Periodically update target network.
Stabilization methods:
Experience Replay
Target Network
🔵 SECTION 1: Deep Reinforcement
Learning
Q1. Define Reinforcement Learning and explain how Deep Reinforcement
Learning extends it.
Answer:
Reinforcement Learning (RL) is a framework where an agent interacts with an environment to
maximize cumulative reward.
At each timestep:
1. Observe state sts_tst
2. Take action ata_tat
3. Receive reward rtr_trt
4. Transition to next state st+1s_{t+1}st+1
Deep Reinforcement Learning (DRL) extends RL by using deep neural networks to
approximate:
Value functions V(s)V(s)V(s)
Q-functions Q(s,a)Q(s,a)Q(s,a)
Policies π(a∣s)\pi(a|s)π(a∣s)
This enables solving high-dimensional problems such as images and continuous control.
Q2. What is the difference between Q-learning and SARSA?
Answer:
Both are temporal-difference (TD) methods.
Q-Learning (Off-Policy)
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \big[r + \gamma
\max_{a'} Q(s',a') - Q(s,a)\big]Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)−Q(s,a)]
, Uses maximum next Q-value
Learns optimal policy
Off-policy
SARSA (On-Policy)
Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \big[r + \gamma Q(s',a') -
Q(s,a)\big]Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Uses actual next chosen action
Learns behavior policy
On-policy
Key difference:
Q-learning uses greedy max; SARSA uses actual action taken.
Q3. Explain step-by-step how Deep Q-Learning (DQN) works.
Answer:
1. Replace Q-table with neural network:
Q(s,a;θ)Q(s,a;\theta)Q(s,a;θ)
2. Use ε-greedy action selection.
3. Store experience tuple:
(s,a,r,s′)(s,a,r,s')(s,a,r,s′)
4. Sample minibatch from replay buffer.
5. Compute target:
y=r+γmaxa′Q(s′,a′;θ−)y = r + \gamma \max_{a'} Q(s',a';\theta^-)y=r+γa′maxQ(s′,a′;θ−)
6. Minimize loss:
L=(y−Q(s,a;θ))2L = (y - Q(s,a;\theta))^2L=(y−Q(s,a;θ))2
7. Periodically update target network.
Stabilization methods:
Experience Replay
Target Network