CS7643 LAST QUIZ QUESTIONS WITH DETAILED VERIFIED
ANSWERS (100% CORRECT ANSWERS) /ALREADY
GRADED A+
Reinforcement learning - ANSWER-Sequential decision making in an environment with evaluative
feedback
Environment: may be unknown, non-linear, stochastic and complex
Agent: learns a policy to map states of the environments to actions
- seeks to maximize long-term reward
RL: Evaluative Feedback - ANSWER-- Pick an action, receive a reward
- No supervision for what the correct action is or would have been (unlike supervised learning)
RL: Sequential Decisions - ANSWER-- Plan and execution actions over a sequence of states
- Reward may be delayed, requiring optimization of future rewards (long-term planning)
Signature Challenges in RL - ANSWER-Evaluative Feedback: Need trial and error to find the right action
Delayed Feedback: Actions may not lead to immediate reward
Non-stationarity: Data distribution of visited states changes when the policy changes
Fleeting Nature: of online data (may only see data once)
MDP - ANSWER-Framework underlying RL
S: Set of states
A: Set of actions
, R: Distribution of Rewards
T: Transition probabiliity
y: Discount property
Markov Property: Current state completely characterizes state of the environment
RL: Equations relating optimal quantities - ANSWER-1. V*(S) = max_a(Q*(s, a)
2. PI*(s) = argmax_a(Q*(s, a)
V*(S) - ANSWER-max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
Q*(s,a) - ANSWER-sum_(s') { p(s'|s, a) [r(s, a) + y*max_(a'){Q*(s', a') ] }
Value Iteration - ANSWER-v_(i+1) = max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV_(i)(s')] } )
- repeat until convergence
- Time complexity per iteration O(|S^2| |A|)
Policy Iteration - ANSWER-Policy Evaluation: Compute V(pi)
Policy Refinement: Greedily change action as per V(Pi) at next states
Why do Policy Iteration: PI_i often converges to PI* sooner than V_PI to V_PI*
- thus requires few iterations
Deep Q-Learning - ANSWER-- Q(s, a; w, b) = w_a^t * s + b_a
MSE Loss := (Q_new(s, a) - (r + y*max_a(Q_old(s', a)))^2
- using a single Q function makes loss function unstable
--> use two Q-tables (NNs)
- Freeze Q_old and update Q_new
ANSWERS (100% CORRECT ANSWERS) /ALREADY
GRADED A+
Reinforcement learning - ANSWER-Sequential decision making in an environment with evaluative
feedback
Environment: may be unknown, non-linear, stochastic and complex
Agent: learns a policy to map states of the environments to actions
- seeks to maximize long-term reward
RL: Evaluative Feedback - ANSWER-- Pick an action, receive a reward
- No supervision for what the correct action is or would have been (unlike supervised learning)
RL: Sequential Decisions - ANSWER-- Plan and execution actions over a sequence of states
- Reward may be delayed, requiring optimization of future rewards (long-term planning)
Signature Challenges in RL - ANSWER-Evaluative Feedback: Need trial and error to find the right action
Delayed Feedback: Actions may not lead to immediate reward
Non-stationarity: Data distribution of visited states changes when the policy changes
Fleeting Nature: of online data (may only see data once)
MDP - ANSWER-Framework underlying RL
S: Set of states
A: Set of actions
, R: Distribution of Rewards
T: Transition probabiliity
y: Discount property
Markov Property: Current state completely characterizes state of the environment
RL: Equations relating optimal quantities - ANSWER-1. V*(S) = max_a(Q*(s, a)
2. PI*(s) = argmax_a(Q*(s, a)
V*(S) - ANSWER-max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
Q*(s,a) - ANSWER-sum_(s') { p(s'|s, a) [r(s, a) + y*max_(a'){Q*(s', a') ] }
Value Iteration - ANSWER-v_(i+1) = max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV_(i)(s')] } )
- repeat until convergence
- Time complexity per iteration O(|S^2| |A|)
Policy Iteration - ANSWER-Policy Evaluation: Compute V(pi)
Policy Refinement: Greedily change action as per V(Pi) at next states
Why do Policy Iteration: PI_i often converges to PI* sooner than V_PI to V_PI*
- thus requires few iterations
Deep Q-Learning - ANSWER-- Q(s, a; w, b) = w_a^t * s + b_a
MSE Loss := (Q_new(s, a) - (r + y*max_a(Q_old(s', a)))^2
- using a single Q function makes loss function unstable
--> use two Q-tables (NNs)
- Freeze Q_old and update Q_new