CS7643 QUIZ 4 QUESTIONS WITH DETAILED
VERIFIED ANSWERS (100% CORRECT
ANSWERS) /ALREADY GRADED
Reinforcement learning
Sequential decision making in an environment with evaluative feedback
Environment: may be unknown, non-linear, stochastic and complex
Agent: learns a policy to map states of the environments to actions
- seeks to maximize long-term reward
RL: Evaluative Feedback
- Pick an action, receive a reward
- No supervision for what the correct action is or would have been (unlike supervised learning)
RL: Sequential Decisions
- Plan and execution actions over a sequence of states
- Reward may be delayed, requiring optimization of future rewards (long-term planning)
Signature Challenges in RL
Evaluative Feedback: Need trial and error to find the right action
Delayed Feedback: Actions may not lead to immediate reward
Non-stationarity: Data distribution of visited states changes when the policy changes
, Fleeting Nature: of online data (may only see data once)
MDP
Framework underlying RL
S: Set of states
A: Set of actions
R: Distribution of Rewards
T: Transition probabiliity
y: Discount property
Markov Property: Current state completely characterizes state of the environment
RL: Equations relating optimal quantities
1. V(S) = max_a(Q(s, a)
2. PI(s) = argmax_a(Q(s, a)
V*(S)
max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
Q*(s,a)
sum_(s') { p(s'|s, a) [r(s, a) + ymax_(a'){Q(s', a') ] }
Value Iteration
VERIFIED ANSWERS (100% CORRECT
ANSWERS) /ALREADY GRADED
Reinforcement learning
Sequential decision making in an environment with evaluative feedback
Environment: may be unknown, non-linear, stochastic and complex
Agent: learns a policy to map states of the environments to actions
- seeks to maximize long-term reward
RL: Evaluative Feedback
- Pick an action, receive a reward
- No supervision for what the correct action is or would have been (unlike supervised learning)
RL: Sequential Decisions
- Plan and execution actions over a sequence of states
- Reward may be delayed, requiring optimization of future rewards (long-term planning)
Signature Challenges in RL
Evaluative Feedback: Need trial and error to find the right action
Delayed Feedback: Actions may not lead to immediate reward
Non-stationarity: Data distribution of visited states changes when the policy changes
, Fleeting Nature: of online data (may only see data once)
MDP
Framework underlying RL
S: Set of states
A: Set of actions
R: Distribution of Rewards
T: Transition probabiliity
y: Discount property
Markov Property: Current state completely characterizes state of the environment
RL: Equations relating optimal quantities
1. V(S) = max_a(Q(s, a)
2. PI(s) = argmax_a(Q(s, a)
V*(S)
max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
Q*(s,a)
sum_(s') { p(s'|s, a) [r(s, a) + ymax_(a'){Q(s', a') ] }
Value Iteration