1. Reinforcement learning
Answer Sequential decision making in an environment with evaluative feedback
Environment: may be unknown, non-linear, stochastic and complex
Agent: learns a policy to map states of the environments to actions
- seeks to maximize long-term reward
2. RL
Answer Evaluative Feedback: - Pick an action, receive a reward
- No supervision for what the correct action is or would have been (unlike supervised
learning)
3. RL: Sequential Decisions
Answer - Plan and execution actions over a sequence of states
- Reward may be delayed, requiring optimization of future rewards (long-term planning)
4.Signature Challenges in RL
Answer Evaluative Feedback: Need trial and error to find the right action
Delayed Feedback: Actions may not lead to immediate reward
Non-stationarity: Data distribution of visited states changes when the policy changes
Fleeting Nature: of online data (may only see data once)
, CS7643 EXAM STUDY GUIDE
5.MDP
Answer Framework underlying RL
S: Set of states
A: Set of actions
R: Distribution of
Rewards T:
Transition
probabiliity y:
Discount property
Markov Property: Current state completely characterizes state of the environment
6.RL
Answer Equations relating optimal quantities: 1. V*(S) = max_a(Q*(s, a)
2. PI*(s) = argmax_a(Q*(s, a)
7.V*(S)
Answer max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
8.Q*(s,a)
Answer sum_(s') { p(s'|s, a) [r(s, a) + y*max_(a'){Q*(s', a') ] }
9.Value Iteration
Answer v_(i+1) = max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV_(i)(s')] } ) - repeat until
convergence
- Time complexity per iteration O(|S^2| |A|)