Lecture 1
RL = what to do to maximize a numerical reward
1 action for each situation -> highest reward
Problem of RL:
• Sense of the state of the environment
• Take actions -> affect the state
• Goal relating to the state
Learning from interactions, directly from its environment
Exploration-Exploitation dilemma:
Exploitation: profit from your experience
Exploration: look for better options in the future
Elements of RL:
1. Policy
Behavior of learning agent in time
Action to be taken
Policies may be stochastic
2. Reward Rt
Goal of the problem
Reward-signal
3. Value function V (s)
Total amount of reward expected to accumulate over the future, starting from state s
Long-run desirability of the state, considering future states & rewards
Actions based on value judgements
Value estimation
4. Model environment
Model-free: trial & error learner
Model-based given state & action, model predicts next state & R
→ What action to take as a function of the state signal
→ Learn while interacting
Multi-armed bandits
k-armed bandit → k options & 1 situation: non-associative feedback problem
k: number of actions
t: time step
At: action at t
q* (a): true value of action a (expected reward) = E[Rt | At = a]
q* (a) unknown → We use Qt (a) as an estimation at time t
𝞹
, Action-value methods
Types of actions: Greedy approach: exploiting, so choose a with highest Qt(a)
Exploring actions
= number of times a has been selected until time t
If Nt (a) = 0 , then Qt (a) = c, some default time
If Nt (a) → ♾ , then Qt (a) → q* (a)
Selection:
1. Random selection: P[ At = a ] = 1/k
2. Greedy action selection method At = arg maxa {Qt (a)}
3. -greedy action selection: with prob. select randomly from all actions with equal probability,
otherwise greedy
Offline computing: all data already available : computationally inefficient
Qn = (R1 + … + Rn-1) / (n - 1)
Qn+1 = (R1 + … + Rn-1) / n
Online computing:
new estimate = old estimate + stepwise ( target - old estimate )
Non-stationary: rewards probabilities change over time
give more weight to recent reward than to long-past reward
Varying step-size n (a): convergence for n (a) = 1/2
& no convergence for n (a) = and varying Qn+1
Sample average: bias disappears when actions are selected at least once
Optimistic initial value for Q1(a) → forces to select all options at least once
→ Qt (a) to proper level
𝜺 𝜶 𝜺 𝜶 𝜶 𝜶
, Lecture 2
Observations multi-armed bandits
Optimistic initial value Q1(a):
• Qt+1 = Qn(a) + [Rn(a) - Qn(a)], Q1(a) = c, influence high for small
• Qn+1 = Qn(a) + 1/n [Rn(a) - Qn(a)], whatever Q1(a) = c → Q2(a) = R1(a)
Greedy (with average reward value):
• t > t0 → At = a0, action a0 as long as Qt(a0) > 0 and if q*(a0) > 0
• t→♾ → Qt(a0) → q*(a0), for the ‘absorbing’ action
-Greedy (with average reward value):
• t→♾ → Qt(a) → q*(a), for all actions
• t→♾ → P[At = a*] = ( 1 - ) + / k, with a* = arg maxa {q* (a)}: optimal
Greedy selection (optimal initial value, exp. regency weighted average):
• Higher average reward (high Rt: slow decrease Qt), ‘absorbing’ action often optimal action
Do better by: giving exploiting priority and when we explore:
• Avoid low reward actions (no random selections)
• Good choices for pi: select greedy with high prob. for p1
• Low reward actions will still be selected
Upper-confidence-bound action selection
Explore non-greedy actions with high potential & keep exploring, also in the long-run
Behaviour:
• Sqrt-term measure of uncertainty, like confidence interval
• Value of actions increases in time, even when not selected
• When selected the uncertainty term decreases
• Subtle favoring of less-frequent selected actions
Multi-bandit problem non-associative task: find or track the best action for a single situation (or
state), either stationary or non-stationary
Contextual Bandits
Associative task: find best action for multiple situations (or states), i.e. learn a policy
Associative search: trial-and-error learning and association of actions to situations
Full RL associative task: Actions affect next situation (or state)
𝜺
𝜶 𝜺 𝜺 𝜶