13. Reinforcement Learning
[Read Chapter 13]
[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose optimal actions
Q learning
Convergence
255 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Control Learning
Consider learning to choose actions, e.g.,
Robot learning to dock on battery charger
Learning to choose actions to optimize factory
output
Learning to play Backgammon
Note several problem characteristics:
Delayed reward
Opportunity for active exploration
Possibility that state only partially observable
Possible need to learn multiple tasks with same
sensors/e ectors
256 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
257 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
[Read Chapter 13]
[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose optimal actions
Q learning
Convergence
255 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Control Learning
Consider learning to choose actions, e.g.,
Robot learning to dock on battery charger
Learning to choose actions to optimize factory
output
Learning to play Backgammon
Note several problem characteristics:
Delayed reward
Opportunity for active exploration
Possibility that state only partially observable
Possible need to learn multiple tasks with same
sensors/e ectors
256 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
257 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997