CS7641 FINAL - REINFORCEMENT LEARNING
QUESTIONS WITH VERIFIED ACCURATE ANSWERS
What are the characteristics of a Markov decision process? - Answers - - The
environment can be described by a particular state
- The agent can take actions in the world based on its state
- The world is described by a transition function that describes how the environment
responds to the action
- Actions are rewarded or punished based on their outcome and the resulting state
What is the Markovian property of an MDP? - Answers - - Only the present matters --
each state only depends on the one previous state, and the environment stays static
What is a policy? - Answers - - A set of actions to take at each possible environment
state
What is an optimal policy? - Answers - - A policy that maximizes the total reward
What is the effect of giving every state in the environment a small negative reward? -
Answers - - The agent is encouraged to end the game quickly
What are stationary preferences? - Answers - - The assumption that, if we prefer a
particular state right now, we'll always prefer it later in time
What is the purpose of discounted rewards in reinforcement learning? - Answers - -
Decrease how impactful future rewards are so that we can converge to a finite reward
after infinite steps
What is the Bellman equation? - Answers - - A recursive equation to find the utility of a
state -- its immediate reward plus all discounted future rewards
What is value iteration? - Answers - - Start by giving each state an arbitrary utility, then
update them based on their neighbors until they converge
- The real transitions and rewards will gradually overwhelm our initial random guesses
- At the end, we can extract the policy that leads to the highest average utility
What is policy iteration? - Answers - - Start with a random policy, which will result in a
particular utility
- Find out how to improve that policy by finding the action that maximizes it
- Repeat until the policy can no longer be improved
What does it mean for an RL agent to be model-free? - Answers - - It doesn't need to
already know the transition function of the environment at each state and action
QUESTIONS WITH VERIFIED ACCURATE ANSWERS
What are the characteristics of a Markov decision process? - Answers - - The
environment can be described by a particular state
- The agent can take actions in the world based on its state
- The world is described by a transition function that describes how the environment
responds to the action
- Actions are rewarded or punished based on their outcome and the resulting state
What is the Markovian property of an MDP? - Answers - - Only the present matters --
each state only depends on the one previous state, and the environment stays static
What is a policy? - Answers - - A set of actions to take at each possible environment
state
What is an optimal policy? - Answers - - A policy that maximizes the total reward
What is the effect of giving every state in the environment a small negative reward? -
Answers - - The agent is encouraged to end the game quickly
What are stationary preferences? - Answers - - The assumption that, if we prefer a
particular state right now, we'll always prefer it later in time
What is the purpose of discounted rewards in reinforcement learning? - Answers - -
Decrease how impactful future rewards are so that we can converge to a finite reward
after infinite steps
What is the Bellman equation? - Answers - - A recursive equation to find the utility of a
state -- its immediate reward plus all discounted future rewards
What is value iteration? - Answers - - Start by giving each state an arbitrary utility, then
update them based on their neighbors until they converge
- The real transitions and rewards will gradually overwhelm our initial random guesses
- At the end, we can extract the policy that leads to the highest average utility
What is policy iteration? - Answers - - Start with a random policy, which will result in a
particular utility
- Find out how to improve that policy by finding the action that maximizes it
- Repeat until the policy can no longer be improved
What does it mean for an RL agent to be model-free? - Answers - - It doesn't need to
already know the transition function of the environment at each state and action