Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Exam (elaborations)

CS 234 assignment 1-ALL ANSWERS 100% CORRECT

Rating
5.0
(1)
Sold
23
Pages
9
Grade
A+
Uploaded on
17-03-2022
Written in
2021/2022

CS 234 Winter 2021 Assignment 1 Due: January 22 at 6:00 pm (PST) For submission instructions please refer to website For all problems, if you use an existing result from either the literature or a textbook to solve the exercise, you need to cite the source. 1 Flappy Karel MDP [25 pts] There is a hot new mobile game on the market called Flappy Karel, where Karel the robot must dodge the red pillars of doom and flap its way to the green pasture. Consider the following 2 grid environments (Flappy World 1 and Flappy World 2). Starting from any unshaded square, Karel can either move right & up, or right & down (e.g from state 4 you can move to state 10 or 12, think checkers). Actions are deterministic and always succeed unless they will cause Karel to run into a wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in falling down one square (e.g. going in any direction from state 30 leads to falling into state 31). A successful run by Karel in Flappy World 1 is shown in Figure 1b. Taking any action from the green target squares (no. 32) earns a reward of rg and ends the episode. Taking any action from the red squares of doom (no. 1, 7, 8, 12, 13...) earns a reward of rr and ends the episode. Otherwise, from every other square, taking any action is associated with a reward rs. Assume the discount factor γ = 0.9, rg = +5, and rr = −5 unless otherwise specified. Notice the Horizon is technically infinite in both worlds. (a) Flappy World 1 (b) A successful run by Karel in Flappy World 1 Figure 1 1 (a) Let rs ∈ {−4, −1, 0, 1}. Starting in square 2, for each of the possible values of rs briefly explain what the optimal policy would be in Flappy World 1. In each case is the optimal policy unique and does the optimal policy depend on the value of the discount factor γ? Explain your answer. [5 pts] (b) What value of rs would cause the optimal policy to return the shortest path to the green target square? Using this value of rs find the optimal value function for each square in Flappy world 1. What is the optimal action from square 27? [5 pts] Now consider Flappy world 2. It is the same as Flappy world 1, except there are no walls on the right and left sides. Going past the right end of flappy world 2 simply loops you to left hand side. Take a look at Figure 1b for a successful run by Karel in Flappy World 2. (a) Flappy World 2 (b) A successful run by Karel in Flappy World 2 Figure 2 (c) Let rs ∈ {−4, −1, 0, 1}. Starting in square 3, for each of the possible values of rs briefly explain what the optimal policy would be in Flappy World 2. Using the value of rs, that would cause the optimal policy to return the s

Show more Read less
Institution
Course

Content preview

CS 234 Winter 2021
Assignment 1
Due: January 22 at 6:00 pm (PST)



For submission instructions please refer to website For all problems, if you use an existing result
from either the literature or a textbook to solve the exercise, you need to cite the source.


1 Flappy Karel MDP [25 pts]
There is a hot new mobile game on the market called Flappy Karel, where Karel the robot must
dodge the red pillars of doom and flap its way to the green pasture. Consider the following 2 grid
environments (Flappy World 1 and Flappy World 2). Starting from any unshaded square, Karel can
either move right & up, or right & down (e.g from state 4 you can move to state 10 or 12, think
checkers). Actions are deterministic and always succeed unless they will cause Karel to run into a
wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in
falling down one square (e.g. going in any direction from state 30 leads to falling into state 31). A
successful run by Karel in Flappy World 1 is shown in Figure 1b. Taking any action from the green
target squares (no. 32) earns a reward of rg and ends the episode. Taking any action from the red
squares of doom (no. 1, 7, 8, 12, 13...) earns a reward of rr and ends the episode. Otherwise, from
every other square, taking any action is associated with a reward rs . Assume the discount factor
γ = 0.9, rg = +5, and rr = −5 unless otherwise specified. Notice the Horizon is technically infinite
in both worlds.

1 8 15 22 29 1 8 15 22 29

2 9 16 23 30 2 9 16 23 30

3 10 17 24 31 3 10 17 24 31

4 11 18 25 32 4 11 18 25 32

5 12 19 26 33 5 12 19 26 33

6 13 20 27 34 6 13 20 27 34

7 14 21 28 35 7 14 21 28 35


(a) Flappy World 1 (b) A successful run by Karel in Flappy World 1

Figure 1



1

, (a) Let rs ∈ {−4, −1, 0, 1}. Starting in square 2, for each of the possible values of rs briefly
explain what the optimal policy would be in Flappy World 1. In each case is the optimal policy
unique and does the optimal policy depend on the value of the discount factor γ? Explain
your answer. [5 pts]
(b) What value of rs would cause the optimal policy to return the shortest path to the green
target square? Using this value of rs find the optimal value function for each square in Flappy
world 1. What is the optimal action from square 27? [5 pts]

Now consider Flappy world 2. It is the same as Flappy world 1, except there are no walls on the
right and left sides. Going past the right end of flappy world 2 simply loops you to left hand side.
Take a look at Figure 1b for a successful run by Karel in Flappy World 2.


1 8 15 22 29
1 8 15 22 29 1 8 15 22 29

2 9 16 23 30 2 9 16 23 30 2 9 16 23 30


3 10 17 24 31 3 10 17 24 31 3 10 17 24 31

4 11 18 25 32 4 11 18 25 32
4 11 18 25 32
5 12 19 26 33 5 12 19 26 33

5 12 19 26 33
6 13 20 27 34 6 13 20 27 34

6 13 20 27 34 7 14 21 28 35 7 14 21 28 35


7 14 21 28 35
(b) A successful run by Karel in Flappy World 2
(a) Flappy World 2

Figure 2


(c) Let rs ∈ {−4, −1, 0, 1}. Starting in square 3, for each of the possible values of rs briefly
explain what the optimal policy would be in Flappy World 2. Using the value of rs , that
would cause the optimal policy to return the shortest path to the green target square, find the
optimal value function for each square in Flappy world 2. What is the optimal action from
square 27? [5 pts]
(d) Consider a general MDP with rewards, and transitions. Consider a discount factor of γ. For
this case assume that the horizon is infinite (so there is no termination). A policy π in this
MDP induces a value function V π (lets refer to this as Vold
π ). Now suppose we have the same

MDP where all rewards have a constant c added to them and then have been scaled by a
constant a (i.e. rnew = a(c + rold )). Can you come up with an expression for the new value
function V π induced by π in this second MDP in terms of Vold π , c, a, and γ? [5 pts]


(e) Can scaling all the rewards by a fixed amount change the optimal policy of a MDP? If so,
describe how different ranges of the constant a (where rnew = a ∗ (rold )) would change the
optimal policy of the MDP from part (c). [5 pts]


(a) rs = 1 Take longest possible path (2,10,18,24,30,31,32) to target square.
rs = 0 Take shortest possible path to target square.
rs = −1 Take shortest possible path to target square.


2

Written for

Course

Document information

Uploaded on
March 17, 2022
Number of pages
9
Written in
2021/2022
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

$10.49
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Reviews from verified buyers

Showing all reviews
3 year ago

5.0

1 reviews

5
1
4
0
3
0
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
Themanehoppe American Intercontinental University Online
Follow You need to be logged in order to follow users or courses
Sold
345
Member since
4 year
Number of followers
224
Documents
3784
Last sold
21 hours ago

3.5

55 reviews

5
25
4
7
3
7
2
3
1
13

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions