CSCI 3202

Assignment 5

Assigned: Thu Oct 7, 2004
Due: Tue Oct 19, 2004

In this assignment, you must program a simple simulated environment and construct a reinforcement learning agent that discovers the optimal (shortest) path through the environment to reach a goal. The agent's environment will look like:

Each cell in this grid is a state in the environment. The cell labeled "I" is the initial state of the agent. The cell labeled "G" is the goal state of the agent. The black cells are barriers--states that are inaccessible to the agent. At each time step, the agent may move one cell to the left, right, up, or down. The environment does not wrap around. Thus, if the agent is in the lower left corner and tries to move down, it will remain in the same cell. Likewise, if the agent is in the initial state and tries to move up (into a barrier), it will remain in the same cell.

You should implement a Q learning algorithm that selects moves for the agent. The algorithm should perform exploration by choosing the action with the maximum Q value 95% of the time, and choosing one of the remaining three actions at random the remaining 5% of the time.

The simulation runs until the agent reaches the goal state. The reward at the goal is 0, but at every other state is -1 (because it takes energy for the agent to move). Because this simulation has a finite running time, you do not need to use a discounting factor, i.e., you can set the discounting parameter γ (gamma) to one.  Because the environment is deterministic, you can set the averaging constant--denoted by the greek letter α (alpha) in the text and ξ (xi) in my notes--to one as well (i.e., the new Q estimate replaces the old Q estimate).

Write up

Your write up should include the following:

Advice

Here is a suggestion about how to modularize your code:

Bells & Whistles

If you get your basic program running, you can easily explore the consequences of nondeterministic actions.  Suppose that choosing some action has probabiliy .7 of causing that action, and probability .1 of causing each of the other 3 actions.  How does that influence the resulting policy and Q function?  (In order to study this question, you will have to use an α (alpha) value << 1; you might try α = .05.

You may also want to explore other maze environments.   One interesting question to ask is how the algorithm scales.  For example if you have an NxN grid world with a fraction K of cells containing obstacles, how many trials does it take before the agent reliably gets to the goal?  If could be that the number of learning trials grows
linearly with N, or with N^2, or even exponentially with N.  The latter would indicate that the future of Q learning is not great.