Human Sequential Decision Making and Reinforcement LearningTheories of learning by reinforcement have been used to interpret asymptotic behavior of individuals performing one- or two-step choice tasks (e.g., the Iowa gambling task), and data from mice performing temporally extended behaviors, but not data from individuals in the initial stages of learning sequential decision tasks. We studied participants in a continuous task that involves exploring an unfamiliar environment. Participants viewed a graphical depiction of the roomâthey were currently in, and were given the choice of two doors leading to other rooms. Rewards were associated with state-action pairs (i.e., the choice of a given door in a given room). Participants made a sequence of 300 choices, attempting to maximize their total reward. We have addressed whether temporal-difference (TD) learning is an appropriate framework for characterizing the temporal dynamics of human choice. A very general form of TD learning--Q learning with a dozen free parameters--was fit to individual subject data to identify the parameters that best predicted action sequences. We discovered that, even with 300 step action sequences, many different parameter settings yielded equivalent fits, making it difficult to unambiguously recover model parameters from behavior. Previous research that has fit human and mouse behavior with reinforcement learning models has involved sufficiently simple models that this issue did not arise. This previous research was simpler as well in that it addressed asymptotic performance following learning, not the initial stages of learning. Nonetheless the model fits predicted individual differences in performance, both within a task, and across tasks.
Hal Pashler (Psychology, UCSD)