Assignment 6
Probabilistic Models of
Human and Machine Learning

CSCI 7222
Fall 2012

Assigned 10/30/2012
Due  11/8/2012

Goal

The goal of this assignment is get some first-hand appreciation of the chinese restaurant process, in particular, the variant that draws samples from a Pitman-Yor Process.

Code

I've included the pseudocode in my lecture notes (pp. 13-15), though please make sure you understand the algorithm because there's always the possibility that I screwed up some notation.

Task 1

Regenerate  the first two graphs in Teh's Figure 1 (reproduced in my lecture notes on p. 16).

Plot the number of words types generated as a function of the number of word tokens drawn, using a log-log plot for
(a)  d=.5, alpha = 1, 10, and 100; and for (b) alpha = 10 and d = 0, .5, and .9.

Assume that G0 is a discrete, uniform distribution over 10^5 word types.  To be clear about the meaning of 'word type' and 'word token', in the sentence "The big boy picked up the big ball", there are 8 word tokens but only 6 word types, because the words the and big are repeated.

Task 2

Generate a plot like the Zipf's law plot on p. 17 of my lecture notes.  Rank the word types by number of tokens drawn, and plot the number of tokens as a function of rank on a log-log scale.  If Zipf's law holds for the Pitman Yor process, you should see a straightish line.  Try different parameters, but my guess is that parameters the yield a fairly straight line in Task 1 are most likely to produce Zipf's law.  For example, you might try alpha = 1 and d = .5

Task 3

Using the simpler Dirichlet Process (d = 0), pick a value of alpha and generate, say, 1000 samples from a DP each with, say, 100 customers.  For each sample, determine the total number of tables that are occupied (= number of components in the DP mixture model).  Make a histogram showing number of tables on the x-axis (which will vary from 1 to 100) and relative frequency on the y-axis.  Repeat this process for a different value of alpha, such that the two histograms look dissimilar.