Your assignment is to implement a Naive Bayes classifier to determine the polarity behind a movie review. By Naive Bayes, I have in mind the the method described in class for WSD and as Multinomial NB in the Manning et al. chapter. Note that even within this approach there are a lot of issues/parameters at play. You'll have to explore them to get reasonable performance on this data.
I am providing as training data a set of 1000 (500 positive/500 negative) movie reviews. The data is divided into 2 sub-directories (pos and neg). Each directory contains 500 plain text files each containing a single movie review. These reviews range from short paragraphs to multiple pages in length. The text has been downcased (unfortunately) and tokenized by spaces (all tokens including punctuation are space separated).
Your system will be evaluated on a withheld test set that I will supply (which will also have a 50/50 split). Given a directory of files containing reviews, your system should output (one per line) the filename of the file being processed followed by a <tab> and a + or - indicating a positive or negative review.
What to Hand In
You should mail me a gzipped tar file consisting of your code for the assignment, a short text writeup of what you did, and a plain text output file for the test instances that I will provide. Include in your writeup a description of your development training process and you're predicted performance.
This is due by 5pm on December 21, 2012.