Assignments will be posted here as we go along. Programming assignments should be completed using the Python programming language.
Assignments
Extra-Credit Movie Assignment
As an extra-credit assignment you can try to do something smarter/better with than our NB approach. There are a lot of things you could do here. You might for example experiment with a better classifier (maxent, svm, random forest, etc.) Or you might do some feature selection to get a better set of features for NB. The only stipulation is that the task stays the same.
Note that since the NB performance is quite high you will have difficulty getting a big improvement here on accuracy. Other improvements might involve creating smaller models that do just as well. Or using vastly smaller amounts of training data to get the same effect.
I'll provide an additional test set for this part of the assignment.
Assignment 3: Sentiment via Naive Bayes
Your assignment is to implement a Naive Bayes classifier to determine the polarity behind a movie review. By Naive Bayes, I have in mind the the method described as Multinomial NB in the Manning et al. chapter. Note that even within this approach there are a lot of issues/parameters at play. You'll have to explore them to get reasonable performance on this data.
I am providing as training data a set of 1000 (500 positive/500 negative) movie reviews. The data is divided into 2 sub-directories (pos and neg). Each directory contains 500 plain text files each containing a single movie review. These reviews range from short paragraphs to multiple pages in length. The text has been downcased (unfortunately) and tokenized by spaces (all tokens including punctuation are space separated).
Your system will be evaluated on a withheld test set that I will supply (which will also have a 50/50 split). Given a directory of files containing reviews your system should output (one per line) the filename of the file being processed followed by a <tab> and a + or - indicating a positive or negative review.
What to Hand In
You should mail to the csci5832@gmail.com address, a gzipped tar file consisting of your code for the assignment, a short text writeup of what you did, and a plain text output file for the test instances that I will provide. Include in your writeup a description of your development training process and you're predicted performance.
This is due by 5pm on Monday May 2.
Assignment 2: Probabilistic Hashtag Segmentation
Your task in this assignment is to improve the performance of the deterministic methods you used in Assignment 1 through the use of a probabilistic language model (the details of what you try are up to you).
Your primary resource in tackling this problem is a large list of bigrams, with frequency counts, derived from the Google N-gram corpus. The most obvious approach is to use these counts to derive a bigram language model and then use that model to find the most probable segmentation given some hashtag.
Improving performance in this context, means reducing the error rate from the previous HW, where error rate is WER (length normalized minimum edit distance).
How much credit you get for this will be based in large part on how much your system improves things (reduces WER) and how complete/clever you were in your approach. Note that I am specifically interested in approaches that employ a probabilistic language model.
This assignment is due on Thursday, February 24.