Assignments will be posted here as we go along. Programming assignments should be completed using the Python programming language.
Your assignment is to implement a Naive Bayes classifier to determine the polarity behind a movie review. By Naive Bayes, I have in mind the the method described in class for WSD and as Multinomial NB in the Manning et al. chapter. Note that even within this approach there are a lot of issues/parameters at play. You'll have to explore them to get reasonable performance on this data.
I am providing as training data a set of 1000 (500 positive/500 negative) movie reviews. The data is divided into 2 sub-directories (pos and neg). Each directory contains 500 plain text files each containing a single movie review. These reviews range from short paragraphs to multiple pages in length. The text has been downcased (unfortunately) and tokenized by spaces (all tokens including punctuation are space separated).
Your system will be evaluated on a withheld test set that I will supply (which will also have a 50/50 split). Given a directory of files containing reviews, your system should output (one per line) the filename of the file being processed followed by a <tab> and a + or - indicating a positive or negative review.
Your task is to implement the transition-based dependency parsing algorithm discussed in class. In the transition-based paradigm, a parser searches through a state space seeking a final state that satisfies some criteria. The search is implemented via a set of operators that transform states into new states.
In the scheme discussed in class, states consist of a stack, a word list and a set of relations that represent the parse. In the initial state, the stack consists of just a dummy 'root' node, the word list contains the sentence to be parsed, and the set of relations is empty; in a final state, the stack contains only the 'root' node, the word list is empty, and the set of relations represents the parse.
So for an example like "We want a shrubbery" the initial state would look this (assuming some python-like data structures)
[root], [we, want, a, shrubbery], 
and a reasonable final state looks like this.
[root], , [(want, we), (want, shrubbery), (shrubbery, a) ]
To transition through this space from the start state to some final state, the parser makes use of three operators defined as follows:
Your task in this assignment is to improve the performance of the deterministic methods you used in Assignment 1 through the use of a probabilistic language model (the details of what you try are up to you).
Your primary resource in tackling this problem is a large list of bigrams, with frequency counts, derived from the Google N-gram corpus. The most obvious approach is to use these counts to derive a bigram language model and then use that model to find the most probable segmentation given some hashtag.
The main missing element from most of your solutions to the previous assignment is the ability to find more than one segmentation. If we're going to use a language model to rank multiple segmentations then we'll need multiple segmentations to compare.
Improving performance in this context, means reducing the error rate from the previous HW, where error rate is WER (length normalized minimum edit distance).
How much credit you get for this will be based in large part on how much your system improves things (reduces WER) and how complete/clever you were in your approach. Note, that I am specifically interested in approaches that employ a probabilistic language model.
Your task in this assignment is to implement and explore the behavior of a deterministic segmenter for Twitter hashtags like the following:
The goal is to use a lexicon (a word list) to drive the segmentation of such tags into corresponding correct sequences of English words.
Part 1: 40 points
In this part of the assignment you should implement and explore the behavior of the MaxMatch algorithm as described in Chapter 3. Your first sub-task here will be to acquire an appropriate list of words. As a starting point, you should use the first 75,000 most frequent words from a Google-derived list of words. Note this list has 333,333 entries; just use the top 75,000 for this part of the assignment.
Once you've implemented your system you should analyze its behavior. When it fails what is the source of the failure? How many different kinds of frequent failures are there? You should develop your system on this somewhat representative corpus of hashtags. I'll leave it to you to figure out a set of reasonable reference answers.