Assignments

Assignments will be posted here as we go along. Programming assignments should be completed using the Python programming language.

Extra-Credit Movie Assignment

As an extra-credit assignment you can try to do something smarter/better with than our NB approach.  There are a lot of things you could do here.  You might for example experiment with a better classifier (maxent, svm, random forest, etc.)  Or you might do some feature selection to get a better set of features for NB.  The only stipulation is that the task stays the same.

Note that since the NB performance is quite high you will have difficulty getting a big improvement here on accuracy.  Other improvements might involve creating smaller models that do just as well. Or using vastly smaller amounts of training data to get the same effect. 

I'll provide an additional test set for this part of the assignment. 

Assignment 3: Sentiment via Naive Bayes

Your assignment is to implement a Naive Bayes classifier to determine the polarity behind a movie review. By Naive Bayes, I have in mind the the method described as Multinomial NB in the Manning et al. chapter.  Note that even within this approach there are a lot of issues/parameters at play. You'll have to explore them to get reasonable performance on this data. 

I am providing as training data a set of 1000 (500 positive/500 negative) movie reviews.  The data is divided into 2 sub-directories (pos and neg). Each directory contains 500 plain text files each containing a single movie review. These reviews range from short paragraphs to multiple pages in length.  The text has been downcased (unfortunately) and tokenized by spaces (all tokens including punctuation are space separated).  

Your system will be evaluated on a withheld test set that I will supply (which will also have a 50/50 split). Given a directory of files containing reviews your system should output (one per line) the filename of the file being processed followed by a <tab> and a + or - indicating  a positive or negative review.

What to Hand In

You should mail to the csci5832@gmail.com address, a gzipped tar file consisting of your code for the assignment, a short text writeup of what you did, and a plain text output file for the test instances that I will provide.  Include in your writeup a description of your development training process and you're predicted performance.

This is due by 5pm on Monday May 2.

Assignment 2: Probabilistic Hashtag Segmentation

Your task in this assignment is to improve the performance of the deterministic methods you used in Assignment 1  through the use of a probabilistic language model (the details of what you try are up to you). 

Your primary resource in tackling this problem is a large list of bigrams, with frequency counts, derived from the Google N-gram corpus. The most obvious approach is to use these counts to derive a bigram language model and then use that model to find the most probable segmentation given some hashtag.

Improving performance in this context, means reducing the error rate from the previous HW, where error rate is WER (length normalized minimum edit distance). 

How much credit you get for this will be based in large part on how much your system improves things (reduces WER) and how complete/clever you were in your approach. Note that I am specifically interested in approaches that employ a probabilistic language model.

This assignment is due on Thursday, February 24. 

Whattohandin

 As with the first assignment, I will post a new test-set with answers. You should run your system (as is at that time) on that test-set and send the output to a plain text file with one segmentation per line.  Don't include any extraneous output or decoration in this file. Then send me an email with the following attachments:
1. Your code as <yourlastname>-assgn1-extra.py
2.Output as <yourlastname>-out-extra.txt
3 A short description of what you did, including the WER of the new system.




 

Assignment 1: Deterministic English Segmentation

Your task in this assignment is to implement and explore the behavior of a deterministic segmenter for Twitter hashtags like the following:

#30secondstomars
#americanidol
#hurricaneike
#celebraterandommilestones
#entrepreneurship
#firstdayofkindergarden
#30secondstomars

The goal is to use a lexicon (a word list) to drive the segmentation of such tags into corresponding correct sequences of English words.

Part 1: 40 points
In this part of the assignment you should implement and explore the behavior of the MaxMatch algorithm as described in Chapter 3.  Your first sub-task here will be to acquire an appropriate list of words.  As a starting point, you should use the  first 75,000 most frequent words from a Google-derived list of words. Note this list has 333,333 entries; just use the top 75,000 for this part of the assignment.

Once you've implemented your system you should analyze its behavior.  When it fails what is the source of the failure?  How many different kinds of frequent failures are there? You should develop your system on this somewhat representative corpus of hashtags. I'll leave it to you to figure out a set of reasonable reference answers.

Note both the Google list and the hashtag corpus are raw and unfiltered. There's likely to be stuff in there to offend all manner of sensibilities.

Part 2: 20 points
Before you can attempt to improve your segmenter, we'll need a more rigorous figure of merit to tell us how well we're doing. There are a lot of ways to approach the problem of evaluating taggers; we'll adopt one from the automatic speech recognition community called Word Error Rate (WER).  The first step in computing WER is to compute the minimum number of edits (deletions, insertions and substitutions) to get from the system's hypothesis to the actual correct reference answer.  WER is then just the length normalized minimum edit distance (i.e, minimum edit distance divided by the length of the reference). 

Your assignment is to take a nearly complete version of minimum edit distance and create a working WER that takes two files as input (hypothesized answers and gold answers) and returns the average WER across that test set.

Part 3: 40 points
In this part of the assignment, you should attempt to improve things by tweaking the original algorithm or altering  the lexicon or some combination of both.  You should base this improvement on the failure modes you've identified in Part 1.  Evaluate your changes using your WER code. Don't consider the use of probabilities in any form at this point; that includes using the position of a word in the word list. Limit yourself to considerations such as (1) changing the MaxMatch strategy, (2) changing the greedy nature of its heuristic, or (3) manipulations to your lexicon.

Miscellaneous and Whattohandin

Don't include the hashtag in your output.  If the input is 
#thebetrayed
the output should be
the betrayed

Ignore case. If the input has upper case just treat it like its all lowercase. So
#AmericanIdol
is the same as 
#americanidol

I will post a test set shortly before the due date.  You should run your system (as is at that time) and send the output to a plain text file with one segmentation per line.  Don't include any extraneous output or decoration in this file. Then send me an email with the following attachments:

1. Your code as <yourlastname>-assgn1.py
2.Output as <yourlastname>-out-assgn1.txt
3 A short description of what you tried to do in Part 3, including your measured WER on the test set.

This is due on February 1 by the start of class.