Assignment 1: Deterministic English Segmentation

 Your task in this assignment is to implement and explore the behavior of a deterministic segmenter for Twitter hashtags like the following:


#30secondstomars

#americanidol

#hurricaneike

#celebraterandommilestones

#entrepreneurship

#firstdayofkindergarden


The goal is to use a lexicon (a word list) to drive the segmentation of such tags into corresponding correct sequences of English words.

Part 1: 40 points

In this part of the assignment you should implement and explore the behavior of the MaxMatch algorithm as described in Chapter 3.  Your first sub-task here will be to acquire an appropriate list of words.  As a starting point, you should use the  first 75,000 most frequent words from a Google-derived list of words. Note this list has 333,333 entries; just use the top 75,000 for this part of the assignment.

Once you've implemented your system you should analyze its behavior.  When it fails what is the source of the failure?  How many different kinds of frequent failures are there? You should develop your system on this somewhat representative corpus of hashtags. I'll leave it to you to figure out a set of reasonable reference answers.

Note both the Google list and the hashtag corpus are raw and unfiltered. There's likely to be stuff in there to offend all manner of sensibilities.

Part 2: 20 points

Before you can attempt to improve your segmenter, we'll need a more rigorous figure of merit to tell us how well we're doing. There are a lot of ways to approach the problem of evaluating taggers; we'll adopt one from the automatic speech recognition community called Word Error Rate (WER).  The first step in computing WER is to compute the minimum number of edits (deletions, insertions and substitutions) to get from the system's hypothesis to the actual correct reference answer.  WER is then just the length normalized minimum edit distance (i.e, minimum edit distance divided by the length of the reference in terms of words). 

Your assignment is to take a nearly complete version of minimum edit distance and create a working WER that takes two files as input (hypothesized answers and gold answers) and returns the average WER across that test set.

Part 3: 40 points

In this part of the assignment, you should attempt to improve things by tweaking the original algorithm or altering  the lexicon or some combination of both.  You should base this improvement on the failure modes you've identified in Part 1.  Evaluate your changes using your WER code. Don't consider the use of probabilities in any form at this point; that includes using the position of a word in the word list. Limit yourself to considerations such as (1) changing the MaxMatch strategy, (2) changing the greedy nature of its heuristic, or (3) manipulations to your lexicon.

Miscellaneous and Whattohandin

Don't include the hashtag in your output.  If the input is 

#thebetrayed

the output should be

the betrayed

Ignore case. If the input has upper case just treat it like its all lowercase. So

#AmericanIdol

is the same as 

#americanidol

I will post a test set shortly before the due date.  You should run your system (as is at that time) and send the output to a plain text file with one segmentation per line.  Don't include any extraneous output or decoration in this file. Then send me an email with the following attachments:

1. Your code as <yourlastname>-assgn1.py

2.Output as <yourlastname>-out-assgn1.txt

3 A short description of what you tried to do in Part 3, including your measured WER on the test set.


This is due on September 20 by 5PM.