Assignment 3: 100 Points

Due 12/5/2006

In this assignment, you are to take as input short documents (paragraph length) and return as output a label corresponding to the correct category for the document. The label will be selected from a small set of labels corresponding to the categories of documents given in the training data.

You are to solve this problem using any machine learning or probabilistic language approach that you feel is appropriate.

Details

I am making two collections of text available for training: the Holmes corpus from Assignment 4, and a similar corpus of Tarzan text.

Your submitted system must take a file containing paragraphs drawn from either source (separated by blank lines) and output either Holmes or Tarzan as the chosen label (one label per line, in the same order as the paragraphs).

You should bring a hardcopy of your code, your output and a brief description of your approach to class. You should also email me the results from your system on the new test data that I will provide.

You can use whatever programming language you like to solve this problem.

For the final evaluation I will make available a new set of training materials (two new different authors), and a set of test documents. You should train a new system based on these materials and run it on the test data. You shouldn't do any new development with this system (other than training it).