Homework 3:  k-NN Classification  Due 10/25/11

In this assignment, you are to create a document classification system that roughly approximates a k-NN algorithm using Lucene.   More specifically, you are to adapt your Lucene index to our medical collection to allow the assignment of MeSH terms  (the .M field) for  new documents.

MeSH Terms

MeSH terms are topics drawn from a fixed vocabulary created by the Library of Medicine. The topics assigned to our documents appear as the .M field in the corpus. To make things a bit easier, I'm providing  a qrels-like file that maps the MeSH identifiers to the documents in the collection to which they are relevant (i.e., you don't actually need to parse or deal with the .M fields).  There are 4904 MeSH topics represented in this collection. Here's an example from the MeSH qrels file.

MSH15 87061364

MSH15 87111396

MSH15 87154830

MSH15 87154836

MSH15 87259306

MSH15 87303180

MSH15 87325781

These lines indicate that this MeSH term (whatever it is) has been assigned to these 7 documents (only these 7).

The Task

The functionality you are to create involves assigning MeSH terms to new documents. I'll provide a fresh set of documents from our collection to use as a test set.  A k-NN approach to this task classifies new documents by using using MeSH terms assigned to similar documents.  How exactly you do this is up to you.

Evaluation

Since I know the correct answers to all the documents in the collection, the evaluation will consist of precision and recall metrics with respect to these answers.

If you adopt a conservative approach that assigns only a  few high confidence terms your system should do well on precision; assign boatloads of terms to all documents and you'll do well on recall.  I'll combine recall and precision with an F1 measure.

What to Hand In

You output should consist of doc-id mesh-id pairs; one pair per line for as many mesh topics that you assign.  As in

87169270 MSH2140
87269956 MSH2233
87269956 MSH2730
87269956 MSH954

...

There will be 100 documents in the test set.

As with the previous assignments.  Attach in an email, your output, your code and a brief description of what you did to get your results.  Your output file should be named

<last-name>-assgn4-out.txt

as in 

hawkins-assgn4-out.txt

Due Date:  October 25, 2011


© James H. Martin, 2011