In this assignment, you are to create a document classification system that roughly approximates a k-NN algorithm using Lucene. More specifically, you are to adapt your Lucene index to our medical collection to allow the assignment of MeSH terms (the .M field) for new documents.
MeSH terms are topics drawn from a fixed vocabulary created by the Library of Medicine. The topics assigned to our documents appear as the .M field in the corpus. To make things a bit easier, I'm providing a qrels-like file that maps the MeSH identifiers to the documents in the collection to which they are relevant (i.e., you don't actually need to parse or deal with the .M fields). There are 4904 MeSH topics represented in this collection. Here's an example from the MeSH qrels file.
These lines indicate that this MeSH term (whatever it is) has been assigned to these 7 documents (only these 7).
The functionality you are to create involves assigning MeSH terms to new documents. I'll provide a fresh set of documents from our collection to use as a test set. A k-NN approach to this task classifies new documents by using using MeSH terms assigned to similar documents. How exactly you do this is up to you.
Since I know the correct answers to all the documents in the collection, the evaluation will consist of precision and recall metrics with respect to these answers.
If you adopt a conservative approach that assigns only a few high confidence terms your system should do well on precision; assign boatloads of terms to all documents and you'll do well on recall. I'll combine recall and precision with an F1 measure.
What to Hand In
You output should consist of doc-id mesh-id pairs; one pair per line for as many mesh topics that you assign. As in
There will be 100 documents in the test set.
As with the previous assignments. Attach in an email, your output, your code and a brief description of what you did to get your results. Your output file should be named
Due Date: October 25, 2011