Homework 2 - Part 1: Due 9/20/2011

For this assignment you are to use Lucene to create a basic index to a collection of medical abstracts and evaluate its performance based on a representative set of queries and reference relevance judgements.


Go to lucene.apache.org to get the latest Lucene release (or to any of the other Lucene in another language sites). You can find the text files that you need to get started at the following directory:


There you'll find three files: medical.txt.gz, queries.txt, and qrels.txt. The first is your corpus of documents (54,710 records; about 20mb zipped). The second is a set of 63 queries that you'll use to test your basic lucene index. The qrels file contains the query/document relevance judgments. Each line consists of a query id# followed by a relevant document number.


Once you've installed Lucene, create an index of the collection. There are several things to keep track here:  

  • Be sure to keep track of the actual document ID (the .U field), you will need that for the evaluation
  • You must index the contents of the document abstracts (.W field), you can choose to use the other fields (titles, abstracts, mesh terms) if you like.
  • Tokenization, stemming, stop-lists, and synonymy do have an effect on performance. Explore the options but don't go crazy.

Query Processing

Keep in mind that your choices in indexing must match the choices in query processing. That is, you had better make them perform the same tokenization and stemming on the queries that you do on the documents (if you hope to get any matches).


Run the queries against your index and keep track of up to the top 50 hits from each query.  Implement the R-Precision metric from the text and use it to evaluate your system (by consulting the qrels file).  Finally, include an attachment with the top 50 hits for each query (one hit per line with the query-id and document-id  in the hit order provided by Lucene). As in the following...

OHSU1 87073567 
OHSU1 87210457 
OHSU1 87316317 
OHSU1 87105372 
OHSU1 87157537

Do not include any extraneous blank lines, salutations, or other adornments in this file. Please name the file <last-name>-assgn3-out.txt.

© James H. Martin, 2011