Assignment 1, Part 1: Due on Monday, September 17 2007

Part 1

Download and install Lucene on the machine of your choice. You can use the standard Java-based Apache version. Or you can use any of the other language ports: python ruby, or perl.

Part 2

Download the Cystic Fibrosis test collection from the link given in my earlier email. Your task is to index this collection using Lucene. For the purposes of this assignment you should include 3 separate fields for each doc: the document ID (the RN #), the title and the abstract or extract.

To get started on this you might want to look the description of the sample demo code, which is described more fully in the Lucene in Action book.

What to Turn In

For Part 1 of this assignment, I will send out a set 25 queries to be run against your index. You should email me a single plain text attachment with the results of each query formatted as

query-number, doc-id

pairs one per line for all the query results. Limit your results to the top 30 hits for each query. Given this your output file should have no more than 750 lines (25 queries * 30 hits per query).

Don't tar, zip, gzip, compress, uuencode, stuffit or otherwise encode it. Just a plain-text file named lastname-assgn1-1.txt. Don't place it in-line. Make sure it is an attachment.