Assignment 1, Part 1: Due on Monday, September 17 2007
Part 1
Download and install Lucene on the machine of your choice. You can use
the standard Java-based Apache
version. Or you can use any of the other language ports:
python
ruby, or
perl.
Part 2
Download the Cystic Fibrosis test collection from the link given in my
earlier email. Your task is to index this collection using
Lucene. For the purposes of this assignment you should include 3
separate fields for each doc: the document ID (the RN #), the title
and the abstract or extract.
To get started on this you might want to look
the description of the sample demo code,
which is described more fully in the
Lucene in Action book.
What to Turn In
For Part 1 of this assignment, I will send out a set 25 queries to be
run against your index. You should email me a single plain text
attachment with the results of each query formatted as
query-number, doc-id
pairs one per line for all the query results. Limit your results
to the top 30 hits for each query. Given this your output file should
have no more than 750 lines (25 queries * 30 hits per query).
Don't tar, zip, gzip, compress, uuencode, stuffit or otherwise
encode it. Just a plain-text file named lastname-assgn1-1.txt. Don't
place it in-line. Make sure it is an attachment.