For this assignment you are to create a Lucene search index to a collection of movie plots.
Movie Plots
For each entry in the movie database, there are one or more plot descriptions by various contributors. The following is a typical example of an entry:
-------------------------------------------------------------------------------
MV: Sherlock Holmes and the Secret Weapon (1943)
PL: Starting in Switzerland, Sherlock Holmes rescues the inventor of a
PL: bomb-sight which the allies want to keep from the Nazis. Back in London it
PL: sems that the inventor is not all that he seemed.
BY: Michael Crew <m.crew@bbcnc.org.uk>
PL: Working for the British government, Sherlock Holmes manages to spirit Dr.
PL: Franz Tobel out of Switzerland and into England before the GESTAPO are able
PL: to get to him. Tobel has devised an immensely accurate bomb site and while
PL: he is willing to make it available to the Allies, he insists on
PL: manufacturing it himself. Soon however, he vanishes and it is left to
PL: Homes, assisted bu the bumbling Dr. Watson, to decipher a coded message he
PL: left behind. Holmes soon realizes that he is up against his old nemesis,
PL: Professor Moriarty.
BY: garykmcd
Note that we're interested in indexing movies by the plots so the multiple plots pose a problem. I'm leaving that up to you. You might simply take the first plot, or merge the plots. In any case, you are to return movies.
By my count there are 167242 entries in the list. Assume for evaluation/indexing purposes that the movies are numbers starting at 1 in the order they appear in the file. Note that there are fair number of TV shows and TV movies in this list. None are relevant to any of the queries you provided. So you might want to deal with those in a special way. But if you exclude them from the index make sure the numbering stays intact.
Queries
I will provide a set of 50 queries drawn from your email responses earlier in the semester to use for a development set.
Evaluation
As we don't have any relevance judgments for this collection, we'll use a method used by NIST called pooled evaluation. You should run the development queries through your system and by hand make a relevance judgment for the top 25 results returned by your system. You should record both the relevant and non-relevant query/movie pairs as part of this process and report them to me. Use these judgements to create a qrels file suitable for use with trec_eval; then run trec_eval on your results.
I will pool (take the union) of these pairings (I will adjudicate in cases where there are differences). These pooled results will form the basis for the final evaluation.
What to hand in
Provide a short description of what you did for indexing details, the results of running trec_eval, and the qrels file you generated.
Due Date: No later than December 16, 2009