For this assignment you should improve the performance of the system you created for Part 1. You can try pretty much anything for this part of the assignment. But you should focus your attempts on some rational analysis of the behavior of your original system. That is, do some sort of error analysis; don't just try random variations.
trec_eval
The standard software used to evaluate by NIST to evaluate ad hoc retrieval systems is called trec_eval and is freely available (get the release marked "latest). Please download, install and use this system to evaluate your results. Installation is trivial; just do make. Once its running do a trec_eval -h to get help on using the system. Pay particular attention to the results file format and the qrels format. They're a little different from what we've been assuming. One change that you'll have to make to your systems is to output the Lucene similarity score for a scoreDoc in addition to the queryid and docid; trec_eval sorts the results file based on this field rather than using the order in which the results are returned.
The results format and qrels format also assume some dummy columns that are not used. You can fill them with whatever you want.
When you prepare your results please use the -q -c and -M250 options.
What to Hand In
Send me the top 250 hits for each of the queries in the collection (in the required trec_eval format) In addition, include a short report on the techniques you used, and the trec_eval output from your final run. Feel free to report on avenues of investigation that did not work out.
Due on Monday September 18, 2009 before class starts