For this assignment, you will create a system that can generate an index for a small test collection of medical abstracts.
Go to www.cs.colorado.edu/~martin/csci5417/assgn1/ and retrieve the file med.all. This is a plain text file consisting of 1033 short medical abstracts. The following is an example of one abstract. The first line provides the document number, the .W line precedes the body of the abstract; the abstracts end with the .I line for the following abstract (or an EOF).
treatment of active chronic hepatitis and lupoid hepatitis with
6-mercaptopurine and azothioprine .
6-mercaptopurine or azothioprine ('imuran') was used successfully in 3
patients with active chronic hepatitis and 2 with lupoid hepatitis, for
periods up to 1 year . these drugs allowed modification and even
abolition of discomforting corticosteroid regimes . their action in
chronic hepatitis may be analogous to their anti-immune action in
suppressing homograft rejection .
Your task in this assignment is to build a system that can index for this collection. More specifically, your system should take this collection and create a file with a list of document numbers for each term in the collection, one term/posting list per line, sorted by terms, with the document numbers sorted as well, separate the values with commas. As in...
For the purposes of this assignment we'll define terms as maximal sequences of alphanumerics and dashes (hyphens). So for document 16 given above, the following are legitimate terms:
"treatment", "imuran", "1", "6-mercaptopurine", "anti-immune"
but the following are not terms
".", "'imuran'","('imuran')", "hepatitis,".
What to hand in
Email me the code for your system and a short description of your approach including the size of the file, the number of terms and the number of postings. Do not send me the index.
Early in the day on 9/1 I will post a file with a small set of terms. You should send me the corresponding entries from your index for those terms. Remember the document numbers in the index should should be in a numerically sorted (ascending), comma-separated, list with no extraneous white-space. Send me these as a single plain-text email attachment. The attachment should be named <your last name>-assgn1-out.txt
This will be graded solely as a simple diff against the correct answers. But you should at least give some thought to how your approach would scale beyond this tiny collection.