Homework 1: Due 9/1/2011

For this assignment, you will create a system that can generate an index for a small test collection of medical abstracts.

Resources

Go to www.cs.colorado.edu/~martin/csci5417/assgn1/ and retrieve the file med.all.  This is a plain text file consisting of 1033 short medical abstracts. The following is an example of one abstract. The first line provides the document number, the .W line precedes the body of the abstract; the abstracts end with the .I line for the following abstract (or an EOF).


.I 16

.W

treatment of active chronic hepatitis and lupoid hepatitis with         

6-mercaptopurine and azothioprine .                                     

  6-mercaptopurine or azothioprine ('imuran') was used successfully in 3

patients with active chronic hepatitis and 2 with lupoid hepatitis, for 

periods up to 1 year . these drugs allowed modification and even        

abolition of discomforting corticosteroid regimes . their action in     

chronic hepatitis may be analogous to their anti-immune action in       

suppressing homograft rejection .        


Task

Your task in this assignment is to build a system that can index for this collection. More specifically, your system should take this collection and  create a file with a list of document numbers for each term in the collection, one term/posting list per line, sorted by terms, with the document numbers sorted as well, separate the values with commas. As in...


hepatitis,1,2,330,500,1001


For the purposes of this assignment we'll define terms as maximal sequences of alphanumerics and dashes (hyphens). So for document 16 given above, the following are legitimate terms:


"treatment", "imuran", "1", "6-mercaptopurine", "anti-immune"


but the following are not terms


".", "'imuran'","('imuran')", "hepatitis,".


What to hand in


Email me the code for your system and a short description of your approach  including the size of the file, the number of terms and the number of postings.  Do not send me the index.


Early in the day on 9/1 I will post a file with a small set of terms.  You should send me the corresponding entries from your index for those terms.  Remember the document numbers in the index should should be in a numerically sorted (ascending), comma-separated, list with no extraneous white-space.  Send me these as a single plain-text email attachment. The attachment should be named <your last name>-assgn1-out.txt


This will be graded solely as a simple diff against the correct answers.  But you should at least give some thought to how your approach would scale beyond this tiny collection.

Due on Wednesday September 1, 2010 before class starts.

© James H. Martin, 2011