Assignment 3:  Information Extraction

In this assignment you are to implement an HMM-based approach to named entity recognition.  In this approach, we can cast the problem of finding named entities as a tagging task using IOB tags. The framework for the HMM-based solution is identical to the one used for the POS tagging assignment.  The particular NER task we’ll be tackling is to find all the references to genes in a set of biomedical journal article abstracts. 

Sample GENE tags

Structure O

, O

promoter O

analysis O

and O

chromosomal O

assignment O

of O

the O

human B

APEX I

gene I

. O

The training material consists of around 13000 sentences with gene references tagged with IOB tags.  Since we’re only dealing with one kind of entity here there are only three tag types in the data. The format of the data is identical to the POS tagging HW: one token per line associated with its proper tag (tab separated).  An example is shown in the sidebar. In this example there is one gene mentioned “human apex gene” with the corresponding tag sequence B I I.

Although the structure of this problem is the same as POS tagging, the characteristics of the problem are quite different.  In particular, there are far fewer parameters to learn for transition probabilities since there are only three tags. However, the vocabulary is much larger than the BERP domain and unknown words will be far more prevalent.  Both of these considerations may lead you to different strategies from those you used in Assignment 2.

Evaluation

As noted in the book, evaluation of these kinds of systems is not based on per tag accuracy (you can do pretty well on that basis by just saying O all the time). What we really want to optimize is recall, precision and f-measures at the gene level.  Remember that precision is the ratio of correctly identified genes identified by the system to the total number of genes your system found,  and recall is the ratio of correctly identified genes to the total that you should have found.  The F1 measure given in the book is just the harmonic mean of these two.

I’ll use F1 to evaluate your systems on the withheld test data.  You should create a training and dev set from the data I’m providing for use in developing your system.  

What to hand in

As with the previous HW, send me a report of what you did and your python code.  I will post a test set before the due date.  Run the test data through your system and send me the output.  The output of your system should be the same as the input training data: one token per line, wih a tab separating the token from its tag, and a blank line separating sentences.