News

News concerning the class will be posted here.

HW 3 Test set

The HW 3 test set is now available.  Please run your systems on this test and send the results (along with your code and writeups) to csci5832@gmail.com.

Don't retune or otherwise do further development on your systems once you've downloaded the tests.

IR Text

The page numbers given for the IR (naive Bayes text classification) are for the on-line PDF that's pointed at in the lectures and readings page.

The page numbers are different from the various editions of the physical textbook that some of you may have.

Last Year's Final

I posted last years final (with answers) in the directory of old exams.  This includes the worked out answer to the final EM question.

April Grades

Here's a new combined grade sheet for the CS, Ling, and CAETE sections.  If you can't uniquely figure out where you are send me mail.  The subtotal column weights the two HWs and the two exams equally.

Recall that the quizzes, HWs and final are equally weighted at 30% each.

If you're at or above the 85% mark and don't plan to blow the final then you're in good shape for an A. And you can probably skip the extra-credit HW.


HW3 

The details of HW 3 (Movie reviews) have been posted on the assignments page.

Readings for Final Exam

The final is cumulative. That means everything is fair game.  But the focus will be on the newer material. Use the course discussions and questions from the quizzes as a guide to reviewing the earlier material (and, of course, the specific readings for the earlier quizzes).

Here are the readings from the last chunk of the class.  This is just the material not covered on our previous quizzes.

Chapter 19: Skip 19.5

Chapter 20: Read pages 637-645, and Section 20.9

IR (Manning et al.): Read pages 253-262 and Section 13.6

Chapter 25: Skip 25.5.2, 25.10, 25.11, 25.12

NLP-based sports reporter

Another reason not to be a journalism major... NLP system writes a better sports story.

Job ad

For those of you finishing up this semester...  Job in SF.

http://www.lemnoslabs.com/search-engineer.html


Midterm 2 Readings

Here's the roadmap for the readings for the next midtem...  (this assumes you've retained what you need from the first part of the class).

Chapter 12: Skip 12.6, 12.7.2, 12.8, and 12.9

Chapter 13: Skip 13.4.2, 13.4.3

Chapter 14: Skip 14.6.2, 14.8, 14.9, 14.10

Chapter 17: Skip 17.5

Chapter 18: Skip 18.3.2, 18.4, 18.5, and 18.6

Chapter 22: Skip 22.3


Maxmatch P1 Output

Here are the "correct" maxmatch answers if you want to debug your WER computation (assuming you didn't get the 20 points on the HW). The answer should be .528.

Grades

I've posted the latest grades (HW1 and the midterm).  The grades are sorted by a straight sum of the two.  You should be able to find yourself by your exam grade or your WER.  If you can't find yourself, contact me and I'll tell you where you are.

Remember for HW1 if you have 0 for part 1 it means your MaxMatch output was wrong, if you have a 0 for part 2 it means you WER was wrong.  Part 3 is based on (.195/yourWER)*40.

This is currently just for the CSCI section. If you're in the Ling or CAETE section just email me and I'll send you your grades.

Assignment 2 Test Set

The test set for today's assignment has been posted. The submission instructions are slightly different this time.

Please send your submission to csci5832@gmail.com (not to me).

Send a single gzipped tar file as an attachment.  The contents of that should be three files:

<yourlastname>-assgn2.py

<yourlastname>-assgn2-out.txt

<yourlastname>-assgn2-description.*

If you don't know what tar (or gzip) is just send the mail as normal attachments and then see me after class.


Office Hours Monday

I'm going to be out of town Monday.  So no office hours. I'll add extra office hours Tuesday afternoon.  3-4:30.

Jeopardy!

Don't forget to watch Jeopardy tonight.   And PBS's Nova had a recent episode covering Watson and various developments related to NLP and speech.

GT example

Here's the promised exciting GT example.  Let's say our text is 

one fish two fish red fish blue fish fish fish two fish fish

That yields the following bigram tokens

one fish
fish two
two fish
fish red
red fish
fish blue
blue fish
fish fish
fish fish
fish two
two fish
fish fish

Sorting and counting those gets us these counts.
fish fish 3
fish two 2
two fish 2
blue fish 1
fish blue 1
fish red 1
one fish 1
red fish 1

So we have a vocabulary of size 5, giving us 25 possible bigrams. Of which, 8 show up in our set of 12 bigram instances.  Leaving us with no data on 17 of our possible bigrams.   And with the following buckets for the rest.

N_3 = 1
N_2 = 2
N_1 = 5

So if we use the ones to give us the probability to assign to (all) the zeroes we get  N_1/N or 5/12.   That's the probability for the sum of all unseens.  If we want a probability for any given one we can distribute that evenly over the unseens to give us 5/12 * 1/17 or .0245.

Now let's move up the food chain.  Remember the basic equation for the discounted counts is:

c* = (c+1) N_c+1/N_c
so
c = 
       1  -->   (1+1) N_2/N_1  =  2* 2/5 =  .8
       2 -->    (2+1) N_3/N_2  =  3* 1/2 = 1.5

we better leave N_3 alone since we're out of data.

Now we seen that the counts for the things that occurred once have gone from 1 to .8,  and the 2s have gone to 1.5. 







Old Exams

I've posted various old quizzes and exams from past semesters. Note the distribution and coverage of materials changes from year to year. So there may be stuff on these old quizzes that we haven't covered.

Readings for Midterm 1

Chapters 1 to 6
   Chapter 2
   Chapter 3
      Skip 3.4.1, 3.10, 3.12
   Chapter 4
      Skip 4.7, 4.8-4.11
   Chapter 5
      Skip 5.5.4, 5.6, 5.8-5.10
   Chapter 6
     Skip 6.6-6.9

HW 1 Test and Submission

The test set and corresponding answers are now available. Don't do any further development on your system (or dictionary) after you retrieve them.

Please email  4 files as attachments to me: 

<yourlastname>-assgn1.py

<yourlastname>-assgn1-part1-out.txt

<yourlastname>-assgn1-part3-out.txt

<yourlastname>-assgn1-description.*  

Use whatever formatter you want for the last one. In the description, please include your WER for maxmatch and for your "improved" system.  Also include a description of what you attempted for part 3.

HW 1 Clarification

The word error rate computation  for the first HW is in terms of words, not characters.  That is, the substitution, insertions and deletions are happening to words not characters.  In the following examples, the cost of a substitution is 1 (not 2 as in the Levenshtein distance).

Fortunately (?), the edit distance code I gave you works both ways.  It doesn't care about what's in the sequences you pass it.  Specifically, if you call it with strings as in

>>> minEditDist("intention", "execution")

5

you get a character level distance because that's what the target and source variable indexes are giving you.  If you call with lists of words as in 

>>>hyp="theta bled own there".split()

>>> hyp

['theta', 'bled', 'own', 'there']

>>> ref= "the table down there".split()

['the', 'table', 'down', 'there']

>>> minEditDist(hyp, ref)

3

then you get a word level edit distance. 

To get the WER from that you need to divide by the length of the gold standard answer.  

>>> float(minEditDist(hyp,ref))/len(ref)

.75

You have to float the distance so you don't get an integer divide.


Google Tech Talk

Google is recruiting on campus tomorrow (Wed 1/26).  They're giving a tech talk as part of that.