CSCI 1300
Project: Tagging English Words

This is one of several possible projects for CSCI 1300. The following link tells how the projects are used:


What the Program Should Do

Start by creating a working directory for this project and downloading this 11MB file to that directory:


www.cs.colorado.edu/~main/penn.txt
This file was created by a linguistics undergrad student from the Penn-Treebrook corpus of English sentences, taken from the Wall Street Journal. Each line of the file has the form:
xxx/YYY
where the xxx is an English word or punctuation mark and the YYY is a syntactic category for the word. For example:
chairman/NN
means that the word chairman appears in the corpus and it has been "tagged" as being in the syntactic class NN (which is the category for singular nouns).

Your job: Write a program that repeatedly asks the user to type an English word. The program reads the word and then looks through the penn.txt file to see which syntactic categories that word appears in. The output of the program should be a table with three columns:

  1. The name of the syntactic category (such as NN).
  2. The number of times that the word appears in the category.
  3. The proportion of occurrences that are in this category (e.g. 42.0%)
The table should list only those categories for which the word actually has one or more appearances. There will be no more than 100 categories.

Estimated Difficulty Level for First Semester Students:


On a scale of 50 (easy) to 500 (hard): 330 points


Bonus of 120 points if your program does not re-read the 11MB penn.txt file for each word that the user inputs.