CSCI 1300
Project: Tagging English Words

This is one of several possible projects for CSCI 1300. The following link tells how the projects are used:

CSCI 1300 - Class Syllabus
CSCI 1300 - Class Grading
CSCI 1300 - Project List
Please note that these projects indicate precisely what your program should accomplish, without a precise indication of how the program works. Part of your assignment is designing the techniques of how the program works.

What the Program Should Do

Start by creating a working directory for this project and downloading this 11MB file to that directory:


www.cs.colorado.edu/~main/penn.txt

This file was created by a linguistics undergrad student from the Penn-Treebrook corpus of English sentences, taken from the Wall Street Journal. Each line of the file has the form:

xxx/YYY

where the xxx is an English word or punctuation mark and the YYY is a syntactic category for the word. For example:

chairman/NN

means that the word chairman appears in the corpus and it has been "tagged" as being in the syntactic class NN (which is the category for singular nouns).

Your job: Write a program that repeatedly asks the user to type an English word. The program reads the word and then looks through the penn.txt file to see which syntactic categories that word appears in. The output of the program should be a table with three columns:

The name of the syntactic category (such as NN).
The number of times that the word appears in the category.
The proportion of occurrences that are in this category (e.g. 42.0%)

The table should list only those categories for which the word actually has one or more appearances. There will be no more than 100 categories.

Estimated Difficulty Level for First Semester Students:

On a scale of 50 (easy) to 500 (hard): 330 points

Bonus of 120 points if your program does not re-read the 11MB penn.txt file for each word that the user inputs.

CSCI 1300 Project: Tagging English Words

What the Program Should Do

Estimated Difficulty Level for First Semester Students:

On a scale of 50 (easy) to 500 (hard): 330 points Bonus of 120 points if your program does not re-read the 11MB penn.txt file for each word that the user inputs.

CSCI 1300
Project: Tagging English Words

On a scale of 50 (easy) to 500 (hard): 330 points

Bonus of 120 points if your program does not re-read the 11MB penn.txt file for each word that the user inputs.