Project Work 3: Due Monday February 16, 2004

 

Each group needs to email me code (and a binary that runs on XP if applicable), that takes the three data sets that Platt used and generates the learning and test sets.

 

First, you will need to download the datasets:


1. The Adult training and testing data you can get at:

 

Training adult data

testing set for the adult data

 

2. The Web data you can get from:

 

training web data

testing set for the web data set

 

Check out the readme files to determine format of data.

 

3. The Reuters data can be download it from:

 

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

 

You will need to do some serious parsing to get at this data. Pointers to public parsers are available at:

 

http://xml.coverpages.org/publicSW.html#parsers

 

If you can’t parse this data, then extract the Reuters data from:

 

http://download.joachims.org/svm_light/examples/example1.tar.gz

 

The description can is found in:  http://svmlight.joachims.org/

“Documents are represented as feature vectors. Each feature corresponds to a word stem (9947 features). The task is to learn which Reuters articles are about "corporate acquisitions". There are 1000 positive and 1000 negative examples in the file train.dat. The file test.dat contains 600 test examples. The feature numbers correspond to the line numbers in the file words

 

 

The code you will write will generate training and test TEXT file that have the following format: Each row contains a training example of the as follows:

 

C  x1 x2 x3 …. xd

 

Were C is the class, and x1,…,xd are the d features (or inputs).