Project Work 3: Due
Each group needs to email me code (and a binary that runs on XP if applicable), that takes the three data sets that Platt used and generates the learning and test sets.
First, you will need to download the datasets:
1. The Adult training and testing data you can get at:
Training adult data
testing set for the adult data
2. The Web data you can get from:
training web data
testing set for the web
data set
Check out
the readme files to determine format of data.
3. The Reuters data can be download it from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
You will need to do some serious parsing to get at this data. Pointers to public parsers are available at:
http://xml.coverpages.org/publicSW.html#parsers
If you can’t parse this data, then extract the Reuters data from:
http://download.joachims.org/svm_light/examples/example1.tar.gz
The description can is found in: http://svmlight.joachims.org/
“Documents are represented as feature vectors.
Each feature corresponds to a word stem (9947 features). The task is to learn
which Reuters articles are about
"corporate acquisitions". There are 1000 positive and 1000 negative
examples in the file train.dat. The file test.dat contains 600 test
examples. The feature numbers correspond to the line numbers in the file words”
The code you will write will generate training and test TEXT file that have the following format: Each row contains a training example of the as follows:
C x1 x2 x3 …. xd
Were C is the class, and x1,…,xd are the d features (or inputs).