You are all expected to work
individually on this project. You may discuss this project with your colleges,
(or anyone else) but the java code you implement must be your own.
The goal of the project is to
implement a novel learning algorithm (Minimax
Probability Machine Regression (MPMR)) within the WEKA environment. The
learning algorithm is described in MPM_Regression.pdf. This algorithm is motivated by a
new theoretical framework (called Minimax Probability
Machines – which we will cover in class – see minimax-nips01.ps
and derivation.pdf) and has not been extensively tested on
real world data. Your task is to implement MPMR and test it on a variety or
real world data sets.
To implement the algorithm
you will need to solve a linear system of equations. Java code for doing this
can be obtained at http://jmat.sourceforge.net/
. In particular, you will need to use the Singular Value Decomposition (SVD)
for inverting a matrix (which we will also cover in class). You can also
convert the following C code for singular value decomposition into java: svd.c. An example driver routine for svd.c is in test_svd_sigma.c. A complete description of how to calculate
the covariance matrix (i.e. the elements in equation (11) of MPM_Regression.pdf) is also contained in test_svd_sigma.c.
To debug your algorithm you
can use the data file debug_data.txt.
This data file contains 5 instances (i.e. training examples) in 5 rows. Each
instance has 13 inputs (the first 13 columns) and 1 output (the last column).
An MPMR model was built using the rbf kernel with
. The coefficients for the MPMR model are given in model_coeff.txt.
The cutoff for setting weights to zero in the SVD algorithm was set to
.
You will evaluate the MPMR
algorithm as follows:
1.
Compare the
performance of MPMR with 3 other suitable algorithms in the Weka
environment on the following data set project_data_1.txt
(you should use your judgment to decide which algorithms are suitable). This
data contains 506 instances (i.e. training examples) in 506 rows. Each instance
has 13 inputs (the first 13 columns) and 1 output (the last column).
2.
Pick 4 other
datasets from the The
Machine Learning Database Repository at
UC Irvine and compare MPMR performance to the best published results on these
datasets. At least 2 of the datasets should be classification data.
Implementing and testing the
MPMR algorithm is worth 50% of your final mark. The breakdown of marks is as
follows:
·
35% for
implementing MPMR in Java (Please see me if you would like to use a language
other than Java) and evaluating it as defined above.
·
10% for
experimentally determining if equations (9), (12) and (13) in MPM_Regression.pdf are valid. You will need to present
the experimental evidence clearly by using either an independent test set, or
using N (10?) fold cross validation. You should verify this on the project_data_1.txt
data, as well as one additional data set (this can be one of the 4 datasets you
will use above).
·
5% for
integrating MPMR into the WEKA environment.
In order to encourage
creativity, you can also receive the following bonus marks;
·
5% Bonus for
finding an algorithm that automatically chooses optimal
for the Polynomial
Kernel and
for the Gaussian
kernel, using
and
in section 2 of the algorithm
description. In particular, formulate an algorithm that does the following:
o
Find
if
is a Polynomial Kernel
is, and
if
is a Gaussian kernel
such that
,
is maximize,
is minimized, and
is maximized.
·
5% Bonus for
modifying the MPMR algorithm to give better results on regression and/or classification
data.