Machine Learning (CSCI 4830)

 

Project: Due Monday December 16, 2002

 

You are all expected to work individually on this project. You may discuss this project with your colleges, (or anyone else) but the java code you implement must be your own.

 

The goal of the project is to implement a novel learning algorithm (Minimax Probability Machine Regression (MPMR)) within the WEKA environment. The learning algorithm is described in MPM_Regression.pdf. This algorithm is motivated by a new theoretical framework (called Minimax Probability Machines – which we will cover in class – see minimax-nips01.ps and derivation.pdf) and has not been extensively tested on real world data. Your task is to implement MPMR and test it on a variety or real world data sets.

 

To implement the algorithm you will need to solve a linear system of equations. Java code for doing this can be obtained at http://jmat.sourceforge.net/ . In particular, you will need to use the Singular Value Decomposition (SVD) for inverting a matrix (which we will also cover in class). You can also convert the following C code for singular value decomposition into java: svd.c. An example driver routine for svd.c is in test_svd_sigma.c. A complete description of how to calculate the covariance matrix (i.e. the elements in equation (11) of MPM_Regression.pdf) is also contained in test_svd_sigma.c.

 

To debug your algorithm you can use the data file debug_data.txt. This data file contains 5 instances (i.e. training examples) in 5 rows. Each instance has 13 inputs (the first 13 columns) and 1 output (the last column). An MPMR model was built using the rbf kernel with . The coefficients for the MPMR model are given in model_coeff.txt. The cutoff for setting weights to zero in the SVD algorithm was set to.

 

You will evaluate the MPMR algorithm as follows:

 

1.     Compare the performance of MPMR with 3 other suitable algorithms in the Weka environment on the following data set project_data_1.txt (you should use your judgment to decide which algorithms are suitable). This data contains 506 instances (i.e. training examples) in 506 rows. Each instance has 13 inputs (the first 13 columns) and 1 output (the last column).

2.     Pick 4 other datasets from the The Machine Learning Database Repository at UC Irvine and compare MPMR performance to the best published results on these datasets. At least 2 of the datasets should be classification data.

 

Implementing and testing the MPMR algorithm is worth 50% of your final mark. The breakdown of marks is as follows:

·        35% for implementing MPMR in Java (Please see me if you would like to use a language other than Java) and evaluating it as defined above.

·        10% for experimentally determining if equations (9), (12) and (13) in MPM_Regression.pdf are valid. You will need to present the experimental evidence clearly by using either an independent test set, or using N (10?) fold cross validation. You should verify this on the project_data_1.txt data, as well as one additional data set (this can be one of the 4 datasets you will use above).

·        5% for integrating MPMR into the WEKA environment.

 

In order to encourage creativity, you can also receive the following bonus marks;

 

·        5% Bonus for finding an algorithm that automatically chooses optimal  for the Polynomial Kernel and  for the Gaussian kernel, using  and  in section 2 of the algorithm description. In particular, formulate an algorithm that does the following:

 

o       Find  if  is a Polynomial Kernel is, and  if  is a Gaussian kernel such that ,  is maximize,  is minimized, and  is maximized.

 

·        5% Bonus for modifying the MPMR algorithm to give better results on regression and/or classification data.