CSCI 5622 Project: (worth 25% of your mark)

 

11/6/2001

 

Instructor: Professor Grudic

 

Due date: December 18, 2002

 

You have two choices for the project:

 

  1. Evaluate learning algorithms on data. Either use existing algorithms on a novel set of data, or design your own algorithm and compare its performance to that of at least one existing algorithm. Your mark will be based on the quality of your analysis - not the quality of the model you generate.

 

OR

 

 

  1. Write a review of two papers (if possible, no longer than 15 pages total please). You will need to validate the clams made, point out any technical holes you have found, decide if the papers are interesting, and indicate potential future work. You may choose form one of the following set of papers (I suggest you choose two related papers), or you may choose two papers by yourself that you are interested in. Some papers come with algorithms which you will be expected to run – other papers might require you to do implementations. Your mark will be based on the extensiveness and quality of your review.

 

 

Bayes Nets:

 

Probabilistic Clustering in Relational Data, B. Taskar, E. Segal, and D. Koller. Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, August 2001, pages 870--876.

 

Active Learning for Structure in Bayesian Networks, S. Tong and D. Koller. Seventeenth International Joint Conference on Artificial

 

Exact Inference in Networks with Discrete Children of Continuous Parents, U. Lerner, E. Segal, and D. Koller. Seventeenth Annual Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, Washington, August 2001, pages 319--328.

 

 

Ensemble Algorithms:

 

Friedman, J. H. "Tutorial: Getting Started with MART in Splus ." (Sept.1999) (software)

Friedman, J. H. "Stochastic Gradient Boosting ." (March 1999b) (software)

Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (Feb. 1999a) (software)

Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998)

Random Forests (Leo Breiman)

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.

A technical report is available explaining the theory and implementation of random forests.

Software for Random Forests

For standalone use, the following are available:

source code

program documentation

a sample data set

sample output

 

Support Vector Machines:

 

Support Vector Machine Active Learning with Applications to Text Classification, S. Tong and D. Koller. Machine Learning Journal, 2002, to appear.

 

Feature Selection for SVMs -- Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik