Machine Learning CSCI 5622

Instructor: Greg Grudic

 

Spring 2008.

 

Location:          

Wednesdays 3:00pm-5:30pm ECCR 131

Instructor:

Professor Greg Grudic

Office:

ECOT 525

Office Hours:

Tuesday and Thursday 2:00 to 3:00

And By Appointment

Phone:

303-492-4419

Email:

grudic@cs.colorado.edu

 

 

Term Project: (Due May 7, 11:55PM – Please email the project directly to me.) The project write-up should include concise descriptions of 1) what you did, 2) why you did it, 3) your experimental or theoretical results, 4) a conclusion, and 5) future work (if applicable). All software written for the project should be submitted, along with all detailed experimental results if appropriate. The default project should either use either WEKA (http://www.cs.waikato.ac.nz/ml/weka/ ) or other publically available software to analyze the following datasets: Data_Default_Project.zip – there are five data sets, four classifications and one regression. You should use at least 3 different algorithms to estimates future error rates on each data set, for each algorithm type.

 

 

Quizzes:

1.   Quiz 1. CSCI5622_quiz_1.pdf

2.   Quiz 2. CSCI5622_quiz_2.pdf

3.   Quiz 3: CSCI5622_quiz_3.pdf

4.   Quiz 4: CSCI5622_quiz_4.pdf

 

 

Homework:

1.   Homework 1: See http://www.colorado.edu/physics/pion/csci5622-spring08

 

2.   Homework 2: See http://www.colorado.edu/physics/pion/csci5622-spring08

 

 

3.   Homework 3: Map the auto data in Homework 1 into Gaussian Kernel Space. Use the algorithms developed in Problem 4 of HW2 (set the number of cross validation folds to 5), to chose the optimal kernel parameter (sigma) and Nearest Neighbor parameter k in kernel space. Compare this to the k you get in the original input space of this dataset. Report the cross validation error rate you get by doing k Nearest Neighbor in input space, as well as in Kernel space. Which space do you think is better for this problem domain (assuming the k Nearest Neighbor algorithm is used)? Email (to the marker Avleen.Bijral@colorado.edu) all matlab code used (zipped) and answers to the above question by 11:55PM on Wednesday, March 6. Make sure to normalize your data!

 

1.   Homework 4: Modify the Perceptron cost/loss function to mimic the Support Vector Machine Classification cost/loss function. You are free to use any code that I have posted on the web. You are also free to use numerical differentiation or analytic differentiation to implement your modified version of the Perceptron gradient descent algorithm. Test your algorithm on synthetic data that you design and generate to verify the algorithm. Verify the algorithm first in linear space for linearly separable data and non-linearly separable data (where one point is not linearly separable). Then verify the algorithm on nonlinear data (that you design for the test) using the Gaussian Kernel. The assignment is due by 11:55PM on Wednesday, April 16 (the assignment includes a description and justification – i.e. why is it SVM like - of the cost function, description of your gradient descent algorithm, matlab code used to generate verification data, and matlab code used to implement the algorithm and the model). I will post data that you will apply to your algorithm on by April 2. You are free to work in groups to try to understand this assignment, but the work you hand in must be your own! You are also free to look on the web for descriptions of such algorithm (but this is not needed). If you do this, you must let me and the marker know which paper you are using and why. You must also implement all algorithms in the paper on your own.

 

2.   Homework 5: Homework due April 30, 2008, 1155PM. Estimate the future error of a model constructed on the data HW5_Data.zip using a support vector machine with a radial basis function kernel. Use the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ) with the Matlab Interface (http://www.csie.ntu.edu.tw/~cjlin/cgi-bin/matlab.cgi?+http://www.csie.ntu.edu.tw/~cjlin/libsvm/matlab+zip ). To estimate your error, do 50 random experiments, each time splitting the data randomly into 10% for testing and 90% for training. The training data should be used in a 5 fold cross validation experiment to pick the “best” radial basis function kernel parameter gamma, and the “best” value for C (use the C-SVM implementation; i.e. set “–s 0”). Make sure to normalize your data and to find an appropriate search range for C and gamma. Once you find your optimal C and gamma, use the entire 90% of the data to build a single SVM model with these parameters, and use it to obtain an error rate on the 10% test data. Report your final estimated error on future data by averaging the error rates over the 50 experiments. Repeat this experimental scenario by mapping the data into radial basis function kernel space (i.e. the same kernel as for the SVM experiments), and using the SPARSE FISHER LDA code presented in class to construct a model in this space. Once more use the 90% training set to perform 5 fold cross validation to pick the “best” radial basis function kernel parameter gamma, and the number of basis functions (terms) in the SPARSE FISHER LDA model given this gamma (to get the estimate of the number of terms in the model, average the number of terms used in each 5 fold runs for a specific gamma, and round to the nearest integer). Once these learning parameters are known, use the entire 90% training set to build a single model for testing on the 10% test set. Report the average error on future data of this SPARSE FISHER LDA algorithm by averaging all your errors over the 50 random experiments. In a single zip file, email the TA (and me) in 1) all code (which must be in matlab) used to run these experiments, 2) instructions on how to run the code, 3) the final error rates on for the SVM algorithm and the SPARSE FISHER LDA.

 

 

 

Weekly Class Schedule:

 

1.   January 16, 2008: Introduction. K Nearest Neighbor Algorithm.

 

2.   January 23, 2008: Cross Validation, Model Selection, and Accuracy Estimation. Please read and be ready to discuss accEst.pdf. (Guest Lecturer: Sam Reid). See http://www.colorado.edu/physics/pion/csci5622-spring08

 

3.   January 30, 2008: Cross Validation, Model Selection, and Accuracy Estimation. Please read and be ready to discuss accEst.pdf. (Guest Lecturer: Sam Reid). See http://www.colorado.edu/physics/pion/csci5622-spring08

 

4.   February 6, 2008: Reading material for next week: Please read and be prepared to discuss these two documents Classification_1.pdf and Regression_1.pdf.

 

5.   February 13, 2008: Introduction to regression and classification. Mercer_Kernels.pdf

 

6.   February 20, 2008: Intro to classification Classification_1.pdf. Perceptron Algorithm (Perceptron.pdf).

 

7.   February 27, 2008: Perceptron Algorithm (Perceptron.pdf) (Perceptron_Demo.zip). Kernel Demo (Demo_Kernels.zip). Regression Demo (Reg_Demos.zip). Fast Gaussian Kernel Calculations (Fast_Gaussian_Kernels.zip). Linear Discriminant Analysis (LDA), Fisher's Linear Discriminant, Quadratic Discriminant Analysis (QDA).

 

8.   March 5, 2008: Support Vector Classification (SMV_classification.pdf).

 

9.   March 12, 2008: Support Vector Regression (SMV_regression.pdf). Decision Trees (Trees.pdf).

 

10.         March 19, 2008: Linear Discriminant Analysis (LDA), Fisher's Linear Discriminant, Quadratic Discriminant Analysis (QDA), Mixture of Gaussians (LDA_QDA_FISHER.pdf). Neural Networks (NeuralNetwork.pdf).

 

11.         April 2, 2008: Sparse Linear Systems. (Notes). (Code).

 

12.         April 9, 2008: Reinforcement Learning.

 

13.         April 16, 2008: Neural Networks (NeuralNetwork.pdf). Ensemble Learning (Approaches_To_Supervised_Learning.pdf). Predicting Model Error (Model_Selection_1.pdf).

 

14.         April 23, 2008: Ensemble Learning (Approaches_To_Supervised_Learning.pdf). Bayesian Learning. K-Means Clustering (Bayesian_1.pdf).

 

15.         April 30, 2008: Dimensionality Reduction. Semi-Supervised Learning. Spectral Clustering.

 

 

 

 

 

 

If you qualify for accommodations because of a disability, please submit to me a letter from Disability Services in a timely manner so that your needs may be addressed.  Disability Services determines accommodations based on documented disabilities.  Contact: 303-492-8671, Willard 322, and http://www.Colorado.EDU/disabilityservices.

 

Campus policy regarding religious observances requires that faculty make every effort to reasonably and fairly deal with all students who, because of religious obligations, have conflicts with scheduled exams, assignments or required attendance. See full details at http://www.colorado.edu/policies/fac_relig.html.

 

Students and faculty each have responsibility for maintaining an appropriate learning environment. Those who fail to adhere to such behavioral standards may be subject to discipline. Professional courtesy and sensitivity are especially important with respect to individuals and topics dealing with differences of race, culture, religion, politics, sexual orientation, gender, gender variance, and nationalities.  Class rosters are provided to the instructor with the student's legal name. I will gladly honor your request to address you by an alternate name or gender pronoun. Please advise me of this preference early in the semester so that I may make appropriate changes to my records.  See polices at http://www.colorado.edu/policies/classbehavior.html and at http://www.colorado.edu/studentaffairs/judicialaffairs/code.html#student_code

 

The University of Colorado at Boulder policy on Discrimination and Harassment, the University of Colorado policy on Sexual Harassment and the University of Colorado policy on Amorous Relationships apply to all students, staff and faculty.  Any student, staff or faculty member who believes s/he has been the subject of discrimination or harassment based upon race, color, national origin, sex, age, disability, religion, sexual orientation, or veteran status should contact the Office of Discrimination and Harassment (ODH) at 303-492-2127 or the Office of Judicial Affairs at 303-492-5550.  Information about the ODH, the above referenced policies and the campus resources available to assist individuals regarding discrimination or harassment can be obtained at http://www.colorado.edu/odh

 

All students of the University of Colorado at Boulder are responsible for knowing and adhering to the academic integrity policy of this institution. Violations of this policy may include: cheating, plagiarism, aid of academic dishonesty, fabrication, lying, bribery, and threatening behavior.  All incidents of academic misconduct shall be reported to the Honor Code Council (honor@colorado.edu; 303-725-2273). Students who are found to be in violation of the academic integrity policy will be subject to both academic sanctions from the faculty member and non-academic sanctions (including but not limited to university probation, suspension, or expulsion). Other information on the Honor Code can be found at http://www.colorado.edu/policies/honor.html  and at http://www.colorado.edu/academics/honorcode/