Learning Algorithms covered so far:

 

  1. K Nearest Neighbor
    1. Hypothesis Space

                                                               i.      Variable size (nonparametric)

                                                             ii.      Deterministic

                                                            iii.      Continuous parameters (although can use discrete as well given a distance metric)

    1. Learning algorithms:

                                                               i.      Direct computation

1.      Take all data and directly compute a model

                                                             ii.      Lazy

1.      Looking for generalizations

                                                            iii.      Batch and online versions.

  1. Naïve Bayes
    1. Hypothesis Space

                                                               i.      Fixed size (parametric)

                                                             ii.      Stochastic

                                                            iii.      Continuous parameters

    1. Learning algorithms:

                                                               i.      Direct computation

                                                             ii.      Eager

1.      Looking for generalizations

                                                            iii.      Batch.

  1. Decision Trees
    1. Hypothesis Space

                                                               i.      Variable size (nonparametric)

                                                             ii.      Deterministic

                                                            iii.      Discrete and continuous parameters

    1. Learning algorithms:

                                                               i.      Constructive search

1.      Learning algorithms built by adding nodes

                                                             ii.      Eager

1.      Looking for generalizations

                                                            iii.      Batch (however, some online versions exist).

 

 

 

Statistical and Computational Learning Theory (Part1)

 

 

Why theory? Answer questions such as:

 

  1. What is the rate of convergence of a learning algorithm?
    1. As a function of the size of the hypothesis space?
    2. As a function of the number of training examples?
  2. What is the computational complexity of a learning algorithm?
    1. As a function of the size of the hypothesis space?
    2. As a function of the number of training examples?

 

How do we begin to answer these questions?

 

Given Assumptions such as:

 

The specific Question we address here is: What are the theoretical bounds on the error rate of  on new data points?

 

General Assumptions (Noise-Free Case)

 

  1. Assumptions on data : Examples are iid (independently identically distributed) generated according to a probability distribution  and labeled according to an unknown function
  2. Assumptions on learning algorithm: The learning algorithm is given  examples in  and outputs a hypothesis  that is consistent with  (i.e. correctly classifies them all).
  3. Goal Assumption:  should fit new examples well (low  error rate) that are draw according to the same distribution

 

 

                                            

 

 

PAC (Probably – Approximately Correct) Learning

 

We allow algorithms to fail with probability

 

Procedure:

1.      Draw  random samples ->

2.      Run a learning algorithm and generate

 

Since  is iid we cannot guarantee that the data will be representative

-Want to ensure that  of the time, the hypothesis error is less than

                                                    

 

Ex: want to obtain a 90% () correct hypothesis 95% () of the time.

 

 

Case 1: Finite Hypothesis Spaces

 

Assume *is finite

 

Consider  such that  - (-bad).

 

Given one training example , the probability that  classifies it correctly is

                                                    

Given  training examples ,…,, the probability that  classifies it correctly is:

                                  

Now, assume we have a second hypothesis  (also -bad). What is the probability that either  or  will be correct?

Therefore, for k -bad hypothesis: the probability that any one of them are correct is:

                                                              

Since

                                                            

Inequality:

                                                      

 

Lemma: For a finite hypothesis space , given a set of  training examples drawn independently according to , the probability that there exists an hypothesis  with true error greater than  consistent with the training examples is less than or equal to .

 

Therefore, for probability less than

                                                             

This is true whenever

                                                       

(Blumer bound – Blumer, Ehrenfeucht, Haussler, and Warmuth 1987).

 

Therefore, if  is consistent and all  samples are independently drawn according to , the error rate  on new data points is bounded by

                                                       

 

Example applications:

·        Boolean Conjunctions over  features

o       Three possibilities  or not present. Therefore for  features

o      

 

Finite Hypothesis Spaces: Inconsistent Hypothesis

 

If  does not perfectly fit the data, but has error rate of

                                                    

Therefore larger than the error rate on

 

 

Infinite Hypothesis Spaces

 

Even if ,  has limited expressive power, therefore we should still be able to obtain bounds.

 

Definition: Let  be a set of  examples. A hypothesis space  can trivially fit , if for every possible labeling of the examples in , there exists an  that gives this labeling. If so, than  is said to shatter .

 

Definition: The Vapnik-Chervonenkis dimension (VC-dimension) of a hypothesis space  is the size of the largest set of examples that can be trivially fit (shattered) by .

 

Note: if  is finite, .

 

VC-Dimension Example

 

Let  be the set of all intervals on the real line.

If  than  is in the interval.

If  than  is NOT in the interval.

 

 can trivially fit (shatter) any two points.

 

 

 

However,  cannot trivially fit (shatter) three points.

 

 

Therefore the VC-dimension is 2.

 

 

Error Bound for Infinite Hypothesis Spaces

 

Theorem: Suppose that . Assume that there are  training examples in , and that a learning algorithm finds an  with error rate  on . Then, with probability , the error rate  on new data is:

                                             

 

Called the Empirical Risk Minimization Principle (Vapnik).

 

 

However, this does not work well for fixed hypothesis spaces because you learning algorithm will minimize :

 

·        Underfitting: Every hypothesis  has high error . Want to consider  that has larger space.

·        Overfitting: Every hypothesis  has high error . Want to consider  smaller hypothesis spaces that have lower .

 

Suppose we have a nested series of hypothesis spaces:

                                                      

with corresponding VC dimensions and errors

                                                      

 

Then, you should use the Structural Risk Minimization Principle (Vapnik).

 

Choose the hypothesis space  that minimizes the combined error bounds: