Learning
Algorithms covered so far:
i. Variable size (nonparametric)
ii. Deterministic
iii. Continuous parameters (although can use discrete as well given a distance metric)
i. Direct computation
1. Take all data and directly compute a model
ii. Lazy
1. Looking for generalizations
iii. Batch and online versions.
i. Fixed size (parametric)
ii. Stochastic
iii. Continuous parameters
i. Direct computation
ii. Eager
1. Looking for generalizations
iii. Batch.
i. Variable size (nonparametric)
ii. Deterministic
iii. Discrete and continuous parameters
i. Constructive search
1. Learning algorithms built by adding nodes
ii. Eager
1. Looking for generalizations
iii. Batch (however, some online versions exist).
Statistical and
Computational Learning Theory (Part1)
Why theory? Answer questions such as:
How do we begin to answer these questions?
Given Assumptions such as:
The specific Question we address here is: What are the
theoretical bounds on the error rate of
on new data points?
General Assumptions (Noise-Free Case)

![]()
PAC
(Probably – Approximately Correct) Learning
We allow algorithms to fail
with probability ![]()
Procedure:
1. Draw
random samples -> ![]()
2. Run
a learning algorithm and generate ![]()
Since
is iid we cannot
guarantee that the data will be representative
-Want to
ensure that
of the time, the
hypothesis error is less than ![]()
![]()
Ex: want to obtain a 90% (
) correct hypothesis 95% (
) of the time.
Case 1: Finite Hypothesis Spaces
Assume
is finite
Consider
such that
- (
-bad).
Given
one training example
, the probability that
classifies it
correctly is
![]()
Given
training examples
,…,
, the probability that
classifies it
correctly is:
![]()
Now, assume we have a second hypothesis
(also
-bad). What is the probability that either
or
will be correct?
Therefore, for k
-bad hypothesis: the probability that any one of them are correct
is:
![]()
Since ![]()
![]()
Inequality: ![]()
![]()
Lemma: For a finite hypothesis space
, given a set of
training examples
drawn independently according to
, the probability that there exists an hypothesis
with true error
greater than
consistent with the
training examples is less than or equal to
.
Therefore, for probability less than ![]()
![]()
This is true whenever
![]()
(Blumer bound – Blumer, Ehrenfeucht, Haussler, and Warmuth 1987).
Therefore, if
is consistent and
all
samples are
independently drawn according to
, the error rate
on new data points
is bounded by
![]()
Example applications:
·
Boolean Conjunctions over
features
o Three
possibilities
or not present.
Therefore for
features ![]()
o ![]()
Finite Hypothesis Spaces: Inconsistent Hypothesis
If
does not perfectly
fit the data, but has error rate of ![]()

Therefore larger than the error rate on ![]()
Infinite Hypothesis Spaces
Even if
,
has limited
expressive power, therefore we should still be able to obtain bounds.
Definition: Let
be a set of
examples. A
hypothesis space
can trivially
fit
, if for every possible labeling of the examples in
, there exists an
that gives this
labeling. If so, than
is said to shatter
.
Definition: The Vapnik-Chervonenkis dimension
(VC-dimension) of a hypothesis space
is the size of the
largest set of examples that can be trivially fit (shattered) by
.
Note: if
is finite,
.
VC-Dimension Example
Let
be the set of all
intervals on the real line.
If
than
is in the interval.
If
than
is NOT in the
interval.
can trivially fit
(shatter) any two points.

However,
cannot trivially fit
(shatter) three points.

Therefore the VC-dimension is 2.
Error Bound for Infinite Hypothesis Spaces
Theorem: Suppose that
. Assume that there are
training examples in
, and that a learning algorithm finds an
with error rate
on
. Then, with probability
, the error rate
on new data is:
![]()
Called the Empirical Risk Minimization Principle (Vapnik).
However, this does not work well for fixed
hypothesis spaces because you learning algorithm will minimize
:
·
Underfitting: Every hypothesis
has high error
. Want to consider
that has larger
space.
·
Overfitting: Every hypothesis
has high error
. Want to consider
smaller hypothesis
spaces that have lower
.
Suppose we have a nested series of hypothesis spaces:
![]()
with corresponding VC dimensions and errors
![]()
Then, you should use the Structural Risk Minimization Principle (Vapnik).
Choose the hypothesis space
that minimizes the
combined error bounds:
