Prediction and classification: Pitfalls for the
Michael C. Mozer, Ph.D.
Professor, Department of Computer Science, University of Colorado
Chief Scientist, Athene Software
Robert Dodier, Ph.D.
Research Engineer, Athene Software
Cesar Guerra, Ph.D.
Research Engineer, Athene Software
Richard Wolniewicz, Ph.D.
VP of Engineering, Athene Software
Lian Yan, Ph.D.
Research Engineer, Athene Software
Member of Technical Staff, Athene Software
Member of Technical Staff, Athene Software
In many facets of science, academia, and business, predicting the future
is critical: What will the weather be like tomorrow? Is an applicant to
law school likely to be successful in the field? What will the Deutsch
Mark / US dollar exchange rate be next year? Is a customer happy with their
Internet Service Provider (ISP)?
Such predictions can be made using models, mathematical or formal
expressions of the relationships among observed quantities of the world.
To give an example, the equation
I = 1000 A + 3000 E - 10000
is a simple-minded model that predicts expected income (I) in dollars
from an individual's age in years (A) and years of college education
(E). A, E, and I are known as variables,
placeholders for observed quantities. In predicting customer satisfaction
with an ISP, a model might use variables such as: the customer's age, whether
the account is for personal or business use, the length of time the customer
has been with the ISP, the type of computer the customer owns, and the
customer's usage patterns. Variables that are fed into the model are called
input variables. For problems in which the model's task is to classify
the input in some manner, such as deciding whether a customer is likely
to churn, the model's prediction is specified in terms of output classes
(e.g., "churn" or "nonchurn").
Building a predictive model is the work of specialists in a field called
machine learning and consists of the following basic steps:
Each step in this process requires the combined skills of statistical machine
learning specialists and domain experts. Scientists in the field of statistical
machine learning optimize the performance of a model and assess its quality
and effectiveness. Insights and understanding of domain experts allow fine
tuning of a model to achieve the very best performance.
Selecting the input variables. The choice of input variables is
determined by what data is available and what factors are deemed potentially
relevant to the prediction task.
Defining the output classes. The task of the model must be specified
by giving a precise definition of each output class.
Preparing data. Data preparation involves transforming the input
variables into a representation that is appropriate for the prediction
task. A variable such as average connect time could be fed into the model
as the number of minutes, but often, transformations of the variable will
yield a more accurate model.
Building models. Given a set of training examples for which
the output class is known, a variety of alternative models is constructed
that can classify the set of examples.
Performing model selection. From the set of alternatives, a single
model must be formulated based on its expected accuracy on novel cases.
Evaluating the selected model on novel cases. One can always build
a model that performs well on the training examples, but the true value
of a model is how well it performs on novel cases. Evaluation is necessary
to estimate the model's expected accuracy.
A great challenge of modeling is that, for any reasonably complex domain,
one will not know in advance how accurate a model can be built. Should
a stock trader be satisfied with a model that can predict whether a stock
will increase or decrease in value the next trading day with an accuracy
of 50.5%? Perhaps the accuracy is high enough to make money for the trader,
but they would certainly prefer a model that could predict the change in
price with an accuracy of 52%.
An outsider to the field can find it difficult to evaluate the quality
of modeling efforts, and can easily be tricked by misleading or elevated
claims concerning a model's performance. The goal of this white paper is
to point out some of the most common pitfalls of modeling to which the
unwary may be susceptible, and to provide guidelines for assessing the
quality of a model and its predictions.
Selecting input variables
In selecting input variables for a model, one must be careful not to include
predictors. A false predictor is a variable that is strongly correlated
with the output class, but that is not available in a realistic prediction
scenario. For example, suppose the task were to predict churn in January,
and one of the input variables was the connect time in January. Clearly,
the connect time in January will be a strong indicator of churn in January,
but this variable would not be available in a realistic prediction scenario,
because the predictions for a given a month must be based on past
data. False predictors can easily sneak into a model, because the process
of extracting time-tagged information from a data base is arduous and errors
easily occur. Nonetheless, automatic techniques exist to identify likely
false predictors, and an expert in a modeling domain can spot when false
predictor is buried among the input variables, because the model will be
performing better than could be expected given the uncertain nature of
Defining output classes
If the model's task is to predict churn, it seems natural for the output
classes to be "churn" and "nonchurn". However, characterizing the exact
conditions under which churn occurs is not straightforward. At Athene,
we have run across problems with many wireless carriers' definitions of
churn. For example, one carrier included among the churners customers who
were disconnected because they had not paid their bills. Although predicting
churn for these customers is easy given bill payment information, these
are not the customers one wants to save! Another carrier defined churn
to have occurred at the point when the number of services to which a customer
was subscribed dropped to zero. However, because the cancellation of all
services could take a month or more, a customer who asked to terminate
service would often appear to be dropping services gradually over a several
month period. Consequently, predicting the churn of these customers was
easy, but the customers had in actuality already churned.
If a modeler claims to be able to predict churn with a certain accuracy,
they should be questioned about the definition of churn used, and whether
the churners identified are the ones an ISP or wireless carrier actually
A critical but often unappreciated aspect of modeling is the choice of
the data representation. To explain, consider the question of how
to represent or encode an input variable in a model, say, the number of
calls per day a customer makes to the ISP. The obvious representation is
just a scalar (single number); if the customer made 5 calls, the representation
would be the number 5. The number of calls could also be represented by
dividing the number line into bins, e.g., 0-1, 2-5, 6-12, 12+, and indicating
which bin a customer was in using a binary vector; for example, if the
customer made 5 calls, the representation would be the vector [0 1 0 0],
if the customer made 20 calls, the representation would be [0 0 0 1]. The
spectrum of possible representations is vast. The choice of representation
fundamentally influences the accuracy of a model. For example, if customers
making 2-5 calls all behave similarly, and/or if the behavior of customers
is nonmonotonic with the number of calls (i.e., customers making 2-5 calls
behave similarly to customers making 12+ calls, but differently than customers
making 6-12 calls), then the vector representation described here makes
sense, because it makes the critical information useful for prediction
Data preparation involves transforming raw numbers and labels in a data
base into a representation suitable for input to a model. Data preparation
is the primary point where domain expertise can come into play. A modeler
who does not focus on data preparation and representation is likely to
develop inferior models. We have illustrated the importance of representations
based on domain intuition in a recent paper (Mozer, Wolniewicz, Grimes,
Johnson, & Kaushansky, 2000).
No free lunch
A wide variety of techniques exist for modeling, from traditional statistical
approaches such as generalized linear models and linear discriminant analysis,
to modern machine learning techniques such as neural networks, evolutionary
(or genetic) algorithms, support-vector machines, Bayes (or belief) networks,
decision (or classification) trees, and Gaussian processes. An important
theorem in statistical machine learning essentially states that no one
technique will outperform all other techniques on all problems (Wolpert
& MacReady, 1997). This theorem is sometimes referred to as no free
lunch. Consequently, any modeling effort should consider a range of
Often, a modeling group will specialize in one particular technique,
and will tout that technique as the being intrinsically superior to others.
Such a claim should be regarded with extreme suspicion. Certain techniques
may be better suited to certain classes of problems (e.g., noisy domains,
small data sets), but a modeling group should be versed in and explore
a range of techniques. Furthermore, the field of statistical machine learning
is evolving rapidly, and new algorithms are developed at a regular pace.
A modeling group should be aware of and should be contributing to these
Any modeling technique can be used to construct of a continuum of models,
from simple to complex. Simple models capture the primary trends in a data
set, but their simplicity may prevent them from capturing subtle patterns.
Complex models have the ability to capture subtle patterns, but they have
a tendency to memorize quirks of the training examples instead of capturing
patterns that will be useful for prediction. One of the key issues in modeling
is model selection (step 5 above), which involves picking the appropriate
level of complexity for a model given a data set. Many different methods
exist for model selection (e.g., cross validation, Bayesian averaging,
ensembles, regularization, minimum description length). Without a rigorous
model selection process, the resulting models will be far less accurate
than they could otherwise be.
Although model selection methods can be automated to some degree, model
selection cannot be avoided. If a modeling group claims otherwise, or does
not emphasize their expertise in model selection, one should be suspicious
of their abilities.
Often, a data set can be broken into several smaller, more homogenous data
sets, which is referred to as segmentation. For example, a customer
data base might be split into business and residential customers. A modeler
must decide whether to build a single model for the entire data set, or
one model for each segment of the data set. If the segments are quite distinct
from one another in terms of their behavior, then segmentation is sensible.
However, if one segment behaves quite similarly to another segment, a single
model is preferable to specialized models because the single model benefits
from having far more data for training. Many statistical machine learning
techniques (e.g., decision trees, mixture of experts) can automatically
segment the data as warranted. Some machine learning techniques can perform
a soft segmentation, in which one model specializes in a particular
segment of the data, but is weakly influenced by the other segments.
Although domain experts can readily propose segmentations, enforcing
a segmentation suggested by domain experts is generally not the most prudent
approach to modeling, because the data itself provides clues to how the
segmentation should be performed. Nonetheless, intuitions of domain experts
can be used to guide the modeling process. Consequently, one should be
concerned if a modeler claims to utilize a priori segmentation, or if they
claim that segmentation is not necessary given their approach. The value
of segmentation must be decided on a case-by-case basis from the data set.
Once a model has been built, the natural question to ask is how accurate
it is. Unfortunately, this is a tricky question to answer, and the modeler
can readily cheat--either deliberately or accidentally--to inflate estimates
of a model's accuracy. We describe common sorts of deception that can occur
in assessing and evaluating a model.
Failing to use an independent test set. A set of training examples
are used to construct the model. It is meaningless to assess the accuracy
of the model using the training set, because one could always build a model
that has extremely high accuracy on the training set. To obtain a fair
estimate of performance, the model must be evaluated on examples that were
not contained in the training set. (Wouldn't you have liked it in school
if your teacher gave you a practice test one day, went over the answers,
and then asked exactly the same questions on the real exam the next day?)
To obtain independent training and test sets, the available data must be
split into two nonoverlapping subsets, with the test set reserved only
Assuming stationarity of the test environment. For many difficult
problems, a model built based on historical data will become a poorer and
poorer predictor as time goes on, because the environment is nonstationary--the
rules and behaviors of individuals change over time. For example, the factors
influencing churn two years ago are likely to be very different than the
factors influencing churn today, because the competitive environment and
user needs have changed significantly. Consequently, the best measure of
a model's true performance will be obtained if it is tested on data from
a future point in time relative to the training data. For churn prediction,
we typically find a small but reliable drop in prediction accuracy when
the test data comes from a shifted time window. Any report of a
model's accuracy should be clear on whether the test data comes from the
same time window or a shifted time window relative to the training data.
Incomplete reports of results. An accurate model will correctly
discriminate examples of one output class from examples of another output
class. Discrimination performance is best reported with an ROC curve, a
lift curve, or a precision-recall curve (see Mozer et al., 2000). Any report
of accuracy using only a single number is suspect. For example, should
you be impressed if you are told a model achieves 90% accuracy? In the
case of churn, it is not, because if 5% of the customers are churning in
a given time window, then a model can be 95% accurate simply by classifying
each example as nonchurn! A more meaningful assessment of performance will
report two numbers: accuracy of classifying a churner as a churner, and
accuracy of classifying a nonchurner as a nonchurner.
Filtering data to bias results. In a large data set, one segment
of the population may be easier to predict than another. For example, customers
with low incomes are likely to be more cost sensitive, and hence, might
reliably churn when their bills rise above a certain amount. If a model
is trained and tested just on this segment of the population, it will be
more accurate than a model that must handle the entire population. We have
run across evaluations of a model in which such selective filtering has
turned a hard problem into an easier problem. One such case focused on
customers in the first three months of service, and assumed that performance
on the larger customer base would be comparable.
Selective sampling of test cases. A fair evaluation of a model
will utilize a test set that is drawn from the same population as the model
will eventually encounter in actual usage. We have run across many instances
where the modelers appear to have selectively sampled test cases to achieve
higher accuracy. For example, correct identification of churn is likely
to be higher for the 10% of the test set deemed most likely to churn than
for the test set as a whole. Some reports are outright puzzling, such as
one modeling group's report in which the test set contained thirty thousand
customers, but performance graphs and charts represented a small fraction--less
than 1%--of the test set.
Failing to assess statistical reliability. When comparing the
accuracy of two models, it is not sufficient to report that one model performed
better than the other, because the difference might not be statistically
reliable. "Statistical reliability" means, among other things, that if
the comparison were repeated using a different sample of the population,
the same result would be achieved. Thus, when comparing two models--or
even when evaluating whether a model is performing better than chance--assessing
the statistical reliability of results in mandatory.
In this brief paper, we have tried to provide information that will allow
the reader to better appreciate the modeling process, and some of the key
competencies that are required to build state-of-the-art predictive models.
Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., & Kaushansky,
H. (2000). Predicting subscriber dissatisfaction and improving retention
in the wireless telecommunications industry. IEEE Transactions on Neural
Networks, 11, 690-696.
Wolpert, D., & MacReady, W. G. (1997). No free lunch theorems for
optimization. IEEE Transactions on Evolutionary Computation, 1,