Prediction and classification: Pitfalls for the unwary

Michael C. Mozer, Ph.D.
Professor, Department of Computer Science, University of Colorado
Chief Scientist, Athene Software

Robert Dodier, Ph.D.
Research Engineer, Athene Software

Cesar Guerra, Ph.D.
Research Engineer, Athene Software

Richard Wolniewicz, Ph.D.
VP of Engineering, Athene Software

Lian Yan, Ph.D.
Research Engineer, Athene Software

Michael Colagrosso
Member of Technical Staff, Athene Software

David Grimes
Member of Technical Staff, Athene Software

In many facets of science, academia, and business, predicting the future is critical: What will the weather be like tomorrow? Is an applicant to law school likely to be successful in the field? What will the Deutsch Mark / US dollar exchange rate be next year? Is a customer happy with their Internet Service Provider (ISP)?

Such predictions can be made using models, mathematical or formal expressions of the relationships among observed quantities of the world. To give an example, the equation

I = 1000 A + 3000 E - 10000
is a simple-minded model that predicts expected income (I) in dollars from an individual's age in years (A) and years of college education (E). A, E, and I are known as variables, placeholders for observed quantities. In predicting customer satisfaction with an ISP, a model might use variables such as: the customer's age, whether the account is for personal or business use, the length of time the customer has been with the ISP, the type of computer the customer owns, and the customer's usage patterns. Variables that are fed into the model are called input variables. For problems in which the model's task is to classify the input in some manner, such as deciding whether a customer is likely to churn, the model's prediction is specified in terms of output classes (e.g., "churn" or "nonchurn").

Building a predictive model is the work of specialists in a field called statistical machine learning and consists of the following basic steps:

  1. Selecting the input variables. The choice of input variables is determined by what data is available and what factors are deemed potentially relevant to the prediction task.
  2. Defining the output classes. The task of the model must be specified by giving a precise definition of each output class.
  3. Preparing data. Data preparation involves transforming the input variables into a representation that is appropriate for the prediction task. A variable such as average connect time could be fed into the model as the number of minutes, but often, transformations of the variable will yield a more accurate model.
  4. Building models. Given a set of training examples for which the output class is known, a variety of alternative models is constructed that can classify the set of examples.
  5. Performing model selection. From the set of alternatives, a single model must be formulated based on its expected accuracy on novel cases.
  6. Evaluating the selected model on novel cases. One can always build a model that performs well on the training examples, but the true value of a model is how well it performs on novel cases. Evaluation is necessary to estimate the model's expected accuracy.
Each step in this process requires the combined skills of statistical machine learning specialists and domain experts. Scientists in the field of statistical machine learning optimize the performance of a model and assess its quality and effectiveness. Insights and understanding of domain experts allow fine tuning of a model to achieve the very best performance.

A great challenge of modeling is that, for any reasonably complex domain, one will not know in advance how accurate a model can be built. Should a stock trader be satisfied with a model that can predict whether a stock will increase or decrease in value the next trading day with an accuracy of 50.5%? Perhaps the accuracy is high enough to make money for the trader, but they would certainly prefer a model that could predict the change in price with an accuracy of 52%.

An outsider to the field can find it difficult to evaluate the quality of modeling efforts, and can easily be tricked by misleading or elevated claims concerning a model's performance. The goal of this white paper is to point out some of the most common pitfalls of modeling to which the unwary may be susceptible, and to provide guidelines for assessing the quality of a model and its predictions.

Selecting input variables

In selecting input variables for a model, one must be careful not to include false predictors. A false predictor is a variable that is strongly correlated with the output class, but that is not available in a realistic prediction scenario. For example, suppose the task were to predict churn in January, and one of the input variables was the connect time in January. Clearly, the connect time in January will be a strong indicator of churn in January, but this variable would not be available in a realistic prediction scenario, because the predictions for a given a month must be based on past data. False predictors can easily sneak into a model, because the process of extracting time-tagged information from a data base is arduous and errors easily occur. Nonetheless, automatic techniques exist to identify likely false predictors, and an expert in a modeling domain can spot when false predictor is buried among the input variables, because the model will be performing better than could be expected given the uncertain nature of the task.

Defining output classes

If the model's task is to predict churn, it seems natural for the output classes to be "churn" and "nonchurn". However, characterizing the exact conditions under which churn occurs is not straightforward. At Athene, we have run across problems with many wireless carriers' definitions of churn. For example, one carrier included among the churners customers who were disconnected because they had not paid their bills. Although predicting churn for these customers is easy given bill payment information, these are not the customers one wants to save! Another carrier defined churn to have occurred at the point when the number of services to which a customer was subscribed dropped to zero. However, because the cancellation of all services could take a month or more, a customer who asked to terminate service would often appear to be dropping services gradually over a several month period. Consequently, predicting the churn of these customers was easy, but the customers had in actuality already churned.

If a modeler claims to be able to predict churn with a certain accuracy, they should be questioned about the definition of churn used, and whether the churners identified are the ones an ISP or wireless carrier actually cares about.

Data preparation

A critical but often unappreciated aspect of modeling is the choice of the data representation. To explain, consider the question of how to represent or encode an input variable in a model, say, the number of calls per day a customer makes to the ISP. The obvious representation is just a scalar (single number); if the customer made 5 calls, the representation would be the number 5. The number of calls could also be represented by dividing the number line into bins, e.g., 0-1, 2-5, 6-12, 12+, and indicating which bin a customer was in using a binary vector; for example, if the customer made 5 calls, the representation would be the vector [0 1 0 0], if the customer made 20 calls, the representation would be [0 0 0 1]. The spectrum of possible representations is vast. The choice of representation fundamentally influences the accuracy of a model. For example, if customers making 2-5 calls all behave similarly, and/or if the behavior of customers is nonmonotonic with the number of calls (i.e., customers making 2-5 calls behave similarly to customers making 12+ calls, but differently than customers making 6-12 calls), then the vector representation described here makes sense, because it makes the critical information useful for prediction more explicit.

Data preparation involves transforming raw numbers and labels in a data base into a representation suitable for input to a model. Data preparation is the primary point where domain expertise can come into play. A modeler who does not focus on data preparation and representation is likely to develop inferior models. We have illustrated the importance of representations based on domain intuition in a recent paper (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000).

No free lunch

A wide variety of techniques exist for modeling, from traditional statistical approaches such as generalized linear models and linear discriminant analysis, to modern machine learning techniques such as neural networks, evolutionary (or genetic) algorithms, support-vector machines, Bayes (or belief) networks, decision (or classification) trees, and Gaussian processes. An important theorem in statistical machine learning essentially states that no one technique will outperform all other techniques on all problems (Wolpert & MacReady, 1997). This theorem is sometimes referred to as no free lunch. Consequently, any modeling effort should consider a range of techniques.

Often, a modeling group will specialize in one particular technique, and will tout that technique as the being intrinsically superior to others. Such a claim should be regarded with extreme suspicion. Certain techniques may be better suited to certain classes of problems (e.g., noisy domains, small data sets), but a modeling group should be versed in and explore a range of techniques. Furthermore, the field of statistical machine learning is evolving rapidly, and new algorithms are developed at a regular pace. A modeling group should be aware of and should be contributing to these developments.

Model selection

Any modeling technique can be used to construct of a continuum of models, from simple to complex. Simple models capture the primary trends in a data set, but their simplicity may prevent them from capturing subtle patterns. Complex models have the ability to capture subtle patterns, but they have a tendency to memorize quirks of the training examples instead of capturing patterns that will be useful for prediction. One of the key issues in modeling is model selection (step 5 above), which involves picking the appropriate level of complexity for a model given a data set. Many different methods exist for model selection (e.g., cross validation, Bayesian averaging, ensembles, regularization, minimum description length). Without a rigorous model selection process, the resulting models will be far less accurate than they could otherwise be.

Although model selection methods can be automated to some degree, model selection cannot be avoided. If a modeling group claims otherwise, or does not emphasize their expertise in model selection, one should be suspicious of their abilities.

Segmentation

Often, a data set can be broken into several smaller, more homogenous data sets, which is referred to as segmentation. For example, a customer data base might be split into business and residential customers. A modeler must decide whether to build a single model for the entire data set, or one model for each segment of the data set. If the segments are quite distinct from one another in terms of their behavior, then segmentation is sensible. However, if one segment behaves quite similarly to another segment, a single model is preferable to specialized models because the single model benefits from having far more data for training. Many statistical machine learning techniques (e.g., decision trees, mixture of experts) can automatically segment the data as warranted. Some machine learning techniques can perform a soft segmentation, in which one model specializes in a particular segment of the data, but is weakly influenced by the other segments.

Although domain experts can readily propose segmentations, enforcing a segmentation suggested by domain experts is generally not the most prudent approach to modeling, because the data itself provides clues to how the segmentation should be performed. Nonetheless, intuitions of domain experts can be used to guide the modeling process. Consequently, one should be concerned if a modeler claims to utilize a priori segmentation, or if they claim that segmentation is not necessary given their approach. The value of segmentation must be decided on a case-by-case basis from the data set.

Model evaluation

Once a model has been built, the natural question to ask is how accurate it is. Unfortunately, this is a tricky question to answer, and the modeler can readily cheat--either deliberately or accidentally--to inflate estimates of a model's accuracy. We describe common sorts of deception that can occur in assessing and evaluating a model.

Failing to use an independent test set. A set of training examples are used to construct the model. It is meaningless to assess the accuracy of the model using the training set, because one could always build a model that has extremely high accuracy on the training set. To obtain a fair estimate of performance, the model must be evaluated on examples that were not contained in the training set. (Wouldn't you have liked it in school if your teacher gave you a practice test one day, went over the answers, and then asked exactly the same questions on the real exam the next day?) To obtain independent training and test sets, the available data must be split into two nonoverlapping subsets, with the test set reserved only for evaluation.

Assuming stationarity of the test environment. For many difficult problems, a model built based on historical data will become a poorer and poorer predictor as time goes on, because the environment is nonstationary--the rules and behaviors of individuals change over time. For example, the factors influencing churn two years ago are likely to be very different than the factors influencing churn today, because the competitive environment and user needs have changed significantly. Consequently, the best measure of a model's true performance will be obtained if it is tested on data from a future point in time relative to the training data. For churn prediction, we typically find a small but reliable drop in prediction accuracy when the test data comes from a shifted time window. Any report of a model's accuracy should be clear on whether the test data comes from the same time window or a shifted time window relative to the training data.

Incomplete reports of results. An accurate model will correctly discriminate examples of one output class from examples of another output class. Discrimination performance is best reported with an ROC curve, a lift curve, or a precision-recall curve (see Mozer et al., 2000). Any report of accuracy using only a single number is suspect. For example, should you be impressed if you are told a model achieves 90% accuracy? In the case of churn, it is not, because if 5% of the customers are churning in a given time window, then a model can be 95% accurate simply by classifying each example as nonchurn! A more meaningful assessment of performance will report two numbers: accuracy of classifying a churner as a churner, and accuracy of classifying a nonchurner as a nonchurner.

Filtering data to bias results. In a large data set, one segment of the population may be easier to predict than another. For example, customers with low incomes are likely to be more cost sensitive, and hence, might reliably churn when their bills rise above a certain amount. If a model is trained and tested just on this segment of the population, it will be more accurate than a model that must handle the entire population. We have run across evaluations of a model in which such selective filtering has turned a hard problem into an easier problem. One such case focused on customers in the first three months of service, and assumed that performance on the larger customer base would be comparable.

Selective sampling of test cases. A fair evaluation of a model will utilize a test set that is drawn from the same population as the model will eventually encounter in actual usage. We have run across many instances where the modelers appear to have selectively sampled test cases to achieve higher accuracy. For example, correct identification of churn is likely to be higher for the 10% of the test set deemed most likely to churn than for the test set as a whole. Some reports are outright puzzling, such as one modeling group's report in which the test set contained thirty thousand customers, but performance graphs and charts represented a small fraction--less than 1%--of the test set.

Failing to assess statistical reliability. When comparing the accuracy of two models, it is not sufficient to report that one model performed better than the other, because the difference might not be statistically reliable. "Statistical reliability" means, among other things, that if the comparison were repeated using a different sample of the population, the same result would be achieved. Thus, when comparing two models--or even when evaluating whether a model is performing better than chance--assessing the statistical reliability of results in mandatory.

Conclusions

In this brief paper, we have tried to provide information that will allow the reader to better appreciate the modeling process, and some of the key competencies that are required to build state-of-the-art predictive models.

References

Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., & Kaushansky, H. (2000). Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry. IEEE Transactions on Neural Networks, 11, 690-696.

Wolpert, D., & MacReady, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67-82.