# «Approximate Statistical Tests for Comparing Supervised Classiﬁcation Learning Algorithms Thomas G. Dietterich Department of Computer Science, ...»

LETTER Communicated by Leo Breiman

Approximate Statistical Tests for Comparing Supervised

Classiﬁcation Learning Algorithms

Thomas G. Dietterich

Department of Computer Science, Oregon State University, Corvallis, OR 97331,

U.S.A.

This article reviews ﬁve approximate statistical tests for determining

whether one learning algorithm outperforms another on a particular learning task. These tests are compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high

**probability of type I error in certain situations and should never be used:**

a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paireddifferences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar’s test, is shown to have low type I error. The ﬁfth test is a new test, 5 × 2 cv, based on ﬁve iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5×2 cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 × 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.

1 Introduction In the research, development, and application of machine learning algorithms for classiﬁcation tasks, many questions arise for which statistical methods are needed. The purpose of this article is to investigate one of these questions, demonstrate that existing statistical methods are inadequate for this question, and propose a new statistical test that shows acceptable performance in initial experiments.

To understand the question raised in this article, it is helpful to consider a taxonomy of the different kinds of statistical questions that arise in machine learning. Figure 1 gives a taxonomy of nine statistical questions.

c 1998 Massachusetts Institute of Technology Neural Computation 10, 1895–1923 (1998) 1896 Thomas G. Dietterich Let us begin at the root of the tree. The ﬁrst issue to consider is whether we are studying only a single application domain or multiple domains. In most applied research, there is a single domain ofinterest, and the goal is to ﬁnd the best classiﬁer or the best learning algorithm to apply in that domain. However, a fundamental goal of research in machine learning is to ﬁnd learning algorithms that work well over a wide range of application domains. We will return to this issue; for the moment, let us consider the single-domain case.

Within a single domain, there are two different sets of questions, depending on whether we are analyzing classiﬁers or algorithms. A classiﬁer is a function that, given an input example, assigns that example to one of K classes. A learning algorithm is a function that, given a set of examples and their classes, constructs a classiﬁer. In a particular application setting, our primary goal is usually to ﬁnd the best classiﬁer and estimate its accuracy with future examples. Suppose we are working for a medical instrumentation company and wish to manufacture and sell an instrument for classifying blood cells. At the time we are designing the instrument, we could gather a large collection of blood cells and have a human expert classify each cell.

We could then apply a learning algorithm to produce a classiﬁer from this set of classiﬁed cells. The classiﬁer would be implemented in the instrument and sold. We want our instrument to contain the most accurate classiﬁer we can ﬁnd.

There are some applications, however, where we must select the best learning algorithm rather than ﬁnd the best classiﬁer. For example, suppose we want to sell an e-mail system that learns to recognize and ﬁlter junk mail. Whenever the user receives an e-mail message that he considers junk mail, he will ﬂag that message. Periodically, a learning algorithm included in the program will analyze the accumulated examples of junk and nonjunk e-mail and update its ﬁltering rules. Our job is to determine which learning algorithm to include in the program.

The next level of the taxonomy distinguishes between two fundamental tasks: estimating accuracy and choosing between classiﬁers (or algorithms).

When we market our blood cell diagnosis system, we would like to make a claim about its accuracy. How can we measure this accuracy? And, of course, when we design the system, we want to choose the best classiﬁer from some set of available classiﬁers.

The lowest level of the taxonomy concerns the amount of data available.

If we have a large amount of data, then we can set some of them aside to serve as a test set for evaluating classiﬁers. Much simpler statistical methods can be applied in this case. However, in most situations, the amount of data is limited, and we need to use all we have as input to our learning algorithms.

This means that we must use some form of resampling (i.e., cross-validation or the bootstrap) to perform the statistical analysis.

Now that we have reviewed the general structure of the taxonomy, let’s consider the nine statistical questions. We assume that all data points (examComparing Supervised Classiﬁcation Learning Algorithms 1897 Figure 1: A taxonomy of statistical questions in machine learning. The boxed node (Question 8) is the subject of this article.

ples) are drawn independently from a ﬁxed probability distribution deﬁned by the particular application problem.

Question 1: Suppose we are given a large sample of data and a classiﬁer C.

The classiﬁer C may have been constructed using part of the data, but there are enough data remaining for a separate test set. Hence, we can measure the accuracy of C on the test set and construct a binomial conﬁdence interval (Snedecor & Cochran, 1989; Efron & Tibshirani, 1993; Kohavi, 1995). Note that in Question 1, the classiﬁer could have been produced by any method (e.g., interviewing an expert); it need not have been produced by a learning algorithm.

Question 2: Given a small data set, S, suppose we apply learning algorithm A to S to construct classiﬁer CA. How accurately will CA classify new examples? Because we have no separate test set, there is no direct way to answer this question. A frequently applied strategy is to convert this question into Question 6: Can we predict the accuracy of algorithm A when it is trained on randomly selected data sets of (approximately) the same size as S? If so, then we can predict the accuracy of CA, which was obtained from training on S.

Question 3: Given two classiﬁers CA and CB and enough data for a sepThomas G. Dietterich arate test set, determine which classiﬁer will be more accurate on new test examples. This question can be answered by measuring the accuracy of each classiﬁer on the separate test set and applying McNemar’s test, which will be described below.

Question 4: Given two classiﬁers, CA and CB, produced by feeding a small data set S to two learning algorithms, A and B, which classiﬁer will be more accurate in classifying new examples? Again, because we have no separate set of test data, we cannot answer this question directly. Some researchers have taken the approach of converting this problem into a question about learning algorithms (Question 8). If we can determine which algorithm usually produces more accurate classiﬁers (when trained on data sets of approximately the same size), then we can select the classiﬁer (CA or CB ) created by that algorithm.

Question 5: Given a learning algorithm A and a large data set S, what is the accuracy of the classiﬁers produced by A when trained on new training sets of a speciﬁed size m? This question has not received much attention in the literature. One approach, advocated by the DELVE project (Hinton, Neal, Tibshirani, & DELVE team members, 1995; Rasmussen, 1996), is to subdivide S into a test set and several disjoint training sets of size m. Then A is trained on each of the training sets, and the resulting classiﬁers are tested on the test set. The average performance on the test set estimates the accuracy of new runs.

Question 6: Given a learning algorithm A and a small data set S, what is the accuracy of the classiﬁers produced by A when A is trained on new training sets of the same size as S? Kohavi (1995) shows that stratiﬁed 10fold cross-validation produces fairly good estimates in this case. Note that in any resampling approach, we cannot train A on training sets of exactly the same size as S. Instead, we train on data sets that have slightly fewer examples (e.g., 90% of the size of S in 10-fold cross-validation) and rely on the assumption that the performance of learning algorithms changes smoothly with changes in the size of the training data. This assumption can be checked experimentally (by performing additional cross-validation studies) with even smaller training sets, but it cannot be checked directly for training sets of the size of S. Results on the shape of learning curves show that in some cases, this smoothness assumption will be violated (Haussler, Kearns, Seung, & Tishby, 1994). Nonetheless, it is observed to hold experimentally in most applications.

Question 7: Given two learning algorithms A and B and a large data set S, which algorithm will produce more accurate classiﬁers when trained on data sets of a speciﬁed size m? This question has not received much attention, although the DELVE team has studied this question for regression problems.

They divide S into several disjoint training sets and a single test set. Each algorithm is trained on each training set, and all resulting classiﬁers are tested on the test set. An analysis of variance can then be performed that includes terms for the choice of learning algorithm, the choice of the training Comparing Supervised Classiﬁcation Learning Algorithms 1899 set, and each individual test example. The Quasi-F test (Lindman, 1992) is applied to determine whether the effect due to the choice of learning algorithms is signiﬁcantly nonzero.

Question 8: Given two learning algorithms A and B and a small data set S, which algorithm will produce more accurate classiﬁers when trained on data sets of the same size as S? The purpose of this article is to describe and compare several statistical tests for answering this question. Because S is small, it will be necessary to use holdout and resampling methods. As mentioned regarding Question 6, this means that we cannot answer this question exactly without making the assumption that the performance of the two learning algorithms changes smoothly with changes in the size of the training set. Speciﬁcally, we will need to assume that the relative difference in performance of the two algorithms changes slowly with changes in the size of the training set.

Question 9: Given two learning algorithms A and B and data sets from several domains, which algorithm will produce more accurate classiﬁers when trained on examples from new domains? This is perhaps the most fundamental and difﬁcult question in machine learning. Some researchers have applied a simple sign test (or the Wilcoxon signed-ranks test) to try to answer this question, based on single runs or cross-validation-based estimates, but these tests do not take into account the uncertainty of the individual comparisons. Effectively, we want to combine the results from several answers to Question 8, where each answer has an associated uncertainty.

This is an important question for future research.

Questions 7, 8, and 9 are the most important for experimental research on learning algorithms. When someone develops a new learning algorithm (or a modiﬁcation to an existing algorithm), answers to these questions can determine whether the new algorithm is better than existing algorithms.

Unfortunately, many data sets used in experimental research are too small to allow posing Question 7. Hence, this article focuses on developed good statistical tests for Question 8. We deﬁne and compare ﬁve statistical tests for this question.

Before proceeding with the derivation of these statistical tests, it is worth noting that each of the questions posed can be extended beyond classiﬁcation algorithms and misclassiﬁcation rates. For example, in many decisionmaking settings, it is important to estimate the conditional probability that a new example belongs to each of the K classes. One measure of the accuracy of probability estimates is the log loss; Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected log loss of a classiﬁer or an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which classiﬁer or algorithm has the smaller log loss. We are unaware of any statistical research speciﬁcally addressing these questions in the case of log loss, however.

In many neural network applications, the task is to predict a continuous response variable. In these problems, the squared error is usually the natThomas G. Dietterich ural loss function, and Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected mean squared error of a predictor or of an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which predictor or algorithm has the smaller mean squared error. Question 1 can be addressed by constructing a conﬁdence interval based on the normal or t distribution (depending on the size of the test set).

Question 3 can be addressed by constructing a conﬁdence interval for the expected difference. The DELVE project has developed analysis-of-variance techniques for Questions 5 and 7. Appropriate statistical tests for the smallsample questions (2, 4, 6, and 8) are still not well established.

The statistical tests for regression methods may suggest ways of designing statistical tests for the log loss case, an important area for future research.