WWW.ABSTRACT.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Abstract, dissertation, book
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |

«Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms Thomas G. Dietterich Department of Computer Science, ...»

-- [ Page 1 ] --

LETTER Communicated by Leo Breiman

Approximate Statistical Tests for Comparing Supervised

Classification Learning Algorithms

Thomas G. Dietterich

Department of Computer Science, Oregon State University, Corvallis, OR 97331,

U.S.A.

This article reviews five approximate statistical tests for determining

whether one learning algorithm outperforms another on a particular learning task. These tests are compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high

probability of type I error in certain situations and should never be used:

a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paireddifferences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar’s test, is shown to have low type I error. The fifth test is a new test, 5 × 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5×2 cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 × 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.

1 Introduction In the research, development, and application of machine learning algorithms for classification tasks, many questions arise for which statistical methods are needed. The purpose of this article is to investigate one of these questions, demonstrate that existing statistical methods are inadequate for this question, and propose a new statistical test that shows acceptable performance in initial experiments.

To understand the question raised in this article, it is helpful to consider a taxonomy of the different kinds of statistical questions that arise in machine learning. Figure 1 gives a taxonomy of nine statistical questions.

c 1998 Massachusetts Institute of Technology Neural Computation 10, 1895–1923 (1998) 1896 Thomas G. Dietterich Let us begin at the root of the tree. The first issue to consider is whether we are studying only a single application domain or multiple domains. In most applied research, there is a single domain ofinterest, and the goal is to find the best classifier or the best learning algorithm to apply in that domain. However, a fundamental goal of research in machine learning is to find learning algorithms that work well over a wide range of application domains. We will return to this issue; for the moment, let us consider the single-domain case.

Within a single domain, there are two different sets of questions, depending on whether we are analyzing classifiers or algorithms. A classifier is a function that, given an input example, assigns that example to one of K classes. A learning algorithm is a function that, given a set of examples and their classes, constructs a classifier. In a particular application setting, our primary goal is usually to find the best classifier and estimate its accuracy with future examples. Suppose we are working for a medical instrumentation company and wish to manufacture and sell an instrument for classifying blood cells. At the time we are designing the instrument, we could gather a large collection of blood cells and have a human expert classify each cell.

We could then apply a learning algorithm to produce a classifier from this set of classified cells. The classifier would be implemented in the instrument and sold. We want our instrument to contain the most accurate classifier we can find.

There are some applications, however, where we must select the best learning algorithm rather than find the best classifier. For example, suppose we want to sell an e-mail system that learns to recognize and filter junk mail. Whenever the user receives an e-mail message that he considers junk mail, he will flag that message. Periodically, a learning algorithm included in the program will analyze the accumulated examples of junk and nonjunk e-mail and update its filtering rules. Our job is to determine which learning algorithm to include in the program.

The next level of the taxonomy distinguishes between two fundamental tasks: estimating accuracy and choosing between classifiers (or algorithms).

When we market our blood cell diagnosis system, we would like to make a claim about its accuracy. How can we measure this accuracy? And, of course, when we design the system, we want to choose the best classifier from some set of available classifiers.

The lowest level of the taxonomy concerns the amount of data available.

If we have a large amount of data, then we can set some of them aside to serve as a test set for evaluating classifiers. Much simpler statistical methods can be applied in this case. However, in most situations, the amount of data is limited, and we need to use all we have as input to our learning algorithms.





This means that we must use some form of resampling (i.e., cross-validation or the bootstrap) to perform the statistical analysis.

Now that we have reviewed the general structure of the taxonomy, let’s consider the nine statistical questions. We assume that all data points (examComparing Supervised Classification Learning Algorithms 1897 Figure 1: A taxonomy of statistical questions in machine learning. The boxed node (Question 8) is the subject of this article.

ples) are drawn independently from a fixed probability distribution defined by the particular application problem.

Question 1: Suppose we are given a large sample of data and a classifier C.

The classifier C may have been constructed using part of the data, but there are enough data remaining for a separate test set. Hence, we can measure the accuracy of C on the test set and construct a binomial confidence interval (Snedecor & Cochran, 1989; Efron & Tibshirani, 1993; Kohavi, 1995). Note that in Question 1, the classifier could have been produced by any method (e.g., interviewing an expert); it need not have been produced by a learning algorithm.

Question 2: Given a small data set, S, suppose we apply learning algorithm A to S to construct classifier CA. How accurately will CA classify new examples? Because we have no separate test set, there is no direct way to answer this question. A frequently applied strategy is to convert this question into Question 6: Can we predict the accuracy of algorithm A when it is trained on randomly selected data sets of (approximately) the same size as S? If so, then we can predict the accuracy of CA, which was obtained from training on S.

Question 3: Given two classifiers CA and CB and enough data for a sepThomas G. Dietterich arate test set, determine which classifier will be more accurate on new test examples. This question can be answered by measuring the accuracy of each classifier on the separate test set and applying McNemar’s test, which will be described below.

Question 4: Given two classifiers, CA and CB, produced by feeding a small data set S to two learning algorithms, A and B, which classifier will be more accurate in classifying new examples? Again, because we have no separate set of test data, we cannot answer this question directly. Some researchers have taken the approach of converting this problem into a question about learning algorithms (Question 8). If we can determine which algorithm usually produces more accurate classifiers (when trained on data sets of approximately the same size), then we can select the classifier (CA or CB ) created by that algorithm.

Question 5: Given a learning algorithm A and a large data set S, what is the accuracy of the classifiers produced by A when trained on new training sets of a specified size m? This question has not received much attention in the literature. One approach, advocated by the DELVE project (Hinton, Neal, Tibshirani, & DELVE team members, 1995; Rasmussen, 1996), is to subdivide S into a test set and several disjoint training sets of size m. Then A is trained on each of the training sets, and the resulting classifiers are tested on the test set. The average performance on the test set estimates the accuracy of new runs.

Question 6: Given a learning algorithm A and a small data set S, what is the accuracy of the classifiers produced by A when A is trained on new training sets of the same size as S? Kohavi (1995) shows that stratified 10fold cross-validation produces fairly good estimates in this case. Note that in any resampling approach, we cannot train A on training sets of exactly the same size as S. Instead, we train on data sets that have slightly fewer examples (e.g., 90% of the size of S in 10-fold cross-validation) and rely on the assumption that the performance of learning algorithms changes smoothly with changes in the size of the training data. This assumption can be checked experimentally (by performing additional cross-validation studies) with even smaller training sets, but it cannot be checked directly for training sets of the size of S. Results on the shape of learning curves show that in some cases, this smoothness assumption will be violated (Haussler, Kearns, Seung, & Tishby, 1994). Nonetheless, it is observed to hold experimentally in most applications.

Question 7: Given two learning algorithms A and B and a large data set S, which algorithm will produce more accurate classifiers when trained on data sets of a specified size m? This question has not received much attention, although the DELVE team has studied this question for regression problems.

They divide S into several disjoint training sets and a single test set. Each algorithm is trained on each training set, and all resulting classifiers are tested on the test set. An analysis of variance can then be performed that includes terms for the choice of learning algorithm, the choice of the training Comparing Supervised Classification Learning Algorithms 1899 set, and each individual test example. The Quasi-F test (Lindman, 1992) is applied to determine whether the effect due to the choice of learning algorithms is significantly nonzero.

Question 8: Given two learning algorithms A and B and a small data set S, which algorithm will produce more accurate classifiers when trained on data sets of the same size as S? The purpose of this article is to describe and compare several statistical tests for answering this question. Because S is small, it will be necessary to use holdout and resampling methods. As mentioned regarding Question 6, this means that we cannot answer this question exactly without making the assumption that the performance of the two learning algorithms changes smoothly with changes in the size of the training set. Specifically, we will need to assume that the relative difference in performance of the two algorithms changes slowly with changes in the size of the training set.

Question 9: Given two learning algorithms A and B and data sets from several domains, which algorithm will produce more accurate classifiers when trained on examples from new domains? This is perhaps the most fundamental and difficult question in machine learning. Some researchers have applied a simple sign test (or the Wilcoxon signed-ranks test) to try to answer this question, based on single runs or cross-validation-based estimates, but these tests do not take into account the uncertainty of the individual comparisons. Effectively, we want to combine the results from several answers to Question 8, where each answer has an associated uncertainty.

This is an important question for future research.

Questions 7, 8, and 9 are the most important for experimental research on learning algorithms. When someone develops a new learning algorithm (or a modification to an existing algorithm), answers to these questions can determine whether the new algorithm is better than existing algorithms.

Unfortunately, many data sets used in experimental research are too small to allow posing Question 7. Hence, this article focuses on developed good statistical tests for Question 8. We define and compare five statistical tests for this question.

Before proceeding with the derivation of these statistical tests, it is worth noting that each of the questions posed can be extended beyond classification algorithms and misclassification rates. For example, in many decisionmaking settings, it is important to estimate the conditional probability that a new example belongs to each of the K classes. One measure of the accuracy of probability estimates is the log loss; Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected log loss of a classifier or an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which classifier or algorithm has the smaller log loss. We are unaware of any statistical research specifically addressing these questions in the case of log loss, however.

In many neural network applications, the task is to predict a continuous response variable. In these problems, the squared error is usually the natThomas G. Dietterich ural loss function, and Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected mean squared error of a predictor or of an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which predictor or algorithm has the smaller mean squared error. Question 1 can be addressed by constructing a confidence interval based on the normal or t distribution (depending on the size of the test set).

Question 3 can be addressed by constructing a confidence interval for the expected difference. The DELVE project has developed analysis-of-variance techniques for Questions 5 and 7. Appropriate statistical tests for the smallsample questions (2, 4, 6, and 8) are still not well established.

The statistical tests for regression methods may suggest ways of designing statistical tests for the log loss case, an important area for future research.



Pages:   || 2 | 3 | 4 | 5 |


Similar works:

«Nadia Nocchi & Stephan Schmid (Zürich) Labiodentale Konsonanten im Schweizerdeutschen 1. Einleitung Der vorliegende Beitrag behandelt einen besonderen Aspekt der Phonetik und Phonologie der schweizerdeutschen Dialekte: die labiodentalen Konsonanten. Ausgehend von der einschlägigen Literatur gehen wir der Frage nach, wie viele solche Konsonanten es gibt und wie sie phonetisch zu beschreiben sind. Zu diesem Zweck haben wir eine instrumentalphonetische Untersuchung durchgeführt, deren...»

«Annex I Annex II Annex III EXECUTIVE PROGRAM OF SCIENTIFIC AND TECHNOLOGICAL CO-OPERATION BETWEEN THE ITALIAN REPUBLIC AND THE REPUBLIC OF INDONESIA 2004-2007 Second Session of Italy – Indonesia Joint Committee Rome, December 1 – 2, 2003 Based on art. 5 of the Agreement on Scientific and Technological Co-operation between Italy and Indonesia, which was signed in Jakarta on 20th October 1997, and came into force on 18th April 2000, the II Session of the Italian –Indonesian Joint Committee...»

«fallout shelter mehr bewohner fallout shelter mehr bewohner Fallout Shelter: Tipps und Tricks für die kostenlose Fallout Shelter: Tipps und Tricks für den Bunker. Eure Aufgabe in Fallout Shelter ist der Aufbau einer Gesellschaft im Fallout-Universum. Ihr müsst euch Fallout Shelter Tipps für den Bunkerbau GamePro Unser Guide zu Fallout Shelter hilft beim Aufbau des perfekten Endzeit-Bunkers. Bethesdas Mobile-Hit kann nämlich ganz schön knifflig werden. Seite 1 Fallout Shelter mehr neue...»

«Электронное научное издание Альманах Пространство и Время. Т. 3. Вып. 1 • 2013 Специальный выпуск ПРОСТРАНСТВО И ВРЕМЯ ГРАНИЦ Electronic Scientific Edition Almanac Space and Time Special issue 'Space, Time, and Boundaries’ Elektronische wissenschaftliche Auflage Almabtrieb ‘Raum und Zeit‘ Spezialausgabe ‘Der Raum und die Zeit der Grenzen‘ Естественные границы Natural...»

«Interactions between calcium and heavy metals in Norway spruce Accumulation and binding of metals in wood and bark Ann Helén Österås Department of Botany, Stockholm University, 2004 Doctoral dissertation Ann Helén Österås Stockholm University April 16, 2004 Front cover: The distribution of Ca (red), Mn (blue) and Cu (green) in a 400µm thick stem section from a 2-year-old Norway spruce analysed with microbeam X-ray fluorescence. The area scanned was 450 x 3510 µm, and the step-size was...»

«Airbus DS GmbH General Terms and Conditions of Purchase 1/2013 1. Scope 1.1 The following General Purchasing Terms (GPT) of Airbus DS GmbH (hereinafter referred to as AIRBUS DS) shall apply to all contracts for work ordered from the supplier and purchase of its goods (hereinafter jointly referred to as deliveries). The GPT as amended shall also apply as a basic supply agreement to future contracts regarding deliveries of moveable items with the same supplier, without AIRBUS DS having to draw...»

«Contractibility and NP-Completeness A.E. Brouwer c m m FOR MATHEMAWS AND c o M P u m SCIENCE AMSTERDAM, THE NETHERLANDS H. J. Veldman DEPARTMENT OF A f f L l E D MATH€MAT/CS r w m r E u N / v E m / r Y of TECHNOLOGY ENSCHEDE, THE NETHERLANDS ABSTRACT For a fixed graph H, let H-CON denote the problem of determining whether a given graph is contractible to H. The complexity of H-CON is studied for H belonging to certain classes of graphs, together covering all connected graphs of order at most...»

«Preprints of the Max Planck Institute for Research on Collective Goods Bonn 2015/2 The Mutualisation of Sovereign Debt: Comparing the American Past and the European Present Armin Steinbach MAX PLANCK SOCIETY Preprints of the Max Planck Institute for Research on Collective Goods Bonn 2015/2 The Mutualisation of Sovereign Debt: Comparing the American Past and the European Present Armin Steinbach January 2015 Max Planck Institute for Research on Collective Goods, Kurt-Schumacher-Str. 10, D-53113...»

«Lance Fallaw, ‘A Queensland House-Warming’: An Edition Lance Fallaw, ‘A Queensland House-Warming’: An Edition Patrick Buckridge Introduction Lance Fallaw was born in Gateshead in the north of England in 1876. He graduated in Arts at the University of Durham, developing a deep love of English literature which he carried with him for the rest of his life as an itinerant literary journalist. In 1900, after working for a few years in Newcastle-on-Tyne, he took his leave of Britain forever,...»

«Orphan-Free Anisotropic Voronoi Diagrams Guillermo D. Canas and Steven J. Gortler School of Engineering and Applied Sciences Harvard University Abstract We describe conditions under which an appropriately-defined anisotropic Voronoi diagram of a set of sites in Euclidean space is guaranteed to be composed of connected cells in any number of dimensions. These conditions are natural for problems in optimization and approximation, and algorithms already exist to produce sets of sites that satisfy...»

«Semantic innovation: Arithmetical and algebraic metaphors within narratives of learning Item type Article Authors Brown, Tony; Eade, F; Wilson, D DOI 10.1023/A:1003738611811 Publisher Springer Netherlands Journal Educational Studies in Mathematics Downloaded 11-May-2016 16:31:41 Link to item http://hdl.handle.net/2173/600385 TONY BROWN, FRANK EADE AND DAVE WILSON SEMANTIC INNOVATION: ARITHMETICAL AND ALGEBRAIC METAPHORS WITHIN NARRATIVES OF LEARNING Accepted by Educational Studies in...»

«Low Cost Automatic Transmission Line Sectionalizing The Ohio State University Distinction Project Autumn 2009 Andrew Dulmage EXECUTIVE SUMMARY This report contains detailed information about the design, testing and completion of an automatic transmission line sectionalizing approach. Below is an outline of what will be discussed in each section. Introduction The purpose of this project is to design an automatic transmission line sectionalizing system that is not only cheap, but reliable and...»





 
<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.