ABSTRACT:

Supervised classification is a major part of machine learning that has
grown in interest over years. In the literature, there are many
proposals for classification paradigms and learning algorithms that
can be applied to specific classification tasks. Therefore, an honest
classifier evaluation and a fair comparison among classification
models are key points in order to draw the right conclusions from the 
results achieved, as well as to choose the best model/paradigm. 
However, there are many researchers that focus their work on proposing 
new classification algorithms, leaving the fair evaluation of the 
results aside. 

This tutorial presents an overview of performance evaluation
methodologies for classifiers. It is organized in five parts. In the
first part, we introduce the classification problem and motivate the
importance of an honest validation of classification models and model
comparison. The second part is devoted to the scores that can be used
to measure the goodness of a classifier. The classification error is
the most studied and also the most commonly used score. However, there
are other scores that may be of interest in certain application
domains. The third part of the tutorial is related to estimation
methods. We present and motivate the problem of estimating the value
of a score for a classifier given a (finite) data set, and we
elaborate on different estimation methods as well as on their
properties and application domains. The fourth part of the tutorial is
dedicated to classifier comparison. In this part, we introduce 
statistical hypothesis testing and different types of statistical tests 
that can be used to compare two or more classification models using one 
or more data sets. Finally, the last part of the tutorial presents 
recommendations to perform fair classifier evaluation according to 
specific characteristics of the problem or the data set as well as 
general best practices in classifier evaluation, in order to obtain 
fair conclusions from the results.