A Guide to Convolutional Neural Networks for Computer Vision. Salman Khan
5, and 6).
2.3 MACHINE LEARNING CLASSIFIERS
Machine learning is usually divided into three main areas, namely supervised, unsupervised, and semi-supervised. In the case of the supervised learning approach, the goal is to learn a mapping from inputs to outputs, given a labeled set of input-output pairs. The second type of machine learning is the unsupervised learning approach, where we are only given inputs, and the goal is to automatically find interesting patterns in the data. This problem is not a well-defined problem, because we are not told what kind of patterns to look for. Moreover, unlike supervised learning, where we can compare our label prediction for a given sample to the observed value, there is no obvious error metric to use. The third type of machine learning is semi-supervised learning, which typically combines a small amount of labeled data with a large amount of unlabeled data to generate an appropriate function or classifier. The cost of the labeling process of a large corpus of data is infeasible, whereas the acquisition of unlabeled data is relatively inexpensive. In such cases, the semi-supervised learning approach can be of great practical value.
Another important class of machine learning algorithms is “reinforcement learning,” where the algorithm allows agents to automatically determine the ideal behavior given an observation of the world. Every agent has some impact on the environment, and the environment provides reward feedback to guide the learning algorithm. However, in this book our focus is mainly on the supervised learning approach, which is the most widely used machine learning approach in practice.
A wide range of supervised classification techniques has been proposed in the literature. These methods can be divided into three different categories, namely linear (e.g., SVM [Cortes, 1995]; logistic regression; Linear Discriminant Analysis (LDA) [Fisher, 1936]), nonlinear (e.g., Multi Layer Perceptron (MLP), kernel SVM), and ensemble-based (e.g., RDF [Breiman, 2001, Quinlan, 1986]; AdaBoost [Freund and Schapire, 1997]) classifiers. The goal of ensemble methods is to combine the predictions of several base classifiers to improve generalization over a single classifier. The ensemble methods can be divided into two categories, namely averaging (e.g., Bagging methods; Random Decision Forests [Breiman, 2001, Quinlan, 1986]) and boosting (e.g., AdaBoost [Freund and Schapire, 1997]; Gradient Tree Boosting [Friedman, 2000]). In the case of the averaging methods, the aim is to build several classifiers independently and then to average their predictions. For the boosting methods, base “weak” classifiers are built sequentially and one tries to reduce the bias of the combined overall classifier. The motivation is to combine several weak models to produce a powerful ensemble.
Our definition of a machine learning classifier that is capable of improving computer vision tasks via experience is somewhat abstract. To make this more concrete, in the following we describe three widely used linear (SVM), nonlinear (kernel SVM) and ensemble (RDF) classifiers in some detail.
2.3.1 SUPPORT VECTOR MACHINE (SVM)
SVM [Cortes, 1995] is a supervised machine learning algorithm used for classification or regression problems. SVM works by finding a linear hyperplane which separates the training dataset into two classes. As there are many such linear hyperplanes, the SVM algorithm tries to find the optimal separating hyperplane (as shown in Fig. 2.7) which is intuitively achieved when the distance (also known as the margin) to the nearest training data samples is as large as possible. It is because, in general, the larger the margin the lower the generalization error of the model.
Mathematically, SVM is a maximum margin linear model. Given a training dataset of n samples of the form {(x1, y1), …, (xn, yn)}, where xi is an m-dimensional feature vector and yi = {1, – 1} is the class to which the sample xi belongs to. The goal of SVM is to find the maximum-margin hyperplane which divides the group of samples for which yi = 1 from the group of samples for which yi = –1. As shown in Fig. 2.7b (the bold blue line), this hyperplane can be written as the set of sample points satisfying the following equation:
where w is the normal vector to the hyperplane. More precisely, any samples above the hyperplane should have label 1, i.e., xi s.t. wTXi + b > 0 will have corresponding yi = 1. Similarly, any samples below the hyperplane should have label –1, i.e., WT xi s.t. wT xi + b < 0 will have corresponding yi, = –1.
Notice that there is some space between the hyperplane (or decision boundary, which is the bold blue line in Fig. 2.7b) and the nearest data samples of either class. Thus, the sample data is rescaled such that anything on or above the hyperplane wTXi + b + = 1 is of one class with label 1, and anything on or below the hyperplane wTXi + b = – 1 is of the other class with label – 1. Since these two new hyperplanes are parallel, the distance between them is
Recall that SVM tries to maximize the distance between these two new hyperplanes demarcating two classes, which is equivalent to minimizing
subject to:
Soft-margin Extension
In the case where the training samples are not perfectly linearly separable, SVM can allow some samples of one class to appear on the other side of the hyperplane (boundary) by introducing slack variables, an ξi for each sample xi as follows:
subject to:
Unlike deep neural networks, which will be introduced in Chapter 3, linear SVMs can only solve problems that are linearly separable, i.e., where the data samples belonging to class 1 can be separated from the samples belonging to class 2 by a hyperplane as shown in Fig. 2.7. However, in many cases, the data samples are not linearly separable.
Nonlinear Decision Boundary
SVM can be extended to nonlinear classification by projecting the original input space (