A Guide to Convolutional Neural Networks for Computer Vision. Salman Khan
methods. The idea behind the HOG descriptors is that the object appearance and the shape within an image can be described by the histogram of edge directions. The implementation of these descriptors consists of the following four steps.
Gradient Computation
The first step is the computation of the gradient values. A 1D centered point discrete derivative mask is applied on an image in both the horizontal and vertical directions. Specifically, this method requires the filtering of the gray-scale image with the following filter kernels:
Thus, given an image I, the following convolution operations (denoted by *) result in the derivatives of the image I in the x and y directions:
Thus, the orientation θ and the magnitude |g| of the gradient are calculated as follows:
As you will see in Chapter 4, just like the HOG descriptor, CNNs also use convolution operations in their layers. However, the main difference is that instead of using hand-engineered filters, e.g., fx, fy in Eq. (2.1), CNNs use trainable filters which make them highly adaptive. That is why they can achieve high accuracy levels in most applications such as image recognition.
Cell Orientation Histogram
The second step is the calculation of the cell histograms. First, the image is divided into small (usually 8 × 8 pixels) cells. Each cell has a fixed number of gradient orientation bins, which are evenly spread over 0–180° or 0–360°, depending on whether the gradient is unsigned or signed. Each pixel within the cell casts a weighted vote for a gradient orientation bin based on the gradient magnitude at that pixel. For the vote weight, the pixel contribution can be the gradient magnitude, or the square root of the gradient magnitude or the square of the gradient magnitude.
Block Descriptor
To deal with changes in illumination and contrast, the gradient strengths are locally normalized by grouping the cells together into larger, spatially connected blocks. The HOG descriptor is then the vector of the components of the normalized cell histograms from all of the block regions.
Block Normalization
The final step is the normalization of the block descriptors. Let v be the non-normalized vector containing all histograms in a given block, ||v||k be its k-norm for k = 1, 2, and ϵ be a small constant. Then the normalization factor can be one of the following:
or
or
There is another normalization factor, L2-Hys, which is obtained by clipping the L2-norm of v (i.e., limiting the maximum values of v to 0.2) and then re-normalizing.
The final image/RoI descriptor is formed by concatenating all normalized block descriptors. The experimental results in Triggs and Dalal [2005] show that all four block normalization methods achieve a very significant improvement over the non-normalized one. Moreover, the L2-norm, L2-Hys, and L1-sqrt normalization approaches provide a similar performance, while the L1-norm provides a slightly less reliable performance.
Figure 2.3: HOG descriptor. Note that for better visualization, we only show the cell orientation histogram for four cells and a block descriptor corresponding to those four cells.
2.2.2 SCALE-INVARIANT FEATURE TRANSFORM (SIFT)
SIFT [Lowe, 2004] provides a set of features of an object that are are robust against object scaling and rotations. The SIFT algorithm consists of four main steps, which are discussed in the following subsections.
Scale-space Extrema Detection
The first step aims to identify potential keypoints that are invariant to scale and orientation. While several techniques can be used to detect keypoint locations in the scale-space, SIFT uses the Difference of Gaussians (DoG), which is obtained as the difference of Gaussian blurring of an image with two different scales, σ, one with scale k times the scale of the other, i.e., k × σ. This process is performed for different octaves of the image in the Gaussian Pyramid, as shown in Fig. 2.4a.
Then, the DoG images are searched for local extrema over all scales and image locations. For instance, a pixel in an image is compared with its eight neighbors in the current image as well as nine neighbors in the scale above and below, as shown in Fig. 2.4b. If it is the minimum or maximum of all these neighbors, then it is a potential keypoint. It means that a keypoint is best represented in that scale.
Figure 2.4: Scale-space feature detection using a sub-octave DoG pyramid. (a) Adjacent levels of a sub-octave Gaussian pyramid are subtracted to produce the DoG; and (b) extrema in the resulting 3D volume are detected by comparing a pixel to its 26 neighbors. (Figure from Lowe [2004], used with permission.)
Accurate Keypoint Localization
This step removes unstable points from the list of potential keypoints by finding those that have low contrast or are poorly localized on an edge. In order to reject low contrast keypoints, a Taylor series expansion of the scale space is computed to get more accurate locations of extrema, and if the intensity at each extrema is less than a threshold value, the keypoint is rejected.
Moreover, the DoG function has a strong response along the edges, which results in a large principal curvature across the edge but a small curvature in the perpendicular direction in the DoG function. In order to remove the keypoints located on an edge, the principal curvature at the keypoint is computed from a 2 × 2 Hessian matrix at the location and scale of the keypoint. If the ratio between the first and the second eigenvalues is greater than a threshold, the keypoint is rejected.
Remark: In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function. Specifically, suppose f(x1, x2, …, xn) is a function outputting a scalar, i.e.,
Orientation Assignment
In order to achieve invariance to image rotation, a consistent orientation is assigned to each keypoint based on its local image properties. The keypoint descriptor can then be represented relative to this orientation. The algorithm used to find an orientation consists of the following steps.
1. The scale of the keypoint is used to select the