Deep Learning for Computer Vision with SAS. Robert Blanchard
that are more resistant to neuron saturation than conventional activation functions. One of the classic characteristics of traditional neural networks was the infamous use of sigmoidal transformations in hidden units. Sigmoidal transformations are problematic for gradient-based learning because the sigmoid has two asymptotic regions that can saturate (that is, gradient of the output is near zero). The red or deeper shaded outer areas represent areas of saturation. See Figure 1.2.
Figure 1.2: Hyperbolic Tangent Function
On the other hand, a linear transformation such as an identity poses little issue for gradient-based learning because the gradient is a constant. However, the use of linear transformations negates the benefits provided by nonlinear transformations (that is, approximate nonlinear relationships).
Rectified linear transformation (or ReLU) consists of piecewise linear transformations that, when combined, can approximate nonlinear functions. (See Figure 1.3.)
Figure 1.3: Rectified Linear Function
In the case of ReLU, the derivative for the active region output by the transformation is 1 and 0 for the inactive region. The inactive region of the ReLU transformation can be viewed as a weakness of the transformation because it inhibits the unit from contributing to gradient-based learning.
The saturation of ReLU could be somewhat mitigated by cleverly initializing the weights to avoid negative output values. For example, consider a business scenario of modeling image data. Each unstandardized input pixel value ranges between 0 and 255. In this case, the weights could be initialized and constrained to be strictly positive to avoid negative output values, avoiding the non-active output region of the ReLU.
Other variants of the rectified linear transformation exist that permit learning to continue when the combination function resolves to a negative value. Most notable of these is the exponential linear activation transformation (ELU) as shown in Figure 1.4.
Figure 1.4: Exponential Linear Function
SAS researchers have observed better performance when ELU is used instead of ReLU in convolutional neural networks in some cases. SAS includes other, popular activation functions that are not shown here, such as softplus and leaky. Additionally, you can create your own activation functions in SAS using the SAS Function Compiler (or FCMP).
Note: Convolutional neural networks (CNNs) are a class of artificial neural networks. CNNs are widely used in image recognition and classification. Like regular neural networks, a CNN consists of multiple layers and a number of neurons. CNNs are well suited for image data, but they can also be used for other problems such as natural language processing. CNNs are detailed in Chapter 2.
The error function defines a surface in the parameter space. If it is a linear model fit by least squares, the error surface is convex with a unique minimum. However, in a nonlinear model, this error surface is often a complex landscape consisting of numerous deep valleys, steep cliffs, and long-reaching plateaus.
To efficiently search this landscape for an error minimum, optimization must be used. The optimization methods use local features of the error surface to guide their descent. Specifically,
the parameters associated with a given error minimum are located using the following procedure:
1. Initialize the weight vector to small random values, w(0).
2. Use an optimization method to determine the update vector, δ(t).
3. Add the update vector to the weight values from the previous iteration to generate new estimates:
4. If none of the specified convergence criteria have been achieved, then go back to step 2.
Here are the three conditions under which convergence is declared:
1. when the specified error function stops improving
2. if the gradient has no slope (implying that a minimum has been reached)
3. if the magnitude of the parameters stops changing substantially
Batch Gradient Descent
Re-invented several times, the back propagation (backprop) algorithm initially just used gradient descent to determine an appropriate set of weights. The gradient,
Figure 1.5: Batch Gradient Descent
By negating the step size (that is, learning rate) parameter,η, a step is made in the direction that is locally steepest downhill:
The parameters associated with a given error minimum are located using the following procedure:
1. Initialize the weight vector to small random values, w(0).
2. Use an optimization method to determine the update vector, δ(t).
3. Add the update vector to the weight values from the previous iteration to generate new estimates:
4. If none of the specified convergence criteria has been achieved, then back go to step 2.
Unfortunately, as gradient descent approaches the desired weights, it exhibits numerous back-and-forth movements known as hemstitching. To control the training iterations wasted in this hemstitching, later versions of back propagation included a momentum term, yielding the modern update rule:
The momentum term retains the last update vector, δ(t-1), using this information to “dampen” potentially oscillating search paths. The cost is an extra learning rate parameter (0 ≤ α ≤ 1) that must be set. This updated rule uses all the training observations (t) to calculate the exact gradient on each descent step. This results in a smooth progression to the gradient minima.
Stochastic Gradient Descent
In the batch variant of the gradient descent algorithm, generation of the weight update vector is determined by using all of the examples in the training set. That is, the exact gradient is calculated, ensuring a relatively smooth progression to the error minima.
However, when the training data set is large, computing the exact gradient is computationally expensive. The entire training data set must be assessed on each step down the gradient. Moreover, if the data are redundant, the error gradient on the second half of the data will be almost identical to the gradient on the first half. In this event, it would be a waste of time to compute the gradient on the whole data set. You would be better off computing the gradient on a subset of the weights, updating the weights, and then repeating on a new subset. In this case, each weight update is based on an approximation to the true gradient. But as long as it points in approximately the same direction as the exact gradient, the approximate gradient is a useful alternative to computing the exact gradient (Hinton 2007).
Taken to extremes, calculation of the approximate gradient can be based on a single training case. The weights are then updated, and the gradient is calculated on the next case. This is known as stochastic gradient descent (also known as online learning). (See Figure 1.6.)
Figure 1.6: Stochastic Gradient Descent