Efficient Processing of Deep Neural Networks. Vivienne Sze

Efficient Processing of Deep Neural Networks

href="#fb3_img_img_79314ff0-5949-5bee-b0cd-d36f8a9908c1.jpg" alt="image"/>, where Wij, xi, and yj are the weights, input activations, and output activations, respectively, and f(.) is a nonlinear function described in Section 2.3.3. The bias term bj is omitted from Figure 1.3b for simplicity. In this book, we will use the color green to denote weights, blue to denote activations, and red to denote weighted sums (or partial sums, which are further accumulated to become the final weighted sums).

Within the domain of neural networks, there is an area called deep learning, in which the neural networks have more than three layers, i.e., more than one hidden layer. Today, the typical numbers of network layers used in deep learning range from 5 to more than a 1,000. In this book, we will generally use the terminology deep neural networks (DNNs) to refer to the neural networks used in deep learning.

DNNs are capable of learning high-level features with more complexity and abstraction than shallower neural networks. An example that demonstrates this point is using DNNs to process visual data, as shown in Figure 1.4. In these applications, pixels of an image are fed into the first layer of a DNN, and the outputs of that layer can be interpreted as representing the presence of different low-level features in the image, such as lines and edges. In subsequent layers, these features are then combined into a measure of the likely presence of higher-level features, e.g., lines are combined into shapes, which are further combined into sets of shapes. Finally, given all this information, the network provides a probability that these high-level features comprise a particular object or scene. This deep feature hierarchy enables DNNs to achieve superior performance in many tasks.

Figure 1.5: Example of an image classification task. The machine learning platform takes in an image and outputs the class probabilities for a predefined set of classes.

1.2 TRAINING VERSUS INFERENCE

Since DNNs are an instance of machine learning algorithms, the basic program does not change as it learns to perform its given tasks. In the specific case of DNNs, this learning involves determining the value of the weights (and biases) in the network, and is referred to as training the network. Once trained, the program can perform its task by computing the output of the network using the weights determined during the training process. Running the program with these weights is referred to as inference.

In this section, we will use image classification, as shown in Figure 1.5, as a driving example for training and using a DNN. When we perform inference using a DNN, the input is image and the output is a vector of values representing the class probabilities. There is one value for each object class, and the class with the highest value indicates the most likely (predicted) class of object in the image. The overarching goal for training a DNN is to determine the weights that maximize the probability of the correct class and minimize the probabilities of the incorrect classes. The correct class is generally known, as it is often defined in the training set. The gap between the ideal correct probabilities and the probabilities computed by the DNN based on its current weights is referred to as the loss (L). Thus, the goal of training DNNs is to find a set of weights to minimize the average loss over a large training set.

When training a network, the weights (wij) are usually updated using a hill-climbing (hill-descending) optimization process called gradient descent. In gradient descent, a weight is updated by a scaled version of the partial derivative of the loss with respect to the weight (i.e., updated to , where α is called the learning rate⁴). Note that this gradient indicates how the weights should change in order to reduce the loss. The process is repeated iteratively to reduce the overall loss.

Sidebar: Key steps in training

Here, we will provide a very brief summary of the key steps of training and deploying a model. For more details, we recommend the reader refer to more comprehensive references such as [2]. First, we collect a labeled dataset and divide the data into subsets for training and testing. Second, we use the training set to train a model so that it can learn the weights for a given task. After achieving adequate accuracy on the training set, the ultimate quality of the model is determined by how accurately it performs on unseen data. Therefore, in the third step, we test the trained model by asking it to predict the labels for a test set that it has never seen before and compare the prediction to the ground truth labels. Generalization refers to how well the model maintains the accuracy between training and unseen data. If the model does not generalize well, it is often referred to as overfitting; this implies that the model is fitting to the noise rather than the underlying data structure that we would like it to learn. One way to combat overfitting is to have a large, diverse dataset; it has been shown that accuracy increases logarithmically as a function of the number of training examples [16]. Section 2.6.3 will discuss various popular datasets used for training. There are also other mechanisms that help with generalization including Regularization. It adds constraints to the model during training such as smoothness, number of parameters, size of the parameters, prior distribution or structure, or randomness in the training using dropout [17]. Further partitioning the training set into training and validation sets is another useful tool. Designing a DNN requires determining (tuning) a large number of hyperparameters such as the size and shape of a layer or the number of layers. Tuning the hyperparameters based on the test set may cause overfitting to the test set, which results in a misleading evaluation of the true performance on unseen data. In this circumstance, the validation set can be used instead of the test set to mitigate this problem. Finally, if the model performs sufficiently well on the test set, it can be deployed on unlabeled images.

An efficient way to compute the partial derivatives of the gradient is through a process called backpropagation. Backpropagation, which is a computation derived from the chain rule of calculus, operates by passing values backward through the network to compute how the loss is affected by each weight.

Figure 1.6: An example of back propagation through a neural network.

This backpropagation computation is, in fact, very similar in form to the computation used for inference, as shown in Figure 1.6 [19].⁵ Thus, techniques for efficiently performing inference can sometimes be useful for performing training. There are, however, some important additional considerations to note. First, backpropagation requires intermediate outputs of the network to be preserved for the backward computation, thus training has increased storage requirements. Second, due to the gradients use for hill-climbing (hill-descending), the precision requirement for training is generally higher than inference. Thus, many of the reduced precision techniques discussed in Chapter 7 are limited to inference only.

A variety of techniques are used to improve the efficiency and robustness of training. For example, often, the loss from multiple inputs is computed before a single pass of weight updates is performed. This is called batching, which helps to speed up and stabilize the process.⁶

There are multiple ways to train the weights. The most common approach, as described above, is called supervised learning, where all the training samples are labeled (e.g., with the correct class). Unsupervised learning is another approach, where no

Скачать книгу