Efficient Processing of Deep Neural Networks. Vivienne Sze

Efficient Processing of Deep Neural Networks

be adaptive and scalable in order to handle the new and varied forms of DNNs that these applications may employ.

1.5 EMBEDDED VERSUS CLOUD

The various applications and aspects of DNN processing (i.e., training versus inference) have different computational needs. Specifically, training often requires a large dataset⁹ and significant computational resources for multiple weight-update iterations. In many cases, training a DNN model still takes several hours to multiple days (or weeks or months!) and thus is typically performed in the cloud.

Inference, on the other hand, can happen either in the cloud or at the edge (e.g., Internet of Things (IoT) or mobile). In many applications, it is desirable to have the DNN inference processing at the edge near the sensor. For instance, in computer vision applications, such as measuring wait times in stores or predicting traffic patterns, it would be desirable to extract meaningful information from the video right at the image sensor rather than in the cloud, to reduce the communication cost. For other applications, such as autonomous vehicles, drone navigation, and robotics, local processing is desired since the latency and security risks of relying on the cloud are too high. However, video involves a large amount of data, which is computationally complex to process; thus, low-cost hardware to analyze video is challenging, yet critical, to enabling these applications.¹⁰ Speech recognition allows us to seamlessly interact with electronic devices, such as smartphones. While currently most of the processing for applications such as Apple Siri and Amazon Alexa voice services is in the cloud, it is still desirable to perform the recognition on the device itself to reduce latency. Some work have even considered partitioning the processing between the cloud and edge at a per layer basis in order to improve performance [49]. However, considerations related to dependency on connectivity, privacy, and security augur for keeping computation at the edge. Many of the embedded platforms that perform DNN inference have stringent requirements on energy consumption, compute and memory cost limitations; efficient processing of DNNs has become of prime importance under these constraints.

¹ Image recognition is also commonly referred to as image classification.

² Note: Recent work using TrueNorth in a stylized fashion allows it to be used to compute reduced precision neural networks [14]. These types of neural networks are discussed in Chapter 7.

³ Without a nonlinear function, multiple layers could be collapsed into one.

⁴ A large learning rate increases the step size applied at each iteration, which can help speed up the training, but may also result in overshooting the minimum or cause the optimization to not converge. A small learning rate decreases the step size applied at each iteration which slows down the training, but increases likelihood of convergence. There are various methods to set the learning rate such as ADAM [18], etc. Finding the best the learning rate is one of the key challenges in training DNNs.

⁵ To backpropagate through each layer: (1) compute the gradient of the loss relative to the weights, , from the layer inputs (i.e., the forward activations, Xi) and the gradients of the loss relative to the layer outputs, ; and (2) compute the gradient of the loss relative to the layer inputs, , from the layer weights, Wij, and the gradients of the loss relative to the layer outputs, .

⁶ There are various forms of gradient decent which differ in terms of how frequently to update the weights. Batch Gradient Descent updates the weights after computing the loss on the entire training set, which is computationally expensive and requires significant storage. Stochastic Gradient Descent update weights after computing loss on a single training example and the examples are shuffled after going through the entire training set. While it is fast, looking at a single example can be noisy and cause the weights to go in the wrong direction. Finally, Mini-batch Gradient Descent divides the training set into smaller sets called mini-batches, and updates weights based on the loss of each mini-batch (commonly referred to simply as “batch”); this approach is most commonly used. In general, each pass through the entire training set is referred to as an epoch.

⁷ In the early 1960s, single neuron systems built out of analog logic were used for adaptive filtering [21, 22].

⁸ The Top-5 error rate is measured based on whether the correct answer appears in one of the top five categories selected by the algorithm.

⁹ One of the major drawbacks of DNNs is their need for large datasets to prevent overfitting during training.

¹⁰ As a reference, running a DNN on an embedded devices is estimated to consume several orders of magnitude higher energy per pixel than video compression, which is a common form of processing near image sensor [48].

CHAPTER 2

Overview of Deep Neural Networks

Deep Neural Networks (DNNs) come in a wide variety of shapes and sizes depending on the application.¹ The popular shapes and sizes are also evolving rapidly to improve accuracy and efficiency. In all cases, the input to a DNN is a set of values representing the information to be analyzed by the network. For instance, these values can be pixels of an image, sampled amplitudes of an audio wave, or the numerical representation of the state of some system or game.

In this chapter, we will describe the key building blocks for DNNs. As there are many different types of DNNs [50], we will focus our attention on those that are most widely used. We will begin by describing the salient characteristics of commonly used DNN layers in Sections 2.1 and 2.2. We will then describe popular DNN layers and how these layers can be combined to form various types of DNNs in Section 2.3. Section 2.4 will provide a detailed discussion on convolutional neural networks (CNNs), since they are widely used and tend to provide many opportunities for efficient DNN processing. It will also highlight various popular CNN models that are often used as workloads for evaluating DNN hardware accelerators. Next, in Section 2.5, we will briefly discuss other types of DNNs and describe how they are similar to and differ from CNNs from a workload processing perspective (e.g., data dependencies, types of compute operations, etc.). Finally, in Section 2.6, we will discuss the various DNN development resources (e.g., frameworks and datasets),

Скачать книгу