Efficient Processing of Deep Neural Networks. Vivienne Sze
partial sums after they have gone through a nonlinear function (i.e., the output activations).
5 In some literature, K is used rather than M to denote the number of 3-D filters (also referred to a kernels), which determines the number of output feature map channels. We opted not to use K to avoid confusion with yet other communities that use it to refer to the number of dimensions. We also have adopted the convention of using P and Q as the dimensions of the output to align with other publications and since our prior use of E and F caused an alias with the use of “F” to represent filter weights. Note that some literature also use X and Y to denote the spatial dimensions of the input rather than W and H.
6 Note that many of the values in the CONV layer tensors are zero, making the tensors sparse. The origins of this sparsity, and approaches for performing the resulting sparse tensor algebra, are presented in Chapter 8.
7 Note that Albert Einstein popularized a similar notation for tensor algebra which omits any explicit specification of the summation variable.
8 In addition to being simple to implement, ReLU also increases the sparsity of the output activations, which can be exploited by a DNN accelerator to increase throughput, reduce energy consumption and reduce storage cost, as described in Section 8.1.1.
9 In the literature, this is often referred to dense prediction.
10 There are two versions of unpooling: (1) zero insertion is applied in a regular pattern, as shown in Figure 2.6a [60]—this is most commonly used; and (2) unpooling is paired with a max pooling layer, where the location of the max value during pooling is stored, and during unpooling the location of the non-zero value is placed in the location of the max value before pooling [61].
11 It has been recently reported that the reason batch normalization enables faster and more stable training is due to the fact that it makes the optimization landscape smoother resulting in more predictive and stable behavior of the gradient [67]; this is in contrast to the popular belief that batch normalization stabilizes the distribution of the input across layers. Nonetheless, batch normalization continues to be widely used for training and thus needs to be supported during inference.
12 During training, parameters δ and μ are computed per batch, and β and β are updated per batch based on the gradient; therefore, training for different batch sizes will result in different δ and μ parameters, which can impact accuracy. Note that each channel has its own set of δ, μ, β, and β parameters. During inference, all parameters are fixed, where δ and μ are computed from the entire training set. To avoid performing an extra pass over the entire training set to compute δ and μ, δ and μ are usually implemented as the running average of the per batch δ and μ computed during training.
13 Note variants of the up CONV layer with different types of upsampling include deconvolution layer, sub-pixel or fractional convolutional layer, transposed convolutional layer, and backward convolution layer [69].
14 This grouped convolution approach is applied more aggressively when performing co-design of algorithms and hardware to reduce complexity, which will be discussed in Chapter 9.
15 v2 is very similar to v3.
16 Note that in some parts of the book we use Top-1 and Top-5 error. The error can be computed as 100% minus accuracy.
17 This was demonstrated on Google’s internal JFT-300M dataset with 300M images and 18,291 classes, which is two orders of magnitude larger than ImageNet. However, performing four iterations across the entire training set using 50 K-80 GPUs required two months of training, which further emphasizes that compute is one of the main bottlenecks in the advancement of DNN research.
PART II
Design of Hardware for Processing DNNs
CHAPTER 3
Key Metrics and Design Objectives
Over the past few years, there has been a significant amount of research on efficient processing of DNNs. Accordingly, it is important to discuss the key metrics that one should consider when comparing and evaluating the strengths and weaknesses of different designs and proposed techniques and that should be incorporated into design considerations. While efficiency is often only associated with the number of operations per second per Watt (e.g., floating-point operations per second per Watt as FLOPS/W or tera-operations per second per Watt as TOPS/W), it is actually composed of many more metrics including accuracy, throughput, latency, energy consumption, power consumption, cost, flexibility, and scalability. Reporting a comprehensive set of these metrics is important in order to provide a complete picture of the trade-offs made by a proposed design or technique.
In this chapter, we will
• discuss the importance of each of these metrics;
• breakdown the factors that affect each metric. When feasible, present equations that describe the relationship between the factors and the metrics;
• describe how these metrics can be incorporated into design considerations for both the DNN hardware and the DNN model (i.e., workload); and
• specify what should be reported for a given metric to enable proper evaluation.
Finally, we will provide a case study on how one might bring all these metrics together for a holistic evaluation of a given approach. But first, we will discuss each of the metrics.
3.1 ACCURACY
Accuracy is used to indicate the quality of the result for a given task. The fact that DNNs can achieve state-of-the-art accuracy on a wide range of tasks is one of the key reasons driving the popularity and wide use of DNNs today. The units used to measure accuracy depend on the task. For instance, for image classification, accuracy is reported as the percentage of correctly classified images, while for object detection, accuracy is reported as the mean average precision (mAP), which is related to the trade off between the true positive rate and false positive rate.
Factors that affect accuracy include the difficulty of the task and dataset.1 For instance, classification on ImageNet is much more difficult than on MNIST, and object detection or semantic segmentation is more difficult than classification. As a result, a DNN model that performs well on MNIST may not necessarily perform well on ImageNet.
Achieving high accuracy on difficult tasks or datasets typically requires more complex DNN models (e.g., a larger number