Efficient Processing of Deep Neural Networks. Vivienne Sze

Efficient Processing of Deep Neural Networks

3.1: The roofline model. The peak operations per second is indicated by the bold line; when the operation intensity, which dictates by amount of compute per byte of data, is low, the operations per second is limited by the data delivery. The design goal is to operate as close as possible to the peak operations per second for the operation intensity of a given workload.

There is also an interplay between the number of PEs and the utilization of PEs. For instance, one way to reduce the likelihood that a PE needs to wait for data is to store some data locally near or within the PE. However, this requires increasing the chip area allocated to on-chip storage, which, given a fixed chip area, would reduce the number of PEs. Therefore, a key design consideration is how much area to allocate to compute (which increases the number of PEs) versus on-chip storage (which increases the utilization of PEs).

The impact of these factors can be captured using Eyexam, which is a systematic way of understanding the performance limits for DNN processors as a function of specific characteristics of the DNN model and accelerator design. Eyexam includes and extends the well-known roofline model [119]. The roofline model, as illustrated in Figure 3.1, relates average bandwidth demand and peak computational ability to performance. Eyexam is described in Chapter 6.

While the number of operations per inference in Equation (3.1) depends on the DNN model, the operations per second depends on both the DNN model and the hardware. For example, designing DNN models with efficient layer shapes (also referred to efficient network architectures), as described in Chapter 9, can reduce the number of MAC operations in the DNN model and consequently the number of operations per inference. However, such DNN models can result in a wide range of layer shapes, some of which may have poor utilization of PEs and therefore reduce the overall operations per second, as shown in Equation (3.2).

A deeper consideration of the operations per second, is that all operations are not created equal and therefore cycles per operation may not be a constant. For example, if we consider the fact that anything multiplied by zero is zero, some MAC operations are ineffectual (i.e., they do not change the accumulated value). The number of ineffectual operations is a function of both the DNN model and the input data. These ineffectual MAC operations can require fewer cycles or no cycles at all. Conversely, we only need to process effectual (or non-zero) MAC operations, where both inputs are non-zero; this is referred to as exploiting sparsity, which is discussed in Chapter 8.

Processing only effectual MAC operations can increase the (total) operations per second by increasing the (total) operations per cycle.⁵ Ideally, the hardware would skip all ineffectual operations; however, in practice, designing hardware to skip all ineffectual operations can be challenging and result in increased hardware complexity and overhead, as discussed in Chapter 8. For instance, it might be easier to design hardware that only recognizes zeros in one of the operands (e.g., weights) rather than both. Therefore, the ineffectual operations can be further divided into those that are exploited by the hardware (i.e., skipped) and those that are unexploited by the hardware (i.e., not skipped). The number of operations actually performed by the hardware is therefore effectual operations plus unexploited ineffectual operations.

Equation (3.4) shows how operations per cycle can be decomposed into

1. the number of effectual operations plus unexploited ineffectual operations per cycle, which remains somewhat constant for a given hardware accelerator design;

2. the ratio of effectual operations over effectual operations plus unexploited ineffectual operations, which refers to the ability of the hardware to exploit ineffectual operations (ideally unexploited ineffectual operations should be zero, and this ratio should be one); and

3. the number of effectual operations out of (total) operations, which is related to the amount of sparsity and depends on the DNN model.

As the amount of sparsity increases (i.e., the number of effectual operations out of (total) operations decreases), the operations per cycle increases, which subsequently increases operations per second, as shown in Equation (3.2):

Table 3.1: Classification of factors that affect inferences per second

However, exploiting sparsity requires additional hardware to identify when inputs are zero to avoid performing unnecessary MAC operations. The additional hardware can increase the critical path, which decreases cycles per second, and also reduce area density of the PE, which reduces the number of PEs for a given area. Both of these factors can reduce the operations per second, as shown in Equation (3.2). Therefore, the complexity of the additional hardware can result in a trade off between reducing the number of unexploited ineffectual operations and increasing critical path or reducing the number of PEs.

Finally, designing hardware and DNN models that support reduced precision (i.e., fewer bits per operand and per operations), which is discussed in Chapter 7, can also increase the number of operations per second. Fewer bits per operand means that the memory bandwidth required to support a given operation is reduced, which can increase the utilization of PEs since they are less likely to be starved for data. In addition, the area of each PE can be reduced, which can increase the number of PEs for a given area. Both of these factors can increase the operations per second, as shown in Equation (3.2). Note, however, that if multiple levels of precision need to be supported, additional hardware is required, which can, once again, increase the critical path and also reduce area density of the PE, both of which can reduce the operations per second, as shown in Equation (3.2).

In this section, we discussed multiple factors that affect the number of inferences per second. Table 3.1 classifies whether the factors are dictated by the hardware, by the DNN model or both.

In summary, the number of MAC operations in the DNN model alone is not sufficient for evaluating the throughput and latency. While the DNN model can affect the number of MAC operations per inference based on the network architecture (i.e., layer shapes) and the sparsity of the weights and activations, the overall impact that the DNN model has on throughput and latency depends on the ability of the hardware to add support to recognize these approaches without significantly reducing utilization of PEs, number of PEs, or cycles per second. This is why the number of MAC operations is not necessarily a good proxy for throughput and latency (e.g., Figure 3.2), and it is often more effective to design efficient DNN models with hardware in the loop. Techniques for designing DNN models with hardware in the loop are discussed in Chapter 9.

Figure 3.2: The number of MAC operations in various DNN models versus latency measured on Pixel phone. Clearly, the number of MAC operations

Скачать книгу