Efficient Processing of Deep Neural Networks. Vivienne Sze
which researchers and practitioners have made available to help enable the rapid progress in DNN model and hardware research and development.
2.1 ATTRIBUTES OF CONNECTIONS WITHIN A LAYER
As discussed in Chapter 1, DNNs are composed of several processing layers, where in most layers the main computation is a weighted sum. There are several different types of layers, which primarily differ in terms of how the inputs and outputs are connected within the layers. There are two main attributes of the connections within a layer:
1. The connection pattern between the input and output activations, as shown in Figure 2.1a: if a layer has the attribute that every input activation is connected to every output, then we call that layer fully connected. On the other hand, if a layer has the attribute that only a subset of inputs are connected to the output, then we call that layer sparsely connected. Note that the weights associated with these connections can be zero or non-zero; if a weight happens to be zero (e.g., as a result of training), it does not mean there is no connection (i.e., the connection still exists).
Figure 2.1: Properties of connections in DNNs (Figure adapted from [4]).
For sparsely connected layers, a sub attribute is related to the structure of the connections. Input activations may connect to any output activation (i.e., global), or they may only connect to output activations in their neighborhood (i.e., local). The consequence of such local connections is that each output activation is a function of a restricted window of input activations, which is referred to as the receptive field.
2. The value of the weight associated with each connection: the most general case is that the weight can take on any value (e.g., each weight can have a unique value). A more restricted case is that the same value is shared by multiple weights, which is referred to as weight sharing.
Combinations of these attributes result in many of the common layer types. Any layer with the fully connected attribute is called a fully connected layer (FC layer). In order to distinguish the attribute from the type of layer, in this chapter, we will use the term FC layer as distinguished from the fully connected attribute. However, in subsequent chapters we will follow the common practice of using the terms interchangeably. Another widely used layer type is the convolutional (CONV) layer, which is locally, sparsely connected with weight sharing.2 The computation in FC and CONV layers is a weighted sum. However, there are other computations that might be performed and these result in other types of layers. We will discuss FC, CONV, and these other layers in more detail in Section 2.3.
2.2 ATTRIBUTES OF CONNECTIONS BETWEEN LAYERS
Another attribute is the connections from the output of one layer to the input of another layer, as shown in Figure 2.1b. The output can be connected to the input of the next layer in which case the connection is referred to as feed forward. With feed-forward connections, all of the computation is performed as a sequence of operations on the outputs of a previous layer.3 It has no memory and the output for an input is always the same irrespective of the sequence of inputs previously given to the network. DNNs that contain feed-forward connections are referred to as feed-forward networks. Examples of these types of networks include multi-layer perceptrons (MLPs), which are DNNs that are composed entirely of feed-forward FC layers and convolutional neural networks (CNNs), which are DNNs that contain both FC and CONV layers. CNNs, which are commonly used for image processing and computer vision, will be discussed in more detail in Section 2.4.
Alternatively, the output can be fed back to the input of its own layer in which case the connection is often referred to as recurrent. With recurrent connections, the output of a layer is a function of both the current and prior input(s) to the layer. This creates a form of memory in the DNN, which allows long-term dependencies to affect the output. DNNs that contain these connections are referred to as recurrent neural networks (RNNs), which are commonly used to process sequential data (e.g., speech, text), and will be discussed in more detail in Section 2.5.
2.3 POPULAR TYPES OF LAYERS IN DNNs
In this section, we will discuss the various popular layers used to form DNNs. We will begin by describing the CONV and FC layers whose main computation is a weighted sum, since that tends to dominate the computation cost in terms of both energy consumption and throughput. We will then discuss various layers that can optionally be included in a DNN and do not use weighted sums such as nonlinearity, pooling, and normalization.
These layers can be viewed as primitive layers, which can be combined to form compound layers. Compound layers are often given names as a convenience, when the same combination of primitive layer are frequently used together. In practice, people often refer to either primitive or compound layers as just layers.
2.3.1 CONV LAYER (CONVOLUTIONAL)
CONV layers are primarily composed of high-dimensional convolutions, as shown in Figure 2.2. In this computation, the input activations of a layer are structured as a 3-D input feature map (ifmap), where the dimensions are the height (H), width (W), and number of input channels (C). The weights of a layer are structured as a 3-D filter, where the dimensions are the height (R), width (S), and number of input channels (C). Notice that the number of channels for the input feature map and the filter are the same. For each input channel, the input feature map undergoes a 2-D convolution (see Figure 2.2a) with the corresponding channel in the filter. The results of the convolution at each point are summed across all the input channels to generate the output partial sums. In addition, a 1-D (scalar) bias can be added to the filtering results, but some recent networks [24] remove its usage from parts of the layers. The results of this computation are the output partial sums that comprise one channel of the output feature map (ofmap).4 Additional 3-D filters can be used on the same input feature map to create additional output channels (i.e., applying M filters to the input feature map generates M output channels in the output feature map). Finally, multiple input feature maps (N) may be processed together as a batch to potentially improve reuse of the filter weights.
Figure 2.2: Dimensionality of convolutions. (a) Shows the traditional 2-D convolution used in image processing. (b) Shows the high dimensional convolution used in CNNs, which applies a 2-D convolution on each channel.
Table 2.1: Shape parameters of a CONV/FC layer
Shape Parameter | Description |
N | Batch size of 3-D fmaps |
M | Number of 3-D filters / number of channels of ofmap (output channels) |
C | Number of channels of filter / ifmap (input channels) |
H/W |
|