Efficient Processing of Deep Neural Networks. Vivienne Sze

Efficient Processing of Deep Neural Networks

that even for the same DNN (e.g., AlexNet) the accuracy of these models can vary by around 1 to 2% depending on how the model was trained and tested, and thus the results do not always exactly match the original publication.

These pre-trained models often are tied to a given framework. In order to facilitate easier exchange between different networks, Open Neural Network Exchange (ONNX) has been established as an open ecosystem for interchangeable DNN models [102]; the current participants include Amazon, Facebook, and Microsoft.

2.6.3 POPULAR DATASETS FOR CLASSIFICATION

It is important to factor in the difficulty of the task when comparing different DNN models. For instance, the task of classifying handwritten digits from the MNIST dataset [103] is much simpler than classifying an object into one of 1000 classes as is required for the ImageNet dataset [23] (Figure 2.15). It is expected that the size of the DNNs (i.e., number of weights) and the number of MACs will be larger for the more difficult task than the simpler task and thus require more energy and have lower throughput. For instance, LeNet-5[71] is designed for digit classification, while AlexNet[7], VGG-16[73], GoogLeNet[74], and ResNet[24] are designed for the 1000-class image classification.

There are many AI tasks that come with publicly available datasets in order to evaluate the accuracy of a given DNN. Public datasets are important for comparing the accuracy of different approaches. The simplest and most common task in computer vision is image classification, which involves being given an entire image, and selecting 1 of N classes that the image most likely belongs to. There is no localization or detection.

MNIST is a widely used dataset for digit classification that was introduced in 1998 [103]. It consists of 28×28 pixel grayscale images of handwritten digits. There are 10 classes (for 10 digits) and 60,000 training images and 10,000 test images. LeNet-5 was able to achieve an accuracy of 99.05% when MNIST was first introduced. Since then the accuracy has increased to 99.79% using regularization of neural networks with dropconnect [104]. Thus, MNIST is now considered a fairly easy dataset.

CIFAR is a dataset that consists of 32×32 pixel colored images of various objects, which was released in 2009 [105]. CIFAR is a subset of the 80 million Tiny Image dataset [106]. CIFAR-10 is composed of 10 mutually exclusive classes. There are 50,000 training images (5000 per class) and 10,000 test images (1000 per class). A two-layer convolutional deep belief network was able to achieve 64.84% accuracy on CIFAR-10 when it was first introduced [107]. Since then the accuracy has increased to 96.53% using fractional max pooling [108].

ImageNet is a large-scale image dataset that was first introduced in 2010; the dataset stabilized in 2012 [23]. It contains images of 256×256 pixel in color with 1000 classes. The classes are defined using the WordNet as a backbone to handle ambiguous word meanings and to combine together synonyms into the same object category. In other words, there is a hierarchy for the ImageNet categories. The 1000 classes were selected such that there is no overlap in the ImageNet hierarchy. The ImageNet dataset contains many fine-grained categories including 120 different breeds of dogs. There are 1.3M training images (732 to 1300 per class), 100,000 testing images (100 per class) and 50,000 validation images (50 per class).

The accuracy for the image classification task in the ImageNet Challenge are reported using two metrics: Top-5 and Top-1 accuracy.¹⁶ Top-5 accuracy means that if any of the top five scoring categories are the correct category, it is counted as a correct classification. Top-1 accuracy requires that the top scoring category be correct. In 2012, the winner of the ImageNet Challenge (AlexNet) was able to achieve an accuracy of 83.6% for the Top-5 (which is substantially better than the 73.8% which was second place that year that did not use DNNs); it achieved 61.9% on the Top-1 of the validation set. In 2019, the state-of-the-art DNNs achieve accuracy above 97% for the Top-5 and above 84% for the Top-1 [87].

In summary of the various image classification datasets, it is clear that MNIST is a fairly easy dataset, while ImageNet is a more challenging one with a wider coverage of classes. Thus, in terms of evaluating the accuracy of a given DNN, it is important to consider that dataset upon which the accuracy is measured.

2.6.4 DATASETS FOR OTHER TASKS

Since the accuracy of the state-of-the-art DNNs are performing better than human-level accuracy on image classification tasks, the ImageNet Challenge has started to focus on more difficult tasks such as single-object localization and object detection. For single-object localization, the target object must be localized and classified (out of 1000 classes). The DNN outputs the top five categories and top five bounding box locations. There is no penalty for identifying an object that is in the image but not included in the ground truth. For object detection, all objects in the image must be localized and classified (out of 200 classes). The bounding box for all objects in these categories must be labeled. Objects that are not labeled are penalized as well as duplicated detections.

Beyond ImageNet, there are also other popular image datasets for computer vision tasks. For object detection, there is the PASCAL VOC (2005-2012) dataset that contains 11k images representing 20 classes (27k object instances, 7k of which have detailed segmentation) [109]. For object detection, segmentation, and recognition in context, there is the M.S. COCO dataset with 2.5M labeled instances in 328k images (91 object categories) [110]; compared to ImageNet, COCO has fewer categories but more instances per category, which is useful for precise 2-D localization. COCO also has more labeled instances per image to potentially help with contextual information.

Most recently, even larger scale datasets have been made available. For instance, Google has an Open Images dataset with over 9M images [111], spanning 6000 categories. There is also a YouTube dataset with 8M videos (0.5M hours of video) covering 4800 classes [112]. Google also released an audio dataset comprised of 632 audio event classes and a collection of 2M human-labeled 10-second sound clips [113]. These large datasets will be evermore important as DNNs become deeper with more weights to train. In addition, it has been shown that accuracy increases logarithmically based on the amount of training data [16].¹⁷

Undoubtedly, both larger datasets and datasets for new domains will serve as important resources for profiling and exploring the efficiency of future DNN engines.

2.6.5 SUMMARY

The development resources presented in this section enable us to evaluate hardware using the appropriate DNN model and dataset. In particular, it’s important to realize that difficult tasks typically require larger models; for instance, LeNet would not apply to the ImageNet Challenge. In addition, different datasets are required for different tasks; for instance, self-driving cars require high-definition video, and thus a network trained on the low resolution ImageNet dataset may not be sufficient. To address these requirements, the number of datasets continues to grow at a rapid pace.

¹ The DNN research community often refers to the shape and size of a DNN as its “network architecture.” However, to avoid confusion with the use of the word “architecture” by the hardware community, we will talk about “DNN models” and their shape and size in this book.

² CONV layers use a specific type of weight sharing, which will be described in Section 2.4.

³ Connections can come from the immediately preceding layer or an earlier layer. Furthermore, connections from a layer can go to multiple later layers.

⁴ For simplicity, in this chapter, we will refer to an array of partial sums as an output feature map. However, technically, the output feature map

Скачать книгу