Artificial Intelligence Hardware Design. Albert Chun-Chen Liu
6Table 6.1 Neurocube performance comparison.
6 Chapter 8Table 8.1 SeerNet system performance comparison.
List of Illustrations
1 Chapter 1Figure 1.1 High‐tech revolution.Figure 1.2 Neural network development timeline.Figure 1.3 ImageNet challenge.Figure 1.4 Neural network model.Figure 1.5 Regression.Figure 1.6 Clustering.Figure 1.7 Neural network top 1 accuracy vs. computational complexity.Figure 1.8 Neural network top 1 accuracy density vs. model efficiency [14]....Figure 1.9 Neural network memory utilization and computational complexity [1...
2 Chapter 2Figure 2.1 Deep neural network AlexNet architecture [1].Figure 2.2 Deep neural network AlexNet model parameters.Figure 2.3 Deep neural network AlexNet feature map evolution [3].Figure 2.4 Convolution function.Figure 2.5 Nonlinear activation functions.Figure 2.6 Pooling functions.Figure 2.7 Dropout layer.Figure 2.8 Deep learning hardware issues [1].
3 Chapter 3Figure 3.1 Intel Xeon processor ES 2600 family Grantley platform ring archit...Figure 3.2 Intel Xeon processor scalable family Purley platform mesh archite...Figure 3.3 Two‐socket configuration.Figure 3.4 Four‐socket ring configuration.Figure 3.5 Four‐socket crossbar configuration.Figure 3.6 Eight‐socket configuration.Figure 3.7 Sub‐NUMA cluster domains [3].Figure 3.8 Cache hierarchy comparison.Figure 3.9 Intel multiple sockets parallel processing.Figure 3.10 Intel multiple socket training performance comparison [4].Figure 3.11 Intel AVX‐512 16 bits FMA operations (VPMADDWD + VPADDD).Figure 3.12 Intel AVX‐512 with VNNI 16 bits FMA operation (VPDPWSSD).Figure 3.13 Intel low‐precision convolution.Figure 3.14 Intel Xenon processor training throughput comparison [2].Figure 3.15 Intel Xenon processor inference throughput comparison [2].Figure 3.16 NVIDIA turing GPU architecture.Figure 3.17 NVIDIA GPU shared memory.Figure 3.18 Tensor core 4 × 4 × 4 matrix operation [9].Figure 3.19 Turing tensor core performance [7].Figure 3.20 Matrix D thread group indices.Figure 3.21 Matrix D 4 × 8 elements computation.Figure 3.22 Different size matrix multiplication.Figure 3.23 Simultaneous multithreading (SMT).Figure 3.24 Multithreading schedule.Figure 3.25 GPU with HBM2 architecture.Figure 3.26 Eight GPUs NVLink2 configuration.Figure 3.27 Four GPUs NVLink2 configuration.Figure 3.28 Two GPUs NVLink2 configuration.Figure 3.29 Single GPU NVLink2 configuration.Figure 3.30 NVDLA core architecture.Figure 3.31 NVDLA small system model.Figure 3.32 NVDLA large system model.Figure 3.33 NVDLA software dataflow.Figure 3.34 Tensor processing unit architecture.Figure 3.35 Tensor processing unit floorplan.Figure 3.36 Multiply–Accumulate (MAC) systolic array.Figure 3.37 Systolic array matrix multiplication.Figure 3.38 Cost of different numerical format operation.Figure 3.39 TPU brain floating‐point format.Figure 3.40 CPU, GPU, and TPU performance comparison [15].Figure 3.41 Tensor Processing Unit (TPU) v1.Figure 3.42 Tensor Processing Unit (TPU) v2.Figure 3.43 Tensor Processing Unit (TPU) v3.Figure 3.44 Google TensorFlow subgraph optimization.Figure 3.45 Microsoft Brainwave configurable cloud architecture.Figure 3.46 Tour network topology.Figure 3.47 Microsoft Brainwave design flow.Figure 3.48 The Catapult fabric shell architecture.Figure 3.49 The Catapult fabric microarchitecture.Figure 3.50 Microsoft low‐precision quantization [27].Figure 3.51 Matrix‐vector multiplier overview.Figure 3.52 Tile engine architecture.Figure 3.53 Hierarchical decode and dispatch scheme.Figure 3.54 Sparse matrix‐vector multiplier architecture.Figure 3.55 (a) Sparse Matrix; (b) CSR Format; and (c) CISR Format.
4 Chapter 4Figure 4.1 Data streaming TCS model.Figure 4.2 Blaize depth‐first scheduling approach.Figure 4.3 Blaize graph streaming processor architecture.Figure 4.4 Blaize GSP thread scheduling.Figure 4.5 Blaize GSP instruction scheduling.Figure 4.6 Streaming vs. sequential processing comparison.Figure 4.7 Blaize GSP convolution operation.Figure 4.8 Intelligence processing unit architecture [8].Figure 4.9 Intelligence processing unit mixed‐precision multiplication.Figure 4.10 Intelligence processing unit single‐precision multiplication.Figure 4.11 Intelligence processing unit interconnect architecture [9].Figure 4.12 Intelligence processing unit bulk synchronous parallel model.Figure 4.13 Intelligence processing unit bulk synchronous parallel execution...Figure 4.14 Intelligence processing unit bulk synchronous parallel inter‐chi...
5 Chapter 5Figure 5.1 Deep convolutional neural network hardware architecture.Figure 5.2 Convolution computation.Figure 5.3 Filter decomposition with zero padding.Figure 5.4 Filter decomposition approach.Figure 5.5 Data streaming architecture with the data flow.Figure 5.6 DCNN accelerator COL buffer architecture.Figure 5.7 Data streaming architecture with 1×1 convolution mode.Figure 5.8 Max pooling architecture.Figure 5.9 Convolution engine architecture.Figure 5.10 Accumulation (ACCU) buffer architecture.Figure 5.11 Neural network model compression.Figure 5.12 Eyeriss system architecture.Figure 5.13 2D convolution to 1D multiplication mapping.Figure 5.14 2D convolution to 1D multiplication – step #1.Figure 5.15 2D convolution to 1D multiplication – step #2.Figure 5.16 2D convolution to 1D multiplication – step #3.Figure 5.17 2D convolution to 1D multiplication – step #4.Figure 5.18 Output stationary.Figure 5.19 Output stationary index looping.Figure 5.20 Weight stationary.Figure 5.21 Weight stationary index looping.Figure 5.22 Input stationary.Figure 5.23 Input stationary index looping.Figure 5.24 Eyeriss Row Stationary (RS) dataflow.Figure 5.25 Filter reuse.Figure 5.26 Feature map reuse.Figure 5.27 Partial sum reuse.Figure 5.28 Eyeriss run‐length compression.Figure 5.29 Eyeriss processing element architecture.Figure 5.30 Eyeriss global input network.Figure 5.31 Eyeriss processing element mapping (AlexNet CONV1).Figure 5.32 Eyeriss processing element mapping (AlexNet CONV2).Figure 5.33 Eyeriss processing element mapping (AlexNet CONV3).Figure 5.34 Eyeriss processing element mapping (AlexNet CONV4/CONV5).Figure 5.35 Eyeriss processing element operation (AlexNet CONV1).Figure 5.36 Eyeriss processing element operation (AlexNet CONV2).Figure 5.37 Eyeriss processing element (AlexNet CONV3).Figure 5.38 Eyeriss processing element operation (AlexNet CONV4/CONV5).Figure 5.39 Eyeriss architecture comparison.Figure 5.40 Eyeriss v2 system architecture.Figure 5.41 Network‐on‐Chip configurations.Figure 5.42 Mesh network configuration.Figure 5.43 Eyeriss v2 hierarchical mesh network examples.Figure 5.44 Eyeriss v2 input activation hierarchical mesh network.Figure 5.45 Weights hierarchical mesh network.Figure 5.46 Eyeriss v2 partial sum hierarchical mesh network.Figure 5.47 Eyeriss v1 neural network model performance. [6]Figure 5.48 Eyeriss v2 neural network model performance. [6]Figure 5.49 Compressed sparse column format.Figure 5.50 Eyeriss v2 PE architecture.Figure 5.51 Eyeriss v2 row stationary plus dataflow.Figure 5.52 Eyeriss architecture AlexNet throughput speedup [6].Figure 5.53 Eyeriss architecture AlexNet energy efficiency [6].Figure 5.54 Eyeriss architecture MobileNet throughput speedup [6].Figure 5.55 Eyeriss architecture MobileNet energy efficiency [6].
6 Chapter 6Figure 6.1 Neurocube architecture.Figure 6.2 Neurocube organization.Figure 6.3 Neurocube 2D mesh network.Figure 6.4 Memory‐centric neural computing flow.Figure 6.5 Programmable neurosequence generator architecture.Figure 6.6 Neurocube programmable neurosequence generator.Figure 6.7 Tetris system architecture.Figure 6.8 Tetris neural network engine.Figure 6.9 In‐memory accumulation.Figure 6.10 Global buffer bypass.Figure 6.11 NN partitioning scheme comparison.Figure 6.12 Tetris performance and power comparison [7].Figure 6.13 NeuroStream and NeuroCluster architecture.Figure 6.14 NeuroStream coprocessor architecture.Figure 6.15 NeuroStream 4D tiling.Figure 6.16 NeuroStream roofline plot [8].
7 Chapter 7Figure 7.1 DaDianNao system architecture.Figure 7.2 DaDianNao neural functional unit architecture.Figure 7.3 DaDianNao pipeline configuration.Figure 7.4 DaDianNao multi‐node mapping.Figure 7.5 DaDianNao timing performance (Training) [1].Figure 7.6 DaDianNao timing performance (Inference) [1].Figure 7.7 DaDianNao power reduction (Training) [1].Figure 7.8 DaDianNao power reduction (Inference) [1].Figure 7.9 DaDianNao basic operation.Figure 7.10 Cnvlutin basic operation.Figure 7.11 DaDianNao architecture.Figure 7.12 Cnvlutin architecture.Figure 7.13 DaDianNao processing order.Figure 7.14 Cnvlutin processing order.Figure 7.15 Cnvlutin zero free neuron array format.Figure 7.16 Cnvlutin dispatch.Figure 7.17 Cnvlutin timing comparison [4].Figure 7.18 Cnvlutin power comparison [4].Figure 7.19 Cnvlutin2 ineffectual activation skipping.Figure 7.20 Cnvlutin2 ineffectual weight skipping.
8 Chapter 8Figure 8.1 EIE leading nonzero detection network.Figure 8.2 EIE processing element architecture.Figure 8.3 Deep compression weight sharing and quantization.Figure 8.4 Matrix W, vector a and b are interleaved over four processing ele...Figure 8.5 Matrix W layout in compressed sparse column format.Figure 8.6 EIE timing performance comparison [1].Figure 8.7 EIE energy efficient comparison