Multi-Processor System-on-Chip 1. Liliana Andrade
Q Number Format (2019). Available at: https://en.wikipedia.org/wiki/Q_(number_format).
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9.
For a color version of all figures in this book, see www.iste.co.uk/andrade/multi1.zip.
2
A Qualitative Approach to Many-core Architecture
Benoît DUPONT DE DINECHIN
Kalray S.A., Grenoble, France
We present the design of the Kalray third-generation MPPA many-core processor, whose objectives are to combine the performance scalability of GPGPUs, the energy efficiency of DSP cores and the I/O capabilities of FPGA devices. These objectives are motivated by the consolidation of high-performance and high-integrity functions on a single computing platform for autonomous vehicles. High-performance computing functions, represented by deep learning inference and computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust and implements a security architecture, in order to support trusted execution environments.
The MPPA software development tools and run-time environments conform to CPU standards, particularly the availability of C/C++/OpenMP programming environments, supported by POSIX operating systems and RTOSes. As these standards target multi-core shared memory architectures, there is an opportunity for higher-level application code generators to automate code and data distribution across the multiple compute units and the local memories of a many-core processor. For the MPPA3 processor, this opportunity is realized in cases of deep learning model inference, as well as of model-based software development using synchronous-reactive languages.
2.1. Introduction
Cyber-physical systems (CPSs) are characterized by software that interacts with the physical world, often with time-sensitive safety-critical physical sensing and actuation (Lee et al. 2017). Applications such as aircraft pilot support or automated driving systems require more than what classical CPSs provide. More specifically, application functionality increasingly relies on machine learning techniques, while cyber-security requirements have become significantly more stringent. We refer to CPSs enhanced with high-performance machine learning capabilities and strong cyber-security support as “intelligent systems”. Given the state of CMOS computing technology (Kanduri et al. 2017), providing the processing performances required by intelligent systems while meeting the size, weight and power (SWaP) constraints of embedded systems can only be achieved by parallel computing and the specialization of processing elements. For example, automated driving systems targeting L3/L4 SAE J3016 levels of automation are estimated to require over 150 TOPS of deep learning inference in vehicle perception functions, while motion planning functions would require more than 50 FP32 TFLOPS (Figure 2.5).
In order to address the challenges of high-performance embedded computing with time predictability, Kalray has been refining a many-core architecture called the MPPA (Massively Parallel Processor Array) across three generations. The first-generation MPPA processor was primarily targeting accelerated computing (Dupont de Dinechin et al. 2013), but implemented the first key architectural features for time-critical computing (Dupont de Dinechin et al. 2014). Kalray further improved the second-generation MPPA processor for time predictability (Saidi et al. 2015), providing an excellent target for model-based code generation (Perret et al. 2016; Graillat et al. 2018, 2019). Accurate analysis of network-on-chip (NoC) service guarantees was achieved through a new deterministic network calculus formulation (Dupont de Dinechin and Graillat 2017). Unlike the first-generation MPPA processor that relied on cyclostatic dataflow programming (Bodin et al. 2013, 2016), the second-generation MPPA programming environment was able to support OpenCL and OpenVX applications (Hascoët et al. 2018).
In this chapter, we present the third-generation MPPA processor, manufactured in 16FFC CMOS technology, whose many-core architecture has significantly improved upon the previous ones in the areas of performance, programmability, functional safety and cyber-security. These features are motivated by application cases in defense, avionics and automotive where the high-performance, high-integrity and cyber-security functions can be consolidated onto a single or dual processor configuration. In section 2.2, we discuss many-core architectures and their limitations with regard to intelligent system requirements. In section 2.3, we present the main features of the third-generation MPPA architecture and processor. In section 2.4, we introduce the MPPA3 application software environments.
2.2. Motivations and context
2.2.1. Many-core processors
A multi-core processor refers to a computing device that contains multiple software-programmable processing units (cores with caches). Multi-core processors deployed in desktop computers or data centers have homogeneous cores and a memory hierarchy is composed of coherent caches (Figure 2.1). Conversely, a many-core processor can be characterized by the architecturally visible grouping of cores inside compute units: cache coherence may not extend beyond the compute unit; or the compute unit may provide scratch-pad memory and data movement engines. A multi-core processor scales by replicating its cores, while a many-core processor scales by replicating its compute units. A many-core architecture may thus be scaled to hundreds, if not thousands, of cores.
Figure 2.1. Homogeneous multi-core processor (Firesmith 2017)
The GPGPU architecture introduced by the NVIDIA Fermi (Figure 2.2) is a mainstream many-core architecture, whose compute units are called streaming multiprocessors (SMs). Each SM comprises 32 streaming cores (SCs) that share a local memory, caches and a global memory system. Threads are scheduled and executed atomically by “warps”, which are sets of 32 threads dispatched to SCs that execute the same instruction at any given time. Hardware multi-threading enables warp execution switching on each cycle, helping to cover the memory access latencies.
Figure 2.2. NVIDIA fermi GPGPU architecture (Huang et al . 2013)
Although embedded GPGPU processors provide adequate performance and energy efficiency for accelerated computing, their architecture carry inherent limitations that hinder their use in intelligent systems:
– kernel programming environment lacks standard features of C/C++, such as recursion, standard multi-threading or accessing a (virtual) file system;
– performance of kernels is highly sensitive to run-time control flow (because of branch divergence) and data access patterns (because of memory coalescing);
– threads blocks are dynamically allocated to SMs, while warps are dynamically scheduled for execution inside an SM;
– coupling between