Multi-Processor System-on-Chip 1. Liliana Andrade

Multi-Processor System-on-Chip 1

time per-frame: rbcc mode and mp modeFigure 5.8. Breakdown of execution time for different clips with increasing back...Figure 5.9. Normalized benchmark execution time for different coherency region s...Figure 5.10. Taxonomy of in-/near-memory computing (colored elements are covered...Figure 5.11. Architecture of the remote near-memory synchronization acceleratorFigure 5.12. Queues are widely used as message passing buffersFigure 5.13. Mechanism for a remote dequeue operation (right) for queues in dist...Figure 5.14. NAS benchmark4 × 4 results (left) and IS scalability (right) for di...Figure 5.15. Far-from memory (left) versus near-memory (right) graph copyFigure 5.16. IMSuite benchmark results on a 4 × 4 tile design with 1 memory tileFigure 5.17. IMSuite benchmark results for inter-memory graph copyFigure 5.18. Effect of NCAFigure 5.19. Interplay of RBCC and NMA for shared and distributed memory program...

6 Chapter 6Figure 6.1. The MAGIC NOR gate. (a) MAGIC NOR gate schematic; (b) MAGIC NOR gate...Figure 6.2. Evaluation tool for MAGIC within crossbar arrays. The initial voltag...Figure 6.3. The SIMPLE and SIMPLER flows. In both flows, the logic is synthesize...Figure 6.4. A 1-bit full adder implementation using SIMPLER. (a) A 1-bit full ad...Figure 6.5. High-level description of the mMPU architecture. A program is execut...Figure 6.6. The internal structure of the mMPU controller. First, an instruction...

7 Chapter 7Figure 7.1. Address translation for the ARMv7 architectureFigure 7.2. Host view of the architected state of the guestFigure 7.3. Pseudo-code of the helper for the ldr instructionFigure 7.4. QEMU -generated code to perform a load instructionFigure 7.5. Embedding guest address spaceFigure 7.6. Overview of the implementationFigure 7.7. Contrasting memory access binary translationsFigure 7.8. Kernel module page fault handlerFigure 7.9. Percentage of memory accesses (with Linux)Figure 7.10. Time spent in the Soft MMUFigure 7.11. Benchmark speed-ups: our solution versus vanilla QEMUFigure 7.12. Plain/hybrid speed-ups versus vanilla QEMUFigure 7.13. Number of calls to slow path during program execution (i386 and ARM...Figure 7.14. Page fault optimization speed-ups (ARM guest)Figure 7.15. Page fault handling – internal versus percolated (note the logarith...

8 Chapter 8Figure 8.1. Kalray MPPA overall architectureFigure 8.2. Memory banks and local interconnect in a Kalray clusterFigure 8.3. Example of collisions with four accesses to the same memory bankFigure 8.4. Description of memory address bits and their use for a 32-bit memory...Figure 8.5. Example of interleaving with four accesses to different memory banksFigure 8.6. Description of memory address bits and their use for a 32-bit memory...Figure 8.7. A four-bank architecture, 1-byte words, with a 16-byte stride access...Figure 8.8. A four-bank architecture, 1-byte words, with a 17-byte strideFigure 8.9. Left: the distribution of addresses within a 5-bank memory system. R...Figure 8.10. Distribution of addresses across five memory banks with Index = Add...Figure 8.11. Using a hash function for memory bank selection. N is the address s...Figure 8.12. Left: an example of an H matrix of size 2 × 3 . Right: the same H m...Figure 8.13. Example of PRIM allocation in a four-bank architecture, and four me...Figure 8.14. H matrix for the PRIM solutionFigure 8.15. Complex Addressing circuit overviewFigure 8.16. Intel Complex Addressing stage 1 and PRIM 67Figure 8.17. Overview of the Kalray MPPA simplified local memory architectureFigure 8.18. Kalray MPPA simplified crossbar internal architectureFigure 8.19. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.20. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.21. Comparison between MOD 16 and MOD 17 for the same executable code a...Figure 8.22. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.23. Hotmap of memory access efficiency according to the number of banks...

9 Chapter 9Figure 9.1. A traditional synchronous bus, in this case the implementation of an...Figure 9.2. Arteris switch fabric network (Arteris IP 2020)Figure 9.3. NoC layer mapping summary (Arteris IP 2020)Figure 9.4. The NoC on the left has a floorplan-unfriendly topology, whereas the...Figure 9.5. Pipeline stages are required in a path to span a particular distance...Figure 9.6. Single-event effect (SEE) error hierarchy diagram (ISO26262-11 2018b...Figure 9.7. Failure mode effects and diagnostic analysis (FMEDA) includes analys...Figure 9.8. A cache coherent NoC interconnect allows the integration of IP using...Figure 9.9. NoC interconnects enable easier creation of hard macro tiles that ca...Figure 9.10. Hierarchical coherency macros enable massive scalability of cache c...Figure 9.11. An AUTOSAR MCAL showing how NoC configuration information will be u...

10 Chapter 10Figure 10.1. Energy efficiency improvement by near-threshold computingFigure 10.2. Block diagram of the SCM with an R × C-bit capacityFigure 10.3. An example of SCM structures (R = 4 ,C = 4 )Figure 10.4. Energy measurement results for two memories with a 256 × 32 capacit...Figure 10.5. Minimum energy point curvesFigure 10.6. Energy and delay contoursFigure 10.7. Minimum energy point in near- or sub-threshold regionFigure 10.8. Minimum energy point in super-threshold regionFigure 10.9. The concept of minimum energy point tracking algorithmFigure 10.10. An OS-based algorithm for MEPT

11 Chapter 11Figure 11.1. Tasks in a heterogeneous computing architecture communicate with ea...Figure 11.2. Hardware context switchonatask with FIFO-based communication channe...Figure 11.3. Modified communication channel to support hardware context switchin...Figure 11.4. The proposed communication protocol in hardware context switchingFigure 11.5. FIFOs in the communication channelFigure 11.6. Reconfigurable architecture with the proposed communication structu...Figure 11.7. Hardware context switch scenario in the experimentsFigure 11.8. Comparison of hardware context switch (preemption) time between CS ...Figure 11.9. Hardware task migration between heterogeneous reconfigurable SoCsFigure 11.10. Task migration timeline