Heterogeneous Computing. Mohamed Zahran
shows the top four supercomputers. Rmax is the maximal achieved performance, while Rpeak is the theoretical peak (assuming zero-cost communication, etc.). The holy grail of high-performance computing is to have an exascale machine by the year 2021. That deadline has been a moving target: from 2015 to 2018 and now 2021. What is hard about that? We can build an exascale machine, that is, on the order of 1018 FLOPS by connecting, say, a thousand petascale machines with high-speed interconnection, right? Wrong! If you build the machine in the way we just mentioned, it would require about 50% of the power generated by the Hoover Dam! It is the problem of power again. The goal set by the US Department of Energy (2013) for an exascale machine is to have one exascale for 20–30 MW of power. This makes the problem very challenging.
Figure 1.3Part of the TOP500 list of fastest supercomputers (as of November 2018). (Top 500 List. 2018. Top 500 List Super Computers (November 2018) Courtesy Jack Dongarra; Retrieved November 2018; https://www.top500.org/lists/2018/11/)
Heterogeneity is one step toward the solution. Some GPUs may dissipate power more than multicore processors. But if a program is written in a GPU-friendly way and optimized for the GPU at hand, you get orders of magnitude speedup over a multicore, which makes the GPU better than a multicore in performance-per-watt measurement. If we assume the power budget to be fixed to, say, 30 MW, then using the right chips for the application at hand gets you much higher performance. Of course heterogeneity alone will not solve the exascale challenge, but it is a necessary step.
2Different Players: Heterogeneity in Computing
In this chapter we take a closer look at the different computing nodes that can exist in a heterogeneous system. Computing nodes are the parts that do the computations, and computations are the main tasks of any program. Computing nodes are like programming languages. Each one can do any computation, but some are way more efficient in some type of computations than others, as we will see.
In 1966 Michael Flynn classified computations into four categories based on how instructions interact with data. The traditional sequential central processing unit (CPU) executes an instruction with its data, then another instruction with its data, and so on. In Flynn’s classification, this computation is called single instruction–single data (SISD). You can execute the same instruction on different data. Think of multiplying each element of a matrix by a factor, for example. This is called single instruction–multiple data (SIMD). The other way around, we refer to the same data that go through different instructions as multiple instruction–single data (MISD). There are not many examples of MISD around. With some stretch we can call pipelining a special case of MISD. Redundant execution of instructions, for reliability reasons, can also be considered MISD. Finally, the most generic category is multiple instruction–multiple data (MIMD). There are some generalizations. For instance, if we execute the same set of instructions on different data, we can generalize SIMD to single thread (or single program)–multiple data (SPMD). One of the advantages of such classifications is to build hardware suitable for each category, or for categories that are used more often, as we will see in this chapter.
2.1Multicore
The first player in a heterogeneity team is the multicore processor itself. Figure 2.1 shows a generic multicore processor. The de facto definition of a core now is a CPU and its two level-1 caches (one for instructions and the other for data). Below the L1 caches are different designs. One design has a shared L2 and L3 cache, where L3 is usually the last-level cache (LLC) before going off-chip. An L2 cache is physically distributed and logically shared to increase scalability with the number of cores. This makes the shared L2 cache a nonuniform cache access (NUCA), as we saw in the previous chapter. An LLC cache is also NUCA. This LLC is designed either in SRAM or embedded DRAM (eDRAM). POWER processors, from IBM, have eDRAM as an LLC. Another design has private L2 caches per core followed by a shared LLC. In some recent processors, but not in many, there is also an L4 shared cache; for example, Intel’s Broadwell i7 processor has a 128 MB L4 cache implemented in eDRAM technology. After the cache hierarchy, we go off-chip to access the system memory. Currently, the vast majority of system memory is in DRAM, but as we saw earlier, nonvolatile memory technology (such as PCM, STTRAM, MRAM, ReRAM, etc.) will soon appear and will be used with and/or in place of DRAM and also in some cache levels.
Figure 2.1Generic Multicore Processors
When we consider programming a multicore processor, we need to take into account several factors. The first is the process technology used for fabrication. It determines the cost, the power density, and the speed of transistors. The second factor is the number of cores and whether they support simultaneous multithreading (SMT) [Tullsen et al. 1995], called hyperthreading technology in Intel lingo and symmetrical multithreading in AMD parlance. This is where a single core can serve more than one thread at the same time, sharing resources. So if the processor has four cores and each one has two-way SMT capability, then the OS will see your processor as one with eight cores. That number of cores (physical and logical) determines the amount of parallelism that you can get and hence the potential performance gain. The third factor is the architecture of the core itself as it affects the performance of a single thread. The fourth factor is the cache hierarchy: the number of cache levels, the specifics of each cache, the coherence protocol, the consistency model, etc. This factor is of crucial importance because going off-chip to access the memory is a very expensive operation. The cache hierarchy helps reduce those expensive trips, of course with help from the programmer, the compiler, and the OS. Finally, the last factor is scaling out. How efficient is a multisocket design? Can we scale even further to thousands of processors?
Figure 2.2IBM POWER9 processor. (Courtesy of International Business Machines Corporation, © International Business Machines Corporation)
Let’s see an example of a multicore. Figure 2.2 shows the POWER9 processor from IBM [Sadasivam et al. 2017]. The POWER9 is fabricated with 14 nm FinFET technology, with about eight billion transistors, which is a pretty advanced one, as of 2018, even though we see lower process technologies (e.g., 10 nm) but still very expensive and not yet in mass production. The figure shows 24 CPU cores. Each core can support up to four hardware threads (SMT). This means we can have up to 96 threads executed in parallel. There is another variation of the POWER9 (not shown in the figure) that has 12 cores, each of which supports up to 8 hardware threads, bringing the total again to 96 threads. The first variation, the one in the figure, has more physical cores so is better in terms of potential performance, depending on the application at hand, of course. Before we proceed, let’s think from a programmer’s perspective. Suppose you are writing a parallel program for this processor and the language you are using gives you the ability to assign threads (or processes) to cores. How will you decide which thread goes to which core? It is obvious that the first rule of thumb is to assign different threads to different physical cores. But there is a big chance that you have more threads than physical cores. In this case try to assign threads of different personalities to the same physical core; that is, a thread that is memory bound and a thread that is compute bound, or a thread with predominant floating point operations and one with predominant integer operations, and so on. Of course there is no magic recipe, but these are rules of thumb. Note that your assignment may be overridden by the language runtime, the OS, or the hardware. Now back to the Power9.
Each core includes its own L1 caches (instructions and data). The processor has a three-level cache hierarchy. L2 is a private 512 KB 8-way set-associative cache. Depending