Customizable Computing. Yu-Ting Chen
7 Concluding Remarks
Acknowledgments
This research is supported by the NSF Expeditions in Computing Award CCF-0926127, by C-FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA), and by the NSF Graduate Research Fellowship Grant #DGE-0707424.
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
June 2015
CHAPTER 1
Introduction
Since the introduction of the microprocessor in 1971, the improvement of processor performance in its first thirty years was largely driven by the Dennard scaling of transistors [45]. This scaling calls for for reduction of transistor dimensions by 30% every generation (roughly every two years) while keeping electric fields constant everywhere in the transistor to maintain reliability (which implies that the supply voltage needs to be reduced by 30% as well in each generation). Such scaling doubles the transistor density each generation, reduces the transistor delay by 30%, and at the same time improves the power by 50% and energy by 65% [7]. The increased transistor count also leads to more architecture design innovations, such as better memory hierarchy designs and more sophisticated instruction scheduling and pipelining supports. These factors combined led to over 1,000 times performance improvement of Intel processors in 20 years (from the 1.5 μm generation down to the 65 nm generation), as shown in [7].
Unfortunately, Dennard scaling came to an end in the early 2000s. Although the transistor dimension reduction by 30% per generation continues to follow Moore’s law, the supply voltage scaling had to almost come to a halt due to the rapid increase of leakage power. In this case, transistor density can continue to increase, but so can the power density. As a result, in order to continue meeting the ever-increasing computing needs, yet maintaining a constant power budget, in the past ten years the computing industry stopped simple processor frequency scaling and entered the era of parallelization, with tens to hundreds of computing cores integrated in a single processor, and hundreds to thousands of computing servers connected in a warehouse-scale data center. However, such highly parallel, general-purpose computing systems now face serious challenges in terms of performance, power, heat dissipation, space, and cost, as pointed out by a number of researchers. The term “utilization wall” was introduced in [128], where it shows that if the chip fills up with 64-bit adders (with input and output registers) designed in a 45 nm TSMC process technology running at the maximum operating frequency (5.2Ghz in their experiment), only 6.5% of 300mm2 of the silicon can be active at the same time. This utilization ratio drops further to less than 3.5% in the 32nm fabrication technology, roughly by a factor of two in each technology generation following their leakage-limited scaling model [128].
A similar but more detailed and realistic study on dark silicon projection was carried out in [51]. It uses a set of 20 representative Intel and AMD cores to build up empirical models which capture the relationship between area vs. performance and the relationship between power vs. performance. These models, together with the device-scaling models, are used for projection of the core area, performance, and power in various technology generations. This also considers real parallel application workloads as represented by the PARSEC benchmark suite [9]. It further considers different multicore models, including the symmetric multicores, asymmetric multicores (consisting of both large and small cores), dynamic multicores (either large or small cores depending on if the power or area constraint is imposed), and composable multicores (where small cores can be fused into large cores). Their study concludes that at 22 nm, 21% of a fixed-size chip must be powered off, and at 8 nm, this dark silicon ratio grows to more than 50% [51]. This study also points to the end of simple core scaling.
Given the limitation of core scaling, the computing industry and research community are actively looking for new disruptive solutions beyond parallelization that can bring further significant energy efficiency improvement. Recent studies suggest that the next opportunity for significant power-performance efficiency improvement comes from customized computing, where one may adapt the processor architecture to optimize for intended applications or application domains [7, 38].
The performance gap between a totally customized solution using an application-specific integrated circuit (ASIC) and a general-purpose processor can be very large, as documented in several studies. An early case study of the 128-bit key AES encryption algorithm was presented in [116]. An ASIC implementation of this algorithm in a 0.18 μm CMOS technology achieves a 3.86Gbits/second processing rate at 350mW power consumption, while the same algorithm coded in assembly languages yields a 31Mbits/second processing rate with 240mW power running on a StrongARM processor, and a 648Mbits/second processing rate with 41.4W power running on a Pentium III processor. This results in a performance/energy efficiency (measured in Gbits/second/W) gap of a factor of 85X and 800X, respectively, when compared with the ASIC implementation. In an extreme case, when the same algorithm is coded in the Java language and executed on an embedded SPARC processor, it yields 450bits/second with 120mW power, resulting in a performance/energy efficiency gap as large as a factor of 3 million (!) when compared to the ASIC solution.
Recent work studied a much more complex application for such gap analysis [67]. It uses a 720p high-definition H.264 encoder as the application driver, and a four-core CMP system using the Tensilica extensible RISC cores [119] as the baseline processor configuration. Compared to an optimized ASIC implementation, the baseline CMP is 250X slower and consumes 500X more energy. Adding 16-wide SIMD execution units to the baseline cores improves the performance by 10X and energy efficiency by 7X. Addition of custom-fused instructions is also considered, and it improves the performance and energy efficiency by an additional 1.4X. Despite these enhancements, the resulting enhanced CMP is still 50X less energy efficient than ASIC.
The large energy efficiency gap between the ASIC and general-purpose processors is the main motivation for architecture customization, which is the focus of this lecture. In particular, one way to significantly improve the energy efficiency is to introduce many special-purpose on-chip accelerators implemented in ASIC and share them among multiple processor cores, so that as much computation as possible is carried out on accelerators instead of using general-purpose cores. This leads to accelerator-rich architectures, which have received a growing interest in recent years [26, 28, 89]. Such architectures will be discussed in detail in Chapter 4.
There are two major concerns about using accelerators. One relates to their low utilization and the other relates to their narrow workload coverage. However, given the utilization wall [128] and the dark silicon problem [51] discussed earlier, low accelerator utilization is no longer a serious problem, as only a fraction of computing resources on-chip can be activated at one time in future technology generation, given the tight power and thermal budgets. So, it is perfectly fine to populate the chip with many accelerators, knowing that many of them will be inactive at any given time. But once an accelerator is used, it can deliver one to two orders of magnitude improvement in energy efficiency over the general-purpose cores.
The problem of narrow workload coverage can be addressed by introducing reconfigurability and using composable accelerators. Examples include the use of fine-grain field-programmable gate arrays (FPGAs), coarse-grain reconfigurable arrays [61, 62, 91, 94, 118], or dynamically composable accelerator building blocks [26, 27]. These approaches will be discussed in more detail in Section 4.4.
Given the significant energy efficiency advantage of accelerators and the promising progress in widening accelerator workload coverage, we increasingly believe that the future of processor architecture should be rich in accelerators, as opposed to having many general-purpose cores. To some extent, such accelerator-rich architectures are more like a human brain, which has many specialized