Customizable Computing. Yu-Ting Chen

Customizable Computing - Yu-Ting Chen


Скачать книгу
3.2: Energy cost of subcomponents in a conventional compute core as a proportion of the total energy consumed by the core. This shows the energy savings attainable if computation is performed in an energy-optimal ASIC. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].

      • Core Fusion: Architectures that enable one “big” core to act as if it were really many “small cores,” and vice versa, to dynamically adapt to different amounts of thread-level or instruction-level parallelism.

      • Customized Instruction Set Extensions: Augmenting processor cores with new workload-specific instructions.

      When a general-purpose processor is designed, it is done with the consideration of a wide range of potential workloads. For any particular workload, many resources may not be fully utilized. As a result, these resources continue to consume power, but do not contribute meaningfully to program performance. In order to improve energy efficiency architectural features can be added that allow for these components to be selectively turned off. While this obviously does not allow for the chip area spent on deactivated components to be repurposed, it does allow for a meaningful energy efficiency improvement.

      Manufacturers of modern CPUs enable this type of selective defeaturing, though typically not for this purpose. This is done with the introduction of machine-specific registers that indicate the activation of particular components. The purpose of these registers, from a manufacturer’s perspective, is to improve processor yield by allowing faulty components in an otherwise stable processor to be disabled. For this reason, the machine-specific registers governing device activation are rarely documented.

      There has been extensive academic work in utilizing defeaturing to create dynamically heterogeneous systems. These works center around identifying when a program is entering a code region that systemically underutilizes some set of features that exist in a conventional core. For example, if it is possible to statically discover that a code region contains long sequences of dependencies between instructions, then it is clear that a processor with a wide issue and fetch width will not be able to find enough independent instructions to make effective use of those wide resources [4, 8, 19, 125]. In that case, powering off the components that enable wide fetch and issue, along with the architectural support for large instruction windows, can save energy without impacting performance. This academic work is contingent upon being able to discern run-time behavior or code, either using run-time monitoring [4, 8] or static analysis [125].

      An example of dynamic resource scaling from academia is CoolFetch [125]. CoolFetch relies on compiler support to statically estimate the execution rate of a code region, and then uses this information to dynamically grow and contract structures within a processor’s fetch and issue units. By constraining these structures in code regions with few opportunities to exploit instruction-level parallelism or out-of-order scheduling, CoolFetch also observed a carry-over effect of reducing power consumption of other processor structures that normally operate in parallel, and also reduces energy spent on squashed instructions by reducing reducing the number of instructions stalling at retirement. In all, CoolFetch reported on average 8% energy savings, with a relatively trivial architectural modification and a negligible performance penalty.

      The efficiency of a processing core, both in terms of energy consumption and compute per area, tends to be reduced as the potential performance of the core increases. The primary cause of this is a shifting of focus from investing resources in compute engines in the case of small cores, to aggressive scheduling mechanisms in the case of big cores. In modern out-of-order processors, this scheduling mechanism constitutes the overwhelming majority of core area investment, and the overwhelming majority of energy consumption.

      An obvious conclusion then is that large sophisticated cores are not worth including in a system, since a sea of weak cores provides greater potential for system-wide throughput than a small group of powerful cores. The problem with this conclusion, however, is that parallelizing software is difficult: parallel code is prone to errors like race conditions, and many algorithms are limited by sequential components that are more difficult to parallelize. Some code cannot reasonably be parallelized at all. In fact, the majority of software is not parallelized at all, and thus cannot make use of a large volume of cores. In these situations a single powerful core is preferable, since it offers high single-thread throughput at the cost of restricting the capability to exploit thread-level parallelism. Because of this observation, it becomes clear that the best design depends on the number of threads that are exposed in software. Large numbers of threads can be run on a large number of cores, enabling higher system-wide throughput, while a few threads may be better run on a few powerful cores, since multiple cores cannot be utilized.

      This observation gave rise to a number of academic works that explore heterogeneous systems which feature a small group of very powerful cores on die with a large group of very efficient cores [64, 71, 84]. In addition to numerous academic perspectives on heterogeneous systems, industry has begun to adopt this trend, such as the ARM big.LITTLE [64]. While these designs are interesting, they still allocate compute resources statically, and thus cannot react to variation in the degree of parallelism present in software. To address this rigidity, core fusion [74], and other related work [31, 108, 115, 123], propose mechanisms for implementing powerful cores out of collaborative collections of weak cores. This allows a system to grow and shrink so that there are as many “cores” in the system as there are threads, and each of these cores is scaled to maximize performance with the current amount of parallelism in the system.

      Core fusion [74] accomplishes this scaling by splitting a core into two halves: a narrow-issue conventional core with the fetch engine stripped off, and an additional component that acts as a modular fetch/decode/commit component. This added component will either perform fetches for each core individually from individual program sequences, or a wide fetch to feed all processors. Similar to how a line buffer reads in multiple instructions in a single effort, this wide fetch engine will read an entire block of instructions and issue them across different cores. Decode and resource renaming is also performed collectively, with registers being stored as physically resident in various cores. A crossbar is added to move register values from one core to another when necessary. At the end of the pipeline, a reordering step is introduced to guarantee correct commit and exception handling. A diagram of this architecture is shown in Figure 3.3. Two additional instructions are added to this architecture that allow the operating system to merge and split core collections, thus adjusting the number of virtual cores available for scheduling.

      As shown in Figure 3.4, Core Fusion cores perform only slightly worse than a natural processor with the same issue width, achieving performance within 20% of a monolithic processor at equivalent effective issue width. The main reason for this is that the infrastructure that enables fusion comes with a performance cost. The strength of this system, however, is its adaptability, not necessarily its performance when compared to a processor designed for a particular software configuration. Furthermore, energy required to power structures necessary for wide out-of-order scheduling do not need to be active when cores are not fused. As a result, core fusion surrenders a portion of area used to implement the out-of-order scheduler and about 20% performance when fused to emulate a larger core. This enables run-time customization of core width and number, in a way that is binary compatible, and thus is completely transparent. For systems that do not have a priori knowledge of the type of workload that will be running on a processor, or expect the software to transition between sequential and parallel portions, the ability to adjust and accommodate varying workloads is a great benefit.

      Figure 3.3: A 4-core core fusion processor bundle with components added to support merging of cores. Adapted from [74].