Multi-Processor System-on-Chip 2. Liliana Andrade
7
1.5.1. Implementation considerations
For the kernel implementation, we need to factor in the following shortened list of considerations (Damjancevic et al. 2019):
1 1) Vectorization – In an ideal case, single-instruction, multiple-data (SIMD) processors of vector length veclen, i.e. veclen SIMD-lanes, will reduce the number of operations required by a factor of veclen, but, in practice, this number is smaller, due to Amdahl’s law, boundary conditions and other potential bottlenecks, such as load/store bandwidth. Similarly, the number of memory accesses (loads/stores) in the ideal case is reduced veclen times.
2 2) Boundary conditions – Based on the algorithm implemented, the mismatch and misalignment of data in memory and data accessed by the vector load/store unit (Figure 1.13) lead to a need for extra processing cycles.
3 3) Bandwidth bottleneck – When the functional units and register file blocks of Figure 1.13 are consuming or producing more data than the load/store block can handle, the core will run into a BW bottleneck, lowering the core utilization. We could use multiple load/store units and/or increased load/store vector length to mitigate this problem, although such architectures come with higher energy consumption per load/store and a bigger area on the chip needed for the load/store unit.
4 4) Memory bandwidth – Amount of data transferred per unit of time.
5 5) Trade-off cycle count and memory bandwidth – Sometimes, when implementing an algorithm, it would be beneficial to minimize cycle counts, which is usually paid for in additional memory accesses and vice versa, optimizing for a low load/store count would increase the overall cycle count. In GFDM, we encounter such a trade-off. Low cycle count optimization reduces execution time, and a low memory bandwidth optimization saves energy (and area, if using an architecture with a smaller load/store unit).
6 6) Loop order and vectorization – Some algorithms, such as GFDM, can be described with mutually independent nested loops. We identified three such loops in GFDM (see “for” loops in Figure 1.9). Loop order and choosing which loop to vectorize can significantly alter the boundary conditions, bandwidth, memory and register space used and even type (instruction set architecture (ISA)) and count of operations needed.
1.5.2. Design space exploration
Let us refer to Figure 1.8 again. Note that the matrix elements are color-coded to match outputs with inputs.
Three loops can be identified: loop i along K, loop m along M of the dot product and loop l along M of the cyclic shift. We will use this notation for the rest of this chapter. D matrix is m and i dependent; g matrix is l, m and i dependent and output x is i and l dependent. There are 6 (3!) loop order combinations to arrange data reading, computation and data storing in this algorithm. In addition, we can vectorize loops i or l at the output and any of the loops at the input. In our notation, a vectorized input loop is emboldened, for example, i→i, and a vectorized output loop is underlined, for example, i→i.
In the design space, we can search for optimizations in regard to section 1.5.1. We decided to search for two optimizations: a) ideal vectorization and b) low memory bandwidth. In a), we analyze two variants (black and red) and in b) two other variants (blue and green) of GFDM. Due to some features overlapping between optimization variants, in the following text we will first present features and then assign them accordingly to black, red, blue and green variants. A summary is provided in Table 1.3 (Damjancevic et al. 2019). The features are:
1 1) Modifying the very long instruction word (VLIW) configuration – The vDSP has the capability to perform either a wide 2veclen vector load or a veclen vector load + veclen vector store. This removes the bandwidth bottleneck for operations that require two operands, which, in turn, removes stalls when executing consecutive vector MAC operations. Basic configuration supports a veclen vector load + veclen vector store.
2 2) Vectorizing loop i maximizes efficiency (no boundary effects) – The IDFT output is of size K = 2i, i ∈ N due to the assumed algorithm implementation of IDFT as 2i point iFFT (K ∈ [128, 4096] i.e.[low end, high end]), and veclen =16. Moreover, this makes the code scale with any M and K value.
3 3) Vectorizing loop m allows the use of a) a shuffle operation on buffered coefficients g, reducing the required number of g loads from M 2 K to MK; b) a combination of element-wise vector multiplication (MPY) and reduction add (rADD) that avoids MACs, and if M ≤ veclen, then the m loop is consumed altogether. This greatly reduces the number of memory accesses. Adversely, for the cases when M is not a composite of veclen, the boundary conditions deteriorate and the efficiency drops. We could handcraft the code for every M to optimize boundary effects, but this is a limitation to code scalability. In Table 1.3, we present the upper bound of efficiency for arbitrary M.
4 4) Vectorizing loop l at the output reduces the operation count, but is subject to boundary effects that reduce core efficiency.
5 5) m nested in i minimizes the need for ACCs in the register file (see Figure 1.13). Variants with this loop order use 1 veclen ACC, by computing the partial sums of the dot product first. Table 1.3. Selected GFDM implementation variants
6 6) i nested in l produces the output sequentially on a symbol-by-symbol basis of size K.
7 7) i nested in m allows D to be read sequentially symbol-by-symbol, preventing stalls when the IFFT result is incoming sequentially (blocks of size K in contrast to the otherwise parallel M × K blocks requirement).
8 8) l nested in i removes the need to either reload or buffer D in register file.
9 9) m and l nested in i allow the use of the same memory locations for both D and x. An index location of D becomes obsolete when x of the same index location is ready, for example, when x0,0 is ready, D0,0 is no longer used. The relative order of m and l is irrelevant. The same is true for i nested in m and l.
Black uses l {i {m {… }}}, hence 1), 2), 5), 6) are valid.
Red uses m {i {l {… }}}, hence 1), 2), 7), 8) are valid.
Blue uses i {l {m {… }}}, hence 3), 5), 8), 9) are valid.
Green uses l {i {m {… }}}, hence 3), 5), 6) are valid. In green, we have to either reload or buffer D; we chose buffer D in register file, reducing memory bandwidth.
Now that we have identified GFDM variants to minimize cycle count (for maximum throughput) or required memory bandwidth (for low energy consumption), we can investigate how the variable throughput requirement from the standard analysis in section 1.1 maps to GFDM kernels, in particular, to black – high-throughput variant, and blue – low-energy-consumption variant. We drop red and green from further consideration since they do not achieve the lowest cycle counts and have high-memory register file usage. However, we should keep in mind the lesson learned: that our code could create a kernel that is in a very unfavorable spot on the