Multi-Processor System-on-Chip 2. Liliana Andrade
from limited flexibility in terms of block sizes and varying number of iterations, a large area and high power consumption. Hence, further research is mandatory.
Figure 2.2. Left: 306 Gbit/s turbo decoder. Middle: 288 Gbit/s LDPC decoder. Right: 764 Gbit/s polar decoder
2.3.2. LDPC decoder
LDPC decoding is based on an iterative message exchange between variable and check nodes on the Tanner graph that is represented by the parity check matrix H. Unlike the Turbo decoding, this BP has some inherent parallelism, since all check nodes (variable nodes) can be processed independently from each other. The decoder throughput is mainly limited by the iterative data exchange between the check and variable nodes. The result of each check node or variable node calculation has to be spread via the edges of the Tanner graph to all other connected nodes. The Tanner graph has very limited locality to provide good communications performance, which challenges an efficient implementation of the data transfers and largely impacts ω. Hence, in contrast to Turbo decoding, the BP algorithm is data transfer and not compute dominated. Let us consider an LDPC block code with a parity check matrix H that has #edges(H) 1-entries (number of 1’s in H equals the number of edges in the Tanner graph). Furthermore, let #proc_edges(A) denote the number of edges that can be processed in one clock cycle by a decoder architecture. The corresponding parallelism is
2.3.3. Polar decoder
As discussed earlier, Polar codes have recently attracted much attention in the context of 5G. Successive Cancelation (SC), Successive Cancelation List (SCL) and BP are the most prominent decoding algorithms for Polar codes. Decoding corresponds to a breadth-first (BP) or depth-first traversal (SC, SCL) of the Polar Factor Tree (PFT) (Alamdar-Yazdi and Kschischang 2011), in which the received log-likelihood ratios from the channel are processed by the tree nodes. BP needs a large set of iterations to achieve the error correction performance of SC and is, thus, not well suited for very high throughput and low latency (Abbas et al. 2017). SC and SCL, on the other hand, have sequential behavior due to the mandatory depth-first tree traversal. To achieve a very high throughput, the tree traversal can be unrolled and pipelined (Giard et al. 2015), very similar to the iteration unrolling in Turbo code and LDPC code decoding as described above. Whenever a node is visited during the tree traversal, a corresponding pipeline stage can be instantiated. In this way, for a block length of N (= number of leaves in the PFT), the maximum number of pipeline stages is 2 ∗ (2N − 2) + 1. In this way, all N ∗ log N operations are executed in parallel, i.e. P = 1. Obviously, the complexity of the decoding architecture is directly proportional to the size of the PFT. However, for a given code, the tree can be reduced by various transformations. For example, if a subtree represents a repetition code or a parity check code, the corresponding subtree can be replaced by a single node. Similarly, we can merge rate-0 and rate-1 nodes into its parent nodes (Sarkis et al. 2014) or use majority logic decoding in subtrees. The achievable tree reduction strongly depends on the underlying Polar code. SC is the best candidate to achieve throughput towards 1 Tbit/s. Here, we consider a (1,024, 512) Polar code that is decoded with the SC algorithm. Again we use the same technology and PVT setup as for the Turbo and LDPC decoders. The unoptimized PFT has 2,047 nodes, corresponding to 4,093 pipeline stages. Performing the aforementioned tree optimization yields a reduction to 385 stages. Implementing this tree in a fully pipelined architecture would require about 310 KB pipeline memory. This huge memory requirement illustrates a further challenge for high-throughput decoder architectures. Although pipelining enables highest throughput for decoders, these architectures suffer from large latency and, more importantly, large power consumption. A huge amount of data has to be stored inside the pipelined architecture. These data sets are stored in registers that have to be driven by a clock tree. If a design contains many registers, the clock tree can become the dominant power consumer. Hence, minimizing the register load is a main optimization goal. On the code level, it can be performed by the aforementioned tree optimizations. On an architectural level, register balancing (retiming) is a very efficient technique to further reduce the pipeline stages. Advanced retiming with a frequency constraint of 700 MHz results in a partially pipelined architecture with only 105 pipeline stages (85 KB memory). The power consumption of this design is 5.7 W, of which more than 70% is consumed in registers and the clock tree. In a further step, some of the registers can be replaced by latches that have a much lower load. Latch-based design with some further optimizations reduces the area from 3.14 mm2 to 2.79 mm2 and the power consumption from 5.7 W to 2.7 W. The final (coded) throughput is 664 Gbit/s. Only 32% of the overall power accounts for registers/latches and the clock tree. Figure 2.2 (right) shows the layout of this Polar decoder. Each color represents a pipeline stage (105), the black color is memory. The decoder was fully automatically optimized and generated with the framework presented in Kestel et al. (2018b) and Lehnigk-Emden et al. (2019).
2.4. Conclusion
We have shown that throughputs towards 1 Tbit/s are feasible for all three code classes by appropriate “unrolling”, using heavy pipelining and spatial parallelism. However, this architectural approach is limited to small block sizes and small numbers of iterations (Turbo and LDPC codes), which negatively impacts the communications performance. Moreover, although pipelining largely increases the throughput and locality, it also increases the latency. All architectures suffer from limited flexibility in terms of block sizes (all three codes), varying number of iterations (Turbo and LDPC codes) and code rate flexibility (LDPC and Polar codes). In summary, the biggest challenge for very high-throughput decoder architectures lies in the improvement of the communications performance, under the aforementioned implementation constraints and providing block size, code rate and algorithmic flexibility. As discussed in the introduction, microelectronic progress will largely contribute to an improved area efficiency but not as much to an increased performance and a reduced power density. Thus, further research is mandatory to keep pace with the increasing requirements on communication systems in terms of throughput, latency, power/energy efficiency, flexibility, cost and communications performance.
2.5. Acknowledgments
We gratefully acknowledge financial support by the EU (project-ID: 760150-EPIC).
2.6. References
Abbas, S.M., Fan, Y., Chen, J., and Tsui, C.Y. (2017). High-throughput and