Multi-Processor System-on-Chip 2. Liliana Andrade
processing part of wireless communication. The main tasks of baseband processing are signal detection, parameter estimation, demodulation and channel decoding. Figure 2.1 shows a state-of-the-art commercial advanced baseband System-on-Chip (SoC) from the company Octasic (http://www.octasic.com). This SoC supports a wide range of communication standards from 2G to 5G. Assisting different standards demands for flexibility; on the other hand, power and energy efficiency requires dedicated optimized accelerators that demand a careful flexibility/implementation efficiency trade-off. A specialized hardware is 2 to 3 orders of magnitude more efficient than processor-based solutions that offer largest flexibility (Horowitz 2014). Hence, the SoC features a heterogeneous architecture composed of standard CPUs such as ARM cores, highly optimized low-power Digital Signal Processor (DSP) cores and dedicated accelerators.
The most computational-intensive part in baseband processing is channel decoding. Channel coding is an essential part in any communication system to guarantee reliable transmission, and enables the correction of errors that are induced during data transmission over noisy channels. It has a long history, going back to Shannon’s famous channel coding theorem in 1948 (Shannon 1948), and is based on the solid theoretical mathematical framework of information theory. Coding evolved as a very active research field and achieved a major breakthrough in 1993, when Claude Berrou wondered whether two small convolutional codes arranged in an original way, and iteratively decoded one after another, could do better than convolutional codes, which are based on very large states. This new coding scheme was named Turbo codes (Berrou et al. 1993). The era of iterative coding schemes was born, which resulted in a re-discovery of LDPC codes by MacKay and Neal (1997) that originally had already been published by Gallager in 1962 (Gallager 1962). Due to their excellent communications performance, consumer-driven cellular and broadcast communication standards such as UMTS, LTE, DVB and WiMAX, to name a few, adopted these new codes. A further breakthrough happened in 2009 when Erdal Arikan published Polar codes and proved that these codes could achieve channel capacity for a Binary Symmetric Memoryless Channel Binary Symmetric Memoryless Channel (BSMC) (Arikan 2009). Due to their excellent communications performance, Polar codes have attracted significant attention and became, for the first time, part of the 5G standard. For high throughput and low latency, these advanced decoding schemes, however, imply large challenges for efficient hardware implementation in consumer devices, where silicon area and power are the most critical cost factors. The computational workload of channel decoding for these advanced coding schemes with throughput in the Gbit/s range is more than 1,000 GOP/s (equivalent DSP operations). It is even more challenging with regard to the available power budget that is limited to some Watts for baseband processing due to thermal power density constraints. The high design costs of a baseband chip in advanced semiconductor technology demands for a large volume. This means that the corresponding SoC has to be used in various applications and the Forward Error Correction (FEC) Intellectual Property (IP) must be flexible to support various coding schemes, code rates, data rates, latency and BER/FER requirements. As mentioned earlier, adding flexibility to any architecture has a negative impact on throughput, energy efficiency, and area. Researchers are continuously looking for techniques to reduce the hardware implementation complexity and optimize energy efficiency for these advanced coding schemes while minimizing the degradation in communications performance (Kienle et al. 2011). Because of the aforementioned reasons, state-of-the-art channel decoders are typically implemented as dedicated accelerator cores. This can be seen in the wireless accelerator group depicted in Figure 2.1, that contains different types of channel en-/decoding accelerators to support various standards: an Application-Specific Instruction Set Processor (ASIP) that offers large flexibility for 2G/3G convolutional and Turbo code support (Vogt and Wehn 2008) and dedicated highly optimized LTE Turbo and DVB-S2 LDPC en-/decoders (Muller et al. 2009; Weithoffer et al. 2018).
Figure 2.1. State-of-the-art commercial system-on-chip baseband architecture
2.2. Role of microelectronics
The tremendous improvement in mobile communication has to be considered alongside the progress in the microelectronic industry, which started with the invention of the transistor in the late 1940s (Shockley 1949), coincidentally at the same time as, when Shannon published his famous article (Shannon 1948). In the following decades, the semiconductor industry achieved an exponential increase in the number of transistors on a single chip, known as Moore’s law (Moore 1965), which is a further key driver of our information society. In today’s semiconductor technologies, two-digit million transistors can be integrated on 1 mm2 of silicon. For many decades, improvement in the silicon process technology provided better performance, lower cost per gate, higher integration density and lower power consumption. However, we have reached a point where Moore’s law is slowing down. The reasons for this slowdown are, in particular, the immense cost of new technologies and the design cost in these technologies, decreasing performance gain and increasing delay in interconnect and power/power density challenges, to name just a few.
The question is, what contribution have microelectronics made to improve throughput and implementation efficiency in channel decoding in the past. As a case study, we consider two Turbo code decoders. Both decoders were designed with the same design methodology and have a very similar state-of-the-art architecture that exploits spatial parallelism and processes several sub-blocks on corresponding Maximum a Posteriori (MAP) decoders in parallel:
– the first decoder is a fully UMTS-compliant Turbo decoder implemented in a 180 nm technology. Under worst-case Process, Voltage and Temperature (PVT) conditions, a maximum frequency of 166 MHz is achieved, which results in a throughput of 71 Mbit/s at 6 decoding iterations. The total area is 30 mm2 (Thul et al. 2005);
– the second decoder is a fully LTE-compliant Turbo decoder implemented in a 65 nm technology, achieving a maximum frequency of 450 MHz under worst-case PVT conditions. It yields a throughput of 2. 15 Gbit/s at 6 decoding iterations and consumes 7.7 mm2 area (Ilnseher et al. 2012).
Three semiconductor technology nodes are between 180 nm and 65 nm technology. We observe a throughput increase by 30× although the improvement of frequency, which is limited by the critical path inside the MAP decoder, is only 3×. The improvement in area efficiency (throughput/area) is 118×. Hence, progress in microelectronics contributed to a huge improvement in area efficiency but much less to a frequency increase, and, thus, throughput increase. The throughput increase mainly originates from code design, i.e. conflict-free Turbo code interleavers that enable efficient implementation with a high degree of parallelism, advanced algorithmic and architectural features, such as next-iteration initialization, optimized radix-4 kernel, re-computation and advanced normalization to reduce internal bit widths. We see that microelectronics could not keep up with the increased requirements coming from communication systems. Thus, the design of communication systems is no longer just a matter of spectral efficiency or BER/FER. When it comes to implementation, channel coding requires a cross-layer approach covering information theory, algorithms, parallel hardware architectures and semiconductor technology to achieve excellent communications performance, high throughput, low latency, low power consumption and high energy and area efficiency (Scholl et al. 2016; Kestel et al. 2018a).
2.3. Towards 1 Tbit/s throughput decoders
A large parallelism is a must for high-throughput decoders towards 1 Tbit/s. The achievable parallelism strongly depends on the properties of the decoding algorithms, for example, sub-functions of a decoding algorithm that have no mutual data dependencies can easily be parallelized by spatial parallelism. This is the case for the Belief Propagation (BP) algorithm to decode LDPC codes. In the BP algorithm, all check nodes can be processed independently from each other. The same applies for the variable nodes. The situation is different for the MAP algorithm used in Turbo code decoding, where the calculation of a specific trellis step depends on the result of the previous trellis step. This results in a sequential behavior, and the different trellis steps