Multi-Processor System-on-Chip 2. Liliana Andrade

Multi-Processor System-on-Chip 2

in parallel. Such data dependencies exist also in iterative decoding algorithms between the various iterations. Hence, the bottleneck for high throughput is the part of an algorithm that cannot be parallelized. This part strongly limits the overall throughput that is also known as Amdahl’s law (Amdahl 2013). Other important features for efficient high-throughput implementations are locality and regularity to minimize interconnect. Interconnect can largely contribute to area, delay and energy consumption. Let us assume an FEC IP block of size 10 mm² is a feasible size. A signal has to travel at least 7 mm if it has to be transmitted from one corner to the other in this IP block. It will take three clock cycles in a 14 nm technology, assuming a frequency of 1 GHz. This interconnect delay largely decreases the throughput and increases power. Thus, data transfers can be as important as calculations. Although channel decoding algorithms for advanced codes such as Turbo- and LDPC codes are mainly data-flow dominated, they imply irregularity and restricted locality, since efficient channel coding for these coding schemes is grounded on some randomness: that is the interleaver for Turbo codes and the Tanner graph for LDPC codes, respectively. Any regularity and locality in these structures improves the implementation efficiency but has negative impact on the communications performance. Hence, there is a fundamental discrepancy between information theory and efficient implementations for high throughput that demands regularity and locality. In Table 2.1, important implementation properties for the different code classes are summarized.

Table 2.1. Implementation properties of various coding schemes

Code	Dec. Algorithms	Parallel vs. Serial	Locality	Compute Kernels	Transfers vs. Compute
Turbo	MAP	Serial/iterative	Low (interleaver)	Add-Compare-Select	Compute dominated
LDPC	Belief propagation	Parallel/iterative	Low (Tanner graph)	Min-Sum/add	Transfer dominated
Polar	Successive cancelation/list	Serial	High	Min-Sum/add/sorting	Balanced

Functional parallelism/pipelining is an efficient technique to speed up algorithms with data dependencies. We can “unroll” the iterations and insert buffers between the different pipeline stages in iterative decoding algorithms. In this way, the various iterations can be calculated in parallel, but on different data sets. Spatial and functional parallelization are implementation techniques only, i.e. they are not changing the algorithmic behavior. We can also modify the decoding algorithm itself to enable parallelism. Let us again consider the MAP algorithm. The data dependencies in the forward/backward recursion can be broken up by exploiting the fact that the trellis has a finite memory due to the underlying constraint length of the code. This property enables the splitting of the trellis into sub-trellises that can be processed independently of each other. However, some acquisition steps are mandatory at the border of the sub-trellises to get the correct probabilities. The length of the sub-trellises and the corresponding acquisition length impact the communications performance. In this case, an increased parallelism can have a negative impact on the communications performance.

The throughput of channel decoders depends on various communication and implementation parameters. Let N be the block size and R the rate of a channel code. Let I denote the number of iterations that a corresponding (iterative) decoder requires to decode a code block (I = 1 in the case of a non-iterative decoding algorithm). Let P denote the degree of achievable parallelism. P is defined as the ratio between the operations that are performed in one clock cycle, and the total number of operations necessary to perform one decoding iteration for a complete code block. Note that operation represents the “critical operation” in an algorithm: an operation can be a computation as well as a data transfer. Let f be the clock frequency. Using these parameters, the throughput (in information bits per second) of a FEC architecture can be roughly estimated by [2.1]

[2-1]

where ω is a normalized value between 0 and 1, that represents the timing overhead due to, for example, data distribution, interconnect, memory access conflicts, etc. The maximum clock frequency f is determined by the critical path in the compute kernels of the corresponding decoding algorithms and/or delay due to interconnect. Moreover, f is typically upper limited to 1 GHz due to power and design methodology constraints. The overhead ω increases with increasing N and P and is larger for decoding algorithms that have limited locality and are data-transfer dominated. The impact of ω on the throughput can be considered as an effective reduction of the clock frequency f or decrease in P, if additional clock cycles are mandatory, for example, due to memory conflicts, that cannot be hidden. If we are targeting 1 Tbit/s throughput with a frequency limit of 1 GHz, 1,000 bits have to be decoded in one single clock cycle that requires extreme parallelism. To achieve the highest throughput, P has to be maximized, and ω minimized. In the following sections, we will discuss the throughput maximization for the different coding schemes.

2.3.1. Turbo decoder

Among the three code classes, Turbo codes are most challenging with respect to high-throughput decoding. The decoding structure consists of two component decoders connected through an interleaver/deinterleaver. These component decoders work cooperatively by exchanging extrinsic information in an iterative loop. The respective bit-wise log-likelihood ratios are computed in a forward and a backward recursion on the trellis in the MAP-based component decoder. Thus, decoding is inherently serial on component decoder (MAP recursions) and on Turbo decoder level (iterative loop). The kernel operation in the MAP is the Add-Compare-Select (ACS) operation. Depending on the encoder’s memory depth, several ACS operations have to be executed to process a single trellis step in the forward and backward recursion, respectively. In total, 4 · K trellis computations and two interleavings have to be performed for one single Turbo decoding iteration. Hence, P is determined by the number of parallel trellis step computations performed by a decoding architecture. The maximum value of P for one decoding iteration, i.e. P = 1, can be achieved by an architecture if the forward/backward recursions of the MAP algorithm are unrolled and pipelined (functional parallelism), yielding the XMAP architecture. P = I if, in addition, iterations are unrolled and pipelined. For a detailed overview and discussion of such highly parallel decoders, we refer to (Weithoffer et al. 2018). Figure 2.2 (left) shows the layout of a Turbo decoder exploiting all levels of parallelism (P = I) and an optimized interleaver such that ω is very small (Weithoffer et al. 2018). The decoder achieves 102 Gbit/s information bit throughput for an information block length of 128 bit, on a low Vt 28 nm Fully Depleted Silicon on Insulator (FD-SOI) technology under worst-case PVT conditions, running with 800 MHz and performing four decoding iterations. Taking into account the code rate of R = 1/3, the corresponding coded throughput is 306 Gbit/s. The total area is 23.61 mm². Different colors represent the eight different MAP decoders originating from the four unrolled iterations (each iteration requires two MAP decoders). Although this architecture achieves an information bit throughput of more than 100 Gbit/s for small block sizes,

Скачать книгу