© Springer Nature Singapore Pte Ltd. and Science Press, Beijing, China  2019
Leibo Liu, Guiqiang Peng and Shaojun WeiMassive MIMO Detection Algorithm and VLSI Architecturehttps://doi.org/10.1007/978-981-13-6362-7_5

5. Architecture for Nonlinear Massive MIMO Detection

Leibo Liu1  , Guiqiang Peng2   and Shaojun Wei3  
(1)
Institute of Microelectronics, Tsinghua University, Beijing, China
(2)
Institute of Microelectronics, Tsinghua University, Beijing, China
(3)
Institute of Microelectronics, Tsinghua University, Beijing, China
 
 
Leibo Liu (Corresponding author)
 
Guiqiang Peng
 
Shaojun Wei

When the algorithm is mapped onto the corresponding hardware architecture design, people need to evaluate the performance of the hardware architecture such as data throughput, area, power consumption and delay, and research the resources reuse, the sub-module design, and the whole module pipeline of the hardware architecture, so as to obtain the innovative method with practical application values. Up to now, only the suboptimal linear data detection algorithm has been implemented on FPGA [1] or ASIC [2, 3]. The results of the linear algorithm in the hardware design are not ideal due to the characteristics of the algorithm itself, so it is necessary to try to design the hardware architecture for the nonlinear algorithm.

This chapter first introduces a VLSI architecture designed by us based on the CHOSLAR algorithm in Sect. 4.​2. It is implemented with K-best detection preprocessor for the 64 QAM modulation and the 16 × 16 MIMO system [4]. In order to achieve an optimal trade-off among throughput, area, and power consumption, here, we will present three types of systolic arrays with diagonal priority to perform the initial matrix computation, LR and matrix inversion. An antenna-level is also proposed to realize high data throughput and low latency in sorted QRD and post-vector computation. Experimental results show that this architecture has great advantages over the existing designs in data throughput, latency, energy efficiency (throughput/power) and area (throughput/gate number).

Then, the corresponding systolic array is designed according to the TASER algorithm. The systolic array can realize high-throughput data detection with a lower silicon area [5]. VLSI is implemented with Xilinx virtex-7 FPGA and the 40 nm CMOS technology, and the performance and computational complexity are compared in detail with that of other data detectors recently proposed for the massive MU-MIMO wireless system [1, 69].

5.1 CHOSLAR Hardware Architecture

5.1.1 VLSI Architecture

This section describes the VLSI architecture implementation based on the CHOSLAR algorithm [4] used in Sect. 4.​2. This architecture is designed for 64 QAM 16 × 16 MIMO systems. The circuit design method is similar to that of other larger scale MIMO systems. From the BER simulation results in Sect. 4.​2.​5, we know that three iterations are enough to achieve near-optimal detection accuracy and lower resource consumption, so we take 3 as the number of iterations for LR.

Figure 5.1 is the top-level module diagram of the CHOSLAR algorithm. It is composed of five parts: the initialization unit, the sorted QRD unit, the PILR unit, the inversion unit, and the post-vector unit. These units are fully pipelined to achieve high data throughput. First, the initialization results (Gram matrix A and vector $$ \dot{\varvec{y}} $$) in Line 4 of Algorithm 4.2.2 is solved in the initialization unit. As the initialization unit output, Gram matrix A is used to perform sorted QRD in Line 5–18 of Arithmetic 4.2.2, channel matrix H perform exchange operation to obtain $$ \dot{\varvec{H}} $$ at the same time as shown in Line 8. Next, matrix R performs PILR to get matrix $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$ in the Line 20–40 of Algorithm 4.2.2. Matrix $$ \widehat{\varvec{R}} $$ is one of the outputs of the CHOSLAR algorithm, and matrix $$ \widehat{\varvec{H}} $$ is transferred to the post-vector unit. Then, the matrix $$ \widehat{\varvec{R}} $$ is inversed in the inversion unit, i.e., Line 41–447 of Algorithm 4.2.2. Finally, in the post-vector unit, the final output $$ \hat{\varvec{y}} $$ of the CHOLSAR algorithm is obtained by matrix–vector multiplication using the outputs (matrix $$ \widehat{\varvec{H}} $$ and $$ \varvec{R}^{\text{inv}} $$ , vector $$ \dot{\varvec{y}} $$) of the previous steps.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig1_HTML.png
Fig. 5.1

Top-level modules of the CHOSLAR algorithm. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

5.1.1.1 The Initialization Unit

The ultimate purpose of the initialization unit is to calculate the Gram matrix A and vector $$ \dot{\varvec{y}} $$ that will be used in the subsequent units. To achieve high data throughput, this unit designs a systolic array including two types of PEs. In a systolic array, there are N PE-As and $$ 1/2N^{2} - 1/2N $$ PE-Bs (that is, there are 16 PE-As and 120 PE-Bs in 16 × 16 MIMO systems). The next unit (the sorted QRD unit) needs to compare the diagonal elements of matrix A, so the PE-As that have obtained these diagonal elements constitute the first block of each row, as shown in Fig. 5.2. In addition, N − 1 registers (REG) are used to store the elements of the channel matrix and get the conjugate of each element at the output time. These REGs are used to balance the timing of the pipeline, which balances the different latencies of multiple PE-Bs computation. The first type of the processing element PE-A is used for the computation of vector $$ \varvec{\dot{y} } $$ and diagonal elements that make up matrix A. Every PE-A contains two types of ALU (two ALU-As and one ALU-B), three accumulators, two subtractors, and two shifters. One ALU-A performs complex number multiplication (CM) of $$ H_{i,j}^{*} $$ and $$ H_{i,j}^{*} $$, whose results of each cycle are accumulated. Others combined with ALU-B performs CM of matrix H and vector $$ \varvec{v} $$. The results of ALU-A/ALU-B are accumulated like computation of the elements of matrix A. To execute the computation of Line 4 of Algorithm 4.2.2, subtract the above results from the elements of y, and the shifter computes of the real and imaginary parts of the vector $$ \dot{\varvec{y}} $$. The subsequent processing element PE-B is used to compute the off-diagonal elements of matrix A. Each PE-B contains a CM unit consisting of an ALU-A and an ALU-B. To ensure that each processing element correctly processes operands, the values of Column i in $$ \varvec{H}^{\text{H}} $$ delay i − 1 clock cycles. First, each value of $$ \varvec{H}^{\text{H}} $$ is transferred from PE-A to the subsequent PE-B, then to REG (operate by row), and then the corresponding conjugate value is transferred from REG to PE-B (operate by column).
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig2_HTML.png
Fig. 5.2

Architecture of the initialization unit. a architecture of the initialization unit, b internal architecture of PE-A, c internal architecture of PE-B. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

Similar systolic arrays are used to compute Gram matrix in linear detection algorithms in Refs. [1, 3], the computations in those architectures do not begin with computing the PEs of the diagonal elements of matrix A [1, 3] unlike the systolic arrays designed in this section. Therefore, computation of the diagonal elements of Gram matrix G is postponed. Then, the subsequent sorted QRD algorithm has to wait for more time to receive the input data, data throughput of the overall architecture decreases and the latency increases. The PE-A in Ref. [1, 3] uses the bilateral input while the architecture in this section adopts the unilateral input. Therefore, the number of ports of the systolic array used in Refs. [1, 3] has doubled (at the input end).

5.1.1.2 Sorted QRD Unit

After the unit is initialized, the output matrix A is transferred to the next unit to perform the sorted QRD operation based on the Cholesky decomposition, to obtain the matrix R, as shown in Fig. 5.3. The channel matrix H is also updated in this unit. To achieve higher parallelism, the unit adopts a deep pipeline architecture including N similar processing elements PE-Cs (for example, 16 PE-Cs in a 16 × 16 MIMO system). All PE-Cs are similar, but the number of ALU-Cs in each PE-C is different, decreasing successively from the first PE-C to the last PE-C in each column. Take the k(th) PE-C as an example to illustrate the architecture. First, use a comparator to compare all diagonal elements of matrix A to find the smallest $$ A_{i,i} $$ and its location (LO) in A. Then, $$ A_{i,i} $$ is used to compute $$ R_{1,1} $$ in the square root (SQRT) element. According to the location of $$ A_{i,i} $$, get matrix $$ \dot{\varvec{H}} $$ by exchanging Column i and Column k of matrix H, as shown in Line 8 of Algorithm 4.2.2. Next, the reciprocal (REC) unit is used to compute the REC of $$ R_{1,1} $$, and the result is transferred to the first ALU-C and later used to get the k(th) column of R, which is shown as Line 9–12 of Algorithm 4.2.2. Finally, the elements of A are updated by operating the matrices R and A, and then transferred to the k+1(th) PE-C, as shown in Line 15 of Algorithm 4.2.2. The multiplier and subtractor in ALU-C are used to perform the computation of the matrix A elements in Line 13–17 of Algorithm 4.2.2. Note that the diagonal elements of matrix A are solved first in this framework and can be directly used for the next PE-C, thus reducing the latency in the sorted QRD unit.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig3_HTML.png
Fig. 5.3

Sorted QRD architecture. a sorted QRD architecture, b internal architecture of PE-C. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

The VLSI architecture based on the GR algorithm and the sorting algorithm is proposed in Refs. [10, 11]. A flexible architecture for a 64 QAM 1 × 1–4 × 4 MIMO system is proposed in Ref. [10], and a 8 × 8 K-best signal detector with less sorting operations combined with the LR and the QRD is proposed in Ref. [11]. The sorted QRD units in these architectures are built into a long chain of paired CORDIC arrays, resulting in excessive latency. It becomes even more problematic as the size of the MIMO system increases. The increased computing time for computing sorted QRD, in turn, affects the overall detector’s data throughput. The proposed architecture does not need to compute QRD directly, but is implemented by a series of deep pipeline multiplications to meet the requirements of future wireless communication systems for high data throughput.

5.1.1.3 PILR Unit

As shown in Figs. 5.4 and 5.5, the PILR unit has two main functions. One is to update the matrix R based on the Siegel condition, the other is to implement the full-size reduction of matrix R, and matrices H and R are updated at the same time.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig4_HTML.png
Fig. 5.4

a Architecture for updating matrix R based on Seagal condition, b internal architecture of PE-D. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

../images/478198_1_En_5_Chapter/478198_1_En_5_Fig5_HTML.png
Fig. 5.5

a Architecture of full size reduction for matrix R, b internal architecture of PE-D, c internal architecture of PE-E. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

In Fig. 5.4, the first part of the unit consists of 3 N PE-Ds (for example, there are 48 PE-Ds in 16 × 16 MIMO systems). All PE-Ds are similar and operate in parallel. Take the k(th) PE-D as an example. The input of the first PE-D is the k(th) row and k − 1(th) row of matrix R. PE-D updates the two rows. Then these two rows are used as the input of the next PE-D, and REG can ensure the timing of the pipeline. The framework of the PE-D array is shown as Fig. 5.4. Each PE-D consists of three parts: the column updates for matrices $$ \dot{\varvec{H}} $$ and R, the column exchanges for matrices $$ \dot{\varvec{H}} $$ and R, and the updates for matrix R. Prior to this, there was a unit dedicated to processing the comparison of Line 23 of Algorithm 4.2.2, and the comparison results were used as the enabling signals of the subsequent unit. First, compute the parameter $$ \mu $$ and divide it by LUT. Then, the CM unit performs the multiplication of Line 25 of Algorithm 4.2.2. And the resulting matrices R and H are exchanged in columns to obtain matrices $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$. Next, compute the parameters a and b in matrix $$ \varvec{\theta} $$ to update $$ \widehat{\varvec{R}} $$. During this process, the real multiplication, real number addition, and a LUT are used to realize multiplication, SQRT and REC operations of the elements in matrix $$ \widehat{\varvec{R}} $$. The multiplier, conjugate and negative units are used to obtain $$ \varvec{\theta} $$ and then the k(th) and the k − 1(th) rows of the matrix $$ \widehat{\varvec{R}} $$ are updated.

In the second part of the PILR unit, there are $$ 1/2N^{2} - 1/2N $$ identical processing elements PE-Es (e.g., there are 120 PE-Es in a 16 × 16 MIMO system). Figure 5.5 shows the framework of the size reduction for matrix $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$. Take a single PE-E as an example. PE-E first computes the parameter $$ \mu $$, then multiplies each element of the N − 2(th) column of the matrices $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$ by $$ \mu $$, and then subtract the result from each element of the N − 1(th) column. In the size reduction framework, the purpose of the first stage is to update the elements of the N(th) column of matrices $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$ (except the elements of Row N − 1, Column N). In the second stage, using the results of the first stage to update the elements of the N(th) column of matrices $$ \widehat{\varvec{R}} $$ and $$ \widehat{\varvec{H}} $$ (except the elements of Row N − 1, Column N and Row N − 2, Column N). In addition, the elements of the N − 1(th) and N − 2(th) columns are input to PE-E to update the N − 1(th) column. The size reduction framework includes N − 1 stages, and the computation method is the same in all subsequent stages.

VLSI, which contains similar LR programs, has been proposed before. LR is implemented on three pairs of CORDIC processors by using an odd–even algorithm in Ref. [11]. The detector can achieve near-ML accuracy for 64 QAM 8 × 8 MIMO system. Several CORDIC pairs for QR and LR are also designed. According to the sequence diagram, this part accounts for the majority of latency so that the data throughput of the overall detector decreases. While the proposed PILR unit in this section implements LR through two frameworks, one of which is used for Segal condition and the other for size reduction condition. These frameworks are designed based on systolic arrays. All the intermediate data are computed in the next PE-D and all computations are deeply pipelined, so its hardware utilization and data throughput are higher than those of the CORDIC processor-based architecture in Ref. [11].

5.1.1.4 Inversion Unit

A systolic array is designed for the inverse part of matrix $$ \widehat{\varvec{R}} $$ in the CHOSLAR architecture, as shown in Fig. 5.6. The systolic array has two types of PEs, N PE-Fs and $$ 1/2N^{2} - 1/2N $$ PE-Gs (for example, there are 16 PE-Fs and 120 PE-Gs in 16 × 16 MIMO systems). PE-Fs and PE-Gs are used to compute the diagonal and non- diagonal elements of matrix $$ \varvec{R}^{\text{inv}} $$, respectively. PE-F makes up the first PE in each row because computing the off-diagonal elements of matrix $$ \varvec{R}^{\text{inv}} $$ needs the diagonal elements. To ensure that each processing element processes the operands correctly, the values of each column of $$ \widehat{\varvec{R}} $$ are delayed by i − 2 clock cycles, and each value of the matrix R is passed from the PE-F to the subsequent PE-G (row by row). Take the third PE-F and PE-G for illustration. PE-F consists of a REC, a REG, and a CM unit and PE-G consists of two CM units, a REG and an adder. Diagonal elements are computed by REC and then output to REG, which is used many times in PE-G. In the next cycle, the off-diagonal elements of matrix $$ \widehat{\varvec{R}} $$ are transferred to the PE-F, the diagonal elements of $$ \widehat{\varvec{R}} $$ is transferred to PE-G on the same row, the PE-F performs the multiplication of the diagonal element of $$ \widehat{\varvec{R}} $$ and off-diagonal element of $$ \varvec{R}^{\text{inv}} $$, and then passes the solution to the PE-G at the lower right. PE-G computes the off-diagonal elements of $$ \varvec{R}^{\text{inv}} $$ using the solution from the upper left PE, the diagonal elements of $$ \widehat{\varvec{R}} $$ from the left PE, and the diagonal elements of $$ \varvec{R}^{\text{inv}} $$, which is shown as Line 45 of Algorithm 4.2.2.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig6_HTML.png
Fig. 5.6

a Architecture of the inversion unit, b internal architecture of PE-F, c internal architecture of PE-G. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

In some previously proposed linear detectors’ VLSI architectures, such as Refs. [1, 3], they also contain inversion units. These inversion units are approximately implemented based on NSA, similar to the inversion units mentioned in this section (based on systolic arrays). However, the proposed architecture can accurately invert the matrix R, while there is an approximate error with architecture in Refs. [1, 3]. In addition, the systolic array of this design first inverts the diagonal element of R in PE–F of PE in the first column of the unit. Thus the result can be utilized by PE-G. whereas diagonal elements can only be computed after a long delay in Refs. [1, 3] because it does not start with the computation of the PE of these elements. Moreover, the architecture in Refs. [1, 3] requires more ports than that of the architecture in this section.

5.1.1.5 Post-Vector Unit

The post-vector unit performs multiplication operations on the matrix $$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$, matrix $$ \widehat{\varvec{H}}^{\text{H}} $$ and vector $$ \dot{\varvec{y}} $$ in the 49th row of the Algorithm 4.2.2. Matrix $$ \widehat{\varvec{H}}^{\text{H}} $$ is the output of the PILR unit, and matrix $$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$ is obtained from the inversion unit. First, matrix $$ \widehat{\varvec{H}}^{\text{H}} $$ is multiplied by vector $$ \dot{\varvec{y}} $$, then the result is multiplied by matrix $$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$ to get the final solution $$ \hat{\varvec{y}} $$. In addition, since the matrices $$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$ and $$ \widehat{\varvec{H}}^{\text{H}} $$ are obtained successively by computation, the two matrix–vector multiplications can also be computed in turn, the resources for matrix–vector multiplications can be reused. The framework of the post-vector vector unit is shown in Fig. 5.7. The unit contains N PE-Hs. Each PE-H computes an element of the result vector. Each PE-H contains a CM unit for CM and an ACC unit for accumulation.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig7_HTML.png
Fig. 5.7

a Architecture of the post-vector unit, b internal architecture of PE-H. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

5.1.2 Implementation Results and Comparison

The VLSI architecture layout uses TSMC 65 nm 1P8M technology. This section presents the ASIC implementation results and compares them with those of the other existing nonlinear detectors. Figure 5.8 is the ASIC layout. Table 5.1 is the detailed comparison of the hardware features based on the CHOSLAR architecture with the post-layout simulation results in Refs. [5, 1015]. The latter is the existing ASIC architectures, which are effective for nonlinear detection preprocessing of small-scale or high-order MIMO systems.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig8_HTML.png
Fig. 5.8

ASIC layout of CHOSLAR-based architecture. © [2018] IEEE.

Reprinted, with permission, from Ref. [4]

Table 5.1

Comparison of ASIC implementation results with other MIMO detectors

Parameter

Reference [11]

Reference [10]

Reference [15]

Reference [12]

Reference [14]

Reference [5]

This section

Antenna size

Complex 8 × 8

Complex 4 × 4

Complex 4 × 4

Complex 4 × 4

Complex 4 × 4

Complex 4 × 4

Complex 8 × 8

Complex 16 × 6

Complex 16 × 6

Modulation mode

64 QAM

64 QAM

64 QAM

64 QAM

256 QAM

QPSK

64 QAM

Algorithm

GR+LR

Sorted GR

GS

Sorted QR

WLD

TASER

Sorted QR

Sorted QR+LR

SNR loss/dB

1.2

3.37

3.88

3.37

4.96

#

3.37

1.44

Process/nm

90

65

90

65

90

40

65

Voltage/V

1.1

1.2

1

1.2

#

1.1

1.2

Frequency/MHz

65

550

114

500

275

598

560

454

588

Throughput/(Mbit/s)

585

2640

684

367.88

733

598

374

363

3528

Delay/$$ \upmu{\text{s}} $$

2.1

1.26

0.28

#

#

#

#

#

0.7

1.2

Number of gates/kG

612

943

505

1055

1580

148

471

1428

3720

5681

Power/mW

37.1

184

56.8

315.36

320.56

41

87

216

1831

2513

Energy efficiency/[Gbit/($$ {\text{s}}\,{\text{W}} $$)]

15.77

14.34

12.04

1.17

2.29

7.27

4.30

1.68

1.93

1.40

Area efficiency/[Mbit/$$ {\text{s}}\,{\text{kG}} $$)]

0.96

2.80

0.73

0.348

0.46

2.01

0.79

0.25

0.950

0.62

Normalized energy efficiency/[Gbit/($$ {\text{s}}\,{\text{W}} $$)]

6.35③④

0.89

1.00③④

0.07

0.27③④

0.14③④

0.34③④

0.54

1.93

1.40

Normalized area efficiency/[Mbit/($$ {\text{s}}\,{\text{kG}} $$)]

0.33③④

0.18

0.12③④

0.02

0.04③④

0.08③④

0.12③④

0.16

0.95

0.62

① The SNR loss (BER target is 10−5) compared with ML detection in 64 QAM 16 × 16 MIMO system

② Energy efficiency is defined as throughput/power, and area efficiency is defined as throughput/gate number

③ The process is normalized to 65 nm CMOS technology, following $$ f \sim s,  P_{\text{dyn}} \sim \left( {1/s} \right)\left( {V_{\text{dd}} /V_{\text{dd}}^{'} } \right)^{2} $$

④ Zoom to 16 × 16 MIMO configuration: energy efficiency × $$ \left( {N \times N} \right)/\left( {16 \times 16} \right) $$, area efficiency × $$ \left( {N \times N} \right)/\left( {16 \times 16} \right) $$

The architecture for this design can achieve a data throughput of 3.528 Gbit/s, which is, respectively, 5.16 times, 9.59 times, 1.34 times, and 4.81 times of that in the Refs. [10, 12, 14, 15] for the small-scale MIMO systems. One of the requirements of future wireless communication systems is higher throughput. However, high data throughput generally suffers a huge amount of hardware resources and power consumption, so we also compare area and power consumption with that of the recent designs. It should be noted that the CHOSLAR algorithm is designed for high-order MIMO systems, while the architectures in Refs. [1015] have higher resource consumption and power consumption in high-order systems. Besides this, different technologies and MIMO configurations are used for these architectures. To ensure fairness of comparison, the energy efficiency and area efficiency are normalized under 65 nm technology and 16 × 16 MIMO configurations, as shown in Table 5.1. This normalization method is widely used when comparing different technologies and hardware implementations for MIMO configurations, such as that in Refs. [5, 11, 12, 1517]. The CHOSLAR algorithm can achieve the energy efficiency of 1.40 Gbit/($$ {\text{s}}\,{\text{W}} $$), which is 1.40 times, 20.00 times, 1.57 times, and 5.19 times, respectively, of that in Refs. [10, 12, 14, 15]. At the same time, the area efficiency of CHOSLAR algorithm is 0.62 Mbit/($$ {\text{s}}\,{\text{kG}} $$), which is, respectively, 5.17 times, 31.00 times, 3.44 times, and 15.5 times of that in Refs. [10, 12, 14, 15]. In addition, the architectures in Refs. [10, 12, 14, 15] do not perform LR, while the architecture for this design with LR can achieve 1.93 Gbit/($$ {\text{s}} \, {\text{W}} $$) energy efficiency and 0.95 Mbit/($$ {\text{s}} \, {\text{kG}} $$) area efficiency, respectively. We can see that the architecture proposed has greater advantages in both energy efficiency and area efficiency. In terms of latency, CHOSLAR can realize 0.7 $$ \upmu{\text{s}} $$ latency without LR, which is 55.56% of the latency in Ref. [10]. The latency in Ref. [15] is slightly lower than that of CHOSLAR, but CHOSLAR has significant advantages in energy efficiency and area efficiency. The architecture in Ref. [11] has higher energy efficiency than that of CHOSLAR, but lower efficiency area, and the data throughput in Ref. [11] is computed under the assumption that the channel conditions remain constant, so only one MIMO detection preprocessing is performed. To be fair to compare, data throughput and energy efficiency should be reduced accordingly. Furthermore, compared with Ref. [11], the architecture for this design achieves 6.03 times of data throughput, and only 57.14% latency. Meanwhile, CHOSLAR is designed for 16 × 16 MIMO configuration, whereas it is for 4 × 4 or 8 × 8 MIMO configuration in Refs. [10, 11, 14, 15]. For higher order MIMO configurations, the latency in Refs. [10, 11, 14, 15] will increase significantly.

The architecture in Ref. [5] is suitable for nonlinear detection of high-order MIMO systems. Compared with the different MIMO configurations in Ref. [5], the data throughput of CHOSLAR is 11.84 times, 9.43 times, and 9.72 times, respectively, while the low data throughput in Ref. [5] may not meet the data rate requirements of future wireless communication systems. The normalized energy efficiency of CHOSLAR is 10.00 times, 4.12, times and 2.59 times of that of different MIMO configurations in Ref. [5], and the area efficiency of CHOSLAR is 7.75 times, 5.17 times, and 3.88 times higher than that in Ref. [5]. At the same time, the architectures in Ref. [5] only support BPSK and QPSK, and does not support higher level modulation. This limitation is another disadvantage of these architectures in future wireless systems. The architecture in Refs. [13] is designed for linear detection algorithms, which can achieve the performance of near-MMSE. These linear detectors suffer nonnegligible loss in detection accuracy, especially when the number of user antennas in MIMO systems is comparable to the number of base station antennas. This is why these linear detectors are not included in Table 5.1.

5.2 TASER-Based Hardware Architecture

5.2.1 Architecture Overview

We present a systolic VLSI architecture with low hardware complexity and high data throughput for TASER algorithm in this section [5]. Figure 5.9 shows a triangular systolic array composed of N(N + 1)/2 PEs, mainly used for MAC operation. Each PE is connected to $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ and stores $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ and $$ V_{i,j}^{\left( t \right)} $$. All PEs receive data from the column-broadcast unit (CBU) and the row-broadcast unit (RBU).
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig9_HTML.png
Fig. 5.9

Top-level modules of the TASER algorithm. © [2018] IEEE.

Reprinted, with permission, from Ref. [5]

In the k(th) cycle of the t(th) iteration of TASER, the i(th) RBU sends $$ \tilde{L}_{i,k}^{{\left( {t - 1} \right)}} $$ to all PEs in the i(th) row, while the j(th) RBU sends $$ \widehat{T}_{k,j} $$ to all PEs in the i(th) column. Assume that matrix $$ \widehat{\varvec{T}} = 2\tau \tilde{\varvec{T}} $$ has been solved in the preprocessing stage and the result has been stored in the storage. Take $$ \tilde{L}_{i,k}^{{\left( {t - 1} \right)}} $$ from the PE in Row i, Column k((i,k)) and send it to other PEs in the same row. After receiving data from CBU and RBU, each PE begins to perform MAC operations until $$ \tilde{\varvec{L}}^{{\left( {t - 1} \right)}} \widehat{\varvec{T}} $$ in Row 4 of Algorithm 4.3.1 is finished being computed. To include the subtraction of Row 4, $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} - \tilde{L}_{i,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,j} $$ operation is performed in the first cycle of each iteration of TASER and the results are stored in the accumulator. In subsequent cycles, $$ \tilde{L}_{i,k}^{{\left( {t - 1} \right)}} \widehat{T}_{k,j} \left( {2 \le k \le N} \right) $$ is subtracted from the accumulator in turn. The matrix $$ \tilde{\varvec{L}} $$ is the lower triangular matrix. There is $$ \tilde{L}_{{i,k^{{\prime }} }} = 0 $$ when $$ i < k^{{\prime }} $$, thereby subtraction of $$ \tilde{L}_{{i,k^{\prime}}}^{{\left( {t - 1} \right)}} \widehat{T}_{{k^{\prime},j}} $$ can be avoided. The $$ V_{i,j}^{\left( t \right)} $$ of PE in the i(th) row of systolic array is solved after i cycles, so the matrix $$ \varvec{V}^{\left( t \right)} $$ in the fourth line of Algorithm 4.3.1 can be solved after N cycles.

Figure 5.10 shows an example of the TASER array when N = 3. In the first cycle of the t(th) iteration, PE(1,1) inputs $$ \tilde{L}_{1,1}^{{\left( {t - 1} \right)}} $$ and $$ \widehat{T}_{1,1} $$ for computing $$ V_{1,1}^{\left( t \right)} { = }\tilde{L}_{1,1}^{{\left( {t - 1} \right)}} - \tilde{L}_{1,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,1} $$. At the same time, the PE in the second row performs the first MAC operation and stores the values of $$ \tilde{L}_{2,j}^{{\left( {t - 1} \right)}} - \tilde{L}_{2,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,j} $$ in their accumulators. In the second cycle, the PE of the second row receives $$ \tilde{L}_{2,2}^{{\left( {t - 1} \right)}} $$ through RBU and $$ \widehat{T}_{2,j} $$ through CBU, thus completing the computation of $$ V_{2,j}^{\left( t \right)} { = }\tilde{L}_{2,j}^{{\left( {t - 1} \right)}} - \tilde{L}_{2,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,j} - \tilde{L}_{2,2}^{{\left( {t - 1} \right)}} \widehat{T}_{2,j} $$. Meanwhile, PE (1,1) utilizes MAC units to compute the square of $$ V_{1,1}^{\left( t \right)} $$ and transfers the solution to the next PE in the same column in the next cycle. In the third cycle, PE(2,1) can utilize $$ V_{1,1}^{{\left( t \right)^{2} }} $$ from PE(1,1) and $$ V_{2,1}^{\left( t \right)} $$ stored internally. PE(2,1) use MAC unit to compute the square of $$ V_{2,1}^{\left( t \right)} $$ and add it to $$ V_{1,1}^{{\left( t \right)^{2} }} $$ (Fig. 5.10c). The result is the sum of the square of the first two elements in the first column of $$ \varvec{V}^{\left( t \right)} $$. In the next cycle, the results are sent to the next PE in the same column [in this case, PE(3,1)], so the same steps are repeated. The procedure is repeated over and over in all the columns, until all PEs has completed the corresponding computation. Therefore, the square of two norms of each column of $$ \varvec{V}^{\left( t \right)} $$ is solved over N + 1 clock cycles, that is, the computation is completed only one cycle after the completion of $$ \varvec{V}^{\left( t \right)} $$. In the N + 2(th) cycle, the square of two norm of the j(th) column is transferred to the scale unit, where the inverse SQRT is computed and the result is multiplied by $$ D_{j,j} $$. This operation takes two cycles so that the result is finally solved in the N + 4(th) cycle. In the N + 4(th) cycle, the scaling factor $$ D_{j,j} /\left\| {\varvec{v}_{j} } \right\|_{2} $$ ($$ \varvec{v}_{j} $$ is the j(th) column of $$ \varvec{V}^{\left( t \right)} $$) is transferred to all PEs in the same column through the CBU, as shown in Fig. 5.10d. Then, in the $$ N + 5 $$(th) and the last cycles of the iteration, all PEs will multiply the received scaling factors with $$ V_{i,j}^{\left( t \right)} $$ to obtain $$ \tilde{L}_{i,j}^{\left( t \right)} $$ for the next iteration, and then complete the operation of the proximity operator in the fifth line of Algorithm 4.3.1. The operation of the second line of Algorithm 4.3.1 must be performed before decoding the next symbol. This can be realized through CBU. That means $$ D_{j,j} $$ is transferred to the diagonal PE, and the off-diagonal PE removes $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ from their internal registers at the same time.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig10_HTML.png
Fig. 5.10

Different cycles of the i(th) iteration of TASER array when N = 3. a first round, b second round, c third round, d seventh round. © [2018] IEEE.

Reprinted, with permission, from Ref. [5]

5.2.2 PE

The systolic array uses two types of PEs: the off-diagonal PE and the diagonal PE (as shown in Fig. 5.11), both supporting the following four operating modes.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig11_HTML.png
Fig. 5.11

Architecture details of the TASER algorithm. a the j(th) scale unit, b the j(th) CBU, c non diagonal PE, d diagonal PE. © [2018] IEEE.

Reprinted, with permission, from Ref. [5]

  1. 1.

    Initialization of $$ \tilde{\varvec{L}} $$: This operation mode is used for computation of the second line of Algorithm 4.3.1. All off-diagonal PEs need to be initialized to $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} = 0 $$, and diagonal PEs need to be initialized to $$ D_{j,j} $$ received from CBU.

     
  2. 2.

    Matrix multiplication: This operation mode is used for the computation of the fourth line 4 of Algorithm 4.3.1. The multiplier need make use of all inputs from the broadcast signal, subtract the multiplier output from $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ in the first cycle of the matrix–matrix multiplication, then subtract the multiplier output from the accumulator in other cycles. Each PE stores its own $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ value. At the k(th) cycle, all the PEs in Column k use the internally stored $$ \tilde{L}_{i,k}^{{\left( {t - 1} \right)}} $$ as input of the multiplier, rather than the signals from RBU.

     
  3. 3.

    Computation of square of two norm: This operation mode is used for the computation of the fifth line of Algorithm 4.3.1. The inputs of all multipliers are $$ V_{i,j}^{\left( t \right)} $$. For the diagonal PEs, the result is transferred to the next PE in the same column. For off-diagonal PEs, the output of the multiplier adds $$ \sum\nolimits_{n = j}^{i - 1} {\left( {V_{n,j}^{\left( t \right)} } \right)^{2} } $$ from the previous PE in the same column, and the result $$ \sum\nolimits_{n = j}^{i} {\left( {V_{n,j}^{\left( t \right)} } \right)^{2} } $$ is transferred to the next PE. If PE is in the last row, the result will be sent to the scale unit.

     
  4. 4.

    Scaling: This operation mode is used for the fifth line 5 of Algorithm 4.3.1. The input of the multiplier is the $$ V_{i,j}^{\left( t \right)} $$ calculated by the scale unit and $$ D_{j,j} /\left\| {\varvec{v}_{j} } \right\|_{2} $$ received from the CBU. The result $$ \tilde{L}_{i,j}^{\left( t \right)} $$ is stored in each PE as $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ of the next iteration.

     

5.2.3 Implementation Details

To prove the effectiveness of the TASER algorithm and the proposed systolic array, the FPGA and ASIC designs with different array sizes N are implemented in this section. All designs are optimized by using Verilog at the register transfer level. The implementation details are described as follows.
  1. 1.

    Fixed-point design parameters: To minimize hardware complexity and ensure near-optimal BER performance, here we adopt a 14-bit fixed-point design. All PEs except the last row of the triangular array use 8-bit decimal digits to represent $$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$ and $$ V_{i,j}^{\left( t \right)} $$, and PEs in the last row use 7-bit decimal digits. $$ \tilde{L}_{N,N} $$ is stored in the register by using 5-bit decimal digits.

     
  2. 2.

    Computation of inverse SQRT: The computation of the inverse SQRT of the scale unit is realized by LUT and integrated by random logic. Each LUT contains 211 items, and each word of each item contains 14 bits, of which 13 bits are decimal digits.

     
  3. 3.

    $$ \widehat{\varvec{T}} $$ matrix memory: For FPGA, $$ \widehat{T}_{k,j} $$ storage and LUT are implemented by using distributed random access memory (RAM), that is, without using block RAM. For ASIC, latch arrays built with standard units are used to reduce the circuit area [18].

     
  4. 4.

    RBU and CBU design: The implementation of RBU for PFGA design and for ASIC design is different. For FPGA, RBU in the i(th) row is an i-input multiplexer that receives data from all PEs in its row and sends approximate $$ \tilde{L}_{i,k}^{{\left( {t - 1} \right)}} $$ to these PEs. For ASIC, RBU consists of a bidirectional bus. Each PE in its row sends data one by one with a tri-state buffer from which all PEs in the same row obtains data. CBU is also designed in a similar way: the multiplexer is used for FPGA design and the bus for ASIC design. In all target architectures, the i(th) RBU output is connected to the i(th) PE. For larger values of i, the path will have a larger fan out, and eventually become the critical path of the large-scale systolic array. The same will happen in CBU. To shorten the critical path, interstage registers will be placed at the input and output of each broadcast unit. Although that will add each TASER iteration by two extra cycles, the total data throughput increases because of the significant increase in clock frequency.

     

5.2.4 FPGA Implementation Result

Several FPGAs of different systolic array sizes N were designed and implemented on the Xilinx virtex-7 XC7VX690T FPGA, where N = 9, 17, 33, 65. The relevant implementation results are shown in Table 5.2. As expected, the resource utilization tends to increase with the square of array size N. For arrays of N = 9 and N = 17, the critical path is located in the MAC unit of PE, and for arrays of N = 33 and N = 65 arrays, the critical path is located in the row-broadcast multiplexer, thus limiting the data throughput when N = 65.
Table 5.2

FPGA implementation results of TASER with different array sizes

Matrix size

N = 9

N = 17

N = 33

N = 65

Number of BPSK Users/Time slot

8

16

32

64

Number of QPSK Users/Time slot

4

8

16

32

Resource quantity

1467

4350

13,787

60,737

LUT resource

4790

13,779

43,331

149,942

FF resource

2108

6857

24,429

91,829

DSP48

52

168

592

2208

Maximum clock frequency/MHz

232

225

208

111

Minimum delay/Clock cycle

16

24

40

72

Maximum throughput/(Mbit/s)

116

150

166

98

Power estimation/W

0.6

1.3

3.6

7.3

① Power estimation at the maximum clock frequency and a supply voltage of 1.0 V

Table 5.3 compares the TASER algorithm with several existing massive MIMO data detectors, i.e., CGLS detector [6], NSA detector [1], OCD detector [7] and GAS detector [8]. These algorithms all use 128 × 8 massive MIMO system and are implemented on the same FPGA. The TASER algorithm can achieve data throughput comparable to that of CGLS and GAS, and has significantly lower latency than that of NSA and OCD. In terms of hardware efficiency (measured by data throughput per PFGA LUT), the hardware efficiency for the TASER algorithm is similar to that of CGLS, NSA, and GAS, while lower than that of OCD. For the 128 × 8 massive MIMO system, all detectors can achieve near-ML performance. Nevertheless, when considering the 32 × 32 massive MIMO system (Fig. 4.​14a and b), the TASER algorithm has better BER performance than all other reference algorithms. However, CGLS, NSA, OCD, and GAS detectors can support 64 QAM modulation, while TASER can support only BPSK or QPSK, and the data throughput is linearly proportional to the number of bits per symbol. Thus, the TASER algorithm has no advantage in terms of data throughput and hardware efficiency.
Table 5.3

Comparison of implementation results of different detectors for 128 × 8 massive MIMO systems

Detection Algorithm

TASER

TASER

CGLS [6]

NSA [1]

OCD [7]

GAS [8]

BER

Near-ML

Near-ML

Approximate MMSE

Approximate MMSE

Approximate MMSE

Approximate MMSE

Modulation mode

BPSK

QPSK

64 QAM

64 QAM

64 QAM

64 QAM

Preprocessing

Not include

Not include

Include

Include

Include

Include

Maximum number of iterations $$ t_{ \hbox{max} } $$

3

3

3

3

3

1

Resource quantity

1467(1.35%)

4350(4.02%)

1094(1%)

48,244(44.6%)

13,447(12.4%)

N.a.

LUT resource

4790(1.11%)

13,779(3.18%)

3324(0.76%)

148,797(34.3%)

23,955(5.53%)

18,976(4.3%)

FF resource

2108(0.24%)

6857(0.79%)

3878(0.44%)

161,934(18.7%)

61,335(7.08%)

15,864(1.8%)

DSP48

52(1.44%)

168(4.67%)

33(0.9%)

1016(28.3%)

771(21.5%)

232(6.3%)

BRAM18

0(0%)

0(0%)

1(0.03%)

32(1.08%)

1(0.03%)

12①`(0.41%)

Clock frequency/MHz

232

225

412

317

263

309

Delay/Clock cycle

48

72

951

196

795

N.a.

Throughput/(Mbit/s)

38

50

20

621

379

48

Throughput/LUT

7933

3629

6017

4173

15,821

2530

①The BRAM36 used in these designs is equivalent to 2 BRAM18s

Figure 5.12 shows the trade-off between the data throughput of the FPGA design based on the TASER algorithm and the minimum SNR required to achieve 1% VER for coherent data detection in massive MIMO systems. Meanwhile, the SIMO lower bound and the linear MMSE detection are used as references, where the MMSE detection serves as a basic performance limitation for the CGLS detection [6], the NSA detection [1], the OCD detection [7] and the GAS detection [8]. The trade-off between performance and complexity of the TASER algorithm can be achieved by the maximum number of iterations tmax, while the simulation proves that only few iterations can achieve theperformance beyond that of linear detection. As we can see from Fig. 5.12, TASER algorithm can achieve near-ML performance, and its FPGA design can achieve data throughput from 10Mbit/s to 80Mbit/s.
../images/478198_1_En_5_Chapter/478198_1_En_5_Fig12_HTML.png
Fig. 5.12

Trade-offs between throughput and performance in FPGA design. a BPSK, b QPSK © [2018] IEEE.

Reprinted, with permission, from Ref. [5]

5.2.5 ASIC Implementation Results

Here, the ASIC with the systolic array size N = 9, N = 17 and N = 33 is implemented on the TSMC 40 nm CMOS, and the implementation result is shown in Table 5.4. The silicon area of the ASIC design increases proportionally with the square of the array size N, which can also be verified by Fig. 5.13 and Table 5.5. We can see that the unit area of each PE and scale unit remains basically the same, while the total area of PEs increases with N2. The unit area of $$ \widehat{T}_{k,j} $$ storage increases with N, where each storage contains one column of a N × N matrix. Different array sizes have different critical paths. When the array size N = 9, the critical path is located in the MAC unit of PE. When N = 17, the critical path is located in the inverse- SQRT LUT; and when N = 33, the critical path is located in the broadcast bus.
Table 5.4

ASIC implementation results of TASER with different array sizes

Matrix size

N = 9

N = 17

N = 33

Number of BPSK users/Time slot

8

16

32

Number of QPSK users/Time slot

4

8

16

Kernel area/$$ \upmu{\text{m}}^{2} $$

149,738

482,677

1,382,318

Kernel density/%

69.86

68.89

72.89

Unit area/GE

148,264

471,238

1,427,962

Maximum clock frequency/MHz

598

560

454

Minimum delay/Clock cycle

16

24

40

Maximum throughput/(Mbit/s)

298

374

363

Power estimation/mW

41

87

216

① One gate equivalent (GE) refers to the area of a unit-sized NAND2 gate. ② Power estimation after the placement and routing at the maximum clock frequency and supply voltage of 1.1 V

../images/478198_1_En_5_Chapter/478198_1_En_5_Fig13_HTML.png
Fig. 5.13

ASIC implementation layout for TASER algorithm. a N = 9, b N = 17, c N = 33. © [2018] IEEE.

Reprinted, with permission, from Ref. [5]

Table 5.5

Area decomposition of the TASER algorithm with different ASIC array sizes

Matrix size

N = 9

N = 17

N = 33

Area

Unit area

Total area

Unit area

Total area

Unit area

Total area

PE

2391(1.6%)

105,198(70.9%)

2404(0.5%)

365,352(77.5%)

2084(0.1%)

1,168,254(81.8%)

Scale unit

6485(4.4%)

25,941(17.5%)

6315(1.3%)

50,521(10.7%)

5945(0.4%)

95,125(6.6%)

$$ \widehat{T}_{k,j} $$ storage

734(0.5%)

5873(4.0%)

1451(0.3%)

23,220(4.9%)

2888(0.2%)

92,426(6.5%)

Control unit

459(0.3%)

459(0.3%)

728(0.2%)

728(0.2%)

1259(0.1%)

1259(0.1%)

Other

#

10,793(7.3%)

#

31,417(6.7%)

#

70,898(5.0%)

Table 5.6 shows the comparison of the ASIC implementation of TASER and the NSA detector in Ref. [9]. The NSA detector is the only ASIC design known for massive MU-MIMO systems. Although the data throughput of the ASIC design for TASER is significantly lower than that of the NSA detector, it owns better hardware efficiency (measured by throughput per unit area) and power efficiency (measured by energy per bit) because of its lower area and power consumption. Moreover, TASER can still realize the near-ML performance when the number of transmitted antennas is equal to the number of receiving antennas in a massive MU-MIMO system (Fig. 4.​14a and b). In fact, the comparison in Table 5.6 is not completely fair. TASER design does not include the preprocessing circuit. However, the NSA algorithm [9] includes the preprocessing circuit and optimizes the broadband system with single-carrier frequency-division multiple access (SC-FDMA).
Table 5.6

Comparison of ASIC implementation results of different algorithms

Detection algorithm

TASER

TASER

NSA [9]

BER

Near-ML

Near-ML

Approximate MMSE

Debug mode

BPSK

QPSK

64 QAM

Pre-process

Not include

Not include

Not include

Iterations number

3

3

3

CMOS process/nm

40

40

45

Voltage/V

1.1

1.1

0.81

Clock frequency/MHz

598

560

1000(1125)

Throughput/(Mbit/s)

99

125

1800(2025)

Kernel area/mm2

0.150

0.483

11.1(8.77)

Kernel density/%

69.86

68.89

73.00

Unit area/kGE

142.4

448.0

12,600

Power/mW

41.25

87.10

8000(13,114)

Throughput/ Unit area/[bit/($$ {\text{s}} \, {\text{GE}} $$)]

695

279

161

Energy/bit/(pJ/b)

417

697

6476

① Suppose: $$ A \sim 1/\ell^{2} $$, $$ t_{\text{pd}} \sim1/\ell $$$$ P_{\text{dyn}} \sim1/\left( {V_{\ell }^{2} \ell } \right) $$, scaling the process to 40 nm and 1.1 V

② The number of the gates that do not contain storage

③ At the maximum clock frequency and given supply voltage

There have been a large number of ASIC designs for data detectors of traditional small-scale MIMO systems (see Refs. [11, 19]), and most of them can achieve the near-Ml performance and can even provide data throughput at the Gb/s level for small-scale MIMO systems. The efficiency of these data detectors for massive MIMO, however, has not been verified. The corresponding algorithm and hardware-level comparison is one of the future work directions.