When the algorithm is mapped onto the corresponding hardware architecture design, people need to evaluate the performance of the hardware architecture such as data throughput, area, power consumption and delay, and research the resources reuse, the sub-module design, and the whole module pipeline of the hardware architecture, so as to obtain the innovative method with practical application values. Up to now, only the suboptimal linear data detection algorithm has been implemented on FPGA [1] or ASIC [2, 3]. The results of the linear algorithm in the hardware design are not ideal due to the characteristics of the algorithm itself, so it is necessary to try to design the hardware architecture for the nonlinear algorithm.
This chapter first introduces a VLSI architecture designed by us based on the CHOSLAR algorithm in Sect. 4.2. It is implemented with K-best detection preprocessor for the 64 QAM modulation and the 16 × 16 MIMO system [4]. In order to achieve an optimal trade-off among throughput, area, and power consumption, here, we will present three types of systolic arrays with diagonal priority to perform the initial matrix computation, LR and matrix inversion. An antenna-level is also proposed to realize high data throughput and low latency in sorted QRD and post-vector computation. Experimental results show that this architecture has great advantages over the existing designs in data throughput, latency, energy efficiency (throughput/power) and area (throughput/gate number).
Then, the corresponding systolic array is designed according to the TASER algorithm. The systolic array can realize high-throughput data detection with a lower silicon area [5]. VLSI is implemented with Xilinx virtex-7 FPGA and the 40 nm CMOS technology, and the performance and computational complexity are compared in detail with that of other data detectors recently proposed for the massive MU-MIMO wireless system [1, 6–9].
5.1 CHOSLAR Hardware Architecture
5.1.1 VLSI Architecture
This section describes the VLSI architecture implementation based on the CHOSLAR algorithm [4] used in Sect. 4.2. This architecture is designed for 64 QAM 16 × 16 MIMO systems. The circuit design method is similar to that of other larger scale MIMO systems. From the BER simulation results in Sect. 4.2.5, we know that three iterations are enough to achieve near-optimal detection accuracy and lower resource consumption, so we take 3 as the number of iterations for LR.
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq1.png)
![$$ \dot{\varvec{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq2.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq3.png)
![$$ \widehat{\varvec{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq4.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq5.png)
![$$ \widehat{\varvec{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq6.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq7.png)
![$$ \hat{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq8.png)
![$$ \widehat{\varvec{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq9.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq10.png)
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq11.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig1_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig1_HTML.png)
Top-level modules of the CHOSLAR algorithm. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
5.1.1.1 The Initialization Unit
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq12.png)
![$$ 1/2N^{2} - 1/2N $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq13.png)
![$$ \varvec{\dot{y} } $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq14.png)
![$$ H_{i,j}^{*} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq15.png)
![$$ H_{i,j}^{*} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq16.png)
![$$ \varvec{v} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq17.png)
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq18.png)
![$$ \varvec{H}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq19.png)
![$$ \varvec{H}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq20.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig2_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig2_HTML.png)
Architecture of the initialization unit. a architecture of the initialization unit, b internal architecture of PE-A, c internal architecture of PE-B. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
Similar systolic arrays are used to compute Gram matrix in linear detection algorithms in Refs. [1, 3], the computations in those architectures do not begin with computing the PEs of the diagonal elements of matrix A [1, 3] unlike the systolic arrays designed in this section. Therefore, computation of the diagonal elements of Gram matrix G is postponed. Then, the subsequent sorted QRD algorithm has to wait for more time to receive the input data, data throughput of the overall architecture decreases and the latency increases. The PE-A in Ref. [1, 3] uses the bilateral input while the architecture in this section adopts the unilateral input. Therefore, the number of ports of the systolic array used in Refs. [1, 3] has doubled (at the input end).
5.1.1.2 Sorted QRD Unit
![$$ A_{i,i} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq21.png)
![$$ A_{i,i} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq22.png)
![$$ R_{1,1} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq23.png)
![$$ A_{i,i} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq24.png)
![$$ \dot{\varvec{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq25.png)
![$$ R_{1,1} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq26.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig3_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig3_HTML.png)
Sorted QRD architecture. a sorted QRD architecture, b internal architecture of PE-C. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
The VLSI architecture based on the GR algorithm and the sorting algorithm is proposed in Refs. [10, 11]. A flexible architecture for a 64 QAM 1 × 1–4 × 4 MIMO system is proposed in Ref. [10], and a 8 × 8 K-best signal detector with less sorting operations combined with the LR and the QRD is proposed in Ref. [11]. The sorted QRD units in these architectures are built into a long chain of paired CORDIC arrays, resulting in excessive latency. It becomes even more problematic as the size of the MIMO system increases. The increased computing time for computing sorted QRD, in turn, affects the overall detector’s data throughput. The proposed architecture does not need to compute QRD directly, but is implemented by a series of deep pipeline multiplications to meet the requirements of future wireless communication systems for high data throughput.
5.1.1.3 PILR Unit
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig4_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig4_HTML.png)
a Architecture for updating matrix R based on Seagal condition, b internal architecture of PE-D. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig5_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig5_HTML.png)
a Architecture of full size reduction for matrix R, b internal architecture of PE-D, c internal architecture of PE-E. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
In Fig. 5.4, the first part of the unit consists of 3 N PE-Ds (for example, there are 48 PE-Ds in 16 × 16 MIMO systems). All PE-Ds are similar and operate in parallel. Take the k(th) PE-D as an example. The input of the first PE-D is the k(th) row and k − 1(th) row of matrix R. PE-D updates the two rows. Then these two rows are used as the input of the next PE-D, and REG can ensure the timing of the pipeline. The framework of the PE-D array is shown as Fig. 5.4. Each PE-D consists of three parts: the column updates for matrices and R, the column exchanges for matrices
and R, and the updates for matrix R. Prior to this, there was a unit dedicated to processing the comparison of Line 23 of Algorithm 4.2.2, and the comparison results were used as the enabling signals of the subsequent unit. First, compute the parameter
and divide it by LUT. Then, the CM unit performs the multiplication of Line 25 of Algorithm 4.2.2. And the resulting matrices R and H are exchanged in columns to obtain matrices
and
. Next, compute the parameters a and b in matrix
to update
. During this process, the real multiplication, real number addition, and a LUT are used to realize multiplication, SQRT and REC operations of the elements in matrix
. The multiplier, conjugate and negative units are used to obtain
and then the k(th) and the k − 1(th) rows of the matrix
are updated.
In the second part of the PILR unit, there are identical processing elements PE-Es (e.g., there are 120 PE-Es in a 16 × 16 MIMO system). Figure 5.5 shows the framework of the size reduction for matrix
and
. Take a single PE-E as an example. PE-E first computes the parameter
, then multiplies each element of the N − 2(th) column of the matrices
and
by
, and then subtract the result from each element of the N − 1(th) column. In the size reduction framework, the purpose of the first stage is to update the elements of the N(th) column of matrices
and
(except the elements of Row N − 1, Column N). In the second stage, using the results of the first stage to update the elements of the N(th) column of matrices
and
(except the elements of Row N − 1, Column N and Row N − 2, Column N). In addition, the elements of the N − 1(th) and N − 2(th) columns are input to PE-E to update the N − 1(th) column. The size reduction framework includes N − 1 stages, and the computation method is the same in all subsequent stages.
VLSI, which contains similar LR programs, has been proposed before. LR is implemented on three pairs of CORDIC processors by using an odd–even algorithm in Ref. [11]. The detector can achieve near-ML accuracy for 64 QAM 8 × 8 MIMO system. Several CORDIC pairs for QR and LR are also designed. According to the sequence diagram, this part accounts for the majority of latency so that the data throughput of the overall detector decreases. While the proposed PILR unit in this section implements LR through two frameworks, one of which is used for Segal condition and the other for size reduction condition. These frameworks are designed based on systolic arrays. All the intermediate data are computed in the next PE-D and all computations are deeply pipelined, so its hardware utilization and data throughput are higher than those of the CORDIC processor-based architecture in Ref. [11].
5.1.1.4 Inversion Unit
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq48.png)
![$$ 1/2N^{2} - 1/2N $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq49.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq50.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq51.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq52.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq53.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq54.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq55.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq56.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq57.png)
![$$ \widehat{\varvec{R}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq58.png)
![$$ \varvec{R}^{\text{inv}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq59.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig6_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig6_HTML.png)
a Architecture of the inversion unit, b internal architecture of PE-F, c internal architecture of PE-G. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
In some previously proposed linear detectors’ VLSI architectures, such as Refs. [1, 3], they also contain inversion units. These inversion units are approximately implemented based on NSA, similar to the inversion units mentioned in this section (based on systolic arrays). However, the proposed architecture can accurately invert the matrix R, while there is an approximate error with architecture in Refs. [1, 3]. In addition, the systolic array of this design first inverts the diagonal element of R in PE–F of PE in the first column of the unit. Thus the result can be utilized by PE-G. whereas diagonal elements can only be computed after a long delay in Refs. [1, 3] because it does not start with the computation of the PE of these elements. Moreover, the architecture in Refs. [1, 3] requires more ports than that of the architecture in this section.
5.1.1.5 Post-Vector Unit
![$$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq60.png)
![$$ \widehat{\varvec{H}}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq61.png)
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq62.png)
![$$ \widehat{\varvec{H}}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq63.png)
![$$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq64.png)
![$$ \widehat{\varvec{H}}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq65.png)
![$$ \dot{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq66.png)
![$$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq67.png)
![$$ \hat{\varvec{y}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq68.png)
![$$ (\varvec{R}^{\text{inv}} )^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq69.png)
![$$ \widehat{\varvec{H}}^{\text{H}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq70.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig7_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig7_HTML.png)
a Architecture of the post-vector unit, b internal architecture of PE-H. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
5.1.2 Implementation Results and Comparison
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig8_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig8_HTML.png)
ASIC layout of CHOSLAR-based architecture. © [2018] IEEE.
Reprinted, with permission, from Ref. [4]
Comparison of ASIC implementation results with other MIMO detectors
Parameter | Reference [11] | Reference [10] | Reference [15] | Reference [12] | Reference [14] | Reference [5] | This section | |||
---|---|---|---|---|---|---|---|---|---|---|
Antenna size | Complex 8 × 8 | Complex 4 × 4 | Complex 4 × 4 | Complex 4 × 4 | Complex 4 × 4 | Complex 4 × 4 | Complex 8 × 8 | Complex 16 × 6 | Complex 16 × 6 | |
Modulation mode | 64 QAM | 64 QAM | 64 QAM | 64 QAM | 256 QAM | QPSK | 64 QAM | |||
Algorithm | GR+LR | Sorted GR | GS | Sorted QR | WLD | TASER | Sorted QR | Sorted QR+LR | ||
SNR loss①/dB | 1.2 | 3.37 | 3.88 | 3.37 | 4.96 | # | 3.37 | 1.44 | ||
Process/nm | 90 | 65 | 90 | 65 | 90 | 40 | 65 | |||
Voltage/V | 1.1 | 1.2 | 1 | 1.2 | # | 1.1 | 1.2 | |||
Frequency/MHz | 65 | 550 | 114 | 500 | 275 | 598 | 560 | 454 | 588 | |
Throughput/(Mbit/s) | 585 | 2640 | 684 | 367.88 | 733 | 598 | 374 | 363 | 3528 | |
Delay/ | 2.1 | 1.26 | 0.28 | # | # | # | # | # | 0.7 | 1.2 |
Number of gates/kG | 612 | 943 | 505 | 1055 | 1580 | 148 | 471 | 1428 | 3720 | 5681 |
Power/mW | 37.1 | 184 | 56.8 | 315.36 | 320.56 | 41 | 87 | 216 | 1831 | 2513 |
Energy efficiency②/[Gbit/( | 15.77 | 14.34 | 12.04 | 1.17 | 2.29 | 7.27 | 4.30 | 1.68 | 1.93 | 1.40 |
Area efficiency②/[Mbit/ | 0.96 | 2.80 | 0.73 | 0.348 | 0.46 | 2.01 | 0.79 | 0.25 | 0.950 | 0.62 |
Normalized energy efficiency/[Gbit/( | 6.35③④ | 0.89④ | 1.00③④ | 0.07④ | 0.27③④ | 0.14③④ | 0.34③④ | 0.54③ | 1.93 | 1.40 |
Normalized area efficiency/[Mbit/( | 0.33③④ | 0.18④ | 0.12③④ | 0.02④ | 0.04③④ | 0.08③④ | 0.12③④ | 0.16③ | 0.95 | 0.62 |
The architecture for this design can achieve a data throughput of 3.528 Gbit/s, which is, respectively, 5.16 times, 9.59 times, 1.34 times, and 4.81 times of that in the Refs. [10, 12, 14, 15] for the small-scale MIMO systems. One of the requirements of future wireless communication systems is higher throughput. However, high data throughput generally suffers a huge amount of hardware resources and power consumption, so we also compare area and power consumption with that of the recent designs. It should be noted that the CHOSLAR algorithm is designed for high-order MIMO systems, while the architectures in Refs. [10–15] have higher resource consumption and power consumption in high-order systems. Besides this, different technologies and MIMO configurations are used for these architectures. To ensure fairness of comparison, the energy efficiency and area efficiency are normalized under 65 nm technology and 16 × 16 MIMO configurations, as shown in Table 5.1. This normalization method is widely used when comparing different technologies and hardware implementations for MIMO configurations, such as that in Refs. [5, 11, 12, 15–17]. The CHOSLAR algorithm can achieve the energy efficiency of 1.40 Gbit/(), which is 1.40 times, 20.00 times, 1.57 times, and 5.19 times, respectively, of that in Refs. [10, 12, 14, 15]. At the same time, the area efficiency of CHOSLAR algorithm is 0.62 Mbit/(
), which is, respectively, 5.17 times, 31.00 times, 3.44 times, and 15.5 times of that in Refs. [10, 12, 14, 15]. In addition, the architectures in Refs. [10, 12, 14, 15] do not perform LR, while the architecture for this design with LR can achieve 1.93 Gbit/(
) energy efficiency and 0.95 Mbit/(
) area efficiency, respectively. We can see that the architecture proposed has greater advantages in both energy efficiency and area efficiency. In terms of latency, CHOSLAR can realize 0.7
latency without LR, which is 55.56% of the latency in Ref. [10]. The latency in Ref. [15] is slightly lower than that of CHOSLAR, but CHOSLAR has significant advantages in energy efficiency and area efficiency. The architecture in Ref. [11] has higher energy efficiency than that of CHOSLAR, but lower efficiency area, and the data throughput in Ref. [11] is computed under the assumption that the channel conditions remain constant, so only one MIMO detection preprocessing is performed. To be fair to compare, data throughput and energy efficiency should be reduced accordingly. Furthermore, compared with Ref. [11], the architecture for this design achieves 6.03 times of data throughput, and only 57.14% latency. Meanwhile, CHOSLAR is designed for 16 × 16 MIMO configuration, whereas it is for 4 × 4 or 8 × 8 MIMO configuration in Refs. [10, 11, 14, 15]. For higher order MIMO configurations, the latency in Refs. [10, 11, 14, 15] will increase significantly.
The architecture in Ref. [5] is suitable for nonlinear detection of high-order MIMO systems. Compared with the different MIMO configurations in Ref. [5], the data throughput of CHOSLAR is 11.84 times, 9.43 times, and 9.72 times, respectively, while the low data throughput in Ref. [5] may not meet the data rate requirements of future wireless communication systems. The normalized energy efficiency of CHOSLAR is 10.00 times, 4.12, times and 2.59 times of that of different MIMO configurations in Ref. [5], and the area efficiency of CHOSLAR is 7.75 times, 5.17 times, and 3.88 times higher than that in Ref. [5]. At the same time, the architectures in Ref. [5] only support BPSK and QPSK, and does not support higher level modulation. This limitation is another disadvantage of these architectures in future wireless systems. The architecture in Refs. [1–3] is designed for linear detection algorithms, which can achieve the performance of near-MMSE. These linear detectors suffer nonnegligible loss in detection accuracy, especially when the number of user antennas in MIMO systems is comparable to the number of base station antennas. This is why these linear detectors are not included in Table 5.1.
5.2 TASER-Based Hardware Architecture
5.2.1 Architecture Overview
![$$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq84.png)
![$$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq85.png)
![$$ V_{i,j}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq86.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig9_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig9_HTML.png)
Top-level modules of the TASER algorithm. © [2018] IEEE.
Reprinted, with permission, from Ref. [5]
In the k(th) cycle of the t(th) iteration of TASER, the i(th) RBU sends to all PEs in the i(th) row, while the j(th) RBU sends
to all PEs in the i(th) column. Assume that matrix
has been solved in the preprocessing stage and the result has been stored in the storage. Take
from the PE in Row i, Column k((i,k)) and send it to other PEs in the same row. After receiving data from CBU and RBU, each PE begins to perform MAC operations until
in Row 4 of Algorithm 4.3.1 is finished being computed. To include the subtraction of Row 4,
operation is performed in the first cycle of each iteration of TASER and the results are stored in the accumulator. In subsequent cycles,
is subtracted from the accumulator in turn. The matrix
is the lower triangular matrix. There is
when
, thereby subtraction of
can be avoided. The
of PE in the i(th) row of systolic array is solved after i cycles, so the matrix
in the fourth line of Algorithm 4.3.1 can be solved after N cycles.
![$$ \tilde{L}_{1,1}^{{\left( {t - 1} \right)}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq100.png)
![$$ \widehat{T}_{1,1} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq101.png)
![$$ V_{1,1}^{\left( t \right)} { = }\tilde{L}_{1,1}^{{\left( {t - 1} \right)}} - \tilde{L}_{1,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,1} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq102.png)
![$$ \tilde{L}_{2,j}^{{\left( {t - 1} \right)}} - \tilde{L}_{2,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq103.png)
![$$ \tilde{L}_{2,2}^{{\left( {t - 1} \right)}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq104.png)
![$$ \widehat{T}_{2,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq105.png)
![$$ V_{2,j}^{\left( t \right)} { = }\tilde{L}_{2,j}^{{\left( {t - 1} \right)}} - \tilde{L}_{2,1}^{{\left( {t - 1} \right)}} \widehat{T}_{1,j} - \tilde{L}_{2,2}^{{\left( {t - 1} \right)}} \widehat{T}_{2,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq106.png)
![$$ V_{1,1}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq107.png)
![$$ V_{1,1}^{{\left( t \right)^{2} }} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq108.png)
![$$ V_{2,1}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq109.png)
![$$ V_{2,1}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq110.png)
![$$ V_{1,1}^{{\left( t \right)^{2} }} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq111.png)
![$$ \varvec{V}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq112.png)
![$$ \varvec{V}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq113.png)
![$$ \varvec{V}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq114.png)
![$$ D_{j,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq115.png)
![$$ D_{j,j} /\left\| {\varvec{v}_{j} } \right\|_{2} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq116.png)
![$$ \varvec{v}_{j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq117.png)
![$$ \varvec{V}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq118.png)
![$$ N + 5 $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq119.png)
![$$ V_{i,j}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq120.png)
![$$ \tilde{L}_{i,j}^{\left( t \right)} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq121.png)
![$$ D_{j,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq122.png)
![$$ \tilde{L}_{i,j}^{{\left( {t - 1} \right)}} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq123.png)
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig10_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig10_HTML.png)
Different cycles of the i(th) iteration of TASER array when N = 3. a first round, b second round, c third round, d seventh round. © [2018] IEEE.
Reprinted, with permission, from Ref. [5]
5.2.2 PE
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig11_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig11_HTML.png)
Architecture details of the TASER algorithm. a the j(th) scale unit, b the j(th) CBU, c non diagonal PE, d diagonal PE. © [2018] IEEE.
Reprinted, with permission, from Ref. [5]
- 1.
Initialization of
: This operation mode is used for computation of the second line of Algorithm 4.3.1. All off-diagonal PEs need to be initialized to
, and diagonal PEs need to be initialized to
received from CBU.
- 2.
Matrix multiplication: This operation mode is used for the computation of the fourth line 4 of Algorithm 4.3.1. The multiplier need make use of all inputs from the broadcast signal, subtract the multiplier output from
in the first cycle of the matrix–matrix multiplication, then subtract the multiplier output from the accumulator in other cycles. Each PE stores its own
value. At the k(th) cycle, all the PEs in Column k use the internally stored
as input of the multiplier, rather than the signals from RBU.
- 3.
Computation of square of two norm: This operation mode is used for the computation of the fifth line of Algorithm 4.3.1. The inputs of all multipliers are
. For the diagonal PEs, the result is transferred to the next PE in the same column. For off-diagonal PEs, the output of the multiplier adds
from the previous PE in the same column, and the result
is transferred to the next PE. If PE is in the last row, the result will be sent to the scale unit.
- 4.
Scaling: This operation mode is used for the fifth line 5 of Algorithm 4.3.1. The input of the multiplier is the
calculated by the scale unit and
received from the CBU. The result
is stored in each PE as
of the next iteration.
5.2.3 Implementation Details
- 1.
Fixed-point design parameters: To minimize hardware complexity and ensure near-optimal BER performance, here we adopt a 14-bit fixed-point design. All PEs except the last row of the triangular array use 8-bit decimal digits to represent
and
, and PEs in the last row use 7-bit decimal digits.
is stored in the register by using 5-bit decimal digits.
- 2.
Computation of inverse SQRT: The computation of the inverse SQRT of the scale unit is realized by LUT and integrated by random logic. Each LUT contains 211 items, and each word of each item contains 14 bits, of which 13 bits are decimal digits.
- 3.
matrix memory: For FPGA,
storage and LUT are implemented by using distributed random access memory (RAM), that is, without using block RAM. For ASIC, latch arrays built with standard units are used to reduce the circuit area [18].
- 4.
RBU and CBU design: The implementation of RBU for PFGA design and for ASIC design is different. For FPGA, RBU in the i(th) row is an i-input multiplexer that receives data from all PEs in its row and sends approximate
to these PEs. For ASIC, RBU consists of a bidirectional bus. Each PE in its row sends data one by one with a tri-state buffer from which all PEs in the same row obtains data. CBU is also designed in a similar way: the multiplexer is used for FPGA design and the bus for ASIC design. In all target architectures, the i(th) RBU output is connected to the i(th) PE. For larger values of i, the path will have a larger fan out, and eventually become the critical path of the large-scale systolic array. The same will happen in CBU. To shorten the critical path, interstage registers will be placed at the input and output of each broadcast unit. Although that will add each TASER iteration by two extra cycles, the total data throughput increases because of the significant increase in clock frequency.
5.2.4 FPGA Implementation Result
FPGA implementation results of TASER with different array sizes
Matrix size | N = 9 | N = 17 | N = 33 | N = 65 |
---|---|---|---|---|
Number of BPSK Users/Time slot | 8 | 16 | 32 | 64 |
Number of QPSK Users/Time slot | 4 | 8 | 16 | 32 |
Resource quantity | 1467 | 4350 | 13,787 | 60,737 |
LUT resource | 4790 | 13,779 | 43,331 | 149,942 |
FF resource | 2108 | 6857 | 24,429 | 91,829 |
DSP48 | 52 | 168 | 592 | 2208 |
Maximum clock frequency/MHz | 232 | 225 | 208 | 111 |
Minimum delay/Clock cycle | 16 | 24 | 40 | 72 |
Maximum throughput/(Mbit/s) | 116 | 150 | 166 | 98 |
Power estimation①/W | 0.6 | 1.3 | 3.6 | 7.3 |
Comparison of implementation results of different detectors for 128 × 8 massive MIMO systems
Detection Algorithm | TASER | TASER | CGLS [6] | NSA [1] | OCD [7] | GAS [8] |
---|---|---|---|---|---|---|
BER | Near-ML | Near-ML | Approximate MMSE | Approximate MMSE | Approximate MMSE | Approximate MMSE |
Modulation mode | BPSK | QPSK | 64 QAM | 64 QAM | 64 QAM | 64 QAM |
Preprocessing | Not include | Not include | Include | Include | Include | Include |
Maximum number of iterations | 3 | 3 | 3 | 3 | 3 | 1 |
Resource quantity | 1467(1.35%) | 4350(4.02%) | 1094(1%) | 48,244(44.6%) | 13,447(12.4%) | N.a. |
LUT resource | 4790(1.11%) | 13,779(3.18%) | 3324(0.76%) | 148,797(34.3%) | 23,955(5.53%) | 18,976(4.3%) |
FF resource | 2108(0.24%) | 6857(0.79%) | 3878(0.44%) | 161,934(18.7%) | 61,335(7.08%) | 15,864(1.8%) |
DSP48 | 52(1.44%) | 168(4.67%) | 33(0.9%) | 1016(28.3%) | 771(21.5%) | 232(6.3%) |
BRAM18 | 0(0%) | 0(0%) | 1(0.03%) | 32①(1.08%) | 1(0.03%) | 12①`(0.41%) |
Clock frequency/MHz | 232 | 225 | 412 | 317 | 263 | 309 |
Delay/Clock cycle | 48 | 72 | 951 | 196 | 795 | N.a. |
Throughput/(Mbit/s) | 38 | 50 | 20 | 621 | 379 | 48 |
Throughput/LUT | 7933 | 3629 | 6017 | 4173 | 15,821 | 2530 |
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig12_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig12_HTML.png)
Trade-offs between throughput and performance in FPGA design. a BPSK, b QPSK © [2018] IEEE.
Reprinted, with permission, from Ref. [5]
5.2.5 ASIC Implementation Results
![$$ \widehat{T}_{k,j} $$](../images/478198_1_En_5_Chapter/478198_1_En_5_Chapter_TeX_IEq144.png)
ASIC implementation results of TASER with different array sizes
Matrix size | N = 9 | N = 17 | N = 33 |
---|---|---|---|
Number of BPSK users/Time slot | 8 | 16 | 32 |
Number of QPSK users/Time slot | 4 | 8 | 16 |
Kernel area/ | 149,738 | 482,677 | 1,382,318 |
Kernel density/% | 69.86 | 68.89 | 72.89 |
Unit area/GE① | 148,264 | 471,238 | 1,427,962 |
Maximum clock frequency/MHz | 598 | 560 | 454 |
Minimum delay/Clock cycle | 16 | 24 | 40 |
Maximum throughput/(Mbit/s) | 298 | 374 | 363 |
Power estimation②/mW | 41 | 87 | 216 |
![../images/478198_1_En_5_Chapter/478198_1_En_5_Fig13_HTML.png](../images/478198_1_En_5_Chapter/478198_1_En_5_Fig13_HTML.png)
ASIC implementation layout for TASER algorithm. a N = 9, b N = 17, c N = 33. © [2018] IEEE.
Reprinted, with permission, from Ref. [5]
Area decomposition of the TASER algorithm with different ASIC array sizes
Matrix size | N = 9 | N = 17 | N = 33 | |||
---|---|---|---|---|---|---|
Area | Unit area | Total area | Unit area | Total area | Unit area | Total area |
PE | 2391(1.6%) | 105,198(70.9%) | 2404(0.5%) | 365,352(77.5%) | 2084(0.1%) | 1,168,254(81.8%) |
Scale unit | 6485(4.4%) | 25,941(17.5%) | 6315(1.3%) | 50,521(10.7%) | 5945(0.4%) | 95,125(6.6%) |
| 734(0.5%) | 5873(4.0%) | 1451(0.3%) | 23,220(4.9%) | 2888(0.2%) | 92,426(6.5%) |
Control unit | 459(0.3%) | 459(0.3%) | 728(0.2%) | 728(0.2%) | 1259(0.1%) | 1259(0.1%) |
Other | # | 10,793(7.3%) | # | 31,417(6.7%) | # | 70,898(5.0%) |
Comparison of ASIC implementation results of different algorithms
Detection algorithm | TASER | TASER | NSA [9] |
---|---|---|---|
BER | Near-ML | Near-ML | Approximate MMSE |
Debug mode | BPSK | QPSK | 64 QAM |
Pre-process | Not include | Not include | Not include |
Iterations number | 3 | 3 | 3 |
CMOS process/nm | 40 | 40 | 45 |
Voltage/V | 1.1 | 1.1 | 0.81 |
Clock frequency/MHz | 598 | 560 | 1000(1125①) |
Throughput/(Mbit/s) | 99 | 125 | 1800(2025①) |
Kernel area/mm2 | 0.150 | 0.483 | 11.1(8.77①) |
Kernel density/% | 69.86 | 68.89 | 73.00 |
Unit area②/kGE | 142.4 | 448.0 | 12,600 |
Power③/mW | 41.25 | 87.10 | 8000(13,114①) |
Throughput/ Unit area①/[bit/( | 695 | 279 | 161 |
Energy/bit①/(pJ/b) | 417 | 697 | 6476 |
There have been a large number of ASIC designs for data detectors of traditional small-scale MIMO systems (see Refs. [11, 19]), and most of them can achieve the near-Ml performance and can even provide data throughput at the Gb/s level for small-scale MIMO systems. The efficiency of these data detectors for massive MIMO, however, has not been verified. The corresponding algorithm and hardware-level comparison is one of the future work directions.