© Springer Nature Singapore Pte Ltd. and Science Press, Beijing, China  2019
Leibo Liu, Guiqiang Peng and Shaojun WeiMassive MIMO Detection Algorithm and VLSI Architecturehttps://doi.org/10.1007/978-981-13-6362-7_4

4. Nonlinear Massive MIMO Signal Detection Algorithm

Leibo Liu1  , Guiqiang Peng2   and Shaojun Wei3  
(1)
Institute of Microelectronics, Tsinghua University, Beijing, China
(2)
Institute of Microelectronics, Tsinghua University, Beijing, China
(3)
Institute of Microelectronics, Tsinghua University, Beijing, China
 
 
Leibo Liu (Corresponding author)
 
Guiqiang Peng
 
Shaojun Wei

Currently, there are two kinds of signal detectors for MIMO systems: linear signal detectors and nonlinear signal detectors [1]. Linear signal detection algorithms include the conventional ZF algorithm and MMSE algorithm [2], as well as some recently proposed linear signal detection algorithms [35]. Although these linear signal detection algorithms have the advantages of low complexity, their deficiency in detection accuracy cannot be ignored, especially when the number of user antennas is close to or equal to the number of base station antennas [3]. The optimal signal detector is a nonlinear ML signal detector, but its complexity increases exponentially with the increase of the number of the transmitting antennas, so it cannot be implemented for massive MIMO systems [6]. The SD detector [7] and the K-Best detector [8] are two different variations of the ML detector. They can balance the computational complexity and performance by controlling the number of nodes in each search stage. However, the QR decomposition in these nonlinear signal detectors can lead to high computational complexity and low parallelism because of the inclusion of matrix operations such as element elimination. Therefore, people urgently need a detector with low complexity, high precision and high processing parallelism.

This chapter first introduces several conventional nonlinear MIMO signal detection algorithms in the Sect. 4.1. Section 4.2 presents a K-best signal detection and preprocessing algorithm in high-order MIMO systems, combining the Cholesky sorted QR decomposition and partial iterative lattice reduction (CHOSLAR) [9]. Section 4.3 presents another new signal detection algorithm, TASER algorithm [10].

4.1 Conventional Nonlinear MIMO Signal Detection Algorithm

4.1.1 ML Signal Detection Algorithm

The ML signal detection algorithm can achieve the optimal estimation of the transmitted signal. The detection process is to find the nearest constellation point in all the constellation points set as the estimation of the transmitted signal. The detailed analysis is as follows. Considering the MIMO system of Nt root transmitting antenna and Nr root received antenna, the symbols received by all receiving antennas are represented by vector $$ \varvec{y} \in C^{{N_{\text{r}} }} $$, then
$$ \varvec{y} = \varvec{Hs} + \varvec{n,} $$
(4.1)
where, $$ \varvec{s} \in \varOmega^{{N_{\text{t}} }} $$, is the transmitted signal vector containing all user data symbols ($$ \varOmega $$ denotes the set of constellation points); $$ \varvec{H} \in C^{{N_{\text{r}} \times N_{\text{t}} }} $$ is Rayleigh flat-fading channel matrix, and its element $$ h_{j,i} $$ is the channel gain from transmitting antenna $$ i\left( {i = 1,2, \ldots ,N_{\text{t}} } \right) $$ to receiving antenna $$ j\left( {j = 1,2, \ldots ,N_{\text{r}} } \right) $$. $$ \varvec{n} \in C^{{N_{\text{r }} }} $$ is the additive Gaussian white noise vector with independent components and obeying $$ N\left( {0,\sigma^{2} } \right) $$ distribution.
The conditional probability density of the receiving signal can be expressed as
$$ P\left( {\left. \varvec{y} \right|\varvec{H},\varvec{s}} \right) = \frac{1}{{\left( {\uppi\sigma^{2} } \right)^{M} }}\exp \left( { - \frac{1}{{\sigma^{2} }}\left\| {\varvec{y} - \varvec{Hs}} \right\|^{2} } \right) $$
(4.2)
As the optimal signal detection algorithm, ML signal detection algorithm solves $$ \varvec{s} $$ by the finite set constrained least mean square optimization, as shown in Eq. (4.3).
$$ \tilde{\varvec{s}} = \mathop {\arg \hbox{max} }\limits_{s \in \varOmega } P\left( {\left. \varvec{y} \right|\varvec{H},\varvec{s}} \right) = \mathop {\arg \hbox{min} }\limits_{s \in \varOmega } \left\| {\varvec{y} - \varvec{Hs}} \right\|^{2} $$
(4.3)
Perform QR decomposition on the channel matrix H, and we can get
$$ \begin{aligned} \left\| {\varvec{y} - \varvec{Hs}} \right\|^{2} & { = }\left\| {\varvec{QQ}^{\text{H}} \left( {\varvec{y} - \varvec{Hs}} \right) + \left( {\varvec{I}_{{N_{\text{r}} }} - \varvec{QQ}^{\text{H}} } \right)\left( {\varvec{y} - \varvec{Hs}} \right)} \right\|^{2} \\ & = \left\| {\varvec{QQ}^{\text{H}} \left( {\varvec{y} - \varvec{Hs}} \right)} \right\|^{2} + \left\| {\left( {\varvec{I}_{{N_{\text{r}} }} - \varvec{QQ}^{\text{H}} } \right)\left( {\varvec{y} - \varvec{Hs}} \right)} \right\|^{2} \\ & = \left\| {\varvec{Q}^{\text{H}} \varvec{y} - \varvec{Rs}} \right\|^{2} + \left\| {\left( {\varvec{I}_{{N_{\text{r}} }} - \varvec{QQ}^{\text{H}} } \right)y} \right\|^{2} \\ & = \left\| {\varvec{Q}^{\text{H}} \varvec{y} - \varvec{Rs}} \right\|^{2} \\ & = \left\| {\varvec{y^{\prime}} - \varvec{Rs}} \right\|^{2} , \\ \end{aligned} $$
(4.4)
where $$ \varvec{y^{\prime}} = \varvec{Q}^{\text{H}} \varvec{y} $$, according to the upper triangular property of matrix $$ \varvec{R} $$,
$$ \begin{aligned} \tilde{\varvec{x}} & = \mathop {\arg \hbox{min} }\limits_{s \in \varOmega } \left\| {\varvec{y^{\prime}} - \varvec{Rs}} \right\|^{2} \\ & = \mathop {\arg \hbox{min} }\limits_{s \in \varOmega } \left( {\sum\limits_{i = 1}^{{N_{\text{t}} }} {\left| {y^{\prime}_{i} - \sum\limits_{j = i}^{{N_{\text{t}} }} {R_{i,j} s_{j} } } \right|^{2} + \sum\limits_{{i = N_{\text{t}} + 1}}^{{N_{\text{r}} }} {\left| {y^{\prime}_{i} } \right|^{2} } } } \right) \\ & = \mathop {\arg \hbox{min} }\limits_{s \in \varOmega } \left( {\sum\limits_{i = 1}^{{N_{\text{t}} }} {\left| {y^{\prime}_{i} - \sum\limits_{j = i}^{{N_{\text{t}} }} {R_{i,j} s_{j} } } \right|^{2} } } \right) \\ & = \mathop {\arg \hbox{min} }\limits_{s \in \varOmega } \left[ {f_{{N_{\text{t}} }} \left( {s_{{N_{\text{t}} }} } \right) + f_{{N_{\text{t}} - 1}} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} } \right) + \cdots + f_{1} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} , \ldots ,s_{1} } \right)} \right], \\ \end{aligned} $$
(4.5)
where the function in $$ f_{k} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} , \ldots ,s_{k} } \right) $$ can be expressed as
$$ f_{k} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} , \ldots ,s_{k} } \right){ = }\left| {y^{\prime}_{k} - \sum\limits_{j = k}^{{N_{\text{t}} }} {R_{k,j} s_{j} } } \right|^{2} $$
(4.6)
Here we constructed a search tree to seek the optimal solution for the set of all constellation points, as shown in Fig. 4.1.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig1_HTML.png
Fig. 4.1

Search tree in the ML signal detection algorithm

There are S nodes (S is the number of possible values of each point in the modulation mode) in the first stage expanded by the root node. Each node has a value of $$ f_{{N_{\text{t}} }} \left( {s_{{N_{\text{t}} }} } \right) $$$$ \left( {s_{{N_{\text{t}} }} \in \,\varvec{\varOmega}} \right) $$. In the first stage, each node expands S child nodes to get the structure of the second stage, a total of $$ S^{2} $$ nodes. The value of second stage node is $$ f_{{N_{\text{t}} }} \left( {s_{{N_{\text{t}} }} } \right){ + }f_{{N_{\text{t}} - 1}} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} } \right) $$, and so on, the last generated $$ N_{\text{t}} $$ stage has $$ S^{{N_{\text{t}} }} $$ child nodes. The value of the child node is $$ f_{{N_{\text{t}} }} \left( {s_{{N_{\text{t}} }} } \right){ + }f_{{N_{\text{t}} - 1}} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} } \right) + \cdots + f_{1} \left( {s_{{N_{\text{t}} }} ,s_{{N_{\text{t}} - 1}} , \cdots ,s_{1} } \right) $$. We can find the optimal solution by looking for all the nodes.

ML signal detection algorithm searches all nodes to find the optimal node, and then estimates the transmitted signal, which is obviously the optimal estimation algorithm. Although ML signal detection algorithm can achieve the best performance, its computational complexity increases exponentially with the increase of the number of the transmitting antennas, the number of bits per symbol after modulation and the length of processing data. The computational complexity is $$ O\left( {M^{{N_{\text{t}} }} } \right) $$ (M is the number of constellation points, $$ N_{\text{t}} $$ is the number of the transmitting antennas). For example, in a $$ 4 \times 4 $$ MIMO system with 16QAM modulation, the search amount for each symbol block is as high as $$ 4^{16} $$ = 65,536. Therefore, it is difficult for the ML signal detection algorithm to apply to the actual communication system in the application scenarios of high-order modulation ($$ M $$ is large) and large number of transmitting antennas ($$ N_{\text{t}} $$ is large). In order to reduce the computational complexity of the algorithm, we need some approximate detection algorithm [11].

4.1.2 SD Signal Detection Algorithm and K-Best Signal Detection Algorithm

To achieve near-optimal performance and reduce computational complexity, several nonlinear signal detectors are proposed. One of the typical algorithms is the tree-based search algorithm [12, 13]. So far, many near-ML signal detection algorithms have been presented, including tree-based search algorithms with low complexity. K-best signal detection algorithm [14] searches K nodes in each stage to find the possible transmitted signals, while SD signal detection algorithm [15] searches the hypersphere near the receiving signal vector to find the possible transmitted signals. However, no other signal detection algorithms can achieve full diversity gain [16] except the ML signal detection algorithm. The fixed-complexity SD (fixed-complexity sphere decoding, FSD) signal detection algorithm [17] utilizes the potential grid structure of the receiving signal, and is considered the most promising algorithm to achieve the ML detection performance and reduce the computational complexity. In the conventional small-scale MIMO system, the algorithm performs well, but its computational complexity is still unbearable when the antenna size increases or the modulation order increases (for example, the number of the transmitting antennas is 128 and the modulation order is 64 QAM modulation) [6].

The above-mentioned algorithm will be described in detail in the following.

4.1.2.1 K-Best Signal Detection Algorithm

K-best [14] signal detection algorithm is a depth-first search algorithm that performs only forward search. This algorithm only keeps $$ K $$ paths with optimal metric in each stage. Figure 4.2 shows the search tree of K-best signal detection algorithm when $$ N_{\text{t}} = 2 $$ [11]. This algorithm expands all possible candidate nodes from the root node, then sorts them by metric, selects the first $$ K $$ paths with the smallest metric as the next stage, and so on, until the leaf node. The specific implementation of K-best signal detection algorithm is shown in Algorithm 4.1.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig2_HTML.png
Fig. 4.2

Search tree in the K-best signal detection algorithm.

© [2018] IEEE. Reprinted, with permission, from Ref. [11]

Algorithm 4.1

K-best signal detection algorithm

../images/478198_1_En_4_Chapter/478198_1_En_4_Figa_HTML.png

The K-best signal detection algorithm can achieve near-optimal performance within a fixed complexity and moderate parallelism. The fixed complexity depends on the number of reserved candidate nodes $$ K $$, the modulation order and the number of the transmitting antennas. In the search number of the K-best signal detection algorithm, the total number of nodes searched is $$ 2^{Q} + \left( {N_{\text{t}} - 1} \right)K2^{Q} $$. Although the K-best signal detection algorithm has the above advantages, it does not take into account the noise variance and channel conditions. In addition, the K-best signal detection algorithm extends all the K reserved paths for each stage to $$ 2^{Q} $$ possible child nodes. Therefore, great complexity is required to enumerate these child nodes, especially when the number of high-order modulation and survivor paths is large. The algorithm also needs to compute and sort $$ 2^{Q} K $$ paths of each stage, where $$ K\left( {2^{Q} - 1} \right) $$ is the path of the cropped tree. This sort is also very time-consuming. At the same time, the algorithm has a large bit error rate at a low $$ K $$ value.

4.1.2.2 SD Signal Detection Algorithm

The SD signal detection algorithm is applicable to a wide SNR range, which can achieve near-optimal performance while maintaining the average computational complexity of the polynomial stage [7]. The SD signal detection algorithm was originally used to calculate the minimum length grid vector, which was further developed to solve the short grid vector, and finally used for the ML estimation [15]. The basic principle of the SD signal detection algorithm is to limit the search space of the optimal ML solution to the hypersphere with a radius of $$ r_{\text{s}} $$ near the receiving vector, as shown in Fig. 4.3 [11], and formulate it as an expression (4.7). Therefore, the computational complexity is reduced by simply verifying the grid points within the hypersphere rather than all possible points of the transmitting signals.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig3_HTML.png
Fig. 4.3

Principle of SD signal detection algorithm.

© [2018] IEEE. Reprinted, with permission, from Ref. [11]

$$ \hat{\varvec{s}}_{{\text{SD}}} = \mathop {\arg \hbox{min} }\limits_{{\varvec{s} \in 2^{{QN_{\text{t}} }} }} \left\{ {\left\| {\varvec{y} - \varvec{Hs}} \right\|^{2} \le r_{\text{s}}^{2} } \right\} $$
(4.7)
The channel matrix $$ \varvec{H} $$ can be decomposed into matrices $$ \varvec{Q} $$ and $$ \varvec{R} $$, i.e., $$ \varvec{H} = \varvec{QR} $$, Eq. (4.7) is equivalent.
$$ \hat{\varvec{s}}_{{\text{SD}}} = \mathop {\arg \hbox{min} }\limits_{{\varvec{s} \in 2^{{QN_{\text{t}} }} }} \left\{ {\left\| {\tilde{\varvec{y}} - \varvec{Rs}} \right\|^{2} \le r_{\text{s}}^{2} } \right\}, $$
(4.8)
where $$ \varvec{R} $$ is the upper triangular matrix. The Euclidean distance is defined as $$ d_{1} = \left\| {\tilde{\varvec{y}} - \varvec{Rs}} \right\|^{2} $$, PED is shown in Eq. (4.9)
$$ d_{i} = d_{i + 1} + \left| {\tilde{y}_{i} - \sum\limits_{j = i}^{{N_{t} }} {R_{i,j} s_{j} } } \right|^{2} = d_{i + 1} + \left| {e_{i} } \right|^{2} $$
(4.9)
The number search process with $$ N_{\text{t}} + 1 $$ nodes can be expressed by Fig. 4.4, where Stage $$ i $$ denotes the $$ i\left( {\text{th}} \right) $$ transmitting antenna.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig4_HTML.png
Fig. 4.4

Search tree in the SD signal detection algorithm.

© [2018] IEEE. Reprinted, with permission, from Ref. [11]

The search algorithm starts with the root node or the first child node in Stage $$ N_{\text{t}} $$, where Stage $$ N_{\text{t}} $$ represents the $$ N_{\text{t}} $$(th) antenna symbol at the transmitter. Then compute PED. If PED (i.e., $$ d_{{N_{\text{t}} }} $$) is less than the sphere radius $$ r_{\text{s}} $$, then the search process will continue until it reaches the stage $$ N_{t} - 1. $$ Otherwise, if the $$ d_{{N_{\text{t}} }} $$ is greater than the sphere radius $$ r_{\text{s}} $$, then the search has exceeded the set hypersphere, no longer search for all the child nodes below this node, but continue to search in another path. Thus, the step-by-step search is carried out until a valid leaf node in the first stage is estimated.

The selection of D in the SD signal detection algorithm has a certain influence on the search process. If the value of D is too small, it will result in the value of the first stage node exceeding D or the value of all nodes exceeding D when finding a middle stage, thus the optimal solution cannot be obtained. The SD-pruning signal detection algorithm can effectively solve this problem. It first defines the preset value as infinity, then updates the preset value when the last stage searched, and then keeps updating when smaller values searched, so that the performance of SD-pruning signal detection algorithm can reach that of ML signal detection algorithm.

4.1.2.3 FSD Signal Detection Algorithm

The SD signal detection algorithm is another suboptimal MIMO signal detection algorithm, which can further reduce the computational complexity of the K-best detection algorithm [13]. The detection process of the FSD signal detection algorithm is based on a two-stage tree search, as shown in Fig. 4.5 [11]. The algorithm implementation is shown in Algorithm 4.2.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig5_HTML.png
Fig. 4.5

Search tree in the FSD signal detection algorithm.

© [2018] IEEE. Reprinted, with permission, from Ref. [11]

Algorithm 4.2

FSD signal detection algorithm

../images/478198_1_En_4_Chapter/478198_1_En_4_Figb_HTML.png

A conventional FSD signal detection algorithm has a fixed complexity but without consideration of noise and channel conditions. A simplified version of FSD signal detection algorithm is described in Ref. [18], which introduces the path selection into the remaining stages, so that the FSD signal detection algorithm can be highly parallel and fully pipelined [19].

4.2 CHOSLAR Algorithm

4.2.1 System Model

The CHOSLAR algorithm exploits a complex domain system with the same number of transmitting antennas and the number of receiving antennas, i.e., $$ N_{\text{r}} = N_{\text{t}} = N $$, and assumes that the channel matrix has been estimated [8, 10, 2022]. The channel model is shown in Eq. (4.1). The ML signal detection is shown in Eq. (4.3). After the channel matrix $$ \varvec{H} $$ is decomposed by QR, Eq. (4.3) can be written as
$$ \hat{\varvec{s}}_{{\text{ML}}} = \mathop {\arg \hbox{min} }\limits_{{\varvec{s} \in \varOmega }} \left\| {\varvec{Q}^{\text{H}} \varvec{y} - \varvec{Rs}} \right\|^{2} = \mathop {\arg \hbox{min} }\limits_{{\varvec{s} \in \varOmega }} \left\| {\hat{\varvec{y}} - \varvec{Rs}} \right\|^{2} , $$
(4.10)
where $$ \varvec{H} = \varvec{QR} $$, $$ \varvec{Q} $$ is unitary matrix, and $$ \varvec{R} $$ is upper triangular matrix. ML detector generates soft-output through LLR, that is, seeking other symbol vectors closest to $$ \hat{\varvec{y}} $$. According to the Ref. [23], LLR of the $$ b $$(th) bit of the $$ i $$(th) symbol can be computed by Eq. (4.11).
$$ \text{LLR}\left( {s_{i,b} } \right) = \frac{1}{{\sigma^{2} }}\left( {\mathop {\hbox{min} }\limits_{{\varvec{s} \in s_{i,b}^{0} }} \left\| {\hat{\varvec{y}} - \varvec{Rs}} \right\|^{2} - \mathop {\hbox{min} }\limits_{{\varvec{s} \in s_{i,b}^{1} }} \left\| {\hat{\varvec{y}} - \varvec{Rs}} \right\|^{2} } \right), $$
(4.11)
where $$ s_{i,b}^{0} $$ and $$ {\text{s}}_{i,b}^{1} $$ represents the set of symbols from the modulation constellation point $$ \varOmega $$ when $$ s_{i,b} $$ equals to 0 and 1 respectively.

Matrix $$ \varvec{R} $$ is an upper triangular matrix, so the tree search starts with the last element of $$ \varvec{s} $$. The scale of the channel matrix $$ \varvec{H} $$ increases as the scale of the MIMO system increases (e.g., from 2 × 2 to 16 × 16). Due to the high complexity and data dependency, QR decomposition is difficult to implement on hardware, especially for high-order MIMO systems. Therefore, the computational complexity and parallelism of QR decomposition are the main factors limiting the throughput of high-order MIMO systems at present.

4.2.2 QR Decomposition

QR decomposition is the process of decomposing the estimated channel matrix $$ \varvec{H} $$ into a unitary matrix $$ \varvec{Q} $$ and an upper triangular matrix $$ \varvec{R} $$ [8]. Gram–Schmidt (GS) [21, 24, 25], Householder transformation (HT) [26] and GR [22, 27] are three widely used QR decomposition algorithms. A simplified GS algorithm is proposed to achieve stable permutation QR decomposition in Ref. [28]. In GR algorithm, the computation of matrix $$ \varvec{Q} $$ can be replaced by the same rotation operation as that of upper triangular matrix $$ \varvec{R} $$. However, in high-order MIMO systems, with the increase of the scale of matrix H, it is inefficient to eliminate only one element at a time. Another disadvantage of GR algorithm is that the rotation operation cannot be executed until the two elements on the left are eliminated. Therefore GR algorithm parallelism is constrained, resulting in lower data throughput in higher order MIMO systems. In order to reduce the computational complexity while maintaining the hierarchical gain, a lattice reduction (LR) algorithm proposed in Ref. [20] as a preprocessing method to adjust the estimated channel matrix in polynomial time according to Lenstra–Lenstra–Lovasz (LLL) algorithm [20]. In addition, the LR-based MMSE signal detection algorithm proposed in Ref. [21] and the LR-based zero-forcing decision feedback (ZF-DF) algorithm proposed in Ref. [29] have better performance than conventional linear detection algorithms. However, LR requires huge quantity of conditional validation and column exchanging, resulting in uncertain data throughput, low parallelism, and long latency in hardware implementation, especially when the size of MIMO systems increases. QR decomposition of channel matrix is also required in LR-based ZF-DF algorithm, and QR decomposition is very sensitive to the order of search stage. The scale increases gradually with the increase of system dimensions. Therefore, with the increase of the number of users and antennas, the preprocessing part (including QR decomposition and LR) as part of the K-best signal detection algorithm has high complexity and low parallelism [20], which becomes one of the main factors hindering the performance of the whole detector, especially when the channel matrix is not slowly changing. After preprocess, the detection program itself is also a rather complicated part of the whole process. The number of child nodes in each stage determines the complexity of K-best signal detection algorithm. Seeking the optimal solution in a smaller vector space can greatly reduce the number of child nodes. Therefore, when the channel matrix obtained by preprocess has better performance (when the preprocess is more asymptotically orthogonal), the complexity of the detection program itself can be greatly reduced.

In Ref. [8], there is a detailed comparison of the computational complexity of GS, HT and GR algorithms based on real number calculation quantity. For complex domain system, GS algorithm and GR algorithm approximate need $$ O\left( {4N^{3} } \right) $$ multiplication on the real domain, while HT algorithm needs more computation d it needs to compute the unitary matrix $$ \varvec{Q} $$ [26]. Therefore, although HT algorithm can eliminate elements by column, its hardware consumption is also unsustainable. An improved GS algorithm for $$ 4 \times 4 $$ QR decomposition is proposed in Ref. [25]. Matrices $$ \varvec{Q} $$ and $$ \varvec{R} $$ as well as the median of computation are expanded for parallel design. However, compared with the conventional GS algorithm, the number of multiplication has not decreased. In Ref. [23], a simplified GS algorithm for matrix decomposition is proposed. This algorithm can realize low complexity, especially QR decomposition for stable replacement of slowly varying channels. However, when the channel is not slowly varying, this algorithm has a nonnegligible loss in detection accuracy. A time-dependent channel matrix is proposed in Ref. [30], in which only the columns of partial matrices will be updated with time. Meanwhile, an algorithm combining an approximate $$ \varvec{Q } $$ matrix retention scheme and a precise QR decomposition update scheme is proposed to reduce the computational complexity. In Ref. [8], GR algorithm using coordinate rotation digital computer (CORDIC) unit simplifies complex arithmetic units in QR decomposition by iterative shift and addition operation. A new decomposition scheme based on real-domain system is also proposed in Ref. [8] to further reduce the number of arithmetic operation. However, the simplified QR decomposition combined with K-best signal detection has a serious loss of precision compared with ML signal detection algorithm. Other nonlinear signal detection algorithms, such as TASER [10] and probabilistic data association (PDA) algorithm [31] are also not suitable for high-order modulation systems. There will be serious precision loss, and hardware implementation is very difficult.

The sorted QR decomposition can reduce the search range and further improve the accuracy of K-best signal detection algorithm while keeping low complexity. In the execution of K-best signal detection algorithm, each element of $$ \hat{\varvec{s}}^{\text{ML}} $$ is estimated as shown in Eq. (4.12).
$$ \hat{s}_{i} = {{\left( {\hat{y}_{i} - \sum\limits_{{j = i + 1}}^{N} {R_{{i,j}} s_{j} } } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{y}_{i} - \sum\limits_{{j = i + 1}}^{N} {R_{{i,j}} s_{j} } } \right)} {R_{{i,i}} }}} \right. \kern-\nulldelimiterspace} {R_{{i,i}} }} $$
(4.12)
Although $$ \hat{y}_{i} $$ contains noise, the principal diagonal element $$ R_{i,i} $$ can avoid the influence of noise and signal interference. QR decomposition is performed by column, which means that the matrix $$ \varvec{R} $$ is generated by row. Because the absolute value of the matrix $$ \varvec{H} $$ is fixed, the product of all diagonal elements of $$ \varvec{R} $$ is also a constant. When we decompose the $$ i $$(th) column, there is
$$ R_{i,i} = \text{norm}\left( {\varvec{H}\left( {i:N,i} \right)} \right) = \sqrt {\sum\limits_{j = i}^{N} {H_{j,i}^{2} } } , $$
(4.13)
where $$ R_{i,i} $$ decreases as $$ i $$ increases. Sort and screen out the columns with the smallest norm as the next column to be decomposed, ensuring that the product of the remaining diagonal elements is as large as possible.

4.2.3 Lattice Reduction

The LR algorithm can achieve more efficient detection performance. For example, the LR algorithm is combined with the MMSE signal detection algorithm in Ref. [30] and LR algorithm is combined with ZF-DF algorithm in Ref. [29]. In the LR-based K-best signal detection algorithm and ZF-DF algorithm, the preprocessing part (QR decomposition and LR) plays an important role in reducing the computational complexity, especially when the size of MIMO system increases. When LR is used for channel matrix $$ \varvec{H} $$, an approximately orthogonal channel matrix can be obtained, that is
$$ \bar{\varvec{H}} = \varvec{HT}, $$
(4.14)
where $$ \varvec{T} $$ is a unimodular matrix. $$ \bar{\varvec{H}} $$ is a matrix whose condition is much better than $$ \varvec{H} $$, thus $$ \bar{\varvec{H}} $$ contains less noise. LLL algorithm is an LR algorithm known for its polynomial time computational complexity [20]. This algorithm checks and corrects the matrix $$ {\mathbf{R}} $$, making it satisfy two conditions. In this section, different LLL algorithms are adopted, which need to satisfy the Siegel condition [20] expressed in Eq. (4.15) and the size reduction conditions expressed in Eq. (4.16). The parameter $$ \delta $$ satisfies $$ 0.25 < \delta < 1 $$. The Siegel condition keeps the difference between the two adjacent diagonal elements within a small range to prevent the generation of excessively small diagonal elements. The size reduction condition ensures that diagonal elements are slightly dominant so as to achieve approximate orthogonality. The adjusted channel matrix $$ \varvec{H} $$ or upper triangular matrix $$ \varvec{R} $$ can inhibit interference between different antennas.
$$ \delta \left| {R_{k - 1,k - 1} } \right| > \left| {R_{k,k} } \right|, \, k = 2,3, \ldots ,N $$
(4.15)
$$ \frac{1}{2}\left| {R_{k - 1,k - 1} } \right| > \left| {R_{k - 1,r} } \right|,\,2 \le k \le r \le N $$
(4.16)

However, LR algorithm requires multiple conditional checks and column exchanges, resulting in uncertain data throughput, low complexity, and long latency. The sorted QR decomposition is essentially similar to the LR algorithm, both of which achieve better properties by adjusting the matrix $$ \varvec{R} $$. Therefore, K-best signal detection algorithm can achieve the near-ML accuracy when the number of branch extensions is small. However, as the scale of the MIMO system increases, the computational complexity of preprocessing becomes uncontrollable. At the same time, sorted QR decomposition and LR will result in higher computational complexity, especially LR, which requires an uncertain number of low parallelism operations.

4.2.4 Cholesky Preprocessing

4.2.4.1 Cholesky-Sorted QR Decomposition

In the K-best signal detector, the matrix preprocessing requires high detection accuracy and computational efficiency. The first step of the K-best signal detector preprocessing is QR decomposition of channel matrix $$ \varvec{H} $$. In the decomposition process, the computation of unitary matrix $$ \varvec{Q} $$ will lead to high computational complexity. The QR decomposition algorithm in this section uses the properties of matrices $$ \varvec{Q} $$ and $$ \varvec{R} $$ to avoid the computation of matrix $$ \varvec{Q} $$. QR decomposition is $$ \varvec{H} = \varvec{QR} $$. There is
$$ \varvec{H}^{\text{H}} \varvec{H} = \left( {\varvec{QR}} \right)^{\text{H}} \varvec{QR} = \varvec{R}^{\text{H}} \left( {\varvec{Q}^{\text{H}} \varvec{Q}} \right)\varvec{R} = \varvec{R}^{\text{H}} \varvec{R} $$
(4.17)
The elements of the channel matrix $$ \varvec{H} $$ are independent and identically distributed complex Gaussian variables with a mean value of 0 and a variance of 1. So matrix $$ \varvec{H} $$ is nonsingular, that is, $$ \varvec{A} = \varvec{H}^{\text{H}} \varvec{H} = \varvec{R}^{\text{H}} \varvec{R} $$ is positive definite. Therefore, $$ \varvec{A} $$ is a Hermite positive semi-definite matrix. Cholesky decomposition is to decompose Hermite positive semi-definite matrix into the product of a lower triangular matrix and its conjugate transpose matrix, namely $$ \varvec{A} = \varvec{LL}^{\text{T}} $$. When using Cholesky decomposition, the matrix $$ \varvec{R} $$ is equivalent to the upper triangular matrix $$ \varvec{L}^{\text{T}} $$, and the calculation of the special unitary matrix $$ \varvec{Q} $$ is avoided. Although matrix $$ \varvec{R} $$ is obtained by shortcut, it is known from Eq. (4.10) that matrix $$ \varvec{Q} $$ still needs to be calculated to solve $$ \varvec{Q}^{\text{H}} \varvec{y} $$. Since $$ \varvec{H} = \varvec{QR} $$, so you can solve $$ \varvec{Q} $$ by $$ \varvec{ Q} = \varvec{HR}^{ - 1} $$. Therefore, the calculation for $$ \varvec{Q}^{\text{H}} \varvec{y} $$ is converted to
$$ \varvec{Q}^{\text{H}} \varvec{y} = \left( {\varvec{HR}^{ - 1} } \right)^{\text{H}} \varvec{y} = \left( {\varvec{R}^{ - 1} } \right)^{\text{H}} \varvec{H}^{\text{H}} \varvec{y} $$
(4.18)

In Eq. (4.18), the direct solution of $$ \varvec{Q}^{\text{H}} \varvec{y} $$ is replaced by the inversion of the upper triangular matrix $$ \varvec{R} $$ and two matrix-vector multiplications. The computation complexity of the upper triangular matrix is $$ O\left( {N^{3} } \right) $$, which is significantly lower than that of the directly computing matrix $$ \varvec{Q} $$. In the GR algorithm, when eliminating the elements of matrix $$ \varvec{H} $$, do the same conversion for vector $$ \varvec{y} $$ to solve $$ \varvec{Q}^{\text{H}} \varvec{y} $$ instead of directly solving matrix $$ \varvec{Q} $$. However, the computation matrix $$ \varvec{R} $$ takes the multiplication of $$ O\left( {4N^{3} } \right) $$ complexity. The computational complexity of these algorithms will be compared in the subsequent chapters.

The next question is how to combine sorting operations with QR decomposition. In the Cholesky-based algorithm, the QR decomposition is realized through the Gram matrix $$ \varvec{A} = \varvec{H}^{\text{H}} \varvec{H} $$. Based on the Cholesky decomposition, the matrix $$ \varvec{R} $$ is generated by row, and the pseudo codes of Cholesky decomposition are shown in Algorithm 4.3.

Algorithm 4.3

Cholesky decomposition algorithm

../images/478198_1_En_4_Chapter/478198_1_En_4_Figc_HTML.png
Here, a 4 × 4 matrix is used as an example. When $$ i = 1 $$ (the first round of sorting), $$ \varvec{A} = \varvec{H}^{\text{H}} \varvec{H} $$, and the values to be compared in the sorting are shown in Eq. (4.19).
$$ V_{k} = \varvec{h}_{k}^{\text{H}} \varvec{h}_{k} = \text{norm}^{2} \left( {\varvec{h}_{k} } \right) = A_{k,k} , \, k = 1,2,3,4 $$
(4.19)
That is, each diagonal element $$ A_{k,k} $$ is the square of the norm of the column corresponding to the matrix $$ \varvec{H} $$, so the elements of the row and column corresponding to the smallest diagonal element in $$ \varvec{A} $$ are exchanged with elements of the first row and first column, and then the first round of decomposition can begin. At $$ i = 2 $$ (the second round of sorting), the value of $$ V_{k} $$ satisfies
$$ \begin{aligned} V_{k} & = \text{norm}^{2} \left( {\varvec{h}_{k} } \right) - R_{i - 1,k}^{\text{H}} R_{i - 1,k} \\ & = A_{k,k} - R_{i - 1,k}^{\text{H}} R_{i - 1,k} , \, k = 2,3,4 \\ \end{aligned} $$
(4.20)
According to the Eq. (4.20) and the eighth row of the Algorithm 4.3, $$ V_{k} $$ is the diagonal element of matrix $$ \varvec{A} $$ (updated in the previous decomposition), so when $$ i = 2 $$, the diagonal element of updated matrix $$ \varvec{A} $$ can be reused as the base of the sorting. When $$ i = 3, 4 $$, the analysis process is similar. Therefore, the diagonal elements of Gram matrix $$ \varvec{A} $$ can always be used as the norm for each column of the sorting operation. In the conventional GR algorithm, when $$ i = 2 $$ (after the first round of decomposition), the matrix $$ \varvec{H} $$ can be written as
$$ \left[ {\begin{array}{*{20}c} {H_{1,1} ,} & {H_{1,2} ,} & {H_{1,3} ,} & {H_{1,4} } \\ {H_{2,1} ,} & {H_{2,2} ,} & {H_{2,3} ,} & {H_{2,4} } \\ {H_{3,1} ,} & {H_{3,2} ,} & {H_{3,3} ,} & {H_{3,4} } \\ {H_{4,1} ,} & {H_{4,2} ,} & {H_{4,3} ,} & {H_{4,4} } \\ \end{array} } \right] \to \left[ {\begin{array}{*{20}c} {R_{1,1} ,} & {R_{1,2} ,} & {R_{1,3} ,} & {R_{1,4} } \\ {0,} & {H^{\prime}_{2,2} ,} & {H^{\prime}_{2,3} ,} & {H^{\prime}_{2,4} } \\ {0,} & {H^{\prime}_{3,2} ,} & {H^{\prime}_{3,3} ,} & {H^{\prime}_{3,4} } \\ {0,} & {H^{\prime}_{4,2} ,} & {H^{\prime}_{4,3} ,} & {H^{\prime}_{4,4} } \\ \end{array} } \right], $$
(4.21)
where $$ H^{\prime}_{i,j} $$ is the updated element of row i, column j of the matrix $$ \varvec{H} $$. While the GR algorithm does not change the norm of each column, so when the decomposition of column 2 is performed, the value to be compared when the sorting column $$ k $$ is
$$ V_{k} = \sum\limits_{j = i}^{4} {\left( {H^{\prime}_{j,k} } \right)^{\text{H}} H^{\prime}_{j,k} } ,\,k = 2,3,4 $$
(4.22)
To get the correct sort, the value to be compared in each round of sort must be computed according to the updated matrix $$ \varvec{H} $$. This computation takes $$ \frac{2}{3}N^{3} $$ real number multiplications. Figure 4.6 exhibits the difference between the Cholesky-sorted QR decomposition algorithm in this section and the conventional algorithm, assuming that the fourth column has the smallest norm value at the second round of decomposition. In the proposed algorithm, sorting is achieved by exchanging rows and columns after matrix updates that only on diagonal elements. In conventional algorithms, the square value of the norm of all three columns of vectors must be obtained before sorting, and then the column is exchanged.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig6_HTML.png
Fig. 4.6

Difference between Cholesky sorted QR decomposition algorithm and a conventional algorithm. a Use this algorithm to sort QRD, b use a conventional algorithm to sort QRD.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

Compared with other Cholesky decomposition algorithms for MIMO detection [29, 32, 33], the proposed algorithm in this section is slightly different in terms of purpose and implementation details. In ZF-DF algorithm and successive interference cancelation (SIC) detector [29, 33], the diagonal elements of $$ \varvec{R} $$ are calculated by the QR decomposition. The Gram matrix is decomposed by the LDL algorithm, which is broken down into a lower triangular matrix $$ \text{(}\varvec{L}\text{)} $$ with diagonal elements of 1 and a diagonal matrix $$ \text{(}\varvec{D}\text{)} $$ with real numbers. We can compute the diagonal elements of $$ \varvec{R} $$ after decomposing the matrix $$ \varvec{D} $$. However, in the K-best signal detection algorithm, the whole upper triangular matrix R needs to be calculated. Therefore, the matrix R cannot be calculated simply by using the LDL algorithm, but by using other algorithms, then the subsequent K-best program will be affected. In the CHOSLAR algorithm in this section, the matrices $$ \varvec{R} $$ and $$ \varvec{Q} $$ can be solved directly by matrix decomposition, which makes the subsequent K-best program execution can start in a very short time. The Cholesky algorithm is used to decompose and invert the Gram matrix in the linear MMSE signal detection algorithm in Ref. [32]. Compared with the above algorithms, in CHOSLAR algorithm, Cholesky QR decomposition is performed first, then LR algorithm and K-best algorithm are executed, so the algorithm can be applied to nonlinear algorithm with high performance. In a word, the algorithm in this section is more favorable to subsequent K-best search than other algorithms. The algorithm also includes sorting operations in the decomposition process, and the detector accuracy is greatly improved compared with the Cholesky decomposition algorithm in Refs. [29, 32, 33]. The obtained matrix has a flat diagonal element, which is conducive to the computation of LR and K-best, and increases the accuracy of detection. In conventional QR decomposition, the sorting operation is used to optimize the matrix $$ \varvec{R} $$ [21, 24, 25, 27]. In the sorted QR decomposition algorithm in this section, the norm of each column of matrix $$ \varvec{H} $$ has been computed when matrix $$ \varvec{A} $$ is adjusted. As a result, the algorithm does not need additional addition operation, because the adjustment of matrix $$ \varvec{A} $$ as part of the Cholesky decomposition process has been implemented. The conventional QR decomposition, whether using GR [27] or GS algorithm [21, 24, 25], is directly implemented with the matrix $$ \varvec{H} $$. The decomposition process is performed by column operation, and the sorting operation requires an extra step to compute the norm of all remaining columns that have not been decomposed.

4.2.4.2 PILR

In this section, a PILR algorithm is described to reduce the number of column exchanges while maintaining a constant throughput.

The algorithm runs T times from the countdown to the $$ N/2 + 1 $$ row in an iterative way. In Step $$ k $$ of each iteration, the algorithm first detects the Siegel condition of the Eq. (4.15). If the condition does not meet, this algorithm will calculate the parameter $$ \mu $$ according to Eq. (4.23), where round() is a rounding function.
$$ \mu = \text{round}\left( {R_{k - 1,k} /R_{k - 1,k - 1} } \right) $$
(4.23)
Then, the $$ k - 1 $$ column of $$ \varvec{R} $$ is multiplied by $$ \mu $$, and the result is subtracted by the $$ k $$ column, as shown in Eq. (4.24):
$$ R_{1:k,k} \leftarrow R_{1:k,k} - \mu R_{1:k,k - 1} $$
(4.24)
The channel matrix $$ \varvec{H} $$ performs the operation shown in Eq. (4.25).
$$ H_{:,k} \leftarrow H_{:,k} - \mu H_{:,k - 1} $$
(4.25)
The operation described in Eq. (4.25) is a single size reduction to ensure that the Siegel condition can be executed correctly. After a single size reduction, the elements of column $$ k $$ and column $$ k - 1 $$ in $$ \varvec{R} $$ and $$ \varvec{H} $$ are exchanged to get $$ \hat{\varvec{R}} $$ and $$ \hat{\varvec{H}} $$. This exchange operation changes the triangular form of the matrix $$ \varvec{R} $$. Therefore, a $$ 2 \times 2 $$ GR matrix $$ \varvec{\theta} $$ is needed, that is
$$ \varvec{\theta}= \left[ {\begin{array}{*{20}c} {a^{\text{H}} ,} & {b^{\text{H}} } \\ { - b,} & a \\ \end{array} } \right], $$
(4.26)
where the value of $$ a $$ and $$ b $$ can be expressed as
$$ a = \frac{{\hat{R}_{k - 1,k - 1} }}{{\sqrt {\hat{R}_{k - 1,k - 1}^{2} { + }\hat{R}_{k,k - 1}^{2} } }},\,b = \frac{{\hat{R}_{k,k - 1} }}{{\sqrt {\hat{R}_{k - 1,k - 1}^{2} { + }\hat{R}_{k,k - 1}^{2} } }} $$
(4.27)
Finally, Row $$ k $$ and Row $$ k - 1 $$ are updated by the left multiplying matrix $$ \theta $$, to restore the triangular form of matrix $$ \varvec{R} $$, i.e.,
$$ \left[ {\begin{array}{*{20}l} {\hat{R}_{k - 1,k - 1} ,} \hfill & {\hat{R}_{k - 1,k} ,} \hfill & \ldots \hfill \\ {0,} \hfill & {\hat{R}_{k,k} ,} \hfill & \ldots \hfill \\ \end{array} } \right] \leftarrow \theta \times \left[ {\begin{array}{*{20}l} {\hat{R}_{k - 1,k} ,} \hfill & {\hat{R}_{k - 1,k - 1} ,} \hfill & \ldots \hfill \\ {\hat{R}_{k,k} ,} \hfill & {0,} \hfill & \ldots \hfill \\ \end{array} } \right] $$
(4.28)
The full size reduction process is similar to the single size reduction step but is performed for the entire matrix $$ \hat{\varvec{R}} $$. When performing the full size reduction operation, the algorithm iterates elements from $$ \hat{R}_{k,k} $$ to $$ \hat{R}_{1,1} $$ one by one. For example, Fig. 4.7 shows the values of each element of the sorted and unsorted matrix $$ \varvec{R} $$ for a $$ 16 \times 16 $$ MIMO system. The values of diagonal elements after sorted QR decomposition are flat, so the LR algorithm can be applied to the $$ 8 \times 8 $$ submatrix in the lower right corner of $$ \varvec{R} $$. The algorithm is iterated from the sixteenth column to the ninth column. The complete preprocessing algorithm includes the Cholesky sorted QR decomposition and PILR, as shown in Algorithm 4.4.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig7_HTML.png
Fig. 4.7

a Element values of unsorted matrix R, b sorted matrix R.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

Algorithm 4.4

CHOSLAR algorithm

../images/478198_1_En_4_Chapter/478198_1_En_4_Figd_HTML.png
../images/478198_1_En_4_Chapter/478198_1_En_4_Fige_HTML.png

In Ref. [34], conventional LR and partial LR operations are used to optimize matrix $$ \varvec{R} $$. First, a conventional algorithm uses the unit matrix $$ \varvec{T} $$ as the trace of all column adjustments, where the channel matrix $$ \varvec{H} $$ can be used directly. Then, when the full size reduction is executed, the original LR [29] and partial LR [34] algorithms iterate elements from $$ \hat{R}_{1,1} $$ to $$ \hat{R}_{k,k} $$ one by one, whereas the algorithm proposed in this section operates in the opposite direction, which makes it more possible to operate in line to improve parallelism. As will be mentioned in Chap. 5, the algorithm can achieve high throughput ASIC hardware architecture design, thus proving its high parallelism. Finally, in the conventional LR algorithm, the entire matrix $$ \hat{\varvec{R}} $$ needs to be adjusted, and the algorithm in this section combines LR with sorted QR decomposition, both of them need to adjust the matrix $$ \hat{\varvec{R}} $$ and make it owns more advantageous features. The PILR algorithm runs $$ T $$ times from the last row to the $$ N/2 + 1th $$ row iteratively. This algorithm makes use of this characteristic and combines it in order to reduce the total number of column exchanges.

4.2.5 Improved K-Best Detector and Its Performance Simulation

Based on the proposed preprocessing algorithm, an improved K-best detector with K = 10 is adopted in this section. The detector is divided into N stages to solve the output signal $$ \hat{\varvec{s}} $$. Taking the $$ 16 \times 16 $$ MIMO system as an example, the search tree is shown in Fig. 4.8. An approximate solution $$ \hat{s}_{N} $$ is computed in Stage 1, as shown in Eq. (4.29).
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig8_HTML.png
Fig. 4.8

Tree extension of K-best detector.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

$$ \hat{s}_{N} = \hat{y}_{N} /\hat{R}_{N,N} $$
(4.29)
Then, four Gaussian integers closest to $$ \hat{s}_{N} $$ in the 64QAM constellation are obtained in a certain order, as shown in Fig. 4.9. These four nodes based on the PED ascending order are as the parent node of Stage 2. In Stage 2, four parent nodes perform interference cancelation in turn. Then, the parent node with the minimum PED extends four child nodes in the same way as Stage 1, and the other three parent nodes extend two child nodes in a certain order respectively. The 10 child nodes in their PEDs ascending order serve as the parent nodes of Stage 3, where Stages 3–15 are structurally similar. First, 10 parent nodes perform interference cancelation successively. Then, the parent node with the minimum PED expands to four child nodes, and the other parent nodes expand to two child nodes respectively. Then 10 child nodes with the smallest PED are selected as the parent node of the next stage. In the last stage, after interference cancelation, each parent node expands only one child node. Finally, the child node with minimum PED is selected as the final solution, and the path corresponding to the child node is the output signal $$ \hat{\varvec{s}} $$.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig9_HTML.png
Fig. 4.9

Enumeration of constellation points.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

BER is simulated below to estimate the performance of the algorithm. All algorithms consider 64QAM modulation high-order MIMO systems and employ a rate-1/2 random interleaved convolutional codes [1, 5]. The number of symbols encoded is 120 and the number of frames is 100,000. The channel is an independent and identically distributed Rayleigh fading channel, and SNR is defined at the receiver. Figure 4.10a compares the BER performance of full-matrix LR and PILR algorithms in $$ 16 \times 16 $$ MIMO systems with different iterations. It can be seen that K-best signal detection algorithm with reduced sequencing at $$ K = 10 $$ can achieve near-optimal performance when the number of iterations is 3. Therefore, when the $$ 8 \times 8 $$ PILR algorithm in the lower right corner mentioned in this section is specially customized for the $$ 16 \times 16 $$ MIMO system, the algorithm iteratively traverses three times from the last row to the ninth row. Figure 4.10b compares the BER performance of the K-best signal detection algorithm with different $$ K $$ values $$ (K = 8,10,12) $$ using CHOSLAR preprocess. It can be seen that compared with ML and CHOSLAR (K = 12) algorithms, the performance loss of CHOSLAR (K = 10) algorithm is 1.44 dB and 0.53 dB (BER is $$ 10^{ - 5} $$) respectively, which is in the acceptable range. However, when $$ K = 8 $$, there is a performance loss of 3.46 dB compared to the ML algorithm, so $$ K = 10 $$ is used in this section. Figure 4.10c compares BER performance of different detection algorithms $$ (K = 10) $$ for $$ 16 \times 16 $$ MIMO systems with 64QAM modulation. Compared with ML algorithm, the performance loss of the K-best algorithm with CHOSLAR preprocessing in this section is 1.44 dB when BER is $$ 10^{ - 5} $$, which is very close to that of K-best algorithm (0.89 dB) with sorted QR decomposition and LR and K-best algorithm (1.2 dB) with QR decomposition and LR [20]. Furthermore, the performance of the algorithms used in this section and the K-best algorithm (3.37 dB) with sorted QR decomposition in Refs. [22, 24], the K-best algorithm (3.88 dB) with QR decomposition in Ref. [25], the K-best algorithm (4.96 dB) with simplified QR decomposition in Ref. [35], the MMSE-LR algorithm (more than 8 dB) and the MMSE algorithm (more than 8 dB) in Ref. [21], are comparable. At the same time, Fig. 4.10c shows that the performance of the algorithm combining sorted QR decomposition and LR is significantly better than that of the algorithm only using sorted QR decomposition. It should be noted that the simulation results are based on 64QAM modulation, so TASER algorithm is not supported [10].
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig10_HTML.png
Fig. 4.10

Comparison of BER performance, a PILR and full-matrix LR, b different K values, c different algorithms.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

The above simulation results are all based on $$ 16 \times 16 $$ MIMO system. It can be seen that the CHOSLAR algorithm in this section has great advantages in the BER performance. To demonstrate the advantages of the proposed algorithm in both higher order MIMO systems and different modulation types, Fig. 4.11 shows the BER performance of the algorithm in different simulation types. Figure 4.11a, b compare the BER performance under the 64QAM modulation of the $$ 64 \times 64 $$ MIMO system and the $$ 128 \times 128 $$ MIMO system respectively. Because the complexity of ML signal detection algorithm is very high in $$ 64 \times 64 $$ MIMO systems and $$ 128 \times 128 $$ MIMO system, the simulation results of ML signal detection algorithm are not shown in the figure. As we can see from Fig. 4.11a, the proposed algorithm has a performance loss of 0.77 dB compared with the K-best signal detection algorithm using sorted QR decomposition and LR in $$ 64 \times 64 $$ MIMO systems. While Fig. 4.11b shows a performance loss of 1.41 dB in $$ 128 \times 128 $$ MIMO system. Moreover, the BER performance of the proposed algorithm is better than that of the algorithm mentioned in Refs. [21, 22, 24, 35]. Therefore, the CHOSLAR algorithm adopted maintains its advantages in the higher order MIMO system. Figure 4.11c shows BER performance of higher order modulation (256QAM). The K-best signal detection algorithm uses $$ K = 14 $$. According to Fig. 4.11c, the CHOSLAR algorithm retains its advantages while keeping a performance loss of only 1.01 dB.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig11_HTML.png
Fig. 4.11

BER performance comparison with different configurations and different modulation. a 64 × 64 MIMO 64QAM, b 128 × 128 MIMO 64QAM, c 16 × 16 MIMO 256QAM.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

Although only symmetrical MIMO systems which have equal receiving antennas and transmitting antennas are discussed for the present algorithms, this algorithm is also applicable to asymmetrical MIMO systems which have more receiving antennas than transmitting antennas. Throughout the algorithm, QR decomposition is converted to the Cholesky decomposition, and the number of the receiving antennas only affects the initialization stage (row 4 in Algorithm 4.4) and the updating and column exchange of matrix $$ \varvec{H} $$ (rows 8, 25, 26, 38 of Algorithm 4.4). With the increase of the number of the receiving antennas (more than the number of the transmitting antennas), these elements of the process would be affected. Initialization is still possible with simple matrix and vector multiplication, while the update and column exchange process of matrix $$ \varvec{H} $$ is only based on a single column of matrix $$ \varvec{H} $$. Therefore, the CHOSLAR algorithm is also applicable to the asymmetric MIMO system. Figure 4.12a illustrates the BER performance of an asymmetric $$ (16 \times 32) $$ MIMO system. The results show that the CHOSLAR algorithm is also applicable to the asymmetric MIMO system, which still maintains its advantages.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig12_HTML.png
Fig. 4.12

Comparison of BER performance. a Asymmetric (16 × 32) MIMO system, b Kronecker channel (16 × 16 MMIMO, 64QAM).

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

To better reflect how different channel characteristics affect the algorithm and simulation results, the Kronecker channel model is used to estimate the performance [5]. The elements of the channel matrix of the channel model follow the distribution in the form of $$ N\left( {0,d\left( \varvec{z} \right)\varvec{I}_{B} } \right) $$, where $$ d\left( \varvec{z} \right) $$ denotes channel attenuation (such as path attenuation and shielding). The classic path attenuation model is adopted and channel attenuation variable is $$ d\left( \varvec{z} \right) = C/\left\| {\varvec{z} - \varvec{b}} \right\|^{\kappa } $$, where $$ \varvec{b} \in \varvec{R}^{2} $$, $$ \kappa $$ and $$ \parallel \cdot\parallel $$ denote the location of base station, path attenuation index and Euclidean norm respectively. The independent shielding attenuation represented by $$ C $$ satisfies $$ 10{ \lg }C \sim N\left( {0,\sigma_{\text{sf}}^{2} } \right) $$. Another distinctive feature of the Kronecker channel model is its consideration of channel correlation. $$ \varvec{R}_{\text{r}} $$ and $$ \varvec{R}_{\text{t}} $$ respectively denote the channel correlation parameters of receiving antennas and the channel correlation parameters of transmit antennas. The exponential correlation model [5] is adopted, and $$ \xi $$ stands for the correlation factor. Therefore, Channel matrix $$ \varvec{H} $$ can be expressed as
$$ \varvec{H} = \varvec{R}_{\text{r}}^{1/2} \varvec{H}_{{\text{i}\text{.i}\text{.d}\text{.}}} \sqrt {d\left( \varvec{z} \right)} \varvec{R}_{\text{t}}^{1/2} , $$
(4.30)
where $$ \varvec{H}_{{{\text{i}}.{\text{i}}.{\text{d}}.}} $$ is a random matrix of independent identically distributed complex Gaussian distribution with an average of 0 and a variance of 1 for each element. During the simulation process, the radius of each hexagonal element is $$ r = 500{\text{m}} $$, and the user location $$ \varvec{z} \in \varvec{R}^{2} $$ is independent and random. The simulation also adopts the following assumptions: $$ \kappa = 3.7 $$, $$ \sigma_{\text{sf}}^{2} = 5 $$, and transmitting antennas power $$ \rho = r^{\kappa } /2 $$. Figure 4.12b compares the BER performance of the algorithm using the Kronecker channel model with three different correlation factors ($$ \xi = 0.2 $$, 0.5, 0.7). According to Fig. 4.12b, CHOSLAR algorithm still maintains its advantages in this actual model.

4.2.6 Summary and Analysis

This section compares the computational complexity and parallelism of the CHOSLAR algorithm with other algorithms (GS, GR, and HT). A specific summary is made in Ref. [8]. The analysis shows that most of the computational complexity is attributed to QR decomposition and LR process. In this section, the complex domain system is used, while the complexity of QR decomposition algorithm is expressed by the required real-field operands. For the computational complexity of the QR decomposition, the real-valued multiplication (RMUL) and real-valued addition (RADD) play a main role. It is assumed that the real-valued division and operation of square root is equivalent to RMUL in Refs. [23, 35]. A complex multiplication operation requires 4 RMUL and 2 RADD, while a complex addition needs 2 RADD. Table 4.1 lists the number of real operations required by GS, GR, HT, and Cholesky-sorted QR decomposition algorithm. The Cholesky sorted QR decomposition algorithm adopted includes two parts: matrix multiplication of $$ \varvec{H}^{\text{H}} \varvec{H} $$ and decomposition of Gram $$ \varvec{A} $$. The matrix multiplication of $$ \varvec{H}^{\text{H}} \varvec{H} $$ requires $$ 2N^{3} + 2N^{2} $$ RMUL and $$ N^{3} - N $$ RADD, because it requires conjugate symmetric matrix multiplication. The Cholesky decomposition of matrix $$ \varvec{A} $$ requires $$ N^{3} - 2N^{2} + 5N $$ RMUL and $$ N^{3} - 3N^{2} + 3N $$ RADD. Because direct computation for matrix $$ \varvec{Q} $$ in GR does not need, its computational complexity is omitted in Table 4.1. In order to perform sorting operations in each algorithm, additional $$ \frac{2}{3}N^{3} $$ RMUL [27] is required. As can be seen from Table 4.1, the computational complexity of the Cholesky sorted QR decomposition algorithm is lower than that of other algorithms. For example, when $$ N = 16 $$, compared with that of GS, GR, and HT algorithms, the number of RMULs required by the algorithm is decreased by 25.1, 44.6, and 93.2%, respectively, and the number of RADDs required by the algorithm is decreased by 55.1, 58.9, and 95.2%, respectively.
Table 4.1

Computational complexity of different detection algorithms in a higher order MIMO system

Algorithm

Real addition

Real multiplication

GS

$$ 4N^{3} + N^{2} - 2N $$

$$ \frac{14}{3}N^{3} + 4N^{2} + N $$

GR

$$ 4N^{3} + \frac{15}{2}N^{2} - \frac{23}{2}N $$

$$ \frac{18}{3}N^{3} + \frac{23}{2}N^{2} - \frac{107}{6}N $$

HT

$$ 2N^{4} + 5N^{3} + \frac{21}{2}N^{2} $$

$$ \frac{8}{3}N^{4} + \frac{22}{3}N^{3} + 14N^{2} $$

Cholesky sorted QR decomposition

$$ 2N^{3} - 3N^{2} + 2N $$

$$ \frac{11}{3}N^{3} + 5N $$

Figure 4.13 shows the simulation results of the average column exchange times and maximum column exchange times of LR algorithm with non-sorted QR decomposition and of LR algorithm with sorted QR decomposition in a $$ 16 \times 16 $$ MIMO system. The results show that the number of columns exchanged by sorted QR decomposition is reduced. The PILR algorithm with constant throughput needs three iterations, and each iteration needs eight matrix exchanges. Therefore, a total of 24 matrix exchanges is required, less than the 44 matrix exchanges required for whole-matrix LR (reduced by 45.5%). The number of multiplications required for row updates is also reduced, as the number of matrix exchanges is reduced, so only the PILR needs to be executed in the lower right corner of the triangular matrix. In 16 × 16 MIMO system, the average number of multiplications required in LR and PILR algorithm is 3960 and 1296 respectively, that is, the PILR algorithm adopted can reduce the number of multiplications by 67.3%. In the calculation of K-best, in order to achieve the same detection accuracy, the K-best signal detection algorithm adopted only needs 33 comparators in each stage, while the unsorted algorithm needs 216 comparators in each stage, that is, the number of comparators in each stage is reduced by 84.7%.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig13_HTML.png
Fig. 4.13

Column exchange comparison between LR for non-sorted QR decomposition and LR for sorted QR decomposition. a LR for non-sorted QRD, b LR for sorted QRD.

© [2018] IEEE. Reprinted, with permission, from Ref. [9]

Considering the parallelism of hardware implementation, the Cholesky sorted QR decomposition algorithm can eliminate an entire column of matrix $$ \varvec{A} $$ at one time, while the conventional GR algorithm based on paired CORDIC can only eliminate one element of matrix $$ \varvec{H} $$ at one time. Also, the elements in the left column must have been eliminated before removing a new column of elements. The correlation between these elimination limits its parallelism, especially when the user number and antennas number increase in higher order MIMO systems. For example, for a $$ 16 \times 16 $$ matrix, the GR algorithm based on paired CORDIC takes four rounds to eliminate all elements in the first column. Therefore, compared with the conventional QR decomposition, the Cholesky-sorted QR decomposition algorithm can achieve higher parallelism and lower preprocessing delay. In addition, the parallelism of the size reduction process in the LR algorithm is also improved, because the column updates of matrices $$ \varvec{H} $$ and $$ \varvec{R} $$ are implemented by row rather than by element.

4.3 TASER Algorithm

4.3.1 System Model

TASER algorithm is applicable to two scenarios: the coherent data detection in MU-MIMO wireless systems and the JED in large-scale single-input multiple-output (SIMO) wireless systems.

First, consider the first application scenario, that is, the coherent data detection in MU-MIMO wireless system. The channel model is shown in Eq. (4.1), the ML detection is shown in Eq. (4.3), assuming that the channel matrix $$ \varvec{H} $$ has been estimated.

For conventional small-scale MIMO systems, a series of SD algorithms are presented in Ref. [15]. However, the computational complexity of these algorithms increases exponentially with the increase of the number of the transmitting antennas $$ N_{\text{t}} $$, so they are not suitable for massive MIMO system. When the ratio of the number of receiving antennas to the transmitting antennas is more than 2 in the massive MU-MIMO system, some newly linear algorithms can achieve the near-ML performance [3]. However, when the number of the receiving antennas increases, the ratio of the number of receiving antennas to the transmitting antennas is close to 1, the BER performance of these linear algorithms is too poor to be accepted [3].

In order to ensure low complexity and near-optimal performance in this application scenario, the Eq. (4.3) in the ML problem is relaxed into a semi-definite program (SDP) [36], and the ML detection problem needs to be reformulated in the relaxation process. Assume that the modulation mode is fixed to QAM modulation, such as BPSK and QPSK. First, the system model is transformed into the real-valued decomposition, as shown in Eq. (4.31).
$$ \bar{\varvec{y}} = \varvec{\bar{H}\bar{s}} + \varvec{\bar{n}}, $$
(4.31)
where the elements are defined as follows:
$$ \begin{aligned} & \bar{\varvec{y}} = \left[ {\begin{array}{*{20}l} {\text{Re}\left\{ \varvec{y} \right\}} \hfill \\ {\text{Im} \left\{ \varvec{y} \right\}} \hfill \\ \end{array} } \right],\,\bar{\varvec{H}} = \left[ {\begin{array}{*{20}c} {\text{Re}\left\{ \varvec{H} \right\}} & { - \text{Im} \left\{ \varvec{H} \right\}} \\ {\text{Im} \left\{ \varvec{H} \right\}} & {\text{Re}\left\{ \varvec{H} \right\}} \\ \end{array} } \right] \\ & \bar{\varvec{s}} = \left[ {\begin{array}{*{20}l} {\text{Re}\left\{ \varvec{s} \right\}} \hfill \\ {\text{Im} \left\{ \varvec{s} \right\}} \hfill \\ \end{array} } \right], \, \bar{\varvec{n}} = \left[ {\begin{array}{*{20}l} {\text{Re}\left\{ \varvec{n} \right\}} \hfill \\ {\text{Im} \left\{ \varvec{n} \right\}} \hfill \\ \end{array} } \right] \\ \end{aligned} $$
(4.32)
The decomposition of Eq. (4.31) causes the ML problem to be formulated as
$$ \bar{\varvec{s}}^{{\text{ML}}} = \mathop {\arg \hbox{min} }\limits_{{\tilde{\varvec{s}} \in \chi N}} \text{Tr}\left( {\tilde{\varvec{s}}^{\text{H}} \varvec{T\tilde{s}}} \right) $$
(4.33)

For QPSK, $$ \varvec{T} = \left[ {\bar{\varvec{H}}^{\text{H}} \bar{\varvec{H}}, - \bar{\varvec{H}}^{\text{H}} \bar{\varvec{y}}; - \bar{\varvec{y}}^{\text{H}} \bar{\varvec{H}},\bar{\varvec{y}}^{\text{H}} \bar{\varvec{y}}} \right] $$ is the matrix of $$ N \times N $$ $$ (N = 2N_{\text{t}} + 1) $$, $$ \tilde{\varvec{s}} = \left[ {\text{Re} \left\{ \varvec{s} \right\};{\text{Im}}\left\{ \varvec{s} \right\};1} \right] $$. The range of its element value is $$ { \mathcal{X}} \in \left\{ { - 1, + 1} \right\} $$. In this way, the solution $$ \bar{\varvec{s}}^{\text{ML}} $$ can be converted back to the solution $$ \left[ {\hat{\varvec{s}}^{{\text{ML}}} } \right]_{i} = \left[ {\bar{\varvec{s}}^{{\text{ML}}} } \right]_{i} + j\left[ {\bar{\varvec{s}}^{{\text{ML}}} } \right]_{i + U} ,i = 1, \ldots ,U $$ in the complex domain. For BPSK, $$ \varvec{T} = \left[ {\underline{\varvec{H}}^{\text{H}} \underline{\varvec{H}} , - \underline{\varvec{H}}^{\text{H}} \bar{\varvec{y}}; - \bar{\varvec{y}}^{\text{H}} \underline{\varvec{H}} ,\bar{\varvec{y}}^{\text{H}} \bar{\varvec{y}}} \right] $$ is the matrix of $$ N \times N $$ $$ (N = N_{\text{t}} + 1) $$, $$ \tilde{\varvec{s}} = \left[ {\text{Re}\left\{ \varvec{s} \right\};1} \right] $$. The matrix $$ \underline{\varvec{H}} = \left[ {\text{Re} \left\{ \varvec{H} \right\};Im\left\{ \varvec{H} \right\}} \right] $$ of $$ 2N_{\text{r}} \times N_{\text{t}} $$ is defined. At this time $$ Im\left\{ \varvec{s} \right\} = 0 $$, so $$ \left[ {\hat{\varvec{s}}^{{\text{ML}}} } \right]_{i} = \left[ {\bar{\varvec{s}}^{{\text{ML}}} } \right]_{i} ,i = 1, \ldots ,U $$.

The second application scenario is JED in the massive SIMO wireless system. Suppose that the transmitting time slot of a single user is $$ K + 1 $$, the number of receiving antennas is $$ N_{\text{r}} $$, and the system model of SIMO wireless channel with narrowband flat-fading attenuation is [37].
$$ \varvec{Y} = \varvec{hs}^{\text{H}} + \varvec{N} , $$
(4.34)
where $$ \varvec{Y} \in \varvec{C}^{{N_{\text{r}} \times \left( {K + 1} \right)}} $$ is the received vector obtained in the $$ K + 1 $$ time slot. $$ \varvec{h} \in \varvec{C}^{{N_{\text{r}} }} $$ is the unknown SIMO channel vector and is assumed to remain constant in the $$ K + 1 $$ time slot. $$ \varvec{s}^{\text{H}} \in \varvec{O}^{{1 \times \left( {K + 1} \right)}} $$ is the transmitting antennas vector containing all the data symbols in the $$ K + 1 $$ time slot, $$ \varvec{N} \in \varvec{C}^{{N_{\text{r}} \times \left( {K + 1} \right)}} $$ is the cyclic symmetric Gaussian noise with independent identical distribution, and the variance is $$ N_{0} $$. The ML JED problem can be expressed as [37].
$$ \left\{ {\hat{\varvec{s}}^{{\text{JED}}} ,\hat{\varvec{h}}} \right\} = \mathop {\arg \hbox{min} }\limits_{{\varvec{s} \in {\mathbf{O}}^{K + 1} ,\varvec{h} \in {\mathbf{C}}^{B} }} \left\| {\varvec{Y} - \varvec{hs}^{\text{H}} } \right\|_{\text{F}} $$
(4.35)

Note that both two outputs of JED have stage ambiguity that is, for a certain stage $$ \phi $$, if $$ \hat{\varvec{s}}^{\text{ML}} {\text{e}}^{j\phi } \in {\mathcal{O}}^{K + 1} $$, then $$ \hat{\varvec{h}}{\text{e}}^{j\phi } $$ is also one of the solutions. To avoid this problem, assume that the first transmitted entry has been known to the receiver.

Since s is assumed to be a vector modulated in a constant modulus (such as BPSK and QPSK), the ML JED estimation of the transmitting antennas vector can be expressed as [37].
$$ \hat{\varvec{s}}^{{\text{JED}}} = \mathop {\arg \hbox{max} }\limits_{{\varvec{s} \in {\mathbf{O}}^{K + 1} }} \left\| {\varvec{Ys}} \right\|_{2} $$
(4.36)

$$ \hat{\varvec{h}} = \varvec{Y\hat{s}}^{\text{JED}} $$ is the estimation for channel vectors. When the time slot $$ K + 1 $$ is very small, Eq. (4.36) can be accurately solved [37] with a low-complexity SD algorithm. However, the complexity of the SD algorithm will become very high when the time slot is very large. Compared with the coherent ML detection algorithm in the above Eq. (4.33), it is not recommendable to approximate the solution of Eq. (4.36) by the linear algorithm, because the possible value of every item of $$ \varvec{s} $$ is infinite after the constraint $$ \varvec{s} \in \varvec{O}^{K + 1} $$ is relaxed to $$ \varvec{s} \in \varvec{C}^{K + 1} $$.

The ML JED problem of Eq. (4.36) is relaxed to the SDR of the same structure as the coherent ML problem of Eq. (4.33). Previously, we have assumed that the symbol $$ s_{0} $$ at the first transmitting end is known, then $$ \parallel \varvec{Ys}\parallel_{2} = \parallel y_{0} s_{0} + \varvec{Y}_{\text{r}} \varvec{s}_{\text{r}} \parallel_{2} $$, where $$ \varvec{Y}_{\text{r}} = \left[ {y_{1} , \ldots ,y_{K} } \right] $$, $$ \varvec{s}_{\text{r}} = \left[ {s_{1} , \ldots ,s_{K} } \right]^{\text{H}} $$. Similar to the coherent ML problem, we transform it into a real-valued decomposition. First, we define
$$ \bar{\varvec{y}} = \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {y_{0} s_{0} } \right\}} \\ {\text{Im} \left\{ {y_{0} s_{0} } \right\}} \\ \end{array} } \right],\,\bar{\varvec{H}} = \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{Y}_{\text{r}} } \right\}} & {\text{ - }\text{Im} \left\{ {\varvec{Y}_{\text{r}} } \right\}} \\ {\text{Im} \left\{ {\varvec{Y}_{\text{r}} } \right\}} & {\text{Re} \left\{ {\varvec{Y}_{\text{r}} } \right\}} \\ \end{array} } \right],\,\bar{\varvec{s}} = \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{s}_{\text{r}} } \right\}} \\ {{\text{Im}}\left\{ {\varvec{s}_{\text{r}} } \right\}} \\ \end{array} } \right] $$
(4.37)
$$ \parallel y_{0} s_{0} + \varvec{Y}_{\text{r}} \varvec{s}_{\text{r}} \parallel_{2} = \parallel \bar{\varvec{y}} + \varvec{\bar{H}\bar{s}}\parallel_{2} $$ can be obtained from Eq. (4.37). Equation (4.36) can be written in the same form as Eq. (4.33), i.e.
$$ \bar{\varvec{s}}^{{\text{JED}}} = \mathop {\arg \hbox{min} }\limits_{{\tilde{\varvec{s}} \in \chi N}} \text{Tr}\left( {\tilde{\varvec{s}}^{\text{T}} \varvec{T\tilde{s}}} \right) $$
(4.38)

For QPSK, $$ T = - \left[ {\bar{\varvec{H}}^{\text{H}} \bar{\varvec{H}},\bar{\varvec{H}}^{\text{H}} \bar{\varvec{y}};\bar{\varvec{y}}^{\text{H}} \bar{\varvec{H}},\bar{\varvec{y}}^{\text{H}} \bar{\varvec{y}}} \right] $$ is the matrix of $$ N \times N $$ $$ (N = 2K + 1) $$, $$ \tilde{\varvec{s}} = \left[ {\text{Re} \left\{ {\varvec{s}_{\text{r}} } \right\};{\text{Im}}\left\{ {\varvec{s}_{\text{r}} } \right\};1} \right] $$. The range of its element value is $$ { \mathcal{X}} \in \left\{ { - 1, + 1} \right\} $$. For BPSK, $$ \varvec{T} = - \left[ {\underline{\varvec{H}}^{\text{H}} \underline{\varvec{H}} ,\underline{\varvec{H}}^{\text{H}} \bar{\varvec{y}};\bar{\varvec{y}}^{\text{H}} \underline{\varvec{H}} ,\bar{\varvec{y}}^{\text{H}} \bar{\varvec{y}}} \right] $$ is the matrix of $$ N \times N $$ $$ (N = K + 1) $$, $$ \tilde{\varvec{s}} = \left[ {\text{Re} \left\{ {\varvec{s}_{\text{r}} } \right\};1} \right] $$. The matrix $$ \underline{\varvec{H}} = \left[ {\text{Re} \left\{ {\varvec{Y}_{\text{r}} } \right\};{\text{Im}}\left\{ {\varvec{Y}_{\text{r}} } \right\}} \right] $$ of $$ 2N \times K $$ is defined. Similar to coherent ML problem, the solution $$ \bar{\varvec{s}}^{\text{JED}} $$ can be converted to the solution in the complex domain.

Next, we will introduce how to find the approximate solutions of Eqs. (4.33) and (4.38) with the same SDRS-based algorithm.

4.3.2 Semi-definite Relaxation

SDR is known to all because it can be used to solve the coherent ML problem [36]. It can significantly reduce the computational complexity of BPSK and QPSK modulation systems. SDR cannot only provide near-ML performance, but also achieve the same diversity order of magnitude as ML detector [38]. Meanwhile, SDR is rarely used to solve the ML JED problem.

The data detection based on SDR first reformulates Eqs. (4.33) and (4.38) into the form of Eq. (4.39) [36], i.e.,
$$ \hat{\varvec{S}} = \mathop {\arg \hbox{min} }\limits_{{\varvec{S} \in {\mathbf{R}}^{N \times N} }} \text{Tr}\left( {\varvec{TS}} \right),\,\text{s}\text{.t}\text{.}\,\text{diag}\left( \varvec{S} \right) = 1,\,\text{rank}\left( \varvec{S} \right) = 1, $$
(4.39)
where $$ {\text{Tr}}\left( {\varvec{s}^{\text{H}} \varvec{Ts}} \right) = {\text{Tr}}\left( {\varvec{Tss}^{\text{H}} } \right) = {\text{Tr}}\left( {\varvec{TS}} \right) $$ and $$ \varvec{S} = \varvec{ss}^{\text{H}} $$ are the matrices with rank 1, $$ \varvec{s} \in {\mathcal{X}}^{N} $$, and dimension is $$ N $$. However, the constraint with the rank 1 in Eq. (4.39) makes Eq. (4.39) as complex as Eqs. (4.33) and (4.38). Therefore, the key of SDR is to relax the constraint of rank, so that SDP can be solved in polynomial time. By applying the SDR to Eq. (4.39), we get the famous optimization problem shown in Eq. (4.40) [36].
$$ \hat{\varvec{S}} = \mathop {\arg \hbox{min} }\limits_{{\varvec{S} \in {\mathbf{R}}^{N \times N} }} \text{Tr}\left( {\varvec{TS}} \right) ,\,\text{s}\text{.t}\text{.}\,\text{diag}\left( \varvec{S} \right) = 1,\,\varvec{S}\underline{ \succ } 0 $$
(4.40)

Constraint $$ \varvec{S} \ge 0 $$ makes $$ \varvec{S} $$ a positive semi-definite (PSD) matrix. If the rank of the result from Eq. (4.40) is 1, then $$ \hat{\varvec{s}} $$ in $$ \hat{\varvec{S}} = \varvec{\hat{s}\hat{s}}^{\text{H}} $$ is an accurate estimation of Eqs. (4.33) and (4.38), that means SDR solves the original problem in an optimal way. If the rank of $$ \hat{\varvec{S}} $$ is greater than 1, the ML solution can be estimated by taking the symbol of the leading eigenvector of $$ \hat{\varvec{S}} $$ or adopting a random scheme [36].

4.3.3 Algorithm Analysis

In this section, we will introduce a new algorithm, TASER, which is used to approximate the SDP problem presented in Eq. (4.40).

TASER algorithm is based on the fact that PSD matrix $$ \varvec{S} \ge 0 $$ in real domain can be factorized by Cholesky decomposition $$ \varvec{S} = \varvec{L}^{\text{H}} \varvec{L} $$, where $$ \varvec{L} $$ is a lower triangular matrix whose main diagonal of $$ N \times N $$ is a nonnegative term. Therefore, the SDP problem of Eq. (4.40) can be transformed into.
$$ \hat{\varvec{L}} = \mathop {\arg \hbox{min} }\limits_{\varvec{L}} \text{Tr}\left( {\varvec{LTL}^{\text{H}} } \right) ,\,\text{s}\text{.t}\text{. }\left\| {l_{k} } \right\|_{2} = 1,\quad \forall k $$
(4.41)

Equation (4.41) uses two-norm constraint to replace $$ {\text{diag}}\left( {\varvec{L}^{\text{H}} \varvec{L}} \right) = 1 $$ in Eq. (4.40), where $$ l_{k} = \left[ \varvec{L} \right]_{k} . $$ The symbolic bit of the last row of solution $$ \hat{\varvec{L}} $$ from Eq. (4.41) is taken as the solution to the problem in Eqs. (4.33) and (4.38), because if the rank of $$ \hat{\varvec{S}} = \hat{\varvec{L}}^{\text{H}} \hat{\varvec{L}} $$ is 1, then the last row of $$ \hat{\varvec{L}} $$, as the unique vector, must contain the relevant eigenvectors. If the rank of $$ \hat{\varvec{S}} = \hat{\varvec{L}}^{\text{H}} \hat{\varvec{L}} $$ is greater than 1, an near-ML solution needs to be extracted. In Ref. [39], it is proposed that the last row of the result from Cholesky decomposition can be used as an approximation of PSD matrix with rank 1. In Chap. 5, the simulation results all prove that the performance achieved by this approximation is close to that of the exact SDR detector from eigenvalue decomposition. This approach avoids eigenvalue decomposition and stochastic solution in conventional schemes, thus reducing the complexity.

An effective algorithm is proposed to directly solve the triangular SDP of Eq. (4.41). However, the matrix $$ \varvec{L} $$ of Eq. (4.41) is non-convex, so it becomes difficult to find the optimal solution. For the TASER algorithm, a special method to solve the convex optimization problem, that is FBS method, it is applied to solve the non-convex problem of Eq. (4.41) [40]. This method cannot guarantee that the non-convex problem of Eq. (4.41) can converge to the optimal solution. Therefore, in Chap. 5, TASER algorithm will be proved to converge to a key point, and simulation results also prove that this method can guarantee the BER performance similar to that of the ML algorithm.

FBS is an effective iterative method for solving convex optimization problems in the form of $$ \hat{\varvec{x}} = \mathop {\arg \hbox{min} }\limits_{\varvec{x}} f\left( \varvec{x} \right) + g\left( \varvec{x} \right) $$, where $$ f $$ is a smooth convex function, $$ g $$ is a convex function, but not necessarily smooth or bounded. The equation for solving FBS is [40, 41]
$$ \varvec{x}^{\left( t \right)} = \text{prox}_{g} \left( {\varvec{x}^{{\left( {t - 1} \right)}} - \tau^{{\left( {t - 1} \right)}} {\nabla }f\left( {\varvec{x}^{{\left( {t - 1} \right)}} } \right);\tau^{{\left( {t - 1} \right)}} } \right),\quad t = 1,2, \ldots $$
(4.42)
The stopping condition of Eq. (4.42) is to achieve convergence or the maximum iterations number $$ t_{ \hbox{max} } $$. $$ \left\{ {\tau^{\left( t \right)} } \right\} $$ is a sequence of step parameters and $$ {\nabla }f\left( \varvec{x} \right) $$ is a gradient function of function $$ f $$. The proximity operator of function $$ g $$ is defined as [40, 41].
$$ \text{prox}_{g} \left( {\varvec{z};\tau } \right) = \mathop {\arg \hbox{min} }\limits_{\varvec{x}} \left\{ {\tau g\left( \varvec{x} \right) + \frac{1}{2}\left\| {\varvec{x} - \varvec{z}} \right\|_{2}^{2} } \right\} $$
(4.43)

In order to approximately solve the Eq. (4.42) with FBS, we define $$ f\left( \varvec{L} \right) = \text{Tr}\left( {\varvec{LTL}^{\text{H}} } \right) $$ and $$ g\left( \varvec{L} \right) = \chi \left( {\left\| {l_{k} } \right\|_{2} = 1,\forall k} \right) $$, where $$ {\mathcal{X}} $$ is the eigenfunction (the value is 0 when the constraint is satisfied, otherwise the value is infinite). The gradient function expressed as $$ {\nabla }f\left( \varvec{L} \right) = \text{tril}\left( {2\varvec{LT}} \right) $$, where $$ {\text{tril}}\left( \cdot \right) $$ means extracting its lower triangular part. Although the function $$ g $$ is non-convex, the approximation operator still has a closed solution $$ \text{prox}_{g} \left( {l_{k} ;\tau } \right){ = }l_{k} /\left\| {l_{k} } \right\|_{2} ,\forall k $$.

In order to be friendly in hardware implementation, the complex step rule proposed in Ref. [40] is not used here, but a fixed step is used to improve the convergence rate of TASER algorithm [41]. The value of the fixed step is proportional to the reciprocal of Lipschitz constant $$ \tau = \alpha /\parallel \varvec{T}\parallel_{2} $$ of the gradient function $$ {\nabla }f\left( \varvec{L} \right) $$, where $$ \parallel \varvec{T}\parallel_{2} $$ is the spectral norm of matrix $$ \varvec{T} $$ and $$ 0 < \alpha < 1 $$ is dependent on the system’s regulation parameters, used to improve the convergence rate of TASER algorithm [41]. To further improve the convergence rate of FBS, we need to preprocess the problem in Eq. (4.41). First, the diagonal scaling matrix $$ \varvec{D} = \text{diag}\left( {\sqrt {T_{1,1} } , \ldots ,\sqrt {T_{M,M} } } \right) $$ is computed, which is used to scale the matrix $$ \varvec{T} $$ to get a matrix with $$ \tilde{\varvec{T}} = \varvec{D}^{ - 1} \varvec{TD}^{ - 1} $$. $$ \tilde{\varvec{T}} $$ is a matrix with a main diagonal of 1. The processor that implements this operation, called the Jacobian preprocessor, is used to increase the conditional number of the original PSD matrix $$ \varvec{T} $$ [42]. Then run FBS to get the lower triangular matrix $$ \tilde{\varvec{L}} = \varvec{LD} $$. In the process of preprocessing, we also need to correct the proximity operators. In this case, $$ \text{prox}_{{\tilde{g}}} \left( {\tilde{l}_{k} } \right) = D_{k,k} \tilde{l}_{k} /\left\| {\tilde{l}_{k} } \right\|_{2} $$, where $$ \tilde{l}_{k} $$ is the $$ kth $$ column of $$ \tilde{\varvec{L}} $$. Because we only take the symbol bits of the last row of $$ \hat{\varvec{L}} $$ to estimate the ML problem, here we can take only the symbols of the normalized triangular matrix $$ \tilde{\varvec{L}} $$.

The pseudocode of the TASER algorithm is shown in Algorithm 4.5. Input is the preprocessing matrix $$ \tilde{\varvec{T}} $$, scaling matrix $$ \varvec{D} $$ and step size $$ \tau $$. $$ \tilde{\varvec{L}}^{\left( 0 \right)} = \varvec{D} $$ is used for initialization. The main loop body of TASER algorithm is to run gradient function and proximity operator until the final iterations number $$ t_{ \hbox{max} } $$ is obtained. In most cases, only a few iterations are required to obtain BER performance approximate to near-ML.

Algorithm 4.5

TASER algorithm

../images/478198_1_En_4_Chapter/478198_1_En_4_Figf_HTML.png

The TASER algorithm tries to use FBS to solve a non-convex problem, which will result in two problems. One is whether the algorithm converges to the minimum; the other is whether the local minimum of the non-convex problem corresponds to the minimum of SDP for the convex optimization problem. Next, we will solve these two problems.

For the first problem, although it is still novel to use FBS to solve the minimum value of the positive semi-definite, there has been a lot of research on the convergence of FBS to solve non-convex problems. In Ref. [43], the condition of solving convergence of non-convex problems with FBS is proposed, and the problem to be solved must be semi-algebraic. Equation (4.41) meets this condition exactly. A strict proof of this solution will be given below.

Proposition 4.3.1

If FBS (Algorithm 4.5) is used to solve the Eq. (4.41) and the step is $$ \tau = \alpha /\left\| \varvec{T} \right\|_{2} \left( {0 < \alpha < 1} \right) $$, then the iterative sequence $$ \left\{ {\varvec{L}^{\left( t \right)} } \right\} $$ will converge to a key point.

Proof

Function $$ \parallel \ell_{k} \parallel_{2}^{2} $$ is a polynomial, and the constraint set of the Eq. (4.41) is the solution of the polynomial system $$ \left\| {l_{k} } \right\|_{2}^{2} = 0,\left( {\forall k} \right) $$, so it is semi-algebraic. Theorem 5.3 in Ref. [43] shows that if the upper bound of the step is the inverse of the Lipschitz constant of the gradient function of the objective function, then the iterative sequence $$ \left\{ {\varvec{L}^{\left( t \right)} } \right\} $$ is convergent. The Lipchitz constant is the spectral radius (two norm) of the matrix $$ \varvec{T} $$.□

Jacobi preprocessor leads to a problem in the same form as Eq. (4.41), except that the constraint is $$ \left\| {\tilde{l}_{k} } \right\|_{2}^{2} = D_{k,k}^{2} $$ and the step is $$ \tau = \alpha /\left\| {\tilde{\varvec{T}}} \right\|_{2} $$, so Proposition 4.3.1 is applicable as well. However, this proposition does not guarantee that a minimum point is found, but only a stable point (in fact, a minimum point is often found). Nevertheless, this proposition is much better than other known low-complexity SDP algorithms in guaranteeing convergence. For example, non-convex enhanced Lagrange scheme adopted in Ref. [44] cannot guarantee convergence.

The second question is whether the local minimum of Eq. (4.41) corresponds to the minimum of convex SDP of Eq. (4.40). In Ref. [44], it is proposed that the local minimum in Eq. (4.41) is the minimum of SDP in Eq. (4.40) when the factors $$ \varvec{L } $$ and $$ \varvec{L}^{\text{H}} $$ are not constrained into triangular form. Nevertheless, $$ \varvec{L} $$ and $$ \varvec{L}^{\text{H}} $$ are constrained to a triangular form, as this simplifies the hardware architecture designed in Chap. 5.

4.3.4 Performance Analysis

Figure 4.14a, b show the respective simulation results of vector error rate (VER) of TASER algorithm modulated by BPSK and QPSK. For the massive MU-MIMO system of $$ 128 \times 8(N_{\text{r}} \times N_{\text{t}} ) $$, $$ 64 \times 16 $$ and $$ 32 \times 32 $$, coherent data detection is used and the channel is flat Rayleigh flat-fading channel. Meanwhile, the performance of ML detection (using SD algorithm in Ref. [15]), exact SDR detection in Eq. (4.39), linear MMSE detection, and K-best detection $$ (K = 5) $$ in Ref. [45] are given, and the performance of SIMO is given as the lower reference bound.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig14_HTML.png
Fig. 4.14

VER performance comparison of MIMOs with different configurations a BPSK, b QPSK.

© [2018] IEEE. Reprinted, with permission, from Ref. [10]

For $$ 128 \times 8 $$ massive MIMO systems, we can see that the performance of all detectors is close to the optimal performance, including the SIMO lower bound. This result is obvious [46]. For the $$ 64 \times 16 $$ massive MIMO system, only linear detection has rather small performance loss, and other detectors have better performance. As for $$ 32 \times 32 $$ massive MIMO system, we can see that that TASER algorithm can still achieve near-ML performance, which is obviously superior to MMSE algorithm and K-best algorithm (however, even with SD algorithm, ML detection complexity is still very high). Figure 4.14a, b also show the fixed-point performance of the TASER algorithm, showing only a small performance loss (SNR below 0.2 dB at 1% VER point).

Figure 4.15a, b show the BER simulation results of the TASER algorithm with BPSK and QPSK modulation in the SIMO system respectively, where the number of receiving antennas is $$ N_{\text{r}} = 16 $$, the time slot $$ K + 1 = 16 $$ at the transmitter, the maximum number of iterations $$ t_{ \hbox{max} } = 20 $$, and the independent identically distributed flat Rayleigh block fading channel model is adopted. The simulation includes SIMO detection, exact SDR detection and ML JED detection, which use perfect receiver channel state information (CSIR) and CHEST [44] respectively. We can see that TASER algorithm can achieve not only near-optimal performance close to perfect CSIR and detection superior to SIMO CHEST, but also performance similar to ML JED algorithm and exact SDR algorithm within controllable complexity.
../images/478198_1_En_4_Chapter/478198_1_En_4_Fig15_HTML.png
Fig. 4.15

BER performance comparison of SIMO system. a BPSK, b QPSK.

© [2018] IEEE. Reprinted, with permission, from Ref. [10]

4.3.5 Computational Complexity

Next, let us compare the computational complexity of the TASER algorithm with other massive MIMO data detection algorithms, including the CGLS algorithm [47], the NSA algorithm [48], the OCD algorithm [49] and the GAS algorithm [50]. Table 4.2 is the number of real number multiplication with the maximum number of iterations $$ t_{ \hbox{max} } $$ by different algorithms. As we can see, the complexity of the TASER algorithm (BPSK and QPSK) and NSA is $$ t_{ \hbox{max} } N_{\text{t}}^{3} $$. The complexity of the TASER algorithm is slightly higher. The complexity of CGLS and GAS are both at the $$ t_{ \hbox{max} } N_{\text{t}}^{2} $$ stage, of which GAS is slightly higher. The complexity of OCD is at the $$ t_{ \hbox{max} } N_{\text{r}} N_{\text{t}} $$ stage. Obviously, TASER algorithm can achieve near-ML performance with the highest computational complexity, while CGLS, OCD, and GAS have lower computational complexity, but their performance is poor in $$ 32 \times 32 $$ system. So only TASER algorithm can be used for JED, and other linear algorithms cannot be applied in this scenario.
Table 4.2

Computational complexity of different data detection algorithms in the massive MIMO system

Algorithm

Computational complexitya

BPSK TASER

$$ t_{ \hbox{max} } \left( {\frac{1}{3}N_{\text{t}}^{3} + \frac{5}{2}N_{\text{t}}^{2} + \frac{37}{6}N_{\text{t}} + 4} \right) $$

QPSK TASER

$$ t_{ \hbox{max} } \left( {\frac{8}{3}N_{\text{t}}^{3} + 10N_{\text{t}}^{2} + \frac{37}{3}N_{\text{t}} + 4} \right) $$

CGLS [47]

$$ \left( {t_{ \hbox{max} } + 1} \right)\left( {4N_{\text{t}}^{2} + 20N_{\text{t}} } \right) $$

NSA [48]

$$ \left( {t_{ \hbox{max} } - 1} \right)2N_{\text{t}}^{3} + 2N_{\text{t}}^{2} - 2N_{\text{t}} $$

OCD [49]

$$ t_{ \hbox{max} } \left( {8N_{\text{r}} N_{\text{t}} + 4N_{\text{t}} } \right) $$

GAS [50]

$$ t_{ \hbox{max} } 6N_{\text{t}}^{2} $$

aThe complexity is expressed by the number of the RMUL under the number of iterations of $$ t_{ \hbox{max} } $$. The complex number multiplication requires four RMULs. All the results ignore the preprocessing complexity