Linear Massive MIMO Detection Algorithm

Massive MIMO signal detection is the key technology of next generation wireless communication (such as 5G) [1], and how to detect the transmitted signal from the mass MIMO system efficiently and accurately is of vital importance. As for massive MIMO signal detection, there are many algorithms to implement the signal detection. Generally, these algorithms can be divided into the linear detection algorithm and the nonlinear detection algorithm according to different calculation methods [2]. Although the linear detection algorithm is less accurate than the nonlinear detection algorithm, it is still a practical signal detection method for massive MIMO system in some cases due to its low complexity. In the linear detection algorithm, the difficulty people often encounter is the calculation to find the inverse matrix of a large-scale matrix, especially when the scale of a massive MIMO system is very large, the algorithm complexity is very high and the corresponding hardware is difficult to implement. Therefore, this chapter introduces several typical linear iterative algorithms for massive MIMO signal detection. Using these algorithms, the iterations between vectors or matrices can be effectively used to avoid direct inversion of large-scale matrices and reduce complexity of the linear detection algorithm. In the following sections, we will introduce Neumann series approximation (NSA) algorithm, the Chebyshev iteration algorithm, the Jacobi iteration algorithm and the Conjugate gradient (CG) algorithm respectively. And the optimization methods of the Chebyshev iteration algorithm, the Jacobi iteration algorithm and the CG algorithm are also introduced to get better linear detection algorithms. In addition, this chapter also compares the complexity and accuracy of these algorithms with those of other massive MIMO signal detection algorithms.

2.1 Analysis of Linear Detection Algorithm

Massive MIMO signal detection algorithms are usually divided into the linear detection algorithm and the nonlinear detection algorithm. The nonlinear detection algorithm, as the name implies, refers to the algorithm that adopts the nonlinear algorithm to recover the transmitted signal s from the received signal y. Such algorithms usually have high accuracy, but the computation complexity is also high. For example, the maximum likelihood (ML) detection is a typical nonlinear detection algorithm [3]. In theory, the ML detection is very ideal for massive MIMO signal detection with high accuracy. However, in the ML detection, the number of cycles required for computation depends largely on the modulation order q and the number of the user antennas (N_t). The total number of cycles is denoted as $q^{{N_{\text{t}} }}$ . Undoubtedly, this result is disastrous undoubtedly because the total number of cycles will still increase considerably even if the modulation order or the number of user antennas increases very little. Therefore, ML detection is not applicable in practice although it is a very ideal detection method in theory, especially in massive MIMO signal detection. In general, the nonlinear detection algorithm is more accurate than the linear detection algorithm in massive MIMO signal detection, but with higher complexity. We will introduce the nonlinear detection algorithm in detail Chap. 4.

Corresponding to the nonlinear detection algorithm, the linear detection algorithm usually estimates the signal s by operating a matrix. The common linear detection algorithms include the zero-forcing (ZF) detection algorithm and the minimum mean square error (MMSE) detection method [4], both of which transform massive MIMO signal detection into linear matrix equation solution by deforming the channel matrix H, that is ${\varvec{Hs}} = {\varvec{y}}$ . According to the massive MIMO channel model in Sect. 1.2.1, Eq. (2.1) for both the received signal and the transmitted signal is

${\varvec{y}} = {\varvec{Hs}} + {\varvec{n}}$

(2.1)

where y is the received signal, H is the channel matrix, s is the transmitted signal, and n is the additive noise. The linear detection algorithm of massive MIMO is to simultaneously left multiply the both sides of Eq. (2.1) by conjugate transpose H^H of the channel matrix, neglecting the additive noise n, so as to obtain Eq. (2.2)

${\varvec{H}}^{\text{H}} {\varvec{y}} = {\varvec{H}}^{\text{H}} {\varvec{Hs}}$

(2.2)

If Eq. (2.3) is true, then we obtain Eq. (2.4).

${\varvec{y}}^{\text{MF}} = {\varvec{H}}^{\text{H}} {\varvec{y}}$

(2.3)

${\varvec{s}} = ({\varvec{H}}^{\text{H}} {\varvec{H}})^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{y}} = ({\varvec{H}}^{\text{H}} {\varvec{H}})^{ - 1} {\varvec{y}}^{\text{MF}}$

(2.4)

The detection for the transmitted signal s is implemented. However, there is an error in Eq. (2.4) due to the existence of the additive noise n. Based on the above idea, the transmitted signal s can be estimated by a matrix W which makes Eq. (2.5) hold

${\hat{\varvec{s}}} = {\varvec{Wy}}$

(2.5)

where the ${\hat{\varvec{s}}}$ denotes the estimated transmitted signal. In this way, the massive MIMO linear detection is transformed into the estimation of matrix W.

The ZF is a common linear detection algorithm. Its main idea is to neglect the additive noise n in the massive MIMO channel model in the analysis, which will make the massive MIMO detection algorithm much simpler and easier to implement. However, considering that the noise is usually not negligible in actual situation, the result obtained by using this algorithm may not be the optimal solution. For a massive MIMO system with a scale of $N_{\text{r}} \times N_{\text{t}}$ , the signal received by a receiving antenna at the base station can be expressed as

${\varvec{y}} = \sum\limits_{i = 1}^{{N_{\text{t}} }} {{\varvec{h}}_{i} {\varvec{s}}_{i} + {\varvec{n}},} \,i = 1,2, \ldots ,N_{\text{t}}$

(2.6)

where ${\varvec{H}} = \left( {{\varvec{h}}_{ 1} ,{\varvec{h}}_{ 2} , \ldots ,{\varvec{h}}_{{N_{\text{t}} }} } \right)$ , $s_{i}$ is the ith element of the transmitted signal s, $i = 1, 2, \ldots ,N_{\text{t}}$ . Now we define a vector ${\varvec{w}}_{{i , 1\times N_{\text{r}} }}$ that follows Eq. (2.7)

${\varvec{w}}_{i} {\varvec{h}}_{j} = \left\{ {\begin{array}{*{20}c} {1,} & {i = j} \\ {0,} & {i \ne j} \\ \end{array} } \right.$

(2.7)

where $i = 1, 2, \ldots ,N_{\text{t}}$ , ${\varvec{w}}_{i}$ will be acted as rows to form a matrix ${\varvec{W}}_{{N_{\text{t}} \times N_{\text{r}} }}$ , There is obviously ${\varvec{WH}} = {\varvec{I}}$ from Eq. (2.7). And combining with Eq. (2.4), we have

${\varvec{W}} = ({\varvec{H}}^{\text{H}} {\varvec{H}})^{ - 1} {\varvec{H}}^{\text{H}}$

(2.8)

In this way, the transmitted signal s can be estimated as

${\hat{\varvec{s}}} = {\varvec{W}}({\varvec{Hs}} + {\varvec{n}}) = {\varvec{s}} + {\varvec{Wn}}$

(2.9)

Obviously, when the additive noise is ${\varvec{n}} = 0$ , ${\hat{\varvec{s}}} = {\varvec{s}}$ is strictly satisfied. Because the ZF detection algorithm meets the conditions in Eq. (2.7), it can eliminate the interference between the data sent by different transmitting antennas, and can get relatively accurate detection results when the signal-to-noise ratio is relatively high.

Although there are some deficiencies in the accuracy of the ZF detection algorithm, its derivation process also provides some ideas whether the influence of noise n can be added to the W matrix to simplify the solution of the transmitted signal s using the same method of solving linear matrix. On this basis, a MMSE detection algorithm is proposed.

The MMSE detection algorithm is another typical linear detection algorithm. Its basic idea is to make the estimated signal ${\hat{\varvec{s}}} = {\varvec{Wy}}$ as close to the real value as possible. In the MMSE detection algorithm, the adopted objective function is

${\hat{\varvec{s}}} = {\varvec{W}}_{\text{MMSE}} = \arg \mathop {\hbox{min} }\limits_{{\varvec{W}}} E\left\| {{\varvec{s}} - {\varvec{Wy}}} \right\|^{2}$

(2.10)

By solving the matrix W based on Eq. (2.10), the following equation can be obtained:

$\begin{aligned} {\varvec{W}}_{\text{MMSE}} & = \arg \mathop {\hbox{min} }\limits_{{\varvec{W}}} E\left\| {{\varvec{s}} - {\varvec{Wy}}} \right\|^{2} \\ & = \arg \mathop {\hbox{min} }\limits_{{\varvec{W}}} E\left\{ {({\varvec{s}} - {\varvec{Wy}})^{\text{H}} ({\varvec{s}} - {\varvec{Wy}})} \right\} \\ & = \arg \mathop {\hbox{min} }\limits_{{\varvec{W}}} E\left\{ {{\text{tr}}\left[ {({\varvec{s}} - {\varvec{Wy}})({\varvec{s}} - {\varvec{Wy}})^{\text{H}} } \right]} \right\} \\ & = \arg \mathop {\hbox{min} }\limits_{{\varvec{W}}} E\left\{ {{\text{tr}}\left[ {{\varvec{ss}}^{\text{H}} - {\varvec{sy}}^{\text{H}} {\varvec{W}}^{\text{H}} - {\varvec{Wys}}^{\text{H}} + {\varvec{Wyy}}^{\text{H}} {\varvec{W}}^{\text{H}} } \right]} \right\} \\ \end{aligned}$

(2.11)

Find the partial derivatives for Eq. (2.11) and set it equal to zero, and get

$\frac{{\partial {\text{tr}}\left[ {{\varvec{ss}}^{\text{H}} - {\varvec{sy}}^{\text{H}} {\varvec{W}}^{\text{H}} - {\varvec{Wys}}^{\text{H}} + {\varvec{Wyy}}^{\text{H}} {\varvec{W}}^{\text{H}} } \right]}}{{\partial {\varvec{W}}}} = 0$

(2.12)

By solving Eq. (2.12), we can get

${\varvec{W}} = \left( {{\varvec{H}}^{\text{H}} {\varvec{H}} + \frac{{N_{0} }}{{E_{\text{s}} }}{\varvec{I}}_{{N_{\text{t}} }} } \right)^{ - 1} {\varvec{H}}^{\text{H}}$

(2.13)

where N₀ is the spectral density of noise, E_s is the spectral density of signal, and the Gram matrix is defined. Similar to the ZF detection algorithm, we can deduce the MMSE detection algorithm. It also makes the estimated signal ${\hat{\varvec{s}}} = {\varvec{Wy}}$ , which is different from the ZF detection algorithm. ${\varvec{W}}_{\text{ZF}} = {\varvec{G}}^{ - 1} {\varvec{H}}^{\text{H}}$ in the ZF detection algorithm and ${\varvec{W}}_{\text{MMSE}} = \left( {{\varvec{G}} + \frac{{N_{0} }}{{E_{\text{s}} }}{\varvec{I}}_{{N_{\text{t}} }} } \right)^{ - 1} {\varvec{H}}^{\text{H}}$ in the MMSE detection algorithm. Both algorithms can estimate the transmitted signal s in massive MIMO detection.

Whether in the ZF detection algorithm or in the MMSE detection algorithm, matrix inverse is involved. The scale of the channel matrix H is usually very large in massive MIMO system. But the matrix inversion is more complex, which is usually difficult to achieve in the actual signal detection circuit. In order to avoid the huge complexity of matrix inversion, many algorithms have been put forward to reduce the algorithm complexity so as to reduce the perplexities caused by large-scale matrix inversion. Among these algorithms, the typical ones are the NSA algorithm, the Chebyshev iteration algorithm, the Jacobi iteration algorithm, the CG algorithm and so on. The above four common algorithms will be introduced below in detail.

2.2 Neumann Series Approximation Algorithm

In the massive MIMO system, the number of receiving antennas is usually much greater than that of the user antennas [5], that is, N_r is much larger than N_t. Since the elements in channel matrix H are independent identically distributed, and the real and imaginary parts are subject to the Gaussian distribution with parameter N(0, 1), both Gram matrix G and MMSE matrix ${\varvec{A}} = {\varvec{G}} + N_{0} /E_{\text{s}} {\varvec{I}}_{{N_{\text{t}} }}$ are diagonally dominant matrices. Gram matrix G tends to be a scalar matrix when Nr tends to infinity, i.e., ${\varvec{G}} \to N_{\text{r}} {\varvec{I}}_{{N_{\text{t}} }}$ [5]. Using this property, we can simplify the inverse process of matrix in the linear massive MIMO detection algorithm.

Because of the diagonal dominance of the MMSE matrix A, if taking the diagonal elements of matrix A out and denoting it as matrix D, it is very easy to seek the inverse of matrix D. As mentioned above, the condition that A tends to D is that the number of user antennas N_t tends to infinity. However that is not actually possible. Even it is often far from that condition. Thus, we use ${\varvec{A}} \approx {\varvec{D}}$ to approximate the matrix A and finding the inverse of the matrix A will lead to relatively large errors.

2.2.1 Algorithm Design

Using Neumann series [6] can get the exact matrix inverse. The Neumann series expands the inverse matrix A⁻¹ of the MMSE matrix A into

${\varvec{A}}^{ - 1} = \sum\limits_{n = 0}^{\infty } {({\varvec{X}}^{ - 1} ({\varvec{X}} - {\varvec{A}}))^{n} } {\varvec{X}}^{ - 1}$

(2.14)

In the above Eq. (2.14) the matrix X need to satisfy

$\mathop {\lim }\limits_{n \to \infty } ({\varvec{I}} - {\varvec{X}}^{ - 1} {\varvec{A}})^{n} = {\mathbf{0}}_{{N_{{\mathbf{t}}} \times N_{{\mathbf{t}}} }}$

(2.15)

The matrix A is decomposed into the sum of diagonal elements and non-diagonal elements ${\varvec{A}} = {\varvec{D}} + {\varvec{E}}$ . Bring the sum into Eq. (2.14), we can get

${\varvec{A}}^{ - 1} = \sum\limits_{n = 0}^{\infty } {({\varvec{D}}^{ - 1} {\varvec{E}})^{n} } {\varvec{D}}^{ - 1}$

(2.16)

According to Eq. (2.15), if the condition $\mathop {\lim }\limits_{n \to \infty } \left( { - {\varvec{D}}^{ - 1} {\varvec{E}}} \right)^{n} = {\varvec{0}}_{{N_{\text{t}} \times N_{\text{t}} }}$ is satisfied, the expansion of Eq. (2.16) will definitely converge.

Expanding the matrix A⁻¹ into a sum of numerous terms is not practical. In order to apply Neumann series to the linear massive MIMO detection algorithm, it is necessary to use NSA to estimate Eq. (2.16) [7]. The main idea of solving inverse matrix with NSA is to take out the first K items of Neumann series in Eq. (2.16), then the expression of calculating the first K terms Neumann series is

${\tilde{\varvec{A}}}_{K}^{ - 1} = \sum\limits_{n = 0}^{K - 1} {({\varvec{D}}^{ - 1} {\varvec{E}})^{n} } {\varvec{D}}^{ - 1}$

(2.17)

By using Eq. (2.17), Neumann series with a certain number of terms can be calculated to change the infinite number of terms into the finite number of terms, so as to reduce the computational complexity and estimate the inverse of MMSE matrix A. By approximating A⁻¹, an approximate MMSE equilibrium matrix ${\tilde{\varvec{W}}}_{K}^{ - 1} = {\tilde{\varvec{A}}}_{K}^{ - 1} {\varvec{H}}^{H}$ can be obtained. A⁻¹ can be expressed in different expressions according to the number of selected items K. When K = 1, ${\tilde{\varvec{A}}}_{1}^{ - 1} = {\varvec{D}}{}^{ - 1}$ , now ${\tilde{\varvec{W}}}_{1}^{ - 1} = {\varvec{D}}^{ - 1} {\varvec{H}}^{\text{H}}$ . When K = 2, ${\tilde{\varvec{A}}}_{2}^{ - 1} = {\varvec{D}}{}^{ - 1} + {\varvec{D}}{}^{ - 1}{\varvec{ED}}{}^{ - 1}$ , the operation complexity is $O(N_{\text{t}}^{2} )$ . When K = 3, ${\tilde{\varvec{A}}}_{3}^{ - 1}$ is

${\tilde{\varvec{A}}}_{3}^{ - 1} = {\varvec{D}}{}^{ - 1} + {\varvec{D}}{}^{ - 1}{\varvec{ED}}{}^{ - 1} + {\varvec{D}}{}^{ - 1}{\varvec{ED}}{}^{ - 1}{\varvec{ED}}{}^{ - 1}$

(2.18)

The computational complexity of Eq. (2.18) is $O(N_{\text{t}}^{3} )$ , which is comparable to the actual computational complexity of the inverse matrix, but the approximate operation of Eq. (2.18) is less. When K > 4, the complexity of the inverse computation of the actual matrix may be lower than that of the approximate algorithm.

2.2.2 Error Analysis

Obviously, the sum of the preceding K items in Eq. 2.17 can cause errors, which can be expressed as

$\begin{aligned} {\varvec{\varDelta}}_{K} & = {\tilde{\varvec{A}}}^{ - 1} - {\tilde{\varvec{A}}}_{K}^{ - 1} \\ & = \sum\limits_{n = K}^{\infty } {( - {\varvec{D}}^{ - 1} {\varvec{E}})^{n} } {\varvec{D}}^{ - 1} \\ & = ( - {\varvec{D}}^{ - 1} {\varvec{E}})^{K} \sum\limits_{n = 0}^{\infty } {( - {\varvec{D}}^{ - 1} {\varvec{E}})^{n} } {\varvec{D}}^{ - 1} \\ & = ( - {\varvec{D}}^{ - 1} {\varvec{E}})^{K} {\varvec{A}}^{ - 1} \\ \end{aligned}$

(2.19)

By using Eq. (2.17) to estimate the transmitted signal and substituting Eq. (2.17) into Eq. (2.19), the expression about the error of the transmitted signal of Eq. (2.20) can be obtained

${\hat{\varvec{s}}}_{K} = {\tilde{\varvec{A}}}_{K}^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{y}} = {\varvec{A}}^{ - 1} {\varvec{y}}^{\text{MF}} -\varvec{\varDelta}_{K} {\varvec{y}}^{\text{MF}}$

(2.20)

By taking the second norm of the second term of Eq. (2.20), we have

$\begin{aligned} \left\| {{\varvec{\varDelta}}_{K} {\varvec{y}}^{\text{MF}} } \right\|_{2} & = \left\| {( - {\varvec{D}}^{ - 1} {\varvec{E}})^{K} {\varvec{A}}^{ - 1} {\varvec{y}}^{\text{MF}} } \right\|_{2} \\ & \quad \le \left\| {( - {\varvec{D}}^{ - 1} {\varvec{E}})^{K} } \right\|_{F} \left\| {{\varvec{A}}^{ - 1} {\varvec{y}}^{\text{MF}} } \right\|_{2} \\ & \quad \le \left\| { - {\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{K} \left\| {{\varvec{A}}^{ - 1} {\varvec{y}}^{\text{MF}} } \right\|_{2} \\ \end{aligned}$

(2.21)

We can see from Eq. (2.21) that if the condition of inequality (2.22) is satisfied in Eq. (2.21), the approximate error exponent approaches zero as the number of terms K increases, and it can be proved that inequality (2.22) is a sufficient condition for the convergence of formula (2.16).

$\left\| { - {\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F} < 1$

(2.22)

Now it is necessary to prove that Eq. (2.16) converges when the scale of the massive MIMO system satisfies the condition that N_r is much greater than N_t, the elements in the channel matrix H are independent and identically distributed, and all of them obey the complex Gaussian distribution with parameter N (0, 1). More specifically, it is necessary to prove that the condition of convergence of Neumann series and the condition of minimal error in Eq. (2.21) are only related to N_t and N_r. Here is a theorem and corresponding proof below.

Theorem 2.2.1

When $N_{\text{r}} > 4$ , the elements in the channel matrix H are independent of each other and satisfy the complex Gaussian distribution with its variance being 1, the following expression is obtained

$P\left\{ {\left\| { - {\mathbf{D}}^{ - 1} {\mathbf{E}}} \right\|_{F}^{K} < \alpha } \right\} \ge 1 - \frac{{(N_{\text{t}}^{2} - N_{\text{t}} )}}{{\alpha^{{\frac{2}{K}}} }}\sqrt {\frac{{2N_{\text{r}} (N_{\text{r}} + 1)}}{{(N_{\text{r}} - 1)(N_{\text{r}} - 2)(N_{\text{r}} - 3)(N_{\text{r}} - 4)}}}$

(2.23)

Proof

To prove Theorem 2.2.1, we need to give three other lemmas and their proofs.

Lemma 2.2.1

Let $x^{(k)}$ , $y^{(k)} \left( {k = 1,2, \cdots ,N_{\text{r}} } \right)$ independent and identically distributed, and make them satisfy the complex Gauss distribution with its variance being 1.

$E\left[ {\left| {\sum\limits_{k = 1}^{{N_{\text{r}} }} {x^{(k)} y^{(k)} } } \right|^{4} } \right] = 2N_{\text{r}} \left( {N_{\text{r}} + 1} \right)$

(2.24)

Proof

First we can get

$\begin{aligned} E\left[ {\left| {\sum\limits_{k = 1}^{{N_{\text{r}} }} {x^{(k)} y^{(k)} } } \right|^{4} } \right] & = E\left[ {\left( {\sum\limits_{k = 1}^{{N_{\text{r}} }} {x^{(k)} y^{(k)} } \sum\limits_{k = 1}^{{N_{\text{r}} }} {\left( {x^{(k)} y^{(k)} } \right)}^{*} } \right)^{2} } \right] \\ & = \left( {_{2}^{{N_{\text{r}} }} } \right)E\left[ {\left| {x^{(k)} } \right|^{2} \left| {y^{(k)} } \right|^{2} } \right] + N_{\text{r}} E\left[ {\left| {x^{(k)} } \right|^{4} \left| {y^{(k)} } \right|^{4} } \right] \\ & = 2N_{\text{r}} \left( {N_{\text{r}} - 1} \right) + 4N_{\text{r}} \\ & = 2N_{\text{r}}^{2} + 2N_{\text{r}} \\ \end{aligned}$

(2.25)

The operation process in Eq. (2.25) can be described as the following steps. The nonzero term can be expressed as $\left| {x^{(k)} } \right|^{4} \left| {y^{(k)} } \right|^{4}$ and $\left| {x^{(k)} } \right|^{2} \left| {y^{(k)} } \right|^{2}$ after the Quadratic term is expanded, in which there are a total of $N_{\text{r}}$ items $\left| {x^{(k)} } \right|^{4} \left| {y^{(k)} } \right|^{4}$ and $\left( {_{2}^{{N_{\text{r}} }} } \right)$ items $\left| {x^{(k)} } \right|^{2} \left| {y^{(k)} } \right|^{2}$ . According to ${\text{E}}\left[ {\left| {x^{(k)} } \right|^{4} } \right] = {\text{E}}\left[ {\left| {y^{(k)} } \right|^{4} } \right] = 2$ , ${\text{E}}\left[ {\left| {x^{(k)} } \right|^{2} } \right] = {\text{E}}\left[ {\left| {y^{(k)} } \right|^{2} } \right] = 1$ , we can get the conclusion of Lemma 2.2.1.

Lemma 2.2.2

Let $N_{\text{r}} > 4$ , and $x^{(k)} \left( {k = 1,2, \cdots ,N_{\text{r}} } \right)$ is independent and identically distributed, obey the complex Gauss distribution with the variance being 1, and $g = \sum\limits_{k = 1}^{{N_{\text{r}} }} {\left| {x^{(k)} } \right|^{2} }$ , then

$E\left[ {\left| {g^{ - 1} } \right|^{4} } \right] = \left( {\left( {N_{\text{r}} - 1} \right)\left( {N_{\text{r}} - 2} \right)\left( {N_{\text{r}} - 3} \right)\left( {N_{\text{r}} - 4} \right)} \right)^{ - 1}$

(2.26)

Proof

First, g is rewritten as

$g = \frac{1}{2}\sum\limits_{k = 1}^{{2N_{\text{r}} }} {\left| {s^{(k)} } \right|^{2} }$

(2.27)

Among them, $s^{(k)}$ is independent and identically distributed, also follows the real Gauss distribution with the mean being 0 and the variance being 1. Therefore, $2g^{ - 1}$ obeys the inverse $\chi^{2}$ distribution with $2N_{\text{t}}$ free degrees. The inverse $\chi^{2}$ distribution with $2N_{\text{r}}$ free degrees corresponds to the inverse Gaussian distribution with $2N_{\text{t}}$ free degrees. The fourth inverse $\chi^{2}$ distribution can be obtained by Eq. (2.28):

$E\left( {\left| {2g^{ - 1} } \right|^{4} } \right) = \frac{16}{{\left( {N_{\text{r}} - 1} \right)\left( {N_{\text{r}} - 2} \right)\left( {N_{\text{r}} - 3} \right)\left( {N_{\text{r}} - 4} \right)}}$

(2.28)

Then we have the conclusion from Eq. (2.26).

Lemma 2.2.3

Let $N_{\text{t}} > 4$ , and the elements in the channel matrix H satisfy the complex Gaussian distribution with the independent identically distributed zero mean and unit variance.

$E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right] \le \left( {N_{\text{t}}^{2} - N_{\text{t}} } \right)\sqrt {\frac{{2N_{\text{r}} \left( {N_{\text{r}} + 1} \right)}}{{\left( {N_{\text{r}} - 1} \right)\left( {N_{\text{r}} - 2} \right)\left( {N_{\text{r}} - 3} \right)\left( {N_{\text{r}} - 4} \right)}}}$

(2.29)

Proof

The normalized Gram matrix G corresponds to matrix A, ${\varvec{A}} = {\varvec{D}} + {\varvec{E}} = {\varvec{G}} + \frac{{N_{0} }}{{E_{\text{s}} }}{\varvec{I}}_{{N_{\text{t}} }}$ . Therefore, the element of ith row and jth column in matrix A can be expressed as

$a^{(i,i)} = \left\{ { \, \begin{array}{*{20}l} {g^{(i,i)} = \sum\limits_{k = 1}^{{N_{\text{r}} }} {(h^{(k,i)} )^{*} h^{(k,j)} ,} } \hfill & {i \ne j} \hfill \\ {g^{(i,i)} + \frac{{N_{0} }}{{E_{\text{s}} }} = \sum\limits_{k = 1}^{{N_{\text{r}} }} {\left| {h^{(k,i)} } \right|^{2} + \frac{{N_{0} }}{{E_{\text{s}} }}} ,} \hfill & {i = j} \hfill \\ \end{array} } \right.$

(2.30)

where $g^{(i,j)}$ is the element of ith row and jth column of the matrix G. Therefore, we can find the inequality in Eq. (2.31):

$E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right] = E\left[ {\sum\limits_{i = 1}^{{i = N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{j = N_{\text{t}} }} {\left| {\frac{{g^{(i,j)} }}{{a^{(i,i)} }}} \right|^{2} } } } \right] \le \sum\limits_{i = 1}^{{i = N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{j = N_{\text{t}} }} {E\left| {\frac{{g^{(i,j)} }}{{g^{(i,i)} }}} \right|^{2} } }$

(2.31)

Then use the Cauchy–Schwartz inequality for Eq. (2.31), there is

$E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right] \le \sum\limits_{i = 1}^{{i = N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{j = N_{\text{t}} }} {\sqrt {E\left[ {\left| {g^{(i,j)} } \right|^{4} } \right]E\left[ {\left| {\left( {g^{(i,i)} } \right)^{ - 1} } \right|^{4} } \right]} } }$

(2.32)

Use Lemmas 2.2.2 and 2.2.3 in the calculation of first and second moments respectively, get the following expression:

$\begin{aligned} E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right] & \le \sum\limits_{i = 1}^{{i = N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{j = N_{\text{t}} }} {\sqrt {\frac{{2N_{\text{r}} \left( {N_{\text{r}} + 1} \right)}}{{\left( {N_{\text{r}} - 1} \right)\left( {N_{\text{r}} - 2} \right)\left( {N_{\text{r}} - 3} \right)\left( {N_{\text{r}} - 4} \right)}}} } } \\ & \quad = \left( {N_{\text{t}}^{2} - N_{\text{t}} } \right)\sqrt {\frac{{2N_{\text{r}} \left( {N_{\text{r}} + 1} \right)}}{{\left( {N_{\text{r}} - 1} \right)\left( {N_{\text{r}} - 2} \right)\left( {N_{\text{r}} - 3} \right)\left( {N_{\text{r}} - 4} \right)}}} \\ \end{aligned}$

(2.33)

Now let uss prove Theorem 2.2.1. Using Markov’s inequality, we obtain

$P\left\{ {\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{K} \ge \alpha } \right]} \right\} = P\left\{ {\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{K} \ge \alpha^{{\frac{2}{K}}} } \right]} \right\} \le \alpha^{{ - \frac{2}{K}}} E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right]$

(2.34)

Combining the upper bound of $P\left\{ {\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{K} < \alpha } \right]} \right\} = 1 - P\left\{ {\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{K} \ge \alpha } \right]} \right\}$ and $E\left[ {\left\| {{\varvec{D}}^{ - 1} {\varvec{E}}} \right\|_{F}^{2} } \right]$ in Lemma 2.2.3, we can get the conclusion of Theorem 2.2.1.

As we can see from Eq. (2.23), when $N_{\text{r}} \gg N_{\text{t}}$ , the probability of the condition (2.22) satisfied is greater. This theorem shows that Neumann series converges with a certain probability, the greater the ratio of $N_{\text{r}} /N_{\text{t}}$ , the greater the probability of convergence. In addition, the theorem also provides α condition for minimizing the error of residual estimation and when alpha is less than 1, the greater the number of terms K selected, the greater the probability of convergence.

2.2.3 Complexity and Block Error Rate

The NSA reduces the computational complexity of inverse matrix. Now we discuss the advantages and limitations of the NSA from two aspects of computational complexity and block error rates. Here, we only consider the case that $N_{\text{r}}$ is 64, 128, and 256 respectively.

In the exact algorithm to solve the inverse matrix, the Cholesky decomposition (CHD) algorithm [8] has lower complexity than other exact solutions, such as direct matrix inversion, QR decomposition, LU decomposition [3, 9], etc. Therefore, the CHD algorithm can be selected as the object to be compared with the NSA. The complexity of the CHD algorithm for inverse matrix solution is in the in the $O(N_{\text{t}}^{3} )$ range, while the complexity of the NSA for inverse matrix is in the $O(N_{\text{t}} )$ range and the $O(N_{\text{t}}^{2} )$ range respectively when K = 1 and K = 2. When K > 3, the complexity of the NSA mainly comes from the multiplication between large matrices, and the complexity increases linearly with the value of K. For example, when K = 3, there is one multiplication between large matrices once in the algorithm, and when K = 4, there are two multiplications between large matrices, that is, when K > 3, you need to calculate K − 2 multiplications between large matrices in the NSA. So when K > 3, the complexity of the NSA algorithm is $O\left( {(K - 2)N_{\text{t}}^{3} } \right)$ . It can be seen that when $K \ge 3$ , the NSA has no advantage in complexity compared with the CHD algorithm.

The complexity of an algorithm mainly depends on the number of real number multiplications in the algorithm. Figure 2.1 depicts the variation curve of the number of real number multiplications with the number of antennas $N_{\text{t}}$ in the CHD algorithm and the NSA algorithm with different K values. Figure 2.1 shows that the complexity of the NSA algorithm is lower than that of the CHD algorithm at $K \le 3$ , while the complexity of the Newman series approximation algorithm is higher when K > 3.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig1_HTML.png — Fig. 2.1
The Relationship between the number of the user antennas $N_{\text{t}}$ and the number of the real number multiplications in the algorithm.
© [2018] IEEE. Reprinted, with permission, from Ref. [7]

$$ N_{\text{t}} $$ — Fig. 2.1
The Relationship between the number of the user antennas $N_{\text{t}}$ and the number of the real number multiplications in the algorithm.
© [2018] IEEE. Reprinted, with permission, from Ref. [7]

NSA is the solution of approximating matrix inversion by taking the first K terms of the Neumann series. Obviously, the more the number of terms is taken, the closer it is to the exact result, but the cost is the increase in complexity. Thus, the precision and complexity are a pair of contradictions. In order to compare the block error rate between Neuman series approximation and Holesky decomposition algorithm, the uplink of massive MIMO system is selected here. At the base station, MMSE detection using the above NSA and CHD algorithm is adopted, and ${\text{SNR}} = N_{\text{r}} \frac{{E_{\text{s}} }}{{N_{0} }}$ is defined. Figure 2.2 shows at a different $N_{\text{r}}$ , the block error rate of the NSA and the CHD algorithm when $N_{\text{t}}$ is equal to 4, 8, and 12 respectively.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig2a_HTML.png — Fig. 2.2
Block error Rate Curve at a $N_{\text{t}} = 4$ , b $N_{\text{t}} = 8$ , c $N_{\text{t}} = 12$ .
© [2018] IEEE. Reprinted, with permission, from Ref. [7]

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig2b_HTML.png — Fig. 2.2
Block error Rate Curve at a $N_{\text{t}} = 4$ , b $N_{\text{t}} = 8$ , c $N_{\text{t}} = 12$ .
© [2018] IEEE. Reprinted, with permission, from Ref. [7]

As we can see from Fig. 2.2, when K = 1 or K = 2, the NSA algorithm has a large block error rate, when the number of antennas at the base station is large, it can make up for part of block error rate. Considering the requirement of 10% block error rate [10] in LTE, it is not suitable for practical applications when K = 1 and K = 2 with modulation order of 64-QAM. The simulation result shows that the block error rate is less than 10⁻² when K = 1, $N_{\text{r}} = 512$ , $N_{\text{t}} = 4$ , and the number of terms required by the NSA algorithm is fewer when the modulation order is 16-QAM. When the modulation order is 64-QAM and K = 3, the result of the NSA algorithm is close to that of the CHD algorithm. For example, when K = 3, $N_{\text{t}} = 4$ and K = 3, $N_{\text{t}} = 8$ , $N_{\text{r}} = 256$ , and the block error rate is 10⁻², the SNR loss of NSA algorithm is less than 0.25 dB. Therefore, when the $N_{\text{r}} /N_{\text{t}}$ ratio of the massive MIMO system is large, the NSA algorithm is used to find the inverse of the matrix and the term K is 3, so that the lower block error rate can be reduced under the condition of low computational complexity.

In summary, in the massive MIMO system, when the $N_{\text{r}} /N_{\text{t}}$ ratio is small, the CHD algorithm and other exact arithmetic are required to seek the inverse of the matrix. When $N_{\text{r}} /N_{\text{t}}$ ratio is large, the NSA algorithm can be used to approximate the inversion. By using the NSA algorithm, we can find the relatively accurate inverse matrix results with low computational complexity. This makes the NSA an efficient and accurate method for massive MIMO detection in some specific cases.

2.3 Chebyshev Iteration Algorithm

2.3.1 Algorithm Design

The Chebyshev iteration algorithm [11] is an algorithm for solving matrix equation ${\varvec{Ax}} = {\varvec{b}}$ by using iteration computation to avoid large matrix inversion. Its basic iteration form is

${\varvec{x}}^{(K)} = {\varvec{x}}^{(K - 1)} + {\varvec{\sigma}}^{(K)}$

(2.35)

where σ is the correction matrix and K is the number of iterations. The σ can be expressed as

${\varvec{\sigma}}^{(0)} = \frac{1}{\beta }{\varvec{r}}^{(0)}$

(2.36)

${\varvec{\sigma}}^{(K)} = \rho^{(K)} {\varvec{r}}^{(K)} + \varphi^{(K)} {\varvec{\sigma}}^{(K - 1)}$

(2.37)

where ${\varvec{r}}^{(K)} = {\varvec{b}} - {\varvec{Ax}}^{(K)}$ denotes the residual vector, ρ^(K) and φ^(K) are the Chebyshev polynomial parameters for two iterations, and β is an iterative parameter related to the eigenvalue of matrix A. Therefore, the Chebyshev iteration can be used to solve the linear equation in the MMSI of massive MIMO detection, so as to avoid the operation complexity caused by large matrix inversion. In this section, for convenience, set ${\varvec{A}} = {\varvec{H}}^{H} {\varvec{H}} + \frac{{N_{0} }}{{E_{\text{s}} }}{\varvec{I}}_{{N_{\text{t}} }}$ , so there is ${\varvec{{A}\hat{s}}} = {\varvec{y}}^{\text{MF}}$ .

Although the Chebyshev iteration can be used in the MMSE of massive MIMO detection, it still faces some challenges. First of all, since the parameters such as β, ρ^(K) and φ^(K) are related to the eigenvalues of matrix A, it is difficult to calculate these parameters. Second, at the beginning of iteration, it is necessary to solve matrix A. Matrix A involves multiplication of large-scale matrix, which will consume a lot of hardware resources. Third, different initial values of iteration affect the convergence rate of the algorithm, and how to determine a good initial value is also the challenge of the algorithm. To solve the above problems, the Chebyshev iteration is optimized in this section to make it more suitable for massive MIMO signal detection.

According to Eq. (2.35), the iteration form in MMSE can be written as

${\hat{\varvec{s}}}^{(K)} = {\hat{\varvec{s}}}^{(K - 1)} + {\varvec{\sigma}}^{(K)}$

(2.38)

where ${\hat{\varvec{s}}}^{(0)}$ is the initial value of iteration, and the parameter ${\varvec{\sigma}}^{(K)}$ satisfies:

${\varvec{\sigma}}^{(K)} = \frac{1}{\beta }\left( {{\varvec{y}}^{\text{MF}} - {\varvec{{A}\hat{s}}}^{(K)} } \right)$

(2.39)

In order to reduce computational complexity, matrix A and ${\hat{\varvec{s}}}^{(K)}$ in Eq. (2.39) can be split into

${\varvec{{A}\hat{s}}}^{(K)} = \frac{{N_{0} }}{{E_{\text{s}} }}{\hat{\varvec{s}}}^{(K)} + {\varvec{H}}^{\text{H}} \left( {{\varvec{H\hat{s}}}^{(K)} } \right)$

(2.40)

In Eqs. (2.36) and (2.37), the parameters ρ^(K) and φ^(K) can be expressed as

$\rho^{(K)} = \frac{2\alpha }{\beta }\frac{{T_{K} (\alpha )}}{{T_{K + 1} (\alpha )}}$

(2.41)

$\varphi^{(K)} = \frac{{T_{K - 1} (\alpha )}}{{T_{K + 1} (\alpha )}}$

(2.42)

where T is Chebyshev polynomial, and α is the parameter related to the eigenvalues of matrix A. According to Chebyshev polynomial [11], the expression of T is

$T_{0} (\alpha ) = 1$

(2.43)

$T_{ 1} (\alpha ) = \alpha$

(2.44)

$T_{K} (\alpha ) = 2\alpha {\text{T}}_{K - 1} (\alpha ) - T_{K - 2} (\alpha ),\quad K \ge 2$

(2.45)

Combining formula (2.41) and (2.42), we can get Eqs. (2.46)–(2.49):

$\rho^{(1)} = \frac{{2\alpha^{2} }}{{\left( {2\alpha^{2} - 1} \right)\beta }}$

(2.46)

$\rho^{(K)} = \frac{{4\alpha^{2} }}{{4\alpha^{2} \beta - \beta^{2} \rho^{(K - 1)} }}$

(2.47)

$\varphi^{(1)} = \frac{1}{{2\alpha^{2} - 1}}$

(2.48)

$\varphi^{(K)} = \frac{{\beta^{2} \rho^{(K - 1)} }}{{4\alpha^{2} \beta - \beta^{2} \rho^{(K - 1)} }}$

(2.49)

where $\alpha$ and $\beta$ satisfy:

$\alpha = \frac{{\lambda_{\hbox{max} } + \lambda_{\hbox{min} } }}{{\lambda_{\hbox{max} } - \lambda_{\hbox{min} } }}$

(2.50)

$\beta = \frac{{\lambda_{\hbox{max} } + \lambda_{\hbox{min} } }}{2}$

(2.51)

where λ_max and λ_min are the maximum and minimum eigenvalues of matrix A respectively. Since computing the eigenvalues of matrix A is complicated, an approximation is adopted here. As N_r and N_t increase, λ_max and λ_min can be approximately expressed

$\lambda_{\hbox{max} } \approx N_{\text{r}} \left( {1 + \sqrt {\frac{{N_{\text{t}} }}{{N_{\text{r}} }}} } \right)^{2}$

(2.52)

$\lambda_{\hbox{min} } \approx N_{\text{r}} \left( {1 - \sqrt {\frac{{N_{\text{t}} }}{{N_{\text{r}} }}} } \right)^{2}$

(2.53)

So far, all the parameters used in the Chebyshev iteration can be expressed in the scale parameters of the channel matrix H in the massive MIMO system. According to Eq. (2.38), we still need an iterative initial value if we want to use the Chebyshev iterative algorithm to estimate the signal s. Theoretically, although using any initial values we can get the final estimation, the convergence rates of the algorithm corresponding to different initial values are not the same. A good initial value can make the algorithm converge faster and generate the desired results, achieving twice the result with half the effort.

As described in Sect. 2.2, in the massive MIMO system, the number of receiving antennas is often much larger than the number of the user antennas, i.e., $N_{\text{r}} \gg N_{\text{t}}$ , the elements in the channel matrix H are subject to Gaussian distribution that is independent and identically distributed and with parameters N(0, 1), so that matrix A is a diagonally dominant matrix and satisfies

$A_{i,j} = \left\{ {\begin{array}{*{20}l} {\frac{{\lambda_{\hbox{max} }\,+\,\lambda_{\hbox{min} } }}{2} = \beta ,} \hfill & {i = j} \hfill \\ {0,} \hfill & {i \ne j} \hfill \\ \end{array} } \right.$

(2.54)

So the initial value ${\hat{\varvec{s}}}^{(0)}$ can be approximated as

${\hat{\varvec{s}}}^{(0)} \approx \frac{1}{\beta }{\varvec{H}}^{\text{H}} {\varvec{y}} = \frac{2}{{\lambda_{\hbox{max} } + \lambda_{\hbox{min} } }}{\varvec{H}}^{\text{H}} {\varvec{y}}$

(2.55)

This initial value enables the Chebyshev iteration to achieve a faster convergence rate, and the computational complexity of the initial value is very low. In addition, the computation of the initial value can be executed in parallel.

In massive MIMO signal detection, it is often necessary to output the log-likelihood ratio for the use of the next stage circuit, so it is necessary to discuss how to use Chebyshev iteration to find the approximate log-likelihood ratio. The estimated transmitted signal can be expressed as

${\hat{\varvec{s}}} = {\varvec{A}}^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{y}} = {\varvec{A}}^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{Hs}} + {\varvec{A}}^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{n}}$

(2.56)

Set ${\varvec{X}} = {\varvec{A}}^{ - 1} {\varvec{H}}^{\text{H}} {\varvec{H}}$ and ${\varvec{Y}} = {\varvec{X{A}}}^{ - 1}$ , they can be used to solve equivalent channel gain and NPI respectively. In combination with Eqs. (2.37) and (2.38), the estimated received ${\hat{\varvec{s}}}$ can be expressed as.

$\begin{aligned} {\hat{\varvec{s}}} \approx {\hat{\varvec{s}}}^{(K)} & = {\hat{\varvec{s}}}^{{\left( {K - 1} \right)}} + \rho^{{\left( {K - 1} \right)}} {\varvec{r}}^{{\left( {K - 1} \right)}} + \varphi^{{\left( {K - 1} \right)}} {\varvec{\sigma}}^{{\left( {K - 2} \right)}} \\ & = \left[ {\left( {1 + \varphi^{{\left( {K - 1} \right)}} } \right){\varvec{I}}_{{N_{\text{t}} }} - \rho^{{\left( {K - 1} \right)}} {\varvec{A}}} \right]{\hat{\varvec{s}}}^{{\left( {K - 1} \right)}} - \varphi^{{\left( {K - 1} \right)}} {\hat{\varvec{s}}}^{{\left( {K - 2} \right)}} + \rho^{{\left( {K - 1} \right)}} {\varvec{y}}^{\text{MF}} \\ \end{aligned}$

(2.57)

Set ${\varvec{y}}^{\text{MF}} = {\varvec{e}}_{{(N_{\text{r}} , 1)}}$ , the iteration in Eq. (2.57) can be approximated as the inverse of matrix A, X, and Y. For example

$\begin{aligned} {\tilde{\varvec{A}}}^{ - 1} & \approx \left( {{\tilde{\varvec{A}}}^{ - 1} } \right)^{\left( K \right)} \\ & = \left[ {\left( {1 + \varphi^{{\left( {K - 1} \right)}} } \right){\varvec{I}}_{{N_{t} }} - \rho^{{\left( {K - 1} \right)}} {\varvec{A}}} \right]\left( {{\tilde{\varvec{A}}}^{ - 1} } \right)^{{\left( {K - 1} \right)}} - \varphi^{{\left( {K - 1} \right)}} \left( {{\tilde{\varvec{A}}}^{ - 1} } \right)^{{\left( {K - 2} \right)}} + \rho^{{\left( {K - 1} \right)}} {\varvec{I}} \\ \end{aligned}$

(2.58)

where $({\tilde{\varvec{A}}}^{ - 1} )^{(0)} = \frac{1}{\beta }{\varvec{I}}_{{N_{\text{t}} }}$ , and all $({\tilde{\varvec{A}}}^{ - 1} )^{(K)}$ are diagonal matrices. Similarly, matrices X and Y are shown as

$\begin{aligned} {\hat{\varvec{X}}} & \approx {\hat{\varvec{X}}}^{(K)} \\ & = \left[ {\left( {1 + \varphi^{{\left( {K - 1} \right)}} } \right){\varvec{I}}_{{N_{\text{t}} }} - \rho^{{\left( {K - 1} \right)}} {\varvec{A}}} \right]{\hat{\varvec{X}}}^{{\left( {K - 1} \right)}} - \varphi^{{\left( {K - 1} \right)}} {\hat{\varvec{X}}}^{{\left( {K - 2} \right)}} + \rho^{{\left( {K - 1} \right)}} {\varvec{H}}^{\text{H}} {\varvec{H}} \\ \end{aligned}$

(2.59)

${\hat{\varvec{Y}}} \approx {\hat{\varvec{Y}}}^{(K)} = {\hat{\varvec{X}}}({\hat{\varvec{W}}}^{ - 1} )^{(K)}$

(2.60)

The equivalent channel gain μ_i and NPI can be approximately expressed as

$\mu_{i} = \hat{X}_{i,i} \approx \hat{X}_{i,i}^{(K)}$

(2.61)

$\nu_{i}^{2} = \sum\limits_{j \ne i}^{{N_{\text{t}} }} {\left| {X_{i,j} } \right|E_{\text{s}}\,+\,} Y_{i,i} N_{0} \approx N_{0} \hat{X}_{i,i}^{(K)} \tilde{A}_{i,i}^{(K)}$

(2.62)

The signal-to-interference-plus-noise ratio (SINR) can be calculated by combining Eqs. (2.58)–(2.62). However, this algorithm has a high computational complexity. An algorithm based on initial eigenvalue solution is presented here to solve LLR, which can reduce the computational complexity of LLR [12]. The LLR is shown as below.

$L_{i,b} (\hat{s}_{i} ) = \gamma_{i} \left( {\mathop {\hbox{min} }\limits_{{s \in S_{b}^{0} }} \left| {\frac{{\hat{s}_{i} }}{{\mu_{i} }} - s} \right|^{2} - \mathop {\hbox{min} }\limits_{{s \in S_{b}^{1} }} \left| {\frac{{\hat{s}_{i} }}{{\mu_{i} }} - s} \right|^{2} } \right)$

(2.63)

where the parameter $\gamma_{i}$ meets

$\gamma_{i} = \frac{{\mu_{i}^{2} }}{{\nu_{i}^{2} }} = \frac{{(\hat{X}_{i,i}^{(K)} )^{2} }}{{N_{0} \hat{X}_{i,i}^{(K)} \tilde{A}_{i,i}^{(K)} }} = \frac{{\hat{X}_{i,i}^{(K)} }}{{N_{0} \tilde{A}_{i,i}^{(K)} }} \approx \frac{1}{\beta }\frac{1}{{N_{0} }}$

(2.64)

where $\gamma_{i}$ indicates the SINR of the ith user, $S_{b}^{0}$ and $S_{b}^{1}$ are the set of modulation points $Q\left( {\left| Q \right| = 2^{\vartheta } } \right)$ in the constellation diagram, and the bth bits of $S_{b}^{0}$ and $S_{b}^{1}$ are 0 and 1 respectively. For the sake of convenience, write $L_{i,b}$ as $L_{i,b} (\hat{s}_{i} ) = \gamma_{i} \xi_{b} (\hat{s}_{i} )$ expressed as a form of linear equation. Here, the approximate SINR no longer depends on the result of $\hat{X}_{i,i}^{(K)}$ and $\tilde{A}_{i,i}^{(K)}$ .

Based on the above analysis, we can obtain the optimized Chebyshev iterative algorithm that approximates an MMSE detector in a massive MIMO system and name it as parallelizable Chebyshev iteration (PCI). The algorithm [13] is shown in Algorithm 2.1.

Algorithm 2.1

The parallelizable Chebyshev iteration (PCI) algorithm for soft-output MMSE detection

../images/478198_1_En_2_Chapter/478198_1_En_2_Figa_HTML.png

2.3.2 Convergence

Now let us discuss the problem of the convergence rate of the Chebyshev iteration. After K steps, the approximation error can be expressed as [14]

${\varvec{e}}^{(K)} = {\varvec{s}} - {\hat{\varvec{s}}}^{(K)} = P^{(K)} ({\varvec{A}}){\varvec{e}}^{(0)}$

(2.65)

In the above equation, $P^{(K)}$ satisfies

$P^{(K)} (\lambda_{i} ) = \frac{{T^{(K)} \left( {\alpha - \frac{\alpha }{\beta }\lambda_{i} } \right)}}{{T^{(K)} (\alpha )}}$

(2.66)

So the error can be expressed as

$\left\| {{\varvec{e}}^{(K)} } \right\| \le \left\| {P^{\left( K \right)} ({\varvec{A}})} \right\|\left\| {{\varvec{e}}^{( 0)} } \right\|$

(2.67)

From the previous description, we know that matrix A is a diagonally dominant matrix, so there is

$\begin{aligned} P^{(K)} ({\varvec{A}}) & = P^{(K)} ({\varvec{SJS}}^{{ - \text{1}}} ) = {\varvec{S}}P^{(K)} ({\varvec{J}}){\varvec{S}}^{ - 1} \\ & = {\varvec{S}}\left[ {\begin{array}{*{20}c} {P^{(K)} (\lambda_{1} )} & {} & {} \\ {} & \ddots & {} \\ {} & {} & {P^{(K)} (\lambda_{{N_{\text{t}} }} )} \\ \end{array} } \right]{\varvec{S}}^{ - 1} \\ \end{aligned}$

(2.68)

where S is a complex matrix of $N_{\text{t}} \times N_{\text{t}}$ and meets the condition ${\varvec{S}}^{ - 1} = {\varvec{S}}^{\text{H}}$ . J is an upper triangular matrix. Now two lemmas are proposed and their corresponding proofs are given, followed by corresponding conclusions.

Lemma 2.3.1

In the massive MIMO system, there is

$\left| {P^{(K)} (\lambda_{i} )} \right| \approx \left| {\frac{{N_{\text{r}} + N_{\text{t}} - \lambda_{i} + \sqrt {\left( {N_{\text{r}} + N_{\text{t}} - \lambda_{i} } \right)^{ 2} - 1} }}{{ 2N_{\text{r}} }}} \right|^{K}$

(2.69)

where $P^{(K)} (\lambda_{i} )$ is the Kth normalized Chebyshev polynomial.

Proof

The Chebyshev polynomials in Eq. (2.45) can be rewritten as [15]

$T_{K} (\alpha ) = { \cosh }\left( {K{\text{arcosh}}(\alpha )} \right) = \frac{{{\text{e}}^{{K{\text{arcosh}}(\alpha )}} + {\text{e}}^{{ - K{\text{arcosh}}(\alpha )}} }}{2}$

(2.70)

Combining Eqs. (2.66) and (2.70), the Chebyshev polynomial is converted to

$\begin{aligned} P^{(K)} (\lambda_{i} ) & = \frac{{{\text{e}}^{{K{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} + {\text{e}}^{{ - K{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{{\text{e}}^{{K{\text{arcosh}}\left( \alpha \right)}} + {\text{e}}^{{ - K{\text{arcosh}}\left( \alpha \right)}} }} \\ & = \left( {\frac{{{\text{e}}^{{{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{{\text{e}}^{{{\text{arcosh}}\left( \alpha \right)}} }}} \right)^{K} \cdot \left( {\frac{{1 + {\text{e}}^{{ - 2K{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{1 + {\text{e}}^{{ - 2K{\text{arcosh}}\left( \alpha \right)}} }}} \right) \\ \end{aligned}$

(2.71)

On this basis, using the identity ${\text{arcosh}}\left( x \right) = \ln \left( {x + \sqrt {x^{2} - 1} } \right)$ to transform Eq. (2.71) into

$P^{(K)} (\lambda_{i} ) = \left( {V(\lambda_{i} )} \right)^{k} \cdot Q^{(K)} (\lambda_{i} )$

(2.72)

where $V(\lambda_{i} )$ and $Q^{(K)} (\lambda_{i} )$ meet:

$V(\lambda_{i} ) = \frac{{{\text{e}}^{{{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{{\text{e}}^{{{\text{arcosh}}(\alpha )}} }} = \frac{{\alpha - \frac{\alpha }{\beta }\lambda_{i} + \sqrt {\left( {\alpha - \frac{\alpha }{\beta }\lambda_{i} } \right)^{2} - 1} }}{{\alpha + \sqrt {\alpha^{2} - 1} }}$

(2.73)

$Q^{(K)} (\lambda_{i} ) = \frac{{{\text{e}}^{{{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{{\text{e}}^{{{\text{arcosh}}(\alpha )}} }} = \frac{{1 + {\text{e}}^{{ - 2K{\text{arcosh}}\left( {\alpha - \left( {\frac{\alpha }{\beta }} \right) \cdot \lambda_{i} } \right)}} }}{{1 + {\text{e}}^{{ - 2K{\text{arcosh}}(\alpha )}} }}$

(2.74)

From Eqs. (2.50)–(2.53), the parameters satisfy $\alpha \notin \left( { - \infty ,\left. 1 \right]} \right.$ and $\left( {\alpha - \frac{\alpha }{\beta }\lambda_{i} } \right) \notin [ - 1,1]$ . As a result. Now $Q^{(K)} (\lambda_{i} )$ satisfies

$0 \le \left| {Q^{(K)} (\lambda_{i} )} \right| \le \frac{2}{{1 - \tau^{K} }}$

(2.75)

When the iteration number K increases, the value $\left| {Q^{(K)} (\lambda_{i} )} \right|$ is limited, so there is

$P^{(K)} (\lambda_{i} ) \approx (V(\lambda_{i} ))^{(K)}$

(2.76)

Considering that $V(\lambda_{i} )$ is an operator similar to that in Eq. (2.72), set $V = V({\varvec{A}})$ . When K is very large, we can get

$P^{(K)} ({\varvec{A}}) \approx V^{(K)}$

(2.77)

If the eigenvalues of A satisfy $\lambda_{\hbox{min} } < \lambda_{i} < \lambda_{\hbox{max} }$ , then the value of $\left| {V(\lambda_{i} )} \right|$ keeps a constant, so the eigenvalues of V and $\psi_{i}$ satisfy

$\left| {\psi_{i} } \right| = \left| {\frac{{\alpha - \frac{\alpha }{\beta }\lambda_{i} + \sqrt {\left( {\alpha - \frac{\alpha }{\beta }\lambda_{i} } \right)^{2} - 1} }}{{\alpha + \sqrt {\alpha^{2} - 1} }}} \right|$

(2.78)

From Eqs. (2.50)–(2.53), Eqs. (2.76)–(2.78). The following equation is obtained:

$\left| {P^{(K)} (\lambda_{i} )} \right| \approx \left| {\frac{{\frac{{N_{\text{r}} + N_{\text{t}} - \lambda_{i} }}{{2\sqrt {N_{\text{r}} N_{\text{t}} } }} + \sqrt {\left( {\frac{{N_{\text{r}} + N_{\text{t}} - \lambda_{i} }}{{2\sqrt {N_{\text{r}} N_{\text{t}} } }}} \right)^{2} - 1} }}{{\frac{{N_{\text{r}} + N_{\text{t}} }}{{2\sqrt {N_{\text{r}} N_{\text{t}} } }} + \sqrt {\left( {\frac{{N_{\text{r}} + N_{\text{t}} }}{{2\sqrt {N_{\text{r}} N_{\text{t}} } }}} \right)^{2} - 1} }}} \right|^{K}$

(2.79)

Therefore, the conclusion of Lemma 2.3.1 is obtained.

According to Eqs. (2.52) and (2.53), when $N_{\text{t}}$ remains constant but $N_{\text{r}}$ increases, the maximum eigenvalue and the minimum eigenvalue of matrix A will approach to $N_{\text{r}}$ . It is proved in Lemma 2.3.1 that the value $\left| {P^{(K)} (\lambda_{i} )} \right|$ is less than 1 and will decrease as the ratio of $N_{\text{r}}$ to $N_{\text{t}}$ increases. According to Eqs. (2.67) and (2.68), the estimation error is very small when the number of iterations K is limited. And from Lemma 2.3.1, we know that the estimation error will decrease with the increase of the number of iterations K. Therefore, we can get

$\mathop {\lim }\limits_{K \to \infty } \left| {P^{(K)} (\lambda_{i} )} \right| = 0$

(2.80)

$\mathop {\lim }\limits_{K \to \infty } \left| {P^{(K)} ({\varvec{A}})} \right| = 0$

(2.81)

In other words, in massive MIMO signal detection, using Chebyshev iterative algorithm to estimate the transmitted signal s, the calculation error is very small, even close to zero.

Lemma 2.3.2

In the massive MIMO system, $\left| {V_{\text{ch}} (\lambda_{i} )} \right| \le \left| {V_{\text{cg}} (\lambda_{i} )} \right|$ and $\left| {V_{\text{ch}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right| \le \left| {V_{\text{ne}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right|$ are satisfied, where $V_{\text{ch}}$ , $V_{\text{cg}}$ and $V_{\text{ne}}$ are respectively the normalized Chebyshev polynomials corresponding to the Chebyshev iteration algorithm, the CG algorithm and the NSA algorithm.

Proof

The convergence rate of the Chebyshev iteration algorithm is

$R({\varvec{A}}) = - \log_{2} \left( {\mathop {\lim }\limits_{K \to \infty } \left( {\left\| {P^{(K)} ({\varvec{A}})} \right\|^{{\frac{1}{K}}} } \right)} \right)$

(2.82)

In order to make the convergence faster, $\mathop {\lim }\limits_{K \to \infty } \left( {\left\| {P^{(K)} ({\varvec{A}})} \right\|^{{\frac{1}{K}}} } \right)$ should be as great as possible in Eq. (2.82). The problem is transformed to find the minimization of the maximum value $\left| {V(\lambda_{i} )} \right|$ by Eqs. (2.76) and (2.78), i.e.,

$\hbox{min} \hbox{max} \left| {V(\lambda_{i} )} \right| = \hbox{min} \hbox{max} \left| {\frac{{\alpha - \frac{\alpha }{\beta }\lambda_{i} + \sqrt {\left( {\alpha - \frac{\alpha }{\beta }\lambda_{i} } \right)^{2} - 1} }}{{\alpha + \sqrt {\alpha^{2} - 1} }}} \right|$

(2.83)

Set $\lambda_{i} = \beta$ , and combining with Eq. (2.51), we can find the minimization of the maximum value $\left| {V(\lambda_{i} )} \right|$ . Thus $\left| {V_{\text{ch}} (\lambda_{i} )} \right|$ can be summarized as

$\begin{aligned} \left| {V_{\text{ch}} (\lambda_{i} )} \right| & = \left| {\frac{1}{{\alpha + \left( {\alpha^{2} - 1} \right)^{{\frac{1}{2}}} }}} \right| = \left| {\frac{1}{{\frac{{\lambda_{\text{max} } \,+\, \lambda_{\text{min} } }}{{\lambda_{\textrm{max} }\, - \,\lambda_{\text{min} } }} + \sqrt {\frac{{\lambda_{\text{max} } \,+\, \lambda_{\text{min} } }}{{\lambda_{\text{max} }\, -\, \lambda_{\text{min} } }}}^{2} - 1}}} \right| \\ & = \left| {\frac{{\lambda_{\text{max} } - \lambda_{\text{min} } }}{{\lambda_{\text{max} } + \lambda_{\text{min} } + 2\sqrt {\lambda_{\text{max} } \cdot \lambda_{\text{min} } } }}} \right| \\ \end{aligned}$

(2.84)

$\theta = \sqrt {\frac{{\lambda_{\text{min} } }}{{\lambda_{\textrm{max} } - \lambda_{\textrm{min} } }}}$ , the $\left| {V_{\text{cg}} (\lambda_{i} )} \right|$ of the CG algorithm can be summarized as

$\begin{aligned} \left| {V_{{{\text{cg}}}} (\lambda _{i} )} \right| = & \left| {\left[ {\frac{2}{{\left( {1 + 2\theta + \sqrt {\left( {1 + 2\theta } \right)^{2} - 1} } \right)^{K} + \left( {1 + 2\theta + \sqrt {\left( {1 + 2\theta } \right)^{2} - 1} } \right)^{{ - K}} }}} \right]^{{\frac{1}{K}}} } \right| \\ \quad \ge & \left| {\frac{1}{{1 + 2\theta + \sqrt {\left( {1 + 2\theta } \right)^{2} - 1} }}} \right| = \left| {\frac{{\lambda _{{{\text{max}}}} - \lambda _{{{\text{min}}}} }}{{\lambda _{{{\text{max}}}} + \lambda _{{{\text{min}}}} + 2\sqrt {\lambda _{{{\text{max}}}} \cdot \lambda _{{{\text{min}}}} } }}} \right| \\ \quad = & \left| {V_{{{\text{ch}}}} (\lambda _{i} )} \right| \\ \end{aligned}$

(2.85)

We can solve $\left| {V_{\text{ch}} (\lambda_{i} )} \right| \le \left| {V_{\text{cg}} (\lambda_{i} )} \right|$ from Eq. (2.85). This inequality indicates that compared with CG algorithm and Chebyshev iteration algorithm, $\left| {V(\lambda_{i} )} \right|$ has a smaller maximum value using Chebyshev iteration algorithm.

In combination with Eqs. (2.50)–(2.53) and (2.73), the $\left| {V_{\text{ch}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right|$ of Chebyshev iteration algorithm can be approximated to

$\left| {V_{\text{ch}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right| = \left| {\frac{{2\sqrt {N_{\text{r}} N_{\text{t}} } }}{{N_{\text{r}} + N_{\text{t}} + \left( {N_{\text{r}} + N_{\text{t}} } \right)^{2} - \left( {2\sqrt {N_{\text{r}} N_{\text{t}} } } \right)^{2} }}} \right| = \sqrt {\frac{{N_{\text{t}} }}{{N_{\text{r}} }}}$

(2.86)

According to literature [7], The $\left| {V_{\text{ne}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right|$ of NSA algorithm is

$\begin{aligned} \left| {V_{\text{ne}} \left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right| & = \left\| {{\varvec{D}}^{ - 1} \left( {{\varvec{L + L}}^{\text{H}} } \right)} \right\|_{F} \ge \frac{1}{{N_{\text{r}} \sqrt {N_{\text{t}} } }}\left\| {{\varvec{L + L}}^{\text{H}} } \right\|_{F} = \frac{1}{{N_{\text{r}} \sqrt {N_{\text{t}} } }}\sqrt {N_{\text{r}} N_{\text{t}} \left( {N_{\text{t}} - 1} \right)} \\ & \approx \sqrt {\frac{{N_{\text{t}} }}{{N_{{_{\text{r}} }} }}} = \left| {V_{\text{ch}} \left( {N_{\text{t}} ,N_{\text{r}} } \right)} \right| \\ \end{aligned}$

(2.87)

Equation (2.87) indicates that compared with the Chebyshev iteration algorithm and the NSA algorithm, $\left| {V\left( {N_{\text{r}} ,N_{\text{t}} } \right)} \right|$ has a smaller maximum value.

Combined with Eqs. (2.68), (2.82) and (2.83), a smaller maximum value of $\left| {V(\lambda_{i} )} \right|$ results in a faster convergence rate. Therefore, as we know from Lemma 2.3.2, the convergence rate of Chebyshev iteration algorithm is faster than CG algorithm and NSA algorithm.

2.3.3 Complexity and Parallelism

After discussing the convergence and convergence rate of the Chebyshev iteration algorithm, we will analyze the computational complexity of the Chebyshev iteration algorithm. As computational complexity mainly refers to the number of multiplications in the algorithm, we can count the number of multiplications required by the algorithm to evaluate the computational complexity. In the Chebyshev iteration algorithm, the first part of the calculation is the parameter calculation based on Eqs. (2.46)–(2.53). Because of the fixed scale of the massive MIMO system, these parameters only need to be computed once and stored in memory as constants when calculating multiple sets of data so the actual number of multiplications is negligible. The second part is to match the filter vector ${\varvec{y}}^{\text{MF}}$ and find the approximate initial solution by using the eigenvalues. The $N_{\text{t}} \times N_{\text{r}}$ matrix ${\varvec{H}}^{\text{H}}$ needs to be multiplied by $N_{\text{r}} \times 1$ vector y to get the $N_{\text{t}} \times 1$ vector ${\varvec{y}}^{\text{MF}}$ , then calculate $\frac{1}{\beta }$ of its value. There are $4N_{\text{r}} N_{\text{t}}$ actual multiplications in this process. There are three steps in the third part of the computation. First, multiply the $N_{\text{r}} \times N_{\text{t}}$ matrix H by the $N_{\text{t}} \times 1$ transmitted vector ${\varvec{s}}^{(K)}$ . Then solve the residual vector ${\varvec{r}}^{(K)}$ and finally find the correction vector $\varvec{\sigma}^{(K)}$ . These three steps require $4KN_{\text{r}} N_{\text{t}}$ , $4KN_{\text{r}} N_{\text{t}} + 2KN_{\text{t}}$ and $4KN_{\text{t}}$ real number multiplications respectively. The last part of the computation is the calculation of approximate log-likelihood ratio. This step requires $N_{\text{t}} + 1$ multiplications when the modulation order is 64-QAM. Therefore, the total number of multiplications required by the Chebyshev iteration algorithm is $\left( {8K + 4} \right)N_{\text{r}} N_{\text{t}} + \left( {6K + 1} \right)N_{\text{t}}$ .

As mentioned before, the method to directly find the inverse matrix will result in the computational complexity increase. Such computation is time-consuming, which makes hardware implementation difficult. A large-scale matrix multiplication takes 36 clock cycles when many pieces of hardware are parallel. These defects affect the energy and area efficiency of the detector hardware. In addition, the PCI contains the initial value, whose computation process can be considered as one iteration. Therefore, in order to balance, the iteration number K of PCI contains a calculation of the initial value and K-1 iterations. The comparison of computational complexity of different methods is listed in Table 2.1. The computational complexity of accurate MMSE detection methods such as the CHD algorithm is $O(N_{\text{t}}^{2} )$ . The computational complexity of NSA increases with the increase of the number of iterations K, and the specific manifestation is that when K < 3 the complexity is $O(N_{\text{t}}^{2} )$ , when K = 3 the complexity increases to $O(N_{\text{t}}^{3} )$ and the computational complexity of NSA is higher than that of accurate MMSE detection when K > 3. Usually, to ensure the detection accuracy, the number of the iterations of NSA K should be greater than 3. In addition, the NSA in literature [7] adopts the explicit method of large-scale matrix multiplication as $N_{\text{t}} \times N_{\text{r}}$ matrix ${\varvec{H}}^{\text{H}}$ multiplied by $N_{\text{r}} \times N_{\text{t}}$ matrix H, which consumes a lot of computing resources, i.e. $O\left( {N_{\text{r}} N_{\text{t}}^{2} } \right)$ . Since the results of direct inversion can be used repeatedly in downlink, if the parallelizable Chebyshev algorithm is used for calculation, the complexity will be doubled, but still lower than the results in literature [7]. The following methods directly compute matrix A and indirectly compute A⁻¹, including implicit version of the Neumann series (INS) approximation and implicit version of CG [16].The proposed PCI can also be modified to compute the explicit version of the matrix A. Table 2.1 lists the associated computational complexity of PCI, which achieves lower or equal complexity compared with literatures [12], [16] and [17]. Comparing with PCI under the same conditions, the least square conjugate gradient least square (CGLS) algorithm [18] and the optimized coordinate descent (OCD) algorithm [19] implement large-scale matrix multiplication and inversion in a completely implicit manner. Compared with the CGLS algorithm and the OCD algorithm, the PCI has advantages in computational complexity.

Table 2.1

Computational complexity analysis

Algorithm	K = 2	K = 3	K = 4	K = 5
NSA [7]	2N_rN_t² + 6N_tN_r + 4N_t² + 2N_t	2N_rN_t² + 6N_tN_r + 2N_t³ + 4N_t²	2N_rN_t² + 6N_tN_r + 6N_t³	2N_rN_t² + 6N_tN_r + 10N_t³-6N_t²
INS [17]	2N_rN_t² + 6N_tN_r + 10N_t² + 2N_t	2N_rN_t² + 6N_tN_r +14N_t² + 2N_t	2N_rN_t² + 6N_tN_r +18N_t² + 2N_t	2N_rN_t² + 6N_tN_r +22N_t² + 2N_t
GAS [12]	2 N_rN_t² + 6N_tN_r +10N_t²-2N_t	2N_rN_t² + 6N_tN_r +14N_t²-6N_t	2N_rN_t² + 6N_tN_r +18N_t²-10N_t	2N_rN_t² + 6N_tN_r +22N_t²-14N_t
CG [16]	2 N_rN_t² + 6N_tN_r +8N_t² + 33N_t	2N_rN_t² + 6N_tN_r + 12N_t² + 49N_t	2N_rN_t² + 6N_tN_r + 16N_t² + 65N_t	2N_rN_t² + 6N_tN_r + 20N_t² + 81N_t
CGLS [18]	16N_tN_r + 20N_t² + 32N_t	20N_tN_r + 28N_t² + 48N_t	24N_tN_r + 36N_t² + 64N_t	28N_tN_r + 44N_t² + 80N_t
OCD [19]	16N_tN_r +4N_t	20N_tN_r + 6N_t	32N_tN_r + 8N_t	40N_tN_r + 10N_t
PCI	2N_rN_t² + 6N_tN_r + 8N_t² + 8N_t	2N_rN_t² + 6N_tN_r + 12N_t² + 12N_t	2N_rN_t² + 6N_tN_r + 16N_t² + 16N_t	2N_rN_t² + 6N_tN_r + 20N_t² + 20N_t
PCI	12N_tN_r + 7N_t	20N_tN_r + 13N_t	28N_tN_r + 19N_t	36N_tN_r + 25N_t

Now consider the parallelism of PCI. In the Gauss–Seidel (GAS) algorithm [12], there is a strong correlation between elements, which means that the method has low parallelism. In the calculation of GAS algorithm, when calculating $s_{i}^{(K)}$ at the Kth iteration, it requires $s_{j}^{(K)}$ , $j = 1,2, \cdots ,i - 1$ and $s_{j - 1}^{(K)}$ , $j = i,i + 1, \cdots ,N_{\text{r}}$ of the previous K iterations. Moreover, in the CHD algorithm as well as NSA, GAS, and CG methods, there is a strong correlation between large-scale matrix inverse and multiplication, which requires the calculation of large-scale matrix multiplication first. This results in their reduced parallelism. PCI parallelism is an important problem in algorithm design and hardware implementation. According to the algorithm, when calculating ${\varvec{s}}^{(K)}$ and correcting vector ${\varvec{\sigma}}^{(K)}$ and residual vector ${\varvec{r}}^{(K)}$ , each element in them can be done in parallel. Besides this, we can find that the implicit method reduces the correlation between large-scale matrix multiplication and inverse calculation, and improves the parallelism of the algorithm.

2.3.4 Bit Error Rate

To evaluate the performance of PCI, BER’s simulation results are compared with NSA, CG, GAS, OCD and the Richardson (RI) algorithm below. Furthermore, the BER performance comparison also includes the MMSE algorithm based on the CHD. SNR is defined at the receiving antenna [7]. The number of iterations K of PCI is an initial value calculation of the K-1th iterations.

Figure 2.3 shows the BER performance comparisons (PCI and other methods) and the simulation results when $N_{\text{t}} = 16$ , with SNR set to 14 dB. As we can see from the figure, the PCI only uses a small number of iterations as $N_{\text{r}}$ increases, and achieves the near-optimal performance (compared with accurate MMSE). This result demonstrates the reasonability of the reduced computational complexity. Figure 2.3 also exhibits the performance of the NSA. When the number of iterations is relatively small, the detection accuracy loss cannot be ignored. However when the number of iterations is large, the system consumes a lot of hardware resources. Hence, this figure validates that PCI in massive MIMO system is superior to that in NSA.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig3_HTML.png — Fig. 2.3
BER simulation results of various algorithms at N = 0, SNR = 14 dB.
© [2018] IEEE. Reprinted, with permission, from Ref. [13]

Figure 2.4 shows the BER comparisons between the PCI using the initial value for the iteration and the traditional zero vector initial solution. In this simulation, the numbers of antennas are 16 and 128 respectively. The initial solution based on the eigenvalue approximation achieves low detection accuracy loss with the same number of iterations. The simulation result in the PCI when K = 3 is very close to the BER when K = 4, which means that the initial value for the iteration reduces the amount of computation while maintaining the similar detection accuracy.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig4_HTML.png — Fig. 2.4
BER performance comparisons between PCI after updating initial value and PCI for conventional zero vector value.
© [2018] IEEE. Reprinted, with permission, from Ref. [13]

Figure 2.5 exhibits the performance of PCI, CG, OCD, GAS, RI, and NSA. It also provides the MMSE similar to the CHD algorithm for reference. The comparison of the three simulation results shows that PCI achieves the near-optimal performance under different antenna configurations. To implement the same BER in PCI, SNR is required to be almost identical to GAS and OCD methods, but smaller than RI and NSA. When the ratio of $N_{\text{r}}$ to $N_{\text{t}}$ is small, CG is slightly better than PCI. When the ratio of $N_{\text{r}}$ to $N_{\text{t}}$ is larger, PCI has better detection performance. For example, in Fig. 2.5 (c), when K = 3, the SNR required to implement 10⁻⁶ BER is 17.25 dB, which is close to the precise MMSE (16.93 dB), GAS (17.57 dB), OCD (17.67 dB), and CG (17.71 dB). By comparison, the SNRs required for RI and NSA are 18.45 and 19.26 dB, respectively.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig5a_HTML.png — Fig. 2.5
BER comparisons between PCI and other algorithms. a N_r = 64, N_t = 16, b N_r = 90, N_t = 16, c N_r = 128, N_t = 16, d N_r = 162, N_t = 16
© [2018] IEEE. Reprinted, with permission, from Ref. [13]

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig5b_HTML.png — Fig. 2.5
BER comparisons between PCI and other algorithms. a N_r = 64, N_t = 16, b N_r = 90, N_t = 16, c N_r = 128, N_t = 16, d N_r = 162, N_t = 16
© [2018] IEEE. Reprinted, with permission, from Ref. [13]

2.3.5 Analysis on Channel Model Impact

The channel of the massive MIMO system also affects the algorithm. The Kronecker channel model [20] is often used to evaluate the performance because it is more practical than the Rayleigh fading channel. In the Kronecker channel model, the elements of the channel matrix satisfy $N\left( {0,d({\varvec{z}}){\varvec{I}}_{{N_{\text{r}} }} } \right)$ , where d(z) indicates the channel fading (such as path fading and shadow fading). Another significant feature of the Kronecker channel model is the consideration of channel correlation. Specifically, R_r and R_t denote the channel correlation of the receiving antenna and the transmitted antenna respectively. This part is also based on the Kronecker channel model, and the channel H can be expressed as

${\varvec{H}} = {\varvec{R}}_{\text{r}}^{{^{{\frac{1}{2}}} }} {\varvec{H}}_{{{\text{i}} . {\text{i}} . {\text{d}} .}} \sqrt {d({\varvec{z}})} {\varvec{R}}_{\text{t}}^{{^{{\frac{ 1}{ 2}}} }}$

(2.88)

where H_i.i.d. is a random matrix, the elements of which are independent and identically distributed and are subject to the complex Gaussian distribution with a zero mean and a unit variance.

In the simulation, the radius of each hexagonal region is r = 500 m, and the users’ locations are independent and random. The independent shadow fading C satisfies

$10\lg C\sim\,N\left( {0,\sigma_{\text{sf}}^{2} } \right)$

(2.89)

Considering the channel fading variance $d({\varvec{z}}) = \frac{C}{{\left\| {{\varvec{z}} - {\varvec{b}}} \right\|^{\kappa } }}$ , where $\varvec{b} \in \varvec{R}^{2}$ , κ and ||·|| are the base station location, path loss index and Euclidean norm respectively. The simulation adopts the following assumption: κ = 3.7, $\sigma_{\text{sf}}^{ 2} = 5$ and the transmitted power is $\rho = \frac{{\gamma^{\kappa } }}{2}$ .

Now we will discuss the influence of the Kronecker channel model on eigenvalue approximation. When the channel is independent and identically distributed Rayleigh fading, the value of the diagonal element of matrix A approximates to β, ξ is the channel correlation coefficient. When the correlation coefficient increases (the Kronecker channel model), the approximate error also increases slightly. Therefore, the eigenvalue approximation is still applicable to more practical channel models, such as the Kronecker channel model. The influence of different channel models on computational complexity is also considered. In order to satisfy the requirement of low approximation error, the number of iterations of the proposed PCI should be slightly increased for enhanced channel correlation and increased eigenvalue approximation error. Therefore, there is a slight increase in the computational complexity of PCI, which is a limitation of this method. As discussed earlier, there are three ways to compute large-scale matrix multiplication and inverse. When channel frequency is flat and changes slowly [21], as channel hardening becomes obvious, the result of matrix multiplication and inverse of explicit and partially implicit methods can be partially reused, such as NSA, INS, GAS, and CG. However, these methods also have some limitations. For example, the high computational complexity at the beginning of the detector can affect subsequent operations (such as matrix decomposition and reciprocal of diagonal elements), which need to start only after matrix multiplication is completed. In addition, the hardware utilization of large-scale multiplication is not high, which reduces the energy and area efficiency of detector. These methods (i.e., CGLS, OCD, and PCI) use implicit methods to solve large-scale matrix multiplication and inversion. The computation of these methods is much lower and the parallelism is higher than the explicit method. However, the results cannot be reused in the case of low frequency selectivity due to small-scale fading [22] average and channel hardening, which is another limitation of PCI. Finally, Fig. 2.6 shows the impact of large-scale fading and spatial correlation on the MIMO channel (Kronecker channel model), which is an important problem in the actual MIMO system. Simulation results show that compared with MMSE, the accuracy loss of PCI is less. As the channel correlation increases (the channel correlation coefficient increases), the number of iterations of all methods increases to reduce the approximation error. With the same number of iterations, PCI achieves lower error compared to the NSA and RI methods. Although PCI has similar detection performance compared with CG, OCD, and GAS, its main advantages are higher parallelism and lower computational complexity than the other three methods. In a word, PCI is superior to other methods under more practical channel model conditions.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig6_HTML.png — Fig. 2.6
BER comparisons of various algorithms under the Kronecker channel model.
© [2018] IEEE. Reprinted, with permission, from Ref. [13]

2.4 Jacobi Iteration Algorithm

2.4.1 Weighted Jacobi Iteration and Convergence

This section introduces an optimized Jacobi iteration algorithm named the weighted Jacobi iteration (WeJi) algorithm. As discussed earlier, matrix G and matrix W are diagonally dominant matrices. Here, we can decompose matrix W into ${\varvec{W}} = {\varvec{P}} + {\varvec{Q}}$ , where P is a diagonal matrix, and Q is a matrix with the diagonal elements as 0. Using the WeJi to solve the linear equation, the transmitted signal can be estimated as

${\hat{\varvec{s}}}^{(K)} = {\varvec{B\hat{s}}}^{{\left( {K - 1} \right)}} + {\varvec{F}}{ = }\left( {\left( {1 - \omega } \right){\varvec{I}} - \omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right){\hat{\varvec{s}}}^{{\left( {K - 1} \right)}} + \omega {\varvec{P}}^{ - 1} {\varvec{y}}^{\text{MF}}$

(2.90)

where ${\varvec{B}} = \left( {\left( {1 - \omega } \right){\varvec{I}} - \omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right)$ , ${\varvec{F}} = \omega {\varvec{P}}^{ - 1} {\varvec{y}}^{\text{MF}}$ is an iteration matrix, K is the number of iterations, and ${\hat{\varvec{s}}}^{(0)}$ is the initial solution. In addition, $0 < \omega < 1$ , it plays a crucial role in the convergence and convergence rate of WeJi. In WeJi, the range of the parameter ω is set as $0 < \omega < \frac{2}{\rho }\left( {{\varvec{P}}^{ - 1} {\varvec{W}}} \right)$ [9]. Because P is a diagonal matrix, its inverse matrix is very easy to find, the computational complexity of WeJi is greatly reduced.

The initial value of the iteration will affect the convergence and convergence rate of the iteration, so we need to find a good initial value when we use the WeJi to solve the massive MIMO signal detection problem. Here, the initial value with low computational complexity can be obtained by using NSA. Therefore, we can set the initial value of iteration as Eq. (2.91), which can make WeJi converge at a faster rate.

${\hat{\varvec{s}}}^{(0)} = \left( {{\varvec{I}} - {\varvec{P}}^{ - 1} {\varvec{Q}}} \right){\varvec{P}}^{ - 1} {\varvec{y}}^{\text{MF}} = \left( {{\varvec{I}} - {\varvec{R}}} \right){\varvec{T}}$

(2.91)

In addition, it is because of the increase of algorithm convergence rate that this initial value also reduces hardware resource consumption and increases data throughput rate.

According to Eq. (2.90), when the iteration number K tends to be infinite, the error of the transmitted signal estimated by using the WeJi is [7]

$\Delta = {\varvec{s}} - {\hat{\varvec{s}}}^{(K)} \approx {\hat{\varvec{s}}}^{(\infty )} - {\hat{\varvec{s}}}^{(K)} = {\varvec{B}}^{K} \left( {{\varvec{s}} - {\hat{\varvec{s}}}^{(0)} } \right)$

(2.92)

Here, obviously ${\varvec{s}} = {\hat{\varvec{s}}}^{(\infty )}$ . So the convergence rate of WeJi is

$R({\varvec{B}}) = - \ln \left( {\mathop {\lim }\limits_{K \to \infty } \left\| {{\varvec{B}}^{K} } \right\|^{{\frac{1}{K}}} } \right) = - \ln \left( {\rho ({\varvec{B}})} \right)$

(2.93)

where $\rho ({\varvec{B}})$ is the spectral radius of the matrix B. We can see that when $\rho ({\varvec{B}})$ is very small the convergence rate of the algorithm is higher. As for the convergence rate of the WeJi, two lemmas and their corresponding proofs are given here.

Lemma 2.4.1

In massive MIMO systems, $\rho \left( {{\varvec{B}}_{\text{W}} } \right) \le \omega \rho \left( {{\varvec{B}}_{\text{N}} } \right)$ , $\rho \left( {{\varvec{B}}_{\text{W}} } \right)$ and $\rho \left( {{\varvec{B}}_{\text{N}} } \right)$ are the iterative matrices of the WeJi and NSA respectively.

Proof

The spectral radius of ${\varvec{B}}_{\text{W}}$ is defined as

$\rho \left( {{\varvec{B}}_{\text{W}} } \right) = \rho \left( {\left( {1 - \omega } \right){\varvec{I}} - \omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right)$

(2.94)

In the WeJi, $0 < \omega < 1$ and ω are close to 1. $0 < \omega < 1$ can be converted to $0 < 1 - \omega < 1$ , so Eq. 2.94 can also be written as

$\rho \left( {{\varvec{B}}_{\text{W}} } \right) = \omega \rho {\varvec{P}}^{ - 1} {\varvec{Q}} - \left( {1 - \omega } \right){\varvec{I}} \le \omega \rho {\varvec{P}}^{ - 1} {\varvec{Q}}$

(2.95)

In the Newman series approximation, there are

$\rho \left( {{\varvec{B}}_{\text{N}} } \right) = \rho \left( {{\varvec{P}}^{ - 1} {\varvec{Q}}} \right)$

(2.96)

Combined with Eq. (2.95), we can get

$\rho \left( {{\varvec{B}}_{\text{W}} } \right) \le \omega \rho \left( {{\varvec{B}}_{\text{N}} } \right)$

(2.97)

Lemma 2.4.1 shows that the WeJi converges faster than that of the Newman series approximation. Without loss of generality, l₂ norm is used to estimate the error of iteration, for example

$\left\| \Delta \right\|_{ 2} \le \left\| {{\varvec{B}}_{\text{W}}^{K} } \right\|_{\text{F}} \left\| {{\varvec{s}} - {\hat{\varvec{s}}}^{(0)} } \right\|_{ 2} \le \left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}}^{K} \left\| {{\varvec{s}} - {\hat{\varvec{s}}}^{(0)} } \right\|_{ 2}$

(2.98)

According to the above expression (2.98), if $\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1$ is satisfied, the approximate error of WeJi will exponentially approach 0 with the increase of the number of iterations K.

Lemma 2.4.2

In the massive MIMO system, the probability of $\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1$ satisfies:

${\text{P}}\left\{ {\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1} \right\} \ge 1 - \omega \sqrt[4]{{\frac{{(N_{\text{r}} + 17)(N_{\text{t}} - 1)N_{\text{r}}^{2} }}{{2N_{\text{r}}^{3} }}}}$

(2.99)

Proof

We can get the following formula according to the Markov inequality:

$P\left\{ {\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1} \right\} \ge 1 - P\left\{ {\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} \ge 1} \right\} \ge 1 - E\left( {\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} } \right)$

(2.100)

Note that if the parameter ω satisfies and approaches 1 in WeJi, then $1 - \omega$ is close to 0, and satisfies $0 < 1 - \omega < 1$ , so the effect of $\left( {1 - \omega } \right){\varvec{I}}$ can be ignored. The formula (2.99) satisfies

$P\left\{ {\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1} \right\} \ge 1 - E\left( {\left\| {\omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}} } \right)$

(2.101)

So the probability of $\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1$ is related to $E\left( {\left\| {\omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}} } \right)$ . Consider $\left\| {{\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}}$ , the element of the ith row and jth column in the matrix A satisfies

$a_{ij} \to \left\{ {\begin{array}{*{20}l} {\sum\limits_{t = 1}^{{N_{\text{r}} }} {h_{ti}^{*} h_{tj} } ,} \hfill & {i \ne j} \hfill \\ {\sum\limits_{m = 1}^{{N_{\text{r}} }} {\left| {h_{mi} } \right|^{2} + \frac{{N_{0} }}{{E_{\text{s}}^{{}} }},} } \hfill & {i = j} \hfill \\ \end{array} } \right.$

(2.102)

As a result, $E\left( {\left\| {\omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}} } \right)$ can be expressed as

$E\left( {\left\| {\omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}} } \right) = E\left( {\sqrt {\sum\limits_{i = 1}^{{N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{N_{\text{t}} }} {\left| {\omega \frac{{a_{ij} }}{{a_{ii} }}} \right|^{2} } } } } \right) = E\left( {\sqrt[4]{{\left( {\sum\limits_{i = 1}^{{N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{N_{\text{t}} }} {\left( {\omega^{2} \frac{{\left| {a_{ij} } \right|^{2} }}{{\left| {a_{ii} } \right|^{2} }}} \right)} } } \right)^{2} }}} \right)$

(2.103)

Then using the Cauchy–Schwartz inequality for Eq. (2.103), and we have

$E\left( {\left\| {\omega {\varvec{P}}^{ - 1} {\varvec{Q}}} \right\|_{\text{F}} } \right) \le \omega \sqrt[4]{{\sum\limits_{i = 1}^{{N_{\text{t}} }} {\sum\limits_{j = 1,i \ne j}^{{N_{\text{t}} }} {E\left( {\left| {a_{ij} } \right|^{4} } \right) \cdot } } \sum\limits_{i = 1}^{{N_{\text{t}} }} {E\left( {\frac{1}{{\left| {a_{ii} } \right|^{4} }}} \right)} }}$

(2.104)

Obviously, there are two key items $E\left( {\left| {a_{ij} } \right|^{4} } \right)$ and $\sum\limits_{i = 1}^{{N_{\text{t}} }} {E\left( {\frac{1}{{\left| {a_{ii} } \right|^{4} }}} \right)}$ . In massive MIMO system, the diagonal elements of the matrix A are close to $N_{\text{t}}$ , so $E\left( {\frac{1}{{\left| {a_{ii} } \right|^{4} }}} \right)$ term can be approximated as

$E\left( {\frac{1}{{\left| {a_{ii} } \right|^{4} }}} \right) = \frac{1}{{N_{\text{r}}^{4} }}$

(2.105)

Now consider $E\left( {\left| {a_{ij} } \right|^{4} } \right)$ . We can express it according to formula (2.102) as follows:

$\begin{aligned} E\left( {\left| {a_{{ij}} } \right|^{4} } \right) = & E\left( {\left| {\sum\limits_{{m = 1}}^{{N_{{\text{r}}} }} {h_{{mi}}^{*} h_{{mj}} } } \right|^{4} } \right) \\ \quad = & \sum _{{\begin{array}{*{20}c} {q_{1} + q_{2} } \\ { + \cdots + q_{{N_{{\text{r}}} }} = N_{{\text{r}}} } \\ \end{array} }} \left( {\begin{array}{*{20}c} {N_{{\text{r}}} } \\ {q_{1} ,q_{2} , \cdots ,q_{{N_{{\text{r}}} }} } \\ \end{array} } \right) \times E\left( {\prod\limits_{{1 \le m \le N_{{\text{r}}} }} {\left( {h_{{mi}}^{*} h_{{mj}} } \right)^{{q_{m} }} } } \right) \\ \end{aligned}$

(2.106)

Let X and μ satisfy Eqs. (2.107) and (2.108), where $\mu_{m}$ is the average of $h_{mi}^{ * } h_{mj}$ .

${\varvec{X}} = \left[ {h_{1i}^{ * } h_{1j} ,h_{2i}^{ * } h_{2j} , \cdots ,h_{{N_{\text{r}} i}}^{ * } h_{{N_{\text{r}} j}} } \right]^{\text{T}}$

(2.107)

$\varvec{\mu}= \left[ {\mu_{1} ,\mu_{2} , \cdots ,\mu_{{N_{\text{r}} }} } \right]^{\text{T}}$

(2.108)

Therefore, there is

$\varphi \left( {h_{1i}^{*} h_{1j} ,h_{2i}^{*} h_{2j} , \cdots ,h_{{N_{\text{r}} i}}^{*} h_{{N_{\text{r}} j}} } \right) = \frac{{{\text{e}}^{{ - \frac{{\left( {{\varvec{X}} - \mu } \right)^{\text{T}} {\varvec{C}}^{ - 1} \left( {{\varvec{X}} - \mu } \right)}}{2}}} }}{{\left( {2\,\uppi} \right)^{{\frac{{N_{\text{r}} }}{2}}} \sqrt {\det {\varvec{C}}} }}$

(2.109)

The matrix C is the covariance matrix. Note that the elements in the channel matrix H are independent and identically distributed and follow $$ N(0, 1 ) $$

, so when $m \ne p$ , $h_{mi}^{ * } h_{mj}$ and $h_{pi}^{ * } h_{pj}$ are also independent and identically distributed and obey $$ N(0, 1 ) $$

. Therefore, expression (2.109) can be written as follows:

$\varphi \left( {h_{1i}^{*} h_{1j} ,h_{2i}^{*} h_{2j} , \cdots ,h_{{N_{\text{r}} i}}^{*} h_{{N_{\text{r}} j}} } \right) = \frac{1}{{\left( {2\,\uppi} \right)^{{\frac{{N_{\text{r}} }}{2}}} }}{\text{e}}^{{ - \frac{{{\varvec{X}}^{\text{T}} {\varvec{X}}}}{2}}}$

(2.110)

The expression above shows that when $m \ne p$ , $h_{mi}^{ * } h_{mj} h_{pi}^{ * } h_{pj}$ obeys $$ N(0, 1 ) $$

and $\left( {h_{ii} } \right)^{2}$ obeys $\chi^{2} (1)$ . Therefore, the probability density function of the random variable $\left( {h_{ii} } \right)^{2}$ is

$f\left( {h_{mi} ;1} \right) = \left\{ {\begin{array}{*{20}c} {\frac{1}{{2\Gamma \left( {\frac{1}{2}} \right)}}\sqrt {\frac{2}{{h_{mi} }}} {\text{e}}^{{ - \frac{{h_{mi} }}{2}}} } & {h_{mi} > 0} \\ 0 & {h_{mi} < 0} \\ \end{array} } \right.$

(2.111)

where Γ is the gamma function [23]. We can get $E\left( {\left| {h_{ii} } \right|^{2} } \right) = 1$ , $D\left( {\left| {h_{ii} } \right|^{2} } \right) = 2$ . When $i \ne j$ , since $h_{mi}^{ * }$ and $h_{mj}$ are independent, and $E\left( {\left| {h_{mi} } \right|^{2} } \right) = D\left( {\left| {h_{mi} } \right|^{2} } \right) + \left[ {E\left( {\left| {h_{mi} } \right|^{2} } \right)} \right] = 3$ , $E\left( {\left| {h_{mi}^{*} h_{mj} } \right|^{2} } \right) = E\left( {\left| {h_{mi}^{*} } \right|^{2} } \right)E\left( {\left| {h_{mj} } \right|^{2} } \right) = 1$ , $E\left( {\left| {h_{mi}^{*} h_{mj} } \right|^{4} } \right) = E\left( {\left| {h_{mi}^{*} } \right|^{4} } \right)E\left( {\left| {h_{mj} } \right|^{4} } \right) = 9$ . After neglecting the zero items, $E\left( {\left| {a_{ii} } \right|^{4} } \right)$ can be expressed as

$E\left( {\left| {a_{ij} } \right|^{4} } \right)\,=\,N_{\text{r}} E\left( {\left( {h_{mi}^{*} h_{mj} } \right)^{4} } \right) + \left( \begin{aligned} N_{\text{r}} \hfill \\ 2 \hfill \\ \end{aligned} \right)\left( {E\left( {h_{mi}^{*} h_{mj} } \right)^{2} } \right)^{2} = \frac{1}{2}\left( {N_{\text{r}}^{2} + 17N_{\text{r}} } \right)$

(2.112)

By substituting Eqs. (2.105) and (2.112) into Eqs. (2.101) and (2.104), we can obtain the conclusion in Lemma 2.4.2.

Lemma 2.4.2 demonstrates that when $N_{\text{t}}$ is fixed, the probability of $\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1$ increases with the increase of $N_{\text{r}}$ . Because of $N_{\text{r}} \gg N_{\text{t}}$ in the massive MIMO system, the probability of $\left\| {{\varvec{B}}_{\text{W}} } \right\|_{\text{F}} < 1$ is close to 1.

2.4.2 Complexity and Frame Error Rate

Some analysis needs to be done for the WeJi. The WeJi first performs the calculation of the matrices ${\varvec{R}}\left( {{\varvec{R = P}}^{ - 1} {\varvec{Q}}} \right)$ and ${\varvec{T}}\left( {{\varvec{T = P}}^{ - 1} {\varvec{y}}^{\text{MF}} } \right)$ . In order to facilitate the calculation for the WeJi, the initial solution should be gotten as soon as possible. So the vector T and the matrix R should be ready in the allocated time. The initial value ${\hat{\varvec{s}}}^{(0)}$ of the WeJi needs to be calculated based on Eq. (2.91). It is worth noting that the architecture of the initial solution can be reused in the next iteration block when considering the hardware design. Finally, the iteration of the final value ${\hat{\varvec{s}}}^{(K)}$ is executed in Eq. (2.90) of the algorithm. In the iteration section, the matrix multiplication is implemented by vector multiplication. All elements of the vector can be executed in parallel. The computational complexity of the complex matrix inversion can reduce the number of iterations. In addition, the weighted parameters reduce the number of iterations and achieve similar performance, which also reduces the computational complexity of the detector. Now we are going to compare the WeJi with the recently developed algorithms in terms of computational complexity, parallelism, and hardware design realizability. Because the MMSE and the WeJi need to compute matrices G and ${\varvec{y}}^{\text{MF}}$ , the work mainly focuses on the computational complexity [7, 12, 24, 25] of the matrix inversion and the LLR calculation. The computational complexity is estimated based on the number of real number multiplications required, and each complex multiplication requires four real number multiplications. The computational complexity of the first calculation comes from the product of the diagonal $N_{\text{t}} \times N_{\text{t}}$ matrix P⁻¹ and the $N_{\text{t}} \times N_{\text{t}}$ matrix Q and the $N_{\text{t}} \times 1$ vector ${\varvec{y}}^{\text{MF}}$ , the results are $2N_{\text{t}} \left( {N_{\text{t}} - 1} \right)$ and $2N_{\text{t}}$ respectively. The computational complexity of the second calculation comes from the multiplication of the iterative matrices B and F, which involves $4N_{\text{t}}$ real number multiplications. The computation complexity of the third part of comes from the calculation of initial solution. The computation complexity of the last part comes from the computation of channel gain, NPI variance and LLR. Therefore, the total number of multiplications required by the WeJi is $\left( {4K + 4} \right)2N_{\text{t}}^{2} - \left( {4K - 4} \right)2N_{\text{t}}$ . Figure 2.7 shows the comparison of the numbers of the real number multiplications between the WeJi and other methods. The WeJi has lower computational complexity compared with that of GAS, SOR, and SSOR methods. When K = 2, the computational complexity of NSA is relatively low. In general, K should not be less than 3 in NSA to ensure the accuracy of detection. When K = 3, NSA shows higher computation. As a result, the reduction in computation of NSA is negligible.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig7_HTML.png — Fig. 2.7
Comparison of the numbers of the actual multiplications between the WeJi and other algorithms.
© [2018] IEEE. Reprinted, with permission, from Ref. [29]

On the other hand, we need to consider the hardware implementation of the WeJi so that it can be executed as parallel as possible. The solution of the WeJi in Eq. (2.90) can be written as

$\hat{s}_{i}^{{^{(K)} }} = \frac{\omega }{{A_{i,i} }}y_{i}^{\text{MF}} + \frac{\omega }{{A_{i,i} }}\sum\limits_{j \ne i} {\left[ {A_{i,j} \hat{s}_{j}^{{^{{\left( {K - 1} \right)}} }} + \left( {1 - \omega } \right)\hat{s}_{j}^{{^{{\left( {K - 1} \right)}} }} } \right]}$

(2.113)

The computation of $\hat{s}_{i}^{{^{\left( K \right)} }}$ only requires the elements of the previous iterations, so all the elements of ${\hat{\varvec{s}}}^{(K)}$ can be computed in parallel. However, in the GAS, the successive over-relaxation (SOR) method [26] and the symmetric successive over-relaxation (SSOR) method [27], each transmitted signal has a strong correlation in the iterative steps. When calculating $\hat{s}_{i}^{(K)}$ , we need $\hat{s}_{j}^{(K)} \left( {j = 1,2, \cdots ,i - 1} \right)$ of the Kth iteration and $\hat{s}_{j}^{{^{{\left( {K - 1} \right)}} }} \left( {j = i,i + 1, \cdots ,N_{\text{t}} } \right)$ of the (K-1)th iteration. This means that the computation for each element cannot be executed in parallel. Therefore, neither the GAS [25] nor the SOR [26] can achieve a high data throughput rate and their throughput rates are far lower than that of the WeJi.

It was noted that the detection method is also proposed based on the Jacobi iteration in the literature [28]. Compared with this method, the WeJi described in this section achieves better performance in the following three aspects. First, the WeJi is a method based on hardware architecture design consideration, that is, the hardware implementation is fully considered in the process of algorithm optimization and improvement. In the process of algorithm design for the WeJi, the detection accuracy, computational complexity, parallelism, and hardware reusability are considered. On the contrary, hardware implementation problems are not considered in the literature [28], such as parallelism and hardware reuse. Second, the initial iterative solutions in WeJi in this section are different from that in literature [28]. The initial value in literature [28] is fixed. By contrast, the method described in this section takes into account the characteristics of the massive MIMO system, including a computational method for the initial solution. According to Eq. (2.91), the initial iterative solution is close to the final result, so the number of iterations can be reduced and the hardware consumption can be minimized. Furthermore, the method of the initial solution for the WeJi is similar to the algorithm in the later iterative steps, and the hardware resources can be reused. Because the Gram matrix G computation will occupy a large number of clock cycles before the iteration, the reuse of hardware resources will not affect the system throughput rate. Third, compared with the literature [28], the WeJi introduces a weighed factor, as shown in Eq. (2.90), so that improves the accuracy of the solution and consequently reduces hardware resource consumption. In addition, the same unit can be reused to increase unit utilization during the pre-iteration and the iteration, and this reuse does not affect the data throughput of hardware.

Next, we will discuss the frame error rate (FER) of the WeJi and the latest other signal detection algorithms. The FER performance of the exact matrix inversion (the CHD) algorithm is also used as a comparison object. In comparison, we consider modulation scheme of 64-QAM. The channel is assumed to be an independent and identically distributed Rayleigh fading matrix. The output (LLR) is adopted by Viterbi decoding. In the receiver, the LLR is the soft-input of the viterbi decoding. As for 4G and 5G, we have discussed a kind of parallel cascade convolution code, the Turbo code [1]. The Turbo scheme currently used in 4G is also an important encoding scheme for 5G and is widely used. Furthermore, these emulation settings are often used in many massive MIMO detection algorithms and architectures in the 5G communications.

Figure 2.8 shows the FER performance curves for the WeJi [29], NSA [7], RI [14], intra-iterative interference cancelation (IIC) [30], CG [18], GAS [12, 25], OCD [31] and MMSE [8, 24]. In Fig. 2.8a, the algorithm in the 64 * 8 massive MIMO system with 1/2 bit rate is simulated. Figure 2.8b shows the FER performance of the 128 * 8 massive MIMO system with 1/2 bit rate. To demonstrate the advantages of the proposed method at higher bit rates, Fig. 2.8c shows the performance of a 128 *8 massive MIMO system with a 3/4 bit rate. These simulation results show that the WeJi can achieve near-optimal performance at different MIMO scales and bit rates. To achieve the same FER, the SNR required by the WeJi is almost the same as that required by the MMSE, but lower than that required by the OCD, CG, GAS, IIC, RI, and NSA. By Fig. 2.8, the proposed WeJi can achieve better FER performance than the existing technical methods in different MIMO scales.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig8a_HTML.png — Fig. 2.8
Performance diagram of various algorithms. a N_r = 64, N_t = 8, 1/2 bit rate, b N_r = 128, N_t = 8, 1/2 bit rate, c N_r = 128, N_t = 8, 3/4 bit rate.
© [2018] IEEE. Reprinted, with permission, from Ref. [29]

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig8b_HTML.png — Fig. 2.8
Performance diagram of various algorithms. a N_r = 64, N_t = 8, 1/2 bit rate, b N_r = 128, N_t = 8, 1/2 bit rate, c N_r = 128, N_t = 8, 3/4 bit rate.
© [2018] IEEE. Reprinted, with permission, from Ref. [29]

2.4.3 Analyses on Channel Model Effects

The previous simulation results are obtained based on the Rayleigh fading channel model. In order to prove the superiority of the proposed algorithm in a more real channel model, Fig. 2.9 shows the effects of large-scale fading and spatial correlation of MIMO channels on the FER performance of different algorithms. The Kronecker channel model [20] was used to evaluate the FER performance of algorithm, because it is more practical than the independent and equally distributed Rayleigh fading channel model. The Kronecker channel model assumes that transmission and reception are separable, and the measurements show that the Kronecker model is a good approximation to the nonline-of-sight scenario. Therefore, this model is widely used in the literature. In the channel model, the elements of the channel matrix satisfy $N\left( {0,d({\varvec{z}}){\varvec{I}}_{B} } \right)$ , where $d({\varvec{z}})$ is an arbitrary function that interprets channel attenuation such as shadow and path loss. Consider the channel attenuation variance $d({\varvec{z}}) = \frac{C}{{\left\| {{\varvec{z}} - {\varvec{b}}} \right\|^{\kappa } }}$ , where ${\varvec{z}} \in \varvec{R}^{2}$ , ${\varvec{z}} \in \varvec{R}^{2}$ , κ and $\left\| \cdot \right\|$ denote the user’s location, base station’s location, path loss index and Euclidean norm respectively. The independent shadow fading C satisfies $10\lg N\left( {0,\sigma_{\text{sf}}^{2} } \right)$ . Combined with the correlation matrix (R_r), the Kronecker channel matrix H can be written as

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig9_HTML.png — Fig. 2.9
FER performance of Kronecker channel model.
© [2018] IEEE. Reprinted, with permission, from Ref. [29]

${\varvec{H}} = {\varvec{R}}_{\text{r}}^{{^{{\frac{1}{2}}} }} {\varvec{H}}_{{{\text{i}} . {\text{i}} . {\text{d}} .}} \sqrt {d\left( {\varvec{z}} \right)} {\varvec{R}}_{\text{t}}^{{^{{\frac{1}{2}}} }}$

(2.114)

${\varvec{H}}_{{{\text{i}} . {\text{i}} . {\text{d}} .}}$ is a random matrix whose elements are independent and identically distributed. It is a complex Gaussian distribution with the zero mean and unit variance. Exponential correlation is a model used to generate correlation matrices. The elements in correlation matrix R_r can be written as

$r_{ij} = \left\{ {\begin{array}{*{20}c} {\xi^{j - i} ,} & {i \le j} \\ {\left( {\xi^{j - i} } \right),} & {i > j} \\ \end{array} } \right.$

(2.115)

where ξ is the correlation factor between adjacent branches. The users in the same cell are evenly distributed in the hexagon with radius r = 500 m. Now we assume κ = 3.7, $\sigma_{\text{sf}}^{2} = 5$ , the transmitting power $\rho = \frac{{\gamma^{\kappa } }}{2}$ for the simulation. The correlation factors are 0.2, 0.5, and 0.7 respectively. Figure 2.9 shows that in order to achieve the same FER, the SNR required by the proposed algorithm is also smaller than CG, RI and NSA, which proves that the algorithm can maintain its advantages in a real model.

2.5 Conjugate Gradient Algorithm

2.5.1 Algorithm Design

CG is an iterative algorithm for solving linear matrix equations [9]. Its approximate solution is presented as follows:

${\varvec{x}}_{K + 1} = {\varvec{x}}_{K} + \alpha_{K} {\varvec{p}}_{K}$

(2.116)

where p is an auxiliary vector, K is the number of iterations, and $\alpha_{K}$ can be calculated in Eq. (2.117)

$\alpha_{K} = \frac{{\left( {{\varvec{r}}_{K} ,{\varvec{r}}_{K} } \right)}}{{\left( {{\varvec{Ap}}_{K} ,{\varvec{r}}_{K} } \right)}}$

(2.117)

Here r is the residual vector, which is expressed by Eq. (2.118)

${\varvec{r}}_{K + 1} = {\varvec{r}}_{K} + \alpha_{K} {\varvec{Ap}}_{K}$

(2.118)

By using the CG algorithm, we can solve the linear matrix equation. Hence this method can be applied in the massive MIMO signal detection. However, the conventional CG algorithm still has some deficiencies. First of all, in the original algorithm, although every element in the vector can be multiplied and accumulated in parallel with each row element of the matrix when matrix times vector, there is a strong data dependence between the steps, so the computation must be carried out step by step according to the order of the algorithm, and the degree of parallelism is not high. Secondly, the conventional CG algorithm does not provide information about the initial value of iteration, but the zero vector is usually used in the iteration. Obviously, zero vector can only satisfy the feasible requirements, but it is not optimal. Therefore, finding a better initial iterative value can make the algorithm converge faster and reduce the number of iterations, thus indirectly reducing the computational complexity of the algorithm. The CG algorithm has been improved to satisfy better parallelism and convergence for the above two points. The improved CG algorithm is named as three-term-recursion conjugate gradient (TCG) algorithm [32].

The conventional CG algorithm is equivalent to Lanczos orthogonalization algorithm [9], so it is presented as

${\varvec{r}}_{K + 1} = \rho_{K} \left( {{\varvec{r}}_{K} - \gamma_{K} {\varvec{Ar}}_{K} } \right) + \mu_{K} {\varvec{r}}_{K - 1}$

(2.119)

Here, polynomial q_j can be used to make ${\varvec{r}}_{j} = q_{j} {\varvec{Ar}}_{0}$ . When ${\varvec{A}} = {\varvec{0}}$ , ${\varvec{r}}_{j} = {\varvec{b}} - {\varvec{Ax}}_{j} \equiv {\varvec{b}}$ , ${\varvec{b}} = \rho_{K} {\varvec{b}} + \mu_{K} {\varvec{b}}$ , we can get $\rho_{K} + \mu_{K} = 1$ and Eq. (2.120)

${\varvec{r}}_{K + 1} = \rho_{K} \left( {{\varvec{r}}_{K} - \gamma_{K} {\varvec{Ar}}_{K} } \right) + \left( {1 - \rho_{K} } \right){\varvec{r}}_{K - 1}$

(2.120)

Since ${\varvec{r}}_{K - 1}$ , ${\varvec{r}}_{K}$ and ${\varvec{r}}_{K + 1}$ are orthogonal, i.e., $\left( {{\varvec{r}}_{K + 1} ,{\varvec{r}}_{K} } \right) = \left( {{\varvec{r}}_{K - 1} ,{\varvec{r}}_{K} } \right) = \left( {{\varvec{r}}_{K - 1} ,{\varvec{r}}_{K + 1} } \right) = {\varvec{0}}$ , We can derive from Eq. (2.120)

$\gamma_{K} = \frac{{\left( {{\varvec{r}}_{K} ,{\varvec{r}}_{K} } \right)}}{{\left( {{\varvec{Ar}}_{K} ,{\varvec{r}}_{K} } \right)}}$

(2.121)

$\rho_{K} = \frac{{\left( {{\varvec{r}}_{K - 1} ,{\varvec{r}}_{K - 1} } \right)}}{{\left( {{\varvec{r}}_{K - 1} ,{\varvec{r}}_{K - 1} } \right) + \gamma_{K} \left( {{\varvec{Ar}}_{K} ,{\varvec{r}}_{K - 1} } \right)}}$

(2.122)

And since $\left( {{\varvec{Ar}}_{K} ,{\varvec{r}}_{K - 1} } \right) = \left( {{\varvec{r}}_{K} ,{\varvec{Ar}}_{K - 1} } \right)$ , and ${\varvec{Ar}}_{K - 1} = - \frac{{{\varvec{r}}_{K} }}{{\rho_{K - 1}^{{}} \gamma_{K - 1}^{{}} }} + \frac{{{\varvec{r}}_{K - 1} }}{{\gamma_{K - 1}^{{}} }} + \frac{{\left( {1 - \rho_{K - 1} } \right){\varvec{r}}_{K - 1} }}{{\rho_{K - 1}^{{}} \gamma_{K - 1}^{{}} }}$ , we can get

$\rho_{K} = \frac{1}{{1 - \frac{{\gamma_{K} }}{{\gamma_{K - 1} }}\frac{{\left( {{\varvec{r}}_{K} ,{\varvec{r}}_{K} } \right)}}{{\left( {{\varvec{r}}_{K - 1} ,{\varvec{r}}_{K - 1} } \right)}}\frac{1}{{\rho_{K - 1} }}}}$

(2.123)

We can also derive from Eq. (2.120)

${\varvec{x}}_{K + 1} = \rho_{K} \left( {{\varvec{x}}_{K} + \gamma_{K} {\varvec{r}}_{K} } \right) + \left( {1 - \rho_{K} } \right){\varvec{x}}_{K - 1}$

(2.124)

By processing the massive MIMO system and utilizing the above derivation process, we can get the TCG algorithm that is applied in the massive MIMO signal detection.

Algorithm 2.2

The parallelizable Chebyshev iteration algorithm used in the massive MIMO to detect the minimum mean square error method

../images/478198_1_En_2_Chapter/478198_1_En_2_Figb_HTML.png

We can see from the Algorithm 2.2 that no data dependence exists between $\left( {\varvec{\eta}_{K} ,{\varvec{z}}_{K} } \right)$ and $\left( {{\varvec{z}}_{K} ,{\varvec{z}}_{K} } \right)$ and between ${\tilde{\varvec{s}}}_{K + 1} = \rho_{K} \left( {{\tilde{\varvec{s}}}_{K} + \gamma_{K} {\varvec{z}}_{K} } \right) + \left( {1 - \rho_{K} } \right){\tilde{\varvec{s}}}_{K - 1}$ and ${\varvec{z}}_{K + 1} = \rho_{K} \left( {{\varvec{z}}_{K} - \gamma_{K} {\varvec{\eta}}_{K} } \right) + \left( {1 - \rho_{K} } \right){\varvec{z}}_{K - 1}$ . So they can perform computation at the same time, increasing the parallelism of the algorithm. Besides since there is also operation of matrix multiplied by the vector, each element in the vector can still be multiplied and accumulated with each row element of the matrix in the hardware design.

2.5.2 Convergence

After discussing the CG and the optimization of CG, it is necessary to study the convergence of CG. First, the polynomial is defined, shown as the following (2.125):

$C_{K} (t) = { \cos }\left[ {K{\text{arcos}}\left( t \right)} \right], - 1 \le t \le 1$

(2.125)

$C_{K + 1} (t) = 2tC_{K} (t) - C_{K - 1} (t),C_{0} (t) = 1,C_{1} (t) = t$

(2.126)

When the above constraint conditions are extended to the case of $\left| t \right| > 1$ , the polynomial (2.125) becomes

$C_{K} (t) = { \cosh }\left[ {K{\text{arcosh}}\left( t \right)} \right], \, \left| t \right| \ge 1$

(2.127)

The polynomial can also be expressed as

$C_{K} (t) = \frac{1}{2}\left[ {\left( {t + \sqrt {t^{2} - 1} } \right)^{K} + \left( {t + \sqrt {t^{2} - 1} } \right)^{ - K} } \right] \ge \frac{1}{2}\left( {t + \sqrt {t^{2} - 1} } \right)^{K}$

(2.128)

If $\eta = \frac{{\lambda_{\text{min} } }}{{\lambda_{\text{max} } - \lambda_{\text{min} } }}$ is defined, it can be derived from Eq. (2.127)

$\begin{aligned} C_{K} (t) & = \frac{1}{2}C_{K} \left( {1 + 2\eta } \right) \ge \frac{1}{2}\left[ {1 + 2\eta + \sqrt {\left( {1 + 2\eta } \right)^{2} - 1} } \right]^{K} \\ & \quad \ge \frac{1}{2}\left[ {1 + 2\eta + 2\sqrt {\eta \left( {\eta + 1} \right)} } \right]^{K} \\ \end{aligned}$

(2.129)

We can find that $\eta$ satisfies from the above expression (2.129).

$\begin{aligned} 1 + 2\eta + 2\sqrt {\eta \left( {\eta + 1} \right)} & = \left( {\sqrt \eta + \sqrt {\eta + 1} } \right)^{2} = \frac{{\left( {\sqrt {\lambda_{\hbox{min} } } + \sqrt {\lambda_{\hbox{max} } } } \right)^{2} }}{{\lambda_{\hbox{max} } - \lambda_{\hbox{min} } }} \\ & = \frac{{\sqrt {\lambda_{\hbox{max} } } + \sqrt {\lambda_{\hbox{min} } } }}{{\sqrt {\lambda_{\hbox{max} } } - \sqrt {\lambda_{\hbox{min} } } }} = \frac{\sqrt \kappa + 1}{\sqrt \kappa - 1} \\ \end{aligned}$

(2.130)

where $\kappa = \frac{{\lambda_{\textrm{max} } }}{{\lambda_{\text{min} } }}$ . If ${\varvec{x}}_{*}$ is for an exact solution vector, the expression of CG convergence is obtained from (2.131):

$\left\| {{\varvec{x}}_{*} - {\varvec{x}}_{K} } \right\|_{A} \le 2\left( {\frac{\sqrt \kappa - 1}{\sqrt \kappa + 1}} \right)^{K} \left\| {{\varvec{x}}_{*} - {\varvec{x}}_{0} } \right\|_{A}$

(2.131)

The observation (2.131) shows that ${\varvec{x}}_{*}$ denotes the exact solution vector, which is a definite invariant vector. So when the initial value of iteration is determined, $2\left\| {{\varvec{x}}_{*} - {\varvec{x}}_{0} } \right\|_{A}$ on the right side of (2.131) can be regarded as a constant. When the number of iterations K increases, $\left\| {{\varvec{x}}_{*} - {\varvec{x}}_{K} } \right\|_{A}$ exponentially approaches 0. This indicates that CG can converge faster when K is large, and the error of CG algorithm decreases with the increase of number of iterations K.

2.5.3 Initial Iteration Value and Search

Iteration requires corresponding initial value vectors in the TCG. As mentioned earlier, how to find a good initial value vector is crucial to the convergence rate of the algorithm. When introducing PCI in Sect. 2.3, we mentioned an eigenvalue-based initial value algorithm, which can be used in the TCG to make algorithm converge faster. Besides this, another quadrant-based initial value algorithm will be discussed in this section.

As discussed above, if the influence of noise n is ignored in the massive MIMO system, the real part and the imaginary part of the received signal y, the transmitted signal s and the channel matrix H are separated and written in the real number form, and the relationship between them can be expressed as

$\left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{y}} \right\}} \\ {\text{Im} \left\{ {\varvec{y}} \right\}} \\ \end{array} } \right]_{{2N_{\text{r}} \times 1}} = \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{H}} \right\}} & { - \text{Im} \left\{ {\varvec{H}} \right\}} \\ {\text{Im} \left\{ {\varvec{H}} \right\}} & {\text{Re} \left\{ {\varvec{H}} \right\}} \\ \end{array} } \right]_{{2N_{\text{r}} \times 2N_{\text{t}} }} \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{s}} \right\}} \\ {\text{Im} \left\{ {\varvec{s}} \right\}} \\ \end{array} } \right]_{{2N_{\text{t}} \times 1}} + \left[ {\begin{array}{*{20}c} {\text{Re} \left\{ {\varvec{n}} \right\}} \\ {\text{Im} \left\{ {\varvec{n}} \right\}} \\ \end{array} } \right]_{{2N_{\text{r}} \times 1}}$

(2.132)

${\varvec{y}}_{\text{R}} = {\varvec{H}}_{\text{R}} {\varvec{s}}_{\text{R}}$

(2.133)

where the subscript R indicates the expression in the form of real numbers. By using real number form in ZF, the transmitted signal s can be estimated as

${\hat{\varvec{s}}}_{\text{R}} = \left( {{\varvec{H}}_{\text{R}}^{\text{H}} {\varvec{H}}_{\text{R}} } \right)^{ - 1} {\varvec{H}}_{\text{R}} {\varvec{y}}_{\text{R}} = \left( {{\varvec{H}}_{\text{R}}^{\text{H}} {\varvec{H}}_{\text{R}} } \right)^{ - 1} {\varvec{y}}_{\text{R}}^{\text{MF}}$

(2.134)

In the channel matrix H_R in the real number form, each element is independent and identically distributed, and obeys the standard Gaussian distribution, so the matrix ${\varvec{H}}_{\text{R}}^{\text{H}} {\varvec{H}}_{\text{R}}$ can be approximated as a diagonal matrix. And the diagonal elements are all nonnegative since each diagonal element is the sum of squares. If ${\varvec{H}}_{\text{R}}^{\text{H}} {\varvec{H}}_{\text{R}}$ is approximated as a diagonal matrix, then $\left( {{\varvec{H}}_{\text{R}}^{\text{H}} {\varvec{H}}_{\text{R}} } \right)^{ - 1}$ is also a diagonal matrix, and its diagonal elements are also nonnegative. Now consider the relationship between the ith element $\hat{s}_{{{\text{R}},i}}$ of vector ${\hat{\varvec{s}}}_{\text{R}}$ and the jth element $y_{R,i}^{\text{MF}}$ of vector ${\varvec{y}}_{\text{R}}^{\text{MF}}$ . According to Eq. (2.134), we can get

$\hat{s}_{R,i} \approx \frac{{y_{R,i}^{\text{MF}} }}{{\sum\limits_{j = 1}^{{N_{\text{r}} }} {\left( {H_{j,i} } \right)^{2} } }}$

(2.135)

where $\frac{1}{{\sum\limits_{j = 1}^{{N_{\text{r}} }} {\left( {H_{j,i} } \right)^{2} } }}$ is nonnegative, so both $\hat{s}_{{{\text{R}},i}}$ and $y_{R,i}^{\text{MF}}$ are positive or negative at the same time. By transforming the solution from real number form to complex form, we can deduce that $\hat{s}_{i}$ and $y_{i}^{MF}$ are in the same quadrant. Based on this conclusion, a new iteration initial value can be proposed.

Now we are considering the ith element of the transmitted vector ${\hat{\varvec{s}}}$ in the massive MIMO system with modulation order of 64-QAM. Since $\hat{s}_{i}$ and $y_{i}^{\text{MF}}$ are in the same quadrant, assuming that $y_{i}^{\text{MF}}$ is in the first quadrant, then we make $\tilde{s}_{i}^{(0)} = 4 + 4{\text{i}}$ . The coordinate of the hollow circle is (4,4) as in Fig. 2.10. Since $\hat{s}_{i}$ will eventually be located in the first quadrant, and the average distance between the point (4, 4) and all constellation points in the first quadrant is less than the average distance between the point (0, 0) and all constellation points in the first quadrant, the point (4, 4) is closer to the final solution so that algorithm 2.3 can converge as soon as possible. The quadrant-based iterative initial value algorithm can be obtained based on this principle.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig10_HTML.png — Fig. 2.10
Schematic diagram of quadrant—based initial value algorithm

Algorithm 2.3

Quadrant-based initial value algorithm

../images/478198_1_En_2_Chapter/478198_1_En_2_Figc_HTML.png

Taking the quadrant-based initial iteration value algorithm as the initial value result of TCG, the TCG can converge faster. According to the error requirements of the massive MIMO signal detection, the number of iterations can be reduced, and the computational complexity of the algorithm can be reduced from another angle, which makes the TCG even better.

In the massive MIMO signal detection algorithm, no matter whether it is a linear detection algorithm or a nonlinear detection algorithm, each algorithm will eventually obtain the transmitted vector ${\tilde{\varvec{s}}}$ through its computation and search for the constellation points with the smallest Euclidean distance from them according to each element in ${\tilde{\varvec{s}}}$ . These constellation points will serve as the final estimated transmitted vector ${\tilde{\varvec{s}}}$ . It is worth noting that at present, all modulation modes of the MIMO system are two-dimensional modulation, that is, the modulated constellation points are all located in the plane rectangular coordinate system. Therefore, when calculating the minimum Euclidean distance constellation point of each element in vector ${\tilde{\varvec{s}}}$ , it is actually calculating the constellation point with the minimum distance from the element plane in ${\tilde{\varvec{s}}}$ . Based on this, we can simplify the traditional method, in which the Euclidean distance between the middle element and each constellation point in the constellation diagram is first obtained, and then the size is compared. As shown in Fig. 2.11, it is assumed that an element is in the position of the hollow circle in the diagram. By dividing the constellation diagram into several parts with the dotted line in the diagram, the nearest constellation point to the hollow circle can be determined according to the region in which the hollow circle is located, that is, the constellation points in the region are just the constellation points with the smallest distance from the element. By using this analysis, the operation of finding the constellation points of the minimum Euclidean distance of the elements in the transmitted vector after the iteration can be simplified correspondingly, and a point finding algorithm based on rounding is obtained. Using algorithm 2.4 to locate constellation points, we can quickly find the final result at very low computational complexity.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig11_HTML.png — Fig. 2.11
Schematic diagram of rounding off-based point seeking method

Algorithm 2.4

Rounding off-based point seeking method

../images/478198_1_En_2_Chapter/478198_1_En_2_Figd_HTML.png

2.5.4 Complexity and Parallelism

In the TCG algorithm, the first part is a series of initial value calculations, including ${\varvec{y}}^{\text{MF}}$ , W and z₀. These calculations involve the $N_{\text{t}} \times N_{\text{r}}$ matrix ${\varvec{H}}^{\text{H}}$ multiplied by the $N_{\text{r}} \times 1$ vector y, the $N_{\text{t}} \times N_{\text{r}}$ matrix ${\varvec{H}}^{\text{H}}$ multiplied by the $N_{\text{r}} \times N_{\text{t}}$ matrix H to get the $N_{\text{t}} \times N_{\text{r}}$ matrix W, the $N_{\text{t}} \times N_{\text{t}}$ matrix W multiplied by the $N_{\text{t}} \times 1$ vector ${\tilde{\varvec{s}}}_{0}$ . The computational complexities of them are $4N_{\text{t}} N_{\text{r}}$ , $2N_{\text{r}} N_{\text{t}}^{ 2}$ and $4N_{\text{t}}^{ 2}$ respectively. The complexity of matrix W is only half (that is $2N_{\text{r}} N_{\text{t}}^{ 2}$ ) because it is a symmetric matrix. Since these parameters are computed only once in the pre-iteration process, the complexity is counted only once. The calculation of the second part is the multiplication of matrices and vectors in the iteration, that is, the $N_{\text{t}} \times 1$ vector z is multiplied by the $N_{\text{t}} \times N_{\text{t}}$ matrix W. The computational complexity is $4KN_{\text{t}}^{ 2}$ . The calculation of the third part is the calculation of two inner products of $\left( {{\varvec{\eta}}_{K} ,{\varvec{z}}_{K} } \right)$ and $\left( {{\varvec{z}}_{K} ,{\varvec{z}}_{K} } \right)$ , and their complexity is $4KN_{t}$ and $4KN_{t}$ respectively. The last part is the update for ${\tilde{\varvec{s}}}$ and z whose complexities are $12KN_{\text{t}}$ and $12KN_{\text{t}}$ respectively. Note that when the number of iterations K is 1, the value of ρ is 1, so the step 13 and the step 14 in the algorithm do not need to be calculated. Therefore the total complexity of the algorithm is $2N_{\text{r}} N_{\text{t}}^{2} + 4N_{\text{t}} N_{\text{r}} + 4N_{\text{t}}^{ 2} + K\left( {4N_{\text{t}}^{ 2} + 32N_{\text{t}} } \right) - 16N_{\text{t}}$ . Table 2.2 lists the complexity of various algorithms at different iterations. We can see from the table that the complexity of the TCG algorithm is low. For example, when $N_{\text{r}} = 128$ , $N_{\text{t}} = 8$ , K = 2, the OCD algorithm has the lowest complexity of 16,416 real number multiplications, and the TCG algorithm has 21,632 real number multiplications, which is higher than that of the OCD algorithm, while the complexity of other algorithms is higher than that of the TCG algorithm.

Table 2.2

Complexity comparison

	K = 2	K = 3	K = 4	K = 5
NSA [7]	2N_rN_t² + 6N_tN_r + 4N_t² + 2N_t	2N_rN_t² + 6N_tN_r + 2N_t³ + 4N_t²	2N_rN_t² + 6N_tN_r + 6N_t³	2N_rN_t² + 6N_tN_r + 10N_t³-6N_t²
INS [17]	2N_rN_t² + 6N_tN_r + 10N_t² + 2N_t	2N_rN_t² + 6N_tN_r + 14N_t² + 2N_t	2N_rN_t² + 6N_tN_r + 18N_t² + 2N_t	2N_rN_t² + 6N_tN_r + 22N_t² + 2N_t
GAS [12]	2N_rN_t² + 6N_tN_r + 10N_t² − 2N_t	2N_rN_t² + 6N_tN_r + 14N_t² − 6N_t	2N_rN_t² + 6N_tN_r + 18N_t² − 10N_t	2N_rN_t² + 6N_tN_r + 22N_t² − 14N_t
CG [16]	2N_rN_t² + 6N_tN_r + 8N_t² + 33N_t	2N_rN_t² + 6N_tN_r + 12N_t² + 49N_t	2N_rN_t² + 6N_tN_r + 12N_t² + 49N_t	2N_rN_t² + 6N_tN_r + 20N_t ² + 81N_t
CGLS [18]	24N_tN_r + 20N_r² + 8N_r + 44N_t	32N_tN_r + 28N_r² + 12N_r + 66N_t	40N_tN_r + 36N_r² +16N_r + 88N_t	48N_tN_r + 44N_r² + 20N_r + 110N_t
OCD [19]	16N_tN_r + 4N_t	24N_tN_r + 6N_t	32N_tN_r + 8N_t	40N_tN_r + 10N_t
TCG	2N_rN_t² + 4N_tN_r + 12N_t² + 48N_t	2N_rN_t² + 4N_tN_r + 16N_t² + 80N_t	2N_rN_t² + 4N_tN_r +20N_t² + 112N_t	2N_rN_t² + 4N_tN_r + 24N_t² + 144N_t

The parallelism of the TCG algorithm is also very important. According to Step 2, Step 3, Step 6, and Step 7 of the algorithm, each row element of the matrix is multiplied by the vector when the matrix is multiplied by the vector, and each multiply accumulate operation can be carried out simultaneously. There is no data dependence between the eighth and ninth steps of the algorithm and between the fourteenth and fifteenth steps of the algorithm except the parallel computation of the matrix multiplied by vectors, so the computation can be performed simultaneously. This algorithm has parallelism in two cases. Compared with other algorithms, the parallelism between steps has great advantages.

As previously analyzed, the TCG has significant advantages over other algorithms in complexity and parallelism. In addition, the TCG algorithm optimizes the initial value of iteration and the final point finding method, so that this algorithm can bring the maximum benefit. In general, complexity and accuracy are in contradiction, and lower complexity often means lower precision. Therefore, although the above performance shows the excellent performance of the TCG algorithm in this respect, it does not show that it is a relatively comprehensive algorithm, and it is also required to consider the accuracy of the algorithm.

2.5.5 Symbol Error Rate

Figure 2.12 shows the effect of the initial value on the performance of the CG algorithm in the massive MIMO system with the size of 128 × 8. The initial value is calculated by the quadrant division initial value algorithm. Obviously, for CG algorithm with the same SNR, the initial value algorithm can achieve lower symbol error rate (SER) for the same number of iterations, and when the number of iterations K = 2, the initial value algorithm has a greater impact on the performance of the algorithm.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig12_HTML.png — Fig. 2.12
Effect of the initial value algorithm on the SER of the CG algorithm

Figure 2.13 shows the SNR and SER curves of various algorithms for different number of iterations in the massive MIMO system with a size of 128*8 and with modulation order of 64-QAM. It can be clearly seen from the graph that, the SNR of NSA [7] is 10.17 dB, and the SNR of MMSE [8, 24], CG [12] and OCD [31] is around 9.66, 10.10, and 10.11 dB when the number of iterations K = 2 and SER is 10⁻². However the SNR of WeJi [29] is about 10.42 db, and the SNR of NSA [7] is larger than 20 dB. Figure 2.14 shows the influence of various algorithms on SER in MIMO systems of different sizes. We can see that the SER of TCG algorithm is obviously lower than that of the NSA and WeJi [29] in MIMO systems with different sizes.

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig13_HTML.png — Fig. 2.13
SER curves of various algorithms for different number of iterations in massive MIMO system with 128 * 8

../images/478198_1_En_2_Chapter/478198_1_En_2_Fig14_HTML.png — Fig. 2.14
Influence of different algorithms on SER in MIMO systems with different scales

Therefore, the TCG has obvious advantages compared with the three algorithms of NSA, CG and CGLS. The TCG is better than the other three methods in complexity and parallelism, while the performance of SER is still worse than the three methods under the same SNR. Compared with the PCI algorithm, the TCG algorithm has lower complexity and better parallelism than the PCI. Besides having the parallelism of the matrix multiplied by vectors in the PCI, there is also parallelism between the steps in the algorithm. Compared with the OCD algorithm, the complexity of the TCG algorithm is higher than that of the OCD algorithm, and the OCD algorithm also has better performance of SER. However, the OCD method also has its inherent shortcomings: Because the OCD algorithm has too strong data dependence, the obtained data through calculation need to be stored in the register and each calculation needs to read data from the register, the parallelism is so poor that the operation can only be executed step by step sequentially.

Compared with the NSA, CG, CGLS, PCI, OCD, and other algorithms, the TCG algorithm is not the best in all performance parameters, but it overall achieves a better compromise. In the actual massive MIMO signal detection, the appropriate algorithm can be selected according to the actual application requirements.

References

1.
Andrews JG, Buzzi S, Wan C et al (2014) What Will 5G Be? IEEE J Sel Areas Commun 32(6):1065–1082
2.
Kim SP, Sanchez JC, Rao YN et al (2006) A comparison of optimal MIMO linear and nonlinear models for brain-machine interfaces. J Neural Eng 3(2):145–161
3.
Burg A, Borgmann M, Wenk M et al (2005) VLSI implementation of MIMO detection using the sphere decoding algorithm. IEEE J Solid-State Circuits 40(7):1566–1577
4.
Trimeche A, Boukid N, Sakly A et al (2012) Performance analysis of ZF and MMSE equalizers for MIMO systems. In: International conference on design & technology of integrated systems in nanoscale Era, pp 1–6
5.
Rusek F, Persson D, Lau BK et al (2012) Scaling up MIMO: opportunities and challenges with very large arrays. Sig Process Mag IEEE 30(1):40–60
6.
Teaching and Research Office of Computational Mathematics (2000) Fundamental of Numeric Analysis. Tongji University, Tongji University Press
7.
Wu M, Yin B, Wang G et al (2014) Large-scale MIMO Detection for 3GPP LTE: algorithms and FPGA implementations. IEEE J Selected Topics Signal Process 8(5):916–929
8.
Auras D, Leupers R, Ascheid GH (2014) A novel reduced-complexity soft-input soft-output MMSE MIMO detector: algorithm and efficient VLSI architecture. In: IEEE international conference on communications, pp 4722–4728
9.
Golub GH, Van Loan CF (1996) Matrix computations. Mathe Gazette 47(5 Series II):392–396
10.
Shahab MB, Wahla MA, Mushtaq MT (2015) Downlink resource scheduling technique for maximized throughput with improved fairness and reduced BLER in LTE. In: International conference on telecommunications and signal processing, pp 163–167
11.
Zhang C, Li Z, Shen L et al (2017) A low-complexity massive mimo precoding algorithm based on chebyshev iteration. IEEE Access 5(99):22545–22551
12.
Dai L, Gao X, Su X et al (2015) Low-complexity soft-output signal detection based on Gauss-Seidel method for uplink multiuser large-scale MIMO System. IEEE Trans Veh Technol 64(10):4839–4845
13.
Peng G, Liu L, Zhang P et al (2017) Low-computing-load, high-parallelism detection method based on Chebyshev Iteration for massive MIMO Systems with VLSI architecture. IEEE Trans Signal Process 65(14):3775–3788MathSciNet
14.
Gao X, Dai L, Ma Y et al (2015) Low-complexity near-optimal signal detection for uplink large-scale MIMO systems. Electron Lett 50(18):1326–1328
15.
Gutknecht MH, Röllin S (2000) The Chebyshev iteration revisited. Parallel Comput 28(2):263–283MathSciNetzbMATH
16.
Yin B, Wu M, Cavallaro JR et al (2014) Conjugate gradient-based soft-output detection and precoding in massive MIMO systems. In: Global communications conference, pp 3696–3701
17.
Čirkić M, Larsson EG (2014) On the complexity of very large multi-user MIMO Detection
18.
Yin B, Wu M, Cavallaro JR et al (2015) VLSI design of large-scale soft-output MIMO detection using conjugate gradients. In: IEEE international symposium on circuits and systems, pp 1498–1501
19.
Wu M, Dick C, Cavallaro JR et al (2016) FPGA design of a coordinate descent data detector for large-scale MU-MIMO. In: IEEE international symposium on circuits and systems, pp 1894–1897
20.
Werner K, Jansson M (2009) Estimating MIMO channel covariances from training data under the Kronecker model. Sig Process 89(1):1–13zbMATH
21.
Sun Q, Cox DC, Huang HC et al (2002) Estimation of continuous flat fading MIMO channels. IEEE Trans Wireless Commun 1(4):549–553
22.
Rappaport TS (2002) Wireless Communications—Principles and Practice. Second Edition. (The Book End). 8(1):33–38
23.
Li X, Chen CP (2013) Inequalities for the gamma function. J Inequal Pure Appl Mathematics 8(1):554–563MathSciNet
24.
Prabhu H, Rodrigues J, Liu L et al (2017) A 60 pJ/b 300 Mb/s 128 × 8 Massive MIMO Precoder-Detector in 28 nm FD-SOI
25.
Wu Z, Zhang C, Xue Y et al (2016) Efficient architecture for soft-output massive MIMO detection with Gauss-Seidel method. In: IEEE international symposium on circuits and systems, pp 1886–1889
26.
Zhang P, Liu L, Peng G et al (2016) Large-scale MIMO detection design and FPGA implementations using SOR method. In: IEEE international conference on communication software and networks, pp 206–210
27.
Quan H, Ciocan S, Qian W et al (2015) Low-complexity MMSE signal detection based on WSSOR method for massive MIMO Systems. In: IEEE international symposium on broadband multimedia systems and broadcasting, pp 193–202
28.
Kong BY, Park IC (2016) Low-complexity symbol detection for massive MIMO uplink based on Jacobi method. In: IEEE international symposium on personal, indoor, and mobile radio communications, pp 1–5
29.
Peng G, Liu L, Zhou S, et al (2017) A 1.58 Gbps/W 0.40 Gbps/mm² ASIC implementation of MMSE detection for $128x8$ 64-QAM Massive MIMO in 65 nm CMOS. IEEE Trans Circuits Syst I Regular Papers PP(99):1–14
30.
Chen J, Zhang Z, Lu H et al (2016) An intra-iterative interference cancellation detector for large-scale MIMO communications based on convex optimization. IEEE Trans Circuits Syst I Regul Pap 63(11):2062–2072
31.
Wu M, Dick C, Cavallaro JR et al (2016) High throughput data detection for massive MU-MIMO-OFDM using coordinate descent. IEEE Trans Circuits Syst I Regul Pap 63(12):2357–2367
32.
Tongxiang Gu (2015) Iterative methods and pretreatment technology. Science Press, Beijing

2. Linear Massive MIMO Detection Algorithm

2.1 Analysis of Linear Detection Algorithm

2.2 Neumann Series Approximation Algorithm

2.2.1 Algorithm Design

2.2.2 Error Analysis

Theorem 2.2.1

Proof

Lemma 2.2.1

Proof

Lemma 2.2.2

Proof

Lemma 2.2.3

Proof

2.2.3 Complexity and Block Error Rate

2.3 Chebyshev Iteration Algorithm

2.3.1 Algorithm Design

Algorithm 2.1

2.3.2 Convergence

Lemma 2.3.1

Proof

Lemma 2.3.2

Proof

2.3.3 Complexity and Parallelism

2.3.4 Bit Error Rate

2.3.5 Analysis on Channel Model Impact

2.4 Jacobi Iteration Algorithm

2.4.1 Weighted Jacobi Iteration and Convergence

Lemma 2.4.1

Proof

Lemma 2.4.2

Proof

2.4.2 Complexity and Frame Error Rate

2.4.3 Analyses on Channel Model Effects

2.5 Conjugate Gradient Algorithm

2.5.1 Algorithm Design

Algorithm 2.2

2.5.2 Convergence

2.5.3 Initial Iteration Value and Search

Algorithm 2.3

Algorithm 2.4

2.5.4 Complexity and Parallelism

2.5.5 Symbol Error Rate