Chapter 11 Applications of Factor Models and PCA

As mentioned in the introduction of Chapter 10, the dependence of high-dimensional measurements is often driven by several common factors: different variables have different dependence on these common factors. This leads to the factor model (10.4):

equation (11.1)

for each component j =1, … ,p. These factors f can be consistently estimated by using the principal component analysis as shown in the previous chapter.

This chapter applies the principal component analysis and factor models to solve several statistics and machine learning problems. These include factor adjustments in high-dimensional model selection and multiple testing and high-dimensional regression using augmented factor models. Principal component analysis is a form of spectral learning. We also discuss other applications of spectral learning in problems such as matrix completion, community detection, and item ranking, and applications that serve as initial values for non-convex optimization problems such as mixture models. Detailed treatments of these require a whole book. Rather, we focus only on some methodological powers of spectral learning to give the reader some idea of its importance to statistical machine learning. Indeed, the role of PCA in statistical machine learning is very similar to regression analysis in statistics.

11.1 Factor-adjusted Regularized Model Selection

Model selection is critical in high dimensional regression analysis. Numerous methods for model selection have been proposed in the past two decades, including Lasso, SCAD, the elastic net, the Dantzig selector, among others. However, these regularization methods work only when the covariates satisfy certain regularity conditions such as the irrepresentable condition (3.38). When covariates admit a factor structure (11.1), such a condition breaks down and the model selection consistency can not be granted.

11.1.1 Importance of factor adjustments

As a simple illustrative example, we simulate 100 random samples of size n = 100 from the p= 250 dimensional sparse regression model

equation (11.2)

where X ~ N(0, Σ) and Σ is a compound symmetry correlation matrix with equal correlation ρ. We apply Lasso to each sample and record the average model size and the model selection consistency rate. The results are depicted in Figure 11.1. The model selection consistency rate is low for Lasso when ρ≥ 0.2, due to over estimation of model size.

The failure of model selection consistency is due to high correlation among the original variables X in p dimensions. If we use the factor model (11.1), the regression model (11.2) can now be written as

equation (11.3)

where α = a^Tβ and γ= B^Tβ. If u and f were observable, we obtain a (p + 1 + K)-dimensional regression problem if γ and α are regarded as free parameters. In this lifted space, the covariates f and u are now weakly correlated (indeed, they are independent under specific model (11.2)) and we can now apply a regularized method to fit model (11.3). Note that the high-dimensional coefficient vector β in (11.3) is the same as that in (11.2) and is hence sparse.

In practice, u and f are not observable. However, they can be consistently estimated by PCA as shown in Section 10.5.4. Regarding the estimated latent factors {f_i}_i₌₁ⁿ and the idiosyncratic components {u_i} as the covariates, we can now apply a regularized model selection method to the regression problem (11.3). The resulting method is called a Factor-Adjusted Regularized Model selector or FarmSelect for short. It was introduced by Kneip and Sarda (2011) and expanded and developed by Fan, Ke and Wang (2020). Figure 11.1 shows the effectiveness of FarmSelect: The model selection consistency rate is nearly 100%, regardless of ρ.

Finally, we would like to remark in representation (11.3), the parameter γ = B^Tβ is not free (B is given by the factor model). How can we assure that the least-squares problem (11.3) with free γ would yield a solution γ = B^Tβ? This requires the condition

equation (11.4)

See Exercise 11.1. This is somewhat stronger than the original condition that EXε = 0 and holds easily as X is exogenous to ε.

11.1.2 FarmSelect

The lifting via factor adjustments is used to weaken the dependence of covariates. It is insensitive to the over-estimate on the number of factors, as the expended factor models include the original factor models as a specific case and hence the dependence of new covariates has also been weakened. The idea of factor adjustments is more generally applicable to a family of sparse linear model

equation (11.5)

following a factor model, where g(.,.) is a known function. This model encompasses the generalized linear models in Chapter 5 and proportional hazards model in Chapter 6.

Suppose that we fit model (11.5) by using the loss function L(Y_i_, X_i^T β), which can be a negative log-likelihood. This can be the log-likelihood function or M-estimation. Instead of regularizing directly via

equation (11.6)

where pλ(·) is a folded concave penalty such as Lasso, SCAD and MCP with a regularization parameter λ, the Factor-Adjusted Regularized (or Robust when so implemented) Model selection (FarmSelect) consists of two steps:

• Factor analysis. Obtain the estimates of f_i and û_i in model (11.5) via PCA.

• Augmented regularization. Find α, β and γ to minimize

equation (11.7)

The resulting estimator is denoted as β. The procedure is implemented in R package FarmSelect. Note that the β in augmented model (11.7) is the same as that in model (11.5).

As an illustration, we consider the sparse linear model (11.2) again with p = 500 and β = (β₁,, β₁₀, 0_p−₁₀)^T, where {β}_j₌₁ are drawn uniformly at random from [2, 5]. Instead of taking the equi-correlation covariance matrix, the correlation structure is now calibrated to that for the excess monthly returns of the constituents of the S&P 500 Index between 1980 and 2012. Figure 11.2 reports the model selection consistency rates for n ranges 50 to 150. The effectiveness of factor adjustments is evidenced.

As pointed out in Fan, Ke and Wang (2020), factor adjustments can also be applied to variable screening. This can be done via an application of an independent screening rule to variables u or by the conditional sure independent screening rule, conditioned on the variables f. See Chapter 8 for details.

11.1.3 Application to forecasting bond risk premia

As an empirical application, we take the example in Fan, Ke, Wang (2019+) that predicts the U.S. government bond risk premia by a large panel of macroeconomic variables mentioned in Section 1.1.4. The response variable is the monthly data of U.S. bond risk premia with maturity of 2 to 5 years between January 1980 and December 2015, consisting of 432 data points. The bond risk premia is calculated as the one year return of an n years maturity bond excessing the one year maturity bond yield as the risk-free rate. The co-variates are 131 monthly U.S. disaggregated macroeconomic variables in the FRED-MD database¹. These macroeconomic variables are driven by economic fundamentals and are highly correlated. To wit, we apply principal component analysis to these macroeconomic variables and obtain the scree plot of the top 20 principal components in Figure 11.3. It shows the first principal component alone explains more than 60% of the total variance. In addition, the first 5 principal components together explain more than 90% of the total variance.

We forecast one month bond risk premia using a rolling window of size 120 months. Within each window, the risk premia is predicted by a sparse linear regression of dimensionality 131. The performance is evaluated by the out-of-sample R, defined as

equation

where y_t is the response variable at time t, Ŷ_t is the predicted Y_t using the previous 120 months data, and Y_t is the sample mean of the previous 120 months responses (Y_t−120, …,Y_t−₁), representing a naive baseline predictor.

We compare FarmSelect with Lasso in terms of prediction and model size. We include also the principal component regression (PCR), which uses f as predictors, in the prediction competition. The FarmSelect is implemented by the R package FarmSelect with default settings. To be specific, the loss function is L₁, the number of factors is estimated by the eigen-ratio method and the regularized parameter is selected by multi-fold cross validation. The Lasso method is implemented by the glmnet R package. The PCR method is implemented by the pls package in R. The number of principal components is chosen as 8 as suggested in Ludvigson and Ng (2009).

_______________

¹ The FRED-MD is a monthly economic database updated by the Federal Reserve Bank of St. Louis which is publicly available at http://research.stlouisfed.org/econ/mccracken/sel/

Table 11.1: Out of sample R² and average selected model size by FarmSelect, Lasso, and PCR.

Table 11.1, taken from Fan, Ke and Wang (2019+), reports the results. It is clear that FarmSelect outperforms Lasso and PCR, in terms of out-of-sample R, even though Lasso uses many more variables. Note that PCR always uses 8 variables, which are about the same as FarmSelect.

11.1.4 Application to a neuroblastoma data

Neuroblastoma is a malignant tumor of neural crest origin that is the most frequent solid extracranial neoplasia in children. It is responsible for about 15% of all pediatric cancer deaths. Oberthuer et al. (2006) analyzed the German Neuroblastoma Trials NB90-NB2004 (diagnosed between 1989 and 2004) and developed a classifier based on gene expressions. For 246 neuroblastoma patients, aged from 0 to 296 months (median 15 months), gene expressions over 10,707 probe sites were measured. We took 3-year event-free survival information of the patients (56 positive and 190 negative) as the binary response and modeled the response as a high dimensional sparse logistic regression model of gene expressions.

First of all, due to co-expressions of genes, the covariates are correlated. This is evidenced by the scree plot in Figure 11.4. The figure and the numerical results here are taken from an earlier version of Fan, Ke and Wang (2020). The scree plot shows the top ten principal components together can explain more than 50% of the total variance. The eigen ratio method suggests a K̂=4 factor model. Therefore, FarmSelect should be advantageous in selecting important genes.

FarmSelect selects a model with 17 genes. In contrast, Lasso, SCAD and Elastic Net (with λ₁ = λ₂ ≡ λ) select much larger models, including 40, 34 and 86 genes respectively. These methods over-fit the model as they ignore the dependence among covariates. The over fitted models make it harder to interpret the molecular mechanisms and biological processes.

To assess the performance of each model selection method, we apply a bootstrap based out-of-sample prediction as follows. For each replication, we randomly draw 200 observations as the training set and leave the remaining 46 data points as the testing set. We use the training set to select and fit a model. Then, for each observation in the testing set, we use the fitted model to calculate the conditional probabilities and to classify outcomes. We record the selected model size and the correct prediction rate (No. of correct predictions/46). By repeating this procedure over 2,000 replications, we report the average model sizes and average correct prediction rates in Table 11.2. It shows that FarmSelect has the smallest average model size and the highest correct prediction rate. Lasso, SCAD and Elastic Net tend to select much larger models and result in lower prediction rates. in Table 11.2 we also report the out-of-sample correct prediction rate using the first 17 variables that entered the solution path of each model selection method. As a result, the correct prediction rates of Lasso, SCAD and Elastic Net decrease further. This indicates that the Lasso, SCAD and Elastic Net are likely to select overfitted models.

Table 11.2: Bootstrapping model selection performance of neuroblastoma data

Bootstrap sample average	Model selection methods
	FarmSelect	Lasso	SCAD	Elastic Net
Model size	17.6	46.2	34.0	90.0
Correct prediction rate	0.813	0.807	0.809	0.790
Prediction performance with first 17 variables entering the solution path
	FarmSelect	Lasso	SCAD	Elastic Net
Correct prediction rate	0.813	0.733	0.764	0.705

11.1.5 Asymptotic theory for FarmSelect

To derive the asymptotic theory for FarmSelect, we need to solve the following three problems:

1. Like problem (11.3), the unconstrained minimization of E L(Y, α + u^Tβ + f^Tγ) with respect to α, β and γ is obtained at

equation (11.8)

under model (11.1).

2. Let θ = (α, β^T, γ^T)^T The error bounds ‖θ−θ*‖ℓ for ℓ= 1, 2, ∞ and sign consistency sgn(β̂) = sgn(β*) should be available for the M-estimator (11.7), when {(u_i^T, f_i)^T}i=1ⁿ were observable.

3. The estimation errors in {(û_i^T, f_i)^T}_i₌₁ⁿ are negligible for the penalized M-estimator (11.7).

Similar to condition (11.4), Problem 1 can be resolved by imposing a condition

equation (11.9)

where ∇L(s, t) = ∂L(s,t)/∂t. See Exercise 11.2. Problem 2 is solved in Fan, Ke and Wang (2020) for strong convex loss function under the Li-penalty. Their derivations are related to the results on the error bounds and sign consistency for the L_i-penalized M-estimator (Lee, Sun and Taylor, 2015). Problem 3 is handled by restricting the likelihood function to the generalized linear model, along with the consistency of estimated latent factors and idiosyncratic components in Sections 10.5.3 and 10.5.4. In particular, the representable conditions are now imposed based on u and f, which is much easier to satisfy. We refer to Fan, Ke and Wang (2019+) for more careful statements and technical proofs.

Finally, we would also to note that in the regression (11.7), only the linear space spanned by {f_i}_i₌₁ⁿ matters. Therefore, we only require the consistent estimation of the eigen-space spanned by {f_i}_i₌₁ⁿ, rather than consistency of each individual factors.

11.2 Factor-adjusted Robust Multiple Testing

Sparse linear regression attempts to find a set of variables that explain jointly the response variable. In many high-dimensional problems, we are also asking about individual effects of treatments such as which genes expressed differently within different cells (e.g. tumor vs. normal control), which mutual fund managers have positive α-skills, namely, risk-adjusted excess returns. These questions result in high-dimensional multiple testing problems that have been popularly studied in statistics. For an overview, see the monograph by Efron (2012) and reference therein.

Due to co-expression of genes or herding effect in fund management as well as other latent factors, the test statistics are often dependent. This results in a correlated test whose false discovery rate is hard to control (Efron, 2007, 2010; Fan, Han, Gu, 2012). Factor adjustments are a powerful tool for this purpose.

11.2.1 False discovery rate control

Suppose that we have n measurements from the model

equation (11.10)

Here X_i can be in one case the p-dimensional vector for the logarithms of gene expression ratios between the tumor and normal cells for individual i or in another case the risk adjusted returns. It can also be the risk-adjusted mutual fund returns over the i^th period. Our interest is to test multiple times the hypotheses

equation (11.11)

Let T_j be a generic test statistic for H₀_j with rejection region |T_j| ≥ z for a given critical value z > 0. Here, we implicitly assumed that the test statistic T_j has the same null distribution; otherwise the critical value z should be z_j. The numbers of total discoveries R(z) and false discoveries V(z) are the number of rejections and the number of false rejections, defined as

equation (11.12)

Note that R(z) is observable, which is really the total number of rejected hypotheses, while V(z) needs to be estimated, depending on the unknown set

equation

which is also called true null. The false discovery proportion and false discovery rate are defined respectively as

equation (11.13)

with convention 0/0=0. Our goal is to control the false discovery proportion or false discovery rate. The FDP is more relevant for the current data. But the diference is small when the test statistics are independent and p is large. Indeed, by the law of large numbers, under mild conditions,

equation

However, the difference can be substantial when the observed data are dependent, as can be seen in Example 11.1.

Let P_j be the P-value for the j^th hypothesis and P₍₁₎ ≤ … P₍_p₎ be the ordered p-values. The significance of the j^th hypothesis (or the importance of the j^th variable) is sorted by their p-values: the smaller the more importance. To control the false discovery rate at a prescribed level α, Benjamini and Hochberg (1995) propose to choose

equation (11.14)

and reject all null hypotheses H_0;(j) for j = 1; …, k̂. They prove such a procedure controls FDR at level α, under the independence assumption of P-values. The procedure can also be implemented by defining the Benjamini-Hochberg adjusted p-value:

equation (11.15)

where the minimum is taken to maintain the monotonic property, and rejects the null hypotheses with adjusted P-value less than α. The function p. adjust in R implements the procedure.

To better understand the above procedure, let us express the testing procedure in terms of P-values: reject H₀_j when P_j ≤ t for a given significant level t. Then, the total number of rejections and total number of false discoveries are respectively

equation (11.16)

and the false rejection proportion of the test is FDP(t)=v(t)/r(t) Note that under the null hypothesis, P_j is uniformly distributed. If the test statistics {T_j} are independent, then the sequence {I(P_j < t)} are the i.i.d. Bernoulli random variable with the probability of success t. Thus, v(t) ≈ pot, where p₀₌|S₀| is the number of true nulls. This is unknown, but can be upper bounded by p. Using this bound, we have an estimate

equation (11.17)

The Benjamini-Hockberg method takes t = P₍_k̂₎ with k given by (11.14). Consequently,

equation

by (11.14). This explains why the Benjamini-Hockberg approach controls the FDR at level α.

As mentioned above, estimating π₀ =p₀/p by its upper bound 1 will control the FDP correctly but it can be too crude. Let us now briefly describe the method for estimating π₀ by Storey (2002). Now the histogram of P-values for the p hypotheses is available to us (e.g. Figure 11.5), which is a mixture of those from true nulls S₀ and false nulls S₀^c. Assume P-values for false nulls (significant hypotheses) are all small so that P-values exceeding a given threshold η ∈ (0, 1) are all from the true nulls (e.g. η = 0.5). Under this assumption, π₀ (1 −η) should be approximately the same as the percentage of P-values that exceed η. This leads to the estimator:

equation (11.18)

for a given η= 0.5, say. Figure 11.7, taken from Fan, Wang, Zhong, Zhu, (2019+), illustrates the idea.

11.2.2 Multiple testing under dependence measurements

Due to co-expressions of genes and latent factors, the gene expressions in the vector X_i are dependent. Similarly, due to the “herding effect” and unadjusted latent factors, the fund returns within the i^th period are correlated. This calls for further modeling the dependence in ε_i in (11.10).

A common technique for modeling dependence is the factor model. Modeling ε_i by a factor model, by (11.10), we have

equation (11.19)

for all i= 1, …, n. In the presence of strong dependence, the FDR and FDP can now be very different, making the interpretation harder. The following example is adapted from Fan and Han (2017).

Example 11.1 To gain insight on how the dependence impacts on the number of false discoveries, consider model (11.19) with only one factor f_i ~ N(0, 1) and B = ρ1 and u_i~ N(0, (1 − ρ)I_p), namely,

equation

Assume in addition that f_i and u_i are independent. The sample-average test statistics for problem (11.11) admit

equation

where W =√nf̄ ~ N(0,1) is independent of U = √n/(1 − ρ)ū ~ N(0, I_p). Let p₀ = |S₀| be the size of true nulls. Then, the number of false discoveries

equation

where a = (1 −ρ)⁻^1/2 By using the law of large numbers, conditioning on W, we have as p₀ → ∞ that

equation

Therefore, the number of false discoveries depends on the realization of W. When ρ=0, p₀⁻¹ V(z) ≈2 Φ(−z). In this case, the FDP and FDR are approximately the same. To quantify the dependence on the realization of W, let p₀=1000 and z=2.236 (the 99^th percentile of the normal distribution) and ρ=0.8 so that

equation

When W = −3, −2, −1,0, the values of p₀⁻¹ V(z) are approximately 0.608, 0.145, 0.008 and 0, respectively, which depend heavily on the realization of W. This is in contrast with the independence case in which p₀⁻¹ V(z) is always approximately 0.02.

11.2.3 Power of factor adjustments

Assume for a moment that B and f_i are known in (11.19). It is then very natural to base the test on the adjusted data

equation (11.20)

This has two advantages: the factor-adjusted data has weaker dependence since the idiosyncratic component u_i is assumed to be so. This makes FDP and FDR approximately the same. In other words, it is easier to control type I errors. More importantly, Y_i has less variance than X_i. In other words, the factor adjustments reduce the noise without suppressing the signals. There-fore, the tests based on {Y_i} also increase the power.

To see the above intuition, let us consider the following simulated example.

Example 11.2 We simulate a random sample of size n = 100 from a threefactor model (11.19), in which

equation

with p = 100. We take μ_j=0.6 for j ≤ p/4 and 0, otherwise. Based on the data, we compute the sample average X. This results in a vector of p test statistics, whose histogram is presented in the left panel of Figure 11.6. The bimodality of the distribution is blurred by large stochastic errors in the estimates. On the other hand, the histogram (right panel of Figure 11.6) for the sample averages based on the factor-adjusted data {X_i− Bf_i}_i₌₁ⁿ reveals clearly the bimodality, thanks to the decreased noise in the data. Hence the power of the testing is enhanced. At the same time, the factor-adjusted data are now uncorrelated due to the removal of the common factors. If the errors u were generated from N(0, 3I_p), then the factor-adjusted data were independent and FDR and FDP are approximately the same.

We generated data from t₃(0, I_p) to demonstrated further the power of robustification. The distribution has 3 — δ moment for any δ > 0, but is not sub- Gaussian. Therefore, we employ a robust estimation of means via the adaptive Huber estimator in Theorem 3.2. The resulting histograms of estimated means are depicted in Figure 11.7. It demonstrates that the robust procedure helps further reduce the noise in the estimation: the bimodality is more clearly re- vealed.

11.2.4 FarmTest

The factor-adjusted test essentially applies a multiple testing procedure to the factor-adjusted data Y_i in (11.20). This requires first to learn Bf_i via PCA, apply the t-test to Y_i, and then control FDR by the Benjamini-Hochberg procedure. In other words, the existing software is fed by using the factor-adjusted data {Y_i}_i₌₁ⁿ rather than the original data {X_i}_i₌₁ⁿ.

The Factor Adjusted Robust Multiple test (Farm Test), introduced by Fan, Ke, Sun and Zhou (2019+), is a robust implementation of the above basic idea. Instead of using the sample mean, we use the adaptive Huber estimator to estimate μ_j for j= 1, … ,p. With a robustification parameter τ_j > 0, we consider the following M-estimator of μ_j:

equation

where ρ_τ(u) is the Huber loss given in Theorem 3.2. It was then shown in Fan, Ke, Sun, Zhou (2018) that

equation

where b_j^T is the j^th row of B and f̄ =n⁻¹ Σ_i₌₁ⁿ f_i, and

equation (11.21)

By (11.19), unknown f can be estimated robustly from the regression problem

equation (11.22)

if {b_j} are given, by regarding sparse {μ_j}^p_j₌₁ as outliers. Thus, given a robust estimate B̂, we obtain the robust estimate of f̂ for f by using (11.22) and use the factor adjusted statistic

equation

where σ̂_u,j is a robust estimator of σ_u,j. This test statistic is simply a robust version of the z-test based on the factor-adjusted data {Y_i}.

We expect that the factor-adjusted test statistic T_j is approximately normally distributed. Hence, by the law of large numbers, the number of false discoveries

equation

This is due to the fact that we assume that the correlation matrix of the idiosyncratic component vector u_i is sparse so that {T_i}are nearly independent. Therefore, by definition (11.13), we have FDP(z) ≈ FDP^A(z), where

equation (11.23)

where π₀ = p₀/p. Note that FDP^A(z) is nearly known, since only π₀ is unknown. In many applications, true discoveries are sparse so that π₀≈1. It can also be consistently estimated by (11.18).

We now summarize the FarmTest of Fan, Ke, Sun, Zhou (2019+). The inputs include {X_i}_i₌₁ⁿ, a generic robust covariance matrix estimator Σ ∈ IR^p×p from the data, a pre-specified level α ∈ (0, 1) for FDP control, the number of factors K, and the robustification parameters γ and {τ_j_}_i=₁^p. Many of those parameters can be simplified in the implementation. For example, K can be estimated by the methods in Section 10.2.3, and overestimating K has little impact on final outputs. Similarly, the robustification parameters {τ_j} can be selected via a five-fold cross-validation with their theoretically optimal orders taken into account.

FarmTest consists of the following three steps.

a) Compute the eigen-decomposition of Σ, set {λ_j}_j₌₁^K to be its top K eigen-values in descending order, and {v̂_j}_j₌₁^K to be their corresponding eigen-vectors. Let B̂ (λ̃₁^1/2v̂₁,…,λ̃_K^1/2v̂_K)∈IR^p×K where λ̃_j=max{λ̃_j,0}, and denote its rows by {b_j}_j₌₁^p.

b) Let x̄_j ₌ 1/n Σ_i₌₁ⁿ x_ij for j =1, … , p and f =argmax_f∈_IR^K Σ_j₌₁^p ℓγ(x̄_j− b_j^Tf), which is an implementation of (11.22) by regarding sparse {μ_j} as “outliers”. Construct factor-adjusted test statistics

equation (11.24)

where σ̂_u,jj = θ_j−μ̂_j ‖b_j‖₂, θ_{j =}arg min_θj−μ̂j₂ ‖b_j‖₂ ℓτ_j (x_ij-θ). This estimate of variance is based on (11.21).

c) Calculate the critical value z_α= inf{z ≥ 0 : FDP^A(z) ≤ α}, where FDP^A(z) =2π̂₀_pΦ(−z)/R(z) (see (11.23)), and reject H₀_j whenever |T_j | > z_α.

Step c) of FarmTest is similar to controlling the FDR by the Benjamini-Hockberg procedure, as noted in Section 11.2.1. Even though many heuristics are used in the introduction of the FarmTest procedure, the validity of the procedure has been rigorously proved by Fan, Ke, Sun, and Zhou (2019+). The proofs are sophisticated and we refer readers to the paper for details. The R software package FarmTest implements the above procedure.

In the implementation, Fan, Ke, Sun and Zhou (2019+) propose to use the robust U-covariance (9.28) or elementwise robust covariance (9.25) as their robust covariance matrix estimator Σ. They provide extensive simulations to demonstrate the advantages of the FarmTest. They showed that FarmTest controls accurately the false discovery rate and has higher power in improving missed discovery rates, in comparison to the methods without factor adjustments. They also show convincingly that when the error distribution u_i has light tails, FarmTest performs about the same as its non-robust implementation; whereas when u_i has heavy tails, FarmTest outperforms.

11.2.5 Application to neuroblastoma data

We now revisit the neuroblastoma data used in Section 11.1.4. The figures and the results are taken from Fan, Ke, Sun and Zhou (2019+). Instead of examining the joint effect via the sparse logistic regression, we now investigate the marginal effect: which genes expressed statistically differently in the neuroblastoma from the normal controls. The dependence of gene expression has been demonstrated in Figure 11.4. Indeed, the top 4 principal components (PCs) explain 42.6% and 33.3% of the total variance for the positive and negative groups, respectively.

Another stylized feature of high-dimensional data is that many variables have heavy tails. To wit, Figure 11.8 depicts the histograms of the excess kurtosis of the gene expressions for both positive (event-free survivals) and negative groups. For the positive group, the left panel of Figure 11.8 shows that 6518 gene expressions have positive excess kurtosis and 420 of them have kurtosis greater than 6. In other words, about 4% of gene expressions are severely heavy tailed as their tails are fatter than the t-distribution with 5 degrees of freedom. Similarly, in the negative group, 9341 gene expressions exhibit positive excess kurtosis with 671 of them have kurtosis greater than 6. Such a heavy-tailed feature indicates the necessity of using robust methods to estimate the mean and covariance of the data.

The effectiveness of the factor adjustments is evidenced in Figure 11.9. Therefore, for each group, we plot the correlation matrices of the first 100 gene expressions before and after adjusting the top 4 PCs. It shows convincingly that the correlations are significantly weakened after the factor adjustments. More specifically, the number of pairs of genes with absolute correlation bigger than 1/3 drops from 1452 to 666 for the positive group and from 848 to 414 for the negative group.

The two-sample version of FarmTest uses the following two-sample t-type statistics:

equation (11.25)

where the subscripts 1 and 2 correspond to the positive and negative groups, respectively. Specifically, μ̂₁_j and μ̂₂_j are the robust mean estimators, and b̂₁_j b̂₂_j, f̂₁ and f̂₂ are robust estimators of the factors and loadings based on a robust covariance estimator. In addition, σ̂ ₁_u,jj and σ̂ ₂_u,jj are the robust variance estimators defined in (11.24).

We apply four tests to the neuroblastoma data set: FarmTest with covariance inputs (9.25) (denoted by FARM-H) and (9.28) (denoted FARM-U), the non-robust factor-adjusted version using sample mean and sample variance (denoted as FAM) and the naive method without factor adjustments and ro-bustification. At level α= 0.01, FARM-H and FARM-U methods identify, respectively, 3912 and 3855 probes with different gene expressions, among which 3762 probes are identical. This shows an approximately 97% similarity in the two methods. In contrast, FAM and the naive method discover 3509 and 3236 probes, respectively. Clearly, the more powerful method leads to more discoveries.

11.3 Factor Augmented Regression Methods

Principal component regression (PCR) is very useful for dimensionality reduction in high-dimensional regression problems. Factor models provide new prospects for understanding why such a method is useful for statistical modeling. Such factor models are usually augmented by other covariates, leading to augmented factor models. This section introduces Factor Augmented Regression Methods for Prediction (FarmPredict)

11.3.1 Principal component regression

Principal component analysis is very frequently used in compressing high-dimensional variables into a couple of principal components. These principal components can also be used for regression analysis, achieving the dimensionality reduction.

Why are the principal components, rather than other components, used as regressors? Factor models offer a new modeling prospective. The basic assumption is that there are latent factors that drive the independent variables and the response variables simultaneously, as is schematically demonstrated in Figure 11.10. Therefore, we extract the latent factors from the independent variables and use the estimated latent factors as the predictors for the response variables. The latent factors are frequently estimated by principal components. Since principal components are used as the predictors in the regression for y, it is called principal component regression (PCR).

To understand the last paragraph better, suppose that we have data {(X_i(Y_i)}_i₌₁ⁿ and X_i and Y_i are both driven by the latent factors via

equation (11.26)

where g(·) is a regression function. The model is also applicable in time series prediction, in which i indexes time t. Under model (11.26), we use data {X_i}_i₌₁ⁿ to estimate {f_i}_i₌₁ⁿ via PCA, resulting in {f̂_i}_i₌₁ⁿ. We then fit the regression

equation (11.27)

PCR corresponds to multiple regression in model (11.27):

equation (11.28)

Due to the identifiability constraints of latent factors, we often normalize {f_i}_i₌₁ⁿ to have sample mean zero and sample variance identity. Under such a condition, the least-squares estimate admits the sample formula:

equation (11.29)

where f̂_i_j is the j^th component of f_i, and K is the number of factors. Paul, Bair, Hastie and Tibshirani (2008) employed “preconditioning” (correlation screening) to reduce the dimensionality in learning latent factors.

PCR is traditionally motivated differently from ours. For simplicity, we assume that both the means of the covariates and the response are zero so that we do not need to deal with the mean. Write the linear model in the matrix form

equation

Let X = UΔV^T where Δ = diag(δ₁, …, δ_p) is the diagonal matrix, consisting of its singular values, U = (u₁, …, u_p) and V = (v₁, …, v_p) are the n × p and p × p orthogonal matrices, consisting of left- and right- singular vectors of matrix X. Then, the principal components based on the sample covariance matrix n⁻X^TX are simply

equation

where v_j is the principal direction. Then, the PCR can be regarded as a constrained least-squares problem:

equation (11.30)

The constraint can be regarded as regularization to reduce the complexity of β.

PCR also admits the optimality in the following sense. Let L be the p × (p − K) full rank matrix. Consider the generalization of the constrained least-squares problem (11.30)

equation (11.31)

Let ß̂_L be the solution and MSE(L) be the mean square error for estimating the true parameter vector β*. The optimal choice of L that minimizes the MSE is the one given by (11.30), or PCR (Park, 1981).

11.3.2 Augmented principal component regression

In many applications, we have more than one type of data. For example, in the macroeconomic data given in Section 11.1.3, in addition to 131 disaggregated macroeconomic data, we have 8 aggregated macroeconomics variables; see Section 11.3.3 for addition details. Similarly, in the neuroblastoma data in Section 11.1.4, in additional to the microarray gene expression data, we have demographical and clinical information such as age, weight, and social economic status, among others. Clearly these two types of variables can not be aggregated via PCA.

Assume that we have additional covariates W_i so that the observed data are now {(W_i, X_i,Y_i)}ⁿ_i₌₁. We assume that W_i is low-dimensional; otherwise, we will use the estimated common factors from W_i. The augmented factor model is to extract the latent factors {f_i} from {X_i} and run the regression using both the estimated latent factors and the augmented variables:

equation (11.32)

We will refer this class of prediction method to the Factor-Augmented Regression Model for Prediction or FarmPredict for short.

The regression problem in (11.32) can be a multiple regression model:

equation (11.33)

This results in the augmented principal component regression. The regression function in (11.32) can also be estimated by using nonparametric function estimation such as the kernel and spline methods in Chapter 2 or deep neural network models in Chapter 14. It can also be estimated by using a multi-index model as is schematically shown in Figure 11.10. There, we attempt to extract several linear combinations of estimated factors, also called a diffusion index by Stock and Watson (2002b), to predict the response Y. See Fan, Xue and Yao (2017) for a study on this approach, which imposes the model:

equation (11.34)

where X*_i = (f^T_i, W^T_i)^T is now the expanded predictors. In particular, when there is no W_i component, the model reduces to the case depicted in Figure 11.10.

Since the study of the multi-index model by Li (1991, 1992), there are a lot of developments in this area on the estimation of the diffusion indices. See Cook (2007, 2009) and the references therein.

11.3.3 Application to forecast bond risk premia

We now revisit the macroeconomic data presented in Section 11.1.3. In addition to the 131 disaggregated macroeconomic variables, we have 8 aggregated macroeconomic variables:

W₁ = Linear combination of five forward rates;

W₂ = GDP;

W₃= Real category development index (CDI);

W₄ = Non-agriculture employment;

W₅ = Real industrial production;

W₆ = Real manufacturing and trade sales;

W₇= Real personal income less transfer;

W₈ = Consumer price index (CPI)

These variables {W_t} can be combined with the 5 principal components {f_t} extracted from the 131 macroeconomic variables to predict the bond risk premia. This yields the PCR based on covariates f_t and the augmented PCR based on (W_t^T, f_t^T). The out-of-sample R² for these two models is presented in the last block of Table 11.3 under the heading of PCA. The augmented PCR improves somewhat the out-of-sample R², but not by a lot.

Another usage of the augmented variables W_t is to blend them in extracting the latent factors via projected PCA introduced in Section 10.4. It can be implemented robustly (labeled PPCA). After extracting the latent factors, we apply the multi-index model (11.34) to predict the bond risk premia. The diffusion indices coefficient {ϕ_j}^L_j₌₁ is estimated via the slice inverse regression (Li, 1991) with the number of diffusion index L chosen by the eigen-ratio method. The choice is usually L = 2 or 3. The nonparametric function g in (11.34) is fitted by using the additive model g(x₁, …, x_L) = g₁(x₁) + … + g_L(x_L) via the local linear fit. The prediction results are shown in the first block of Table 11.3. It shows that extracting latent factors via projected PCA is far more effective than the augmented PCR. We also implemented a non-robust version of PPCA and obtained similar but somewhat weaker results than those by its robust implementation.

Table 11.3: Out-of-sample R² (%) for linear prediction of bond risk premia. Adapted from Fan, Ke, and Liao (2018).

Table 11.4: Out-of-sample R² (%) by the multi-index model for the bond risk premia. Adapted from Fan, Ke, and Liao (2018).

Fan, Ke, and Liao (2018) also applied the multi-index model to the PCA extracted factors along with the augmented variables. The out-of-sample R² is presented in the second block of Table 11.4. It is substantially better than the linear predictor but is not as effective as the factors extracted by PPCA.

11.4 Applications to Statistical Machine Learning

Principal component analysis has been widely used in statistical machine learning and applied mathematics. The applications range from dimensionality reduction such as factor analysis, manifold learning, and multidimensional scaling to clustering analysis such as image segmentation and community detection. Other applications include item ranking, matrix completion, Z₂-synchronization, and blind convolution and initialization for nonconvex optimization problems such as mixture models and phase retrieval. The role of PCA in statistical machine learning is very similar to that of regression analysis in statistics. It requires a whole book in order to give a comprehensive account on this subject. Rather, we only highlight some applications to give readers an idea of how PCA can be used to solve seemingly complicated statistical machine learning and applied mathematics problems.

In the applications so far, PCA is mainly applied to the sample covariance matrix or robust estimates of the covariance matrix to extract latent factors. In the applications below, the PCA is applied to a class of Wigner matrices, which are random symmetric (or Hermitian) matrices with independent elements in the upper diagonal matrix. It is often assumed that the random elements in the Wigner matrices have mean zero and finite second moment. In contrast, the elements in the sample covariance matrix are dependent. Factor modeling and principal components are closely related. We can view PCA as a form of spectral learning which is applied to the sample covariance matrix or robust estimates of the covariance matrix to extract latent factors.

11.4.1 Community detection

Community detection is basically a clustering problem based on network data. It has diverse applications in social networks, citation networks, genomic networks, economics and financial networks, among others. The observed network data are represented by a graph in which nodes represent members in the communities and edges indicate the links between two members. The data can be summarized as an adjacency matrix A (a_ij), where the element a_ij = 0 or 1 depends on whether or not there is a link between the i^th and j^th node.

The simplest probabilistic model is probably the stochastic block model, introduced in Holland and Leinhardt (1981), Holland, Laskey, and Leinhardt (1983) and Wang and Wong (1987). For an overview of recent developments, see Abbe (2018).

Definition 11.1 Suppose that we have n vertices, labeled {1, …, n}, that can be partitioned into K disjoint communities C₁, …, C_K. For two vertices k ∈ C_i and l ∈ C_j, a stochastic block model assumes that they are connected with probability p_ij, independent of other connections. In other words, the adjacency matrix A (A_kl) is regarded as a realization of a sequence of independent Bernoulli random variables with

equation (11.35)

The K × K symmetric matrix P = (p_i,j) is referred to as the edge probability matrix.

In the above stochastic block model, the edge probability p_ij describes the connectivity between the two communities C_i and C_j. Within a community C_i, the connection probability is p_ii. The simplest model is the one in which p_ij = p for all i and j, and is referred to as the Erdös-Rényi graph or model (Erdös and Rényi, 1960). In other words, the partition is irrelevant, corresponding to a degenerate case, resulting in a random graph. Denote such a model as G(n, p).

A planted partition model corresponds to the case where p_ii ₌ p for all i and p_ij = q for all i ≠ j and p ≠ q. In other words, the probability of within-community connections is p and the probability of between-community connections is q. When r = 2, this corresponds to the two community case. Figure 11.11 gives a realization from the Erdös-Rényi model and the stochastic block model. The simulations and plots are generated from the functions sample_gnp and sample_sbm in the R package igraph [install. package(“igraph”); library(igraph)].

Given an observed adjacency matrix A = (a_ij), statistical tasks include detecting whether the graph has a latent structure and recovering the latent partition of the communities. We naturally appeal to the maximum likelihood method. Let C(i) be the unknown community to which node i belongs. Then

the likelihood function of observing {a_ij}_i_>_j is simply the multiplication of the Bernoulli likelihood:

equation

where the unknown parameters are {C(i)}K_i=1 and P (p_ij) ∈ R_K×K. This is an NP-hard computation problem. Therefore, various relaxed algorithms including semidefinite programs and spectral methods have been proposed; see Abbe (2018).

The spectral method is based on the method of moments. Let Γ be an n × K membership matrix, whose i^th row indicates the membership of node i: It has an element 1 at the location C(i) and zero otherwise. The matrix Γ classifies each node into a community, with rows indicating the communities. Then, it is easy to see that E A_ij = p_C₍_i₎_C₍_j₎ for i ≠ j and

equation (11.36)

except the diagonal elements, which are zero on the left (assuming no self-loop).² It follows immediately that the membership matrix Γ falls in the eigen-space spanned by the first K eigenvectors of E A. Indeed, the column space of Γ is the same as the eigen-space spanned by the top K eigenvalues of E A, if P is not degenerate. See Exercise 11.9.

The above discussion reveals that the column space of Γ can be estimated via its empirical counterpart, namely the eigen-space spanned by the top K eigenvalues (in absolute value) of A. To find a consistent estimate of Γ, an unknown rotation is needed. This can be challenging to find. Instead of rotating, the second step is typically replaced by a clustering algorithm such as the K-means algorithm (see Section 13.1.1). This is due to the fact that if two members are in the same community, their corresponding rows of the eigen-vectors associated with the top K eigenvalues are the same, by using simple permutation arguments. See Example 11.3. Therefore, there are K distinct rows, which can be found by using the K-means algorithm. Spectral clustering is also discussed in Chapter 13. We summarize the algorithm for spectral clustering as in Algorithm 11.1.

Example 11.3 As a specific example, let us consider the planted partition model with two communities. For simplicity, we assume that we have two groups of equal size n/2. The first group is in the index set J and the second group is in the index set J^c. Then,

equation

_______________

² More rigorously, E A = ΓPΓ^T − diag(ΓPΓ^T). Since || diag(ΓPΓ^T) || ≤ max_k∈_[_K_] p_kk, this diagonal matrix is negligible in the analysis of the top K eigenvalues of E A under some mild conditions. Putting heuristically, EA ≈ ΓPΓ^T. We will ignore this issue throughout this section.

Algorithm 11.1 Spectral clustering for community detection

1. From the adjacency matrix A, get the n × K matrix Γ̂, consisting of the eigenvectors of the top K absolute eigenvalues;

2. Treat each of the n rows in Γ̂ as the input data, run a cluster algorithm such as k-means clustering to group the data points into K clusters. This puts n members into K communities. (Normalizing each rows to unit norm is recommended and is needed for more general situations)

has two non-vanishingeigenvalues. They are equation and with corresponding eigenvectors

equation

In this case, the sign of the second eigenvector identifies memberships. Our spectral method is particularly simple. Get the second eigenvector u₂ from the observed adjacency matrix A and use sgn(u₂) to classify each membership. Abbe, Fan, Wang and Zhong (2019+) show such a simple method is optimal in exact and nearly exact recovery cases.

We now apply the spectral method in Example 11.3 to the simulated data given in Figure 11.11. The results are depicted in Figure 11.12. From the screeplot in the left panel of Figure 11.12, we can see that the data from the Erdös-Rényi model has only one distinguished eigenvalue, whereas the data from the SBM model has two distinguished eigenvalues. This corresponds to the fact that E A is rank one for the Erdös-Rényi model and rank two for the SBM model. The right panel of Figure 11.12 demonstrates the effectiveness of the spectral method. Using the signs of the empirical second eigenvector as a classifier, we recover correctly the communities for almost all members based on the data generated from SBM. In contrast, as it should be, we only identify 50% correctly for the data generated from Erdös-Rényi model.

There are a number of variations on the applications of the spectral method. For each node, let equation be the degree of node i, which is the number of edges that node i has and D = diag(d₁, …, d_n) be the degree matrix. The Laplacian matrix, symmetric normalized Laplacian and random walk normalized Laplacian are defined as

equation (11.37)

In step 1 of Algorithm 11.1, the adjacency matrix A can be replaced by one of the Laplacian matrices in (11.37). In step 2, the row of Γ̂ should be normalized to have the unit norm before running the K-mean algorithm. See Qin and Rohe (2013) and the arguments at the end of this subsection.

There are various extensions of the stochastic block models to make it more realistic. Mixed membership models have been introduced by Airoldi, Blei, Fienberg and Xing (2008) to accommodate members having multiple communities. Karrer and Newman (2011) introduced a degree corrected stochastic block model to account for varying degrees of connectivity among members in each community. We describe these models herewith to give readers some insights. We continue to use the notations introduced in Definition 11.1.

Definition 11.2 For each node i = 1, …, n, let π_i be a K-dimensional probability vector that describes the mixed membership, with π_i(k) = P(i ∈ C_k) for k ∈ [K] and θ_i > 0 representing the degree of heterogeneity in affinity of node i. Under the degree corrected mixed membership model, we assume that

equation (11.38)

Here we implicitly assume that 0 ≤ θ_kθ_lp_ij ≤ 1 for all i, j, k and l. Consequently, the probability that there is an edge between nodes k and 1 is

equation (11.39)

Note that when θ_i = 1 and π_i is an indicator vector for all i = 1, …, n, model (11.39) reduces to the stochastic block model (11.35). If {θ_i} varies but π_i is an indicator vector for all i = 1, …, n, we have the degree-corrected stochastic block model:

equation (11.40)

Return to the general model (11.39). Let Θ = diag(θ_i, …, θ_n) then,

equation (11.41)

where Π is an n × K matrix, with π_i^T as its i^th row. Thus, ΘΠ falls in the space spanned by the eigenvectors of the top K eigenvalues of E A. They can be estimated by their empirical counterparts. A simple way to eliminate the degree heterogeneity Θ is the ratios of eigenvectors

equation

where γ̂_ik is the (i, k) element of Γ̂, the matrix of top K eigenvectors given in Algorithm 11.1. The community can now be detected by running the K-means algorithm on the rows {(Y_i2, …,Y_iK)}ⁿ_i₌₁. This is SCORE by Jin (2015). Another way to eliminate the degree heterogeneity is through normalization of rows of Γ̂ to the unit norm before running the K-mean algorithm.

To see why degree heterogeneity can be removed by the above two methods, let us consider the population version: E A or a symmetric normalized Laplacian version: (E D)^−1/2 E A(E D)^−1/2 (Note that L = D^1/2LsymD^1/2 falls also in this category). Whatever the version is used, the population matrix is of form Θ̇̃̃ΠPΠ^TΘ̇̃̃, where Θ̇̃ is a diagonal matrix. Let Γ be the matrix that comprises of the top K eigenvectors of Θ̇̃ΠPΠ^TΘ̇̃. Then, the column space spanned by Γ is the same as that spanned by Θ̇̃Π, assuming P is not degenerate. Therefore, there exists a nondegenerate K × K matrix V = (v₁, …, v_K) such that

equation

where θ̃ = (θ̃₁, …, θ̃_n)^T are the diagonal elements of Θ̃ and o is the Hadamard product (componentwise product). Let γj be the j^th column of Γ. Then, the ratios of eigenvectors {γj/γ₁}^K_j=₂ (componentwise division) will eliminate the parameter vector θ̃, resulting in {Πv_j/Πv₁}^K_j₌₂. In addition, if Π has at most m distinct rows, so are the ratios of eigenvectors.

Similarly, letting {u_i^T}ⁿ_i₌₁ be the rows of ΠV. Then, u_i^T = π^T_iV and the i^th row of Γ is θ̃_iu^T_i. If {π_i}ⁿ_i₌₁ have at most m distinct vectors, so do {u_i}ⁿ_i₌₁. Within a cluster, there might be different θ̃_i but rows of Γ point the same direction u_i = V^Tπ_i. Therefore, the normalization to the unit norm puts these rows at the same point so that the K-means algorithm can be applied.

Membership estimation by vertex hunting can be found in the paper by Jin, Ke and Luo (2017). Statistical inference on membership profiles π_i is given in Fan, Fan, Han and Lv (2019).

11.4.2 Topic model

A topic model is used to classify a collection of documents into “topics” (see Figure 1.3). Suppose that we are given a collection of n documents and a bag of p words. Note that one of these words can be “none of the above words”. Then, we can count the frequencies of occurrence of p words in each document and obtain a p × n observation matrix D = (d₁, …, d_n), called the text corpus matrix, where d_j is the observed frequencies of bag of words in document j. Suppose that there are K topics and the bag of words appears in topic k with a probability vector p_k, k ∈ [K], with p_ik indicating the probability of word i in topic k. Like the mixed membership in the previous section, we assume that document j is a mixture of K topics with probability vector πj =(π_j (1), …, π_j(K))^T, i.e. document j puts weight π_j(k) on topic k. Therefore, the probability of observing the i^th word in document j is

equation

Let n_j be the length of the j^th document. We assume that

equation

where d*_j = (d*_1j, …, d*_pj)^T. This is the probabilistic latent semantic indexing model introduced by Hofmann (1999).

Let Π be the n × K matrix with the j^th row π^T_j,

equation

Then, the observed data matrix D is related to unknown parameters through

equation

which is of rank at most K. Let

equation

be the singular value decomposition (SVD) of D*, where L* and R* are respectively the left and right singular vectors of D* and A* is the K × K diagonal matrix, consisting of K non-vanishing singular values. Then, the column space spanned by P is the same as that spanned by L*:

equation

or P is identifiable by L* up to a right affine matrix U. Similarly,

equation

The unknown matrix P and Π can be estimated by a spectral method. Let D = LAR be the SVD of the text corpus. Then L is an estimate of P up to a right K × K matrix U. This can be further identified by anchor words, a concept that is similar to the pure nodes in the mixed membership model. As explained at the end of Section 11.4.1, we can normalize the matrix D first, followed by post-SVD normalization. The K-means algorithm can be conducted by using different rows L for grouping words or R for clustering documents. We refer to Ke and Wang (2019) for details.

A popular Bayesian approach to topic modeling is the Latent Dirichlet Allocation, introduced by Blei, Ng and Jordan (2003), in which a Dirichlet prior on {π_i} is imposed. The parameters are then estimated by a variational EM algorithm (Hoffman, Blei, Wang, and Paisley, 2013).

11.4.3 Matrix completion

A motivating example is the Netflix problem in movie-rating: Customer i rates movie j if she has watched the movie; otherwise the (i, j)-entry of the ratings matrix is missing. Since Netflix has millions of customers and tens of thousands of movies, most entries of the ratings matrix are missing. The task is to predict the remaining entries in order to make good recommendations to customers on what to watch next, namely to complete the matrix. Similar problems appear in ratings and recommendations for books, music and CDs, which are the basis for collaborative filtering in the recommender systems. Another example of matrix completion is the word-document matrix: The frequencies of words used in a collection of documents can be represented as a matrix and the task is to classify the documents.

Let Θ be the n₁ × n₂ true preference matrix that we would like to have: the preference score of the i^th customer on the j^th item. Instead, we observe only X on a small subset Ω of { 1, …, n₁} × {1, …, n₂}, which can possibly further be corrupted with noise (or uncertainties in ratings) so that the observed data are

equation (11.42)

The problem as stated is under-determined even when there are no noises. What makes the solution feasible is the low-rank assumption of Θ. This assumption leads to the factor model interpretation. Take the movie rating as an example. Each movie has a number of features (factors) and different users have different loadings on these features plus some idiosyncratic components of personal tastes. This leads to the factor model (10.6) and low rank matrix Θ in (11.42). The main difference here is that we only observe a small fraction of the data in the factor model.

A common assumption in the sampling is that Ω is randomly chosen in some sense. It is often assumed that the (i, j)-item is observed with probability p, independently of other entries. In other words, data are missing at random. Let {I_ij} be the i.i.d. Bernoulli random variables with probability of success p. We model the observed entries as

equation (11.43)

A natural solution to problem (11.42) is the penalized least-squares: Find Θ̂ to minimize

equation (11.44)

where the nuclear norm penalty is a relaxation of the low rank constraint. Candés and Recht (2009) show that in the noiseless situation, solving the relaxed problem yields the same solution as the nonconvex rank constrained optimization problem with high probability and is optimal. Chen, Chen, Fan, Ma, and Yan (2019) showed further such an optimality continues to hold for a noisy setting and Chen, Fan, Ma, and Yan (2019) refined further the optimality result and derived a method for constructing confidence intervals.

We now offer a quick solution from the spectral point of view. First of all, let us create a complete data matrix via the inverse probability weighting:

equation (11.45)

where p̂ is the proportion of the observed entries. When there are sufficiently many entries, p̂ is an accurate estimate of p. Ignoring the error in estimating p, we then have

equation

a low-rank matrix. This suggests a spectral method as follows. Let us assume that the rank of Θ is known to be K. As in (9.2), let

equation

be its singular value decomposition. Then, a spectral estimator is

equation (11.46)

The maximum entry-wise estimation error of such a procedure was established by Abbe, Fan, Wang and Zhong (2019+). They show that the rate is optimal, as it can recover the Frobenius bound (Keshavan et al., 2010) up to a log factor.

11.4.4 Item ranking

Item ranking is a classical subject, dated back at least to the 17th century. Humans have a better ability to express preferences between two items than among many items. Thus, one of important tasks is to rank the top K- items based on many pairwise comparisons. Such a ranking problem finds its applications in web searchs such as Page rank (Dwork, 2001), recommendation systems (Baltrunas, Makcinskas and Ricci (2010)), sports competitions (Masse, 1997)), among others.

To accomplish the above complicated task, we need a statistical model. We assume that there is a positive latent score w7 that is associated with item i so that

equation (11.47)

for any pairs (i, j). Such a model is referred to as the Bradley-Terry-Luce model (Bradley and Terry, 1952; Luce, 1959). Thus, our task is to get an estimate of the latent score vector w* = (w*₁, …, w*_n)^T so that we can rank the items according to the preference scores. The idea is depicted in Figure 11.13. Note that w* can only be identified up to a constant multiple so that we will assume that it is a probability vector. Hunter (2004) discussed an MM algorithm for fitting the Bradley-Terry-Luce model.

We do not assume that we have all pairwise comparisons available. In fact, there are only a small fraction of pairs that have ever competed. The pairwise comparisons form a graph G, in which items are vertices and edges indicate whether the two items have been compared. See Figure 11.14. Let ε be the edge set of the graph G. For two items (i, j) ∈ ε, we have obtained L_ij

independent comparisons:

equation (11.48)

Let p̂_ij be the proportion that the j^th item beats the i^th one. For theoretical studies, we may assume that G is a realization of the Erdös-Rényi graph.

A simple and naive ranking is to compute the overall proportion p̂_i of winning that item i has ever competed. Then, rank the items according to {p̂_i}ⁿ_j₌₁This method ignores the ranks of the competitors item i compared with. With a probabilistic model, a natural method is the maximum likelihood method. Conditioning on G, the likelihood of the observed data is the product of those of Bernoulli’s:

equation

The log-likelihood is given by

equation (11.49)

Parametrizing θ_i log w_i yields a logistic type of regression and the log-likelihood function is convex. To stabilize the high-dimensional likelihood, one can also regularize the likelihood function by using

equation

which is given by

equation (11.50)

Now, let us consider the spectral method, which is also referred to as rank centrality. Without loss of generality, we assume that w* has been normalized to a probability vector. The idea is to create a Markov chain on the graph G such that w* is its invariant distribution, also called stationary distribution. Let us define the transition matrix P* as

equation (11.51)

where ε is the set of edges of the graph G and d is a parameter. When d is sufficiently large (e.g. the maximum of the degrees) so that the diagonal entries are non-negative, the matrix P* is a transition probability matrix: each row has a sum equals to 1. This Markov chain prefers to walk to the state with a higher score. It is easy to verify that P is a transition matrix with the detailed balance

equation

Therefore,

equation

or put in the matrix notation

equation (11.52)

In other words, w* is the leading left eigenvector of P* with eigenvalue 1. It is also the stationary distribution of the Markov chain with transition matrix P*.

The spectral method replaces the unknown transition matrix by its empirical counterpart P̂, whose (i, j) element is given by

equation (11.53)

If d > d_max, the maximum degree of the graph, then P̂ is a transition matrix (Exercise 11.13). In practice, we take d = 2d_max.

Negahban, Oh and Shah (2017) establish the L₂-rate of convergence for the spectral estimator ŵ. Chen, Fan, Ma, and Wang (2019) improve their results and establish further the maximum entrywise estimation error. As a result, they were able to obtain the selection consistency for both the spectral method and the regularized MLE. They show that the sampling complexity achieves the information low bound given by Chen and Suh (2015). Additional references can be found in the papers by Negahban, Oh and Shah (2017) and Chen, Fan, Ma, and Wang (2019). In particular, the latter give a comprehensive review on the recent development on this topic (see Section 2.7 there).

Algorithm 11.2 Spectral method for top K-ranking (Rank Centrality)

1. Input the comparison graph G, sufficient statistics {p̂_ij, (i, j) ∈ ε}, and the normalization factor d.

2. Define the probability transition matrix P̂ = [P̂_i,_j]_l_≤i,j≤n as in (11.53).

3. Compute the leading left eigenvector ŵ of P̂.

4. Output the K items that correspond to the K largest entries of ŵ.

11.4.5 Gaussian mixture models

As an illustration of the applications of spectral methods to nonconvex optimization, let us consider the Gaussian mixture model:

equation (11.54)

with unknown parameters {w_k, µ_k, σ²_k}^K_k=₁. The likelihood function based on a random sample {X_i}ⁿ_i₌₁ from model (11.54) can easily be found (Exercise 11.14), but the function is non-convex, which imposes computational challenges to find the MLE.

PCA allows us to find a good initial value for the nonconvex optimization problem. Indeed, the classical result of Bickel (1975) shows that for finite dimensional parametric problems, if we start from a root-n consistent estimator, the one-step Newton-Raphson estimator will be statistically efficient. In other words, optimization errors (the distance to a global optimum) will be of smaller order than statistical errors (standard deviations of estimators). Further iterations will only improve the optimization errors, as is more precisely stated in Robison (1988).

The method of moments is very simple in finding root-n consistent estimators. For the mixture model (11.54), the first two moments (Exercise 11.15) are given by

equation (11.55)

where ⨂ denote the Kronecker product and equation the weighted average variances. Therefore, the column space spanned by {µ_k}^p_k=₁ is the same as the eigen-space spanned by the top K principal components, and σ²w is the minimum eigenvalue of E [X ⨂ X].

To further determine rotation, Hsu and Kakade (2013) introduce the third order tensor method. Suppose that {µ_k}^K_k₌₁ are linearly independent and w_k > 0 for all k ∈ [K] so that the problem does not degenerate. Then σ²_w is the minimum eigenvalue of Σ = cov(X); this will be shown at the end of this section. Let v be any eigenvector of Σ that is associated with the eigenvalue σ²_w as shown at the end of this section. Define the following quantities:

equation

where we use the notation

equation

and e_j is the unit vector with j^th position 1. Then Hsu and Kakade (2013) show that

equation (11.56)

With {M_i}_i₌₁ replaced by their empirical versions, the remaining task is to solve for all the parameters of interest via (11.56). Hsu and Kakade (2013) and Anandkumar, Ge, Hsu, Kakade and Telgarsky (2014) propose a fast method called the robust tensor power method to compute the estimators. The idea is to orthogonalize {µ_k} in M₃ by using the matrix M₂ so that the power method can be used in computing the tensor decomposition.

Let M₂ = UDU^T be the spectral decomposition of M₂, where D is the K × K non-vanishing eigenvalues of M₂. Set

equation (11.57)

Note that W is really a generalized inverse of M₂^1/2. Then, W^TM₂W = I_K, which implies by (11.56) that

equation

Thus, {µ̃_k}^K_k₌₁ are orthonormal and

equation (11.58)

Note that the quadratic form W^TM₂W can be denoted as M₂(W, W), which is defined as

equation

where a^⨂² = a ⨂ a. We can define similarly

equation (11.59)

where a^⨂ = a ⨂ a ⨂ a. Therefore, M₃ admits an orthogonal tensor decomposition that is similar to the spectral decomposition for a symmetric matrix and the decomposition can be rapidly computed by the power method. It then can be verified that

equation (11.60)

which is merely a rotated version of (11.56).

The idea of estimation is now very clear. First of all, using the method of moments, we get an estimator of M̂₃ by (11.60). Using the tensor power method (Anandkumar et al., 2014), we find estimators for the orthogonal tensor decomposition {µ̃_k} and its associated eigenvalue {1/√w_k}^K₁₌₁ in (11.59). The good piece of the news is that we operate in the K-dimensional rather than the original p-dimensional space. We omit the details. Using (11.57) and (11.58), we can get an estimator for µ_k. We summarize the idea in Algorithm 11.3.

Note that in Algorithm 11.3, the sample covariance can be replaced by the empirical second moment matrix equation . To see this, by (11.55) and (11.56),

equation

Thus, the top K eigenvalues are {d₁ + σ²_w,…, d_K + σ²_w} and the remaining (p – K) eigenvalues are σ²_w. The top K eigenvectors are in the linear span of {µ_k}^K_k₌₁. Let v be an eigenvector that corresponds to the minimum eigenvalue of E[X ⨂ X]. Then, µ^T_kv = 0 for all k ∈ [K] Using the population covariance

equation

we have

equation

Algorithm 11.3 The tensor power method for estimating parameters in the normal mixtures

1. Calculate the sample covariance matrix, its minimum eigenvalue σ̂²_w and its associated eigenvector v̂.

2. Derive the estimators M̂₁ and M̂₂ by using the empirical moments of X, v̂ and σ̂²_w.

3. Calculate the spectral decomposition M̂₂ = ÛD̂Û^T. Let Ŵ = ÛD̂⁻¹/². Construct an estimator of M̃₃, denoted by M̂₃, based on (11.60) by substituting the empirical moments of Ŵ^TX, Ŵ and M̂₁. Apply the robust tensor power method in Anandkumar et al. (2014) to M̂₃ and obtain {µ̂̃_k}^K_k₌₁ and {ŵ_k}^K_k₌₁.

4. Compute Solve the linear equations M̂₁=

Therefore, σ²_w is an eigenvalue and v is an eigenvector Σ. We can show further σ²_w is indeed the minimum eigenvalue.

The above gives us an idea of how the spectral methods can be used to obtain an initial estimator. Anandkumar et al. (2014) used tensor methods to solve a number of statistical machine learning problems. In addition to the aforementioned mixtures of spherical Gaussians with heterogeneous variances, they include hidden Markov models and latent Dirichlet allocation. Sedghi, Janzamin and Anandkumar (2016) apply tensor methods to learning mixtures of generalized linear models.

11.5 Bibliographical Notes

There is a huge literature on controlling false discoveries for independent test statistics. After the seminal work of Benjamini and Hochberg (1995) on controlling false discovery rates (FDR), the field of large-scale multiple testing received a lot of attention. Important work in this area includes Storey (2002); Genovese and Wasserman (2004), Lehmann and Romano(2005), Lehmann, Romano and Shaffer (2005), among others. The estimation of the proportion of nulls has been studied by, for example, Storey (2002), Langaas and Lindqvist (2005), Meinshausen and Rice (2006), Jin and Cai (2007) and Jin (2008), among others.

Although Benjamini and Yekutieli (2001), Sarkar (2002), Storey, Taylor and Siegmund (2004), Clarke and Hall (2009), Blanchard and Roquain(2009) and Liu and Shao (2014) showed that the Benjamini and Hochberg and its related procedures continue to provide valid control of FDR under positive regression dependence on subsets or weak dependence, they will still suffer from efficiency loss without considering the actual dependence information. Efron (2007, 2010) pioneered the work in the field and noted that correlation must be accounted for in deciding which null hypotheses are significant because the accuracy of FDR techniques will be compromised in high correlation situations. Leek and Storey (2008), Friguet, Kloareg and Causeur (2009), Fan, Han and Gu(2012), Desai and Storey (2012) and Fan and Han (2017) proposed different methods to control FDR or FDP under factor dependence models. Other related studies on controlling FDR under dependence include Owen (2005), Sun and Cai (2009), Schwartzman and Lin (2011), Wang, Zhao, Hastie and Owen (2017), among others.

There is a surge of interest in community detection in statistical machine learning literature. We only brief mention some of these. Since the introduction of stochastic block models by Holland, Laskey and Leinhardt (1983), there have been various extensions of this model. See Abbe (2018) for an overview. In addition to the references in Section 11.4.1, Bickel and Chen (2009) gave a nonparametric view of network models. Other related work on nonparametric estimation of graphons includes Wolfe and Olhede (2013), Olhede and Wolfe (2014), Gao, Lu and Zhou (2015). Bickel, Chen and Levina (2011) employed the method of moments. Amini et al. (2013) studied pseudo-likelihood methods for network models. Rohe, Chatterjee and Yu (2011) and Lei and Rinaldo (2015) investigated the properties of spectral clustering for the stochastic block model. Zhao, Levina and Zhu (2012) established consistency of community detection under degree-corrected stochastic block models. Jin (2015) proposed fast community detection by score and Jin, Ke and Luo (2017) proposed a simplex hunting approach. Abbe, Bandeira and Hall (2016) investigated the exact recovery problem. Inferences on network models have been studied by Bickel and Sarkar (2016) and Lei (2016), Fan, Fan, Han and Lv(2019a,b).

There has been a surge of interest in the matrix completion problem over the last decade. Candés and Recht (2009) and Recht (2011) studied the exact recovery via convex optimization. Candés and Plan (2010) investigated matrix completion with noisy data. Candés and Tao (2010) studied the optimality of the matrix completion. Cai, Candés and Shen (2010) proposed a singular value thresholding algorithm for matrix completion. Keshavan, Montanari and Oh(2010) investigated the sampling complexity of matrix completion. Eriksson, Balzano and Nowak (2011) studied high-rank matrix completion. Negahban and Wainwright (2011, 2012) derived the statistical error of the penalized M-estimator and showed that it is optimal up to logarithmic factors. Fan, Wang and Zhu (2016) proposed a robust procedure for low-rank matrix recovery, including the matrix completion as a specific example. Cai, Cai and Zhang (2016) and Fan and Kim (2019) applied structured matrix completion to genomic data integration and volatility matrix estimation for high-frequency financial data.

11.6 Exercises

11.1 Suppose that we have linear model Y = X^Tβ*+ ε with X following model (11.1). Here, we use β* to indicate the true parameter of the model. Consider the least squares solution to the problem (11.3) without constraints:

equation

Let α₀, β₀, and γ₀ be the solution. Show that under condition (11.4), the solution to the above least-squares problem is obtained at

equation

11.2 Suppose that the population parameter β* is the unique solution to E∇L(Y,X^Tβ)X = 0 where equation Let W = (1, u^T, f^T)^T and assume that E∇L(Y, W^Tθ)W = 0 has a unique solution. Then, under condition (11.9), the unconstrained minimization to the population problem E L(Y, α + u^Tβ + f^Tγ) has a unique solution (11.8).

11.3 Let us consider the macroeconomic time series in the Federal Reserve Bank of St. Louisz

http://research.stlouisfed.org/econ/mccracken/sel/

from January 1963 to December of the last year. Let us take the unemployment rate as the Y variable and the remaining as X variables. Use FarmSelect and rolling windows of the past 120, 180, and 240 months to predict the monthly change of unemployment rates. Compute the out-of-sample R² and produce a table that is similar to Table 11.1. Also report the top 10 most frequently selected variables by FarmSelect in the rolling windows. Note that several variables such as “Industrial Production Index”, “Real Personal Consumption”, “Real M2 Money Stock”, “Consumer Price Index”, “S&P 500 index”, etc, are non-stationary and they are better replaced by computing log-returns: namely the difference of the logarithm of time series.

11.4 Let us dichotomize the monthly changes of unemployment rate as 0 and 1, depending upon whether the changes are non-negative or positive. Run the similar analysis, using FarmSelect for logistic regression, as the previous exercise and report the percent of correct prediction using rolling windows of the past 120, 180, and 240 months. Report also the top 10 variables selected most frequently by FarmSelect in the rolling windows.

11.5 As a generalization of Example 11.1, we consider the one-factor model

equation

Consider again the sample average test statistics Z_j = √nX̄._j for the testing problem (11.11). Show that under some mild conditions,

equation

11.6 Analyze the neuroblastoma data again and reproduce the results in Section 11.2.5 using FarmTest and other functions in the R software package. Produce in particular

(a) the scree plots and the number of factors selected in both positive and negative groups;

(b) the correlation coefficient matrix (similar plot to Figure 11.9) for the last 100 genes before and after factor adjustments;

11.7 Prove (11.29) and (11.30).

11.8 Use the data in Exercise 11.3 and run principal component regression and augmented principal component regression to forecast the changes of unemployment rates using a rolling windows of 120 months, 180 months, and 240 months, and report the out-of-sample R² results similar to the PCA block in Table 11.3.

11.9 Let us verify (11.36) using a specific example with three communities.

(a) Assume that communities 1, 2 and 3 consist of nodes {1, …, m}, {m + 1, …, 2m}, and {2m + 1, …, 3m}. Verify (11.36)

(b) Show that the column space of Γ is the eigen-space spanned by the top 3 eigen-vectors of the matrix E A, if P is not degenerate.

11.10 For the Erdös-Rényi graph G(n, p),

(a) What is the probability that a node is isolated?

(b) What is the limit of the probability as n → ∞ if p = c/n.

11.11 Simulate a data set from the model given in Exercise 9 with n = 150 (m = 50), diagonal elements of P matrix p = 5(log n)/n, and off-diagonal elements q = p/10. Give the resulting graph, the scree plot of absolute eigenvalues of the adjacency matrix, and the results of spectral analysis using R function kmeans.

11.12 Following the notation in Section 11.4.3, let Z_ij = X_ijI₍_i_,_j_)∈Ω/p. Show that E Z_ij = E X_ij = θ_ij under the Bernoulli sampling scheme.

11.13 Show that the matrix (11.53) is a transition matrix if d > d_max.

11.14 Suppose that we have a random sample {X_i}ⁿ_i₌₁ from the Gaussian mixture model (11.54).

(a) Write down the likelihood function.

(b) Simulate a random sample of size n = 1000 from the model (11.54) with p = 1, K = 3, σ_k = 1 and w_k = 1/3 for all k, and µ₁ = −1, µ₂ = 1 and µ₃ = 10. Plot the resulting likelihood function as a function of (μ₁, μ₂) and a function of (μ₂, μ₃). Are they convex?

11.15 Prove (11.55).