1 Introduction
The principle of parsimony, known as Occam’s razor, is arguably one of the most fundamental ideas that pervade science, and sparsity has become a popular paradigm in statistical analyses, as a particular manifestation of the parsimony principle in the context of modern statistics. Much of this popularity has been driven by the success of frequentist (and Bayesian) methods utilizing the underlying sparsity structure of the unknown parameter of interest.
In general, a sparse signal is a high-dimensional parameter which allows a parsimonious representation. In signal processing, this is typically expressed by assuming that it contains only a small number of non-zero elements compared to its dimension. The value zero of the sparsity cluster has the interpretation of being “insignificant” for the corresponding coordinates. Any other value of the sparsity cluster can be handled as well in the analysis (in fact, we can always reduce to zero by subtracting that value) as long as this value is known a priori to the observer.
In the signal+noise setting, the best-studied problem is that of signal estimation, and a variety of estimation methods and results are available in the literature: [1, 3–8]. Thresholding strategies are particularly appealing, mainly because thresholding automatically generates sparsity. In addition, the corresponding procedures generally exhibit fast convergence properties. Moreover, thresholding processes the signal in a coordinate-wise fashion, resulting in low complexity algorithms (typically of order n), which are easy to implement in practice.
Many methods have Bayesian connections. For example, even some seemingly non-Bayesian estimators can be obtained as certain quantities (like posterior mode for penalized estimators) of the posterior distributions resulting from imposing some specific priors on the parameter; cf.
[1, 2, 4, 6, 8, 11]. A common Bayesian way to model sparsity structure is by the two-group priors. Such a prior puts positive mass on vector with some exact zero coordinates (zero group) and the remaining coordinates (signal group) are drawn from a chosen distribution. As pointed out by
[6] (also by
[8]), the prior distributions of non-zero coordinates should not have too light tails; otherwise, one gets sub-optimal convergence rates (or even inconsistency). The important Gaussian case is, for example, excluded,
[6, 8] use therefore heavy-tailed priors. On the other hand, in
[4], it was shown that normal priors are still usable and lead to strong local results (even for non-iid, non-normal models) if combined with the empirical Bayes approach.
However, all these above-mentioned approaches are based on the essential assumption that the sparsity cluster value is known to the observer (which is set to zero by default). In this note, we relax these modeling assumptions by allowing the sparsity cluster value to be an unknown constant, obtaining a robust formulation of the estimation problem. This situation can occur when, for example, there is a systematic error in the observations and sparsity coordinates get shifted by unknown value (bias of systematic error), leading to what we call sparsity cluster with unknown cluster value. It is clear that thresholding procedures are not going to be applicable in this situation, so we need to deal with methodological and computational issues. We address the first aspect by applying an empirical Bayes approach, which delivers two robust procedures: the empirical Bayes model averaging (EBMA) and empirical Bayes model selection (EBMS) procedures. As to the computational issue, the former procedure is realized by an MCMC algorithm based on the partial (mixed) normal–normal conjugacy in the model, and the latter is based on a new optimization algorithm of -complexity (cf. O(n)-complexity for typical thresholding procedures). We perform simulations to demonstrate how the proposed procedures work and accommodate possible systematic error in the sparsity cluster value.
2 Setting and Notation

![$$\begin{aligned} X_i=\theta _i+ \sigma \xi _i, \quad i\in [n]=\{1,\ldots ,n\}, \end{aligned}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ1.png)




Denote the probability measure of X from the model (1) by , and by
the corresponding expectation. For notational simplicity, we often skip the dependence on
and n. Denote by
the indicator function of the event E, and by |S| the cardinality of the set S. Let
and
for
. For
, define
. Let
be the family of all subsets of [n] including the empty set. Throughout, we assume the conventions that
,
,
,
,
,
for any
and
(hence
) for any
. Let
be the ordered values of
, introduce also
.
Throughout, will be the density of
at point x, where
. By convention,
denotes a Dirac measure at point
. Finally, let
denote the usual norm of
.
3 Empirical Bayes Approach
First, we introduce a family of normal priors (similar to priors from
[4]). Next, by applying the empirical Bayes approach to the normal likelihood, we derive an empirical Bayes posterior for the case of unknown sparsity cluster value, and use this posterior in further inference on .
3.1 Multivariate Normal Prior



























A logical choice for seems to be the uniform prior on
:
. However, this prior is not monotone with respect to the cardinality |I|, whereas we would like to penalize large cardinalities. As
for
, we take the above defined prior
as monotone (in |I|) proxy for
, with an extra parameter
to control the amount of penalization;
corresponds to the prior
.











3.2 Empirical Bayes Posterior























![$$\tilde{\mathrm {E}}_I(\vartheta |X)=\hat{\mu }(I) =(X_i \mathbbm {1}\{i \in I\}+\bar{X}_{I^c}\mathbbm {1}\{i \in I^c\}, i\in [n])$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq103.png)








![$$\hat{\mu }_i(I)=(X_i \mathbbm {1}\{i \in I\}+\bar{X}_{I^c}\mathbbm {1}\{i \in I^c\}, i\in [n])$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq110.png)
![$$\begin{aligned} \hat{\pi } (\vartheta |X)= \tilde{\pi }_{\hat{I}} (\vartheta |X), \;\; \hat{\theta }^{EB}=\hat{\mathrm {E}}(\vartheta |X)= \hat{\mu }(\hat{I}) =(\hat{\mu }_i(I), i\in [n]), \end{aligned}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ7.png)


4 Known Sparsity Cluster Value: Thresholding Procedures
In the traditional sparsity setting, the sparsity cluster value is assumed to be known and set to be zero without loss of generality. This case is well studied, various estimators are proposed and studied in the literature, see
[1, 4–8], and further references therein. Many estimation procedures originate as penalized estimators minimizing the criterion , for some appropriately chosen penalties
, or as (empirical) Bayes estimators according to appropriately chosen priors. An extensive discussion on this can be found in
[1].
Notice that whenever the penalty is of an
-type, i.e.,
for some function p and
, the resulting penalized estimator is a thresholding estimator
, where
and
is the minimizer of
,
(recall that
). Thresholding strategies are particularly appealing because thresholding automatically generates sparsity. Besides, thresholding procedures generally exhibit fast convergence properties and process the signal in a coordinate-wise way, which results in low complexity algorithms. There is a vast literature on this topic, see, e.g.,
[10] and further references therein.
The Bayesian approach can be connected to the penalized estimation by relating the penalties to the corresponding priors on . Penalties of
-type can be linked to Bayesian procedures involving priors on the number of non-zero entries of
, see
[1].






![$$\begin{aligned} \check{I}=\big \{i\in [n]: |X_i| \ge \check{t}=|X_{[\check{k}]}|\big \}, \quad \check{k}=\arg \!\min _{k\in [n]_0} \Big \{\sum _{i=k+1}^n X^2_{[i]} +p(k)\Big \}, \end{aligned}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ8.png)
![$$\check{\theta }^{HT}=\check{\theta }^{HT}(K,X)=(\check{\theta }_i^{HT}, i\in [n])$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq133.png)
![$$\begin{aligned} \check{\theta }_i^{HT} = X_i\mathbbm {1}\{i \in \check{I}\}= X_i\mathbbm {1}\{|X_i|\ge |X_{[\check{k}]}|\}, \;\; i \in [n]. \end{aligned}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ9.png)





5 EBMA and EBMS Procedures for the Case of Unknown Sparsity Cluster Value
Clearly, the thresholding approach relies very much on the fact that we know the sparsity cluster value, zero by default. Assume now that there is a large (sparsity) cluster of (almost) equal coordinates of , but its value is not known. Alternatively, one can think of the so-called robust inference in the sense that there may be a systematic error in the “known” sparsity cluster value zero and the true sparsity cluster value may actually deviate from zero. Using a thresholding procedure in such a situation would lead to a big cumulative error, because the sparsity cluster contains most of the coordinates of the high-dimensional signal
.
![$$ \hat{I}=\arg \!\min _{I\in \mathcal {I}}\Big \{\sum _{i \in I^c} (X_i-\bar{X}_{I^c})^2 + K \sigma ^2 |I| \log \big (\tfrac{en}{|I|}\big )\Big \}=\big \{i\in [n]: X_i\ne X_{[\hat{j}+t]},\, t\in [\hat{k}]\big \}, $$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ18.png)

![$$X_{[1]}\ge X_{[2]} \ge \ldots \ge X_{[n]}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq142.png)

![$$ (\hat{k},\hat{j})=\!\!\!\mathop {{{\,\mathrm{\arg \!\min }\,}}}_{k,j\in [n]_0,\; k+j\le n}\Big \{\sum _{i=j+1}^{j+k} (X_{[i]}-\bar{X}_{jk})^2+K \sigma ^2 (n-k) \log \big (\tfrac{en}{n-k}\big )\Big \}, \;\; \bar{X}_{jk}=\frac{1}{k} \sum _{i=1+j}^{k+j}X_{[i]}. $$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ19.png)
![$$\hat{\theta }^{EB}=\hat{\theta }^{EB}(K,X) =(\hat{\theta }_i^{EB}, i\in [n])$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq144.png)
![$$\begin{aligned} \hat{\theta }_i^{EB}= X_i\mathbbm {1}\{X_i\ne X_{[t+\hat{j}]},\, t\in [\hat{k}]\}+\bar{X}_{\hat{j}\hat{k}} \mathbbm {1}\{X_i= X_{[t+\hat{j}]},\, t\in [\hat{k}]\}, \;\; i \in [n]. \end{aligned}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_Equ10.png)

An alternative is to use the EBMA method. All posterior quantities involved in the construction of the EBMA estimator given by (5) are explicit, and the major issue is that the number of terms in (5) is exponential in the dimension so that direct computation is not practically feasible for high dimensions. Therefore, in this case, we have to resort to an MCMC procedure.












![$$\alpha =\min \big \{1, \; \tilde{\pi }\big (\{i\in [n]: s_i'=1\}|X\big )/\tilde{\pi }\big (\{i\in [n]: s_i=1\}|X\big )\big \}$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq159.png)






6 Comparative Simulation Study
In this section, we present a comparative simulation study for the cases of known and unknown (or shifted) sparsity cluster value.
We generate observations according to the model (1) with ,
,
, where we use signals
of the form
. The first p coordinates of
are “significant,” the remaining
entries form the sparsity cluster. We consider different “sparsity levels”
and “signal strengths”:
(signal is undetectable, i.e., comparable to the noise);
(signal is barely distinct from the noise);
(signal is well distinct from the noise). Next, we consider two situation: a) known sparsity cluster value
; b) unknown sparsity cluster value which we set
in the simulations.
The following estimators are considered: the projection oracle (PO) estimator ,
; the empirical Bayes mean (EBMean)
considered in
[8] with a standard Laplace prior and realized in the R-package EbayesThresh (see
[9]); the classical universal hard-thresholding (UHT) (see
[7])
,
; the HT estimator
defined by (9), the EBMA estimator
given by (5), and, finally, the EBMS estimator
defined by (10). Clearly, the PO procedure is not really an estimator as it uses oracle knowledge of the active set
and the sparsity cluster value
. Clearly, the pseudo-estimator PO cannot be outperformed by any practical estimation procedure. The performance of the pseudo-estimator PO is provided only for reference as a benchmark of the ideal situation.
The estimators EBMean, UHT, and HT are all geared towards the known (zero) sparsity cluster value, whereas our EBMA and EBMS estimators and
can also accommodate any unknown sparsity cluster value. To create more competition for our procedures
and
in the case of unknown sparsity cluster value, we also provide adjusted versions aEBMean, aUHT and aHT of the estimators
, constructed as follows:
, where
is an n-dimensional vector of ones,
is the empirical mean of the sample X, and
is the corresponding estimator (respectively, EBMean, UHT, and HT). In the case of unknown non-zero sparsity cluster value, the adjusted versions are clearly biased and only competitive for relatively small p and
’s. The adjusted versions of the estimators are expected to perform worse (and they do so, as Table 2 shows) for larger values of p and
’s.
Each of our estimators ,
and
depends on one tuning parameter, K or
. It is possible to choose the parameters K and
from the data via a cross-validation procedure, but this significantly increases the running time for computing
, and especially
. However, in the simulation results for several various cases, the optimal K did not vary much and appeared to lie mostly in the range [1.8, 3.2]. Moreover, the results were good for many choices of K, the performance deteriorates significantly only when K gets close to 1 or becomes too big. This actually agrees with the conclusions (about the penalty constants) from
[5]. In the simulations for the EBMS estimators
and
, the choice
appeared to be fairly good for all considered cases. When computing the EBMA estimator
, we took
which is a natural choice in the light of Remark 1. The EBMA procedure seemed to be even less sensitive to the choice of parameter
, again many choices are possible as long as
is not too small (should be larger than 0.7) and not too big. We let the chain burn in for 10000 iterations and then collected 25000 states from the posterior. The final sample of states used to approximate the EBMA estimator was obtained by keeping every 25-th state resulting in
in (11). This thinning was done to reduce the correlation between the samples from the MCMC procedure.
Tables 1 and 2 contain estimates of the mean square errors for the above-mentioned estimators
and choices of the signal
. Tables 1 and 2 concern the cases of the known (
) and unknown (
) sparsity cluster value, respectively. The
is evaluated by the average squared error
of l estimates
computed from
data vectors simulated independently from the model (1).
![$$A_i\sim \text {U}[0,2]$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq223.png)
![$$A_i\sim \text {U}[2,4]$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq224.png)
![$$A_i\sim \text {U}[0,2]$$](../images/461444_1_En_8_Chapter/461444_1_En_8_Chapter_TeX_IEq225.png)

Estimated MSE’s for the case (a) of the known sparsity cluster value
p | 25 | 50 | 100 | ||||||
---|---|---|---|---|---|---|---|---|---|
U[0,2] | U[2,4] | U[4,6] | U[0,2] | U[2,4] | U[4,6] | U[0,2] | U[2,4] | U[4,6] | |
PO | 25 | 25 | 25 | 50 | 50 | 50 | 99 | 99 | 99 |
EBMean | 34 | 96 | 91 | 64 | 164 | 172 | 111 | 273 | 319 |
UHT | 39 | 157 | 68 | 75 | 316 | 137 | 141 | 626 | 270 |
HT | 37 | 127 | 62 | 72 | 194 | 123 | 138 | 300 | 233 |
EBMA | 36 | 103 | 79 | 66 | 178 | 162 | 114 | 291 | 254 |
EBMS | 42 | 132 | 64 | 73 | 204 | 124 | 121 | 313 | 233 |
Estimated MSE’s for the case (b) of an unknown sparsity cluster value
p | 25 | 50 | 100 | ||||||
---|---|---|---|---|---|---|---|---|---|
U[0,2] | U[2,4] | U[4,6] | U[0,2] | U[2,4] | U[4,6] | U[0,2] | U[2,4] | U[4,6] | |
PO | 25 | 25 | 25 | 50 | 50 | 50 | 99 | 99 | 99 |
EBMean | 136 | 178 | 176 | 154 | 229 | 240 | 182 | 312 | 348 |
aEBMean | 57 | 108 | 118 | 100 | 201 | 254 | 162 | 332 | 441 |
UHT | 162 | 280 | 191 | 191 | 432 | 254 | 245 | 730 | 374 |
aUHT | 69 | 174 | 96 | 129 | 380 | 285 | 224 | 811 | 916 |
HT | 157 | 256 | 206 | 186 | 327 | 268 | 240 | 414 | 360 |
aHT | 68 | 128 | 104 | 127 | 253 | 300 | 222 | 516 | 577 |
EBMA | 66 | 97 | 79 | 122 | 176 | 161 | 201 | 281 | 251 |
EBMS | 69 | 107 | 59 | 128 | 170 | 120 | 224 | 275 | 231 |