M. La Rocca et al. (eds.)Nonparametric StatisticsSpringer Proceedings in Mathematics & Statistics339https://doi.org/10.1007/978-3-030-57306-5_8

Robust Estimation of Sparse Signal with Unknown Sparsity Cluster Value

Eduard Belitser¹, Nurzhan Nurushev² and Paulo Serra³

(1)

Department of Mathematics, VU Amsterdam, Amsterdam, The Netherlands

(2)

Korteweg-de Vries Institute for Mathematics, University of Amsterdam Research funded by the Netherlands Organisation for Scientific Research NWO, Amsterdam, The Netherlands

(3)

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands

Eduard Belitser (Corresponding author)

Email: e.n.belitser@vu.nl

Nurzhan Nurushev

Email: n.nurushev@uva.nl

Paulo Serra

Email: p.j.serra@tue.nl

Abstract

In the signal+noise model, we assume that the signal has a more general sparsity structure in the sense that the majority of signal coordinates are equal to some value which is assumed to be unknown, contrary to the classical sparsity context where one knows the sparsity cluster value (typically, zero by default). We apply an empirical Bayes approach (linked to the penalization method) for inference on the signal, possibly sparse in this more general sense. The resulting method is robust in that we do not need to know the sparsity cluster value; in fact, the method extracts as much generalized sparsity as there is in the underlying signal. However, as compared to the case of known sparsity cluster value, the proposed robust method cannot be reduced to thresholding procedure anymore. We propose two new procedures: the empirical Bayes model averaging (EBMA) and empirical Bayes model selection (EBMS) procedures, respectively. The former is procedure realized by an MCMC algorithm based on the partial (mixed) normal–normal conjugacy build in our modeling stage, and the latter is based on a new optimization algorithm of $$O(n^2)$$ -complexity. We perform simulations to demonstrate how the proposed procedures work and accommodate possible systematic error in the sparsity cluster value.

Keywords

Empirical BayesSparce signalUnknow sparsity cluster value

1 Introduction

The principle of parsimony, known as Occam’s razor, is arguably one of the most fundamental ideas that pervade science, and sparsity has become a popular paradigm in statistical analyses, as a particular manifestation of the parsimony principle in the context of modern statistics. Much of this popularity has been driven by the success of frequentist (and Bayesian) methods utilizing the underlying sparsity structure of the unknown parameter of interest.

In general, a sparse signal is a high-dimensional parameter which allows a parsimonious representation. In signal processing, this is typically expressed by assuming that it contains only a small number of non-zero elements compared to its dimension. The value zero of the sparsity cluster has the interpretation of being “insignificant” for the corresponding coordinates. Any other value of the sparsity cluster can be handled as well in the analysis (in fact, we can always reduce to zero by subtracting that value) as long as this value is known a priori to the observer.

In the signal+noise setting, the best-studied problem is that of signal estimation, and a variety of estimation methods and results are available in the literature: [1, 3–8]. Thresholding strategies are particularly appealing, mainly because thresholding automatically generates sparsity. In addition, the corresponding procedures generally exhibit fast convergence properties. Moreover, thresholding processes the signal in a coordinate-wise fashion, resulting in low complexity algorithms (typically of order n), which are easy to implement in practice.

Many methods have Bayesian connections. For example, even some seemingly non-Bayesian estimators can be obtained as certain quantities (like posterior mode for penalized estimators) of the posterior distributions resulting from imposing some specific priors on the parameter; cf. [1, 2, 4, 6, 8, 11]. A common Bayesian way to model sparsity structure is by the two-group priors. Such a prior puts positive mass on vector $\theta$ with some exact zero coordinates (zero group) and the remaining coordinates (signal group) are drawn from a chosen distribution. As pointed out by [6] (also by [8]), the prior distributions of non-zero coordinates should not have too light tails; otherwise, one gets sub-optimal convergence rates (or even inconsistency). The important Gaussian case is, for example, excluded, [6, 8] use therefore heavy-tailed priors. On the other hand, in [4], it was shown that normal priors are still usable and lead to strong local results (even for non-iid, non-normal models) if combined with the empirical Bayes approach.

However, all these above-mentioned approaches are based on the essential assumption that the sparsity cluster value is known to the observer (which is set to zero by default). In this note, we relax these modeling assumptions by allowing the sparsity cluster value to be an unknown constant, obtaining a robust formulation of the estimation problem. This situation can occur when, for example, there is a systematic error in the observations and sparsity coordinates get shifted by unknown value (bias of systematic error), leading to what we call sparsity cluster with unknown cluster value. It is clear that thresholding procedures are not going to be applicable in this situation, so we need to deal with methodological and computational issues. We address the first aspect by applying an empirical Bayes approach, which delivers two robust procedures: the empirical Bayes model averaging (EBMA) and empirical Bayes model selection (EBMS) procedures. As to the computational issue, the former procedure is realized by an MCMC algorithm based on the partial (mixed) normal–normal conjugacy in the model, and the latter is based on a new optimization algorithm of $$O(n^2)$$ -complexity (cf. O(n)-complexity for typical thresholding procedures). We perform simulations to demonstrate how the proposed procedures work and accommodate possible systematic error in the sparsity cluster value.

2 Setting and Notation

Suppose we observe $X=X^{(\sigma ,n)}=(X_1,\ldots ,X_n)$ :

$\begin{aligned} X_i=\theta _i+ \sigma \xi _i, \quad i\in [n]=\{1,\ldots ,n\}, \end{aligned}$

(1)

where $\theta =(\theta _1,\ldots , \theta _n) \in \mathbb {R}^n$ is an unknown high-dimensional parameter of interest, $\xi _i\overset{\mathrm{ind}}{\sim }N(0,1)$ , $\sigma >0$ is the known noise intensity. The goal is to make inference on the parameter $\theta$ based on the data X. We exploit the empirical Bayes approach and make a connection with the penalization method.

Denote the probability measure of X from the model (1) by $\mathrm {P}_\theta =\mathrm {P}^{(\sigma ,n)}_\theta$ , and by $\mathrm {E}_\theta$ the corresponding expectation. For notational simplicity, we often skip the dependence on $\sigma$ and n. Denote by $\mathbbm {1}_E=\mathbbm {1}\{E\}$ the indicator function of the event E, and by |S| the cardinality of the set S. Let $[n]=\{1,\ldots ,n\}$ and $[n]_0=\{0\} \cup [n]$ for $n\in \mathbb {N}=\{1,2,\ldots \}$ . For $I\subseteq [n]$ , define $I^c=\{i\in [n]: i \not \in I\}$ . Let $\mathcal {I}=\mathcal {I}_n=\{I: \, I \subseteq [n]\}$ be the family of all subsets of [n] including the empty set. Throughout, we assume the conventions that $\bar{a}_I=\frac{1}{|I|} \sum _{i\in I} a_i$ , $\sum _{i\in \varnothing }a_i=0$ , $\prod _{i\in \varnothing }a_i=1$ , $\sum _a^b a_i=\sum _{a\le i \le b} a_i$ , $\sum _i a_i =\sum _{i\in [n]} a_i$ , $\sum _I a_I =\sum _{I\in \mathcal {I}} a_I$ for any $a_i, a_I, a,b\in \mathbb {R}$ and $0\log (c/0)=0$ (hence $$(c/0)^0=1$$ ) for any $$c>0$$ . Let $X_{[1]}^2\ge \theta _{[2]}^2\ge \cdots \ge X_{[n]}^2$ be the ordered values of $X_1^2,\ldots ,X_n^2$ , introduce also $X_{[0]}^2=\infty$ .

Throughout, $\varphi (x,\mu ,\sigma ^2)$ will be the density of $\mu +\sigma Z\sim N(\mu ,\sigma ^2)$ at point x, where $Z\sim N(0,1)$ . By convention, $N(\mu ,0)=\delta _\mu$ denotes a Dirac measure at point $\mu$ . Finally, let $\Vert x\Vert$ denote the usual norm of $x\in \mathbb {R}^n$ .

3 Empirical Bayes Approach

First, we introduce a family of normal priors (similar to priors from [4]). Next, by applying the empirical Bayes approach to the normal likelihood, we derive an empirical Bayes posterior for the case of unknown sparsity cluster value, and use this posterior in further inference on $\theta$ .

3.1 Multivariate Normal Prior

To model a possible sparsity cluster of unknown cluster value in the parameter $\theta$ , the coordinates of $\theta$ can be split into two distinct groups of coordinates of $\theta$ : for some $I \in \mathcal {I}$ , $\theta _I = (\theta _i, i \in I)$ and $\theta _{I^c}=(\theta _i, i \in I^c)$ , so that $\theta =(\theta _I, \theta _{I^c})$ . The group $\theta _{I^c}=(\theta _i, i \not \in I)$ consists of coordinates that are all assumed to be (almost) equal to some cluster value $\mu _c$ , and $\theta _{I}=(\theta _i, i \in I)$ is the group of coordinates significantly different from $\mu _c$ . To model sparsity with unknown sparsity cluster value, we propose a prior on $\theta$ given I as follows:

$\begin{aligned} \pi _I&=\bigotimes \nolimits _i N\big (\mu _i(I), \tau _i^2(I)\big ), \end{aligned}$

where $\mu _i(I)=\mu _i \mathbbm {1}\{i\in I\}+\mu _c\mathbbm {1}\{i\not \in I\}$ , $\tau ^2_i(I)=\sigma ^2K_n(I)\mathbbm {1}\{i \in I\}$ , $K_n(I)=(\tfrac{en}{|I|}-1)\mathbbm {1} \{I\not = \varnothing \}$ . The indicators in the above prior ensure the sparsity of the group $$I^c$$

. The rather specific choice of $$K_n(I)$$

is made for the sake of concise expressions in later calculations, many other choices are actually possible. By using the normal likelihood $\ell (\theta , X) =(2\pi \sigma ^2)^{-n/2}\exp \{-\Vert X-\theta \Vert ^2/2\sigma ^2\}$ , the corresponding posterior distribution for $\theta$ is readily obtained:

$\begin{aligned} \pi _I(\vartheta |X) = \bigotimes \nolimits _i N\Big (\frac{\tau _i^2(I) X_i+\sigma ^2\mu _i(I)}{\tau _i^2(I)+\sigma ^2}, \frac{\tau _i^2(I)\sigma ^2}{\tau _i^2(I)+\sigma ^2}\Big ). \end{aligned}$

(2)

Next, introduce the prior $\lambda$ on $\mathcal {I}$ . For $\varkappa >0$ , draw a random set from $\mathcal {I}$ with probabilities

$\begin{aligned} \lambda _I=c_{\varkappa ,n} \exp \big \{-\varkappa |I| \log (\tfrac{en}{|I|})\big \} = c_{\varkappa ,n}(\tfrac{en}{|I|})^{-\varkappa |I|}, \quad I \in \mathcal {I}, \end{aligned}$

where $c_{\varkappa ,n}$ is the normalizing constant.

Remark 1

A logical choice for $\lambda _I$ seems to be the uniform prior on $\mathcal {I}$ : $\bar{\lambda }_I=\left( {\begin{array}{c}n\\ |I|\end{array}}\right) ^{-1}$ . However, this prior is not monotone with respect to the cardinality |I|, whereas we would like to penalize large cardinalities. As $(\frac{n}{k})^k \le \left( {\begin{array}{c}n\\ k\end{array}}\right) \le (\frac{en}{k})^k$ for $k\in [n]_0$ , we take the above defined prior $\lambda _I$ as monotone (in |I|) proxy for $\bar{\lambda }_I$ , with an extra parameter $\varkappa$ to control the amount of penalization; $\varkappa =1$ corresponds to the prior $\bar{\lambda }_I$ .

Combining the conditional prior $\pi _I$ with the prior $\lambda _I$ gives the mixture prior on $\theta$ : $\pi = \sum _I \lambda _I \pi _I$ . This leads to the marginal distribution of X: $\mathrm {P}_X=\sum _I \lambda _I \mathrm {P}_{X,I}$ , with $\mathrm {P}_{X,I}= \bigotimes _i N\big (\mu _i(I), \sigma ^2+\tau _i^2(I)\big )$ , and the posterior of $\theta$ is

$\begin{aligned} \pi (\vartheta |X) =\sum _I\pi _I(\vartheta |X)\pi (I|X), \end{aligned}$

(3)

where $\pi _I(\vartheta |X)$ is defined by (2) and the posterior $\pi (I|X)$ for I is

$\begin{aligned} \pi (I|X)= \frac{\lambda _I \mathrm {P}_{X,I}}{\mathrm {P}_X}= \frac{\lambda _I \prod _i \varphi \big (X_i,\mu _i(I), \sigma ^2+\tau _i^2(I)\big )}{\sum _{J\in \mathcal {I}}\lambda _J\prod _i \varphi \big (X_i,\mu _i(J), \sigma ^2+\tau _i^2(J)\big )}. \end{aligned}$

(4)

3.2 Empirical Bayes Posterior

The parameters $\mu _i(I)$ are yet to be chosen in the prior. We choose $\mu _i(I)$ by using empirical Bayes approach. The marginal likelihood $\mathrm {P}_X$ is readily maximized with respect to $\mu _i(I)$ : $\hat{\mu }_i(I)=X_i$ for $i\in I$ and $\hat{\mu }_i(I) = \bar{X}_{I^c}$ for $i \in I^c$ , where $\bar{X}_{I^c} = \frac{1}{|I^c|}\sum _{i\in I^c} X_i$ . We substitute $\hat{\mu }=(\hat{\mu }(I), I \in \mathcal {I})$ instead of $\mu =(\mu (I), I \in \mathcal {I})$ in the expression (3) for $\pi (\vartheta |X)$ , obtaining the empirical Bayes posterior (called empirical Bayes model averaging (EBMA) posterior)

$\begin{aligned} \tilde{\pi }(\vartheta |X)=\sum _I \tilde{\pi }_I(\vartheta |X)\tilde{\pi }(I|X), \end{aligned}$

where the empirical Bayes conditional posterior (recall that $N(\mu ,0) =\delta _\mu$ )

$\begin{aligned} \tilde{\pi }_I(\vartheta |X)&=\prod _{i\in I} N\big (X_i, \tfrac{K_n(I)\sigma ^2}{K_n(I)+1} \big ) \otimes \prod _{i\in I^c} \delta _{\bar{X}_{I^c}} \end{aligned}$

is obtained from (2) with $\mu _i(I)=\hat{\mu }_i(I)=X_i \mathbbm {1}\{i \in I\} + \bar{X}_{I^c}\mathbbm {1}\{i \in I^c \}$ , and

$\begin{aligned} \tilde{\pi }(I|X)=\frac{\lambda _I \mathrm {P}_{X,I}}{\sum _{J\in \mathcal {I}} \lambda _J\mathrm {P}_{X,J}} =\frac{\lambda _I\prod _i \varphi (X_i,\hat{\mu }_i(I) ,\sigma ^2+\tau ^2_i(I))}{\sum _{J\in \mathcal {I}} \lambda _J\prod _i \varphi (X_i, \hat{\mu }_i(J),\sigma ^2+\tau ^2_i(J))} \end{aligned}$

is the empirical Bayes posterior for $I \in \mathcal {I}$ , obtained from (4) with $\mu _i(I)=\hat{\mu }_i(I)$ . Let $\tilde{\mathrm {E}}$ and $\tilde{\mathrm {E}}_I$ be the expectations with respect to the measures $\tilde{\pi }(\vartheta |X)$ and $\tilde{\pi }_I(\vartheta |X)$ respectively. Then $\tilde{\mathrm {E}}_I(\vartheta |X)=\hat{\mu }(I) =(X_i \mathbbm {1}\{i \in I\}+\bar{X}_{I^c}\mathbbm {1}\{i \in I^c\}, i\in [n])$ . Introduce the EBMA mean estimator

$\begin{aligned} \tilde{\theta }^{EB}&=\tilde{\theta }^{EB}(\varkappa ,X)=\tilde{\mathrm {E}} (\vartheta |X) = \sum _{I\in \mathcal {I}}\tilde{\mathrm {E}}_I(\vartheta |X) \tilde{\pi }(I|X) =\sum _{I\in \mathcal {I}}\hat{\mu }(I)\tilde{\pi }(I|X). \end{aligned}$

(5)

Consider an alternative (“more Bayesian”) empirical Bayes posterior. First, derive an empirical Bayes variable selector $\hat{I}$ by maximizing $\tilde{\pi }(I|X)$ over $I \in \mathcal {I}$ (any maximizer will do) as follows:

$\begin{aligned} \hat{I}\!&=\!\arg \!\max _{I\in \mathcal {I}}{\tilde{\pi }}(I|X) =\arg \!\max _{I\in \mathcal {I}} \Big \{\!-\!\sum _{i \in I^c } \tfrac{(X_i-\bar{X}_{I^c})^2}{2\sigma ^2} -\tfrac{|I|}{2}\log (K_n(I)+1)+\log \lambda _I\Big \} \nonumber \\&=\arg \!\min _{I\in \mathcal {I}}\Big \{\sum _{i \in I^c} (X_i-\bar{X}_{I^c})^2 + (2\varkappa +1) \sigma ^2 |I| \log \big (\tfrac{en}{|I|}\big )\Big \}. \end{aligned}$

(6)

Now plugging in $\hat{I}$ into $\tilde{\pi }_I(\vartheta |X)$ yields another empirical Bayes posterior (called empirical Bayes model selection (EBMS) posterior) and the corresponding EBMS mean estimator for $\theta$ : with $\hat{\mu }_i(I)=(X_i \mathbbm {1}\{i \in I\}+\bar{X}_{I^c}\mathbbm {1}\{i \in I^c\}, i\in [n])$ ,

$\begin{aligned} \hat{\pi } (\vartheta |X)= \tilde{\pi }_{\hat{I}} (\vartheta |X), \;\; \hat{\theta }^{EB}=\hat{\mathrm {E}}(\vartheta |X)= \hat{\mu }(\hat{I}) =(\hat{\mu }_i(I), i\in [n]), \end{aligned}$

(7)

where $\hat{\mathrm {E}}$ denotes the expectation with respect to the measure $\hat{\pi }(\vartheta |X)$ .

4 Known Sparsity Cluster Value: Thresholding Procedures

In the traditional sparsity setting, the sparsity cluster value is assumed to be known and set to be zero without loss of generality. This case is well studied, various estimators are proposed and studied in the literature, see [1, 4–8], and further references therein. Many estimation procedures originate as penalized estimators minimizing the criterion $\text {crit}(X,\theta )=\Vert X-\theta \Vert ^2 + P(\theta )$ , for some appropriately chosen penalties $P(\theta )$ , or as (empirical) Bayes estimators according to appropriately chosen priors. An extensive discussion on this can be found in [1].

Notice that whenever the penalty $\text {crit}(X,\theta )$ is of an $\ell _0$ -type, i.e., $P(\theta )=p(\Vert \theta \Vert _0)$ for some function p and $\Vert \theta \Vert _0= \sum _i\mathbbm {1}\{\theta _i \not = 0\}$ , the resulting penalized estimator is a thresholding estimator $\check{\theta }_i=X_i 1\big \{|X_i| \ge \check{t}\big \}$ , where $\check{t}=|X_{[\check{k}]}|$ and $\check{k}$ is the minimizer of $\sum _{i=k+1}^n X^2_{[i]} +p(k)$ , $k\in [n]_0$ (recall that $\infty =X_{[0]}^2>X_{[1]}^2\ge \ldots \ge X_{[n]}^2$ ). Thresholding strategies are particularly appealing because thresholding automatically generates sparsity. Besides, thresholding procedures generally exhibit fast convergence properties and process the signal in a coordinate-wise way, which results in low complexity algorithms. There is a vast literature on this topic, see, e.g., [10] and further references therein.

Remark 2

The Bayesian approach can be connected to the penalized estimation by relating the penalties to the corresponding priors on $\theta$ . Penalties of $\ell _0$ -type can be linked to Bayesian procedures involving priors on the number of non-zero entries of $\theta$ , see [1].

Within the framework of the present paper, the case of the known sparsity cluster value corresponds to taking $\mu _c=0$ in the prior $\pi _I$ . This leads to $\hat{\mu }_i(I) =0$ for $i \in I^c$ in all the posterior quantities of Sect. 3.2, and the criterion (6) reduces to

$\check{I}=\arg \!\min _{I\in \mathcal {I}}\Big \{\sum _{i \in I^c} X_i^2 + K\sigma ^2 |I| \log \big (\tfrac{en}{|I|}\big )\Big \},\quad K=2\varkappa +1,$

which is reminiscent of the penalization procedure from [5] (cf. also [1]), with the penalty $p(k) = K\sigma ^2 k \log (\tfrac{en}{k})$ . Indeed, it can be easily seen that

$\begin{aligned} \check{I}=\big \{i\in [n]: |X_i| \ge \check{t}=|X_{[\check{k}]}|\big \}, \quad \check{k}=\arg \!\min _{k\in [n]_0} \Big \{\sum _{i=k+1}^n X^2_{[i]} +p(k)\Big \}, \end{aligned}$

(8)

and the EBMS procedure yields the corresponding thresholding estimator $\check{\theta }^{HT}=\check{\theta }^{HT}(K,X)=(\check{\theta }_i^{HT}, i\in [n])$ , with

$\begin{aligned} \check{\theta }_i^{HT} = X_i\mathbbm {1}\{i \in \check{I}\}= X_i\mathbbm {1}\{|X_i|\ge |X_{[\check{k}]}|\}, \;\; i \in [n]. \end{aligned}$

(9)

The penalty p(k) corresponds to the complete variable selection case in [5]. Recall our rather specific choice of parameter $$K_n(I)$$

in the prior $\pi _I$ resulting in this penalty. As we mentioned, other choices of $$K_n(I)$$

are also possible, which would lead to other penalties. But the main term $\sigma ^2k\log (\tfrac{en}{k})$ would always be present because of the choice of prior $\lambda _I$ . The optimality of this kind of penalties (and priors) is discussed in [1, 4, 5]. In [1] it is concluded that essentially only such penalties lead to adaptive penalized estimators over certain sparsity scales.

5 EBMA and EBMS Procedures for the Case of Unknown Sparsity Cluster Value

Clearly, the thresholding approach relies very much on the fact that we know the sparsity cluster value, zero by default. Assume now that there is a large (sparsity) cluster of (almost) equal coordinates of $\theta$ , but its value is not known. Alternatively, one can think of the so-called robust inference in the sense that there may be a systematic error in the “known” sparsity cluster value zero and the true sparsity cluster value may actually deviate from zero. Using a thresholding procedure in such a situation would lead to a big cumulative error, because the sparsity cluster contains most of the coordinates of the high-dimensional signal $\theta$ .

Recalling the empirical Bayes approach described in Sect. 3.2 for the case of unknown sparsity cluster value, we immediately see that, unlike (8), the EBMS criterion (6) cannot be reduced to a thresholding procedure. However, the corresponding optimization problem is still feasible from a computational point of view. Indeed, the criterion (6) reduces to

$\hat{I}=\arg \!\min _{I\in \mathcal {I}}\Big \{\sum _{i \in I^c} (X_i-\bar{X}_{I^c})^2 + K \sigma ^2 |I| \log \big (\tfrac{en}{|I|}\big )\Big \}=\big \{i\in [n]: X_i\ne X_{[\hat{j}+t]},\, t\in [\hat{k}]\big \},$

where $K=2\varkappa +1$ , $X_{[1]}\ge X_{[2]} \ge \ldots \ge X_{[n]}$ are the ordered $X_1,\ldots , X_n$ ,

$(\hat{k},\hat{j})=\!\!\!\mathop {{{\,\mathrm{\arg \!\min }\,}}}_{k,j\in [n]_0,\; k+j\le n}\Big \{\sum _{i=j+1}^{j+k} (X_{[i]}-\bar{X}_{jk})^2+K \sigma ^2 (n-k) \log \big (\tfrac{en}{n-k}\big )\Big \}, \;\; \bar{X}_{jk}=\frac{1}{k} \sum _{i=1+j}^{k+j}X_{[i]}.$

In this case, the EBMS method yields the robust (the sparsity cluster value is unknown) version of EBMS estimator $\hat{\theta }^{EB}=\hat{\theta }^{EB}(K,X) =(\hat{\theta }_i^{EB}, i\in [n])$ with

$\begin{aligned} \hat{\theta }_i^{EB}= X_i\mathbbm {1}\{X_i\ne X_{[t+\hat{j}]},\, t\in [\hat{k}]\}+\bar{X}_{\hat{j}\hat{k}} \mathbbm {1}\{X_i= X_{[t+\hat{j}]},\, t\in [\hat{k}]\}, \;\; i \in [n]. \end{aligned}$

(10)

It is not so difficult to see that this procedure has the computational complexity of order $$n^2$$

, which is of course worse than the procedure (8)–(9), but still computationally feasible. This is demonstrated in the next section.

An alternative is to use the EBMA method. All posterior quantities involved in the construction of the EBMA estimator $\tilde{\theta }^{EB}$ given by (5) are explicit, and the major issue is that the number of terms in (5) is exponential in the dimension so that direct computation is not practically feasible for high dimensions. Therefore, in this case, we have to resort to an MCMC procedure.

In the MCMC procedure, each element I in the support of $\tilde{\pi }(I|X)$ is encoded (one-to-one) by a binary state vector $s=(s_1, \dots , s_n)\in \{0,1\}^n$ . The correspondence is that $$s_i=1$$

if, and only if, $i\in I$ and $$s_i=0$$

if, and only if, $i\not \in I$ . The proposal $$s'$$

flips simply one bit chosen uniformly at random from the current state s. This means that we first select j uniformly at random on $\{1,\dots ,n\}$ , and then set $$s'_j=1-s_j$$

and

, $i\ne j$ . The chain moves from s to $$s'$$

with probability $\alpha =\min \big \{1, \; \tilde{\pi }\big (\{i\in [n]: s_i'=1\}|X\big )/\tilde{\pi }\big (\{i\in [n]: s_i=1\}|X\big )\big \}$ . The EBMA estimator $\tilde{\theta }^{EB}$ from (5) is the expectation of $\hat{\mu }(I)$ with respect to the posterior $\tilde{\pi }(I|X)$ . If $I_1, \dots , I_M$ is a sample drawn from $\tilde{\pi }(I|X)$ , or indeed a sample produced by the MCMC procedure from above, then we approximate the EBMA estimator as

$\begin{aligned} \tilde{\theta }^{EB}=\tilde{\theta }^{EB}(\varkappa ,X)\approx \frac{1}{M} \sum _{i=1}^M \hat{\mu }(I_i). \end{aligned}$

(11)

6 Comparative Simulation Study

In this section, we present a comparative simulation study for the cases of known and unknown (or shifted) sparsity cluster value.

We generate observations according to the model (1) with $\xi _i\overset{\mathrm{ind}}{\sim } N(0,1)$ , $\sigma =1$ , $$n=500$$ , where we use signals $\theta =(\theta _{1},\ldots ,\theta _{n})$ of the form $\theta =(A_1,\ldots ,A_p, \delta ,\ldots ,\delta )$ . The first p coordinates of $\theta$ are “significant,” the remaining $$n-p$$ entries form the sparsity cluster. We consider different “sparsity levels” $p\in \{25,50,100\}$ and “signal strengths”: $A_i\overset{\mathrm{ind}}{\sim }U[0,2]$ (signal is undetectable, i.e., comparable to the noise); $A_i\overset{\mathrm{ind}}{\sim }U[2,4]$ (signal is barely distinct from the noise); $A_i\overset{\mathrm{ind}}{\sim }U[4,6]$ (signal is well distinct from the noise). Next, we consider two situation: a) known sparsity cluster value $\delta =0$ ; b) unknown sparsity cluster value which we set $\delta =-0.5$ in the simulations.

The following estimators are considered: the projection oracle (PO) estimator $\hat{\theta }_i^{PO}=X_i\mathbbm {1}\{\theta _i\not =\delta \}+\delta \mathbbm {1}\{\theta _i=\delta \}$ , $i\in [n]$ ; the empirical Bayes mean (EBMean) $\hat{\theta }^{EBMean}$ considered in [8] with a standard Laplace prior and realized in the R-package EbayesThresh (see [9]); the classical universal hard-thresholding (UHT) (see [7]) $\hat{\theta }_i^{UHT}=X_i\mathbbm {1}\{|X_i|>\sqrt{2\log n}\}$ , $i \in [n]$ ; the HT estimator $\check{\theta }^{HT}$ defined by (9), the EBMA estimator $\tilde{\theta }^{EB}$ given by (5), and, finally, the EBMS estimator $\hat{\theta }^{EB}$ defined by (10). Clearly, the PO procedure is not really an estimator as it uses oracle knowledge of the active set $I_*(\theta )=\{i\in [n]: \theta _i \not = \delta \}$ and the sparsity cluster value $\delta$ . Clearly, the pseudo-estimator PO cannot be outperformed by any practical estimation procedure. The performance of the pseudo-estimator PO is provided only for reference as a benchmark of the ideal situation.

The estimators EBMean, UHT, and HT are all geared towards the known (zero) sparsity cluster value, whereas our EBMA and EBMS estimators $\tilde{\theta }^{EB}$ and $\hat{\theta }^{EB}$ can also accommodate any unknown sparsity cluster value. To create more competition for our procedures $\tilde{\theta }^{EB}$ and $\hat{\theta }^{EB}$ in the case of unknown sparsity cluster value, we also provide adjusted versions aEBMean, aUHT and aHT of the estimators $\hat{\theta }=\hat{\theta }(X)$ , constructed as follows: $\hat{\theta }'(X)= \hat{\theta }(X-\bar{X}1_n) +\bar{X}1_n$ , where $$1_n$$ is an n-dimensional vector of ones, $\bar{X}_=\frac{1}{n}\sum \limits _{i=1}^n X_i$ is the empirical mean of the sample X, and $\hat{\theta }$ is the corresponding estimator (respectively, EBMean, UHT, and HT). In the case of unknown non-zero sparsity cluster value, the adjusted versions are clearly biased and only competitive for relatively small p and $$A_i$$ ’s. The adjusted versions of the estimators are expected to perform worse (and they do so, as Table 2 shows) for larger values of p and $$A_i$$ ’s.

Each of our estimators $\check{\theta }^{HT}(K,X)$ , $\hat{\theta }^{EB}(K,X)$ and $\tilde{\theta }^{EB}(\varkappa ,X)$ depends on one tuning parameter, K or $\varkappa$ . It is possible to choose the parameters K and $\varkappa$ from the data via a cross-validation procedure, but this significantly increases the running time for computing $\hat{\theta }^{EB}(K,X)$ , and especially $\tilde{\theta }^{EB}(\varkappa ,X)$ . However, in the simulation results for several various cases, the optimal K did not vary much and appeared to lie mostly in the range [1.8, 3.2]. Moreover, the results were good for many choices of K, the performance deteriorates significantly only when K gets close to 1 or becomes too big. This actually agrees with the conclusions (about the penalty constants) from [5]. In the simulations for the EBMS estimators $\check{\theta }^{HT}(K,X)$ and $\hat{\theta }^{EB}(K,X)$ , the choice $$K=2.5$$ appeared to be fairly good for all considered cases. When computing the EBMA estimator $\tilde{\theta }^{EB}(\varkappa ,X)$ , we took $\varkappa =1$ which is a natural choice in the light of Remark 1. The EBMA procedure seemed to be even less sensitive to the choice of parameter $\varkappa$ , again many choices are possible as long as $\varkappa$ is not too small (should be larger than 0.7) and not too big. We let the chain burn in for 10000 iterations and then collected 25000 states from the posterior. The final sample of states used to approximate the EBMA estimator was obtained by keeping every 25-th state resulting in $$M=1000$$ in (11). This thinning was done to reduce the correlation between the samples from the MCMC procedure.

Tables 1 and 2 contain estimates of the mean square errors $MSE(\hat{\theta },\theta )=\mathrm {E}_\theta \Vert \hat{\theta }-\theta \Vert ^2$ for the above-mentioned estimators $\hat{\theta }$ and choices of the signal $\theta$ . Tables 1 and 2 concern the cases of the known ( $\delta =0$ ) and unknown ( $\delta =-0.5$ ) sparsity cluster value, respectively. The $MSE(\hat{\theta },\theta )$ is evaluated by the average squared error $\widehat{MSE}(\hat{\theta },\theta )=\frac{1}{l}\sum _{k=1}^l\Vert \hat{\theta }^k-\theta \Vert ^2$ of l estimates $\hat{\theta }^1,\ldots ,\hat{\theta }^l$ computed from $$l=100$$ data vectors simulated independently from the model (1).

It is not surprising that the shrinkage estimators EBMean and EBMA perform well for weak signals (cases $A_i\sim \text {U}[0,2]$ and $A_i\sim \text {U}[2,4]$ ) in situation a) of known (zero) sparsity cluster value, as one can see from Table 1. Table 2 is for situation b) and is more interesting, it shows a clear advantage of the EBMA and EBMS methods which take into account the unknown sparsity cluster value. Only for the cases with undetectable signal (case $A_i\sim \text {U}[0,2]$ ), the adjusted shrinkage estimator aEBMean is still competitive, as this case is very favorable for any shrinkage procedure and a relatively small absolute shift value $|\delta |$ .

Table 1

Estimated MSE’s for the case (a) of the known sparsity cluster value $\delta =0$

p	25			50			100
	U[0,2]	U[2,4]	U[4,6]	U[0,2]	U[2,4]	U[4,6]	U[0,2]	U[2,4]	U[4,6]
PO	25	25	25	50	50	50	99	99	99
EBMean	34	96	91	64	164	172	111	273	319
UHT	39	157	68	75	316	137	141	626	270
HT	37	127	62	72	194	123	138	300	233
EBMA	36	103	79	66	178	162	114	291	254
EBMS	42	132	64	73	204	124	121	313	233

Table 2

Estimated MSE’s for the case (b) of an unknown sparsity cluster value $\delta =-0.5$

p	25			50			100
	U[0,2]	U[2,4]	U[4,6]	U[0,2]	U[2,4]	U[4,6]	U[0,2]	U[2,4]	U[4,6]
PO	25	25	25	50	50	50	99	99	99
EBMean	136	178	176	154	229	240	182	312	348
aEBMean	57	108	118	100	201	254	162	332	441
UHT	162	280	191	191	432	254	245	730	374
aUHT	69	174	96	129	380	285	224	811	916
HT	157	256	206	186	327	268	240	414	360
aHT	68	128	104	127	253	300	222	516	577
EBMA	66	97	79	122	176	161	201	281	251
EBMS	69	107	59	128	170	120	224	275	231

References

1.
Abramovich, F., Grinshtein, V., Pensky, M.: On optimality of Bayesian testimation in the normal means problem. Ann. Stat. 35, 2261–2286 (2007)MathSciNetCrossref
2.
Babenko, A., Belitser, E.: Oracle projection convergence rate of posterior. Math. Meth. Stat. 19, 219–245 (2010)Crossref
3.
Belitser, E.: On coverage and local radial rates of credible sets. Ann. Stat. 45, 1124–1151 (2017)MathSciNetCrossref
4.
Belitser, E., Nurushev, N.: Needles and straw in a haystack: robust empirical Bayes confidence for possibly sparse sequences. Bernoulli 26, 191–225(2020)
5.
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3, 203–268 (2001)MathSciNetCrossref
6.
Castillo, I., van der Vaart, A.: Needles and straw in a haystack: posterior concentration for possibly sparse sequences. Ann. Stat. 40, 2069–2101 (2012)MathSciNetCrossref
7.
Donoho, D.L., Johnstone, I.M.: Minimax risk over $\ell _p$ -balls for $\ell _q$ -error. Probab. Theory Rel. Fields 99, 277–303 (1994)
8.
Johnstone, I., Silverman, B.: Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Stat. 32, 1594–1649 (2004)MathSciNetCrossref
9.
Johnstone, I., Silverman, B.: EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 12 (2005)
10.
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall, CRC Press (2015)Crossref
11.
Van der Pas, S.L., Kleijn, B.J.K., van der Vaart, A.W.: The horseshoe estimator: Posterior concentration around nearly black vectors. Electron. J. Stat. 8, 2585–2618 (2014)MathSciNetCrossref
12.
Van der Pas, S.L., Szabó, B.T., van der Vaart, A.W.: Uncertainty quantification for the horseshoe (with discussion). Bayesian Anal. 12, 1221–1274 (2017)MathSciNetCrossref