1 Introduction
Ideally, one would perform regularisation, i.e. penalising the sum of squared residuals by the number of nonzero variables, but this is a combinatorial problem, and therefore intractable from the computational point of view. On the other hand, Lasso [8] and other -type regularisations (penalisation by the sum of the absolute values of the selected variables) offer a quadratic programming alternative whose solution is still a proper variable selection, as it contains many zeros.
Lasso has many advantages. First, it applies shrinkage on the variables which can lead to better predictions than simple least squares due to Stein’s phenomenon [7]. Second, the convexity of its penalty means Lasso can be solved numerically. Third, regularisation is variable selection consistent under certain conditions, provided that the coefficients of the true model are large enough compared to the regularisation parameter [6, 10, 14]. Fourth, Lasso can take into account structures; simple modifications of its penalisation term result in structured variable selection. Such variations among others are the fused lasso [9], the graphical lasso [3] and the composite absolute penalties [13] including the group-lasso [12]. Fifth, for a fixed regularisation parameter, regularisation has nearly the same sparsity as regularisation [2]. However, this does not hold anymore when the regularisation parameter is optimised in a data-dependent way, using an information criterion such as AIC [1] or Mallows’s Cp [5].
Mallows’s Cp, like many other information criteria, takes the form of a penalised—likelihood or—sum of squared residuals whose penalty depends on the number of selected variables. Therefore, among all models of equal size, the selection is based on the sum of squared residuals. Because of sparsity, it is easy to find a well-chosen combination of falsely significant variables that reduces the sum of squared residuals, by fitting the observational errors. The effects of these false positives can be tempered by applying shrinkage. The optimisation of the information criterion then overestimates the number of variables needed, including too many false positives in the model. In order to avoid this scenario, one idea could be to combine both and regularisation: the former to select the nonzeros and the latter to estimate their value. The optimal balance between the sum of squared residuals and the regularisation should shift towards smaller models. Of course, this change must be taken into account and the expression of the information criterion consequently adapted. The correction for the difference between and regularisation has been described as a ‘mirror’ effect [4].
In Sect. 2, we explain more the mirror effect. The main contribution of this paper follows and concerns the impact of a structure among the variables and how it affects the selection. More precisely, the behaviour of the mirror effect for unstructured and structured signal-plus-noise models is investigated in Sect. 3. A simulation is presented in Sect. 4 to support the previous sections.
2 The Mirror Effect
Among all selections of size k for k large enough so that the important variables are in the model, the procedure consisting of minimising (2), i.e. , adds seemingly nonzero variables—that should, in fact, be zeros—in order to fit the observational errors by minimising further the distance between and . The consequence is a better-than-average appearance of contrasting with the worse-than-average true prediction error: indeed the false positives perform worse in staying close to the signal without the errors than variables selected in a purely arbitrary way. This two-sided effect of appearance versus reality is described as a mirror effect [4].
3 Qualitative Description of the Mirror
3.1 Unstructured Signal-Plus-Noise Models
- Small selection sizes:
especially if (the true number of nonzeros in ), only nonzero variables should be selected. The errors associated with these variables have an expected variance equal to since they are randomly distributed among these variables. Hence, is close to zero.
- Intermediate selection sizes:
the large errors are selected accordingly to their absolute value. Their expected variance is obviously greater than , meaning that is increasing. Its growth is high at first and decreases to zero as smaller errors are added to the selection, leading to the last stage.
- Large selection sizes:
finally, only the remaining small errors are selected. This has the effect of diminishing the (previously abnormally high) variance to its original expected value; drops to zero which is achieved for the full model.
3.2 Structured Models: Grouped Variables
- Small selection sizes:
groups of nonzero variables are selected first as they have major effects on the linear regression. As before, the associated errors are randomly distributed, so their expected variance equals and is roughly zero.
- Intermediate selection sizes:
the groups containing only errors are selected accordingly to their norms. These groups typically contain high value errors and some low value errors. Hence, their expected variance is greater than but smaller than in the case of unstructured selection as the latter just selects the largest errors. The consequence on is that it increases but its growth, although high at first, is smaller than for the unstructured case.
- Large selection size:
groups of small errors are selected (although they can still contain some large errors), meaning that their expected variance decreases to which is achieved for the full model.
The explanations above hold for any type of structure, so we can deduce that the unstructured mirror has the largest amplitude in signal-plus-noise models. Indeed, as there is no constraint on the variables, once the nonzeros are selected, the errors are included in the model given their absolute value and more correction is needed in order to temper their effects.
This description is represented in Fig. 2 where the mirror corrections for group and unstructured selections can be visually compared as they are plotted in dot-dashed and solid lines, respectively. The three stages of group selection can be seen in the intervals [0, 2500], [2500, 125000] and [12500, 25000]. Details of the calculations can be found in Sect. 4.
4 Simulation
In this simulation, groups containing coefficients are generated so that is a n-dimensional vector with . Within group j, the coefficients have the same probability of being nonzero and, for each group j, a different value is randomly drawn from the set with the respective probability . The expected proportion of nonzeros is for the whole vector . The nonzeros from are distributed according to the zero inflated Laplace model where . The observations then are computed as the vector of groups , where is an n-vector of independent, standard normal errors.
Estimates are calculated considering two configurations: groups of size (initial setting) and groups of size 1 (unstructured selection). In the latter scenario, where is the binary n-vector selecting the k largest absolute values. The 10-group estimator is where is the binary r-vector selecting the l groups whose norms are the largest. Using Lasso and group-lasso, respectively, would provide us with the same selections and , because of the signal-plus-noise model. Mirror corrections for both configurations are found using (8).
A comparison of the false discovery rates (FDR) of nonzero variables for unstructured and group selections is presented in Fig. 3: because we allow groups to contain zeros and nonzeros in this simulation, at first the recovery rate of the nonzeros in the group setting is below the recovery rate of the unstructured nonzeros. However, the two rates quickly cross and we find the recovery of the nonzeros in the group setting to be better afterwards: particularly, this is the case for the selection that minimises the group-mirror-corrected Cp (marked with the vertical line in Fig. 3).
5 Conclusion
During the optimisation of an information criterion over the model size, using both and -regularisation for the selection and estimation of variables allows us to take advantage of quadratic programming for the former and least squares projection for the latter. This technique avoids an overestimation of the number of selected variables; however, it requires a corrected expression for the information criterion: the difference between and regularisation is compensated using the mirror effect.
In this paper, we described the behaviour of the mirror effect in signal-plus-noise models, observing three stages depending on the model size. This way we can distinguish the selection of nonzero variables, of large false positives and of small false positives for which the mirror is, respectively, close to zero, then increasing and finally decreasing to zero again. In the special case of structured selection, we note a similar behaviour for the mirror although its amplitude is smaller, meaning that the information criterion needs less correction.