Algebraic sampling methods are a powerful tool to perform hypothesis tests on conditional spaces. We analyse the link of the sampling method introduced in
[6] with permutation tests and we exploit this link to build a two-step sampling procedure to perform two-sample comparisons for non-negative discrete exponential families. We thus establish a link between standard permutation and algebraic-statistics-based sampling. The proposed method reduces the dimension of the space on which the MCMC sampling is performed by introducing a second step in which a standard Monte Carlo sampling is performed. The advantages of this dimension reduction are verified through a simulation study, showing that the proposed approach grants convergence in the least time and has the lowest mean squared error.
Keywords
Conditional testsDiscrete exponential familiesMarkov basisMarkov chain monte carlo
1 Introduction
Consider two samples and of size and , respectively, coming from some non-negative discrete exponential family with natural parameter , base measure and normalising constant
We are interested in conditional tests that exploit the joint sample of size to test
(1)
Specifically, there exists a uniformly most powerful unbiased (UMPU) procedure performed conditionally on the sum of the entries of the pooled sample
[10]. Moreover, T is a sufficient statistic for the nuisance parameter of the test, the population constant , if we assume the standard one-way ANOVA model for the means
[11].
The test statistic adopted in UMPU tests is and its conditional distribution given T under in (1) is
(2)
where H is the base measure of the non-negative discrete exponential family f. We denote by the set of non-negative integer vectors of length n with sum of entries equal to x.
In order to perform the test (1), we can either find the critical values for any given risk of type I error or, alternatively, compute the p-value corresponding to the observed value of U. Unfortunately, the distribution (2) can rarely be computed in closed form. In most cases, it is necessary to approximate (2) through Markov Chain Monte Carlo (MCMC).
MCMC sampling methods suffer two major drawbacks in the discrete setting: the construction of the Markov basis needed to build the chain is computationally expensive and the chain may mix slowly
[8]. The idea of speeding-up the MCMC sampling is therefore not new in the Algebraic Statistics literature. In
[7], the first drawback is addressed in the case of bounded contingency tables, instead of computing the entire Markov basis in an initial step, sets of local moves that connect each table in the reference set with a set of neighbouring tables are studied. A similar approach for bounded two-way contingency tables under the independence model with positive bounds is presented in
[13]. A hybrid scheme using MCMC and sequential importance sampling able to address both drawbacks has been proposed in
[8]. We propose a strategy that exploits the structure of the sample space for UMPU tests and does not require sequential importance sampling but only standard independent Monte Carlo sampling.
2 Markov Chain Monte Carlo Samplings
As a consequence of the conditioning on , the sample space to be inspected under , is the fibre of non-negative integer vectors of size and with entries which add up to t
(3)
where is the vector of length N with all entries equal to 1.
The distribution we are interested in is the cumulative distribution function of the test statistic given under as specified in (1)
(4)
where and is 1 if and 0 otherwise and with a slight abuse of notation.
In the following, we describe two MCMC algorithms to sample from . The first one samples vectors , while the second one samples orbits of permutations . Both MCMC algorithms make use of a Markov basis, a set of moves allowing to build a connected Markov chain over using only simple additions/subtractions
[6]. Specifically, a Markov basis for a matrix A is a finite set of moves such that
1.
belongs to the integer kernel of A, ;
2.
every pair of elements in is connected by a path formed by a sequence of moves and signs , and this path is contained in .
Markov bases can be found analytically
[5, 6] or using the algebraic software 4ti2
[1].
2.1 MCMC—Vector-Based
The first MCMC we consider is an adaptation of the algorithm used in
[2, 3, 6] for the fibre .
An example of the Markov chain over is shown in Fig. 1a for and . Each vertex of the graph represents a vector and each edge represents an applicable move in the Markov basis. The number of states (i.e. vectors) and the number of edges is given by
The target distribution of the Markov chain is the probability of sampling under as specified in (1)
The estimator used to approximate (4) is the indicator function , where the s are the sampled vectors.
2.2 MCMC—Orbit-Based
The second algorithm is built by exploiting the link of with orbits of permutations . Clearly, if , every permutation of is an element of too. Moreover, different orbits do not intersect. Therefore the orbits of permutations form a partition of .
This partition is particularly interesting, as the elements which belong to the same orbit have the same probability of being sampled from ,
[12].
Therefore, it is possible to devise a two-step sampling procedure:
Step 1:
Sample one orbit from the set of orbits .
Step 2:
Sample uniformly from .
The first step can be performed through a MCMC algorithm similar to the one described in Sect. 2.1 with target distribution the probability of sampling in orbit
while the second one corresponds to a standard Monte Carlo sampling from .
The number of orbits of permutation contained in the fibre is given by
[5], with defined in
[9, 14]. The values of the partition function can be computed using the recurrence
and depend on both the sample size N and the sum of entries t.
To perform Step 1, we parametrise the fibre in such a way that all the vectors in the same permutation orbit are mapped to the same element. To do so, we consider a frequency-based representation. In this representation the orbit is mapped into . In this notation, vectors (0, 4, 2) and (2, 0, 4), which belong to the same orbit, correspond to the same frequency vector.
The target distribution in the frequency-based parametrisation is
with being the number of distinct elements in the orbit .
Because Step 2 corresponds to a standard permutation sampling, we consider the distribution of U given T over one orbit , i.e. the usual permutation cdf,
(5)
3 Comparison of Vector-Based and Orbit-Based MCMC
Dividing the sampling procedure into the two steps described in Sect. 2.2 has a clear computational advantage: Step 2 corresponds to a standard Monte Carlo sampling from the orbit , which is faster than performing an MCMC sampling. On the other hand, Step 1 performs an MCMC sampling over the set of orbits contained in , whose cardinality is smaller than that of the set of vectors in :
Table 1 shows the ratios between the cardinality of and the cardinality of for values of N and t between 5 and 100. Even for moderately sized samples, the number of orbits contained in is about two orders of magnitude smaller than the number of vectors (e.g. for and ).
Hence, if we keep the number of iterations fixed and compare the number of vectors inspected by the two algorithms, the orbit-based algorithm gives the highest value, namely the number of iterations the number of distinct vectors in the orbit sampled at iteration i.
Table 1
Ratio between the number of orbits and the number of vectors in for several values of N and t
5
10
15
20
30
50
100
5
0.056
0.030
0.022
0.018
0.015
0.012
0.010
10
0.004
15
20
50
We show how the reduction in the dimension of the space explored by the MCMC algorithm improves convergence and accuracy with respect to the truth (4) through the simulation study in the next section.
4 Simulation Study
Assume that the two samples and are Poisson distributed with mean and , respectively. In this case, the exact distribution (4) under is the binomial distribution
We compare the exact conditional cdf above with the approximated cdfs given by the vector-based algorithm, the orbit-based algorithm and the standard permutation cdf over the orbit of the observed pooled vector (this is the limit case of the orbit-based algorithm when only the observed orbit is sampled). A preliminary simulation study is presented in
[4].
We consider 9 scenarios built taking three sample sizes and, for each sample size, three different couples of population means .
The comparisons performed are two: first, we check the convergence behaviour on a fixed runtime (15 s) for both algorithms; then we compare their accuracy through the mean squared error (MSE).
4.1 Convergence Comparison
To compare how fast the two MCMC procedures converge to the true distribution (4), we draw one random sample for each scenario above and we run both algorithms for 15 s. Figure 2 shows four examples of the behaviour of the two MCMC procedures which are representative of the nine scenarios.
The orbit-based algorithm is very fast and achieves good convergence in seconds. On the contrary, the vector-based algorithm is much less efficient, in fact, its convergence to the exact value is not always satisfactory even after 15 s (Fig. 2a, b).
Remark 1
It would be possible to further reduce the computational time required by the orbit-based algorithm by exploiting one of the key features of this new approach, namely the possibility of sampling from each orbit independently. The Monte Carlo samplings in Step 2 could be made in parallel: once the chain reaches an orbit the Monte Carlo sampling over can be performed while the chain keeps on moving on the set of orbits.
4.2 Accuracy Comparison
For each scenario, we randomly generate 1, 000 samples through which we compute the MSE of the distribution estimated by the three procedures under study
Both MCMC algorithms are run for 15 s with no burn-in steps. The resulting MSE for the nine scenarios is shown in Table 2. As a further comparison, we report the MSE given by the standard Monte Carlo permutation sampling over the observed orbit.
Table 2
Mean Squared Error (MSE) for the vector-based, the orbit-based and the Monte Carlo permutation sampling. Both MCMC algorithms were run for 15 s with no burn-in steps
Orbit-based
Vector-based
Permutation
6
4
1
1
0.00012
0.0016
0.00284
6
4
1
1.5
0.00012
0.00083
0.00212
6
4
1
2
0.00016
0.00043
0.00221
10
15
1
1
0.00034
0.00131
0.00077
10
15
1
1.5
0.00009
0.00046
0.00074
10
15
1
2
0.00007
0.00017
0.00057
20
30
1
1
0.00069
0.00132
0.00036
20
30
1
1.5
0.00006
0.00053
0.00027
20
30
1
2
0.00001
0.00011
0.00009
The orbit-based algorithm always give the smallest MSE apart from scenario , , , , where the standard Monte Carlo permutation sampling has the smallest MSE. Table 3 shows the ratio between the MSE of the vector-based algorithm and the MSE of the orbit-based algorithm (column 5) and the ratio between the MSE of the standard Monte Carlo permutation sampling and the MSE of the orbit-based algorithm (column 6). The MSE of the vector-based algorithm is at least 1.9 times bigger than that of the orbit-based algorithm, while the MSE of the standard Monte Carlo permutation sampling can be 22.7 times bigger than that of the orbit-based algorithm (scenario 1).
Table 3
Ratio between the MSE of the vector-based and the MSE of the orbit-based algorithm (column 5) and ratio between the MSE of the standard Monte Carlo permutation sampling and the MSE of the orbit-based algorithm (column 6).
MSE vector/MSE orbit
MSE perm/MSE orbit
6
4
1
1
12.82
22.7
6
4
1
1.5
6.98
17.79
6
4
1
2
2.67
13.82
10
15
1
1
3.9
2.31
10
15
1
1.5
4.96
8
10
15
1
2
2.45
8.27
20
30
1
1
1.9
0.52
20
30
1
1.5
9.03
4.58
20
30
1
2
15.9
12.77
The number of iterations made by the vector-based and the orbit-based algorithms in the allocated 15 s are reported in Table 4. The orbit-based algorithm performs better than the vector-based one even if the number of iterations made is lower: in 15 s, the ratio between the numbers of iterations increases from twice to almost 19 times. Despite this difference in the number of iterations, the orbit-based algorithm always achieves lower MSE than the vector-based algorithm.
Table 4
Number of iterations for 15 s
Scenario
N. iterations
Ratio
Orbit
Vector
Vector/Orbit
6
4
1
1
23,977
53,842
2.25
6
4
1
1.5
24,169
53,210
2.20
6
4
1
2
24,560
57,382
2.34
10
15
1
1
11,950
52,504
4.39
10
15
1
1.5
11,564
54,836
4.74
10
15
1
2
7326
53,492
7.30
20
30
1
1
4675
45,576
9.75
20
30
1
1.5
3174
44,817
14.12
20
30
1
2
2572
48,003
18.66
5 Conclusions
The orbit-based algorithm grants a faster convergence to the exact distribution if compared to the standard MCMC algorithm proposed in
[6]. At the same time, it gives more reliable estimates by decreasing the MSE. This simulation-based observation can be proved by comparing the variance of the estimators used by the two algorithms (the indicator function in (4) and the permutation cdf (5) respectively)
[5].
When permutation-invariant statistics are used, the orbit-based algorithm is dramatically simplified. In this case, it is only necessary to walk among orbits of permutations without performing the second-step sampling and thus the reduction in computational time is significant.
Finally, it is worth noting that the MCMC sampling procedure based on orbits of permutations establishes a link between standard permutation and algebraic-statistics-based sampling that, to the best of our knowledge, has not been previously noted.
A preliminary version of this work has been presented at the 4th ISNPS conference in June 2018 in Salerno, Italy. Extensions of the present work include hypothesis testing for groups and data fitting
[5].
Acknowledgments
R. Fontana acknowledges that the present research has been partially supported by MIUR grant Dipartimenti di Eccellenza 2018–2022 (E11G18000350001). The authors thank the anonymous referee for his/her helpful comments.