V. M. Panaretos, Y. ZemelAn Invitation to Statistics in Wasserstein SpaceSpringerBriefs in Probability and Mathematical Statisticshttps://doi.org/10.1007/978-3-030-38438-8_5

5. Construction of Fréchet Means and Multicouplings

Victor M. Panaretos¹ and Yoav Zemel²

(1)

Institute of Mathematics, EPFL, Lausanne, Switzerland

(2)

Statistical Laboratory, University of Cambridge, Cambridge, UK

When given measures μ ¹, …, μ ^N are supported on the real line, computing their Fréchet mean $\bar \mu$ is straightforward (Sect. 3.1.4). This is in contrast to the multivariate case, where, apart from the important yet special case of compatible measures, closed-form formulae are not available. This chapter presents an iterative procedure that provably approximates at least a Karcher mean with mild restrictions on the measures μ ¹, …, μ ^N. The algorithm is based on the differentiability properties of the Fréchet functional developed in Sect. 3.1.6 and can be interpreted as classical steepest descent in the Wasserstein space ${\mathcal {W}}_2({\mathbb {R}}^d)$ . It reduces the problem of finding the Fréchet mean to a succession of pairwise transport problems, involving only the Monge–Kantorovich problem between two measures. In the Gaussian case (or any location-scatter family), the latter can be done explicitly, rendering the algorithm particularly appealing (see Sect. 5.4.1).

This chapter can be seen as a complementary to Chap. 4. On the one hand, one can use the proposed algorithm to construct the regularised Fréchet–Wasserstein estimator $\widehat {\lambda }_n$ that approximates a population version (see Sect. 4.3). On the other hand, it could be that the object of interest is the sample μ ¹, …, μ ^N itself, but that the latter is observed with some amount of noise. If one only has access to proxies $\widehat {\mu ^1},\dots ,\widehat {\mu ^N}$ , then it is natural to use their Fréchet mean $\widehat {\bar \mu }$ as an estimator of $\bar \mu$ . The proposed algorithm can then be used, in principle, in order to construct $\bar \mu$ , and the consistency framework of Sect. 4.4 then allows to conclude that if each $\widehat {\mu ^i}$ is consistent, then so is $\widehat {\bar \mu }$ .

After presenting the algorithm in Sect. 5.1, we make some connections to Procrustes analysis in Sect. 5.2. A convergence analysis of the algorithm is carried out in Sect. 5.3, after which examples are given in Sect. 5.4. An extension to infinitely many measures is sketched in Sect. 5.5.

5.1 A Steepest Descent Algorithm for the Computation of Fréchet Means

Throughout this section, we assume that N is a fixed integer and consider a fixed collection

$\displaystyle \begin{aligned} \mu^1,\dots,\mu^N\in {\mathcal{W}}_2({\mathbb{R}}^d) \qquad \mathrm{with} \mu^1\mathrm{ absolutely continuous with bounded density}, \end{aligned}$

(5.1)

whose unique (Proposition 3.1.8) Fréchet mean $\bar \mu$ is sought. It has been established that if γ is absolutely continuous then the associated Fréchet functional

$\displaystyle \begin{aligned} F(\gamma) =\frac 1{2N}\sum_{i=1}^n W_2^2 (\mu^i,\gamma), \qquad \gamma\in {\mathcal{W}}_2({\mathbb{R}}^d), \end{aligned}$

has Fréchet derivative (Theorem 3.1.14)

$\displaystyle \begin{aligned} F'(\gamma) =- \frac 1N \sum_{i=1}^N \log_{\gamma}(\mu^i) =- \frac 1N \sum_{i=1}^N \left({\mathbf{t}}_{\gamma}^{\mu_i} - {\mathbf i}\right) \in {\mathrm{Tan}}_\gamma \end{aligned}$

(5.2)

at γ. Let $\gamma _j\in {\mathcal {W}}_2({\mathbb {R}}^d)$ be an absolutely continuous measure, representing our current estimate of the Fréchet mean at step j. Then it makes sense to introduce a step size τ _j > 0, and to follow the steepest descent of F given by the negative of the gradient:

$\displaystyle \begin{aligned} \gamma_{j+1} =\exp_{\gamma_j} \left(-\tau_j F'(\gamma_j)\right) = \left[{\mathbf i} + \tau_j\frac 1N \sum_{i=1}^N \log_{\gamma}(\mu^i)\right]\#\gamma_j =\left[{\mathbf i} + \tau_j\frac 1N\sum_{i=1}^N ({\mathbf{t}}_{\gamma_j}^{\mu^i} - {\mathbf i})\right]\#\gamma_j. \end{aligned}$

In order to employ further descent at γ _j+1, it needs to be verified that F is differentiable at γ _j+1, which amounts to showing that the latter stays absolutely continuous. This will happen for all but countably many values of the step size τ _j, but necessarily if the latter is contained in [0, 1]:

Lemma 5.1.1 (Regularity of the Iterates)

If γ ₀ is absolutely continuous and τ = τ ₀ ∈ [0, 1], then $\gamma _{1}=\exp _{\gamma _0} \left (-\tau _0 F'(\gamma _0)\right )$ is also absolutely continuous.

The idea is that push-forwards of γ ₀ under monotone maps are absolutely continuous if and only if the monotonicity is strict, a property preserved by averaging. See page 118 in the supplement for the details.

Lemma 5.1.1 suggests that the step size should be restricted to [0, 1]. The next result suggests that the objective function essentially tells us that the optimal step size, achieving the maximal reduction of the objective function (thus corresponding to an approximate line search), is exactly equal to 1. It does not rely on finite-dimensional arguments and holds when replacing ${\mathbb {R}}^d$ by a separable Hilbert space.

Lemma 5.1.2 (Optimal Stepsize)

If $\gamma _0\in {\mathcal {W}}_2({\mathbb {R}}^d)$ is absolutely continuous, then

$\displaystyle \begin{aligned} F(\gamma_{1})-F(\gamma_0) \le - \|F'(\gamma_0)\|{}^2\left[\tau - \frac{\tau^2}2 \right] \end{aligned}$

and the bound on the right-hand side of the last display is minimised when τ = 1.

Proof

Let $S_i={\mathbf {t}}_{\gamma _0}^{\mu ^i}$ be the optimal map from γ ₀ to μ ⁱ, and set W _i = S _i −i. Then

$\displaystyle \begin{aligned} 2NF(\gamma_0) =\sum_{i=1}^N W_2^2(\gamma_0,\mu^i) =\sum_{i=1}^N {\int_{{\mathbb{R}}^d}} \! {\|S_i- {\mathbf i}\|{}^2} \, \mathrm{d}{\gamma_0} =\sum_{i=1}^N \|W_i\|{}^2_{\mathcal L^2(\gamma_0)}, \end{aligned}$

(5.3)

Both γ ₁ and μ ⁱ can be written as push-forwards of γ ₀ and (2.3) gives the bound

$\displaystyle \begin{aligned} W_2^2(\gamma_1,\mu^i) \le \int_{{\mathbb R^d}}{\left\|\left[(1-\tau)\mathbf i+\frac\tau N\sum_{j=1}^NS_j\right] - S_i\right\|{}_{\mathbb R^d}^2}\mathrm{d}{\gamma_0} = \left\| -W_i + \frac \tau N\sum_{j=1}^NW_j\right\|{}^2_{\mathcal L^2(\gamma_0)}. \end{aligned}$

For brevity, we omit the subscript $\mathcal L^2(\gamma _0)$ from the norms and inner products. Developing the squares, summing over i = 1, …, N and using (5.3) gives

$\displaystyle \begin{aligned} 2NF(\gamma_1) &\le \sum_{i=1}^N \|W_i\|{}^2 - 2\frac\tau N \sum_{i,j=1}^N {\left\langle {W_i},{W_j}\right\rangle} +N\tau^2 \left\|\sum_{j=1}^N\frac 1NW_j\right\|{}^2\\ & = 2NF(\gamma_0) - 2N\tau \left\|\sum_{i=1}^N\frac 1NW_i\right\|{}^2 + N\tau^2 \left\|\sum_{i=1}^N\frac 1NW_i\right\|{}^2, \end{aligned}$

and recalling that W _i = S _i −i yields

$\displaystyle \begin{aligned} F(\gamma_1) - F(\gamma_0) \le \frac{\tau^2 - 2\tau}2\left\|\frac 1N\sum_{i=1}^NW_i\right\|{}^2 = -\|F'(\gamma_0)\|{}^2\left[\tau - \frac{\tau^2}2\right]. \end{aligned}$

To conclude, observe that τ − τ ²∕2 is maximised at τ = 1.

In light of Lemmata 5.1.1 and 5.1.2, we will always take τ _j = 1. The resulting iteration is summarised as Algorithm 1. A first step in the convergence analysis is that the sequence (F(γ _j)) is nonincreasing and that for any integer k,

$\displaystyle \begin{aligned} \frac 12\sum_{j=0}^k \|F'(\gamma_j)\|{}^2 \le \sum_{j=0}^k F(\gamma_{j}) - F(\gamma_{j+1}) = F(\gamma_0) - F(\gamma_{k+1}) \le F(\gamma_0). \end{aligned}$

As k →∞, the infinite sum on the left-hand side converges, so ∥F′(γ _j)∥² must vanish as j →∞.

Remark 5.1.3

The proof of Proposition 3.1.2 suggests a generalisation of Algorithm 1to arbitrary measures in ${\mathcal {W}}_2({\mathbb {R}}^d)$ even if none are absolutely continuous. One can verify that Lemmata 5.1.2and 5.3.5(below) also hold in this setup, so it may be that convergence results also apply in this setup. The iteration no longer has the interpretation as steepest descent, however.

Algorithm 1 Steepest descent via Procrustes analysis

(A)
Set a tolerance threshold 𝜖 > 0.
(B)
For j = 0, let γ _j be an arbitrary absolutely continuous measure.
(C)
For i = 1, …, N solve the (pairwise) Monge problem and find the optimal transport map $\mathbf t_{\gamma _j}^{\mu ^i}$ from γ _j to μ ⁱ.
(D)
Define the map $T_{j}=N^{-1}\sum _{i=1}^N \mathbf t_{\gamma _j}^{\mu ^i}$ .
(E)
Set γ _j+1 = T _j#γ _j, i.e. push-forward γ _j via T _j to obtain γ _j+1.
(F)
If ∥F′(γ _j+1)∥ < 𝜖, stop, and output γ _j+1 as the approximation of $\bar {\mu }$ and ${\mathbf {t}}_{\gamma _{j+1}}^{\mu ^i}$ as the approximation of ${\mathbf {t}}_{\bar {\mu }}^{\mu ^i}$ , i = 1, …, N. Otherwise, return to step (C).

5.2 Analogy with Procrustes Analysis

Algorithm 1 is similar in spirit to another procedure, generalised Procrustes analysis, that is used in shape theory. Given a subset $B\subseteq {\mathbb {R}}^d$ , most commonly a finite collection of labelled points called landmarks, an interesting question is how to mathematically define the shape of B. One way to reach such a definition is to disregard those properties of B that are deemed irrelevant for what one considers this shape should be; typically, these would include its location, its orientation, and/or its scale. Accordingly, the shape of B can be defined as the equivalence class consisting of all sets obtained as gB, where g belongs to a collection $\mathcal G$ of transformations of ${\mathbb {R}}^d$ containing all combinations of rotations, translations, dilations, and/or reflections (Dryden and Mardia [45, Chapter 4]).

If B ₁ and B ₂ are two collections of k landmarks, one may define the distance between their shapes as the infimum of ∥B ₁ − gB ₂∥² over the group $\mathcal G$ . In other words, one seeks to register B ₂ as close as possible to B ₁ by using elements of the group $\mathcal G$ , with distance being measured as the sum of squared Euclidean distances between the transformed points of B ₂ and those of B ₁. In a sense, one can think about the shape problem and the Monge problem as dual to each other. In the former, one is given constraints on how to optimally carry out the registration of the points with the cost being judged by how successful the registration procedure is. In the latter, one imposes that the registration be done exactly, and evaluates the cost by how much the space must be deformed in order to achieve this.

The optimal g and the resulting distance can be found in closed-form by means of ordinary Procrustes analysis [45, Section 5.2]. Suppose now that we are given N > 2 collections of points, B ₁, …, B _N, with the goal of minimising the sum of squares ∥g _iB _i − g _jB _j∥² over $g_i\in \mathcal G$ .¹ As in the case of Fréchet means in ${\mathcal {W}}_2({\mathbb {R}}^d)$ (Sect. 3.1.2), there is a formulation in terms of sum of squares from the average $N^{-1}\sum g_jB_j$ . Unfortunately, there is no explicit solution for this problem when d ≥ 3. Like Algorithm 1, generalised Procrustes analysis (Gower [66]; Dryden and Mardia [45, p. 90]) tackles this “multimatching” setting by iteratively solving the pairwise problem as follows. Choose one of the configurations as an initial estimate/template, then register every other configuration to the template, employing ordinary Procrustes analysis. The new template is then given by the linear average of the registered configurations, and the process is iterated subsequently.

Paralleling this framework, Algorithm 1 iterates the two steps of registration and linear averaging given the current template γ _j, but in a different manner:

(1)
Registration: by finding the optimal transportation maps ${\mathbf {t}}_{\gamma _j}^{\mu ^i}$ , we identify each μ ⁱ with the element ${\mathbf {t}}_{\gamma _j}^{\mu ^i} - {\mathbf i}=\log _{\gamma _j}(\mu ^i)$ . In this sense, the collection (μ ¹, …, μ ^N) is viewed in the common coordinate system given by the tangent space at the template γ _j and is registered to it.
(2)
Averaging: the registered measures are averaged linearly, using the common coordinate system of the registration step (1), as elements in the linear space ${\mathrm {Tan}}_{\gamma _j}$ . The linear average is then retracted back onto the Wasserstein space via the exponential map to yield the estimate at the (j + 1)-th step, γ _j+1.

Notice that in the Procrustes sense, the maps that register each μ ⁱ to the template γ _j are ${\mathbf {t}}_{\mu ^i}^{\gamma _j}$ , the inverses of ${\mathbf {t}}_{\gamma _j}^{\mu ^i}$ . We will not use the term “registration maps” in the sequel, to avoid possible confusion.

5.3 Convergence of Algorithm 1

In order to tackle the issue of convergence, we will use an approach that is specific to the nature of optimal transport. This is because the Hessian-type arguments that are used to prove similar convergence results for steepest descent on Riemannian manifolds (Afsari et al. [1]) or Procrustes algorithms (Le [86]; Groisser [67]) do not apply here, since the Fréchet functional may very well fail to be twice differentiable.

In fact, even in Euclidean spaces, convergence of steepest descent usually requires a Lipschitz bound on the derivative of F (Bertsekas [19, Subsection 1.2.2]). Unfortunately, F is not known to be differentiable at discrete measures, and these constitute a dense set in ${\mathcal {W}}_2$ ; consequently, this Lipschitz condition is very unlikely to hold. Still, this specific geometry of the Wasserstein space affords some advantages; for instance, we will place no restriction on the starting point for the iteration, except that it be absolutely continuous; and no assumption on how “spread out” the collection μ ¹, …, μ ^N is will be necessary as in, for example, [1, 67, 86].

Theorem 5.3.1 (Limit Points are Karcher Means)

Let $\mu ^1,\dots ,\mu ^N\in {\mathcal {W}}_2({\mathbb {R}}^d)$ be probability measures and suppose that one of them is absolutely continuous with a bounded density. Then, the sequence generated by Algorithm 1 stays in a compact set of the Wasserstein space ${\mathcal {W}}_2({\mathbb {R}}^d)$ , and any limit point of the sequence is a Karcher mean of (μ ¹, …, μ ^N).

Since the Fréchet mean $\bar \mu$ is a Karcher mean (Proposition 3.1.8), we obtain immediately:

Corollary 5.3.2 (Wasserstein Convergence of Steepest Descent)

Under the conditions of Theorem 5.3.1, if F has a unique stationary point, then the sequence {γ _j} generated by Algorithm 1 converges to the Fréchet mean of {μ ¹, …, μ ^N} in the Wasserstein metric,

$\displaystyle \begin{aligned} W_2(\gamma_j,\bar{\mu}){\longrightarrow}0 ,\qquad j\to\infty. \end{aligned}$

Alternatively, combining Theorem 5.3.1 with the optimality criterion Theorem 3.1.15 shows that the algorithm converges to $\bar \mu$ when the appropriate assumptions on {μ ⁱ} and the Karcher mean $\mu =\lim \gamma _j$ are satisfied. This allows to conclude that Algorithm 1 converges to the unique Fréchet mean when μ ⁱ are Gaussian measures (see Theorem 5.4.1).

The proof of Theorem 5.3.1 is rather elaborate, since we need to use specific methods that are tailored to the Wasserstein space. Before giving the proof, we state two important consequences. The first is the uniform convergence of the optimal maps ${\mathbf {t}}_{\gamma _j}^{\mu ^i}$ to ${\mathbf {t}}_{\bar \mu }^{\mu ^i}$ on compacta. This convergence does not immediately follow from the Wasserstein convergence of γ _j to $\bar \mu$ , and is also established for the inverses. Both the formulation and the proof of this result are similar to those of Theorem 4.4.3.

Theorem 5.3.3 (Uniform Convergence of Optimal Maps)

Under the conditions of Corollary 5.3.2, there exist sets $A,B^1,\dots ,B^N\subseteq {\mathbb {R}}^d$ such that $\bar {\mu }(A)=1=\mu ^1(B^1)=\dots =\mu ^N(B^N)$ and

$\displaystyle \begin{aligned} \sup_{\varOmega_1}\,\left\|{\mathbf{t}}_{\gamma_j}^{\mu^i} - {\mathbf{t}}_{\bar{\mu}}^{\mu^i}\right\|\stackrel{j\rightarrow\infty}{\longrightarrow}0 ,\qquad \sup_{\varOmega_2^i}\,\left\|{\mathbf{t}}^{\gamma_j}_{\mu^i} - {\mathbf{t}}^{\bar{\mu}}_{\mu^i}\right\|\stackrel{j\rightarrow\infty}{\longrightarrow}0 ,\qquad i=1,\dots,N, \end{aligned}$

for any pair of compacta Ω ₁ ⊆ A, $\varOmega _2^i\subseteq B^i$ . If in addition all the measures μ ¹, …, μ ^N have the same support, then one can choose all the sets B ⁱ to be the same.

The other consequence is convergence of the optimal multicouplings.

Corollary 5.3.4 (Convergence of Multicouplings)

Under the conditions of Corollary 5.3.2, the sequence of multicouplings

$\displaystyle \begin{aligned} \left({\mathbf{t}}_{\gamma_j}^{\mu^1},\dots{\mathbf{t}}_{\gamma_j}^{\mu^n}\right)\#\gamma_j \end{aligned}$

of {μ ¹, …, μ ^N} converges (in Wasserstein distance on $({\mathbb {R}}^d)^N$ ) to the optimal multicoupling $({\mathbf {t}}_{\overline \mu }^{\mu ^1},\dots {\mathbf {t}}_{\overline \mu }^{\mu ^n})\#\overline {\mu }$ .

The proofs of Theorem 5.3.3 and Corollary 5.3.4 are given at the end of the present section.

The proof of Theorem 5.3.1 is achieved by establishing the following facts:

1.
The sequence (γ _j) stays in a compact subset of ${\mathcal {W}}_2({\mathbb {R}}^d)$ (Lemma 5.3.5).
2.
Any limit of (γ _j) is absolutely continuous (Proposition 5.3.6 and the paragraph preceding it).
3.
Algorithm 1 acts continuously on its argument (Corollary 5.3.8).

Since it has already been established that ∥F′(γ _j)∥→ 0, these three facts indeed suffice.

Lemma 5.3.5

The sequence generated by Algorithm 1 stays in a compact subset of the Wasserstein space ${\mathcal {W}}_2({\mathbb {R}}^d)$ .

Proof

For all j ≥ 1, γ _j takes the form M _n#π, where $M_N(x_1,\dots ,x_N)=\overline x$ and π is a multicoupling of μ ¹, …, μ ^N. The compactness of this set has been established in Step 2 of the proof of Theorem 3.1.5; see page 63 in the supplement, where this is done in a more complicated setup.

A closer look at the proof reveals that a more general result holds true. Let $\mathcal A$ denote the steepest descent iteration, that is, $\mathcal A(\gamma _j)=\gamma _{j+1}$ . Then the image of $\mathcal A$ , $\{\mathcal A\mu :\mu \in {\mathcal {W}}_2({\mathbb {R}}^d)$ absolutely continuous} has a compact closure in ${\mathcal {W}}_2({\mathbb {R}}^d)$ . This is also true if ${\mathbb {R}}^d$ is replaced by a separable Hilbert space.

In order to show that a weakly convergent sequence (γ _j) of absolutely continuous measures has an absolutely continuous limit γ, it suffices to show that the densities of γ _j are uniformly bounded. Indeed, if C is such a bound, then for any open $O\subseteq {\mathbb {R}}^d$ , $\liminf \gamma _k(O)\le C{\mathrm {Leb}}(O)$ , so γ(O) ≤ CLeb(O) by the portmanteau Lemma 1.7.1. It follows that γ is absolutely continuous with density bounded by C. We now show that such C can be found that applies to all measures in the image of $\mathcal A$ , hence to all sequences resulting from iterations of Algorithm 1.

Proposition 5.3.6 (Uniform Density Bound)

For each i = 1, …, N denote by g ⁱ the density of μ ⁱ (if it exists) and ∥g ⁱ∥_∞ its supremum, taken to be infinite if g ⁱ does not exist (or if g ⁱ is unbounded). Let γ ₀ be any absolutely continuous probability measure. Then the density of $\gamma _1=\mathcal A(\gamma _0)$ is bounded by the 1∕d-th harmonic mean of ∥g ⁱ∥_∞,

$\displaystyle \begin{aligned} C_{\mu} =\left[ \frac 1N \sum_{i=1}^N \frac 1{\|g^i\|{}_\infty^{1/d}} \right]^{-d}. \end{aligned}$

The constant C _μ depends only on the measures (μ ¹, …, μ ^N), and is finite as long as one μ ⁱ has a bounded density, since C _μ ≤ N ^d∥g ⁱ∥_∞ for any i.

Proof

Let h _i be the density of γ _i. By the change of variables formula, for γ ₀-almost any x

$\displaystyle \begin{aligned} h_1({\mathbf{t}}_{\gamma_0}^{\gamma_1}(x)) =\frac{h_0(x)} {\det\nabla {\mathbf{t}}_{\gamma_0}^{\gamma_1}(x)} ;\qquad g^i({\mathbf{t}}_{\gamma_0}^{\mu^i}(x)) =\frac{h_0(x)} {\det\nabla {\mathbf{t}}_{\gamma_0}^{\mu^i}(x)} ,\qquad \mathrm{when} g^i\mathrm{ exists}. \end{aligned}$

(Convex functions are twice differentiable almost surely (Villani [125, Theorem 14.25]), hence these gradients are well-defined γ ₀-almost surely.) We seek a lower bound on the determinant of $\nabla {\mathbf {t}}_{\gamma _0}^{\gamma _1}(x)$ , which by definition equals

$\displaystyle \begin{aligned} N^{-d} \det\sum_{i=1}^N \nabla {\mathbf{t}}_{\gamma_0}^{\mu^i}(x). \end{aligned}$

Such a bound is provided by the Brunn–Minkowski inequality (Stein and Shakarchi [121, Section 1.5]) for symmetric positive semidefinite matrices

$\displaystyle \begin{aligned}{}[\det(A+B)]^{1/d} \ge [\det A]^{1/d} + [\det B]^{1/d}, \end{aligned}$

which, applied inductively, yields

$\displaystyle \begin{aligned} \left[\det\nabla {\mathbf{t}}_{\gamma_0}^{\gamma_1}(x)\right]^{1/d} \ge \frac 1N \sum_{i=1}^N \left[\det\nabla {\mathbf{t}}_{\gamma_0}^{\mu^i}(x)\right]^{1/d} . \end{aligned}$

From this, we obtain an upper bound for h ₁:

$\displaystyle \begin{aligned} \frac 1{h_1^{1/d}({\mathbf{t}}_{\gamma_0}^{\gamma_1}(x))} {=}\frac {\det^{1/d}\sum_{i=1}^N \nabla {\mathbf{t}}_{\gamma_0}^{\mu^i}(x)}{N h_0^{1/d}(x)} {\ge} \frac 1N \sum_{i=1}^N \frac {1}{[g^i({\mathbf{t}}_{\gamma_0}^{\mu^i}(x))]^{1/d}} {\ge} \frac 1N \sum_{i=1}^N \frac 1{\|g^i\|{}_\infty^{1/d}} =C_\mu^{-1/d}. \end{aligned}$

Let Σ be the set of points where this inequality holds, then γ ₀(Σ) = 1. Hence

$\displaystyle \begin{aligned} \gamma_1({\mathbf{t}}_{\gamma_0}^{\gamma_1}(\varSigma)) = \gamma_0[({\mathbf{t}}_{\gamma_0}^{\gamma_1})^{-1}({\mathbf{t}}_{\gamma_0}^{\gamma_1}(\varSigma))] \ge \gamma_0(\varSigma)=1. \end{aligned}$

Thus, γ ₁-almost surely and for all i,

$\displaystyle \begin{aligned} h_1(y) \le C_\mu. \end{aligned}$

The third statement (continuity of $\mathcal A$ ) is much more subtle to establish, and its rather lengthy proof is given next. In view of Proposition 5.3.6, the uniform bound on the densities is not a hindrance for the proof of convergence of Algorithm 1.

Proposition 5.3.7

Let (γ _n) be a sequence of absolutely continuous measures with uniformly bounded densities, suppose that W ₂(γ _n, γ) → 0, and let

$\displaystyle \begin{aligned} \eta_j =\left({\mathbf{t}}_{\gamma_j}^{\mu^1},\dots{\mathbf{t}}_{\gamma_j}^{\mu^n},{\mathbf i}\right)\#\gamma_j ,\qquad \eta =\left({\mathbf{t}}_{\gamma}^{\mu^1},\dots{\mathbf{t}}_{\gamma}^{\mu^n},{\mathbf i}\right)\#\gamma. \end{aligned}$

Then η _j → η in ${\mathcal {W}}_2([{\mathbb {R}}^d]^{N+1})$ .

Proof

As has been established in the discussion before Proposition 5.3.6, the limit γ must be absolutely continuous, so η is well-defined.

In view of Theorem 2.2.1, it suffices to show that if $h:({\mathbb {R}}^d)^{N+1}\to {\mathbb {R}}$ is any continuous nonnegative function such that

$\displaystyle \begin{aligned} |h(t_1,\dots,t_N,y)| \le \frac 2N\sum_{i=1}^N \|t_i\|{}^2 + 2\|y\|{}^2, \end{aligned}$

then

$\displaystyle \begin{aligned} \int_{{\mathbb{R}}^d} {g_n}{\gamma_n} =\int_{({\mathbb{R}}^d)^{N+1}}{}h\mathrm{d}{\eta_n} \to\!\!\int_{({\mathbb{R}}^d)^{N+1}}{}h\mathrm{d}{\eta} =\!\!\int_{{\mathbb{R}}^d}{}{g}\mathrm{d}{\gamma}, \,\, g_n(x){=}h({\mathbf{t}}_{\gamma_j}^{\mu^1}(x),\dots{\mathbf{t}}_{\gamma_j}^{\mu^n}(x),x), \end{aligned}$

and g defined analogously. The proof, given in full detail on page 124 of the supplement, is sketched here.

Step 1: Truncation. Since γ _n converge in the Wasserstein space, they satisfy the uniform integrability (2.4) and absolute continuity (2.7) by Theorem 2.2.1. Consequently, $g_{n,R}=\min (g_n,4R)$ is uniformly close to g _n:

$\displaystyle \begin{aligned} \sup_n \int {[g_n(x) - g_{n,R}(x)]}\mathrm{d}{\gamma_n(x)} \to0, \qquad R\to\infty. \end{aligned}$

We may thus replace g _n by a bounded version g _n,R.

Step 2: Convergence of g _n to g. By Proposition 1.7.11, the optimal maps ${\mathbf {t}}_{\gamma _n}^{\mu ^i}$ converge to ${\mathbf {t}}_{\gamma }^{\mu ^i}$ and (since h is continuous), g _n → g uniformly on “nice” sets Ω ⊆ E = suppγ. Write

$\displaystyle \begin{aligned} \int {g_{n,R}}\mathrm{d}{\gamma_n} - \int {g_R}\mathrm{d}\gamma =\int {g_R}\mathrm{d}{(\gamma_n-\gamma)}+\int_{\varOmega}{(g_{n,R} - g_R)}\mathrm{d}{\gamma_n} + \int_{\mathbb R^d\setminus\varOmega}{(g_{n,R} - g_R)}\mathrm{d}{\gamma_n}. \end{aligned}$

Step 3: Bounding the first two integrals. The first integral vanishes as n →∞, by the portmanteau Lemma 1.7.1, and the second by uniform convergence.

Step 4: Bounding the third integral. The integrand is bounded by 8R, so it suffices to bound the measures of ${\mathbb {R}}^d\setminus \varOmega$ . This is a bit technical, and uses the uniform density bound on (γ _n) and the portmanteau lemma.

Corollary 5.3.8 (Continuity of $\mathcal A$ )

If W ₂(γ _n, γ) → 0 and γ _n have uniformly bounded densities, then $\mathcal A(\gamma _n)\to \mathcal A(\gamma )$ .

Proof

Choose h in the proof of Proposition 5.3.7 to depend only on y.

Proof (Proof of Corollary 5.3.4)

Choose h in the proof of Proposition 5.3.7 to depend only on t ₁, …, t _n.

Proof (Proof of Theorem 5.3.3)

Let $E={\mathrm {supp}} \bar \mu$ and set $A^i=E^{\mathrm {den}}\cap \{x:\mathbf t_{\bar \mu }^{\mu ^i}(x)\,\mathrm{is}$ univalued}. As $\bar \mu$ is absolutely continuous, $\bar \mu (A^i)=1$ , and the same is true for $A=\cap _{i=1}^NA^i$ . The first assertion then follows from Proposition 1.7.11.

The second statement is proven similarly. Let E ⁱ = suppμ ⁱ and notice that by absolute continuity the $B^i=(E^i)^{\mathrm {den}}\cap \{x:\mathbf t_{\mu ^i}^{\bar \mu }(x)\mathrm{ is univalued}\}$ has measure 1 with respect to μ ⁱ. Apply Proposition 1.7.11. If in addition E ¹ = ⋯ = E ^N, then μ ⁱ(B) = 1 for B = ∩B ⁱ.

5.4 Illustrative Examples

As an illustration, we implement Algorithm 1 in several scenarios for which pairwise optimal maps can be calculated explicitly at every iteration, allowing for fast computation without error propagation. In each case, we give some theory first, describing how the optimal maps are calculated, and then implement Algorithm 1 on simulated examples.

5.4.1 Gaussian Measures

No example illustrates the use of Algorithm 1 better than the Gaussian case. This is so because optimal maps between centred nondegenerate Gaussian measures with covariances A and B have the explicit form (see Sect. 1.6.3)

$\displaystyle \begin{aligned} {\mathbf{t}}_{A}^{B}(x) = A^{-1/2} [A^{1/2}BA^{1/2}]^{1/2} A^{-1/2}x, \qquad x\in {\mathbb{R}}^d, \end{aligned}$

with the obvious slight abuse of notation. In contrast, the Fréchet mean of a collection of Gaussian measures (one of which nonsingular) does not admit a closed-form formula and is only known to be a Gaussian measure whose covariance matrix Γ is the unique invertible root of the matrix equation

$\displaystyle \begin{aligned} \varGamma =\frac 1N\sum_{i=1}^N \left[\varGamma^{1/2}S_i\varGamma^{1/2}\right]^{1/2}, \end{aligned}$

(5.4)

where S _i is the covariance matrix of μ ⁱ.

Given the formula for ${\mathbf {t}}_{A}^{B}$ , application of Algorithm 1 to Gaussian measures is straightforward. The next result shows that, in the Gaussian case, the iterates must converge to the unique Fréchet mean, and that (5.4) can be derived from the characterisation of Karcher means.

Theorem 5.4.1 (Convergence in Gaussian Case)

Let μ ¹, …, μ ^N be Gaussian measures with zero means and covariance matrices S _i with S ₁ nonsingular, and let the initial point γ ₀ be $\mathcal N(0,\varGamma _0)$ with Γ ₀ nonsingular. Then the sequence of iterates generated by Algorithm 1 converges to the unique Fréchet mean of (μ ¹, …, μ ^N).

Proof

Since the optimal maps are linear, so is their mean and therefore γ _k is a Gaussian measure for all k, say $\mathcal N(0,\varGamma _k)$ with Γ _k nonsingular. Any limit point of γ is a Karcher mean by Theorem 5.3.1. If we knew that γ itself were Gaussian, then it actually must be the Fréchet mean because $N^{-1}\sum {\mathbf {t}}_{\gamma }^{\mu ^i}$ equals the identity everywhere on ${\mathbb {R}}^d$ (see the discussion before Theorem 3.1.15).

Let us show that every limit point γ is indeed Gaussian. It suffices to prove that (Γ _k) is a bounded sequence, because if Γ _k → Γ, then $\mathcal N(0,\varGamma _k)\to \mathcal N(0,\varGamma )$ weakly, as can be seen from either Lehmann–Scheffé’s theorem (the densities converge) or Lévy’s continuity theorem (the characteristic functions converge).

To see that (Γ _k) is bounded, observe first that for any centred (Gaussian or not) measure μ with covariance matrix S,

$\displaystyle \begin{aligned} W^2_2(\mu,\delta_0) ={{\mathrm{tr}}} [S], \end{aligned}$

where δ ₀ is a Dirac mass at the origin. (This follows from the spectral decomposition of S.) Therefore

$\displaystyle \begin{aligned} 0\le {{\mathrm{tr}}}[\varGamma_k] =W_2^2(\gamma_k,\delta_0) \end{aligned}$

is bounded uniformly, because {γ _k} stays in a Wasserstein-compact set by Lemma 5.3.5. If we define C =sup_ktr[Γ _k] < ∞, then all the diagonal elements of Γ _k are bounded uniformly. When A is symmetric and positive semidefinite, 2|A _ij|≤ A _ii + A _ij. Consequently, all the entries of Γ _k are bounded uniformly by C, which means that (Γ _k) is a bounded sequence.

From the formula for the optimal maps, we see that if Γ is the covariance of the Fréchet mean, then

$\displaystyle \begin{aligned} I =\frac 1N\sum_{i=1}^N \varGamma^{-1/2} \left[\varGamma^{1/2}S_i\varGamma^{1/2}\right]^{1/2} \varGamma^{-1/2} \end{aligned}$

and we recover the fixed point equation (5.4).

If the means are nonzero, then the optimal maps are affine and the same result applies; the Fréchet mean is still a Gaussian measure with covariance matrix Γ and mean that equals the average of the means of μ ⁱ, i = 1, …, N.

Figure 5.1 shows density plots of N = 4 centred Gaussian measures on ${\mathbb {R}}^2$ with covariances S _i ∼Wishart(I ₂, 2), and Fig. 5.2 shows the density of the resulting Fréchet mean. In this particular example, the algorithm needed 11 iterations starting from the identity matrix. The corresponding optimal maps are displayed in Fig. 5.3. It is apparent from the figure that these maps are linear, and after a more careful reflection one can be convinced that their average is the identity. The four plots in the figure are remarkably different, in accordance with the measures themselves having widely varying condition numbers and orientations; μ ³ and more so μ ⁴ are very concentrated, so the optimal maps “sweep” the mass towards zero. In contrast, the optimal maps to μ ¹ and μ ² spread the mass out away from the origin.

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig1_HTML.png — Fig. 5.1
Density plot of four Gaussian measures in ${ \mathbb {R}}^2$ .

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig2_HTML.png — Fig. 5.2
Density plot of the Fréchet mean of the measures in Fig. 5.1

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig3_HTML.png — Fig. 5.3
Gaussian example: vector fields depicting the optimal maps $x\mapsto {\mathbf {t}}_{\bar \mu }^{\mu ^i}(x)$ from the Fréchet mean $\bar \mu$ of Fig. 5.2 to the four measures {μ ⁱ} of Fig. 5.1. The order corresponds to that of Fig. 5.1

$$x\mapsto {\mathbf {t}}_{\bar \mu }^{\mu ^i}(x)$$ — Fig. 5.3
Gaussian example: vector fields depicting the optimal maps $x\mapsto {\mathbf {t}}_{\bar \mu }^{\mu ^i}(x)$ from the Fréchet mean $\bar \mu$ of Fig. 5.2 to the four measures {μ ⁱ} of Fig. 5.1. The order corresponds to that of Fig. 5.1

5.4.2 Compatible Measures

We next discuss the behaviour of the algorithm when the measures are compatible. Recall that a collection $\mathcal C\subseteq {\mathcal {W}}_2({\mathcal { X}})$ is compatible if for all $\gamma ,\rho ,\mu \in \mathcal C$ , ${\mathbf {t}}_{\mu }^{\nu }\circ {\mathbf {t}}_{\gamma }^{\mu }={\mathbf {t}}_{\gamma }^{\nu }$ in L ₂(γ) (Definition 2.3.1). Boissard et al. [28] showed that when this condition holds, the Fréchet mean of (μ ¹, …, μ ^N) can be found by simple computations involving the iterated barycentre. We again denote by γ ₀ the initial point of Algorithm 1, which can be any absolutely continuous measure.

Lemma 5.4.2 (Compatibility and Convergence)

If γ ₀ ∪{μ ⁱ} is compatible, then Algorithm 1 converges to the Fréchet mean of (μ ⁱ) after a single step.

Proof

By definition, the next iterate is

$\displaystyle \begin{aligned} \gamma_1 =\left[\frac 1N\sum_{i=1}^N \mathbf t_{\gamma_0}^{\mu^i} \right]\#\gamma_0 , \end{aligned}$

which is the Fréchet mean by Theorem 3.1.9.

In this case, Algorithm 1 requires the calculation of N pairwise optimal maps, and this can be reduced to N − 1 if the initial point is chosen to be μ ¹. This is the same computational complexity as the calculation of the iterated barycentre proposed in [28].

When the measures have a common copula, finding the optimal maps reduces to finding the optimal maps between the one-dimensional marginals (see Lemma 2.3.3) and this can be done using quantile functions as described in Sect. 1.5. The marginal Fréchet means are then plugged into the common copula to yield the joint Fréchet mean. We next illustrate Algorithm 1 in three such scenarios.

5.4.2.1 The One-Dimensional Case

When the measures are supported on the real line, there is no need to use the algorithm since the Fréchet mean admits a closed-form expression in terms of quantile functions (see Sect. 3.1.4). We nevertheless discuss this case briefly because we build upon this construction in subsequent examples. Given that ${\mathbf {t}}_{\mu }^{\nu }=F_\nu ^{-1}\circ F_\mu$ , we may apply Algorithm 1 starting from one of these measures (or any other measure). Figure 5.4 plots N = 4 univariate densities and the Fréchet mean yielded by the algorithm in two different scenarios. At the left, the densities were generated as

$\displaystyle \begin{aligned} f^i(x) =\frac 12\phi\left(\frac{x - m^i_1}{\sigma^i_1}\right) +\frac 12\phi\left(\frac{x - m^i_2}{\sigma^i_2}\right) ,\qquad \end{aligned}$

(5.5)

with ϕ the standard normal density, and the parameters generated independently as

$\displaystyle \begin{aligned} m^i_1\sim U[-13, -3],\quad m^i_2\sim U[3, 13],\quad \sigma^i_1,\sigma^i_2\sim Gamma(4, 4). \end{aligned}$

At the right of Fig. 5.4, we used a mixture of a shifted gamma and a Gaussian:

$\displaystyle \begin{aligned} f^i(x) =\frac 35\frac{\beta_i^3}{\varGamma (3)}(x - m^i_3)^2e^{-\beta_i(x-m^i_3)} +\frac 25\phi(x - m^i_4), \end{aligned}$

(5.6)

with

$\displaystyle \begin{aligned} \beta^i\sim Gamma(4, 1) ,\quad m^i_3\sim U[1,4] ,\quad m^i_4\sim U[-4,-1]. \end{aligned}$

The resulting Fréchet mean density for both settings is shown in thick light blue, and can be seen to capture the bimodal nature of the data. Even though the Fréchet mean of Gaussian mixtures is not a Gaussian mixture itself, it is approximately so, provided that the peaks are separated enough. Figure 5.5 shows the optimal maps pushing the Fréchet mean $\bar \mu$ to the measures μ ¹, …, μ ^N in each case. If one ignores the “middle part” of the x axis, the maps appear (approximately) affine for small values of x and for large values of x, indicating how the peaks are shifted. In the middle region, the maps need to “bridge the gap” between the different slopes and intercepts of these affine maps.

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig4_HTML.png — Fig. 5.4
Densities of a bimodal Gaussian mixture (left) and a mixture of a Gaussian with a gamma (right), with the Fréchet mean density in light blue

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig5_HTML.png — Fig. 5.5
Optimal maps ${\mathbf {t}}_{\bar \mu }^{\mu ^i}$ from the Fréchet mean $\bar \mu$ to the four measures {μ ⁱ} in Fig. 5.4. The left plot corresponds to the bimodal Gaussian mixture, and the right plot to the Gaussian/gamma mixture

$${\mathbf {t}}_{\bar \mu }^{\mu ^i}$$ — Fig. 5.5
Optimal maps ${\mathbf {t}}_{\bar \mu }^{\mu ^i}$ from the Fréchet mean $\bar \mu$ to the four measures {μ ⁱ} in Fig. 5.4. The left plot corresponds to the bimodal Gaussian mixture, and the right plot to the Gaussian/gamma mixture

5.4.2.2 Independence

We next take measures μ ⁱ on ${\mathbb {R}}^2$ , having independent marginal densities $$f_X^i$$

as in (5.5), and $$f_Y^i$$

as in (5.6). Figure 5.6 shows the density plot of N = 4 such measures, constructed as the product of the measures from Fig. 5.4. One can distinguish the independence by the “parallel” structure of the figures: for every pair (y ₁, y ₂), the ratio g(x, y ₁)∕g(x, y ₂) does not depend on x (and vice versa, interchanging x and y). Figure 5.7 plots the density of the resulting Fréchet mean. We observe that the Fréchet mean captures the four peaks and their location. Furthermore, the parallel nature of the figure is preserved in the Fréchet mean. Indeed, by Lemma 3.1.11 the Fréchet mean is a product measure. The optimal maps, in Fig. 5.10, are the same as in the next example, and will be discussed there.

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig6_HTML.png — Fig. 5.6
Density plots of the four product measures of the measures in Fig. 5.4

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig7_HTML.png — Fig. 5.7
Density plot of the Fréchet mean of the measures in Fig. 5.6

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig8_HTML.png — Fig. 5.8
Density plots of four measures in ${ \mathbb {R}}^2$ with Frank copula of parameter − 8

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig9_HTML.png — Fig. 5.9
Density plot of the Fréchet mean of the measures in Fig. 5.8

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig10_HTML.png — Fig. 5.10
Frank copula example: vector fields of the optimal maps ${\mathbf {t}}_{\bar \mu }^{\mu ^i}$ from the Fréchet mean $\bar \mu$ of Fig. 5.9 to the four measures {μ ⁱ} of Fig. 5.8. The colours match those of Fig. 5.4

5.4.2.3 Common Copula

Let μ ⁱ be a measure on ${\mathbb {R}}^2$ with density

$\displaystyle \begin{aligned} g^i(x,y) =c(F_X^i(x),F_Y^i(y))f_X^i(x)f_Y^i(y), \end{aligned}$

where

and

are random densities on the real line with distribution functions $$f_X^i$$

and

, and c is a copula density. Figure 5.8 shows the density plot of N = 4 such measures, with $$f_X^i$$

generated as in (5.5), $$f_Y^i$$

as in (5.6), and c is the Frank(− 8) copula density, while Fig. 5.9 plots the density of the Fréchet mean obtained. (For ease of comparison we use the same realisations of the densities that appear in Fig. 5.4.) The Fréchet mean can be seen to preserve the shape of the density, having four clearly distinguished peaks. Figure 5.10, depicting the resulting optimal maps, allows for a clearer interpretation: for instance, the leftmost plot (in black) shows more clearly that the map splits the mass around x = −2 to a much wider interval; and conversely a very large amount mass is sent to x ≈ 2. This rather extreme behaviour matches the peak of the density of μ ¹ located at x = 2.

5.4.3 Partially Gaussian Trivariate Measures

We now apply Algorithm 1 in a situation that entangles two of the previous settings. Let U be a fixed 3 × 3 real orthogonal matrix with columns U ₁, U ₂, U ₃ and let μ ⁱ have density

$\displaystyle \begin{aligned} g^i(y_1,y_2,y_3) =g^i(y) =f^i(U_3^ty) \frac 1{2\pi\sqrt{\det S^i}} \exp\left[-\frac{(U_1^ty, U_2^ty)(S^i)^{-1}\binom{U_1^ty}{U_2^ty}}2\right],\end{aligned}$

with f ⁱ bounded density on the real line and $S^i\in {\mathbb {R}}^{2\times 2}$ positive definite. We simulated N = 4 such densities with f ⁱ as in (5.5) and S ⁱ ∼Wishart(I ₂, 2). We apply Algorithm 1 to this collection of measures and find their Fréchet mean (see the end of this subsection for precise details on how the optimal maps were calculated). Figure 5.11 shows level set of the resulting densities for some specific values. The bimodal nature of f ⁱ implies that for most values of a, {x : f ⁱ(x) = a} has four elements. Hence, the level sets in the figures are unions of four separate parts, with each peak of f ⁱ contributing two parts that form together the boundary of an ellipsoid in ${\mathbb {R}}^3$ (see Fig. 5.12). The principal axes of these ellipsoids and their position in ${\mathbb {R}}^3$ differ between the measures, but the Fréchet mean can be viewed as an average of those in some sense.

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig11_HTML.png — Fig. 5.11
The set $\{v\in { \mathbb {R}}^3:g^i(v)=0.0003\}$ for i = 1 (black), the Fréchet mean (light blue), i = 2, 3, 4 in red, green, and dark blue, respectively

$$\{v\in { \mathbb {R}}^3:g^i(v)=0.0003\}$$ — Fig. 5.11
The set $\{v\in { \mathbb {R}}^3:g^i(v)=0.0003\}$ for i = 1 (black), the Fréchet mean (light blue), i = 2, 3, 4 in red, green, and dark blue, respectively

../images/456556_1_En_5_Chapter/456556_1_En_5_Fig12_HTML.png — Fig. 5.12
The set $\{v\in { \mathbb {R}}^3:g^i(v)=0.0003\}$ for i = 3 (left) and i = 4 (right), with each of the four different inverses of the bimodal density f ⁱ corresponding to a colour

In terms of orientation (principal axes) of the ellipsoids, the Fréchet mean is most similar to μ ¹ and μ ², whose orientations are similar to one another.

Let us now see how the optimal maps are calculated. If Y = (y ₁, y ₂, y ₃) ∼ μ ⁱ, then the random vector (x ₁, x ₂, x ₃) = X = U ⁻¹Y has joint density

$\displaystyle \begin{aligned} f^i(x_3)\exp\left[-\frac{(x_1,x_2)(\varSigma^i)^{-1}\binom{x_1}{x_2}}2\right] \frac 1{2\pi\sqrt{\det \varSigma^i}}, \end{aligned}$

so the probability law of X is ρ ⁱ ⊗ ν ⁱ with ρ ⁱ centred Gaussian with covariance matrix Σ ⁱ and ν ⁱ having density f ⁱ on ${\mathbb {R}}$ . By Lemma 3.1.11, the Fréchet mean of (U ⁻¹#μ ⁱ) is the product measure of that of (ρ ⁱ) and that of (ν ⁱ); by Lemma 3.1.12, the Fréchet mean of (μ ⁱ) is therefore

$\displaystyle \begin{aligned} U\#(\mathcal N(0,\varSigma)\otimes f), \qquad f=F', \quad F^{-1}(q) =\frac 1N\sum_{i=1}^NF_i^{-1}(q), \quad F_i(x) =\int_{-\infty}^{x}{f^i(s)}\mathrm{d}s, \end{aligned}$

where Σ is the Fréchet–Wasserstein mean of Σ ₁, …, Σ _N.

Starting at an initial point $\gamma _0=U\#(\mathcal N(0,\varSigma _0)\otimes \nu _0)$ , with ν ₀ having continuous distribution $F_{\nu _0}$ , the optimal maps are $U\circ \mathbf t_0^i\circ U^{-1}=\nabla (\varphi _0^i\circ U^{-1})$ with

$\displaystyle \begin{aligned} {\mathbf{t}}_{0}^{i}(x_1,x_2,x_3) =\binom{{\mathbf{t}}_{\varSigma_0}^{\varSigma^j}(x_1,x_2)}{F_j^{-1}\circ F_{\nu_0} (x_3)} \end{aligned}$

the gradients of the convex function

$\displaystyle \begin{aligned} \varphi_0^i(x_1,x_2,x_3) =(x_1, x_2){\mathbf{t}}_{\gamma_0}^{\varSigma^i}\binom{x_1}{x_2} +\int_{0}^{x_3}{F_j^{-1}(F_{\nu_0}(s))}\mathrm{d}s, \end{aligned}$

where we identify ${\mathbf {t}}_{\gamma _0}^{\varSigma ^i}$ with the positive definite matrix (Σ ⁱ)^1∕2[(Σ ⁱ)^1∕2Σ ₀(Σ ⁱ)^1∕2]^−1∕2 (Σ ⁱ)^1∕2 that pushes forward $\mathcal N(0,\varSigma _0)$ to $\mathcal N(0,\varSigma ^i)$ . Due to the one-dimensionality, the algorithm finds the third component of the rotated measures after one step, but the convergence of the Gaussian component requires further iterations.

5.5 Population Version of Algorithm 1

Let $\varLambda \in {\mathcal {W}}_2({\mathbb {R}}^d)$ be a random measure with finite Fréchet functional. The population version of (5.1) is

$\displaystyle \begin{aligned} q= \mathbb{P}(\varLambda \mathrm{ absolutely continuous with density bounded by} M) >0 \quad \mathrm{for some} M<\infty, \end{aligned}$

(5.7)

which we assume henceforth. This condition is satisfied if and only if

$\displaystyle \begin{aligned} \mathbb{P}(\varLambda \mathrm{ absolutely continuous with bounded density}) >0. \end{aligned}$

These probabilities are well-defined because the set

$\displaystyle \begin{aligned} {\mathcal{W}}_2({\mathbb{R}}^d;M)= \{\mu\in {\mathcal{W}}_2({\mathbb{R}}^d) :\mu\mathrm{ absolutely continuous with density bounded by} M\} \end{aligned}$

is weakly closed (see the paragraph before Proposition 5.3.6), hence a Borel set of ${\mathcal {W}}_2({\mathbb {R}}^d)$ .

In light of Theorem 3.2.13, we can define a population version of Algorithm 1 with the iteration function

$\displaystyle \begin{aligned} \mathcal A(\gamma) ={\mathbb{E}} {\mathbf{t}}_{\gamma}^{\varLambda}, \qquad \gamma\in {\mathcal{W}}_2({\mathbb{R}}^d) \mathrm{ absolutely continuous}. \end{aligned}$

The (Bochner) expectation is well-defined in $\mathcal L_2(\gamma )$ because the random map ${\mathbf {t}}_{\gamma }^{\varLambda }$ is measurable (Lemma 2.4.6). Since $\mathcal L_2(\gamma )$ is a Hilbert space, the law of large numbers applies there, and results for the empirical version carry over to the population version by means of approximations. In particular:

Lemma 5.5.1

Any descent iterate γ has density bounded by q ^−dM, where q and M are as in (5.7).

Proof

The result is true in the empirical case, by Proposition 5.3.6. The key point (observed by Pass [102, Subsection 3.3]) is that the number of measures does not appear in the bound q ^−dM.

Let Λ ₁, … be a sample from Λ and let q _n be the proportion of measures in (Λ ₁, …, Λ _n) that have density bounded by M. Then both $n^{-1}\sum _{i=1}^n {\mathbf {t}}_{\gamma }^{\varLambda _i}\to {\mathbb {E}}{\mathbf {t}}_{\gamma }^{\varLambda }$ and q _n → q almost surely by the law of large numbers. Pick any ω in the probability space for which this happens and notice that (invoking Lemma 2.4.5)

$\displaystyle \begin{aligned} \mathcal A(\gamma) =\left[\lim_{n\to\infty} \frac 1n\sum_{i=1}^n {\mathbf{t}}_{\gamma}^{\varLambda_i}\right] \# \gamma =\lim_{n\to\infty} \left[\frac 1n\sum_{i=1}^n {\mathbf{t}}_{\gamma}^{\varLambda_i}\right] \# \gamma. \end{aligned}$

Let λ _n denote the measure in the last limit. By Proposition 5.3.6, its density is bounded by $q_n^{-d}M\to q^{-d}M$ almost surely, so for any C > q ^−dM and n large, λ _n has density bounded by C. By the portmanteau Lemma 1.7.1, so does $\lim \lambda _n=[{\mathbb {E}} {\mathbf {t}}_{\gamma }^{\varLambda }]\#\gamma$ . Now let ../images/456556_1_En_5_Chapter/456556_1_En_5_IEq131_HTML.gif

../images/456556_1_En_5_Chapter/456556_1_En_5_IEq131_HTML.gif

Though it follows that every Karcher mean of Λ has a bounded density, we cannot yet conclude that the same bound holds for the Fréchet mean, because we need an a-priori knowledge that the latter is absolutely continuous. This again can be achieved by approximations:

Theorem 5.5.2 (Bounded Density for Population Fréchet Mean)

Let $\varLambda \in {\mathcal {W}}_2({\mathbb {R}}^d)$ be a random measure with finite Fréchet functional. If Λ has a bounded density with positive probability, then the Fréchet mean of Λ is absolutely continuous with a bounded density.

Proof

Let q and M be as in (5.7), Λ ₁, … be a sample from Λ, and q _n the proportion of (Λ _i)_i≤n with density bounded by M. The empirical Fréchet mean λ _n of the sample (Λ ₁, …, Λ _n) has a density bounded by $q_n^{-d}M$ . The Fréchet mean λ of Λ is unique by Proposition 3.2.7, and consequently λ _n → λ in ${\mathcal {W}}_2({\mathbb {R}}^d)$ by the law of large numbers (Corollary 3.2.10). For any $C>\limsup q_n^{-d}M$ , the density of λ is bounded by C by the portmanteau Lemma 1.7.1, and the limsup is q ^−dM almost surely. Thus, the density is bounded by q ^−dM.

In the same way, one shows the population version of Theorem 3.1.9:

Theorem 5.5.3 (Fréchet Mean of Compatible Measures)

Let $\varLambda :(\varOmega ,\mathcal F,\mathbb {P})\to {\mathcal {W}}_2({\mathbb {R}}^d)$ be a random measure with finite Fréchet functional, and suppose that with positive probability Λ is absolutely continuous and has bounded density. If the collection {γ}∪ Λ(Ω) is compatible and γ is absolutely continuous, then $[{\mathbb {E}} {\mathbf {t}}_{\gamma }^{\varLambda }]\#\gamma$ is the Fréchet mean of Λ.

It is of course sufficient that $\{\gamma \}\cup \varLambda (\varOmega \setminus \mathcal N)$ be compatible for some null set $\mathcal N\subset \varOmega$ .

5.6 Bibliographical Notes

The algorithm outlined in this chapter was suggested independently in this steepest descent form by Zemel and Panaretos [134] and in the form a fixed point equation iteration by Álvarez-Esteban et al. [9]. These two papers provide different alternative proofs of Theorem 5.3.1. The exposition here is based on [134]. Although longer and more technical than the one in [9], the formalism in [134] is amenable to directly treating the optimal maps (Theorem 5.3.3) and the multicouplings (Corollary 5.3.4). On the flip side, it is noteworthy that the proof of the Gaussian case (Theorem 5.4.1) given in [9] is more explicit and quantitative; for instance, it shows the additional property that the traces of the matrix iterates are monotonically increasing.

Developing numerical schemes for computing Fréchet means in ${\mathcal {W}}_2({\mathbb {R}}^d)$ is a very active area of research, and readers are referred to the recent monograph of Peyré and Cuturi [103, Section 9.2] for a survey.

In recent work, Backhoff-Varaguas et al. [15] propose a stochastic steepest descent for finding Karcher means of a population Fréchet functional associated with a random measure Λ. At iterate j, one replaces γ _j by

$\displaystyle \begin{aligned}{}[t_j{\mathbf{t}}_{\gamma_j}^{\mu_j} + (1-t_j){\mathbf i}]\#\gamma_j, \qquad \mu_j\sim \varLambda. \end{aligned}$

The analogue of Theorem 5.3.1 holds under further conditions.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

References

1.
B. Afsari, R. Tron, R. Vidal, On the convergence of gradient descent for finding the Riemannian center of mass. SIAM J. Contr. Opt. 51(3), 2230–2260 (2013)
9.
P.C. Álvarez-Esteban, E. del Barrio, J.A. Cuesta-Albertos, C. Matrán, A fixed-point approach to barycenters in Wasserstein space. J. Math. Anal. Appl. 441(2), 744–762 (2016)
15.
J. Backhoff-Veraguas, J. Fontbona, G. Rios, F. Tobar, Bayesian learning with Wasserstein barycenters (2018). arXiv:1805.10833
19.
D.P. Bertsekas, Nonlinear Programming (Athena Scientific, Belmont, 1999)
28.
E. Boissard, T. Le Gouic, J.-M. Loubes, Distribution‘s template estimate with Wasserstein metrics. Bernoulli 21(2), 740–759 (2015)
45.
I.L. Dryden, K.V. Mardia, Statistical Shape Analysis, vol. 4 (Wiley, Chichester, 1998)
66.
J.C. Gower, Generalized Procrustes analysis. Psychometrika 40(1), 33–51 (1975)
67.
D. Groisser, On the convergence of some Procrustean averaging algorithms. Stochastics: An Inter. J. Probab. Stoch. Process. 77(1), 31–60 (2005)
86.
H. Le, Locating Fréchet means with application to shape spaces. Adv. Appl. Probab. 33(2), 324–338 (2001)
102.
B. Pass, Optimal transportation with infinitely many marginals. J. Funct. Anal. 264(4), 947–963 (2013)
103.
G. Peyré, M. Cuturi, Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019). https://www.nowpublishers.com/article/Details/MAL-073
121.
E.M. Stein, R. Shakarchi, Real Analysis: Measure Theory, Integration & Hilbert Spaces. (Princeton University Press, Princeton, 2005)
125.
C. Villani, Optimal Transport: Old and New (Springer, Berlin, 2008)
134.
Y. Zemel, V.M. Panaretos, Fréchet means and Procrustes analysis in Wasserstein space. Bernoulli 25(2), 932–976 (2019)