The Kantorovich problem described in the previous chapter gives rise to a metric structure, the Wasserstein distance, in the space of probability measures on a space . The resulting metric space, a subspace of , is commonly known as the Wasserstein space (although, as Villani [125, pages 118–119] puts it, this terminology is “very questionable”; see also Bobkov and Ledoux [25, page 4]). In Chap. 4, we shall see that this metric is in a sense canonical when dealing with warpings, that is, deformations of the space (for example, in Theorem 4.2.4). In this chapter, we give the fundamental properties of the Wasserstein space. After some basic definitions, we describe the topological properties of that space in Sect. 2.2. It is then explained in Sect. 2.3 how can be endowed with a sort of infinite-dimensional Riemannian structure. Measurability issues are dealt with in the somewhat technical Sect. 2.4.
2.1 Definition, Notation, and Basic Properties
The aforementioned setting is by no means the most general one can consider. Firstly, one can define W p and for 0 < p < 1 by removing the power 1∕p from the infimum and the limit case p = 0 yields the total variation distance. Another limit case can be defined as W ∞(μ, ν) =limp→∞W p(μ, ν). Moreover, W p and can be defined whenever is a complete and separable metric space (or even only separable; see Clément and Desch [36]): one fixes some x 0 in and replaces ∥x∥ by d(x, x 0). Although the topological properties below still hold at that level of generality (except when p = 0 or p = ∞), for the sake of simplifying the notation we restrict the discussion to Banach spaces. It will always be assumed without explicit mention that 1 ≤ p < ∞.
We also recall the notation B R(x 0) = {x : ∥x − x 0∥ < R} and for open and closed balls in .
2.2 Topological Properties
2.2.1 Convergence, Compact Subsets
The topology of a space is determined by the collection of its closed sets. Since is a metric space, whether a set is closed or not depends on which sequences in converge. The following characterisation from Villani [124, Theorem 7.12] will be very useful.
- 1.
W p(μ n, μ) → 0 as n →∞;
- 2.
μ n → μ weakly and ;
- 3.μ n → μ weakly and(2.4)
- 4.for any C > 0 and any continuous such that |f(x)|≤ C(1 + ∥x∥p) for all x,
- 5.
(Le Gouic and Loubes [ 87, Lemma 14]) μ n → μ weakly and there exists such that W p(μ n, ν) → W p(μ, ν).
Consequently, the Wasserstein topology is finer than the weak topology induced on from . Indeed, let be weakly closed. If converge to μ in , then μ n → μ weakly, so . In other words, the Wasserstein topology has more closed sets than the induced weak topology. Moreover, each is a weakly closed subset of by the same arguments that lead to (1.3). In view of Theorem 2.2.1, a common strategy to establish Wasserstein convergence is to first show tightness and obtain weak convergence, hence a candidate limit, and then show that the stronger Wasserstein convergence actually holds. In some situations, the last part is automatic:
Let be a bounded set and suppose that μ n(K) = 1 for all n ≥ 1. Then W p(μ n, μ) → 0 if and only if μ n → μ weakly.
This is immediate from (2.4).
Before giving some examples, it will be convenient to formulate Theorem 2.2.1 in probabilistic terms. Let X, X n be random elements on with laws . Assume without loss of generality that X, X n are defined on the same probability space and write W p(X n, X) to denote W p(μ n, μ). Then W p(X n, X) → 0 if and only if X n → X weakly and .
An early example of the use of Wasserstein metric in statistics is due to Bickel and Freedman [21]. Let X n be independent and identically distributed random variables with mean zero and variance 1 and let Z be a standard normal random variable. Then converge weakly to Z by the central limit theorem. But , so W 2(Z n, Z) → 0. Let be a bootstrapped version of Z n constructed by resampling the X n’s. If , then and in particular has the same asymptotic distribution as Z n.
Condition (2.4) is called uniform integrability of the function x↦∥x∥p with respect to the collection (μ n). Of course, it holds for a single measure by the dominated convergence theorem. This condition allows us to characterise compact sets in the Wasserstein space. One should beware that when is infinite-dimensional, (2.4) alone is not sufficient in order to conclude that μ n has a convergent subsequence: take μ n to be Dirac measures at e n with (e n) an orthonormal basis of a Hilbert space (or any sequence with ∥e n∥ = 1 that has no convergent subsequence, if is a Banach space). The uniform integrability (2.4) must be accompanied with tightness, which is a consequence of (2.4) only when .
The proof is on page 41 of the supplement.
For any sequence (μ n) in (tight or not) there exists a monotonically divergent g with for all n.
is compact.
2.2.2 Dense Subsets and Completeness
If we identify a measure with a random variable X (having distribution μ), then X has a finite p-th moment in the sense that the real-valued random variable ∥X∥ is in L p. In view of that, it should not come as a surprise that enjoys topological properties similar to L p spaces. In this subsection, we give some examples of useful dense subsets of and then “show” that like itself, it is a complete separable metric space. In the next subsection, we describe some of the negative properties that has, again in similarity with L p spaces.
We first show that is separable. The core idea of the proof is the feasibility of approximating any measure with discrete measures as follows.
For any and the corresponding sequence of empirical measures μ n, W p(μ n, μ) → 0 almost surely if and only if .
Indeed, if , then W p(μ n, μ) is infinite for all n, since μ n is compactly supported, hence in .
Proposition 2.2.6 is the basis for constructing dense subsets of the Wasserstein space.
- 1.
finitely supported measures with rational weights;
- 2.
compactly supported measures;
- 3.
finitely supported measures with rational weights on a dense subset ;
- 4.
if , the collection of absolutely continuous and compactly supported measures;
- 5.
if , the collection of absolutely continuous measures with strictly positive and bounded analytic densities.
In particular, is separable (the third set is countable as is separable).
This is a simple consequence of Proposition 2.2.6 and approximations, and the details are given on page 43 in the supplement.
The Wasserstein space is complete.
One may find two different proofs in Villani [125, Theorem 6.18] and Ambrosio et al. [12, Proposition 7.1.5]. On page 43 of the supplement, we sketch an alternative argument based on completeness of the weak topology.
2.2.3 Negative Topological Properties
In the previous subsection, we have shown that is separable and complete like L p spaces. Just like them, however, the Wasserstein space is neither locally compact nor σ-compact. For this reason, existence proofs of Fréchet means in require tools that are more specific to this space, and do not rely upon local compactness (see Sect. 3.1).
is not compact.
Ambrosio et al. [12, Remark 7.1.9] show this when μ is a Dirac measure, and we extend their argument on page 43 of the supplement.
From this, we deduce:
The Wasserstein space is not σ-compact.
2.2.4 Covering Numbers
with C 1(d) = 3deθ d, and .
Since 𝜖 > 0 is small and L is increasing in 𝜖, the restriction that d𝜖 ≤ L is typically not binding. We provide some examples before giving the proof.
Example 4: if f(L) is only , then behaves like , so p has a very dominant effect.
The proof is divided into four steps.
2.3 The Tangent Bundle
Although the Wasserstein space is non-linear in terms of measures, it is linear in terms of maps. Indeed, if and are such that ∥T i∥∈ L p(μ), then for all . Later, in Sect. 2.4, we shall see that is in fact homeomorphic to a subset of the space of such functions. The goal of this section is to exploit the linearity of the latter in order to define the tangent bundle of . This in particular will be used for deriving differentiability properties of the Wasserstein distance in Sect. 3.1.6. However, the latter can be understood at a purely analytic level, and readers uncomfortable with differential geometry can access most of the rest of the monograph without reference to this section.
We assume here that is a Hilbert space and that p = 2; the results extend to any p > 1. Absolutely continuous measures are assumed to be so with respect to Lebesgue measure if and otherwise refer to Definition 1.6.4.
2.3.1 Geodesics, the Log Map and the Exponential Map in
2.3.2 Curvature and Compatibility of Measures
A collection of absolutely continuous measures is compatible if for all , we have (in L 2(ν)).
The absolute continuity is not necessary and was introduced for notational simplicity. A more general definition that applies to general measures is the following: every finite subcollection of admits an optimal multicoupling whose relevant projections are simultaneously pairwise optimal; see the paragraph preceding Theorem 3.1.9.
Here are some examples of compatible measures. It will be convenient to describe them using the optimal maps from a reference measure . Define with t belonging to one of the following families. The first imposes the one-dimensional structure by varying only the behaviour of the norm of x, while the second allows for separation of variables that splits the d-dimensional problem into d one-dimensional ones.
The copulae associated with absolutely continuous measures are equal if and only if takes the separable form (2.8).
2.4 Random Measures in Wasserstein Space
Let μ be a fixed absolutely continuous probability measure in . If is another probability measure, then the transport map and the convex potential are functions of Λ. If Λ is now random, then we would like to be able to make probability statements about them. To this end, it needs to be shown that and the convex potential are measurable functions of Λ. The goal of this section is to develop a rigorous mathematical framework that justifies such probability statements. We show that all the relevant quantities are indeed measurable, and in particular establish a Fubini-type result in Proposition 2.4.9. This technical section may be skipped at first reading.
Here is an example of a measurability result (Villani [125, Corollary 5.22]). Recall that is the space of Borel probability measures on , endowed with the topology of weak convergence that makes it a metric space. Let be a complete separable metric space and a continuous cost function. Let be a probability space and be measurable maps. Then there exists a measurable selection of optimal transference plans. That is, a measurable such that π(ω) ∈ Π(Λ(ω), κ(ω)) is optimal for all ω ∈ Ω.
Although this result is very general, it only provides information about π. If π is induced from a map T, it is not obvious how to construct T from π in a measurable way; we will therefore follow a different path. In order to (almost) have a self-contained exposition, we work in a somewhat simplified setting that nevertheless suffices for the sequel. At least in the Euclidean case , more general measurability results in the flavour of this section can be found in Fontbona et al. [53]. On the other hand, we will not need to appeal to abstract measurable selection theorems as in [53, 125].
2.4.1 Measurability of Measures and of Optimal Maps
Let be a separable Banach space. (Most of the results below hold for any complete separable metric space but we will avoid this generality for brevity and simpler notation). The Wasserstein space is a metric space for any p ≥ 1. We can thus define:
A random measure Λ is any measurable map from a probability space to , endowed with its Borel σ-algebra.
In what follows, whenever we call something random, we mean that it is measurable as a map from some generic unspecified probability space.
A random measure Λ is measurable if and only if it is measurable with respect to the induced weak topology.
Since both topologies are Polish, this follows from abstract measure-theoretic results (Fremlin [57, Paragraph 423F]). We give an elementary proof on page 53 of the supplement.
Optimal maps are functions from to itself. In order to define random optimal maps, we need to define a topology and a σ-algebra on the space of such functions.
When is separable, is an example of a Bochner space, though we will not use this terminology.
The space is a Banach space.
The proof, a simple variant of the classical one, is given on page 53 of the supplement.
Random maps lead naturally to random measures:
Let and let t be a random map in . Then Λ = t#μ is a continuous mapping from to , hence a random measure.
Conversely, t is a continuous function of Λ:
Let Λ be a random measure in and let such that is the unique optimal coupling of μ and Λ. Then is a continuous mapping from to , so is a random element in . In particular, the result holds if is a separable Hilbert space, p > 1, and μ is absolutely continuous.
This result is more subtle than Lemma 2.4.5, since is not necessarily Lipschitz. We give here a self-contained proof for the Euclidean case with quadratic cost and μ absolutely continuous. The general case builds on Villani [125, Corollary 5.23] and is given on page 54 of the supplement.
In Proposition 5.3.7, we show under some conditions that is a continuous function of μ.
2.4.2 Random Optimal Maps and Fubini’s Theorem
From now on, we assume that is a separable Hilbert space and that p = 2. The results can most likely be generalised to all p > 1 (see Ambrosio et al. [12, Section 10.2]), but we restrict to the quadratic case for simplicity.
The space of functions for which the Bochner integral is defined is the Bochner space L 1(Ω;B), but we will use neither this terminology nor the notation. It is not difficult to see that Bochner integrals are well-defined: the expectations do not depend on the representation of the simple functions nor on the approximating sequence, and the limit exists in B (because it is complete). More on Bochner integrals can be found in Hsing and Eubank [71, Section 2.6] or Dunford et al. [48, Chapter III.6]. A major difference from the real case is that there is no clear notion of “infinity” here: the Bochner integral is always an element of B, whereas expectations of real-valued random variables can be defined in . It turns out that separability is quite important in this setting:
Let f : Ω → B be measurable. Then there exists a sequence of simple functions f n such that ∥f n(ω) − f(ω)∥→ 0 for almost all ω if and only if is separable for some of probability zero. In that case, f n can be chosen so that ∥f n(ω)∥≤ 2∥f(ω)∥ for all ω ∈ Ω.
A proof can be found in [48, Lemma III.6.9], or on page 55 of the supplement. Functions satisfying this approximation condition are sometimes called strongly measurable or Bochner measurable. In view of the lemma, we will call them separately valued, since this is the condition that will need to be checked in order to define their integrals.
Two remarks are in order. Firstly, if B itself is separable, then f(Ω) will obviously be separable. Secondly, the set on which does not converge to f may fail to be measurable, but must have outer probability zero (it is included in a measurable set of measure zero) [48, Lemma III.6.9]. This can be remedied by assuming that the probability space is complete. It will not, however, be necessary to do so, since this measurability issue will not alter the Bochner expectation of f.
This holds by linearity when Λ is a simple random measure. The general case follows by approximation: the Wasserstein space is separable and so is the space of optimal maps, by Lemma 2.4.6, so we may apply Lemma 2.4.8 and approximate by simple maps for which the equality holds by linearity. On page 56 of the supplement, we show that these simple maps can be assumed optimal, and give the full details.
2.5 Bibliographical Notes
Our proof of Theorem 2.2.11 borrows heavily from Bolley et al. [29]. A similar result was obtained by Kloeckner [81], who also provides a lower bound of a similar order.
The origins of Sect. 2.3 can be traced back to the seminal work of Jordan et al. [74], who interpret the Fokker–Planck equation as a gradient flow (where functionals defined on can be differentiated) with respect to the 2-Wasserstein metric. The Riemannian interpretation was (formally) introduced by Otto [99], and rigorously established by Ambrosio et al. [12] and others; see Villani [125, Chapter 15] for further bibliography and more details.
Compatible measures (Definition 2.3.1) were implicitly introduced by Boissard et al. [28] in the context of admissible optimal maps where one defines families of gradients of convex functions (T i) such that is a gradient of a convex function for any i and j. For (any) fixed measure , compatibility of is then equivalent to admissibility of the collection of maps . The examples we gave are also taken from [28].
Lemma 2.3.3 is from Cuesta-Albertos et al. [38, Theorem 2.9] (see also Zemel and Panaretos [135]).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.