In this preliminary chapter, we introduce the problem of optimal transport, which is the main concept behind Wasserstein spaces. General references on this topic are the books by Rachev and Rüschendorf [107], Villani [124, 125], Ambrosio et al. [12], Ambrosio and Gigli [10], and Santambrogio [119]. This chapter includes only few proofs, when they are simple, informative, or are not easily found in one of the cited references.
1.1 The Monge and the Kantorovich Problems
Perhaps most importantly, a minimiser for the Kantorovich problem exists under weak conditions. In order to show this, we first recall some definitions. Let be the space of real-valued, continuous bounded functions on . A sequence of probability measures is said to converge weakly 2 to if for all , . To avoid confusion with other types of convergence, we will usually write μ n → μ weakly; in the rare cases where a symbol is needed we shall use the notation . Of course, if μ n → μ weakly and , then μ must be in too (this is seen by taking f ≡ 1 and by observing that if f ≥ 0).
A collection of probability measures is tight if for all 𝜖 > 0 there exists a compact set K such that . If is represented by a sequence {μ n}, then Prokhorov’s theorem (Billingsley [24, Theorem 5.1]) states that a subsequence of {μ n} must converge weakly to some probability measure μ.
A remark about terminology is in order. Many authors talk about the Monge–Kantorovich problem or the optimal transport(ation) problem. More often than not, they refer to what we call here the Kantorovich problem. When one of the scenarios presented in Sects. 1.3 and 1.6.1 is considered, this does not result in ambiguity.
1.2 Probabilistic Interpretation
The preceding section was an analytic presentation of the Monge and the Kantorovich problems. It is illuminating, however, to also recast things in probabilistic terms, and this is the topic of this section.
A random element on a complete separable metric space (or any topological space) is simply a measurable function X from some (generic) probability space to (with its Borel σ-algebra). The probability law (or probability distribution) is the probability measure defined on the space ; this is the Borel measure satisfying for all Borel sets A.
1.3 The Discrete Uniform Case
There is a special case in which the Monge–Kantorovich problem reduces to a finite combinatorial problem. Although it may seem at first hand as an oversimplification of the original problem, it is of importance in practice because arbitrary measures can be approximated by discrete measures by means of the strong law of large numbers. Moreover, the discrete case is important in theory as well, as a motivating example for the Kantorovich duality (Sect. 1.4) and the property of cyclical monotonicity (Sect. 1.7).
The Kantorovich problem is a linear program with n 2 variables and 2n constraints. It must have a solution because B n (hence B n∕n) is a compact (nonempty) set in and the objective function is linear in the matrix elements, hence continuous. (This property is independent of the possibly infinite-dimensional spaces and in which the points lie.) The Monge problem also admits a solution because S n is a finite set. To see that the two problems are essentially the same, we need to introduce the following notion. If B is a convex set, then x ∈ B is an extremal point of B if it cannot be written as a convex combination tz + (1 − t)y for some distinct points y, z ∈ B. It is well known (Luenberger and Ye [89, Section 2.5]) that there exists an optimal solution that is extremal, so that it becomes relevant to identify the extremal points of B n. It is fairly clear that each permutation matrix is extremal in B n; the less obvious converse is known as Birkhoff’s theorem, a proof of which can be found, for instance, at the end of the introduction in Villani [124] or (in a different terminology) in Luenberger and Ye [89, Section 6.5]. Thus, we have:
There exists σ ∈ S n such that M(σ) minimises C(M) over B n∕n. Furthermore, if {σ 1, …, σ k} is the set of optimal permutations, then the set of optimal matrices is the convex hull of {M(σ 1), …, M(σ k)}. In particular, if σ is the unique optimal permutation, then M(σ) is the unique optimal matrix.
Thus, in the discrete case, the Monge and the Kantorovich problems coincide. One can of course use the simplex method [89, Chapter 3] to solve the linear program, but there are n! vertices, and there is in principle no guarantee that the simplex method solves the problem efficiently. However, the constraints matrix has a very specific form (it contains only zeroes and ones, and is totally unimodular), so specialised algorithms for this problem exist. One of them is the Hungarian algorithm of Kuhn [85] or its variant of Munkres [96] that has a worst-case computational complexity of at most O(n 4). Another alternative is the class of net flow algorithms described in [89, Chapter 6]. In particular, the algorithm of Edmonds and Karp [50] has a complexity of at most . This monograph does not focus on computational aspects for optimal transport. This is a fascinating and very active area of contemporary research, and readers are directed to Peyré and Cuturi [103].
The special case described here could have been more precisely called “the discrete uniform case on the same number of points”, as “the discrete case” could refer to any two finitely supported measures μ and ν. In the Monge context, the setup discussed here is the most interesting case, see page 8 in the supplement for more details.
1.4 Kantorovich Duality
The discrete case of Sect. 1.3 is an example of a linear program and thus enjoys a rich duality theory (Luenberger and Ye [89, Chapter 4]). The general Kantorovich problem is an infinite-dimensional linear program, and under mild assumptions admits similar duality.
1.4.1 Duality in the Discrete Uniform Case
1.4.2 Duality in the General Case
The proof follows from the fact that (1.2) yields the above equality when φ and ψ are indicator functions. One then uses linearity and approximations to deduce the result.
See the Bibliographical Notes for other versions of the duality.
When the cost function is continuous, or more generally, a countable supremum of continuous functions, the infimum is attained (see (1.3)). The existence of maximisers (φ, ψ) is more delicate and requires a finiteness condition, as formulated in Proposition 1.8.1 below.
The next sections are dedicated to more concrete examples that will be used through the rest of the book.
1.5 The One-Dimensional Case
Since any distribution can be approximated by continuous distributions, in view of the above discussion, the following result from Villani [124, Theorem 2.18] should not be too surprising.
If the infimum is finite and h is strictly convex, then the optimal transference plan is unique. Furthermore, if F is continuous, then the infimum is attained by the transport map T = G −1 ∘ F.
The prototypical choice for h is h(z) = |z|p with p > 1. This result allows in particular a direct evaluation of the Wasserstein distances for measures on the real line (see Chap. 2).
Note that no regularity is needed in order that the optimal transference plan be unique, unlike in higher dimensions (compare Theorem 1.8.2). The structure of solutions in the concave case (0 < p < 1) is more complicated, see McCann [94].
When p = 1, the cost function is convex but not strictly so, and solutions will not be unique. However, the total cost in Theorem 1.5.1 admits another representation that is often more convenient.
The proof is a simple application of Fubini’s theorem; see page 13 in the supplement.
1.6 Quadratic Cost
1.6.1 The Absolutely Continuous Case
We begin with the Euclidean case, where is endowed with the Euclidean metric, and use the Kantorovich duality to obtain characterisations of optimal maps.
Let be a convex function with domain and let be the set of points at which f is not differentiable. Then has Lebesgue measure 0.
Theorem 1.6.1 is usually stated for the interior of domf, denoted int(domf), rather than the closure. But, since A = domf is convex, its boundary has Lebesgue measure zero. To see this assume first that A is bounded. If intA is empty, then A lies in a lower dimensional subspace [113, Theorem 2.4]. Otherwise, without loss of generality 0 ∈intA, and then by convexity of A, ∂A ⊆ (1 + 𝜖)A for all 𝜖 > 0. When A is unbounded, write it as ∪nA ∩ [−n, n]d.
Let μ and ν be probability measures on with finite second moments, and suppose that μ is absolutely continuous with respect to Lebesgue measure. Then the solution to the Kantorovich problem is unique, and is induced from a transport map T that equals μ-almost surely the gradient of a convex function ϕ. Furthermore, the pair (∥x∥2∕2 − ϕ, ∥y∥2∕2 − ϕ ∗) is optimal for the dual problem.
To alleviate the notation we write ϕ instead of . By Proposition 1.8.1, there exists an optimal dual pair (φ, ψ) such that ϕ(x) = ∥x∥2∕2 − φ(x) is convex and lower semicontinuous, and by the discussion in Sect. 1.1, there exists an optimal π. Since ϕ is μ-integrable, it must be finite almost everywhere, i.e., μ(domϕ) = 1. By Theorem 1.6.1, if we define as the set of nondifferentiability points of ϕ, then ; as μ is absolutely continuous, the same holds for μ. (Here Leb denotes Lebesgue measure.)
We conclude that . In other words, ϕ is differentiable μ-almost everywhere, and so for μ-almost any x, there exists a unique y such that , and y = ∇ϕ(x). This shows that π is unique and induced from the transport map ∇ϕ(x). The gradient ∇ϕ is Borel measurable, since each of its coordinates can be written as for some vector v (the canonical basis of ), which is Borel measurable because the limit superior is taken on countably many functions (and ϕ is measurable because it is lower semicontinuous).
1.6.2 Separable Hilbert Spaces
The finite-dimensionality of in the previous subsection was only used in order to apply Theorem 1.6.1, so one could hope to extend the results to infinite-dimensional separable Hilbert spaces.
Although there is no obvious parallel for Lebesgue measure (i.e., translation invariant) on infinite-dimensional Banach spaces, one can still define absolute continuity via Gaussian measures. Indeed, is absolutely continuous with respect to Lebesgue measure if and only if the following holds: if is such that for any nondegenerate Gaussian measure ν, then . This definition can be extended to any separable Banach space via projections, as follows. Let be the (topological) dual of , consisting of all real-valued, continuous linear functionals on .
A probability measure is a nondegenerate Gaussian measure if for any , is a Gaussian measure with positive variance.
A subset is a Gaussian null set if whenever ν is a nondegenerate Gaussian measure, . A probability measure is absolutely continuous if μ vanishes on all Gaussian null sets.
Clearly, if ν is a nondegenerate Gaussian measure, then it is absolutely continuous.
As explained in Ambrosio et al. [12, Section 6.2], a version of Rademacher’s theorem holds in separable Hilbert spaces: a locally Lipschitz function is Gâteaux differentiable except on a Gaussian null set of . Theorem 1.6.2 (and more generally, Theorem 1.8.2) extend to infinite dimensions; see [12, Theorem 6.2.10].
1.6.3 The Gaussian Case
Apart from the one-dimensional case of Sect. 1.5, there is another special case in which there is a unique and explicit solution to the Monge–Kantorovich problem.
Suppose that μ and ν are Gaussian measures on with zero means and nonsingular covariance matrices A and B. By Theorem 1.6.2, we know that there exists a unique optimal map T such that T#μ = ν. Since linear push-forwards of Gaussians are Gaussian, it seems natural to guess that T should be linear, and this is indeed the case.
Since T is a linear map that should be the gradient of a convex function ϕ, it must be that ϕ is quadratic, i.e., for and some matrix Q. The gradient of ϕ at x is (Q + Q t)x and the Hessian matrix is Q + Q t. Thus, T = Q + Q t and since ϕ is convex, T must be positive semidefinite.
1.6.4 Regularity of the Transport Maps
The optimal transport map T between Gaussian measures on is linear, so it is of course very smooth (analytic). The densities of Gaussian measures are analytic too, so that T inherits the regularity of μ and ν. Using the formula for T, one can show that a similar phenomenon takes place in the one-dimensional case. Though we do not have a formula for T at our disposal when μ and ν are general absolutely continuous measures on , d ≥ 2, it turns out that even in that case, T inherits the regularity of μ and ν if some convexity conditions are satisfied.
To guess what kind of results can be hoped for, let us first examine the case d = 1. Let F and G denote the distribution functions of μ and ν, respectively. Suppose that G is continuously differentiable and that G′ > 0 on some open interval (finite or not) I such that ν(I) = 1. Then the inverse function theorem says that G −1 is also continuously differentiable. Recall that the support of a (Borel) probability measure μ (denoted suppμ) is the smallest closed set K such that μ(K) = 1. A simple application of the chain rule (see page 19 in the supplement) gives:
Let possess distribution functions F and G of class C k, k ≥ 1. Suppose further that suppν is an interval I (possibly unbounded) and that G′ > 0 on the interior of I. Then the optimal map is of class C k as well. If F, G ∈ C 0 are merely continuous, then so is the optimal map.
The assumption on the support of ν is important: if μ is Lebesgue measure on [0, 1] and the support of ν is disconnected, then T cannot even be continuous, no matter how smooth ν is.
- 1.
If Ω 1 and Ω 2 are bounded and f, g are bounded below, then ϕ is strictly convex and of class C 1, α(Ω 1) for some α > 0.
- 2.
If and f, g ∈ C 0, α, then ϕ ∈ C 2, α(Ω 1).
If in addition f, g ∈ C k, α, then ϕ ∈ C k+2, α(Ω 1).
In other words, the optimal map T = ∇ϕ ∈ C k+1, α(Ω 1) is one derivative smoother than the densities, so has the same smoothness as the measures μ, ν.
Theorem 1.6.7 will be used in two ways in this book. Firstly, it is used to derive criteria for a Karcher mean of a collection of measures to be the Fréchet mean of that collection (Theorem 3.1.15). Secondly, it allows one to obtain very smooth estimates for the transport maps. Indeed, any two measures μ and ν can be approximated by measures satisfying the second condition: one can approximate them by discrete measures using the law of large numbers and then employ a convolution with, e.g., a Gaussian measure (see, for instance, Theorem 2.2.7). It is not obvious that the transport maps between the approximations converge to the transport maps between the original measures, but we will see this to be true in the next section.
1.7 Stability of Solutions Under Weak Convergence
In this section, we discuss the behaviour of the solution to the Monge–Kantorovich problem when the measures μ and ν are replaced by approximations μ n and ν n. Since any measure can be approximated by discrete measures or by smooth measures, this allows us to benefit from both worlds. On the one hand, approximating μ and ν with discrete measures leads to the finite discrete problem of Sect. 1.3 that can be solved exactly. On the other hand, approximating μ and ν with Gaussian convolutions thereof leads to very smooth measures (at least on ), and so the regularity results of the previous section imply that the respective optimal maps will also be smooth. Finally, in applications, one would almost always observe the measures of interest μ and ν with a certain amount of noise, and it is therefore of interest to control the error introduced by the noise. In image analysis, μ can represent an image that has undergone blurring, or some other perturbation (Amit et al. [13]). In other applications, the noise could be due to sampling variation, where instead of μ one observes a discrete measure μ N obtained from realisations X 1, …, X N of random elements with distribution μ as (see Chap. 4).
In Sect. 1.7.1, we will see that the optimal transference plan π depends continuously on μ and ν. With this result under one’s belt, one can then deduce an analogous property for the optimal map T from μ to ν given some regularity of μ, as will be seen in Sect. 1.7.2.
We shall assume throughout this section that μ n → μ and ν n → ν weakly, which, we recall, means that for all continuous bounded . The following equivalent definitions for weak convergence will be used not only in this section, but elsewhere as well.
μ n → μ weakly;
F n(x) → F(x) for any continuity point x of F. Here , F n is the distribution function of μ n and F is that of μ;
for any open , ;
for any closed , ;
for any bounded measurable h whose set of discontinuity points is a μ-null set.
For a proof, see, for instance, Billingsley [24, Theorem 2.1]. The equivalence with the last condition can be found in Pollard [104, Section III.2].
1.7.1 Stability of Transference Plans and Cyclical Monotonicity
In this subsection, we state and sketch the proof of the fact that if μ n → μ and ν n → ν weakly, then the optimal transference plans π n ∈ Π(μ n, ν n) converge to an optimal π ∈ Π(μ, ν). The result, as stated in Villani [125, Theorem 5.20], is valid on complete separate metric spaces with general cost functions, and reads as follows.
then (π n) is a tight sequence and each of its weak limits π ∈ Π(μ, ν) is optimal.
One can even let c vary with n under some conditions.
A probability measure π on is cyclically monotone if there exists a monotone Borel set Γ such that π(Γ) = 1.
Let and suppose that the cost function c is nonnegative and continuous. Assume that the optimal π ∈ Π(μ, ν) has a finite total cost. Then suppπ is cyclically monotone. In particular, π is cyclically monotone.
Thus, optimal transference plans π solve infinitely many discrete Monge– Kantorovich problems emanating from their support. More precisely, for any finite collection (x i, y i) ∈suppπ, i = 1, …, N and any permutation σ ∈ S N, (1.7) is satisfied. Therefore, the identity permutation is optimal between the measures and .
In the same spirit as Γ defined above for the discrete case, one can strengthen Proposition 1.7.4 and prove existence of a cyclically monotone set Γ that includes the support of any optimal transference plan π: take Γ = ∪supp(π) for π optimal.
The converse of Proposition 1.7.4 also holds.
Let , continuous and π ∈ Π(μ, ν) a cyclically monotone measure with C(π) finite. Then π is optimal in Π(μ, ν).
The argument for more general costs follows similar lines and is sketched at the end of this subsection.
Given these intermediary results, it is now instructive to prove Theorem 1.7.2.
A nonempty is quadratic cyclically monotone if and only if it is included in the graph of the subdifferential of a lower semicontinuous convex function that is not identically infinite.
Secondly, we have not used at all the Kantorovich duality, merely its weak form. The machinery of cyclical monotonicity can be used in order to prove the duality Theorem 1.4.2. This is indeed the strategy of Villani [125, Chapter 5], who explains its advantage with respect to Hahn–Banach-type duality proofs.
1.7.2 Stability of Transport Maps
We now extend the weak convergence of π n to π of the previous subsection to convergence of optimal maps. Because of the applications we have in mind, we shall work exclusively in the Euclidean space with the quadratic cost function; our results can most likely be extended to more general situations.
In this setting, we know that optimal plans are supported on graphs of subdifferentials of convex functions. Suppose that π n is induced by T n and π is induced by T. Then in some sense, the weak convergence of π n to π yields convergence of the graphs of T n to the graph of T. Our goal is to strengthen this to uniform convergence of T n to T. Roughly speaking, we show the following: there exists a set A with μ(A) = 1 and such that T n converge uniformly to T on every compact subset of A. For the reader’s convenience, we give a user-friendly version here; a more general statement is given in Proposition 1.7.11 below.
Let μ n, μ be absolutely continuous measures with finite second moments on an open convex set such that μ n → μ weakly, and let ν n → ν weakly with with finite second moments. If T n and T are continuous on U and C(T n) is bounded uniformly in n, then
for any compact Ω ⊆ U.
Since T n and T are only defined up to Lebesgue null sets, it will be more convenient to work directly with the subgradients. That is, we view T n and T as set-valued functions that to each assign a (possibly empty) subset of . In other words, T n and T take values in the power set of , denoted by .
Of course, if is convex, then u = ∂ϕ is monotone. It follows from Theorem 1.7.6 that u is maximally cyclically monotone (no points can be added to its graph while preserving cyclical monotonicity). It can actually be shown that u is maximally monotone [6, Section 7]. In what follows, we will always work with subdifferentials of convex functions, so unless stated otherwise, u will always be assumed maximally monotone.
Maximally monotone functions enjoy the following very useful continuity property. It is proven in [6, Corollary 1.3] and will be used extensively below.
Let such that u(x) = {y} is a singleton. Then u is nonempty on some neighbourhood of x and it is continuous at x: if x n → x and y n ∈ u(x n), then y n → y.
Notice that this result implies that if a convex function ϕ is differentiable on some open set , then it is continuously differentiable there (Rockafellar [113, Corollary 25.5.1]).
Let B r(x 0) = {x : ∥x − x 0∥ < r} for r ≥ 0 and . The interior of a set is denoted by intG and the closure by . If G is measurable, then LebG denotes the Lebesgue measure of G. Finally, convG denotes the convex hull of G.
We denote the set of points of Lebesgue density of G by G den. Here are some facts about G den: clearly, . Stein and Shakarchi [121, Chapter 3, Corollary 1.5] show that Leb(G ∖ G den) = 0 (and Leb(G den ∖ G) = 0, so G den is very close to G). By the Hahn–Banach theorem, G den ⊆int(conv(G)): indeed, if x is not in int(convG), then there is a separating hyperplane between x and convG ⊇ G, so the fraction above is at most 1∕2 for all t > 0.
The “denseness” of Lebesgue points is materialised in the following result. It is given as exercise in [121] when d = 1, and the proof can be found on page 27 in the supplement.
Of course, this result holds for any if the little o is replaced by big O, since δ is Lipschitz. When x 0 ∈intG, this is trivial because δ vanishes on intG.
The important part here is the following corollary: for almost all x ∈ G, δ(z) = o(∥z − x∥) as z → x. This can be seen in other ways: since δ is Lipschitz, it is differentiable almost everywhere. If and δ is differentiable at x, then ∇δ(x) must be 0 (because δ is minimised there), and then δ(z) = o(∥z − x∥). We just showed that δ is differentiable with vanishing derivative at all Lebesgue points of x. The converse is not true: has no Lebesgue points, but δ(y) ≤ 4y 2 as y → 0.
The locality of monotone functions can now be stated as follows. It is proven on page 27 of the supplement.
Then y ∗ = y 0. In particular, the result is true if the inequality holds on with ∅≠O open and Lebesgue negligible.
With this background on monotone functions at our disposal, we are now ready to state the stability result for the optimal maps. We assume the following.
(convergence ) μ n → μ and ν n → ν weakly;
- (finiteness ) the optimal couplings π n ∈ Π(μ n, ν n) satisfy
(unique limit ) the optimal π ∈ Π(μ, ν) is unique.
We further denote the subgradients ∂ϕ nand ∂ϕ by u nand u, respectively.
These assumptions imply that π has a finite total cost. This can be shown by the argument in the proof of Theorem 1.7.2 but also from the uniqueness of π. As a corollary of the uniqueness of π, it follows that π n → π weakly; notice that this holds even if π n is not unique for any n. We will now translate this weak convergence to convergence of the maximal monotone maps u n to u, in the following form.
In particular, if u is univalued throughout int(E) (so that ϕ ∈ C 1 there), then uniform convergence holds for any compact Ω ⊂int(E).
if a sequence in the graph of u n converges, then the limit is in the graph of u;
sequences in the graph of u n are bounded if the domain is bounded.
If in addition μ is absolutely continuous, then u n(x) → u(x) μ-almost surely.
We first claim that . Indeed, for any x ∈ E and any 𝜖 > 0, the ball B = B 𝜖(x) has positive measure. Consequently, u cannot be empty on the entire ball, because otherwise would be 0. Since domu is almost convex (see the discussion before Assumptions 1), this implies that actually int(convE) ⊆domu.
1.8 Complementary Slackness and More General Cost Functions
It is well-known (Luenberger and Ye [89, Section 4.4]) that the solutions to the primal and dual problems are related to each other via complementary slackness. In other words, solution of one problem provides a lot of information about the solution of the other problem. Here, we show that this idea remains true for the Kantorovich primal and dual problems, extending the discussion in Sect. 1.6.1 to more general cost functions.
Let and be complete separable metric spaces, , , and be a measurable cost function.
1.8.1 Unconstrained Dual Kantorovich Problem
Let μ and ν be probability measures on and such that the independent coupling with respect to the nonnegative and lower semicontinuous cost function is finite: . Then there exists an optimal pair (φ, ψ) for the dual Kantorovich problem. Furthermore, the pair can be chosen in a way that μ-almost surely, φ = ψ c and ν-almost surely, ψ = φ c.
We remark that at the level of generality of Proposition 1.8.1, the function φ c may fail to be Borel measurable; Ambrosio and Pratelli show that this pair can be modified up to null sets in order to be Borel measurable. If c is continuous, however, then φ c is an infimum of a collection of continuous functions (in y). Hence − φ c is lower semicontinuous, which yields that φ c is measurable. When c is uniformly continuous, measurability of φ c is established in a more lucid way, as exemplified in the next subsection.
1.8.2 The Kantorovich–Rubinstein Theorem
1.8.3 Strictly Convex Cost Functions on Euclidean Spaces
The following result generalises Theorem 1.6.2 to other powers p > 1 of the Euclidean norm. These cost functions define the Wasserstein distances of the next chapter.
The existence of the optimal pair (φ, φ c) with the desired properties follows from Proposition 1.8.1 (they are Borel measurable because c is continuous). We shall now show that φ has a unique c-supergradient μ-almost surely.
Step 1: φ is c-superdifferentiable. Let π ∗ be an optimal coupling. By duality arguments, π is concentrated on the set of (x, y) such that y ∈ ∂ cφ(x). Consequently, for μ-almost any x, the c-superdifferential of φ at x is nonempty.
Step 2: φ is differentiable. Here, we impose the additional condition that ν is compactly supported. Then φ can be taken as a c-transform on the compact support of ν. Since h is locally Lipschitz (it is C 1 because p > 1) this implies that φ is locally Lipschitz. Hence, it is differentiable Lebesgue almost surely, and consequently μ-almost surely.
The general result, without assuming compact support for ν, can be found in Gangbo and McCann [59]. It holds for a larger class of functions h, those that are strictly convex on (this yields that ∇h is invertible), have superlinear growth (h(v)∕∥v∥→∞ as v →∞) and satisfying a technical geometric condition (which ∥v∥p∕p does when p > 1). Furthermore, if h is sufficiently smooth, namely h ∈ C 1, 1 locally (it is if p ≥ 2, but not if p ∈ (1, 2)), then μ does not need to be absolutely continuous; it suffices that it not give positive measure to any set of Hausdorff dimension smaller or equal than d − 1. When d = 1 this means that Theorem 1.8.2 is still valid as long as μ has no atoms (μ({x}) = 0 for all ), which is a weaker condition than μ being absolutely continuous.
It is also noteworthy that for strictly concave cost functions (e.g., p ∈ (0, 1)), the situation is similar when the supports of μ and ν are disjoint. The reason is that h may fail to be differentiable at 0, but it only needs to be differentiated at x − y with x ∈suppμ and y ∈suppν. If the supports are not disjoint, then one needs to leave all common mass in place until the supports become disjoint (Villani [124, Chapter 2]) and then the result of [59] applies. As a simple example, let μ be uniform on [0, 1] and ν be uniform on [0, 2]. After leaving common mass in place, we are left with uniforms on [0, 1] and [1, 2] (with total mass 1∕2) with essentially disjoint supports, for which the optimal transport map is the decreasing map T(x) = 2 − x. Thus, the unique optimal π is not induced from a map, but rather from an equal weight mixture of T and the identity. Informally, each point x in the support of μ needs to be split; half stays and x and the other half transported to 2 − x. The optimal coupling from ν to μ is unique and induced from the map S(x) = x if x ≤ 1 and 2 − x if x ≥ 1, which is neither increasing nor decreasing.
1.9 Bibliographical Notes
Many authors, including Villani [124, Theorem 1.3]; [125, Theorem 5.10], give the duality Theorem 1.4.2 for lower semicontinuous cost functions. The version given here is a simplification of Beiglböck and Schachermayer [17, Theorem 1]. The duality holds for functions that take values in [−∞, ∞] provided that they are finite on a sufficiently large subset of , but there are simple counterexamples if c is infinite too often [17, Example 4.1]. For results outside the Polish space setup, see Kellerer [80] and Rachev and Rüschendorf [107, Chapter 4].
Theorem 1.5.1 for the one-dimensional case is taken from [124], where it is proven using the general duality theorem. For direct proofs and the history of this result, one may consult Rachev [106] or Rachev and Rüschendorf [107, Section 3.1]. The concave case is carefully treated by McCann [94].
The results in the Gaussian case were obtained independently by Olkin and Pukelsheim [98] and Givens and Shortt [65]. The proof given here is from Bhatia [20, Exercise 1.2.13]. An extension to separable Hilbert spaces can be found in Gelbrich [62] or Cuesta-Albertos et al. [39].
The regularity theory of Sect. 1.6.4 is very delicate. Caffarelli [32] showed the first part of Theorem 1.6.7; the proof can also be found in Figalli’s book [52, Theorem 4.23]. Villani [124, Theorem 4.14] states the result without proof and refers to Alesker et al. [7] for a sketch of the second part of Theorem 1.6.7. Other regularity results exist, Villani [125, Chapter 12]; Santambrogio [119, Section 1.7.6]; Figalli [52].
Cuesta-Albertos et al. [40, Theorem 3.2] prove Theorem 1.7.2 for the quadratic case; the form given here is from Schachermayer and Teichmann [120, Theorem 3].
The definition of cyclical monotonicity depends on the cost function. It is typically referred to as c-cyclical monotonicity, with “cyclical monotonicity” reserved to the special case of quadratic cost. Since we focus on the quadratic case and for readability, we slightly deviate from the standard jargon. That cyclical monotonicity implies optimality (Proposition 1.7.5) was shown independently by Pratelli [105] (finite lower semicontinuous cost) and Schachermayer and Teichmann [120] (possibly infinite continuous cost). A joint generalisation is given by Beiglböck et al. [18].
Section 1.7.2 is taken from Zemel and Panaretos [134, Section 7.5]; a slightly weaker version was shown independently by Chernozhukov et al. [35]. Heinich and Lootgieter [68] establish almost sure pointwise convergence. If μ n = μ, then the optimal maps converge in μ-measure [125, Corollary 5.23] in a very general setup, but there are simple examples where this fails if μ n≠μ [125, Remark 5.25]. In the quadratic case, further stability results of a weaker flavour (focussing on the convex potential ϕ, rather than its derivative, which is the optimal map) can be found in del Barrio and Loubes [42].
The idea of using the c-transform (Sect. 1.8) is from Rüschendorf [116].
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.