Why is it relevant to construct the Fréchet mean of a collection of measures with respect to the Wasserstein metric? A simple answer is that this kind of average will often express a more natural notion of “typical” realisation of a random probability distribution than an arithmetic average.1 Much more can be said, however, in that the Wasserstein–Fréchet mean and the closely related notion of an optimal multicoupling arise canonically as the appropriate framework for the formulation and solution to the problem of separation of amplitude and phase variation of a point process. It would almost seem that Wasserstein–Fréchet means were “made” for precisely this problem.
Amplitude variation. This is the “classical” variation that one would also encounter in multivariate analysis, and refers to the stochastic fluctuations around a mean level, usually encoded in the covariance kernel, at least up to second order.
In short, this is variation “in the y-axis” (ordinate).
Phase variation. This is a second layer of non-linear variation peculiar to continuous domain stochastic processes, and is rarely—if ever—encountered in multivariate analysis. It arises as the result of random changes (or deformations) in the time scale (or the spatial domain) of definition of the process. It can be conceptualised as a composition of the stochastic process with a random transformation (warp map) acting on its domain.
This is variation “in the x-axis” (abscissa).
Phase variation naturally arises in the study of random phenomena where there is no absolute notion of time or space, but every realisation of the phenomenon evolves according to a time scale that is intrinsic to the phenomenon itself, and (unfortunately) unobservable. Processes related to physiological measurements, such as growth curves and neuronal signals, are usual suspects. Growth curves can be modelled as continuous random functions (functional data), whereas neuronal signals are better modelled as discrete random measures (point processes). We first describe amplitude/phase variation in the former2 case, as that is easier to appreciate, before moving on to the latter case, which is the main subject of this chapter.
4.1 Amplitude and Phase Variation
4.1.1 The Functional Case
Let K denote the unit cube . A real random function Y = (Y (x) : x ∈ K) can, broadly speaking, have two types of variation. The first, amplitude variation, results from Y (x) being a random variable for every x and describes its fluctuations around the mean level , usually encoded by the variance varY (x). For this reason, it can be referred to as “variation in the y-axis”. More generally, for any finite set x 1, …, x n, the n × n covariance matrix with entries κ(x i, x j) = cov[Y (x i), Y (x j)] encapsulates (up to second order) the stochastic deviations of the random vector (Y (x 1), …, Y (x n)) from its mean, in analogy with the multivariate case. Heuristically, one then views amplitude variation as the collection κ(x, y) for x, y ∈ K in a sense we discuss next.
There is an obvious identifiability problem in the model . If S is any (deterministic) invertible function, then the model with (Y, T) is statistically indistinguishable from the model with (Y ∘ S, T ∘ S). It is therefore often assumed that is the identity and in addition, in nearly all application, that T is monotonically increasing (if d = 1).
In the bibliographic notes, we review some methods for carrying out this separation of amplitude and phase variation. It is fair to say that no single registration method arises as the canonical solution to the functional registration problem. Indeed, most need to make additional structural and/or smoothness assumptions on the warp maps, further to the basic identifiability conditions requiring that T be increasing and that equal the identity. We will eventually see that, in contrast, the case of point processes (viewed as discretely observed random measures) admits a canonical framework, without needing additional assumptions.
4.1.2 The Point Process Case
A point process is the mathematical object that represents the intuitive notion of a random collection of points in a space . It is formally defined as a measurable map Π from a generic probability space into the space of (possibly infinite) Borel integer-valued measures of in such a way that Π(B) is a measurable real-valued random variable for all Borel subsets B of . The quantity Π(B) represents the random number of points observed in the set B. Among the plethora of books on point processes, let us mention Daley and Vere-Jones [41] and Karr [79]. Kallenberg [75] treats more general objects, random measures, of which point processes are a peculiar special case. We will assume for convenience that Π is a measure on a compact subset .
Conditional upon T, the random variables , k = 1, …, n are independent as the sets (T −1(A k)) are disjoint, and follows a Poisson distribution with mean λ(T −1(A)) = Λ(A). This is precisely the definition of a Cox process : conditional upon the driving measure Λ, is a Poisson process with mean measure λ. For this reason, it is also called a doubly stochastic process ; in our context, the phase variation is associated with the stochasticity of Λ while the amplitude one is associated with the Poisson variation conditional upon Λ.
As in the functional case there are problems with identifiability: the model (Π, T) cannot be distinguished from the model (S#Π, T ∘ S −1) for any invertible S : K → K. It is thus natural to assume that is the identity map4 (otherwise set , i.e., replace Π by and T by ).
the expected value of T is the identity;
T is a gradient of a convex function.
4.2 Wasserstein Geometry and Phase Variation
4.2.1 Equivariance Properties of the Wasserstein Distance
This carries over to Fréchet means in an obvious way.
Let Λ be a random measure in with finite Fréchet functional and . Then γ is a Fréchet mean of Λ if and only if γ ∗ δ a is a Fréchet mean of Λ ∗ δ a.
4.2.2 Canonicity of Wasserstein Distance in Measuring Phase Variation
The purpose of this subsection is to show that the standard functional data analysis assumptions on the warp function T, having mean identity and being increasing, are equivalent to purely geometric conditions on T and the conditional mean measure Λ = T#λ. Put differently, if one is willing to assume that and that T is increasing, then one is led unequivocally to the problem of estimation of Fréchet means in the Wasserstein space . When , “increasing” is interpreted as being the gradient of a convex function, as explained at the end of Sect. 4.1.2.
The total mass is invariant under the push-forward operation, and when it is finite, we may assume without loss of generality that it is equal to one, because all the relevant quantities scale with the total mass. Indeed, if λ = τμ with μ probability measure and τ > 0, then T#λ = τ × T#μ, and the Wasserstein distance (defined as the infimum-over-coupling integrated cost) between τμ and τν is τW p(μ, ν) for μ, ν probabilities.
We begin with the one-dimensional case, where the explicit formulae allow for a more transparent argument, and for simplicity we will assume some regularity.
- (A1)
Unbiasedness: for all x ∈ K.
- (A2)
Regularity: T is monotone increasing.
The relevance of the Wasserstein geometry to phase variation becomes clear in the following proposition that shows that Assumptions 2 are equivalent to geometric assumptions on the Wasserstein space .
- (B1)Unbiasedness: for any
- (B2)Regularity: if is such that T#λ = Q#λ, then with probability one
These assumptions have a clear interpretation: (B1) stipulates that λ is a Fréchet mean of the random measure Λ = T#λ, while (B2) states that T must be the optimal map from λ to Λ, that is, .
If T satisfies (B2) then, as an optimal map, it must be nondecreasing λ-almost surely. Since λ is arbitrary, T must be nondecreasing on the entire domain K. Conversely, if T is nondecreasing, then it is optimal for any λ. Hence (A2) and (B2) are equivalent.
The generalisation with respect to the one-dimensional result is threefold. Firstly, since our main interest is the implication (A1–A2) ⇒ (B1–B2), we need not assume T to be injective. Secondly, the support of λ is not required to be compact. Lastly, the result holds in arbitrary dimension, including infinite-dimensional separable Hilbert spaces . In particular, if t is a linear map, then coincides with the operator norm of t, so the assumption is that t be a bounded self-adjoint nonnegative operator with mean identity and finite expected operator norm.
as required. The rigorous mathematical justification for this is given on page 88 in the supplement.
The “natural” space fortwould be , but without the continuity assumption, the result may fail (Álvarez-Esteban et al. [ 9, Example 3.1]). A simple argument shows that the growth condition imposed by the G 1assumption is minimal; see page 89 in the supplement or Galasso et al. [ 58].
The same statement holds if is replaced by a (Borel) convex subset K thereof. The integrals will then be taken on K, showing that λ minimises the Fréchet functional among measures supported on K, and, by continuity, on . By Proposition 3.2.4 , λ is a Fréchet mean.
4.3 Estimation of Fréchet Means
4.3.1 Oracle Case
In view of the canonicity of the Wasserstein geometry in Sect. 4.2.2, separation of amplitude and phase variation of the point processes essentially requires computing Fréchet means in the 2-Wasserstein space. It is both conceptually important and technically convenient to introduce the case where an oracle reveals the conditional mean measures Λ = T#λ entirely. Thus, assuming that is the unique Fréchet mean of a random measure Λ, the goal is to estimate the structural mean λ on the basis of independent and identically distributed realisations Λ 1, …, Λ n of λ.
The warp maps (and their inverses) can then be estimated as the optimal maps from λ n to each Λ i (see Sect. 4.3.4).
4.3.2 Discretely Observed Measures
As a generalisation of the discrete case discussed in Sect. 1.3, the Fréchet mean of discrete measures can be computed exactly. Suppose that is nonzero for all i. Then each is a discrete measure supported on N i points. One can then recast the multimarginal formulation (see Sect. 3.1.2) as a finite linear program, solve it, and “average” the solution as in Proposition 3.1.2 in order to obtain (an alternative linear programming formulation for finding a Fréchet mean is given by Anderes et al. [14]). Thus, can be computed in finite time, even when is infinite-dimensional.
Finally, a remark about measurability is in order. Point processes can be viewed as random elements in endowed with the vague topology induced from convergence of integrals of continuous functions with compact support. If μ n converge to μ vaguely, and a n are numbers that converge to a, then a nμ n → aμ vaguely. Thus, is a continuous function of the pair and can be viewed as a random measure with respect to the vague topology. The restriction of the vague topology to probability measures is equivalent to the weak topology,5 and therefore vague, weak, and Wasserstein measurability are all equivalent.
4.3.3 Smoothing
Even when the computational complexity involved in calculating is tractable, there is another reason not to use it as an estimator for λ. If one has a-priori knowledge that λ is smooth, it is often desirable to estimate it by a smooth measure. One way to achieve this would be to apply some smoothing technique to using, e.g., kernel density estimation. However, unless the number of observed points from each measure is the same N 1 = ⋯ = N n = N, will usually be concentrated on many points, essentially N 1 + ⋯ + N n of them. In other words, the Fréchet mean is concentrated on many more points than each of the measures , thus potentially hindering its usefulness as a mean because it will not be a representative of the sample.
By counting the number of redundancies in the constraints matrix of the linear program, one can show that this is in general an upper bound on the number of support points of the Fréchet mean.
An alternative approach is to first smooth each observation and then calculate the Fréchet mean. Since it is easy to bound the Wasserstein distances when dealing with convolutions, we will employ kernel density estimation, although other smoothing approaches could be used as well.
If and d ≥ 2, then there is no explicit expression for , although it exists and is unique. In the next chapter, we present a steepest descent algorithm that approximately constructs by taking advantage of the differentiability properties of the Fréchet functional in Sect. 3.1.6.
4.3.4 Estimation of Warpings and Registration Maps
4.3.5 Unbiased Estimation When
Unbiased estimators allow us to avoid the problem of over-registering (the so-called pinching effect; Kneip and Ramsay [82, Section 2.4]; Marron et al. [90, p. 476]). An extreme example of over-registration is if one “aligns” all the observed patterns into a single fixed point x 0. The registration will then seem “successful” in the sense of having no residual phase variation, but the estimation is clearly biased because the points are not registered to the correct reference measure. Thus, requiring the estimator to be unbiased is an alternative to penalising the registration maps.
Due to the Hilbert space embedding of , it is possible to characterise unbiased estimators in terms of a simple condition on their quantile functions. As a corollary, λ n, the Fréchet mean of {Λ 1, …, Λ n}, is unbiased. Our regularised Fréchet–Wasserstein estimator can then be interpreted as approximately unbiased, since it approximates the unobservable λ n.
Let Λ be a random measure in with finite Fréchet functional and let λ be the unique Fréchet mean of Λ (Theorem 3.2.11 ). An estimator δ constructed as a function of a sample (Λ 1, …, Λ n) is unbiased for λ if and only if the left-continuous representatives (in L 2(0, 1)) satisfy for all x ∈ (0, 1).
4.4 Consistency
In functional data analysis, one often assumes that the number of curves n and the number of observed points per curve m both diverge to infinity. An analogous framework for point processes would similarly require the number of point processes n as well as the expected number of points τ per processes to diverge. A technical complication arises, however, because the mean measures do not suffice to characterise the distribution of the processes. Indeed, if one is given a point processes Π with mean measure λ (not necessarily a probability measure), and τ is an integer, there is no unique way to define a process Π (τ) with mean measure τλ. One can define Π (τ) = τΠ, so that every point in Π will be counted τ times. Such a construction, however, can never yield a consistent estimator of λ, even when τ →∞.
The Laplace functional of Π (τ) is L τ(f) = [L 1(f)]τ for any rational τ, which simply amounts to multiplying the measure ρ by the scalar τ. One can then do the same for an irrational τ, and the resulting Laplace functional determines the distribution of Π (τ) for all τ > 0.
4.4.1 Consistent Estimation of Fréchet Means
We are now ready to define our asymptotic setup. The following assumptions will be made. Notice that the Wasserstein geometry does not appear explicitly in these assumptions, but is rather derived from them in view of Theorem 4.2.4. The compactness requirement can be relaxed under further moment conditions on λ and the point process Π; we focus on the compact case for the simplicity and because in practice the point patterns will be observed on a bounded observation window.
For every n, let be independent point processes, each having the same distribution as a superposition of τ n copies of Π.
Let T be a random injective function on K (viewed as a random element in C b(K, K) endowed with the supremum norm) such that T(x) ∈ U for x ∈ U (that is, T ∈ C b(U, U)) with nonsingular derivative for almost all x ∈ U, that is a gradient of a convex function. Let {T 1, …, T n} be independent and identically distributed as T.
For every x ∈ U, assume that .
Assume that the collections and are independent.
Let be the warped point processes, having conditional mean measures .
Define by the smoothing procedure (4.2), using bandwidth (possibly random).
The dependence of the estimators on n will sometimes be tacit. But Λ idoes not depend on n.
By virtue of Theorem 4.2.4, λ is a Fréchet mean of the random measure Λ = T#λ. Uniqueness of this Fréchet mean will follow from Proposition 3.2.7 if we show that Λ is absolutely continuous with positive probability. This is indeed the case, since T is injective and has a nonsingular Jacobian matrix; see Ambrosio et al. [12, Lemma 5.5.3]. The Jacobian assumption can be removed when , because Fréchet means are always unique by Theorem 3.2.11.
Full independence: here the point processes are independent across rows, that is, and are also independent.
Nested observations: here includes the same points as and additional points, that is, is a superposition of and another point process distributed as (τ n+1 − τ n)Π.
We now state and prove the consistency result for the estimators of the conditional mean measures Λ i and the structural mean measure λ.
Convergence in 1. holds almost surely under the additional conditions that and . If σ n → 0 only in probability, then convergence in 2. still holds in probability.
Theorem 4.4.1 still holds without smoothing (σ n = 0). In that case, is possibly not unique, and the theorem should be interpreted in a set-valued sense (as in Proposition 1.7.8): almost surely, any choice of minimisers converges to λ as n →∞.
The preceding paragraph notwithstanding, we will usually assume that some smoothing is present, in which case is unique and absolutely continuous by Proposition 3.1.8. The uniform Lipschitz bounds for the objective function show that if we restrict the relevant measures to be absolutely continuous, then is a continuous function of and hence is measurable; this is again a minor issue because many arguments in the proof hold for each ω ∈ Ω separately. Thus, even if is not measurable, the proof shows that the convergence holds outer almost surely or in outer probability.
The first step in proving consistency is to show that the Wasserstein distance between the unsmoothed and the smoothed estimators of Λ i vanishes with the smoothing parameter. The exact rate of decay will be important to later establish the rate of convergence of to λ, and is determined next.
Since the smoothing parameter will anyway vanish, this restriction to small values of σ is not binding. The constant C ψ,K is explicit. When , a more refined construction allows to improve this constant in some situations, see Panaretos and Zemel [100, Lemma 1].
The idea is that (4.2) is a sum of measures with mass 1∕N i that can be all sent to the relevant point x j, and we refer to page 98 in the supplement for the precise details.
Proof(Proof of Theorem 4.4.1) The proof, detailed on page 97 of the supplement, follows the following steps: firstly, one shows the convergence in probability of to Λ i. This is basically a corollary of Karr [79, Proposition 4.8] and the smoothing bound from Lemma 4.4.2.
4.4.2 Consistency of Warp Functions and Inverses
We next discuss the consistency of the warp and registration function estimators. These are key elements in order to align the observed point patterns . Recall that we have consistent estimators for Λ i and for λ. Then is estimated by and is estimated by . We will make the following extra assumptions that lead to more transparent statements (otherwise one needs to replace K with the set of Lebesgue points of the supports of λ and Λ i).
- 1.
λ has a positive density on K (in particular, suppλ = K);
- 2.
T is almost surely surjective on U = intK (thus a homeomorphism of U).
As a consequence almost surely.
Almost sure convergence can be obtained under the same provisions made at the end of the statement of Theorem 4.4.1.
A few technical remarks are in order. First and foremost, it is not clear that the two suprema are measurable. Even though T i and are random elements in , their estimators are only defined in an L 2 sense. The proof of Theorem 4.4.3 is done ω-wise. That is, for any ω in the probability space such that Theorem 4.4.1 holds, the two suprema vanish as n →∞. In other words, the convergence holds in outer probability or outer almost surely.
Secondly, assuming positive smoothing, the random measures are smooth with densities bounded below on K, so are defined on the whole of U (possibly as set-valued functions on a Λ i-null set). But the only known regularity result for is an upper bound on its density (Proposition 3.1.8), so it is unclear what is its support and consequently what is the domain of definition of .
Lastly, when the smoothing parameter σ is zero, and are not defined. Nevertheless, Theorem 4.4.3 still holds in the set-valued formulation of Proposition 1.7.11, of which it is a rather simple corollary:
The division by the number of observed points ensures that the resulting measures are probability measures; the relevant information is contained in the point patterns themselves, and is invariant under this normalisation.
Possible extensions pertaining to the boundary of K are discussed on page 33 of the supplement.
4.5 Illustrative Examples
In this section, we illustrate the estimation framework put forth in this chapter by considering an example of a structural mean λ with a bimodal density on the real line. The unwarped point patterns Π originate from Poisson processes with mean measure λ and, consequently, the warped points are Cox processes (see Sect. 4.1.2). Another scenario involving triangular densities can be found in Panaretos and Zemel [100].
4.5.1 Explicit Classes of Warp Maps
As a first step, we introduce a class of random warp maps satisfying Assumptions 2, that is, increasing maps that have as mean the identity function. The construction is a mixture version of similar maps considered by Wang and Gasser [128, 129].
4.5.2 Bimodal Cox Processes
The main criterion for the quality of our regularised Fréchet–Wasserstein estimator will be its success in discerning the two modes at ± 8; these will be smeared by the phase variation arising from the warp functions.
4.5.3 Effect of the Smoothing Parameter
4.6 Convergence Rates and a Central Limit Theorem on the Real Line
Since the conditional mean measures Λ i are discretely observed, the rate of convergence of our estimators will be affected by the rate at which the number of observed points per process increases to infinity. The latter is controlled by the next lemma, which is valid for any complete separable metric space .
One can also show that the limit superior of the same quantity is bounded by a constant . If is bounded below, then the same result holds but with worse constants. If only τ n →∞, then the result holds for each i separately but in probability.
The proof is a simple application of Chernoff bounds; see page 108 in the supplement.
where all the constants in the terms are explicit.
Unlike classical density estimation, no assumptions on the rate of decay of σ n are required, because we only need to estimate the distribution function and not the derivative. If the smoothing parameter is chosen to be for some α > 0 and , then by Lemma 4.6.1 . For example, if Rosenblatt’s rule α = 1∕5 is employed, then the term can be replaced by .
One can think about the parameter τ as separating the sparse and dense regimes as in classical functional data analysis (see also Wu et al. [132]). If τ is bounded, then the setting is ultra sparse and consistency cannot be achieved. A sparse regime can be defined as the case where τ n →∞ but slower than . In that case, consistency is guaranteed, but some point patterns will be empty. The dense regime can be defined as τ n ≫ n 2, in which case the amplitude variation is negligible asymptotically when compared with the phase variation.
The exponent − 1∕4 of τ n can be shown to be optimal without further assumptions, but it can be improved to − 1∕2 if for some 𝜖 > 0, where f Λ is the density of Λ (see Sect. 4.7). In terms of T, the condition is that for some 𝜖 and λ has a density bounded below. When this is the case, τ n needs to compared with n rather than n 2 in the next paragraph and the next theorem.
Theorem 4.6.3 provides conditions for the optimal parametric rate to be achieved: this happens if we set σ n to be of the order or less and if τ n is of the order n 2 or more. But if the last two terms in Theorem 4.6.3 are negligible with respect to n −1∕2, then a sort of central limit theorem holds for :
If the density f λ exists and is (piecewise) continuous and bounded below on K, then the weak convergence also holds in L 2(K).
In view of Sect. 2.3, Theorem 4.6.5 can be interpreted as asymptotic normality of in the tangential sense: converges to a Gaussian random element in the tangent space Tanλ, which is a subset of L 2(λ). The additional smoothness conditions allow to switch to the space L 2(K), which is independent of the unknown template measure λ.
See pages 109 and 110 in the supplement for detailed proofs of these theorems. Below we sketch the main ideas only.
Proof(Proof of Theorem 4.6.3)
The quantile formula from Sect. 1.5 and the average quantile formula for the Fréchet mean (Sect. 3.1.4) show that the oracle empirical mean follows a central limit theorem in L 2(0, 1). Since we work in the Hilbert space L 2(0, 1), Fréchet means are simple averages, so the errors in the Fréchet mean have the same rate as the errors in the Fréchet functionals. The smoothing term is easily controlled by Lemma 4.4.2.
Controlling the amplitude term is more difficult. Bounds can be given using the machinery sketched in Sect. 4.7, but we give a more elementary proof by reducing to the 1-Wasserstein case (using (2.2)), which can be more easily handled in terms of distributions functions (Corollary 1.5.3).
4.7 Convergence of the Empirical Measure and Optimality
One may find the term in Theorem 4.6.3 to be somewhat surprising, and expect that it ought to be . The goal of this section is to show why the rate is optimal without further assumptions and discuss conditions under which it can be improved to the optimal rate . For simplicity, we concentrate on the case τ n = n and assume that the point process Π is binomial; the Poisson case being easily obtained from the simplified one (using Lemma 4.6.1). We are thus led to study rates of convergence of empirical measures in the Wasserstein space. That is to say, for a fixed exponent p ≥ 1 and a fixed measure , we consider independent random variables X 1, … with law μ and the empirical measure . The first observation is that :
The sequence is not monotone, as the simple example μ = (δ 0 + δ 1)∕2 shows (see page 111 in the supplement).
The next question is how quickly vanishes when . We shall begin with two simple general lower bounds, then discuss upper bounds in the one-dimensional case, put them in the context of Theorem 4.6.3, and finally briefly touch the d-dimensional case.
by the central limit theorem and the Kantorovich–Rubinstein theorem (1.11).
For discrete measures, the rates scale badly with p. More generally:
Then for any p ≥ 1 there exists c p(μ) > 0 such that .
Any nondegenerate finitely discrete measure μ satisfies this condition, and so do “non-pathological” countably discrete ones. (An example of a “pathological” measure is one assigning positive mass to any rational number.)
Let k ∼ B(n, q = μ(A)) denote the number of points from the sample (X 1, …, X n) that fall in A. Then a mass of |k∕n − q| must travel between A and B, a distance of at least . Thus, , and the result follows from the central limit theorem for k; see page 112 in the supplement for the full details.
is necessary and sufficient for .
See Bobkov and Ledoux [25, Theorem 5.10] for a proof for the J p condition, and Theorems 5.1 and 5.3 for the values of the constants and a stronger result.
When p > 1, for J p(μ) to be finite, the support of μ must be connected; this is not needed when p = 1. Moreover, the J p condition is satisfied when f is bounded below (in which case the support of μ must be compact). However, smoothness alone does not suffice, even for measures with positive density on a compact support. More precisely, we have:
We conclude by proving a lower bound for absolutely continuous measures and stating, without proof, an upper bound.
If for some d>2p, N(μ, 𝜖, 𝜖 dp∕(d−2p))≤K𝜖 −d, then .
Comparing this with the lower bound in Lemma 4.7.4, we see that in the high-dimensional regime d > 2p, absolutely continuous measures have a worse rate than discrete ones. In the low-dimensional regime d < 2p, the situation is opposite. We also obtain that for d > 2 and a compactly supported absolutely continuous , .
4.8 Bibliographical Notes
Our exposition in this chapter closely follows the papers Panaretos and Zemel [100] and Zemel and Panaretos [134].
Books on functional data analysis include Ramsay and Silverman [109, 110], Ferraty and Vieu [51], Horváth and Kokoszka [70], and Hsing and Eubank [71], and a recent review is also available (Wang et al. [127]). The specific topic of amplitude and phase variation is discussed in [110, Chapter 7] and [127, Section 5.2]. The next paragraph gives some selective references.
One of the first functional registration techniques employed dynamic programming (Wang and Gasser [128]) and dates back to Sakoe and Chiba [118]. Landmark registration consists of identifying salient features for each curve, called landmarks, and aligning them (Gasser and Kneip [61]; Gervini and Gasser [63]). In pairwise synchronisation (Tang and Müller [122]) one aligns each pair of curves and then derives an estimator of the warp functions by linear averaging of the pairwise registration maps. Another class of methods involves a template curve, to which each observation is registered, minimising a discrepancy criterion; the template is then iteratively updated (Wang and Gasser [129]; Ramsay and Li [108]). James [72] defines a “feature function” for each curve and uses the moments of the feature function to guarantee identifiability. Elastic registration employs the Fisher–Rao metric that is invariant to warpings and calculates averages in the resulting quotient space (Tucker et al. [123]). Other techniques include semiparametric modelling (Rønn [115]; Gervini and Gasser [64]) and principal components registration (Kneip and Ramsay [82]). More details can be found in the review article by Marron et al. [90]. Wrobel et al. [131] have recently developed a registration method for functional data with a discrete flavour. It is also noteworthy that a version of the Wasserstein metric can also be used in the functional case (Chakraborty and Panaretos [34]).
The literature on the point processes case is more scarce; see the review by Wu and Srivastava [133].
A parametric version of Theorem 4.2.4 was first established by Bigot and Klein [22, Theorem 5.1] in , extended to a compact nonparametric formulation in Zemel and Panaretos [134]. There is an infinite-dimensional linear version in Masarotto et al. [91]. The current level of generality appears to be new.
Theorem 4.4.1 is a stronger version of Panaretos and Zemel [100, Theorem 1] where it was assumed that τ n must diverge to infinity faster than . An analogous construction under the Bayesian paradigm can be found in Galasso et al. [58]. Optimality of the rates of convergence in Theorem 4.6.3 is discussed in detail by Bigot et al. [23], where finiteness of the functional J 2 (see Sect. 4.7) is assumed and consequently is improved to .
As far as we know, Theorem 4.6.5 (taken from [100]) is the first central limit theorem for Fréchet means in Wasserstein space. When the measures Λ i are observed exactly (no amplitude variation: τ n = ∞ and σ = 0) Kroshnin et al. [84] have recently proven a central limit theorem for random Gaussian measures in arbitrary dimension, extending a previous result of Agueh and Carlier [3]. It seems likely that in a fully nonparametric setting, the rates of convergence (compare Theorem 4.6.3) might be slower than ; see Ahidar-Coutrix et al. [4].
The magnitude of the amplitude variation in Theorem 4.6.3 pertains to the rates of convergence of to zero (Sect. 4.7). This is a topic of intense research, dating back to the seminal paper by Dudley [46], where a version of Theorem 4.7.8 with p = 1 is shown for the bounded Lipschitz metric. The lower bounds proven in this section were adapted from [46], Fournier and Guillin [54], and Weed and Bach [130].
The version of Theorem 4.7.8 given here can be found in [130] and extends Boissard and Le Gouic [27]. Both papers [27, 130] work in a general setting of complete separable metric spaces. An additional term appears in the limiting case d = 2p, as already noted (for p = 1) by [46], and the classical work of Ajtai et al. [5] for μ uniform on [0, 1]2. More general results are available in [54]. A longer (but far from being complete) bibliography is given in the recent review by Panaretos and Zemel [101, Subsection 3.3.1], including works by Barthe, Dobrić, Talagrand, and coauthors on almost sure results and deviation bounds for the empirical Wasserstein distance.
The J 1 condition is due to del Barrio et al. [43], who showed it to be necessary and sufficient for the empirical process to converge in distribution to , with Brownian bridge. The extension to 1 ≤ p ≤∞ (and a lot more) can be found in Bobkov and Ledoux [25], employing order statistics and beta distributions to reduce to the uniform case. Alternatively, one may consult Mason [92], who uses weighted approximations to Brownian bridges.
An important aspect that was not covered here is that of statistical inference of the Wasserstein distance on the basis of the empirical measure. This is a challenging question and results by del Barrio, Munk, and coauthors are available for one-dimensional, elliptical, or discrete measures, as explained in [101, Section 3].
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.