© Springer Nature Switzerland AG 2020
M. La Rocca et al. (eds.)Nonparametric StatisticsSpringer Proceedings in Mathematics & Statistics339https://doi.org/10.1007/978-3-030-57306-5_31

The Discrepancy Method for Extremal Index Estimation

Natalia Markovich1  
(1)
V.A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences, Moscow, 117997, Russia
 
 
Natalia Markovich

Abstract

We consider the nonparametric estimation of the extremal index of stochastic processes. The discrepancy method that was proposed by the author as a data-driven smoothing tool for probability density function estimation is extended to find a threshold parameter u for an extremal index estimator in case of heavy-tailed distributions. To this end, the discrepancy statistics are based on the von Mises–Smirnov statistic and the k largest order statistics instead of an entire sample. The asymptotic chi-squared distribution of the discrepancy measure is derived. Its quantiles may be used as discrepancy values. An algorithm to select u for an estimator of the extremal index is proposed. The accuracy of the discrepancy method is checked by a simulation study.

Keywords
Nonparametric estimationDiscrepancy methodvon Mises–Smirnov statisticExtremal indexHeavy-tailed distribution

1 Introduction

Let $$X^n=\{X_1,\ldots ,X_n\}$$ be a sample of random variables (rvs) with cumulative distribution function (cdf) F(x). We consider the nonparametric estimation of the extremal index (EI) of stochastic processes. There are nonparametric methods like the well-known blocks and runs estimators of the EI which require the selection of two parameters, where an appropriate threshold u is among them [3]. Modifications of the blocks estimator [6, 20] and sliding blocks estimators [17, 19] require only the block size without u. The intervals estimator depends only on u [9]. Less attention is devoted to the estimation of parameters required for these estimators.

The discrepancy method was proposed and studied in [13, 23] as a data-driven smoothing tool for light-tailed probability density function (pdf) estimation by independent identically distributed (iid) data. The idea is to find a smoothing parameter (i.e., a bandwidth) h as a solution of the discrepancy equation:
$$\begin{aligned}\rho (\widehat{F}_h,F_n)=\delta .\end{aligned}$$
Here, $$\widehat{F}_h(x)=\int _{-\infty }^x\widehat{f}_h(t)dt$$, $$\hat{f}_h(t)$$ is some pdf estimate, and $$\delta $$ is a known discrepancy value of the estimation of F(x) by the empirical distribution function $$F_n(t)$$, i.e., $$\delta =\rho (F,F_n)$$, $$\rho (\cdot ,\cdot )$$ is a metric in the space of cdf’s. Since $$\delta $$ is usually unknown, some quantiles of limit distributions of von Mises–Smirnov (M-S)
$$\begin{aligned} \omega ^2_n=n\int _{-\infty }^{\infty }\left( F_n(x)-F(x)\right) ^2dF(x), \end{aligned}$$
and Kolmogorov–Smirnov statistics were proposed as $$\delta $$. Distributions of these statistics are invariant regarding F(x), [4]. In practice the bandwidth h may be found as a solution of the equation [13]
$$\begin{aligned} \hat{\omega }^2_n(h)=0.05,\end{aligned}$$
(1)
where
$$\hat{\omega }^2_n(h) = \sum _{i=1}^n\left( \widehat{F}_h(X_{i,n})-\frac{i-0.5}{n}\right) ^2+\frac{1}{12n}$$
is based on the order statistics $$X_{1,n}\le \cdots \le X_{n,n}$$ corresponding to the sample $$X^n$$. The value 0.05 corresponding to the maximum (mode) of the pdf of the statistic $$\omega ^2_n$$ was found by tables of statistic $$\omega ^2_n$$ [4] as the discrepancy value $$\delta $$.
It is noted in [14, 16] that for heavy-tailed distributions, the statistic $$\omega ^2_n$$ may not reach the value 0.05 and hence, the discrepancy equation (1) may have no solutions, or the solutions provide too small values of h that are unsatisfactory for pdf estimation. In order to estimate heavy-tailed pdf’s, the modification of the discrepancy method based on the k largest order statistics instead of the entire sample was considered in [16]. Namely, the statistic
$$\begin{aligned}&\hat{\omega }^2_n(h) = \sum _{i=n-k+1}^n\left( \widehat{F}_h(X_{i,n})-\frac{i-0.5}{n}\right) ^2+\frac{1}{12n}\end{aligned}$$
was proposed to be used in (1). A similar idea was explored in [15] to estimate the EI.
Definition 1
([12, p. 67]) The stationary sequence $$\{X_n\}_{n\ge 1}$$ is said to have EI $$\theta \in [0,1]$$ if for each $$0<\tau <\infty $$ there is a sequence of real numbers $$u_n=u_n(\tau )$$ such that it holds
$$\begin{aligned} \lim _{n\rightarrow \infty }n(1-F(u_n))=\tau , \qquad \lim _{n\rightarrow \infty }P\{M_n\le u_n\}=e^{-\tau \theta },\end{aligned}$$
(2)
where $$M_n=\max \{X_1,\ldots ,X_n\}$$.
The EI reflects a cluster structure of a sequence. If $$X_1,\ldots ,X_n$$ are independent, $$\theta =1$$ holds. One can determine the cluster as the number of consecutive observations exceeding a threshold u between two consecutive non-exceedances, [9]. Then the values of inter-cluster times $$T_1(u)$$ for a given threshold u are stated as the numbers of consecutive non-exceedances between two consecutive clusters, [9]. We have
$$\begin{aligned} T_1(u)=\min \{j\ge 1: M_{1,j}\le u, X_{j+1}>u|X_{1}>u\}, \end{aligned}$$
$$M_{1,j}=\max \{X_{2},\ldots ,X_{j}\}$$, $$M_{1,1}=-\infty $$. Observations of $$T_1(u_n)$$ normalized by the tail function $$\{Y_i=\overline{F}(u_n)T_1(u_n)_i\}$$, $$i=1,\ldots ,L$$,  $$L=L(u_n)$$, $$L< n$$,1 are derived to be asymptotically exponentially distributed with weight $$\theta $$, i.e.,
$$\begin{aligned}P\{\overline{F}(u_n)T_1(u_n)>t\}\rightarrow & {} \theta \exp (-\theta t)\qquad \text{ for }~~t>0\end{aligned}$$
as $$n\rightarrow \infty $$ under a specific mixing condition and $$u_n$$ satisfying (2), [9].
The discrepancy equation may be based on the k, $$1\le k\le L-1$$, largest order statistics of a sample $$\{Y_i=(N_u/n)T_1(u)_i\}$$ as follows:
$$\begin{aligned} \hat{\omega }^2_L(u) = \sum _{i=L-k+1}^L\left( \widehat{G}(Y_{i,L})-\frac{i-0.5}{L}\right) ^2+\frac{1}{12L}=\delta .\end{aligned}$$
(3)
Here, $$N_u=\sum _{i=1}^n\mathbf{1}\{X_i>u\}$$ is the number of observations which exceed a predetermined high threshold u. $$\widehat{G}(Y_{i,L})$$ is determined by $$G(t)=1-\theta \exp (-\theta t)$$ with the replacement of $$\theta $$ by some estimate $$\widehat{\theta }(u)$$ and t by the order statistic $$Y_{i,L}$$, [15]. An appropriate value of the threshold u can be found as a solution of the discrepancy equation (3) with a predetermined value $$\delta $$ with regard to any consistent nonparametric estimator of EI. The calculation (3) by an entire sample may lead to the lack of a solution of the discrepancy equation regarding u the same way as for the heavy-tailed pdf estimation or to too large values u which may not be appropriate for the estimation of $$\theta $$. The selection of k and $$\delta $$ remains a problem. We aim to obtain a limit distribution of the discrepancy statistic related to (3) depending on k and to use its quantiles as $$\delta $$.

The paper is organized as follows. In Sect. 2, related work is recalled. In Sect. 3, a limit distribution of the normalized statistic $$\hat{\omega }^2_L(u)$$ is obtained, and an algorithm of the discrepancy method based on the M-S statistic is given. Simulation study is shown in Sect. 4. Conclusions are presented in Sect. 5.

2 Related Work

Our achievements are based on the following Lemmas 3.4.1, 2.2.3 by [7] concerning limit distributions of the order statistics. They are recalled here.

Lemma 1
([7, p. 89]) Let $$X, X_1, X_2,\ldots ,X_n$$ be iid rvs with common cdf F, and let $$X_{1,n} \le X_{2,n} \le \cdots \le X_{n,n}$$ be the n order statistics. The joint distribution of $$\{X_{i,n}\}^n_{i=n-k+1}$$ given $$X_{n-k,n} = t$$, for some $$k\in \{1,\ldots , n-1\}$$, equals the joint distribution of the set of the order statistics $$\{X^*_{i,k}\}^k_{i=1}$$ of iid rvs $$\{X^*_{i}\}^k_{i=1}$$ with cdf
$$\begin{aligned} F_t(x)= & {} P\{X \le x|X> t\} = (F(x) - F(t))/(1-F(t)), ~~~ x>t.\end{aligned}$$
(4)
Lemma 2
([7, p. 41]) Let $$U_{1,n}\le U_{2,n}\le \cdots \le U_{n,n}$$ be the n order statistics from a standard uniform distribution. Then, as $$n\rightarrow \infty $$, $$k\rightarrow \infty $$, $$n-k\rightarrow \infty $$,
$$\begin{aligned} (U_{k,n}-b_{n})/(a_{n}) \end{aligned}$$
is asymptotically standard normal with
$$\begin{aligned}&b_{n}=(k-1)/(n-1),\qquad a_{n}=\sqrt{b_n(1-b_n)/(n-1)}.\end{aligned}$$
Definition 2
([9]) For real u and integers $$1\le k\le l$$, let $$\mathscr {F}_{k,l}(u)$$ be the $$\sigma $$-field generated by the events $$\{X_i>u\}$$, $$k\le i\le l$$. Define the mixing coefficients $$\alpha _{n,q}(u)$$,
$$\begin{aligned}&\alpha _{n,q}(u)=\max _{1\le k\le n-q}\sup |P(B|A)-P(B)|, \end{aligned}$$
where the supremum is taken over all $$A\in \mathscr {F}_{1,k}(u)$$ with $$P(A)> 0$$ and $$B\in \mathscr {F}_{k+q,n}(u)$$ and k, q are positive integers.
Theorem 1
([9, p. 547]) Let $$\{X_n\}_{n\ge 1}$$ be a stationary process of rvs with tail function $$\overline{F}(x)=1-F(x)$$. Let the positive integers $$\{r_n\}$$ and the thresholds $$\{u_n\}$$, $$n\ge 1$$ be such that $$r_n\rightarrow \infty $$, $$r_n\overline{F}(u_n)\rightarrow \tau $$, and $$P\{M_{r_n}\le u_n\}\rightarrow exp(-\theta \tau )$$ hold as $$n\rightarrow \infty $$ for some $$\tau \in (0,\infty )$$ and $$\theta \in [0,1]$$. If there are positive integers $$q_n=o(r_n)$$ such that $$\alpha _{cr_n,q_n}(u_n)=o(1)$$ for any $$c>0$$, then we get for $$t>0$$
$$\begin{aligned}&P\{\overline{F}(u_n)T_1(u_n)>t\}\rightarrow \theta \exp (-\theta t),\qquad n\rightarrow \infty . \end{aligned}$$

For declustering the sample into approximately independent inter-cluster times $$\{(T_1(u))_i\}$$, one can take $$k-1=\lfloor \theta \sum _{i=1}^n\mathbf{1}(X_i>u)\rfloor $$ of the largest inter-exceedance times, [9]. The larger u corresponds to larger inter-cluster times whose number L(u) may be small. This leads to a larger variance of the estimates based on $$\{(T_1(u))_i\}$$.

The intervals estimator of the EI follows from Theorem 1 and depends only on u. It is defined as [3, p. 391],
$$\begin{aligned} \hat{\theta }_n(u)= & {} \Big \{ \begin{array}{ll} \min (1,\hat{\theta }_n^1(u)) , \text{ if } \max \{(T_1(u))_i : 1 \le i \le L -1\} \le 2,\\ \min (1,\hat{\theta }_n^2(u)) , \text{ if } \max \{(T_1(u))_i : 1 \le i \le L -1\} > 2, \end{array} \end{aligned}$$
(5)
$$\begin{aligned}&\hat{\theta }_n^1(u)=\frac{2(\sum _{i=1}^{L-1}(T_1(u))_i)^2}{(L-1)\sum _{i=1}^{L-1}(T_1(u))_i^2},\end{aligned}$$
$$\begin{aligned}&\hat{\theta }_n^2(u)=\frac{2(\sum _{i=1}^{L-1}((T_1(u))_i-1))^2}{(L-1)\sum _{i=1}^{L-1}((T_1(u))_i-1)((T_1(u))_i-2)}.\end{aligned}$$
The intervals estimator is derived to be consistent for m-dependent processes, [9]. Asymptotic normality of $$\sqrt{k_n}(\hat{\theta }_n(u)-\theta )$$, where $$k_n=\lfloor n/r_n\rfloor $$ and $$r_n\rightarrow \infty $$, $$r_n=o(n)$$ as $$n\rightarrow \infty $$ is proved for the intervals estimator in [18].

3 Main Results

3.1 Theory

Let us rewrite the left-hand side of (3) in the following form:
$$\begin{aligned}\hat{\omega }^2_L(u)= & {} \sum _{i=L-k+1}^L\left( 1-\theta \exp (-Y_{i,L}\theta )-\frac{i-0.5}{L}\right) ^2+\frac{1}{12L}.\end{aligned}$$
We disregard the marginal distribution of the random number of inter-cluster times $$L=L(u)$$. This approach is referred to as a conditional one. This is possible since the limit distribution of $$L(u_n)$$ does not depend on $$u_n$$ as $$n\rightarrow \infty $$. Following [3], Sect. 10.3.1, the probability $$P\{L(u_n)=i\}$$ may be approximated by a binomial distribution with probability $$p_n^*=P\{M_{r_n}\le u_n\}$$ that tends to $$e^{-\tau \theta }$$ by (2). Here, $$r_n$$ denotes the length of a data block. The cluster is defined as a block of data with at least one exceedance over $$u_n$$. The same is true for the cluster defined as in [9]. We have
$$\begin{aligned} 1-\theta \exp (-Y_{i,L}\theta )=^d U_{i,L},\end{aligned}$$
(6)
where $$\{U_{i,L}\}$$ are order statistics derived from an iid uniform [0, 1] sample $$\{U_{i}\}_{i=1}^L$$.
By Lemma 1, the joint distribution of upper order statistics $$(U_{L-k+1,L},\ldots ,U_{L,L})$$ given $$U_{L-k,L}=t$$, $$t\in [0,1)$$, for some $$k= 1,\ldots , L-1$$, equals the joint distribution of order statistics $$(U^*_{1,k},\ldots ,U^*_{k,k})$$ of associated iid rvs $$\{U^*_{i}\}_{i=1}^k$$ with cdf
$$\begin{aligned} F_t(x)= & {} (x-t)/(1-t),\qquad 0\le t<x\le 1. \end{aligned}$$
(7)
Lemma 3 derives the normal distribution of $$U^*_{i,k}$$ given $$U_{L-k,L} = t$$ after normalization.
Lemma 3
Let $$U^*_{1,k}\le U^*_{2,k}\le \cdots \le U^*_{k,k}$$ be the k order statistics of iid rvs $$\{U^*_{i}\}_{i=1}^k$$ with cdf (7). Then, as $$k\rightarrow \infty $$, $$i\rightarrow \infty $$, $$k-i\rightarrow \infty $$,
$$\begin{aligned}(U^*_{i,k}-b_{i,k})/a_{i,k}\end{aligned}$$
is asymptotically standard normal with
$$\begin{aligned} b^*_{i,k}= & {} \frac{i-1}{k-1}=\frac{b_{i,k}-t}{1-t},\qquad a_{i,k}=(1-t)\sqrt{\frac{b^*_{i,k}(1-b^*_{i,k})}{k-1}}.\end{aligned}$$
(8)
Proof
The pdf of $$(U^*_{i,k}-b_{i,k})/a_{i,k}$$ is given by [2]
$$\begin{aligned}&\frac{k!}{(i-1)!(k-i)!}\cdot f(a_{i,k}x+b_{i,k})F_t(a_{i,k}x+b_{i,k})^{i-1}(1-F_t(a_{i,k}x+b_{i,k}))^{k-i}\nonumber \\= & {} \frac{k!}{(i-1)!(k-i)!}\cdot \frac{a_{i,k}}{1-t}\cdot \left( \frac{xa_{i,k}+b_{i,k}-t}{1-t}\right) ^{i-1}\cdot \left( 1-\frac{xa_{i,k}+b_{i,k}-t}{1-t}\right) ^{k-i} \\= & {} \left( \frac{k!}{(i-1)!(k-i)!}(b^*_{i,k})^{i-1}(1-b^*_{i,k})^{k-i}\frac{a_{i,k}}{1-t}\right) \nonumber \\&\cdot \left( 1+\frac{xa_{i,k}}{b_{i,k}^*(1-t)}\right) ^{i-1}\left( 1-\frac{xa_{i,k}}{(1-b_{i,k}^*)(1-t)}\right) ^{k-i}.\nonumber \end{aligned}$$
(9)
In the same way as in the proof of Lemma 2.2.3 [7], we obtain
$$\begin{aligned}&(i-1)\log \left( 1+\frac{xa_{i,k}}{b_{i,k}^*(1-t)}\right) +(k-i)\log \left( 1-\frac{xa_{i,k}}{(1-b_{i,k}^*)(1-t)}\right) \\= & {} (i-1)\left( x\frac{a_{i,k}}{b_{i,k}^*(1-t)}-\frac{x^2}{2}\left( \frac{a_{i,k}}{b_{i,k}^*(1-t)}\right) ^2+\cdots \right) \\+ & {} (k-i)\left( -x\frac{a_{i,k}}{(1-b_{i,k}^*)(1-t)}-\frac{x^2}{2}\left( \frac{a_{i,k}}{(1-b_{i,k}^*)(1-t)}\right) ^2-\cdots \right) . \end{aligned}$$
From (8) it follows $$(i-1)\left( \frac{a_{i,k}}{b_{i,k}^*(1-t)}\right) ^2+(k-i)\left( \frac{a_{i,k}}{(1-b_{i,k}^*)(1-t)}\right) ^2=1$$. The other terms are of smaller order. Using Stirlings’s formula for k! we find that the factor in the third string of (9) tends to $$(2\pi )^{-1/2}$$. Thus, the statement follows.
Lemma 4
Let the conditions of Lemma 3 be fulfilled. Then the statistic
$$\begin{aligned}&\chi ^2=\sum _{i=1}^{k^*}\left( (U^*_{i,k}-b_{i,k})/a_{i,k}\right) ^2\end{aligned}$$
(10)
is asymptotically $$\chi ^2$$ distributed with $$k^*=[k/2]$$ degrees of freedom.
Proof
Let us denote $$Y^*_{i,k}=(U^*_{i,k}-b_{i,k})/a_{i,k}$$ and obtain the distribution $$\Phi _{\zeta }(y)$$ of $$\zeta =\chi /\sqrt{k}$$. From Lemma 3 $$Y^*_{i,k}\sim N(0,1)$$ holds asymptotically. Due to the symmetry, we get $$Y^*_{1,k}\le \cdots \le Y^*_{k^*,k}$$ and $$Y^*_{k,k}\le \cdots \le Y^*_{k^*+1,k}$$ for odd k ($$Y^*_{k,k}\le \cdots \le Y^*_{k^*,k}$$ for even k). By Lemma 1, the joint pdf of the $$k^*$$ order statistics $$Y^*_{1,k}, \ldots , Y^*_{k^*,k}$$ is determined by [2]
$$\begin{aligned}f(x_1,\ldots ,x_{k^*})= & {} (k^*)!\prod _{i=1}^{k^*}f(x_i)=\frac{(k^*)!}{(\sqrt{2\pi })^{k^*}}\exp \left( -\frac{\sum _{i=1}^{k^*}x_i^2}{2}\right) , \\&t<x_1<x_2<\cdots<x_{k^*}<+\infty . \end{aligned}$$
For positive y, the cdf $$\Phi _{\zeta }(y)$$ is equal to the probability to fall inside the $$k^*$$-dimensional sphere $$\sum _{i=1}^{k^*}\left( Y^*_{i,k}\right) ^2=y^2\sqrt{k^*}$$. For negative y, we have $$\Phi _{\zeta }(y)=0$$. Hence, we obtain
$$\begin{aligned}&\Phi _{\zeta }(y)=\frac{(k^*)!}{(\sqrt{2\pi })^{k^*}}\cdot \\&\int \cdots \int _{\sum _{i=1}^{k^*} x_i^2<y^2k^*}\exp \left( -\frac{\sum _{i=1}^{k^*}x_i^2}{2}\right) \mathbf{1}(t<x_1<x_2<\cdots<x_{k^*}<+\infty )dx_1dx_2\ldots dx_{k^*}. \end{aligned}$$
Using spherical coordinates and replacing $$x_1=\rho \cos \theta _1\cos \theta _2\ldots \cos \theta _{k^*-1}$$, $$x_2=\rho \cos \theta _1\cos \theta _2\ldots \sin \theta _{k^*-1},$$ $$\ldots ,$$ $$x_{k^*-1}= \rho \cos \theta _1\sin \theta _2$$, $$x_{k^*}= \rho \sin \theta _1$$, we find the intervals of each variable $$\rho $$ and $$\theta _i$$, $$i=1,\ldots ,k^*-1$$. Since $$t<x_1<x_2<\cdots<x_{k^*}<+\infty $$ holds, we get $$x_2/x_1 = \tan (\theta _{k^*-1})>1$$ and hence, $$\pi /4<\theta _{k^*-1}<\pi /2$$, $$\sqrt{2}/2<\sin (\theta _{k^*-1})<1$$. From $$x_3/x_2 = \tan (\theta _{k^*-2})/\sin (\theta _{k^*-1})>1$$ we then have $$\tan (\theta _{k^*-2})>\sin (\theta _{k^*-1})$$, and since the largest value of $$\sin (\theta _{k^*-1})$$ is 1, we get $$\tan (\theta _{k^*-2})>1$$ and $$\pi /4<\theta _{k^*-2}<\pi /2$$. Finally, we get $$\pi /4<\theta _{i}<\pi /2$$, $$i=1,\ldots ,k^*-1$$. Now, we may take the following integral:
$$\begin{aligned}&\Phi _{\zeta }(y)=\frac{(k^*)!}{(\sqrt{2\pi })^{k^*}}\cdot \\&\int _{\pi /4}^{\pi /2}\cdots \int _{\pi /4}^{\pi /2}\int _{0}^{y\sqrt{k^*}}\exp \left( -\frac{\rho ^2}{2}\right) \rho ^{k^*-1}D(\theta _1,\ldots ,\theta _{k^*-1}) d\rho d\theta _{k^*-1}\ldots d\theta _{1} \\= & {} C_{k^*}\int _{0}^{y\sqrt{k^*}}\exp \left( -\frac{\rho ^2}{2}\right) \rho ^{k^*-1}d\rho , \end{aligned}$$
where
$$\begin{aligned}C_{k^*}= & {} \frac{(k^*)!}{(\sqrt{2\pi })^{k^*}}\int _{\pi /4}^{\pi /2}\cdots \int _{\pi /4}^{\pi /2}D(\theta _1,\ldots ,\theta _{k^*-1}) d\theta _{k^*-1}\ldots d\theta _{1}. \end{aligned}$$
Then the constant $$C_{k^*}$$ can be obtained from the equation:
$$\begin{aligned}\Phi _{\zeta }(+\infty )= & {} 1=C_{k^*}\int _{0}^{\infty }\exp \left( -\frac{\rho ^2}{2}\right) \rho ^{k^*-1}d\rho =C_{k^*}\Gamma \left( \frac{k^*}{2}\right) 2^{k^*/2-1}. \end{aligned}$$
Hence, it follows
$$\begin{aligned}\Phi _{\zeta }(y)= & {} \frac{1}{\Gamma \left( k^*/2\right) 2^{k^*/2-1}}\int _{0}^{y\sqrt{k^*}}\exp \left( -\frac{\rho ^2}{2}\right) \rho ^{k^*-1}d\rho .\end{aligned}$$
The pdf of $$\zeta $$ is given by
$$\begin{aligned}\varphi _{\zeta }(y)= & {} \frac{\sqrt{2k^*}}{\Gamma \left( k^*/2\right) }\left( \frac{y\sqrt{k^*}}{\sqrt{2}}\right) ^{k^*-1}\exp \left( -\frac{k^*y^2}{2}\right) .\end{aligned}$$
Hence, one can get the chi-squared pdf of $$\chi ^2$$:
$$\begin{aligned} p(x)= & {} \frac{x^{k^*/2-1}\exp (-x/2)}{2^{k^*/2}\Gamma \left( k^*/2\right) }. \end{aligned}$$

3.2 Discrepancy Equation Based on the Chi-Squared Statistic

Regarding a consistent estimate $$\widehat{\theta }(u)$$ by (6) and denoting $$i^*=i-L+k^*$$, $$i=L-k^*+1,\ldots ,L$$, then u can be selected as a solution of the discrepancy equation
$$\begin{aligned}&\sum _{i^*=2}^{k^*-1}\left( (1-\widehat{\theta }(u)\exp (-Y_{i^*+L-k^*,L}\widehat{\theta }(u))-b_{i^*,k^*})/a_{i^*,k^*}\right) ^2 =\delta \end{aligned}$$
(11)
for a given k such that $$k^*-1\ge 2$$ and $$k^*=[(k-2)/2]$$. Here $$\delta $$ is a mode $$\max \{k^*-2,0\}$$ of the $$\chi ^2(k^*)$$ distribution and $$t=1-\widehat{\theta }(u)\exp (-Y_{L-k,L}\widehat{\theta }(u))$$. $$b_{i,k}$$, $$a_{i,k}$$, and $$k^*$$ are calculated as in Lemmas 3 and 4. This could be an alternative to the method (3). In Fig. 1a, c, one can see that the empirical cdf of the left-hand side statistic in (11), where $$\widehat{\theta }(u)$$ is based on (5), and a chi-squared cdf are rather close. Figure 1b, d shows that the discrepancy equation (11) may have solutions since the left-hand side statistic in (11) crosses the mode of the $$\chi ^2$$-distribution.
../images/461444_1_En_31_Chapter/461444_1_En_31_Fig1_HTML.png
Fig. 1

Empirical cdf of the left-hand side of (11) built by 300 re-samples by a Moving Maxima (MM) process with sample size $$n=4\cdot 10^4$$, the EI $$\theta =0.8$$, $$k^*=5$$, $$t=0.867$$, and $$u=10$$ (solid line) and chi-squared cdf (points), Fig. 1a; Left-hand side statistic in (11) for $$k^*=5$$ against threshold u and the $$\chi ^2$$ mode as the discrepancy, Fig. 1(b). The same statistics for an ARMAX process with $$\theta =0.25$$, $$k^*=5$$, $$t=0.953$$, and $$u=50$$, Fig. 1c and for $$k^*=5$$ Fig. 1d

Remark 1

The discrepancy methods (3) and (11) are universal and can be used for any nonparametric estimator $$\widehat{\theta }(u)$$.

Algorithm 3.1
  1. 1.
    Using $$X^n=\{X_1, X_2,\ldots ,X_n\}$$ and taking thresholds u corresponding to quantile levels $$q\in \{0.90,0.905,\ldots ,0.995\}$$, generate samples of inter-cluster times $$\{T_1(u)_i\}$$ and the normalized rvs
    $$\begin{aligned}\{Y_{i}\}= & {} \{\overline{F}(u)T_1(u)_{i}\}=\{(N/n)T_1(u)_{i}\}, \qquad i=\overline{1,L}, \qquad L=L(u),\end{aligned}$$
    where N is the number of exceedances over threshold u.
     
  2. 2.

    For each u, select $$k=sL(u)$$, $$0<s<1$$, e.g., $$s=0.0001$$.

     
  3. 3.

    Use a sorted sample $$Y_{L-k+1,L}\le \cdots \le Y_{L,L}$$ and find all solutions $$u_1,\ldots ,u_l$$ (here, l is a random number) of the discrepancy equation (11).

     
  4. 4.
    For each $$u_j$$, $$j\in \{1,\ldots ,l\}$$, calculate $$\hat{\theta }(u_j)$$ and find
    $$\begin{aligned} \widehat{\theta }_1(u)= & {} \frac{1}{l}\sum _{i=1}^l\widehat{\theta }(u_i),\quad \widehat{\theta }_2(u) =\widehat{\theta }(u_{min}),\qquad \widehat{\theta }_3(u) =\widehat{\theta }(u_{max}) \end{aligned}$$
    as resulting estimates, where $$u_{min}$$ and $$u_{max}$$ are the minimal and maximal values among $$\{u_1,\ldots ,u_l\}$$.
     
../images/461444_1_En_31_Chapter/461444_1_En_31_Fig2_HTML.png
Fig. 2

Left-hand side statistic in (11) for $$k^*= 7, 8$$ against threshold u and the $$\chi ^2$$ mode $$\max \{k^*-2,0\}$$ as the discrepancy for an ARMAX process with $$\theta =0.75$$

3.3 Estimation of k

It remains to select k. For declustering purposes, i.e., to have approximately independent clusters of exceedances over u, it is recommended in [9] to take the largest value k such that $$(T_1(u))_{L-k,L}$$ is strictly larger than $$(T_1(u))_{L-k-1,L}$$.

We propose another approach. For each predetermined threshold u and for a corresponding L(u), one may decrease the k-value until the discrepancy equations have solutions and select the largest one among such k’s. Figure 2 shows that the solution of (11) exists for $$k^*=7$$ and it does not for $$k^*=8$$. Due to several possible solutions u, the average may be taken over all estimates $$\widehat{\theta }(u)$$ with such u’s.

4 Simulation Study

Our simulation study, enhancing the behavior of Algorithm 3.1 is based on 1000 replicas of samples $$\{X_1,\ldots ,X_n\}$$ with size $$n=10^5$$ generated from a set of models. These models are Moving Maxima (MM), Autoregressive Maximum (ARMAX), AR(1), AR(2), MA(2), and GARCH. The AR(1) process is considered with uniform noise (ARu) and with Cauchy distributed noise (ARc). Using Algorithm 3.1, we check the accuracy of the intervals estimator (5), where u is selected based on (11). The root mean squared error (RMSE) and the absolute bias are given in Tables 1 and 2. The best results are shown in bold numbers.

4.1 Models

Let us shortly recall the processes under study. The mth order MM process is $$X_t=\max _{i=0,\ldots ,m}\{\alpha _i Z_{t-i}\}$$, $$t\in Z $$, where $$\{\alpha _i\}$$ are constants with $$\alpha _i \ge 0$$, $$\sum _{i=0}^m \alpha _i=1$$, and $$Z_t$$ are iid standard Fréchet distributed rvs with the cdf $$F(x)=\exp \left( -1/x\right) $$, for $$x>0$$. Its EI is equal to $$\theta =\max _i\{\alpha _i\}$$, [1]. Values $$m=3$$ and $$\theta \in \{0.5,0.8\}$$ corresponding to $$\alpha =(0.5,0.3,0.15,0.05)$$ and $$\alpha =(0.8,0.1,0.008,0.02)$$ are taken.

The ARMAX process is determined as $$X_t=\max \{\alpha X_{t-1},(1-\alpha )Z_t\}$$, $$t\in Z ,$$ where $$0\le \alpha < 1$$, $$\{Z_t\}$$ are i.i.d standard Fréchet distributed rvs and $$P\{X_t\le x\}=\exp \left( -1/x\right) $$ holds assuming $$X_0=Z_0$$. Its EI is given by $$\theta =1-\alpha $$, [3]. We consider $$\theta \in \{0.25,0.75\}$$.

The ARu process is defined by $$X_j=(1/r) X_{j-1}+\epsilon _j$$, $$j \ge 1$$ and $$X_0\sim U(0,1)$$ with $$X_0$$ independent of $$\epsilon _j$$. For a fixed integer $$r\ge 2$$, let $$\epsilon _n$$, $$n\ge 1$$ be iid rvs with $$P\{\epsilon _1=k/r\}=1/r$$, $$k=0,1, \ldots , r-1$$. The EI of AR(1) is $$\theta =1-1/r$$ [5]. $$\theta \in \{0.5,0.8\}$$ are taken.

The MA(2) process is determined by $$X_i= p Z_{i-2} + q Z_{i-1} + Z_i$$, $$i\ge 1,$$ with $$p>0$$, $$q<1$$, and iid Pareto rvs $$Z_{-1}, Z_0, Z_1,\ldots $$ with $$P\{Z_0>x\}=1$$ if $$x<1$$, and $$P\{Z_0>x\}=x^{-\alpha }$$ if $$x\ge 1$$ for some $$\alpha >0$$ [20]. Its EI is $$\theta =\left( 1+p^{\alpha }+q^{\alpha }\right) ^{-1}$$. We consider $$\alpha =2$$, $$(p,q)=(1/\sqrt{2}, 1/\sqrt{2}), (1/\sqrt{3}, 1/\sqrt{6})$$ with corresponding $$\theta =1/2, 2/3$$.

We consider also processes studied in [8, 17, 22]. The ARc process is $$X_j=0.7 X_{j-1}+\epsilon _j$$, where $$\epsilon _j$$ is standard Cauchy distributed and $$\theta =0.3$$. The AR(2) process is $$X_j=0.95 X_{j-1}-0.89X_{j-2} +\epsilon _j$$, where $$\epsilon _j$$ is Pareto distributed with tail index 2 and $$\theta =0.25$$. The GARCH(1,1) is $$X_j=\sigma _j\epsilon _j$$, with $$\sigma _j^2=\alpha +\lambda X_{j-1}^2+\beta \sigma _{j-1}^2$$, $$\alpha =10^{-6}$$, $$\beta =0.7$$, $$\lambda =0.25$$, with an iid sequence of standard Gaussian rvs $$\{\epsilon _j\}_{j\ge 1}$$ and $$\theta =0.447$$ [11].
Table 1

The root mean squared error

$$RMSE \cdot 10^{4}$$/$$\theta $$

MM

ARMAX

ARu

MA(2)

ARc

AR(2)

GARCH

0.5

0.8

0.25

0.75

0.5

0.8

0.5

2/3

0.3

0.25

0.328

$$s=0.001$$

$$\widehat{\theta }_1$$

146

188

136

173

2719

1217

227

545

66

426

237

$$\widehat{\theta }_2$$

141

164

122

156

2163

946

295

813

104

498

288

$$\widehat{\theta }_3$$

360

440

291

430

3330

1527

336

470

67

432

467

$$s=0.0005$$

$$\widehat{\theta }_1$$

105

155

100

148

2519

1127

268

696

50

498

231

$$\widehat{\theta }_2$$

139

144

97

151

1906

854

434

1107

161

692

860

$$\widehat{\theta }_3$$

355

451

296

463

3325

1527

355

449

67

439

484

$$s=0.0001$$

$$\widehat{\theta }_1$$

96

120

88

115

2224

975

331

846

151

620

416

$$\widehat{\theta }_2$$

149

135

117

143

1704

768

503

1197

358

953

1127

$$\widehat{\theta }_3$$

350

453

285

434

3331

1518

336

441

67

431

474

$$\widehat{\theta }^{Kimt}$$

217

569

69

498

1883

199

309

466

33

3630

4028

$$\widehat{\theta }^{db}$$

630

 

550

3640

    

320

 

1000

$$\widehat{\theta }^{sb}$$

550

 

450

950

    

320

 

840

$$\widehat{\theta }^{r}$$

550

       

140

 

1550

$$\widehat{\theta }^{ml}$$

  

550

3640

  

220

485

   

$$\widehat{\theta }^{bcml}$$

  

400

1070

  

375

210

   

$$\widehat{\theta }^{mlsb}$$

  

420

853

       

$$\widehat{\theta }^{C}$$

660

       

950

 

1580

$$\widehat{\theta }^{Cms}$$

320

       

6080

 

3520

Table 2

The absolute bias

$$|Bias| \cdot 10^{4}$$/$$\theta $$

MM

ARMAX

ARu

MA(2)

ARc

AR(2)

GARCH

0.5

0.8

0.25

0.75

0.5

0.8

0.5

2/3

0.3

0.25

0.328

$$s=0.001$$

$$\widehat{\theta }_1$$

7.4016

6.2708

2.9461

3.6254

2709

1204

182

516

66

399

70

$$\widehat{\theta }_2$$

41

50

22

49

2139

921

267

791

104

477

139

$$\widehat{\theta }_3$$

38

40

36

62

3302

1474

72

246

67

314

174

$$s=0.0005$$

$$\widehat{\theta }_1$$

37

12

15

13

2513

1118

251

684

50

485

137

$$\widehat{\theta }_2$$

104

59

52

65

1893

841

424

1017

161

679

809

$$\widehat{\theta }_3$$

11

61

48

45

3296

1474

47

227

67

319

210

$$s=0.0001$$

$$\widehat{\theta }_1$$

56

35

48

43

2221

970

322

842

151

613

391

$$\widehat{\theta }_2$$

126

81

97

97

1701

760

497

1194

358

949

1119

$$\widehat{\theta }_3$$

43

59

31

43

3303

1464

51

246

67

304

186

$$\widehat{\theta }^{Kimt}$$

0.14862

567

54

496

1878

196

306

462

33

3627

4027

$$\widehat{\theta }^{db}$$

160

 

450

3530

    

80

 

690

$$\widehat{\theta }^{sb}$$

100

 

180

340

    

80

 

630

$$\widehat{\theta }^{r}$$

50

       

160

 

630

$$\widehat{\theta }^{ml}$$

  

450

4170

  

87.5

270

   

$$\widehat{\theta }^{bcml}$$

  

0

1070

  

40

25

   

$$\widehat{\theta }^{mlsb}$$

  

179

320

       

$$\widehat{\theta }^{C}$$

130

       

200

 

230

$$\widehat{\theta }^{Cms}$$

20

       

6000

 

3230

4.2 Estimators and Their Comparison

In Tables 1 and 2, we insert apart from our estimates $$\widehat{\theta }_1(u)-\widehat{\theta }_3(u)$$ the available results of the simulation study by [8, 17, 20, 21]. The estimators are notated as follows. $$\widehat{\theta }^{db}$$ denotes the disjoint blocks and $$\widehat{\theta }^{sb}$$ the sliding blocks estimators [17, 18]; $$\widehat{\theta }^{r}$$ the runs estimator [24]; $$\widehat{\theta }^{ml}$$ the multilevel estimator [20]; $$\widehat{\theta }^{bcml}$$ and $$\widehat{\theta }^{mlsb}$$ the bias-corrected multilevel and the multilevel sliding blocks estimators [20, 21]; $$\widehat{\theta }^{C}$$ and $$\widehat{\theta }^{Cms}$$ the cycles and the max-stable cycles estimators [8]. We can compare only results related to processes overlapping with our experiment. We calculate the $$K-$$gaps estimates by [22] with IMT-selected pairs (uK) (for details regarding the IMT test, see [10]) which are denoted as $$\widehat{\theta }^{Kimt}$$.

We may conclude the following. The intervals estimator coupling with the discrepancy method demonstrates a good performance in comparison with other investigated estimators. It is not appropriate for light-tailed distributed processes (by its definition) as one can see by the example of the ARu process. The K-gaps estimator is indicated as one of the most promising methods in [8, 17]. Our estimators, especially $$\widehat{\theta }_1(u)$$ for smaller s (that reflect the smaller number of the largest order statistics k), may perform better.

Comparing Tables 1 and 2 with Fig. 1 by [21], where the multilevel and the bias-corrected multilevel estimators were compared by data simulated from an ARMAX model only, one can see that the latter estimates demonstrate much larger accuracy values. Particularly, our estimate gives the best RMSE equal to 0.0088 and 0.0115 as far as the best among these estimates show about 0.04 and a bit less than 0.15 for $$\theta =0.25$$ and $$\theta =0.75$$, respectively.

In [8] the cycles, the max-stable cycles, the runs, the K-gaps, the disjoint blocks, and sliding blocks estimators were compared. For the first three estimators, the misspecification IMT test was applied as a choice method of the threshold-run parameter. As an alternative, quantiles $$q\in \{0.95,0.975,0.90\}$$ were used for these estimators as thresholds with the run parameter estimated by the latter test. We can compare only results related to MM with $$\theta =0.5$$, ARMAX with $$\theta =0.75$$, and AR(1) with $$\theta =0.5$$ processes. The best bias equal to 0.002 and the RMSE equal to 0.032 for an MM process were achieved by the max-stable cycles estimator $$\widehat{\theta }^{Cms}$$. For our estimator, the best absolute bias is 0.00074 and the RMSE is 0.0096 for an MM process. For an ARMAX process, the best were the cycles estimated with the bias equal to 0.003 and the max-stable cycles estimated with the RMSE equal to 0.032. Our estimator provides the best absolute bias 0.00036 and the RMSE 0.0115.

The MA(2) process has been studied in [20] regarding the multilevel and the bias-corrected multilevel blocks estimators with two specific weighted functions. For MA(2) with $$\theta =0.5$$, the obtained best absolute bias is inside the interval $$(2.5,3.0)\cdot 10^{-3}$$, and the MSE is in $$(0.75,1.0)\cdot 10^{-3}$$. Our estimator provides the best absolute bias $$4.7\cdot 10^{-3}$$ and the MSE $$5.15259\cdot 10^{-4}$$. For MA(2) with $$\theta =2/3$$, we find in [20] the bias about 0.0025 and the MSE $$0.7\cdot 10^{-3}$$. Our estimator shows 0.0227 and $$2.025\cdot 10^{-3}$$ for the bias and the MSE, respectively.

5 Conclusions

The discrepancy method proposed for smoothing of pdf estimates is modified to select the threshold parameter u for the EI estimation. We derive the $$\chi ^2$$ asymptotic distribution of the statistic relating to the M-S statistic. This allows us to use its mode as an unknown discrepancy value $$\delta $$. Since the discrepancy method may be applied for different estimators of the EI, one can find other parameters such as the block size for the blocks estimator of the EI instead of the threshold in the same way. The accuracy of the intervals estimator (5) with u selected by the new discrepancy method (11) is provided by a simulation study. The comparison with several EI estimators shows its good performance regarding heavy-tailed distributed processes.

Acknowledgments

The author appreciates the partial financial support by the Russian Foundation for Basic Research, grant 19-01-00090.