© Springer Nature Switzerland AG 2020
M. La Rocca et al. (eds.)Nonparametric StatisticsSpringer Proceedings in Mathematics & Statistics339https://doi.org/10.1007/978-3-030-57306-5_48

To Rank or to Permute When Comparing an Ordinal Outcome Between Two Groups While Adjusting for a Covariate?

Georg Zimmermann1, 2, 3  
(1)
Department of Mathematics, Paris Lodron University, Hellbrunner Str. 34, 5020 Salzburg, Austria
(2)
Team Biostatistics and Big Medical Data, IDA Lab Salzburg, Paracelsus Medical University, Strubergasse 16, 5020 Salzburg, Austria
(3)
Department of Neurology, Christian Doppler Medical Centre, Ignaz-Harrer-Strasse 79, 5020 Salzburg, Austria
 
 
Georg Zimmermann

Abstract

The classical parametric analysis of covariance (ANCOVA) is frequently used when comparing an ordinal outcome variable between two groups, while adjusting for a continuous covariate. However, the normality assumption might be crucial and assuming an underlying additive model might be questionable. Therefore, in the present manuscript, we consider the outcome as truly ordinal and dichotomize the covariate by a median split, in order to transform the testing problem to a nonparametric factorial setting. We propose using either a permutation-based Anderson–Darling type approach in conjunction with the nonparametric combination method or the pseudo-rank version of a nonparametric ANOVA-type test. The results of our extensive simulation study show that both methods maintain the type I error level well, but that the ANOVA-type approach is superior in terms of power for location-shift alternatives. We also discuss some further aspects, which should be taken into account when deciding for the one or the other method. The application of both approaches is illustrated by the analysis of real-life data from a randomized clinical trial with stroke patients.

Keywords
Nonparametric covariate adjustmentNonparametric combination methodNPCRank-based inferenceNonparametric analysis of covariance

1 Introduction

In many clinical studies, the main research interest is focused on the comparison of two groups with respect to a certain outcome variable of interest. Examples include the classical parallel-group design (e.g., verum vs. placebo) in randomized clinical trials, but also observational studies, which are aimed at studying the association between a particular factor (e.g., two different pathologies) and a health-related variable (e.g., a laboratory measurement or a quality of life score). Frequently, the two-sample t-test is used for analyzing data from such studies. If the comparison is adjusted for one or several covariates, the analysis of covariance (ANCOVA) is regarded as the method of choice. A comprehensive overview about the classical ANCOVA and some of its alternatives is given by [11]. However, most methods require that the outcome is continuous, or at least that the effects are additive. Therefore, applying these approaches in settings where the outcome is ordinal might be inappropriate. However, ordinal outcomes are quite frequently encountered in applied research: One example, among others, is the modified Rankin Scale (mRS), which is used to quantify the functional outcome of stroke patients by assigning a rating on a scale from 0 (no symptoms) to 6 (dead) [9, 21, 27]. Although some evidence suggests that converting the ordinal factor levels to scores and subsequently applying the standard parametric ANCOVA does not affect the performance of the test substantially [25], there are two crucial issues with this approach. Firstly, with respect to the normality assumption, a highly discrete variable might be particularly challenging to deal with. Secondly, careful thoughts concerning whether or not the resulting effect measure (e.g., the difference of the adjusted means) can be interpreted in a sensible way are required. Therefore, analyzing the outcome as a truly ordinal variable might be an attractive option. Again, however, classical approaches such as proportional odds models (for an introduction, see [1]) require assumptions which might be too restrictive.

Nonparametric rank-based methods may serve as a remedy in order to overcome the aforementioned difficulties. In particular, approaches where the so-called relative effect is used as the effect measure are also applicable to ordinal outcomes. One well-known example is the Wilcoxon-Mann-Whitney test [14, 28], which is the classical nonparametric counterpart of the two-sample t-test. Several extensions to general factorial designs have been proposed (e.g., [2, 4, 6, 8]). However, only a few approaches allow for adjustment for a continuous covariate [3, 26].

Alternatively, using permutation methods might be an appealing option, mainly due to their finite-sample properties (e.g., exactness) under relatively mild assumptions. For an overview of the underlying theory, we refer, for example, to [18]. Like with the rank-based methods that have been mentioned previously, there is a broad variety of permutation approaches, which cover various practically relevant settings (see, for example, [16, 19]). Again, however, the present setting, that is, the comparison of an the comparison of an ordinal outcome between two groups while adjusting for a continuous covariate, seems to be a somewhat difficult problem. The reason is that many permutation tests (e.g., the so-called synchronized permutation approach, see [10, 22]), implicitly assume an additive model, and hence, are only applicable if the outcome variable is metric. However, using the so-called using the so-called “nonparametric combination method” might be a promising alternative to a classical parametric approach: For the purpose of the present manuscript, the comparison of two samples with respect to an ordered categorical outcome, which was studied in [17], is of interest. Nevertheless, like with other approaches, continuous covariates cannot be directly accounted for. One straightforward solution might be to categorize the covariate, because an additional adjustment for a categorical covariate could be done by another application of the nonparametric combination method, then. Indeed, in applied research, a continuous covariate is often transformed into a binary variable by, for example, applying a median split. Alternatively, there might be some well-established cutoffs available, or subject-matter expertise could help to define appropriate categories. Therefore, it might be of interest especially for biostatisticians to have some empirical guidance at hand concerning which of the two approaches— permutation tests and the nonparametric combination method, or rank-based tests—should be used in practice.

The manuscript is organized as follows: In Sect. 2, we introduce the pseudo-rank version of the nonparametric ANOVA-type test for general factorial designs [4] and the Anderson–Darling type permutation test [17], as well as some fundamental pieces of the respective theory. Section 3 contains the results of an extensive simulation study, covering balanced and unbalanced settings, as well as different distributional assumptions. The application of the two methods under investigation is illustrated by the analysis of a real-life data example in Sect. 4. Finally, Sect. 5, contains some concluding remarks and a brief synopsis of the respective advantages and drawbacks. This hopefully helps applied researches to choose an appropriate statistical analysis approach. All tables and figures are provided in the Online Supplement.

2 The Nonparametric Combination Method and a Pseudo-rank-based Approach

Let $$(X_{11},Y_{11}),\ldots ,(X_{1n_1},Y_{1n_1}) \overset{iid}{\sim } F_1$$ and $$(X_{21},Y_{21}),\ldots ,(X_{2n_2},Y_{2n_2}) \overset{iid}{\sim } F_2$$ denote two independent bivariate samples. Thereby, the first and second components are the continuous covariate and the ordinal outcome, respectively. Let $$\mathcal {R}_Y := \{C_1 \le \ldots \le C_K\}$$ denote the support of the outcome Y, which consists of the ordered categories $$C_1,\ldots , C_K$$. For example, in medical research, but also in psychology and other fields, the outcome is frequently assessed by a rating on some scale (e.g., modified Rankin Scale, Glasgow Coma Scale, Visual Analog Scale, or Functional Gait Assessment, just to name a few). We have the impression that quite frequently, such a sort of outcome is analyzed by using classical methods for metric variables (e.g., t-test, ANOVA). This might indeed be appropriate in case that the interpretation as a metric variable is sensible from the respective subject matter point of view. Nevertheless, we would like to emphasize that this issue requires careful case-by-case considerations. Moreover, especially in case of a highly discrete variable (i.e., the cardinality of the support of Y is quite “small”), assuming a normal distribution might not be justified.

Regarding the covariate, the continuous random variables $$X_{11},\ldots ,X_{2n_2}$$ are categorized by applying a measurable function $$g:\mathcal {R}_X \rightarrow \{1,2\}$$, where $$\mathcal {R}_X$$ denotes the range of the covariate. The choice of g is either guided by subject-matter expertise or relies on statistical considerations (e.g., median split). Hence, in the sequel, we shall partition the outcomes $$Y_{11},\ldots ,Y_{2n_2}$$ according to the transformed covariate values $$Z_{11}:=g(X_{11}),\ldots ,Z_{2n_2}:=g(X_{2n_2})$$. Doing so, and after some re-indexing, we get
$$\begin{aligned} Y_{111},\ldots ,Y_{11n_{11}}&\overset{iid}{\sim } F_{1|Z=1}, Y_{121},\ldots ,Y_{12n_{12}} \overset{iid}{\sim } F_{1|Z=2}, \\ Y_{211},\ldots ,Y_{21n_{21}}&\overset{iid}{\sim } F_{2|Z=1}, Y_{221},\ldots ,Y_{22n_{22}} \overset{iid}{\sim } F_{2|Z=2}. \\ \end{aligned}$$
It should be noted that actually, $$n_{ij}$$ is a random variable, $$i,j\in \{1,2\}$$, since the covariate and its categorized version are random quantities. Nevertheless, for ease of presentation, we consider the cell sizes as fixed in the sequel, which means that everything has to be understood conditionally on a fixed set of covariate values $$z_{11},\ldots ,z_{2n_2}$$. This does not restrict the generality of the two approaches proposed in Sects. 2.1 and 2.2, because the formal procedures work completely analogously in case of random covariates. However, caution is needed concerning the simulation setup. Therefore, in all settings discussed in Sect. 3, the covariate will be considered as random again.

For the sake of notational simplicity, let $$F_{ij}:= F_{i|Z=j}$$, $$i,j \in \{1,2\}$$. Recall that our main aim is to compare the outcome Y between the two treatment groups (i.e., $$i = 1$$ and $$i = 2$$). In the following section, we propose two different approaches.

2.1 The Nonparametric Combination Method, Applied to an Anderson–Darling type Permutation Test

The basic idea underlying the nonparametric combination (NPC) method is quite simple: The hypothesis is split up into a finite set of partial hypotheses, and subsequently, a hypothesis test is constructed for each of these partial testing problems. In a second step, the resulting test statistics or p values are combined by using an appropriate combination function (e.g., Fisher’s combination function). It follows immediately from the underlying theory that the basic properties of the separate tests (e.g., consistency, exactness) are carried over to the combined test, then. For an overview, we refer to [18, 19].

We consider the hypothesis $$H_{0,NPC}: \{F_{11} = F_{21}\} \cap \{F_{12} = F_{22}\}$$ vs. $$H_{1,NPC}: \{F_{11} \ne F_{21}\} \cup \{F_{12} \ne F_{22}\}$$ and construct a corresponding test by a suitable nonparametric combination of two partial permutation statistics $$T_1$$ and $$T_2$$. Let $$n_{.j} = n_{1j} + n_{2j}$$ denote the number of observations with transformed covariate value j, $$j\in \{1,2\}$$. For even total sample size N, dichotomizing the covariate by applying a median split yields $$n_{.1} = n_{.2} = N/2$$. For a permutation $$s \in \mathcal {S}_{n_{.j}}$$, we define the corresponding permutation of the pooled observations $$\mathbf {Y}_j = \{Y_{1},\ldots ,Y_{n_{.j}}\}$$ within covariate subgroup $$Z = j$$ by $$\mathbf {Y}_j^{*} := \{Y_{s(1)},\ldots ,Y_{s(n_{.j})}\}$$, $$j \in \{1,2\}$$. Now, we use an Anderson–Darling type test statistic for each of the two partial tests, that is
$$\begin{aligned} T(\mathbf {Y}_j^{*}) := \sum _{k=1}^{K-1}(\hat{F}_{1j}^{*}(C_k) - \hat{F}_{2j}^{*}(C_k))^2 (\hat{F}_{.j}(C_k)(1-\hat{F}_{.j}(C_k)))^{-1}, j \in \{1,2\}. \end{aligned}$$
(1)
Thereby, $$\hat{F}_{ij}^{*}$$ and $$\hat{F}_{.j}$$ denote the permutation version of the empirical CDF within treatment group i, given $$Z = j$$, and the marginal empirical CDF, given $$Z = j$$, respectively. This Anderson–Darling type permutation test has already been considered for the two-group comparison setting without adjustment for covariates by [17]. The main idea in the present setting is to just apply that test to the observations within each of the two covariate subgroups separately. For the subsequent nonparametric combination of $$T(\mathbf {Y}_1^{*})$$ and $$T(\mathbf {Y}_2^{*})$$, there are several choices available (see, for example, [18]). We would like to mention two of them: On the one hand, the direct combination can be used, that is,
$$ T_{AD,dir}(\mathbf {Y}_1^{*},\mathbf {Y}_2^{*}):= T(\mathbf {Y}_1^{*}) + T(\mathbf {Y}_2^{*}), $$
and subsequently, the permutation p value is calculated by
$$\begin{aligned} p_{AD,dir} = \frac{1}{n_p}\sum _{m=1}^{n_p}\mathbf {1}\{T_{AD,dir}(\mathbf {Y}_1^{*(m)},\mathbf {Y}_2^{*(m)}) \ge T_{AD,dir}(\mathbf {Y}_1,\mathbf {Y}_2)\}, \end{aligned}$$
(2)
where $$n_p$$ denotes the number of Monte Carlo replications (e.g., $$n_p = 2000$$), and $$\mathbf {Y}_j^{*(m)}$$ denotes the m-th permuted dataset, $$j \in \{1,2\}$$. Alternatively, one may calculate the respective p values first and combine them by using the Fisher combination function, then. Hence, if we let $$p_j := \frac{1}{n_p}\sum _{m=1}^{n_p}\mathbf {1}\{T(\mathbf {Y}_j^{*(m)}) \ge T(\mathbf {Y}_j)\}$$, $$j \in \{1,2\}$$, the Fisher combination p value is obtained by
$$\begin{aligned} p_{AD,F} = 1-H(-2 log(p_1 p_2)), \end{aligned}$$
(3)
where H denotes the CDF of a central Chi-square distribution with 4 degrees of freedom.

Observe that each summand in (1), might be regarded as a separate test statistic, which essentially compares the cumulative frequencies up to category $$C_k$$ between the two treatment groups, conditionally on $$Z = j$$, $$k \in \{1,2,\ldots ,K-1\}$$, $$j \in \{1,2\}$$. Hence, the two Anderson–Darling type test statistics are again direct combinations of partial tests, thus representing another application of the NPC method.

2.2 A Nonparametric (Pseudo-)Rank-based Method for Factorial Designs

As an alternative to the permutation approach, the nonparametric rank-based ANOVA-type test proposed by [4], might be used. Analogously to the parametric linear model, the hypothesis corresponding to the main effect of the binary treatment factor is stated as $$H_0: F_{11} - F_{21} + F_{12} - F_{22} = 0$$ vs. $$H_1: F_{11} - F_{21} + F_{12} - F_{22} \ne 0$$. Let $$R_{ijl}$$ denote the rank of $$Y_{ijl}$$ (i.e., the outcome of subject l in treatment group i with dichotomized covariate value $$Z = j$$, for $$l \in \{1,2,\ldots ,n_{ij}\}$$, $$i,j\in \{1,2\}$$) within all $$N = \sum _{i,j}n_{ij}$$ observations. Let $$\bar{R}_{ij.} = n_{ij}^{-1}\sum _{l=1}^{n_{ij}}R_{ijl}$$ and $$S_{ij}^2 = (n_{ij} - 1)^{-1}\sum _{l=1}^{n_{ij}}(R_{ijl}-\bar{R}_{ij.})^2$$ denote the empirical mean and variance of the ranks, $$i,j \in \{1,2\}$$. We consider the test statistic
$$\begin{aligned} T_A(\mathbf {Y}) = \frac{(\bar{R}_{11.} - \bar{R}_{21.} + \bar{R}_{12.}-\bar{R}_{22.})^2}{S_0^2}, \end{aligned}$$
(4)
where $$S_0^2 := \sum _{i=1}^{2}\sum _{j=1}^{2}S_{ij}^2/n_{ij}.$$ Under $$H_0$$, this test statistic has, asymptotically, a central Chi-square distribution with 1 degree of freedom. For small samples, however, the distribution of $$T_A$$ can be approximated by a F-distribution with numerator degrees of freedom equal to 1 and denominator degrees of freedom
$$ \hat{f}_0 = \frac{S_0^4}{\sum _{i,j}(n_{ij}-1)^{-1}(S_{ij}^2/n_{ij})^2}. $$
We would like to add some important remarks. Firstly, in order to allow for establishing a unified framework regardless of whether ties are present or not, the normalized CDF $$F:=(F^{+} + F^{-})/2$$ should be used. Thereby, $$F^{+}$$ and $$F^{-}$$ denote the right and left continuous versions of the CDF, respectively. Accordingly, the so-called mid-ranks are used in (4). For the sake of notational simplicity, however, we have not explicitly used the normalized CDF (mid-ranks) in the formal considerations above. Secondly, it has been noticed recently that using ranks might lead to paradoxical results [5]. Replacing the ranks by the so-called pseudo-ranks has been shown to serve as a remedy [6]. Operationally, one just uses pseudo-ranks instead of ranks when calculating $$T_A(\mathbf {Y})$$. This corresponds to replacing the weighted mean distribution function $$W:= N^{-1}\sum _{i=1}^{2}\sum _{j=1}^{2}n_{ij}F_{ij}$$ by the unweighted mean CDF $$U:=1/4\sum _{i=1}^{2}\sum _{j=1}^{2}F_{ij}$$, as proposed by [8]. For the reasons discussed in [5], we recommend using pseudo-ranks and denote the corresponding test statistic by $$T_A^{\psi }(\mathbf {Y})$$ in the sequel. Finally, since the numerator degrees of freedom of the distribution of $$T_A$$ are equal to 1, one could also consider the linear (pseudo-)rank statistic $$\sqrt{T_A^{\psi }(\mathbf {Y})}$$, which has a large-sample standard normal distribution. For small samples, the same approximation idea as outlined above may be used (see [7] for details). In particular, using the linear rank statistic would allow for testing one-sided hypotheses. Likewise, it is straightforward to construct the one-sided counterparts of the Anderson–Darling type permutation tests that have been discussed in the previous section [17].

3 Simulations

The simulation scenarios are based on the data from the SIESTA (Sedation vs Intubation for Endovascular Stroke Treatment) trial, where conscious sedation and general anesthesia were compared with respect to early neurological improvement in patients undergoing stroke thrombectomy [23, 24]. We considered the functional outcome, which was assessed at the end of the study (i.e., 3 months after the intervention) by using the modified Rankin Scale (mRS). Since this was one of the secondary outcomes in the original study, no covariate-adjusted analysis had been conducted. Nevertheless, for the present purpose, we compared the mRS at 3 months between the two treatment groups, while adjusting for “door-to-arterial-puncture time” as a covariate. The respective group-specific empirical means and variances for both variables were extracted from Table 3 in [24], and subsequently averaged over the two treatment groups, yielding $$\mu _{\tilde{Y}} = 3.6\, (\sigma _{\tilde{Y}}^2 = 3.425)$$ for the continuous variable $$\tilde{Y}$$, which was assumed to underlie the outcome, and $$\mu _X = 70.6\, (\sigma _X^2 = 627.25)$$ for the covariate X (time), respectively. Observe that for the sake of simplicity, we assumed that the distribution of the covariate was the same in both groups. This assumption is met in (well-designed) randomized clinical trials (note that we refer to the equality of the distributions at the population level, so, empirical imbalance especially in small samples is not an issue). With these specifications, and assuming a correlation of 0.5, the realizations of $$\tilde{Y}$$ and X were simulated from a bivariate normal distribution. Note that for power simulations, the means of the outcome in the first and in the second group were set to $$\mu _{\tilde{Y}}$$ and $$\mu _{\tilde{Y}} - \delta $$, where $$\delta \in \{0.5,1.0,1.5\}$$, respectively. Secondly, independent and identically distributed error terms $$\tilde{\xi }_{ij}$$, $$i \in \{1,2\}, j \in \{1,2,\dots ,n_{i}\}$$, were drawn from one out of several different distributions (standard normal, standard lognormal, exp(1), t(3), Cauchy and Pareto(1)) and standardized by
$$ \varepsilon _{ij} = \frac{\tilde{\xi }_{ij} - \mathrm {E}[\tilde{\xi }_{ij}]}{(\mathrm {Var}[\tilde{\xi }_{ij}])^{1/2}}, $$
for $$i \in \{1,2\}, j \in \{1,2,\ldots ,n_{i}\}$$ (of course, provided that the second moments were finite). Then, we calculated the sum of the realizations of the variable underlying the outcome and the errors, that is, $$\tilde{Y}_{ij} + \varepsilon _{ij}$$, and rounded the resulting values. Finally, in order to obtain outcomes $$Y_{ij}$$ within the range of the mRS $$(0-6)$$, we set negative values to 0 and values $$\ge $$7 to 6, respectively. Since doing so yielded relatively large proportions of 0 and 6 values, we reduced the variance $$\sigma _{\tilde{Y}}^2$$ of the underlying continuous variable by 1 (it should be noted that this might resemble the real-life data more closely, because adding the error terms increases the variance by 1). So, summing up, we at first, generated samples from a bivariate normal distribution, subsequently added the error terms to the first coordinates (i.e., the outcomes) and manipulated the resulting values accordingly, in order to eventually obtain integer values between 0 and 6. Furthermore, due to construction, shift effects were considered for power simulations.

Our main aim was to examine the empirical type I error rates and power of the direct and the Fisher combination of the Anderson–Darling type permutation tests—the corresponding formulas for calculating the p values are given in (2) and (3)—as well as the performance of the pseudo-rank-based nonparametric ANOVA-type test $$T_A^{\psi }(\mathbf {Y})$$. The latter method is implemented in the rankFD package [13], in R [20]. The code that was used for the simulations and the real-life data analysis (Sect. 4), is available upon request. As a competitor for benchmarking the results, we also tested for a significant group effect by employing a probabilistic index model (PIM; see [26]). It should be noted that analogously to the aforementioned setup, the median split version of the covariate (i.e., door-to-arterial-puncture time) was included, in order to ensure comparability of the results. Furthermore, preliminary simulations revealed that including the covariate as a metric variable in the PIM might lead to considerable power loss (results not shown). For carrying out the PIM simulations, we used the pim package [15]. For all scenarios and tests, we considered three balanced and three unbalanced group size configurations, namely $$\mathbf {n}_1 = (20,20)$$, $$\mathbf {n}_2 = (40,40)$$, $$\mathbf {n}_3 = (80,80)$$, $$\mathbf {n}_4 = (20,40)$$, $$\mathbf {n}_5 = (20,60)$$, and $$\mathbf {n}_6 = (20,80)$$. For each combination of the simulation parameters, we conducted $$n_{sim} = 10,000$$ simulations. The number of permutations within each run was set to $$n_{p} = 2,000$$. The significance level was specified as $$\alpha = 5 \%$$.

Both NPC- and rank-based tests maintained the target type I error level very well (Table S1). However, the PIM test tended to be slightly liberal, especially in small samples, as well as in case of severe group size imbalance. With respect to power, the ANOVA-type approach showed either a similar or a better performance compared to the two permutation-based tests, which were almost empirically equivalent. Depending on the scenario, the difference in power was up to about 13% points. The power of the PIM approach was lower compared to our proposed methods, especially for moderate to large effect sizes. The results were very similar across most error distributions, as obvious from Figs. S1 and S2, except for some power loss in case of errors from a Cauchy or a Pareto(1) distribution (Fig. S3). For the latter settings, the simulation study also revealed that there might be some computational problems when conducting the permutation-based tests for small balanced group sizes, and when using the ANOVA-type test with substantially unbalanced groups, due to degenerated empirical variances in one of the subgroups. However, this problem was only present in very few simulation runs (permutation-based tests: 13 runs for type I error simulations, 7 and 1 runs for $$\delta = 0.5$$ and $$\delta = 1.0$$, respectively; ATS: 1 run for each $$\delta \in \{0,0.5,1.0,1.5\}$$). The PIM test was not affected by these problems, yet being inferior to the other approaches in terms of power again. However, for unbalanced Pareto(1) scenarios, the PIM approach outperformed the NPC-based tests. Apart from that, for unbalanced settings, interestingly, the empirical power values of all tests under consideration did not change substantially as the group allocation ratios were becoming more extreme, despite the fact the total sample sizes were thus increased (Fig. S2).

Moreover, in order to explore whether the aforementioned findings may depend on the particular choices of the test statistics and the alternatives, respectively, we conducted a number of further sensitivity analyses. Firstly, all simulations were repeated using the Fisher combination method, but with the Anderson–Darling type test being replaced by the Wilcoxon-Mann-Whitney (WMW) test [14, 28]. To this end, we used the corresponding function rank.two.samples in the rankFD package [12]. The underlying rationale for using the WMW test was to examine whether or not the aforementioned power discrepancies between the rank- and permutation-based approaches may be at least partially explained by the different structures of the respective test statistics and effect measures. Overall, the results were very similar to the Anderson–Darling-based combination tests, with small gains in power in some scenarios (Table S1, Figs. S1–S3). However, it should be mentioned that computational problems were present in up to 5–10

Secondly, in order to compare the methods under consideration for other alternatives than location shifts, we modified the data generation process in the following way: Both group means were set to $$\mu _{\tilde{Y}} = 3.6$$, but the variances of the outcome in group 1 and 2 were specified as $$\sigma _1^2 = \sigma _{\tilde{Y}}^2 - 1 = 2.425$$ and $$\sigma _2 = d \sigma _1^2$$, where $$d \in \{4,8,12,16,20\}$$. All other simulation parameters that have been described at the beginning of this section were left unchanged. For ease of presentation, only sample size scenarios $$\mathbf{n}_1$$ and $$\mathbf{n}_5$$ with normal and lognormal errors were considered. The results are summarized in Fig. S4. The Anderson–Darling combination tests clearly outperformed the PIM approach, which, in turn, was more powerful than the rank-based test. Hence, the latter method should not be used in settings where scale alternatives are of interest. Observe that the Fisher combination of WMW-tests would have been a suboptimal choice either, because like the ATS approach, the WMW test is based on the so-called relative effect and, therefore, lacks power for scale alternatives.

Finally, we conducted a small simulation study that was aimed at investigating the performance of our proposed approaches in comparison to a classical parametric proportional odds (PO) model (see, for example, [1]). Analogously to the notations in the previous sections, let Y, Z, and G denote the outcome (i.e., 3-months mRS), the covariate (i.e., door-to-arterial-puncture time), and a dummy variable representing the group indicator (i.e., $$G = i-1$$ for $$i = 1,2$$), respectively. We considered the PO model
$$ logit(P(Y\le C_k)) = \alpha _k + \beta _1 G + \beta _2 Z + \beta _3 (GZ), $$
where $$C_k$$ denotes the k-th mRS category, so, $$C_k = k - 1$$, $$k \in \{1,2,\ldots ,6\}$$. Observe that $$P(Y\le 6) = 1$$, so, this probability does not have to be modeled. Moreover, (GZ) denotes the covariate representing the group-covariate interaction. Data were simulated from this model as follows:
  1. 1.

    Simulate covariates $$Z_{ij}\overset{iid}{\sim }\mathcal {N}(\mu _X, \sigma _X^2)$$, where $$\mu _X$$ and $$\sigma _X^2$$ denote the (group-averaged) empirical mean and variance of “door-to-arterial-puncture time”, respectively (as defined at the beginning of Sect. 3).

     
  2. 2.

    Calculate $$P(Y_{ij}\le C_k) = logit^{-1}(\alpha _k + \beta _1 G_{ij} + \beta _2 Z_{ij} + \beta _3 (GZ)_{ij})$$, where the parameters were specified as follows: $$\beta _1 \in \{0,2,4\}$$, $$\beta _2 = -0.1$$, $$\beta _3 = 0$$, and $$\alpha _k = k+1$$, $$k \in \{1,\ldots ,6\}$$.

     
  3. 3.

    Finally, the realization of $$Y_{ij}$$ was sampled from the corresponding probability distribution $$\{P(Y_{ij}=C_1),\ldots ,P(Y_{ij} = C_7)\}$$.

     

It should be noted that the probability distribution in step 3 depends on the covariate, which nevertheless has not been stated explicitly for the sake of notational simplicity. The resulting empirical type I error and power rates of the PO-based test for $$H_0:\beta _1 = 0$$ and the aforementioned competitors are reported in Table S2. Note that for the PIM and PO tests, the original covariate instead of its categorized counterpart was included in the model, because using the latter led to computational errors. Obviously, the permutation approaches, as well as the rank-based ANOVA-type test outperformed their competitors in terms of power, while maintaining the prespecified 5% level. It has to be mentioned, however, that computational problems due to degenerated variances were present in a considerable number of simulation runs (2–5%).

4 Real-Life Data Example

In order to illustrate the different approaches under consideration, we analyzed the data from the SIESTA trial that has been mentioned in Sect. 3. The outcome variable was the modified Rankin Scale (mRS) at 3 months post intervention, and the age at baseline was considered as the (continuous) covariate. Each patient had been randomly assigned to either conscious sedation or general anesthesia. Firstly, the direct combination method yielded the combined Anderson–Darling type statistic $$T_{AD,dir} = 0.53126$$, with the corresponding permutation p value $$p_{AD,dir} = 0.556$$. The Fisher combination p value was very similar ($$p_{AD,Fi} = 0.570$$). However, the pseudo-rank-based ANOVA-type approach yielded a somewhat smaller p value ($$p_{ATS} = 0.374$$, $$T_A^{\psi } = 0.79539$$, $$\hat{f}_0 = 139.69$$). Summing up, the results might point to some gain in power when using the ANOVA-type statistic. For the sake of completeness, we also conducted a test for the group indicator in a PIM model with the dichotomized covariate age, as well as the covariate-group interaction as additional explanatory variables. The resulting p value was 0.5513, which is also in line with the findings from our simulation study (see Sect. 3).

5 Discussion

In the present manuscript, we have considered two different nonparametric approaches for comparing an ordinal outcome variable between two groups while adjusting for a continuous covariate. By contrast to existing methods, we did not incorporate the covariate directly, but dichotomized it by applying a median split. In applied research, covariates are frequently categorized, where the choice of the categories is either guided by subject matter expertise or based on certain empirical quantiles. Hence, our proposed method can be regarded as a representative of what is frequently done in practice. The simulation results showed that type I error rates were neither inflated nor deflated. With respect to power, the pseudo-rank-based ANOVA-type statistic outperformed the permutation-based approaches for location-shift alternatives. However, since multiple aspects should be considered when deciding for or against a particular statistical method, we would like to briefly discuss further advantages and drawbacks now.

Firstly, in addition to the gain in power at least for location-shift alternatives, another argument in favor of the ANOVA-type approach is that it provides a corresponding effect measure, the so-called relative effect, which can be estimated from the data. In fact, the ANOVA-type statistic is based on these estimated relative effects [4]. By contrast, the permutation-based methods are designed for testing hypotheses rather than for the purpose of estimation. Of course, one could calculate the estimated relative effects in addition to the permutation p value, yet with the drawback of introducing some discrepancy between what is tested and what is considered as the effect measure.

Secondly, on the other hand, although the relative effects are straightforward to interpret, the ANOVA-type approach might be somewhat more difficult to apply in so far, as there are several options for specifying the hypotheses, the test statistics and the ranks that are used. Although the rankFD package is a very convenient implementation in R, it might be easier to understand what is going on exactly when using the permutation-based approaches. Apart from that, we have to acknowledge that our empirical results, like any simulation study, only provide evidence for some specific settings. Although the range of sample size scenarios and distributional assumptions is quite broad, we would like to emphasize that we only considered shift effects. But, we have demonstrated in Sect. 3, that, for example, considering scale alternatives might yield very different results in terms of power. Hence, we recommend conducting further simulations that appropriately reflect the particular setting of interest, including various types of alternatives, before actually applying the one or the other method. Moreover, especially in case of very small sample sizes, we conjecture that the permutation-based approaches might be superior to the pseudo-rank-based method, because the former is finitely exact. Apart from that, despite the somewhat suboptimal performance of the PIM-based tests in our simulations, it should be emphasized that the PIM model clearly warrants further examination in future research on analysis methods for ordinal outcomes, due to its attractiveness in terms of the broad range of potential applications. Likewise, it might be worthwhile to consider employing a proportional odds model at least in particular settings where the underlying assumptions are tenable, due to the straightforward interpretation of the results.

Finally, we would like to emphasize that the approaches under consideration can be easily applied to settings with multiple (categorical or categorized) covariates, too. Moreover, we would like to briefly sketch how the permutation methods that have been proposed in the present manuscript can be extended to the case of multivariate outcomes. For example, in the SIESTA trial (see Sect. 4), it would be of interest to compare the mRS, as well as the National Institute of Health Stroke Scale (NIHSS), between the two treatment groups. Note that standard parametric tests (e.g., Wilks’ Lambda) rely, in particular, on the assumption of multivariate normality, which might be restrictive and is even more difficult to justify than univariate normality. Therefore, using the nonparametric combination (NPC) method could be an appealing alternative. However, as the number of subgroups and/or dimensions increases, the permutation-based method is getting more and more demanding with respect to computational resources. In addition to that, a thorough empirical examination of the properties of the resulting tests has to be done in future work, in order to ensure that this approach can be safely used in applied research.