1 Introduction
In many clinical studies, the main research interest is focused on the comparison of two groups with respect to a certain outcome variable of interest. Examples include the classical parallel-group design (e.g., verum vs. placebo) in randomized clinical trials, but also observational studies, which are aimed at studying the association between a particular factor (e.g., two different pathologies) and a health-related variable (e.g., a laboratory measurement or a quality of life score). Frequently, the two-sample t-test is used for analyzing data from such studies. If the comparison is adjusted for one or several covariates, the analysis of covariance (ANCOVA) is regarded as the method of choice. A comprehensive overview about the classical ANCOVA and some of its alternatives is given by [11]. However, most methods require that the outcome is continuous, or at least that the effects are additive. Therefore, applying these approaches in settings where the outcome is ordinal might be inappropriate. However, ordinal outcomes are quite frequently encountered in applied research: One example, among others, is the modified Rankin Scale (mRS), which is used to quantify the functional outcome of stroke patients by assigning a rating on a scale from 0 (no symptoms) to 6 (dead) [9, 21, 27]. Although some evidence suggests that converting the ordinal factor levels to scores and subsequently applying the standard parametric ANCOVA does not affect the performance of the test substantially [25], there are two crucial issues with this approach. Firstly, with respect to the normality assumption, a highly discrete variable might be particularly challenging to deal with. Secondly, careful thoughts concerning whether or not the resulting effect measure (e.g., the difference of the adjusted means) can be interpreted in a sensible way are required. Therefore, analyzing the outcome as a truly ordinal variable might be an attractive option. Again, however, classical approaches such as proportional odds models (for an introduction, see [1]) require assumptions which might be too restrictive.
Nonparametric rank-based methods may serve as a remedy in order to overcome the aforementioned difficulties. In particular, approaches where the so-called relative effect is used as the effect measure are also applicable to ordinal outcomes. One well-known example is the Wilcoxon-Mann-Whitney test [14, 28], which is the classical nonparametric counterpart of the two-sample t-test. Several extensions to general factorial designs have been proposed (e.g., [2, 4, 6, 8]). However, only a few approaches allow for adjustment for a continuous covariate [3, 26].
Alternatively, using permutation methods might be an appealing option, mainly due to their finite-sample properties (e.g., exactness) under relatively mild assumptions. For an overview of the underlying theory, we refer, for example, to [18]. Like with the rank-based methods that have been mentioned previously, there is a broad variety of permutation approaches, which cover various practically relevant settings (see, for example, [16, 19]). Again, however, the present setting, that is, the comparison of an the comparison of an ordinal outcome between two groups while adjusting for a continuous covariate, seems to be a somewhat difficult problem. The reason is that many permutation tests (e.g., the so-called synchronized permutation approach, see [10, 22]), implicitly assume an additive model, and hence, are only applicable if the outcome variable is metric. However, using the so-called using the so-called “nonparametric combination method” might be a promising alternative to a classical parametric approach: For the purpose of the present manuscript, the comparison of two samples with respect to an ordered categorical outcome, which was studied in [17], is of interest. Nevertheless, like with other approaches, continuous covariates cannot be directly accounted for. One straightforward solution might be to categorize the covariate, because an additional adjustment for a categorical covariate could be done by another application of the nonparametric combination method, then. Indeed, in applied research, a continuous covariate is often transformed into a binary variable by, for example, applying a median split. Alternatively, there might be some well-established cutoffs available, or subject-matter expertise could help to define appropriate categories. Therefore, it might be of interest especially for biostatisticians to have some empirical guidance at hand concerning which of the two approaches— permutation tests and the nonparametric combination method, or rank-based tests—should be used in practice.
The manuscript is organized as follows: In Sect. 2, we introduce the pseudo-rank version of the nonparametric ANOVA-type test for general factorial designs [4] and the Anderson–Darling type permutation test [17], as well as some fundamental pieces of the respective theory. Section 3 contains the results of an extensive simulation study, covering balanced and unbalanced settings, as well as different distributional assumptions. The application of the two methods under investigation is illustrated by the analysis of a real-life data example in Sect. 4. Finally, Sect. 5, contains some concluding remarks and a brief synopsis of the respective advantages and drawbacks. This hopefully helps applied researches to choose an appropriate statistical analysis approach. All tables and figures are provided in the Online Supplement.
2 The Nonparametric Combination Method and a Pseudo-rank-based Approach
Let and denote two independent bivariate samples. Thereby, the first and second components are the continuous covariate and the ordinal outcome, respectively. Let denote the support of the outcome Y, which consists of the ordered categories . For example, in medical research, but also in psychology and other fields, the outcome is frequently assessed by a rating on some scale (e.g., modified Rankin Scale, Glasgow Coma Scale, Visual Analog Scale, or Functional Gait Assessment, just to name a few). We have the impression that quite frequently, such a sort of outcome is analyzed by using classical methods for metric variables (e.g., t-test, ANOVA). This might indeed be appropriate in case that the interpretation as a metric variable is sensible from the respective subject matter point of view. Nevertheless, we would like to emphasize that this issue requires careful case-by-case considerations. Moreover, especially in case of a highly discrete variable (i.e., the cardinality of the support of Y is quite “small”), assuming a normal distribution might not be justified.
For the sake of notational simplicity, let , . Recall that our main aim is to compare the outcome Y between the two treatment groups (i.e., and ). In the following section, we propose two different approaches.
2.1 The Nonparametric Combination Method, Applied to an Anderson–Darling type Permutation Test
The basic idea underlying the nonparametric combination (NPC) method is quite simple: The hypothesis is split up into a finite set of partial hypotheses, and subsequently, a hypothesis test is constructed for each of these partial testing problems. In a second step, the resulting test statistics or p values are combined by using an appropriate combination function (e.g., Fisher’s combination function). It follows immediately from the underlying theory that the basic properties of the separate tests (e.g., consistency, exactness) are carried over to the combined test, then. For an overview, we refer to [18, 19].
Observe that each summand in (1), might be regarded as a separate test statistic, which essentially compares the cumulative frequencies up to category between the two treatment groups, conditionally on , , . Hence, the two Anderson–Darling type test statistics are again direct combinations of partial tests, thus representing another application of the NPC method.
2.2 A Nonparametric (Pseudo-)Rank-based Method for Factorial Designs
3 Simulations
Our main aim was to examine the empirical type I error rates and power of the direct and the Fisher combination of the Anderson–Darling type permutation tests—the corresponding formulas for calculating the p values are given in (2) and (3)—as well as the performance of the pseudo-rank-based nonparametric ANOVA-type test . The latter method is implemented in the rankFD package [13], in R [20]. The code that was used for the simulations and the real-life data analysis (Sect. 4), is available upon request. As a competitor for benchmarking the results, we also tested for a significant group effect by employing a probabilistic index model (PIM; see [26]). It should be noted that analogously to the aforementioned setup, the median split version of the covariate (i.e., door-to-arterial-puncture time) was included, in order to ensure comparability of the results. Furthermore, preliminary simulations revealed that including the covariate as a metric variable in the PIM might lead to considerable power loss (results not shown). For carrying out the PIM simulations, we used the pim package [15]. For all scenarios and tests, we considered three balanced and three unbalanced group size configurations, namely , , , , , and . For each combination of the simulation parameters, we conducted simulations. The number of permutations within each run was set to . The significance level was specified as .
Both NPC- and rank-based tests maintained the target type I error level very well (Table S1). However, the PIM test tended to be slightly liberal, especially in small samples, as well as in case of severe group size imbalance. With respect to power, the ANOVA-type approach showed either a similar or a better performance compared to the two permutation-based tests, which were almost empirically equivalent. Depending on the scenario, the difference in power was up to about 13% points. The power of the PIM approach was lower compared to our proposed methods, especially for moderate to large effect sizes. The results were very similar across most error distributions, as obvious from Figs. S1 and S2, except for some power loss in case of errors from a Cauchy or a Pareto(1) distribution (Fig. S3). For the latter settings, the simulation study also revealed that there might be some computational problems when conducting the permutation-based tests for small balanced group sizes, and when using the ANOVA-type test with substantially unbalanced groups, due to degenerated empirical variances in one of the subgroups. However, this problem was only present in very few simulation runs (permutation-based tests: 13 runs for type I error simulations, 7 and 1 runs for and , respectively; ATS: 1 run for each ). The PIM test was not affected by these problems, yet being inferior to the other approaches in terms of power again. However, for unbalanced Pareto(1) scenarios, the PIM approach outperformed the NPC-based tests. Apart from that, for unbalanced settings, interestingly, the empirical power values of all tests under consideration did not change substantially as the group allocation ratios were becoming more extreme, despite the fact the total sample sizes were thus increased (Fig. S2).
Moreover, in order to explore whether the aforementioned findings may depend on the particular choices of the test statistics and the alternatives, respectively, we conducted a number of further sensitivity analyses. Firstly, all simulations were repeated using the Fisher combination method, but with the Anderson–Darling type test being replaced by the Wilcoxon-Mann-Whitney (WMW) test [14, 28]. To this end, we used the corresponding function rank.two.samples in the rankFD package [12]. The underlying rationale for using the WMW test was to examine whether or not the aforementioned power discrepancies between the rank- and permutation-based approaches may be at least partially explained by the different structures of the respective test statistics and effect measures. Overall, the results were very similar to the Anderson–Darling-based combination tests, with small gains in power in some scenarios (Table S1, Figs. S1–S3). However, it should be mentioned that computational problems were present in up to 5–10
Secondly, in order to compare the methods under consideration for other alternatives than location shifts, we modified the data generation process in the following way: Both group means were set to , but the variances of the outcome in group 1 and 2 were specified as and , where . All other simulation parameters that have been described at the beginning of this section were left unchanged. For ease of presentation, only sample size scenarios and with normal and lognormal errors were considered. The results are summarized in Fig. S4. The Anderson–Darling combination tests clearly outperformed the PIM approach, which, in turn, was more powerful than the rank-based test. Hence, the latter method should not be used in settings where scale alternatives are of interest. Observe that the Fisher combination of WMW-tests would have been a suboptimal choice either, because like the ATS approach, the WMW test is based on the so-called relative effect and, therefore, lacks power for scale alternatives.
- 1.
Simulate covariates , where and denote the (group-averaged) empirical mean and variance of “door-to-arterial-puncture time”, respectively (as defined at the beginning of Sect. 3).
- 2.
Calculate , where the parameters were specified as follows: , , , and , .
- 3.
Finally, the realization of was sampled from the corresponding probability distribution .
It should be noted that the probability distribution in step 3 depends on the covariate, which nevertheless has not been stated explicitly for the sake of notational simplicity. The resulting empirical type I error and power rates of the PO-based test for and the aforementioned competitors are reported in Table S2. Note that for the PIM and PO tests, the original covariate instead of its categorized counterpart was included in the model, because using the latter led to computational errors. Obviously, the permutation approaches, as well as the rank-based ANOVA-type test outperformed their competitors in terms of power, while maintaining the prespecified 5% level. It has to be mentioned, however, that computational problems due to degenerated variances were present in a considerable number of simulation runs (2–5%).
4 Real-Life Data Example
In order to illustrate the different approaches under consideration, we analyzed the data from the SIESTA trial that has been mentioned in Sect. 3. The outcome variable was the modified Rankin Scale (mRS) at 3 months post intervention, and the age at baseline was considered as the (continuous) covariate. Each patient had been randomly assigned to either conscious sedation or general anesthesia. Firstly, the direct combination method yielded the combined Anderson–Darling type statistic , with the corresponding permutation p value . The Fisher combination p value was very similar (). However, the pseudo-rank-based ANOVA-type approach yielded a somewhat smaller p value (, , ). Summing up, the results might point to some gain in power when using the ANOVA-type statistic. For the sake of completeness, we also conducted a test for the group indicator in a PIM model with the dichotomized covariate age, as well as the covariate-group interaction as additional explanatory variables. The resulting p value was 0.5513, which is also in line with the findings from our simulation study (see Sect. 3).
5 Discussion
In the present manuscript, we have considered two different nonparametric approaches for comparing an ordinal outcome variable between two groups while adjusting for a continuous covariate. By contrast to existing methods, we did not incorporate the covariate directly, but dichotomized it by applying a median split. In applied research, covariates are frequently categorized, where the choice of the categories is either guided by subject matter expertise or based on certain empirical quantiles. Hence, our proposed method can be regarded as a representative of what is frequently done in practice. The simulation results showed that type I error rates were neither inflated nor deflated. With respect to power, the pseudo-rank-based ANOVA-type statistic outperformed the permutation-based approaches for location-shift alternatives. However, since multiple aspects should be considered when deciding for or against a particular statistical method, we would like to briefly discuss further advantages and drawbacks now.
Firstly, in addition to the gain in power at least for location-shift alternatives, another argument in favor of the ANOVA-type approach is that it provides a corresponding effect measure, the so-called relative effect, which can be estimated from the data. In fact, the ANOVA-type statistic is based on these estimated relative effects [4]. By contrast, the permutation-based methods are designed for testing hypotheses rather than for the purpose of estimation. Of course, one could calculate the estimated relative effects in addition to the permutation p value, yet with the drawback of introducing some discrepancy between what is tested and what is considered as the effect measure.
Secondly, on the other hand, although the relative effects are straightforward to interpret, the ANOVA-type approach might be somewhat more difficult to apply in so far, as there are several options for specifying the hypotheses, the test statistics and the ranks that are used. Although the rankFD package is a very convenient implementation in R, it might be easier to understand what is going on exactly when using the permutation-based approaches. Apart from that, we have to acknowledge that our empirical results, like any simulation study, only provide evidence for some specific settings. Although the range of sample size scenarios and distributional assumptions is quite broad, we would like to emphasize that we only considered shift effects. But, we have demonstrated in Sect. 3, that, for example, considering scale alternatives might yield very different results in terms of power. Hence, we recommend conducting further simulations that appropriately reflect the particular setting of interest, including various types of alternatives, before actually applying the one or the other method. Moreover, especially in case of very small sample sizes, we conjecture that the permutation-based approaches might be superior to the pseudo-rank-based method, because the former is finitely exact. Apart from that, despite the somewhat suboptimal performance of the PIM-based tests in our simulations, it should be emphasized that the PIM model clearly warrants further examination in future research on analysis methods for ordinal outcomes, due to its attractiveness in terms of the broad range of potential applications. Likewise, it might be worthwhile to consider employing a proportional odds model at least in particular settings where the underlying assumptions are tenable, due to the straightforward interpretation of the results.
Finally, we would like to emphasize that the approaches under consideration can be easily applied to settings with multiple (categorical or categorized) covariates, too. Moreover, we would like to briefly sketch how the permutation methods that have been proposed in the present manuscript can be extended to the case of multivariate outcomes. For example, in the SIESTA trial (see Sect. 4), it would be of interest to compare the mRS, as well as the National Institute of Health Stroke Scale (NIHSS), between the two treatment groups. Note that standard parametric tests (e.g., Wilks’ Lambda) rely, in particular, on the assumption of multivariate normality, which might be restrictive and is even more difficult to justify than univariate normality. Therefore, using the nonparametric combination (NPC) method could be an appealing alternative. However, as the number of subgroups and/or dimensions increases, the permutation-based method is getting more and more demanding with respect to computational resources. In addition to that, a thorough empirical examination of the properties of the resulting tests has to be done in future work, in order to ensure that this approach can be safely used in applied research.