Rank comparison: two independent samples

Statsmodels provides statistics and tests for the probability that x1 has larger values than x2. This measures are based on ordinal comparisons using ranks.

Define p as the probability that a random draw from the population of the first sample has a larger value than a random draw from the population of the second sample, specifically

p = P(x1 > x2) + 0.5 * P(x1 = x2)

This is a measure underlying Wilcoxon-Mann-Whitney’s U test, Fligner-Policello test and Brunner-Munzel test. Inference is based on the asymptotic distribution of the Brunner-Munzel test. The half probability for ties corresponds to the use of midranks and makes it valid for discrete variables.

The Null hypothesis for stochastic equality is p = 0.5, which corresponds to the Brunner-Munzel test.

This notebook provides a brief overview of the statistics provided in statsmodels.

[1]:
import numpy as np

from statsmodels.stats.nonparametric import (
    rank_compare_2indep, rank_compare_2ordinal, prob_larger_continuous,
    cohensd2problarger)

Example

The main function is rank_compare_2indep which computes the Brunner-Munzel test and returns a RankCompareResult instance with additional methods.

The data for the example are taken from Munzel and Hauschke 2003 and is given in frequency counts. We need to expand it to arrays of observations to be able to use it with rank_compare_2indep. See below for a function that directly takes frequency counts. The labels or levels are treated as ordinal, the specific values are irrelevant as long as they define an order (>, =).

[2]:
levels = [-2, -1, 0, 1, 2]
new = [24, 37, 21, 19, 6]
active = [11, 51, 22, 21, 7]

x1 = np.repeat(levels, new)
x2 = np.repeat(levels, active)
np.bincount(x1 + 2), np.bincount(x2 + 2)
[2]:
(array([24, 37, 21, 19,  6]), array([11, 51, 22, 21,  7]))
[3]:
res = rank_compare_2indep(x1, x2) #, use_t=False)
res
[3]:
<class 'statsmodels.stats.nonparametric.RankCompareResult'>
statistic = np.float64(-1.1757561456581607)
pvalue = np.float64(0.24106066495471642)
s1 = np.float64(1164.3327014635863)
s2 = np.float64(701.9050836550837)
var1 = np.float64(0.09281989010392111)
var2 = np.float64(0.06130710836361985)
var = np.float64(0.3098544504968025)
var_prob = np.float64(0.0014148605045516095)
nobs1 = 107
nobs2 = 112
nobs = 219
mean1 = np.float64(105.04672897196262)
mean2 = np.float64(114.73214285714286)
prob1 = np.float64(0.4557743658210948)
prob2 = np.float64(0.5442256341789052)
somersd1 = np.float64(-0.08845126835781036)
somersd2 = np.float64(0.08845126835781048)
df = np.float64(204.29842398679557)
use_t = True
tuple = (np.float64(-1.1757561456581607), np.float64(0.24106066495471642))

The methods of the results instance are

[4]:
[i for i in dir(res) if not i.startswith("_") and callable(getattr(res, i))]
[4]:
['conf_int',
 'confint_lintransf',
 'effectsize_normal',
 'summary',
 'test_prob_superior',
 'tost_prob_superior']
[5]:
print(res.summary())  # returns SimpleTable
                  Probability sample 1 is stochastically larger
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
prob(x1>x2) c0     0.4558      0.038     -1.176      0.241       0.382       0.530
==================================================================================
[6]:
ci = res.conf_int()
ci
[6]:
(np.float64(0.3816117144128266), np.float64(0.529937017229363))
[7]:
res.test_prob_superior()
[7]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-1.1757561456581602)
pvalue = np.float64(0.24106066495471665)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-1.1757561456581602), np.float64(0.24106066495471665))
[ ]:

One-sided tests, superiority and noninferiority tests

The hypothesis tests functions have a alternative keyword to specify one-sided tests and a value keyword to specify nonzero or nonequality hypothesis. Both keywords together can be used for noninferiority tests or superiority tests with a margin.

A noninferiority test specifies a margin and alternative so we can test the hypothesis that a new treatment is almost as good or better than a reference treatment.

The general one-sided hypothesis is

H0: p = p0
versus
HA: p > p0 alternative is that effect is larger than specified null value p0
HA: p < p0 alternative is that effect is smaller than specified null value p0

Note: The null hypothesis of a one-sided test is often specified as a weak inequality. However, p-values are usually derived assuming a null hypothesis at the boundary value. The boundary value is in most cases the worst case for the p-value, points in the interior of the null hypothesis usually have larger p-values, and the test is conservative in those cases.

Most two sample hypothesis test in statsmodels allow for a “nonzero” null value, that is a null value that does not require that the effects in the two samples is the same. Some hypothesis tests, like Fisher’s exact test for contingency tables, and most k-samples anova-type tests only allow a null hypothesis that the effect is the same in all samples.

The null value p0 for hypothesis that there is no difference between samples, depends on the effect statistic, The null value for a difference and a correlation is zero, The null value for a ratio is one, and, the null value for the stochastic superiority probability is 0.5.

Noninferiority and superiority tests are just special cases of one-sided hypothesis tests that allow nonzero null hypotheses. TOST equivalence tests are a combination of two one-sided tests with nonzero null values.

Note, we are using “superior” now in two different meanings. Superior and noninferior in the tests refer to whether the treatment effect is better than the control. The effect measure in rank_compare is the probability based on stochastic superiority of sample 1 compared to sample 2, but stochastic superiority can be either good or bad.

Noninferiority: Smaller values are better

If having lower values is better, for example if the variable of interest is disease occurencies, then a non-inferiority test specifies a threshold larger than equality with an alternative that the parameter is less than the threshold.

In the case of stochastic superiority measure, equality of the two sample implies a probability equal to 0.5. If we specify an inferiority margin of, say 0.6, then the null and alternative hypothesis are

H0: p >= 0.6 (null for inference based on H0: p = 0.6)
HA: p < 0.6

If we reject the null hypothesis, then our data supports that the treatment is noninferior to the control.

Noninferiority: Larger values are better

In cases where having larger values is better, e.g. a skill or health measures, non-inferiority means that the treatment has values almost as high or higher than the control. This defines the alternative hypothesis. Assuming p0 is the non-inferiority threshold, e.g p0 = 0.4, then null and alternative hypotheses are

H0: p <= p0 (p0 < 0.5)
HA: p > p0

If the null hypothesis is rejected, then we have evidence that the treatment (sample 1) is noninferior to reference (sample 2), that is the superiority probability is larger than p0.

Supperiority tests

Suppertiority tests are usually defined as one-sided alternative with equality as the null and the better inequality as the alternative. However, superiority can also me defined with a margin, in which case the treatment has to be better by a non-neglibible amount specified by the margin.

This means the test is the same one-sided tests as above with p0 either equal to 0.5, or p0 a value above 0.5 if larger is better and below 0.5 if smaller is better.

Example: noninferiority smaller is better

Suppose our noninferiority threshold is p0 = 0.55. The one-sided test with alternative “smaller” has a pvalue around 0.0065 and we reject the null hypothesis at an alpha of 0.05. The data provides evidence that the treatment (sample 1) is noninferior to the control (sample2), that is we have evidence that the treatment is at most 5 percentage points worse than the control.

[8]:
res.test_prob_superior(value=0.55, alternative="smaller")
[8]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-2.505026112598482)
pvalue = np.float64(0.006512753894336686)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-2.505026112598482), np.float64(0.006512753894336686))

Example: noninferiority larger is better

Now consider the case when having larger values is better and the noninferiority threshold is 0.45. The one-sided test has a p-value of 0.44 and we cannot reject the null hypothesis. Therefore, we do not have evidence for the treatment to be at most 5 percentage points worse than the control.

[9]:
res.test_prob_superior(value=0.45, alternative="larger")
[9]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(0.15351382128216023)
pvalue = np.float64(0.43907230923278956)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(0.15351382128216023), np.float64(0.43907230923278956))

Equivalence Test TOST

In equivalence test we want to show that the treatment has approximately the same effect as the control, or that two treatments have approximately the same effect. Being equivalent is defined by a lower and upper equivalence margin (low and upp). If the effect measure is inside this interval, then the treatments are equivalent.

The null and alternative hypothesis are

H0: p <= p_low or p >= p_upp
HA: p_low < p < p_upp

In this case the null hypothesis is that the effect measure is outside of the equivalence interval. If we reject the null hypothesis, then the data provides evidence that treatment and control are equivalent.

In the TOST procedure we evaluate two one-sided tests, one for the null hypothesis that the effect is equal or below the lower threshold and one for the null hypothesis that the effect is equal or above the upper threshold. If we reject both tests, then the data provides evidence that the effect is inside the equivalence interval. The p-value of the tost method will be the maximum of the pvalues of the two test.

Suppose our equivalence margins are 0.45 and 0.55, then the p-value of the equivalence test is 0.43, and we cannot reject the null hypothesis that the two treatments are not equivalent, i.e. the data does not provide support for equivalence.

[10]:
res.tost_prob_superior(low=0.45, upp=0.55)
[10]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(0.15351382128216023)
pvalue = np.float64(0.43907230923278956)
results_larger = <class 'statsmodels.stats.base.HolderTuple'>
    statistic = np.float64(0.15351382128216023)
    pvalue = np.float64(0.43907230923278956)
    df = np.float64(204.29842398679557)
    distribution = 't'
    tuple = (np.float64(0.15351382128216023), np.float64(0.43907230923278956))
results_smaller = <class 'statsmodels.stats.base.HolderTuple'>
    statistic = np.float64(-2.505026112598482)
    pvalue = np.float64(0.006512753894336686)
    df = np.float64(204.29842398679557)
    distribution = 't'
    tuple = (np.float64(-2.505026112598482), np.float64(0.006512753894336686))
title = 'Equivalence test for Prob(x1 > x2) + 0.5 Prob(x1 = x2) '
tuple = (np.float64(0.15351382128216023), np.float64(0.43907230923278956))
[11]:
res.test_prob_superior(value=0.55, alternative="smaller")
[11]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-2.505026112598482)
pvalue = np.float64(0.006512753894336686)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-2.505026112598482), np.float64(0.006512753894336686))
[12]:
res.test_prob_superior(value=0.6, alternative="larger")
[12]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-3.834296079538801)
pvalue = np.float64(0.9999161566635063)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-3.834296079538801), np.float64(0.9999161566635063))
[13]:
res.test_prob_superior(value=0.529937, alternative="smaller")
[13]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-1.9716432456640076)
pvalue = np.float64(0.02500002636314127)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-1.9716432456640076), np.float64(0.02500002636314127))
[14]:
res.test_prob_superior(value=0.529937017, alternative="smaller")
[14]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-1.9716436976157954)
pvalue = np.float64(0.025000000351072277)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-1.9716436976157954), np.float64(0.025000000351072277))
[15]:
res.conf_int()
[15]:
(np.float64(0.3816117144128266), np.float64(0.529937017229363))

Aside: Burden of proof in hypothesis testing

Hypothesis tests have in general the following two properties

  • small samples favor the null hypothesis. Uncertainty in small samples is large and the hypothesis test has little power. The study in this case is underpowered and we often cannot reject the null because of large uncertainty. Non-rejection cannot be taken as evidence in favor of the null because the hypothesis test does not have much power to reject the null.

  • large samples favor the alternative hypothesis. In large samples uncertainty becomes small, and even small deviations from the null hypothesis will lead to rejection. In this case we can have a test result that shows an effect that is statistical significant but substantive irrelevant.

Noninferiority and equivalence tests solve both problems. The first problem, favoring the null in small samples, can be avoided by reversing null and alternative hypothesis. The alternative hypothesis should be the one we want to show, so the test does not just support the hypothesis of interest because the power is too small. The second problem, favoring the alternative in large samples, can be voided but specifying a margin in the hypothesis test, so that trivial deviations are still part of the null hypothesis. By using this, we are not favoring the alternative in large samples for irrelevant deviations of a point null hypothesis. Noninferiority tests want to show that a treatment is almost as good as the control. Equivalence test try to show that the statistics in two samples are approximately the same, in contrast to a point hypothesis that specifies that they are exactly the same.

Reversing the samples

In the literature, stochastic dominance in the sense of p not equal to 0.5 is often defined using a stochastically smaller measure instead of stochastically larger as defined in statsmodels.

The effect measure reverses the inequality and is then

p2 = P(x1 < x2) + 0.5 * P(x1 = x2)

This can be obtained in the statsmodels function by reversing the sequence of the sample, that is we can use (x2, x1) instead of (x1, x2). The two definitions, p, p2, are related by p2 = 1 - p.

The results instance of rank_compare shows both statistics, but the hypothesis tests and confidence interval are only provided for the stochastically larger measure.

[16]:
res = rank_compare_2indep(x2, x1) #, use_t=False)
res
[16]:
<class 'statsmodels.stats.nonparametric.RankCompareResult'>
statistic = np.float64(1.1757561456581607)
pvalue = np.float64(0.24106066495471642)
s1 = np.float64(701.9050836550837)
s2 = np.float64(1164.3327014635863)
var1 = np.float64(0.06130710836361985)
var2 = np.float64(0.09281989010392111)
var = np.float64(0.3098544504968025)
var_prob = np.float64(0.0014148605045516095)
nobs1 = 112
nobs2 = 107
nobs = 219
mean1 = np.float64(114.73214285714286)
mean2 = np.float64(105.04672897196262)
prob1 = np.float64(0.5442256341789052)
prob2 = np.float64(0.4557743658210948)
somersd1 = np.float64(0.08845126835781048)
somersd2 = np.float64(-0.08845126835781036)
df = np.float64(204.29842398679557)
use_t = True
tuple = (np.float64(1.1757561456581607), np.float64(0.24106066495471642))
[17]:
res.conf_int()
[17]:
(np.float64(0.470062982770637), np.float64(0.6183882855871734))
[18]:
st = res.summary()
st
[18]:
Probability sample 1 is stochastically larger
coef std err t P>|t| [0.025 0.975]
prob(x1>x2) c0 0.5442 0.038 1.176 0.241 0.470 0.618

Ordinal data

The rank based analysis assumes only that data is ordinal and uses only ordinal information. Furthermore, the definition in terms of midranks allows for discete data in analysis similar to Brunner-Munzel test.

Statsmodels provides some additional functions for the case when the data is discrete and has only a finite number of support points, i.e. is ordered categorical.

The data above was given as ordinal data, but we had to expand it to be able to use rank_compare_2indep. Instead we can use rank_compare_2ordinal directly with the frequency counts. The latter function mainly differs from the former by using more efficient computation for the special case. The results statistics will be the same.

The counts for treatment (“new”) and control (“active”) in increasing order of the underlying values from the above example are

[19]:
new, active
[19]:
([24, 37, 21, 19, 6], [11, 51, 22, 21, 7])
[20]:
res_o = rank_compare_2ordinal(new, active)
res_o.summary()
[20]:
Probability sample 1 is stochastically larger
coef std err t P>|t| [0.025 0.975]
prob(x1>x2) c0 0.4558 0.038 -1.176 0.241 0.382 0.530
[21]:
res_o
[21]:
<class 'statsmodels.stats.nonparametric.RankCompareResult'>
statistic = None
pvalue = None
s1 = None
s2 = None
var1 = np.float64(0.0919524144954732)
var2 = np.float64(0.06075972346751607)
var = np.float64(0.3098544504968023)
var_prob = np.float64(0.0014148605045516088)
nobs1 = np.int64(107)
nobs2 = np.int64(112)
nobs = np.int64(219)
mean1 = None
mean2 = None
prob1 = np.float64(0.4557743658210948)
prob2 = np.float64(0.5442256341789052)
somersd1 = np.float64(-0.08845126835781036)
somersd2 = np.float64(0.08845126835781048)
df = np.float64(204.29842398679557)
use_t = True
tuple = (None, None)
[22]:
res = rank_compare_2indep(x1, x2)
res
[22]:
<class 'statsmodels.stats.nonparametric.RankCompareResult'>
statistic = np.float64(-1.1757561456581607)
pvalue = np.float64(0.24106066495471642)
s1 = np.float64(1164.3327014635863)
s2 = np.float64(701.9050836550837)
var1 = np.float64(0.09281989010392111)
var2 = np.float64(0.06130710836361985)
var = np.float64(0.3098544504968025)
var_prob = np.float64(0.0014148605045516095)
nobs1 = 107
nobs2 = 112
nobs = 219
mean1 = np.float64(105.04672897196262)
mean2 = np.float64(114.73214285714286)
prob1 = np.float64(0.4557743658210948)
prob2 = np.float64(0.5442256341789052)
somersd1 = np.float64(-0.08845126835781036)
somersd2 = np.float64(0.08845126835781048)
df = np.float64(204.29842398679557)
use_t = True
tuple = (np.float64(-1.1757561456581607), np.float64(0.24106066495471642))
[23]:
res_o.test_prob_superior()
[23]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-1.1757561456581607)
pvalue = np.float64(0.24106066495471642)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-1.1757561456581607), np.float64(0.24106066495471642))
[24]:
res.test_prob_superior()
[24]:
<class 'statsmodels.stats.base.HolderTuple'>
statistic = np.float64(-1.1757561456581602)
pvalue = np.float64(0.24106066495471665)
df = np.float64(204.29842398679557)
distribution = 't'
tuple = (np.float64(-1.1757561456581602), np.float64(0.24106066495471665))
[25]:
res_o.conf_int(), res.conf_int()
[25]:
((np.float64(0.38161171441282665), np.float64(0.529937017229363)),
 (np.float64(0.3816117144128266), np.float64(0.529937017229363)))

Last update: Oct 03, 2024