Interval estimation, correlation measures, and statistical tests

We briefly covered interval estimation as an introductory example of SciPy: bayes_mvs, in Chapter 1, Introduction to SciPy, with very simple syntax, as follows:

bayes_mvs(data, alpha=0.9)

It returns a tuple of three arguments in which each argument has the form (center, (lower, upper)). The first argument refers to the mean; the second refers to the variance; and the third to the standard deviation. All intervals are computed according to the probability given by alpha, which is 0.9 by default.

We may use the linregress routine to compute the regression line of some two-dimensional data x, or two sets of one-dimensional data, x and y. We may compute different correlation coefficients, with their corresponding p-values, as well. We have the Pearson correlation coefficient (pearsonr), Spearman's rank-order correlation (spearmanr), point biserial correlation (pointbiserialr), and Kendall's tau for ordinal data (kendalltau). In all cases, the syntax is the same, as it is only required either a two-dimensional array of data, or two one-dimensional arrays of data with the same length.

SciPy also has most of the best-known statistical tests and procedures: t-tests (ttest_1samp for one group of scores, ttest_ind for two independent samples of scores, or ttest_rel for two related samples of scores), Kolmogorov-Smirnov tests for goodness of fit (kstest, ks_2samp), one-way Chi-square test (chisquare), and many more.

Let us illustrate some of the routines of this module with a textbook example, based on Timothy Sturm's studies on control design.

To turn a knob that moved an indicator by the screw action, 25 right-handed individuals were asked to use their right hands. There were two identical instruments, one with a right-handed thread where the knob turned clockwise, and the other with a left-hand thread where the knob turned counter-clockwise. The following table gives the times in seconds each subject took to move the indicator to a fixed distance:

Subject

1

2

3

4

5

6

7

8

9

10

Right thread

113

105

130

101

138

118

87

116

75

96

Left thread

137

105

133

108

115

170

103

145

78

107

Subject

11

12

13

14

15

16

17

18

19

20

Right thread

122

103

116

107

118

103

111

104

111

89

Left thread

84

148

147

87

166

146

123

135

112

93

Subject

21

22

23

24

25

     

Right thread

78

100

89

85

88

     

Left thread

76

116

78

101

123

     

We may perform an analysis that leads to a conclusion about right-handed people finding right-hand threads easier to use, by a simple one-sample t-statistic. We will load the data in memory, as follows:

>>> import numpy
>>> data = numpy.array([[113,105,130,101,138,118,87,116,75,96, \
         122,103,116,107,118,103,111,104,111,89,78,100,89,85,88], \
         [137,105,133,108,115,170,103,145,78,107, \
         84,148,147,87,166,146,123,135,112,93,76,116,78,101,123]])

The difference of each row indicates which knob was faster, and for how much time. We can obtain that information easily and perform some basic statistical analysis on it. We will start by computing the mean, standard deviation, and a histogram with 10 bins:

>>> dataDiff = data[1,:]-data[0,:]
>>> dataDiff.mean(), dataDiff.std()

The output is shown as:

(13.32, 22.472596645692729)

Let's plot the histogram by issuing the following set of commands:

>>> import matplotlib.pyplot as plt
>>> plt.hist(dataDiff)
>>> plt.show()

This produces the following histogram:

Interval estimation, correlation measures, and statistical tests

In light of this histogram, it is not far-fetched to assume a normal distribution. If we assume that this is a proper simple random sample, the use of t-statistics is justified. We would like to prove that it takes longer to turn the left thread than the right, so we set the mean of dataDiff to be contrasted against the zero mean (which would indicate that it takes the same time for both threads).

The two-sample t-statistics and p-value for the two-sided test are computed by the simple command, as follows:

>>> from scipy.stats import ttest_1samp
>>> t_stat,p_value=ttest_1samp(dataDiff,0.0)

The p-value for the one-sided test is then calculated:

>>> print (p_value/2.0)

The output is shown as follows:

0.00389575522747

Note that this p-value is much smaller than either of the usual thresholds alpha = 0.05 or alpha = 0.1. We can thus guarantee that we have enough evidence to support the claim that right-handed threads take less time to turn than left-handed threads.