Interval estimation, correlation measures, and statistical tests

We briefly covered interval estimation as an introductory example of SciPy: bayes_mvs, in Chapter 1, Introduction to SciPy, with very simple syntax, as follows:

bayes_mvs(data, alpha=0.9)

It returns a tuple of three arguments in which each argument has the form (center, (lower, upper)). The first argument refers to the mean; the second refers to the variance; and the third to the standard deviation. All intervals are computed according to the probability given by alpha, which is 0.9 by default.

We may use the linregress routine to compute the regression line of some two-dimensional data x, or two sets of one-dimensional data, x and y. We may compute different correlation coefficients, with their corresponding p-values, as well. We have the Pearson correlation coefficient (pearsonr), Spearman's rank-order correlation (spearmanr), point biserial correlation (pointbiserialr), and Kendall's tau for ordinal data (kendalltau). In all cases, the syntax is the same, as it is only required either a two-dimensional array of data, or two one-dimensional arrays of data with the same length.

SciPy also has most of the best-known statistical tests and procedures: t-tests (ttest_1samp for one group of scores, ttest_ind for two independent samples of scores, or ttest_rel for two related samples of scores), Kolmogorov-Smirnov tests for goodness of fit (kstest, ks_2samp), one-way Chi-square test (chisquare), and many more.

Let us illustrate some of the routines of this module with a textbook example, based on Timothy Sturm's studies on control design.

To turn a knob that moved an indicator by the screw action, 25 right-handed individuals were asked to use their right hands. There were two identical instruments, one with a right-handed thread where the knob turned clockwise, and the other with a left-hand thread where the knob turned counter-clockwise. The following table gives the times in seconds each subject took to move the indicator to a fixed distance:

Subject	1	2	3	4	5	6	7	8	9	10
Right thread	113	105	130	101	138	118	87	116	75	96
Left thread	137	105	133	108	115	170	103	145	78	107
Subject	11	12	13	14	15	16	17	18	19	20
Right thread	122	103	116	107	118	103	111	104	111	89
Left thread	84	148	147	87	166	146	123	135	112	93
Subject	21	22	23	24	25
Right thread	78	100	89	85	88
Left thread	76	116	78	101	123

We may perform an analysis that leads to a conclusion about right-handed people finding right-hand threads easier to use, by a simple one-sample t-statistic. We will load the data in memory, as follows:

>>> import numpy
>>> data = numpy.array([[113,105,130,101,138,118,87,116,75,96, \
         122,103,116,107,118,103,111,104,111,89,78,100,89,85,88], \
         [137,105,133,108,115,170,103,145,78,107, \
         84,148,147,87,166,146,123,135,112,93,76,116,78,101,123]])

The difference of each row indicates which knob was faster, and for how much time. We can obtain that information easily and perform some basic statistical analysis on it. We will start by computing the mean, standard deviation, and a histogram with 10 bins:

>>> dataDiff = data[1,:]-data[0,:]
>>> dataDiff.mean(), dataDiff.std()

The output is shown as:

(13.32, 22.472596645692729)

Let's plot the histogram by issuing the following set of commands:

>>> import matplotlib.pyplot as plt
>>> plt.hist(dataDiff)
>>> plt.show()

This produces the following histogram:

Interval estimation, correlation measures, and statistical tests

In light of this histogram, it is not far-fetched to assume a normal distribution. If we assume that this is a proper simple random sample, the use of t-statistics is justified. We would like to prove that it takes longer to turn the left thread than the right, so we set the mean of dataDiff to be contrasted against the zero mean (which would indicate that it takes the same time for both threads).

The two-sample t-statistics and p-value for the two-sided test are computed by the simple command, as follows:

>>> from scipy.stats import ttest_1samp
>>> t_stat,p_value=ttest_1samp(dataDiff,0.0)

The p-value for the one-sided test is then calculated:

>>> print (p_value/2.0)

The output is shown as follows:

0.00389575522747

Note that this p-value is much smaller than either of the usual thresholds alpha = 0.05 or alpha = 0.1. We can thus guarantee that we have enough evidence to support the claim that right-handed threads take less time to turn than left-handed threads.