part0029

Case study–Python for chi-square independence test

A team member examines an annual report that shows the number of resin batches produced by each of four reactors during each of three shifts. She performs a chi-square independence test to determine whether the reactor and shift are associated.

Since the p-value is less than 0.05, she rejects the null hypothesis and concludes that there is an association between reactor and shift variables.

A Chi-square independence test is used to test whether the row and the column variables are associated in a contingency table. The table has R rows and C columns in that the observations with two variables (V1 and V2) are classified into one of R mutually exclusive categories for V1 and one of C mutually exclusive categories for V2.

The following is the mathematical formula for chi-square independence test: [21]

H0 : The two-way table is independent

Ha : The two-way table is not independent

Where:

T = the test statistic

r = the number of rows in the contingency table

c = the number of columns in the contingency table

Oij =

the observed frequency of the ith row and jth column

Eij = the expected frequency of the ith row and jth column.

Critical region: T >

1-α / (r-1) * (c-1)

Where:

1-α / (r-1) * (c-1) = the critical value of the chi-square distribution with (r – 1) * (c-1) degrees of freedom and the significant level of

Chi-square independence test using Python follows the following steps.

Step 1. Import libraries.

In [1]:

from scipy import stats

import pandas as pd

Step 2. Load and examine the dataset.

In [2]:

df = pd.read_excel('case chi square table.xlsx')

In [3]:

print(df)

Reactor 1st shift 2nd shift 3rd shift

0 1 119 117 119

1 2 192 117 78

2 3 88 98 83

3 4 117 114 11 4

The dataset has four columns. The reactor labels are in the first column. The numbers of batches produced by each of the three shifts are in the rest of the three columns.

Step 3. Set the Reactor column as the index column.

In [4]:

df = df.set_index(['Reactor'])

print(df)

1st shift 2nd shift 3rd shift

Reactor

1 119 117 119

2 192 117 78

3 88 98 83

4 117 114 114

Step 4. Perform chi-square independence test.

In [5]:

stats.chi2_contingency(df)

Out[5]:

(36.10950415733108,

2.624773867337985e-06,

array([[135.08849558, 116.76253687, 103.14896755],

[147.26548673, 127.28761062, 112.44690265],

[102.36283186, 88.47640118, 78.16076696],

[131.28318584, 113.47345133, 100.24336283]]))

The stats.chi2_contingency( ) function performs the chi-square test of independence of variables in a contingency table. Its return has three components: the chi-square test statistic, the p-value of the test, the degree of freedom, and a contingency table in a numpy array format that shows the expected frequencies, based on the marginal sums of the table. The p-value is 2.62e-6.