Further Statistical Questions about Categorical Data

Just generating and looking at frequency tables or crosstabs such as those above is not necessarily the last step. You may also wish to ask various questions such as:

Are the distributions of observations between categories or cells even (there are roughly equal numbers in each cell), or are there cells with significantly greater or lower frequency than the numbers in another cell? For instance, in Figure 15.2 Crosstab example of relating two categorical variables above you can see that the numbers in the six cells are not even. Clearly, the 56 observations in the Small/Freeware cell are more numerous than the 22 in Small/Premium. However, in other cases these differences may not be as obvious, and you may wish to seek further statistical evidence of differences.
Are the distributions of observations between categories or cells different from a benchmark distribution of your choosing? For instance, in your industry a customer base of 50% big companies, 30% medium companies and 20% small companies may be considered usual. In Figure 15.1 Example of frequency analysis of categorical data in SAS we see that your actual distribution is 43%, 29% and 28% respectively. You may wish to test whether the deviations of your customer base distribution is statistically significantly different from the industry.
If there are significant differences, where specifically are they different? In the example in the previous bullet point, we may find that there are deviances from the industry benchmark, but that these differences are specifically for small and big customer distributions (i.e. that the 29% in the medium category is not statistically significantly different from the industry average of 30%).
Is there evidence that one of the categorical variables may be dependent on others? This is a special case, where you believe the allocation of observations within a certain categorical variable is partly affected by the observation’s membership in another categorical variable. For instance, in the main book example you may believe that the choice of a company to be a freeware versus premium customer may partly depend on its size (perhaps because larger companies can probably more easily afford the premium version).

These and many other questions can be answered using a variety of statistical tests. The following sections discuss just an introductory sample of such tests.