Chapter 20: Computing Frequencies
You can use PROC FREQ to create one-way, two-way (row and column), and three-way (page, row, and column) tables. You can output counts and percentages as well as statistics such as chi-square and Fisher’s Exact test. This chapter shows you examples of these tasks.
Creating a Data Set to Demonstrate Features of PROC FREQ
The first step is to have some test data to demonstrate features of PROC FREQ. To this end, you can run Program 20.1 to generate a data set called Risk that includes the following variables:
Variable |
Description |
Subj |
Subject number |
Gender |
1=Male, 2=Female |
Age |
Age in years |
Chol |
Cholesterol level |
BP_Status |
Blood pressure status (High or Low) |
Chol_Status |
Cholesterol status (High or Low) |
Heart_Attack |
1-Yes or 2-No |
As in the previous chapter, this program randomly generates all of the data. Here is the program.
Program 20.1: Program to Generate Test Data Set Risk
proc format;
value yesno 1 = ‘1-Yes’
0 = ‘2-No’;
run;
data Risk;
call streaminit(12345678);
length Age_Group $ 7;
do Subj = 1 to 250;
do Gender = ‘F’,’M’;
Age = round(rand(‘uniform’)*30 + 50);
if Age lt 60 then Age_Group = ‘1:< 60’;
else if Age le 70 then Age_Group = ‘2:60-70’;
else Age_Group = ‘3:71+’;
if rand(‘uniform’) lt .3 then BP_Status = ‘High’;
else BP_Status = ‘Low’;
Chol = rand(‘normal’,200,30) +
rand(‘uniform’)*8*(Gender=’M’);
Chol = round(Chol);
if Chol gt 240 then Chol_Status = ‘High’;
else Chol_Status = ‘Low’;
Score = .1*Chol + age + 8*(Gender eq ‘M’) +
10*(BP_Status = ‘High’);
Heart_Attack = (Score gt 100)*(rand(‘uniform’) lt .6);
output;
end;
end;
keep Subj Gender Age Chol Chol_Status BP_Status Heart_Attack;
format Heart_Attack yesno.;
run;
title “Listing of Data Set Risk (first 10 observations)”;
proc print data=Risk(obs=10);
id Subj;
run;
Figure 20.1 is a listing of the first 10 observations from data set Risk.
Figure 20.1: Output from Program 20.1
Using PROC FREQ to Generate One-Way Frequency Tables
To create one-way frequency tables, specify the list of variables in a TABLES statement as follows.
Program 20.2: Using PROC FREQ to Generate One-Way Frequency Tables
title “One-way Frequency Tables”;
proc freq data=Risk;
tables Gender Heart_Attack;
run;
In this example, you are requesting one-way frequencies for Gender and Heart_Attack. PROC FREQ can also compute frequencies for numeric variables. However, beware, if you have a variable such as Age with many different values, PROC FREQ will compute frequencies for every unique value. Here is the output.
Figure 20.2: Output from Program 20.2
You see frequency, percent, cumulative frequency, and cumulative percent in the two tables. The order of the values is based, by default, on the internal values of the variables. For numeric variables, lower numbers come before higher numbers; for character variables, the values are sorted alphabetically. Because Heart_Attack was coded as 0 and 1, the first entry in the table is 2-No, the formatted value for the internal value of 0.
You might wonder why the formats ‘2-No’ and ‘1-Yes’ were created. Why not just ‘Yes’ and ‘No’? The answer is that you can request the order in any tables to be based on the formatted value of a variable by using the PROC FREQ option ORDER=formatted. Alphabetically, ‘1-Yes’ comes before ‘2-No’. That will force the ‘Yes’ frequencies to be listed before the ‘No’ frequencies. For the 2 x 2 tables coming up next, it is preferable (for statistical purposes) to have the ‘Yes’ values come before the ‘No’ values.
It is rarely useful to see cumulative frequencies or percentages. You can eliminate these values from the tables with a TABLES option called NOCUM. The program that follows uses both the ORDER= procedure option and the NOCUM statement option.
Program 20.3: Changing the Table Order and Removing the Cumulative Statistics
title “One-way Frequency Tables”;
proc freq data=Risk order=formatted;
tables Gender Heart_Attack / nocum;
run;
Figure 20.3 shows how these two options affect the tables.
Figure 20.3: Output from Program 20.3
The order of the frequencies for the variable Heart_Attack is now controlled by the formatted values for Gender and Heart_Attack, and the cumulative frequencies are no longer included in the tables.
Before we leave this section, you should know one other ordering option: ORDER=freq. This option arranges the frequencies from the most frequent to the least frequent. At times, this can be extremely useful.
Creating Two-Way Frequency Tables
You can create a row by column table by placing one or more variables in the TABLES statement, followed by an asterisk, followed by another list of variables. The variables before the asterisk form the rows of the tables, and the variables after the asterisk form the columns of the tables. If you select more than one variable for the rows or columns list, that list must be placed in parentheses. The syntax is:
tables (list of row variables) * (list of column variables) / options;
PROC FREQ will create a table for every combination of variables in the rows list with every variable in the columns list. If you write the following TABLES statement:
tables (A B) * (C D E);
PROC FREQ will create tables A by C, A by D, A by E, B by C, B by D, and B by E.
To demonstrate a two-way frequency table, let’s create a table using the two variables BP_Status and Heart_Attack. Traditionally, epidemiologists like the outcome variable (Heart_Attack, in this example) to form the columns of the table. Here is the program.
Program 20.4: Creating a Two-Way Frequency Table
title “Two-way Frequency Table of BP_Status by Heart_Attack”;
proc freq data=Risk order=formatted;
tables BP_Status * Heart_Attack;
run;
This table request produces a table with BP_Status forming the rows of the table and Heart_Attack forming the columns of the table. Here is the listing.
Figure 20.4: Output from Program 20.4
The key to this table is displayed in the upper left-hand part of the output. The top number in each box is a frequency count. For example, there were 44 subjects with high blood pressure who had a heart attack. The second number in the table is a percent: These 44 subjects represent 8.8% of all the subjects in the table (500). The third number in the table is a row percentage. Of the 154 subjects who had high blood pressure, 28.57% of them had a heart attack. Finally, the bottom number in the box is a column percentage. Of the 85 subjects who had a heart attack, 51.76% of them had high blood pressure.
The primary numbers of interest to a medical researcher looking at this table would be the percent of people in the high blood pressure group who had a heart attack (28.57%) compared to the percent of people in the low blood pressure group who had a heart attack (11.85%).
One popular statistical test to decide whether the difference in these two proportions is statistically significant is called chi-square. You can request a chi-square test by adding the CHISQ TABLES option to your program. The modified program is shown next.
Program 20.5: Adding a Request for a Chi-Square Test
title “Two-way Frequency Table of BP_Status by Heart_Attack”;
proc freq data=Risk order=formatted;
tables BP_Status * Heart_Attack / chisq;
run;
The output contains the same table as above, with the following added information.
Figure 20.5: Partial Output from Program 20.5
The CHISQ option produces other statistics that are not shown here. Of primary interest in this table is the first row where you see a chi-square of 21.1184 with a p-value of <.0001. This very small p-value tells you that if blood pressure status had no relationship to having a heart attack, the probability of getting such a large difference in the proportion between the two blood pressure groups by chance alone is less than .0001. This is considered highly significant by statisticians.
Creating Three-Way Frequency Tables
You can create three-way frequency tables by specifying the page, row, and column variables, separated by an asterisk. You should be cautious when you do this as it might generate a large volume of output if your page variable(s) have large numbers of values. To keep the output small, the example for a three-way table that follows, uses Gender as the page variable, BP_Status as the row variable, and Heart_Attack as the column variable. Here is the program.
Program 20.6: Creating a Three-Way Table
title “Three-way Table of Gender by BP_Status by Heart_Attack”;
proc freq data=Risk order=formatted;
tables Gender * BP_Status * Heart_Attack;
run;
Here is the output.
Figure 20.6: Output from Program 20.6
Inspection of these tables suggests that high blood pressure is related to the incidence of heart attack in females and males. You could verify this by adding the CHISQ option in the TABLES statement.
Using Formats to Create Groups for Numeric Variables
You can use formats to group values to be displayed in the frequency tables. One of the unique features of PROC FREQ is that it automatically uses formatted values for any variable that is associated with a format. For example, suppose you want to compute frequencies for the variable Age, and you want age groups of 20 years. The program that follows demonstrates how you can use a format to accomplish this task.
Program 20.7: Using Formats to Group a Numeric Variable
proc format;
value Agegroup low-19 = ‘<20’
20-39 = ‘20 to 39’
40-59 = ‘40 to 59’
60-79 = ‘60 to 79’
80-high= ‘80+’
. = ‘Missing’;
run;
title “Using a Format to Group a Numeric Variable”;
proc freq data=Risk;
tables Age / nocum;
format Age Agegroup.;
run;
Because you included a FORMAT statement in the procedure, frequencies are computed for the age groups, not the actual age values. Here is the output.
Figure 20.7: Output from Program 20.7
Frequencies are displayed for age groups. Notice the order of the frequencies. Because the option ORDER= was omitted in the program, PROC FREQ used its default ordering called INTERNAL. So, regardless of the values of the formatted ranges, the order in the table is by increasing age.
PROC FREQ is one of the main procedures for computing frequencies. It can construct one-way, two-way, and three-way tables. In addition, by adding options, you can request a variety of statistical tests to be performed on the resulting tables.
1. Using the SASHELP data set Heart, compute one-way frequencies and percentages for the variables Status, BP_Status, and Smoking_Status. Use options to omit cumulative frequencies from the resulting tables.
2. Compute frequencies and percentages for Smoking_Status from the SASHELP data set Heart. Use a procedure option so that the table is in frequency order (with the most frequent category listed first).
3. Compute frequencies and percentages for Status from the SASHELP data set Heart. Create formats and use the appropriate PROC FREQ option so that the category “Dead” comes before “Alive” in the resulting table.
4. Using the SASHELP data set Heart, construct a two-way table for the variables Sex (rows of the table) and Status (columns of the table). Arrange the table so that “Dead” comes before “Alive” and “Male” comes before “Female”. If you are statistically inclined, add an option to compute the chi-square for this table.
5. Using the SASHELP data set Heart, construct a three-way table with Sex as the page dimension, Weight_Status as the row dimension, and Status as the column dimension. There are three categories for Weight_Status. Write the necessary code to eliminate all observations where Weight_Status is equal to “Underweight”.
6. Using the SASHELP data set Heart, construct a two-way table of Systolic (rows) by Diastolic (columns) blood pressures. Both of these variables are numeric. Use formats to divide Systolic and Diastolic into two groups (below 200 versus 200 and above for Systolic; below 120 versus 120 and above for Diastolic). If you are interested, include a request for the chi-square.