Chapter 11: Performing Conditional Processing
All programming languages enable you to perform conditional processing—making decisions based on data values or other conditions. For example, you might want to create a new variable (Age_Group) based on the values of age. Another common use of conditional logic is to check if data values are within a prescribed range.
Grouping Age Using Conditional Processing
For the first example, you have data on gender, age, height, and weight. You want to create a new variable (Age_Group) based on the variable Age. Here is a first attempt that runs but has a logical flaw in regard to SAS missing values.
Program 11.1: First Attempt at Creating an Age Group Variable (Incorrect Program)
data People;
input @1 ID $3.
@4 Gender $1.
@5 Age 3.
@8 Height 2.
@10 Weight 3.;
if Age le 20 then Age_Group = 1;
else if Age le 40 then Age_Group = 2;
else if Age le 60 then Age_Group = 3;
else if Age le 80 then Age_Group = 4;
else if Age ge 80 then Age_Group = 5;
datalines;
001M 5465220
002F10161 98
003M 1770201
004M 2569166
005F 64187
006F 3567135
;
title “Listing of Data Set People”;
proc print data=People;
id ID;
run;
To indicate conditions such as less than, etc., you have a choice of two-letter abbreviations or symbols. The table below shows all the possible logical comparisons.
Table 11.1: Logical Comparison Operators
Logical Comparison |
Mnemonic |
Symbol |
Equal to |
EQ |
= |
Not equal to |
NE |
^= or ~= or ¬= |
Less than |
LT |
< |
Less than or equal to |
LE |
<= |
Greater than |
GT |
> |
Greater than or equal to |
GE |
>= |
Equal to any value in a list |
IN |
The new statements in this program are the IF and ELSE IF statements. They work like this: Following the IF or ELSE IF statement is a logical statement that is either true or false. If the statement is true, the following expression executes; if it is false, the following expression does not execute. Also, if the logical statement on an IF or ELSE IF statement is true, all the subsequent ELSE IF statements are skipped. For example, in the data set People, the first subject is 54 years old. The first ELSE IF statement that is true is
else if age le 60 then Age_Group = 3;
Because this statement is true, all the remaining ELSE IF statements are skipped. This logic has the advantage of being more efficient than a series of IF statements; the program does not have to evaluate more IF statements than necessary.
Let’s run the program and examine the output.
Figure 11.1: Output from Program 11.1
Most of the Age_Group values are correct. However, there is a problem for ID 005. This person had a missing value for age but was placed in age group 1. Why?
In SAS, a numeric missing value is treated logically as the most negative number possible. Thus, a missing value is less than any real—positive or negative—number.
The first IF statement asks whether Age is less than or equal to 20. Person 005 has a missing value for Age and a missing value is less than 20, so this person is placed in age group 1. Here is one way to fix Program 11.1:
Program 11.2: Corrected Version of Program 11.1
data People;
input @1 ID $3.
@4 Gender $1.
@5 Age 3.
@8 Height 2.
@10 Weight 3.;
if missing(Age) then Age_Group = .;
else if Age le 20 then Age_Group = 1;
else if Age le 40 then Age_Group = 2;
else if Age le 60 then Age_Group = 3;
else if Age le 80 then Age_Group = 4;
else if Age ge 80 then Age_Group = 5;
datalines;
001M 5465220
002F10161 98
003M 1770201
004M 2569166
005F 64187
006F 3567135
;
title “Listing of Data Set People”;
proc print data=People;
id ID;
run;
The first IF statement tests if Age is a missing value. This is accomplished using the MISSING function. All SAS functions end with a set of parentheses. The values placed in the parentheses are called arguments to the function. The MISSING function returns a value of true if the argument is a missing value and false otherwise (The MISSING function works for both character and numeric arguments). When the program processes ID 005, the missing function returns a true value and the variable Age_Group is set to a missing value (designated by a period). An alternative to testing for a missing value is the following line of code:
if Age = . then Age_Group = . ;
This author strongly recommends that you use the MISSING function to test for missing values.
Most SAS programmers would agree that failing to account for missing values in a DATA step is the most common logical error in SAS programming. Always be sure to consider the consequences of a missing value meeting your program logic.
Output from Program 11.2 results in a missing value for Age_Group for ID 005.
Using Conditional Logic to Check for Data Errors
You can use IF-THEN-ELSE logic to test if values of certain variables are outside a predetermined range. For example, you might want to check if anyone in the People data set was heavier than 200 pounds or lighter than 100 pounds. The following program does just that.
Program 11.3: Using Conditional Logic to Test for Out-of-Range Data Values
data _null_;
set People;
if Weight lt 100 and not missing(Weight)or Weight gt 200 then
put “Weight for ID “ ID “is “ Weight;
run;
There are several new features in this program. First is the special data set name _NULL_. This is a reserved data set name that enables you to run a DATA step without actually creating a data set. The reason it is used in this program is that you are checking for out-of-range values, and you do not need a data set when you are finished checking—thus the use of DATA _NULL_. Because you are not creating a data set, the program is more efficient than one that does create a data set.
The SET statement brings in observations from the People data set. Notice that the condition in the IF statement checks for two things: First, is the value less than 100 and not missing (remember that a missing value is less than 100)? Second, is the Weight greater than 200? If either condition is true, the PUT statement executes. The PUT statement is an instruction to write out the text “Weight for ID” followed by the value of ID (note ID is not in quotation marks, so it represents a variable name) followed by the word “is” followed by the value of Weight. By default, the PUT statement writes its output to the SAS log. This is fine for programmers but not so fine for nonprogrammers. To tell SAS to write out the Weight values to the RESULTS window, add the line file print; before the PUT statement. Here is the output from Program 11.3.
Figure 11.2: Output from Program 11.3
Three people had weights out-of-range.
If you want to check if a value is any one of several values, you can use multiple OR operators or the IN operator. Suppose you want to check if values for Race are ‘W’, ‘B’, ‘H’, or ‘O’ (white, black, Hispanic, or other). Using the OR operator, you could write:
length Race_Value $ 7;
if Race = ‘W’ or Race = ‘B’ or Race = ‘H’ or Race = ‘O’ then
Race_Value = ‘Valid’;
else Race_Value = ‘Invalid’;
length Race_Value $ 7;
if Race in (‘W’,’B’,’H’,’O’) then Race_Value = ‘Valid’;
else Race_Value = ‘Invalid’;
When you use the IN operator, you place the character values in quotation marks (single or double) and separate each value by a comma or space. The statement evaluates as true if Race matches any one of the listed values. It is useful to know that once a match is made, the IN operator stops looking for matches. When extreme efficiency is needed, programmers will place values most likely to be present in the data at the beginning of the list of values, saving the extra CPU time in searching the whole list.
You can use the IN operator with numeric data as well. For example, if you want to list all subjects in Age_Group 3, 4, or 5, you could use the following statement:
if Age_Group in (3,4,5) then put “Older Folks”;
You might wonder about the LENGTH statement in these code segments. Why is it needed? Remember that the length of character variables is set at compile time (before any data values are read or any conditional logic is performed). Without the LENGTH statement, the first time the variable Race_Value appears is where it is set equal to ‘Valid’. Because ‘Valid’ is 5 characters long, SAS would set the storage length for Race_Value to 5. The effect would be to truncate the other possible value for Race_Group, ‘Invalid’, to 5 characters. By using a LENGTH statement before the assignment statement for Race_Value, SAS assigns a storage length of 7 for this variable. A popular trick to avoid entering the LENGTH statement is to pad the first value of Race_Value, ‘Valid’, with two extra blanks (for example, ‘Valid ‘), so that SAS will assign the variable a length of 7 instead of 5. Using a LENGTH statement is considered more elegant and a better programming practice.
Using Boolean Logic (AND, OR, and NOT Operators)
You can combine the three Boolean operators, AND, OR, and NOT, in logical expressions. Here is an example:
*Note: there are no missing values of LDL and HDL in the data;
if (LDL gt 100 or HDL lt 50) and Gender eq ‘F’ then
Risk = ‘High’;
else Risk = ‘Low’;
if (LDL gt 100 or HDL lt 40) and Gender = ‘M’ then
Risk = ‘High’;
else Risk = ‘Low’;
High values of LDL (low-density lipids) or low values of HDL (high-density lipids) are considered a risk for coronary artery disease. In addition, the Mayo Clinic uses different values for the HDL cutoff for men and women.
The order of precedence of the Boolean operators in decreasing order is:
1. NOT
2. AND
3. OR
You can use parentheses to force a different order of operation. For example:
if x and y or z;
is equivalent to
if (x and y) or z;
If you want to perform the OR operation before the AND operation, write the expression like this:
if x and (y or z);
Because the NOT operator has the highest precedence, the expression:
if x and not y or z;
is equivalent to
if x and (not y) or z;
Even though the Boolean operators have a built-in ordering, feel free to add parentheses in your logical expressions—it makes the logic easier to understand.
A Special Caution When Using Multiple OR Operators
It is very easy to make a serious error when using multiple OR operators. Take a look at the following program:
Program 11.4: A Common Error Using Multiple OR Operators
data Mystery;
input x;
if x = 3 or 4 then Match = ‘Yes’;
else Match = ‘No’;
datalines;
3
4
9
.
-5
;
title “Listing of Data Set Mystery”;
proc print data=Mystery noobs;
run;
Before you look at the output, notice that the PROC PRINT option NOOBS was added. This option removes the Obs column from the output of PROC PRINT. If you use an ID statement with PROC PRINT, the NOOBS option is not necessary because the ID variable(s) replace the Obs column.
Here is the output.
Figure 11.3: Output from Program 11.4Program 11.4: A Common Error Using Multiple OR Operators
That is not what you expected, is it? First of all, you probably expected a syntax error because the statement:
if x = 3 or 4 then Match = ‘Yes’;
should have been written as:
if x = 3 or x = 4 then Match = ‘Yes’;
However, if you expressed the logic in this program verbally, you might very well say if x equals 3 or 4, then Match equals ‘Yes’. Let’s first look at the SAS log.
Figure 11.4: SAS Log from Program 11.4
There are no syntax errors, yet the output is obviously wrong. What is going on?
In SAS, any numerical value that is not equal to 0 or a missing value is considered true.
In this program, there are two conditions separated by an OR operator. One is ‘x = 3’; the other is ‘4’. ‘4’ is not equal to 0 or missing, so it is evaluated as true. An OR expression is true if either one (or both) of the expressions is true. Because ‘4’ is true, the logical expression is true regardless of the value of x.
Just to be sure this is clear, let’s rewrite Program 11.4 correctly like this.
Program 11.5: A Corrected Version of Program 11.4
data Mystery;
input x;
if x = 3 or x = 4 then Match = ‘Yes’;
else Match = ‘No’;
datalines;
3
4
9
.
-5
;
title “Listing of Data Set Mystery”;
proc print data=Mystery noobs;
run;
Here is the output from the corrected program.
Figure 11.5: Output from the Program 11.5
Seeing how easy it is to make this error when using multiple OR operators, you should be convinced that it is better to write the logical test as:
if x in (3,4) then Match = ‘Yes’;
Just about every program that you write will need to use conditional logic. Using IF-THEN-ELSE statements and the Boolean operators, NOT, AND, and OR, enables you to evaluate complex conditions. Finally, remember to consider missing values when you write your logical expressions.
1. Using the SASHELP data set Fish, create a new, temporary SAS data set called Group_Fish that contains the variables Species, Weight, and Height. Using IF-THEN-ELSE logic, create a new variable called Group_Fish that places the fish into weight groups as follows:
0 to 100=1, 101-200=2, 201-500=3, 501-1,000=4, and 1,001 and greater=5.
Use PROC PRINT to list the first 10 observations from data set Group_Fish.
2. Run Program 11.2 in this chapter (listed below), adding statements to print one message if the variable Age is greater than 100 and another message if Age is missing. Include the Age value in the message for ages greater than 100.
data People;
input @1 ID $3.
@4 Gender $1.
@5 Age 3.
@8 Height 2.
@10 Weight 3.;
if missing(Age) then Age_Group = .;
else if Age le 20 then Age_Group = 1;
else if Age le 40 then Age_Group = 2;
else if Age le 60 then Age_Group = 3;
else if Age le 80 then Age_Group = 4;
else if Age ge 80 then Age_Group = 5;
datalines;
001M 5465220
002F10161 98
003M 1770201
004M 2569166
005F 64187
006F 3567135
;
3. Create a new, temporary SAS data set called High_BP containing subjects from the SASHELP.Heart data set who have systolic blood pressure greater than 250 or diastolic blood pressure greater than 180. The new data set should only have variables Diastolic, Systolic, and Status. Use PROC PRINT to list the contents of High_BP.
4. If A is true, B is false, and C is false, what do each of these expressions evaluate as (true or false)?
a) A AND NOT B
b) NOT A OR NOT C
c) A AND NOT B AND C
d) A AND (NOT B OR NOT C)
5. What’s wrong with this program?
1. data Weights;
2. input Wt;
3. if Wt lt 100 then Wt_Group = 1;
4. if Wt lt 200 then Wt_Group = 2;
5. if Wt lt 300 then Wt_Group = 3;
6. datalines;
50
150
250
;
6. Starting with the SASHELP data set Retail, write a program to create a new data set (Sales_Status) with the following new variables:
If Sales is greater than or equal to 300, set Bonus equal to ‘Yes’ and Level to ‘High’. Otherwise, if Sales is not a missing value, set Bonus to ‘No’ and level to ‘Low’. Use PROC PRINT to list the observations in this data set.