Chapter 10: Creating Formats and Labels

What Is a SAS Format and Why Is It Useful?

It is a common practice to store information in a database using codes rather than actual values. For example, you might have a questionnaire where the responses are strongly disagree, disagree, neutral, agree, and strongly agree. It would be unusual to store the actual values in your database. Rather, you would use coded values such as 1=strongly disagree, 2=disagree, and so on.

Even though you are storing codes in your database, you would like to see the actual labels printed in your output. SAS formats are the tool that allows this to happen.

To demonstrate how to create your own SAS formats, let’s start with a SAS data set called Taxes, described in Table 10.1.

Table 10.1: Formats in Taxes Data Set

Variable

Description

Codes Used

SSN

Social Security Number

Gender

Gender

M=Male, F=Female

Question_1

Do you pay taxes?

1=Yes, 0=No

Question_2

Are you satisfied with this service?

1=strongly disagree, 2=disagree, 3=neutral, 4=agree, 5=strongly agree

Question_3

How many phone calls did it take to resolve your problem?

A=0, B=1 or 2, C=3 to 5, D=More than 5

Question_4

Was the person answering your call friendly?

Same as Question_2

Question_5

How much did you pay?

The actual dollar amount

The program to create the Taxes data set is shown next:

Program 10.1: Creating the Taxes Data Set (and Demonstrating a DATALINES Statement)

data Taxes;
informat SSN $11.
Gender $1.
Question_1 - Question_4 $1.;

input SSN Gender Question_1 - Question_5;
datalines;
101-23-1928 M 1 3 C 4 23000
919-67-7800 F 9 2 D 2 17000
202-22-3848 M 0 5 A 5 57000
344-87-8737 M 1 1 B 2 34123
444-38-2837 F . 4 A 1 17233
763-01-0123 F 0 4 A 4 .
;

title “Listing of Data Set Taxes”;
proc print data=Taxes;
id SSN;
run;

This program uses a feature that you have not seen before: a DATALINES statement. When you want to write a short SAS program to test your logic or syntax, you can save the trouble of writing a text file (perhaps using Notepad or some similar editor) and then writing an INFILE statement telling the program where to find the data. Instead, you can actually include the lines of data in the program itself. You do this by writing a DATALINES statement and following this statement with your data. You end your data with a single semicolon or a RUN statement. SAS will read these lines of data as if they were in an external file. There is one additional new feature in this program—that is, the ID statement following PROC PRINT. You use an ID statement to do two things: First, you specify a variable that you want to display in the first column of your listing. If your data set has more variables that can fit across one page of output, the ID variable is repeated on each new page. Second, when you specify an ID variable, the default Obs column is no longer printed (this is a good thing in this author’s opinion).

Here is a listing of the Taxes data set.

Figure 10.1: Output from Program 10.1

Figure 1.1 Some JMP Help Options

You would prefer to have values for Gender be listed as ‘Male’ and ‘Female’, values for Question_1 to be listed as ‘Yes’ and ‘No’, values for Question_2 and Question_4 to display the agreement scale (called a Likert scale by psychometricians), and values for Question_3 to show the number of calls. Finally, you would like to place the dollar amounts into four categories: 0 - $10,000 = Low, $10,001 - $20,000 = Medium, $20,001 - $50,000 = High, and $50,000+ = Very High.

The process of substituting these labels for the coded values is called formatting in SAS terminology. The first step is to create the formats. Once that is accomplished, you can associate one or more variables with these formats. You use PROC FORMAT to create your SAS formats, as shown in the following program.

Program 10.2: Creating Your Own Formats

proc format;
value $Gender ‘M’=’Male’
‘F’=’Female’;
value $Yesno ‘0’=’No’
‘1’=’Yes’
other=’Did not answer’;
value $Likert ‘1’=’Strongly Disagree’
‘2’=’Disagree’
‘3’=’No Opinion’
‘4’=’Agree’
‘5’=’Strongly Agree’;
value $Calls ‘A’=’None’
‘B’=’1 or 2’
‘C’=’3 - 5’
‘D’=’More than 5’;


value Pay_group low-10000 = ‘Low’
10001-20000 = ‘Medium’
20001-50000 = ‘High’
50001-high = ‘Very High’;

run;

You write VALUE statements to define your formats and follow the keyword VALUE with the name of the format that you want to create. Format names can be up to 32 characters long, and they follow the same naming conventions as other SAS names, except that they cannot end in a digit.

Formats that you plan to apply to character variables must begin with a dollar sign ($) (leaving 31 characters for you to use). In this example, the first four formats will be associated with one or more character variables. The last format will be associated with the numeric variable Question_5. Because the first four formats will be associated with character variables, their names all begin with a dollar sign. Following the format name, you list either a single value (such as ‘M’) or a range of values (such as low–10000). Notice that you place the character values in single or double quotation marks. There are several keywords that you can use in defining a value. As you can see in the $Yesno format, the keyword OTHER will supply the label “Did not answer” for all values other than ‘0’ or ‘1’. You can also use the keywords LOW and HIGH to refer to the lowest and highest values, respectively. Note that for the Pay_group format, you cannot include commas in the numerical ranges.

To demonstrate how formats work, let’s first run PROC FREQ to compute frequencies for each variable in the Taxes data set, without formatting any of the variables. The PROC FREQ statements look like this:

Program 10.3: Computing Frequencies on the Taxes Data Set (without Formats)

title “Frequencies for the Taxes Data Set”;
proc freq data=Taxes;
tables Gender Question_1 - Question_5 / nocum;
run;

The TABLES option NOCUM is an instruction to omit cumulative frequencies from the results, shown below.

Figure 10.2: Output from Program 10.3

Figure 1.1 Some JMP Help Options

Figure 1.1 Some JMP Help Options

Figure 1.1 Some JMP Help Options

PROC FREQ listed frequencies for each unique value of the variables. For numeric variables such as Question_5 (How much did you pay, in dollars?), you see frequencies for each unique value—not something that is very useful. There are other SAS procedures (such as PROC SGPLOT) that will automatically place numerical values into groups and produce histograms, etc.

It’s time to see what the output from PROC FREQ looks like when you add a FORMAT statement, associating each of the variables in the Taxes data set with a format. Here is the code.

Program 10.4: Adding a FORMAT Statement to PROC FREQ

title “Frequencies for the Taxes Data Set”;
proc freq data=Taxes;
format Gender $Gender.
Question_1 $Yesno.
Question_2 Question_4 $Likert.
Question_3 $Calls.
Question_5 Pay_group.;
tables Gender Question_1 - Question_5 / nocum;
run;

You use a FORMAT statement to associate your variables with the appropriate format. SAS programs recognize the difference between variables and formats because when you associate formats to a variable, you end each format name with a period. The two variables Question_2 and Question_4 share the same format, so you list these two variables together and follow them with the format that you want to use ($Likert). Output from this program is shown below.

Figure 10.3: Output from Program 10.4

Figure 1.1 Some JMP Help Options

Figure 1.1 Some JMP Help Options

Figure 1.1 Some JMP Help Options

You now see the formatted values in the tables. You might wonder about the order of the values in the tables. Notice that Female comes before Male and No comes before Yes. Did you guess that PROC FREQ orders these values alphabetically? If so, you are correct. Later on, you will see how to change the order in the frequency tables.

If you are very observant, you will have noticed that the table for Question_1 lists two missing values. However, when you look at the unformatted data, you see only one missing value and one value of 9. It turns out that the OTHER category was combined with the missing values (because it was not a valid value for this variable). If you want to separate the missing value from the OTHER category, you can add a format for a missing value like this:

value $Yesno ‘0’=’No’ 
             ‘1’=’Yes’ 
             ‘ ‘=’Did not answer’  
             other=’Invalid value’;

Because $Yesno is a character format, you specify a missing value with a single space between the quotation marks. To specify a missing value for a numeric format, use a single period instead. When you use this format with PROC FREQ, the output frequencies for Question_1 look like Figure 10.4.

Figure 10.4: Using the Modified $Yesno Format

Figure 1.1 Some JMP Help Options

By using this modified format, you now see that there was one invalid value and one missing value.

Using SAS Built-in Formats

SAS provides you with a large number of built-in formats that you can use, along with ones that you create yourself. One example of a numeric format is w.d, where w (stands for width) specifies the total number of spaces to use when writing a number and d specifies how many places to include after the decimal place. Table 10.2 shows some examples. For each of these examples, X=1234.567.

Table 10.2: Built-in Numeric Format Examples

Format

Display

Explanation

8.3
1234.567

The 8 is the total column width, including the decimal point.

10.4
 1234.5670

There is a leading space and a 0 is added to give four places to the right of the decimal point.

8.2
 1234.57

There is one leading blank and the decimal value is rounded.

4.
1235

When the width is shorter than the value, it gets truncated (and rounded).

10.1
   1234.6

There are four leading blanks and the decimal is rounded.

Some other useful SAS formats are dollarw.d and commaw.d. The dollar format adds dollar signs, commas (if needed), and cents (if d is equal to 2). The value of w is the field width. This includes the dollar sign, any commas, a decimal point, and the digits to the right of the decimal point. If you leave out a value of d (the number after the period), the dollar value will be rounded. The comma format is similar to the dollar format, except that it does not include a dollar sign and the value of d can show as many decimal places as needed. Formats for SAS dates are particularly useful and will be discussed in the chapter on SAS dates.

More Examples to Demonstrate How to Write Formats

There is great flexibility in defining values or ranges when you create a SAS format. The following examples help illustrate this flexibility.

In Program 10.2, one of the formats placed numerical values into four categories as shown here:

value Pay_group  low-10000   = ‘Low’
                 10001-20000 = ‘Medium’
                 20001-50000 = ‘High’
                 50001-high  = ‘Very High’;
run;

This works fine as long as the values to be formatted are integers. However, suppose you tried to format a value of 10,000.50. This value falls between the Low and Medium ranges (and would not be formatted—it would print as 10000.50 in a listing). You might specify ranges like this:

 value Pay_group low-10000   = ‘Low’
                 10000-20000 = ‘Medium’
                 20000-50000 = ‘High’
                 50000-high  = ‘Very High’;

Although this code works, it is confusing. What is the formatted value of 10,000? It turns out it would be formatted as ‘Low’. It is better to allow a value to only match a single range. You can exclude a value from the beginning or the end of a range by adding a less than (<) sign before or after the hyphen that specifies ranges. For example, to exclude 10,000 from the low range, you would use:

value Pay_group  low- <10000 = ‘Low’

If you wanted to exclude 10,000 from the medium range, you would use:

value Pay_group  10000< -20000 = ‘Medium’

You can also use hyphens and commas to specify character ranges. Here are some examples:

value $Grades ‘A’ ,’B’  = ‘Good’
              ‘C’ – ‘E’ = ‘Passing’
              ‘F’       = ‘Fail’
              other     = ‘Error’;

Values of A or B are formatted as ‘Good’; C, D, and E are formatted as ‘Passing’; F is formatted as ‘Fail’; and any characters not equal to any of these values is formatted as ‘Error’.

Describing the Difference between a FORMAT Statement in a Procedure and a FORMAT Statement in a DATA Step

In the programs that you have seen thus far, the FORMAT statements have been used inside a SAS procedure. This creates an association between the variables and formats only for that procedure. If you place a FORMAT statement in a DATA step, the association between the variables and formats remains for the entire program. Then, if you write a PROC PRINT or PROC FREQ step, you will see formatted values for all the variables you listed in your FORMAT statement.

You will usually find it more convenient to associate you formats with variables, using a FORMAT statement in a DATA step, saving the trouble of having to rewrite (or copy) it for each procedure that you run. Keep in mind that even though you have assigned a format to a variable, many procedures such as PROC MEANS or other statistical procedures will still use the internal values of the variables when doing their calculations.

Making Your Formats Permanent

All of the user-written formats previously described are temporary formats—that is, they only exist for the duration of your SAS session. If you have formats that you plan to use frequently, you can make them permanent. That way, every time you open up a SAS session, you can use any of your permanent formats without having to rerun PROC FORMAT. There are a few steps that you need to follow to make this happen.

First, you need to decide where you plan to store your formats. To keep this example simple, let’s store your permanent formats in MyBookFiles. You run PROC FORMAT just as you did previously, except you add the procedure option LIBRARY=libref (where libref is the library reference you create using a LIBNAME statement). Here is an example.

Program 10.5: Making a Permanent Format

libname Myfmts “~/MyBookFiles”;

proc format library=Myfmts;
value $Gender 1=’Male’
2=’Female’;
value $Yesno 0=’No’
1=’Yes’;
run;

Figure 10.5: Log Window After Running Program 10.5

Figure 1.1 Some JMP Help Options

The two formats, $Gender and $Yesno, are now permanent formats and are stored in the Myfmts library.

If you have variables associated with formats in a permanent SAS data set, keep in mind that the formats are now a permanent property of those variables, and the format definitions must be available in order to open and read the data set. That is, if you give someone a copy of one of your permanent SAS data sets, be sure to give them the format catalog (it will have the name formats.sas7bcat).

There is a way to open a data set that has formats associated with variables where you do not have the format library. The “trick” used by this author and many others is to include the following statement at the top of your program:

Options nofmterr;

The option nofmterr stands for no format error.

A good way to keep track of your permanent formats is to include another PROC FORMAT option called FMTLIB. This option produces a listing of all the formats in the specified library, along with all of the defined values for these formats. To illustrate this, let’s run Program 10.5: Making a Permanent Format again with the FMTLIB option added.

Program 10.6: Adding the FMTLIB Option

libname Myfmts “~/MyBookFiles”;

proc format library=Myfmts fmtlib;
value $Gender 1=’Male’
2=’Female’;
value $Yesno 0=’No’
1=’Yes’;
run;

Here is the output.

Figure 10.6: Output from Program 10.6

Figure 1.1 Some JMP Help Options

This output is an especially useful document for you to use or to share with others who want to use your formats.

Before you use these permanent formats in a new SAS session, you need to include two statements: One is a LIBNAME statement that defines your libref . The other statement is OPTIONS FMTSEARCH=libref. The option FMTSEARCH=libref, tells SAS to look in the location specified by your libref to find your permanent formats. So, beginning every SAS session, you would add these two statements:

libname Myfmts “~/MyBookFiles”;
options fmtsearch=(Myfmts);

If you plan to use permanent formats, you should put these two statements in the Autoexec file, as described in the last chapter.

Creating Variable Labels

You can associate labels with your SAS variables. If you have variables such as Gender and Race, a label might not be necessary. However, for variable names such as Question_1, Question_2, etc., you might want to provide labels. You use a LABEL statement to associate labels with your variables. For example, the following program adds labels to the variables in the Taxes data set (described in the beginning of this chapter).

Program 10.7: Creating Variable Labels

data Taxes;
informat SSN $11.
Gender $1.
Question_1 - Question_4 $1.;

input SSN Gender Question_1 - Question_5;
label Question_1 = ‘Do you pay taxes?’
Question_2 = ‘Are you satisfied with the service?’
Question_3 – ‘How many phone calls?’
Question_4 = ‘Was the person friendly?’
Question_5 = ‘How much did you pay?’;
datalines;

101-23-1928 M 1 3 C 4 23000
919-67-7800 F 9 2 D 2 17000
202-22-3848 M 0 5 A 5 57000
344-87-8737 M 1 1 B 2 34123
444-38-2837 F . 4 A 1 17233
763-01-0123 F 0 4 A 4 .
;

title ‘Frequencies for the Taxes Data Set’;
proc freq data=Taxes;
format Gender $Gender.
Question_1 $Yesno.
Question_2 Question_4 $Likert.
Question_3 $Calls.
Question_5 Pay_group.;
tables Gender Question_1 - Question_5 / nocum;
run;

Now that you have added variable labels to your program, let’s see how it affects the output. Listed below is a partial output from this program.

Figure 10.7: Output from Program 10.7

Figure 1.1 Some JMP Help Options

You now see the variable label listed at the top of each frequency table.

Conclusion

Adding formats and labels to a SAS program will make the listings and tables much more readable. If you have formats that you use frequently, be sure to create permanent formats so that you don’t have to run PROC FORMAT every time you start a SAS session. If you have associated formats with variables in the DATA step, and plan to share your data set with others, be sure to include the format catalog (formats.sas7bcat) along with the SAS data set.

Problems

In this chapter and several to follow, you will see programs with captions such as Program for Problem Sets 1, and so forth. These programs are all included in the download package in a folder called Problems. All these programs are stored in the file Problem Data Sets.sas. Other files in the Problems folder contain data that you can use in other problems.

1. Modify the following program to supply formats to the variables listed in the table below. (Use the SAS format MMDDYY8. to format Visit.)

Variable

Type

Formatted Values

Gender

Num

1=Male, 2=Female

Q1-Q4

Char

1=’Strongly Disagree’, 2=’Disagree’, 3=’No Opinion’, 4=’Agree’, 5=’Strongly Agree’

Visit

Num

Month/Day/Year

Age

Num

0-20=’Young’, 21-40=’Still young’, 40-60=’Middle’, 61+=’Older’

Program for Problem Sets 1

data Questionnaire;
   informat Gender 1. Q1-Q4 $1. Visit date9.;
   input Gender Q1-Q4 Visit Age;
   format Visit date9.;
datalines;
1 3 4 1 2 29May2015 16
1 5 5 4 3 01Sep2015 25
2 2 2 1 3 04Jul2014 45
2 3 3 3 4 07Feb2015 65
;
title “Listing of Data Set Questionnaire”;
proc print data=Questionnaire noobs;
run;

2. Using the program in Problem 1, create a format that places ages into the following categories:

0-20=‘Group 1’, 21-40=‘Group 2’, 41-60=‘Group 3’, 61-80=‘Group 5’, 81+=‘Group 5’

Use this format for the variable Age and the other formats described in Problem 1. Produce a listing showing the formatted values for all the variables.

3. You have a character variable called Grades. Values of Grades are ‘A’, ‘B’, ‘C’, ‘D’, ‘F’, ‘I’, and missing. Write a format (call it $Grades) that formats ‘A’ and ‘B’ as ‘Good’, ‘C’ as ‘Average’, ‘D’ as ‘Poor’, ‘F’ as ‘Fail’, ‘I’ as ‘Incomplete’, and missing values as ‘Missing’. Also, include a format for any nonmissing value that is not one of the valid values (call them Invalid).