Altering the data type

While processing data, you will frequently need to change data types. Even if the data is of good quality, where numbers have been stored as numerical and all date fields have been specified with the correct date format, transformations or derivations from them may be required to perform analysis.

Some of the commonly used transformations and derivations for a data analyst where the data type is altered are the following:

To convert a number into a character, use the PUT function. The syntax is as follows:

Character_variable = put(numeric_variable, informat.);

To convert the index variable in the cost of living dataset into a character, use the following code:

DATA Num_to_Char;
SET Cost_Living;
Index_char = PUT(Index, 3.);
RUN;

First, we converted a variable from numeric into character format and then added a new variable to the dataset.

To convert from character to numeric, we use the INPUT function. The syntax is as follows:

Numeric_variable = input(character_variable, informat.);

Let’s run the following code to create a dataset that will help us understand the character into numeric conversion:

DATA Convert;
INPUT Id_Char $4. Turnover $7. Turnover_w_Currency $8. Source_Mixed $3.;
DATALINES;
0001 20,000 $20,000 A1
0002 10,000 $10,000 2
;

This gives us the following dataset:

To convert the Id_char variable to numeric, we can use the following statement in a data step:

Id_Num = INPUT(Id_Char, 5.);

This will remove the leading zeros and give us the values 1 and 2 for the variable. But what if we wanted to preserve the leading zeros? In that case, we should use the following code:

Id_Num_Leading_Zero = INPUT(Id_Char, $5.);

To convert the Turnover_w_currency variable to numeric data type, we need to use the correct informat:

Turnover_Num = INPUT(Turnover_w_currency, dollar8.);

The Source_mixed field has alphanumeric data. When we use the following statement to convert it into a numeric variable, we get missing data for observation 1. Users should be careful while deploying numeric conversion using automated scripts. Missing data can be deemed acceptable in some circumstances. However, losing the alphanumeric data might negatively impact data quality in some instances:

Source_numeric = INPUT(source_mixed, 3.);

We have looked at date values and how SAS interprets them as numbers. The date variable with time information is also stored as a number in SAS. The informat and format used for interpretation are different from those used for a date variable. We will try and understand handling date-time values using the following example:

DATA DateTime;
INPUT Id Date_Time Datetime20.;
DATALINES;
1 01aug19:09:10:05.2
2 01aug20:19:20:10.4
;

DATA Convert_DateTime;
SET DateTime;
FORMAT Orig_Date Datetime.;
Orig_Date = Date_Time;
FORMAT Orig_Date_1 Datetime7.;
Orig_Date_1 = Date_Time;
FORMAT Orig_Date_2 Datetime12.;
Orig_Date_2 = Date_Time;
RUN;

The datetime values must be in the following form: ddmmmyy or ddmmmyyyy, followed by a blank or special character, followed by hh:mm:ss.ss (the time), where hh is the number of hours ranging from 00 through 23, mm is the number of minutes ranging from 00 through 59, and ss.ss is the number of seconds ranging from 00 through 59 with the fraction of a second following the decimal point. Only the ss.ss portion is optional in the datetime value.

We get the following output when we create the DateTime table using informats:

Without the time variable, the values for IDs 1 and 2 (dates August 01, 2019 and August 01, 2020) will be stored in SAS as 21762 and 22128. The addition of the time component to the date changes the value of the number stored in SAS significantly.

To make the datetime information stored in SAS readable, we used different formats in DATA Convert_DateTime and got the following output:

For date and time values, we have looked at how to read and output them in SAS. We have also managed to format these dates in a way that we only see them in a manner relevant to us. On occasions, we might just want to extract some information from the date or the datetime variable. This information could be stored in a non-date/datetime variable such as the numeric variable.

We can use the following statements to extract the Year, Month, and Date values from a Date variable with an informat of mmddyy:

Year = YEAR (Date);
Month = MONTH (Date);
Day = DAY (Date);

For the date August 1, 2019, we will get the year as 2019, month as 8, and day as 1 when we use the preceding statements in a data step. The Year, Month, and Day variables will be created as numeric variables.

If we have a Date_Time variable, we need to extract the date part from it before converting it into a Year, Month, and Date variable. We can use the following statements:

Year = YEAR (DATEPART(Date_Time));
Month = MONTH (DATEPART(Date_Time));
Day = DAY (DATEPART(Date_Time));

For the datetime value 01aug19:09:10:05.2, we will get the values 2019, 8, and 1 for the year, month, and day variables respectively.