|
|
|
||||
|
|
|||||
Converting a categorical variable to dummy variables can be a tedious process when done using a series of series of if then statements. Consider the following example data file.
DATA auto ; LENGTH make $ 20 ; INPUT make $ 1-17 price mpg rep78 ; CARDS; AMC Concord 4099 22 3 AMC Pacer 4749 17 3 Audi 5000 9690 17 5 Audi Fox 6295 23 3 BMW 320i 9735 25 4 Buick Century 4816 20 3 Buick Electra 7827 15 4 Buick LeSabre 5788 18 3 Cad. Eldorado 14500 14 2 Olds Starfire 4195 24 1 Olds Toronado 10371 16 3 Plym. Volare 4060 18 2 Pont. Catalina 5798 18 4 Pont. Firebird 4934 18 1 Pont. Grand Prix 5222 19 3 Pont. Le Mans 4723 19 3 ; RUN;
The variable rep78 is coded with values from 1 - 5 representing various repair histories. We may create dummy variables for rep78 by writing separate assignment statements for each value as follows:
DATA auto2 ;
SET auto ;
IF rep78 = 1 THEN rep78_1 = 1;
ELSE rep78_1 = 0;
IF rep78 = 2 THEN rep78_2 = 1;
ELSE rep78_2 = 0;
IF rep78 = 3 THEN rep78_3 = 1;
ELSE rep78_3 = 0;
IF rep78 = 4 THEN rep78_4 = 1;
ELSE rep78_4 = 0;
IF rep78 = 5 THEN rep78_5 = 1;
ELSE rep78_5 = 0;
RUN;
PROC FREQ DATA=auto2;
TABLES rep78*rep78_1*rep78_2*rep78_3*rep78_4*rep78_5 / list ;
RUN;
As you see from the proc freq below, the dummy variables were properly created, but it required a lot of if then else statements.
[Output below edited for readability]
REP78 REP78_1 REP78_2 REP78_3 REP78_4 REP78_5 Freq Percent
------------------------------------------------------------
1 1 0 0 0 0 2 12.5
2 0 1 0 0 0 2 12.5
3 0 0 1 0 0 8 50.0
4 0 0 0 1 0 3 18.8
5 0 0 0 0 1 1 6.3
Had rep78 ranged from 1 to 10 or 1 to 20, that would be a lot of typing (and prone to error). Here is a shortcut you could use when you need to create dummy variables.
DATA auto3;
set auto;
ARRAY dummys {*} 3. rep78_1 - rep78_5;
DO i=1 TO 5;
dummys(i) = 0;
END;
dummys( rep78 ) = 1;
RUN;
PROC FREQ DATA=auto3;
TABLES rep78*rep78_1*rep78_2*rep78_3*rep78_4*rep78_5 / list ;
RUN;
As you see below, the dummy variables were created successfully.
[Output below edited for readability]
REP78 REP78_1 REP78_2 REP78_3 REP78_4 REP78_5 Freq Percent
-----------------------------------------------------------------
1 1 0 0 0 0 2 12.5
2 0 1 0 0 0 2 12.5
3 0 0 1 0 0 8 50.0
4 0 0 0 1 0 3 18.8
5 0 0 0 0 1 1 6.3
Let's look at each statement in some detail.
ARRAY dummys {*} 3. rep78_1 - rep78_5;
This statement defines an array called dummys that creates five dummy variables rep78_1 to rep78_5 giving each the minimum storage length required, i.e., 3 bytes. You would change rep78_1 to rep78_5 to be the names you want for your dummy variables. The asterisk in the brackets tells SAS to automatically count up the number of new variables based on the number of variables listed at the end of the statement.
DO i=1 TO 5; dummys(i) = 0; END;
This initialized each dummy variable to 0. You would change 5 to be the number values your variable could have.
dummys(rep78) = 1;
Set the appropriate dummy variable to 1. For example, if rep78 = 3, then dummys(dummys( rep78 ) = 1 will assign a value of 1 to the third element in the array, i.e., assign 1 to rep78_3. You would change rep78 to the name of the variable for which you want to create dummy variables.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services