|
|
|
||||
|
|
|||||
We will illustrate creating and replacing variables in SAS using a data file about 26 automobiles with their make, price, mpg, repair record in 1978 (rep78), and whether the car was foreign or domestic (foreign). The program below reads the data and creates a temporary data file called "auto". Please note that there are two missing values for mpg in the data file (coded as a single period).
We will create one new variable to go along with the existing ones. First, we will create cost so that it gives us the price in thousands of dollars. Then we will create mpgpd which will stand for miles per gallon per thousand dollars. In each case, we just type the variable name, followed by an equal sign, followed by an expression for the value.
DATA auto; INPUT make $ price mpg rep78 foreign; cost = ROUND( price / 1000 ); mpgptd = mpg / price; DATALINES; AMC 4099 22 3 0 AMC 4749 17 3 0 AMC 3799 22 3 0 Audi 9690 . 5 1 Audi 6295 23 3 1 BMW 9735 25 4 1 Buick 4816 20 3 0 Buick 7827 15 4 0 Buick 5788 18 3 0 Buick 4453 26 3 0 Buick 5189 20 3 0 Buick 10372 16 3 0 Buick 4082 19 3 0 Cad. 11385 14 3 0 Cad. 14500 14 2 0 Cad. 15906 21 3 0 Chev. 3299 29 3 0 Chev. 5705 16 4 0 Chev. 4504 . 3 0 Chev. 5104 22 2 0 Chev. 3667 24 2 0 Chev. 3955 19 3 0 Datsun 6229 23 4 1 Datsun 4589 35 5 1 Datsun 5079 24 4 1 Datsun 8129 21 4 1 ; RUN;PROC PRINT DATA=auto; RUN;
Here is the output of the proc print. You can compare the output to the original data.
OBS MAKE PRICE MPG REP78 FOREIGN COST MPGPTD 1 AMC 4099 22 3 0 4 .0053672 2 AMC 4749 17 3 0 5 .0035797 3 AMC 3799 22 3 0 4 .0057910 4 Audi 9690 . 5 1 10 . 5 Audi 6295 23 3 1 6 .0036537 6 BMW 9735 25 4 1 10 .0025681 7 Buick 4816 20 3 0 5 .0041528 8 Buick 7827 15 4 0 8 .0019164 9 Buick 5788 18 3 0 6 .0031099 10 Buick 4453 26 3 0 4 .0058388 11 Buick 5189 20 3 0 5 .0038543 12 Buick 10372 16 3 0 10 .0015426 13 Buick 4082 19 3 0 4 .0046546 14 Cad. 11385 14 3 0 11 .0012297 15 Cad. 14500 14 2 0 15 .0009655 16 Cad. 15906 21 3 0 16 .0013203 17 Chev. 3299 29 3 0 3 .0087905 18 Chev. 5705 16 4 0 6 .0028046 19 Chev. 4504 . 3 0 5 . 20 Chev. 5104 22 2 0 5 .0043103 21 Chev. 3667 24 2 0 4 .0065449 22 Chev. 3955 19 3 0 4 .0048040 23 Datsun 6229 23 4 1 6 .0036924 24 Datsun 4589 35 5 1 5 .0076269 25 Datsun 5079 24 4 1 5 .0047253 26 Datsun 8129 21 4 1 8 .0025833
Note that cost is just a one or two-digit value. The vehicle that achieves the best mpgptd is the Chev. for observation 17 which gets 9+ miles per gallon for every thousand dollars in price. The Cad. in observation 14 has the worst mpgptd.
Also note that there are two missing values for mpgptd because of the missing values in mpg.
The variable rep78 is coded 1 through 5 standing for poor, fair, average, good and excellent. We would like to change rep78 so that it has only three values, 1 through 3, standing for below average, average, and above average. We will do this by creating a new variable called repair and recoding the values of rep78 into it.
We will also create a new variable called himpg that is a dummy coding of mpg. All vehicles with better than 20 mpg will be coded 1 and those with 20 or less will be coded 0.
SAS does not have a recode command, so we will use a series of if-then/else commands in a data step to do the job. This data step creates a temporary data file called auto2.
DATA auto2;
SET auto;
repair = .;
IF (rep78=1) or (rep78=2) THEN repair = 1;
IF (rep78=3) THEN repair = 2;
IF (rep78=4) or (rep78=5) THEN repair = 3;
himpg = .;
IF (mpg <= 20) THEN himpg = 0;
IF (mpg > 20) THEN himpg = 1;
RUN;
Note that we begin by setting repair and himpg to missing, just in case we make a mistake in the recoding. Proc freq will show us how the recoding worked.
PROC FREQ DATA=auto2; TABLES repair*rep78 repair*himpg / MISSING; RUN;TABLE OF REPAIR BY REP78 REPAIR REP78 Frequency| Percent | Row Pct | Col Pct | 2| 3| 4| 5| Total ---------+--------+--------+--------+--------+ 1 | 3 | 0 | 0 | 0 | 3 | 11.54 | 0.00 | 0.00 | 0.00 | 11.54 | 100.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | ---------+--------+--------+--------+--------+ 2 | 0 | 15 | 0 | 0 | 15 | 0.00 | 57.69 | 0.00 | 0.00 | 57.69 | 0.00 | 100.00 | 0.00 | 0.00 | | 0.00 | 100.00 | 0.00 | 0.00 | ---------+--------+--------+--------+--------+ 3 | 0 | 0 | 6 | 2 | 8 | 0.00 | 0.00 | 23.08 | 7.69 | 30.77 | 0.00 | 0.00 | 75.00 | 25.00 | | 0.00 | 0.00 | 100.00 | 100.00 | ---------+--------+--------+--------+--------+ Total 3 15 6 2 26 11.54 57.69 23.08 7.69 100.00 TABLE OF REPAIR BY HIMPG REPAIR HIMPG Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 1 | 1 | 2 | 3 | 3.85 | 7.69 | 11.54 | 33.33 | 66.67 | | 7.69 | 15.38 | ---------+--------+--------+ 2 | 9 | 6 | 15 | 34.62 | 23.08 | 57.69 | 60.00 | 40.00 | | 69.23 | 46.15 | ---------+--------+--------+ 3 | 3 | 5 | 8 | 11.54 | 19.23 | 30.77 | 37.50 | 62.50 | | 23.08 | 38.46 | ---------+--------+--------+ Total 13 13 26 50.00 50.00 100.00
Uh oh, there's a problem with himpg. There are no missing values for himpg even though there were two missing values of mpg. SAS treats missing values (values coded with a . ) as the smallest number possible (i.e., negative infinity). When we recoded mpg we wrote
IF (mpg <= 20) THEN himpg = 0;
which converted all values of mpg that were 20 or less into a value of 0 for himpg. Since a missing value is also less than 20, the missing values got recoded to 0 as well. (It is unforeseen mistakes like this that make it so important to check every variable that you recode.) Let's try recoding himpg again, being careful to properly treat missing values like this:
IF (. < mpg <= 20) THEN himpg = 0;
The complete program, with the fixed if statement, is shown below.
DATA auto2;
SET auto;
repair = .;
IF (rep78=1) or (rep78=2) THEN repair = 1;
IF (rep78=3) THEN repair = 2;
IF (rep78=4) or (rep78=5) THEN repair = 3;
himpg = .;
IF (. < mpg <= 20) THEN himpg = 0;
IF (mpg > 20) THEN himpg = 1;
RUN;
Now let's use proc freq again to check the recoding.
PROC FREQ DATA=auto2; TABLES repair*himpg / MISSING; RUN;TABLE OF REPAIR BY HIMPG REPAIR HIMPG Frequency| Percent | Row Pct | Col Pct | .| 0| 1| Total ---------+--------+--------+--------+ 1 | 0 | 1 | 2 | 3 | 0.00 | 3.85 | 7.69 | 11.54 | 0.00 | 33.33 | 66.67 | | 0.00 | 9.09 | 15.38 | ---------+--------+--------+--------+ 2 | 1 | 8 | 6 | 15 | 3.85 | 30.77 | 23.08 | 57.69 | 6.67 | 53.33 | 40.00 | | 50.00 | 72.73 | 46.15 | ---------+--------+--------+--------+ 3 | 1 | 2 | 5 | 8 | 3.85 | 7.69 | 19.23 | 30.77 | 12.50 | 25.00 | 62.50 | | 50.00 | 18.18 | 38.46 | ---------+--------+--------+--------+ Total 2 11 13 26 7.69 42.31 50.00 100.00
There, that's better, this time there are two missing values for himpg.
- Set values to missing and then recode them.
- Use new variable names when you create or recode variables. Avoid constructions like this, total = total + sub1 + sub2; that reuse the variable name total.
- Use the missing option with proc freq to make sure all missing values are accounted for.
- For more information about missing data in SAS, see SAS Learning Module: Missing data in SAS .
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services