SAS Class Notes
Modifying Data


1.0 SAS statements and procs in this unit

proc format Creates formats (aka value labels)
label Creates labels for variables
renameChanges the name of a variable in a data step
if then Executes a statement only if the condition is true
functions Creating new variables using SAS functions
mergeMerge files

2.0 Demonstration and explanation

Proc Format, labels and renaming variables

Let's see what the data set looks like. We will continue to use the same data set hs0 from last unit.

libname in 'c:\sas_data\';
proc contents data = in.hs0;
run;

Now let's format one of the variables, schtyp. To create value labels, we need to use proc format. Then we can apply those value labels to the appropriate variable. In the example below, proc freq with the format statement is used to create a table with the value labels.

proc format;
  value scl 1 = "public"
            2 = "private";
run;
proc freq data = in.hs0;
  tables schtyp;
  format schtyp scl.;
run;

We can permanently apply a value label to a variable in a data step using the format statement. Note that this data step also produces a temporary dataset hs0b which is based on the file c:\sas_data\hs0.

data hs0b;
  set "c:\sas_data\hs0" ;
  * or use set in.hs0;
  format schtyp scl.;
run;

In this data step, we will label the dataset and add a variable label to the variable schtyp.  The label option on the data statement sets the dataset label. The label statement in the data step assigns variable schtyp in the dataset hs0b.

data hs0b(label="High School and Beyond");
  set hs0b;
  label schtyp = "type of school";
  rename gender = female;
run;
We will now look at the effects of the data step using proc contents.
proc contents data = hs0b;
run;

In the data step below we change the name of the variable schtyp to public, and gender to female. Then we use proc contents to see that the changes have been made.

data hs0b;
   set hs0b (rename=(schtyp=public gender=female));
run;

proc contents data=hs0b;
run;

Putting things together in a long data step

Now we will run a longer data step to do a variety of tasks.  Comments are used to explain the program. (Note the code below replicates some of the code above.)
proc format;
  * create value labels for schtyp ;
  value scl 1 = "public"
            2 = "private";

  * create value labels for grade ;
  value abcdf 0 = "F" 
              1 = "D" 
              2 = "C" 
              3 = "B" 
              4 = "A";

  * create value labels for female ;
  value fm 1 = "female"
           0 = "male";
run;

* create data file hs1, label it ;
data hs1(label="High School and Beyond") ;

  set in.hs0 (rename=(gender=female));  

  * label the variable schtyp ;
  label schtyp = "type of school";
  
  * apply value labels to schtyp;
  format schtyp scl.;

  * the if-then statements create a new variable, called prog,
    which is numeric variable ;
  if prgtype = "academic"   then prog = 1;
  if prgtype = "general"    then prog = 2;
  if prgtype = "vocational" then prog = 3;
    
  * apply value labels to female;
  format female fm.;

  * the if statement recodes values of 5 in the variable race to be missing (.) ;
  if race = 5 then race = .;

  * create a variable called total that is the sum of read, write, math, and science ;
  total = read + write + math + science;

  * the if-then statements recode the variable total into the variable grade ; 
  if (total < 80) then grade = 0;  
  if (80  <= total  <  110) then grade = 1; 
  if (110 <= total  <  140) then grade = 2; 
  if (140 <= total  <  170) then grade = 3; 
  if (total  >= 170) then grade = 4;
  if (total = .) then grade = .;

 * apply value labels to variable grade;
  format grade abcdf.;
  
run;

Let's check to see that everything worked as planned.

proc contents data = hs1;
run;
proc print data = hs1 (obs = 20);
run;
proc freq data = hs1;
  tables schtyp*female;
run;

Permanently save the dataset as 'c:\sas_data\hs1'.

data in.hs1;
  set hs1;
run;

Creating new variables using procedures

There are also a number of SAS procedures and functions that can be used to modify your data. One such procedure is proc standard, which can be used to standardize your variables (i.e., give the a mean of 0 and a standard deviation of 1). Below we use proc standard to standardize the variables read and write. Note that the variables read and write have been replace by standardized versions of those variables.

proc standard data = hs1 mean=0 std=1 out=hs1b;
   var read write ;
run;

proc print data=hs1b (obs=10);
run;

Using SAS functions

In the long data step above, we created the variable total by adding the variables read, write, math and science together (i.e., total = read + write + math + science). We could do something similar using the sum() function, but there is a major difference between the two methods. That difference is in how missing values are handled. When we add items using "+", a case with missing values on any of the variables listed will have a missing value for the resulting variable. In other words, if a case has valid values of read, write and math, but a missing value for science, total will be equal to  missing for that case. If we use the sum() function, any missing values will be treated as though they were zero, and the new variable will be equal to missing only if all of the variables listed are missing. Which method is most appropriate depends on the situation and what you are trying to achieve. The code below shows how to use the sum() function in a data step. Together with the sum function, we also show the ordinal function, just for the fun of it.

data hs1b;
  set hs1;
  total2 = sum(of read write math science);
  * similarly, mean, max, min and more;
  x3 = ordinal(3, read, write, math, science);
run;

proc print data=hs1b (obs=20);
  var read write math science total total2 x3;
run;

Creating group-level variables using proc means

Now we want to create a new variable that is equal to the mean of read for each of the three types of programs. First we sort the dataset using proc sort, then we use proc means to create new dataset that contains the mean of read by prog. Finally, we merge the two datasets (using the merge statement in the data step), matching on the variable prog. We will discuss the merging process more in the next unit, Managing Data.

* creating group-level variables;
* first try;
proc means data = hs1b mean;
   var read math write;
   class prog;
run;
* second try, it looks good and try to save it to a data set;
proc means data = hs1b;
  var read math write;
  class prog;
  output out = test mean = mean;
run;
proc print data = test;
run;
* third try, fix a couple of things;
proc means data = hs1b nway;
   var read math write;
   class prog ;
   output out = meanscores 
         mean =  
         max = /autoname;
run;
proc print data = meanscores;
run;
* sort the data;
proc sort data = hs1;
   by prog;
run;
* merge the two data sets, matching on prog and drop extra variables from readmean;
data merged;
   merge hs1 meanscores ;
   by prog;
run;
proc print data = merged (obs=20);
run;

* another version of it;
proc means data = hs1b nway;
   var read math write;
   class prog ;
   output out = sumstat 
   mean(read) = mread
   mean(math) = mmath
   n(read)  = vcase_read
   n(write) = vcase_write;
run;
proc print data = sumstat;
run;

3.0 For more information

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.