UCLA Academic Technology Services HomeServicesClassesContactJobs

SAS Class Notes
Modifying Data


1.0 SAS statements and procs in this unit

proc format Creates formats (aka value labels)
label Creates labels for variables
renameChanges the name of a variable in a data step
if then Executes a statement only if the condition is true
mergeMerge files

2.0 Demonstration and explanation

Let's see what the data set looks like.

proc contents data = "c:\sas_data\hs0";
run;

Now let's format one of the variables, schtyp. To create value labels, we need to use proc format. Then we can apply those value labels to the appropriate variable. In the example below, proc freq with the format statement is used to create a table with the value labels. Note that the format is not permanently assigned to the variable. In other words, we need to tell SAS to use the format whenever we want to see the value labels instead of the numbers.

proc format;
  value scl 1 = "public"
            2 = "private";
run;
proc freq data = "c:\sas_data\hs0";
  tables schtyp;
  format schtyp scl.;
run;

We can permanently apply a value label to a variable in a data step using the format statement. Note that this data step also produces a temporary dataset hs0b which is based on the file c:\sas_data\hs0.

data hs0b;
	set "c:\sas_data\hs0" ;
	format schtyp scl.;
run;

In this data step, we will label the dataset and add a variable label to the variable schtyp.  The label option on the data statement sets the dataset label. The label statement in the data step assigns variable schtyp in the dataset hs0b.

data hs0b(label="High School and Beyond");
  set hs0b;
  label schtyp = "type of school";
run;
We will now look at the effects of the data step using proc contents.
proc contents data = hs0b;
run;

In the data step below we change the name of the variable schtyp to public, and gender to female. Then we use proc contents to see that the changes have been made.

data hs0b;
   set hs0b (rename=(schtyp=public gender=female));
run;

proc contents data=hs0b;
run;
Now we will run a longer data step to do a variety of tasks.  Comments are used to explain the program. (Note the code below replicates some of the code above.)
proc format;
  * create value labels for schtyp ;
  value scl 1 = "public"
            2 = "private";

  * create value labels for grade ;
  value abcdf 0 = "F" 
              1 = "D" 
              2 = "C" 
              3 = "B" 
              4 = "A";

  * create value labels for female ;
  value fm 1 = "female"
           0 = "male";
run;

* create data file hs1, label it ;
data hs1(label="High School and Beyond") ;

  * read in the sas file c:\sas_data\hs0;
  set "c:\sas_data\hs0";  

  * label the variable schtyp ;
  label schtyp = "type of school";
  
  * apply value labels to schtyp;
  format schtyp scl.;

  * the if-then statements create a new variable, called prog,
    which is numeric variable ;
  if prgtype = "academic" then prog = 1;
  if prgtype = "general" then prog = 2;
  if prgtype = "vocational" then prog = 3;
  
  * create a new variable, called female, which is 
  identical to the variable gender ;
  * and then use drop statement to remove the variable gender from the dataset;
  female = gender;
  drop gender;

  * label the variable prog ;
  label prog = "type of program";

  * label the variable female ;
  label female = "student's gender";
  
  * apply value labels to female;
  format female fm.;

  * the if statement recodes values of 5 in the variable race to be missing (.) ;
  if race = 5 then race = .;

  * create a variable called total that is the sum of read, write, math, and science ;
  total = read + write + math + science;

  * the if-then statements recode the variable total into the variable grade ; 
  if (total < 80) then grade = 0;  
  if (80  <= total  <  110) then grade = 1; 
  if (110 <= total  <  140) then grade = 2; 
  if (140 <= total  <  170) then grade = 3; 
  if (total  >= 170) then grade = 4;
  if (total = .) then grade = .;


  * label the variable grade ;
  label grade = "combined grades of read, write, math, and science";
  
  * apply value labels to variable grade;
  format grade abcdf.;
  
run;

Let's check to see that everything worked as planned.

proc contents data = hs1;
run;

proc print data = hs1 (obs = 20);
run;

proc freq data = hs1;
  tables schtyp female;
run;

Permanently save the dataset as 'c:\sas_data\hs1'.

data 'c:\sas_data\hs1';
	set hs1;
run;

There are also a number of SAS procedures and functions that can be used to modify your data. One such procedure is proc standard, which can be used to standardize your variables (i.e., give the a mean of 0 and a standard deviation of 1). Below we use proc standard to standardize the variables read and write. Note that the variables read and write have been replace by standardized versions of those variables.

proc standard data = hs1 mean=0 std=1 out=hs1b;
   var read write ;
run;

proc print data=hs1b (obs=10);
run;

In the long data step above, we created the variable total by adding the variables read, write, math and science together (i.e., total = read + write + math + science). We could do something similar using the sum() function, but there is a major difference between the two methods. That difference is in how missing values are handled. When we add items using "+", a case with missing values on any of the variables listed will have a missing value for the resulting variable. In other words, if a case has valid values of read, write and math, but a missing value for science, total will be equal to  missing for that case. If we use the sum() function, any missing values will be treated as though they were zero, and the new variable will be equal to missing only if all of the variables listed are missing. Which method is most appropriate depends on the situation and what you are trying to achieve. The code below shows how to use the sum() function in a data step.

data hs1b;
	set hs1;
	total2 = sum(of read write math science);
run;

proc print data=hs1b (obs=20);
	var read write math science socst total total2;
run;

If we wanted the mean of the items, rather than their sum, we could use the function mean() to calculate the means. Note that the mean() function behaves the same way as the sum() function with respect to missing values.

Now we want to create a new variable that is equal to the mean of read for each of the three types of programs. First we sort the dataset using proc sort, then we use proc means to create new dataset that contains the mean of read by prog. Finally, we merge the two datasets (using the merge statement in the data step), matching on the variable prog. We will discuss the merging process more in the next unit, Managing Data.

proc sort data = 'c:\sas_data\hs1';
   by prog;  
run;

proc means data = 'c:\sas_data\hs1' mean ;
   var read;
   by prog;
   output out = readmean mean=m;
run;

* look at the dataset of means;
proc print data = readmean;
run;

* sort the data;
proc sort data = hs1;
   by prog;
run;

* merge the two data sets, matching on prog and drop extra variables from readmean;
data merged;
   merge hs1 readmean;
   by prog;
   drop _TYPE_ _FREQ_;
run;

proc print data = merged (obs=20);
run;

3.0 For more information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.