UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

SAS Class Notes 2.0
Fancy graphics and other cool code


The SAS program.

Data sets: hsb2, uis.

1.0 SAS statements and procs in this unit

proc gplot creates a variety of plots including scatter plots
proc insight used here to create a scatter plot matrix
proc boxplot creates boxplots fancy and plain
proc reg creates useful diagnostic plots and has powerful model selection routines
proc lifetest used here to create Kaplan-Meier curves
proc phreg used here to estimate the survival function for specific covariate patterns
proc g3d creates 3-dimensional plots

2.0 Basic exploratory plots: Scatter plots

Single scatter plot using the symbol statement to modify the symbols used in the plot to be purple dots.
The goption statement appears before every graph because this statement resets all the options to their default.  In the symbol statement note that value and color options can be abbreviated to c and v respectively.  Certain procedures such as gplot and reg can be used interactively and to stop them we need to end with a quit statement.

options nocenter nodate nonumber;
data hsb2;
  set 'd:\hsb2';
run;

goptions reset=all;
symbol value=dot color=purple;
proc gplot data=hsb2;
  plot write*math;
run;
quit;

We can also modify the symbol to use the id variable instead of purple dots. This is especially useful to identify outliers or other interesting observations.

*adding an outlier;
data outlier;
  if _n_ = 1 then do;
  write = 50;
  math = 99;
  id = 201;
  end;
  output;  
  set hsb2;
run;

*using labels for the symbols to identify the outlier;
goptions reset=all;
symbol1 pointlabel = ("#id") font=simplex value=none;
proc gplot data=outlier;
  plot write*math=1;
run;
quit;

If we are interested in the relationship between many variables, it can be very tedious generate scatter plots one at a time. By using proc insight we can create a scatter plot matrix.

proc insight data=hsb2;
  scatter write math socst female * write math socst female ;
run; 

We can also create a scatter plot were we have different symbols depending on the gender of the subjects. This can be  used to check if the relationship between write and math is linear for each gender group.  The axis statement changes the label to be 'Writing Scores' and turns the label 90 degrees to be vertically aligned instead of horizontally aligned.  We need a symbol statement for each symbol we want to use in the plot (thus, we need two symbol statements.) The vaxis option in the plot statement specifies that we would like to use the options specified in the axis statement.

goptions reset=all;
symbol1 v=circle c=blue;
symbol2 v=dot  c=red;
axis label=(a=90 'Writing Scores');
proc gplot data=hsb2;
  plot write*math=female / vaxis=axis;
run;
quit;

3.0 Basic exploratory plots: Boxplots and histograms plots

If we want to compare distributions of different groups it can be very helpful to look at their boxplots. The options in the plot statement can really improve the look of the boxplots. By specifiying schematicid in the boxstyle option the whiskers are drawn from the upper edge of the box to the largest value within the upper fence and from the lower edge of the box to the smallest value within the lower fence and outliers will be identified by their id number (the lower fence is located at 1.5×IQR (interquartile range) below the 25th percentile and the upper fence is located at 1.5×IQR above the 75th percentile).  The nohlabel option suppresses the label on the horizontal axis; the cboxes option is used to specify the color of the outline of the boxes; the cboxfill option is used to specify color inside the boxes; the idcolor option is used to specify the color of the symbols for the outliers; the boxwidth option is used to specify the width of the boxes.  The id statement is used to specify the variable to be used for identifying the outliers.

proc sort data=hsb2 out=sort;
 by prog;
run;

goptions reset=all;
proc boxplot data=sort;
 plot (math socst)*prog / boxwidth=10;
run;

*demonstrate the nice options;
proc boxplot data=sort;
 plot (math socst)*prog / boxstyle=schematicid nohlabel cboxes=blue cboxfill=yellow
                          idcolor=red boxwidth=10;
 id id;
run;

We can create other useful exploratory graphs including histograms. These are obtained through proc univariate and also have some very nice options. The cfill option is used to specify the color of the bars; the normal option fits a normal density curve on the graph (the blue curve); the kernel option fits a kernel density curve and the color option is used to specify the color of this curve (the red curve in the plot); the cbarline option is used to specify the color of the outline of the bars.

goptions reset=all;
proc univariate data=hsb2 noprint;
 histogram socst/ cfill=yellow normal kernel color = red cbarline=blue;
run;

4.0 Regression: Diagnostic plots and model selection methods

The regression procedure in SAS is fabulous. It can generate a truly impressive number of plots and it contains some powerful model selection capabilities. First we will demonstrate some of the diagnostic plots. In the first plot statement we are fitting the basic residual versus fitted plot as well as the plot of the residuals versus math2. The second plot from the first plot statement will indicate if we should be including math2 as a predictor in our model (since there is no systematic variance present we conclude that we do not need to include this predictor in our model.) The second plot statement generates a plot of the deleted studentized residuals versus the fitted values. This is very useful in detecting outliers. The third plot statement is a normal probability plot of the residuals which will test the normality assumption of our regression model. This plot will emphasize deviations from normality at the middle of the distribution. The last plot statement is a quantile probability plot of the residual and it is another check of the normality assumption, but here the emphasis in on deviation from normality at the tails of the distribution. Note that we did not have to include the math2 variable as a predictor in the model. We can use math2 in our plots since we included it in a var statement.

*Creating the math^2 predictor to be used in the regression.;
data hsb;
  set hsb2;
  math2 = math*math;
run;

goptions reset=all;
proc reg data=hsb;
  var math2;
  model write = female math socst;
  plot r.*p. r.*math2/ cline=black;
  plot rstudent.*p. / vref=-2.5 2.5 cline=black;
  plot npp.*r. /  modellab=' Quantile plot:';
  plot r.*nqq. / modellab=' Quantile plot:' noline;
  symbol v=circle c=red h=.8;
run;
quit;

By outputting the predicted values into a new data set called temp, we can generate a graph of the predicted scores with different symbols for each gender group.

proc reg data=hsb2;
  model write = female math socst;
  output out=temp p=predict;
run;
quit;

symbol1 v=dot c=red;
symbol2 v=circle c=blue;
axis1 label=(a=90 'Predicted values');
proc gplot data=temp;
  plot predict*math=female / vaxis=axis1;
  plot predict*socst=female / vaxis=axis1;
run;
quit;

There is a plethora of different model selection methods available through proc reg. Here we will demonstrate the Mallow's C model selection, which will indicate which subsets have less bias in their estimates. Models with a Mallow's C value approximately on the line CP=P will tend to have less bias and the models with values above the line will tend to have more bias whereas models that have values below the line will be considered to have no bias (since they have values below the line due to sampling error).
Next we have R-square selection method listing the models by predictor subset size. The cp option indicates that we also want the Mallow's C value listed for each model; the best option specifies the total number of models listed for each predictor subset size; the start option specifies the minimum number and the stop option specifies the maximum number of predictors in the models.

data selection;
  set hsb2;
  math2 = math*math;
  mathf = math*female;
  mathsch = math*schtyp;
  mathsci = math*science;
  sciencef = science*female;
  progsch = prog*schtyp;
run;

proc reg data=selection;
  model write = math socst female schtyp prog science math2 mathf mathsch mathsci 
                sciencef progsch / selection=rsquare noprint;
  plot cp.*np. / cmallows=blue vaxis=0 to 15 by 5;
run;
quit;

*Using R-squared model selection;
proc reg data=selection;
  model write = math socst female schtyp prog science math2 mathf mathsch mathsci 
                sciencef progsch / selection=rsquare cp best=3 start=2 stop=6;
run;
quit;

5.0 Survival analysis

We can get the Kaplan-Meier curves by using the plots option in proc lifetest.

data uis;
  set 'd:\uis';
run;

goptions reset=all;
proc lifetest data=uis plots=(s);
  time time*censor(0);
  strata treat;
  symbol c=red;
run;

We want to obtain a graph of the survival functions for the subjects who are 30 years old (age=30), have only 5 previous drug treatments (ndrugtx=5), received their treatment at site A (site=0). We would like to have one curve for each treatment group.

We can obtain the estimated survival function for specific covariate patterns by using the baseline statement in proc phreg. The baseline statement is quite complicated.  The out option specifies the name of the output dataset; the covariates options specifies the dataset which contains the covariate pattern to be used; the survival option specifies which type of estimate to be saved in the output data set.

data cov;
  age = 30;
  ndrugtx = 5;
  treat = 1;
  site = 0;
  agesite = 0;
run;

proc phreg data=uis noprint;
  model time*censor(0) = age ndrugtx treat site agesite; 
  agesite = age*site;
  baseline out=surv covariates=cov survival=surv/ nomean;
run;

data cov_short;
  age = 30;
  ndrugtx = 5;
  treat = 0;
  site = 0;
  agesite = 0;
run;

proc phreg data=uis noprint;
  model time*censor(0) = age ndrugtx treat site agesite; 
  agesite = age*site;
  baseline out=surv_short covariates=cov_short survival=surv/ nomean;
run;

data combo;
  set surv surv_short;
run;

data combo;
  if _n_ = 1 then do;
  time = 1172;
  surv =  0.08429;
  treat = 0;
  end;
  if _n_ = 2 then do;
  time = 1172;
  surv = 0.15060;
  treat = 1;
  end;
  output;
  set combo;
run;

proc sort data=combo;
  by time;
run;

goptions reset=all;
symbol1 c=red v=triangle h=.6 i=stepjll;
symbol2 c=blue v=circle h=.6 i=stepjll;
axis1 label=(a=90 'Survivorship function');
proc gplot data=combo;
 plot surv*time=treat / vaxis=axis1;
run;
quit;

6.0 Three dimensional plots

We would like to get a better understanding of the relationship between socst, math, write and prog. We will use the three continuous variables for each of the three axes and use different colors for each category of prog.  The proc g3d generates three-dimensional plots and using the scatter statement produces three-dimensional scatter plots whereas using the plot statement produces three-dimensional surface plots.

In the data step we are creating a string variable for prog called colorval which equals a different color for each level of prog, for example when prog=1 then colorval = red.  The colorval variable has to be in this format in order for us to use it in the color option in the scatter statement in proc g3d.  This is the option which determines the color of the symbols in the scatter plots generated by proc g3d.  In the scatter statement the first variable is used as the y-axis, the second is used as the x-axis and the variable after the equal sign is the z-axis.  The shape option is used to specify the shape of the symbols used in the plot; the caxis option is used to specify the color of the axes; the noneedle option suppresses the lines connecting the symbols to the x-y plane. 

*Creating the color variable "colorval";
data color;
  set hsb2;
  if prog=1 then colorval="red";
  else if prog=2 then colorval="blue";
  else colorval="green";
run;

proc g3d data=color;
   scatter socst*math=write / shape='pillar' color=colorval caxis=blue;
   scatter socst*math=write / color=colorval caxis=blue noneedle;
run;
quit;

In order to get a better understanding of the result from a multiple regression model it is sometimes very useful to look at the predicted values in three dimensions. To facilitate an even more in-depth understanding, we will generate three-dimensional graphs from different angles to get the complete picture.

We create the math2, soc2 variables as well as the interaction of math and socst.

data interaction;
  set hsb2;
  mathsc = math*socst;
  math2 = math**2;
  soc2 = socst**2;
run;

Using proc reg to get the parameter estimates.

proc reg data=interaction;
  model write = math socst mathsc female math2 soc2; 
run;
quit;

Using proc means to get the range of math and socst.

proc means data=hsb2 max min;
  var math socst;
run;

If we create a graph using only the values in the original data set we not get a nice surface plot instead we will get a very sparse graph consisting of only a few scattered lines. To get the surface plot we have to fill in the data. So, we generate values for math and socst at small regular intervals across the range of each variable. Then we create the predicted values using the parameter estimates obtained from the proc reg. Finally, we use proc g3d with a plot statement to generate the surface plot. The rotate option is to generate plots at every 15 degrees, starting at zero degrees and continuing to 360 degrees. The cbottom option is used to specify the color of the bottom surface color; the ctop option is used to specify the top surface color; the caxis option is used to specify the color of the axes; the xticknum and yticknum options are used to specify the number of tick marks on the x-axis and y-axis respectively; the grid option draws reference lines at the major tick marks of all the axes.

data graph;
   do math= 30 to 75 by 2.5;
      do socst= 25 to 75 by 2.5;
		 yhatf = 6.77 + .74*math +.34*socst - .009*math*socst + .002*math**2 + .004*socst**2; 
		 yhatm = 2 + .74*math +.34*socst - .009*math*socst + .002*math**2 + .004*socst**2; 
         output;
      end;
   end;
run;

proc print data=graph (obs=10);
run;

proc g3d data=graph;
 plot math*socst = yhatf / rotate=0 to 360 by 15 cbottom=black ctop=red caxis=blue 
                            xticknum=5 yticknum=5 grid;
run;
quit;

On the SAS website they have many very useful examples of code including a page with examples of various types of fancy graphics http://support.sas.com/sassamples/graphgallery/


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California