|
|
|
||||
|
|
|||||
This page shows several possible uses of the predict statement in proc nlmixed. For more information about fitting models using proc nlmixed see our FAQ pages listed at the bottom of the page. Below we show how to generate predicted values for each case holding some values constant (e.g. some variables held at their mean), and finally how predict can be used with a random intercept model to get predictions that do or do not contain the random effect.
You can download the data used in this example by clicking here: hsb2.sas7bdat.
In this example we use scores on standardized tests in writing (write), and science (science), as well as student's gender (female) to predict scores on a standardized test of reading (read). This is a normal multiple regression (also known as OLS regression), it is not necessary to use proc nlmixed to fit this model, but it works well to demonstrate the predict command. The SAS code to run this model using nlmixed appears below. After the proc nlmixed statement, we define the parameters in the model. Here xb is the predicted value of read, and b0, b1, b2, and b3 are coefficients. SAS will recognize variable names and mathematical operators in these statements. Any letter, series of letters, or letter(s) followed by number(s) that are not variable names are assumed to be parameter names. Although this model only requires one line to define all the parameters, more complex models may involve more than one statement. Next we use the predict statement to record the predicted value for each case. The statement starts with predict, then gives the expression to record, in this case the predicted value of write, that is xb from the previous line. The out=output1 gives the name of the dataset (output1) in which we wish to place the predicted values. Next is the model statement. Here we define write as distributed (~) normally with mean equal to xb, and variance s2, both of which we wish to estimate (xb is estimated through the coefficients, s2 is estimated based on the model residuals).
proc nlmixed data="d:\data\hsb2"; xb = b0 + b1*female + b2*write + b3*science; predict xb out=output1; model read ~ normal(xb,s2); run;
We have omitted the output from nlmixed, since our primary interest in this FAQ is the new dataset output1. After we run the above model we can run proc contents on the dataset output1. We use the varnum option to list the variables by their order in the dataset, rather than in alphabetical order. The first 8 variables (i.e. ID to SOCST) are from the dataset hsb2 that we used to estimate the above model, the rest of the variables were created by the predict statement.
proc contents data=output1 varnum;
run;
<output omitted>
Variables in Creation Order
# Variable Type Len Label
1 ID Num 4
2 FEMALE Num 3
3 RACE Num 3
4 SES Num 3
5 SCHTYP Num 3 type of school
6 PROG Num 3 type of program
7 READ Num 3 reading score
8 WRITE Num 3 writing score
9 MATH Num 3 math score
10 SCIENCE Num 3 science score
11 SOCST Num 3 social studies score
12 Pred Num 8 Predicted Value
13 StdErrPred Num 8 Standard Error of Prediction
14 DF Num 8 Degrees of Freedom
15 tValue Num 8 t Value
16 Probt Num 8 Pr > |t|
17 Alpha Num 8 Alpha
18 Lower Num 8 Lower Confidence Limit
19 Upper Num 8 Upper Confidence Limit
Let's take a closer look at the new variables. The code below uses proc print to print the values of the 8 new variables for the first 10 observations. The variable Pred contains the predicted value for each case, StdErrPred contains the standard error of that prediction. The variables tValue, and Probt contain the t-value and p-value for a test that the predicted value (Pred) is equal to zero, the degrees of freedom for this test are shown in the variable DF. The variables Lower and Upper contain the upper and lower bounds of the confidence interval for the predicted value. Alpha contains the alpha level for the confidence interval, and alpha of 0.05 corresponds to a 95% confidence interval.
proc print data=output1 (obs=10); var pred stderrpred df tvalue probt alpha lower upper; run;
StdErr
Obs Pred Pred DF tValue Probt Alpha Lower Upper
1 51.1568 0.92525 200 55.2898 3.857E-123 0.05 49.3323 52.9812
2 58.4022 0.98214 200 59.4642 4.2178E-129 0.05 56.4655 60.3389
3 47.2408 1.65745 200 28.5021 2.32949E-72 0.05 43.9724 50.5091
4 50.0545 0.88070 200 56.8346 2.1636E-125 0.05 48.3179 51.7912
5 53.5535 0.77956 200 68.6971 4.7046E-141 0.05 52.0163 55.0907
6 57.5480 0.96435 200 59.6752 2.1568E-129 0.05 55.6464 59.4496
7 56.6150 1.00789 200 56.1718 1.9691E-124 0.05 54.6276 58.6025
8 45.3369 1.11523 200 40.6526 1.25755E-98 0.05 43.1378 47.5361
9 57.7376 0.85866 200 67.2414 2.8581E-139 0.05 56.0444 59.4307
10 53.6672 0.92619 200 57.9442 5.642E-127 0.05 51.8409 55.4936
This example uses the dataset used in the previous section.
You can also request predicted values holding one or more of the variables in the model constant, and using the actual value from each case for other variables. In the example below we hold write at it's mean (52.884), female at one, and use each student's science score to calculate the predicted values. This allows us to examine how predicted values of read change as the student's science score changes, holding writing scores and gender constant. Further below we show some ways to use this information.
The code below is the same as above except for the predict statement. This time, instead of simply specifying that we want predicted values of xb we write out the regression equation, this is necessary in order to fix the values of some of the variables. In the predict statement, the variable name female has been replaced with a 1, and the variable name write has been replaced with the value 52.774 (it's mean). This fixes these values to a constant for the purpose of calculating predicted values. Only the variable name science remains in the equation, so that each case's value on this variable will be used to calculate the predicted values. The results of this predict statement are placed in the dataset output2.
proc nlmixed data="d:\data\hsb2"; xb = b0 + b1*female + b2*write + b3*science; predict b0 + b1*1 + b2*52.774 + b3*science out=output2; model read ~ normal(xb,sd); run;
Using the predicted values in output2 we can calculate the predicted mean of read for female students with average writing scores, and the sample values for science. The average predicted reading score, holding writing score and gender constant, while allowing science to vary is 51.23, with a standard deviation of 3.95.
proc means data=output2;
var pred;
run;
The MEANS Procedure
Analysis Variable : Pred Predicted Value
N Mean Std Dev Minimum Maximum
200 51.2252437 3.9549304 40.8994103 60.0731049
The predicted values can also be used to graph the relationship between science and predicted values of read, holding the other variables constant (if the other variables in the model aren't held constant, the result will not be a single regression line, but a scattering of points). Below we use proc gplot to generate a line graph showing this relationship. The first line of code below clears previous graph setting, the second instructs SAS to connect the data points to form a line, rather than graphing a series of unconnected points. Next is the proc gplot statement which tells SAS we want to create a graph using the data in dataset output2. The plot statement instructs SAS to graph pred against science. Finally, run; executes the command and quit; instructs SAS that we will not be working with this graph further. (For more information on proc gplot, see our SAS Learning Module: Graphing data in SAS.)
goptions reset=all border; symbol1 interpol=join; proc gplot data=output2; plot pred*science; run; quit;
Below we show the resulting graph. This graph shows the predicted values of read across the observed values of science, holding write at its mean and female at one. Since this is a linear model, with no interaction terms, if we used other values of write and/or female, the predicted values would be different, but the slope of the line would be the same.
You can download the data used in this example by clicking here: hsb.sas7bdat.
In this example we show how the predict statement can be used in a model with random effects. The dataset for this example includes information on 7185 students in 160 schools. (Note: this dataset comes from the HLM manual.) The variable mathach is a measure of each student's math achievement, the variable female is a binary variable equal to one if the student is female and zero otherwise, and the variable pracad is a school level variable, the proportion of students at each school who are on an academic track. The variable id identifies which school a student attends. In the model we use female and pracad to predict mathach, and include a random intercept for schools.
The first line of code below begins the proc nlmixed command. The second line specifies the fixed portion of the model, i.e. the model without the random intercept, and calls this value fixed. The third line of code creates a value pred that is equal to the fixed part of the model (fixed) plus a random intercept term u. The model statement specifies that mathach is distributed (~) normally with a mean of pred and variance s2. The random statement defines the random effect u as distributed normally with mean zero and a variance term, s2u to be estimated. The level 2 units, that is schools, are identified by subject=id;. The last two lines of the command are predict statements. While we could create only a single set of predicted values, in this case we would like to generate two. The first predict statement gives us the predicted values for the fixed portion of the model, identified by the value fixed, and outputs a dataset called output_fixed. The second predict statement generates predicted values that include the estimate of the random intercept in addition to the fixed portion of the model.
proc nlmixed data="d:\data\hsb"; fixed = b0 + female*b1 + meanses*b2; pred = fixed + u; model mathach ~ normal(pred,s2); random u ~ normal(0,s2u) subject=id; predict fixed out=output_fixed; predict pred out=output_random; run;
We can use the predicted value is to calculate residuals. Using the dataset output_random from the previous example, the data step below calculates the residuals by subtracting each student's predicted mathach score (pred) from their actual score (mathach), creating a new variable resid. This example calculates residuals for the dataset where the predicted values include the random intercept (output_random), however, depending on our purpose, we also could have used the dataset with only the fixed effects (output_fixed). After calculating the residuals we use proc sgplot to plot the residuals (resid) versus fitted values (pred) in a scatterplot.
data output_random; set output_random ; resid=mathach-pred; run; proc sgplot data=output_random; scatter x=pred y=resid; run;
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services