|
|
|
||||
|
Stat Computing > SAS > FAQ
|
|
||||
Imagine you wish to predict temperature using time of day. You believe that the temperature increases until a certain time t and decreases after that point and you wish to estimate two functions that will meet at t. Given this location at which you believe the relationship between your outcome and predictor changes, you can fit two functions to meet at that given point in SAS using proc transreg.
For this example, we will use a fictional dataset where our outcome variable is y, our predictor variable is x, and the relationship between x and y is clearly not a single continuous function. We have borrowed this example data from SAS examples.
data a; x=-0.000001; do i=0 to 199; if mod(i,50)=0 then do; c=((x/2)-5)**2; if i=150 then c=c+5; y=c; end; x=x+0.1; y=y-sin(x-c); output; end; run; proc print data = a (obs = 5); run; Obs X I C Y 1 0.10000 0 25.0000 24.7694 2 0.20000 1 25.0000 24.4427 3 0.30000 2 25.0000 24.0234 4 0.40000 3 25.0000 23.5155 5 0.50000 4 25.0000 22.9241proc gplot data = a; plot y*x; run;
We may have some reason to believe that there is a "knot" located at x = 10 and that the relationship between x and y is linear. By saying there is a knot here, we're suggesting that the relationship between x to y for x < 10 is different from the relationship for x > 10. Rather than fit one line or curve where we believe there are actually two, we can indicate the location of the knot and fit a piecewise model. We will do this using proc transreg. We will indicate the location of the knot with the knots option and specify that we wish to fit one-degree polynomials (lines) with the degree option. We will output a dataset, a2, that includes the coefficients for these lines.
proc transreg data = a;
model identity(y) = pspline(x / knots = 10 degree = 1);
output out = a2 coefficients;
run;
TRANSREG Univariate Algorithm Iteration History for Identity(Y)
Iteration Average Maximum Criterion
Number Change Change R-Square Change Note
-------------------------------------------------------------------------
1 0.00000 0.00000 0.48219 Converged
By fitting this model, we are effectively estimating one intercept and two slopes. The output from proc transreg does not include these estimates as we would see them after fitting a regression with proc reg, but the estimates have been added to the output dataset a2. The output dataset also includes the transformed outcome variable, as many powers of the predictor x are needed for polynomials of the indicated degree, and as many modified versions of x as are required by the number of knots. We can look at a few records in our output dataset to see this.
proc print data = a2 (obs = 5); where x > 9.8; run; Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 TIntercept TX_1 TX_2 X 99 SCORE ROW99 -5.85961 -5.85961 1 9.9000 0.00000 1 9.9000 0.00000 9.9000 100 SCORE ROW100 -5.28805 -5.28805 1 10.0000 0.00000 1 10.0000 0.00000 10.0000 101 SCORE ROW101 0.62507 0.62507 1 10.1000 0.10000 1 10.1000 0.10000 10.1000 102 SCORE ROW102 1.32494 1.32494 1 10.2000 0.20000 1 10.2000 0.20000 10.2000 103 SCORE ROW103 2.09263 2.09263 1 10.3000 0.30000 1 10.3000 0.30000 10.3000
In this example, we used the "identity" transformation of y, so the transformed variable ty is equal to y. We indicated one-degree polynomials, so variables intercept = x^0 and x_1 = x^1= x were created. No higher powers of x were generated. We indicated one knot placed at x = 10, so a variable x_2 was generated. x_2 is zero when x < 10 and is x - 10 when x > 10. For example, we can see that when x = 10.1, x_2 = 0.1. Transformations of intercept, x_1, and x_2 also appear in the dataset. In this example, the transformed values are equal to the untransformed.
The last row of the generated dataset contains the coefficients from the model. In this row, the value for _TYPE_ is "M COEFFI". We can print this line to see the coefficients.
proc print data = a2; where _TYPE_ eq "M COEFFI"; run; Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 TIntercept TX_1 TX_2 X 201 M COEFFI TY . . . . . 16.9614 -1.47976 3.94202 .
Here we see the intercept (16.9614), the slope of the first line (-1.47976) and the slope of the second line (3.94202). We will see the same estimates if we use proc reg:
proc reg data = a2;
model y = x_1 x_2;
output out = a3 predicted = pred;
run;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 8181.68207 4090.84103 91.72 <.0001
Error 197 8786.23476 44.60018
Corrected Total 199 16968
Root MSE 6.67834 R-Square 0.4822
Dependent Mean 12.04335 Adj R-Sq 0.4769
Coeff Var 55.45246
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 16.96137 1.26019 13.46 <.0001
X_1 X 1 1 -1.47976 0.18399 -8.04 <.0001
X_2 X 2 1 3.94202 0.32717 12.05 <.0001
We can also see that the R-square value in the above regression is the same that appeared in the proc transreg output. We can plot the predicted values and see that they consist of two lines that meet at x = 10.
legend label=none value=('y' 'predicted y') position=(bottom left inside) mode=share down = 2; proc gplot data = a3; plot (y pred)*x / overlay legend = legend; run; quit;
If we have reason to believe that relationship between x and y is quadratic, we can indicate this in proc transreg and fit two two-degree polynomials to the data. We will again run proc tranreg, this time with degree = 2, and output a dataset, aquad.
proc transreg data = a;
model identity(y) = pspline(x / knots = 10 degree = 2);
output out = aquad coefficients;
run;
TRANSREG Univariate Algorithm Iteration History for Identity(Y)
Iteration Average Maximum Criterion
Number Change Change R-Square Change Note
-------------------------------------------------------------------------
1 0.00000 0.00000 0.46274 Converged
In this model, proc transreg is finding the combination of two quadratic functions that meet at x = 10. The dataset aquad has several added transformed variables: x_2 is the square of x and x_3 is the square of (x-10) for x > 10 and 0 for x < 10. We can look at a few records in our output dataset to see this.
proc print data = aquad (obs = 5); where x > 9.8; run; Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 99 SCORE ROW99 -5.85961 -5.85961 1 9.9000 98.010 100 SCORE ROW100 -5.28805 -5.28805 1 10.0000 100.000 101 SCORE ROW101 0.62507 0.62507 1 10.1000 102.010 102 SCORE ROW102 1.32494 1.32494 1 10.2000 104.040 103 SCORE ROW103 2.09263 2.09263 1 10.3000 106.090 Obs X_3 TIntercept TX_1 TX_2 TX_3 X 99 0.000000 1 9.9000 98.010 0.000000 9.9000 100 0.000000 1 10.0000 100.000 0.000000 10.0000 101 0.010000 1 10.1000 102.010 0.010000 10.1000 102 0.040000 1 10.2000 104.040 0.040000 10.2000 103 0.089999 1 10.3000 106.090 0.089999 10.3000
Again, the last row of the generated dataset contains the coefficients from the model. In this row, the value for _TYPE_is "M COEFFI". We can print this line to see the coefficients.
proc print data = aquad;
where _TYPE_ eq "M COEFFI";
run;
T
I I
n n
t t
_ _ e e
T N r r
Y A c c T T T
O P M e X X X e X X X
b E E T p _ _ _ p _ _ _
s _ _ Y Y t 1 2 3 t 1 2 3 X
201 M COEFFI TY . . . . . . 23.5360 -5.39639 0.36708 -0.38836 .
We will see the same estimates if we use proc reg:
proc reg data = aquad;
model y = x_1 x_2 x_3;
output out = aquad_reg predicted = pred;
run;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 7851.67021 2617.22340 56.27 <.0001
Error 196 9116.24661 46.51146
Corrected Total 199 16968
Root MSE 6.81993 R-Square 0.4627
Dependent Mean 12.04335 Adj R-Sq 0.4545
Coeff Var 56.62817
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 23.53600 1.83580 12.82 <.0001
X_1 X 1 1 -5.39639 0.64211 -8.40 <.0001
X_2 X 2 1 0.36708 0.04645 7.90 <.0001
X_3 X 3 1 -0.38836 0.08628 -4.50 <.0001
We can also see that the R-square value in the above regression is the same that appeared in the proc transreg output. We can plot the predicted values and see that they consist of two curves that meet at x = 10.
legend label=none value=('y' 'predicted y') position=(bottom left inside) mode=share down = 2; proc gplot data = aquad_reg; plot (y pred)*x / overlay legend = legend; run; quit;
We have assumed up to this point that there is a known value at which the relationship between y and x changes. If you know that such a point exists, but do not know its location, see SAS FAQ: How can I find where to split a piecewise regression?.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services