|
|
|
||||
|
|
|||||
In this page, we will discuss how to interpret a regression model when some variables in the model have been log transformed. The example data, lgtrans.csv in .csv format used for creating the examples on this page can be downloaded following the link. The variables in the data set are writing, reading, and math scores (write, read and math), the log transformed writing (lgwrite) and log transformed math scores (lgmath) and female. All the examples are done in Stata, but they can be easily generated in any statistical package. In the examples below, the variable write or its log transformed version will be used as the outcome variable. The examples are used for illustrative purposes and are not intended to make substantive sense. Here is a table of different types of means for variable write.
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
write | Arithmetic 200 52.775 51.45332 54.09668
| Geometric 200 51.8496 50.46854 53.26845
| Harmonic 200 50.84403 49.40262 52.37208
------------------------------------------------------------------------
Very often, a linear relationship is hypothesized between a log transformed outcome variable and a group of predictor variables. Written mathematically, the relationship follows the equation
log(y_i)= β0 + β1*x1 + ... + βk*xk + e_i
where y is the outcome variable and x1, .., xk are the predictor variables. In other words, we assume that log(y) - x'β is normally distributed, (or y is log-normal conditional on all the covariates.) Since this is just an ordinary least squares regression, we can easily interpret a regression coefficient, say β1, as the expected change in log of y with respect to a one-unit increase in x1 holding all other variables at any fixed value, assuming that x1 enters the model only as a main effect. But what if we want to know what happens to the outcome variable y itself for a one-unit increase in x1? The natural way to do this is to interpret the exponentiated regression coefficients, exp(β), since exponentiation is the inverse of logarithm function.
Let's start with the intercept-only model, log(write) = β0.
------------------------------------------------------------------------------
lgwrite | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
intercept | 3.948347 .0136905 288.40 0.000 3.92135 3.975344
------------------------------------------------------------------------------
We can say that 3.95 is the unconditional expected mean of log of write. Therefore the exponentiated value is exp(3.948347) = 51.85. This is the geometric mean of write. The emphasis here is that it is the geometric mean instead of the arithmetic mean. OLS regression of the original variable y is used to to estimate the expected arithmetic mean and OLS regression of the log transformed outcome variable is to estimated the expected geometric mean of the original variable.
Now let's move on to a model with a single binary predictor variable.
------------------------------------------------------------------------------
lgwrite | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .1032614 .0265669 3.89 0.000 .050871 .1556518
intercept | 3.89207 .0196128 198.45 0.000 3.853393 3.930747
------------------------------------------------------------------------------
log(write)= β0 + β1*female = 3.89 + .10*female
Before diving into the interpretation of these parameters, let's get the means of our dependent variable, write, by gender:
males
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
write | Arithmetic 91 50.12088 47.97473 52.26703
| Geometric 91 49.01222 46.8497 51.27457
| Harmonic 91 47.85388 45.6903 50.23255
------------------------------------------------------------------------
females
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
write | Arithmetic 109 54.99083 53.44658 56.53507
| Geometric 109 54.34383 52.73513 56.0016
| Harmonic 109 53.64236 51.96389 55.43289
------------------------------------------------------------------------
Now we can map the parameter estimates to the geometric means for the two groups. The intercept of 3.89 is the log of geometric mean of write when female = 0, i.e., for males. Therefore, the exponentiated value of it is the geometric mean for the male group: exp(3.892) = 49.01. What can we say about the coefficient for female? In the log scale, it is the difference in the expected geometric means of the log of write between the female students and male students. In the original scale of the variable write, it is the ratio of the geometric mean of write for female students over the geometric mean of write for male students, exp(.1032614) = 54.34383/49.01222 = 1.11. In terms of percent change, we can say that switching from male students to female students, we expect to see about 11% increase in the geometric mean of writing scores.
Last, let's look at a model with multiple predictor variables.
------------------------------------------------------------------------------ lgwrite | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | .114718 .0195341 5.87 0.000 .076194 .153242 read | .0066305 .0012689 5.23 0.000 .0041281 .0091329 math | .0076792 .0013873 5.54 0.000 .0049432 .0104152 intercept | 3.135243 .0598109 52.42 0.000 3.017287 3.253198 ------------------------------------------------------------------------------log(write)= β0 + β1*female + β2*read + β3*math
The exponentiated coefficient exp(β1) for female is the ratio of the expected geometric mean for the female students group over the expected geometric mean for the male students group, when read and math are held at some fixed value. Of course, the expected geometric means for the male and female students group will be different for different values of read and math. However, their ratio is a constant: exp(β1). In our example, exp(β1) = exp(.114718) = 1.12. We can say that writing scores will be 12% higher for the female students than for the male students. For the variable read, we can say that for a one-unit increase in read, we expect to see about a 0.7% increase in writing score, since exp(.0066305) = 1.006653. For a ten-unit increase in read, we expect to see about a 6.9% increase in writing score, since exp(.0066305*10) = 1.0685526.
In summary, when the outcome variable is log transformed, it is natural to interpret the exponentiated regression coefficients. These values correspond to changes in the ratio of the expected geometric means of the original outcome variable.
Occasionally, we also have some predictor variables being log transformed. In this section, we will take a look at an example where some predictor variables are log-transformed, but the outcome variable is in its original scale.
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.388777 .9307948 5.79 0.000 3.553118 7.224436
lgmath | 20.94097 3.430907 6.10 0.000 14.17473 27.7072
lgread | 16.85218 3.063376 5.50 0.000 10.81076 22.89359
intercept | -99.16397 10.80406 -9.18 0.000 -120.4711 -77.85685
------------------------------------------------------------------------------
Written in equation, we have
write= β0 + β1*female + β2*lgmath + β3*lgread
Since this is an OLS regression, the interpretation of the regression coefficients for the non-transformed variables are unchanged from an OLS regression without any transformed variables. For example, the expected mean difference in writing scores between the female and male students is about 5.4 points, holding the other predictor variables constant. On the other hand, due to the log transformation, the estimated effects of math and read are no longer linear, even though the effect of lgmath and lgread are linear. The plot below shows the curve of predicted values against the reading scores for the female students group holding math score constant.
How do we interpret the coefficient of 16.85218 for the variable of log of reading score? Let's take two values of reading score, r1 and r2. The expected mean difference in writing score at r1 and r2, holding the other predictor variables constant, is write(r2) - write(r1) = β3*(log(r2) - log(r1)) = β3*log(r2/r1). This means that as long as the percent increase in read (the predictor variable) is fixed, we will see the same difference in writing score, regardless where the baseline reading score is. For example, we can say that for a 10% increase in reading score, the difference in the expected mean writing scores will be always β3*log(1.10) = 16.85218*log(1.1) = 1.61.
What happens when both the outcome variable and predictor variables are log transformed? We can combine the two previously described situations into one. Here is an example of such a model.
------------------------------------------------------------------------------
lgwrite | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .1142399 .0194712 5.87 0.000 .07584 .1526399
lgmath | .4085369 .0720791 5.67 0.000 .2663866 .5506872
read | .0066086 .0012561 5.26 0.000 .0041313 .0090859
intercept | 1.928101 .2469391 7.81 0.000 1.441102 2.415099
------------------------------------------------------------------------------
Written as an equation, we can describe the model:
log(write)= β0 + β1*female + β2*log(math) + β3*read
For variables that are not transformed, such as female, its exponentiated coefficient is the ratio of the geometric mean for the female to the geometric mean for the male students group. For example, in our example, we can say that the expected percent increase in geometric mean from male student group to female student group is about 12% holding other variables constant, since exp(.1142399) = 1.12. For reading score, we can say that for a one-unit increase in reading score, we expected to see about 0.7% of increase in the geometric mean of writing score, since exp(.006086) = 1.007.
Now, let's focus on the effect of math. Take two values of math, m1 and m2, and hold the other predictor variables at any fixed value. The equation above yields
log(write)(m2) - log(write)(m1) = β2*(log(m2) - log(m1))
It can be simplified to log(write(m2)/write(m1)) = β2*(log(m2/m1)), leading to
write(m2)/write(m1) = (m2/m1)^β2.
This tells us that as long as the ratio of the
two math scores, m2/m1 stays the same, the expected ratio of the outcome variable,
write, stays the same. For example, we can say that for any 10% increase
in math score, the expected ratio of the two geometric means for writing
score will be 1.10^β2 = 1.10^.4085369 = 1.0397057. In other
words, we expect about 4% increase in writing score when math score increases by
10%.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services