UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R Data Analysis Examples
Tobit Models

Note: the code on this page works with R 2.8.1.

The tobit model, also called a censored regression model, is designed to estimate linear relationships between variables when there is either left- or right-censoring in the dependent variable (also known as censoring from below and above, respectively). Censoring from above takes place when cases with a value at or above some threshold, all take on the value of that threshold, so that the true value might be equal to the threshold, but it might also be higher. In the case of censoring from below, values those that fall at or below some threshold are censored.

Please Note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.

Examples of Tobit Analysis

Example 1. In the 1980s there was a federal law restricting speedometer readings to no more than 85 mph. So if you wanted to try and predict a vehicle's top-speed from a combination of horse-power and engine size, you would get a reading no higher than 85, regardless of how fast the vehicle was really traveling. This is a classic case of right-censoring (censoring from above) of the data. The only thing we are certain of is that those vehicles were traveling at least 85 mph.

Example 2. A research project is studying the level of lead in home drinking water as a function of the age of a house and family income. The water testing kit cannot detect lead concentrations below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous. These data are an example of left-censoring (censoring from below).

Example 3. Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). The problem here is that students who answer all questions on the academic aptitude test correctly receive a score of 800, even though it is likely that these students are not "truly" equal in aptitude. The same is true of students who answer all of the questions incorrectly. All such students would have a score of 200, although they may not all be of equal aptitude.

Description of the Data

For our data analysis below, we are going to expand on Example 3 from above.  We have generated hypothetical data, which can be obtained from our website from within R. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.


mydata <- read.csv("http://www.ats.ucla.edu/stat/r/dae/tobit.csv")

In order to make it easier to refer to the variables, we can attach the dataset. Then we use the names function to list the names of the variables in the dataset.


attach(mydata)
names(mydata)

[1] "id"   "read" "math" "prog" "apt"

The dataset contains 200 observations. The academic aptitude variable is apt, the reading and math test scores are read and math respectively. The variable prog is the type of program the student is in, it is a categorical (nominal) variable that takes on three values, academic (prog = 1), general (prog = 2), and vocational (prog = 3). The variable id is an identification variable.

Now let's look at the data descriptively. Note that in this dataset, the lowest value of apt is 352. That is, no students received a score of 200 (the lowest score possible), meaning that even though censoring from below was possible, it does not occur in the dataset.


summary(apt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  352.0   575.5   633.0   640.0   705.2   800.0 
sd(apt)
[1] 99.21903

summary(read)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.00   44.00   50.00   52.23   60.00   76.00 
sd(read)
[1] 10.25294

summary(math)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  33.00   45.00   52.00   52.65   59.00   75.00 
sd(math)
[1] 9.368448

table(prog)
prog
  academic    general vocational 
        45        105         50 

h<-hist(apt)
xfit<-seq(min(apt),max(apt),length=40)
yfit<-dnorm(xfit,mean=mean(apt),sd=sd(apt))
yfit <- yfit*diff(h$mids[1:2])*length(apt)
lines(xfit, yfit, lwd=2) 

Looking at the above histogram, we can see the censoring in the values of apt, that is, there are far more cases with scores of 750 to 800 than one would expect looking at the rest of the distribution. Below is an alternative histogram that further highlights the excess of cases where apt=800. In the histogram below, the breaks option produces a histogram where each unique value of apt has its own bar (by setting breaks equal to a vector containing values from the minimum of apt to the maximum of apt). Because apt is continuous, most values of apt are unique in the dataset, although close to the center of the distribution there are a few values of apt that have two or three cases. The spike on the far right of the histogram is the bar for cases where apt=800, the height of this bar relative to all the others clearly shows the excess number of cases with this value.

x<-apt
H<-hist(apt, breaks=cbind(min(x):max(x)), freq=TRUE)
l<-dim(table(x))
N<-length(x)
dx <- min(diff(H$breaks))
curve(N*dx*dnorm(x, mean=mean(x), sd=sd(x)), add=TRUE, col="blue")

Next we'll explore the bivariate relationships in our dataset.


cor(mydata[,c(2,3,5)], use="complete.obs")
          read      math       apt
read 1.0000000 0.6622801 0.6451215
math 0.6622801 1.0000000 0.7332702
apt  0.6451215 0.7332702 1.0000000

splom(~ mydata[ , c(2,3,5)])

In the first row of the scatterplot matrix shown above, we see the scatterplots showing the relationship between read and apt, as well as math and apt. Note the collection of cases at the top these two scatterplots, this is due to the censoring in the distribution of apt.

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

R Tobit Analysis

Below we run the tobit model, using the vglm(...) function of the VGAM library.

library(VGAM)
fit <- vglm(apt ~ read + math + as.factor(prog), tobit(Upper=800))

Since we gave our model a name (fit), R will not produce any output from our regression. In order to get the basic output we use the summary command:

summary(fit)

Call:
vglm(formula = apt ~ read + math + as.factor(prog), family = tobit(Upper = 800))

Pearson Residuals:
            Min       1Q    Median      3Q    Max
mu      -3.2606 -0.69522  0.049445 0.82743 2.7935
log(sd) -1.1309 -0.61020 -0.310926 0.21836 4.8277

Coefficients:
                             Value Std. Error t value
(Intercept):1             209.5488  32.642056  6.4196
(Intercept):2               4.1848   0.049756 84.1071
read                        2.6980   0.619577  4.3546
math                        5.9148   0.706472  8.3723
as.factor(prog)general    -12.7145  12.409533 -1.0246
as.factor(prog)vocational -46.1431  13.707785 -3.3662

Number of linear predictors:  2 

Names of linear predictors: mu, log(sd)

Dispersion Parameter for tobit family:   1

Log-likelihood: -872.8971 on 394 degrees of freedom

Number of Iterations: 8

Below we calculate the p-values for each of the coefficients in the model. In the first two lines of code we get the coefficients and standard errors from the model. In the third line we calculate the p-value for each coefficient, and on the fourth we display these values together. The coefficients for read, math, and prog = 3 (vocational) are statistically significant.

coef <- coefficients(summary(fit))
se <- sqrt(diag(vcov(fit)))
pvals <- 2*(1-pt(abs(coef)/se, df.residual(fit)))
cbind(coef,se,pvals)
                                coef          se        pvals
(Intercept):1             209.548822 32.64205591 3.932592e-10
(Intercept):2               4.184806  0.04975566 0.000000e+00
read                        2.698025  0.61957745 1.701602e-05
math                        5.914766  0.70647200 8.881784e-16
as.factor(prog)general    -12.714539 12.40953304 3.061908e-01
as.factor(prog)vocational -46.143141 13.70778548 8.370252e-04

Below we calculate the upper and lower 95% confidence intervals for the coefficients.


upper <- coefficients(summary(fit)) + 1.96*sqrt(diag(vcov(fit)))
lower <- coefficients(summary(fit)) - 1.96*sqrt(diag(vcov(fit)))
cbind(lower,upper)
                               lower      upper
(Intercept):1             145.570392 273.527252
(Intercept):2               4.087285   4.282327
read                        1.483653   3.912397
math                        4.530081   7.299451
as.factor(prog)general    -37.037224  11.608146
as.factor(prog)vocational -73.010400 -19.275881

We can test for an overall effect of prog using the wald.test function of the aod library. The wald.test function refers to the coefficients by their order the model, which is the same as their order in the table of coefficients. Below we first load the aod library. Then we use the wald.test function. The b option supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 5, and 6, are the two terms for the levels of prog.


library(aod)
wald.test(b=coef(fit), Sigma=vcov(fit), Terms=5:6)

Wald test:
----------

Chi-squared test:
X2 = 12.0, df = 2, P(> X2) = 0.0025

The chi-squared test statistic of 12, with two degrees of freedom is associated with a p-value of 0.0025, indicating that the overall effect of prog is statistically significant.

We can also test additional hypotheses about the differences in the coefficients for different levels of prog. Below we test that the coefficient for prog=2 is equal to the coefficient for prog=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for prog=2 and prog=3 (i.e. the 5th and 6th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).


l <- cbind(0,0,0,0,1, -1)
wald.test(b=coef(fit), Sigma=vcov(fit), L=l)

Wald test:
----------

Chi-squared test:
X2 = 6.7, df = 1, P(> X2) = 0.0096

The chi-squared test statistic of 6.7, with one degree of freedom is associated with a p-value of 0.0096, indicating that the coefficients for prog=2 and prog=3 are significantly different.

We may also wish to examine how well our model fits the data. This can be particularly useful when comparing competing models. One method of doing this is to compare the predicted values based on the tobit model to the observed values in the dataset. Below we use the predict function to generate predicted values of apt based on the model. Next we correlate the observed values of apt with the predicted values (predicted).


mydata$predicted <- predict(fit, newdata=mydata, type="response")
cor(mydata[, c(5,6)], use="complete.obs")

                apt predicted
apt       1.0000000 0.7824707
predicted 0.7824707 1.0000000

The correlation between the predicted and observed values of apt is 0.7825. If we square this value, we get the multiple squared correlation, this indicates predicted values share about 61% (0.78252 = 0.6123) of their variance with apt.

References

Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

Tobin, J. 1958. Estimation of relationships for limited dependent variables. Econometrica 26: 24-36.


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.