|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
The tobit model, also called a censored regression model, is designed to estimate linear relationships between variables when there is either left- or right-censoring in the dependent variable (also known as censoring from below and above, respectively). Censoring from above takes place when cases with a value at or above some threshold, all take on the value of that threshold, so that the true value might be equal to the threshold, but it might also be higher. In the case of censoring from below, values those that fall at or below some threshold are censored.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Example 1. In the 1980s there was a federal law restricting speedometer readings to no more than 85 mph. So if you wanted to try and predict a vehicle's top-speed from a combination of horse-power and engine size, you would get a reading no higher than 85, regardless of how fast the vehicle was really traveling. This is a classic case of right-censoring (censoring from above) of the data. The only thing we are certain of is that those vehicles were traveling at least 85 mph.
Example 2. A research project is studying the level of lead in home drinking water as a function of the age of a house and family income. The water testing kit cannot detect lead concentrations below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous. These data are an example of left-censoring (censoring from below).
Example 3. Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). The problem here is that students who answer all questions on the academic aptitude test correctly receive a score of 800, even though it is likely that these students are not "truly" equal in aptitude. The same is true of students who answer all of the questions incorrectly. All such students would have a score of 200, although they may not all be of equal aptitude.
Let's pursue Example 3 from above.
We have a hypothetical data file, tobit.dta with 200 observations.
The academic aptitude variable is apt
Lets start by looking at some descriptive statistics generated in another package. The first table gives the descriptive statistics for the three continuous variables, and the second table tabulates the categorical variable prog. As expected the highest value of apt is 800. In this dataset, the lowest value of apt is 352 indicating that no students received a score of 200 (i.e., the lowest score possible), thus even though censoring from below was possible, it does not occur in this dataset.
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
apt | 200 640.035 99.21903 352 800
read | 200 52.23 10.25294 28 76
math | 200 52.645 9.368448 33 75
type of |
program | Freq. Percent Cum.
------------+-----------------------------------
academic | 45 22.50 22.50
general | 105 52.50 75.00
vocational | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
As we mentioned above, even if you've already run descriptive statistics in another package, you probably want to run an Mplus model with type=basic to make sure your data has been read in properly. The input file for such a model is shown below. We have also used the type = plot1 option of the plot command, so that we can use Mplus to generate histograms and scatterplots.
Data: file is tobit.dat; Variable: names are id read math prog apt prog1 prog2 prog3; usevariables are read math apt prog1 prog2 prog3; Analysis: type = basic; Plot: type = plot1;
As we mentioned above, you will want to look at this output carefully to be sure that the dataset was read into Mplus correctly. For example, checking to make sure that you have the correct number of observations, and that the variables all have means that are close to those from the descriptive statistics generated in a general purpose statistical package. If there are missing values for some or all of the variables, the descriptive statistics generated by Mplus may not match those from a general purpose statistical package exactly, because by default, Mplus versions 5.0 and later use maximum likelihood based procedures for handling missing values. Looking at the output shown below we can confirm that the number of observations is correct and that the means of the variables are consistent with those from a general purpose statistical package. Later on we will use the variance of apt as a point of comparison, so we will make note of this variance (9795.194) shown on the diagonal of the covariance matrix below.
SUMMARY OF ANALYSIS
Number of groups 1
Number of observations 200
<output omitted>
ESTIMATED SAMPLE STATISTICS
Means
READ MATH APT PROG1 PROG2
________ ________ ________ ________ ________
1 52.230 52.645 640.035 0.225 0.525
Means
PROG3
________
1 0.250
Covariances
READ MATH APT PROG1 PROG2
________ ________ ________ ________ ________
READ 104.597
MATH 63.297 87.329
APT 652.992 678.187 9795.194
PROG1 -0.557 -0.590 -0.228 0.174
PROG2 2.064 2.146 19.807 -0.118 0.249
PROG3 -1.508 -1.556 -19.579 -0.056 -0.131
Covariances
PROG3
________
PROG3 0.188
The plot command included in the input file above allows us to view histograms of our variables. We can view the histogram by clicking on the "Graph" menu, and then moving down to click on "View graphs." In the window that appears select "Histograms" and click "view." A second window will appear, where we can select the variable we wish to plot. Below is a histogram of apt.

Looking at the above histogram showing the distribution of apt, we can see the censoring in the data, that is, there are more cases with scores of 750 to 800 (i.e., the bin labeled 777.5) than one would expect looking at the rest of the distribution. Below is an alternative histogram that further highlights the excess of cases where apt=800. To produce this graph we proceeded as before, but after we selected apt as the variable to be plotted, we moved to the "Display properties" tab (in the same window), here we set the number of bins to be the range of apt plus one (800-352+1=449), this produces a histogram with a bin for each integer value from 352 to 800. Because apt is continuous, most values of apt are unique in the dataset, although close to the center of the distribution there are a few values of apt that have two or three cases. The spike on the far right of the histogram is the bar for cases where apt=800, the height of this bar relative to all the others clearly shows the excess number of cases with this value.

Next we'll explore the bivariate relationships in our dataset. We can view the histogram by going to the "Graph" menu, and down to "View graphs," then selecting "Scatterplots" in the window that appears. Clicking view will show a second window, where we can select the variables we wish to plot. Below is a scatterplot showing read and apt. Note the collection of cases near the top of the scatterplot, due to the censoring in the distribution of apt.

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
Below is the content of an Mplus input file for a tobit regression model. Because we are not using all of the variables in the dataset in the model, we use the usevariables option of the variables command to indicate which variables should be included in the model. The censored option declares that the variable apt is censored. The (a) following apt on the censored option indicates that the variable is censored from above (i.e., right censoring). If we had censoring from below (i.e., left-censoring), we would have used the (b) option instead. By default, Mplus uses its MLR estimator (maximum likelihood parameter estimates with standard errors and a chi-square test statistic that are robust to non-normality and non-independence of observations) when estimating tobit models. The MLR standard errors are computed using what is often called a sandwich estimator. This is what we generally call robust standard errors. If for some reason you want to match the output from Mplus to output from other packages, you will need to use the ML estimator, by including estimator = ml; in the analysis command.
data: file is tobit.dat ; variable: names are id read math prog apt prog1 prog2 prog3; usevariables are read math apt prog2 prog3; censored are apt (a); model: apt on read math prog2 prog3; output: stdyx;
SUMMARY OF ANALYSIS
Number of groups 1
Number of observations 200
Number of dependent variables 1
Number of independent variables 4
Number of continuous latent variables 0
Observed dependent variables
Censored
APT
Observed independent variables
READ MATH PROG2 PROG3
Estimator MLR
Information matrix OBSERVED
Optimization Specifications for the Quasi-Newton Algorithm for
Continuous Outcomes
Maximum number of iterations 100
Convergence criterion 0.100D-05
Optimization Specifications for the EM Algorithm
Maximum number of iterations 500
Convergence criteria
Loglikelihood change 0.100D-02
Relative loglikelihood change 0.100D-05
Derivative 0.100D-02
Optimization Specifications for the M step of the EM Algorithm for
Categorical Latent variables
Number of M step iterations 1
M step convergence criterion 0.100D-02
Basis for M step termination ITERATION
Optimization Specifications for the M step of the EM Algorithm for
Censored, Binary or Ordered Categorical (Ordinal), Unordered
Categorical (Nominal) and Count Outcomes
Number of M step iterations 1
M step convergence criterion 0.100D-02
Basis for M step termination ITERATION
Maximum value for logit thresholds 15
Minimum value for logit thresholds -15
Minimum expected cell size for chi-square 0.100D-01
Maximum number of iterations for H1 2000
Convergence criterion for H1 0.100D-03
Optimization algorithm EMA
Integration Specifications
Type STANDARD
Number of integration points 15
Dimensions of numerical integration 0
Adaptive quadrature ON
Cholesky OFF
Input data file(s)
tobit.dat
Input data format FREE
SUMMARY OF DATA
Number of missing data patterns 0
COVARIANCE COVERAGE OF DATA
Minimum covariance coverage value 0.100
SUMMARY OF CENSORED LIMITS
APT 800.000
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -1041.063
H0 Scaling Correction Factor 0.988
for MLR
Information Criteria
Number of Free Parameters 6
Akaike (AIC) 2094.126
Bayesian (BIC) 2113.916
Sample-Size Adjusted BIC 2094.907
(n* = (n + 2) / 24)
MODEL RESULTS
Two-Tailed
Estimate S.E. Est./S.E. P-Value
APT ON
READ 2.698 0.615 4.386 0.000
MATH 5.914 0.667 8.872 0.000
PROG2 -12.715 11.850 -1.073 0.283
PROG3 -46.144 13.780 -3.349 0.001
Intercepts
APT 209.567 32.867 6.376 0.000
Residual Variances
APT 4313.260 438.225 9.843 0.000
Because we used the stdyx option of the output command, the output includes standardized coefficients. We did this primarily to obtain the R-square values for the output variables, so we have omitted the standardized output to save space. Based on this output, the model explains about 62% of the variance in apt.
<output omitted>
R-SQUARE
Observed Two-Tailed
Variable Estimate S.E. Est./S.E. P-Value
APT 0.615 0.043 14.205 0.000
We may also want to test that the coefficients for prog2, and prog3, all equal to zero. This type of test can also be described as an overall test for the effect of prog. There are multiple ways to test this type of hypothesis, the model test command requests one of them, a Wald test. The Mplus input file shown below is similar to the first regression model, except that the coefficients for prog2, and prog3 are assigned the names p2, and p3, respectively. Note that each variables to be tested must be alone on a line followed by its label in parentheses. In the model test command, these coefficient names (i.e., p2, and p3) are used to test that each of the coefficients is equal to 0.
Data:
File is tobit.dat;
Variable:
Names are id read math prog apt prog1 prog2 prog3;
usevariables are read math apt prog2 prog3;
censored are apt (a);
Model:
apt on read math
prog2 (p1)
prog3 (p2);
Model test:
p1 = 0;
p2 = 0;
The majority of the output from this model is the same as the first model, so we will only show part of the output generated by the model test command.
Wald Test of Parameter Constraints
Value 11.417
Degrees of Freedom 2
P-Value 0.0033
The test statistic of 11.417, with 2 degrees of freedom and an associated p-value of 0.003 indicates that the overall effect of prog is statistically significant.
We can also test additional hypotheses about the differences in the coefficients for different levels of prog. Below we test that the coefficient for prog2 is equal to the coefficient for prog3. In the output below we see that the two coefficient are significantly different.
Data:
File is tobit.dat;
Variable:
Names are id read math prog apt prog1 prog2 prog3;
Missing are all (-9999) ;
usevariables are read math apt prog2 prog3;
censored are apt (a);
Model:
apt on read math
prog2 (p1)
prog3 (p2);
Model test:
p1 = p2;
Wald Test of Parameter Constraints
Value 6.002
Degrees of Freedom 1
P-Value 0.0143
The test statistic of 6.002, with 1 degree of freedom and an associated p-value of 0.0143 indicates that the coefficient for prog=2 is significantly different from the coefficient for prog=3.
Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
Tobin, J. 1958. Estimation of relationships for limited dependent variables. Econometrica 26: 24-36.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services