UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Learning Module
A statistical sampler in Stata

This module will give a brief overview of some common statistical tests in Stata. Let's use the auto data file that we will use for our examples.

use auto 

t-tests

Let's do a t-test comparing the miles per gallon (mpg) of foreign and domestic cars.

ttest mpg , by(foreign) 
 
Two-sample t test with equal variances

------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |      52    19.82692     .657777    4.743297    18.50638    21.14747
       1 |      22    24.77273     1.40951    6.611187    21.84149    27.70396
---------+--------------------------------------------------------------------
combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
---------+--------------------------------------------------------------------
    diff |           -4.945804    1.362162               -7.661225   -2.230384
------------------------------------------------------------------------------
Degrees of freedom: 72

                      Ho: mean(0) - mean(1) = diff = 0

     Ha: diff <0 Ha: diff ~="0" Ha: diff> 0
       t =  -3.6308                t =  -3.6308              t =  -3.6308
   P < t =   0.0003          P > |t| =   0.0005          P > t =   0.9997 

As you see in the output above, the domestic cars had significantly lower mpg (19.8) than the foreign cars (24.7).

Chi-square

Let's compare the repair rating (rep78) of the foreign and domestic cars. We can make a crosstab of rep78 by foreign. We may want to ask whether these variables are independent. We can use the chi2 option to request a chi-square test of independence as well as the crosstab.

tabulate rep78 foreign, chi2 
           |        foreign
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000 

The chi-square is not really valid when you have empty cells. In such cases when you have empty cells, or cells with small frequencies, you can request Fisher's exact test with the exact option.

tabulate rep78 foreign, chi2 exact 
           |        foreign
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000
          Fisher's exact =                 0.000 

Correlation

We can use the correlate command to get the correlations among variables. Let's look at the correlations among price mpg weight and rep78. (We use rep78 in the correlation even though it is not continuous to illustrate what happens when you use correlate with variables with missing data.)

correlate price mpg weight rep78 
 (obs=69)

         |    price      mpg   weight    rep78
---------+------------------------------------
   price |   1.0000
     mpg |  -0.4559   1.0000
  weight |   0.5478  -0.8055   1.0000
   rep78 |   0.0066   0.4023  -0.4003   1.0000 

Note that the output above said (obs=69). The correlate command drops data on a listwise basis, meaning that if any of the variables are missing, then the entire observation is omitted from the correlation analysis.

We can use pwcorr (pairwise correlations) if we want to obtain correlations that deletes missing data on a pairwise basis instead of a listwise basis. We will use the obs option to show the number of observations used for calculating each correlation.

pwcorr price mpg weight rep78, obs 
           |    price      mpg   weight    rep78
----------+------------------------------------
    price |   1.0000 
          |       74
          |
      mpg |  -0.4686   1.0000 
          |       74       74
          |
   weight |   0.5386  -0.8072   1.0000 
          |       74       74       74
          |
    rep78 |   0.0066   0.4023  -0.4003   1.0000 
          |       69       69       69       69
          | 

Note how the correlations that involve rep78 have an N of 69 compared to the other correlations that have an N of 74. This is because rep78 has five missing values, so it only had 69 valid observations, but the other variables had no missing data so they had 74 valid observations.

Regression

Let's look at doing regression analysis in Stata. For this example, let's drop the cases where rep78 is 1 or 2 or missing.

drop if (rep78 <= 2) | (rep78==.) 
 (15 observations deleted) 

Now, let's predict mpg from price and weight. As you see below, weight is a significant predictor of mpg, but price is not.

regress mpg price weight 
 
  Source |       SS       df       MS                  Number of obs =      59
---------+------------------------------               F(  2,    56) =   47.87
   Model |  1375.62097     2  687.810483               Prob > F      =  0.0000
Residual |  804.616322    56  14.3681486               R-squared     =  0.6310
---------+------------------------------               Adj R-squared =  0.6178
   Total |  2180.23729    58  37.5902981               Root MSE      =  3.7905

------------------------------------------------------------------------------
     mpg |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   price |  -.0000139   .0002108     -0.066   0.948      -.0004362    .0004084
  weight |   -.005828   .0007301     -7.982   0.000      -.0072906   -.0043654
   _cons |   39.08279   1.855011     21.069   0.000       35.36676    42.79882
------------------------------------------------------------------------------ 

What if we wanted to predict mpg from rep78 as well. rep78 is really more of a categorical variable than it is a continuous variable. To include it in the regression, we should convert rep78 into dummy variables. Fortunately, Stata makes dummy variables easily using tabulate. The gen(rep) option tells Stata that we want to generate dummy variables from rep78 and we want the stem of the dummy variables to be rep.

tabulate rep78, gen(rep) 
      rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |         30       50.85       50.85
          4 |         18       30.51       81.36
          5 |         11       18.64      100.00
------------+-----------------------------------
      Total |         59      100.00 

Stata has created rep1 (1 if rep78 is 3), rep2 (1 if rep78 is 4) and rep3 (1 if rep78 is 5). We can use the tabulate command to verify that the dummy variables were created properly.

tabulate rep78 rep1 
           |  rep78==     3.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |         0         30 |        30 
         4 |        18          0 |        18 
         5 |        11          0 |        11 
-----------+----------------------+----------
     Total |        29         30 |        59  
tabulate rep78 rep2 
           |  rep78==     4.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |        30          0 |        30 
         4 |         0         18 |        18 
         5 |        11          0 |        11 
-----------+----------------------+----------
     Total |        41         18 |        59  
tabulate rep78 rep3 
           |  rep78==     5.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |        30          0 |        30 
         4 |        18          0 |        18 
         5 |         0         11 |        11 
-----------+----------------------+----------
     Total |        48         11 |        59  

Now we can include rep1 and rep2 as dummy variables in the regression model.

regress price mpg weight rep1 rep2 
 
  Source |       SS       df       MS                  Number of obs =      59
---------+------------------------------               F(  4,    54) =    8.06
   Model |   179897908     4  44974477.0               Prob > F      =  0.0000
Residual |   301329048    54  5580167.55               R-squared     =  0.3738
---------+------------------------------               Adj R-squared =  0.3274
   Total |   481226956    58  8297016.48               Root MSE      =  2362.2

------------------------------------------------------------------------------
   price |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     mpg |  -45.59633   86.36275     -0.528   0.600      -218.7432    127.5506
  weight |    2.09409   .6236754      3.358   0.001       .8436957    3.344484
    rep1 |  -1889.762   955.4336     -1.978   0.053      -3805.291    25.76703
    rep2 |  -1247.299   962.3382     -1.296   0.200      -3176.671    682.0728
   _cons |   2296.682   3644.487      0.630   0.531      -5010.074    9603.438
------------------------------------------------------------------------------ 

Analysis of variance

If you wanted to do an analysis of variance looking at the differences in mpg among the three repair groups, you can use the oneway command to do this.

oneway mpg rep78 
                         Analysis of Variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      506.325167      2   253.162583      8.47     0.0006
 Within groups      1673.91212     56   29.8912879
------------------------------------------------------------------------
    Total           2180.23729     58   37.5902981

Bartlett's test for equal variances:  chi2(2) =   9.9384  Prob>chi2 = 0.007
 

If you include the tabulate option, you get mean mpg for the three groups, which shows that the group with the best repair rating (rep78 of 5) also has the highest mpg (27.3).

oneway mpg rep78, tabulate 
 
            |           Summary of mpg
      rep78 |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          3 |   19.433333   4.1413252          30
          4 |   21.666667   4.9348699          18
          5 |   27.363636   8.7323849          11
------------+------------------------------------
      Total |    21.59322   6.1310927          59

                        Analysis of Variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      506.325167      2   253.162583      8.47     0.0006
 Within groups      1673.91212     56   29.8912879
------------------------------------------------------------------------
    Total           2180.23729     58   37.5902981

Bartlett's test for equal variances:  chi2(2) =   9.9384  Prob>chi2 = 0.007
 

If you want to include covariates, you need to use the anova command. The continuous(price weight) option tells Stata that those variables are covariates.

anova mpg rep78 price weight, continuous(price weight)
 
                           Number of obs =      59     R-squared     =  0.6586
                           Root MSE      = 3.71263     Adj R-squared =  0.6333

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  1435.91975     4  358.979938      26.04     0.0000
                         |
                   rep78 |  60.2987853     2  30.1493926       2.19     0.1221
                   price |   3.8421233     1   3.8421233       0.28     0.5997
                  weight |  529.932889     1  529.932889      38.45     0.0000
                         |
                Residual |  744.317536    54  13.7836581   
              -----------+----------------------------------------------------
                   Total |  2180.23729    58  37.5902981   
 

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California