Stata FAQ
How can I create dummy variables in Stata?

There are two easy ways to create dummy variables in Stata. Let's begin with a simple dataset that has three levels of the variable group:
input group
1
1
2
3
2
2
1
3
3
end
We can create dummy variables using the tabulate command and the generate( ) option, as shown below.
tabulate group, generate(dum)

      group |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       33.33       33.33
          2 |          3       33.33       66.67
          3 |          3       33.33      100.00
------------+-----------------------------------
      Total |          9      100.00

list 

         group      dum1      dum2      dum3 
  1.         1         1         0         0  
  2.         1         1         0         0  
  3.         2         0         1         0  
  4.         3         0         0         1  
  5.         2         0         1         0  
  6.         2         0         1         0  
  7.         1         1         0         0  
  8.         3         0         0         1  
  9.         3         0         0         1 
The tabulate command with the generate option created three dummy variables called dum1, dum2 and dum3.

We can also use the xi command to create dummy variables for us.
xi i.group

i.group               Igroup_1-3   (naturally coded; Igroup_1 omitted)

list , clean

       group   dum1   dum2   dum3   _Igrou~2   _Igrou~3  
  1.       1      1      0      0          0          0  
  2.       1      1      0      0          0          0  
  3.       2      0      1      0          1          0  
  4.       3      0      0      1          0          1  
  5.       2      0      1      0          1          0  
  6.       2      0      1      0          1          0  
  7.       1      1      0      0          0          0  
  8.       3      0      0      1          0          1  
  9.       3      0      0      1          0          1  
The xi command created two dummy variables called _Igroup_2 and _Igroup_3 and omitted the dummy variable for group 1.

An Example Using the High School and Beyond Dataset

Using High School and Beyond dataset we wish to account for variability in the writing test scores using information on reading, math and the program type the student is in. The categorical variable prog has three levels: 1) general program, 2) academic program, and 3) vocational program. First, we will load the dataset from the Internet, then we will create dummy variables for prog using the tabulate command.
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear

tabulate prog, generate(prog)

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         45       22.50       22.50
   academic |        105       52.50       75.00
   vocation |         50       25.00      100.00
------------+-----------------------------------
      Total |        200      100.00
The tabulate command with the generate option created the following variables: prog1, prog2, and prog3. In a regression analysis we can only use two of the three dummy variables. Since prog has three levels it uses two degrees of freedom. Here is the regression analysis.
regress write read math prog2 prog3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  4,   195) =   41.03
       Model |  8170.58624     4  2042.64656           Prob > F      =  0.0000
    Residual |  9708.28876   195  49.7860962           R-squared     =  0.4570
-------------+------------------------------           Adj R-squared =  0.4459
       Total |   17878.875   199   89.843593           Root MSE      =  7.0559

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |    .289028   .0659478     4.38   0.000     .1589656    .4190905
        math |   .3587215   .0745443     4.81   0.000     .2117048    .5057381
       prog2 |   .6647754    1.32845     0.50   0.617    -1.955198    3.284749
       prog3 |  -2.253484   1.468445    -1.53   0.127    -5.149556    .6425886
       _cons |   19.00854    3.40933     5.58   0.000     12.28465    25.73243
------------------------------------------------------------------------------
In the analysis all of the variables were statistically significant except for prog2 and prog3. However, it is necessary to remember that it is the combination of prog2 and prog3 that makes up the variable program type. Let's test prog2 and prog3 together.
test prog2 prog3

 ( 1)  prog2 = 0.0
 ( 2)  prog3 = 0.0

       F(  2,   195) =    2.32
            Prob > F =    0.1015
As it turns out, by testing prog2 and prog3 together, we find that the variable program type is not statistically significant.

We can also do this in one step using the xi prefix, as shown below. Note how the results below match those above exactly.
xi: regress write read math i.prog

i.prog            _Iprog_1-3          (naturally coded; _Iprog_1 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  4,   195) =   41.03
       Model |  8170.58624     4  2042.64656           Prob > F      =  0.0000
    Residual |  9708.28876   195  49.7860962           R-squared     =  0.4570
-------------+------------------------------           Adj R-squared =  0.4459
       Total |   17878.875   199   89.843593           Root MSE      =  7.0559

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |    .289028   .0659478     4.38   0.000     .1589656    .4190905
        math |   .3587215   .0745443     4.81   0.000     .2117048    .5057381
    _Iprog_2 |   .6647754    1.32845     0.50   0.617    -1.955198    3.284749
    _Iprog_3 |  -2.253484   1.468445    -1.53   0.127    -5.149556    .6425886
       _cons |   19.00854    3.40933     5.58   0.000     12.28465    25.73243
------------------------------------------------------------------------------
As we did in the prior example, we can test the overall effect of program type with the test command as shown below.
test _Iprog_2 _Iprog_3

( 1) _Iprog_2 = 0
( 2) _Iprog_3 = 0

F( 2, 195) = 2.32
Prob > F = 0.1015

For more information

See the Stata manual or Stata help for tabulate and for xi.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.

F