Statistical Computing Workshop
Introduction to Stata Programming

Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands that you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,

Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example, In fact, many of the built-in Stata commands are just ado-files, like the ttest command shown above. You can look at the source code for the ado commands using the viewsource command, for example, You can also use viewsource with your do-files. Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.

We will try to give you a feel for Stata programming by covering the following topics:

  1. Creating and using do-files for checking and cleaning data.
  2. Using do-files for analyzing data.
  3. Writing an ado program to create a statistical command.
  4. Using Stata and Mata matrix commands

Dirty Data

We will create a do-file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values. Here is how our do-file is used with the dataset hsberr. So how did the hsbcheck program "know" which file to use? This was done using a macro variable, in this case, `1', which takes the first term typed after the name f\of the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.

Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.

One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.

First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.

Analyze This!

Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do. Now, let's use hsbanalyze with our data file hsbclean. This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, "we need to redo the whole analysis using only schtyp equal to one." Here's all you have to do.

Many Happy Returns

Return list are one of the most powerful and useful features of Stata. There are three commonly used return lists: 1) return list for ordinary nonestimation commands; 2) ereturn list for full estimation commands; and 3) creturn list for a list constants and system parameters.

Let's start with the creturn list (abbr: cret lis).

cret lis
You can access any the values using c(name) by replacing name with the name of the function. For example to get today's date and time,
display "Today is" c(current_date) " and the current time is " c(current_time)
This can also be useful appending a date onto a file name.
local today c(current_date)
local today = subinstr(`today'," ","_",.)
save hsb`today', replace
Here is an example of the return list (abbr: ret lis) following the summarize command.
summarize write, detail

                        writing score
-------------------------------------------------------------
      Percentiles      Smallest
 1%           31             31
 5%         35.5             31
10%           39             31       Obs                 200
25%         45.5             31       Sum of Wgt.         200

50%           54                      Mean             52.775
                        Largest       Std. Dev.      9.478586
75%           60             67
90%           65             67       Variance       89.84359
95%           65             67       Skewness      -.4784158
99%           67             67       Kurtosis       2.238527

ret lis

scalars:
                  r(N) =  200
              r(sum_w) =  200
               r(mean) =  52.775
                r(Var) =  89.84359296482411
                 r(sd) =  9.47858602138653
           r(skewness) =  -.4784157665394925
           r(kurtosis) =  2.238527050562138
                r(sum) =  10555
                r(min) =  31
                r(max) =  67
                 r(p1) =  31
                 r(p5) =  35.5
                r(p10) =  39
                r(p25) =  45.5
                r(p50) =  54
                r(p75) =  60
                r(p90) =  65
                r(p95) =  65
                r(p99) =  67
We can use this information to compute a statistics, such as, the coefficient of variation that Stata does not provide.
display as txt "Coefficient of variation = " as res r(sd)/abs(r(mean))*100

Coefficient of variation = 17.960371
The ereturn list (abbr: eret lis) is used following estimation commands, such as, regress, anova, logit, sem, etc. Here is an example following regress.
use http://www.ats.ucla/stat/data/hsbdemo, clear

regress write read female 

[output omitted]

eret lis

scalars:
                  e(N) =  200
               e(df_m) =  2
               e(df_r) =  197
                  e(F) =  77.21062421518373
                 e(r2) =  .439419213038751
               e(rmse) =  7.132734938503835
                e(mss) =  7856.321182518197
                e(rss) =  10022.5538174818
               e(r2_a) =  .4337280375366064
                 e(ll) =  -675.2152914029984
               e(ll_0) =  -733.0934827146214
               e(rank) =  3

macros:
            e(cmdline) : "regress write read female"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "write"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
              e(model) : "ols"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 3
                  e(V) :  3 x 3

functions:
             e(sample) 
The matrix e(b) contains the parameter estimates while e(V) has the covariance matrix of the parameter estimates.
mat list e(b)

e(b)[1,3]
         read     female      _cons
y1  .56588693   5.486894  20.228368

mat list e(V)

symmetric e(V)[3,3]
              read      female       _cons
  read   .00243887
female   .00265893   1.0287262
 _cons  -.12883112  -.69953157   7.3644737
You can also access the coefficients and standard errors using _b[varname] and _se[varname]. For example, if you wanted the predicted score for a female with a reading score of 60, you could type the following.
display _b[_cons] + _b[female]*1 + _b[read]*60 

59.668478

Macrobiotics

Macro variables are a good way to store values for later use. Stata supports two kinds of macro variables: 1) global macros and 2) local macros. Global macros are saved until Stata is shut down or the macros are cleared while local macros exist only while the do-file or ado-file is being run. Then they disappear. Macros have a name and a way to refer to the values stored in the macros. For global macros $name refers to the value of the macro called name. For local macros `name' (watch out for the two kinds quote marks) refers to the value of the macro called name.

Say I wanted to compute the difference in medians for two groups. Here is one way you could do this using local macros.

quietly sum write if female==0, detail
local mmedian = r(p50)
quietly sum write if female==1, detail
local fmedian = r(p50)
display "median difference = " `fmedian' - `mmedian'

median difference = 5

macro lis

[output omitted]
Here is the same thing using global macros.
quietly sum write if female==0, detail
global mmedian = r(p50)
quietly sum write if female==1, detail
global fmedian = r(p50)
display "median difference = " $fmedian - $mmedian

median difference = 5

macro list

fmedian:        57
mmedian:        52

[output omitted]
This only scratches the surface of the utility of macro variables. We will see other examples as we go along.

Feeling Loopy

Many programming languages support looping. Stata has several ways of doing loops: foreach, forvalues and while. We don't have the time to demonstrate all of them, so we will show you two examples of looping across variables.

For the first example we want to create centered variables and squared center variables for five variables.

use http://www.ats.ucla/stat/data/hsbdemo, clear
foreach var of varlist read write math science socst {
  quietly sum `var'
  gen c`var' = `var' - r(mean)  // create centered variable
  gen c`var'2 = c`var'^2        // create squared centered variable
}
For our second example we want to create a long dataset in which our variables are stacked on top on one another.
use http://www.ats.ucla/stat/data/hsbdemo, clear

list id read write math science socst in 1/12, sep(0)

     +---------------------------------------------+
     |  id   read   write   math   science   socst |
     |---------------------------------------------|
  1. |  45     34      35     41        29      26 |
  2. | 108     34      33     41        36      36 |
  3. |  15     39      39     44        26      42 |
  4. |  67     37      37     42        33      32 |
  5. | 153     39      31     40        39      51 |
  6. |  51     42      36     42        31      39 |
  7. | 164     31      36     46        39      46 |
  8. | 133     50      31     40        34      31 |
  9. |   2     39      41     33        42      41 |
 10. |  53     34      37     46        39      31 |
 11. |   1     34      44     40        39      41 |
 12. | 128     39      33     38        47      41 |
     +---------------------------------------------+

local i = 1
foreach var of varlist read write math science socst {
  rename `var' y`i'
  local i = `i' + 1
}

list id y* in 1/12, sep(0)

     +------------------------------+
     |  id   y1   y2   y3   y4   y5 |
     |------------------------------|
  1. |  45   34   35   41   29   26 |
  2. | 108   34   33   41   36   36 |
  3. |  15   39   39   44   26   42 |
  4. |  67   37   37   42   33   32 |
  5. | 153   39   31   40   39   51 |
  6. |  51   42   36   42   31   39 |
  7. | 164   31   36   46   39   46 |
  8. | 133   50   31   40   34   31 |
  9. |   2   39   41   33   42   41 |
 10. |  53   34   37   46   39   31 |
 11. |   1   34   44   40   39   41 |
 12. | 128   39   33   38   47   41 |
     +------------------------------+

reshape long y, i(id) j(var)

lis id y var in 1/12, sep(0)

     +---------------+
     | id    y   var |
     |---------------|
  1. |  1   34     1 |
  2. |  1   44     2 |
  3. |  1   40     3 |
  4. |  1   39     4 |
  5. |  1   41     5 |
  6. |  2   39     1 |
  7. |  2   41     2 |
  8. |  2   33     3 |
  9. |  2   42     4 |
 10. |  2   41     5 |
 11. |  3   63     1 |
 12. |  3   65     2 |
     +---------------+
It is possible to loop throught observations (rows) but it is usually not necessary. Here is an example in which we create an index value that is equal to the observation number..
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear

local bign = _N
generate index = .
forvalues i=1/`bign' {
  quietly replace index=`i' in `i'
}

list index id in 1/10

     +-------------+
     | index    id |
     |-------------|
  1. |     1    45 |
  2. |     2   108 |
  3. |     3    15 |
  4. |     4    67 |
  5. |     5   153 |
     |-------------|
  6. |     6    51 |
  7. |     7   164 |
  8. |     8   133 |
  9. |     9     2 |
 10. |    10    53 |
     +-------------+
Here is a much simpler way to get the same result
drop index

generate index = _n
There is no end to the uses for looping. Unfortunately, we don't have time to cover more today.

Somewhat Iffy

Every good programming language has the ability to conditionally execute commands. Stata is no exception. It has both if and else to allow you to controls the execution of commands. We will illustate the use of if with an example to runs a series of ttests and reports which ones are statistically significant.
foreach var of varlist read write math science {
  quietly ttest `var', by(female)
  if r(p)<.05 {
    local msg "* statistically significant" 
  }
  else {
    local msg " not statistically significant"
  }
  display "variable `var' t = " r(t) "`msg'"
}

variable read t = .74801096 not statistically significant
variable write t = -3.7340739* statistically significant
variable math t = .41299865 not statistically significant
variable science t = 1.8123753 not statistically significant
Let's revisit the issue of looping through observation (rows) with an added if statement. Let's say that we want a list of all the females with reading scores less than 50. Here one way.
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear

display "Females with a reading score less than 50"
local bign = _N
forvalues i=1/`bign' {
  if female[`i']==1 & read[`i']<50 {
    display id[`i'] "  " read[`i'] 
  }
}

45  34
51  42
2  39
1  34
106  36
89  35
19  28
...
[output omitted]
Here is a much better (simpler and faster) way to get the same results by subsetting with if as part of list the command.
list id read if female==1 & read<50

[output omitted]
Most of Stata's commands work the if and in clauses.

Give'em the Boot

Bootstrapping is an alternative method for determining standard errors. For standard estimation commands, bootstrapping is just a matter of using the vce(boot) option.
use http://www.ats.ucla/stat/data/hsbdemo, clear

regress write read female, vce(boot, reps(100))

(running regress on estimation sample)

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Linear regression                               Number of obs      =       200
                                                Replications       =       100
                                                Wald chi2(2)       =    215.00
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.4394
                                                Adj R-squared      =    0.4337
                                                Root MSE           =    7.1327

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
       write |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .5658869   .0428593    13.20   0.000     .4818842    .6498897
      female |   5.486894   .9479372     5.79   0.000     3.628971    7.344817
       _cons |   20.22837   2.423242     8.35   0.000      15.4789    24.97784
------------------------------------------------------------------------------
But what if we want bootstrap standard errors for a statistic that Stata does not compute. We will need to write our own ado program. Let's go back to the example of the difference in medians between males and females. We will write a program named med_dif which we will save is a file named med_diff.ado. Here's how we do this.
/* begin med_dif program */
program med_diff, rclass
  quietly sum write if female==0, detail
  local mmedian = r(p50)
  quietly sum write if female==1, detail
  local fmedian = r(p50)
  return scalar meddif = fmedian - mmedian
  display as txt "median difference = " `fmedian' - `mmedian'
end
/* end med_dif program */
Let's try it.
use http://www.ats.ucla/stat/data/hsbdemo, clear

med_diff

median difference = 5

ret lis

scalars:
             r(meddif) =  5
Now, let's use med_diff with the bootstrap command.
bootstrap r(meddif), reps(100): med_diff
(running med_diff on estimation sample)

Warning:  Because med_diff is not an estimation command or does not set e(sample),
          bootstrap has no way to determine which observations are used in
          calculating the statistics and so assumes that all observations are used.
          This means that no observations will be excluded from the resampling
          because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the
          observations that are to be excluded.  Be sure that the dataset in memory
          contains only the relevant data.

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

      command:  med_diff
        _bs_1:  r(meddif)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |          5   2.844435     1.76   0.079    -.5749893    10.57499
------------------------------------------------------------------------------

estat boot, percentile

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

      command:  med_diff
        _bs_1:  r(meddif)

------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |           5        .99   2.8444346           0         12   (P)
------------------------------------------------------------------------------
(P)    percentile confidence interval

Talkin' 'Bout My Egeneration

Egen commands can make your life as a programmer much easier by saving you from additional programming. Here is an example that creates a variable containg group means. First, the egen way.
use http://www.ats.ucla.edu/stat/data/hsbmar, clear

egen fmean = mean(write), by(female)
And next, how you might program the same thing yourself.
drop fmean

quietly sum write if female==0, meanonly
generate fmean = r(mean) if female==0
quietly sum write if female==1, meanonly
replace fmean = r(mean) if female==1
Here is another example, in which we create a new variable which is the average of the non-missing values of four variables.
egen amean = rowmean(read write math science)

sum amean read write math science
And here is one way to program this. Note that we need to loop over the observations (rows).
drop amean

local bign = _N
generate amean=.

forvalues i=1/`bign' {
  local sum = 0
  local n = 0
  foreach var of varlist read write math science {
     if `var'[`i']~=. {
       local sum = `sum' + `var'[`i']
       local n = `n' + 1
     }
     if `n'~=0 {
     replace amean = `sum'/`n' in `i'
     }
  }
}
These are just two of the many egen commands available to programmers.

The Matrix Reloaded

There are two matric systems in Stata: 1) Traditional Stata matrix commands and 2) Mata which is a full programming language in addition to a matrix language. Many of the Stata estimation commands are written in Mata. Here are two examples of doing regression using matrices. We will use the basic regression formula shown below with each method.
b = (X'X)-1X'Y
We want to regres write on female and read. Hopefully the results wil look something like this:
regress write female read

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
       Model |  7856.32118     2  3928.16059           Prob > F      =  0.0000
    Residual |  10022.5538   197  50.8759077           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
       Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
        read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------
To begin we need to create a constant for the intercept, which we will call cons. Now, let's try this with the Stata matrix commands.
generate cons = 1

mkmat female read cons, mat(X)
mkmat write, mat(y)

matrix b = syminv(X'*X)*X'*y

matrix list b

b[3,1]
            write
female   5.486894
  read  .56588693
  cons  20.228368
In Mata the code would look like this.
tomata female write read cons

mata

y = write
X = (female, read, cons)

b = invsym(X'X)*X'y

b

end

                 1
    +---------------+
  1 |  5.486893967  |
  2 |  .5658869298  |
  3 |  20.22836845  |
    +---------------+
These examples demonstrate how matrix commands may be used. Stata's two matrix systems are powerful tools that can do both simple tasks and allow you to program complex estimation procedures.

Conclusion

Spending some time learning to program in Stata can increase your productivity and efficiency in working with your research data.

updated 10/11/11

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.