### Statistical Computing Workshop Introduction to Stata Programming

Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands that you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,

do hsbcheck hsberr
Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,
ttest write, by(female)
In fact, many of the built-in Stata commands are just ado-files, like the ttest command shown above. You can look at the source code for the ado commands using the viewsource command, for example,
You can also use viewsource with your do-files.
viewsource hsbcheck.do
Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.

We will try to give you a feel for Stata programming by covering the following topics:

1. Creating and using do-files for checking and cleaning data.
2. Using do-files for analyzing data.
3. Writing an ado program to create a statistical command.
4. Using Stata and Mata matrix commands

#### Dirty Data

We will create a do-file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values.
/* begin hsbcheck.do */
version 11.0
clear
use 1', clear
set more off

nmissing
describe
summarize
list id if id<1 | id>200
list id gender if ~ inlist(gender,1,2,.)
list id race   if ~ inlist(race,1,2,3,4,.)
list id ses    if ~ inlist(ses,1,2,3,.)
list id schtyp if ~ inlist(schtyp,1,2,.)
list id prog   if ~ inlist(prog,1,2,3,.)
list id write   if (write<1   | write>99)     & write~=.
list id math    if (math<1    | math>99)      & math~=.
list id science if (science<1 | science>99)   & science~=.
list id socst   if (socst<1   | socst>99)     & socst~=.
*list id female  if (female<0  | female>1)     & female~=.

set more on
/* end hsbcheck.do */
Here is how our do-file is used with the dataset hsberr.
do hsbcheck hsberr

[output omitted]
So how did the hsbcheck program "know" which file to use? This was done using a macro variable, in this case, 1', which takes the first term typed after the name f\of the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.

Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.

/* begin hsbfix.do */
use hsberr, clear

replace id=193     if id==1193
replace science=61 if science==-61
replace gender=.   if gender==5
replace race=.     if race<1 | race>4
replace ses=.      if ses<1 | ses>3
replace schtyp=.   if schtyp<1 | schtyp>2
replace prog=.     if prog<1 | prog>3
tab gender
tab reg
/* create female from gender */

generate female = gender

recode female 1=0 2=1
l
label define fem 0 "male" 1 "female"

label value female fem
tab female gender
label data "hsb clean data using hsberr.do"
save hsbclean, replace
/* end hsbfix.do */
One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.

First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.

do hsbfix

do hsbcheck hsbclean

[output omitted]

#### Analyze This!

Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do.
/* begin hsbanalyze.do */
log using hsb_fall_2011.txt, text replace

summarize read write socst
univar read write math science
tabstat write, stat(n mean sd p25 p50 p75) by(female)
tabstat write, stat(n mean sd p25 p50 p75) by(prog)
ttest write, by(female)
hist write, normal start(30) width(5) name(hist_write, replace)
kdensity write, normal name(kd_write, replace)
tab1 female ses prog
tab prog ses, all
correlate write read socst female
graph matrix read socst write, half name(grmat, replace)
regress write read i.female##c.socst
margins female, at(socst=(25(5)70)) atmeans asbalanced
marginsplot, recast(line) recastci(rarea) name(plot, replace)

log close
/* end hsbanalyze */ 
Now, let's use hsbanalyze with our data file hsbclean.
use hsbdemo, clear

do hsbanalyze

[output omitted]
This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, "we need to redo the whole analysis using only schtyp equal to one." Here's all you have to do.
keep if schtyp==1

do hsbanalyze

[output omitted]

#### Many Happy Returns

Return list are one of the most powerful and useful features of Stata. There are three commonly used return lists: 1) return list for ordinary nonestimation commands; 2) ereturn list for full estimation commands; and 3) creturn list for a list constants and system parameters.

Let's start with the creturn list (abbr: cret lis).

cret lis
You can access any the values using c(name) by replacing name with the name of the function. For example to get today's date and time,
display "Today is" c(current_date) " and the current time is " c(current_time)
This can also be useful appending a date onto a file name.
local today c(current_date)
local today = subinstr(today'," ","_",.)
save hsbtoday', replace
Here is an example of the return list (abbr: ret lis) following the summarize command.
summarize write, detail

writing score
-------------------------------------------------------------
Percentiles      Smallest
1%           31             31
5%         35.5             31
10%           39             31       Obs                 200
25%         45.5             31       Sum of Wgt.         200

50%           54                      Mean             52.775
Largest       Std. Dev.      9.478586
75%           60             67
90%           65             67       Variance       89.84359
95%           65             67       Skewness      -.4784158
99%           67             67       Kurtosis       2.238527

ret lis

scalars:
r(N) =  200
r(sum_w) =  200
r(mean) =  52.775
r(Var) =  89.84359296482411
r(sd) =  9.47858602138653
r(skewness) =  -.4784157665394925
r(kurtosis) =  2.238527050562138
r(sum) =  10555
r(min) =  31
r(max) =  67
r(p1) =  31
r(p5) =  35.5
r(p10) =  39
r(p25) =  45.5
r(p50) =  54
r(p75) =  60
r(p90) =  65
r(p95) =  65
r(p99) =  67
We can use this information to compute a statistics, such as, the coefficient of variation that Stata does not provide.
display as txt "Coefficient of variation = " as res r(sd)/abs(r(mean))*100

Coefficient of variation = 17.960371
The ereturn list (abbr: eret lis) is used following estimation commands, such as, regress, anova, logit, sem, etc. Here is an example following regress.
use http://www.ats.ucla/stat/data/hsbdemo, clear

regress write read female

[output omitted]

eret lis

scalars:
e(N) =  200
e(df_m) =  2
e(df_r) =  197
e(F) =  77.21062421518373
e(r2) =  .439419213038751
e(rmse) =  7.132734938503835
e(mss) =  7856.321182518197
e(r2_a) =  .4337280375366064
e(ll) =  -675.2152914029984
e(ll_0) =  -733.0934827146214
e(rank) =  3

macros:
e(cmdline) : "regress write read female"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "write"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"

matrices:
e(b) :  1 x 3
e(V) :  3 x 3

functions:
e(sample) 
The matrix e(b) contains the parameter estimates while e(V) has the covariance matrix of the parameter estimates.
mat list e(b)

e(b)[1,3]
y1  .56588693   5.486894  20.228368

mat list e(V)

symmetric e(V)[3,3]
female   .00265893   1.0287262
_cons  -.12883112  -.69953157   7.3644737
You can also access the coefficients and standard errors using _b[varname] and _se[varname]. For example, if you wanted the predicted score for a female with a reading score of 60, you could type the following.
display _b[_cons] + _b[female]*1 + _b[read]*60

59.668478

#### Macrobiotics

Macro variables are a good way to store values for later use. Stata supports two kinds of macro variables: 1) global macros and 2) local macros. Global macros are saved until Stata is shut down or the macros are cleared while local macros exist only while the do-file or ado-file is being run. Then they disappear. Macros have a name and a way to refer to the values stored in the macros. For global macros $name refers to the value of the macro called name. For local macros name' (watch out for the two kinds quote marks) refers to the value of the macro called name. Say I wanted to compute the difference in medians for two groups. Here is one way you could do this using local macros. quietly sum write if female==0, detail local mmedian = r(p50) quietly sum write if female==1, detail local fmedian = r(p50) display "median difference = " fmedian' - mmedian' median difference = 5 macro lis [output omitted] Here is the same thing using global macros. quietly sum write if female==0, detail global mmedian = r(p50) quietly sum write if female==1, detail global fmedian = r(p50) display "median difference = "$fmedian - \$mmedian

median difference = 5

macro list

fmedian:        57
mmedian:        52

[output omitted]
This only scratches the surface of the utility of macro variables. We will see other examples as we go along.

#### Feeling Loopy

Many programming languages support looping. Stata has several ways of doing loops: foreach, forvalues and while. We don't have the time to demonstrate all of them, so we will show you two examples of looping across variables.

For the first example we want to create centered variables and squared center variables for five variables.

use http://www.ats.ucla/stat/data/hsbdemo, clear
foreach var of varlist read write math science socst {
quietly sum var'
gen cvar' = var' - r(mean)  // create centered variable
gen cvar'2 = cvar'^2        // create squared centered variable
}
For our second example we want to create a long dataset in which our variables are stacked on top on one another.
use http://www.ats.ucla/stat/data/hsbdemo, clear

list id read write math science socst in 1/12, sep(0)

+---------------------------------------------+
|  id   read   write   math   science   socst |
|---------------------------------------------|
1. |  45     34      35     41        29      26 |
2. | 108     34      33     41        36      36 |
3. |  15     39      39     44        26      42 |
4. |  67     37      37     42        33      32 |
5. | 153     39      31     40        39      51 |
6. |  51     42      36     42        31      39 |
7. | 164     31      36     46        39      46 |
8. | 133     50      31     40        34      31 |
9. |   2     39      41     33        42      41 |
10. |  53     34      37     46        39      31 |
11. |   1     34      44     40        39      41 |
12. | 128     39      33     38        47      41 |
+---------------------------------------------+

local i = 1
foreach var of varlist read write math science socst {
rename var' yi'
local i = i' + 1
}

list id y* in 1/12, sep(0)

+------------------------------+
|  id   y1   y2   y3   y4   y5 |
|------------------------------|
1. |  45   34   35   41   29   26 |
2. | 108   34   33   41   36   36 |
3. |  15   39   39   44   26   42 |
4. |  67   37   37   42   33   32 |
5. | 153   39   31   40   39   51 |
6. |  51   42   36   42   31   39 |
7. | 164   31   36   46   39   46 |
8. | 133   50   31   40   34   31 |
9. |   2   39   41   33   42   41 |
10. |  53   34   37   46   39   31 |
11. |   1   34   44   40   39   41 |
12. | 128   39   33   38   47   41 |
+------------------------------+

reshape long y, i(id) j(var)

lis id y var in 1/12, sep(0)

+---------------+
| id    y   var |
|---------------|
1. |  1   34     1 |
2. |  1   44     2 |
3. |  1   40     3 |
4. |  1   39     4 |
5. |  1   41     5 |
6. |  2   39     1 |
7. |  2   41     2 |
8. |  2   33     3 |
9. |  2   42     4 |
10. |  2   41     5 |
11. |  3   63     1 |
12. |  3   65     2 |
+---------------+
It is possible to loop throught observations (rows) but it is usually not necessary. Here is an example in which we create an index value that is equal to the observation number..
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear

local bign = _N
generate index = .
forvalues i=1/bign' {
quietly replace index=i' in i'
}

list index id in 1/10

+-------------+
| index    id |
|-------------|
1. |     1    45 |
2. |     2   108 |
3. |     3    15 |
4. |     4    67 |
5. |     5   153 |
|-------------|
6. |     6    51 |
7. |     7   164 |
8. |     8   133 |
9. |     9     2 |
10. |    10    53 |
+-------------+
Here is a much simpler way to get the same result
drop index

generate index = _n
There is no end to the uses for looping. Unfortunately, we don't have time to cover more today.

#### Somewhat Iffy

Every good programming language has the ability to conditionally execute commands. Stata is no exception. It has both if and else to allow you to controls the execution of commands. We will illustate the use of if with an example to runs a series of ttests and reports which ones are statistically significant.
foreach var of varlist read write math science {
quietly ttest var', by(female)
if r(p)<.05 {
local msg "* statistically significant"
}
else {
local msg " not statistically significant"
}
display "variable var' t = " r(t) "msg'"
}

variable read t = .74801096 not statistically significant
variable write t = -3.7340739* statistically significant
variable math t = .41299865 not statistically significant
variable science t = 1.8123753 not statistically significant

Let's revisit the issue of looping through observation (rows) with an added if statement. Let's say that we want a list of all the females with reading scores less than 50. Here one way.
use http://www.ats.ucla.edu/stat/data/hsbdemo, clear

display "Females with a reading score less than 50"
local bign = _N
forvalues i=1/bign' {
if female[i']==1 & read[i']<50 {
display id[i'] "  " read[i']
}
}

45  34
51  42
2  39
1  34
106  36
89  35
19  28
...
[output omitted]
Here is a much better (simpler and faster) way to get the same results by subsetting with if as part of list the command.
list id read if female==1 & read<50

[output omitted]
Most of Stata's commands work the if and in clauses.

#### Give'em the Boot

Bootstrapping is an alternative method for determining standard errors. For standard estimation commands, bootstrapping is just a matter of using the vce(boot) option.
use http://www.ats.ucla/stat/data/hsbdemo, clear

regress write read female, vce(boot, reps(100))

(running regress on estimation sample)

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

Linear regression                               Number of obs      =       200
Replications       =       100
Wald chi2(2)       =    215.00
Prob > chi2        =    0.0000
R-squared          =    0.4394
Adj R-squared      =    0.4337
Root MSE           =    7.1327

------------------------------------------------------------------------------
|   Observed   Bootstrap                         Normal-based
write |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
read |   .5658869   .0428593    13.20   0.000     .4818842    .6498897
female |   5.486894   .9479372     5.79   0.000     3.628971    7.344817
_cons |   20.22837   2.423242     8.35   0.000      15.4789    24.97784
------------------------------------------------------------------------------
But what if we want bootstrap standard errors for a statistic that Stata does not compute. We will need to write our own ado program. Let's go back to the example of the difference in medians between males and females. We will write a program named med_dif which we will save is a file named med_diff.ado. Here's how we do this.
/* begin med_dif program */
program med_diff, rclass
quietly sum write if female==0, detail
local mmedian = r(p50)
quietly sum write if female==1, detail
local fmedian = r(p50)
return scalar meddif = fmedian - mmedian
display as txt "median difference = " fmedian' - mmedian'
end
/* end med_dif program */
Let's try it.
use http://www.ats.ucla/stat/data/hsbdemo, clear

med_diff

median difference = 5

ret lis

scalars:
r(meddif) =  5
Now, let's use med_diff with the bootstrap command.
bootstrap r(meddif), reps(100): med_diff
(running med_diff on estimation sample)

Warning:  Because med_diff is not an estimation command or does not set e(sample),
bootstrap has no way to determine which observations are used in
calculating the statistics and so assumes that all observations are used.
This means that no observations will be excluded from the resampling
because of missing values or other reasons.

If the assumption is not true, press Break, save the data, and drop the
observations that are to be excluded.  Be sure that the dataset in memory
contains only the relevant data.

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

Bootstrap results                               Number of obs      =       200
Replications       =       100

command:  med_diff
_bs_1:  r(meddif)

------------------------------------------------------------------------------
|   Observed   Bootstrap                         Normal-based
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 |          5   2.844435     1.76   0.079    -.5749893    10.57499
------------------------------------------------------------------------------

estat boot, percentile

Bootstrap results                               Number of obs      =       200
Replications       =       100

command:  med_diff
_bs_1:  r(meddif)

------------------------------------------------------------------------------
|    Observed               Bootstrap
|       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 |           5        .99   2.8444346           0         12   (P)
------------------------------------------------------------------------------
(P)    percentile confidence interval

#### Talkin' 'Bout My Egeneration

Egen commands can make your life as a programmer much easier by saving you from additional programming. Here is an example that creates a variable containg group means. First, the egen way.
use http://www.ats.ucla.edu/stat/data/hsbmar, clear

egen fmean = mean(write), by(female)
And next, how you might program the same thing yourself.
drop fmean

quietly sum write if female==0, meanonly
generate fmean = r(mean) if female==0
quietly sum write if female==1, meanonly
replace fmean = r(mean) if female==1
Here is another example, in which we create a new variable which is the average of the non-missing values of four variables.
egen amean = rowmean(read write math science)

sum amean read write math science
And here is one way to program this. Note that we need to loop over the observations (rows).
drop amean

local bign = _N
generate amean=.

forvalues i=1/bign' {
local sum = 0
local n = 0
foreach var of varlist read write math science {
if var'[i']~=. {
local sum = sum' + var'[i']
local n = n' + 1
}
if n'~=0 {
replace amean = sum'/n' in i'
}
}
}
These are just two of the many egen commands available to programmers.

#### The Matrix Reloaded

There are two matric systems in Stata: 1) Traditional Stata matrix commands and 2) Mata which is a full programming language in addition to a matrix language. Many of the Stata estimation commands are written in Mata. Here are two examples of doing regression using matrices. We will use the basic regression formula shown below with each method.
b = (X'X)-1X'Y
We want to regres write on female and read. Hopefully the results wil look something like this:
regress write female read

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
Model |  7856.32118     2  3928.16059           Prob > F      =  0.0000
Residual |  10022.5538   197  50.8759077           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
_cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------
To begin we need to create a constant for the intercept, which we will call cons. Now, let's try this with the Stata matrix commands.
generate cons = 1

mkmat female read cons, mat(X)
mkmat write, mat(y)

matrix b = syminv(X'*X)*X'*y

matrix list b

b[3,1]
write
female   5.486894
cons  20.228368
In Mata the code would look like this.
tomata female write read cons

mata

y = write
X = (female, read, cons)

b = invsym(X'X)*X'y

b

end

1
+---------------+
1 |  5.486893967  |
2 |  .5658869298  |
3 |  20.22836845  |
+---------------+`
These examples demonstrate how matrix commands may be used. Stata's two matrix systems are powerful tools that can do both simple tasks and allow you to program complex estimation procedures.

#### Conclusion

Spending some time learning to program in Stata can increase your productivity and efficiency in working with your research data.

updated 10/11/11

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.