Help the Stat Consulting Group by giving a gift

Introduction to Stata Programming

Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands that you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files.
Do-files are run from the command line using the **do** command, for example,

We will try to give you a feel for Stata programming by covering the following topics:

- Creating and using do-files for checking and cleaning data.
- Using do-files for analyzing data.
- Writing an ado program to create a statistical command.
- Using Stata and Mata matrix commands

Here is how our do-file is used with the dataset/* begin hsbcheck.do */ version 11.0 clear use `1', clear set more off nmissing describe summarize list id if id<1 | id>200 list id gender if ~ inlist(gender,1,2,.) list id race if ~ inlist(race,1,2,3,4,.) list id ses if ~ inlist(ses,1,2,3,.) list id schtyp if ~ inlist(schtyp,1,2,.) list id prog if ~ inlist(prog,1,2,3,.) list id read if (read<1 | read>99) & read~=. list id write if (write<1 | write>99) & write~=. list id math if (math<1 | math>99) & math~=. list id science if (science<1 | science>99) & science~=. list id socst if (socst<1 | socst>99) & socst~=. *list id female if (female<0 | female>1) & female~=. set more on /* end hsbcheck.do */

So how did thedo hsbcheck hsberr[output omitted]

Now that we know what errors there are in the data we can write a do-file
that will fix the errors. When we know the correct value of an observation, we will replace the
incorrect value with the correct one. When we do not know the correct value for an observation, we
will replace the incorrect value with missing. The do-file **hsbfix.do** will read in
**hsberr**, correct the errors and save the corrected file as **hsbclean**.
Here is what **hsbfix.do** looks like.

One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file,/* begin hsbfix.do */ use hsberr, clear replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race<1 | race>4 replace ses=. if ses<1 | ses>3 replace schtyp=. if schtyp<1 | schtyp>2 replace prog=. if prog<1 | prog>3 tab gender tab reg /* create female from gender */ generate female = gender recode female 1=0 2=1 l label define fem 0 "male" 1 "female" label value female fem tab female gender label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbfix.do */

First, we will run **hsbfix** on the original file **hsberr** then, as a check, we will run **hsbcheck** on the new file
**hsbclean**.

do hsbfix do hsbcheck hsbclean[output omitted]

Now, let's use/* begin hsbanalyze.do */ log using hsb_fall_2011.txt, text replace summarize read write socst univar read write math science tabstat write, stat(n mean sd p25 p50 p75) by(female) tabstat write, stat(n mean sd p25 p50 p75) by(prog) ttest write, by(female) hist write, normal start(30) width(5) name(hist_write, replace) kdensity write, normal name(kd_write, replace) tab1 female ses prog tab prog ses, all correlate write read socst female graph matrix read socst write, half name(grmat, replace) regress write read i.female##c.socst margins female, at(socst=(25(5)70)) atmeans asbalanced marginsplot, recast(line) recastci(rarea) name(plot, replace) log close /* end hsbanalyze */

This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, "we need to redo the whole analysis using onlyuse hsbdemo, clear do hsbanalyze[output omitted]

keep if schtyp==1 do hsbanalyze[output omitted]

Let's start with the **creturn list** (abbr: **cret lis**).

You can access any the values usingcret lis

This can also be useful appending a date onto a file name.display "Today is" c(current_date) " and the current time is " c(current_time)

Here is an example of thelocal today c(current_date) local today = subinstr(`today'," ","_",.) save hsb`today', replace

We can use this information to compute a statistics, such as, the coefficient of variation that Stata does not provide.summarize write, detailwriting score ------------------------------------------------------------- Percentiles Smallest 1% 31 31 5% 35.5 31 10% 39 31 Obs 200 25% 45.5 31 Sum of Wgt. 200 50% 54 Mean 52.775 Largest Std. Dev. 9.478586 75% 60 67 90% 65 67 Variance 89.84359 95% 65 67 Skewness -.4784158 99% 67 67 Kurtosis 2.238527ret lisscalars: r(N) = 200 r(sum_w) = 200 r(mean) = 52.775 r(Var) = 89.84359296482411 r(sd) = 9.47858602138653 r(skewness) = -.4784157665394925 r(kurtosis) = 2.238527050562138 r(sum) = 10555 r(min) = 31 r(max) = 67 r(p1) = 31 r(p5) = 35.5 r(p10) = 39 r(p25) = 45.5 r(p50) = 54 r(p75) = 60 r(p90) = 65 r(p95) = 65 r(p99) = 67

Thedisplay as txt "Coefficient of variation = " as res r(sd)/abs(r(mean))*100Coefficient of variation = 17.960371

The matrix e(b) contains the parameter estimates while e(V) has the covariance matrix of the parameter estimates.use http://www.ats.ucla/stat/data/hsbdemo, clear regress write read female[output omitted]eret lisscalars: e(N) = 200 e(df_m) = 2 e(df_r) = 197 e(F) = 77.21062421518373 e(r2) = .439419213038751 e(rmse) = 7.132734938503835 e(mss) = 7856.321182518197 e(rss) = 10022.5538174818 e(r2_a) = .4337280375366064 e(ll) = -675.2152914029984 e(ll_0) = -733.0934827146214 e(rank) = 3 macros: e(cmdline) : "regress write read female" e(title) : "Linear regression" e(marginsok) : "XB default" e(vce) : "ols" e(depvar) : "write" e(cmd) : "regress" e(properties) : "b V" e(predict) : "regres_p" e(model) : "ols" e(estat_cmd) : "regress_estat" matrices: e(b) : 1 x 3 e(V) : 3 x 3 functions: e(sample)

You can also access the coefficients and standard errors using _b[mat list e(b)e(b)[1,3] read female _cons y1 .56588693 5.486894 20.228368mat list e(V)symmetric e(V)[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737

display _b[_cons] + _b[female]*1 + _b[read]*6059.668478

Say I wanted to compute the difference in medians for two groups. Here is one way you could do this using local macros.

Here is the same thing using global macros.quietly sum write if female==0, detail local mmedian = r(p50) quietly sum write if female==1, detail local fmedian = r(p50) display "median difference = " `fmedian' - `mmedian'median difference = 5macro lis[output omitted]

This only scratches the surface of the utility of macro variables. We will see other examples as we go along.quietly sum write if female==0, detail global mmedian = r(p50) quietly sum write if female==1, detail global fmedian = r(p50) display "median difference = " $fmedian - $mmedianmedian difference = 5macro listfmedian: 57 mmedian: 52 [output omitted]

For the first example we want to create centered variables and squared center variables for five variables.

For our second example we want to create a long dataset in which our variables are stacked on top on one another.use http://www.ats.ucla/stat/data/hsbdemo, clear foreach var of varlist read write math science socst { quietly sum `var' gen c`var' = `var' - r(mean) // create centered variable gen c`var'2 = c`var'^2 // create squared centered variable }

It is possible to loop throught observations (rows) but it is usually not necessary. Here is an example in which we create an index value that is equal to the observation number..use http://www.ats.ucla/stat/data/hsbdemo, clear list id read write math science socst in 1/12, sep(0)+---------------------------------------------+ | id read write math science socst | |---------------------------------------------| 1. | 45 34 35 41 29 26 | 2. | 108 34 33 41 36 36 | 3. | 15 39 39 44 26 42 | 4. | 67 37 37 42 33 32 | 5. | 153 39 31 40 39 51 | 6. | 51 42 36 42 31 39 | 7. | 164 31 36 46 39 46 | 8. | 133 50 31 40 34 31 | 9. | 2 39 41 33 42 41 | 10. | 53 34 37 46 39 31 | 11. | 1 34 44 40 39 41 | 12. | 128 39 33 38 47 41 | +---------------------------------------------+local i = 1 foreach var of varlist read write math science socst { rename `var' y`i' local i = `i' + 1 } list id y* in 1/12, sep(0)+------------------------------+ | id y1 y2 y3 y4 y5 | |------------------------------| 1. | 45 34 35 41 29 26 | 2. | 108 34 33 41 36 36 | 3. | 15 39 39 44 26 42 | 4. | 67 37 37 42 33 32 | 5. | 153 39 31 40 39 51 | 6. | 51 42 36 42 31 39 | 7. | 164 31 36 46 39 46 | 8. | 133 50 31 40 34 31 | 9. | 2 39 41 33 42 41 | 10. | 53 34 37 46 39 31 | 11. | 1 34 44 40 39 41 | 12. | 128 39 33 38 47 41 | +------------------------------+reshape long y, i(id) j(var) lis id y var in 1/12, sep(0)+---------------+ | id y var | |---------------| 1. | 1 34 1 | 2. | 1 44 2 | 3. | 1 40 3 | 4. | 1 39 4 | 5. | 1 41 5 | 6. | 2 39 1 | 7. | 2 41 2 | 8. | 2 33 3 | 9. | 2 42 4 | 10. | 2 41 5 | 11. | 3 63 1 | 12. | 3 65 2 | +---------------+

Here is a much simpler way to get the same resultuse http://www.ats.ucla.edu/stat/data/hsbdemo, clear local bign = _N generate index = . forvalues i=1/`bign' { quietly replace index=`i' in `i' } list index id in 1/10+-------------+ | index id | |-------------| 1. | 1 45 | 2. | 2 108 | 3. | 3 15 | 4. | 4 67 | 5. | 5 153 | |-------------| 6. | 6 51 | 7. | 7 164 | 8. | 8 133 | 9. | 9 2 | 10. | 10 53 | +-------------+

There is no end to the uses for looping. Unfortunately, we don't have time to cover more today.drop index generate index = _n

Let's revisit the issue of looping through observation (rows) with an addedforeach var of varlist read write math science { quietly ttest `var', by(female) if r(p)<.05 { local msg "* statistically significant" } else { local msg " not statistically significant" } display "variable `var' t = " r(t) "`msg'" }variable read t = .74801096 not statistically significant variable write t = -3.7340739* statistically significant variable math t = .41299865 not statistically significant variable science t = 1.8123753 not statistically significant

Here is a much better (simpler and faster) way to get the same results by subsetting withuse http://www.ats.ucla.edu/stat/data/hsbdemo, clear display "Females with a reading score less than 50" local bign = _N forvalues i=1/`bign' { if female[`i']==1 & read[`i']<50 { display id[`i'] " " read[`i'] } }45 34 51 42 2 39 1 34 106 36 89 35 19 28 ... [output omitted]

Most of Stata's commands work thelist id read if female==1 & read<50[output omitted]

But what if we want bootstrap standard errors for a statistic that Stata does not compute. We will need to write our own ado program. Let's go back to the example of the difference in medians between males and females. We will write a program named med_dif which we will save is a file named med_diff.ado. Here's how we do this.use http://www.ats.ucla/stat/data/hsbdemo, clear regress write read female, vce(boot, reps(100))(running regress on estimation sample) Bootstrap replications (100) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 Linear regression Number of obs = 200 Replications = 100 Wald chi2(2) = 215.00 Prob > chi2 = 0.0000 R-squared = 0.4394 Adj R-squared = 0.4337 Root MSE = 7.1327 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based write | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0428593 13.20 0.000 .4818842 .6498897 female | 5.486894 .9479372 5.79 0.000 3.628971 7.344817 _cons | 20.22837 2.423242 8.35 0.000 15.4789 24.97784 ------------------------------------------------------------------------------

Let's try it./* begin med_dif program */ program med_diff, rclass quietly sum write if female==0, detail local mmedian = r(p50) quietly sum write if female==1, detail local fmedian = r(p50) return scalar meddif = fmedian - mmedian display as txt "median difference = " `fmedian' - `mmedian' end /* end med_dif program */

Now, let's useuse http://www.ats.ucla/stat/data/hsbdemo, clear med_diffmedian difference = 5ret lisscalars: r(meddif) = 5

bootstrap r(meddif), reps(100): med_diff(running med_diff on estimation sample) Warning: Because med_diff is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (100) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 Bootstrap results Number of obs = 200 Replications = 100 command: med_diff _bs_1: r(meddif) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | 5 2.844435 1.76 0.079 -.5749893 10.57499 ------------------------------------------------------------------------------estat boot, percentileBootstrap results Number of obs = 200 Replications = 100 command: med_diff _bs_1: r(meddif) ------------------------------------------------------------------------------ | Observed Bootstrap | Coef. Bias Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | 5 .99 2.8444346 0 12 (P) ------------------------------------------------------------------------------ (P) percentile confidence interval

And next, how you might program the same thing yourself.use http://www.ats.ucla.edu/stat/data/hsbmar, clear egen fmean = mean(write), by(female)

Here is another example, in which we create a new variable which is the average of the non-missing values of four variables.drop fmean quietly sum write if female==0, meanonly generate fmean = r(mean) if female==0 quietly sum write if female==1, meanonly replace fmean = r(mean) if female==1

And here is one way to program this. Note that we need to loop over the observations (rows).egen amean = rowmean(read write math science) sum amean read write math science

These are just two of the manydrop amean local bign = _N generate amean=. forvalues i=1/`bign' { local sum = 0 local n = 0 foreach var of varlist read write math science { if `var'[`i']~=. { local sum = `sum' + `var'[`i'] local n = `n' + 1 } if `n'~=0 { replace amean = `sum'/`n' in `i' } } }

We want to regresb = (X'X)^{-1}X'Y

To begin we need to create a constant for the intercept, which we will callregress write female readSource | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 read | .5658869 .0493849 11.46 0.000 .468496 .6632778 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------

In Mata the code would look like this.generate cons = 1 mkmat female read cons, mat(X) mkmat write, mat(y) matrix b = syminv(X'*X)*X'*y matrix list bb[3,1] write female 5.486894 read .56588693 cons 20.228368

These examples demonstrate how matrix commands may be used. Stata's two matrix systems are powerful tools that can do both simple tasks and allow you to program complex estimation procedures.tomata female write read cons mata y = write X = (female, read, cons) b = invsym(X'X)*X'y b end1 +---------------+ 1 | 5.486893967 | 2 | .5658869298 | 3 | 20.22836845 | +---------------+

updated 10/11/11

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.