|
|
|
||||
|
Stat Computing > Seminars
> Stata8
|
|
||||
Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands to you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.
There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,
We will try to give users a feel for Stata programming by covering the following topics:
/* begin hsbcheck.do */ version 9.2 clear use `1' set more off nmissing summarize list id if id<1 | id>200 list id gender if ~ inlist(gender,1,2,.) list id race if ~ inlist(race,1,2,3,4,.) list id ses if ~ inlist(ses,1,2,3,.) list id schtyp if ~ inlist(schtyp,1,2,.) list id prog if ~ inlist(prog,1,2,3,.) list id read if (read<1 | read>99) & read~=. list id write if (write<1 | write>99) & write~=. list id math if (math<1 | math>99) & math~=. list id science if (science<1 | science>99) & science~=. list id socst if (socst<1 | socst>99) & socst~=. set more on /* end hsbcheck.do */Here is how our do-file is used with the dataset hsberr.
do hsbcheck hsberr [output omitted]So how did the hsbcheck program "know" which file to use? This was done using a macro variable, in this case, `1', which takes the first term typed after the name f\of the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.
Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.
/* begin hsbfix.do */ version 9.2 clear use hsberr replace id=193 if id==1193 /* correct value known */ replace read=47 if read==147 /* correct value known */ replace science=61 if science==-61 /* correct value known */ replace gender=. if gender==5 /* correct value unknown */ replace race=. if race<1 | race>4 /* correct value unknown */ replace ses=. if ses<1 | ses>3 /* correct value unknown */ replace schtyp=. if schtyp<1 | schtyp>2 /* correct value unknown */ replace prog=. if prog<1 | prog>3 /* correct value unknown */ generate female = gender recode female 1=0 2=1 label define fem 0 "male" 1 "female" label value female fem save hsbclean, replace /* end hsbfix.do */One important thing to note is that after we fix the incorrect values, we will save the the data file with a new name. We will never change any of the values in the original data file, hsberr.
First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.
do hsbfix do hsbcheck hsbclean [output omitted]
/* begin hsbanalyze.do */ verson 9.2 log using hsb10_14_08.txt, text replace summarize read write math science univar read write math science tabstat write, stat(n mean sd p25 p50 p75) by(female) tabstat write, stat(n mean sd p25 p50 p75) by(prog) ttest write, by(gender) histogram write, normal start(30) width(5) more kdensity write, normal more tab1 female ses prog tab prog ses, all correlate write read science female more regress write read female rvfplot more ologit ses read write female gologit2 ses read write female, autofit mlogit prog read write female log close /* end hsbanalyze */Now, let's use hsbanalyze with our data file hsbclean.
use hsbclean, clear do hsbanalyze [output omitted]This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, "we need to redo the whole analysis using only schtyp equal to one." Here's all you have to do.
keep if schtyp==1 do hsbanalyze [output omitted]
We will illustrate the ado program by writing a command that computes the median. Of course, Stata already have commands that compute medians but we are doing this to illustrate the process of creating an ado program.
The basic logic of computing the median is to sort the variable of interest then, if there are an odd number of values take the middle one, an if there are an even number of values take half the distance between the middle two. Below is the first version of our program which is saved in the file names median1.ado.
/* Median Program -- Version #1 */
/* basic program with one variable */
program define median1
version 9.2
sort `1'
quietly count if `1' ~= .
local n = r(N)
local mid = int(`n'/2)
local odd = mod(`n',2)
if `odd' {
local median = `1'[`mid'+1]
}
else {
local median = (`1'[`mid'] + `1'[`mid'+1])/2
}
display
display as text "Median of `1' = " as result `median'
end
/* end median1.ado */
Let's try median1 on the hsbclean dataset.
use hsbclean median1 write Median of write = 54This program worked just fine but it could be improved. We will modify the program to improve the output format and to allow for multiple variables. We will call this new program median2 which will be saved in the file median2.ado.
/* Median Program -- Version #2 */
/* multiple variables; saves results in return list */
program define median2, rclass
version 9.2
syntax varlist(numeric)
preserve /* preserve data */
display
display as text " Variable N Median"
display as text "----------------------------"
foreach var of local varlist {
quietly count if `var' ~= .
local n = r(N)
local mid = int(`n'/2)
local odd = mod(`n',2)
sort `var'
if `odd' {
local median = `var'[`mid'+1]
}
else {
local median = (`var'[`mid'] + `var'[`mid'+1]) / 2
}
display as result %9s "`var'" %9.0f `n' %10.2f `median'
}
return scalar Mdn = `median'
return scalar N = `n'
restore /* restore data */
end
/* end median2.ado */
Here is how median2 works.
median2 read write math science
Variable N Median
----------------------------
read 200 50.00
write 200 54.00
math 200 52.00
science 195 53.00
Again, everything seems to be working fine but it would be better if the
program allowed the use of if or in to subset the data. For example,
what if we wanted the medians just for males. The program median3 will
allow the user to use if and in.
/* Median Program -- Version #3 */
/* Allows for if and in */
program define median3, rclass
version 9.2
syntax varlist(numeric) [if] [in]
display
display as text " Variable N Median"
display as text "----------------------------"
foreach var of local varlist {
preserve
if "`if'"~="" | "`in'"~="" {
quietly keep `if' `in'
}
quietly keep if ~missing(`var')
quietly count
local n = r(N)
local mid = int(`n'/2)
local odd = mod(`n',2)
sort `var'
if `odd' {
local median = `var'[`mid'+1]
}
else {
local median = (`var'[`mid'] + `var'[`mid'+1])/2
}
display as result %9s "`var'" %9.0f `n' %10.2f `median'
restore
}
return scalar Mdn = `median'
return scalar N = `n'
end
/* end median3.ado */
Here is how this version of the program works.
median3 read write math science if female==0
Variable N Median
----------------------------
read 90 52.00
write 90 50.50
math 90 51.50
science 85 55.00
median3 read write math science in 50/150
Variable N Median
----------------------------
read 101 50.00
write 101 54.00
math 101 52.00
science 98 53.00
The rclass option in median2 and median3 allow you to temporarily
store the results from a program. For our program, we keep the frequency and median
for the last variable used in the command. Here is how you can view the values being stored.
return list
scalars:
r(N) = 99
r(Mdn) = 50
Stata has included some tools to make creating and debugging program as bit easier. One
of them is the trace option that shows you command by command what is happening inside
your program. Let's run median3 with trace turned on and see what it
looks like.
set trace on median3 science if female==0 [output omitted] set trace off
We will begin with matreg1.ado that makes use of the famous matrix equation for the regression coefficients, b=(X'X)-1X'Y.
/* begin matreg1.ado */
program define matreg1
version 9.0
syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)]
marksample touse /* mark cases in the sample */
tokenize "`varlist'"
quietly matrix accum sscp = `varlist' if `touse'
matrix XX = sscp[2...,2...] /* X'X */
matrix Xy = sscp[1,2...] /* X'y */
matrix b = Xy * syminv(XX) /* (X'X)-1X'y */
local k = colsof(b) /* number of coefs */
local nobs = r(N)
local df = `nobs' - (rowsof(sscp) - 1) /* df residual */
matrix hat = Xy * b'
matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df'
matrix C = corr(V)
matrix seb = vecdiag(V)
matrix seb = seb[1, 1...]
matrix t = J(1,`k',0)
matrix p = t
local i = 1
while `i' <= `k' {
matrix seb[1,`i'] = sqrt(seb[1,`i'])
matrix t[1,`i'] = b[1,`i']/seb[1,`i']
matrix p[1,`i'] = tprob(`df',t[1,`i'])
local i = `i' + 1
}
display
display "Dependent variable: `1'"
display
display "Regression coefficients"
matrix list b
display
display "Standard error of coefficients"
matrix list seb
display
display "Values of t"
matrix list t
display
display "P values for t"
matrix list p
display
display "Covariance of the regression coefficients"
matrix list V
display
display "Correlation of the regression coefficients"
matrix list C
matrix drop sscp XX Xy b hat V seb t p C
end
/* end matreg1.ado */
Here is an example of how to use matreg1.
use http://www.ats.ucla.edu/stat/data/hsb2, clear
regress write read female
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32118 2 3928.16059 Prob > F = 0.0000
Residual | 10022.5538 197 50.8759077 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337
Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .5658869 .0493849 11.46 0.000 .468496 .6632778
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
matreg1 write read female
Dependent variable: write
Regression coefficients
b[1,3]
read female _cons
write .56588693 5.486894 20.228368
Standard error of coefficients
seb[1,3]
read female _cons
r1 .04938488 1.0142614 2.7137564
Values of t
t[1,3]
c1 c2 c3
r1 11.458708 5.4097434 7.4540105
P values for t
p[1,3]
c1 c2 c3
r1 1.265e-23 1.818e-07 2.800e-12
Covariance of the regression coefficients
symmetric V[3,3]
read female _cons
read .00243887
female .00265893 1.0287262
_cons -.12883112 -.69953157 7.3644737
Correlation of the regression coefficients
symmetric C[3,3]
read female _cons
read 1
female .05308387 1
_cons -.96129327 -.25414792 1
In matreg1 we computed t-tests and p-values manually. We also managed
all of the output display. We can let Stata do all of that for us by using
the estimates post command which will greatly simplify
the program. We will do this in matreg2 as
shown below.
/* begin matreg2.ado */
program define matreg2, eclass
version 9.0
syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)]
marksample touse /* mark cases in the sample */
tokenize "`varlist'"
quietly matrix accum sscp = `varlist' if `touse'
matrix XX = sscp[2...,2...] /* X'X */
matrix Xy = sscp[1,2...] /* X'y */
matrix b = Xy * syminv(XX) /* (X'X)-1X'y */
local k = colsof(b) /* number of coefs */
local nobs = r(N)
local df = `nobs' - (rowsof(sscp) - 1) /* df residual */
matrix hat = Xy * b'
matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df'
ereturn post b V, dof(`df') obs(`nobs') depname(`1') ///
esample(`touse')
ereturn local depvar "`1'"
ereturn local cmd "matreg"
display
ereturn display, level(`level')
matrix drop sscp XX Xy hat
end
/* end matreg2.ado */
Now, check out matreg2.
matreg2 write read female
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .5658869 .0493849 11.46 0.000 .468496 .6632778
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
The eclass option does for estimation commands what the rclass option
does for regular statistical commands. To view the eclass values we need
to use the ereturn list command>
ereturn list
scalars:
e(N) = 200
e(df_m) = 2
e(df_r) = 197
e(F) = 77.21062421518363
e(r2) = .4394192130387506
e(rmse) = 7.132734938503835
e(mss) = 7856.321182518186
e(rss) = 10022.5538174818
e(r2_a) = .4337280375366059
e(ll) = -675.2152914029984
e(ll_0) = -733.0934827146212
macros:
e(cmdline) : "regress write read female, nohead"
e(title) : "Linear regression"
e(vce) : "ols"
e(depvar) : "write"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 3
e(V) : 3 x 3
functions:
e(sample)
matrix list e(b)
e(b)[1,3]
read female _cons
y1 .56588693 5.486894 20.228368
matrix list e(V)
symmetric e(V)[3,3]
read female _cons
read .00243887
female .00265893 1.0287262
_cons -.12883112 -.69953157 7.3644737
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services