UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Textbook Examples
Introduction to the Practice of Statistics by Moore and McCabe
Chapter 2: Looking at Data, Relationships

Figure 2.1, page 107 uses a file with SAT scores for the different states of the United States. The file is called satstate.dta which is a Stata data file. We can read that file into Stata like this.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/satstate, clear
Figure 2.1 can be produced using the graph command below. In the second scatter plot, we only plot the location for California and use mlabel(state) to label California's point.
graph twoway (scatter svavg perctake) (scatter svavg perctake if state == "California", mlabel(state) mlabposition(9)), ///
	      ylabel(460(20)620) xlabel(0(10)90)
Example 2.4 and figure 2.3, page 110. You can input the data in the table in example 2.4 as shown below. You just type each line, one at a time, in the command window. This is a quick and dirty way to enter a small data file into Stata.
clear
input femur humerus
38 41 
56 63
59 70
64 72
74 84
end
We can now graph the data like this, producing figure 2.3on page 110.
graph twoway scatter humerus femur, xlabel(30(10)90) ylabel(30(10)90) 
Example 2.5 and figure 2.4, page 111. The corn yield data is stored in the file ta02_001.dta. Since this is a Stata data file, we can read the file like this.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_001, clear
We can compute the average corn yield using the egen command below. It tells Stata to create a new variable avgcorn that is the mean of corn yield corn_yie broken down by the number of plants planted plant_p.
egen avgcorn = mean( corn_yie) , by( plants_p)
When we construct the graph we first do the scatter plot with corn_yie and plants_p and then overlay the second scatterplot of avgcorn and plants_p and connect the means with the connect(l) option.
graph twoway (scatter corn_yie plants_p) (scatter avgcorn plants_p, connect(l)), xlabel(12(4)28) 
We skip Examples 2.6 and 2.7, and figures 2.5 and 2.6 for now.

We skip Example 2.8 and figure 2.7 for now. We also skip section 2.2.

Example 2.11 and figure 2.12, pages 135-136. We read in the height data, contained in the file ta02_005, as shown below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_005, clear
We can graph the data as shown below to create figure 2.12, page136.
graph twoway scatter height_y age_x_in, ylabel(74(2)86) xlabel(16(2)32)
Figure 2.13, page 138.

Note: The regression line created in Stata does not extend past the range of the data; therefore, in this graph the there is no intersection between the x- and y-lines lines and the regression line.

graph twoway (scatter height_y age_x_in) (lfit height_y age_x_in), ///
	ylabel(74(2)86) xlabel(16(2)32) yline(85.25) xline(32)
Example 2.14 and figure 2.15, pages 141-142. Example 2.14 shows how you can compute the formula for the regression line. We can do this in Stata using the regress command shown below. In the results we see _cons = 64.92832 which is the way that Stata reports the constant, or a. We also see age_x_in .6349653 which is the way that Stata reports the slope, or b. You can see these results match the hand computed results in example 2.14, and are similar to the computer output shown in figure 2.15.
regress height_y age_x_in

  Source |       SS       df       MS                  Number of obs =      12
---------+------------------------------               F(  1,    10) =  880.00
   Model |  57.6548678     1  57.6548678               Prob > F      =  0.0000
Residual |  .655171562    10  .065517156               R-squared     =  0.9888
---------+------------------------------               Adj R-squared =  0.9876
   Total |  58.3100394    11  5.30091267               Root MSE      =  .25596

------------------------------------------------------------------------------
height_y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
age_x_in |   .6349653   .0214047     29.665   0.000       .5872726     .682658
   _cons |   64.92832    .508409    127.709   0.000       63.79551    66.06112
------------------------------------------------------------------------------
We skip figures 2.16-2.17, page144-146.
We illustrate how to produce the results of example 2.16 and figure 2.18, pages 154-155 below.
We can compute the residual using the predict command. We give the residual the name res_ht (residual height) and the option residual tells Stata that we want to create a residual. You can replace res_ht with any valid variable name, but the residual is the keyword that tells Stata you want a residual to be computed.
predict res_ht, residual
We list the residuals below, and see they match the residuals in example 2.16 (except for slight differences in rounding).
list res_ht

        res_ht 
  1. -.2576923  
  2.   .007344  
  3.  .4723772  
  4. -.0625896  
  5. -.0975488  
  6.  .1674798  
  7. -.2674809  
  8.  .2975508  
  9.  -.237416  
 10. -.2723751  
 11.  .0926596  
 12.  .1576913  
Figure 2.18a is the same as figure 2.13 so we omit showing that. We can obtain the residual plot, figure 2.18b on page 155 by graphing the residual res_ht by the age age_x_in, as shown below.
graph twoway scatter res_ht age_x_in, xlabel(16(2)32) ylabel(-.4(.2).8) yline(0)
We show how to solve example 2.17, and to get figures 2.20-2.22 below (pages 158-159). First, we read in the data for this example below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/eg2_17, clear
Figure 2.20, page 158.
graph twoway (scatter y x) (lfit y x), xlabel(3800(200)5000) ylabel(6000(500)8000) 
We obtain the regression equation shown in example 2.17 using the regress command. This also shows us the r-squared.
regress y x

  Source |       SS       df       MS                  Number of obs =       8
---------+------------------------------               F(  1,     6) =   13.63
   Model |  486552.229     1  486552.229               Prob > F      =  0.0102
Residual |  214209.271     6  35701.5452               R-squared     =  0.6943
---------+------------------------------               Adj R-squared =  0.6434
   Total |   700761.50     7  100108.786               Root MSE      =  188.95

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
       x |   1.066322   .2888466      3.692   0.010       .3595401    1.773105
   _cons |   2492.692   1267.199      1.967   0.097      -608.0328    5593.416
------------------------------------------------------------------------------
Following the regress command, we get the residual plot in figure 2.21, page 159 using the commands below.
predict yres, residual
graph twoway scatter yres x , yline(0) xlabel(3800(200)5000) ylabel(-300(100)400)
We can also show the residuals across years as shown in figure 2.22, page 159.
graph twoway scatter yres year, yline(0) xlabel(1990(1)1997) ylabel(-300(100)400)
We illustrate example 2.18 below and figures 2.23-2.25, pages 160-162 below. We start by using the data file from table 2.8.

Note: For Figure 2.23-2.25 there is no variable in the data set that distinguishes between one two children.

Figure 2.23, page 161.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_008, clear

graph twoway (scatter score age) (lfit score age), ylabel(40(20)140) xlabel(0(10)50)
We calculate the line of best fit using the regress command below.
regress score age

  Source |       SS       df       MS                  Number of obs =      21
---------+------------------------------               F(  1,    19) =   13.20
   Model |  1604.08089     1  1604.08089               Prob > F      =  0.0018
Residual |  2308.58578    19  121.504515               R-squared     =  0.4100
---------+------------------------------               Adj R-squared =  0.3789
   Total |  3912.66667    20  195.633333               Root MSE      =  11.023

------------------------------------------------------------------------------
   score |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     age |  -1.126989   .3101721     -3.633   0.002      -1.776187   -.4777913
   _cons |   109.8738   5.067802     21.681   0.000       99.26681    120.4809
------------------------------------------------------------------------------
Figure 2.24, page 161. We compute the residuals (calling the variable scor_res) and show the residual plot from figure 2.24. Note that this plot does not show labels for child 18 and 19 like in the book.
predict scor_res, resid
graph twoway scatter scor_res age, yline(0) xlabel(0(10)50) ylabel(-20(10)30)
We show the residual graph again using child number as the symbol to help us identify the outlying children. Indeed, we can see child 18 and 19 as shown in figure 2.24, page 161.
gen child = _n
graph twoway (scatter scor_res age) (scatter scor_res age if child == 18|child ==19, mlabel(child)), ///
	        yline(0) xlabel(0(10)50) ylabel(-20(10)30)
Figure 2.25, page 162
graph twoway (scatter score age) (lfit score age if child ~= 18) (lfit score age), ///
	        xlabel(0(10)50) ylabel(40(20)140)
	        
Example 2.19, page 163 shows the drop in r-squared when child 18 is removed. We can run the regression without child 18 like this. Indeed, we see the r-squared does go down to .11.
regress score age if childnum != 18

  Source |       SS       df       MS                  Number of obs =      20
---------+------------------------------               F(  1,    18) =    2.27
   Model |  280.519481     1  280.519481               Prob > F      =  0.1489
Residual |  2220.48052    18  123.360029               R-squared     =  0.1122
---------+------------------------------               Adj R-squared =  0.0628
   Total |     2501.00    19  131.631579               Root MSE      =  11.107

------------------------------------------------------------------------------
   score |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     age |  -.7792208   .5167331     -1.508   0.149      -1.864837    .3063951
   _cons |   105.6299   7.161928     14.749   0.000       90.58322    120.6765
------------------------------------------------------------------------------
Figures 2.26-2.28, page 164-165. Figure 2.26 shows computations of three regression diagnostics. We first re-run the regression with all of the observations, then use the predict command to create the regression diagnostics shown in figure 2.26. We name these diagnostics resid, rstud, and dffit, but we could have named them using any valid Stata variable name.
regress score age 

  Source |       SS       df       MS                  Number of obs =      21
---------+------------------------------               F(  1,    19) =   13.20
   Model |  1604.08089     1  1604.08089               Prob > F      =  0.0018
Residual |  2308.58578    19  121.504515               R-squared     =  0.4100
---------+------------------------------               Adj R-squared =  0.3789
   Total |  3912.66667    20  195.633333               Root MSE      =  11.023

------------------------------------------------------------------------------
   score |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     age |  -1.126989   .3101721     -3.633   0.002      -1.776187   -.4777913
   _cons |   109.8738   5.067802     21.681   0.000       99.26681    120.4809
------------------------------------------------------------------------------
predict resid, residual
predict rstud, rstudent
predict dffit, dfits
Below we list the variables, and we see the values are the same as figure 2.26, page 164.
list resid rstud dffit

         resid      rstud      dffit 
  1.  2.030993   .1839685    .041274  
  2. -9.572129  -.9415833  -.4025207  
  3. -15.60395  -1.510812    -.39114  
  4. -8.730941  -.8142633  -.2243285  
  5.  9.030993   .8328629    .186856  
  6. -.3340623  -.0306318  -.0085717  
  7.   3.41196   .3112468   .0772239  
  8.  2.523037   .2297157   .0563035  
  9.  3.142071   .2899101   .0854075  
 10.  6.665938   .6176603   .1728405  
 11.  11.01508   1.050847   .3319969  
 12.  -3.73094  -.3428315  -.0944496  
 13. -15.60395  -1.510812    -.39114  
 14. -13.47696  -1.279776  -.3136739  
 15.  4.523037   .4131532   .1012641  
 16.  1.396049   .1273934   .0329814  
 17.  8.650026   .7982811   .1871661  
 18. -5.540306  -.8451108  -1.155779  
 19.  30.28497    3.60698   .8537371  
 20. -11.47696  -1.076481  -.2638462  
 21.  1.396049   .1273934   .0329814  
We can produce figure 2.27, page 165 using the qnorm command. We chose to use the mymbol(child) option to plot each point using the child number. We can see that child 19 is far from the diagonal, as shown in the book.
qnorm rstud, mlabel(child) msymbol(i) ylabel(-4(2)4) xlabel(-3(1)3)
Likewise, we can use the qnorm command to make figure 2.28, page 165 as illustrated below.
qnorm dffit, mlabel(child) msymbol(i) 
Figure 2.32, page 182. We can make figure 2.32 by first using the data file with the data from table 2.13 below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_013, clear
We then use the graph command below. The connect(l) option (that is an l, not a 1 (one)) says to connect the points with a line, and the msymbol(i) option means to omit any symbol for the points (e.g., no dot, square or triangle for each point). This produces a graph like figure 2.32.
graph twoway scatter mbbl_ year, connect(l) msymbol(i) 
Figure 2.33, page 183. We can make the bacteria data like this.
clear
set obs 25
generate hours = _n
generate numbact = 1 if hours == 1
replace numbact = numbact[_n-1]*2 if hours > 1
format numbact %10.0g

list

        hours     numbact 
  1.         1           1  
  2.         2           2  
  3.         3           4  
  4.         4           8  
  5.         5          16  
  6.         6          32  
  7.         7          64  
  8.         8         128  
  9.         9         256  
 10.        10         512  
 11.        11        1024  
 12.        12        2048  
 13.        13        4096  
 14.        14        8192  
 15.        15       16384  
 16.        16       32768  
 17.        17       65536  
 18.        18      131072  
 19.        19      262144  
 20.        20      524288  
 21.        21     1048576  
 22.        22     2097152  
 23.        23     4194304  
 24.        24     8388608  
 25.        25    16777216
graph twoway scatter numbact hours, connect(l)
Figure 2.34, page 185 shows the graph of the log of the population by the number of hours. We can make the log of the number of bacteria and graph it like this.

Note: The y-scale for this figure does not match the book.

generate logbact = log(numbact)
graph twoway scatter logbact hours, connect(l)
Figures 2.35-2.37, page 186-187. Let's use the world oil production data file again.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_013, clear
Let's make the log of oil production and graph that as shown in figure 2.35, page 186.
generate logbbl = log(mbbl_)
graph twoway scatter logbbl year, connect(l) msymbol(i)
Now we can use the log of oil production to get the regression results for example 2.26, page 187.
regress logbbl year

  Source |       SS       df       MS                  Number of obs =      32
---------+------------------------------               F(  1,    30) = 1048.74
   Model |  113.779861     1  113.779861               Prob > F      =  0.0000
Residual |  3.25475832    30  .108491944               R-squared     =  0.9722
---------+------------------------------               Adj R-squared =  0.9713
   Total |  117.034619    31  3.77531029               Root MSE      =  .32938

------------------------------------------------------------------------------
  logbbl |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
    year |   .0001608   4.97e-06     32.384   0.000       .0001507     .000171
   _cons |   8.698228   .0594613    146.284   0.000       8.576792    8.819664
------------------------------------------------------------------------------
Figure 2.36, page 186.
graph twoway (scatter logbbl year, connect(l) msymbol(i)) (lfit logbbl year)
We can make the residuals as shown below (calling them oilres) and graph them as shown in figure 2.37, page 187.
predict oilres, residual
graph twoway scatter oilres year, yline(0) 
We skip sections 2.6 and 2.7.

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.