|
|
|
||||
|
|
|||||
Figure 2.1, page 107 uses a file with SAT scores for the different states of the United States. The file is called satstate.dta which is a Stata data file. We can read that file into Stata like this.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/satstate, clear
Figure 2.1 can be produced using the graph command below. In the second scatter plot, we only plot the location for California and use mlabel(state) to label California's point.
graph twoway (scatter svavg perctake) (scatter svavg perctake if state == "California", mlabel(state) mlabposition(9)), /// ylabel(460(20)620) xlabel(0(10)90)
Example 2.4 and figure 2.3, page 110. You can input the data in the table in example 2.4 as shown below. You just type each line, one at a time, in the command window. This is a quick and dirty way to enter a small data file into Stata.
clear input femur humerus 38 41 56 63 59 70 64 72 74 84 end
We can now graph the data like this, producing figure 2.3on page 110.
graph twoway scatter humerus femur, xlabel(30(10)90) ylabel(30(10)90)
Example 2.5 and figure 2.4, page 111. The corn yield data is stored in the file ta02_001.dta. Since this is a Stata data file, we can read the file like this.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_001, clear
We can compute the average corn yield using the egen command below. It tells Stata to create a new variable avgcorn that is the mean of corn yield corn_yie broken down by the number of plants planted plant_p.
egen avgcorn = mean( corn_yie) , by( plants_p)
When we construct the graph we first do the scatter plot with corn_yie and plants_p and then overlay the second scatterplot of avgcorn and plants_p and connect the means with the connect(l) option.
graph twoway (scatter corn_yie plants_p) (scatter avgcorn plants_p, connect(l)), xlabel(12(4)28)
We skip Examples 2.6 and 2.7, and figures 2.5 and 2.6 for now.
We skip Example 2.8 and figure 2.7 for now. We also skip section 2.2.
Example 2.11 and figure 2.12, pages 135-136. We read in the height data, contained in the file ta02_005, as shown below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_005, clear
We can graph the data as shown below to create figure 2.12, page136.
graph twoway scatter height_y age_x_in, ylabel(74(2)86) xlabel(16(2)32)
Figure 2.13, page 138.Note: The regression line created in Stata does not extend past the range of the data; therefore, in this graph the there is no intersection between the x- and y-lines lines and the regression line.
graph twoway (scatter height_y age_x_in) (lfit height_y age_x_in), /// ylabel(74(2)86) xlabel(16(2)32) yline(85.25) xline(32)
Example 2.14 and figure 2.15, pages 141-142. Example 2.14 shows how you can compute the formula for the regression line. We can do this in Stata using the regress command shown below. In the results we see _cons = 64.92832 which is the way that Stata reports the constant, or a. We also see age_x_in .6349653 which is the way that Stata reports the slope, or b. You can see these results match the hand computed results in example 2.14, and are similar to the computer output shown in figure 2.15.
regress height_y age_x_in Source | SS df MS Number of obs = 12 ---------+------------------------------ F( 1, 10) = 880.00 Model | 57.6548678 1 57.6548678 Prob > F = 0.0000 Residual | .655171562 10 .065517156 R-squared = 0.9888 ---------+------------------------------ Adj R-squared = 0.9876 Total | 58.3100394 11 5.30091267 Root MSE = .25596 ------------------------------------------------------------------------------ height_y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age_x_in | .6349653 .0214047 29.665 0.000 .5872726 .682658 _cons | 64.92832 .508409 127.709 0.000 63.79551 66.06112 ------------------------------------------------------------------------------
We skip figures 2.16-2.17, page144-146.
We illustrate how to produce the results of example 2.16 and figure 2.18, pages 154-155 below.
We can compute the residual using the predict command. We give the residual the name res_ht (residual height) and the option residual tells Stata that we want to create a residual. You can replace res_ht with any valid variable name, but the residual is the keyword that tells Stata you want a residual to be computed.
predict res_ht, residual
We list the residuals below, and see they match the residuals in example 2.16 (except for slight differences in rounding).
list res_ht
res_ht
1. -.2576923
2. .007344
3. .4723772
4. -.0625896
5. -.0975488
6. .1674798
7. -.2674809
8. .2975508
9. -.237416
10. -.2723751
11. .0926596
12. .1576913
Figure 2.18a is the same as figure 2.13 so we omit showing that. We can obtain the residual plot, figure 2.18b on page 155 by graphing the residual res_ht by the age age_x_in, as shown below.
graph twoway scatter res_ht age_x_in, xlabel(16(2)32) ylabel(-.4(.2).8) yline(0)
We show how to solve example 2.17, and to get figures 2.20-2.22 below (pages 158-159). First, we read in the data for this example below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/eg2_17, clear
Figure 2.20, page 158.
graph twoway (scatter y x) (lfit y x), xlabel(3800(200)5000) ylabel(6000(500)8000)
We obtain the regression equation shown in example 2.17 using the regress command. This also shows us the r-squared.
regress y x
Source | SS df MS Number of obs = 8
---------+------------------------------ F( 1, 6) = 13.63
Model | 486552.229 1 486552.229 Prob > F = 0.0102
Residual | 214209.271 6 35701.5452 R-squared = 0.6943
---------+------------------------------ Adj R-squared = 0.6434
Total | 700761.50 7 100108.786 Root MSE = 188.95
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 1.066322 .2888466 3.692 0.010 .3595401 1.773105
_cons | 2492.692 1267.199 1.967 0.097 -608.0328 5593.416
------------------------------------------------------------------------------
Following the regress command, we get the residual plot in figure 2.21, page 159 using the commands below.
predict yres, residual graph twoway scatter yres x , yline(0) xlabel(3800(200)5000) ylabel(-300(100)400)
We can also show the residuals across years as shown in figure 2.22, page 159.
graph twoway scatter yres year, yline(0) xlabel(1990(1)1997) ylabel(-300(100)400)
We illustrate example 2.18 below and figures 2.23-2.25, pages 160-162 below. We start by using the data file from table 2.8.
Note: For Figure 2.23-2.25 there is no variable in the data set that distinguishes between one two children.
Figure 2.23, page 161.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_008, clear graph twoway (scatter score age) (lfit score age), ylabel(40(20)140) xlabel(0(10)50)
We calculate the line of best fit using the regress command below.
regress score age
Source | SS df MS Number of obs = 21
---------+------------------------------ F( 1, 19) = 13.20
Model | 1604.08089 1 1604.08089 Prob > F = 0.0018
Residual | 2308.58578 19 121.504515 R-squared = 0.4100
---------+------------------------------ Adj R-squared = 0.3789
Total | 3912.66667 20 195.633333 Root MSE = 11.023
------------------------------------------------------------------------------
score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
age | -1.126989 .3101721 -3.633 0.002 -1.776187 -.4777913
_cons | 109.8738 5.067802 21.681 0.000 99.26681 120.4809
------------------------------------------------------------------------------
Figure 2.24, page 161. We compute the residuals (calling the variable scor_res) and show the residual plot from figure 2.24. Note that this plot does not show labels for child 18 and 19 like in the book.
predict scor_res, resid graph twoway scatter scor_res age, yline(0) xlabel(0(10)50) ylabel(-20(10)30)
We show the residual graph again using child number as the symbol to help us identify the outlying children. Indeed, we can see child 18 and 19 as shown in figure 2.24, page 161.
gen child = _n graph twoway (scatter scor_res age) (scatter scor_res age if child == 18|child ==19, mlabel(child)), /// yline(0) xlabel(0(10)50) ylabel(-20(10)30)
Figure 2.25, page 162
graph twoway (scatter score age) (lfit score age if child ~= 18) (lfit score age), /// xlabel(0(10)50) ylabel(40(20)140)
Example 2.19, page 163 shows the drop in r-squared when child 18 is removed. We can run the regression without child 18 like this. Indeed, we see the r-squared does go down to .11.
regress score age if childnum != 18
Source | SS df MS Number of obs = 20
---------+------------------------------ F( 1, 18) = 2.27
Model | 280.519481 1 280.519481 Prob > F = 0.1489
Residual | 2220.48052 18 123.360029 R-squared = 0.1122
---------+------------------------------ Adj R-squared = 0.0628
Total | 2501.00 19 131.631579 Root MSE = 11.107
------------------------------------------------------------------------------
score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
age | -.7792208 .5167331 -1.508 0.149 -1.864837 .3063951
_cons | 105.6299 7.161928 14.749 0.000 90.58322 120.6765
------------------------------------------------------------------------------
Figures 2.26-2.28, page 164-165. Figure 2.26 shows computations of three regression diagnostics. We first re-run the regression with all of the observations, then use the predict command to create the regression diagnostics shown in figure 2.26. We name these diagnostics resid, rstud, and dffit, but we could have named them using any valid Stata variable name.
regress score age
Source | SS df MS Number of obs = 21
---------+------------------------------ F( 1, 19) = 13.20
Model | 1604.08089 1 1604.08089 Prob > F = 0.0018
Residual | 2308.58578 19 121.504515 R-squared = 0.4100
---------+------------------------------ Adj R-squared = 0.3789
Total | 3912.66667 20 195.633333 Root MSE = 11.023
------------------------------------------------------------------------------
score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
age | -1.126989 .3101721 -3.633 0.002 -1.776187 -.4777913
_cons | 109.8738 5.067802 21.681 0.000 99.26681 120.4809
------------------------------------------------------------------------------
predict resid, residual predict rstud, rstudent predict dffit, dfits
Below we list the variables, and we see the values are the same as figure 2.26, page 164.
list resid rstud dffit
resid rstud dffit
1. 2.030993 .1839685 .041274
2. -9.572129 -.9415833 -.4025207
3. -15.60395 -1.510812 -.39114
4. -8.730941 -.8142633 -.2243285
5. 9.030993 .8328629 .186856
6. -.3340623 -.0306318 -.0085717
7. 3.41196 .3112468 .0772239
8. 2.523037 .2297157 .0563035
9. 3.142071 .2899101 .0854075
10. 6.665938 .6176603 .1728405
11. 11.01508 1.050847 .3319969
12. -3.73094 -.3428315 -.0944496
13. -15.60395 -1.510812 -.39114
14. -13.47696 -1.279776 -.3136739
15. 4.523037 .4131532 .1012641
16. 1.396049 .1273934 .0329814
17. 8.650026 .7982811 .1871661
18. -5.540306 -.8451108 -1.155779
19. 30.28497 3.60698 .8537371
20. -11.47696 -1.076481 -.2638462
21. 1.396049 .1273934 .0329814
We can produce figure 2.27, page 165 using the qnorm command. We chose to use the mymbol(child) option to plot each point using the child number. We can see that child 19 is far from the diagonal, as shown in the book.
qnorm rstud, mlabel(child) msymbol(i) ylabel(-4(2)4) xlabel(-3(1)3)
Likewise, we can use the qnorm command to make figure 2.28, page 165 as illustrated below.
qnorm dffit, mlabel(child) msymbol(i)
Figure 2.32, page 182. We can make figure 2.32 by first using the data file with the data from table 2.13 below.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_013, clear
We then use the graph command below. The connect(l) option (that is an l, not a 1 (one)) says to connect the points with a line, and the msymbol(i) option means to omit any symbol for the points (e.g., no dot, square or triangle for each point). This produces a graph like figure 2.32.
graph twoway scatter mbbl_ year, connect(l) msymbol(i)
Figure 2.33, page 183. We can make the bacteria data like this.
clear
set obs 25
generate hours = _n
generate numbact = 1 if hours == 1
replace numbact = numbact[_n-1]*2 if hours > 1
format numbact %10.0g
list
hours numbact
1. 1 1
2. 2 2
3. 3 4
4. 4 8
5. 5 16
6. 6 32
7. 7 64
8. 8 128
9. 9 256
10. 10 512
11. 11 1024
12. 12 2048
13. 13 4096
14. 14 8192
15. 15 16384
16. 16 32768
17. 17 65536
18. 18 131072
19. 19 262144
20. 20 524288
21. 21 1048576
22. 22 2097152
23. 23 4194304
24. 24 8388608
25. 25 16777216
graph twoway scatter numbact hours, connect(l)
Figure 2.34, page 185 shows the graph of the log of the population by the number of hours. We can make the log of the number of bacteria and graph it like this.Note: The y-scale for this figure does not match the book.
generate logbact = log(numbact) graph twoway scatter logbact hours, connect(l)
Figures 2.35-2.37, page 186-187. Let's use the world oil production data file again.
use http://www.ats.ucla.edu/stat/stata/examples/mm/webdata/ta02_013, clear
Let's make the log of oil production and graph that as shown in figure 2.35, page 186.
generate logbbl = log(mbbl_) graph twoway scatter logbbl year, connect(l) msymbol(i)
Now we can use the log of oil production to get the regression results for example 2.26, page 187.
regress logbbl year
Source | SS df MS Number of obs = 32
---------+------------------------------ F( 1, 30) = 1048.74
Model | 113.779861 1 113.779861 Prob > F = 0.0000
Residual | 3.25475832 30 .108491944 R-squared = 0.9722
---------+------------------------------ Adj R-squared = 0.9713
Total | 117.034619 31 3.77531029 Root MSE = .32938
------------------------------------------------------------------------------
logbbl | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
year | .0001608 4.97e-06 32.384 0.000 .0001507 .000171
_cons | 8.698228 .0594613 146.284 0.000 8.576792 8.819664
------------------------------------------------------------------------------
Figure 2.36, page 186.
graph twoway (scatter logbbl year, connect(l) msymbol(i)) (lfit logbbl year)
We can make the residuals as shown below (calling them oilres) and graph them as shown in figure 2.37, page 187.
predict oilres, residual graph twoway scatter oilres year, yline(0)
We skip sections 2.6 and 2.7.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services