### Stata Learning Module Creating and recoding variables

This module shows how to create and recode variables. In Stata you can create new variables with generate and you can modify the values of an existing variable with replace and with recode.

#### Computing new variables using generate and replace

Let's use the auto data for our examples. In this section we will see how to compute variables with generate and replace.

use auto

The variable length contains the length of the car in inches. Below we see summary statistics for length.

summarize length
Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
length |      74    187.9324   22.26634        142        233   

Let's use the generate command to make a new variable that has the length in feet instead of inches, called len_ft.

generate len_ft = length / 12

We should emphasize that generate is for creating a new variable. For an existing variable, you need to use the replace command (not generate). As shown below, we use replace to repeat the assignment to len_ft.

replace len_ft = length / 12

summarize length len_ft

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
length |      74    187.9324   22.26634        142        233
len_ft |      74    15.66104   1.855528   11.83333   19.41667   

The syntax of generate and replace are identical, except:
- generate works when the variable does not yet exist and will give an error if the variable already exists.
- replace works when the variable already exists, and will give an error if the variable does not yet exist.

Suppose we wanted to make a variable called length2 which has length squared.

generate length2 = length^2

summarize length2

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
length2 |      74    35807.69   8364.045      20164      54289   

Or we might want to make loglen which is the natural log of length.

generate loglen = log(length)

summarize loglen

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
loglen |      74    5.229035   .1201383   4.955827   5.451038   

Let's get the mean and standard deviation of length and we can make Z-scores of length.

summarize length

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
length |      74    187.9324   22.26634        142        233   

The mean is 187.93 and the standard deviation is 22.27, so zlength can be computed as shown below.

generate zlength = (length - 187.93) / 22.27

summarize zlength

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
zlength |      74    .0001092   .9998357  -2.062416   2.023799   

With generate and replace
you can use + - for addition and subtraction
you can use * / for multiplication and division
you can use ^ for exponents (e.g., length^2)
you can use ( ) for controlling order of operations.

#### Recoding new variables using generate and replace

Suppose that we wanted to break mpg down into three categories. Let's look at a table of mpg to see where we might draw the lines for such categories.

tabulate mpg

mpg |      Freq.     Percent        Cum.
------------+-----------------------------------
12 |          2        2.70        2.70
14 |          6        8.11       10.81
15 |          2        2.70       13.51
16 |          4        5.41       18.92
17 |          4        5.41       24.32
18 |          9       12.16       36.49
19 |          8       10.81       47.30
20 |          3        4.05       51.35
21 |          5        6.76       58.11
22 |          5        6.76       64.86
23 |          3        4.05       68.92
24 |          4        5.41       74.32
25 |          5        6.76       81.08
26 |          3        4.05       85.14
28 |          3        4.05       89.19
29 |          1        1.35       90.54
30 |          2        2.70       93.24
31 |          1        1.35       94.59
34 |          1        1.35       95.95
35 |          2        2.70       98.65
41 |          1        1.35      100.00
------------+-----------------------------------
Total |         74      100.00 

Let's convert mpg into three categories to help make this more readable. Here we convert mpg into three categories using generate and replace.

generate mpg3    = .

(74 missing values generated)

replace  mpg3    = 1 if (mpg <= 18)

replace  mpg3    = 2 if (mpg >= 19) & (mpg <=23)

replace  mpg3    = 3 if (mpg >= 24) & (mpg <.)

(23 real changes made) 

Let's use tabulate to check that this worked correctly. Indeed, you can see that a value of 1 for mpg3 goes from 12-18, a value of 2 goes from 19-23, and a value of 3 goes from 24-41.

tabulate mpg mpg3

|               mpg3
mpg |         1          2          3 |     Total
-----------+---------------------------------+----------
12 |         2          0          0 |         2
14 |         6          0          0 |         6
15 |         2          0          0 |         2
16 |         4          0          0 |         4
17 |         4          0          0 |         4
18 |         9          0          0 |         9
19 |         0          8          0 |         8
20 |         0          3          0 |         3
21 |         0          5          0 |         5
22 |         0          5          0 |         5
23 |         0          3          0 |         3
24 |         0          0          4 |         4
25 |         0          0          5 |         5
26 |         0          0          3 |         3
28 |         0          0          3 |         3
29 |         0          0          1 |         1
30 |         0          0          2 |         2
31 |         0          0          1 |         1
34 |         0          0          1 |         1
35 |         0          0          2 |         2
41 |         0          0          1 |         1
-----------+---------------------------------+----------
Total |        27         24         23 |        74  

Now, we could use mpg3 to show a crosstab of mpg3 by foreign to contrast the mileage of the foreign and domestic cars.

tabulate mpg3 foreign, column

|        foreign
mpg3 |         0          1 |     Total
-----------+----------------------+----------
1 |        22          5 |        27
|     42.31      22.73 |     36.49
-----------+----------------------+----------
2 |        19          5 |        24
|     36.54      22.73 |     32.43
-----------+----------------------+----------
3 |        11         12 |        23
|     21.15      54.55 |     31.08
-----------+----------------------+----------
Total |        52         22 |        74
|    100.00     100.00 |    100.00  

The crosstab above shows that 21% of the domestic cars fall into the high mileage category, while 55% of the foreign cars fit into this category.

#### Recoding variables using recode

There is an easier way to recode mpg to three categories using generate and recode. First, we make a copy of mpg, calling it mpg3a. Then, we use recode to convert mpg3a into three categories: min-18 into 1, 19-23 into 2, and 24-max into 3.

generate mpg3a = mpg

recode   mpg3a (min/18=1) (19/23=2) (24/max=3)

(74 changes made) 

Let's double check to see that this worked correctly. We see that it worked perfectly.

tabulate mpg mpg3a

|              mpg3a
mpg |         1          2          3 |     Total
-----------+---------------------------------+----------
12 |         2          0          0 |         2
14 |         6          0          0 |         6
15 |         2          0          0 |         2
16 |         4          0          0 |         4
17 |         4          0          0 |         4
18 |         9          0          0 |         9
19 |         0          8          0 |         8
20 |         0          3          0 |         3
21 |         0          5          0 |         5
22 |         0          5          0 |         5
23 |         0          3          0 |         3
24 |         0          0          4 |         4
25 |         0          0          5 |         5
26 |         0          0          3 |         3
28 |         0          0          3 |         3
29 |         0          0          1 |         1
30 |         0          0          2 |         2
31 |         0          0          1 |         1
34 |         0          0          1 |         1
35 |         0          0          2 |         2
41 |         0          0          1 |         1
-----------+---------------------------------+----------
Total |        27         24         23 |        74


#### Recodes with if

Let's create a variable called mpgfd that assesses the mileage of the cars with respect to their origin. Let this be a 0/1 variable called mpgfd which is:
0 if below the median mpg for its group (foreign/domestic)
1 if at/above the median mpg for its group (foreign/domestic).

sort foreign

by foreign: summarize mpg, detail

-> foreign=        0
mpg
-------------------------------------------------------------
Percentiles      Smallest
1%           12             12
5%           14             12
10%           14             14       Obs                  52
25%         16.5             14       Sum of Wgt.          52

50%           19                      Mean           19.82692
Largest       Std. Dev.      4.743297
75%           22             28
90%           26             29       Variance       22.49887
95%           29             30       Skewness       .7712432
99%           34             34       Kurtosis       3.441459

-> foreign=        1
mpg
-------------------------------------------------------------
Percentiles      Smallest
1%           14             14
5%           17             17
10%           17             17       Obs                  22
25%           21             18       Sum of Wgt.          22

50%         24.5                      Mean           24.77273
Largest       Std. Dev.      6.611187
75%           28             31
90%           35             35       Variance       43.70779
95%           35             35       Skewness        .657329
99%           41             41       Kurtosis        3.10734 

We see that the median is 19 for the domestic (foreign==0) cars and 24.5 for the foreign (foreign==1) cars. The generate and recode commands below recode mpg into mpgfd based on the domestic car median for the domestic cars, and based on the foreign car median for the foreign cars.

generate mpgfd = mpg

recode   mpgfd (min/18=0) (19/max=1) if foreign==0

recode   mpgfd (min/24=0) (25/max=1) if foreign==1

(22 changes made) 

We can check using this below, and the recoded value mpgfd looks correct.

by foreign: tabulate mpg mpgfd

-> foreign=        0
|         mpgfd
mpg |         0          1 |     Total
-----------+----------------------+----------
12 |         2          0 |         2
14 |         5          0 |         5
15 |         2          0 |         2
16 |         4          0 |         4
17 |         2          0 |         2
18 |         7          0 |         7
19 |         0          8 |         8
20 |         0          3 |         3
21 |         0          3 |         3
22 |         0          5 |         5
24 |         0          3 |         3
25 |         0          1 |         1
26 |         0          2 |         2
28 |         0          2 |         2
29 |         0          1 |         1
30 |         0          1 |         1
34 |         0          1 |         1
-----------+----------------------+----------
Total |        22         30 |        52

-> foreign=        1
|         mpgfd
mpg |         0          1 |     Total
-----------+----------------------+----------
14 |         1          0 |         1
17 |         2          0 |         2
18 |         2          0 |         2
21 |         2          0 |         2
23 |         3          0 |         3
24 |         1          0 |         1
25 |         0          4 |         4
26 |         0          1 |         1
28 |         0          1 |         1
30 |         0          1 |         1
31 |         0          1 |         1
35 |         0          2 |         2
41 |         0          1 |         1
-----------+----------------------+----------
Total |        11         11 |        22  

#### Summary

Create a new variable len_ft which is length divided by 12.

generate len_ft = length / 12

Change values of an existing variable named len_ft.

replace len_ft = length / 12

Recode mpg into mpg3, having three categories using generate and replace if.

generate mpg3 = .
replace  mpg3 = 1 if (mpg <=18)
replace  mpg3 = 2 if (mpg >=19) & (mpg <=23)
replace  mpg3 = 3 if (mpg >=24) & (mpg <.)  

Recode mpg into mpg3a, having three categories, 1 2 3, using generate and recode.

generate mpg3a = mpg
recode   mpg3a (min/18=1) (19/23=2) (24/max=3)  

Recode mpg into mpgfd, having two categories, but using different cutoffs for foreign and domestic cars.

generate mpgfd = mpg
recode   mpgfd (min/18=0) (19/max=1) if foreign==0
recode   mpgfd (min/24=0) (25/max=1) if foreign==1 

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.