### Stata FAQ How can I quickly convert many string variables to numeric variables?

There may be times that you receive a file that has many (or all) of the variables defined as strings, that is, character variables.  The variables may contain numeric values, but if they are defined as type string, there are very few things you can do to analyze the data.  You cannot get means, you cannot do a regression, you cannot do an ANOVA, etc... Sometimes the dataset contains numerical values that are stored as strings. We will address this scenario first. Then we will address the case where the string variables actually contain strings, and the goal is to assign each value the string takes on to a numeric value.

All of the examples on this page use the same dataset, so let's start by examining the data. The example dataset, hsbs, is a subset of the High School and Beyond data file with all of the variables as string variables.

use http://www.ats.ucla.edu/stat/stata/faq/hsbs, clear
As you see from the describe command below, the variables are all defined as string variables (e.g., science is str2, a string of length 2).
describe

Contains data from hsbs.dta
obs:            20
vars:             6                          25 Nov 1999 19:09
size:           400 (99.6% of memory free)
-------------------------------------------------------------------------------
1. id        str3   %3s
2. gender    str1   %5s
3. race      str5   %9s
4. schtyp    str3   %5s
6. science   str2   %5s
-------------------------------------------------------------------------------
Sorted by:  
Now that we know the variables are string variables, we can use the list command to see what the strings stored in these variables look like. Although the variable science is defined as str2, you can see from the list below that it contains just numeric values. Even so, because the variable is defined as str2, Stata cannot perform any kind of numerical analysis of the variable science. The same is true for the variable read.
list

id  gender       race  schtyp   read  science
1.  70      m          1    pub     45     47
2. 121      f          1    pub     68     63
3.  86      m          1    pub     44     58
4. 141      m          1    pub     63     53
5. 172      m          1    pub     47     53
6. 113      m          1    pub     44     63
7.  50      m          3    pub     50     53
8.  11      m          2    pub     34     39
9.  84      m          1    pub     63      .
10.  48      m          3    pub     57     50
11.  75      m          1    pub     60     53
12.  60      m          X    pub     57     63
13.  95      m          1    pub     73     61
14. 104      m          1    pub     54     55
15.  38      m          3    pub     45     31
16. 115      m          1    pub     42     50
17.  76      m          1    pub     47     50
18. 195      m          1    pri     57
19. 114      m          1    pub     68     55
20.  85      m          1    pub     55     53  

#### Converting string variables with numeric values

One method of converting numbers stored as strings into numerical variables is to use a string function called real that translates numeric values stored as strings into numeric values Stata can recognize as such. The first line of syntax reads in the dataset shown above. The second generates a new variable read_n that is equal to the value of the number stored in the string variable read. The real(s) is the function that translates the values held as strings, where s is the variable containing strings.

gen read_n = real(read)

+---------------+
|---------------|
1. |   45       45 |
2. |   68       68 |
3. |   44       44 |
4. |   63       63 |
5. |   47       47 |
+---------------+

A second method of achieving the same result is the command destring. Let's try using the destring command and see how it works. The first line of syntax loads the dataset again, so that we are starting with a dataset containing only string variables again. The second line of syntax runs the destring command.

use http://www.ats.ucla.edu/stat/stata/faq/hsbs, clear
destring , replace

id has all characters numeric; replaced as int
gender contains non-numeric characters; no replace
race contains non-numeric characters; no replace
schtyp contains non-numeric characters; no replace
read has all characters numeric; replaced as byte
science has all characters numeric; replaced as byte
(2 missing values generated)
As you can see from the describe command below, the destring command converted all of the variables to numeric, except for race, gender and schtyp. Since these variables had characters in them, the destring command left such variables alone. If there had been any numeric variables in the dataset, they would remain unchanged.
describe

Contains data from http://www.ats.ucla.edu/stat/stata/faq/hsbs.dta
obs:            20
vars:             6                          25 Nov 1999 19:09
size:           340 (98.5% of memory free)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              int    %10.0g
gender          str1   %5s
race            str5   %9s
schtyp          str3   %5s
science         byte   %10.0g
-------------------------------------------------------------------------------
Sorted by:
Note:  dataset has changed since last saved

Both of the techniques described above have attributes that in some situations are advantages and in other situations may be disadvantages. The command destring can be run on an entire dataset in one step, the method using the function "real" requires issuing a command for each variable to be converted (although this can be done with a loop rather than typing out the syntax for each variable). One potential advantage to using the function "real" (the first method) is that if the function "real" encounters a non-numeric value, it sets the variable equal to missing in that case and moves on. To some extent destring can be made to behave similarly, but not identically. In order to convert a string variable containing any non-numeric value using destring one must list the characters that should be ignored (e.g. "," or "."). Additionally, rather than setting values for those cases containing non-numeric values to missing (what the function "real" does), destring removes the specified non-numeric characters. destring will extract the specified strings and then convert, meaning that "a4" can be converted to "4". destring's behavior is very good if one has numeric values stored as strings that occasionally contain things like commas (e.g. 4,354), but there may be situations where this behavior is undesirable.

#### Converting string variables with non-numeric values into numeric values

How do we convert gender and schtyp into numeric values?  We can use the encode command as shown below.  These commands create gender2 and schtyp2.
encode gender, generate(gender2)
encode schtyp, generate(schtyp2)
Notice in the describe command below that gender2 and schtyp2 are numeric variables and they have labels associated with them (called gender2 and schtyp2).
describe

Contains data from http://www.ats.ucla.edu/stat/stata/faq/hsbs.dta
obs:            20
vars:             8                          25 Nov 1999 19:09
size:           560 (99.8% of memory free)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              str3   %3s
gender          str1   %5s
race            str5   %9s
schtyp          str3   %5s
science         str2   %5s
gender2         long   %8.0g       gender2
schtyp2         long   %8.0g       schtyp2
-------------------------------------------------------------------------------
Sorted by:
Note:  dataset has changed since last saved
If we list out the data, it appears that gender2 and schtyp2 are identical to gender and schtyp, however they are really numeric and what you are seeing are the value labels associated with the variables.
list

id  gen~r       race  sch~p   read  sci~e   gender2   schtyp2
1.  70      m          1    pub     45     47         m       pub
2. 121      f          1    pub     68     63         f       pub
3.  86      m          1    pub     44     58         m       pub
4. 141      m          1    pub     63     53         m       pub
5. 172      m          1    pub     47     53         m       pub
6. 113      m          1    pub     44     63         m       pub
7.  50      m          3    pub     50     53         m       pub
8.  11      m          2    pub     34     39         m       pub
9.  84      m          1    pub     63      .         m       pub
10.  48      m          3    pub     57     50         m       pub
11.  75      m          1    pub     60     53         m       pub
12.  60      m          X    pub     57     63         m       pub
13.  95      m          1    pub     73     61         m       pub
14. 104      m          1    pub     54     55         m       pub
15.  38      m          3    pub     45     31         m       pub
16. 115      m          1    pub     42     50         m       pub
17.  76      m          1    pub     47     50         m       pub
18. 195      m          1    pri     57                m       pri
19. 114      m          1    pub     68     55         m       pub
20.  85      m          1    pub     55     53         m       pub
Below we use the nolabel option and you see that gender2 and schtyp2 are really numeric.
list, nolabel

id  gen~r       race  sch~p   read  sci~e   gender2   schtyp2
1.  70      m          1    pub     45     47         2         2
2. 121      f          1    pub     68     63         1         2
3.  86      m          1    pub     44     58         2         2
4. 141      m          1    pub     63     53         2         2
5. 172      m          1    pub     47     53         2         2
6. 113      m          1    pub     44     63         2         2
7.  50      m          3    pub     50     53         2         2
8.  11      m          2    pub     34     39         2         2
9.  84      m          1    pub     63      .         2         2
10.  48      m          3    pub     57     50         2         2
11.  75      m          1    pub     60     53         2         2
12.  60      m          X    pub     57     63         2         2
13.  95      m          1    pub     73     61         2         2
14. 104      m          1    pub     54     55         2         2
15.  38      m          3    pub     45     31         2         2
16. 115      m          1    pub     42     50         2         2
17.  76      m          1    pub     47     50         2         2
18. 195      m          1    pri     57                2         1
19. 114      m          1    pub     68     55         2         2
20.  85      m          1    pub     55     53         2         2
What about the variable race? It is still a character variable because our prior destring command saw the X in the data and did not attempt to convert it because it had non-numeric values.  Below we can convert it to numeric by include the ignore(X) option that tells destring to convert the variable to numeric and when it encounters X to convert that to a missing value.  You can see the results in the list command below.
destring race, replace ignore(X)
race: characters X removed; replaced as byte
(1 missing value generated)

list

id  gen~r        race  sch~p   read  sci~e   gender2   schtyp2
1.  70      m           1    pub     45     47         m       pub
2. 121      f           1    pub     68     63         f       pub
3.  86      m           1    pub     44     58         m       pub
4. 141      m           1    pub     63     53         m       pub
5. 172      m           1    pub     47     53         m       pub
6. 113      m           1    pub     44     63         m       pub
7.  50      m           3    pub     50     53         m       pub
8.  11      m           2    pub     34     39         m       pub
9.  84      m           1    pub     63      .         m       pub
10.  48      m           3    pub     57     50         m       pub
11.  75      m           1    pub     60     53         m       pub
12.  60      m           .    pub     57     63         m       pub
13.  95      m           1    pub     73     61         m       pub
14. 104      m           1    pub     54     55         m       pub
15.  38      m           3    pub     45     31         m       pub
16. 115      m           1    pub     42     50         m       pub
17.  76      m           1    pub     47     50         m       pub
18. 195      m           1    pri     57                m       pri
19. 114      m           1    pub     68     55         m       pub
20.  85      m           1    pub     55     53         m       pub  
As you have seen, we can use destring to convert string variables that contain numbers into numeric variables, and it can handle situations where some values are stored as a character (like the X we saw with race).  If you have a character variable that is stored as all characters, you can use encode to convert the character variable to numeric and it will create value labels that have the values that were stored with the character variable.