NOTE: This page has been delinked.  It is no longer being maintained, and information on this page may be out of date.

Stata FAQ: How can I convert string variables to numeric variables in Stata?

The easiest way to convert string variables to numeric form is to use the encode command. If the variable is actually a numeric value that just happens to be stored as a string, see our FAQ: How can I quickly convert many string variables to numeric variables?

Let's say that you have the following data:

region     units 
East       800  
South      600  
South      500  
West       850  
East       300  
Central    500  
East       600  
West       390  
Central    480  
Central    400  
South      650  
East       350  
Central    380  
South      470  
West       550  
South      720  
South      460  
East       520  
West       400  
East       590
Let's begin by reading in the dataset from the Internet and using the describe command.
use http://www.ats.ucla.edu/stat/stata/faq/region, clear

describe

Contains data from region.dta
  obs:            20                          
 vars:             2                          
 size:           260 (99.8% of memory free)
--------------------------------------------
   1. region    str7   %9s                    
   2. units     int    %8.0g                  
--------------------------------------------
Sorted by:
The variable region is a string variable. Next, let's do a tabulate on region.
tabulate region

     region |      Freq.     Percent        Cum.
------------+-----------------------------------
    Central |          4       20.00       20.00
       East |          6       30.00       50.00
      South |          6       30.00       80.00
       West |          4       20.00      100.00
------------+-----------------------------------
      Total |         20      100.00
Now let's use the encode command and another describe.
encode region, generate(region2)
describe

Contains data from region.dta
  obs:            20                          
 vars:             3                          
 size:           340 (99.8% of memory free)
--------------------------------------------
   1. region    str7   %9s                    
   2. units     int    %8.0g                  
   3. region2   long   %8.0g       region2    
--------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved
The encode command has generated a new variable called region2 that is of type long integer and has value labels which are defined and are also called region2. Now let's do two tabulates, one with labels and one without labels.
tabulate region2

    region2 |      Freq.     Percent        Cum.
------------+-----------------------------------
    Central |          4       20.00       20.00
       East |          6       30.00       50.00
      South |          6       30.00       80.00
       West |          4       20.00      100.00
------------+-----------------------------------
      Total |         20      100.00
	  
tabulate region2, nolabel

    region2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          4       20.00       20.00
          2 |          6       30.00       50.00
          3 |          6       30.00       80.00
          4 |          4       20.00      100.00
------------+-----------------------------------
      Total |         20      100.00
The first tabulate looks identical to the tabulate done on the string variable region above. As you can see from the second tabulate, region2 has the numeric values 1 through 4 but retains the strings Central, East, South and West as value labels.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.