Stata FAQ
How do I standardize variables in Stata?

A standardized variable (sometimes called a z-score or a standard score) is a variable that has been rescaled to have a mean of zero and a standard deviation of one. For a standardized variable, each case's value on the standardized variable indicates it's difference from the mean of the original variable in number of standard deviations (of the original variable). For example, a value of 0.5 indicates that the value for that case is half a standard deviation above the mean, while a value of -2 indicates that a case has a value two standard deviations lower than the mean. Variables are standardized for a variety of reasons, for example, to make sure all variables contribute evenly to a scale when items are added together, or to make it easier to interpret results of a regression or other analysis.

Standardizing a variable is a relatively straightforward procedure. First, the mean is subtracted from the value for each case, resulting in a mean of zero. Then, the difference between the individual's score and the mean is divided by the standard deviation, which results in a standard deviation of one. If we start with a variable x, and generate a variable x*, the process is:

x* = (x-m)/sd

Where m is the mean of x, and sd is the standard deviation of x.

To illustrate the process of standardization, we will use the High School and Beyond dataset (hsb2). We will create standardized versions of three variables, math, science, and socst. These variables contain students' scores on tests of knowledge of mathematics (math), science (science), social studies (socst). First, we will use the summarize command (abbreviated as sum below) to get the mean and standard deviation for each variable.

use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
(highschool and beyond (200 cases))

sum math science socst

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        math |       200      52.645    9.368448         33         75
     science |       200       51.85    9.900891         26         74
       socst |       200      52.405    10.73579         26         71

The mean of math is 52.645, and it's standard deviation is 9.368448. Based on this information, we can generate a standardized version of math called z1math. The code below does this with the generate command (abbreviated to gen), then uses summarize to confirm that the mean of z1math is very close to zero (due to rounding error, the mean of a standardized variable will rarely be exactly 0), and the standard deviation is one.

gen z1math = (math-52.645)/9.368448
sum z1math

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      z1math |       200   -8.51e-09           1  -2.096932   2.386201

Below we do the same for science and socst, creating two new variables, z1science and z1socst, using their respective means and standard deviations taken from first table of summary statistics. The table of summary statistics shown below demonstrates that both variables are indeed standardized.

gen z1science = (science-51.85)/9.900891
gen z1socst = (socst-52.405)/10.73579

sum z1science z1socst

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
   z1science |       200   -4.95e-09           1  -2.610876   2.237172
     z1socst |       200    4.02e-09           1   -2.45953   1.732057

Standardizing variables is not difficult, but to make this process easier, and less error prone, you can use the egen command to make standardized variables. The commands below standardize the values of math, science, and socst, creating three new variables, z2math, z2science, and z2socst.

egen z2math = std(math)
egen z2science = std(science) 
egen z2socst = std(socst)

Again we can look at a table of summary statistics to confirm that these variables are standardized. Note that the means are not exactly zero, nor do they match the means from the set of standardized variables created above using the generate command, in both cases, this is due to very slight rounding error.

sum z2math z2science z2socst
	
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      z2math |       200    3.41e-09           1  -2.096932   2.386201
   z2science |       200   -2.94e-09           1  -2.610876   2.237172
     z2socst |       200   -7.38e-09           1  -2.459529   1.732056

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.