UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata FAQ:
How can I see the number of missing values and patterns of missing values in my data file?

Sometimes, a data set may have "holes" in it, i.e., missing values. Some statistical procedures such as regression analysis will not work as well, or at all, on a data set with missing values.  The observations with missing values have to be deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results.  Thus we may want to know  the number of missing values and the distribution of those missing values so we have a better idea as to what to do with the observations with missing values. Let's look at  the following data set.
 LANDVAL  IMPROVAL    TOTVAL  SALEPRIC SALTOAPR

   30000     64831     94831    118500   1.25
   30000     50765     80765     93900    .
   46651     18573     65224         .   1.16
   45990     91402         .    184000   1.34
   42394         .     40575    168000   1.43
       .      3351     51102    169000   1.12
   63596      2182     65778         .   1.26
   56658     53806     10464    255000   1.21
   51428     72451         .         .   1.18
   93200         .      4321    422000   1.04
   76125     78172     54297    290000   1.14
       .     61934     16294    237000   1.10
   65376     34458         .    286500   1.43
   42400         .     57446         .    .
   40800     92606     33406    168000   1.26

1. Number of missing values vs. number of  non missing values

The first thing we are going to do is determine which variables have a lot of missing values. We have created a small Stata program called tabmiss that counts the number of missing values in both numeric and character variables. You can download tabmiss from within Stata by typing findit tabmiss (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

Then you can run tabmiss for one or more variables as illustrated below.

tabmiss  landval  improval totval salepric saltoapr

    landval |      Freq.     Percent        Cum.
------------+-----------------------------------
  nomissing |         13       86.67       86.67
    missing |          2       13.33      100.00
------------+-----------------------------------
      Total |         15      100.00

   improval |      Freq.     Percent        Cum.
------------+-----------------------------------
  nomissing |         12       80.00       80.00
    missing |          3       20.00      100.00
------------+-----------------------------------
      Total |         15      100.00

     totval |      Freq.     Percent        Cum.
------------+-----------------------------------
  nomissing |         12       80.00       80.00
    missing |          3       20.00      100.00
------------+-----------------------------------
      Total |         15      100.00

   salepric |      Freq.     Percent        Cum.
------------+-----------------------------------
  nomissing |         11       73.33       73.33
    missing |          4       26.67      100.00
------------+-----------------------------------
      Total |         15      100.00

   saltoapr |      Freq.     Percent        Cum.
------------+-----------------------------------
  nomissing |         13       86.67       86.67
    missing |          2       13.33      100.00
------------+-----------------------------------
      Total |         15      100.00  
Now we know the number of missing values in each variable. For instance, variable salepric has four and saltoapr has two missing values. This will help us to identify variables that may have a large number of missing values and perhaps we may want to exclude those from analysis.

2. Obtaining the distribution of the number of missing values per observation

We can also look at the distribution of missing values across observations. For example variable nmis created below is the number of missing values across each observation. Looking at its frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values. If we are willing to substitute one missing value per observation, we will be able to reclaim nine observations back to get a valid data set that is 13/15=87% of the size of the original one. The function rmiss2() used here is an extension to the egen function rmiss(). It counts the number of missing values in the varlist.  rmiss2() accepts both string and numeric variables.  Stata's rmiss() accepts only numeric variables. You can download rmiss2() over the internet from within Stata by typing findit rmiss2 (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

Now let's create our new variable.

egen nmis=rmiss2( landval improval totval salepric saltoapr)

tab nmis

       nmis |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          4       26.67       26.67
          1 |          9       60.00       86.67
          2 |          1        6.67       93.33
          3 |          1        6.67      100.00
------------+-----------------------------------
      Total |         15      100.00  

3. Patterns of missing values

We can also look at the patterns of  missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the collapse command to compute the frequency for each pattern of missing data.
egen land1=rmiss2(landval)
egen impr1=rmiss2(improval)
egen totv1=rmiss2(totval)
egen sale1=rmiss2(salepric)
egen salt1=rmiss2(saltoapr)
gen id=_n
collapse (count)  freq=id,  by (land1 impr1 sale1 totv1 salt1)
Now we see that there are four observations with no missing values, one observation with one missing value in variable saltoapr, two observations with missing value in variable salepric and one observation with  missing value in both variable totval and salepric, etc. If we want to delete some observations from the original data set, we have a better idea now on which observation to delete, e.g. the observation corresponding to the 7th observation below.
list

         land1      impr1      totv1      sale1      salt1       freq 
  1.         0          0          0          0          0          4  
  2.         0          0          0          0          1          1  
  3.         0          0          1          0          0          2  
  4.         0          0          0          1          0          2  
  5.         0          0          1          1          0          1  
  6.         0          1          0          0          0          2  
  7.         0          1          0          1          1          1  
  8.         1          0          0          0          0          2  

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California