SAS FAQ
How can I see the number of missing values and patterns of missing values in a data file?

We would like to thank Mike Zdeb of University at Albany School of Public Health who emailed us with his suggestions/examples to improve this page. 

Sometimes, a data set may have "holes" in them, i.e., missing values and we may want to know  the number of missing values of all the variables and the distribution of the missing values. We will use the following data set as our example data set.

data test;
input landval improval totval salepric saltoapr city $6. season $8.;
datalines;
   30000     64831     94831    118500   1.25  A    spring
   30000     50765     80765     93900    .         winter
   46651     18573     65224         .   1.16  B      
   45990     91402         .    184000   1.34  C    winter
   42394         .     40575    168000   1.43       
       .      3351     51102    169000   1.12  D    winter
   63596      2182     65778         .   1.26  E    spring
   56658     53806     10464    255000   1.21      
   51428     72451         .         .   1.18  F    spring
   93200         .      4321    422000   1.04      
   76125     78172     54297    290000   1.14  G    winter
       .     61934     16294    237000   1.10  H    spring
   65376     34458         .    286500   1.43       winter
   42400         .     57446         .    .    K    
   40800     92606     33406    168000   1.26  S    
;
run;

1. Number of missing values vs. number of  non missing values in each variable

The first thing we are going to look at the variables that have a lot of missing values. For numerical variables, we use proc means with the options n and nmiss.

proc means data = test n nmiss;
  var _numeric_;
run;
                     N
Variable     N    Miss
----------------------
LANDVAL     13       2
IMPROVAL    12       3
TOTVAL      12       3
SALEPRIC    11       4
SALTOAPR    13       2

For character variables, we can use proc freq to display the number of missing values in each variable.

proc freq data = test;
  tables city season ;
run;
                                 Cumulative    Cumulative
city    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
A              1       10.00             1        10.00
B              1       10.00             2        20.00
C              1       10.00             3        30.00
D              1       10.00             4        40.00
E              1       10.00             5        50.00
F              1       10.00             6        60.00
G              1       10.00             7        70.00
H              1       10.00             8        80.00
K              1       10.00             9        90.00
S              1       10.00            10       100.00

Frequency Missing = 5


                                   Cumulative    Cumulative
season    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
spring           4       44.44             4        44.44
winter           5       55.56             9       100.00

Frequency Missing = 6

2. Number of missing values in each observation

We can also look at the number of missing values in each observation. For example, we can use SAS function cmiss to store the number of missing values from both numeric and character variables in each observation.

data test1;
  set test;
  miss_n = cmiss(of landval -- season);
run;

proc print data = test1; 
run;
                                                                  
Obs    landval    improval    totval    salepric    saltoapr    city    season    miss_n

  1     30000       64831      94831     118500       1.25       A      spring       0
  2     30000       50765      80765      93900        .                winter       2
  3     46651       18573      65224          .       1.16       B                   2
  4     45990       91402          .     184000       1.34       C      winter       1
  5     42394           .      40575     168000       1.43                           3
  6         .        3351      51102     169000       1.12       D      winter       1
  7     63596        2182      65778          .       1.26       E      spring       1
  8     56658       53806      10464     255000       1.21                           2
  9     51428       72451          .          .       1.18       F      spring       2
 10     93200           .       4321     422000       1.04                           3
 11     76125       78172      54297     290000       1.14       G      winter       0
 12         .       61934      16294     237000       1.10       H      spring       1
 13     65376       34458          .     286500       1.43              winter       2
 14     42400           .      57446          .        .         K                   4
 15     40800       92606      33406     168000       1.26       S                   1

A slightly different approach is to make use of SAS special names for referring numeric variables, character variables or all variables. For example, for all variables, we can use the SAS keyword _all_ as shown below. Similarly, use _numeric_ for all numeric variables and _character_ for all character variables. 
 

data test1;
  set test;
  miss_n  = cmiss(of _all_) - 1;
run;

3. Distribution of missing values

We can also look at the patterns of  missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the proc freq with statement tables with option list to compute the frequency for each pattern of missing data.

data miss_pattern (drop=i);
  set test;
  array mynum(*) _numeric_;
  do i=1 to dim(mynum);
    if  mynum(i) =. then mynum{i}=1;
      else mynum(i)=0;
  end;
  array mychar(*) $ _character_;
  do i=1 to dim(mychar);
    if  mychar(i) ="" then mychar{i}=1;
      else mychar(i)=0;
  end;
run;
proc freq data=miss_pattern;
  tables landval*improval*totval*salepric*saltoapr*city*season /list;
run;
landval    improval    totval    salepric    saltoapr    city      season
---------------------------------------------------------------------------
      0           0         0           0           0         0           0
      0           0         0           0           0         0           1
      0           0         0           0           0         1           1
      0           0         0           0           1         1           0
      0           0         0           1           0         0           0
      0           0         0           1           0         0           1
      0           0         1           0           0         0           0
      0           0         1           0           0         1           0
      0           0         1           1           0         0           0
      0           1         0           0           0         1           1
      0           1         0           1           1         0           1
      1           0         0           0           0         0           0

                         Cumulative    Cumulative
Frequency     Percent     Frequency      Percent
-------------------------------------------------
       2       13.33             2        13.33
       1        6.67             3        20.00
       1        6.67             4        26.67
       1        6.67             5        33.33
       1        6.67             6        40.00
       1        6.67             7        46.67
       1        6.67             8        53.33
       1        6.67             9        60.00
       1        6.67            10        66.67
       2       13.33            12        80.00
       1        6.67            13        86.67
       2       13.33            15       100.00

Now we see that there are two observations with no missing values, one observation with one missing value in variable season, and so on.

Another approach to achieve the same output could be via making use of formats, therefore to eliminate the extra data step for creating the indicator variables.

options linesize =120 nocenter nodate 
        formchar = '|----|+|---+=|-/<>*';
proc format;
  value  nm   . = '1' other = '0';
  value $ch ' ' = '1' other = '0';
run;

proc freq data=test;
   tables landval*improval*totval*salepric*saltoapr*city*season 
   /list missing nocum; 
  format _numeric_ nm. _character_ $ch.; 
run;
landval    improval    totval    salepric    saltoapr    city    season    Frequency     Percent
------------------------------------------------------------------------------------------------
1          0           0         0           0           0       0                2       13.33
0          1           0         1           1           0       1                1        6.67
0          1           0         0           0           1       1                2       13.33
0          0           1         1           0           0       0                1        6.67
0          0           1         0           0           1       0                1        6.67
0          0           1         0           0           0       0                1        6.67
0          0           0         1           0           0       1                1        6.67
0          0           0         1           0           0       0                1        6.67
0          0           0         0           1           1       0                1        6.67
0          0           0         0           0           1       1                1        6.67
0          0           0         0           0           0       1                1        6.67
0          0           0         0           0           0       0                2       13.33

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.