SAS FAQ
How can I see the number of missing values and patterns of missing values in a data file?

We would like to thank Mike Zdeb of University at Albany School of Public Health who emailed us with his suggestions/examples to improve this page. 

Sometimes, a data set may have "holes" in them, i.e., missing values and we may want to know  the number of missing values of all the variables and the distribution of the missing values. We will use the following data set as our example data set.

data test;
input landval improval totval salepric saltoapr city $6. season $8.;
datalines;
   30000     64831     94831    118500   1.25  A    spring
   30000     50765     80765     93900    .         winter
   46651     18573     65224         .   1.16  B      
   45990     91402         .    184000   1.34  C    winter
   42394         .     40575    168000   1.43       
       .      3351     51102    169000   1.12  D    winter
   63596      2182     65778         .   1.26  E    spring
   56658     53806     10464    255000   1.21      
   51428     72451         .         .   1.18  F    spring
   93200         .      4321    422000   1.04      
   76125     78172     54297    290000   1.14  G    winter
       .     61934     16294    237000   1.10  H    spring
   65376     34458         .    286500   1.43       winter
   42400         .     57446         .    .    K    
   40800     92606     33406    168000   1.26  S    
;
run;

1. Number of missing values vs. number of  non missing values in each variable

The first thing we are going to look at the variables that have a lot of missing values. For numerical variables, we use proc means with the options n and nmiss.

proc means data = test n nmiss;
  var _numeric_;
run;
VariableNN Miss
landval132
improval123
totval123
salepric114
saltoapr132

For character variables, we can use proc freq to display the number of missing values in each variable.

proc freq data = test;
  tables city season ;
run;
  
cityFrequencyPercentCumulativeCumulative
FrequencyPercent
A110110
B110220
C110330
D110440
E110550
F110660
G110770
H110880
K110990
S11010100
Frequency Missing = 5
seasonFrequencyPercentCumulativeCumulative
FrequencyPercent
spring444.44444.44
winter555.569100
Frequency Missing = 6

2. Number of missing values in each observation

We can also look at the number of missing values in each observation. For example, we can use SAS function cmiss to store the number of missing values from both numeric and character variables in each observation.

data test1;
  set test;
  miss_n = cmiss(of landval -- season);
run;

proc print data = test1; 
run;
                                                                  
Obslandvalimprovaltotvalsalepricsaltoaprcityseasonmiss_n
13000064831948311185001.25Aspring0
230000507658076593900.winter2
3466511857365224.1.16B2
44599091402.1840001.34Cwinter1
542394.405751680001.433
6.3351511021690001.12Dwinter1
763596218265778.1.26Espring1
85665853806104642550001.212
95142872451..1.18Fspring2
1093200.43214220001.043
117612578172542972900001.14Gwinter0
12.61934162942370001.1Hspring1
136537634458.2865001.43winter2
1442400.57446..K4
154080092606334061680001.26S1

3. Distribution of missing values

We can also look at the patterns of missing values. By default the MI procedure will output missing data patterns for the variables in the specified datasets. If no var statement is specified Proc MI will output a table for the all the variables in a dataset. The ods select statement tells SAS to only output the "Missing Data Patterns" table.

proc mi data=test;
ods select misspattern;
run;
Missing Data Patterns
GrouplandvalimprovaltotvalsalepricsaltoaprFreqPercentGroup Means
landvalimprovaltotvalsalepricsaltoapr
1XXXXX426.675089672354482502078751.215
2XXXX.16.6730000507658076593900.
3XXX.X213.33551241037865501.1.21
4XX.XX213.335568362930.2352501.385
5XX..X16.675142872451..1.18
6X.XXX213.3367797.224482950001.235
7X.X..16.6742400.57446..
8.XXXX213.33.32643336982030001.11

You will notice that this report only contains information for numeric variables not character. Another approach to achieve the same output could be via making use of formats, which then allows for character variables to be included. The order of the patterns is different but the information is still the same.

proc format;
value nm . = '.' other = 'X';
value $ch ' ' = '.'other = 'X';
run;

proc freq data=test;
table landval*improval*totval*salepric*saltoapr*city*season / list missing nocum;
format _numeric_ nm. _character_ $ch.;
run;
landvalimprovaltotvalsalepricsaltoaprcityseasonFrequencyPercent
.XXXXXX213.33
X.X..X.16.67
X.XXX..213.33
XX..XXX16.67
XX.XX.X16.67
XX.XXXX16.67
XXX.XX.16.67
XXX.XXX16.67
XXXX..X16.67
XXXXX..16.67
XXXXXX.16.67
XXXXXXX213.33

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.