|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
We would like to thank Mike Zdeb of University at Albany School of Public Health who emailed us with his suggestions/examples to improve this page.
Sometimes, a data set may have "holes" in them, i.e., missing values and we may want to know the number of missing values of all the variables and the distribution of the missing values. We will use the following data set as our example data set.
data test;
input landval improval totval salepric saltoapr city $6. season $8.;
datalines;
30000 64831 94831 118500 1.25 A spring
30000 50765 80765 93900 . winter
46651 18573 65224 . 1.16 B
45990 91402 . 184000 1.34 C winter
42394 . 40575 168000 1.43
. 3351 51102 169000 1.12 D winter
63596 2182 65778 . 1.26 E spring
56658 53806 10464 255000 1.21
51428 72451 . . 1.18 F spring
93200 . 4321 422000 1.04
76125 78172 54297 290000 1.14 G winter
. 61934 16294 237000 1.10 H spring
65376 34458 . 286500 1.43 winter
42400 . 57446 . . K
40800 92606 33406 168000 1.26 S
;
run;
The first thing we are going to look at the variables that have a lot of missing values. For numerical variables, we use proc means with the options n and nmiss.
proc means data = test n nmiss;
var _numeric_;
run;
N
Variable N Miss
----------------------
LANDVAL 13 2
IMPROVAL 12 3
TOTVAL 12 3
SALEPRIC 11 4
SALTOAPR 13 2
For character variables, we can use proc freq to display the number of missing values in each variable.
proc freq data = test; tables city season ; run;Cumulative Cumulative city Frequency Percent Frequency Percent --------------------------------------------------------- A 1 10.00 1 10.00 B 1 10.00 2 20.00 C 1 10.00 3 30.00 D 1 10.00 4 40.00 E 1 10.00 5 50.00 F 1 10.00 6 60.00 G 1 10.00 7 70.00 H 1 10.00 8 80.00 K 1 10.00 9 90.00 S 1 10.00 10 100.00 Frequency Missing = 5 Cumulative Cumulative season Frequency Percent Frequency Percent ----------------------------------------------------------- spring 4 44.44 4 44.44 winter 5 55.56 9 100.00 Frequency Missing = 6
We can also look at the number of missing values in each observation. For example, we can use SAS function cmiss to store the number of missing values from both numeric and character variables in each observation.
data test1;
set test;
miss_n = cmiss(of landval -- season);
run;
proc print data = test1;
run;
Obs landval improval totval salepric saltoapr city season miss_n
1 30000 64831 94831 118500 1.25 A spring 0
2 30000 50765 80765 93900 . winter 2
3 46651 18573 65224 . 1.16 B 2
4 45990 91402 . 184000 1.34 C winter 1
5 42394 . 40575 168000 1.43 3
6 . 3351 51102 169000 1.12 D winter 1
7 63596 2182 65778 . 1.26 E spring 1
8 56658 53806 10464 255000 1.21 2
9 51428 72451 . . 1.18 F spring 2
10 93200 . 4321 422000 1.04 3
11 76125 78172 54297 290000 1.14 G winter 0
12 . 61934 16294 237000 1.10 H spring 1
13 65376 34458 . 286500 1.43 winter 2
14 42400 . 57446 . . K 4
15 40800 92606 33406 168000 1.26 S 1
A slightly different approach is to make use of SAS special names for
referring numeric variables, character variables or all variables. For example,
for all variables, we can use the SAS keyword _all_ as shown below.
Similarly, use _numeric_ for all numeric variables and _character_
for all character variables.
data test1; set test; miss_n = cmiss(of _all_) - 1; run;
We can also look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the proc freq with statement tables with option list to compute the frequency for each pattern of missing data.
data miss_pattern (drop=i); set test; array mynum(*) _numeric_; do i=1 to dim(mynum); if mynum(i) =. then mynum{i}=1; else mynum(i)=0; end; array mychar(*) $ _character_; do i=1 to dim(mychar); if mychar(i) ="" then mychar{i}=1; else mychar(i)=0; end; run; proc freq data=miss_pattern; tables landval*improval*totval*salepric*saltoapr*city*season /list; run;landval improval totval salepric saltoapr city season --------------------------------------------------------------------------- 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 Cumulative Cumulative Frequency Percent Frequency Percent ------------------------------------------------- 2 13.33 2 13.33 1 6.67 3 20.00 1 6.67 4 26.67 1 6.67 5 33.33 1 6.67 6 40.00 1 6.67 7 46.67 1 6.67 8 53.33 1 6.67 9 60.00 1 6.67 10 66.67 2 13.33 12 80.00 1 6.67 13 86.67 2 13.33 15 100.00
Now we see that there are two observations with no missing values, one observation with one missing value in variable season, and so on.
Another approach to achieve the same output could be via making use of formats, therefore to eliminate the extra data step for creating the indicator variables.
options linesize =120 nocenter nodate formchar = '|----|+|---+=|-/<>*'; proc format; value nm . = '1' other = '0'; value $ch ' ' = '1' other = '0'; run; proc freq data=test; tables landval*improval*totval*salepric*saltoapr*city*season /list missing nocum; format _numeric_ nm. _character_ $ch.; run;landval improval totval salepric saltoapr city season Frequency Percent ------------------------------------------------------------------------------------------------ 1 0 0 0 0 0 0 2 13.33 0 1 0 1 1 0 1 1 6.67 0 1 0 0 0 1 1 2 13.33 0 0 1 1 0 0 0 1 6.67 0 0 1 0 0 1 0 1 6.67 0 0 1 0 0 0 0 1 6.67 0 0 0 1 0 0 1 1 6.67 0 0 0 1 0 0 0 1 6.67 0 0 0 0 1 1 0 1 6.67 0 0 0 0 0 1 1 1 6.67 0 0 0 0 0 0 1 1 6.67 0 0 0 0 0 0 0 2 13.33
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services