Help the Stat Consulting Group by giving a gift

Missing data in SAS

This module will explore missing data in SAS, focusing on numeric missing
data. It will describe how to indicate missing data in your raw data files, how
missing data are handled in SAS procedures, and how to handle missing data in a
SAS **data step**. Suppose we did a reaction time study with six subjects, and the subjects
reaction time was measured three times. The data file is shown below.

DATA times ; INPUT id trial1 trial2 trial3 ; CARDS ; 1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.2 6 1.8 2.0 1.9 ; RUN ; PROC PRINT DATA=times ; RUN ;

You might notice that some of the reaction times are coded using a single dot. For example, for subject 2, the second trial is coded just as a dot. Well, the person measuring response time for that trial did not measure the response time properly so the data for that trial was missing.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.9

In your raw data, missing data are generally coded using a single . to indicate a missing value. SAS recognizes a single . as a missing value and knows to interpret it as missing and handles it in special ways. Let's examine how SAS handles missing data in procedures.

As a general rule, SAS procedures that perform
computations handle missing data by omitting the missing values. (We say
**procedures
that perform computations** to indicate that we are not addressing procedures like
**proc contents**). The way that missing values are eliminated is not always the same among
SAS procedures, so let's us look at some examples. First, let's do a **proc means**
on our data file and see how SAS **proc means** handles the missing values.

PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ;

As you see in the output below, **proc means** computed the
means using 4 observations for **trial1** and **trial2**
and 6 observations for **trial3**. In short, **proc means** used all of the valid
data and performed the computations on all of the available data.

Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------- TRIAL1 4 1.7250000 0.2872281 1.5000000 2.1000000 TRIAL2 4 1.9250000 0.3774917 1.4000000 2.3000000 TRIAL3 6 1.9000000 0.2683282 1.6000000 2.2000000 -------------------------------------------------------------------

As you see below, **proc freq** likewise performed its
computations using just the available data. Note that the percentages are computed based
on just the total number of non-missing cases.

PROC FREQ DATA=times ; TABLES trial1 trial2 trial3 ; RUN ;Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- 1.5 2 50.0 2 50.0 1.8 1 25.0 3 75.0 2.1 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL2 Frequency Percent Frequency Percent ---------------------------------------------------- 1.4 1 25.0 1 25.0 2 2 50.0 3 75.0 2.3 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL3 Frequency Percent Frequency Percent ---------------------------------------------------- 1.6 2 33.3 2 33.3 1.9 2 33.3 4 66.7 2.2 2 33.3 6 100.0

It is possible that you might want the percentages
to be computed out of the total number of values, and even report the percentage missing
right in the table itself. You can request this using the **missing** option
on the **tables** statement of **proc freq** as shown below
(just for **trial1**).

PROC FREQ DATA=times ; TABLES trial1 / MISSING ; RUN ;

As you see, now the percentages are computed out of the total number of observations, and the percentage missing are shown right in the table as well.

Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- . 2 33.3 2 33.3 1.5 2 33.3 4 66.7 1.8 1 16.7 5 83.3 2.1 1 16.7 6 100.0

Let's look at how **proc corr** handles
missing data. We would expect that it would do the computations based on the available
data, and omit the missing values. Here is an example program.

PROC CORR DATA=times ; VAR trial1 trial2 trial3 ; RUN ;

The output of this program is shown
below. Note how the missing values were excluded. For each **pair**
of variables, **proc corr** used the number of pairs that had valid data. For
the pair formed by **trial1** and **trial2**, there were 3 pairs with valid
data. For the pairing of **trial1** and **trial3 **there were 4 valid pairs,
and likewise there were 4 valid pairs for **trial2** and **trial3**.
Since this used all of the valid **pairs**
of data, this is often called **pairwise deletion of missing data**.

Correlation Analysis 3 'VAR' Variables: TRIAL1 TRIAL2 TRIAL3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum TRIAL1 4 1.725000 0.287228 6.900000 1.500000 2.100000 TRIAL2 4 1.925000 0.377492 7.700000 1.400000 2.300000 TRIAL3 6 1.900000 0.268328 11.400000 1.600000 2.200000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations TRIAL1 TRIAL2 TRIAL3 TRIAL1 1.00000 0.98198 0.85280 0.0 0.1210 0.1472 4 3 4 TRIAL2 0.98198 1.00000 0.76089 0.1210 0.0 0.2391 3 4 4 TRIAL3 0.85280 0.76089 1.00000 0.1472 0.2391 0.0 4 4 6

It is possible to ask SAS to only perform the
correlations on the observations that had complete data for all of the variables on the
**var** statement. For example, you might want the correlations of the reaction times just for
the observations that had non-missing data on all of the trials. This is called
**listwise
deletion of missing data** meaning that when any of the variables are missing, the
entire observation is omitted from the analysis. You can request listwise deletion within
**proc corr** with the **nomiss** option as illustrated below.

PROC CORR DATA=times NOMISS ; VAR trial1 trial2 trial3 ; RUN ;

As you see in the results below, the N for all the simple statistics is the same, 3, which corresponds to the number of cases with complete non-missing data for trial1 trial2 and trial3. Since the N is the same for all of the correlations (i.e., 3), the N is not displayed along with the correlations.

Correlation Analysis 3 'VAR' Variables: TRIAL1 TRIAL2 TRIAL3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum TRIAL1 3 1.800000 0.300000 5.400000 1.500000 2.100000 TRIAL2 3 1.900000 0.458258 5.700000 1.400000 2.300000 TRIAL3 3 1.900000 0.300000 5.700000 1.600000 2.200000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 3 TRIAL1 TRIAL2 TRIAL3 TRIAL1 1.00000 0.98198 1.00000 0.0 0.1210 0.0001 TRIAL2 0.98198 1.00000 0.98198 0.1210 0.0 0.1210 TRIAL3 1.00000 0.98198 1.00000 0.0001 0.1210 0.0

It is important to understand how SAS procedures handle missing data if you have missing data. To know how a procedure handles missing data, you should consult the SAS manual. Here is a brief overview of how some common SAS procedures handle missing data.

- -
proc means

For each variable, the number of non-missing values are usedproc freq

By default, missing values are excluded and percentages are based on the number of non-missing values. If you use themissingoption on thetablesstatement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.proc corr

By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). Thenomissoption can be used on theproc corrstatement to request that correlations be computed only for observations that have non-missing data for all variables on thevarstatement (listwise deletion of missing data).proc reg

If any of the variables on themodelorvarstatement are missing, they are excluded from the analysis (i.e.,listwise deletion of missing data)proc factor

Missing values are deletedlistwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.proc glm

The handling of missing values inproc glmcan be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing arepeated measures ANOVAor aMANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual aboutproc glm.- For other procedures, see the SAS manual for information on how missing data are handled.

It is important to understand how missing values are handled in assignment statements. Consider the example shown below.

DATA times2 ; SET times ; avg = (trial1 + trial2 + trial3) / 3 ; RUN ; PROC PRINT DATA=times2 ; RUN ;

The **proc print** below illustrates how
missing values are handled in assignment statements. The variable **avg**
is based on the
variables **trial1** **trial2** and **trial3**. If
any of those variables were missing, the value for **avg** was set to
missing. This meant that **avg** was missing for observations 2, 3 and 4.

OBS ID TRIAL1 TRIAL2 TRIAL3 AVG 1 1 1.5 1.4 1.6 1.5 2 2 1.5 . 1.9 . 3 3 . 2.0 1.6 . 4 4 . . 2.2 . 5 5 2.1 2.3 2.2 2.2 6 6 1.8 2.0 1.9 1.9

In fact, SAS included a **NOTE**: in the Log to let you know
about the missing values that were created. The Log entry from this example is shown
below.

222 DATA times2 ; 223 SET times ; 224 avg = (trial1 + trial2 + trial3) / 3 ; 225 RUN ; NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 3 at 224:17 3 at 224:26 3 at 224:36 NOTE: The data set WORK.TIMES2 has 6 observations and 5 variables.

This note tells us that three missing values were created in
the program at line 224. This makes sense, we know that 3 missing values were created for
**avg**
and that **avg** is created on line 224.

As a general rule, computations involving missing values yield missing values. For example,

2 + 2 yields 4

2 + . yields .

2 / 2 yields 1

. / 2 yields .

2 * 3 yields 6

2 * . yields .

whenever you add, subtract, multiply, divide, etc., values that involve missing data, the result is missing.

In our reaction time experiment, the average reaction
time **avg** is missing for three out of six cases. We could try just averaging the
data for the non-missing trials by using the **mean** function as shown in the example
below.

DATA times3 ; SET times ; avg = MEAN(trial1, trial2, trial3) ; RUN ; PROC PRINT DATA=times3 ; RUN ;

The results below show that **avg** now
contains the average of the non-missing trials.

OBS ID TRIAL1 TRIAL2 TRIAL3 AVG 1 1 1.5 1.4 1.6 1.5 2 2 1.5 . 1.9 1.7 3 3 . 2.0 1.6 1.8 4 4 . . 2.2 2.2 5 5 2.1 2.3 2.2 2.2 6 6 1.8 2.0 1.9 1.9

Had there been a large number of trials, say 50 trials,
then it would be annoying to have to type**avg = mean(trial1, trial2, trial3 .... trial50)**

Here is a shortcut you could use in this kind of situation
**avg = mean(of trial1-trial50)**

Also, if we wanted to get the sum of the times instead
of the average, then we could just use the **sum** function instead of the
**mean**
function. The syntax of the **sum** function is just like the
**mean**
function, but it returns the sum of the non-missing values.

Finally, you can use the **N** function to
determine the number of non-missing values in a list of variables, as illustrated below.

DATA times4 ; SET times ; n = N(trial1, trial2, trial3) ; RUN ; PROC PRINT DATA=times4 ; RUN ;

As you see below, observations 1, 5 and 6 had three valid values, observations 2 and 3 had two valid values, and observation 4 had only one valid value.

OBS ID TRIAL1 TRIAL2 TRIAL3 N 1 1 1.5 1.4 1.6 3 2 2 1.5 . 1.9 2 3 3 . 2.0 1.6 2 4 4 . . 2.2 1 5 5 2.1 2.3 2.2 3 6 6 1.8 2.0 1.9 3

You might feel uncomfortable with the variable **avg**
for observation 4 since it is not really an average at all. We can use the variable
**n**
to create **avg** only when there are two or more valid values, but if the number of non-missing values
is 1 or less, then make **avg** to be missing. This is illustrated below.

DATA times5 ; SET times ; n = N(trial1, trial2, trial3) ; IF n >= 2 THEN avg = MEAN(trial1, trial2, trial3) ; IF n <= 1 THEN avg=. ; RUN ; PROC PRINT DATA=times5 ; RUN ;

In the output below, you see that **avg**
now contains the average reaction time for the non-missing values, except for observation
4 where the value is assigned to missing because it had only 1 valid observation.

OBS ID TRIAL1 TRIAL2 TRIAL3 N AVG 1 1 1.5 1.4 1.6 3 1.5 2 2 1.5 . 1.9 2 1.7 3 3 . 2.0 1.6 2 1.8 4 4 . . 2.2 1 . 5 5 2.1 2.3 2.2 3 2.2 6 6 1.8 2.0 1.9 3 1.9

It is important to understand how missing values are
handled in logical statements. For example, say that you want to create
a 0/1 value for **trial1 **that is 0 if it is 1.5 or less, and 1 if it is
over 1.5. We show this below (incorrectly, as you will see).

DATA times2 ; SET times ; if (trial1 <= 1.5) then trial1a = 0; else trial1a = 1 ; RUN ; proc print data=times2; var id trial1 trial1a; run;

And as you can see in the output, the values
for **trial1a **are wrong when** id **is 3 or 4, when **trial1**
is missing. This is because SAS treats a missing value as the smallest
possible value (e.g., negative infinity) and that value is less than 1.5, so then
the value for **trial1a** becomes 0.

Obs id trial1 trial1a 1 1 1.5 0 2 2 1.5 0 3 3 . 0 4 4 . 0 5 5 2.1 1 6 6 1.8 1

Instead, we will explicitly exclude missing values to make sure they are treated properly, as shown below.

DATA times2 ; SET times ; trial1a = .; if (trial1 <= 1.5) and (trial1 > .) then trial1a = 0; if (trial1 > 1.5) then trial1a = 1 ; RUN ; proc print data=times2; var id trial1 trial1a; run;

And now we get the results that we wish. The value for **trial1a **is only
0 when it is less than or equal to 1.5 and it is not missing. The
value for **trial1a **is only 1 when it is over 1.5, as shown below.

Obs id trial1 trial1a 1 1 1.5 0 2 2 1.5 0 3 3 . . 4 4 . . 5 5 2.1 1 6 6 1.8 1

- When creating or recoding variables that involve missing values, always pay attention to the SAS log to detect when you are creating missing values.

- See Subsetting data in SAS for information about subsetting data with variables that are missing.
- See How do I specify types of missing values? for more information about using different missing data values.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.