Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu
In the ideal data collection project, complete data would exist for all variables across all experimental units (also called subjects, cases, or observations). Unfortunately, for a number of reasons it is inevitable that some values won't be collected, will become lost, or will be unusable.
Understanding how statistical programs work with missing data is an important step in dealing with this situation. Establishing mechanisms to track sources of missing data may help you better understand your data and how to present results. It's also helpful to know the various ways to test for and report the presence of missing data with a computer program such as SAS. This article focuses primarily on how SAS works with missing numerical data.
When dealing with missing data, the first step is to understand the three basic ways in which it typically occurs:
MCAR (data are missing completely at random): A "missing" value does not depend on the variable itself or on the values of other variables in the database.
MAR (data are missing at random): The probability of missing data on any variable is not related to its particular value. The pattern of missing data is traceable or predictable from other variables in the database.
NMAR (not missing at random): Missing data are not random and depend on the values that are missing.
Next, it's important to anticipate how a given statistical program handles missing data: will it be listwise or pairwise?
Listwise or casewise deletion. Many statistical procedures will eliminate an entire observation or case if there are any missing data in the defined variables. This is known as listwise or casewise data deletion and occurs when a record has missing data for one or more identified variables. For example, repeated measures or multivariate analyses with the General Linear Model (GLM) in SAS or SPSS will only use data from subjects where all identified variables in the model are completely recorded. This means available data from a substantial number of experimental units could be omitted from the analysis if missing values in any combination of variables are present.
Pairwise data deletion occurs for some procedures that work with data in "pairs." Correlations and covariances computed with PROC CORR are instances where statistics can be computed with complete data for pairs of items that do not need to account for missing data with other variables. Pairwise data deletion is available in a number of SAS and SPSS statistical procedures. The SAS procedure PROC MIXED also works in this manner; it can analyze linear models without deleting subjects with missing data. However, missing data combinations, especially if they are numerous, can cause computational problems.
Three basic methods are commonly applied to code missing numerical data: the single period ( . ), a blank space, or "impossible" numeric values. A blank space or period are the common choices for missing character data. These choices will work with just about any program to work with missing data read from external files.
Impossible numbers that lie outside the relevant range, such as -99 or -9, often represent missing data. For example, when entering survey data, a negative number works well when the value is always positive, such as a person's age; a positive value such as 9 or 99 works well with responses that have a restricted range (such as a Likert scale from 1 to 5).
Visually, a negative number may have the greatest impact (especially when you expect to see a positive value). However, if such numbers are added to the data file, make sure that they will not be used in any numerical computations to avoid biased or illogical results. SPSS allows you to define up to three discrete numerical values or a range of values to be defined as missing in addition to the period or blank space. This approach can also be used with data read into SAS. However, when doing so you must convert the number to a SAS missing value (such as the period) before you perform any numerical computations. Another option is to enter a WHERE statement into the list of statements for the DATA or PROC steps to specify values that lie within the relevant range.
A more flexible method of coding missing data is to apply single-letter codes (a, b, c, ..., y, z). The advantages of this technique when used in SAS will become apparent as you continue to read this article.
When entering data into an Excel worksheet that will eventually be read into SAS, you have two basic options to transfer data: have SAS read the Excel file directly with PROC IMPORT or ODBC, or save the Excel file in text format and have SAS read it with a DATA step. For either situation you should first review the strengths and limitations of Excel for data entry, how to structure the worksheet, and how PROC IMPORT accesses external files (see http://cc.uoregon.edu/cnews/spring2003/datadesign.html and http://cc.uoregon.edu/cnews/summer2003/sasupdate.html).
If you choose to apply only the three basic missing data coding schemes, SAS easily reads data from an Excel worksheet. The key to making the data transfer work properly is to format the range of cells with one data type for each column and either leave cells with missing data blank, or enter a period or an "impossible" number. With a single consistent format in each column and variable names placed in the first row, SAS can effectively import data from Excel worksheets.
If missing data are represented by "impossible" numbers, you need to immediately convert them to a SAS missing value in a DATA step. For example, suppose you apply one or more numeric codes (e.g., -99, -9, 9, 99, etc.) to represent missing values. To convert one or two specific numbers or a range of numbers to a missing code, enter an appropriate IF statement in a DATA step:
IF (x_var EQ -99 OR x_var EQ -9) THEN x_var = . ;
IF (-99 LE x_var LE -1) THEN x_var = . ;
If a period represents missing data, an OPTIONS statement with missing='M' placed at the beginning of the sequence of commands specifies what symbol to print. In this case, the letter M (or another letter of your choice) will be printed for any missing data.
To utilize SAS's ability to apply a variety of missing data codes, a second data transfer approach is necessary. Assume you decide to enter missing data in Excel with a variety of single-letter codes a, b, c, d, .., z (in addition to the three basic codes). Before you can read this data into SAS you will need to save the worksheet as a text file (e.g., comma separated value .CSV or tab-delimited .TXT file). A DATA step with MISSING, INFILE, and INPUT statements can then be used to read the data. Note that character values placed where SAS expects to find numbers will produce error messages in the log file. To avoid this, place a MISSING statement within the DATA step to tell SAS what specific character values should be treated as missing, e.g.,
OPTIONS missing='M'; ;
DATA new;
MISSING n r u;
INFILE cards DLM=',' dsd;
INPUT a1 a2 a3;
CARDS;
n,3,2
3,r,1
u,1,1
3,n,.
In this example the letters n, r, and u defined on the MISSING statement represent legitimate missing values placed in the input data file (they act as if they were a period). The OPTIONS statement designates that any other data value read as missing, such as another letter or the period, will be printed with the letter M. Observe what happens to the missing values when the data set is printed:
PROC PRINT DATA=new; VAR a1 a2 a3; RUN;
obs a1 a2 a3;
1 N 3 2
2 3 R 1
3 U 1 1
4 3 N M
A missing data value in SAS is actually a special, reserved floating point number. You can choose from 28 possible missing data codes (including the period), each having its own inherent sorting order, as follows:
The advantage of using SAS coding. Often in data collection--especially surveys--you should preserve the precise reason why data are missing. Suppose you collect data on a person's income. Missing data for this sometimes sensitive question can occur for a variety of reasons. The SAS system allows you to assign codes for missing data that distinguish why income could not be collected for specific reasons, such as:
n for "No Response"
r for "Refused to Answer"
u for "Unemployed"
. for "Unknown"
You can then create your own formats for the actual income levels, including separate codes for the missing values:
PROC FORMAT;
VALUE incm 1='less than 15000' 2='15000 to 30000'
3='30000 to 45000' 4='greater than 45000'
.n="no response".r="refused to answer"
.u="unemployed".="unknown";
RUN;
DATA two;
LABEL income="Income";
FORMAT income incm. ;
MISSING n r u; * the letters n r u refer to
specific reasons for missing data from
income, listed in PROC FORMAT;
INPUT id income @@;
CARDS;
1 1 2 . 3 r 4 4
5 2 6 n 7 u 8 1
;
PROC PRINT NOobs Label;VAR id income;RUN;
PROC TABULATE NOseps missing;CLASS income;
TABLE income, n="Count"*f=7.0 / RTS=20;
RUN;
Missing data are read with a single letter code; however, the associated formats include a period placed before the letter. Income for id=2 is missing for an unknown reason; thus it is assigned a period rather than one of the specific missing data codes.
The output from PROC PRINT and PROC TABULATE (Figures 1 and 2 on the following page) demonstrate how a variety of missing data formats allows you to work within the coding of the variable itself to include additional discrete levels. The "missing" option on the PROC TABULATE statement tells SAS to print the number of missing data items that exist for each type of missing data.
| id | INCOME |
| 1 | less than 15000 |
| 2 | unknown |
| 3 | refused to answer |
| 4 | greater than 45000 |
| 5 | 15000 to 30000 |
| 6 | no response |
| 7 | unemployed |
| 8 | less than 15000 |
| |-- | - - - - - - - - - - - - - - - - - - | | | - - - - - - - - - - - | | |
| | | | | COUNT | | | |
| |-- | - - - - - - - - - - - - - - - - - - | | | - - - - - - - - - - - | | |
| | | Income | | | | | |
| | | unknown | | | 1 | | |
| | | no response | | | 1 | | |
| | | refused to answer | | | 1 | | |
| | | unemployed | | | 1 | | |
| | | lt 15000 | | | 2 | | |
| | | 15000 to 30000 | | | 1 | | |
| | | 30000 to 45000 | | | 0 | | |
| | | gt 45000 | | | 1 |
|
| |-- | - - - - - - - - - - - - - - - - - - | | | - - - - - - - - - - - | | |
Without missing data coding systems, one could only ignore multiple types of missing data and move on. Such codes may also obviate the need to create different fields to explain the reasons for missing data. Looking at your data with a variety of missing data codes can potentially tell you a lot about your study.
One way to test if a given variable is missing is to enter an IF statement in the DATA step. With missing data stored as periods, to take some specific action if true, one would typically use:
IF (x_var EQ . ) THEN DO; < insert statement(s) > ; END;
This statement legitimately tests whether the current value of x_var is missing; however, it only checks for one of the 28 possible missing data codes. If x_var has been set equal to n, r, u, or one of the other missing data codes, this statement won't work.
SAS also has a MISSING data function which tests for the presence of any missing data, not only numbers, but other data types as well (character, date, etc.) The function returns 0 (zero) if the value of the argument is not missing, and returns 1 if the value is missing.
IF (MISSING(x_var) EQ 1) THEN DO; < insert statement(s) >
; END;
The MISSING function is generally preferred where the existence of missing values requires the program to take some specific action.
To appreciate the usefulness of a variety of missing data values, it's essential to know the ascending order in which SAS evaluates them:
._ < . < .a < .b < .c < .. < .x < .y <
.z
Even though they are not considered numerical data, missing data codes behave as if they had unique, ordered numerical values. Since .z is defined to be the so-called "largest" missing data value, a more comprehensive check that will work for all missing data codes is:
IF (x_var LE .z) THEN DO; < insert statement(s) >; END;
Perhaps what you should be most concerned about with missing numerical data is that SAS treats them as if they were negative numbers of extremely large magnitudes. As a result, the internal value of a missing data item is comparatively "less than" any negative number you would ever compute or even be able to assign. You can test this:
DATA one;
IF (.z < 100) THEN a=1;
IF (.z < -99) THEN b=1;
IF (.z < -1.797654321E308) THEN c=1;
RUN;
PROC PRINT DATA=one NOobs; RUN;
a b c
1 1 1
Notice that when .z (the "largest" of the missing data values) is compared with any real number, no matter the magnitude of the number, positive or negative, the indicator variables a, b, c all evaluate to 1 (i.e., they indicate a true expression). From the viewpoint of SAS, comparing numerical data value with an open-ended IF statement is risky. For example:
IF ( x_var LT <any real number>) THEN DO; <enter statements>;
END;
This IF statement will be "true" whenever x_var contains a missing value. As a result, the statement(s) following THEN will be performed--and this is likely to give you a result you don't want. In other words, open-ended IF statements with no lower bound such as (x_var LT <any number>) are evaluated to be true whenever x_var contains any one of the 28 missing data codes. To avoid this situation, two possible solutions are to use the MISSING function or to always insert a lower bound into IF statements:
IF ( <lower bound> LT x_var LT <upper bound> ) THEN
DO; < SAS statement >; END;
Remember that missing data values behave in comparison expressions as if they were negative numbers with very large magnitudes, even larger than SAS can represent them. Although missing data codes behave as if they have unique, ordered, numerical values, they are not included in any numerical computations.
1. Allison, P. D. (2001) "Missing Data." Sage University Papers in Quantitative Applications in the Social Sciences, 07136. Thousand Oaks, CA: Sage.
2. Little, Roderick J. A. and Donald B. Rubin. Statistical Analysis With Missing Data, 2nd Edition, New York: John Wiley and Sons, 2002.