Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu
One way to code dichotomous response data is to enter a 0 or 1 for each level, e.g., No=0/Yes=1. You can then assign formats to the variable names so that the meaning of each level remains intact. In order to analyze dichotomous data as a response variable you should first define which of the two levels is of greater interest to you. You then typically assume:
0 < p < 1 )
within each groupn trialsGiven this scenario, what method of analysis should you use? Your first inclination might be to employ ANOVA methods (including linear regression). However, this approach is not optimal, for the following reasons:
1. Proportions are bounded by 0 and 1. ANOVA methods assume the continuous response variable ranges from negative infinity to positive infinity. When analyzing data this open interval does not need to be literally true, but with dichotomous data the defined bounds may be too restrictive.
2. Linear regression and ANOVA typically assume homoscedasticity of the residuals. The variance of a proportion depends on the value of its mean (a function of the levels of the independent variables), so it doesn't meet this assumption.
To minimize these problems, the arcsin transformation has historically been applied. However, today we can do much better!
The type of analysis designed for dichotomous data is logistic regression. It is similar to linear regression and ANOVA in that the independent or explanatory variables may be either continuous or categorical. Logistic regression produces model coefficients with significance tests and interprets the response variable with odds ratios.
SAS has several procedures that will calculate logistic regression models, namely PROC GENMOD, LOGISTIC, and NLMIXED. (You may add to this list FREQ, CATMOD, and SURVEYLOGISTIC, but only GENMOD, LOGISTIC, and NLMIXED are considered in this article.) The objective is to code data that will produce consistent output when read into any of these procedures.
In the examples that follow I will assume the dichotomous response variable
is coded as 0/1. The important point is to give the level of greatest interest
the value that sorts last in ascending order, in this case a 1. How you code
the independent categorical variables will also determine the results you observe
on the output. By default, most procedures in SAS that include a CLASS statement
assign the "largest" coded level (i.e., the one that sorts last
in an ascending order) as the reference category; that is, estimated coefficients
of all other categories are compared with it. This is important to know if
you want to state how females compare to males on the level of the response
coded as 1. This implies females should be coded as gender=0 and males as gender=1.
Coding gender as character "F" and "M" also works,
since "F" precedes "M" when sorting.
With these principles in mind, a very "small" dataset (for illustration only) is shown where 5 women and 4 men, selected randomly, were asked a basic yes/no question where "yes" is the response of greater interest. The final two columns are different ways to dummy-code gender, depending if Female or Male is the reference category.
| response | gender | gndrF | gndrM |
| 1 | F | 1 | 0 |
| 1 | F | 1 | 0 |
| 0 | F | 1 | 0 |
| 0 | F | 1 | 0 |
| 0 | F | 1 | 0 |
| 1 | M | 0 | 1 |
| 1 | M | 0 | 1 |
| 1 | M | 0 | 1 |
| 0 | M | 0 | 1 |
By aggregating the response where y = Sum(response) is the
number of 'yes'responses for each level of gender, the same dataset could
also be represented as:
y |
n |
gender |
2 |
5 |
F |
3 |
4 |
M |
PROC GENMOD. GENMOD is an acronym for "GENeralized linear MODels." Its function and application is similar to PROC GLM (general linear model) for ANOVA models; however, GENMOD allows the response variable to assume distributions such as binomial (for dichotomous data) and Poisson (counts). This is the basic syntax for logistic regression:
PROC GENMOD DATA=b descending;
CLASS gender;
MODEL response = gender / dist=binomial link=logit
<options> ;
RUN;
PROC LOGISTIC. LOGISTIC is designed for analysis of dichotomous and ordinal data (e.g., Likert scales) and now includes a CLASS statement. This is the most basic syntax for logistic regression:
PROC LOGISTIC DATA=descending;
CLASS gender/PARAM=ref;
MODEL response = gender / <options>;
RUN;
A common feature of GENMOD and LOGISTIC is the descending option on the PROC statement, which means for response data coded 0/1, SAS will analyze the probability of a response of '1' rather than the default level of '0'. This option is an essential feature to recognize when interpreting the sign of estimated coefficients. The MODEL statements in the two PROCs look similar, although the choices for options in each are quite different. In GENMOD the "logit" link is the default value for the binomial distribution--just be sure to enter "logit" (and not another valid choice, "log") in order to produce the same results as LOGISTIC. Both procedures allow you to enter aggregated data with the notation y/n for the response (y subjects answered "Yes" in n trials); with this coding the "descending" option no longer applies.
Perhaps the greatest potential confusion between the two procedures is found
in the CLASS statement. GENMOD applies 0/1 coding of categorical data (like
PROC GLM), treating the 'highest formatted level' (or numeric,
if unformatted) as the reference category. By default, LOGISTIC assigns classification
factors as 'effects' coding, that is, given -1/1 values. This means
the estimated coefficients will be 1/2 the magnitude produced with GENMOD,
thus the reason for the PARAM=ref on the CLASS statement. If you follow the
steps outlined above, you will arrive at the same results regardless of which
procedure you use:
Parameter |
Estimate |
Intercept |
1.0986 |
gender |
1.5041 |
As shown above, one very nice feature of the CLASS statement
in LOGISTIC is the ability to change the default parameterization to match
GENMOD and also to specify your choice of the reference category:
CLASS gender(REF=first) / PARAM=ref;
This option is of value if you code gender as "F" and "M" and you want to compare Males to Females. Without REF=first attached to gender, LOGISTIC would compare Females to Males.
NLMIXED. A third way to compute logistic regression is the procedure NLMIXED. However, its use assumes knowledge of maximum likelihood estimation. It also requires that you numerically code all categorical data with 0/1 values (with the DATA step or with PROC GLMMOD) as it does not currently have a CLASS statement. Although it has a built-in binomial function, it is more instructive to observe how you can write statements to analyze dichotomous data directly with the likelihood equation:
PROC NLMIXED DATA=b;
PARMS intercept = -.1 _gender = -.1;
eta = intercept + ( _gender * gndrF );
prb_1 = exp(eta) / (1+exp(eta));
liklhd = (prb_1**response) * ((1-prb_1)**(1-response));
loglik = LOG(liklhd);
MODEL response ~ general(loglik);
RUN;
The equation for eta is the linear predictor, a function of the intercept
and dummy-coded gender (here, as with GENMOD and LOGISTIC, Males are treated
as the reference category ). Eta enters the formula to compute prb_1,
the probability that response=1; this formula has the same effect as the descending
option discussed above. The log-likelihood is obtained from the probability
equation and is then maximized to compute the parameter estimates (the same
as the estimates produced by GENMOD and LOGISTIC, as shown above).
The answer to this question depends on many factors, but essentially here are the major differences among the three methods we've discussed:
PROC GENMOD presents a unified approach to the analysis of categorical data, including Poisson and Negative Binomial (for counts), gamma, and normally distributed data (though for this distribution, GLM, REG, or MIXED will likely work better). It also handles repeated measures for count data in much the same way as PROC MIXED works with repeated measures for continuous data.
PROC LOGISTIC is designed for regression applications with one response (0/1) collected from each subject or several independent responses aggregated over subjects. It is the procedure designed to compute ROC curves. It can also perform exact logistic regression when you have small sample sizes or 0 counts in some of the cells (a technique that may be of value when given the warning "quasi-complete separation" in the log file).
PROC NLMIXED allows great flexibility in writing program statements. It also
handles random effects models for count data and allows you to enter formulas
for non-standard probability distributions (much like the dichotomous data
example). One of its most useful features is the ability to compute zero-inflated
models. In this situation a large number of legitimate zeros appear in your
dataset, making overdispersion a concern. PROC GENMOD has a scale = option
for approaching this problem, although a zero-inflated model may be a more
attractive solution. Note that this situation may be mistaken for censored
data (i.e., the value is an upper or lower bound), such as Tobit models. Legitimate
zeros and censored observations are two very different problems and NLMIXED
allow you to treat them as such (see Long, Chapters 7 and 8).