Return to UOCC HomeComputing News Home
Header bar

Data Coding for Logistic Regression with SAS: Which Procedure Should You Use?

Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu

If you are working with dichotomous data, where typical responses are defined as no/yes or failure/success, you may wonder which of SAS's many data analysis procedures you should use. The purpose of this article is to explain how to code data to produce equivalent results from three SAS procedures designed for dichotomous data.

One way to code dichotomous response data is to enter a 0 or 1 for each level, e.g., No=0/Yes=1. You can then assign formats to the variable names so that the meaning of each level remains intact. In order to analyze dichotomous data as a response variable you should first define which of the two levels is of greater interest to you. You then typically assume:

What Method Should You Use?

Given this scenario, what method of analysis should you use? Your first inclination might be to employ ANOVA methods (including linear regression). However, this approach is not optimal, for the following reasons:

1. Proportions are bounded by 0 and 1. ANOVA methods assume the continuous response variable ranges from negative infinity to positive infinity. When analyzing data this open interval does not need to be literally true, but with dichotomous data the defined bounds may be too restrictive.

2. Linear regression and ANOVA typically assume homoscedasticity of the residuals. The variance of a proportion depends on the value of its mean (a function of the levels of the independent variables), so it doesn't meet this assumption.

To minimize these problems, the arcsin transformation has historically been applied. However, today we can do much better!

Logistic Regression

The type of analysis designed for dichotomous data is logistic regression. It is similar to linear regression and ANOVA in that the independent or explanatory variables may be either continuous or categorical. Logistic regression produces model coefficients with significance tests and interprets the response variable with odds ratios.

SAS has several procedures that will calculate logistic regression models, namely PROC GENMOD, LOGISTIC, and NLMIXED. (You may add to this list FREQ, CATMOD, and SURVEYLOGISTIC, but only GENMOD, LOGISTIC, and NLMIXED are considered in this article.) The objective is to code data that will produce consistent output when read into any of these procedures.

In the examples that follow I will assume the dichotomous response variable is coded as 0/1. The important point is to give the level of greatest interest the value that sorts last in ascending order, in this case a 1. How you code the independent categorical variables will also determine the results you observe on the output. By default, most procedures in SAS that include a CLASS statement assign the "largest" coded level (i.e., the one that sorts last in an ascending order) as the reference category; that is, estimated coefficients of all other categories are compared with it. This is important to know if you want to state how females compare to males on the level of the response coded as 1. This implies females should be coded as gender=0 and males as gender=1. Coding gender as character "F" and "M" also works, since "F" precedes "M" when sorting.

With these principles in mind, a very "small" dataset (for illustration only) is shown where 5 women and 4 men, selected randomly, were asked a basic yes/no question where "yes" is the response of greater interest. The final two columns are different ways to dummy-code gender, depending if Female or Male is the reference category.

response gender gndrF gndrM
1 F 1 0
1 F 1 0
0 F 1 0
0 F 1 0
0 F 1 0
1 M 0 1
1 M 0 1
1 M 0 1
0 M 0 1

By aggregating the response where y = Sum(response) is the number of  'yes'responses for each level of gender, the same dataset could also be represented as:

y n gender
2 5 F
3 4 M

PROC GENMOD. GENMOD is an acronym for "GENeralized linear MODels." Its function and application is similar to PROC GLM (general linear model) for ANOVA models; however, GENMOD allows the response variable to assume distributions such as binomial (for dichotomous data) and Poisson (counts). This is the basic syntax for logistic regression:

PROC GENMOD DATA=b descending;
CLASS gender;
MODEL response = gender / dist=binomial link=logit
<options> ;
RUN;

PROC LOGISTIC. LOGISTIC is designed for analysis of dichotomous and ordinal data (e.g., Likert scales) and now includes a CLASS statement. This is the most basic syntax for logistic regression:

PROC LOGISTIC DATA=descending;
CLASS gender/PARAM=ref;
MODEL response = gender / <options>;
RUN;

A common feature of GENMOD and LOGISTIC is the descending option on the PROC statement, which means for response data coded 0/1, SAS will analyze the probability of a response of '1' rather than the default level of '0'. This option is an essential feature to recognize when interpreting the sign of estimated coefficients. The MODEL statements in the two PROCs look similar, although the choices for options in each are quite different. In GENMOD the "logit" link is the default value for the binomial distribution--just be sure to enter "logit" (and not another valid choice, "log") in order to produce the same results as LOGISTIC. Both procedures allow you to enter aggregated data with the notation y/n for the response (y subjects answered "Yes" in n trials); with this coding the "descending" option no longer applies.

Perhaps the greatest potential confusion between the two procedures is found in the CLASS statement. GENMOD applies 0/1 coding of categorical data (like PROC GLM), treating the 'highest formatted level' (or numeric, if unformatted) as the reference category. By default, LOGISTIC assigns classification factors as 'effects' coding, that is, given -1/1 values. This means the estimated coefficients will be 1/2 the magnitude produced with GENMOD, thus the reason for the PARAM=ref on the CLASS statement. If you follow the steps outlined above, you will arrive at the same results regardless of which procedure you use:

Analysis of Parameter Estimates (GENMOD & LOGISTIC)

Parameter Estimate
Intercept 1.0986
gender 1.5041

As shown above, one very nice feature of the CLASS statement in LOGISTIC is the ability to change the default parameterization to match GENMOD and also to specify your choice of the reference category:

CLASS gender(REF=first) / PARAM=ref;

This option is of value if you code gender as "F" and "M" and you want to compare Males to Females. Without REF=first attached to gender, LOGISTIC would compare Females to Males.

NLMIXED. A third way to compute logistic regression is the procedure NLMIXED. However, its use assumes knowledge of maximum likelihood estimation. It also requires that you numerically code all categorical data with 0/1 values (with the DATA step or with PROC GLMMOD) as it does not currently have a CLASS statement. Although it has a built-in binomial function, it is more instructive to observe how you can write statements to analyze dichotomous data directly with the likelihood equation:

PROC NLMIXED DATA=b;
PARMS intercept = -.1 _gender = -.1;
eta = intercept + ( _gender * gndrF );
prb_1 = exp(eta) / (1+exp(eta));
liklhd = (prb_1**response) * ((1-prb_1)**(1-response));
loglik = LOG(liklhd);
MODEL response ~ general(loglik);

RUN;

The equation for eta is the linear predictor, a function of the intercept and dummy-coded gender (here, as with GENMOD and LOGISTIC, Males are treated as the reference category ). Eta enters the formula to compute prb_1, the probability that response=1; this formula has the same effect as the descending option discussed above. The log-likelihood is obtained from the probability equation and is then maximized to compute the parameter estimates (the same as the estimates produced by GENMOD and LOGISTIC, as shown above).

Which Procedure is Best for Your Data?

The answer to this question depends on many factors, but essentially here are the major differences among the three methods we've discussed:

PROC GENMOD presents a unified approach to the analysis of categorical data, including Poisson and Negative Binomial (for counts), gamma, and normally distributed data (though for this distribution, GLM, REG, or MIXED will likely work better). It also handles repeated measures for count data in much the same way as PROC MIXED works with repeated measures for continuous data.

PROC LOGISTIC is designed for regression applications with one response (0/1) collected from each subject or several independent responses aggregated over subjects. It is the procedure designed to compute ROC curves. It can also perform exact logistic regression when you have small sample sizes or 0 counts in some of the cells (a technique that may be of value when given the warning "quasi-complete separation" in the log file).

PROC NLMIXED allows great flexibility in writing program statements. It also handles random effects models for count data and allows you to enter formulas for non-standard probability distributions (much like the dichotomous data example). One of its most useful features is the ability to compute zero-inflated models. In this situation a large number of legitimate zeros appear in your dataset, making overdispersion a concern. PROC GENMOD has a scale = option for approaching this problem, although a zero-inflated model may be a more attractive solution. Note that this situation may be mistaken for censored data (i.e., the value is an upper or lower bound), such as Tobit models. Legitimate zeros and censored observations are two very different problems and NLMIXED allow you to treat them as such (see Long, Chapters 7 and 8).

References

  1. Hosmer D. W. and Lemeshow S. (2002) Applied Logistic Regression, 2nd Ed. N.Y.: John Wilely & Sons.
  2. Long, J. Scott. (1997) "Regression Models for Categorical and Limited Dependent Variables." Thousand Oaks, CA.: Sage.
  3. Articles on data analysis issues: http://darkwing.uoregon.edu/~robinh/analysis.html

Spring 2005 Computing News | Computing Center Home Page