Return to UOCC HomeComputing News Home
Header bar

The Analysis of Variance is Changing! Meet the New Kid on the Block: PROC MIXED

Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu

Statistical analysis would be so simple if only a researcher didn't need to deal with all those pesky assumptions! For example, whenever you run an independent groups analysis of variance (ANOVA) you should always check three very important assumptions:

  1. Observations are independent
  2. Residuals computed from the group means are normally distributed
  3. Residuals have equal variances across groups

The first assumption is generally met from the design, that is, by subjects randomly assigned to two or more groups.

The second assumption focuses on normality of the residuals, not the observations themselves.

The third assumption implies there is equal "spread" of the residuals, as measured by the pooled variance. Side-by-side boxplots or hilo plots for each group are particularly helpful methods for visually detecting violations in this third assumption and should always be among the initial steps of an ANOVA. Fortunately, ANOVA stands up well to minor violations of assumptions 2 and 3; however, it is not a good choice for data analysis if assumption 1 is not met.

This article introduces you to a relatively new SAS procedure called PROC MIXED, which for many applications is destined to replace PROCs TTEST and GLM. One noteworthy advantage over these two older procedures is the provisions it has for data analysis when assumptions 1 and 3 are not met. It is designed to analyze continuous or interval level response data collected across one or more classification factors which for this article will be assumed to have "fixed effects," that is, all levels of interest from each factor are included in the study.

The name MIXED implies it works with both fixed and two or more random effects, but explanation of how it works with the latter will be reserved for future articles. For now, be aware that with only a few exceptions, everything PROCs TTEST and GLM can do, PROC MIXED can do just as well--and in many situations it does it much better.

Mixed Model Analysis

To demonstrate how PROC MIXED works, a simple introduction is to compute a two-sample t-test from a continuous response variable collected from persons randomly chosen from a much larger population of subjects. A t-test is a special case of ANOVA with subjects randomly selected from two groups and one observation taken on each subject. Here is a "small" example dataset of test scores from an unequal number of male and female subjects selected at random:

PROC FORMAT;
VALUE gnd 0=' ' 1='Female' 2='Male' 3=' ';
RUN;

DATA scores;
INPUT gender score @@;
DATALINES;
1 75 1 76 1 80 1 75 1 78 1 77 1 73 1 72 1 74 1 71
2 82 2 83 2 85 2 78 2 77 2 87 2 86
;

The first step is to visually show the data with an informative plot. This will help determine if unequal variances across the groups exist, or if severe outliers are present. Details on how the following plotting statements are constructed can be found at http://cc.uoregon.edu/cnews/spring2001/sasgraphics.html

GOPTIONS reset=all cback=white;
SYMBOL interpol=hiloj value=dot Height=1
  color=black Line=5 Width=1 Repeat=2 ;
PROC GPLOT DATA=scores;
PLOT score*gender / Noframe
  haxis=0 to 3 by 1 hminor=0
  vaxis=70 to 90 by 4 vminor=3 ;
TITLE H=2 "Test Scores";
FORMAT xplot gnd. ;
RUN; QUIT;

Important summary statistics from PROC TABULATE are output in table form.

PROC TABULATE DATA=scores NOseps ;
CLASS gender; VAR score;
TABLE gender,
  score*(n*f=3.0 min*f=4.0 max*f=4.0
  mean*f=6.2 var*f=7.3) /
  rts=12 BOX='Summary Statistics';
FORMAT gender gnd. ;
RUN;

Summary Statistics
score
 
N
Min
Max
Mean
Var

gender

female

male

 

10

7

 

71

77

 

80

87

 

75.10

82.57

 

7.656

14.952

The variances of the two levels of gender are different. But is the difference severe enough to pursue an unequal variance model? We'll check that later, but for now, because of the small sample sizes we'll assume equal variances in the two groups is probably reasonable. Rather than applying PROCs TTEST or GLM to compute a t-test of the two means under the null hypothesis of equality, the following commands from PROC MIXED produce the same results:

PROC MIXED DATA=scores NoItPrint ORDER=internal;
CLASS gender;
MODEL score = gender / solution DDFM=bw;
ESTIMATE 'Female - Male' gender 1 -1 / cl;
LSMEANS gender / diff cl;
FORMAT gender gnd. ;
RUN;

Actually, this example demonstrates three equivalent approaches PROC MIXED offers to test two sample means for equality: by entering “solution” as an option on the MODEL statement, the ESTIMATE statement, and LSMEANS. The edited output contains this information:

Class Level Information
Class   Levels   Values
gender   2       Female Male

The residual variance of the model is found under the Covariance Parameter Estimates. Note that the standard deviation output by PROC TTEST is equal to the square root of the Estimate (e.g., 3.2518=SQRT(10.5743)).

Covariance Parameter Estimates

Cov Parm     Estimate

Residual      10.5743

Since we are testing the means of a "fixed" effect, results for the t-test are observed from output produced by the MODEL statement. 'Male' is the reference category, so Estimate=-7.4714 is the difference in the two gender means producing a t-value = -4.66 (F-value=21.74 = (-4.66)2 ) with 15 degrees of freedom resulting in a p-value = 0.0003:

Solution for Fixed Effects
Effect gender Estimate Standard Error t Value Pr >|t|
gender Female -7.4714 1.6025 -4.66 0.0003
gender Male 0 . . .
Type 3 Tests of Fixed Effects
  Num Den  
Effect DF DF F Value Pr > F
gender 1 15 21.74 0.0003

The ESTIMATE statement also produces a p-value which tests the null hypothesis for equality of the gender means. How gender is coded (as printed in the Class Level Information table) defines how to place 1 and -1 on the ESTIMATE statement. By default, SAS orders levels of classification factors in alphabetical order of their formats (or in their numerical order when coded as numbers without a FORMAT or with option ORDER=internal). The two coefficients indicate to take the mean for Females and subtract the mean for Males to get the computed difference (Estimate= -7.4714). The cl option produces 95% confidence intervals for the difference of the two sample means.

Output from the Estimate Statement
Label Estimate Standard Error DF
Female - Male -7.4714 1.6025 15
"" "" "" ""
t Value Pr > |t| Lower Upper
-4.66 0.0003 -10.8871 -4.0558

The same data coding features apply when working with the Least Squares Means and its associated table of differences produced by the LSMEANS statement with the option diff. It provides a table of the Least Squares Means (listed under the Estimate column). In this example they are the same as the actual means; however, with unbalanced data and more complicated designs this will not necessarily be the result. Their standard errors and respective confidence limits are also printed. P-values also appear on the output for LSMEANS; they usually are of little or no interest unless you want to test whether the means are different from zero.

Least Squares Means
Effect gender Estimate Standard Error Lower Upper
gender Female 75.1000 1.0283 72.9082 77.2918
gender Male 82.5714 1.2291 79.9517 85.1911

What is of the greatest interest is the difference between the two means. On the output for the table of differences, observe two columns labeled gender and _gender. Female on the line below gender indicates you substitute the lsmean for females, and Male below _gender indicates you substitute the lsmean for males. When you subtract them, the output shows the difference in the two means of -7.4714 (printed under the Estimate column). The standard error of the difference, the t-value, p-value, and 95% confidence limits are also printed. The t -value column is the difference between two means divided by their standard error.

Differences of Least Squares Means
Effect
gender
_gender
Estimate
Standard Error
gender
Female
Male
-7.4714
1.6025
"" "" "" "" ""
DF t Value Pr > |t| Lower Upper
15 -4.66 0.0003 -10.8871 -4.0558

The purpose of this simple example is to show how the output from the Solution of Fixed Effects, the ESTIMATE, and the LSMEANS statements, all produce results identical to the output obtained with PROC TTEST or PROC GLM (assuming equal variances). In more complicated designs these statements have more specialized functions. The MODEL statement defines the model and produces regression-like coefficient estimates as well as tests for the fixed effects. The ESTIMATE statement allows you to estimate linear combinations of two or more group means or to probe complex interactions. The LSMEANS statement produces all pair-wise comparisons of means. Combined with the Output Delivery System, it is an extremely helpful tool when comparing differences in means.

If the variances across groups are unequal…

A statistical test for the equality of means which involves unequal variances (different from the one produced with TTEST) can also be requested with the above statements by changing the MODEL statement option for the degrees of freedom computation to DDFM=satterthwaite and adding the following after the MODEL statement:

REPEATED / GROUP=gender;

The test for equal variances involves comparing the AIC values from the Fit Statistics tables, running it with and without this REPEATED statement. How this test works and how to account for heterogeneous variances with PROC MIXED can be found at http://www.uoregon.edu/~robinh/mixed_sas.html. This example serves as a good starting point to understand how to embellish the PROC MIXED statements to handle a wide variety of complicated designs containing both fixed and random effects.

Mastery of these basic features gives you the background needed to graduate to more complex analyses such as repeated measurements, combinations of fixed and random effects, or hierarchical linear models, to name just a few.

PROC MIXED has been called the best thing to happen in statistical computing since sliced bread. In fact , "slice" is a helpful option for the LSMEANS statement.

Information Resources

  1. Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996). SAS System for Mixed Models, SAS Institute, Cary, NC.
  2. McLean, R. Sanders, and Stroup (1991). "A unified approach to mixed linear models," The American Statistician, 45:54-64.
  3. Web articles:

Winter 2006 Computing News | Computing Center Home Page