Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu
Statistical analysis would be so simple if only a researcher didn't need to deal with all those pesky assumptions! For example, whenever you run an independent groups analysis of variance (ANOVA) you should always check three very important assumptions:
The first assumption is generally met from the design, that is, by subjects randomly assigned to two or more groups.
The second assumption focuses on normality of the residuals, not the observations themselves.
The third assumption implies there is equal "spread" of the residuals, as measured by the pooled variance. Side-by-side boxplots or hilo plots for each group are particularly helpful methods for visually detecting violations in this third assumption and should always be among the initial steps of an ANOVA. Fortunately, ANOVA stands up well to minor violations of assumptions 2 and 3; however, it is not a good choice for data analysis if assumption 1 is not met.
This article introduces you to a relatively new SAS procedure called PROC MIXED, which for many applications is destined to replace PROCs TTEST and GLM. One noteworthy advantage over these two older procedures is the provisions it has for data analysis when assumptions 1 and 3 are not met. It is designed to analyze continuous or interval level response data collected across one or more classification factors which for this article will be assumed to have "fixed effects," that is, all levels of interest from each factor are included in the study.
The name MIXED implies it works with both fixed and two or more random effects, but explanation of how it works with the latter will be reserved for future articles. For now, be aware that with only a few exceptions, everything PROCs TTEST and GLM can do, PROC MIXED can do just as well--and in many situations it does it much better.
To demonstrate how PROC MIXED works, a simple introduction is to compute a two-sample t-test from a continuous response variable collected from persons randomly chosen from a much larger population of subjects. A t-test is a special case of ANOVA with subjects randomly selected from two groups and one observation taken on each subject. Here is a "small" example dataset of test scores from an unequal number of male and female subjects selected at random:
PROC FORMAT;
VALUE gnd 0=' ' 1='Female' 2='Male' 3=' ';
RUN;
DATA scores;
INPUT gender score @@;
DATALINES;
1 75 1 76 1 80 1 75 1 78 1 77 1 73
1 72 1 74 1 71 ;
2 82 2 83 2 85 2 78 2 77 2 87 2 86
The first step is to visually show the data with an informative plot. This will help determine if unequal variances across the groups exist, or if severe outliers are present. Details on how the following plotting statements are constructed can be found at http://cc.uoregon.edu/cnews/spring2001/sasgraphics.html
GOPTIONS reset=all cback=white;
SYMBOL interpol=hiloj value=dot Height=1
color=black Line=5 Width=1 Repeat=2 ;
PROC GPLOT DATA=scores;
PLOT score*gender / Noframe
haxis=0 to 3 by 1 hminor=0
vaxis=70 to 90 by 4 vminor=3 ;
TITLE H=2 "Test Scores";
FORMAT xplot gnd. ;
RUN; QUIT;
Important summary statistics from PROC TABULATE are output in table form.
PROC TABULATE DATA=scores NOseps ;
CLASS gender; VAR score;
TABLE gender,
score*(n*f=3.0 min*f=4.0 max*f=4.0
mean*f=6.2 var*f=7.3) /
rts=12 BOX='Summary Statistics';
FORMAT gender gnd. ;
RUN;
Summary Statistics |
score |
||||
N |
Min |
Max |
Mean |
Var |
|
male
|
7 |
77 |
|
|
|
The variances of the two levels of gender are different. But is the difference severe enough to pursue an unequal variance model? We'll check that later, but for now, because of the small sample sizes we'll assume equal variances in the two groups is probably reasonable. Rather than applying PROCs TTEST or GLM to compute a t-test of the two means under the null hypothesis of equality, the following commands from PROC MIXED produce the same results:
PROC MIXED DATA=scores NoItPrint ORDER=internal;
CLASS gender;
MODEL score = gender / solution DDFM=bw;
ESTIMATE 'Female
- Male' gender 1 -1 / cl;
LSMEANS gender / diff cl;
FORMAT gender gnd. ;
RUN;
Actually, this example demonstrates three equivalent approaches PROC MIXED offers to test two sample means for equality: by entering “solution” as an option on the MODEL statement, the ESTIMATE statement, and LSMEANS. The edited output contains this information:
Class Level Information
Class Levels Values
gender 2 Female Male
The residual variance of the model is found under the Covariance Parameter Estimates. Note that the standard deviation output by PROC TTEST is equal to the square root of the Estimate (e.g., 3.2518=SQRT(10.5743)).
Covariance Parameter Estimates
Cov Parm Estimate
Residual 10.5743
Since we are testing the means of a "fixed" effect, results for the t-test are observed from output produced by the MODEL statement. 'Male' is the reference category, so Estimate=-7.4714 is the difference in the two gender means producing a t-value = -4.66 (F-value=21.74 = (-4.66)2 ) with 15 degrees of freedom resulting in a p-value = 0.0003:
Solution for Fixed Effects |
|||||
Effect |
gender |
Estimate |
Standard Error |
t Value |
Pr >|t| |
gender |
Female |
-7.4714 |
1.6025 |
-4.66 |
0.0003 |
gender |
Male |
0 |
. |
. | . |
Type 3 Tests of Fixed Effects |
|||||
Num |
Den |
||||
Effect |
DF |
DF |
F Value |
Pr > F |
|
gender |
1 |
15 |
21.74 |
0.0003 |
|
The ESTIMATE statement also produces a p-value which tests the null hypothesis for equality of the gender means. How gender is coded (as printed in the Class Level Information table) defines how to place 1 and -1 on the ESTIMATE statement. By default, SAS orders levels of classification factors in alphabetical order of their formats (or in their numerical order when coded as numbers without a FORMAT or with option ORDER=internal). The two coefficients indicate to take the mean for Females and subtract the mean for Males to get the computed difference (Estimate= -7.4714). The cl option produces 95% confidence intervals for the difference of the two sample means.
Output from the Estimate Statement |
|||
Label |
Estimate |
Standard Error |
DF |
Female - Male |
-7.4714 |
1.6025 |
15 |
t Value |
Pr > |t| |
Lower |
Upper |
-4.66 |
0.0003 |
-10.8871 |
-4.0558 |
The same data coding features apply when working with the
Least Squares Means and its associated table of differences produced by the
LSMEANS statement with the option diff. It provides a table of
the Least Squares Means (listed under the Estimate column). In this example
they are the same as the actual means; however, with unbalanced data and more
complicated designs this will not necessarily be the result. Their standard
errors and respective confidence limits are also printed. P-values also appear
on the output for LSMEANS; they usually are of little or no interest unless
you want to test whether the means are different from zero.
Least Squares Means |
|||||
Effect |
gender |
Estimate |
Standard Error |
Lower |
Upper |
gender |
Female |
75.1000 |
1.0283 |
72.9082 |
77.2918 |
gender |
Male |
82.5714 |
1.2291 |
79.9517 |
85.1911 |
What is of the greatest interest is the difference between
the two means. On the output for the table of differences, observe two columns
labeled gender and _gender. Female on
the line below gender indicates you substitute the lsmean
for females, and Male below _gender indicates you
substitute the lsmean for males. When you subtract them, the output shows the
difference in the two means of -7.4714 (printed under the Estimate column).
The standard error of the difference, the t-value, p-value, and 95% confidence
limits are also printed. The t -value column is the difference between two
means divided by their standard error.
Differences of Least Squares Means |
||||
Effect |
gender |
_gender |
Estimate |
Standard Error |
gender |
Female |
Male |
-7.4714 |
1.6025 |
DF |
t Value |
Pr > |t| |
Lower |
Upper |
15 |
-4.66 |
0.0003 |
-10.8871 |
-4.0558 |
The purpose of this simple example is to show how the output from the Solution of Fixed Effects, the ESTIMATE, and the LSMEANS statements, all produce results identical to the output obtained with PROC TTEST or PROC GLM (assuming equal variances). In more complicated designs these statements have more specialized functions. The MODEL statement defines the model and produces regression-like coefficient estimates as well as tests for the fixed effects. The ESTIMATE statement allows you to estimate linear combinations of two or more group means or to probe complex interactions. The LSMEANS statement produces all pair-wise comparisons of means. Combined with the Output Delivery System, it is an extremely helpful tool when comparing differences in means.
A statistical test for the equality of means which involves unequal variances (different from the one produced with TTEST) can also be requested with the above statements by changing the MODEL statement option for the degrees of freedom computation to DDFM=satterthwaite and adding the following after the MODEL statement:
REPEATED / GROUP=gender;
The test for equal variances involves comparing the AIC values from the Fit Statistics tables, running it with and without this REPEATED statement. How this test works and how to account for heterogeneous variances with PROC MIXED can be found at http://www.uoregon.edu/~robinh/mixed_sas.html. This example serves as a good starting point to understand how to embellish the PROC MIXED statements to handle a wide variety of complicated designs containing both fixed and random effects.
Mastery of these basic features gives you the background needed to graduate to more complex analyses such as repeated measurements, combinations of fixed and random effects, or hierarchical linear models, to name just a few.
PROC MIXED has been called the best thing to happen in statistical computing since sliced bread. In fact , "slice" is a helpful option for the LSMEANS statement.