Return to UOCC HomeComputing News Home
Header bar

Should You Be Using PROC MIXED to Analyze Continuous Data?

Here are some examples to help you determine the answer...

Robin High
robinh@darkwing.uoregon.edu

SAS continues to enhance existing products and develop new ones to fulfill data analysis needs as computational technology advances. Recently, the next level of sophistication has become available with PROC MIXED, a statistical procedure that even ardent SAS users may know very little about. PROC MIXED offers many features to work around assumptions in your data that are often overlooked or ignored. The purpose of this article is to show you how ANOVA, GLM, MIXED (all SAS) and UNIANOVA (SPSS) are alike, and to demonstrate how PROC MIXED is superior.

Suppose you set up six greenhouse benches as blocks for a plant-growing experiment. Within each block, you grew each of four varieties of a house plant, and after a given length of time you measured the plant heights in centimeters:

Plant Heights Plant Variety
Greenhouse Bench 1 2 3 4

1

19.8 21.9 16.4 14.7
2 16.7 19.8 15.4 13.5
3 17.7 21.0 14.8 12.8
4 18.2 21.4 15.6 13.7
5 20.3 22.1 16.4 14.6
6 15.5 20.8 14.6 12.9

This is an example of a randomized complete block design where the blocking factor (bench) has six levels and the treatment factor (variety) has four levels. Variety is considered to be a "fixed" effect in this analysis because the levels used are the only ones of interest. Bench (block) is defined as a random effect because it is a component of the design structure and reflects a random outcome that many similar experiments would produce. (A random factor occurs when its specific levels could be replaced by other equally acceptable levels without changing the research questions or the conclusions.) A two-way analysis of variance could be performed on the data with the following SAS code. This example briefly reviews how to use PROC GLM.

DATA heights;
LABEL bench='Greenhouse Bench'variety='Plant Variety';
INPUT bench variety height @@;
DATALINES;
1 1 19.8 1 2 21.9 1 3 16.4 1 4 14.7
2 1 16.7 2 2 19.8 2 3 15.4 2 4 13.5
3 1 17.7 3 2 21.0 3 3 14.8 3 4 12.8
4 1 18.2 4 2 21.4 4 3 15.6 4 4 13.7
5 1 20.3 5 2 22.1 5 3 16.4 5 4 14.6
6 1 15.5 6 2 20.8 6 3 14.6 6 4 12.9
;
PROC GLM data=heights;
CLASS bench variety;
MODEL height= bench variety;
RANDOM bench;
LSMEANS variety / stderr pdiff;
TITLE "Analysis of Variance with PROC GLM";
RUN;

The PROC GLM statement invokes the General Linear Model procedure. The CLASS statement specifies that bench and variety are classification variables in the model. Dummy (indicator) variables are automatically created by SAS corresponding to all distinct levels of bench and variety. The MODEL statement specifies the response (dependent) variable "height" to the left of the equal sign (=). It then lists both factors as explanatory variables, with the RANDOM statement specifying that bench is considered a random effect. The LSMEANS statement computes least square means for "variety" and the "pdiff" option computes all pairwise differences. (Note: least squares means are important to use with unbalanced data; when the data are balanced, they are the same as averages for varieties across benches.)

In this example the two explanatory variables are "bench" and "variety," and they comprise the main effects of interest. An edited version of the SAS output appears below:

Class Level Information

Class Levels Values
Bench 6 1 2 3 4 5 6
Variety 4 1 2 3 4

Number of observations 24

Dependent Variable: height

Source DF Sum of Squares Mean Square F Value Pr > F
Model 8 208.33167 26.0415 59.85 <.0001
bench 5 19.79333 3.9587 9.10 0.0004
variety 3 188.53833 62.8461 144.44 <.0001
Error 15 6.52667 0.4351    
Corrected Total 23 214.85833      
           

The residual mean square from this model issigma squared=.4351. Notice that the statistical test for differences in the levels of variety is very significant (F-value of 144.44 (3,15) with a p-value < .0001). This implies that the average plant height for at least one of the varieties is significantly different from the others.

variety height LSMEAN Standard Error Pr > | t | LSMEAN Number
1 18.0333 0.26929 <.0001 1
2 21.1667 0.26929 <.0001 2
3 15.5333 0.26929 <.0001 3
4 13.7 0.26929 <.0001 4

The column of p-values (Pr >|t|) are computed using a standard error of a treatment mean equal to

formula 1
These p-values are usually of little practical value in hypothesis testing, since they test whether the average plant height for each variety equals 0.

It is very useful to test pairwise differences of interest between types of plant (variety). Since the F-test for variety is significant, one approach to determine where the significant differences lie is to compute p-values of pairwise differences. W

ith a balanced design, the standard error of a difference between any two means is:

µi and µj are the estimated means for two specified levels of variety, k is the number of blocks, and t is the number of treatments. PROC GLM prints the following symmetric matrix of p-values with the four levels of variety indexed by i=1 to 4 and j=1 to 4 with the respective LSMEAN number given in final column of the table above.

Dependent Variable: height

i/j 1 2 3 4
1   <.0001 <.0001 <.0001
2 <.0001     <.0001
3 <.0001 <.0001   0.0002
4 <.0001 <.0001 0.0002  
         
         

When interpreting multiple p-values, you should use a smaller critical value (depending on the number of pairwise differences of interest) as the cutoff to get the desired overall alpha level test (usually 0.01 or 0.05). You can obtain similar conclusions with one of the many multiple comparison procedures. The results shown in this GLM example are identical to the output from the SPSS UNIANOVA procedure.)

But is this really the complete picture? One important aspect of PROC GLM and UNIANOVA is that both do not treat the random effects properly in the computation of expected mean squares, as the following results will now show. The same data analysis can be performed with PROC MIXED as follows:

PROC MIXED data=heights;
CLASS bench variety;
MODEL height = variety;
RANDOM bench;
LSMEANS variety / diff ;
TITLE "Analysis of Variance with PROC MIXED";
RUN;

This set of statements for PROC MIXED looks nearly the same as for PROC GLM. The PROC MIXED statement invokes the procedure to analyze the data from the SAS dataset called "heights." The CLASS statement creates dummy variables associated with bench and variety effects. Variety appears on the MODEL statement as a fixed effect and the random effect for bench now appears only on the RANDOM statement. The LSMEANS statement computes least square means, and the "diff" (not "pdiff") option computes all pairwise differences. Even though the coding is nearly the same as the GLM procedure, the output from PROC MIXED is structured quite differently, as indicated by the following edited portions:

Covariance Parameter Estimates (REML)

Cov Parm Estimate
Bench 0.8809 [ = sigma squaredblk]
  0.4351 [ = sigma squared]

The "Covariance Parameter Estimates" table displays the estimates of the variance components, including the variance due to "Bench" (sigma squaredblk=.8809) and the "Residual" error term (sigma squared =.4351) which is the same value computed with PROC GLM.
The "Type 3 Tests of Fixed Effects" table displays significance tests for the one fixed effect (variety) listed in the MODEL statement. This table is analogous to the ANOVA table from GLM:

Type 3 Tests of Fixed Effects

  Num Den    
Effect DF DF F Value Pr>F
variety 3 15 144.44 <.0001

The degrees of freedom, F-statistic, and p-value for variety are identical with those produced by the GLM procedure. However, PROC MIXED uses a restricted maximum likelihood-based estimation routine (REML) based on normal distribution theory and therefore does not compute nor display sums of squares as observed with PROC GLM. But what happens with the least squares means?

Least Squares Means

Effect variety Estimate Standard Error DF t Value Pr > |t|
variety 1 18.0333 0.4683 8.53 38.51 <.0001
variety 2 21.1667 0.4683 8.53 45.20 <.0001
variety 3 15.5333 0.4683 8.53 33.17 <.0001
variety 4 13.7000 0.4683 0.53 29.25 <.0001

The numbers in the column labeled "Estimate" are the same as those computed with PROC GLM. However, one noticeable difference is that the column labeled Standard Error is now equal to 0.4683, which is considerably larger than the standard error used by PROC GLM. This is because the variance of a predicted mean needs to account for both the variance due to the random effect of block and the residual:

formula 3
PROC MIXED correctly includes the variance component for bench in the estimate of the variance (whereas PROC GLM and UNIANOVA do not). However, the standard error of a difference between two least squares means is the same as found with PROC GLM because the mathematics show that differences between two means does not involve the blocking factor, since all four varieties appear in each block. Thus, interferences for pairwise differences will be the same as with GLM.

Differences of Least Squares Means

Effect variety Estimate Standard Error DF t Value Pr > |t|
variety 1 2 -3.1333 0.3808 15 -8.23 <.0001
variety 1 3 2.5000 0.3808 15 6.56 <.0001
variety 1 4 4.3333 0.3808 15 11.38 <.0001
variety 2 3 5.6333 0.3808 15 14.79 <.0001
variety 2 4 7.4667 0.3808 15 19.61 <.0001
variety 3 4 1.8333 0.3808 15 4.81 0.0002

Why Use PROC MIXED Instead of PROC GLM or SPSS UNIANOVA?

Given that these two analyses are virtually the same, why should you even consider using PROC MIXED rather than PROC GLM or UNIANOVA? First, consider the following important assumptions concerning the residuals computed from linear models by the GLM and UNIVANOVA procedures:

The normality assumption is usually realistic when the original data are measured on a continuous scale and are not highly skewed or contaminated with outliers ( a suitable transformation can help). However, the remaining two assumptions are often not satisfied, especially when data are collected over time or space or come from subjects that are somehow related to each other within blocks.

The methods implemented in PROC MIXED are also based on the assumption of normally distributed residuals. However, several important reasons emphasize why MIXED is superior to GLM or UNIVANOVA, of which I will only list a few:

If you plan to perform repeated measures analysis, you should read the following PDF document comparing PROCs GLM and MIXED: http://www.sas.com/rnd/app/papers/mixedglm.pdf

Data Structure

For repeated measures and other multivariate analyses, PROC GLM and other similar programs require all data for each subject to be listed on a single recordÑthat is, in a horizontal or multivariate format. For example, PROC GLM requires data from a repeated measures study with weights collected on the first two subjects at three points in time to have the following multivariate structure:

Subject weight1 weight2 weight3
1 150 155 158
2 200 212 216

When using PROC MIXED, the data shown above must be structured in a univariate format (i.e., placed in a column):

Subject time weight
1 1 150
1 2 155
1 3 158
2 1 200
2 2 212
2 3 216

A new variable, time, is now included, rather than an index on the variable name. There are several ways to transform data into this structure. Details can be found at http://www.uoregon.edu/~robinh/065appl_trnsp.txt


Note: The examples in this article are based on a sample problem in Chapter Two ("Randomized Complete Block Design") of Mixed Models for the Practicing Statistician Using SAS by Linda J. Young and George A. Milliken. © Kansas State University, 1998.


References

Three very important references for further information on these techniques include:

  1. Brown, H. and Prescott, R. (1999). Applied mixed models in medicine. Wiley and Sons.
  2. Jackson, Sally, and Brashers, Dale (1994). Random Factors in ANOVA. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-098. Newberry Park, CA: Sage.
  3. Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996). SAS System for Mixed Models, SAS Institute, Cary, NC.

Summer 2001 Computing News | Computing Center Home Page