Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu
For almost 30 years now, PROC GLM (General Linear Models) has been one of SAS’s most important analysis routines. In addition to being able to analyze unbalanced designs, PROC GLM was designed to compute fixed effects analysis of variance and covariance models in a manner that combines features from both PROC ANOVA (analysis of variance) and PROC REG (regression). Its task is to fit a variety of linear models to experimental or observational data, enabling you to make statistical inferences.
But now a comparatively new procedure called PROC MIXED gives you even more power to analyze a wide variety of analysis of variance and covariance models with balanced or unbalanced datasets. (For a comparison of some of the features of PROC MIXED with PROC GLM, see http://cc.uoregon.edu/cnews/summer2001/procmixed.html )
To illustrate how the PROC MIXED procedure works, consider this example drawn from Chapter 41 of the SAS/STAT User’s Guide (see References section at the end of this article), which uses SAS commands to read an unbalanced dataset of children’s heights that was collected from four families in a specified region:
DATA heights;
INPUT family obs gender $ height @@;
CARDS;
1 1 F 67 1 2 F 66 1 3 F 64
1 1 M 71 1 2 M 72
2 1 F 63 2 2 F 63 2 3 F 67
2 1 M 69 2 2 M 68 2 3 M 70
3 1 F 63 3 1 M 64
4 1 F 67 4 2 F 66
4 1 M 67 4 2 M 67 4 3 M 69
;
PROC TABULATE NOseps;
CLASS family gender obs;
VAR height;
TABLE family, gender*obs=' '*height='
'*sum=' '*f=6.0
/ rts=10 BOX=' Heights' misstext=' ';
RUN;
In this example, the response variable height measures the heights (in inches) of 18 individuals who are classified according to their respective family and gender. Results are shown in the following table:

You can also perform a traditional two-way analysis of variance (ANOVA) of these unbalanced data (which produces output similar to what you get with PROC GLM) by using the following PROC MIXED statements:
PROC MIXED DATA=heights NOitprint;
CLASS gender family;
MODEL height = gender family family*gender;
RUN;
In this example, the PROC MIXED statement invokes the procedure and identifies the SAS dataset. The CLASS statement treats both family and gender as classification variables. For this analysis, both gender and family are treated as fixed effects (i.e., gender can only be male or female and these four families are the only levels of interest).
The MODEL statement specifies the continuous response (or dependent) variable, height, which is placed to the left of the equal sign. The explanatory (or independent) variables are listed to the right of the equal sign. In this example the two explanatory variables are categorical, as they represent levels of gender and family; they comprise the main effects of the design. The third term, family*gender, provides a way to test for the interaction between the two main effects.
PROC MIXED computes dummy variables associated with gender, family, and family*gender to construct the nxp design (X) matrix for the linear model. A column of 1’s is also included as the first column of X to model the intercept. By the assumption of independence, the residual matrix, R, is equal to σ2*I18 where I18 is an 18×18 identity matrix.
The RUN; statement completes the list of statements for this procedure. The coding looks very similar to PROC GLM statements. However, the output from PROC MIXED takes on a somewhat different look than that produced by PROC GLM:
Class |
Levels |
Values |
FAMILY |
4 |
1 2 3 4 |
GENDER |
2 |
F M |
Table 1: Class Level Information
The “Class Level Information” in Table 1 lists the values of all classification variables specified in the CLASS statement. It also tells you the order SAS interprets levels of categorical variables. Check this table to make sure that the data have been coded and read in to the SAS dataset correctly.
Cov Parm |
Estimate |
Residual |
2.10000000 |
Table 2: Covariance Parameter Estimates (REML)
The “Covariance Parameter Estimates” in Table 2 display the estimate of of the residual variance,σ2, for the model (analogous to Mean Square Error from PROC GLM).
Source |
NDF |
DD |
Type III F |
Pr > F |
GENDER |
1 |
10 |
17.63 |
0.0018 |
FAMILY |
3 |
10 |
5.90 |
0.0139 |
FAMILY*GENDER |
3 |
10 |
2.89 |
0.0889 |
Table 3: Tests of Fixed Effects
Table 3 shows the Type III significance tests for the fixed effects listed in the MODEL statement. The Type III F-statistics and p-values are the same as those produced by the Type III analysis from PROC GLM. However, because PROC MIXED applies a likelihood-based estimation scheme by default, it does not compute or display sums of squares or mean squares.
The Type III test for the interaction effect, family*gender, is not significant at the 5% level, but the tests for both main effects are significant. As long as you don’t enter a RANDOM or REPEATED statement, the analyses from both PROC MIXED and PROC GLM will agree if the model contains only fixed effects.
The assumptions concerning the residuals from an analysis of variance include:
The normality assumption is probably realistic in this example since observed heights are measured on a reasonably continuous scale and the absolute lower limit of 0 is not even close to the smallest values observed. However, since the data were collected from clusters (families), it is very likely that the heights from members of the same family are positively correlated with each other (i.e., they should not be treated as independent observations).
The methods implemented in PROC MIXED are also based on the assumption of normally distributed data; however, the assumption of independence can be modified by modeling statistical correlation in a variety of ways. You can also work with heterogeneous variances, that is, variances that are not constant across the groups. I will examine these two features with the REPEATED statement in the fall issue of Computing News.
For the children’s height data, one of the simplest ways to model the correlation is through inclusion of a random effect. Since the four families were actually selected at random from a large population, the effect for family is assumed to be a random variable that is normally distributed with zero mean and some unknown variance, σ2f. Declaring family as a random effect in this model sets up a common correlation among all heights measured from the same family.
The interaction of family*gender as a second random effect also accounts for the correlation between all observations that have the same level of both family and gender. One interpretation is that females will have a higher (or lower) correlation with other females in the same family than males will have with other males in the same family. With the height data, this random effects model seems reasonable. Here is the code to fit this mixed model (which includes both random and fixed effects) in PROC MIXED:
PROC MIXED DATA=heights NOitprint;
CLASS gender family;
MODEL height = gender;
RANDOM family family*gender / subject=family type=vc vcorr ;
RUN;
The random effects, family and family*gender, are now listed only on the RANDOM statement (notice the two terms must not appear on the MODEL statement with PROC MIXED). The type=vc option specifies the variance components model for both family and family*gender. The residual matrix is assumed to equal σ2*I18 where I is an 18x18 identity matrix. The output from this revised analysis is as follows:
Row |
Col1 |
Col2 |
Col3 |
Col4 |
Col5 |
1 |
1.000 |
0.658 |
0.658 |
0.379 |
0.378 |
2 |
0.658 |
1.000 |
0.658 |
0.379 |
0.378 |
3 |
0.658 |
0.658 |
1.000 |
0.379 |
0.378 |
4 |
0.379 |
0.379 |
0.379 |
1.000 |
0.658 |
5 |
0.379 |
0.379 |
0.379 |
0.658 |
1.000 |
Table 4: Estimated V Correlation Matrix for Family 1
Table 4 shows the computed correlation matrix for family 1. These data were given in row 1 of the data table, which has 5 members (3 females and 2 males). Columns 1-3 and rows 1-3, along with columns 4-5 and rows 4-5, show the correlations of females with females and males with males is 0.658. The remaining cells indicate the correlation among the females and males within the family is 0.379.
Cov Parm |
Estimate |
FAMILY |
2.4010 |
FAMILY*GENDER |
1.7657 |
RESIDUAL |
2.1668 |
Table 5: Covariance Parameter Estimates (REML)
Table 5 displays the results of the REML fit. The “Estimate” column contains the variance component estimates for family and family*gender, as well as the residual variance, σ2=2.1668. PROC MIXED and PROC GLM estimate variance componets from the linear model with different methods. PROC GLM uses method-of-moments estimators while PROC MIXED uses several options, including moment estimates (with METHOD=typen where n=1, 2, or 3) maximum likelihood (METHOD=ml) or restricted/residual maximum likelihood (METHOD=reml), the default. The type of estimation method you choose may lead to different answers, especially with random effects or unbalanced data.
| Source | NDF | DDF | Type III F | Pr > F |
| GENDER | 1 | 3 | 7.95 | 0.0667 |
Table 6: Type 3 Tests of Fixed Effects
Table 6 contains the significance test for the fixed effect, gender. Note that its p-value (p=0.0667) is larger than the one observed in the first statistical model that assumed all fixed effects (p=0.0139). The contrast in these two results illustrates the importance of modeling family as a random, rather than a fixed, effect. In fact, if 0.05 is applied as the cutoff point for significance, the fixed effects model shows a significant effect, whereas the model with random effects does not.
An additional benefit of a random effects analysis is that it enables you to make inferences about gender that apply to a population of families, whereas the inferences about gender from the analysis where family and family*gender are treated as fixed effects apply only to the particular families present in the dataset.
This simple example was designed to show you how PROC MIXED lets you model correlation in your data directly and make inferences about fixed effects that apply to entire populations of random effects.
1. Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996). SAS System for Mixed Models, SAS Institute, Cary, NC.
2. SAS/STAT User’s Guide: The Mixed Procedure. Chapter 41, Section 6 “Clustered
Data Example,”
http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap41/sect6.htm and
http://sas.uoregon.edu/sashtml/stat/chap41/sect6.htm