Return to UOCC HomeComputing News Home
Header bar

Seven Problematic Areas of Statistical Analysis

Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu

Advances in computing technology and sophisticated programs offer many more choices for data analysis than even a few years ago, and researchers should make informed decisions about which technique to apply. The industrial statistician George Box, best known for his books on time series and experimental design, once wrote, "All models are wrong, but some models are useful." I would add that some models are more useful than others--or even more to the point, some models are more appropriate than others.

One of the potential weaknesses of any statistical analysis is accepting the output at face value. Just because SPSS or SAS (or any other program) produced it doesn't mean it's necessarily correct or the best technique you could have used. The purpose of this article is to briefly describe seven particular problematic areas of statistical analysis and introduce you to important new developments that will help you deal with them.

1. Lack of study planning

This first problematic area directly relates to my article on power calculations (http://cc.uoregon.edu/cnews/spring2006/powercalc.htm). Data analysis begins with study planning and focuses on the inputs to the problem and resources required to answer your research questions. This important first step is often difficult, but it's essential for data analysis later on. (Remember Abe Lincoln's maxim, "If I had eight hours to chop down a tree, I'd spend six hours sharpening my ax.") The type of data (discrete or continuous) and how data will be collected (i.e., the design) all help determine appropriate analysis tools to answer research questions.

2. Placing importance on study results only when p-values are less than .05

There is nothing scientific about a cutoff for statistical significance of α=.05, even though this particular value is the default value in most programs and does have intuitive appeal. If you followed the study planning steps mentioned in #1 above, you have a solid basis for selecting α=.05 as a reasonable cutoff p-value for significance. This implies you have given thought to variability, effect size, and Type I and Type II errors, and considered what they actually mean in your study when viewed together.

In data analysis, the meaning of a p-value is the strength of evidence the sample data provides against the null hypothesis--that is, evidence against a hypothesis of no effect, H0. When interpreting a p-value, you assume your data show nothing interesting. You then apply the statistical test saying, OK, prove me wrong and show me the evidence! Out comes a number that indicates if your chosen model explains the data better than results due to random variation. You then assess if the magnitude of the observed effect is of practical value along with its associated strength of evidence (i.e., you need to observe an arbitrarily small p-value to reject the null hypothesis).

To choose an arbitrary cutoff value, evaluate exploratory studies against larger p-values (say, approximately 0.10) to indicate results look interesting and are worthy of further research. Use a much smaller cutoff value (say, less than 0.005) if the study has public safety or financial implications.

3. Chi-square tests with small samples

The Pearson chi-square test and the likelihood-ratio test are among the methods most widely applied to interpret the independence of cell counts in multi-dimensional contingency tables. Although the formulas look very different, they can be shown to be asymptotically equivalent. That is, they should be approximately the same number as the sample size gets large; both are compared to a chi-square distribution with appropriate degrees of freedom. However, the key phrase here is large sample size. Under the null hypothesis of factor independence, tables may have one or more small expected cell counts. A statistical program will warn you if the cell counts are not large enough for the mathematical properties of these tests to hold. Collapsing multiple categories into fewer levels is one way to work around the problem. Yet another way, and perhaps the only one, is to apply exact tests of independence.

When you have only small amounts of data, exact tests are available for many analyses that typically require large sample sizes (PROC, FREQ, NPAR1WAY, and LOGISTIC in SAS, separately acquired exact test modules in SPSS). These tests often have intensive computing and memory requirements and may not run in a reasonable time--or not run at all. In that case, you can try tests based on simulations. Exact tests offer a better alternative to the computation of that all-important p-value when you are faced with small sample sizes, either by design or practical limitations.

4. Indiscriminant application of regression and analysis of variance

Assume you have collected one, and only one, observation for each subject randomly selected from a well defined population (i.e., all observations are independent). Regression or ANOVA techniques to test for linear relationships of a continuous response variable with one or more predictors (which may be categorical or continuous) are then appropriate.

If the data exhibit clustering, (i.e., where the observations are not collected independently) then see #7 below. One common problematic task with data clustering includes measuring a subject multiple times and then analyzing the averages as if they were a single number. Although this has been a common approach in the past, with today's advances in multi-level modeling, it is usually not necessary or even desirable.

Many data problems lurk in the background when running linear regression or ANOVA models. Diagnostic techniques should be applied to test for influential data, outliers, homogeneity of variance, and collinearity. You should also always consider tests for normality and outliers in the residuals so that you don't claim or miss significance on the basis of a small number of discrepant observations. PROC ROBUSTREG is a new SAS procedure that can help with this situation.

It's generally accepted that you should never apply stepwise regression selection of variables for a model. Three reasons for not doing it are found in Thompson (1995). One of the main problems is that the independent variables are usually not all that independent (i.e., they exhibit a degree of collinearity) and this redundancy has the potential to cause mild to severe problems in model selection, and these sample-specific models frequently fail when applied to new datasets. This same rationale can also be applied to the ranking of the importance of variables that are in a model, another problematic area.

5. Analysis of data from non-normal distributions as if they were normal

This problem follows from #4. Not choosing the appropriate model for the data is the real concern. Discrete response variables, whether they are dichotomous, ordinal, counts, proportions, or other data with non-normal distributions, are more appropriately analyzed with generalized linear models. For example, the odds ratio is a convenient way to examine dichotomous (or binary), ordinal (e.g., Likert scale), and multinomial data. Data with one observation for each subject can be examined with PROC LOGISTIC. Many types of generalized linear models can be analyzed with PROC GENMOD, including discrete data collected as repeated measures.

Working with ratios presents another type of data analysis problem. You should always store both the numerator and denominator that compute ratios and analyze them--rather than treating the ratios as if they were normally distributed, even after applying a transformation. (Note: the theory behind the arcsin transformation, often applied to percents, is based on a ratio of known integers that count independent events.)

6. Distorted graphic displays of data

Graphic displays are powerful tools for summarizing research findings and should be an integral part of any publication or public presentation of results. It's important to use visual methods in addition to tables or prose descriptions to show what results actually mean. Unfortunately, you may find it difficult to get the desired graph with existing programs, as they include many feature choices that may distract from what you want to show.

Anyone who summarizes data in charts and graphs should read Edward Tufte's classic book The Visual Display of Quantitative Information as well as his other publications. These present a wealth of information on how to visually present data effectively. William Cleveland's books on this subject are also a particularly good resource for statistical graphs.

7. Not understanding the concept of a within-cluster covariance matrix

Whenever you have two or more values collected within subjects (clusters), you have a data analysis situation that requires an assumption about the structure of a within-subject covariance matrix. You could choose to ignore it, but failing to understand how a covariance matrix works may influence the results you observe.

It's possible to assume an unstructured covariance matrix and apply a multivariate model. However, researchers often respond to correlated data with a repeated measures approach by implementing the general linear model (GLM) module from either SAS or SPSS and then checking the sphericity condition; if necessary, the error degrees of freedom are adjusted downward to make statistical tests on the within-subjects factors. This was the state-of-the art approach up until only a few years ago, but today we can do much better by analyzing clustered or repeated measures data with a mixed model.

Learning how to work with the covariance matrix beyond the sphericity assumption or an unstructured matrix opens many possibilities for more appropriate data analyses. In many situations, the General Linear Model (GLM) of both SAS and SPSS simply don't work well with clustered data. PROCs MIXED, NLMIXED, GLIMMIX are designed for continuous data; and PROCs GENMOD, NLMIXED, and GLIMMIX all work with discrete data (see Little, 2006).

References

1. Box, G.E.P., "Robustness in the Strategy of Scientific Model Building," in Lanner and Wilkerson, eds., Robustness in Statistics (NY, Academic Press, 1979), pp. 201-236.

2. Littell, Ramon C., George A. Milliken, Walter W. Stroup, Russell D. Wolfinger, and Oliver Schabenberger. 2006, SAS for Mixed Models, Second Edition. Cary, NC: SAS Institute Inc.

3. Thompson, B. (1995). "Stepwise regression and stepwise discriminant analysis need not apply here…" Educational and Psychological Measurement, 55, 525-534.

4. Tufte, Edward R., (Second edition, May 2001) The Visual Display of Quantitative Information, Graphics Press: Cheshire, Connecticut.


Summer 2006 Computing News | Computing Center Home Page