Return to UOCC HomeComputing News Home
Header bar

Important Factors in Designing Statistical Power Analysis Studies

The size of your study sample is critical to producing meaningful results

| Initial Considerations | Power Calculations | Software for Power Calculations |

By Robin High (robinh@darkwing.uoregon.edu)

What is "statistical power analysis?" And how and under what circumstances is it applied?

The purpose of many research projects is to search for evidence that a value of a parameter from a population of interest is different from the hypothesized value. In other words, if the computed p-value from a statistical test is "small" (e.g., less than some arbitrary cutoff value), the null hypothesis is rejected in favor of the alternative. This process can be compared to an investigation in which the researcher determines if there is sufficient evidence from a collected sample to change our view of a population characteristic.

Power is broadly defined as "the probability that a statistical significance test will reject the null hypothesis for a specified value of an alternative hypothesis." Another way to define it is "the ability of a test to detect an effect, given that the effect actually exists."

Here is a simple, illustrative example: suppose you want to explore the effect regular exercise has on a person's "quality of life." When designing such an experiment, what questions should you consider?

Initial Considerations. Assume the exercise program is of scientific value and two groups with an equal number of subjects randomly assigned to each group will be available (one group is assigned to the exercise program and one group assigned as a control). Here are a few important considerations:

This is only a partial list of many important questions that must be carefully considered when designing any data collection activity.All too often, however, the question of sample size is slighted--or perhaps even ignored--even though it's critical to the usefulness of the results.

Essential information. When performing a statistical power analysis, you'll need to consider the following important information:

1. Significance level a or the probability of a Type I error. A common, yet arbitrary, choice is alpha =.05

2. Power to detect an effect. This is expressed as power=1 - ß, where ß is the probability of a Type II error. Power=0.80 is also a common, yet arbitrary, choice.

3. Effect size (in actual units of the response) the researcher wants to detect. Effect size and the ability to detect it are indirectly related; the smaller the effect, the more difficult it will be to find it.

4. Variation in the response variable. The standard deviation, which usually comes from previous research or pilot studies, is often used for the response variable of interest.

5. Sample size. A larger sample size generally leads to parameter estimates with smaller variances, giving you a greater ability to detect a significant difference. These five components of a power analysis are not independent: in fact, any four of them automatically determines the fifth. The usual objectives of a power analysis are to calculate the sample size (5) for given values of items (1)-(4). In studies with limited resources, the maximum sample size will be known. Power analysis then becomes a useful tool to determine if sufficient power exists (2) for specified values of (1), (3), (4), and (5). The researcher can evaluate whether the study is worth pursuing.

Comments [back to top]

Significance level. Using alpha=.05 is completely arbitrary. In fact, I am 90% confident that 75% of researchers do not know what alpha actually means, so how could a "correct" cutoff level be chosen?

Similar comments can be made concerning ß, but explaining this important probability is beyond the limited scope of this article.

What effect size is meaningful? The size of a practical difference in the response you would like to detect among the groups is crucial. It essentially measures the "distance" between the null (H0) and a specified value of the alternative (HA) hypotheses. It also relates to the underlying population, not to data from a sample. A desirable effect size is the degree of deviation from the null hypotheses (in actual units of the response) that is considered large enough to attract your attention. Jacob Cohen, an important contributor to power analysis documentation, defined effect sizes as small, medium, and large, and he has stated that "all null hypotheses, at least in their two-tailed forms, are false." A difference is always going to be there; however, it might exist in such a small quantity that you should not be concerned about finding it. The concept of small, medium, and large effect sizes can be a reasonable starting point if you do not have more precise information. (Note that an effect size should be stated in terms of a number in the actual units of the response, not a percent change such as 5% or 10%.)

Returning to the example, if a difference in quality of life due to an exercise program exists, is the magnitude of the difference worth detecting? Suppose the levels of exercise you apply to subjects cause an observed change in quality of life of one unit on the chosen measurement scale. Is a one-unit change--or even 5 or 10 units--meaningful when facing the reality that many factors external to the study will also affect a person's quality of life?

Estimates of variation. You'll also need an estimate of the variability in the response of interest before you can determine the sample size needed to estimate an effect. This value is often found from pilot studies or from previous research, although it is all too often not readily available in published documents. Some parameters of interest are dimensionless quantities, such as a correlation or coefficient of variation, so in these cases a standard deviation would not be required.

Power calculations. Computing power for any specific study may very well be a difficult task. However, if you do not evaluate the joint influence of the size of the effect that is important and the inherent variability of the response during the planning stage, one of two inefficient outcomes will most likely result:

1. "Low power" (too little data; meaningful effect sizes are difficult to detect). If too few subjects are used, a hypothesis test will result in such low power that there is little chance to detect a significant effect. Consider someone attempting to start a car on a cold winter morning with a weak battery--it just doesn't provide the cranking power to get the engine going. This is analogous to designing an experiment in which resources were not put to optimal use (i.e., data from fewer subjects than the necessary number were collected to detect a meaningful effect).

2. "High power" (too much data; trivially small effect sizes can be detected). At the other extreme, consider an experiment where data collection is so large that a trivially small difference in the effect is detectable. One could describe this approach as the "Tim the Tool-Man" method ("MORE POWER, eh, eh!"). If you have ever watched the popular TV show Ô"Tool Time,'" you'll know exactly what I mean. Again, the researcher has not put all of his or her time and resources to good use--in statistical terms, too many subjects have been studied.

A study with low power will have indecisive results, even if the phenomenon you're investigating is real. Stated differently, the effect may well be there, but without adequate power, you won't find it.

The situation with high power is the reverse: you will likely see very significant results, even if the size of the effect you're investigating is not practical. Stated differently, the effect is there, but its magnitude is of little value.

In conclusion, the number of subjects you use is critical to the success of research. Without a sufficient number, you won't be able to achieve adequate power to detect the effect you're looking for. With too many subjects, you may be using valuable resources inefficiently. Either way, implementing a study with too little or too much power does not spend time and resources economically; this is viewed by some reviewers as unethical scientific behavior. [back to top]


Computer Software for Power Calculations [back to top]

Several Internet sites will help you understand power analysis and will get you started with either commercial or free software. Descriptions of available software and URL resources are shown below.

nQuery Advisor Release 4.0 - This software, which is used for sample size and power calculations, contains extensive table entries and many other convenient features. For more information, see http://www.statsol.ie

SamplePower(r) 1.2 - SamplePower, available from SPSS, arrives at sample sizes for a variety of common data analysis situations. You can learn more about it at http://www.spss.com/spower/research.htm

G*Power - You can download this free program from http://www.psychologie.uni-trier.de:8000/projects/gpower.html

G*Power allows you to calculate a sample size for a given effect size, alpha level, and power value.

UnifyPow is another free power analysis program that uses SAS. You can find example programs and workshop notes at the UnifyPow web site at http://www.bio.ri.ccf.org/power.html


Summer 2000 Computing News | Computing Center Home Page