Glossary

Statistical Power

Statistical power is the probability that a statistical test will detect a real effect if one exists. A study with low power may fail to detect a genuine difference between designs, leading to the false conclusion that they are equivalent. Power is the complement of the Type II error rate (β): power = 1 − β.

By convention, studies aim for power of 0.80 (80% chance of detecting the effect) with significance level α = 0.05.

Power depends on four factors:

  1. Effect size — how large is the real difference? (Larger effects are easier to detect)
  2. Sample size — how many participants? (More participants give more power)
  3. Significance level (α) — stricter thresholds reduce power
  4. Variability — how much do participants differ from each other? (More variation reduces power)

Power analysis is the calculation that determines required sample size given the other three factors. It should always be done before running an experiment, not after. Running an underpowered experiment wastes everyone's time with inconclusive results; running an overpowered one wastes resources.

Rough guidelines for usability experiments comparing two conditions at medium effect size (Cohen's d = 0.5):

  • Within-subjects: ~34 participants
  • Between-subjects: ~64 participants per group

Low power is a pervasive problem in usability research. Many published studies are underpowered, producing "no significant difference" findings that reflect weak measurement rather than genuine equivalence.

Related terms: Effect Size, Within-Subjects Design, Between-Subjects Design

Discussed in:

Also defined in: Textbook of Usability, Textbook of Medical Statistics