Statistical Power | Glossary | Textbook of Usability

Statistical power is the probability that a statistical test will detect a real effect if one exists. A study with low power may fail to detect a genuine difference between designs, leading to the false conclusion that they are equivalent. Power is the complement of the Type II error rate (β): power = 1 − β.

By convention, studies aim for power of 0.80 (80% chance of detecting the effect) with significance level α = 0.05.

Power depends on four factors:

Effect size: how large is the real difference? (Larger effects are easier to detect)
Sample size: how many participants? (More participants give more power)
Significance level (α): stricter thresholds reduce power
Variability: how much do participants differ from each other? (More variation reduces power)

Power analysis is the calculation that determines required sample size given the other three factors. It should always be done before running an experiment, not after. Running an underpowered experiment wastes everyone's time with inconclusive results; running an overpowered one wastes resources.

Rough guidelines for usability experiments comparing two conditions at medium effect size (Cohen's d = 0.5):

Within-subjects: ~34 participants
Between-subjects: ~64 participants per group

Low power is a pervasive problem in usability research. Many published studies are underpowered, producing "no significant difference" findings that reflect weak measurement rather than genuine equivalence.

Discussed in:

Chapter 18: Experimental Design and Statistics for Usability (Sample Size and Statistical Power)

Also defined in: Textbook of Usability, Textbook of Medical Statistics