Chapter Eighteen

Experimental Design and Statistics for Usability

Learning Objectives
  1. Design controlled experiments to evaluate usability hypotheses
  2. Choose between within-subjects and between-subjects designs
  3. Select appropriate statistical tests for common usability comparisons
  4. Interpret and report effect sizes alongside statistical significance
  5. Apply A/B testing methodology to evaluate design changes at scale

Introduction

The preceding chapters covered methods for evaluating usability through expert judgment (Chapter 16), predictive modelling (Chapter 17), and observational testing (Chapter 15). This chapter covers the most rigorous approach: controlled experiments Lazar, 2017. When the question is "Does design A produce better performance than design B, and can we be confident that the difference is real and not due to chance?" only a properly designed experiment with appropriate statistical analysis can provide a definitive answer Sauro, 2016.

The Logic of Controlled Experiments

A controlled experiment isolates the effect of one factor (the independent variable) on one or more outcomes (dependent variables) while holding all other factors constant. In usability, the independent variable is typically a design feature (menu layout, button size, colour scheme, interaction pattern) and the dependent variables are usability metrics (task time, error rate, completion rate, satisfaction score).

Independent Variables

The independent variable is the factor the experimenter manipulates. Examples:

  • Menu structure (flat vs. hierarchical)
  • Input method (mouse vs. touch vs. keyboard)
  • Alert modality (visual only vs. visual + auditory)
  • Font size (12pt vs. 14pt vs. 16pt) Each value of the independent variable is called a level or condition. An experiment with two conditions (design A vs. design B) is the simplest; experiments with three or more conditions allow more nuanced comparisons.

Dependent Variables

Common dependent variables in usability experiments:

  • Task completion time (continuous, measured in seconds)
  • Task completion rate (binary: success/failure)
  • Error count (count data)
  • Satisfaction rating (ordinal, e.g., SUS score)
  • Number of clicks or navigation steps (count data)

Confounding Variables

A confounding variable is a factor that varies systematically with the independent variable, making it impossible to attribute observed differences to the intended cause. Common confounds in usability experiments include:

  • Individual differences: some participants are faster/more experienced than others
  • Learning effects: performance improves with practice, regardless of design
  • Fatigue effects: performance degrades over time
  • Task order effects: earlier tasks may influence later ones Experimental design techniques (described below) control for these confounds.

Within-Subjects vs. Between-Subjects Designs

Within-Subjects (Repeated Measures)

Each participant uses all conditions. Participant 1 uses both design A and design B. Advantages:

  • Controls for individual differences (each participant serves as their own control)
  • Requires fewer participants for the same statistical power
  • Directly measures the difference within each participant Disadvantages:
  • Learning effects: performance with the second design may be influenced by experience with the first
  • Fatigue: participants may perform worse on later conditions simply because they are tired
  • Carryover effects: exposure to one design may change how the participant approaches the next Mitigation: counterbalancing — half the participants use design A first, half use design B first. For more than two conditions, Latin square designs ensure each condition appears in each position equally often.

Between-Subjects

Each participant uses only one condition. One group of participants uses design A; a different group uses design B. Advantages:

  • No learning, fatigue, or carryover effects
  • Each participant encounters the design "fresh" Disadvantages:
  • Individual differences between groups may obscure the treatment effect
  • Requires more participants (typically 2–3 times as many as within-subjects) Mitigation: random assignment of participants to conditions ensures that individual differences are distributed equally across groups (in expectation).
Key Principle

Within-subjects designs are more statistically powerful (they require fewer participants) but are vulnerable to order effects. Between-subjects designs avoid order effects but require more participants. For usability experiments comparing two designs, a within-subjects design with counterbalancing is usually the most efficient choice. For experiments where exposure to one condition would contaminate the other (e.g., comparing onboarding flows), between-subjects is necessary.

Sample Size and Statistical Power

Power Analysis

Statistical power is the probability that a test will detect a real effect if one exists. A study with low power may fail to detect a genuine difference between designs, leading to a false conclusion that they are equivalent. Power depends on four factors:

  1. Effect size: how large is the difference between conditions?
  2. Sample size: how many participants per condition?
  3. Significance level (α): the threshold for declaring a result "significant" (typically 0.05)
  4. Variability: how much do participants differ from each other?
Design Law

Determine sample size before the experiment using a power analysis Cohen, 2013. Specify the minimum effect size you want to detect (based on practical significance: what size improvement would matter?), set power to 0.80 (the conventional minimum), and calculate the required sample size. Running an experiment with too few participants risks wasting everyone's time with an inconclusive result. Running with too many wastes resources.

For a within-subjects comparison of task times with a medium effect size (Cohen's d = 0.5) Cohen, 2013, approximately 34 participants provide 80% power at α = 0.05. For a between-subjects comparison with the same parameters, approximately 64 participants per group are needed.

Common Statistical Tests

Comparing Two Conditions

Comparing Three or More Conditions

When ANOVA reveals a significant effect, post-hoc pairwise comparisons (with correction for multiple testing, e.g., Bonferroni or Tukey's HSD) identify which specific conditions differ.

Effect Size

Key Principle

Statistical significance (the p-value) tells you whether an observed difference is likely to be real. Effect size tells you whether it matters. A comparison of two designs with 1,000 participants may find a statistically significant difference of 0.3 seconds in task time — real, but practically irrelevant. Always report effect size alongside p-values. For continuous outcomes, Cohen's d (small: 0.2, medium: 0.5, large: 0.8) Cohen, 2013 is standard. For binary outcomes, odds ratios or relative risk provide practical measures.

A/B Testing

A/B testing is the application of between-subjects experimental design to live software systems at scale Kohavi, 2020. Users are randomly assigned to see version A or version B of a feature, and their behaviour is measured through analytics.

Methodology

  1. Define the hypothesis: "Changing the checkout button from grey to green will increase conversion rate."
  2. Define the metric: conversion rate (purchases / visits).
  3. Calculate sample size: based on the baseline conversion rate, the minimum detectable effect, and the desired power.
  4. Randomise: assign each visitor randomly to condition A or B.
  5. Run the experiment: collect data until the required sample size is reached.
  6. Analyse: compare conversion rates using a chi-squared test or z-test for proportions.

Practical Considerations

Duration: A/B tests should run for at least one full business cycle (typically one week) to account for day-of-week effects, even if the required sample size is reached sooner. Multiple testing: testing many variations simultaneously (A/B/C/D...) or checking results repeatedly (peeking) inflates the false positive rate Kohavi, 2020. Sequential testing methods and Bonferroni corrections address this. Metric selection: the primary metric should be the one most directly related to the business or user goal. Secondary metrics provide context but should not drive the decision if they conflict with the primary metric.

Example

An e-commerce company A/B tests two checkout flows: the current multi-page flow (A) and a new single-page flow (B). After 50,000 visitors per condition:

  • Flow A: 3.2% conversion rate
  • Flow B: 3.8% conversion rate
  • Difference: 0.6 percentage points (18.75% relative improvement)
  • Chi-squared test: p < 0.001 The result is both statistically significant (p < 0.001) and practically significant (an 18.75% relative improvement in conversion is substantial). The single-page flow is adopted.

Limitations of A/B Testing

A/B testing measures what users do, not why they do it. A test might show that version B has higher conversion but cannot explain whether users found it easier, more trustworthy, or simply more visually prominent. Combining A/B testing with qualitative research (usability testing, surveys) provides the "why" behind the "what." A/B testing also optimises for the measured metric, which may not capture the full user experience. Aggressive optimisation of click-through rates can lead to dark patterns (Chapter 11) Gray, 2018 if the metric does not align with genuine user benefit.

Think About It

A/B testing treats users as a means to generate data. Participants are not informed that they are in an experiment; they have not given consent. In most jurisdictions, A/B testing of minor design variations falls within the normal scope of product development. But what about A/B tests that manipulate emotional content, pricing, or access to features? Where is the ethical line between product development and human experimentation?

Reporting Usability Experiments

A well-reported usability experiment includes:

  1. Participants: number, demographics, recruitment method, relevant experience
  2. Design: independent and dependent variables, within/between-subjects, counterbalancing
  3. Materials: system or prototype used, task descriptions, questionnaires
  4. Procedure: step-by-step description of what participants did
  5. Results: descriptive statistics (means, standard deviations, confidence intervals), inferential statistics (test statistic, p-value, effect size), and visualisations (bar charts with error bars, box plots)
  6. Discussion: interpretation of results, limitations, practical implications

Key Takeaways

  • Controlled experiments isolate the effect of design changes on usability metrics by manipulating one factor while controlling others.
  • Within-subjects designs control for individual differences but require counterbalancing; between-subjects designs avoid order effects but require more participants.
  • Sample size should be determined by power analysis before the experiment.
  • Statistical significance (p-value) indicates whether an effect is real; effect size indicates whether it matters. Always report both.
  • A/B testing applies experimental design to live systems at scale, measuring real user behaviour.
  • A/B testing measures what users do but not why; combine with qualitative methods for complete understanding.

Further Reading

  • Field, A. (2024). Discovering Statistics Using IBM SPSS Statistics (6th ed.). SAGE.
  • Lazar, J., Feng, J. H., & Hochheiser, H. (2017). Research Methods in Human-Computer Interaction (2nd ed.). Morgan Kaufmann.
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
  • Sauro, J., & Lewis, J. R. (2016). Quantifying the User Experience (2nd ed.). Morgan Kaufmann.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum.