Experimental Design and Statistics for Usability

Dr Chris Paton

Learning Objectives

Design controlled experiments to evaluate usability hypotheses
Choose between within-subjects and between-subjects designs
Select appropriate statistical tests for common usability comparisons
Interpret and report effect sizes alongside statistical significance
Apply A/B testing methodology to evaluate design changes at scale

Introduction

The preceding chapters covered methods for evaluating usability through expert judgment (Chapter 16), predictive modelling (Chapter 17), and observational testing (Chapter 15). This chapter covers the most rigorous approach: controlled experiments Lazar, 2017. When the question is "Does design A produce better performance than design B, and can we be confident that the difference is real and not due to chance?" only a properly designed experiment with appropriate statistical analysis can provide a definitive answer Sauro, 2016.

The Logic of Controlled Experiments

A controlled experiment isolates the effect of one factor (the independent variable) on one or more outcomes (dependent variables) while holding all other factors constant. In usability, the independent variable is typically a design feature (menu layout, button size, colour scheme, interaction pattern) and the dependent variables are usability metrics (task time, error rate, completion rate, satisfaction score).

Independent Variables

The independent variable is the factor the experimenter manipulates. Examples:

Menu structure (flat vs. hierarchical)
Input method (mouse vs. touch vs. keyboard)
Alert modality (visual only vs. visual + auditory)
Font size (12pt vs. 14pt vs. 16pt)

Each value of the independent variable is called a level or condition. An experiment with two conditions (design A vs. design B) is the simplest; experiments with three or more conditions allow more nuanced comparisons.

Dependent Variables

Common dependent variables in usability experiments:

Task completion time (continuous, measured in seconds)
Task completion rate (binary: success/failure)
Error count (count data)
Satisfaction rating (ordinal, e.g., SUS score)
Number of clicks or navigation steps (count data)

Confounding Variables

A confounding variable is a factor that varies systematically with the independent variable, making it impossible to attribute observed differences to the intended cause. Common confounds in usability experiments include:

Individual differences: some participants are faster/more experienced than others
Learning effects: performance improves with practice, regardless of design
Fatigue effects: performance degrades over time
Task order effects: earlier tasks may influence later ones

Experimental design techniques (described below) control for these confounds.

Within-Subjects vs. Between-Subjects Designs

Within-Subjects (Repeated Measures)

Each participant uses all conditions. Participant 1 uses both design A and design B.

Advantages:

Controls for individual differences (each participant serves as their own control)
Requires fewer participants for the same statistical power
Directly measures the difference within each participant

Disadvantages:

Learning effects: performance with the second design may be influenced by experience with the first
Fatigue: participants may perform worse on later conditions simply because they are tired
Carryover effects: exposure to one design may change how the participant approaches the next

Mitigation: counterbalancing: half the participants use design A first, half use design B first. For more than two conditions, Latin square designs ensure each condition appears in each position equally often.

It is worth distinguishing the two order effects that within-subjects designs introduce, because they behave differently. A learning effect is a general improvement that carries across conditions: a participant who has just completed a search task with design A is now more familiar with the underlying content, the task instructions, and the laboratory setup, so they will tend to be faster on design B regardless of which design is genuinely better. A carryover effect is specific to the condition just experienced: exposure to a particular menu structure may set an expectation about where items live, and that expectation may help or hinder performance in the next condition in a way that depends on which design came first. Fatigue, in which performance degrades late in a session, acts in the opposite direction to learning. Counterbalancing does not eliminate these effects; it distributes them symmetrically across conditions so that, in aggregate, they do not favour one design over another Lazar, 2017.

A Latin square is the standard tool for distributing order effects when there are more than two conditions. With four conditions (A, B, C, D), a four by four Latin square assigns each participant one of four orderings (for example ABCD, BCDA, CDAB, DABC) such that every condition appears exactly once in each ordinal position across the square. This balances simple position effects but not pairwise sequence effects (the fact that B always follows A in the first row); a balanced Latin square, or a fully randomised order with enough participants, addresses the residual sequence dependence. Counterbalancing only works cleanly when the number of participants is a multiple of the number of orderings, which is a practical constraint on recruitment.

Key Principle

The choice between designs is a trade between two sources of unwanted variation. A within-subjects design removes individual differences (each participant is compared against themselves, so a naturally fast or slow participant cannot bias the comparison) but introduces order effects (learning, fatigue, carryover). A between-subjects design removes order effects (every participant sees one design, fresh) but reintroduces individual differences as noise between the groups. Counterbalancing converts the within-subjects order effects from a systematic bias into symmetrically distributed noise, which is why a counterbalanced within-subjects design is usually both the most powerful and the safest default for a two-design comparison.

Between-Subjects

Each participant uses only one condition. One group of participants uses design A; a different group uses design B.

Advantages:

No learning, fatigue, or carryover effects
Each participant encounters the design "fresh"

Disadvantages:

Individual differences between groups may obscure the treatment effect
Requires more participants (typically 2 to 3 times as many as within-subjects)

Mitigation: random assignment of participants to conditions ensures that individual differences are distributed equally across groups (in expectation).

Key Principle

Within-subjects designs are more statistically powerful (they require fewer participants) but are vulnerable to order effects. Between-subjects designs avoid order effects but require more participants. For usability experiments comparing two designs, a within-subjects design with counterbalancing is usually the most efficient choice. For experiments where exposure to one condition would contaminate the other (e.g., comparing onboarding flows), between-subjects is necessary.

Sample Size and Statistical Power

Power Analysis

Statistical power is the probability that a test will detect a real effect if one exists. A study with low power may fail to detect a genuine difference between designs, leading to a false conclusion that they are equivalent.

Power is best understood through the two ways a statistical test can be wrong. A Type I error (false positive) occurs when the test declares a difference between designs that does not really exist; its probability is the significance level, alpha, conventionally set at 0.05. A Type II error (false negative) occurs when a genuine difference exists but the test fails to detect it; its probability is denoted beta. Power is simply 1 minus beta: the probability of correctly detecting a real effect. Setting power to 0.80 therefore accepts a 20% chance of missing a true difference, which is a deliberate asymmetry. The convention treats a false positive (alpha = 0.05) as roughly four times more costly than a false negative (beta = 0.20), on the reasoning that wrongly adopting an ineffective design is usually worse than failing to detect a small benefit Cohen, 1988.

The effect size quantifies how large the difference between conditions is, in units that do not depend on sample size. For comparing two means (for example, task times under two designs), the standard measure is Cohen's d, the difference between the two means divided by the pooled standard deviation. A d of 0.5 means the two design conditions differ by half a standard deviation. Cohen proposed conventional benchmarks of 0.2 (small), 0.5 (medium), and 0.8 (large) Cohen, 1988, though these are rules of thumb rather than laws; what counts as a meaningful effect should be judged against the practical stakes of the design decision.

Power, effect size, alpha, and sample size are mathematically linked, so fixing any three determines the fourth. This is what makes a power analysis possible: choose the effect size you care about, fix alpha at 0.05 and power at 0.80, and the required sample size follows. The relationship between effect size and sample size is steeply non-linear. Detecting a medium effect (d = 0.5) at 80% power and alpha = 0.05 in a two-group between-subjects comparison requires roughly sixty-odd participants per group (about 64 per group, near 128 in total). Dropping the target to a small effect (d = 0.2) does not merely double the requirement; because the required sample size grows with the inverse square of the effect size, it rises more than sixfold, to around 400 participants per group. This is the central practical lesson of power analysis: small, subtle improvements are genuinely expensive to detect, and a study that is underpowered for the effect it hopes to find is likely to produce a non-significant result that proves nothing either way Lazar, 2017.

Power depends on four factors:

Effect size: how large is the difference between conditions?
Sample size: how many participants per condition?
Significance level (α): the threshold for declaring a result "significant" (typically 0.05)
Variability: how much do participants differ from each other?

Design Law

Determine sample size before the experiment using a power analysis Cohen, 1988. Specify the minimum effect size you want to detect (based on practical significance: what size improvement would matter?), set power to 0.80 (the conventional minimum), and calculate the required sample size. Running an experiment with too few participants risks wasting everyone's time with an inconclusive result. Running with too many wastes resources.

For a within-subjects comparison of task times with a medium effect size (Cohen's d = 0.5) Cohen, 1988, approximately 34 participants provide 80% power at α = 0.05. For a between-subjects comparison with the same parameters, approximately 64 participants per group are needed.

Common Statistical Tests

The choice of test follows from the design and the type of dependent variable, not the other way round. A useful first move is to map the common usability comparisons to their standard tests, while remembering that each test carries assumptions (independence of observations, and for the parametric tests an approximately normal distribution of the outcome) that should be checked before the result is trusted Field, 2024.

Comparison	Design	Outcome type	Standard test
Two designs, different participants	Between-subjects	Continuous (task time)	Independent-samples t-test
Two designs, same participants	Within-subjects	Continuous (task time)	Paired t-test
Three or more designs, different participants	Between-subjects	Continuous	One-way ANOVA
Three or more designs, same participants	Within-subjects	Continuous	Repeated-measures ANOVA
Two or more designs	Between-subjects	Categorical (success/failure)	Chi-square test of proportions

Comparing Two Conditions

To compare a continuous outcome such as task completion time across two designs, the t-test is the workhorse. The independent-samples t-test applies when the two conditions are seen by different groups of participants (a between-subjects design); it asks whether the difference between the two group means is large relative to the variation within the groups. The paired t-test applies when the same participants experience both conditions (a within-subjects design); it works on the within-participant differences, which is precisely how it strips out individual differences and gains power. Choosing the wrong member of the pair (for example, an independent-samples test on within-subjects data) discards the design's structure and either throws away power or violates the test's independence assumption. When the outcome is markedly non-normal or the sample is very small, the non-parametric counterparts (the Mann-Whitney U test for independent samples, the Wilcoxon signed-rank test for paired samples) make weaker distributional assumptions Field, 2024.

Comparing Three or More Conditions

Comparing three or more designs with a series of t-tests is tempting but wrong: each test carries its own Type I error risk, and running several of them inflates the chance that at least one comes up falsely significant. Analysis of variance (ANOVA) addresses this by testing, in a single step, whether any of the condition means differ. A significant ANOVA is an omnibus result; it says the conditions are not all equal but does not say which differ. When ANOVA reveals a significant effect, post-hoc pairwise comparisons (with correction for multiple testing, e.g., Bonferroni or Tukey's HSD) identify which specific conditions differ. For count and categorical outcomes the logic is different again: completion rates (success versus failure) across conditions are compared with a chi-square test, which contrasts observed cell counts against those expected if condition and outcome were independent. None of these tests, on its own, establishes that a difference matters; that is the separate question of effect size, addressed next.

Effect Size

Key Principle

Statistical significance (the p-value) tells you whether an observed difference is likely to be real. Effect size tells you whether it matters. A comparison of two designs with 1,000 participants may find a statistically significant difference of 0.3 seconds in task time: real, but practically irrelevant. Always report effect size alongside p-values. For continuous outcomes, Cohen's d (small: 0.2, medium: 0.5, large: 0.8) Cohen, 1988 is standard. For binary outcomes, odds ratios or relative risk provide practical measures.

A/B Testing

A/B testing is the application of between-subjects experimental design to live software systems at scale Kohavi, 2020. Users are randomly assigned to see version A or version B of a feature, and their behaviour is measured through analytics.

Methodology

Define the hypothesis: "Changing the checkout button from grey to green will increase conversion rate."
Define the metric: conversion rate (purchases / visits).
Calculate sample size: based on the baseline conversion rate, the minimum detectable effect, and the desired power.
Randomise: assign each visitor randomly to condition A or B.
Run the experiment: collect data until the required sample size is reached.
Analyse: compare conversion rates using a chi-squared test or z-test for proportions.

Practical Considerations

Duration: A/B tests should run for at least one full business cycle (typically one week) to account for day-of-week effects, even if the required sample size is reached sooner.

Multiple testing: testing many variations simultaneously (A/B/C/D...) or checking results repeatedly (peeking) inflates the false positive rate Kohavi, 2020. Sequential testing methods and Bonferroni corrections address this.

Metric selection: the primary metric should be the one most directly related to the business or user goal. Secondary metrics provide context but should not drive the decision if they conflict with the primary metric.

Example

An e-commerce company A/B tests two checkout flows: the current multi-page flow (A) and a new single-page flow (B). After 50,000 visitors per condition:

Flow A: 3.2% conversion rate
Flow B: 3.8% conversion rate
Difference: 0.6 percentage points (18.75% relative improvement)
Chi-squared test: p < 0.001

The result is both statistically significant (p < 0.001) and practically significant (an 18.75% relative improvement in conversion is substantial). The single-page flow is adopted.

A Worked A/B Test

Consider a subscription service testing whether a redesigned sign-up page lifts the rate at which visitors complete registration. Visitors are randomly assigned on arrival to the existing page (control, A) or the redesign (variant, B), and the metric is the registration completion rate (completed sign-ups divided by page views).

Variant A (control): 1,180 completions from 40,000 visitors, a rate of 2.95%.
Variant B (redesign): 1,360 completions from 40,000 visitors, a rate of 3.40%.
Absolute difference: 0.45 percentage points. Relative improvement: about 15.3%.

The comparison is between two proportions, so a chi-square test (equivalently, a z-test for two proportions) is appropriate. With samples this large, the difference of 0.45 points is comfortably significant, with p well below 0.01, and a 95% confidence interval on the difference that lies entirely above zero (roughly 0.2 to 0.7 percentage points). What does that significant result actually tell you? It tells you that the observed gap is very unlikely to have arisen by chance alone if the two pages were genuinely equivalent: the redesign really does convert better, in this population, over this period. What it does not tell you is at least as important. It does not tell you why the redesign won (clearer copy, fewer form fields, a more prominent button, or simple novelty); it does not tell you that the 0.45-point lift will persist once the novelty fades or once the visitor mix shifts with a different marketing campaign; and it does not tell you that registration is the right thing to optimise, since a page that maximises sign-ups could still lower the rate at which those sign-ups become paying, retained customers. A significant A/B result is a precise answer to a narrow question, and the discipline lies in remembering exactly how narrow that question was Kohavi, 2020.

Think About It

Suppose the same redesign had produced 1,210 completions from B against 1,180 from A: a difference of 30 completions, about 3.03% versus 2.95%, with p = 0.53. It would be wrong to conclude that the two pages are equivalent. A non-significant result with this sample size means only that any true difference, if it exists, is probably small enough that 40,000 visitors per arm could not reliably distinguish it from zero. Absence of evidence is not evidence of absence; ruling a difference out requires either an equivalence test or a much larger sample.

Limitations of A/B Testing

A/B testing measures what users do, not why they do it. A test might show that version B has higher conversion but cannot explain whether users found it easier, more trustworthy, or simply more visually prominent. Combining A/B testing with qualitative research (usability testing, surveys) provides the "why" behind the "what." A/B testing also optimises for the measured metric, which may not capture the full user experience. Aggressive optimisation of click-through rates can lead to dark patterns (Chapter 11) Gray, 2018 if the metric does not align with genuine user benefit.

Think About It

A/B testing treats users as a means to generate data. Participants are not informed that they are in an experiment; they have not given consent. In most jurisdictions, A/B testing of minor design variations falls within the normal scope of product development. But what about A/B tests that manipulate emotional content, pricing, or access to features? Where is the ethical line between product development and human experimentation?

Reporting Usability Experiments

A well-reported usability experiment includes:

Participants: number, demographics, recruitment method, relevant experience
Design: independent and dependent variables, within/between-subjects, counterbalancing
Materials: system or prototype used, task descriptions, questionnaires
Procedure: step-by-step description of what participants did
Results: descriptive statistics (means, standard deviations, confidence intervals), inferential statistics (test statistic, p-value, effect size), and visualisations (bar charts with error bars, box plots)
Discussion: interpretation of results, limitations, practical implications

Key Takeaways

Controlled experiments isolate the effect of design changes on usability metrics by manipulating one factor while controlling others.
Within-subjects designs control for individual differences but require counterbalancing; between-subjects designs avoid order effects but require more participants.
Sample size should be determined by power analysis before the experiment.
Statistical significance (p-value) indicates whether an effect is real; effect size indicates whether it matters. Always report both.
A/B testing applies experimental design to live systems at scale, measuring real user behaviour.
A/B testing measures what users do but not why; combine with qualitative methods for complete understanding.

Textbook of Usability