Usability Testing

Dr Chris Paton

Learning Objectives

Plan and conduct a usability test including participant recruitment, task design, and facilitation
Apply the think-aloud protocol to elicit user reasoning during testing
Select appropriate usability metrics for different evaluation goals
Determine sample sizes and understand the diminishing returns of additional participants
Conduct remote usability testing effectively

Introduction

Heuristic evaluation, predictive modelling, and expert review (covered in the following chapters) are valuable for identifying potential usability problems. But the definitive test of usability is observation: watching real users attempt real tasks with the actual system Rubin, 2008. Usability testing is the empirical core of usability practice: the method that grounds design decisions in evidence about actual human performance. This chapter covers the planning, execution, and analysis of usability tests, from formative tests with five participants to summative evaluations with rigorous metrics NIELSEN, 1993.

Types of Usability Testing

Formative Testing

Formative testing is conducted during the design process to identify usability problems and inform redesign. It is typically qualitative, focusing on discovering what goes wrong and why rather than measuring performance with statistical precision. Formative testing uses small samples (5–8 participants), think-aloud protocols, and iterative cycles of testing and redesign.

Summative Testing

Summative testing evaluates a finished or near-finished design against defined criteria. It is typically quantitative, measuring task completion rates, task times, error rates, and satisfaction scores. Summative testing uses larger samples (20+ participants) and statistical analysis to determine whether usability targets have been met.

Comparative Testing

Comparative testing evaluates two or more alternative designs using the same tasks and metrics. It can be within-subjects (each participant uses all designs) or between-subjects (different participants use different designs). Comparative testing requires careful experimental design to control for learning effects and individual differences.

Planning a Usability Test

Define the Goals

Before recruiting participants or writing tasks, define what the test should reveal. Common goals include:

Identifying the most severe usability problems (formative)
Measuring whether the system meets specified usability targets (summative)
Comparing two design alternatives (comparative)
Validating that a specific design change solved a known problem

Key Principle

The test goals determine everything else: the number and type of participants, the tasks, the metrics, and the analysis method. A test without clear goals produces data that is difficult to interpret and easy to ignore. State the goals before designing the test, and ensure that every element of the test plan serves at least one goal.

Identify Representative Users

Participants should represent the system's actual user population in relevant characteristics: experience level, domain knowledge, age, technical proficiency, and any other factors that affect interaction with the system. Testing with colleagues or developers is convenient but misleading: their familiarity with the system's concepts and conventions makes them unrepresentative of actual users.

Write Task Scenarios

Tasks are the heart of a usability test. Each task describes a realistic goal that the participant must accomplish using the system. Good tasks are:

Realistic: based on actual use cases, not artificial exercises
Specific enough to be actionable: "Find the nearest pharmacy open after 8pm" rather than "Explore the pharmacy section"
Neutral in wording: avoiding terminology that mirrors the interface labels (which would make the task a word-matching exercise rather than a usability test)
Independent: each task can be completed without information gained from another task

Example

Poor task: "Click on the Settings gear icon and change the notification preference to email only." This task tells the participant exactly what to do, testing their ability to follow instructions rather than the interface's usability. Better task: "You're receiving too many notifications on your phone. Change things so you only get notified by email." This task describes a realistic goal using the participant's language, requiring them to discover how to accomplish it through the interface.

Determine Sample Size

Nielsen and Landauer's model Nielsen, 1993 predicts that 5 participants discover approximately 85% of usability problems in a formative test. This has been widely cited to justify small-sample testing Krug, 2010. However, the 85% figure applies to problems with a discovery probability of ~31% per participant; less common problems require more participants. For formative testing, 5–8 participants per user group is a practical starting point. For summative testing with quantitative metrics, larger samples (20–30+) are needed for statistical power. For comparative testing, power analysis based on the expected effect size should determine sample size.

Design Law

Five participants are sufficient for formative testing aimed at discovering major usability problems. This is not because five participants find "all" problems (they do not) but because the most severe problems (those affecting most users) are very likely to be observed with five participants, and fixing those problems before testing with more participants is more efficient than testing with a larger sample upfront. The strategy is iterative: test with 5, fix the biggest problems, test again with 5.

Conducting the Test

The Think-Aloud Protocol

The think-aloud protocol, introduced by Ericsson and Simon Danks, 1985, asks participants to verbalise their thoughts as they work through tasks. "Tell me what you're thinking" is the standard instruction. The facilitator prompts silent participants with neutral phrases ("What are you looking at?" "What are you thinking now?") without directing their actions. Think-aloud provides rich qualitative data about the user's reasoning, expectations, and confusions. It reveals not just what participants do but why: information that is essential for understanding the root cause of observed problems and designing effective fixes.

Facilitator Behaviour

The facilitator's role is to create conditions for authentic behaviour while collecting useful data. Key principles:

Do not help. When the participant struggles, the natural impulse is to provide hints or guidance. Resist it; the struggle is the data.
Do not lead. Questions like "Did you see the button in the top right?" direct attention and contaminate the results.
Remain neutral. Do not react positively or negatively to participant actions. Facial expressions, tone of voice, and body language all communicate evaluative feedback.
Probe, don't direct. When the participant does something unexpected, ask "What were you expecting to happen?" rather than "That's the wrong button."

Recording and Observation

Usability tests should be recorded (screen capture plus audio, and optionally video of the participant) for later review. However, recording is a supplement to real-time observation, not a replacement. The facilitator's real-time notes capture contextual information (the participant's facial expression when encountering an error, a muttered comment that the microphone missed) that the recording may not preserve. Additional observers (designers, developers, product managers) should watch the test (from a separate room or via screen sharing) to build empathy with the user's experience. Observed problems are more compelling than reported problems.

Usability Metrics

Effectiveness Metrics

Task completion rate: the percentage of participants who successfully complete each task. The most fundamental usability metric.
Error rate: the number of errors per task. Errors can be classified by type (slips, mistakes) and severity (recovered, unrecovered).
Assists: the number of times the facilitator had to provide help for the participant to continue. (Assists indicate a task failure for the participant but provide useful data about the specific point of failure.)

Efficiency Metrics

Task time: the time from task start to completion. Typically measured as the mean or median across participants, with a target time for comparison.
Deviation from optimal path: the number of extra steps or pages visited beyond the minimum required. Indicates navigational confusion.
Lostness score: a composite metric combining the number of unique pages visited, the number of total pages visited, and the minimum required pages Sauro, 2016.

Satisfaction Metrics

System Usability Scale (SUS): a 10-item questionnaire yielding a single score from 0 to 100. Quick to administer and interpret. An SUS score above 68 is considered "above average." Developed by Brooke Brooke, 1996 and validated against several competing questionnaires by Tullis and Stetson Tullis, 2004.
Single Ease Question (SEQ): "Overall, how easy or difficult was this task?" rated on a 7-point scale. Administered after each task.
Post-Study System Usability Questionnaire (PSSUQ): a more detailed instrument measuring system usefulness, information quality, and interface quality Lewis, 1995.

Key Principle

Quantitative metrics tell you what happened; qualitative observations tell you why. A task completion rate of 60% indicates a problem, but only observation of the failures reveals the root cause. Effective usability testing combines both: metrics to measure the severity of problems and observations to understand their causes.

Remote Usability Testing

Remote usability testing (where the participant and facilitator are in different locations) has become the dominant modality, accelerated by the COVID-19 pandemic. It offers advantages in participant recruitment (no geographic constraint), cost (no travel, no lab), and speed (easier scheduling). It sacrifices some observational richness (harder to read body language, potential technical issues with screen sharing).

Moderated Remote Testing

The facilitator conducts the session via video conferencing, sharing the participant's screen. The think-aloud protocol, task scenarios, and facilitation principles are the same as in-person testing. This is the closest remote equivalent to traditional lab testing.

Unmoderated Remote Testing

Participants complete tasks independently, recorded by specialised software (UserTesting, Maze, Lookback). The facilitator is not present during the session. Unmoderated testing scales easily (dozens of participants can test simultaneously) but produces less rich data (no follow-up probes, no real-time observation of confusion). Unmoderated testing is well suited to summative evaluation (measuring completion rates and times for a well-defined set of tasks) but less suited to formative discovery (where the facilitator's probing questions reveal why problems occur).

Analysing and Reporting Results

Problem Identification

The primary output of formative testing is a list of usability problems, each with:

A description of the problem
The task(s) during which it was observed
The number of participants who encountered it
A severity rating (critical, major, minor, cosmetic)
A recommended fix

Severity Rating

A widely used severity scale (adapted from Nielsen Nielsen, 1994):

Cosmetic: noticed by some users but does not affect task performance
Minor: causes minor delays or confusion but users recover
Major: causes significant difficulty; some users fail the task
Critical: prevents task completion for most users; causes data loss or safety risk

Reporting

Usability test reports should be concise and action-oriented. Stakeholders need:

Key findings (the 3–5 most important problems)
Supporting evidence (video clips, participant quotes, metrics)
Recommended fixes (specific, actionable design changes)
Priority ranking (which problems to fix first, based on severity and frequency)

Think About It

Usability test participants are not a statistically representative sample of the user population. With 5–8 participants, confidence intervals are wide and p-values are meaningless. Does this invalidate the findings? How should formative usability test results be communicated to stakeholders who expect statistical rigour?

Key Takeaways

Usability testing is the empirical foundation of usability practice: observing real users with real tasks.
Formative testing (small samples, qualitative, iterative) discovers problems; summative testing (larger samples, quantitative) measures performance.
Five participants are sufficient for formative testing to discover most major problems; the strategy is to iterate.
The think-aloud protocol reveals the user's reasoning, not just their actions.
The facilitator must avoid helping, leading, or reacting to participant behaviour.
Usability metrics span effectiveness (completion, errors), efficiency (time, path deviation), and satisfaction (SUS, SEQ).
Remote testing (moderated and unmoderated) offers scalability advantages with some loss of observational richness.

Textbook of Usability