- Conduct a heuristic evaluation using Nielsen's framework
- Apply severity ratings to prioritise identified usability problems
- Perform a cognitive walkthrough for a specific user task
- Compare the strengths and limitations of expert review methods versus user testing
- Determine when to use expert review methods versus empirical testing
Introduction
Usability testing (Chapter 15) observes real users performing real tasks. Expert review methods take a different approach: trained evaluators examine the interface and predict usability problems based on established principles and cognitive models. These methods are faster, cheaper, and less logistically demanding than user testing, making them valuable for early-stage evaluation, quick assessments, and situations where user recruitment is difficult. This chapter covers the two most widely used expert review methods: heuristic evaluation and cognitive walkthrough.
Heuristic Evaluation
Heuristic evaluation, developed by Jakob Nielsen and Rolf Molich Nielsen, 1990, is a systematic inspection method in which evaluators examine an interface and judge its compliance with recognised usability principles (heuristics). The heuristics serve as the evaluation criteria; the evaluator's expertise and judgment determine how they are applied Nielsen, 1994.
Procedure
- Briefing: evaluators are given a description of the user population, the primary tasks, and the context of use.
- Individual evaluation: each evaluator independently examines the interface, going through it at least twice: first to get an overall sense of the system, then to focus on specific elements.
- Problem documentation: each evaluator records every usability problem found, noting which heuristic it violates, where it occurs, and its likely impact.
- Consolidation: evaluators' findings are combined into a single list, with duplicates merged.
- Severity rating: the consolidated list is rated for severity, either by the evaluators or by a separate group.
Heuristic evaluation must be performed by multiple independent evaluators. Individual evaluators find different subsets of problems; no single evaluator finds them all Hertzum, 2003. Nielsen recommends 3 to 5 evaluators for a cost-effective analysis Nielsen, 1993. With 5 evaluators, approximately 85% of usability problems are typically identified. Using a single evaluator dramatically reduces the coverage.
How Many Evaluators?
The recommendation to use three to five evaluators rests on the same problem-discovery mathematics that governs sample size in user testing (Chapter 15). Each evaluator finds only a fraction of the problems present; Nielsen's empirical studies put the average single-evaluator hit rate at roughly 31%, with experienced specialists doing better and novices doing worse Nielsen, 1992. Because different evaluators stumble on different problems, the union of their findings grows as evaluators are added, following the same diminishing-returns curve seen in user testing: the proportion of problems found rises steeply with the first few evaluators and then flattens.
The practical consequence is that a single evaluator is dangerous, not because they are incompetent, but because they predictably miss roughly two-thirds of the problems, and there is no way to tell from one person's report which two-thirds were missed. A second evaluator typically finds many problems the first overlooked; a third and fourth keep adding new ones but at a slowing rate. By around five evaluators the curve has flattened to the point where each additional reviewer finds few genuinely new problems while still costing time, which is why Nielsen frames three to five as the cost-effective range rather than a hard rule [Nielsen, 1993; Nielsen, 1992]. Crucially, the evaluators must work independently and only pool their findings afterwards; if they evaluate as a group, the loudest voice anchors the others and the independence that makes multiple evaluators valuable is lost. This evaluator effect, the wide variation in what different competent reviewers find, is one of the strongest empirical results in the inspection-methods literature Hertzum, 2003.
The Heuristics
The evaluation criteria are typically Nielsen's 10 heuristics (described in Chapter 8) Nielsen, 1994: visibility of system status, match between system and real world, user control, consistency, error prevention, recognition rather than recall, flexibility, aesthetic minimalism, error recovery, and help. Evaluators may supplement these with domain-specific heuristics, for instance, Gerhardt-Powals' cognitive engineering principles Gerhardt‐Powals, 1996 or medical device heuristics that include patient safety considerations.
A Worked Walkthrough
The value of the heuristics becomes concrete only when they are applied against a real screen. The skill of heuristic evaluation lies not in reciting the ten principles but in scanning a specific element, asking which principle it might violate, and naming the violation precisely enough that a developer can act on it. The following walkthrough applies several of Nielsen's heuristics to a single screen: the date and room selection step of a hotel-booking form.
Interface under review: the "Choose your stay" page of a hotel-booking website. The page contains a check-in date field, a check-out date field, a room-type dropdown, a guest-count stepper, and a "Continue" button. A trained evaluator works across the screen element by element.
The date fields interpret free-typed text as mm/dd/yyyy, matching the placeholder, but most of the site's international users expect dd/mm/yyyy. A British user typing 06/02/2026 (6 February) is silently interpreted as 2 June. Violates: match between system and the real world (the format contradicts the user's calendar convention) and error prevention (no constrained date picker stops the mistake). Severity 4: it is frequent, high-impact (the wrong dates are booked and charged), and persistent.
After the user picks dates, no running total or nightly rate appears; the price is shown only on the next page. Violates: visibility of system status. The user cannot tell what their choices cost until they have committed to them. Severity 3.
The room-type dropdown lists internal codes ("DLX-K", "STD-T2") rather than plain descriptions ("Deluxe, king bed"). Violates: match between system and the real world and recognition rather than recall, because the user must remember or guess what each code means. Severity 3.
Selecting a check-out date earlier than the check-in date produces a generic message, "Invalid input," with no indication of which field is wrong or how to fix it. Violates: help users recognise, diagnose, and recover from errors (the message is not expressed in plain language and offers no constructive remedy) and error prevention (the interface should not have allowed the out-of-order selection in the first place). Severity 3.
There is no visible way to return to the previous step once "Continue" is pressed, and the browser back button silently clears all selections. Violates: user control and freedom (no clearly marked emergency exit) and consistency and standards (the back button does not behave as the platform convention leads users to expect). Severity 3.
A single evaluator, working for roughly an hour, has produced five specific, principle-anchored findings with severity ratings, each phrased so that a developer knows exactly what to change.
Notice that two distinct problems (the date format and the date ordering) both touch error prevention, and that one problem (the room codes) violates two heuristics at once. This is typical: the heuristics overlap, and a single design flaw can be described under more than one principle. What matters is that the violation is named and located, not that it is filed under exactly one heading.
Severity Rating
After consolidation, each problem is rated for severity. Severity is not a single property of a problem; it emerges from three contributing factors Nielsen, 1994:
Frequency: how often will users encounter it? Is it common or rare?
Impact: how serious is the problem when it does occur? Can users work around it easily, or are they blocked?
Persistence: is it a one-off annoyance that users overcome and then ignore, or a recurring obstacle that bothers them on every visit?
A cosmetic typo that appears once and is immediately understood is trivial. A confusing control that every user hits on every transaction, cannot recover from, and is forced to relearn each time is catastrophic. Nielsen's widely used scale collapses these three factors into a single ordinal rating from 0 to 4 Nielsen, 1994:
- 0: Not a usability problem. The evaluator disagrees that the flagged issue is a problem at all. Including this rating is useful because it captures genuine disagreement between evaluators rather than forcing a false consensus.
- 1: Cosmetic problem. Need not be fixed unless extra time is available. A slightly misaligned label or an inconsistent capitalisation that does not impede the task.
- 2: Minor usability problem. Fixing this should be given low priority. Users notice it and are mildly slowed, but they recover without help.
- 3: Major usability problem. Important to fix, so should be given high priority. Users are substantially delayed, make recoverable errors, or fail the task on first attempt before finding a workaround.
- 4: Usability catastrophe. Imperative to fix before the product can be released. Users are blocked, lose data, make unrecoverable errors, or are exposed to harm.
The rating is an ordinal judgement rather than an arithmetic product, but the three factors pull it upward together: a problem that is frequent, high-impact and persistent lands at 4, while a problem that is strong on only one factor (for example, high impact but extremely rare) is usually pulled down to a 2 or 3. Ratings are most reliable when each evaluator rates the consolidated list independently and the ratings are then averaged, because severity judgements vary between evaluators just as problem discovery does Nielsen, 1994. The averaged ratings give a defensible order in which to spend a finite remediation budget.
A heuristic evaluation of a hospital EHR system might produce findings such as:
Problem: Allergy information is not visible on the medication ordering screen (violates: visibility of system status, error prevention).
Severity: 4, usability catastrophe: high impact (potential patient harm) combined with high frequency (every medication order).
Recommendation: Display active allergies in a persistent banner on the ordering screen.
Problem: The "Cancel order" button is styled identically to the "Confirm order" button (violates: error prevention, consistency).
Severity: 3, major usability problem: high impact (wrong action) combined with medium frequency (occurs under time pressure).
Recommendation: Visually differentiate destructive and confirmatory actions using colour, size, and position.
Strengths and Limitations
Strengths:
- Fast (a single evaluator can review a system in 1 to 2 hours)
- Inexpensive (no participant recruitment, no lab)
- Can be conducted early (on wireframes, prototypes, or specifications)
- Produces actionable, specific findings
Limitations:
- Depends on evaluator expertise; novice evaluators miss problems and report false positives
- Does not reveal actual user behaviour, task times, or satisfaction
- Evaluators may disagree on severity ratings
- Cannot identify problems that arise from the user's mental model rather than from heuristic violations
Cognitive Walkthrough
The cognitive walkthrough, developed by Lewis, Polson, Wharton, and Rieman Lewis, 1990, focuses specifically on learnability. It traces the steps required to complete a specific task and, at each step, asks whether a new user would know what to do Wharton, 1994.
Procedure
- Define the task: specify a realistic user task and the correct sequence of actions to complete it.
- Define the user: describe the target user's goals, knowledge, and experience.
- Walk through each step: for each action in the correct sequence, the evaluator answers four questions.
- Will the user try to achieve the right effect? (Does the user's goal match what the interface requires?)
- Will the user notice that the correct action is available? (Is the control visible and recognisable?)
- Will the user associate the correct action with the desired effect? (Does the label, icon, or affordance suggest the right action?)
- If the correct action is performed, will the user see that progress is being made? (Is the feedback adequate?)
- Record failures: any step where the answer to one or more questions is "no" (or "probably not") represents a learnability problem.
The cognitive walkthrough's four questions operationalise the core of learnability: visibility (Can the user see what to do?), affordance (Does the control suggest the right action?), mapping (Does the label match the user's goal description?), and feedback (Can the user tell that the action worked?). These correspond directly to Norman's design principles (Chapter 8) Norman, 2013. A control that fails any of these four tests will cause problems for new users.
Strengths and Limitations
Strengths:
- Focuses specifically on learnability: the first-use experience
- Forces the evaluator to adopt the user's perspective
- Identifies specific points of failure in specific tasks
- Can be conducted on paper prototypes or wireframes
Limitations:
- Time-consuming (each task must be walked through step by step)
- Narrow focus (only evaluates the specific tasks analysed)
- Does not address efficiency, satisfaction, or error recovery
- Assumes a specific "correct" path, which may not match how users actually approach the task
Heuristic Evaluation versus Cognitive Walkthrough
Heuristic evaluation and the cognitive walkthrough are the two dominant expert-review methods, and they answer different questions Polson, 1992. Heuristic evaluation is broad and holistic: the evaluator roams freely over the whole interface, applying a fixed set of principles to whatever element they are looking at, and finding problems of any kind (efficiency, consistency, error handling, aesthetics) wherever they occur. The cognitive walkthrough is narrow and task-focused: the evaluator fixes one realistic task, defines the correct action sequence in advance, and asks the same four learnability questions at every single step. Heuristic evaluation asks "does this interface comply with good principles?"; the cognitive walkthrough asks "will a new user be able to work out what to do here?".
The trade-offs follow from that difference. Because heuristic evaluation is unstructured and fast, it covers a lot of ground cheaply but depends heavily on the evaluator's expertise and can drift toward whatever the reviewer happens to notice. Because the cognitive walkthrough is highly structured and theory-driven, it is more systematic and less dependent on individual flair, and it surfaces first-use problems that a heuristic sweep might gloss over; but it is slower, it evaluates only the tasks chosen, and it says nothing about the experience of an expert user who already knows the path. In practice the two are often combined: a heuristic evaluation gives a broad first pass across the entire product, and a cognitive walkthrough is then reserved for the small number of critical first-use tasks (account setup, an emergency action, a one-time configuration) where learnability matters most.
Pluralistic Walkthrough
The pluralistic walkthrough combines elements of expert review and user testing. A group comprising users, developers, and usability experts walks through the interface together, discussing each step. The group format generates diverse perspectives: users reveal mental model mismatches, developers explain technical constraints, and usability experts identify principle violations. The pluralistic walkthrough is particularly effective for building shared understanding among team members about usability issues. Its limitation is the social dynamics of group settings: quieter participants may defer to more vocal ones, and the presence of developers may inhibit users from expressing confusion.
Comparing Expert Review with User Testing
Expert review and user testing are complementary, not competing methods. Research comparing the two approaches [Jeffries, 1991; Hertzum, 2003] consistently finds:
- Different problems: expert review methods and user testing identify overlapping but distinct sets of problems. Expert reviewers find problems that users work around (and therefore might not report), while users encounter problems that experts fail to predict.
- False positives: expert reviewers sometimes flag issues that do not cause problems for actual users (false positives). User testing avoids this because problems are identified by observing actual difficulty.
- Context sensitivity: user testing reveals problems that arise from the user's context, mental model, and task approach, factors that expert reviewers can only approximate.
- Efficiency: expert review is faster and cheaper; user testing is more expensive but produces more ecologically valid results.
A common pattern in practice is to use expert review early (to catch obvious problems before investing in user testing) and user testing later (to validate the design with real users). But this sequence means that expert review findings are treated as more urgent than user testing findings, simply because they come first. Is this the right prioritisation? Could it lead to over-investment in fixing predicted problems while missing discovered problems?
When to Use Which Method
Use heuristic evaluation when:
- The design is early-stage (wireframes, prototypes)
- Budget or time constraints prevent user testing
- You need a quick assessment of a competitor's product
- You want to identify low-hanging fruit before user testing
Use cognitive walkthrough when:
- Learnability is a primary concern (new users, infrequent use)
- The task flow is complex and sequential
- You want to evaluate whether a specific task can be completed without training
Use user testing when:
- You need to validate that the design works for real users
- Quantitative metrics (task time, completion rate, satisfaction) are needed
- The design is at a stage where user feedback can still influence changes
- You want to discover problems that experts cannot predict
Use multiple methods when:
- The stakes are high (safety-critical systems, high-traffic consumer products)
- The budget allows iterative evaluation
- You want the most comprehensive assessment possible
Key Takeaways
- Heuristic evaluation uses trained evaluators and established principles to identify usability problems without user involvement. Use 3 to 5 evaluators for adequate coverage.
- Cognitive walkthrough focuses on learnability by tracing task steps and asking whether a new user would succeed at each step.
- Expert review is faster and cheaper than user testing but identifies different (overlapping) problems and is subject to evaluator expertise and bias.
- Severity ratings based on impact and frequency prioritise remediation efforts.
- Expert review and user testing are complementary; the strongest evaluation programmes use both.
Further Reading
- Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. Proceedings of CHI '90, 249-256.
- Nielsen, J. (1994). Heuristic evaluation. In J. Nielsen & R. L. Mack (Eds.), Usability Inspection Methods (pp. 25-62). John Wiley & Sons.
- Wharton, C., Rieman, J., Lewis, C., & Polson, P. (1994). The cognitive walkthrough method: A practitioner's guide. In J. Nielsen & R. L. Mack (Eds.), Usability Inspection Methods (pp. 105-140). John Wiley & Sons.
- Hertzum, M., & Jacobsen, N. E. (2003). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15(1), 183-204.
- Jeffries, R., Miller, J. R., Wharton, C., & Uyeda, K. (1991). User interface evaluation in the real world: A comparison of four techniques. Proceedings of CHI '91, 119-124.