Live demo · Open source

Usability Engine — an audit catalog you can run.

Nielsen's 10 rewritten for modern surfaces. Plus two extensions for AI agents. Each heuristic carries its own audit question, its own LLM prompt, its own interactive demo.

Period: 2026 · live on this site
Role: Designer · Engineer
Tags: UX researchHeuristic evaluationLocal LLMOpen source

Most usability writing online — Nielsen's 10, Norman doors, the WCAG checklist — gives you the principle but not the audit. You read the heuristic. You nod. You close the tab. The product you were going to fix is still broken.

The Usability Engine is the catalog as engine. Twelve heuristics — Nielsen's 10 rewritten in the vocabulary of modern product surfaces, plus two extensions for AI: Uncertainty must be legible and Reversibility is the policy axis. Each row carries its audit question, its fix, its automation spec, and where it makes sense, an interactive good-vs-bad demo.

Type a URL in. The engine pages each heuristic, marks a verdict, reports back. Heuristics a script can answer get a script. Ones that need judgment route through your local Ollama. Ones that need a human reading a system diagram stay manual and say so. Nothing is faked. The static site never touches the cloud.

Open the live engine usability-engine source heuristics.ts

heuristics.ts · 12 rows · one renderer

The catalog is the spec.

One row of data per heuristic — story, severity, audit question, fix, checkability, automation spec, optional demo key. The engine handles surface filtering, demo lookup, verdict aggregation, and report generation. Add a row, the engine picks it up.

Static exportNo backendLocal Ollama, opt-inApache 2.0

{
  id: "user-control",
  number: "03",
  title: "User control & freedom",
  severity: "blocker",
  appliesTo: ["website","application",
              "form","mobile-app"],
  story: "...",
  auditQuestion: "...",
  fix: "...",
  checkability: "hybrid",
  automationSpec: "...",
  demo: "undo",
}

What's in the catalog

Twelve heuristics. Ten are Nielsen's, rewritten in the vocabulary of modern product surfaces — gone is "Help and documentation" as a placid footer link; in is "help that arrives where the user is stuck." Two are mine: an AI confidence axis and an agent reversibility axis. Severity is opinionated — blockers are the ones I will not ship past.

#	Heuristic	Severity	Checkability	Note
01	Visibility of system status	blocker	hybrid	Silence is the most expensive UX bug.
02	Match the user's world, not the system's	major	llm	Jargon audit — replace anything 2+ users define differently.
03	User control & freedom	blocker	hybrid	Soft-delete with snackbar > confirmation modal.
04	Consistency & standards	major	script	Same word, same icon, same action — everywhere.
05	Error prevention	blocker	hybrid	Make the wrong state unreachable, not just recoverable.
06	Recognition over recall	major	llm	Show options. Don't make people remember them.
07	Flexibility & efficiency	major	script	Power users deserve shortcuts; novices shouldn't see them.
08	Aesthetic & minimalist design	major	llm	Every element competes for attention.
09	Recognize, diagnose, recover	blocker	llm	Errors must say what, why, and what to do next.
10	Help & documentation	major	llm	Help that arrives where the user is stuck, not buried in a menu.
11	Uncertainty must be legible	blocker	llm	New. Every AI claim shows its confidence.
12	Reversibility is the policy axis	blocker	manual	New. Map each agent action to its recovery cost.

Checkability tiers — what an audit can honestly automate

"Run an audit" is a vague verb. Some heuristics reduce to a regex on the DOM. Some are entirely judgment calls a script can never resolve. Each row in the catalog declares which it is — so the engine never pretends to answer a question it can't.

Script
A deterministic check on the DOM, accessibility tree, or rendered text. No model needed; the answer is yes/no.
e.g. Find every interactive element on the page and verify it has a visible focus ring.
LLM
A prompt against the visible content. The model evaluates judgment-shaped questions a script can't reduce to a regex.
e.g. Read every error message on the page and judge whether it tells the user what went wrong, why, and what to do next.
Hybrid
Script enumerates candidates, LLM evaluates them. The split is mechanical: scripts find the elements, models judge the quality.
e.g. Script lists every destructive button; LLM follows each click and rates whether recovery is visible without a modal.
Manual
The judgment requires reading the system architecture or the user model. Pattern detection isn't enough; this is design review.
e.g. Mapping every agentic action to its recovery cost — needs the actual blast-radius diagram and the approval-authority model.

Two heuristics that aren't Nielsen's

Nielsen's 10 were written in 1994 for desktop GUIs. They still hold. They don't cover what generative interfaces broke. These two are the additions I argue for — both rated blocker, both shipped in the catalog.

11 · Blocker

Uncertainty must be legible.

Generative interfaces output confident prose regardless of how much they actually know. Without a visible confidence signal, the user has no way to weight the output — and repeated overconfidence erodes trust in the whole system.

Fix: a calibrated vocabulary — Confident, Likely, Unsure, Low. Reserve raw percentages for power users who hover. Show the basis for every confident claim.

12 · Blocker

Reversibility is the policy axis.

"Safety" is too vague to design around. Recovery cost is the lever: how quickly, how completely, and at what cognitive cost can the user undo the agent's action?

Fix: a reversibility chip in the agent UX. Cheap to undo → run autonomously. Expensive to undo → present for human approval first. The recovery path is part of the design, not an afterthought.

The live audit mode

The engine has two modes. Manifesto mode is what most visitors see — twelve numbered heuristics, each with its story, an interactive demo where one is registered, and a self-audit question with a tap-to-reveal fix.

Audit mode takes a URL. The engine pages through every applicable heuristic, asks the user to mark Pass / Fail / N/A, and assembles a report with the per-heuristic verdicts and a severity-weighted tally. For heuristics with an LLM prompt, the prompt is right there in the interface — copy it, paste it into Ollama with the page text, get an answer.

What it doesn't do: pretend to be an autonomous crawler. The site is a static export — there is no backend, no headless browser, no cloud call. The audit is human-in-the-loop on purpose. The engine's job is to make the audit cheap and well-organised, not to fake it.

Design moves I'm proud of

LESSON 01
Checkability as a first-class field, not a footnote.
Every row declares whether a script, an LLM, a hybrid, or a human eye answers its question. The engine renders the tier on the card; the report respects it. No claim is made that a manual heuristic was 'automated.'
LESSON 02
Severity is opinionated, not democratic.
Five of twelve are blockers — including both AI extensions. I'd rather over-call severity than ship a checklist where everything reads as major and nothing as a stop-the-line. A blocker says: I won't help you launch past this.
LESSON 03
Demos pair a good and a bad version side by side.
The good-vs-bad pattern is the smallest reproducible UX experiment. One artifact teaches the principle better than a paragraph of prose can. Where a heuristic has one (visibility-of-status, undo, error-prevention, recognition), that demo is the focal point of the card.
LESSON 04
Ollama is opt-in, never default.
The LLM prompts are part of the spec, not a hosted feature. Anyone with Ollama can run them on their own machine; the static site never makes a network call. The product is the catalog and the engine — the model is whichever one you brought.
LESSON 05
Two new heuristics in a place that respects Nielsen.
Adding to a canon is delicate work. The 12 are presented in number order — Nielsen's 10 keep their original numbering; the AI extensions land at 11 and 12 with their own claim. The reader can audit the lineage without being asked to take the additions on faith.

A heuristic without an audit is a poster. The Usability Engine's bet is that every principle worth writing down deserves the verb that proves it — the question you can answer, the fix you can ship, the check that catches you when you don't.

Open the live engine →

The catalog is the spec.

What's in the catalog

Checkability tiers — what an audit can honestly automate

Two heuristics that aren't Nielsen's

Uncertainty must be legible.

Reversibility is the policy axis.

The live audit mode

Design moves I'm proud of

Checkability as a first-class field, not a footnote.

Severity is opinionated, not democratic.

Demos pair a good and a bad version side by side.

Ollama is opt-in, never default.

Two new heuristics in a place that respects Nielsen.