03 / 04 Case study · AI workflow · Banking 9 min read

I stopped writing UX audits. I built a Claude skill instead.

How auditing a Saudi retail bank turned into a reusable workflow now powering every banking engagement at the agency. For three months I watched my agency rebuild the same audit from scratch on every banking engagement. Heuristic evaluation. IA review. Conversion funnels. Page inventory. Benchmarking matrix. Same dimensions, different bank, different week wasted formatting cover pages.

25Findings shipped

9Critical issues flagged

7Banking products mapped

3 daysAudit cycle, down from weeks

The shift

When a major Saudi retail bank came through the door, I changed the unit of work. Instead of "deliver an audit," the goal became "build the audit as a reusable Claude skill, then run it on this bank as proof." Same scope. Different exit state. After the engagement, the framework runs in three days for any client we point it at.

This is the workflow. It is written for two readers. If you are a designer, you can copy it. If you are hiring, this is what AI literacy looks like in 2026 for a designer who is actually doing it, not posting about it.

The audit-per-engagement model cannot get better between engagements. Treat it as a product, and that changes.

What is broken in agency UX research

UX audits at most agencies are artisanal. Every designer writes them slightly differently. PMOs scope them in multi-week ranges because nobody can predict how long the next one will take. Clients cannot compare an audit done by one team to an audit done by another. And the same gaps show up in retros every quarter: heuristic findings without screenshots, conversion analysis without funnel maps, IA reviews that contradict the page inventory.

The problem is not that designers are bad at audits. The problem is the unit of work. If you treat the audit as a product, two things change. You stop redoing the format. And you start measuring quality against shipped artefacts, not against vibes.

Step 01

Define the ICP

Before any AI work, I wrote down who the audit is for. Not the bank. The people downstream of the audit.

The designer running it. Mid-level, 2 to 5 years in. Needs to produce a defensible report in days, not weeks. Wants to spend their thinking on the findings, not the chrome.
The client lead reading it. Marketing director or head of digital at a bank. Will skim the first three pages and drill into one or two screenshots. If the cover does not say what is wrong in five seconds, the audit is dead on arrival.
The PMO scoping it. Needs to quote effort to the client with confidence. Hates the "it depends" answer. Wants a fixed unit of work they can multiply.

Three ICPs, one audit. Every choice in the system answers to one of them.

Step 02

Map the JTBD

Once the people are named, the jobs follow. I used a Jobs-to-be-Done frame because it forces functional, emotional, and social outcomes onto the same page, which is how audits actually get evaluated by clients.

Tell the bank where the website is broken, with evidence.
Tell the bank how broken it is, on a scale comparable to other banks.
Show the bank where it sits next to competitors, scored on the same dimensions.
Tell the bank what to fix first, with effort and impact called out.
Give the bank a paper trail their internal stakeholders can defend.

Five jobs. Each one maps to a section of the final report. Each section maps to a measurable dimension.

Step 03

Build the dimensions

I split the audit into four dimensions, each scored independently.

Foundation

Brand, accessibility, performance, content hygiene, multilingual readiness, technical health.

User needs fulfilment

Whether users can find, understand, and evaluate what they need across 19 weighted jobs.

Conversion and funnel

CTA visibility, application path completeness, form quality, post-submission follow-up.

Service design and post-action

What happens after submit. Cross-channel coherence, status tracking, service recovery, lifetime engagement.

Each dimension is scored on the same scale (max 25 points), with the same evidence structure, finding-card format, and status thresholds.

Once the dimensions were locked, the audit was no longer a designer's improvisation. It was a system any one of us could run.

Step 04

Turn it into a Claude skill

Here is where the work gets interesting for hiring teams. A Claude skill is a packaged set of instructions, references, and assets that Claude loads when triggered. You write the skill once. It runs forever.

I built three:

gravitas-css

The agency's report stylesheet, codified as a skill. Brand tokens, component classes, page layouts. Any audit Claude generates now ships in the house style with zero manual chrome.

d3-conversion-audit

Encodes the D3 dimension end-to-end. Six criteria, scoring rubric, evidence-row format, finding-card output. Trigger phrases: "run D3," "conversion audit," "evaluate URL for conversion."

heuristic-evaluation

Encodes the heuristic dimension with severity tagging, RTL awareness for Arabic interfaces, accessibility cross-checks, and screenshot-evidence cards.

For the dimensions I did not have time to make into skills yet (D1, D2, D4), I wrote prompts in the same shape. A designer pastes the prompt into a fresh Claude session, fills in the site name and competitors, and the audit runs.

AI literacy in 2026 design work is not "I use ChatGPT to summarise interviews." It is identifying which parts of your workflow are repeatable, building those parts into reusable systems, and using the time saved to think harder about the parts that are not.

Step 05

Run it on the Saudi bank

With the framework in place, I pointed it at the bank's site. The audit produced four artefacts. Each one took hours, not weeks.

Benchmarking Framework. Defined the four dimensions, the weighting, the competitor scope, the scorecard format. Approved by the client before any scoring started.
Heuristic Evaluation Report. 25 pages, 25 findings, 9 critical, severity-coded, each anchored to an annotated screenshot. Includes a dedicated Arabic / RTL section, a top-five friction list, a call-deflection opportunities table, and a full WCAG accessibility audit.
Phase 1 Audit Report. 10 pages. Manager-grade synthesis covering all four dimensions plus the JTBD framework, conversion funnels across seven banking products, and Phase 2 / Phase 3 scope.
Heuristic Evaluation Template. Reusable template now used by other designers in the agency. Shipped with the skill so the format stays consistent across clients.

What changed after the engagement

Total elapsed time from kickoff to client-ready deliverables: three days of audit work, two days of synthesis. Compare that to the standard agency cycle for a banking website audit.

The skill is now agency property. The same framework has been pointed at three other digital benchmarking targets in 2026 already: a regional commercial bank in the UAE, a Saudi neobank competitor study, and a comparative audit across embassy websites. Each one took the framework and added the client's specifics. Each one shipped in days.

The artefacts are the same shape. The findings are different. The audit is no longer a one-off.

For designers

Treat your repeating work as a product. If you are doing the same audit twice, you are doing it wrong. Spend the second time turning it into a system.
Start with the ICP, not the AI. Most AI design experiments fail because they are solutions looking for a workflow. Define the three people downstream of your output first.
A Claude skill is just a brief. If you can write a design brief, you can write a skill. Who triggers it, what it produces, what rules it follows, what artefacts it uses.

For hiring teams

A designer who treats AI as a process tool, not a content tool. Knows the difference between "use Claude for one task" and "build a Claude skill so a team can repeat the work."
Shipped this approach to a major Gulf banking client and is iterating on it across other engagements.
The harder skill on display here is not prompt engineering. It is deciding what gets automated and what stays in the designer's head.
The next iteration adds a measurement layer at the production end: scoring shipped audits against principle adherence over time.

References

Microsoft HAX (Human-AI eXperience) toolkit — heuristics for AI interaction design.
Google PAIR (People + AI Research) guidebook — patterns for human-AI interaction.
NN/g 2025 research on AI adoption among UX professionals — adoption is uneven, mostly content summarisation, rarely workflow design.
Anthropic documentation on Claude skills — the unit of reusable agent work.

Caveat: client name, internal stakeholder names, and engagement-specific commercials withheld under NDA. The framework is described in full because it is the author's IP.