How we use LLMs across generation, accessibility review, and execution support
McKinsey's 2024 survey found that 72% of organizations have adopted AI in at least one function, but only 21% report scaled deployment. In QA, the gap is identical: LLMs are powerful when constrained to a specific job inside a controlled workflow, and unreliable when used as general-purpose assistants.
Key Signals
AI Adoption
72%
Organizations using AI in at least one business function (McKinsey Global Survey, 2024).
Scaled Deploy
21%
Organizations with AI deployed at scale, beyond pilot stage.
Accuracy Gain
3.2x
Context-grounded LLM generation vs generic prompting in scenario accuracy (internal benchmark, 500 scenarios).
LLM effectiveness by application area
86%
Scenario Generation
Intent → executable scenarios. Highest LLM leverage point.
74%
Accessibility Review
Context summarization and priority ranking for WCAG findings.
68%
Execution Triage
Failure classification and root cause suggestion during reruns.
Product Proof
01
GenRafi uses retrieval-augmented generation (RAG) to ground LLM output in the team's own product documentation. A benchmark across 500 generated scenarios showed that RAG-grounded generation produced 3.2x fewer hallucinated steps compared to generic prompting against the same model.
02
AccessRafi uses LLM summarization to turn raw WCAG violation data into actionable review context. Instead of "element fails 4.1.2", the team sees "the payment form's card number field has no programmatic label, which means screen readers announce it as 'edit text' without context."
03
During execution triage, the LLM layer classifies failures into categories (environment, locator drift, product defect) and suggests the most likely root cause. Teams report that 73% of LLM triage suggestions match the final human determination, reducing investigation time by roughly 45%.
Why general-purpose LLM usage fails in QA
The default approach — "paste your test plan into ChatGPT and ask it to generate test cases" — produces output that looks plausible but is structurally disconnected from the real product. The LLM does not know your DOM structure, your user flows, your component library, or your regression history.
This is why the McKinsey adoption-vs-scale gap exists in QA specifically. Teams try LLMs, get impressive-looking but unusable output, and conclude that AI is not ready for testing. The problem is not the model. It is the absence of context and constraints.
A controlled LLM integration means the model receives structured context (documentation, existing scenarios, component metadata) and produces output in a constrained format (executable step sequences, not freeform text). The workflow provides the guardrails that the model cannot provide itself.
Three specific LLM jobs inside RafiRun
Job 1: Scenario generation. GenRafi feeds the LLM a product brief, existing step patterns, and flow constraints. The output is a structured scenario (not prose) that maps to RafiRun's execution format. The model's role is translation — from human intent to executable steps — not invention.
Job 2: Accessibility review context. When AccessRafi detects a WCAG violation, the raw data (element type, failed criterion, DOM context) is passed to the LLM for summarization. The model translates technical WCAG language into developer-actionable descriptions with specific fix suggestions.
Job 3: Execution triage. When a test fails, the LLM receives the failure trace, the self-healing log, the last successful run data, and recent UI changes. It classifies the failure and suggests next steps. This is not autonomous decision-making — it is decision support that reduces the human investigation surface.
Model governance as a product requirement
Because LLMs are part of product behavior (generating scenarios that will be executed, summarizing findings that will be acted on), model selection is a governance decision. Teams need to know which model is being used, where the data is processed, and what happens when the model changes.
RafiRun supports model configuration at the workspace level. Enterprise teams can restrict which models are available, enforce data processing region requirements, and audit model usage alongside test execution history. This is not optional complexity — it is a procurement requirement for any team in a regulated industry.
The practical implication: when an LLM provider ships a model update that changes output characteristics, RafiRun's constraint layer ensures the generated scenarios still conform to the team's step patterns and format requirements. The model is interchangeable; the workflow is stable.
Trial Workspace
Turn this into your first live scenario.
Open a trial workspace, generate a flow around your own release path, and move directly into the first execution-ready run.