We measure failure exposure, not correctness.
Internal QA confirms the system works on the happy path. We map where and how it breaks under real pressure — scored against a per-service rubric, validated through a three-layer review system, and ranked by severity. Structured failure intelligence, not opinion.
Each service is scored on its own rubric.
There is no single universal scorecard. Each service line is evaluated on the dimensions that matter for that interaction type — weighted to 100% and scored from 1.0 to 5.0. Every rubric is labeled with the service it belongs to, and weights show what each evaluation prioritizes.
Voice Agent Testing
Reliable real-world conversation handling under live phone conditions.
Accent Handling covers native-English accents and speech variation — not multiple languages.
AI Sales / SDR Testing
Effective buyer qualification, discovery, objection handling, and conversion.
Customer Support AI Evaluation
Accurate issue resolution, customer guidance, and escalation handling.
Contact Center AI QA Program
Consistent, policy-adherent interactions across high-volume operations.
Policy & Script Adherence measures whether the agent follows your own defined rules, scripts, and disclosures — not regulatory compliance.
Enterprise AI Performance Assessment
Operational reliability, governance, and performance across production workflows.
Policy & Script Adherence covers your own rules, scripts, and disclosures — not regulatory compliance.
Score interpretation
Applied uniformly across every rubric.
| Score | Interpretation |
|---|---|
| 4.5 – 5.0 | Excellent |
| 4.0 – 4.49 | Good |
| 3.0 – 3.99 | Acceptable |
| 2.0 – 2.99 | Improvement required |
| Below 2.0 | Critical concern |
A three-layer review system behind every score.
No single evaluator decides a score on their own. Every label moves through three independent layers before it reaches you — which is what makes the output consistent, auditable, and reproducible.
Execution
Evaluators label each output, tag the failure type, and attach a confidence score.
Validation
A QA reviewer re-checks the work — sampling 10–20% of every batch and 100% for new evaluators — correcting labeling errors and flagging inconsistency.
Final authority
An evaluation authority resolves disputes, adjusts scoring rules, and signs off the dataset before it is delivered.
Evaluators are themselves scored — on label accuracy, inter-rater agreement, reasoning quality, and detection sensitivity — so scoring stays consistent across the team rather than varying by who happened to run the test.
What keeps results consistent.
Tester independence
Testers sit outside the client team and score against the rubric, not against how the system was meant to behave.
Scenario design
Each scenario maps to a specific behavior and failure mode, defined before any testing begins.
Adversarial testing
Designed pressure, not random usage — every pass probes a behavior we expect to break.
Audit trail
Every score is backed by transcripts, recordings, timestamps, and reproduction steps.
Regression testing
The same suite re-runs across releases, so you can see whether a change fixed or broke behavior.
Structured reporting
Findings map to dimensions and severity, formatted for engineering — not marketing.
What we classify against.
Every defect is tagged to a failure type, so patterns are countable across a run instead of described one finding at a time.
| Failure type | What it is |
|---|---|
| Factual error / hallucination | Confident output that is wrong or unsupported by the source material. |
| Instruction-following failure | An explicit instruction is ignored or only partially followed. |
| Context-retention failure | Earlier facts in the conversation are lost or contradicted. |
| Ambiguity / missing context | The agent proceeds on assumption instead of clarifying an unclear request. |
| Overgeneralization | A narrow case is answered with a broad, imprecise response. |
| Escalation failure | A handoff condition is met but no transfer to a human is triggered. |
| Intent misclassification | The agent misreads what the caller is actually asking for. |
| Flow breakdown | Context is lost after an interruption, correction, or topic switch. |
| Pricing deflection | The agent dodges or fumbles a direct pricing question. |
| Weak differentiation | The agent fails to articulate why it wins against an alternative. |
| Multi-intent miss | One of several requests made in a single turn is dropped. |
Four bands, ranked by impact.
Every finding is classified by impact and recurrence — so you fix what matters first, not whatever surfaced last.
| Severity | Meaning |
|---|---|
| CRITICAL | High-impact failure: blocks task completion, creates customer risk, or recurs across many interactions. |
| MAJOR | Significant defect that degrades experience or conversion but does not fully block completion. |
| MINOR | Low-impact inconsistency or edge-case failure. |
| OBSERVATIONS | Noted behaviors worth tracking — not defects. |
Illustrative distribution from a single engagement — 127 findings, skewed to minor and observational. Not a client result.
One finding, end to end.
Rubric, score, severity, evidence, and fix — the chain behind every line in a report. Illustrative, and scored on the Voice Agent Testing rubric.
Why critical
High customer-risk failure, recurring across interactions — the lowest-scoring dimension in the run.
Recommendation
Re-anchor on the most recently confirmed value after any correction before proceeding.
See the method in action.
A pilot is the fastest way to understand how we score and what you get.