Services
Voice Agent TestingAI Sales / SDR TestingCustomer Support AI EvaluationContact Center AI QA ProgramEnterprise AI Performance AssessmentAll services
How we work
How It WorksMethodologyReportsCase studies
Company
Why UsEngagement Models
Book a Pilot
Methodology

We measure failure exposure, not correctness.

Internal QA confirms the system works on the happy path. We map where and how it breaks under real pressure — scored against a per-service rubric, validated through a three-layer review system, and ranked by severity. Structured failure intelligence, not opinion.

Scoring model

Each service is scored on its own rubric.

There is no single universal scorecard. Each service line is evaluated on the dimensions that matter for that interaction type — weighted to 100% and scored from 1.0 to 5.0. Every rubric is labeled with the service it belongs to, and weights show what each evaluation prioritizes.

Voice Agent Testing

Reliable real-world conversation handling under live phone conditions.

Intent Recognition25%
Accent Handling20%
Response Accuracy20%
Context Retention20%
Conversation Quality15%

Accent Handling covers native-English accents and speech variation — not multiple languages.

AI Sales / SDR Testing

Effective buyer qualification, discovery, objection handling, and conversion.

20%
20%
20%
20%
20%
Lead Qualification20%
Discovery Questions20%
Product Positioning20%
Objection Handling20%
Meeting Booking20%

Customer Support AI Evaluation

Accurate issue resolution, customer guidance, and escalation handling.

25%
Response Accuracy
25%
Resolution Effectiveness
20%
Escalation Handling
15%
Customer Experience
15%
Consistency

Contact Center AI QA Program

Consistent, policy-adherent interactions across high-volume operations.

100%weighted
Accuracy25%
Customer Experience20%
Policy & Script Adherence20%
Resolution Quality20%
Escalation Handling15%

Policy & Script Adherence measures whether the agent follows your own defined rules, scripts, and disclosures — not regulatory compliance.

Enterprise AI Performance Assessment

Operational reliability, governance, and performance across production workflows.

Accuracy
20%
Customer Experience
20%
Operational Reliability
20%
Escalation Handling
15%
Policy & Script Adherence
15%
Conversation Effectiveness
10%

Policy & Script Adherence covers your own rules, scripts, and disclosures — not regulatory compliance.

Score interpretation

Applied uniformly across every rubric.

ScoreInterpretation
4.5 – 5.0Excellent
4.0 – 4.49Good
3.0 – 3.99Acceptable
2.0 – 2.99Improvement required
Below 2.0Critical concern
Quality assurance

A three-layer review system behind every score.

No single evaluator decides a score on their own. Every label moves through three independent layers before it reaches you — which is what makes the output consistent, auditable, and reproducible.

1

Execution

Evaluators label each output, tag the failure type, and attach a confidence score.

2

Validation

A QA reviewer re-checks the work — sampling 10–20% of every batch and 100% for new evaluators — correcting labeling errors and flagging inconsistency.

3

Final authority

An evaluation authority resolves disputes, adjusts scoring rules, and signs off the dataset before it is delivered.

Evaluators are themselves scored — on label accuracy, inter-rater agreement, reasoning quality, and detection sensitivity — so scoring stays consistent across the team rather than varying by who happened to run the test.

Controls

What keeps results consistent.

Tester independence

Testers sit outside the client team and score against the rubric, not against how the system was meant to behave.

Scenario design

Each scenario maps to a specific behavior and failure mode, defined before any testing begins.

Adversarial testing

Designed pressure, not random usage — every pass probes a behavior we expect to break.

Audit trail

Every score is backed by transcripts, recordings, timestamps, and reproduction steps.

Regression testing

The same suite re-runs across releases, so you can see whether a change fixed or broke behavior.

Structured reporting

Findings map to dimensions and severity, formatted for engineering — not marketing.

Failure taxonomy

What we classify against.

Every defect is tagged to a failure type, so patterns are countable across a run instead of described one finding at a time.

Failure typeWhat it is
Factual error / hallucinationConfident output that is wrong or unsupported by the source material.
Instruction-following failureAn explicit instruction is ignored or only partially followed.
Context-retention failureEarlier facts in the conversation are lost or contradicted.
Ambiguity / missing contextThe agent proceeds on assumption instead of clarifying an unclear request.
OvergeneralizationA narrow case is answered with a broad, imprecise response.
Escalation failureA handoff condition is met but no transfer to a human is triggered.
Intent misclassificationThe agent misreads what the caller is actually asking for.
Flow breakdownContext is lost after an interruption, correction, or topic switch.
Pricing deflectionThe agent dodges or fumbles a direct pricing question.
Weak differentiationThe agent fails to articulate why it wins against an alternative.
Multi-intent missOne of several requests made in a single turn is dropped.
Severity

Four bands, ranked by impact.

Every finding is classified by impact and recurrence — so you fix what matters first, not whatever surfaced last.

SeverityMeaning
CRITICALHigh-impact failure: blocks task completion, creates customer risk, or recurs across many interactions.
MAJORSignificant defect that degrades experience or conversion but does not fully block completion.
MINORLow-impact inconsistency or edge-case failure.
OBSERVATIONSNoted behaviors worth tracking — not defects.
8
22
41
56
Critical8 · 6%
Major22 · 17%
Minor41 · 32%
Observations56 · 44%

Illustrative distribution from a single engagement — 127 findings, skewed to minor and observational. Not a client result.

A scored example

One finding, end to end.

Rubric, score, severity, evidence, and fix — the chain behind every line in a report. Illustrative, and scored on the Voice Agent Testing rubric.

Scored example · Voice agentILLUSTRATIVE
Intent Recognition4.3 / 5
Accent Handling3.7 / 5
Response Accuracy4.1 / 5
Context Retention2.8 / 5
Conversation Quality3.4 / 5
Finding · context retentionILLUSTRATIVE
CRITICAL
Context lost after correction
Caller changed the order mid-call on 3 of 4 tests; the agent reverted to the original item.

Why critical

High customer-risk failure, recurring across interactions — the lowest-scoring dimension in the run.

Recommendation

Re-anchor on the most recently confirmed value after any correction before proceeding.

See the method in action.

A pilot is the fastest way to understand how we score and what you get.