KNK Global — Independent human evaluation for conversational AI

Methodology

We measure failure exposure, not correctness.

Internal QA confirms the system works on the happy path. We map where and how it breaks under real pressure — scored against a per-service rubric, validated through a three-layer review system, and ranked by severity. Structured failure intelligence, not opinion.

Scoring model

Each service is scored on its own rubric.

There is no single universal scorecard. Each service line is evaluated on the dimensions that matter for that interaction type — weighted to 100% and scored from 1.0 to 5.0. Every rubric is labeled with the service it belongs to, and weights show what each evaluation prioritizes.

Voice Agent Testing

Reliable real-world conversation handling under live phone conditions.

Intent Recognition25%

Accent Handling20%

Response Accuracy20%

Context Retention20%

Conversation Quality15%

Accent Handling covers native-English accents and speech variation — not multiple languages.

AI Sales / SDR Testing

Effective buyer qualification, discovery, objection handling, and conversion.

20%

Lead Qualification20%

Discovery Questions20%

Product Positioning20%

Objection Handling20%

Meeting Booking20%

Customer Support AI Evaluation

Accurate issue resolution, customer guidance, and escalation handling.

25%

Response Accuracy

25%

Resolution Effectiveness

20%

Escalation Handling

15%

Customer Experience

15%

Consistency

Contact Center AI QA Program

Consistent, policy-adherent interactions across high-volume operations.

Accuracy25%

Customer Experience20%

Policy & Script Adherence20%

Resolution Quality20%

Escalation Handling15%

Policy & Script Adherence measures whether the agent follows your own defined rules, scripts, and disclosures — not regulatory compliance.

Enterprise AI Performance Assessment

Operational reliability, governance, and performance across production workflows.

Accuracy

20%

Customer Experience

20%

Operational Reliability

20%

Escalation Handling

15%

Policy & Script Adherence

15%

Conversation Effectiveness

10%

Policy & Script Adherence covers your own rules, scripts, and disclosures — not regulatory compliance.

Score interpretation

Applied uniformly across every rubric.

Score	Interpretation
4.5 – 5.0	Excellent
4.0 – 4.49	Good
3.0 – 3.99	Acceptable
2.0 – 2.99	Improvement required
Below 2.0	Critical concern

Quality assurance

A three-layer review system behind every score.

No single evaluator decides a score on their own. Every label moves through three independent layers before it reaches you — which is what makes the output consistent, auditable, and reproducible.

Execution

Evaluators label each output, tag the failure type, and attach a confidence score.

Validation

A QA reviewer re-checks the work — sampling 10–20% of every batch and 100% for new evaluators — correcting labeling errors and flagging inconsistency.

Final authority

An evaluation authority resolves disputes, adjusts scoring rules, and signs off the dataset before it is delivered.

Evaluators are themselves scored — on label accuracy, inter-rater agreement, reasoning quality, and detection sensitivity — so scoring stays consistent across the team rather than varying by who happened to run the test.

Controls

What keeps results consistent.

Tester independence

Testers sit outside the client team and score against the rubric, not against how the system was meant to behave.

Scenario design

Each scenario maps to a specific behavior and failure mode, defined before any testing begins.

Adversarial testing

Designed pressure, not random usage — every pass probes a behavior we expect to break.

Audit trail

Every score is backed by transcripts, recordings, timestamps, and reproduction steps.

Regression testing

The same suite re-runs across releases, so you can see whether a change fixed or broke behavior.

Structured reporting

Findings map to dimensions and severity, formatted for engineering — not marketing.

Failure taxonomy

What we classify against.

Every defect is tagged to a failure type, so patterns are countable across a run instead of described one finding at a time.

Failure type	What it is
Factual error / hallucination	Confident output that is wrong or unsupported by the source material.
Instruction-following failure	An explicit instruction is ignored or only partially followed.
Context-retention failure	Earlier facts in the conversation are lost or contradicted.
Ambiguity / missing context	The agent proceeds on assumption instead of clarifying an unclear request.
Overgeneralization	A narrow case is answered with a broad, imprecise response.
Escalation failure	A handoff condition is met but no transfer to a human is triggered.
Intent misclassification	The agent misreads what the caller is actually asking for.
Flow breakdown	Context is lost after an interruption, correction, or topic switch.
Pricing deflection	The agent dodges or fumbles a direct pricing question.
Weak differentiation	The agent fails to articulate why it wins against an alternative.
Multi-intent miss	One of several requests made in a single turn is dropped.

Severity

Four bands, ranked by impact.

Every finding is classified by impact and recurrence — so you fix what matters first, not whatever surfaced last.

Severity	Meaning
CRITICAL	High-impact failure: blocks task completion, creates customer risk, or recurs across many interactions.
MAJOR	Significant defect that degrades experience or conversion but does not fully block completion.
MINOR	Low-impact inconsistency or edge-case failure.
OBSERVATIONS	Noted behaviors worth tracking — not defects.

Critical8 · 6%

Major22 · 17%

Minor41 · 32%

Observations56 · 44%

Illustrative distribution from a single engagement — 127 findings, skewed to minor and observational. Not a client result.

A scored example

One finding, end to end.

Rubric, score, severity, evidence, and fix — the chain behind every line in a report. Illustrative, and scored on the Voice Agent Testing rubric.

Scored example · Voice agentILLUSTRATIVE

Intent Recognition4.3 / 5

Accent Handling3.7 / 5

Response Accuracy4.1 / 5

Context Retention2.8 / 5

Conversation Quality3.4 / 5

Finding · context retentionILLUSTRATIVE

CRITICAL

Context lost after correction

Caller changed the order mid-call on 3 of 4 tests; the agent reverted to the original item.

Why critical

High customer-risk failure, recurring across interactions — the lowest-scoring dimension in the run.

Recommendation

Re-anchor on the most recently confirmed value after any correction before proceeding.

See the method in action.

A pilot is the fastest way to understand how we score and what you get.

Book a Pilot View engagement models

Services

How we work

Company