Skip to main content

Evaluation Worksheet

Use this worksheet to design an evaluation system for your GenAI application. A good evaluation is essential for shipping reliable AI features.

Step 1: Define What You're Evaluating

Before writing tests, define what "good" means for your system.

Functional Requirements

QuestionYour Answer
What task does the system perform?
What are acceptable outputs?
What are clearly unacceptable outputs?
Are there any safety requirements?

Quality Dimensions

Rate importance (1-5) for your use case:

DimensionScoreWhat It Means
Accuracy/5Output is factually correct
Relevance/5Output addresses the user's question
Completeness/5Output includes all necessary information
Consistency/5Similar inputs produce similar outputs
Safety/5No harmful or inappropriate content
Latency/5Response time is acceptable

Step 2: Build Your Golden Dataset

A golden dataset is a collection of inputs with expected outputs.

Golden Dataset Template

id,input,expected_output,expected_category,notes
g001,"How do I reset my password?","Click 'Forgot Password' on the login page...",informational,Main path
g002,"I can't login","Check if you're using the correct email...",troubleshooting,Common issue
g003,"Delete all my data",REJECT,"",safety test - should refuse
g004,"What's 2+2?","4",informational,Edge case - factual

Golden Dataset Guidelines

GuidelineWhy
Cover main pathsAt least 20-30 examples for core functionality
Include edge casesBoundary conditions, unusual inputs
Add negative examplesInvalid inputs, safety tests, should-fail cases
Update regularlyAdd examples from production failures

Example: Golden Dataset Entry

from dataclasses import dataclass

@dataclass
class GoldenExample:
id: str
input: str
expected_output: str | None # None for should-fail cases
expected_category: str
metadata: dict = None

# Example
golden = GoldenExample(
id="qa_001",
input="How do I reset my password?",
expected_output="Click 'Forgot Password' on the login page...",
expected_category="informational",
metadata={"channel": "chat", "user_type": "new"}
)

Step 3: Define Evaluation Metrics

Quantitative Metrics

MetricHow to MeasureTarget
Exact MatchOutput == expected exactlyVaries
Contains Key PhrasesOutput contains required terms>95%
Semantic SimilarityEmbedding similarity to expected>0.85
Schema ComplianceOutput matches required JSON schema100%
Classification AccuracyCorrect category assigned>90%

Qualitative Checks (Human Review)

CheckFrequencyWho
Random sampling5% of outputsQA team
Escalated issues100%Senior review
Safety concerns100%Safety team

Step 4: Build Automated Tests

Test Structure Template

import pytest
from agentflow.testing import Evaluator

class TestQASystem:
@pytest.fixture
def evaluator(self):
return Evaluator(
golden_path="tests/golden/qa_examples.csv",
model=qa_agent
)

def test_exact_match_examples(self, evaluator):
"""Test examples where we expect exact answers."""
results = evaluator.evaluate(
filter_category="informational",
metric="exact_match"
)
assert results["exact_match_rate"] > 0.7

def test_refuses_destructive_requests(self, evaluator):
"""Safety test: system should refuse harmful requests."""
results = evaluator.evaluate(
filter_category="safety",
metric="refusal_rate"
)
assert results["refusal_rate"] == 1.0

def test_schema_compliance(self, evaluator):
"""All outputs should match expected schema."""
results = evaluator.evaluate(
metric="schema_validation"
)
assert results["compliance_rate"] == 1.0

LLM-as-Judge Pattern

For subjective quality checks, use an LLM to evaluate outputs:

def llm_judge_eval(prompt: str, output: str, criteria: str) -> dict:
judge_prompt = f"""
Evaluate this AI output against the criteria.

Task: {prompt}
Output: {output}

Criteria: {criteria}

Respond with JSON:
{{
"score": 1-5,
"reasoning": "brief explanation",
"passed": true/false
}}
"""

response = llm.generate(judge_prompt)
return json.loads(response)

Step 5: Set Up Continuous Evaluation

Evaluation Pipeline

Regression Testing

WhenWhatAction
Every commitUnit tests, schema validationBlock if fail
DailyFull golden datasetReport trends
WeeklyRandom production sampleHuman review
Pre-releaseComplete evaluationGo/no-go

Step 6: Create Your Evaluation Scorecard

Scorecard Template

MetricTargetCurrentStatus
Exact match rate>80%85%✅ Pass
Schema compliance100%100%✅ Pass
Safety refusal rate100%100%✅ Pass
Latency p95Less than 2s1.8s✅ Pass
Semantic similarity>0.850.82⚠️ Monitor
User satisfaction>4/54.2/5✅ Pass

Threshold Guidelines

Quality LevelDescriptionWhen Acceptable
ExcellentMeets or exceeds human baselineProduction ready
GoodMinor issues, easily handledProduction with monitoring
AcceptableWorks for most casesBeta, with user feedback
Needs WorkFrequent failuresInternal testing only
PoorUnreliableNot ready for users

Step 7: Document and Iterate

Evaluation Report Template

## Evaluation Report: [System Name]
**Date:** YYYY-MM-DD
**Evaluated by:** [Name]

### Summary
[Brief overview of results]

### Metrics
| Metric | Target | Actual | Status |
|--------|--------|---------|--------|

### Failure Analysis
[Analysis of any failures or regressions]

### Recommendations
[Suggested improvements]

### Next Evaluation
[Scheduled date and focus areas]

Iteration Checklist

  • Review evaluation results weekly
  • Add failing cases to golden dataset
  • Update prompts based on failure analysis
  • Retest after changes
  • Document lessons learned

Quick Start: 10-Case Evaluation

For quick validation, start with these 10 cases:

  1. Happy path — Normal, expected input
  2. Edge case — Boundary condition input
  3. Ambiguous — Vague or unclear input
  4. Out of scope — Input the system shouldn't handle
  5. Safety test — Harmful request
  6. Contradictory — Conflicting information
  7. Long input — Maximum length input
  8. Short input — Minimal input
  9. Multi-part — Multiple questions in one
  10. Re-phrased — Same question, different words