Evaluation Criteria¶

Criteria are the rules used to evaluate agent behavior. This page covers all available criteria and how to use them.

Overview¶

Agentflow provides three categories of evaluation criteria:

Deterministic - Fast, rule-based evaluation (trajectory matching, exact match)
Statistical - Text similarity metrics (ROUGE, cosine similarity)
LLM-as-Judge - Use an LLM to evaluate quality (semantic matching, rubrics)

Base Criterion Interface¶

All criteria inherit from BaseCriterion:

from agentflow.evaluation import BaseCriterion, CriterionResult

class MyCustomCriterion(BaseCriterion):
    name = "my_criterion"
    description = "Evaluates something custom"

    async def evaluate(
        self,
        actual: TrajectoryCollector,
        expected: EvalCase,
    ) -> CriterionResult:
        # Your evaluation logic
        score = self._compute_score(actual, expected)

        return CriterionResult(
            criterion=self.name,
            score=score,
            passed=score >= self.threshold,
            threshold=self.threshold,
            details={"custom_data": "..."},
        )

Trajectory Criteria¶

TrajectoryMatchCriterion¶

Validates that the agent called the expected tools in the expected order.

from agentflow.evaluation import (
    TrajectoryMatchCriterion,
    CriterionConfig,
    MatchType,
)

criterion = TrajectoryMatchCriterion(
    config=CriterionConfig(
        threshold=0.8,
        match_type=MatchType.IN_ORDER,
    )
)

Match Types:

Type	Description	Example
`EXACT`	All tools in exact order, no extras	Expected: [A, B] → Actual: [A, B] ✓
`IN_ORDER`	All expected tools in order, extras allowed	Expected: [A, B] → Actual: [A, X, B] ✓
`ANY_ORDER`	All expected tools present, any order	Expected: [A, B] → Actual: [B, A] ✓

Configuration via EvalConfig:

config = EvalConfig(
    criteria={
        "trajectory_match": CriterionConfig(
            enabled=True,
            threshold=1.0,
            match_type=MatchType.ANY_ORDER,
        ),
    }
)

ToolNameMatchCriterion¶

Simpler version that only checks tool names (ignores arguments).

from agentflow.evaluation import ToolNameMatchCriterion

criterion = ToolNameMatchCriterion(
    config=CriterionConfig(threshold=0.9)
)

Useful when: - Tool arguments may vary (e.g., different date formats) - You only care about which tools are called, not how

Response Criteria¶

ResponseMatchCriterion¶

Uses ROUGE scores to measure text similarity.

from agentflow.evaluation import ResponseMatchCriterion

criterion = ResponseMatchCriterion(
    config=CriterionConfig(threshold=0.7)
)

How it works: 1. Extracts text from actual and expected responses 2. Computes ROUGE-1 F1 score (unigram overlap) 3. Passes if score >= threshold

Best for: - Responses that should contain specific keywords - When exact wording doesn't matter but content does - Fast, deterministic evaluation

ExactMatchCriterion¶

Checks for exact string match (case-insensitive by default).

from agentflow.evaluation import ExactMatchCriterion

criterion = ExactMatchCriterion()

Use cases: - Deterministic outputs (numbers, codes, IDs) - Strict format requirements - Unit test-style assertions

ContainsKeywordsCriterion¶

Checks if response contains specific keywords.

from agentflow.evaluation import ContainsKeywordsCriterion

criterion = ContainsKeywordsCriterion(
    keywords=["temperature", "weather", "forecast"],
    require_all=False,  # At least one keyword
    config=CriterionConfig(threshold=0.5),
)

Parameters: - keywords: List of words/phrases to find - require_all: If True, all keywords must be present - case_sensitive: Whether matching is case-sensitive

LLM-as-Judge Criteria¶

These criteria use an LLM to evaluate response quality. They require the litellm extra.

LLMJudgeCriterion¶

Semantic similarity judged by an LLM.

from agentflow.evaluation import LLMJudgeCriterion, CriterionConfig

criterion = LLMJudgeCriterion(
    config=CriterionConfig(
        threshold=0.7,
        judge_model="gpt-4o-mini",
    )
)

How it works: 1. Sends actual and expected responses to judge LLM 2. LLM rates semantic similarity from 0-1 3. Passes if score >= threshold

Configuration:

config = EvalConfig(
    criteria={
        "llm_judge": CriterionConfig(
            enabled=True,
            threshold=0.75,
            judge_model="gpt-4o",  # Use more capable model
        ),
    }
)

RubricBasedCriterion¶

Evaluates against custom rubrics for multi-dimensional scoring.

from agentflow.evaluation import (
    RubricBasedCriterion,
    CriterionConfig,
    Rubric,
)

rubrics = [
    Rubric(
        name="helpfulness",
        description="Is the response helpful and actionable?",
        scoring_guide="5: Extremely helpful with clear next steps\n"
                      "4: Helpful with some guidance\n"
                      "3: Somewhat helpful\n"
                      "2: Minimally helpful\n"
                      "1: Not helpful at all",
        weight=2.0,
    ),
    Rubric(
        name="accuracy",
        description="Is the information accurate and correct?",
        scoring_guide="5: Completely accurate\n"
                      "3: Mostly accurate\n"
                      "1: Inaccurate or misleading",
        weight=1.5,
    ),
    Rubric(
        name="tone",
        description="Is the tone appropriate and professional?",
        scoring_guide="5: Professional and friendly\n"
                      "3: Acceptable\n"
                      "1: Inappropriate",
        weight=1.0,
    ),
]

criterion = RubricBasedCriterion(
    config=CriterionConfig(
        threshold=0.7,
        rubrics=rubrics,
        judge_model="gpt-4o-mini",
    )
)

Scoring: - Each rubric is scored independently - Scores are weighted and averaged - Result includes per-rubric breakdown

Advanced Criteria¶

These criteria evaluate specific safety and quality aspects.

HallucinationCriterion¶

Detects when the agent makes claims not grounded in the context or tool results.

from agentflow.evaluation import HallucinationCriterion

criterion = HallucinationCriterion(
    config=CriterionConfig(
        threshold=0.8,
        judge_model="gpt-4o-mini",
    )
)

How it works: 1. Collects all context (tool results, knowledge base) 2. Asks LLM to verify each claim in the response 3. Scores groundedness (1.0 = fully grounded, 0.0 = hallucinated)

Result details:

result.details = {
    "hallucinated_claims": ["claim that wasn't grounded"],
    "grounded_claims": ["claim supported by context"],
    "groundedness_score": 0.85,
}

SafetyCriterion¶

Evaluates response safety and harmlessness.

from agentflow.evaluation import SafetyCriterion

criterion = SafetyCriterion(
    config=CriterionConfig(
        threshold=0.9,
        judge_model="gpt-4o-mini",
    )
)

Safety categories evaluated: - Harmful content - Hate speech / discrimination - Violence - Personal information disclosure - Illegal activities - Sexual content - Self-harm

Result details:

result.details = {
    "category_scores": {
        "harmful_content": 0.0,
        "hate_speech": 0.0,
        "violence": 0.0,
        "pii_disclosure": 0.1,  # Slight concern
    },
    "overall_safe": True,
    "concerns": ["Minor PII disclosure risk"],
}

FactualAccuracyCriterion¶

Checks factual accuracy of claims against known facts.

from agentflow.evaluation import FactualAccuracyCriterion

criterion = FactualAccuracyCriterion(
    config=CriterionConfig(
        threshold=0.8,
        judge_model="gpt-4o",  # Use capable model for fact checking
    ),
    reference_facts=["Tokyo is in Japan", "Python was created by Guido van Rossum"],
)

Composite Criteria¶

Combine multiple criteria for complex evaluation.

CompositeCriterion¶

Runs multiple criteria and aggregates results.

from agentflow.evaluation import CompositeCriterion

criterion = CompositeCriterion(
    criteria=[
        TrajectoryMatchCriterion(config=CriterionConfig(threshold=0.8)),
        ResponseMatchCriterion(config=CriterionConfig(threshold=0.6)),
        SafetyCriterion(config=CriterionConfig(threshold=0.9)),
    ],
    aggregation="all",  # all, any, average
)

Aggregation modes: - all: Pass only if all criteria pass - any: Pass if any criterion passes - average: Pass if average score >= threshold

WeightedCriterion¶

Weighted combination of criteria.

from agentflow.evaluation import WeightedCriterion

criterion = WeightedCriterion(
    criteria=[
        (TrajectoryMatchCriterion(), 2.0),   # Weight 2
        (ResponseMatchCriterion(), 1.0),     # Weight 1
        (SafetyCriterion(), 3.0),            # Weight 3 (safety is important!)
    ],
    config=CriterionConfig(threshold=0.75),
)

Custom Criteria¶

Create your own criteria for domain-specific evaluation.

Synchronous Criterion¶

For simple, non-async evaluation:

from agentflow.evaluation import SyncCriterion, CriterionResult

class WordCountCriterion(SyncCriterion):
    name = "word_count"
    description = "Checks response is within word limit"

    def __init__(self, min_words: int = 10, max_words: int = 200):
        super().__init__()
        self.min_words = min_words
        self.max_words = max_words

    def evaluate_sync(
        self,
        actual: TrajectoryCollector,
        expected: EvalCase,
    ) -> CriterionResult:
        response = actual.final_response
        word_count = len(response.split())

        in_range = self.min_words <= word_count <= self.max_words
        score = 1.0 if in_range else 0.0

        return CriterionResult(
            criterion=self.name,
            score=score,
            passed=in_range,
            threshold=1.0,
            details={"word_count": word_count},
        )

Async Criterion¶

For criteria requiring external API calls:

from agentflow.evaluation import BaseCriterion, CriterionResult

class ExternalAPIValidator(BaseCriterion):
    name = "api_validator"
    description = "Validates response against external service"

    async def evaluate(
        self,
        actual: TrajectoryCollector,
        expected: EvalCase,
    ) -> CriterionResult:
        response = actual.final_response

        # Call external validation service
        async with httpx.AsyncClient() as client:
            result = await client.post(
                "https://api.validator.com/check",
                json={"text": response}
            )
            validation = result.json()

        score = validation["score"]
        return CriterionResult(
            criterion=self.name,
            score=score,
            passed=score >= self.threshold,
            threshold=self.threshold,
            details=validation,
        )

Using Custom Criteria¶

# Add to evaluator
evaluator = AgentEvaluator(graph, config)
evaluator.criteria.append(WordCountCriterion(min_words=50))

# Or create from config
class MyConfig(CriterionConfig):
    min_words: int = 50
    max_words: int = 200

Configuring Criteria¶

Via EvalConfig¶

from agentflow.evaluation import EvalConfig, CriterionConfig, MatchType

config = EvalConfig(
    criteria={
        # Trajectory matching
        "trajectory_match": CriterionConfig(
            enabled=True,
            threshold=0.9,
            match_type=MatchType.IN_ORDER,
        ),

        # Response similarity
        "response_match": CriterionConfig(
            enabled=True,
            threshold=0.6,
        ),

        # LLM judge (semantic)
        "llm_judge": CriterionConfig(
            enabled=True,
            threshold=0.7,
            judge_model="gpt-4o-mini",
        ),

        # Rubric-based
        "rubric_based": CriterionConfig(
            enabled=True,
            threshold=0.75,
            rubrics=[
                Rubric(name="quality", description="...", scoring_guide="..."),
            ],
        ),

        # Disable if not needed
        "hallucination": CriterionConfig(enabled=False),
    }
)

Criterion Names Map¶

Config Name	Criterion Class
`trajectory_match`	`TrajectoryMatchCriterion`
`tool_trajectory_avg_score`	`TrajectoryMatchCriterion`
`response_match`	`ResponseMatchCriterion`
`response_match_score`	`ResponseMatchCriterion`
`llm_judge`	`LLMJudgeCriterion`
`final_response_match_v2`	`LLMJudgeCriterion`
`rubric_based`	`RubricBasedCriterion`
`rubric_based_final_response_quality_v1`	`RubricBasedCriterion`

Best Practices¶

Choose the Right Criteria¶

Scenario	Recommended Criteria
Testing tool calls	`TrajectoryMatchCriterion`
Deterministic output	`ExactMatchCriterion`
Content coverage	`ContainsKeywordsCriterion`
General quality	`LLMJudgeCriterion`
Safety-critical apps	`SafetyCriterion`
RAG applications	`HallucinationCriterion`
Customer-facing	`RubricBasedCriterion`

Performance Considerations¶

Fast (milliseconds): Trajectory, Exact, Keywords, ROUGE
Slow (1-5 seconds): LLM-as-Judge criteria

For CI/CD, consider:

# Fast config for CI
ci_config = EvalConfig(
    criteria={
        "trajectory_match": CriterionConfig(enabled=True),
        "response_match": CriterionConfig(enabled=True),
        "llm_judge": CriterionConfig(enabled=False),  # Skip slow LLM checks
    }
)

# Full config for nightly
nightly_config = EvalConfig.default()

Threshold Tuning¶

Start with these thresholds and adjust based on your requirements:

Criterion	Suggested Range	Notes
Trajectory	0.8 - 1.0	Lower for flexible tool usage
Response ROUGE	0.5 - 0.7	Lower = more tolerance
LLM Judge	0.6 - 0.8	Higher for strict matching
Safety	0.9 - 1.0	Safety should be high
Hallucination	0.7 - 0.9	Higher for accuracy-critical