Getting Started with Agent Evaluation¶
This guide walks you through setting up and running your first agent evaluation.
Prerequisites¶
- A compiled Agentflow graph to test
- Python 3.12+
- Basic understanding of Agentflow graphs
Quick Start (5 Lines!)¶
The fastest way to test your agent is with QuickEval:
from agentflow.evaluation import QuickEval
# Test your agent with one line!
report = await QuickEval.check(
graph=compiled_graph,
query="What's the weather in Tokyo?",
expected_response_contains="weather",
)
print(f"Pass rate: {report.summary.pass_rate * 100:.1f}%")
That's it! QuickEval handles: - ✅ Creating eval sets automatically - ✅ Configuring evaluation criteria - ✅ Running the evaluation - ✅ Generating reports
85% less code compared to the manual approach below.
More QuickEval Examples¶
Batch Testing¶
Test multiple scenarios at once:
report = await QuickEval.batch(
graph=compiled_graph,
test_pairs=[
("Hello", "Hi"),
("Weather in NYC?", "sunny"),
("Thank you", "welcome"),
],
threshold=0.7,
)
Tool Usage Validation¶
Verify your agent calls the right tools:
report = await QuickEval.tool_usage(
graph=compiled_graph,
test_cases=[
("Weather in NYC?", "It's sunny", ["get_weather"]),
("Temperature in Paris?", "18°C", ["get_weather"]),
],
strict=True, # Require exact tool matches
)
Using Presets¶
from agentflow.evaluation import EvalPresets
# Use preset configurations
report = await QuickEval.preset(
graph=compiled_graph,
preset=EvalPresets.response_quality(threshold=0.8),
eval_set="tests/my_tests.json",
)
Building Test Sets Fluently¶
Use EvalSetBuilder for complex test scenarios:
from agentflow.evaluation import EvalSetBuilder, QuickEval, EvalPresets
eval_set = (
EvalSetBuilder("weather_tests")
.add_case(
query="Weather in Tokyo?",
expected="sunny",
expected_tools=["get_weather"],
)
.add_case(
query="Temperature in Paris?",
expected="18°C",
expected_tools=["get_weather"],
)
.add_multi_turn(
conversation=[
("Hello", "Hi there!"),
("Weather?", "Which city?"),
("Tokyo", "It's sunny in Tokyo"),
],
)
.build()
)
# Run evaluation
report = await QuickEval.from_builder(
graph=compiled_graph,
builder=EvalSetBuilder("weather_tests"),
config=EvalPresets.comprehensive(),
)
Manual Approach (Advanced)¶
For full control over evaluation configuration, you can build everything manually.
Step 1: Create an Evaluation Set¶
Evaluation sets can be created programmatically or loaded from JSON files.
Programmatic Creation¶
from agentflow.evaluation import (
EvalSet,
EvalCase,
Invocation,
MessageContent,
ToolCall,
)
# Create a simple eval set
eval_set = EvalSet(
eval_set_id="weather_tests",
name="Weather Agent Tests",
description="Integration tests for weather agent functionality",
eval_cases=[
EvalCase(
eval_id="test_1",
name="Basic weather lookup",
conversation=[
Invocation(
invocation_id="turn_1",
user_content=MessageContent.user("What's the weather in Paris?"),
expected_tool_trajectory=[
ToolCall(name="get_weather", args={"city": "Paris"})
],
expected_final_response=MessageContent.assistant(
"The weather in Paris is currently 18°C and sunny."
),
)
],
tags=["weather", "basic"],
),
],
)
JSON File Format¶
Save evaluation sets as JSON files for reusability:
{
"eval_set_id": "weather_tests",
"name": "Weather Agent Tests",
"eval_cases": [
{
"eval_id": "test_1",
"name": "Basic weather lookup",
"conversation": [
{
"invocation_id": "turn_1",
"user_content": {
"role": "user",
"content": "What's the weather in Paris?"
},
"expected_tool_trajectory": [
{"name": "get_weather", "args": {"city": "Paris"}}
]
}
]
}
]
}
Step 2: Configure Evaluation Criteria¶
from agentflow.evaluation import EvalConfig, CriterionConfig, MatchType
# Use presets (recommended)
config = EvalPresets.response_quality(threshold=0.8)
# Or customize criteria manually
config = EvalConfig(
criteria={
"trajectory_match": CriterionConfig(
enabled=True,
threshold=0.8,
match_type=MatchType.IN_ORDER,
),
"response_match": CriterionConfig(
enabled=True,
threshold=0.6,
),
}
)
Step 3: Create and Run the Evaluator¶
from agentflow.evaluation import AgentEvaluator
# Create evaluator
evaluator = AgentEvaluator(graph=compiled_graph, config=config)
# Run evaluation
report = await evaluator.evaluate("tests/my_tests.json", verbose=True)
# Check results
print(f"Pass rate: {report.summary.pass_rate * 100:.1f}%")
Step 4: View and Export Results¶
from agentflow.evaluation import ConsoleReporter, JSONReporter
# Console output
ConsoleReporter(verbose=True).report(report)
# Export to JSON
JSONReporter().save(report, "results/evaluation_report.json")
Next Steps¶
- Learn about Data Models in detail
- Explore all Criteria options
- Set up Pytest Integration
- Use User Simulation for dynamic testing