Reporters¶
Reporters format and output evaluation results. Agentflow includes four built-in reporters for different use cases.
Overview¶
| Reporter | Use Case | Output Format |
|---|---|---|
ConsoleReporter |
Development, debugging | Terminal (ANSI colors) |
JSONReporter |
Data analysis, storage | JSON file/dict |
JUnitXMLReporter |
CI/CD integration | JUnit XML |
HTMLReporter |
Stakeholder reporting | Interactive HTML |
ConsoleReporter¶
Pretty-prints evaluation results to the terminal with ANSI colors.
Basic Usage¶
from agentflow.evaluation import ConsoleReporter, print_report
# Quick usage
print_report(eval_report)
# Or with options
reporter = ConsoleReporter(
verbose=True, # Show detailed output
use_color=True, # Use ANSI colors
)
reporter.report(eval_report)
Output Example¶
═══════════════════════════════════════════════════════════════════════
EVALUATION REPORT: weather_tests
═══════════════════════════════════════════════════════════════════════
Summary
───────────────────────────────────────────────────────────────────────
Total Cases: 10
Passed: 8 ✓
Failed: 2 ✗
Pass Rate: 80.0%
Duration: 3.45s
Criterion Statistics
───────────────────────────────────────────────────────────────────────
trajectory_match 9/10 passed avg: 0.92
response_match 8/10 passed avg: 0.78
llm_judge 8/10 passed avg: 0.81
Failed Cases
───────────────────────────────────────────────────────────────────────
✗ test_edge_case_1 (0.45s)
- trajectory_match: 0.50 (threshold: 0.80)
- response_match: 0.62 (threshold: 0.70)
✗ test_complex_query (1.23s)
- trajectory_match: 0.75 (threshold: 0.80)
Options¶
reporter = ConsoleReporter(
verbose=True, # Show all cases, not just failures
use_color=True, # Colorful output
output=sys.stdout, # Where to write (default: stdout)
)
Disable Colors¶
For non-terminal environments:
from agentflow.evaluation.reporters.console import Colors
Colors.disable() # Removes all ANSI codes
reporter = ConsoleReporter(use_color=False)
JSONReporter¶
Exports evaluation results as JSON for analysis and storage.
Save to File¶
from agentflow.evaluation import JSONReporter
reporter = JSONReporter()
reporter.save(eval_report, "results/report.json")
Get as Dictionary¶
# Full report
data = reporter.to_dict(eval_report)
# Only failed cases
data = reporter.to_dict(eval_report, include_passed=False)
# Include full trajectory data
data = reporter.to_dict(eval_report, include_trajectory=True)
JSON Structure¶
{
"report_id": "rpt_abc123",
"eval_set_id": "weather_tests",
"eval_set_name": "Weather Agent Tests",
"created_at": "2024-01-15T10:30:00Z",
"duration_seconds": 3.45,
"summary": {
"total_cases": 10,
"passed_cases": 8,
"failed_cases": 2,
"pass_rate": 0.8,
"avg_score": 0.85,
"criterion_stats": {
"trajectory_match": {
"passed": 9,
"failed": 1,
"avg_score": 0.92
}
}
},
"results": [
{
"eval_id": "test_1",
"name": "Basic Weather Query",
"passed": true,
"duration_seconds": 0.45,
"criterion_results": [
{
"criterion": "trajectory_match",
"score": 1.0,
"passed": true,
"threshold": 0.8,
"details": {}
}
]
}
],
"config_used": {
"criteria": {...}
}
}
Options¶
reporter = JSONReporter(
indent=2, # JSON indentation
include_metadata=True, # Include config and metadata
)
# Filter output
data = reporter.to_dict(
report,
include_passed=True, # Include passing cases
include_trajectory=False, # Include raw trajectory data
include_config=True, # Include configuration used
)
JUnitXMLReporter¶
Generates JUnit XML format for CI/CD integration (GitHub Actions, Jenkins, etc.).
Save to File¶
from agentflow.evaluation import JUnitXMLReporter
reporter = JUnitXMLReporter()
reporter.save(eval_report, "results/junit.xml")
XML Structure¶
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="weather_tests" tests="10" failures="2" time="3.45">
<testsuite name="weather_tests" tests="10" failures="2" time="3.45">
<testcase name="test_basic_weather" classname="weather_tests" time="0.45">
</testcase>
<testcase name="test_edge_case" classname="weather_tests" time="0.67">
<failure message="trajectory_match failed: 0.50 < 0.80">
Criterion: trajectory_match
Score: 0.50
Threshold: 0.80
Details: Expected [get_weather], Got [get_forecast]
</failure>
</testcase>
</testsuite>
</testsuites>
CI/CD Integration¶
GitHub Actions:
# .github/workflows/test.yml
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evaluations
run: python run_evals.py
- name: Upload Test Results
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results
path: results/junit.xml
- name: Publish Test Report
uses: mikepenz/action-junit-report@v4
if: always()
with:
report_paths: 'results/junit.xml'
Jenkins:
pipeline {
stages {
stage('Evaluate') {
steps {
sh 'python run_evals.py'
}
post {
always {
junit 'results/junit.xml'
}
}
}
}
}
HTMLReporter¶
Generates interactive HTML reports for sharing with stakeholders.
Save to File¶
from agentflow.evaluation import HTMLReporter
reporter = HTMLReporter()
reporter.save(eval_report, "results/report.html")
HTML Features¶
The generated HTML includes:
- Summary Dashboard - Pass/fail rates, charts
- Filtering - Filter by status, criterion, tags
- Case Details - Expandable sections for each case
- Criterion Breakdown - Per-criterion scores and details
- Search - Find specific test cases
- Responsive Design - Works on desktop and mobile
Customization¶
reporter = HTMLReporter(
title="Agent Evaluation Report",
theme="light", # "light" or "dark"
include_charts=True,
)
Sample Output¶
The HTML report displays:
┌─────────────────────────────────────────────────────────┐
│ 📊 Weather Agent Evaluation Report │
│ Generated: 2024-01-15 10:30:00 │
├─────────────────────────────────────────────────────────┤
│ Summary │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 10 Total │ │ 8 Passed │ │ 2 Failed │ │
│ │ Cases │ │ ✓ │ │ ✗ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Pass Rate: [████████░░] 80% │
├─────────────────────────────────────────────────────────┤
│ Filter: [All ▼] [Status ▼] [Criterion ▼] [🔍 Search] │
├─────────────────────────────────────────────────────────┤
│ ✓ test_basic_weather 0.45s │
│ trajectory_match: 1.00 ✓ │
│ response_match: 0.85 ✓ │
│ │
│ ✗ test_edge_case 0.67s │
│ trajectory_match: 0.50 ✗ (threshold: 0.80) │
│ response_match: 0.62 ✗ (threshold: 0.70) │
│ [▼ Show Details] │
└─────────────────────────────────────────────────────────┘
Using Multiple Reporters¶
Generate reports in multiple formats:
from agentflow.evaluation import (
ConsoleReporter,
JSONReporter,
JUnitXMLReporter,
HTMLReporter,
)
# Run evaluation
report = await evaluator.evaluate(eval_set)
# Output to console
ConsoleReporter(verbose=True).report(report)
# Save all formats
JSONReporter().save(report, "results/report.json")
JUnitXMLReporter().save(report, "results/junit.xml")
HTMLReporter().save(report, "results/report.html")
Reporter Factory¶
Create a helper for consistent reporting:
def save_all_reports(report, output_dir: str = "results"):
from pathlib import Path
Path(output_dir).mkdir(exist_ok=True)
# Console
ConsoleReporter(verbose=True).report(report)
# Files
JSONReporter().save(report, f"{output_dir}/report.json")
JUnitXMLReporter().save(report, f"{output_dir}/junit.xml")
HTMLReporter().save(report, f"{output_dir}/report.html")
print(f"\nReports saved to {output_dir}/")
Custom Reporters¶
Create custom reporters by implementing the base pattern:
from agentflow.evaluation import EvalReport
class MarkdownReporter:
"""Generate Markdown report."""
def report(self, report: EvalReport) -> str:
lines = [
f"# Evaluation Report: {report.eval_set_name}",
"",
"## Summary",
"",
f"- **Total Cases**: {report.summary.total_cases}",
f"- **Passed**: {report.summary.passed_cases}",
f"- **Failed**: {report.summary.failed_cases}",
f"- **Pass Rate**: {report.summary.pass_rate * 100:.1f}%",
"",
"## Results",
"",
]
for result in report.results:
status = "✓" if result.passed else "✗"
lines.append(f"### {status} {result.name or result.eval_id}")
for cr in result.criterion_results:
status = "✓" if cr.passed else "✗"
lines.append(f"- {cr.criterion}: {cr.score:.2f} {status}")
lines.append("")
return "\n".join(lines)
def save(self, report: EvalReport, filepath: str) -> None:
content = self.report(report)
with open(filepath, "w") as f:
f.write(content)
Slack Reporter Example¶
import httpx
class SlackReporter:
"""Send evaluation results to Slack."""
def __init__(self, webhook_url: str):
self.webhook_url = webhook_url
async def report(self, report: EvalReport) -> None:
status = "✅" if report.summary.pass_rate == 1.0 else "⚠️"
message = {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{status} Evaluation: {report.eval_set_name}",
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Pass Rate:* {report.summary.pass_rate * 100:.1f}%"},
{"type": "mrkdwn", "text": f"*Duration:* {report.duration_seconds:.2f}s"},
{"type": "mrkdwn", "text": f"*Passed:* {report.summary.passed_cases}"},
{"type": "mrkdwn", "text": f"*Failed:* {report.summary.failed_cases}"},
]
},
]
}
async with httpx.AsyncClient() as client:
await client.post(self.webhook_url, json=message)
Best Practices¶
Development Workflow¶
# During development
ConsoleReporter(verbose=True).report(report)
# Save for analysis
if report.summary.pass_rate < 1.0:
JSONReporter().save(report, f"failures/{report.eval_set_id}.json")
CI/CD Workflow¶
# Always save structured output
JSONReporter().save(report, "results/report.json")
JUnitXMLReporter().save(report, "results/junit.xml")
# Fail the build if pass rate is too low
if report.summary.pass_rate < 0.95:
sys.exit(1)