User Simulation¶
User simulation enables dynamic conversation testing by using an LLM to simulate realistic user behavior. This is useful when fixed prompts aren't practical or when you want to test agent robustness.
Why User Simulation?¶
Static eval sets have limitations:
- Fixed prompts don't test edge cases
- Multi-turn conversations are tedious to write manually
- User behavior varies in real-world usage
- You can't predict all possible user inputs
User simulation solves this by:
- Dynamically generating user messages based on context
- Following conversation plans to test specific scenarios
- Checking goal completion to validate outcomes
- Creating diverse test cases automatically
Core Concepts¶
ConversationScenario¶
A scenario defines what the simulated user is trying to accomplish:
from agentflow.evaluation import ConversationScenario
scenario = ConversationScenario(
scenario_id="travel_planning",
description="User planning a trip to Japan",
starting_prompt="I'm thinking about visiting Japan next month",
conversation_plan="""
1. Ask about weather conditions
2. Inquire about recommended destinations
3. Ask about visa requirements
4. Request packing suggestions
""",
goals=[
"Get weather information",
"Receive destination recommendations",
"Learn about visa requirements",
],
max_turns=8,
)
UserSimulator¶
The simulator runs conversations against your agent:
from agentflow.evaluation import UserSimulator
simulator = UserSimulator(
model="gpt-4o-mini", # LLM for generating user messages
temperature=0.7, # Creativity in responses
max_turns=10, # Default turn limit
)
SimulationResult¶
The result contains the full conversation and goal tracking:
result = await simulator.run(graph, scenario)
print(f"Turns: {result.turns}")
print(f"Completed: {result.completed}")
print(f"Goals achieved: {result.goals_achieved}")
print(f"Conversation: {result.conversation}")
Quick Start¶
import asyncio
from agentflow.evaluation import UserSimulator, ConversationScenario
async def main():
# Create your compiled graph
graph = await create_travel_agent_graph()
# Create simulator
simulator = UserSimulator(model="gpt-4o-mini")
# Define scenario
scenario = ConversationScenario(
scenario_id="simple_weather",
description="User wants to know the weather",
starting_prompt="What's the weather like in Tokyo?",
goals=["Get current temperature"],
max_turns=4,
)
# Run simulation
result = await simulator.run(graph, scenario)
# Check results
print(f"Completed: {result.completed}")
print(f"Turns: {result.turns}")
print(f"Goals achieved: {result.goals_achieved}")
# Print conversation
for msg in result.conversation:
print(f"{msg['role'].upper()}: {msg['content'][:100]}...")
asyncio.run(main())
Creating Scenarios¶
Basic Scenario¶
scenario = ConversationScenario(
scenario_id="greeting",
description="Basic greeting interaction",
starting_prompt="Hello!",
goals=["Receive a friendly greeting back"],
max_turns=2,
)
Multi-Step Scenario¶
scenario = ConversationScenario(
scenario_id="flight_booking",
description="User wants to book a flight from NYC to London",
starting_prompt="I need to book a flight to London",
conversation_plan="""
1. Provide departure city (New York)
2. Specify travel dates (next Friday)
3. Indicate passenger count (2 adults)
4. Select flight preference (morning, direct)
5. Confirm booking
""",
goals=[
"Search for flights",
"View flight options",
"Complete booking",
],
max_turns=10,
)
Edge Case Scenario¶
scenario = ConversationScenario(
scenario_id="error_recovery",
description="User makes mistakes and needs to correct them",
starting_prompt="Book me a flight to Londno", # Typo
conversation_plan="""
1. Make typo in city name
2. Correct when prompted
3. Provide incomplete info
4. Complete booking successfully
""",
goals=[
"Handle typo gracefully",
"Complete booking despite errors",
],
max_turns=8,
metadata={"test_type": "error_handling"},
)
Adversarial Scenario¶
scenario = ConversationScenario(
scenario_id="off_topic",
description="User tries to go off-topic",
starting_prompt="Can you help me with travel?",
conversation_plan="""
1. Start with valid travel question
2. Try to discuss unrelated topics
3. Return to travel planning
""",
goals=[
"Agent stays focused on travel",
"Agent politely redirects",
],
max_turns=6,
)
Running Simulations¶
Single Scenario¶
With Configuration¶
from agentflow.evaluation import UserSimulatorConfig
config = UserSimulatorConfig(
model="gpt-4o", # More capable model
temperature=0.5, # Less random responses
max_invocations=12, # Higher turn limit
timeout_seconds=60, # Per-turn timeout
)
simulator = UserSimulator(config=config)
result = await simulator.run(graph, scenario)
Batch Simulation¶
Run multiple scenarios:
from agentflow.evaluation import BatchSimulator
# Create scenarios
scenarios = [
ConversationScenario(
scenario_id="weather",
starting_prompt="What's the weather?",
goals=["Get weather info"],
max_turns=4,
),
ConversationScenario(
scenario_id="booking",
starting_prompt="Book a hotel",
goals=["Complete booking"],
max_turns=6,
),
ConversationScenario(
scenario_id="support",
starting_prompt="I have a problem",
goals=["Issue resolved"],
max_turns=8,
),
]
# Run all scenarios
batch_simulator = BatchSimulator(model="gpt-4o-mini")
results = await batch_simulator.run_all(graph, scenarios)
# Analyze results
for result in results:
print(f"{result.scenario_id}: {'✓' if result.completed else '✗'}")
print(f" Goals: {result.goals_achieved}")
Parallel Batch Simulation¶
Goal Checking¶
Goals are checked against the conversation history to determine if objectives were met.
Simple Keyword Goals¶
scenario = ConversationScenario(
scenario_id="weather",
starting_prompt="What's the weather in Paris?",
goals=[
"temperature", # Response should mention temperature
"Paris", # Response should mention Paris
"weather", # Response should discuss weather
],
)
Complex Goal Patterns¶
For more sophisticated goal checking, subclass UserSimulator:
class CustomSimulator(UserSimulator):
def _check_goals(
self,
scenario: ConversationScenario,
conversation: list[dict],
) -> list[str]:
achieved = []
full_text = " ".join(m["content"] for m in conversation)
for goal in scenario.goals:
# Custom logic per goal type
if goal.startswith("TOOL:"):
tool_name = goal.replace("TOOL:", "")
if self._tool_was_called(tool_name):
achieved.append(goal)
elif goal.startswith("CONTAINS:"):
keyword = goal.replace("CONTAINS:", "")
if keyword.lower() in full_text.lower():
achieved.append(goal)
else:
# Default: keyword matching
if goal.lower() in full_text.lower():
achieved.append(goal)
return achieved
Integration with Evaluation¶
Combining with EvalSet¶
Generate dynamic eval cases from simulations:
from agentflow.evaluation import EvalSet, EvalCase, Invocation, MessageContent
async def generate_eval_cases(graph, scenarios):
"""Run simulations and convert to eval cases."""
simulator = UserSimulator(model="gpt-4o-mini")
cases = []
for scenario in scenarios:
result = await simulator.run(graph, scenario)
if result.completed:
# Convert successful simulation to eval case
invocations = [
Invocation(
invocation_id=f"turn_{i}",
user_content=MessageContent.user(msg["content"]),
)
for i, msg in enumerate(result.conversation)
if msg["role"] == "user"
]
case = EvalCase(
eval_id=scenario.scenario_id,
name=scenario.description,
conversation=invocations,
metadata={"generated_by": "simulation"},
)
cases.append(case)
return EvalSet(
eval_set_id="generated",
name="Generated from simulations",
eval_cases=cases,
)
Quality Evaluation of Simulations¶
async def evaluate_simulation_quality(result: SimulationResult):
"""Evaluate the quality of a simulation run."""
from agentflow.evaluation import HallucinationCriterion, SafetyCriterion
# Extract assistant responses
assistant_msgs = [
m["content"] for m in result.conversation
if m["role"] == "assistant"
]
# Check safety
safety = SafetyCriterion()
# ... evaluate responses
return {
"turns": result.turns,
"goals_achieved": len(result.goals_achieved),
"completion": result.completed,
}
Advanced Usage¶
Custom User Personas¶
PERSONA_PROMPT = """You are simulating a user with this persona:
PERSONA:
{persona}
Stay in character throughout the conversation.
"""
class PersonaSimulator(UserSimulator):
def __init__(self, persona: str, **kwargs):
super().__init__(**kwargs)
self.persona = persona
def _build_prompt(self, scenario, conversation):
base_prompt = super()._build_prompt(scenario, conversation)
return PERSONA_PROMPT.format(persona=self.persona) + base_prompt
# Usage
impatient_user = PersonaSimulator(
persona="An impatient user who wants quick answers and gets frustrated with long responses",
model="gpt-4o-mini",
)
tech_savvy = PersonaSimulator(
persona="A technically proficient user who understands APIs and wants detailed information",
model="gpt-4o-mini",
)
Conditional Behavior¶
scenario = ConversationScenario(
scenario_id="conditional_flow",
starting_prompt="I need help with my order",
conversation_plan="""
1. Ask about order status
2. IF order is delayed: Express frustration
ELSE: Thank the agent
3. Request follow-up action
""",
goals=["Order status provided", "Issue resolved"],
)
Stress Testing¶
async def stress_test_agent(graph, num_simulations: int = 50):
"""Run many simulations to find edge cases."""
scenarios = [
generate_random_scenario(i)
for i in range(num_simulations)
]
simulator = BatchSimulator(model="gpt-4o-mini")
results = await simulator.run_all(
graph,
scenarios,
parallel=True,
max_concurrency=10,
)
# Analyze failures
failures = [r for r in results if not r.completed]
print(f"Failure rate: {len(failures) / len(results) * 100:.1f}%")
for failure in failures:
print(f"\nFailed scenario: {failure.scenario_id}")
print(f"Error: {failure.error}")
print(f"Last message: {failure.conversation[-1] if failure.conversation else 'N/A'}")
return results
Configuration Reference¶
UserSimulatorConfig¶
from agentflow.evaluation import UserSimulatorConfig
config = UserSimulatorConfig(
# LLM settings
model="gpt-4o-mini", # Model for user simulation
temperature=0.7, # Response creativity (0-1)
# Limits
max_invocations=10, # Max conversation turns
timeout_seconds=30, # Per-turn timeout
# Behavior
retry_on_error=True, # Retry failed LLM calls
max_retries=3, # Number of retries
)
ConversationScenario Fields¶
| Field | Type | Description |
|---|---|---|
scenario_id |
str |
Unique identifier |
description |
str |
Human-readable description |
starting_prompt |
str |
First user message |
conversation_plan |
str |
High-level conversation flow |
goals |
list[str] |
Objectives to achieve |
max_turns |
int |
Maximum conversation turns |
metadata |
dict |
Additional data |
SimulationResult Fields¶
| Field | Type | Description |
|---|---|---|
scenario_id |
str |
Scenario that was run |
turns |
int |
Number of turns executed |
conversation |
list[dict] |
Full conversation history |
goals_achieved |
list[str] |
Goals that were met |
completed |
bool |
Whether simulation completed |
error |
str |
Error message if failed |
Best Practices¶
1. Start Simple¶
# Good: Start with basic scenarios
simple = ConversationScenario(
scenario_id="simple",
starting_prompt="Hello",
goals=["greeting"],
max_turns=2,
)
# Then add complexity
complex = ConversationScenario(
scenario_id="complex",
starting_prompt="I need help with multiple things...",
conversation_plan="1. ... 2. ... 3. ...",
goals=["goal1", "goal2", "goal3"],
max_turns=10,
)
2. Define Clear Goals¶
# Good: Specific, verifiable goals
goals=["temperature", "humidity", "forecast"]
# Bad: Vague goals
goals=["helpful", "good response"]
3. Use Conversation Plans¶
# Good: Clear plan
conversation_plan="""
1. Ask about current weather
2. Ask about tomorrow's forecast
3. Ask about packing recommendations
"""
# Bad: No structure
conversation_plan=""
4. Set Appropriate Turn Limits¶
# Simple query: 2-4 turns
max_turns=4
# Multi-step task: 6-10 turns
max_turns=8
# Complex workflow: 10-15 turns
max_turns=12
5. Monitor Costs¶
# Use cheaper model for bulk testing
simulator = UserSimulator(model="gpt-4o-mini")
# Use capable model for quality testing
simulator = UserSimulator(model="gpt-4o")
Troubleshooting¶
Simulation Doesn't Complete¶
- Increase max_turns: Conversation may need more turns
- Simplify goals: Goals may be too complex
- Check agent responses: Agent may be stuck
Goals Not Achieved¶
- Check goal keywords: Ensure they match expected responses
- Review conversation: Agent may not be providing expected info
- Adjust conversation plan: Guide the simulated user better
Inconsistent Results¶
- Lower temperature: Reduce randomness
- Use more specific prompts: Better guide the simulator
- Run multiple times: Average results for reliability
async def run_multiple_times(graph, scenario, n=5):
"""Run simulation multiple times for reliability."""
simulator = UserSimulator(model="gpt-4o-mini")
results = []
for _ in range(n):
result = await simulator.run(graph, scenario)
results.append(result)
# Calculate success rate
success_rate = sum(r.completed for r in results) / n
avg_goals = sum(len(r.goals_achieved) for r in results) / n
return {
"success_rate": success_rate,
"avg_goals_achieved": avg_goals,
"results": results,
}