User Simulation¶

User simulation enables dynamic conversation testing by using an LLM to simulate realistic user behavior. This is useful when fixed prompts aren't practical or when you want to test agent robustness.

Why User Simulation?¶

Static eval sets have limitations:

Fixed prompts don't test edge cases
Multi-turn conversations are tedious to write manually
User behavior varies in real-world usage
You can't predict all possible user inputs

User simulation solves this by:

Dynamically generating user messages based on context
Following conversation plans to test specific scenarios
Checking goal completion to validate outcomes
Creating diverse test cases automatically

Core Concepts¶

ConversationScenario¶

A scenario defines what the simulated user is trying to accomplish:

from agentflow.evaluation import ConversationScenario

scenario = ConversationScenario(
    scenario_id="travel_planning",
    description="User planning a trip to Japan",
    starting_prompt="I'm thinking about visiting Japan next month",
    conversation_plan="""
    1. Ask about weather conditions
    2. Inquire about recommended destinations
    3. Ask about visa requirements
    4. Request packing suggestions
    """,
    goals=[
        "Get weather information",
        "Receive destination recommendations",
        "Learn about visa requirements",
    ],
    max_turns=8,
)

UserSimulator¶

The simulator runs conversations against your agent:

from agentflow.evaluation import UserSimulator

simulator = UserSimulator(
    model="gpt-4o-mini",  # LLM for generating user messages
    temperature=0.7,       # Creativity in responses
    max_turns=10,          # Default turn limit
)

SimulationResult¶

The result contains the full conversation and goal tracking:

result = await simulator.run(graph, scenario)

print(f"Turns: {result.turns}")
print(f"Completed: {result.completed}")
print(f"Goals achieved: {result.goals_achieved}")
print(f"Conversation: {result.conversation}")

Quick Start¶

import asyncio
from agentflow.evaluation import UserSimulator, ConversationScenario

async def main():
    # Create your compiled graph
    graph = await create_travel_agent_graph()

    # Create simulator
    simulator = UserSimulator(model="gpt-4o-mini")

    # Define scenario
    scenario = ConversationScenario(
        scenario_id="simple_weather",
        description="User wants to know the weather",
        starting_prompt="What's the weather like in Tokyo?",
        goals=["Get current temperature"],
        max_turns=4,
    )

    # Run simulation
    result = await simulator.run(graph, scenario)

    # Check results
    print(f"Completed: {result.completed}")
    print(f"Turns: {result.turns}")
    print(f"Goals achieved: {result.goals_achieved}")

    # Print conversation
    for msg in result.conversation:
        print(f"{msg['role'].upper()}: {msg['content'][:100]}...")

asyncio.run(main())

Creating Scenarios¶

Basic Scenario¶

scenario = ConversationScenario(
    scenario_id="greeting",
    description="Basic greeting interaction",
    starting_prompt="Hello!",
    goals=["Receive a friendly greeting back"],
    max_turns=2,
)

Multi-Step Scenario¶

scenario = ConversationScenario(
    scenario_id="flight_booking",
    description="User wants to book a flight from NYC to London",
    starting_prompt="I need to book a flight to London",
    conversation_plan="""
    1. Provide departure city (New York)
    2. Specify travel dates (next Friday)
    3. Indicate passenger count (2 adults)
    4. Select flight preference (morning, direct)
    5. Confirm booking
    """,
    goals=[
        "Search for flights",
        "View flight options",
        "Complete booking",
    ],
    max_turns=10,
)

Edge Case Scenario¶

scenario = ConversationScenario(
    scenario_id="error_recovery",
    description="User makes mistakes and needs to correct them",
    starting_prompt="Book me a flight to Londno",  # Typo
    conversation_plan="""
    1. Make typo in city name
    2. Correct when prompted
    3. Provide incomplete info
    4. Complete booking successfully
    """,
    goals=[
        "Handle typo gracefully",
        "Complete booking despite errors",
    ],
    max_turns=8,
    metadata={"test_type": "error_handling"},
)

Adversarial Scenario¶

scenario = ConversationScenario(
    scenario_id="off_topic",
    description="User tries to go off-topic",
    starting_prompt="Can you help me with travel?",
    conversation_plan="""
    1. Start with valid travel question
    2. Try to discuss unrelated topics
    3. Return to travel planning
    """,
    goals=[
        "Agent stays focused on travel",
        "Agent politely redirects",
    ],
    max_turns=6,
)

Running Simulations¶

Single Scenario¶

simulator = UserSimulator(model="gpt-4o-mini")
result = await simulator.run(graph, scenario)

With Configuration¶

from agentflow.evaluation import UserSimulatorConfig

config = UserSimulatorConfig(
    model="gpt-4o",          # More capable model
    temperature=0.5,         # Less random responses
    max_invocations=12,      # Higher turn limit
    timeout_seconds=60,      # Per-turn timeout
)

simulator = UserSimulator(config=config)
result = await simulator.run(graph, scenario)

Batch Simulation¶

Run multiple scenarios:

from agentflow.evaluation import BatchSimulator

# Create scenarios
scenarios = [
    ConversationScenario(
        scenario_id="weather",
        starting_prompt="What's the weather?",
        goals=["Get weather info"],
        max_turns=4,
    ),
    ConversationScenario(
        scenario_id="booking",
        starting_prompt="Book a hotel",
        goals=["Complete booking"],
        max_turns=6,
    ),
    ConversationScenario(
        scenario_id="support",
        starting_prompt="I have a problem",
        goals=["Issue resolved"],
        max_turns=8,
    ),
]

# Run all scenarios
batch_simulator = BatchSimulator(model="gpt-4o-mini")
results = await batch_simulator.run_all(graph, scenarios)

# Analyze results
for result in results:
    print(f"{result.scenario_id}: {'✓' if result.completed else '✗'}")
    print(f"  Goals: {result.goals_achieved}")

Parallel Batch Simulation¶

results = await batch_simulator.run_all(
    graph,
    scenarios,
    parallel=True,
    max_concurrency=4,
)

Goal Checking¶

Goals are checked against the conversation history to determine if objectives were met.

Simple Keyword Goals¶

scenario = ConversationScenario(
    scenario_id="weather",
    starting_prompt="What's the weather in Paris?",
    goals=[
        "temperature",   # Response should mention temperature
        "Paris",         # Response should mention Paris
        "weather",       # Response should discuss weather
    ],
)

Complex Goal Patterns¶

For more sophisticated goal checking, subclass UserSimulator:

class CustomSimulator(UserSimulator):
    def _check_goals(
        self,
        scenario: ConversationScenario,
        conversation: list[dict],
    ) -> list[str]:
        achieved = []
        full_text = " ".join(m["content"] for m in conversation)

        for goal in scenario.goals:
            # Custom logic per goal type
            if goal.startswith("TOOL:"):
                tool_name = goal.replace("TOOL:", "")
                if self._tool_was_called(tool_name):
                    achieved.append(goal)
            elif goal.startswith("CONTAINS:"):
                keyword = goal.replace("CONTAINS:", "")
                if keyword.lower() in full_text.lower():
                    achieved.append(goal)
            else:
                # Default: keyword matching
                if goal.lower() in full_text.lower():
                    achieved.append(goal)

        return achieved

Integration with Evaluation¶

Combining with EvalSet¶

Generate dynamic eval cases from simulations:

from agentflow.evaluation import EvalSet, EvalCase, Invocation, MessageContent

async def generate_eval_cases(graph, scenarios):
    """Run simulations and convert to eval cases."""
    simulator = UserSimulator(model="gpt-4o-mini")
    cases = []

    for scenario in scenarios:
        result = await simulator.run(graph, scenario)

        if result.completed:
            # Convert successful simulation to eval case
            invocations = [
                Invocation(
                    invocation_id=f"turn_{i}",
                    user_content=MessageContent.user(msg["content"]),
                )
                for i, msg in enumerate(result.conversation)
                if msg["role"] == "user"
            ]

            case = EvalCase(
                eval_id=scenario.scenario_id,
                name=scenario.description,
                conversation=invocations,
                metadata={"generated_by": "simulation"},
            )
            cases.append(case)

    return EvalSet(
        eval_set_id="generated",
        name="Generated from simulations",
        eval_cases=cases,
    )

Quality Evaluation of Simulations¶

async def evaluate_simulation_quality(result: SimulationResult):
    """Evaluate the quality of a simulation run."""
    from agentflow.evaluation import HallucinationCriterion, SafetyCriterion

    # Extract assistant responses
    assistant_msgs = [
        m["content"] for m in result.conversation
        if m["role"] == "assistant"
    ]

    # Check safety
    safety = SafetyCriterion()
    # ... evaluate responses

    return {
        "turns": result.turns,
        "goals_achieved": len(result.goals_achieved),
        "completion": result.completed,
    }

Advanced Usage¶

Custom User Personas¶

PERSONA_PROMPT = """You are simulating a user with this persona:

PERSONA:
{persona}

Stay in character throughout the conversation.
"""

class PersonaSimulator(UserSimulator):
    def __init__(self, persona: str, **kwargs):
        super().__init__(**kwargs)
        self.persona = persona

    def _build_prompt(self, scenario, conversation):
        base_prompt = super()._build_prompt(scenario, conversation)
        return PERSONA_PROMPT.format(persona=self.persona) + base_prompt

# Usage
impatient_user = PersonaSimulator(
    persona="An impatient user who wants quick answers and gets frustrated with long responses",
    model="gpt-4o-mini",
)

tech_savvy = PersonaSimulator(
    persona="A technically proficient user who understands APIs and wants detailed information",
    model="gpt-4o-mini",
)

Conditional Behavior¶

scenario = ConversationScenario(
    scenario_id="conditional_flow",
    starting_prompt="I need help with my order",
    conversation_plan="""
    1. Ask about order status
    2. IF order is delayed: Express frustration
       ELSE: Thank the agent
    3. Request follow-up action
    """,
    goals=["Order status provided", "Issue resolved"],
)

Stress Testing¶

async def stress_test_agent(graph, num_simulations: int = 50):
    """Run many simulations to find edge cases."""
    scenarios = [
        generate_random_scenario(i)
        for i in range(num_simulations)
    ]

    simulator = BatchSimulator(model="gpt-4o-mini")
    results = await simulator.run_all(
        graph,
        scenarios,
        parallel=True,
        max_concurrency=10,
    )

    # Analyze failures
    failures = [r for r in results if not r.completed]
    print(f"Failure rate: {len(failures) / len(results) * 100:.1f}%")

    for failure in failures:
        print(f"\nFailed scenario: {failure.scenario_id}")
        print(f"Error: {failure.error}")
        print(f"Last message: {failure.conversation[-1] if failure.conversation else 'N/A'}")

    return results

Configuration Reference¶

UserSimulatorConfig¶

from agentflow.evaluation import UserSimulatorConfig

config = UserSimulatorConfig(
    # LLM settings
    model="gpt-4o-mini",      # Model for user simulation
    temperature=0.7,          # Response creativity (0-1)

    # Limits
    max_invocations=10,       # Max conversation turns
    timeout_seconds=30,       # Per-turn timeout

    # Behavior
    retry_on_error=True,      # Retry failed LLM calls
    max_retries=3,            # Number of retries
)

ConversationScenario Fields¶

Field	Type	Description
`scenario_id`	`str`	Unique identifier
`description`	`str`	Human-readable description
`starting_prompt`	`str`	First user message
`conversation_plan`	`str`	High-level conversation flow
`goals`	`list[str]`	Objectives to achieve
`max_turns`	`int`	Maximum conversation turns
`metadata`	`dict`	Additional data

SimulationResult Fields¶

Field	Type	Description
`scenario_id`	`str`	Scenario that was run
`turns`	`int`	Number of turns executed
`conversation`	`list[dict]`	Full conversation history
`goals_achieved`	`list[str]`	Goals that were met
`completed`	`bool`	Whether simulation completed
`error`	`str`	Error message if failed

Best Practices¶

1. Start Simple¶

# Good: Start with basic scenarios
simple = ConversationScenario(
    scenario_id="simple",
    starting_prompt="Hello",
    goals=["greeting"],
    max_turns=2,
)

# Then add complexity
complex = ConversationScenario(
    scenario_id="complex",
    starting_prompt="I need help with multiple things...",
    conversation_plan="1. ... 2. ... 3. ...",
    goals=["goal1", "goal2", "goal3"],
    max_turns=10,
)

2. Define Clear Goals¶

# Good: Specific, verifiable goals
goals=["temperature", "humidity", "forecast"]

# Bad: Vague goals
goals=["helpful", "good response"]

3. Use Conversation Plans¶

# Good: Clear plan
conversation_plan="""
1. Ask about current weather
2. Ask about tomorrow's forecast  
3. Ask about packing recommendations
"""

# Bad: No structure
conversation_plan=""

4. Set Appropriate Turn Limits¶

# Simple query: 2-4 turns
max_turns=4

# Multi-step task: 6-10 turns
max_turns=8

# Complex workflow: 10-15 turns
max_turns=12

5. Monitor Costs¶

# Use cheaper model for bulk testing
simulator = UserSimulator(model="gpt-4o-mini")

# Use capable model for quality testing
simulator = UserSimulator(model="gpt-4o")

Troubleshooting¶

Simulation Doesn't Complete¶

Increase max_turns: Conversation may need more turns
Simplify goals: Goals may be too complex
Check agent responses: Agent may be stuck

Goals Not Achieved¶

Check goal keywords: Ensure they match expected responses
Review conversation: Agent may not be providing expected info
Adjust conversation plan: Guide the simulated user better

Inconsistent Results¶

Lower temperature: Reduce randomness
Use more specific prompts: Better guide the simulator
Run multiple times: Average results for reliability

async def run_multiple_times(graph, scenario, n=5):
    """Run simulation multiple times for reliability."""
    simulator = UserSimulator(model="gpt-4o-mini")
    results = []

    for _ in range(n):
        result = await simulator.run(graph, scenario)
        results.append(result)

    # Calculate success rate
    success_rate = sum(r.completed for r in results) / n
    avg_goals = sum(len(r.goals_achieved) for r in results) / n

    return {
        "success_rate": success_rate,
        "avg_goals_achieved": avg_goals,
        "results": results,
    }