Advanced Patterns & Performance¶

This section explores higher-level compositions and tuning techniques once you grasp core graph mechanics.

Multi-Agent Orchestration¶

While a single StateGraph can coordinate reasoning + tools, complex systems may compose multiple specialized graphs:

Pattern	Description	Example
Router → Workers	Classifier graph delegates to domain-specific subgraphs	Customer support triaging billing vs tech
Supervisor + Tools	Supervisory graph decides next sub-task & spawns tool-rich worker	Research agent splitting search, summarise, synthesis
Map-Reduce	Parallel subgraphs process shards; aggregator combines	Summarizing many documents
Hierarchical Memory	One graph updates long-term store; another handles short-term dialog	Knowledge-grounded assistants

Future first-class nested graph APIs will simplify this; today you can approximate by having nodes invoke other compiled graphs explicitly.

Dynamic Tool Injection¶

Inject tool availability based on state or user tier:

def provide_tools(state):
    tool_list = base_tools.copy()
    if state.user_profile.get("tier") == "pro":
        tool_list += pro_tools
    return tool_list

Feed this into the LLM call just-in-time instead of statically instantiating a monolithic ToolNode.

Background Enrichment¶

Long-running tasks (vector indexing, summarisation) can trail the main conversation:

User asks complex question
Node schedules retrieval expansion job via BackgroundTaskManager
Conversation proceeds with placeholder
When task completes, result appended to store; future turns benefit

Ensure idempotent jobs by hashing inputs (e.g. document chunk digest) to skip duplicates.

State Minimisation Strategy¶

Memory grows; consider layering:

Layer	Contents	Persistence
Active Context	Last N messages	Always in state.context
Summary	Rolling narrative	Stored in `context_summary`
External Store	Full history, embeddings	`BaseStore` / vector DB

Periodically:

Summarise older messages → context_summary
Offload full transcripts to store
Truncate context to a sliding window

Performance Tuning Cheatsheet¶

Issue	Mitigation
Slow tool chain	Parallelize independent calls (future feature) or restructure into single batch tool
High token usage	Aggressive summarisation + retrieval instead of raw replay
Frequent identical tool calls	Memoize with cache layer keyed by args
Unstable latency	Warm LLM/model sessions; pre-create container-bound clients
Large message objects	Strip raw provider payloads after conversion (optional config)

Observability Enhancements¶

Add correlation identifiers:

Use custom BaseIDGenerator with tenant prefix
Include thread_name in every published event
Attach semantic spans via callback hooks (before_node, after_node)

Expose metrics:

Metric	Source
`agent_steps_total`	Increment after each node
`tool_invocations_total`	Count executed tool calls
`reasoning_tokens_total`	Sum from `Message.usages`
`latency_node_seconds`	Timestamp diff in callbacks

Fault Tolerance Patterns¶

Failure	Strategy
Transient LLM errors	Retry with exponential backoff wrapper inside node
Tool timeout	Circuit-breaker: mark tool unavailable for cool-down window
Checkpointer outage	Fallback to in-memory & emit warning event
Partial stream drop	Buffer deltas locally until final message commit

Safe Execution Sandbox¶

For untrusted tool logic:

Run tool execution in a restricted subprocess
Validate JSON schema inputs strictly
Enforce timeouts per tool and global budget per step

Experimentation & A/B¶

Encode experiment variant in config:

config = {"thread_id": tid, "variant": "tool-strategy-B"}

Branch in node:

if config.get("variant") == "tool-strategy-B":
    tools = alt_toolset

Log variant with every published event for offline comparison (success rate, latency, token cost).

Roadmap-Oriented Extensibility¶

Design choices enabling future features:

Future Feature	Existing Hook
Nested graphs	`Command(graph=...)` placeholder
Parallel branches	Background tasks + future branch scheduler
Adaptive memory pruning	`BaseContextManager` injection
Multi-provider ensemble	Converter abstraction + dynamic provider selection node

Checklist Before Production¶

[ ] Deterministic termination paths tested
[ ] Recursion limit sized for longest scenario
[ ] Tool idempotency validated
[ ] State serialisation size acceptable under worst cases
[ ] Observability events consumed by monitoring stack
[ ] Security review of external tool surfaces
[ ] Back-pressure strategy for streaming consumers

With these patterns you can evolve from a prototype assistant to a resilient agent platform incrementally.