Graceful Shutdown
Source example: agentflow/examples/graceful_shutdown/graceful_shutdown_example.py
What you will build
A long-running asyncio service that:
- handles
SIGINTandSIGTERM - keeps initialization and cleanup protected from interruption
- processes work in a loop until shutdown is requested
- closes graph resources with
app.aclose()and logs shutdown statistics
This pattern is important for real services, workers, and containerized deployments.
Why graceful shutdown matters
Without a shutdown strategy, a process can stop in the middle of:
- tool execution
- background work
- state persistence
- resource cleanup
That can leave your system in a partially updated state.
The example shows how to avoid that.
Shutdown architecture
Step 1 - Create a realistic graph
The example uses a normal ReAct pattern rather than a fake no-op loop.
It defines three tools:
get_current_timeget_system_statuscalculate
Then it builds:
- a
ToolNode - a main
Agent - a conditional route back to the tool node when tool calls are present
That makes the example valuable because the shutdown logic is tested around a graph that really does work.
Step 2 - Create a GracefulShutdownManager
Inside long_running_service() the example sets:
SHUTDOWN_TIMEOUT = 30.0
shutdown_manager = GracefulShutdownManager(shutdown_timeout=SHUTDOWN_TIMEOUT)
Then it compiles the graph with the same timeout:
graph = build_graph().compile(shutdown_timeout=SHUTDOWN_TIMEOUT)
This matters because the compiled graph can use that timeout when draining internal resources during shutdown.
Step 3 - Register signal handlers
The service explicitly registers handlers for SIGINT and SIGTERM:
shutdown_manager.register_signal_handlers()
Once that is in place:
- pressing
Ctrl+CsendsSIGINT - container orchestrators usually send
SIGTERM - the manager flips
shutdown_requestedtoTrue
That gives the main loop a clean signal to stop accepting new work.
Signal handling flow
Step 4 - Protect initialization and cleanup
One of the best parts of this example is the use of protect_section():
with shutdown_manager.protect_section():
await asyncio.sleep(2)
logger.info("Initialization complete")
The same pattern is used during cleanup:
with shutdown_manager.protect_section():
stats = await graph.aclose()
shutdown_manager.unregister_signal_handlers()
This protection is built on delayed interrupt handling. In practice, it means:
- signals are noticed
- interruption is deferred briefly
- critical startup or teardown code gets a chance to finish safely
Use it sparingly, but definitely use it around the most sensitive phases.
Step 5 - Stop the loop without dropping work abruptly
The main processing loop looks roughly like this:
while not shutdown_manager.shutdown_requested:
...
result = await graph.ainvoke(...)
...
That pattern is simple and dependable. The loop keeps running until a shutdown request is raised, then it stops taking on new work and falls into cleanup.
A few other practical touches in the example are worth copying:
- it catches exceptions per task so one bad task does not crash the process
- it logs task progress clearly
- it catches
KeyboardInterruptat the top level as a final safeguard
Step 6 - Close the graph and inspect stats
The cleanup section calls:
stats = await graph.aclose()
Then it logs areas such as:
- total duration
- background task information
- checkpointer stats
- publisher stats
- store stats
This is the key production lesson: shutdown is not just about stopping. It is also about learning whether the stop was clean.
Cleanup lifecycle
Run the example
python agentflow/examples/graceful_shutdown/graceful_shutdown_example.py
Then press Ctrl+C while it is running.
What you should see:
- the process logs normal task execution
Ctrl+Ctriggers graceful shutdown instead of an abrupt crash- cleanup begins
- shutdown stats are logged
- the application exits cleanly
Production takeaways
This example maps well to:
- API worker processes
- background job runners
- long-lived CLI daemons
- containerized services in Kubernetes or Docker
A healthy shutdown design usually includes all four of these ideas:
- stop accepting new work
- finish or cancel work intentionally
- close resources explicitly
- log enough detail to debug bad shutdowns later
Common mistakes
- Relying on
KeyboardInterruptalone and ignoringSIGTERM. - Doing cleanup in
finallywithout protecting that section from interruption. - Never calling
app.aclose(), which can leave background tasks or stores hanging. - Starting new work after a shutdown signal has already been received.
Related docs
What you learned
- How to use
GracefulShutdownManagerto coordinate process shutdown. - Why protected initialization and protected cleanup matter.
- How
app.aclose()fits into a safe shutdown lifecycle for long-running AgentFlow services.
Next step
→ Pair this with the production and troubleshooting docs in later sprints when you document deployment-specific shutdown behavior.