Graceful Shutdown

Source example: agentflow/examples/graceful_shutdown/graceful_shutdown_example.py

What you will build

A long-running asyncio service that:

handles SIGINT and SIGTERM
keeps initialization and cleanup protected from interruption
processes work in a loop until shutdown is requested
closes graph resources with app.aclose() and logs shutdown statistics

This pattern is important for real services, workers, and containerized deployments.

Why graceful shutdown matters

Without a shutdown strategy, a process can stop in the middle of:

tool execution
background work
state persistence
resource cleanup

That can leave your system in a partially updated state.

The example shows how to avoid that.

Shutdown architecture

Step 1 - Create a realistic graph

The example uses a normal ReAct pattern rather than a fake no-op loop.

It defines three tools:

get_current_time
get_system_status
calculate

Then it builds:

a ToolNode
a main Agent
a conditional route back to the tool node when tool calls are present

That makes the example valuable because the shutdown logic is tested around a graph that really does work.

Step 2 - Create a `GracefulShutdownManager`

Inside long_running_service() the example sets:

SHUTDOWN_TIMEOUT = 30.0
shutdown_manager = GracefulShutdownManager(shutdown_timeout=SHUTDOWN_TIMEOUT)

Then it compiles the graph with the same timeout:

graph = build_graph().compile(shutdown_timeout=SHUTDOWN_TIMEOUT)

This matters because the compiled graph can use that timeout when draining internal resources during shutdown.

Step 3 - Register signal handlers

The service explicitly registers handlers for SIGINT and SIGTERM:

shutdown_manager.register_signal_handlers()

Once that is in place:

pressing Ctrl+C sends SIGINT
container orchestrators usually send SIGTERM
the manager flips shutdown_requested to True

That gives the main loop a clean signal to stop accepting new work.

Signal handling flow

Step 4 - Protect initialization and cleanup

One of the best parts of this example is the use of protect_section():

with shutdown_manager.protect_section():
    await asyncio.sleep(2)
    logger.info("Initialization complete")

The same pattern is used during cleanup:

with shutdown_manager.protect_section():
    stats = await graph.aclose()
    shutdown_manager.unregister_signal_handlers()

This protection is built on delayed interrupt handling. In practice, it means:

signals are noticed
interruption is deferred briefly
critical startup or teardown code gets a chance to finish safely

Use it sparingly, but definitely use it around the most sensitive phases.

Step 5 - Stop the loop without dropping work abruptly

The main processing loop looks roughly like this:

while not shutdown_manager.shutdown_requested:
    ...
    result = await graph.ainvoke(...)
    ...

That pattern is simple and dependable. The loop keeps running until a shutdown request is raised, then it stops taking on new work and falls into cleanup.

A few other practical touches in the example are worth copying:

it catches exceptions per task so one bad task does not crash the process
it logs task progress clearly
it catches KeyboardInterrupt at the top level as a final safeguard

Step 6 - Close the graph and inspect stats

The cleanup section calls:

stats = await graph.aclose()

Then it logs areas such as:

total duration
background task information
checkpointer stats
publisher stats
store stats

This is the key production lesson: shutdown is not just about stopping. It is also about learning whether the stop was clean.

Cleanup lifecycle

Run the example

python agentflow/examples/graceful_shutdown/graceful_shutdown_example.py

Then press Ctrl+C while it is running.

What you should see:

the process logs normal task execution
Ctrl+C triggers graceful shutdown instead of an abrupt crash
cleanup begins
shutdown stats are logged
the application exits cleanly

Production takeaways

This example maps well to:

API worker processes
background job runners
long-lived CLI daemons
containerized services in Kubernetes or Docker

A healthy shutdown design usually includes all four of these ideas:

stop accepting new work
finish or cancel work intentionally
close resources explicitly
log enough detail to debug bad shutdowns later

Common mistakes

Relying on KeyboardInterrupt alone and ignoring SIGTERM.
Doing cleanup in finally without protecting that section from interruption.
Never calling app.aclose(), which can leave background tasks or stores hanging.
Starting new work after a shutdown signal has already been received.

What you learned

How to use GracefulShutdownManager to coordinate process shutdown.
Why protected initialization and protected cleanup matter.
How app.aclose() fits into a safe shutdown lifecycle for long-running AgentFlow services.

Next step

→ Pair this with the production and troubleshooting docs in later sprints when you document deployment-specific shutdown behavior.

What you will build​

Why graceful shutdown matters​

Shutdown architecture​

Step 1 - Create a realistic graph​

Step 2 - Create a GracefulShutdownManager​

Step 3 - Register signal handlers​

Signal handling flow​

Step 4 - Protect initialization and cleanup​

Step 5 - Stop the loop without dropping work abruptly​

Step 6 - Close the graph and inspect stats​

Cleanup lifecycle​

Run the example​

Production takeaways​

Common mistakes​

Related docs​

What you learned​

Next step​