Skip to main content

Multimodal

Source example: agentflow/examples/multimodal/multimodal_agent.py

What you will build

A single-node graph that accepts multiple content types:

  • images
  • audio
  • video
  • documents

The tutorial also shows three ways to provide media:

  • external URL
  • inline base64 data
  • uploaded file_id

Prerequisites

  • Python 3.11 or later
  • 10xscale-agentflow installed
  • a multimodal-capable provider key such as GOOGLE_API_KEY

Multimodal architecture

Step 1 — Set up storage

The example uses:

checkpointer = InMemoryCheckpointer()
media_store = InMemoryMediaStore()

Why two stores:

  • checkpointer tracks thread state
  • media_store stores uploaded media for file_id workflows

You can skip the media store if you only use external URLs or inline base64.

Step 2 — Configure the Agent for multimodal input

The example creates an agent with MultimodalConfig:

agent = Agent(
model="gemini-2.5-flash",
provider="google",
system_prompt=[
{
"role": "system",
"content": (
"You are a helpful multimodal assistant. "
"Describe what you see in any images, "
"transcribe any audio, and summarize any documents."
),
},
],
multimodal_config=MultimodalConfig(
image_handling=ImageHandling.BASE64,
document_handling=DocumentHandling.EXTRACT_TEXT,
),
)

This controls how media is prepared for the provider.

Media input paths

Step 3 — Build the graph

The graph is intentionally simple:

graph = StateGraph()
graph.add_node("agent", agent)
graph.set_entry_point("agent")
graph.add_edge("agent", END)

app = graph.compile(checkpointer=checkpointer)

Step 4 — Build multimodal messages

Each message is made of blocks.

External URL image

Message(
role="user",
content=[
TextBlock(text="What is in this image?"),
ImageBlock(
media=MediaRef(
kind="url",
url="https://...",
mime_type="image/png",
)
),
],
)

Inline base64 image

ImageBlock(
media=MediaRef(
kind="data",
data_base64=png_b64,
mime_type="image/png",
)
)

Uploaded file_id

file_id = loop.run_until_complete(media_store.store(data=sample_image, mime_type="image/png"))

ImageBlock(
media=MediaRef(
kind="file_id",
file_id=file_id,
mime_type="image/png",
)
)

Step 5 — Use other media blocks

The example also demonstrates:

  • AudioBlock
  • DocumentBlock
  • VideoBlock

That means a single message can mix multiple modalities.

Mixed-media execution flow

Step 6 — Run examples

The example defines named scenarios:

EXAMPLES = {
"url": ("External URL image", build_message_with_external_url),
"base64": ("Inline base64 image", build_message_with_base64),
"file_id": ("Uploaded file_id image", build_message_with_file_id),
"audio": ("Audio input", build_message_with_audio),
"document": ("Document input", build_message_with_document),
"video": ("Video input", build_message_with_video),
"mixed": ("Mixed media types", build_mixed_message),
}

Run a few:

python agentflow/examples/multimodal/multimodal_agent.py url base64 file_id

Or let the script run the default examples:

python agentflow/examples/multimodal/multimodal_agent.py

When to use each media strategy

StrategyBest for
URLpublic remote media already hosted elsewhere
base64small inline test data and quick experiments
file_idproduction usage, reuse, and repeated access

Common mistakes

  • Using inline base64 for large files in production.
  • Forgetting a media store when you rely on file_id.
  • Choosing a model that does not support your target media type.
  • Sending MIME types that do not match the payload.

Key concepts

ConceptDetails
MediaRefCanonical reference to media payloads
block-based contentMessages can contain text and media together
MultimodalConfigControls adaptation for image and document handling
InMemoryMediaStoreSupports uploaded file workflows through file_id

What you learned

  • How to structure multimodal messages in AgentFlow.
  • How to choose between URL, base64, and file_id.
  • How multimodal agent configuration affects provider adaptation.

Next step

Multiagent to route between multiple specialized nodes in one graph.