Skip to main content

Media and files

AgentFlow has first-class multimodal support. Any message can contain a mix of text, images, audio, video, and documents through typed content blocks. Media is referenced via MediaRef, which decouples the message structure from where the bytes actually live.


Content block types

All block types live in agentflow.core.state (re-exported from the top-level agentflow package).

Classtype discriminatorWhen to use
TextBlock"text"Plain text content
ImageBlock"image"Images (PNG, JPEG, WebP, GIF)
AudioBlock"audio"Audio data (WAV, MP3, OGG)
VideoBlock"video"Video data (MP4, WebM)
DocumentBlock"document"PDFs, Word docs, plain text
DataBlock"data"Any raw binary blob with a MIME type
ToolCallBlock"tool_call"Tool invocation request from the model
ToolResultBlock"tool_result"Result returned from a tool execution
ReasoningBlock"reasoning"Chain-of-thought reasoning traces
AnnotationBlock"annotation"Citations, references, structured notes
ErrorBlock"error"Error information from failed operations

MediaRef — the reference model

MediaRef is how you tell a block where the binary data is. It has three kind values:

from agentflow import MediaRef

# 1. External URL — the agent fetches it per provider
MediaRef(kind="url", url="https://example.com/photo.png", mime_type="image/png")

# 2. Inline base64 — embed the bytes directly (small payloads only)
MediaRef(kind="data", data_base64="<base64-string>", mime_type="image/png")

# 3. Store key — uploaded to a MediaStore first, then referenced by key
MediaRef(kind="file_id", file_id="a1b2c3d4...", mime_type="image/png")

Full MediaRef fields

class MediaRef(BaseModel):
kind: Literal["url", "file_id", "data"] = "url"
url: str | None = None # https:// or agentflow://media/<key>
file_id: str | None = None # opaque key from MediaStore.store()
data_base64: str | None = None # base64-encoded bytes (small payloads only)
mime_type: str | None = None
size_bytes: int | None = None
sha256: str | None = None
filename: str | None = None
# Media-specific hints
width: int | None = None
height: int | None = None
duration_ms: int | None = None
page: int | None = None

Building multimodal messages

Import all blocks from the top-level agentflow package:

from agentflow import (
AudioBlock,
DocumentBlock,
ImageBlock,
MediaRef,
Message,
TextBlock,
VideoBlock,
)

Example 1: Image from an external URL

messages = [
Message(
role="user",
content=[
TextBlock(text="What is in this image?"),
ImageBlock(
media=MediaRef(
kind="url",
url="https://upload.wikimedia.org/wikipedia/commons/4/47/example.png",
mime_type="image/png",
)
),
],
)
]

result = app.invoke({"messages": messages}, config={"thread_id": "t1"})

Example 2: Image from inline base64

import base64

with open("photo.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()

messages = [
Message(
role="user",
content=[
TextBlock(text="Describe this photo."),
ImageBlock(
media=MediaRef(
kind="data",
data_base64=b64,
mime_type="image/jpeg",
)
),
],
)
]

Upload the file once and reference it by key in any number of subsequent messages:

import asyncio
from agentflow import InMemoryMediaStore, ImageBlock, MediaRef, Message, TextBlock

media_store = InMemoryMediaStore()

# Upload — returns an opaque storage key
with open("photo.png", "rb") as f:
file_key = asyncio.run(media_store.store(data=f.read(), mime_type="image/png"))

# Reference by key in a message
messages = [
Message(
role="user",
content=[
TextBlock(text="Analyze this uploaded image."),
ImageBlock(
media=MediaRef(
kind="file_id",
file_id=file_key,
mime_type="image/png",
)
),
],
)
]

Example 4: Audio

messages = [
Message(
role="user",
content=[
TextBlock(text="Transcribe this audio clip."),
AudioBlock(
media=MediaRef(
kind="data",
data_base64=base64.b64encode(audio_bytes).decode(),
mime_type="audio/wav",
),
# optional hints
sample_rate=16000,
channels=1,
),
],
)
]

Example 5: Document

messages = [
Message(
role="user",
content=[
TextBlock(text="Summarize this document."),
DocumentBlock(
text="Pre-extracted text content (optional — provide when you already have the text).",
media=MediaRef(
kind="file_id",
file_id="doc-storage-key",
mime_type="application/pdf",
),
),
],
)
]

Example 6: Mixed media in one message

messages = [
Message(
role="user",
content=[
TextBlock(text="Here are multiple inputs — process all of them."),
ImageBlock(media=MediaRef(kind="url", url="https://example.com/chart.png", mime_type="image/png")),
DocumentBlock(
text="This document discusses agent frameworks.",
media=MediaRef(kind="file_id", file_id="doc-001", mime_type="text/plain"),
),
],
)
]

MediaStore — binary storage backends

MediaStore stores the actual bytes outside the message, keeping messages lightweight. The BaseMediaStore interface exposes five methods:

async def store(data: bytes, mime_type: str, metadata: dict | None) -> str  # returns storage key
async def retrieve(storage_key: str) -> tuple[bytes, str] # bytes + mime_type
async def delete(storage_key: str) -> bool
async def exists(storage_key: str) -> bool
async def get_metadata(storage_key: str) -> dict | None # without loading bytes

Available backends

ClassModuleUse case
InMemoryMediaStoreagentflow.storage.media.storageDevelopment, tests
LocalFileMediaStoreagentflow.storage.media.storageSingle-server, dev
CloudMediaStoreagentflow.storage.media.storageS3 / GCS (production)

InMemoryMediaStore

from agentflow import InMemoryMediaStore

store = InMemoryMediaStore()
key = await store.store(data=image_bytes, mime_type="image/png")
bytes_back, mime = await store.retrieve(key)

Data is lost on process restart. Thread-safe via asyncio.

LocalFileMediaStore

from agentflow.storage.media.storage import LocalFileMediaStore

store = LocalFileMediaStore(base_dir="./agentflow_media")
key = await store.store(data=pdf_bytes, mime_type="application/pdf")

Files are sharded on disk as {base_dir}/{key[:2]}/{key[2:4]}/{key}.{ext} with a .meta.json sidecar.

CloudMediaStore (S3 / GCS)

pip install "10xscale-agentflow[cloud-storage]"
from cloud_storage_manager import CloudStorageFactory, StorageProvider, StorageConfig, AwsConfig
from agentflow.storage.media.storage import CloudMediaStore

config = StorageConfig(
aws=AwsConfig(bucket_name="my-bucket", access_key_id="...", secret_access_key="...")
)
cloud_storage = CloudStorageFactory.get_storage(StorageProvider.AWS, config)
store = CloudMediaStore(cloud_storage, prefix="agentflow-media")

Stores binary blobs in the cloud bucket. Supports generating signed URLs via get_direct_url() so providers can fetch media directly.


MultimodalConfig — per-agent media handling

Pass MultimodalConfig to Agent to control how media is delivered to the LLM provider:

from agentflow import Agent, InMemoryMediaStore, MultimodalConfig
from agentflow.storage.media.config import ImageHandling, DocumentHandling

agent = Agent(
model="gemini-2.5-flash",
provider="google",
multimodal_config=MultimodalConfig(
image_handling=ImageHandling.BASE64, # "base64" | "url" | "file_id"
document_handling=DocumentHandling.EXTRACT_TEXT, # "extract_text" | "pass_raw" | "skip"
max_image_size_mb=10.0,
max_image_dimension=2048,
supported_image_types={"image/jpeg", "image/png", "image/webp", "image/gif"},
supported_doc_types={"application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
),
)

Image handling strategies

StrategyDescription
ImageHandling.BASE64Convert image to base64 and embed inline
ImageHandling.URLSend a URL (external or signed from CloudMediaStore)
ImageHandling.FILE_IDUpload via provider-native file API (e.g. Google File API)

Document handling strategies

StrategyDescription
DocumentHandling.EXTRACT_TEXTExtract text and send as text context
DocumentHandling.FORWARD_RAWForward the raw bytes to the provider
DocumentHandling.SKIPIgnore document blocks entirely

Full graph wiring with a media store

import asyncio
from agentflow import (
Agent, StateGraph, InMemoryCheckpointer, InMemoryMediaStore,
ImageBlock, MediaRef, Message, MultimodalConfig, TextBlock, END,
)
from agentflow.storage.media.config import ImageHandling, DocumentHandling

checkpointer = InMemoryCheckpointer()
media_store = InMemoryMediaStore()

agent = Agent(
model="gemini-2.5-flash",
provider="google",
system_prompt=[{"role": "system", "content": "You are a helpful multimodal assistant."}],
multimodal_config=MultimodalConfig(
image_handling=ImageHandling.BASE64,
document_handling=DocumentHandling.EXTRACT_TEXT,
),
)

graph = StateGraph()
graph.add_node("agent", agent)
graph.set_entry_point("agent")
graph.add_edge("agent", END)

# Pass media_store to compile so the resolver can dereference file_id refs
app = graph.compile(checkpointer=checkpointer)

# Upload a file and invoke
with open("chart.png", "rb") as f:
key = asyncio.run(media_store.store(data=f.read(), mime_type="image/png"))

messages = [
Message(
role="user",
content=[
TextBlock(text="Describe this chart."),
ImageBlock(media=MediaRef(kind="file_id", file_id=key, mime_type="image/png")),
],
)
]

result = app.invoke({"messages": messages}, config={"thread_id": "media-demo"})

File upload via REST API

When running behind the API server, upload a file with multipart form data:

curl -X POST http://127.0.0.1:8000/v1/files/upload \
-F "file=@photo.jpg"

Response:

{
"file_id": "a1b2c3d4e5f6...",
"filename": "photo.jpg",
"content_type": "image/jpeg",
"size_bytes": 24576,
"access_url": "/v1/files/a1b2c3d4e5f6..."
}

Use the returned file_id in subsequent invoke or stream requests:

{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image", "media": {"kind": "file_id", "file_id": "a1b2c3d4e5f6...", "mime_type": "image/jpeg"}}
]
}
],
"config": {"thread_id": "media-demo", "recursion_limit": 10}
}

File upload via TypeScript client

import { AgentFlowClient } from "@10xscale/agentflow-client";

const client = new AgentFlowClient({ baseUrl: "http://127.0.0.1:8000" });

const file = new File([imageBytes], "photo.jpg", { type: "image/jpeg" });
const upload = await client.uploadFile(file);

const result = await client.invoke(
[
{
role: "user",
content: [
{ type: "text", text: "Describe this image." },
{ type: "image", media: { kind: "file_id", file_id: upload.file_id, mime_type: "image/jpeg" } },
],
},
],
{ config: { thread_id: "ts-media-demo" } },
);

Provider capability matrix

Not all providers support all media types and transport modes. AgentFlow's internal capability matrix (agentflow.storage.media.capabilities) determines the best transport for each provider/model combination. The resolver tries transport modes in preference order:

Transport modeDescription
remote_urlSend a public or signed HTTPS URL directly
provider_fileUpload via provider-native file API (e.g. Google File API)
inline_bytesSend raw bytes inline (base64 data URI)
unsupportedThe provider/model cannot handle this media type

You do not need to manage this yourself — MultimodalConfig and Agent handle the fallback chain automatically based on your configured strategy.


Accessing an uploaded file

GET /v1/files/{file_id}

This returns the raw file bytes with the correct Content-Type header.

What you learned

  • Upload files with POST /v1/files/upload and receive a file_id.
  • Reference the file_id in message content blocks.
  • AgentFlowClient.uploadFile handles the multipart upload in TypeScript.
  • File content is stored in the configured MediaStore.