Media and files

AgentFlow has first-class multimodal support. Any message can contain a mix of text, images, audio, video, and documents through typed content blocks. Media is referenced via MediaRef, which decouples the message structure from where the bytes actually live.

Content block types

All block types live in agentflow.core.state (re-exported from the top-level agentflow package).

Class	type discriminator	When to use
`TextBlock`	`"text"`	Plain text content
`ImageBlock`	`"image"`	Images (PNG, JPEG, WebP, GIF)
`AudioBlock`	`"audio"`	Audio data (WAV, MP3, OGG)
`VideoBlock`	`"video"`	Video data (MP4, WebM)
`DocumentBlock`	`"document"`	PDFs, Word docs, plain text
`DataBlock`	`"data"`	Any raw binary blob with a MIME type
`ToolCallBlock`	`"tool_call"`	Tool invocation request from the model
`ToolResultBlock`	`"tool_result"`	Result returned from a tool execution
`ReasoningBlock`	`"reasoning"`	Chain-of-thought reasoning traces
`AnnotationBlock`	`"annotation"`	Citations, references, structured notes
`ErrorBlock`	`"error"`	Error information from failed operations

MediaRef — the reference model

MediaRef is how you tell a block where the binary data is. It has three kind values:

from agentflow import MediaRef

# 1. External URL — the agent fetches it per provider
MediaRef(kind="url", url="https://example.com/photo.png", mime_type="image/png")

# 2. Inline base64 — embed the bytes directly (small payloads only)
MediaRef(kind="data", data_base64="<base64-string>", mime_type="image/png")

# 3. Store key — uploaded to a MediaStore first, then referenced by key
MediaRef(kind="file_id", file_id="a1b2c3d4...", mime_type="image/png")

Full MediaRef fields

class MediaRef(BaseModel):
    kind: Literal["url", "file_id", "data"] = "url"
    url: str | None = None          # https:// or agentflow://media/<key>
    file_id: str | None = None      # opaque key from MediaStore.store()
    data_base64: str | None = None  # base64-encoded bytes (small payloads only)
    mime_type: str | None = None
    size_bytes: int | None = None
    sha256: str | None = None
    filename: str | None = None
    # Media-specific hints
    width: int | None = None
    height: int | None = None
    duration_ms: int | None = None
    page: int | None = None

Building multimodal messages

Import all blocks from the top-level agentflow package:

from agentflow import (
    AudioBlock,
    DocumentBlock,
    ImageBlock,
    MediaRef,
    Message,
    TextBlock,
    VideoBlock,
)

Example 1: Image from an external URL

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="What is in this image?"),
            ImageBlock(
                media=MediaRef(
                    kind="url",
                    url="https://upload.wikimedia.org/wikipedia/commons/4/47/example.png",
                    mime_type="image/png",
                )
            ),
        ],
    )
]

result = app.invoke({"messages": messages}, config={"thread_id": "t1"})

Example 2: Image from inline base64

import base64

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Describe this photo."),
            ImageBlock(
                media=MediaRef(
                    kind="data",
                    data_base64=b64,
                    mime_type="image/jpeg",
                )
            ),
        ],
    )
]

Example 3: File uploaded to MediaStore (recommended for production)

Upload the file once and reference it by key in any number of subsequent messages:

import asyncio
from agentflow import InMemoryMediaStore, ImageBlock, MediaRef, Message, TextBlock

media_store = InMemoryMediaStore()

# Upload — returns an opaque storage key
with open("photo.png", "rb") as f:
    file_key = asyncio.run(media_store.store(data=f.read(), mime_type="image/png"))

# Reference by key in a message
messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Analyze this uploaded image."),
            ImageBlock(
                media=MediaRef(
                    kind="file_id",
                    file_id=file_key,
                    mime_type="image/png",
                )
            ),
        ],
    )
]

Example 4: Audio

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Transcribe this audio clip."),
            AudioBlock(
                media=MediaRef(
                    kind="data",
                    data_base64=base64.b64encode(audio_bytes).decode(),
                    mime_type="audio/wav",
                ),
                # optional hints
                sample_rate=16000,
                channels=1,
            ),
        ],
    )
]

Example 5: Document

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Summarize this document."),
            DocumentBlock(
                text="Pre-extracted text content (optional — provide when you already have the text).",
                media=MediaRef(
                    kind="file_id",
                    file_id="doc-storage-key",
                    mime_type="application/pdf",
                ),
            ),
        ],
    )
]

Example 6: Mixed media in one message

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Here are multiple inputs — process all of them."),
            ImageBlock(media=MediaRef(kind="url", url="https://example.com/chart.png", mime_type="image/png")),
            DocumentBlock(
                text="This document discusses agent frameworks.",
                media=MediaRef(kind="file_id", file_id="doc-001", mime_type="text/plain"),
            ),
        ],
    )
]

MediaStore — binary storage backends

MediaStore stores the actual bytes outside the message, keeping messages lightweight. The BaseMediaStore interface exposes five methods:

async def store(data: bytes, mime_type: str, metadata: dict | None) -> str  # returns storage key
async def retrieve(storage_key: str) -> tuple[bytes, str]                    # bytes + mime_type
async def delete(storage_key: str) -> bool
async def exists(storage_key: str) -> bool
async def get_metadata(storage_key: str) -> dict | None                      # without loading bytes

Available backends

Class	Module	Use case
`InMemoryMediaStore`	`agentflow.storage.media.storage`	Development, tests
`LocalFileMediaStore`	`agentflow.storage.media.storage`	Single-server, dev
`CloudMediaStore`	`agentflow.storage.media.storage`	S3 / GCS (production)

InMemoryMediaStore

from agentflow import InMemoryMediaStore

store = InMemoryMediaStore()
key = await store.store(data=image_bytes, mime_type="image/png")
bytes_back, mime = await store.retrieve(key)

Data is lost on process restart. Thread-safe via asyncio.

LocalFileMediaStore

from agentflow.storage.media.storage import LocalFileMediaStore

store = LocalFileMediaStore(base_dir="./agentflow_media")
key = await store.store(data=pdf_bytes, mime_type="application/pdf")

Files are sharded on disk as {base_dir}/{key[:2]}/{key[2:4]}/{key}.{ext} with a .meta.json sidecar.

CloudMediaStore (S3 / GCS)

pip install "10xscale-agentflow[cloud-storage]"

from cloud_storage_manager import CloudStorageFactory, StorageProvider, StorageConfig, AwsConfig
from agentflow.storage.media.storage import CloudMediaStore

config = StorageConfig(
    aws=AwsConfig(bucket_name="my-bucket", access_key_id="...", secret_access_key="...")
)
cloud_storage = CloudStorageFactory.get_storage(StorageProvider.AWS, config)
store = CloudMediaStore(cloud_storage, prefix="agentflow-media")

Stores binary blobs in the cloud bucket. Supports generating signed URLs via get_direct_url() so providers can fetch media directly.

MultimodalConfig — per-agent media handling

Pass MultimodalConfig to Agent to control how media is delivered to the LLM provider:

from agentflow import Agent, InMemoryMediaStore, MultimodalConfig
from agentflow.storage.media.config import ImageHandling, DocumentHandling

agent = Agent(
    model="gemini-2.5-flash",
    provider="google",
    multimodal_config=MultimodalConfig(
        image_handling=ImageHandling.BASE64,           # "base64" | "url" | "file_id"
        document_handling=DocumentHandling.EXTRACT_TEXT,  # "extract_text" | "pass_raw" | "skip"
        max_image_size_mb=10.0,
        max_image_dimension=2048,
        supported_image_types={"image/jpeg", "image/png", "image/webp", "image/gif"},
        supported_doc_types={"application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
    ),
)

Image handling strategies

Strategy	Description
`ImageHandling.BASE64`	Convert image to base64 and embed inline
`ImageHandling.URL`	Send a URL (external or signed from `CloudMediaStore`)
`ImageHandling.FILE_ID`	Upload via provider-native file API (e.g. Google File API)

Document handling strategies

Strategy	Description
`DocumentHandling.EXTRACT_TEXT`	Extract text and send as text context
`DocumentHandling.FORWARD_RAW`	Forward the raw bytes to the provider
`DocumentHandling.SKIP`	Ignore document blocks entirely

Full graph wiring with a media store

import asyncio
from agentflow import (
    Agent, StateGraph, InMemoryCheckpointer, InMemoryMediaStore,
    ImageBlock, MediaRef, Message, MultimodalConfig, TextBlock, END,
)
from agentflow.storage.media.config import ImageHandling, DocumentHandling

checkpointer = InMemoryCheckpointer()
media_store = InMemoryMediaStore()

agent = Agent(
    model="gemini-2.5-flash",
    provider="google",
    system_prompt=[{"role": "system", "content": "You are a helpful multimodal assistant."}],
    multimodal_config=MultimodalConfig(
        image_handling=ImageHandling.BASE64,
        document_handling=DocumentHandling.EXTRACT_TEXT,
    ),
)

graph = StateGraph()
graph.add_node("agent", agent)
graph.set_entry_point("agent")
graph.add_edge("agent", END)

# Pass media_store to compile so the resolver can dereference file_id refs
app = graph.compile(checkpointer=checkpointer)

# Upload a file and invoke
with open("chart.png", "rb") as f:
    key = asyncio.run(media_store.store(data=f.read(), mime_type="image/png"))

messages = [
    Message(
        role="user",
        content=[
            TextBlock(text="Describe this chart."),
            ImageBlock(media=MediaRef(kind="file_id", file_id=key, mime_type="image/png")),
        ],
    )
]

result = app.invoke({"messages": messages}, config={"thread_id": "media-demo"})

File upload via REST API

When running behind the API server, upload a file with multipart form data:

curl -X POST http://127.0.0.1:8000/v1/files/upload \
  -F "file=@photo.jpg"

Response:

{
  "file_id": "a1b2c3d4e5f6...",
  "filename": "photo.jpg",
  "content_type": "image/jpeg",
  "size_bytes": 24576,
  "access_url": "/v1/files/a1b2c3d4e5f6..."
}

Use the returned file_id in subsequent invoke or stream requests:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image", "media": {"kind": "file_id", "file_id": "a1b2c3d4e5f6...", "mime_type": "image/jpeg"}}
      ]
    }
  ],
  "config": {"thread_id": "media-demo", "recursion_limit": 10}
}

File upload via TypeScript client

import { AgentFlowClient } from "@10xscale/agentflow-client";

const client = new AgentFlowClient({ baseUrl: "http://127.0.0.1:8000" });

const file = new File([imageBytes], "photo.jpg", { type: "image/jpeg" });
const upload = await client.uploadFile(file);

const result = await client.invoke(
  [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe this image." },
        { type: "image", media: { kind: "file_id", file_id: upload.file_id, mime_type: "image/jpeg" } },
      ],
    },
  ],
  { config: { thread_id: "ts-media-demo" } },
);

Provider capability matrix

Not all providers support all media types and transport modes. AgentFlow's internal capability matrix (agentflow.storage.media.capabilities) determines the best transport for each provider/model combination. The resolver tries transport modes in preference order:

Transport mode	Description
`remote_url`	Send a public or signed HTTPS URL directly
`provider_file`	Upload via provider-native file API (e.g. Google File API)
`inline_bytes`	Send raw bytes inline (base64 data URI)
`unsupported`	The provider/model cannot handle this media type

You do not need to manage this yourself — MultimodalConfig and Agent handle the fallback chain automatically based on your configured strategy.

Accessing an uploaded file

GET /v1/files/{file_id}

This returns the raw file bytes with the correct Content-Type header.

What you learned

Upload files with POST /v1/files/upload and receive a file_id.
Reference the file_id in message content blocks.
AgentFlowClient.uploadFile handles the multipart upload in TypeScript.
File content is stored in the configured MediaStore.

Content block types​

MediaRef — the reference model​

Full MediaRef fields​

Building multimodal messages​

Example 1: Image from an external URL​

Example 2: Image from inline base64​

Example 3: File uploaded to MediaStore (recommended for production)​

Example 4: Audio​

Example 5: Document​

Example 6: Mixed media in one message​

MediaStore — binary storage backends​

Available backends​

InMemoryMediaStore​

LocalFileMediaStore​

CloudMediaStore (S3 / GCS)​

MultimodalConfig — per-agent media handling​

Image handling strategies​

Document handling strategies​

Full graph wiring with a media store​

File upload via REST API​

File upload via TypeScript client​

Provider capability matrix​

Related concepts​

Accessing an uploaded file​

What you learned​