Lesson 2: Prompting, Context Engineering, and Structured Outputs

Learning Outcome

By the end of this lesson, you will be able to:

Design prompts that produce reliable, consistent outputs
Use structured outputs to guarantee response formats
Manage context efficiently to stay within token limits
Implement validation and error handling for production systems

Prerequisites

Read LLM basics for engineers for failure mode context
Read Prompt patterns cheatsheet for pattern reference
Read Tokenization and context windows for context management

Concept: Better Prompting vs. Better System Design

There's a limit to what better prompting can achieve. Sometimes you need better system design.

The Reliability Spectrum

When Prompting Alone Isn't Enough

Problem	Prompting Fix	Better Solution
Inconsistent format	Add format instructions	Use structured outputs
Wrong information	"Answer accurately"	Ground with retrieval
Missing edge cases	"Consider X, Y, Z"	Add validation layer
Slow responses	"Be concise"	Use faster model
Hallucination	"Don't make things up"	Retrieval + citations
Brittle behavior	Many examples	Structured outputs + validation

Concept: Prompt Structure for Reliability

The Anatomy of a Reliable Prompt

Instruction Hierarchy

Prompts have a priority order. Higher-priority instructions override lower-priority ones:

Prompt Positioning

Important information should appear:

At the beginning — Set the stage
At the end — Reinforce the task

prompt = """
[BEGINNING] You are a technical support assistant for Acme Corp.
Be helpful, concise, and professional.

Always cite sources when providing factual information.
Never make up information you don't know.

[CONTEXT] The user is asking about password reset.

[TASK] Answer their question directly. If you don't know, say so.

User: How do I reset my password?
"""

System Prompt Template

SYSTEM_PROMPT = """
You are a {ROLE} with expertise in {DOMAIN}.

Your responsibilities:
1. {Responsibility 1}
2. {Responsibility 2}
3. {Responsibility 3}

IMPORTANT RULES:
- Never {Rule 1}
- Always {Rule 2}
- If unsure, {Fallback behavior}

OUTPUT FORMAT:
{Format requirements}
"""

Concept: Structured Outputs

Structured outputs guarantee format consistency. This is essential for production systems.

Why Structured Outputs Matter

Schema Definition with Pydantic

from pydantic import BaseModel, Field
from enum import Enum

class Priority(str, Enum):
    URGENT = "urgent"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

class TicketClassification(BaseModel):
    category: str = Field(description="Ticket category")
    priority: Priority = Field(description="Urgency level")
    confidence: float = Field(description="Confidence score 0-1")
    reasoning: str = Field(description="Brief explanation")
    
    class Config:
        use_enum_values = True

Structured Output in AgentFlow

from agentflow.core.llm import OpenAIModel
from agentflow.core.graph import StateGraph

# Initialize model with structured output
llm = OpenAIModel(
    "gpt-4o",
    response_format=TicketClassification
)

# Create agent
builder = StateGraph(AgentState)

@builder.node
def classify(state: AgentState) -> AgentState:
    last_message = state.messages[-1].content if state.messages else ""
    
    # response is guaranteed to be TicketClassification
    result = llm.generate(messages=[{"role": "user", "content": last_message}])
    
    return {
        **state.dict(),
        "classification": result.dict()
    }

app = builder.compile()

Concept: Validation and Error Handling

Structured outputs can still fail. You need validation.

Validation Pipeline

Validation Error Handling

from pydantic import ValidationError
from typing import Optional

def safe_generate(
    prompt: str,
    schema: type[BaseModel],
    max_retries: int = 3
) -> tuple[bool, Optional[BaseModel], str]:
    """
    Generate with validation and retry.
    
    Returns:
        (success, result, error_message)
    """
    for attempt in range(max_retries):
        try:
            response = llm.generate(prompt, response_format=schema)
            return True, response, None
        
        except ValidationError as e:
            # Try to fix common issues
            error_msg = str(e)
            
            if attempt < max_retries - 1:
                # Add correction hint to prompt
                prompt += f"\n\nPlease fix: {error_msg}"
                continue
            
            return False, None, error_msg
        
        except Exception as e:
            return False, None, str(e)
    
    return False, None, "Max retries exceeded"

Error Recovery Patterns

def parse_with_fallback(response: str, schema: type[BaseModel]) -> BaseModel:
    """Try parsing, fall back to extraction."""
    try:
        return schema.parse_raw(response)
    
    except Exception:
        # Try to extract JSON from text
        import re
        match = re.search(r'\{[\s\S]*\}', response)
        
        if match:
            try:
                return schema.parse_raw(match.group())
            except:
                pass
        
        # Last resort: raise
        raise ValueError(f"Could not parse response: {response[:100]}")

Example: Building a Reliable Classification System

Complete Implementation

from enum import Enum
from pydantic import BaseModel, Field
from agentflow.core.graph import StateGraph, AgentState
from agentflow.core.state import Message
from agentflow.core.llm import OpenAIModel

# 1. Define the schema
class TicketPriority(str, Enum):
    URGENT = "urgent"      # Needs immediate attention
    HIGH = "high"          # Important but not critical
    MEDIUM = "medium"      # Standard priority
    LOW = "low"            # Can wait

class ClassificationResult(BaseModel):
    category: str = Field(description="Ticket category: billing, technical, general, etc.")
    priority: TicketPriority
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence 0-1")
    reasoning: str = Field(description="Brief explanation of classification")
    suggested_response: str = Field(description="Suggested first response")

# 2. Create the system prompt
SYSTEM_PROMPT = """
You are a customer support ticket classifier for Acme Corp.

Classify each ticket accurately based on:
1. Category: What type of issue is this?
2. Priority: How urgent is it?
3. Reasoning: Why did you choose this classification?

Be conservative with HIGH and URGENT - only use when truly critical.
"""

# 3. Create the classifier
llm = OpenAIModel("gpt-4o", response_format=ClassificationResult)

builder = StateGraph(AgentState)

@builder.node
def classify_ticket(state: AgentState) -> AgentState:
    messages = state.get("messages", [])
    last_message = messages[-1].content if messages else ""
    
    # Generate with structured output
    result = llm.generate(
        system_instruction=SYSTEM_PROMPT,
        messages=[Message(role="user", content=last_message)]
    )
    
    # Add response to history
    messages.append(Message(
        role="assistant",
        content=f"Category: {result.category}, Priority: {result.priority}"
    ))
    
    return {
        **state.dict(),
        "messages": messages,
        "classification": result.dict()
    }

builder.add_node("classify", classify_ticket)
builder.set_entry_point("classify")
builder.set_finish_point("classify")

app = builder.compile()

# 4. Safe wrapper with validation
def classify_safe(message: str) -> dict:
    try:
        result = app.invoke({
            "messages": [Message(role="user", content=message)]
        })
        
        return {
            "success": True,
            "classification": result.get("classification")
        }
    
    except ValidationError as e:
        return {
            "success": False,
            "error": f"Validation failed: {e}"
        }
    
    except Exception as e:
        return {
            "success": False,
            "error": f"Unexpected error: {e}"
        }

Testing the Classifier

# Test cases
test_tickets = [
    "My entire website is down!",
    "I have a question about my invoice",
    "Can you help me reset my password?",
    "URGENT: Production database is corrupted",
    "What are your business hours?",
]

for ticket in test_tickets:
    result = classify_safe(ticket)
    
    if result["success"]:
        print(f"Ticket: {ticket[:50]}...")
        print(f"  Category: {result['classification']['category']}")
        print(f"  Priority: {result['classification']['priority']}")
        print()

Context Management

Token Budget for Prompts

Automatic Context Truncation

from agentflow.core.utils import count_tokens

MAX_CONTEXT = 8000  # Leave room for response

def build_prompt(state: AgentState, new_message: str) -> list[dict]:
    """Build prompt with automatic truncation."""
    
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT}
    ]
    
    # Add conversation history
    for msg in state.messages:
        messages.append({"role": msg.role, "content": msg.content})
    
    # Add new message
    messages.append({"role": "user", "content": new_message})
    
    # Truncate if too long
    while count_tokens(messages) > MAX_CONTEXT and len(messages) > 3:
        messages.pop(1)  # Remove oldest non-system message
    
    return messages

Exercise: Build a Code Review Assistant

Your Task

Build a code review assistant with structured output:

Define a schema for code review results:
- Bug severity (critical, high, medium, low)
- Files affected (list of strings)
- Suggested fixes (list of strings)
- Overall recommendation (approve, request_changes, reject)
Create an agent that uses the schema
Test with this vulnerable code:

# Vulnerable code to review
code = '''
def get_user(user_id):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    return db.execute(query)

def login(username, password):
    user = db.query(f"SELECT * FROM users WHERE username = '{username}'")
    if user.password == password:
        return jwt.encode({user_id: user.id})
'''

Expected Output Structure

class CodeReviewResult(BaseModel):
    bugs: list[dict] = Field(description="List of bugs found")
    severity: str = Field(description="critical/high/medium/low")
    files_affected: list[str]
    suggested_fixes: list[str]
    recommendation: str = Field(description="approve/request_changes/reject")
    reasoning: str

What You Learned

Prompting has limits — Sometimes you need better system design
Prompt structure matters — Clear hierarchy, positioning, and examples
Structured outputs guarantee consistency — Use schemas for production
Validation is essential — Structured outputs can still fail
Context management prevents overflow — Truncate or summarize when needed

Common Failure Mode

Relying on prompt instructions for format without schema validation

Even with explicit format instructions, LLMs sometimes deviate:

# ❌ Don't trust the output without validation
response = llm.generate("Return JSON with name and age")
data = json.loads(response)  # Might fail or be wrong!

# ❌ Don't even trust structured outputs blindly
response = llm.generate(
    "Return JSON",
    response_format=PersonSchema
)
# Still might fail in edge cases

# ✅ Always validate
def safe_generate(prompt: str, schema: type):
    try:
        result = llm.generate(prompt, response_format=schema)
        return result
    except (ValidationError, Exception) as e:
        # Handle gracefully
        return fallback_result(e)

Next Step

Continue to Lesson 3: Tools, files, and MCP basics to extend your agent with external capabilities.

Or Explore

Tools Reference — AgentFlow tool patterns
Agents and Tools concepts — Tool design principles
Prompt patterns cheatsheet — Pattern reference

Learning Outcome​

Prerequisites​

Concept: Better Prompting vs. Better System Design​

The Reliability Spectrum​

When Prompting Alone Isn't Enough​

Concept: Prompt Structure for Reliability​

The Anatomy of a Reliable Prompt​

Instruction Hierarchy​

Prompt Positioning​

System Prompt Template​

Concept: Structured Outputs​

Why Structured Outputs Matter​

Schema Definition with Pydantic​

Structured Output in AgentFlow​

Concept: Validation and Error Handling​

Validation Pipeline​

Validation Error Handling​

Error Recovery Patterns​

Example: Building a Reliable Classification System​

Complete Implementation​

Testing the Classifier​

Context Management​

Token Budget for Prompts​

Automatic Context Truncation​

Exercise: Build a Code Review Assistant​

Your Task​

Expected Output Structure​

What You Learned​

Common Failure Mode​

Next Step​

Or Explore​