Error Handling Guidelines for AgentFlow¶

This document provides comprehensive guidelines for error handling in the AgentFlow framework.

Table of Contents¶

Overview
Exception Hierarchy
Error Codes
Structured Error Responses
Logging Best Practices
Usage Examples
Migration Guide

Overview¶

AgentFlow uses a structured error handling approach with: - Error Codes: Unique identifiers for each error type - Contextual Information: Additional data to aid debugging - Structured Logging: Consistent log format with error codes and context - Serializable Responses: Convert errors to dictionaries for API responses

Exception Hierarchy¶

Exception
├── GraphError (GRAPH_XXX)
│   ├── NodeError (NODE_XXX)
│   └── GraphRecursionError (RECURSION_XXX)
├── StorageError (STORAGE_XXX)
│   ├── TransientStorageError (STORAGE_TRANSIENT_XXX)
│   ├── SerializationError (STORAGE_SERIALIZATION_XXX)
│   └── SchemaVersionError (STORAGE_SCHEMA_XXX)
├── MetricsError (METRICS_XXX)
└── ValidationError (VALIDATION_XXX)

Error Codes¶

Error codes follow a hierarchical pattern: CATEGORY_SUBCATEGORY_NNN

Graph Errors (GRAPH_XXX)¶

GRAPH_000: Generic graph error
GRAPH_001: Invalid graph structure
GRAPH_002: Missing entry point
GRAPH_003: Orphaned nodes detected
GRAPH_004: Invalid edge configuration

Node Errors (NODE_XXX)¶

NODE_000: Generic node error
NODE_001: Node execution failed
NODE_002: No tool calls to execute
NODE_003: Invalid node configuration
NODE_004: Node not found

Recursion Errors (RECURSION_XXX)¶

RECURSION_000: Generic recursion error
RECURSION_001: Recursion limit exceeded
RECURSION_002: Infinite loop detected

Storage Errors (STORAGE_XXX)¶

STORAGE_000: Generic storage error
STORAGE_TRANSIENT_000: Transient storage error (retryable)
STORAGE_SERIALIZATION_000: Serialization/deserialization error
STORAGE_SCHEMA_000: Schema version mismatch

Metrics Errors (METRICS_XXX)¶

METRICS_000: Generic metrics error
METRICS_001: Failed to emit metrics

Validation Errors (VALIDATION_XXX)¶

VALIDATION_000: Generic validation error
VALIDATION_001: Prompt injection detected
VALIDATION_002: Message content validation failed
VALIDATION_003: Content policy violation

Structured Error Responses¶

All exceptions support the to_dict() method for structured responses:

{
    "error_type": "NodeError",
    "error_code": "NODE_001",
    "message": "Node failed to execute",
    "context": {
        "node_name": "process_data",
        "input_size": 100,
        "execution_time_ms": 1500
    }
}

Logging Best Practices¶

1. Always Include Context¶

raise NodeError(
    message="Node failed to execute",
    error_code="NODE_001",
    context={
        "node_name": node_name,
        "input_size": len(input_data),
        "execution_time_ms": execution_time
    }
)

2. Use Appropriate Log Levels¶

ERROR: For exceptions that indicate a failure (GraphError, NodeError, SerializationError)
WARNING: For recoverable issues (TransientStorageError, MetricsError)
INFO: For normal operation logs
DEBUG: For detailed diagnostic information

3. Include Stack Traces¶

All exception classes automatically include exc_info=True in their logging, which captures the full stack trace.

4. Avoid Sensitive Information¶

Never log sensitive information such as: - API keys or credentials - Personal identifiable information (PII) - Raw user data - Password hashes

Usage Examples¶

Basic Usage¶

from agentflow.exceptions import NodeError

try:
    result = process_node(data)
except Exception as e:
    raise NodeError(
        message=f"Failed to process node: {e!s}",
        error_code="NODE_001",
        context={
            "node_name": "data_processor",
            "error_type": type(e).__name__
        }
    ) from e

With Retry Logic¶

from agentflow.exceptions import TransientStorageError, StorageError

max_retries = 3
for attempt in range(max_retries):
    try:
        result = save_to_database(data)
        break
    except ConnectionError as e:
        if attempt < max_retries - 1:
            raise TransientStorageError(
                message=f"Database connection failed, attempt {attempt + 1}/{max_retries}",
                error_code="STORAGE_TRANSIENT_001",
                context={
                    "attempt": attempt + 1,
                    "max_retries": max_retries
                }
            ) from e
        else:
            raise StorageError(
                message="Database connection failed after all retries",
                error_code="STORAGE_001",
                context={
                    "total_attempts": max_retries
                }
            ) from e

API Response¶

from agentflow.exceptions import GraphError

@app.exception_handler(GraphError)
async def graph_error_handler(request, exc: GraphError):
    return JSONResponse(
        status_code=400,
        content=exc.to_dict()
    )

Conditional Logging¶

from agentflow.exceptions import MetricsError

try:
    emit_metric("node_execution", value)
except Exception as e:
    # Metrics errors are non-critical, log but don't raise
    raise MetricsError(
        message=f"Failed to emit metric: {e!s}",
        error_code="METRICS_001",
        context={"metric_name": "node_execution"}
    )

Input Validation Error¶

from typing import Any
from agentflow.utils.validators import ValidationError
from agentflow.state.message import Message

class ValidationError(Exception):
    """Custom exception raised when input validation fails."""

    def __init__(self, message: str, violation_type: str, details: dict[str, Any] | None = None):
        """
        Initialize ValidationError.

        Args:
            message: Human-readable error message
            violation_type: Type of validation violation
            details: Additional details about the validation failure
        """
        super().__init__(message)
        self.violation_type = violation_type
        self.details = details or {}


# Usage example
try:
    if "DROP" in user_input.upper():
        raise ValidationError(
            message="Potential SQL injection detected",
            violation_type="injection_pattern",
            details={"content_sample": user_input[:100]}
        )
except ValidationError as e:
    logger.error(
        f"Validation failed: {e.violation_type}",
        extra={
            "violation_type": e.violation_type,
            "details": e.details
        }
    )
    raise

Migration Guide¶

Updating Existing Code¶

Before (Old Style)¶

from agentflow.exceptions import GraphError

raise GraphError("Invalid graph structure")

After (New Style)¶

from agentflow.exceptions import GraphError

raise GraphError(
    message="Invalid graph structure",
    error_code="GRAPH_001",
    context={"node_count": 5, "edge_count": 3}
)

However, we recommend migrating to the new structured format for better observability and debugging.

Finding Exceptions to Update¶

Search for exception raises in your codebase:

# Find all GraphError raises
grep -r "raise GraphError" agentflow/

# Find all NodeError raises
grep -r "raise NodeError" agentflow/

# Find all other exception raises
grep -r "raise.*Error" agentflow/

Best Practices Summary¶

✅ Always include meaningful error codes
✅ Provide contextual information in the context dict
✅ Use structured logging with consistent format
✅ Chain exceptions with from e to preserve stack traces
✅ Document error codes in your API documentation
✅ Use to_dict() for API responses
❌ Don't log sensitive information
❌ Don't catch generic Exception without re-raising with context
❌ Don't suppress errors silently
❌ Don't use the same error code for different error scenarios

Future Enhancements¶

Add error code registry with descriptions
Implement error monitoring integration (Sentry, etc.)
Add error metrics and dashboards
Create error code lookup CLI tool
Add internationalization (i18n) support for error messages